VIEWS: 108 PAGES: 81



                                       William R. Shadish
                                       Trru UNIvERSITYop MEvPrrts

                               **      Thomas D. Cook
                                       NonrrrwpsrERN UNrvPnslrY

                                        Donald T. Campbell

                                        HOUGHTON     COMPANY
                                               MIFFLIN              Boston New York
               Ex.per'i'ment (ik-spEr'e-mant):[Middle English from Old French from Latin
                   experimentum, from experiri, to try; seeper- in Indo-European Roots.]
                   n. Abbr. exp., expt, 1. a. A test under controlled conditions that is
                   made to demonstratea known truth, examine the validity of a hypothe-
                   sis, or determine the efficacyof something previously untried' b. The
                   processof conducting such a test; experimentation. 2' An innovative
                                      "Democracy is only an experiment in gouernment"
                   act or procedure:
                   (.V{illiam Ralph lnge).

               Cause (k6z): [Middle English from Old French from Latin causa' teason,
                   purpose.] n. 1. a. The producer of an effect, result, or consequence.
                   b. The one, such as a person, an event' or a condition, that is responsi-
                   ble for an action or a result. v. 1. To be the causeof or reason for; re-
                   sult in. 2. To bring about or compel by authority or force.

   o MANv historians and philosophers,the increasedemphasison experimenta-
   tion in the 15th and L7th centuriesmarked the emergence modern science
   from its roots in natural philosophy (Hacking, 1983). Drake (1981) cites
          '1.6'!.2                                   or Moue in It as usheringin
Galileo's        treatrseBodies Tbat Stay Atop
modern experimental science,but earlier claims can be made favoring \Tilliam
Gilbert's1,600  study Onthe Loadstoneand MagneticBodies,    Leonardoda Vinci's
(1,452-1.51.9)many investigations, perhapseventhe Sth-century
                                    and                             B.C.philoso-
pher Empedocles,who used various empirical demonstrationsto argue against
Parmenides(Jones,             1'969b).In the everyday senseof the term, humans
have beenexperimentingwith different ways of doing things from the earliestmo-
ments of their history. Suchexperimentingis as natural a part of our life as trying
a new recipe or a different way of starting campfires.
               AND        cAUsAL

       However, the scientific revolution of the 1.7thcentury departed in three ways
  from the common use of observation in natural philosophy atthat time. First, it in-
  creasingly used observation to correct errors in theory. Throughout historg natu-
  ral philosophers often used observation in their theories, usually to win philo-
  sophical arguments by finding observations that supported their theories.
  However, they still subordinated the use of observation to the practice of deriving
  theories from "first principles," starting points that humans know to be true by our
  nature or by divine revelation (e.g., the assumedproperties of the four basic ele-
  ments of fire, water, earth, and air in Aristotelian natural philosophy). According
  to some accounts,this subordination of evidenceto theory degenerated the 17th
  century:         Aristotelian principle of appealing to experiencehad degenerated
  among philosophers into dependenceon reasoning supported by casual examples
  and the refutation of opponents by pointing to apparent exceptions not carefully
  examined" (Drake,            p. xxi).'Sfhen some 17th-century scholarsthen beganto
  use observation to correct apparent errors in theoretical and religious first princi-
  ples, they came into conflict with religious or philosophical authorities, as in the
 case of the Inquisition's demands that Galileo recant his account of the earth re-
  volving around the sun. Given such hazards,the fact that the new experimental sci-
 ence tipped the balance toward observation and ^way from dogma is remarkable.
  By the time Galileo died, the role of systematicobservation was firmly entrenched
 as a central feature of science,and it has remained so ever since (Harr6,1981).
      Second,before the 17th century, appeals to experiencewere usually basedon
 passive observation of ongoing systemsrather than on observation of what hap-
 pens after a system is deliberately changed. After the scientific revolution in the
  L7th centurS the word experiment (terms in boldface in this book are defined in
 the Glossary) came to connote taking a deliberate action followed by systematic
 observationof what occurred afterward. As Hacking (1983) noted of FrancisBa-
 con: "He taught that not only must we observenature in the raw, but that we must
 also         the lion's tale', that is, manipulate our world in order to learn its se-
 crets" (p. U9). Although passiveobservation revealsmuch about the world, ac-
 tive manipulation is required to discover some of the world's regularities and pos-
 sibilities (Greenwood,, 1989). As a mundane example, stainless steel does not
 occur naturally; humans must manipulate it into existence.Experimental science
 came to be concerned with observing the effects of such manipulations.
      Third, early experimenters realized the desirability of controlling extraneous
 influences that might limit or bias observation. So telescopeswere carried to
 higher points at which the air was clearer, the glass for microscopeswas ground
 ever more accuratelg and scientistsconstructed laboratories in which it was pos-
 sible to use walls to keep out potentially biasing ether waves and to use (eventu-
 ally sterilized) test tubes to keep out dust or bacteria. At first, thesecontrols were
 developed for astronomg chemistrg and physics, the natural sciences which in-
 terest in sciencefirst bloomed. But when scientists started to use experiments in
 areas such as public health or education, in which extraneous influences are
 harder to control (e.g., Lind , 1,753lr,they found that the controls used in natural
                                                                    EXPERTMENTS CAUSATTON I

science the laboratoryworked poorly in thesenew applications. they devel-
oped new methodsof dealingwith extraneous        influence,such as random assign-
ment (Fisher,  1,925) addinga nonrandomized
                     or                           control group (Coover& Angell,
1.907). theoreticaland observational
       As                                experience   accumulated  across  theseset-
tings and topics,more sources bias were identifiedand more methodswere de-
veloped copewith them (Dehue,
         to                          2000).
     TodaSthe key featurecommonto all experiments still to deliberately
                                                         is                     vary
something asto discover
            so                             to
                            what happens something        elselater-to discover  the
effectsof presumed  causes. laypersons
                            As             we do this, for example,to assess   what
happens our blood pressure we exercise
          to                    if             more, to our weight if we diet less,
or ro our behaviorif we read a self-helpbook. However,scientificexperimenta-
tion has developed   increasinglyspecialized  substance,   language,and tools, in-
cluding the practiceof field experimentation the socialsciences
                                              in                     that is the pri-
mary focus of this book. This chapter begins to explore these matters by
(1) discussing natureof causation
               the                    that experiments  test,(2) explainingthe spe-
cializedterminology(e.g.,randomizedexperiments,        quasi-experiments) de-
scribessocial experiments,   (3) introducing the problem of how to generalize
causalconnections   from individual experiments,   and (4) briefly situatingthe ex-
perimentwithin a largerliteratureon the nature of science.

         discussion experiments
A sensible         of           requiresboth a vocabularyfor talking about
causation                   of key concepts
         and an understanding              that underliethat vocabulary.

                   and Causal
Most peopleintuitively recognize    causalrelationships their daily lives.For in-
stance, you may say that another automobile's     hitting yours was a causeof the
damage your car; that the number of hours you spentstudyingwas a causeof
your testgrades; that the amountof food a friend eatswas a cause his weight.
You may evenpoint to more complicated        causalrelationships,noting that a low
test gradewas demoralizing,    which reducedsubsequent      studying,which caused
evenlower grades.   Here the same  variable(low grade)can be both a cause  and an
 effect,and there can be a reciprocal relationship betweentwo variables (low
gradesand not studying)that cause      eachother.
     Despitethis intuitive familiarity with causalrelationsbips, precise
                                                                a        definition
of cause  and effecthaseludedphilosophers centuries.lIndeed,the definitions

1. Our analysisrefldctsthe useof the word causationin ordinary language,                               of
                                                                        not the more detaileddiscussions
causeby philosophers. Readers                                                                    in
                              interested suchdetail may consult a host of works that we reference this
chapter,includingCook and Campbell(1979).
                  AND               INFERENCE

    of terms suchas cause and,  effectdependpartly on eachother and on the causal
    relationshipin which both are embedded. the 17th-century
                                               So                   philosopherJohn
    Locke said: "That which producesany simpleor complexidea,we denoteby the
    generalnamecaLtse, that which is produce effect" (1,97 p. 32fl and also:
                         and                       d,            s,
    " A cAtrse that
               is     which makesany other thing, either simpleidea, substance,     or
    mode,beginto be; and an effectis that, which had its beginning   from someother
    thing" (p. 325).Since then,otherphilosophers scientists
                                                   and           havegivenus useful
    definitionsof the threekey ideas--cause, effect,and causalrelationship-that are
    more specific and that betterilluminatehow experiments   work. We would not de-
    fend any of theseas the true or correctdefinition,giventhat the latter haseluded
    philosophers millennia;but we do claign
                  for                           that theseideashelp to clarify the sci-
    entific practiceof probing causes.

      Considerthe causeof a forest fire.         know that fires start in differentways-a
      match tossedfrom a ca\ a lightning strike, or a smolderingcampfire,for exam-
     ple. None of thesecauses necessary
                                  is                      a
                                                 because forest fire can start evenwhen,
      say'a match is not present.Also, none of them is sufficientto start the fire. After
     all, a match must stay "hot" long enoughto start combustion;it must contact
     combustible     material suchas dry leaves;   theremust be oxygenfor combustionto
     occur; and the weather must be dry enoughso that the leavesare dry and the
     match is not dousedby rain. So the match is part of a constellation conditions
     without which a fire will not result,althoughsomeof these       conditionscan be usu-
     ally takenfor granted,suchasthe availabilityof oxygen.A lightedmatchis, rhere-
     fore, what Mackie (1,974)called an inus condition-"an insufficient but non-
     redundantpart of an unnecessary sufficient condition" (p. 62; italicsin orig-
     inal). It is insufficientbecause match cannot start a fire without the other con-
     ditions. It is nonredundant only if it adds something fire-promoting that is
     uniquelydifferent from what the other factors in the constellation       (e.g.,oxygen,
     dry leaves)   contributeto startinga fire; after all,it would beharderro saywhether
     the match causedthe fire if someone        elsesimultaneously  tried startingit with a
    cigarettelighter.It is part of a sufficientcondition to start a fire in combination
    with the full constellationof factors.But that condition is not necessary        because
    thereare other setsof conditionsthat can also start fires.
          A research  exampleof an inus condition concerns new potentialtreatment
    for cancer. the late 1990s,a teamof researchers Bostonheaded Dr. Judah
                  In                                         in                by
    Folkman reportedthat a new drug calledEndostatinshrank tumors by limiting
    their blood supply (Folkman, 1996).Other respected          researcherscould not repli-
    catethe effectevenwhen usingdrugsshippedto them from Folkman's                lab. Scien-
    tists eventuallyreplicatedthe resultsafter they had traveledto Folkman'slab to
    learnhow to properlymanufacture,        transport,store,and handlethe drug and how
    to inject it in the right location at the right depth and angle.One observer     labeled
    thesecontingencies "in-our-hands" phenomenon,meaning "even we don't
                                                     EXPERIMENTS CAUSATIONI S

know which details are important, so it might take you some time to work it out"
(Rowe, L999, p.732). Endostatin was an inus condition. It was insufficientcause
by itself, and its effectivenessrequired it to be embedded in a larger set of condi-
tions that were not even fully understood by the original investigators.
     Most causesare more accurately called inus conditions. Many factors are usu-
ally required for an effectto occur, but we rarely know all of them and how they
relate to each other. This is one reason that the causal relationships we discussin
this book are not deterministic but only increasethe probability that an effect will
occur (Eells,1,991,;  Holland, 1,994).It also explains why a given causalrelation-
ship will occur under some conditions but not universally across time, space,hu-
-"r pop,rlations, or other kinds of treatments and outcomes that are more or less
related io those studied. To different {egrees, all causal relationships are context
dependent,so the generalizationof experimental effects is always at issue.That is
*hy *. return to such generahzations     throughout this book.

    can better understand what an effect is through a counterfactual model that'l'973'
goes back at least to the 18th-century philosopher David Hume (Lewis,
p. SSel. A counterfactual is something that is contrary to fact. In an experiment,
ie obseruewhat did happez when people received a treatment. The counterfac-
tual is knowledge of what would haue happened to those same people if they si-
multaneously had not receivedtreatment. An effect is the difference betweenwhat
did happen and what would have happened.
           cannot actually observe a counterfactual. Consider phenylketonuria
(PKU), a genetically-based   metabolic disease that causesmental retardation unless
treated during the first few weeks of life. PKU is the absenceof an enzyme that
would otherwise prevent a buildup of phenylalanine, a substance toxic to the
nervous system. Vhen a restricted phenylalanine diet is begun early and main-
tained, reiardation is prevented. In this example, the causecould be thought of as
the underlying genetic defect, as the enzymatic disorder, or as the diet. Each im-
plies a difierenicounterfactual. For example, if we say that a restricted phenyl-
alanine diet causeda decrease PKU-basedmental retardation in infants who are
phenylketonuric at birth, the counterfactual is whatever would have happened
      t'h.r. sameinfants not receiveda restricted phenylalanine diet. The samelogic
 applies to the genetic or enzymatic version of the cause. But it is impossible for
 theseu.ry ,"-i infants simultaneously to both have and not have the diet, the ge-
 netic disorder, or the enzyme deficiency.
      So a central task for all cause-probing research is to create reasonable ap-
 proximations to this physically impossible counterfactual. For instance, if it were
 ethical to do so, we might contrast phenylketonuric infants who were given the
 diet with other phenylketonuric infants who wer€ not given the diet but who were
 similar in many ways to those who were (e.g., similar face) gender,age, socioeco-
 nomic status, health status). Or we might (if it were ethical) contrast infants who

               AND             INFERENCE

 were not on the diet for the first 3 months of their lives with those same infants
  after they were put on the diet starting in the 4th month. Neither of these ap-
 proximations is a true counterfactual. In the first case,the individual infants in the
 treatment condition are different from those in the comparison condition; in the
 second case, the identities are the same, but time has passedand many changes
 other than the treatment have occurred to the infants (including permanent dam-
 age done by phenylalanine during the first 3 months of life). So two central tasks
 in experimental design are creating a high-quality but necessarily  imperfect source
 of counterfactual inference and understanding how this source differs from the
 treatment condition.
      This counterfactual reasoning is fundarnentally qualitative because    causal in-
 ference, even in experiments, is fundamentally qualitative (Campbell, 1975;
 Shadish, 1995a; Shadish 6c Cook, 1,999). However, some of these points have
 been formalized by statisticiansinto a specialcasethat is sometimescalled Rubin's
 CausalModel (Holland, 1,986;Rubin,                                  This book is not
 about statistics, so we do not describethat model in detail ('West,Biesanz,& Pitts
 [2000] do so and relate it to the Campbell tradition). A primary emphasisof Ru-
 bin's model is the analysis of causein experiments, and its basic premisesare con-
 sistent with those of this book.2 Rubin's model has also been widely used to ana-
 lyze causal inference in case-control studies in public health and medicine
 (Holland 6c Rubin, 1988), in path analysisin sociology (Holland,1986), and in
 a paradox that Lord (1967) introduced into psychology (Holland 6c Rubin,
 1983); and it has generatedmany statistical innovations that we cover later in this
 book. It is new enough that critiques of it are just now beginning to appear (e.g.,
 Dawid, 2000; Pearl, 2000). tUfhat is clear, however, is that Rubin's is a very gen-
 eral model with obvious and subtle implications. Both it and the critiques of it are
 required material for advanced students and scholars of cause-probingmethods.

 How do we know if cause and effect are related? In a classic analysis formalized
 by the 19th-century philosopher John Stuart Mill, a causal relationship exists if
 (1) the causeprecededthe effect, (2) the causewas related to the effect,and (3) we
 can find no plausible alternative explanation for the effect other than the cause.
 These three characteristics mirror what happens in experiments in which (1) we
 manipulate the presumed cause and observe an outcome afterward; (2) we see
 whether variation in the cause is related to variation in the effect; and (3) we use
 various methods during the experiment to reduce the plausibility of other expla-
 nations for the effect, along with ancillary methods to explore the plausibility of
 those we cannot rule out (most of this book is about methods for doing this).

 2. However, Rubin's model is not intended to say much about the matters of causal generalization that we address
 in this book.
                                                       EXPERTMENTS CAUSATTON 7
                                                                AND        |          I

Henceexperiments well-suitedto studyingcausalrelationships. other sci-
                     are                                          No
entificmethodregularlymatches characteristics causalrelationships well.
                                  the              of                  so
Mill's analysis alsopointsto the weakness other
                                          of               In
                                                  methods. many correlational
studies, for example,it is impossible know which of two variablescamefirst,
so defending causal
              a       relationshipbetween  them is precarious.            this
logic of causalrelationships  and how its key terms,suchas causeand effect,are
defined  helpsresearchers critique cause-probing
                          to                        studies.

         Correlation, Confounds
Causation,          and

A well-known maxim in research is: Correlation does not proue causation. This is
so because may not know which variable came first nor whether alternative ex-
planations for the presumed effectexist. For example, supposeincome and educa-
tion are correlated.Do you have to have a high income before you can aff.ordto pay
for education,or do you first have to get a good education before you can get a bet-
ter paying job? Each possibility may be true, and so both need investigation.But un-
til those investigationsare completed and evaluatedby the scholarly communiry a
simple correlation doesnot indicate which variable came first. Correlations also do
little to rule out alternative explanations for a relationship between two variables
such as education and income. That relationship may not be causal at all but rather
due to a third variable (often called a confound), such as intelligence or family so-
cioeconomicstatus,that causes     both high education and high income. For example,
if high intelligencecauses  success education and on the job, then intelligent peo-
ple would have correlatededucation and incomes,not because        education causes in-
come (or vice versa) but because   both would be causedby intelligence.Thus a cen-
tral task in the study of experiments is identifying the different kinds of confounds
that can operate in a particular researcharea and understanding the strengthsand
weaknesses    associated with various ways of dealing with them

Manipulable Nonmanipulable
In the intuitive understanding experimentation
                                  of                    that most peoplehave,it makes
sense say,
      to      "Let's seewhat happens we requirewelfarerecipients work"; but
                                         if                               to
                           "Let's seewhat happens I change
                                                       if          this adult maleinto a
it makesno sense say,
three-year-old  girl." And so it is alsoin scientificexperiments.   Experiments  explore
the effectsof things that can be manipulated,       such as the dose of a medicine,the
amount of a welfarecheck,the kind or amount of psychotherapy the number   or
of childrenin a classroom.   Nonmanipulable        events (e.g.,the explosionof a super-
nova) or attributes(e.g.,people's     ages, their raw genetic material,or their biologi-
cal sex)cannotbe causes experiments
                           in                 because cannotdeliberately
                                                      we                       vary them
to seewhat then happens.      Consequently,     most scientists  and philosophers  agree
that it is much harderto discover effects nonmanipulable
                                      the         of                   causes.

8                  ANDGENERALTzED TNFERENcE
        1. EXeERTMENTS        cAUsAL

         To be clear,we are not arguing that all causes  must be manipulable-only that
    experimental causes   must be so. Many variables that we correctly think of as causes
    are not directly manipulable. Thus it is well establishedthat a geneticdefect causes
    PKU even though that defect is not directly manipulable.'We can investigatesuch
    causesindirectly in nonexperimental studiesor even in experimentsby manipulat-
    ing biological processesthat prevent the gene from exerting its influence, as
    through the use of diet to inhibit the gene'sbiological consequences.   Both the non-
    manipulable gene and the manipulable diet can be viewed as causes-both covary
    with PKU-basedretardation, both precedethe retardation, and it is possibleto ex-
    plore other explanations for the gene'sand the diet's effectson cognitive function-
    ing. However, investigating the manipulablc diet as a causehas two important ad-
    vantages over considering the nonmanipulable genetic problem as a cause.First,
    only the diet provides a direct action to solve the problem; and second,we will see
    that studying manipulable agents allows a higher quality source of counterfactual
    inferencethrough such methods as random assignment.\fhen individuals with the
    nonmanipulable genetic problem are compared with personswithout it, the latter
    are likely to be different from the former in many ways other than the genetic de-
    fect. So the counterfactual inference about what would have happened to those
    with the PKU genetic defect is much more difficult to make.
         Nonetheless,nonmanipulable causesshould be studied using whatever means
    are availableand seemuseful. This is true because      such causes  eventuallyhelp us
    to find manipulable agents that can then be used to ameliorate the problem at
    hand. The PKU example illustrates this. Medical researchers not discover how
    to treat PKU effectively by first trying different diets with retarded children. They
    first discovered the nonmanipulable biological features of retarded children af-
    fected with PKU, finding abnormally high levels of phenylalanine and its associ-
    ated metabolic and genetic problems in those children. Those findings pointed in
    certain ameliorative directions and away from others, leading scientiststo exper-
    iment with treatments they thought might be effective and practical. Thus the new
    diet resulted from a sequenceof studies with different immediate purposes, with
    different forms, and with varying degreesof uncertainty reduction. Somewere ex-
    perimental, but others were not.
         Further, analogue experiments can sometimes be done on nonmanipulable
    causes,that is, experiments that manipulate an agent that is similar to the cause
    of interest. Thus we cannot change a person's race, but we can chemically induce
    skin pigmentation changes in volunteer individuals-though such analogues do
    not match the reality of being Black every day and everywhere for an entire life.
    Similarly past events,which are normally nonmanipulable, sometimesconstitute
    a natural experiment that may even have been randomized, as when the 1'970
    Vietnam-era draft lottery was used to investigate a variety of outcomes (e.g., An-
    grist, Imbens, & Rubin, 1.996a;Notz, Staw, & Cook, l97l).
         Although experimenting on manipulable causesmakes the job of discovering
    their effectseasier,experiments are far from perfect means of investigating causes.
                                                     EXPERIMENTS CAUSATIONI 9

Sometimesexperiments modify the conditions in which testing occurs in a way
that reducesthe fit between those conditions and the situation to which the results
are to be generalized.Also, knowledge of the effects of manipulable causestells
nothing about how and why those effectsoccur. Nor do experiments answer many
other questions relevant to the real world-for       example, which questions are
worth asking, how strong the need for treatment is, how a cause is distributed
through societg whether the treatment is implemented with theoretical fidelitS
and what value should be attached to the experimental results.
     In additioq, in experiments,we first manipulate a treatment and only then ob-
serveits effects;but in some other studieswe first observean effect, such as AIDS,
and then search for its cause, whether manipulable or not. Experiments cannot
help us with that search. Scriven (1976) likens such searchesto detective work in
which a crime has been committed (..d., " robbery), the detectives  observea par-
ticular pattern of evidencesurrounding the crime (e.g.,the robber wore a baseball
cap and a distinct jacket and used a certain kind of Bun), and then the detectives
searchfor criminals whose known method of operating (their modus operandi or
m.o.) includes this pattern. A criminal whose m.o. fits that pattern of evidence
then becomesa suspect to be investigated further. Epidemiologists use a similar
method, the case-control design (Ahlbom 6c Norell, 1,990),in which they observe
a particular health outcome (e.g., an increasein brain tumors) that is not seen in
another group and then attempt to identify associatedcauses(e.g., increasedcell
phone use). Experiments do not aspire to answer all the kinds of questions, not
even all the types of causal questions, that social scientistsask.

               and    Explanation
     Description Causal
The uniquestrengthof experimentation in describingthe consequences
                                            is                                attrib-
utableto deliberately   varyinga treatment.'We this causaldescription. con-
                                                  call                       In
trast, experiments lesswell in clarifying the mechanisms
                     do                                         through which and
the conditionsunder which that       causalrelationshipholds-what we call causal
explanation.For example,most childrenvery quickly learnthe descriptive        causal
relationshipbetween    flicking a light switch and obtainingillumination in a room.
However,few children (or evenadults)can fully explain why that light goeson.
To do so, they would haveto decompose treatment(the act of flicking a light
switch)into its causallyefficacious   features(e.g.,closingan insulatedcircuit) and
its nonessential  features(e.g.,whetherthe switch is thrown by hand or a motion
detector).They would haveto do the samefor the effect (eitherincandescent          or
fluorescentlight can be    produced,but light will still be produced whether the
light fixture is recessed not). For full explanation,they would then have to
show how the causallyefficacious       parts of the treatmentinfluencethe causally
affectedparts of the outcomethrough identified mediating processes         (e.g.,the

                AND             INFERENCE

  passageof electricity through the circuit, the excitation of photons).3 ClearlS the
  causeof the light going on is a complex cluster of many factors. For those philoso-
  phers who equate cause with identifying that constellation of variables that nec-
  essarily inevitably and infallibly results in the effect (Beauchamp,1.974),talk of
  cause is not warranted until everything of relevanceis known. For them, there is
  no causal description without causal explanation. Whatever the philosophic mer-
  its of their position, though, it is not practical to expect much current social sci-
  ence to achieve such complete explanation.
       The practical importance of causal explanation is brought home when the
  switch fails to make the light go on and when replacing the light bulb (another
  easily learned manipulation) fails to solva the problem. Explanatory knowledge
  then offers clues about how to fix the problem-for example, by detecting and re-
  pairing a short circuit. Or if we wanted to create illumination in a place without
  lights and we had explanatory knowledge, we would know exactly which features
  of the cause-and-effect   relationship are essentialto create light and which are ir-
  relevant. Our explanation might tell us that there must be a source of electricity
  but that that source could take several different molar forms, such as abattery, a
  generator, a windmill, or a solar array. There must also be a switch mechanism to
  close a circuit, but this could also take many forms, including the touching of two
  bare wires or even a motion detector that trips the switch when someone enters
  the room. So causal explanation is an important route to the generalization of
  causal descriptions becauseit tells us which features of the causal relationship are
  essentialto transfer to other situations.
       This benefit of causal explanation helps elucidate its priority and prestige in
  all sciences and helps explain why, once a novel and important causal relationship
  is discovered, the bulk of basic scientific effort turns toward explaining why and
  how it happens. Usuallg this involves decomposing the causeinto its causally ef-
  fective parts, decomposing the effects into its causally affected parts, and identi-
  fying the processes  through which the effective causal parts influence the causally
  affected outcome parts.
       These examplesalso show the close parallel between descriptive and explana-
  tory causation and molar and molecular causation.aDescriptive causation usually
  concerns simple bivariate relationships between molar treatments and molar out-
  comes, molar here referring to a package that consistsof many different parts. For
  instance, we may find that psychotherapy decreases      depression,a simple descrip-
  tive causal relationship benveen a molar treatment    package and a molar outcome.
  However, psychotherapy consists of such parts as verbal interactions, placebo-

  3. However, the full explanationa physicistwould offer might be quite different from this electrician's
  explanation,perhapsinvoking the behaviorof subparticles. This difference                                is
                                                                            indicatesiust how complicated the
  notion of explanationand how it can quickly becomequite complex once one shifts levelsof analysis.
  4. By molar, we mean somethingtaken as a whole rather than in parts. An analogyis to physics,in which molar
                                                   as            from those of molecules atomsthat make up
  might refer to the propertiesor motions of masses, distinguished                       or
                                                        EXPERIMENTS CAUSATION 11
                                                                 AND         I

generating procedures, setting characteristics,time constraints, and payment for
services.Similarly, many depression measuresconsist of items pertaining to the
physiological,cognitive, and affectiveaspectsof depression.Explan atory causation
breaks thesemolar causesand effectsinto their molecular parts so as to learn, say,
that the verbal interactions and the placebo featuresof therapy both causechanges
in the cognitive symptoms of depression,but that payment for servicesdoes not do
so even though it is part of the molar treatment package.
     If experiments are less able to provide this highly-prized explanatory causal
knowledge, why.are experimentsso central to science,especiallyto basic social sci-
ence,in which theory and explanation are often the coin of the realm? The answer is
that the dichotomy ber'*reen    descriptive and explanatory causation is lessclear in sci-
entific practice than in abstract discussions   about causation.First, many causal ex-
planatironsconsist of chains of descriptivi causal links in which one event causesthe
next. Experiments help to test the links in each chain. Second,experiments help dis-
tinguish betweenthe validity of competing explanatory theories, for example, by test-
ing competing mediating links proposed by those theories. Third, some experiments
test whether a descriptive causal relationship varies in strength or direction under
Condition A versus Condition B (then the condition is a moderator variable that ex-
plains the conditions under which the effect holds). Fourth, some experimentsadd
quantitative or qualitative observations of the links in the explanatory chain (medi-
ator variables) to generateand study explanations for the descriptive causal effect.
     Experiments are also prized in applied areas of social science,in which the
identification of practical solutions to social problems has as great or even greater
priority than explanations of those solutions. After all, explanation is not always
required for identifying practical solutions. Lewontin (1997) makes this point
about the Human Genome Project, a coordinated multibillion-dollar research
program ro map the human genome that it is hoped eventually will clarify the ge-
netic causesof diseases.    Lewontin is skeptical about aspectsof this search:
     '!ilhat                                                      and intervention. Many
             is involvedhereis the differencebetween explanation
     disorders be explained the failureof the organism makea normal protein,a
                 can             by                           to
     failurethat is the consequence  of a genemutation.But interuentionrequires   that the
     normalproteinbe providedat the right placein the right cells,at the right time and in
     the right amount,or elsethat an alternative  way be found to providenormal cellular
     function.'What worse,    it might evenbenecessary keepthe abnormalproteinaway
     from the cellsat criticalmoments.                                    by
                                         None of theseobjectives served knowing the
     DNA sequence the defective
                      of              gene.(Lewontin,       p.29)

    Practical applications are not immediately revealedby theoretical advance.In-
stead, to reveal them may take decadesof follow-up work, including tests of sim-
ple descriptive causal relationships. The same point is illustrated by the cancer
drug Endostatin, discussedearlier. Scientistsknew the action of the drug occurred
through cutting off tumor blood supplies; but to successfullyuse the drug to treat
cancersin mice required administering it at the right place, angle, and depth, and
those details were not part of the usual scientific explanation of the drug's effects.
                AND               TNFERENCE

      In the end,then,causal             and
                             descriptions causal    explanations in delicate
                                                                  are             bal-
  ancein experiments.'$7hat  experiments bestis to improvecausaldescriptions;
  they do lesswell at explainingcausalrelationships. most experiments be
                                                      But                       can
  designed providebetterexplanations
            to                           than is typicallythe case today.Further,in
  focusingon causaldescriptions,  experiments   often investigate molar eventsthat
  may be less strongly related to outcomesthan are more molecularmediating
  processes,           thoseprocesses are closerto the outcomein the explana-
              especially              that
  tory chain. However,many causaldescriptions      are still dependable    and strong
  enoughto be useful,to be worth making the building blocks around which im-
  portant policiesand theoriesare created. Just considerthe dependability such of
  causal            as
         statements that schooldesegregation   causes  white flight, or that outgroup
  threat causes ingroup cohesion, that psychotherapy
                                 or                      improves  mentalhealth,or
  that diet reduces retardationdueto PKU. Suchdependable
                   the                                           causal  relationships
  are usefulto policymakers, practitioners,and scientists alike.

  Some of the terms used in describing modern experimentation (seeTable L.L) are
  unique, clearly defined, and consistently used; others are blurred and inconsis-
  tently used. The common attribute in all experiments is control of treatment
  (though control can take many different forms). So Mosteller (1990, p. 225)
  writes, "fn an experiment the investigator controls the application of the treat-
  ment"l and Yaremko, Harari, Harrison, and Lynn (1,986,p.72) write, "one or
  more independent variables are manipulated to observe their effects on one or
  more dependentvariables." However, over time many different experimental sub-
  types have developed in responseto the needs and histories of different sciences
  ('Winston, 1990;          6c Blais, 1.996\.

  TABLE TheVocabulary Experiments

           A                                         introduced observe effects.
  Experiment: studyin whichan intervention deliberately
                                         is                   to      its
  RandomizedExperiment: experiment whichunitsareassigned receive treatment
                      An          in                    to     the       or
     an alternative      by
                 condition a randomprocess asthe toss a coinor a table
                                         such         of              of
     random  numbers.
  Quasi-Experiment: experiment whichunitsarenot assigned conditions
                An           in                        to
  Natural        Not
        Experiment: really experiment
                          an                  the
                                      because cause   usually
               a studythat contrastsnaturally
                                  a        occurringeventsuch an earthquake
                                                             as           with
     a comoarison
  Correlational                                    or          study; study
                         synonymous nonexperimentalobservational
             Study:Usually       with                               a
     thatsimply       the    and                        variables.
                observes size direction a relationship
                                                                           OF EXPERIMENTS tr
                                                            MODERN                      I

The most clearlydescribed    variant is the randomizedexperiment,    widely credited
to Sir RonaldFisher   (1,925,1926).It   was first usedin agriculture laterspread
to other topic areasbecause promisedcontrol over extraneous
                              it                                     sources vari-
ation without requiringthe physicalisolationof the laboratory.Its distinguishing
featureis clear and important-that the varioustreatmentsbeingcontrasted(in-
cludingno treatmentat all) are assigned experimental
                                           to                                 for
                                                            units' by chance, ex-
ample,by cointossor useof a table of random numbers.       If implemented  correctlS
,"rdo- assignment     createstwo or more groupsof units that are probabilistically
similarto .".h other on the average.6   Hence,any outcomedifferences are ob-
served  between  thosegroupsat the end,ofa study    arelikely to be dueto treatment'
not to differences between groupsthat alreadyexistedat the start of the study.
Further,when certainassumptions met, the randomizedexperiment
                                      are                                   yieldsan
estimateof the sizeof a treatmenteffectthat has desirablestatisticalproperties'
along with estimates the probability that the true effectfalls within a defined
confidence  interval.Thesefeaturesof experiments so highly prized that in a
research  areasuchas   medicinethe randomizedexperimentis often referredto as
the gold standardfor treatmentoutcomeresearch.'
     Closelyrelatedto the randomizedexperimentis a more ambiguousand in-
consistently  usedterm, true experiment.Someauthorsuseit synonymously           with
randomized    experiment(Rosenthal Rosnow,1991').
                                       &                   Others  useit more gener-
ally to refer to any studyin  which an independent    variableis deliberately manip-
ulated (Yaremkoet al., 1,9861and dependent
                                     a             variableis assessed. shall not
usethe term at all givenits ambiguity and given that the modifier true seems      to
imply restricted  claimsto a singlecorrectexperimental     method.

Much of this book focuseson a class of designsthat Campbell and Stanley
      popularized quasi-experiments.s
(1,963)          as                                  share
                                     Quasi-experiments    with all other

5. Units can be people,animals,time periods,institutions,or almost anything else.Typically in field
experimentation  they are peopleor someaggregate people,such as classrooms work sites.In addition, a little
                                                  of                           or
thought showsthat   random assignment units to treatmentsis the sameas assignment treatmentsto units, so
                                      of                                             of
thesephrases frequendyusedinterchangeably'
6. The word probabilisticallyis crucial, as is explainedin more detail in Chapter 8.
7. Although the rerm randomized experiment is used this way consistently acrossmany fields and in this book,
statisticianssometimesuse the closely related term random experiment in a different way to indicate experiments
for which the outcomecannor be predictedwith certainry(e.g.,Hogg & Tanis, 1988).
8. Campbell (1957) first calledthesecompromisedesigns      but changedterminologyvery quickly; Rosenbaum
(1995a\ and Cochran (1965\ referto theseas observational    studies,a term we avoid because many peopleuseit to
refer to correlationalor nonexperimental  studies,as well. Greenberg and Shroder(1997) usequdsi-etcperiment  to
refer to studiesthat randomly assigngroups (e.g.,communities)to conditions,but we would considerthesegroup-
randomizedexperiments     (Murray' 1998).

                 AND              INFERENCE

  experiments a similar purpose-to test descriptivecausal hypothesesabout manip-
  ulable causes-as well as many structural details, such as the frequent presenceof
  control groups and pretest measures,to support a counterfactual inference about
  what would have happened in the absenceof treatment. But, by definition, quasi-
  experiments lack random assignment. Assignment to conditions is by means of self-
  selection,by which units choosetreatment for themselves, by meansof adminis-
  trator selection,by which teachers,bureaucrats,legislators,therapists,physicians,       t

  or others decide which persons should get which treatment. Howeveq researchers
  who use quasi-experimentsmay still have considerablecontrol over selectingand           rl
  schedulingmeasures,over how nonrandom assignmentis executed,over the kinds
  of comparison groups with which treatment,groups are compared, and over some
  aspectsof how treatment is scheduled.As Campbell and Stanleynote:

      There are many natural socialsettingsin which the research  personcan introduce
      something                   designinto his scheduling data collectionprocedures
                  like experimental                        of
      (e.g.,the uhen and to whom of measurement),   eventhough he lacksthe full control
      over the scheduling experimental
                          of            stimuli (the when and to wltom of exposure and
      the ability to randomizeexposures)which makesa true experiment  possible. Collec-
      tively,such situationscan be regarded quasi-experimental
                                            as                   designs. (Campbell&
      StanleS         p.
               1,963, 34)

       In quasi-experiments,the causeis manipulable and occurs before the effect is
  measured. However, quasi-experimental design features usually create less com-
  pelling support for counterfactual inferences. For example, quasi-experimental
  control groups may differ from the treatment condition in many systematic(non-
  random) ways other than the presenceof the treatment Many of theseways could
  be alternative explanations for the observed effect, and so researchershave to
  worry about ruling them out in order to get a more valid estimate of the treatment
  effect. By contrast, with random assignmentthe researcherdoes not have to think
  as much about all these alternative explanations. If correctly done, random as-
  signment makes most of the alternatives less likely as causes of the observed
  treatment effect at the start of the study.
       In quasi-experiments,the researcherhas to enumeratealternative explanations
  one by one, decide which are plausible, and then use logic, design, and measure-
  ment to assess   whether each one is operating in a way that might explain any ob-
  servedeffect. The difficulties are that thesealternative explanations are never com-
  pletely enumerable in advance, that some of them are particular to the context
  being studied, and that the methods neededto eliminate them from contention will
  vary from alternative to alternative and from study to study. For example, suppose
  two nonrandomly formed groups of children are studied, a volunteer treatment
  group that gets a new reading program and a control group of nonvolunteerswho
  do not get it. If the treatment group does better, is it becauseof treatment or be-
  causethe cognitive development of the volunteerswas increasingmore rapidly even
  before treatment began? (In a randomized experiment, maturation rates would
                                                 MODERN         OF EXPERIMENTS 1s
                                                      DESCRIPTIONS           |

  havebeenprobabilistically      equalin both groups.)To assess alternative, re-
                                                                      this             the
  searcher  might add multiple pretests revealmaturationaltrend beforethe treat-
  ment, and then comparethat trend with the trend after treatment.
       Another alternative   explanationmight bethat the nonrandomcontrol group in-
  cludedmoredisadvantaged         childrenwho had lessaccess booksin their homesor
  who had parentswho read to them lessoften. (In a randomizedexperiment'both
  groupswould havehad similar proportionsof suchchildren.)To assess alter-          this
  nativi, the experimenter     may measure number of books at home,parentaltime
  spentreadingtochildren,and perhaps                             Then the researcher
                                               trips to libraries.                     would
  seeif thesevariablesdiffered acrosstreatment and control groups in the hypothe-
  sizeddirection that could explain the observedtreatment effect. Obviously,as the
  number of plausiblealternativeexplapationsincreases, designof the quasi-
. experimentbecomes         more intellectually demandingand complex---especially          be-
  cause are nevercertainwe haveidentifiedall the alternative
         we                                                                explanations.  The
  efforts of the quasi-experimenter      start to  look like affemptsto bandagea wound
  that would have     beenlesssevere random assignment
                                       if                        had beenusedinitially.
       The ruling out of alternative     hypotheses closelyrelatedto a falsificationist
  logic popularized Popper
                        by         (1959).Poppernoted how hard it is to be sure that a
  g*.r"t conclusion(e.g.,,ll r*"ttr are white) is correct basedon a limited set of
  observations   (e.g.,all the swansI've seenwere white). After all, future observa-
  tions may change(e.g.,some I may seea black swan).So confirmation is log-
  ically difficult. By contrast,observing disconfirminginstance
                                              a                           (e.g.,a black swan)
  is sufficient,in Popper's    view, to falsify the generalconclusionthat all swansare
  white. Accordingly,      nopper urged scientists try deliberately falsify the con-
                                                       to                  to
  clusionsthey wiih to draw rather than only to seekinformation corroborating
  them. Conciusions       that withstand falsificationare retainedin scientificbooks or
  journals and treated as plausible until better evidencecomes along. Quasi-
  experimentation falsificationistin that it requiresexperimenters identify a
                       is                                                        to
  causalclaim and then to generate        and examineplausiblealternativeexplanations
  that might falsify the claim.
       However,suchfalsificationcan neverbe as definitiveas Popperhoped.Kuhn
   (7962) pointed out that falsificationdepends two assumptions
                                                         on                   that can never
  be fully tested.The first is that the causalclaim is        perfectlyspecified.   But that is
  neverih. ."r.. So many featuresof both the claim and the test of the claim are
  debatable-for example,which outcome is of interest,how it is measured,                   the
  conditionsof treatment,who needstreatment,               and all the many other decisions
  that researchers    must make in testingcausalrelationships. a result, disconfir-
   mation often leadstheoriststo respecify         part of their causaltheories.For exam-
   ple, they might now specifynovel conditionsthat must hold for their theory to be
   irue and that were derivedfrom the apparentlydisconfirmingobservations.                Sec-
   ond, falsificationrequiresmeasures        that are perfectlyvalid reflectionsof    the the-
   ory being tested.However,most philosophersmaintain that all observationis
   theorv-laden. is laden both with intellectualnuancesspecificto the partially
                 AND              INFERENCE

  uniquescientificunderstandings the theory held by the individual or group de-
  vising the test and also with the experimenters'              extrascientific   wishes,hopes,
  aspirations,and broadly          shared cultural assumptionsand understandings.                 If
  measures not independent theories,
               are                     of            how can they provideindependent           the-
  ory tests,includingtestsof causaltheories? the possibilityof theory-neutral
                                                       If                                       ob-
  servations denied,with          them disappears possibilityof definitiveknowledge
  both of what seems confirm a causalclaim and of what seems disconfirmit.
                          to                                                    to
        Nonetheless, fallibilist versionof falsificationis
                       a                                                     It
                                                                  possible. argues      that stud-
  iesof causalhypotheses still usefullyimproveunderstanding general
                                can                                             of           trends
  despite   ignorance all the contingencies might pertainto thosetrends.It ar-
                       of                           that
  guesthat causalstudiesare usefulevenif w0 haveto respecify initial hypoth-  the
  esisrepeatedly accommodate
                    to                    new contingencies new understandings.
                                                                 and                            Af-
  ter all, those respecifications usually minor in scope;they rarely involve
  wholesale     overthrowing     of generaltrendsin favor of completely         oppositetrends.
  Fallibilist falsificationalso assumes                                              is
                                              that theory-neutralobservation impossible
  but that observations approacha more factlikestatuswhenthey
                             can                                                    havebeenre-
  peatedlymadeacross         differenttheoreticalconceptions a construct,across
                                                                    of                         mul-
  tiple kinds of measurements, at multiple times.It alsoassumes
                                      and                                           that observa-
  tions are imbued with multiple theories,                 not iust one, and that different
  operationalprocedures not sharethe samemultiple theories. a result,ob-
                                do                                               As
  servations    that repeatedlyoccur despitedifferent theoriesbeing built into them
  havea special    factlike statusevenif they can neverbe fully justifiedascompletely
  theory-neutral    facts.In summary,      then, fallible falsificationis more than just see-
  ing whether observationsdisconfirm a prediction. It involvesdiscoveringand
  judging the worth of ancillary assumptions            about the restrictedspecificityof the
  causalhypothesis      under test and also about the heterogeneity theories,view-
  points, settings,   and times built into the measures         of the causeand effectand of
  any contingencies     modifying their relationship.
        It is neitherfeasible   nor desirable rule out all possiblealternative
                                                to                                        interpre-
  tarionsof a causalrelationship.Instead,only              plausiblealternatives    constitutethe
  major focus.This serves        partly to keep matterstractablebecause number of  the
  possiblealternatives endless. also recognizes
                            is          It                     that many   alternatives    haveno
  seriousempiricalor experiential         support and so do not warrant special         attention.
  However,the lack of supportcan sometimes deceiving. example, cause
                                                          be            For              the
  of stomachulcerswas long thought to be a combination                  of lifestyle(e.g.,stress)
  and excessacid production. Few scientistsseriouslythought that ulcers were
  caused a pathogen
            by                (e.g.,virus,germ,bacteria)       because was assumed an
                                                                        it                  that
  acid-filled stomachwould          destroy all living organisms.      However,in L982 Aus-
  tralian researchers    Barry Marshall and Robin                    discovered     spiral-shaped
  bacteria,later name Helicobacter
                         d                   pylori (H. pylori), in ulcerpatients'stomachs.
  rilfith this discovery, previouslypossiblebut implausiblebecame
                           the                                                       plausible.  By
  "1994,                                                             Development       Conference
           a U.S. National Institutesof Health Consensus
  concluded that H. pylori was the major causeof most peptic ulcers. So labeling ri-

                                                          OFEXPERIMENTS tt
                                            MODERN                    I

val hypothesesas plausible dependsnot just on what is logically possible but on
social consensus,  shared experienceand, empirical data.
     Because such factors are often context specific, different substantive areasde-
velop their own lore about which alternatives are important enough to need to be
controlled, even developing their own methods for doing so. In early psychologg
for example, a control group with pretest observations was invented to control for
the plausible alternative explanation that, by giving practice in answering test con-
tent, pretestswould produce gains in performance even in the absenceof a treat-
ment effect (Coover 6c Angell, 1907). Thus the focus on plausibility is a two-edged
sword: it reducesthe range of alternatives to be considered in quasi-experimental
work, yet it also leavesthe resulting causal inference vulnerable to the discovery
that an implausible-seemingalternative may later emerge as a likely causal agent.

The term natural experiment describesa naturally-occurring contrast between a
treatment and a comparisoncondition (Fagan, 1990; Meyer, 1995;Zeisel,1,973l.
Often the treatments are not even potentially manipulable, as when researchers
retrospectivelyexamined whether earthquakesin California causeddrops in prop-
erty values (Brunette, 1.995; Murdoch, Singh, 6c Thayer, 1993). Yet plausible
causal inferences about the effects of earthquakes are easy to construct and de-
fend. After all, the earthquakesoccurred before the observations on property val-
ues,and it is easyto seewhether earthquakesare related to properfy values. A use-
ful source of counterfactual inference can be constructed by examining property
values in the same locale before the earthquake or by studying similar localesthat
did not experience an earthquake during the bame time. If property values
dropped right after the earthquake in the earthquake condition but not in the com-
parison condition, it is difficult to find an alternative explanation for that drop.
     Natural experiments have recently gained a high profile in economics. Before
the 1990s economists had great faith in their ability to produce valid causal in-
ferencesthrough statistical adjustments for initial nonequivalence between treat-
ment and control groups. But two studies on the effects of job training programs
showed that those adjustments produced estimates that were not close to those
generated from a randomized experiment and were unstable across tests of the
model's sensitivity (Fraker 6c Maynard, 1,987; Lalonde, 1986). Hence, in their
searchfor alternative methods, many economistscame to do natural experiments,
such as the economic study of the effects that occurred in the Miami job market
when many prisoners were releasedfrom Cuban jails and allowed to come to the
United States(Card, 1990). They assumethat the releaseof prisoners (or the tim-
ing of an earthquake) is independent of the ongoing processesthat usually affect
unemployment rates (or housing values). Later we explore the validity of this
assumption-of its desirability there can be little question.
                 AND              INFERENCE

  The termscorrelationaldesign,passive  observational   design,and nonexperimental
  designrefer to situationsin which a presumedcauseand effect are identified and
  measuredbut in which other structural featuresof experiments missing.Ran-
  dom assignment not part of the design,nor are suchdesignelements pretests
                   is                                                    as
  and control groupsfrom which researchers   might constructa usefulcounterfactual
  inference. Instead,relianceis placedon measuring                           indi-
  vidually and then statisticallycontrolling for them. In cross-sectionalstudiesin
  which all the data aregatheredon the respondents one time, the researcher
                                                     at                       may
  not even know if the causeprecedes dffect. When thesestudiesare used for
  causalpurposes, missingdesignfeatures
                   the                         can be problematic unlessmuch is al-
  ready known about which alternativeinterpretations plausible,unlessthose
  that are plausiblecan be validly measured, unless substantive
                                              and         the           model used
  for statistical
                adjustment well-specified.
                           is               Theseare difficult conditionsto meetin
  the real world of research practice,and thereforemany commentators     doubt the
  potentialof suchdesigns supportstrongcausalinferences most cases.
                           to                                  in

           AND              OF
  The strength of experimentation is its ability to illuminate causal inference. The
  weaknessof experimentation is doubt about the extent to which that causal rela-
  tionship generalizes.    hope that an innovative feature of this book is its focus
  on generalization. Here we introduce the general issuesthat are expanded in later

  Most Experiments HighlyLocalBut Have

  Most experiments highly localizedand particularistic.
                     are                                    They are almostalways
  conductedin a restrictedrange of settings,   often just one, with a particular ver-
  sion of one type of treatmentrather than, say,a sampleof all possibleversions.
  Usually they have several  measures-eachwith theoreticalassumptions        that are
  differentfrom thosepresentin other measures-but far from a complete of allset
  possiblemeasures.   Each experimentnearly always usesa convenientsampleof
  people rather than one that reflectsa well-described    population; and it will in-
  evitably be conducted a particular point in time that rapidly becomes
                         at                                                 history.
       Yet readers experimental
                  of               resultsare rarelyconcerned   with what happened
  in that particular,past,local study.Rather,they usuallyaim to learn eitherabout
  theoreticalconstructs interestor about alarger policy.Theoristsoften want to
                                                            CONNECTIONS t'
                          EXeERTMENTS THEGENERALIZATION
                                  AND                                 I

connect experimental results to theories with broad conceptual applicability,
which ,.q,rir., generalization at the linguistic level of constructs rather than at the
level of the operations used to represent these constructs in a given experiment.
They nearly always want to generallzeto more people and settings than are rep-
resentedin a single experiment. Indeed, the value assignedto a substantive theory
usually dependson how broad a rangeof phenomena the theory covers. SimilarlS
policymakers may be interested in whether a causal relationship would hold
                                                                      implemented as a
iprobabilistically) across the many sites at which it would be
policS an inferencethat requires generalization      beyond the original experimental
stody contexr. Indeed, all human beings probably value the perceptual and cogni-
tive stability that is fostered by generalizations. Otherwise, the world might ap-
pear as a btulzzing   cacophony of isolqted instances requiring constant cognitive
processingthat would overwhelm our limited capacities.
     In defining generalizationas a problem, we do not assumethat more broadly ap-
plicable resulti are always more desirable(Greenwood, 1989). For example, physi-
cists -ho use particle accelerators to discover new elements may not expect that it
would be desiiable to introduce such elementsinto the world. Similarly, social scien-
tists sometimes aim to demonstrate that an effect is possible and to understand its
mechanismswithout expecting that the effect can be produced more generally. For
                    "sleeper effect" occurs in an attitude change study involving per-
instance, when a
suasivecommunications, the implication is that change is manifest after a time delay
 but not immediately so. The circumstancesunder which this effect occurs turn out to
 be quite limited and unlikely to be of any general interest other than to show that the
 theory predicting it (and many other ancillary theories) may not be wrong (Cook,
 Gruder, Hennigan & Flay l979\.Experiments that demonstrate limited generaliza-
 tion may be just as valuable as those that demonstratebroad generalization.
      Nonetheless, conflict seems exist berweenthe localized nature of the causal
                    a                 to
 knowledge that individual experiments provide and the more generalizedcausal
 goals that researchaspiresto attain. Cronbach and his colleagues(Cronbach et al.,
 f gSO;Cronbach, 19821havemade this argument most forcefully and their works
 have contributed much to our thinking about causal generalization. Cronbach
 noted that each experiment consistsof units that receivethe experiences      being con-
 trasted, of the treaiments themselves of obseruations
                                         ,                made on the units, and of the
 settings in which the  study is conducted. Taking the first letter from each of these
                                                              "instances on which data
 four iords, he defined the acronym utos to refer to the
 are collected" (Cronb ach,             78)-to the actual people,treatments' measures'
 and settingsthat were sampledin the experiment. He then defined two problems of
                                           "domain about which
  generalizition: (1) generaliiing to the                        [the] question is asked"
                                                           "units, treatments,variables,
 (p.7g),which he called UTOS; and (2) generalizingto
  "nd r.r,ings not directly observed" (p. 831,*hi.h he called

9. We oversimplify Cronbach'spresentationhere for pedagogical       reasons.For example,Cronbach only usedcapital S,
not small s, so that his system,eferred only to ,tos, not utos. He offered diverseand not always consistentdefinitions
                                                                                                             do here.
of UTOS and *UTOS, in particular. And he doesnot usethe word generalizationin the samebroad way we
                 AND              INFERENCE

        Our theoryof causalgeneralization,                                  in
                                             outlinedbelowand presented more de-
   tail in ChaptersLL through 13, melds Cronbach's         thinking with our own ideas
   about generalization    from previousworks (Cook, 1990, t99t; Cook 6c Camp-
   bell,1979), creatinga theory that is differentin modestways from both of these
   predecessors. theory is influencedby Cronbach's
                   Our                                       work in two ways.First, we
   follow him by describingexperiments      consistently  throughout this book as con-
   sistingof the elements units, treatments,
                             of                   observations,   and settingsrlothough
   we frequentlysubstitute    personsfor units giventhat most field experimentation   is
   conducted                                 :We
               with humansas participants. alsooften substituteoutcomef.orob-
   seruations  given the centrality of observations   about outcomewhen examining
   causal  relationships. Second, acknowledge
                                   we               that researchers ofteninterested
   in two kinds of.generalization    about eachof thesefive elements,     and that these
   two typesareinspiredbg but not identicalto, the two kinds of       generalizationthat
                         'We                                                 (inferences
   Cronbach defined.          call these construct validity generalizations
   about the constructs   that researchoperations  represent) externalvalidity gen-
   eralizations  (inferences about whetherthe causal  relationship  holdsovervariation
   in persons,  settings,treatment,and measurement     variables).

   Construct            Generalization
   as Representation
  The first causal generalization problem concerns how to go from the particular
  units, treatments, observations, and settings on which data are collected to the
  higher order constructs these instancesrepresent.These constructs are almost al-
  ways couched in terms that are more abstract than the particular instancessam-
  pled in an experiment. The labels may pertain to the individual elementsof the ex-
  periment (e.g., is the outcome measured by a given test best described as
  intelligence or as achievement?).Or the labels may pertain to the nature of rela-
  tionships among elements, including causal relationships, as when cancer treat-
  ments are classified as cytotoxic or cytostatic depending on whether they kill tu-
  mor cells directly or delay tumor growth by modulating their environment.
  Consider a randomized experiment by Fortin and Kirouac (1.9761.      The treatment
  was a brief educational course administered by severalnurses,who gave a tour of
  their hospital and covered some basic facts about surgery with individuals who
  were to have elective abdominal or thoracic surgery 1-5to 20 days later in a sin-
  gle Montreal hospital. Ten specific outcome measureswere used after the surgery,
  such as an activities of daily living scaleand a count of the analgesicsused to con-
  trol  pain. Now compare this study with its likely t^rget constructs-whether

  10. \Weoccasionally refer to time as a separatefeatureof experiments,following Campbell (79571and Cook and
  Campbell (19791,because   time can cut acrossthe other factorsindependently.Cronbachdid not includetime in
  his notational system,insteadincorporating time into treatment(e.g.,the scheduling treatment),observations
  (e.g.,when measures administered), setting (e.g.,the historicalcontext of the experiment).
                       are                or
                   EXnERTMENTs THE
                           AND                       coNNEcrtoNS| ,,
                                             oF cAUsAL
                                  GENERALIZATIoN                                   I

patient education (the target cause)promotes physical recovery (the targ€t effect)
"*ong surgical patients (the target population of units) in hospitals (the target
univeise ofiettings). Another example occurs in basic research,in which the ques-
tion frequently aiises as to whether the actual manipulations and measuresused
in an experiment really tap into the specific cause and effect constructs specified
by the theory. One way to dismiss an empirical challenge to a theory is simply to
make the casethat the data do not really represent the concepts as they are spec-
ified in the theory.
      Empirical resnlts often force researchers change their initial understanding
of whaithe domain under       study is. Sometimesthe reconceptuahzation leads to a
more restricted inference about what has been studied. Thus the planned causal
agent in the Fortin and Kirouac (I976),study-patie,nt education-might need to
b! respecified as informational patient education if the information component of
the treatment proved to be causally related to recovery from surgery but the tour
of the hospital did not. Conversely data can sometimes lead researchersto think
in terms o?,"rg., constructs and categoriesthat are more general than those with
which they began a researchprogram. Thus the creative analyst of patient educa-
tion studies mlght surmise that the treatment is a subclass of interventions that
                          "perceived control" or that recovery from surgery can be
function by increasing
                           ;'p.tronal coping." Subsequentreaders of the study can
treated as a subclas of
 even add their own interpietations, perhaps claiming that perceived control is re-
 ally just a special caseof the even more general self-efficacy construct. There is a
 sobtie interplay over time among the original categories the researcherintended
 to represeni, the study as it was actually conducted, the study results, and subse-
 qrr..ri interpretations. This interplay can change the researcher'sthinking about
 what the siudy particulars actually achieved at a more conceptual level, as can
 feedback fromreaders. But whatever reconceptualizationsoccur' the first problem
 of causal generaltzationis always the same: How can we generalizefrom a sam-
 ple of instancesand the data patterns associatedwith them to the particular tar-
 get constructs they represent?

External            Generalization Extrapolation
       Validity:                 as

The secondproblem of generalization to infer whether a causalrelationship
holdsovervariationsin p.rrorrt, settings,treatments,              For
                                                    and outcomes. example,
someonereadingthe resultsof an experimenton the effectsof a kindergarten
Head Startprogiam on the subsequent   grammarschoolreadingtestscores poor
African Americanchildrenin Memphis during the 1980smay want to know if a
programwith partially overlapping  cognitiveand socialdevelopment  goals_would
be aseffective improvingthi mathematics scores
                in                           test     of poor Hispanicchildren
in Dallas if this programwere to be implemented   tomorrow.
    This exampl. again reminds us that generahzation not a synonym for
broader applicatiorr.H.r., generahzation from one city to another city and
              AND              INFERENCE

from one kind of clienteleto anotherkind, but thereis no presumptionthat Dal-
las is somehow broader than Memphis or that Hispanic children constitute a
broader population than African American children. Of course,some general-
izations are from narrow to broad. For example,a researcher         who randomly
samplesexperimentalparticipants from a national population may generalize
(probabilistically) from the sampleto all the other unstudied     members that
same   population. Indeed,that is the rationale for choosingrandom selectionin
the first place.Similarly when policymakers     considerwhetherHead Start should
be continuedon a national basis,they are not so interested what happened
                                                              in                in
Memphis.They are more interested what would happenon the average
                                       in                                   across
the United States, its many local programsstill differ from eachother despite
efforts in the 1990sto standardize                          to
                                     much of what happens Head Startchildren
and parents.But generalization also go from the broad to the narrow. Cron-
bach(1982)givesthe example an experiment
                                 of                that studieddifferencesbetween
the performances groups of studentsattendingprivate and public schools.In
this case, concernof individual parentsis to know which type of schoolis bet-
ter for their particular child, not for the whole group. \Thether from narrow to
broad, broad to narroq or across      units at about the samelevelof aggregation,
all theseexamples externalvalidity questions
                    of                            sharethe sameneed-to infer the
extent to  which the effect holds over variationsin persons,           treatments,
or outcomes.

Approaches MakingCausal
\Thichever way the causal generalization issue is framed, experiments do not
seem at first glance to be very useful. Almost invariablS a given experiment uses
a limited set of operations to represent units, treatments, outcomes, and settings.
This high degree of localization is not unique to the experiment; it also charac-
terizes case studies, performance monitoring systems, and opportunistically-
administered marketing questionnaires given to, say, a haphazard sample of re-
spondents at local shopping centers (Shadish, 1995b). Even when questionnaires
are administered to nationally representative samples, they are ideal for repre-
senting that particular population of persons but have little relevanceto citizens
outside of that nation. Moreover, responsesmay also vary by the setting in which
the interview took place (a doorstep, a living room, or a work site), by the time
of day at which it was administered, by how each question was framed, or by the
particular race, age,and gender combination of interviewers. But the fact that the
experiment is not alone in its vulnerability to generalization issuesdoes not make
it any less a problem. So what is it that justifies any belief that an experiment can
achieve a better fit between the sampling particulars of a study and more general
inferences to constructs or over variations in persons, settings, treatments, and
                     EXeERTMENTs THE
                             AND                       coNNEcrtoNs I tt
                                               oF cAUsAL

Samplingand Causal
The methodmost often recommended achieving  for            this closefit is the useof for-
mal probabiliry samplingof instances units, treatments,
                                             of                       observations, set-
tings (Rossi,Vlright, & Anderson,L983). This presupposes we have clearly
deiineated    populationsof eachand that we can samplewith known probability
from within eachof thesepopulations.In effect,this entailsthe random selection
of instances, be carefullydistinguished
                to                              from random assignment       discussed  ear-
lier in this chapter.  Randomselection     involvesselecting    cases  by chanceto repre-
sentthat popuiation,whereas       random    assignment   involvesassigning    cases mul-
tiple conditions.
       In cause-probing  research  that is not experimental,    random samples indi-of
viduals"r. oft.n nr.d. Large-scale     longitudinalsurveys    suchasthe PanelStudyof
IncomeDynamicsor the National Longitudinal Surveyare usedto represent                    the
populationof the United States-or certainagebrackets             within it-and measures
Lf pot.ntial causes     and effectsare then relatedto each other using time lags in
,nr^"r.rr.-ent and statistical  controlsfor group nonequivalence. this is donein
hopesof approximatingwhat a randomizedexperiment                achieves.  However,cases
of random ielection from a broad population followed by random assignment
from within this population are much rarer (seeChapter 12 for examples).               Also
  rare arestudies t".rdotn selection
                   oi                    followed by a quality quasi-experiment.       Such
 experiments    requirea high levelof resources a degree
                                                   and            of logisticalcontrol that
 is iarely feasible, many researchers
                      so                    prefer to rely on an implicit set of nonsta-
 tistical heuristics for generalization  that we hope to make more explicit and sys-
 tematicin this book.
       Random selection    occurseven more rarely with treatments'outcomes,and
 settings  than with people.Consider outcomes
                                        the             observed an experiment.
                                                                   in                  How
 ofterrlre they raniomly sampled?'We         grant that the domain samplingmodel of
 classical iheory (Nunnally 6c Bernstein,1994)assumes the itemsusedto
            test                                                     that
 measure constructhavebeenrandomly sampledfrom a domain of all possible
 items. However,in actual experimentalpracticefew researchers                ever randomly
 sampleitemswhen constructing        measures.    Nor do they do so when choosing       ma-
 nipulationsor settings. instance,
                           For             many settings   will  not agreeto be sampled,
 "rid ,o1n. of the settings  that agreeto be randomly sampled         will almostcertainly
 not agree be randomlyassigned conditions.For treatments, definitivelist
             to                         to                                 no
  of poisible treatments   usuallyexists,as is    most obvious in areasin which treat-
  -*,,    are being discovered  and developed      rapidly, such as in AIDS research.      In
  general,  then, random samplingis alwaysdesirable, it is only
                                                            but             rarely and con-
  tingently feasible.
                  formal sampling  methodsare not the only option. Two informal, pur-
  posive samplingmethodrare sometimes           useful-purposive sampling of heteroge-
  neousinstances purposivesamplingof typical instances. the former case'the
                    and                                              In
  aim is to includeinrLni.r chosendeliberately reflect diversity on presumptively
  important dimensions,   eventhough the sampleis not formally random. In the latter
                  AND             INFERENCE

  case,the aim is to explicate the kinds of units, treatments, observations, and settings
  to which one most wants to generalize andthen to selectat least one instance of each
  class that is impressionistically similar to the class mode. Although these purposive
  sampling methods are more practical than formal probability sampling, they are not
  backed by a statistical logic that justifies formal generalizations.Nonetheless, they
  are probabty the most commonly used of all sampling methods for facilitating gen-
  eralizations. A task we set ourselvesin this book is to explicate such methods and to
  describe how they can be used more often than is the casetoday.
       However, sampling methods of any kind are insufficient to solve either prob-
  lem of generalization. Formal probability sampling requires specifying a target
  population from which sampling then takes place, but defining such populations
  is difficult for some targets of generalization such as treatments. Purposive sam-
  pling of heterogeneousinstancesis differentially feasible for different elementsin
  a study; it is often more feasible to make measuresdiverse than it is to obtain di-
  verse settings, for example. Purposive sampling of typical instancesis often feasi-
  ble when target modes, medians, or means are known, but it leaves questions
  about generalizations a wider range than is typical. Besides, Cronbach points
                          to                                         as
  out, most challenges to the causal         generalization of an experiment typically
  emerge after a study is done. In such cases,sampling is relevant only if the in-
  stancesin the original study were sampled diversely enough to promote responsi-
  ble reanalysesof the data to seeif a treatment effect holds acrossmost or all of the
  targets about which generahzation has been challenged. But packing so many
  sourcesof variation into a single experimental study is rarely practical and will al-
  most certainly conflict with other goals of the experiment. Formal sampling meth-
  ods usually offer only a limited solution to causal generalizationproblems. A the-
  ory of generalizedcausal inference needsadditional tools.

   A GroundedTheoryof Causal
   Practicingscientists routinely make causal generalizations their research,
                                                               in                and
   they almostneveruseformal probability samplingwhen       they do. In this book, we
   presenta theory of causalgeneralization is groundedin the actualpracticeof
   science (Matt, Cook, 6c Shadish,   2000). Although this theory was originally de-
   velopedfrom ideasthat were groundedin the constructand externalvalidiry lit-
   eratures (Cook, 1990,1991.),we    havesince found that these ideas commonin
   a diverse                                         (e.g.,Abelson,1995;Campbell
             literatureabout scientificgeneralizations
   & Fiske, 1.959;   Cronbach& Meehl, 1955; Davis, 1994; Locke, 1'986;        Medin,
    1989;Messick,   1ggg,1'995;  Rubins,1.994;'Willner,1,991';'$7ilson,Hayward,Tu-
   nis, Bass, Guyatt, 1,995];t. providemore details
              &                 \7e                     aboutthis grounded     theory
   in Chapters1L through L3, but in brief it suggests scientists
                                                    that           makecausal    gen-
   eralizations their work by usingfive closelyrelatedprinciples:
      SurfaceSimilarity.They assess apparentsimilaritiesbetweenstudy opera-
      tions and the prototypicalcharacteristics the target of generalization.

                                                      CONNECTIONS ZS
                             AND               OF               I

2. Ruling Out lrreleuancies.They identify those things that are irrelevant because
    they do not change a generalization.
3.  Making Discriminations. They clarify k.y discriminations that limit
4. Interpolation and Extrapolation. They make interpolations to unsampled val-
    ues within the range of the sampled instances and, much more difficult, they
    explore extrapolations beyond the sampled range.
5 . Causal Explanation. They develop and test explanatory theories about the pat-
    tern of effects,causes,and mediational processesthat are essentialto the trans-
  fer of a causalrelationship.
      In this book, we want to show how scientistscan and do use thesefive princi-
ples to draw generalizedconclusions dbout a causal connection. Sometimes the
conclusion is about the higher order constructs to use in describing an obtained
connection at the samplelevel. In this sense,     thesefive principles have analoguesor
parallels both in the construct validity   literature (e.g.,with construct content, with
loru.rg.nt and discriminant validity, and with the need for theoretical rationales
for consrructs) and in the cognitive scienceand philosophy literatures that study
how people decidewhether instancesfall into a category(e.g.,concerning the roles
that protorypical characteristicsand surface versus deep similarity play in deter-
mining category membership). But at other times, the conclusion about general-
ization refers to whether a connection holds broadly or narrowly over variations
in persons, settings,treatments, or outcomes. Here, too, the principles have ana-
logues or parallels that we can recognizefrom scientific theory and practice, as in
the study of dose-response    relationships (a form of interpolation-extrapolation) or
the appeal to explanatory mechanismsin generalizing from animals to humans (a
form of causal explanation).
      Scientistsuse rhese five principles almost constantly during all phases of re-
search.For example, when they read a published study and wonder if some varia-
tion on the study's particulars would work in their lab, they think about similari-
ties of the published study to what they propose to do.               they conceptualize
the new study, they anticipate how the       instancesthey plan to study will match the
prototypical featuresof the constructs about which they are curious. They may de-
iign their study on the assumptionthat certain variations will be irrelevant to it but
that others will point to key discriminations over which the causal relationship
does not hold or the very character of the constructs changes.They may include
measuresof key theoretical mechanisms to clarify how the intervention works.
During data analysis, they test all these hypotheses and adjust their construct de-
 scriptions to match better what the data suggest happened in the study. The intro-
 duction section of their articles tries to convince the reader that the study bears on
 specific constructs, and the discussion sometimes speculatesabout how results
 -igttt extrapolate to different units, treatments, outcomes, and settings.
       Further, practicing scientistsdo all this not just with single studies that they
 read or conduct but also with multiple studies. They nearly always think about
         1. EXPERTMENTS      CAUSAL

     how their own studiesfit into a larger literature about both the constructsbeing
     measured    and the variablesthat may or may not bound or explain a causalconnec-
     tion, often documenting   this fit in the introduction to their study.And they apply all
     five principleswhen they conduct reviewsof the literature,in which they make in-
     ferences                                                               can
               about the kinds of generalizations a body of research suppoft.
          Throughoutthis book, and especially Chapters11 to L3, we providemore
     detailsabout this groundedtheory of causal generalization about the scientific
     practices                  Adopting this
                that it suggests.                grounded  theoryof generalization  doesnot
     imply a rejectionof formal probabilitysampling.Indeed, recommend
                                                                   we             suchsam-
     pling unambiguously    whenit is feasible,    alongwith  purposive sampling schemes  to
     aid generalization   when formal randomselection      methods   cannotbe implemented.
     But we alsoshow that samplingis just one methodthat practicingscientists to      use
     make causalgeneralizations,      along with practical  logic, applicationof diversesta-
     tistical methods,and useof features designother than sampling.

     Extensivephilosophicaldebatesometimes          surroundsexperimentation.    Here we
     briefly summarize    somekey featuresof these     debates,  and then we discusssome
     implications of thesedebatesfor experimentation.         However,there is a sense  in
     which all this philosophical   debateis incidentalto the practiceof experimentation.
     Experimentationis as old as humanity itself, so it preceded         humanity'sphilo-
     sophicaleffortsto understand     causation  and genenlization thousands years.
                                                                    by           of
     Even over just the past 400 yearsof scientificexperimentation,       we can seesome
     constancyof experimentalconcept and method, whereasdiversephilosophical
     conceptions the experiment        havecomeand gone. Hacking(1983)said,
     perimentation    has a life of its own" (p. 150). It has beenone of science's   most
     powerful methodsfor discovering      descriptive causalrelationships, it hasdone
     so well in so many ways that its placein science probably assured
                                                           is                  forever.To
     justify its practicetodag a scientist  neednot resortto sophisticated  philosophical
     reasoningabout experimentation.
          Nonetheless, doeshelp scientists understand
                        it                      to             thesephilosophical debates.
     For example,previousdistinctionsin this chapterbetweenmolar and molecular
     causation,descriptiveand explanatorycause,or probabilisticand deterministic
     causalinferences help both philosophersand scientists understandbetter
                         all                                          to
     both the purposeand the resultsof experiments        (e.g.,Bunge,1959; Eells, 1991';
     Hart & Honore, 1985;Humphreys,"t989;            Mackie, 1'974;   Salmon,7984,1989;
     Sobel,1993;P.A. \X/hite,1990).Here we focus on a differentand broadersetof
     critiquesof science   itself,not only from philosophybut alsofrom the history,so-
     ciologS and psychology science
                                 of         (see  usefulgeneral  reviewsby Bechtel, 1988;
     H. I. Brown, 1977; Oldroyd, 19861.        Some  of theseworks have beenexplicitly
     about the nature of experimentation,      seeking createa justified role for it (e.g.,

                                                                      EXPERIMENTS METASCIENCE 27

Bhaskar,  L975;Campbell,1982,,1988;   Danziger,     S. Drake, l98l; Gergen,
1,973; Gholson, Shadish, Neimeyer,6d Houts, L989;Gooding,Pinch,6cSchaffer,
1,989b;Greenwood, L989; Hacking, L983; Latour, 1'987;Latour 6c
1.979;Morawski,   1988;Orne,1.962;R.  RosenthaL,1.966;Shadish        L994;
                                                             & Fuller,
Shapin,1,9941.  Thesecritiqueshelp scientists seesomelimits of experimenta-
                   and society.
tion in both science

Kuhn (1962\ described    scientificrevolutionsas differentand partly incommensu-
rableparadigms   that abruptly succeedgd   eachother in time and in which the grad-
ual accumulation scientific
                   of           knowledge   was a chimera.  Hanson(1958),Polanyi
(1958),Popper('J.959),    Toulmin (1'961),  Feyerabend    (L975),and Quine (1'95t'
1,969) contributedto the critical momentum,     in part by exposingthe grossmis-
takesin logicalpositivism's  attemptto build a philosophyof science      basedon re-
constructinga successful   science  such as physics.All thesecritiquesdeniedany
firm foundationsfor scientific   knowledge(so, by extension,experiments not    do
provide firm causalknowledge).     The logicalpositivistshopedto achieve     founda-
tions on which to build knowledgeby tying all theory tightly       to theory-freeob-
servationthrough predicatelogic. But this left out important scientificconcepts
that could not be tied tightly to observation; and it failed to recognize that all ob-
servations impregnated
           are               with substantive  and methodological     theory,making
it impossible to conducttheory-free   tests.lt
     The impossibility of theory-neutral observation (often referred to as the
Quine-Duhem     thesis)impliesthat the resultsof any singletest (and so any single
experiment) inevitably ambiguous.
             are                          They could be disputed,for example,on
groundsthat the theoreticalassumptions       built into the outcome measurewere
wrong or that the study made a fatity assumptionabout how high a treatment
dosewas requiredto be effective.    Someof theseassumptions small,easilyde-
tected,and correctable,  suchaswhen a voltmeter     givesthe wrong readingbecause
the impedance the voltagesource
                of                    was much higherthan that of the meter ('$fil-
son, L952).But other assumptions more paradigmlike,
                                     are                                     a
                                                               impregnating theory
so completely  that other parts of the theory makeno sense    without them (e.g.,the
assumption  that the earthis the centerof the universe pre-Galilean
                                                         in              astronomy).
Because number of assumptions
         the                            involved in any scientifictest is very large,
researchers  can easily find some assumptions fault or can even posit new

                                                                                                    "Even the father
11. However, Holton (1986) reminds us nor to overstatethe relianceof positivistson empirical data:
of positivism,AugusteComte, had written . . . that without a theory of somesort by  which to link phenomenato some
principles 'it would not only be impossibleto combine the isolatedobservationsand draw any usefulconclusions,we
would not evenbe able to rememberthem, and, for the most part, the fact would not be noticed by our eyes"' (p. 32).
Similarly, Uebel (1992) providesa more detailedhistorical analysisof the protocol sentencedebatein logical
positivism, showing somesurprisinglynonstereorypical   positions held by key playerssuch as Carnap.
                      AND       CAUSAL

     assumptions       (Mitroff & Fitzgerald,1.977).In this way, substantive theoriesare
     lesstestablethan their authors originally conceived.    How cana theory be tested
     if it is madeof clayrather than granite?
           For reasons clarify later,this critique is more true of singlestudiesand less
     true of programsof research. evenin the latter case,
                                      But                      undetected constantbiases
     ."tt t.r,tlt in flawed inferencesabout causeand its genenlization.As a result,no ex-           I
     perimentis everfully certain,and extrascientific  beliefsand preferencesalwayshave
     ioo- to influencethe many discretionaryjudgmentsinvolved in all     scientificbelief.

     Sociologists   working within traditionsvariouslycalledsocialconstructivism,          epis-
     temological    relativism, and the strongprogram     (e.g.,Barnes,1974;       Bloor,1976;
     Collins,   l98l;Knorr-Cetina,L981-;      Latour 6c'Woolgar,1.979;Mulkay,       1'979)have
     shown thoseextrascientific      processes work in science.
                                                 at                  Their empiricalstudies
     show that scientists    often fail to adhereto norms commonlyproposedas part of
     good science    (e.g.,objectivity neutrality,sharingof information).They havealso
     rho*n how that which comesto be reportedas scientificknowledgeis partly de-
     terminedby socialand psychological         forcesand partly by issues economicand
     political power both within science in the largersociety-issues
                                             and                                 that arerarely
     mention;d in publishedresearch        reports.The most extremeamongthesesociolo-
     gistsattributesall scientificknowledgeto suchextrascientific         processes,   claiming
     ihat  "the natural world has a small or nonexistent     role in the construction sci-of
     entificknowledge"(Collins,               p. 3).
          Collins doesnot denyontologicalrea.lism,      that real entitiesexistin the world.
     Rather,he deniesepistemological(scientific)realism, that whateverexternal real-
     ity may existcanconstrainour scientifictheories.      For example, atomsreally ex-
     ist, do they affectour scientifictheoriesat all? If our theory postulates atom, is
     it describin a realentitythat existsroughly aswe
                   g                                                  it?
                                                           describe Epistetnologi,cal        rel-
      atiuistssuch as Collins respondnegativelyto both questions,           believingthat the
     most important influences science social,psychological,
                                    in          are                        economic,    and po-
     litical, "ttd th"t thesemight evenbe the only influences scientific
                                                                 on               theories- This
     view is not widely endorsed       outsidea small group of sociologists, it is a use-
     ful counterweight naiveassumptions scientificstudies
                          to                      that                   somehow     directlyre-
     veal natur. to r.r,(an assumptiorwe callnaiuerealism).The resultsof all studies,
     including experiments, profoundly subjectto theseextrascientific
                                 are                                                 influences,
     from their conception reportsof their results.

         Science Trust
         A standard image of the scientist is as a skeptic, a person who only trusts results that
         have been personally verified. Indeed, the scientific revolution of the'l'7th century
                                                                    EXPERIMENTS METASCIENCE 29
                                                                             AND          I

claimed that trust, particularly trust in authority and dogma, was antithetical to
good science.Every authoritative assertion,every dogma, was to be open to ques-
tion, and the job of sciencewas to do that questioning.
     That image is partly wrong. Any single scientific study is an exercisein trust
(Pinch, 1986; Shapin, 1,994).Studies trust the vast majority of already developed
methods, findings, and concepts that they use when they test a new hypothesis.
For example, statistical theories and methods are usually taken on faith rather
than personally verified, as are measurement instruments. The ratio of trust to
skepticism in any given study is more llke 99% trust to 1% skepticism than the
opposite. Even in lifelong programs of research, the single scientist trusts much
-or. than he or she ever doubts. Indeed, thoroughgoing skepticism is probably
impossible for the individual scientist, po iudge from what we know of the psy-
chology of science(Gholson et al., L989; Shadish 6c Fuller, 1'9941. Finall5 skepti-
cism is not even an accuratecharacterrzation of past scientific revolutions; Shapin
                                 "gentlemanly trust" in L7th-century England was
(1,994) shows that the role of
central to the establishment of experimental science.Trust pervades science,de-
spite its rhetoric of skepticism.

lmplications Experiments
The net result of thesecriticismsis a greaterappreciationfor the equivocalityof
all scientificknowledge. The experiment not a clearwindow that reveals
                                          is                                 nature
directly to us.To the contrary,experiments   yield hypotheticaland fallible knowl-
edgethat is often dependent context and imbuedwith many unstatedtheoret-
ical assumprions.  Consequentlyexperimental     resultsare partly relativeto those
assumptions    and contextsand might well changewith new assumptions con-  or
texts.In this sense, scientists epistemological
                    all          are                 constructivistsand relativists.
The differenceis whether they are    strong or weak relativists.Strong relativists
share Collins'sposition that only extrascientificfactors influenceour theories.
       relativistsbelievethat both the ontologicalworld and the worlds of ideol-
og5 interests,  values,hopes,and wishesplay a role in the constructionof scien-
tiiic knowledge.Most practicingscientists,    including ourselves,would probably
describe  themselves Lrrtological
                     ",             realistsbut weak epistemological  relativists.l2
To the extent that experiments   revealnature to us, it is through a very clouded
windowpane     (Campbell, 1988).
      Suchcounterweights naiveviewsof experiments
                          to                             were badly needed. re-
centlyas 30 yearsago,the centralrole of the experimentin science     was probably

1.2. If spacepermitred,we could exrendthis discussion a host of other philosophicalissues
                                                        to                                    that have beenraised
about the experiment, such as its role in discovery versusconfirmation, incorrect assertionsthat the experiment is
tied to somespecificphilosophysuch as logical positivismor pragmatism,and the various mistakesthat are
frequentlymadei., suchdiscussions    (e.g.,Campbell, 1982,1988; Cook, 1991; Cook 6< Campbell, 1985; Shadish,
                AND               INFERENCE

   taken more for granted than is the case today. For example, Campbell and Stan-
   ley (1.9631
             described        as:
       committed to the experiment: as the only means for settling disputes regarding educa-
       tional practice, as the only way of verifying educational improvements, and as the only
       way of establishing a cumulative tradition in which improvements can be introduced
       without the danger of a faddish discard of old wisdom in favor of inferior novelties. (p. 2)
                                             "'experimental method' usedto be
  Indeed,Hacking (1983) points out that                                           iust an-
  other name for scientific method"    (p.149); and experimentation was then a more
  fertile ground for examples illustrating basic philosophical issuesthan it was a
  source of contention itself.                   ,
       Not so today.       now understand better that the experiment is a profoundly
  human endeavor,affected by all the same human foibles as any other human en-
  deavor, though with well-developed procedures for partial control of some of the
  limitations that have been identified to date. Some of these limitations are com-
  mon to all science,of course. For example, scientiststend to notice evidencethat
  confirms their preferred hypothesesand to overlook contradictory evidence.They
  make routine cognitive errors of judgment and have limited capacity to process
  large amounts of information. They react to peer pressures agreewith accepted
  dogma and to social role pressuresin their relationships to students,participants,
  and other scientists.They are partly motivated by sociological and economic re-
  wards for their work (sadl5 sometimesto the point of fraud), and they display all-
  too-human psychological needs and irrationalities about their work. Other limi-
  tations have unique relevance to experimentation. For example, if causal results
  are ambiguous, as in many weaker quasi-experiments,experimentersmay attrib-
  ute causation or causal generalization based on study features that have little to
  do with orthodox logic or method. They may fail to pursue all the alternative
  causal explanations becauseof a lack of energS a need to achieveclosure, or a bias
  toward accepting evidence that confirms their preferred hypothesis.Each experi-
  ment is also a social situation, full of social roles (e.g., participant, experimenter,
  assistant) and social expectations (e.g., that people should provide true informa-
  tion) but with a uniqueness (e.g., that the experimenter does not always tell the
  truth) that can lead to problems when social cues are misread or deliberately
  thwarted by either party. Fortunately these limits are not insurmountable, as for-
  mal training can help overcome some of them (Lehman, Lempert, & Nisbett,
  1988). Still, the relationship between scientific results and the world that science
  studies is neither simple nor fully trustworthy.
       These social and psychological analyseshave taken some of the luster from
  the experiment as a centerpieceof science.The experiment may have a life of its
  own, but it is no longer life on a pedestal. Among scientists,belief in the experi-
  ment as the only meansto settle disputes about causation is gone, though it is still
  the preferred method in many circumstances. Gone, too, is the belief that the
  power experimental methods often displayed in the laboratory would transfer eas-
  ily to applications in field settings. As a result of highly publicized science-related

                                       A WORLDWITHOUTEXPERIMENTS CAUSES? gT
                                                              OR        I

events  suchasthe tragicresults the Chernobylnucleardisaster, disputes
                                 of                                   the          over
certaintylevelsof DNA testingin the O.J. Simpsontrials,         and the failure to find
a cure for most cancers   after decades highly publicizedand funded effort, the
general  public now betterunderstands limits of science.
      Yet we should not take these critiques too far. Those who argue against
theory-free  testsoften seem suggest
                             to          that everyexperiment    will comeout just as
the experimenter   wishes. This expectation totally contrary to the experience
                                              is                                     of
researchers, find
              who      instead  that experimentation often frustratingand disap-
pointing for the theoriesthey loved so much. Laboratory resultsmay not speak
for themselves, they certainlydo not speakonly for one'shopesand wishes.
We   find much to valuein the laboratoryscientist's   beliefin "stubborn facts" with
a life spanthat is greaterthan the fluctqatingtheorieswith which one tries to ex-
plain them.Thus many basicresultsabout gravityare the same,           whetherthey are
containedwithin a framework developed Newton or by Einstein;and no suc-
cessor  theory to Einstein'swould be plausibleunlessit could accountfor most of
the stubbornfactlike findingsabout falling bodies.There may not be pure facts,
but someobservations clearlyworth treating as if they were facts.
      Some theorists of science-Hanson, Polanyi, Kuhn, and Feyerabend
included-have so exaggerated role of theory in science to make experi-
                                  the                              as
mental evidence   seemalmost irrelevant.But exploratory experiments           that were
unguidedby formal theory and unexpected        experimental   discoveries  tangentialto
the initial researchmotivationshaverepeatedly      beenthe sourceof greatscientific
advances.   Experiments  haveprovidedmany stubborn,dependable,            replicablere-
sultsthat then become subject theory.
                        the         of         Experimental               feel
                                                              physicists that their
laboratorydata help keeptheir more speculative       theoreticalcounterparts     honest,
giving experiments indispensable
                      an                role in science. course,thesestubborn
facts often involve both commonsense        presumptionsand trust in many well-
established  theoriesthat make up the sharedcore of belief of the science ques- in
tion. And of course, these stubbornfactssometimes      proveto beundependable,      are
reinterpreted experimental
               as               artifacts,or are so ladenwith a dominantfocal the-
ory that they disappear   oncethat theory is replaced.    But this is not the casewith
the greatbulk of the factual  base,  which remainsreasonably      dependable   over rel-
ativelylong periodsof time.

A WORLD               OR
To borrow a thought experiment from Maclntyre (1981),imaginethat the slates
of science philosophywerewiped cleanand that we had to constructour un-
derstanding the world anew.As part of that reconstruction,
              of                                          would we reinvent
the notion of a manipulablecause? think so, largely because the practical
                                  \7e                        of
                      manipulandahave for our ability to surviveand prosper.
utility that dependable
IUTould reinvent the experimentas a method for investigatingsuch causes?

  Again yes,because   humanswill always be trying to betterknow how well these
  manipulablecauses   work. Over time, they will refinehow they conductthoseex-
  perimentsand so will againbe drawn to problemsof counterfactual         inference,of
  cause  precedingeffect,of alternative explanations,  and of all of the other features
  of causation  that we havediscussed this chapter.In the end, we would proba-
  bly end up with the experimentor something     very much like it. This book is one
  more stepin that ongoingprocess refining experiments. is about improving
  the yield from experiments  that take placein complexfield settings,  both the qual-
  ity of causalinferences they yield and our ability to generalize theseinferences  to
  constructs  and over variationsin persons,  settings,treatments,  and outcomes.
                       A Critical
                       Our Assumptions
                       As.sump.tion (e-simp'shen):[Middle Englishassumpcion,   from Latin as-
                           sumpti, assumptin-adoption, from assumptus,past participle of ass-
                           mere,te adopt; seeassume.] n. 1. The act of taking to or upon oneself:
                           assumption an obligation. 2.The act of taking overiassumption
                                       of                                                     of
                           command. 3. The act of taking for granted:assumptionof a false the-
                           ory. 4. Somethingtaken for granted or accepted true without proof;
                           a supposition: ualid assumption. 5. Presumption;
                                          a                                  arrogance. 5.
                           Logic.A minor premise.

       fltHIS BooK covers five central topics across its 13 chapters. The first topic
         | (Chapter 1) deals with our general understanding of descriptive causation and
        I experimentation. The second (Chapters 2 and 3) deals with the types of valid-
       ity and the specific validity threats associatedwith this understanding. The third
       (Chapters 4 through 7) deals with quasi-experimentsand illustrates how combin-
       ing design features can facilitate better causal inference. The fourth (Chapters 8
       through L0) concerns randomized experiments and stresses          the factors that im-
      pede and promote their implementation. The fifth (Chapters 11 through L3) deals
      with causal generalization, both theoretically and as concerns the conduct of in-
      dividual studies and programs of research.The purpose of this last chapter is to
      critically assess some of the assumptions that have gone into these five topics, es-
      pecially the assumptions that critics have found obiectionable or that we antici-
      pate they will find objectionable.           organize the discussionaround each of the
      five topics and then briefly justify why we did not deal more extensivelywith non-
      experimental methods for assessing        causation.
            I7e do not delude ourselvesthat we can be the best explicators of our own as-
      sumptions. Our critics can do that task better. But we want to be as comprehen-
      siveand as explicit as we can. This is in part because are convinced the ad-
      srve an(l    explclt          can. I nrs ls rn part becausewe are convrnced of the acl-
                                                                  we               ot
      vantages of falsification as a major component of any epistemology for the social
      sciences,and forcing out one's assumptions and confronting them is one part of
      falsification. But it is also becausewe would like to stimulate critical debateabout
      theseassumptionsso that we can learn from those who would challengeour think-

                                                          AND EXPERIMENTATION rct
                                                  CAUSATION                 |

ing. If therewereto be a future book that carriedevenfurther forward the tradi-
tion emanating   from Campbelland Stanleyvia Cook and Campbellto this book,
then that futuie book *o,rld probably be all the better for building upon all the
justifiedcriticisms comingfrom thosewho do not agreewith us, eitheron partic-
,rlu6 o, on the whole approachwe havetaken to the analysis descriptive
sationand its generayzition.'We   would like this chapternot only to model the at-
i.-p, to be cr"iti.alabout the assumprions scholarsmust inevitablymake
alsoto encourage    othersto think about theseassumptions and how they might be
addressed fuiure empiricalor theoreticalwork'


     Arrows and Pretzels
Experiments test the influence of one or at most a small subset of
                                                                              very few
causes.If statistical interactions are involved, they tend to be among
treatments or between a single treatment and a limited set   of moderator variables'
Many researchers     believe that the causal knowledge that results from this typical
.*p.ii-..rtal  structure fails to map the many causal forces that simultaneously af-
                                                              (e.g., Cronbach et al',
fe.t "ny given outcome in compiex and nonlinear ways
                                                                     prioritize on ar-
19g0; Magnusson,2000). These critics assertthat experiments
                                                                       an explanatory
,o*, .onrr-.cting A to B when they should instead seekto describe
                                                                           most causal
pretzel or set of intersectingpretzels,as it were. They also believethat
ielationships vary across ,rttitt, settings, and times, and so they doubt
                                                          (e.g., Cronbach 6c Snow,
there ".. "ny constant bivariate causal relationships
                                                                             reflect sta-
 1977).Those that do appearto be dependablein the data may simply
                                                                          to reveal the
tistically underpow.r.i irr,, of modeiators or mediators that failed
true underlying complex causal relationships. True-variation in effect
                                                                                  or the
 also be obrc.rr"d b.c"rrs. the relevant substantive theory is underspecified,
 outcome measuresare partially invalid, or the treatment contrast     is attenuated, or
 causally implicated variables afe truncated in how they are sampled
 6c Judd, 1993).
      As valid as theseobiectionsare, they do not invalidatethe casefor experi-
 ments.The purposeof experiments not to completelyexplain-
                                      is                          some.phenome-
 non; it is to ldentify whethera particularvariableor small setof variables
 a margirraldifferencein someoutcomeover and above all the other forces
 ing that outcome.Moreover,ontologicaldoubts such as the preceding       have not
 stJppedbelievers more complex iausal theoriesfrom acting
                     in                                           as though many
 .r,rol relationships be usefullycharacterized dependable
                        can                       as                         or
                                                                main effects as
 very simpl. nonlin."rities that are also dependable
                                                      enoughto be_u_seful. this
                considersomeexamples    from educationin the United States, where
4s8                 ASSESSMENT

       objections to experimentation are probably the most prevalent and virulent. Few
       educational researchers  seemto object to the following substantiveconclusions of
      the form that A dependably causesB: small schools are better than large ones;
      time-on-task raises achievement; summer school raises test scores;school deseg-
      regation hardly affects achievement but does increaseWhite flight; and assigning
       and grading homework raises achievement.The critics also do not seemto object
      to other conclusions involving very simple causal contingencies: reducing class
       size increasesachievement,but only if the amount of change is                and to a
      level under 20; or Catholic schools are superior to public ones, but only in the in-
      ner city and not in the suburbs and then most noticeably in graduation rates rather
      than in achievementtest scores.                ,
            The primary iustification for such oversimplifications-and for the use of the
      experiments that test them-is that some moderators of effects are of minor rele-
      vance to policy and theory even if they marginally improve explanation. The most
      important contingencies are usually those that modify the sign of a causal rela-
      tionship rather than its magnitude. Sign changesimply that a treatment is benefi-
      cial in some circumstancesbut might be harmful in others. This is quite different
      from identifying circumstancesthat influence just how positive an effect might be.
      Policy-makers are often willing to advocate an overall change,even if they suspect
      it has different-sizedpositive effects for different groups, as long as the effects are
      rarely negative. But if some groups will be positively affected and others nega-
      tively political actors are loath to prescribe different treatments for different
      groups becauserivalries and jealousies often ensue. Theoreticians also probably
      pay more attention to causal relationships that differ in causal sign becausethis
      result implies that one can identify the boundary conditions that impel such a dis-
       parate data pattern.
            Of course, we do not advocate ignoring all causal contingencies.For exam-
      ple, physicians routinely prescribe one of severalpossibleinterventions for a given
      diagnosis.The exact choice may depend on the diagnosis,test results,patient pref-
      erences, insurance resources, and the availability of treatments in the patient's
      area. However, the costs of such a contingent system are high. In part to limit the
      number of relevant contingencies,physicians specialize,andwithin their own spe-
       cialty they undergo extensivetraining to enable them to make thesecontingent de-
      cisions. Even then, substantial judgment is still required to cover the many situa-
      tions in which causal contingencies are ambiguous or in dispute. In many other
      policy domains it would also be costly to implement the financial, management,
      and cultural changesthat a truly contingent system would require even if the req-
      uisite knowledge were available. Taking such a contingent approach to its logical
      extremes would entail in education, for example, that individual tutoring become
      the order of the dav.Students
                       day.           and instructorswould haveto be carefullymatched
      for overlap in teachingand learning skills and in the curriculum supportsthey
      would need.
          tilTithinlimits, some moderators can be studied experimentallSeither by
      measuringthe moderator so it can be testedduring analysisor by deliberately
                                                            AND EXPERIMENTATION Ot'
                                                    CAU5ATION                 I

varying it in the next study in a program of research' conductingsuch
                         from thethik-bo" experiments yesteryear
                                                        of          towardtak-
ing causalcontingencies    more seriouslyand toward routinely study!1gthem by,
foi .""-ple, disaggregating treatmentto examineits causallyeffective
ponents,iir"ggt.glting the effect,toexamineits causallyimpactedcomponents,
                                                        moderatorvariables, and
.ondrr.ting ,n"ty*r ofi.-ographic and psychological
exploringlhe causalpathwa-ysihtooghwhjch (parts.of) the treatment
                                              in a singleexperimentis not possi-
lparts of) the outcomJ.To do all of this well
tl.. brrtto do someof it well is possibleand desirable.

             Criticisms E4periments
Epistemological       of

In highlightingstatistical    conclusionvalidity and in-selecting      examples, have
often linked causaldescriptionto quantitativemethodsand hypothesis
                                                                     theory of  positivism'
Many criticswill (wrongly)r.. this asimplying a discredited
As a philosophyof scieniefirst outlined in the early        L9th century'positivismre-
                         speculations,  especiallyabout unobservables,        and equated
1.ct.d' metaphysical
lrro*t.ag. *lih descriptions e*perienced
                                   of               phenomena- narrower school of
logical pisitivism .*.rg.d in the eatly 20th         century that also rejectedrealism
                                                    connections predicate
                                                                   in            logic form
*til. "lro .-phasizing Ih. ,rr. of data-theory
 ""J " fr.f.r.r.. for p"redicting     phenomena   over explainingthem' Both thesere-
 lated epistemologies    *.r. lonf ago discredited,   especially explanations how
                                                                  as                 of
 science   op.r"trr.*so few criticsseriously  critici'e experiments this basis'How-
 ever,many critics use the term positiuismwith less         historical fidelity to attack
 quantitativesocialscience       methodsin genera-l     (e'g', Lincoln & Guba, 1985)'
 liuilding on the rejectionof logicalpositivism,they reiectthe useof
 and forLal logic in observatiron,      measurement,    and hypothesistesting.Because
                                                                                    of posi-
 theselast features part of experiments, reiectthis looseconception
                       are                       to
 tivism entailsrejecting    experiments.  However,the errorsin suchcriticismsare nu-
                                                                         (like the idea that
  merous.For example,to ,eject a specificfeatureof positivism
                                               only permissible    links between   data and
  f,r"rrtifi.rtion and p redicatelogic the
  tiheory;doesnot nJcessarily     imlly reiectingall relatedand more generalproposi-
  tions jsuch asthe notion that somekinds of quantificationand hypothesis
  may be usefulfor knowledge       growth).Ife and othershave       outlinedmore sucher-
  rors elsewhere    (Phillips,1990;Shadish,   I995al'
       other epistemological    criticismsof experimentation the work of historians
  of science   suchasKuh"n   (1,g62),of sociologists science
                                                     of         suchasLatour and'wool-
   gar ltiZll "rrd of fhiloroph.ir of scienceiuch Harr6'(1931).Thesecritics
                                                                             the notion that
  to focuson threethings.orre.i, the incommensurability theories,
  theoriesare neverper"fectly    specified and so can always     be reinterpreted. a re-
   sult, when disconfirming    data seemto imply that a theory should be reiected'its
   poriolut., can insteadbI reworkedin order to make the theory and observations
                                                                                        to the
   consistent   with eachother.This is usuallydoneby addingnew contingencies
460 | 14.A CRIT|CAL

     theory that limit the conditions under which it is thought to hold. A second cri-
     tique is of the assumption that experimental observations can be used as truth
    tests.      would like observations to be objective assessments   that can adjudicate
     between different theoretical explanations of a phenomenon. But in practice, ob-
     servationsare not theory neutral; they are open to multiple interpretations that in-
    clude such irrelevanciesas the researcher's    hopes, dreams, and predilections. The
    consequence that observations rarely result in definitive hypothesistests.The fi-
    nal criticism follows from the many behavioral and cognitive inconsistenciesbe-
    tween what scientists do in practice and what scientific norms prescribe they
    should do. Descriptions of scientists' behavior in laboratories reveal them as
    choosing to do particular experiments becausethey have an intuition about a re-
    lationship, or they are simply curious to seewhat happens, or they want to play
    with a new piece of equipment they happen to find lying around. Their impetus,
    therefore, is not a hypothesis carefully deduced from a theory that they then test
    by means of careful observation.
         Although these critiques have some credibilitg they are overgeneralized.Few
    experimentersbelievethat their work yields definitive results even after it has been
    subjected to professional review. Further, though these philosophical, historical,
    and social critiques complicate what a "fact" means for any scientific method,
    nonethelessmany relationships have stubbornly recurred despite changesassoci-
    ated with the substantive theories, methods, and researcherbiasesthat first gen-
    erated them. Observations may never achieve the status of "facts," but many of
    them are so stubbornly replicable that they may be consideredas though they were
    facts. For experimenters, the trick is to make sure that observations are not im-
    pregnated with just one theory, and this is done by building multiple theories into
    observationsand by valuing independent replications, especiallythose of sub-
    stantive critics-what we have elsewherecalled critical multiplism (Cook, 1985;
    Shadish,'1.989,    1994).
         Although causal claims can never be definitively tested and proven, individ-
    ual experiments still manage to probe such claims. For example, if a study pro-
    duces negative results, it is often the casethat program developersand other ad-
    vocates then bring up methodological and substantive contingenciesthat might
    have changedthe result. For instance, they might contend that a different outcome
   measure or population would have led to a different conclusion. Subsequent       stud-
   ies then probe these alternatives and, if they again prove negative, lead to yet an-
   other round of probes of whatever new explanatory possibilities have emerged.
   After a time, this process runs out of steam, so particularistic are the contingen-
   cies that remain to be examined. It is as though a consensusemerges:"The causal
   relationship was not obtained under many conditions. The conditions that remain
   to be examined are so circumscribed that the intervention will not be worth much
   even if it is effectiveunder these conditions. "     agreethat this processis as much
   or more social than logical. But the reality of elastic theory does not mean that de-
   cisions about causal hypotheses are only social and devoid of all empirical and
   logical content.

                                                         AND EXPERIMENTATION +er
                                                 CAUSATION                 |

      The criticismsnoted are especially    usefulin highlightingthe limited value of
individual studies  relativeto reviewsof research     programs.Suchreviewsare bet-
ter because greaterdiversityof study features
             the                                      makesit lesslikely that the same
theoretical biases that inevitablyimpregnate one studywill reappear
                                                any                          across all
the studies                                                               and
             underreview.Still, a dialecticprocess point, response, counter-
point is needed   evenwith reviews,againimplying that no singlereview is defini-
iirr.. Fo, example, response Smith and Glass's
                      in            to                    (1'977)meta-analytic   claim
that psychotheiapy .ff..tive, Eysen (L977)and Presby
                       *",                   ck                    (1'977) pointedout
methojological and substantivecontingencies          that challengedthe original re-
viewers'reJults.  They suggested a differentanswerwould havebeenachieved
if Smith and Glassitrd ""t combinedrandomizedand nonrandomizedexperi-
mentsor if they had usednarrower calegories which to classifytypesof ther-
apy. Subsequent    studiesprobed thesechallenges Smith and Glassor brought
 foith nouef or,., 1e.g.,  \il'eiszet al., 1,992).This processof challengingcausal
 claimswith specificalternatives now slowedin reviewsof psychotherapy
                                       has                                          as
 many major contingencies might limit effectiveness
                               that                          have beenexplored.The
 currenrconsensus    fiom reviewsof many experiments many kinds of settings
                                                          in                         is
 that psychotherapy effective; is not iust the product of a regression
                        is            it                                        process
 lrporrt"nrors remission)   wherebythosewho are temporarily in needseekprofes-
 ,ii""t help and get better,as they would haveevenwithout the therapy'

Neglected       Questions
Our focus on causalquestions     within an experimental   framework neglects many
other questions  that arerelevantto causation.  These  includequestions about how
to decideon the importanceor      leverage any singlecausalquestion.This could
entail exploringwhethera causalquestionis evenwarranted,as it often is not at
the early sa"g.-ofdevelopment an issue.Or it could entail exploringwhat type
of c".rsalquestionis moie   important-one that fills an identifiedhole in somelit-
erature,o, orr. that setsout to identify specificboundary conditionslimiting a
causalconnection, one that probesthe validity of a centralassumption
                    or                                                      held by
all the theoristsand researchers    within a field, or one that reducesuncertainty
about an important decision    when formerly uncertaintywas high. Our approach
alsoneglects realitythat how oneformulatesa descriptive
              the                                               causalquestionusu-
 aily enLils meetingsomestakeholders'     interests the socialresearch
                                                    in                   more than
those of others.TLus to ask about the effectsof a national program meetsthe
 needs Congressional
        of               staffs,the media,and policy wonks to learnaboutwhether
 the program"*orks. But it can fail to meet the needsof local practitionerswho
 ,rro"lly"*"nt to know about the effectiveness microelements
                                                  of                within the pro-
 gram ,o thut they can usethis knowledgeto improve their daily practice.-In   more
 Ih.or.ti."l work, to ask how some interventionaffectspersonalself-efficacy       is
 likely to promote individuals'autonomyneeds,       whereasto ask about the effects
 of a'persoasive  communicationdesigned changeattitudescould well cater to
462       14.A CR|T|CAL

       the needs of those who would limit or manipulate such autonomy. Our narrow
       technical approach to causation also neglectedissuesrelated to how such causal
       knowledge might be used and misused. It gave short shrift to a systematic analy-
       sis of the kinds of causal questions that can and cannot be answered through ex-
       periments. \7hat about the effects of abortion, divorce, stable cohabitation, birth
       out of wedlock, and other possibly harmful events that we cannot ethically ma-
       nipulate? What about the effects of class,race, and gender that are not amenable
       to experimentation?          about the effects of historical occurrencesthat can be
       studied only by using time-seriesmethods on whatever variables might or might
       not be in the archives?Of what use, one might ask, is a method that cannot get at
       some of the most important phenomena that shape our social world, often over
       generations, in the caseof race, class,and gender?
            Many statisticians now consider questions about things that cannot be ma-
      nipulated as being beyond causal analysis,so closely do they link manipulation to
      causation. To them, the cause must be at least potentially manipulable, even if it
      is not actually manipulated in a given observational study. Thus they would not
      consider race ^ cause, though they would speak of the causal analysis of race in
      studies in which Black and White couples are, say, randomly assignedto visiting
      rental units in order to seeif the refusal rates vary, or that entail chemically chang-
      ing skin color to seehow individuals are responded to differently as a function of
      pigmentation, or that systematicallyvaried the racial mix of studentsin schools or
      classrooms in order to study teacher responsesand student performance. Many
      critics do not like so tight a coupling of manipulation and causation. For exam-
      ple, those who do status attainment researchconsider it obvious that race causally
      influences how teachers treat individual minority students and thus affects how
      well these children do in school and therefore what jobs they get and what
      prospects their own children will subsequentlyhave. So this coupling of causeto
      manipulation is a real limit of an experimental approach to causation. Although
      we like the coupling of causation and manipulation for purposes of defining ex-
      periments, we do not seeit as necessaryto all useful forms of cause.

      Objections InternalValidity
      There are several criticismsof Campbell's(1957) validity typology and its exten-
      sions(Gadenne,   1976;Kruglanski Kroy, 1.976;Hultsch Hickey,1978;Cron-
                                        &                        6c
      bach, 1982; Cronbachet al., 1980).      start first with two criticismsof internal
      validity raisedby Cronbach(1982)and to a lbsser    extentby Kruglanskiand Kroy
      (1'976):(1) an atheoreticallydefinedinternal validity (A causes is trivial with-
      out reference constructs;and (2) causationin singleinstances impossible,
                    to                                                  is           in-
      cludingin singleexperiments.
                                                                               vALtDtrY nol

lnternal Validity ls Trivial

    I consider it pointless to speak of causeswhen all that can be validly meant by refer-
    enceto a causein a particular instanceis that, on one trial of a partially specified
    nipulation under.orrditior6 A, B, and c, along with   other conditions not named, phe-
    nomenon p was observed.To introduce the word cause seemspointless. Campbell's
    writings make internal validity a property of trivial, past-tense'and local statements'
     (p .t3 7 )

Hence,.,causal    language superfluous" 140).Cronbach
                            is             (p.                  doesnot retaina spe-
cific role fo, .",rr"Iinferenceln his validity typology at all. Kruglanski and Kroy
(1976)criticizeinternalvalidity similanlSsaying:
    The concrete events which constitute the treatment within a specific research
    meaningful only as members of a general conceptual    category' ' ' ' Thus, it is simply
    impossibleto draw strictly specificconclusionsfrom an experiment: our concepts
    g.rr.r"l and each pr.r,rppor"s an implicit general theory about resemblanceberween
    different concretecases.(p. 1'57)

All theseauthors suggest       collapsinginternal with constructvalidity in different
     Of course, agree
                    we       that researchersconceptualize discuss
                                                            and          treatments  and
outcomes concepfual
              in             terms.As we saidin Chapter3, constructs      are so basicto
l"rrgo"g. and thought that it is impossible conceptualize
                                                 to               scientificwork with-
out"thJm. Indeed,ir, *"ny important respects, constructswe use constrain
what we experience, point agreedto by theoristsranging from Quine (L951'
 L96g)to th; postmodernists        (Conner,1989;Testeq1993). So when we say that
internalvalidity concerns atheoretical
                               an              local molar causalinference, do not
mean that the researcher        should conceptualize   experiments   or report a causal
            "somethingmadea differencer" useCronbach's
                                                to                 (1982,p' 130) exag-
claim as
      Still, it is both sensible and usefulto differentiateinternal from constructva-
lidity. The task of sortingout constructs demanding
                                             is             enoughto warrant separate
 attention from the task of sorting out causes.      After all, operationsare concept
 laden,and it is very rare for researchers know fully what thoseconcepts
                                             to                                   are.In
 fu.t, th, ,erearchrialmostcertainlycannotknow them fully because          paradigmatic
 .orr..p,, areso implicitly and universally    imbuedthat   thoseconcepts their as-
 sumptions "r. ,oi,'.,imes entirely unrecognized_ researchcommunitiesfor
 y."ri. Indeed,the history of science repletewith examplesof famousseries
                                         is                                            of
 ."p.rim.nts in which a causalrelationshipwas demonstrated             earlS but it took
 y."r, for the cause(or effect)to be consensually stablynamed.For instance,
 in psychology      and linguisticsmany causalrelationships   originally emanated   from
 a behavioriit paradigl but were later relabeledin cognitive terms; in the early
 Hawthorne st;dy, illumination effectswere later relabeled effectsof obtrusive
  observers;and some cognitive dissonanceeffects have been reinterpretedas

    attribution effects.In the history of a discipline,relationships     that are correctly
    identified as causalcan be important evenwhen the causeand effectconstructs
    are incorrectlylabeled.Suchexamples        exist because reasoning
                                                             the             usedto draw
    causalinferences   (e.g.,requiring evidence   that treatmentpreceded     outcome)dif-
    fers from the reasoning   usedto   generalize(e.g.,matchingoperations prototyp-
    ical characteristics constructs).
                          of              \Tithout understanding    what is meant by de-
    scriptive causation, we have no means of telling whether a claim to have
    established  suchcausationis justified.
         Cronbach's(1982) prosemakesclear that he understands importanceof
   causallogic; but in the end, his sporadically     expressed  craft knowledgedoesnot
    add up to a coherent   theory of judgingthe validity of descriptive  causal inferences.
   His equation of internal validity as part of reproducibility (under replication)
   misses point that one can replicateincorrectcausalconclusions. solution
            the                                                               His
   to suchquestions simplythat
                       is             "the forceof eachquestion    can bereduced suit-
   able controls" (1982,p. 233).This is inadequate, a complete
                                                          for                        of
                                                                            analysis the
   problem of descriptive    causalinference                      we
                                               requiresconcepts can useto recognize
   suitablecontrols.If a suitablecontrol is one that reduces plausibilityof, say
   historyor maturation, Cronbach
                             as            (1982,p.233)suggests, is little morethan
   internalvalidity aswe haveformulatedit. If one needs concepts
                                                              the           enoughto use
   them, then they should be part of a validity typology for cause-probing        methods.
        For completeness, might add that a similar boundaryquestionarisesbe-
   tween constructvalidity and externalvalidity and between         constructvalidity and
   statisticalconclusionvalidity. In the former case,no scientistever framesan ex-
   ternal validity questionwithout couchingthe questionin the languageof con-
   structs.In the latter case,researchers                          or
                                             neverconceptualize discuss       their results
   solelyin terms of statistics.  Constructsare ubiquitousin the      process doing re-
   searchbecause     they are essential conceptualizing
                                         for                   and reporting operations.
   But again,the answerto this objectionis the same.The strategies making in-
   ferences   about a constructare not the sameas strategies making inferences
   about whether a causal relationship holds over variation in persons,settings,
   treatments,   and outcomes externalvalidity or for drawing valid statistical
                                in                                                    con-
   clusionsin the caseof statisticalconclusionvalidity.Constructvalidity requiresa
   theoreticalargumentand an assessment the correspondence
                                                of                       betweensamples
   and   constructs. Externalvalidity requiresanalyzing     whethercausalrelationships
   hold over variations in persons,settings,treatments,and outcomes.Statistical
   conclusion   validity requires closeexaminationof the statistical   procedures as-
   sumptionsused.And again,one can be wrong about constructlabelswhile being
   right about externalor statisticalconclusionvalidity.

   Objections Causation SingleExperiments
            to        in
   A second criticism of internal validity deniesthe possibility of inferring causation
   in a single experiment. Cronbach (1982) says that the important feature of cau-
   sation is the "progressivelocalizationof a cause" (Mackie, 1974, p.73) over mul-

                                                                        vALrDrry otu

tiple experimentsin a program of researchin which the uncertainties about the es-
sential i."t.rr.r of the cause are reduced to the point at which one can character-
ize exacflywhat the causeis and is not. Indeed, much philosophy of causation as-
serts that we only recognize causes through observing multiple instances of a
putative causal relationship, although philosophers differ as to whether the mech-
anism for recognition involves logical laws or empirical regularities (Beauchamp,
1974;P. White, 1990).
      However, some philosophers do defend the position that causescan be in-
ferred in singleinstances(e.g.,Davidson, 1,967;Ducasse'1,95L1'    Madden & Hum-
ber, L97'1,). good example is causation
              A                             in the law (e.g.,Hart & Honore, 1985)'
by which we judge whether or not one person, say, caused the death of another
despitethe fact that the defendant may 4ever before have been on trial for a crime.
The verdict requires a plausible casethat (among other things) the defendantb ac-
tions precededlhe death of the victim, that those actions were related to the death,
that other potential causesof the death are implausible, and that the death would
not have occurred had the defendant not taken those actions-the very logic of
causal relationships and counterfactualsthat we outlined in Chapter 1. In fact, the
defendant'scriminal history will often be specifically excluded from consideration
in iudging guilt during the trial. The lessonis clear. Although we may learn more
"bo,rt ."nsation from multiple than from single experiments, we can rnf.ercause
in single experiments.Indeed, experimenterswill do so whether we tell them to or
not. Providing them with conceptual help in doing so is a virtue, not a vice; fail-
ing to do so is a major flaw in a theory of cause-probing methods.
      Of course, individual experiments virtually always use prior concepts from
other experiments.However, such prior conceptualizations are entirely consistent
with the claim that internal validity is about causal claims in single experiments.
If it were not (at least partly) about single experiments, there would be no point
to doing the experiment, for the prior conceptualization would successfullypre-
 dict what will be observed.The possibility that the data will not support the prior
 conceptualization makes internal validity essential.Further, prior conceptualiza-
 tions are not logically necessary;we can experiment to discover effects that we
                                                 "The physicist George Darwin used
 have no prior conceptual structure to expect:
 to say tliat once in a while one should do a completely crazy experiment, like
 blowing the trumper to the tulips every morning for a month. Probably nothing
 wiil hafpen, but if something did happen, that would be a stupendousdiscovery"
 (Hacking, L983, p. 15a). But we would still need internal validity to guide us in
 judging if the trumpets had an effect.

Objections to Descriptive Causation
A few authorsobjectto the very notion of descriptive         Typicall5 how-
ever,suchobjections madeabout a caricature descriptive
                    are                       of                   that has
not teen usedin philosophyor in science many years-for example,a billiard
ball modelthat requiresa commitmentto deterministic causation that excludes
466                  AssEssMENT
          ra.n cRrrcAL       oFouRAssuMproNs

      reciprocalcausation. contrast,mostwho write aboutexperimentation
                          In                                               today es-
      pousetheoriesof probabilisticcausation which the many difficultiesassociated
      with identifyingdependable causalrelationships humbly acknowledged.
                                                     are                       Even
      more important, thesecriticsinevitablyusecausal-sounding  languagethemselves,
      for example,replacing  "cause" with "mutual simultaneous  shaping" (Lincoln 6c
      Guba, 1985,  p. 151).Thesereplacements   seem us to avoidthe word but keep
      the concept,and for good reason.As we saidat the end of ChapterL, if we wiped
      the slatecleanand constructed our knowledgeof the world aneq we believe    we
      would end up reinventingthe notion of descriptive  causationall over again, so
      greatlydoesknowledgeof causes   help us to survivein the world.

      ObjectionsConcerning Discrimination
                          the              Between
              Validityand ExternalValidity
      Although we traced the history of the present validity system briefly in Chapter 2,
      readers may want additional historical perspectiveon why we made the changes
      we made in the present book regarding construct and external validity. Both
      Campbell (1957) and Campbell and Stanley(1963) only usedthe phraseexternal
      validitS which they defined as inferring to what populations, settings,treatment
      variables, and measurement variables an effect can be generalized.They did not
       rcfer at all to construct validity. However, from his subsequentwritings (Camp-
      bell, 1986), it is clear Campbell thought of construct validity as being part of ex-
      ternal validity. In Campbell and Stanley therefore, external validity subsumed
      generalizing from researchoperations about persons, settings,causes,and effects
      for the purposes of labeling theseparticulars in more abstract terms, and also gen-
      eralizing by identifying sourcesof variation in causal relationships that are attrib-
      utable to person, setting, cause, and effect factors. All subsequentconceptualiza-
      tions also share the same generic strategy based on sampling instancesof persons,
      settings, causes,and effects and then evaluating them for their presumed corre-
      spondenceto targets of inference.
           In Campbell and Stanley'sformulation, person, setting, cause,and effect cat-
      egories share two basic similarities despite their surface differences-to wit, all of
      them have both ostensive qualities and construct representations.Populations of
      persons or settings are composed of units that are obviously individually osten-
      sive. This capacity to point to individual persons and settings, especially when
      they are known to belong in a referent category permits them to be readily enu-
      merated and selectedfor study in the formal ways that sampling statisticianspre-
      fer. By contrast, although individual measures (e.g., the Beck Depression Inven-
      tory) and treatments (e.g., a syringe full of a vaccine) are also ostensive,efforts to
      enumerate all existing ways of measuring or manipulating such measuresand
      treatments are much more rare (e.g.,Bloom, L956; Ciarlo et al., 1986; Steiner&
      Gingrich, 2000). The reason is that researchers   prefer to use substantivetheory to
      determine which attributes a treatment or outcome measureshould contain in any

                                                                          vALrDtrY oe,

given studS recognizing that scholars often disagreeabout the relevant attributes
of th. higher order entity and of the supposed best operations to representthem.
None of ihis negatesthe reality that populations of persons or settingsare also de-
fined in part by the theoretical constructs used to refer to them, just like treatments
and outiomes; they also have multiple attributes that can be legitimately con-
tested.          for instance, is the American population? \7hile a legal definition
surely exists,it is not inviolate. The German conception of nationality allows that
the gieat grandchildren of a German are Germans even if their parents and grand-
p"r*t, have not claimed German nationality. This is not possible for Americans.
And why privilege alegaldefinition? A cultural conception might admit as Amer-
ican all thor. illegal immigrants who have been in the United Statesfor decades
and it might e*cl.rde those American adults with passports who have never lived
in the United States. Given that person's,settings, treatments, and outcomes all
have both construct and ostensive qualities, it is no surprise that Campbell and
Stanley did not distinguish between construct and external validity.
     Cook and Camptell, however, did distinguish between the two. Their un-
stated rationale for the distinction was mostly pragmatic-to facilitate memory
for the very long list of threats that, with the additions they made' would have
had to fit under bampbell and Stanley's      umbrella conception of external validity.
In their theoretical diicussion, Cook and Campbell associated        construct validity
with generalizingto causesand effects, and      external validity with generalizing to
and across persons, settings, and times. Their choice of terms explicitly refer-
encedCronbach and Meehl (1955) who used construct and construct validity in
                                              "about higher-order constructs from re-
measurementtheory to justify inferences
 search operations'; lcook & Campbel| 1,979, p. 3S). Likewise, Cook and
 Campbeli associatedthe terms population and external ualidity with sampling
 theory and the formal and purposive ways in which researchersselect instances
 of persons and settings. But to complicate matters, Cook and Campbell also
                               "all aspectsof the researchrequire naming samples in
 brlefly acknowledged that
  gener-alizable termi, including samplesof peoples and settings as well as samples
 of -r"r,rres or manipulations" (p. 59). And in listing their external validity
 threats as statistical inieractions between a treatment and population, they linked
 external validity more to generalizing across populations than to generalizing to
 them. Also, their construct validity threats were listed in ways that emphasized
 generalizing to cause and effect constructs. Generalizing across different causes
 ind effect, *", listed as external validity becausethis task does not involve at-
 tributing meaning to a particular measure or manipulation. To read the threats
 in Cook and Campbell, external validity is about generalizing acrosspopulations
 of persons and settings and across different cause and effect constructs, while
 construct validity is about generalizing to causesand effects.Where, then, is gen-
  era\zing from samples of persons or settings to their referent populations? The
 text disiussesthis as a matter of external validitg but this classification is not ap-
 parent in the list of validity threats. A system is neededthat can improve on Cook
  and Campbell's partial confounding between objects of generalization (causes
468                         OF
        14.A CRITICAL

      and effects versus persons and settings) and functions of generalization (general-
      izing to higher-order constructs from researchoperations versus inferring the de-
      greeof replicationacrossdifferent constructsand populations).
           This book usessucha functional approachto differentiate       constructvalidity
      from externalvalidity. It equates   constructvalidity with labelingresearch  opera-
      tions, and externalvalidity with sources variation
                                                 of           in causalrelationships. This
      new formulation subsumes of the old. Thus, Cook and Campbellt under-
                                      all                                                     ,i-{11
      standingof constructvalidity asgeneralizing    from manipulations    and measures to     f i..

      causeand effectconstructsis retained.So is externalvalidity understood gen-  as
      eralizingacrosssamples persons,
                                  of         settings, and times.And generalizing   across
      different causeor effectconstructsis now,evenmore clearlyclassified       as part of
      exrernalvalidity.Also highlightedis the needto label samples persons
                                                                        of       and set-
      tings in abstractterms, iust as measures    and manipulationsneedto be labeled.
      Suchlabelingwould seem be a matterof constructvalidity giventhat construct
      validity is functionallydefinedin termsof labeling.However,labelinghumansam-
      ples might have been read as being a matter of external validity in Cook and
      Campbell,given that their referents    were human populationsand their validity
      typeswereorganized      more around referents  than functions.So,althoughthe new
      formulation in this book is definitelymore systematic   than its predecessors, are
      unsurewhetherthat systematizationwillultimately result in greaterterminologi-
      cal clarity or confusion.To keepthe latter to a minimum, the following discussion
      reflectsissues  pertinentto the demarcation constructand externalvalidity that
      have emerged     either in deliberationsbetweenthe first two authorsor in classes
      that we havetaughtusingpre-publication       versions this book.

      Is Construct Vatidity a Prerequisite for External Vatidity?
      In this book, we equateexternalvalidity with variation in causal   relationships and
      constructvalidity with labeling.research   operations.Somereaders     might seethis
      assuggesting successful        generalization a causal
                                                   of          relationshiprequires ac-
      curate labelingof eachpopulation of personsand eachtype of settingto which
       generalization sought,eventhough we can neverbe certainthat anythingis la-
      beledwith perfectaccuracy.     The relevanttask is to achieve most accurate
                                                                     the                as-
      sessment   availableunder the circumstances.    Technically, can.test genenliza-
      tion across  entitiesthat are akeadyknown to be confounded      and thus not labeled
      well-e.g., when causaldata arebrokenout by genderbut the females the sam-in
      ple are, on average,  more intelligentthan the malesand thereforescorehigheron
      everythingelsecorrelatedwith intelligence.     This exampleillustrates how danger-
      ous it is to rely on measured   surfacesimilarity alone (i.e.,genderdifferences)  for
      determining   how a sampleshouldbe labeledin populationterms.\7e might more
      accuratelylabel genderdifferences we had a random sampleof each gender
      taken from the same    population.But this is not often found in experimental  work,
      and eventhis is not perfectbecause   genderis known to be confounded      with other
      attributes  (e.g.,income,work status)evenin the population,and thoseother at-

                                                                          vALrDrrY oo,

tributes may be pertinent labels for some of the inferencesbeing made. Hence, we
usually have to rely on the assumption that, becausegender samplescome from
the same physical setting, they are comparable on all background characteristics
that might be correlated with the outcome. Becausethis assumption cannot be
fully testedand is ^nyw^y often false-as in the hypothetical example above-this
means rhat we could and should measure all the potential confounds within the
limits of our theoretical knowledge to suggestthem, and that we should also use
these measuresin the analysis to reduce confounding.
      Even with acknowledged confounding, sample-specific differences in effect
sizesmay still allow us to conclude that a causal relationship varies by something
associatedwith gender.This is a useful conclusion for preventing premature over-
 generalization.Iilith more breakdownq, confounded or not, one can even get a
senseof the percentage contrastsacrosswhich a causal relationship does and
does not hold. But without further work, the populations across which the rela-
tionship varies are incompletely identified. The value of identifying them better is
particularly salient when some effect sizescannot be distinguished from zero. Al-
though this clearly identifies a nonuniversal causal relationship, it does not ad-
vance theory or practice by specifying the labeled boundary conditions over which
a causal relationship fails to hold. Knowledge gains are also modest from gener-
 alization strategiesthat do not explicitly contrast effect sizes.Thus, when differ-
ent populations are lumped together in a single hypothesis test, researcherscan
learn how large a causal relationship is despite the many unexamined sources of
variation built into the analysis. But they cannot accurately identify which con-
structs do and do not co-determine the relationship's size. Construct validity adds
useful specificity to external validity concerns, but it is not a necessarycondition
for external validity.'We can generalize across entities known to be confounded'
albeit lessusefully than acrossaccurately labeled entities.
      This last point is similar to the one raised earlier to counter the assertion of
Gadenne (L9761and Kruglanski and Kroy (1976) that internal validity requires
the high consrruct validity of both causeand effect. They assertthat all scienceis
                                                             "something causedsome-
about constructs, and so it has no value to conclude that
thing sfss"-1hs result that would follow if we did a technically exemplary ran-
domized experiment with correspondingly high internal validity but the causeand
effect were not labeled. Nonetheless, a causal relationship is demonstrably en-
                              "something reliably causedsomething else" might lead
tailed, and the finding that
to further researchto refine whatever clues are available about the cause and ef-
 fect constructs. A similar argument holds for the relationship of construct to ex-
ternal validity. Labels with high construct validity are not necessaryfor internal
 or for external validity, but they are useful for both.
      Researchers  necessarilyuse the language of constructs (including human and
 setting population ones) to frame their research questions and selecttheir repre-
 sentationsof constructsin the samplesand measureschosen.If they have designed
 their work well and have had some luck, the constructs they begin and end with
 will be the same,though critics can challengeany claims they make. However, the
470                        OF
        14.A CRITICAL

      samplesand constructs might not match we[], and then the task is to examine the
      samples and ascertain what they might alternatively stand for. As critics like
      Gadenne,   Kruglanski,and Kroy havepointedout, suchreliance the operational
      levelseems legitimize
                  to          operations havinga life independent
                                          as                          of constructs.This
      is not the case,                                                 on
                      though,for operations intimatelydependent interpretations
      at all stages research.
                   of          Still, every operation fits some interpretations, however
      tentative that referent may be due to poor researchplanning or to nature turning
      out to be more complex than the researcher'sinitial theory.

      How Does Variation AcrossDifferent Operational Representations
      of the SameIntendedCause EffectRelate Construct
                                or             to          and
      In Chapter 3 we emphasizedhow the valid labeling of a cause or effect benefits
      from multiple operational instances,and also that thesevarious instancescan be
      fruitfully analyzedto examine how a causal relationship varies with the definition
      used. If each operational instance is indeed of the sameunderlying construct, then
      the samecausalrelationshipshouldresult regardless how the cause effectis
                                                      of             or
      operationally defined. Yet data analysis sometimes revealsthat a causal relation-
      ship varies by operational instance.This means that the operations are not in fact
      equivalent, that theypresumably both into differentconstructs into dif-
                 so                     tap                            and
                                Either the samecausalconstructis differentlyrelated
      ferent causalrelationships.
      to what now must be seenas two distinct outcomes, the sameeffectconstruct
      is differently related to two or more unique causal agents.So the intention to pro-
      mote the construct validity of causesand effects by using multiple operations has
      now facilitated conclusions about the external validiry of causesor effects;that is,
      when the external validity of the causeand effect are in play, the data analysishas
      revealed that more than one causal relationship needsto be invoked.
           FortunatelS when we find that a causal relationship varies over different causes
      or different effects, the research and its context often provide clues as to how the
      causalelementsin eachrelationshipmight be (re)labeled. example,the researcher
      will generally examine closely how the operations differ in their particulars, and will
      also study which unique meaningshave been attached to variants like thesein the ex-
      isting literature.While the meaningsthat are achieved                      be-
                                                           might be lesssuccessful
      cause they have been devised post hoc to fit novel findings, they may in some cr-
      cumstances still attain an acceptable level of accuracy and will certainly prompt
      continued discussion to account for the findings. Thus, we come full circle. I7e be-
      gan with multiple operational representations of the same causeor effect when test-
      ing a single causal relationship; then the data forced us to invoke more than one re-
      lationship; and finally the pattern of the outcomes and their relationship to the
      existing literature can help improve the labeling of the new relationships achieved.A
      construct validity exercise begets an externat validity conclusion that prompts the
      need for relabeling constructs. Demonstrating effect size variation acrossoperations
      presumed to represent the same cause or effect can enhance external validity by
                                                                               vALlDlrY I ort

showingthat more constructs and causalrelationships involved than was origi-
                                               increase constructvalidity by pre-
nally envisaged; in that case, can eventually
               and            it
                                                   in the original choiceof meas-
ventingany mislabeling the cause effectinherent
                       of        or
                                             causalrelationshipsabout how the
ures and by providffilues from detailsof the
                                                 see hereanalytictasksthat flow
elements each..f"io"ritp shouldbe labeled.'We
                                               concerns'involving each'
smoothlybetween.onr,r.r.i and externalvalidity

                                                       of Personsor settings
 should Generalizingfrom a single sample
Be Classifiedas External or Construct Validity?
                                                             this samplemust represent       a
If a study hasa singlesampleof pers.ons settings,
                                                         is an issue'Given that construct
population.How ,"nlrrr-pre should be labeled
                                                        an issueof constructvalidity?Af-
validity is about rJ.iirrg, i, Itbeling the lample
ter all, externalvalidity hardly seems       relevantsincewith a singlesampleit is not
                                                                         relationships  would
immediately     obvious*n", comparisonof variation in causal
                                                     of personsor settings treatedas a
be involved.So if g.".t"iit-g fio* a sample
                                                                  from treatment and out-
 matter of constructvalidity analogousto generalizing
                                                                         potential conflict in
 come operations,     i*o probl.-, "r-ir.. Firstl this highlightsa
 usage the generalsocialscience
         in                               community'someparts of which saythat gen-
 eralizations  from;;;i;         of peopleto its pofulation are a matter of external
 lidity, evenwhen ;rh.;;;",         ,"y ih", labefingpeopleis a matter of constructva-
                                                                          and Campbellthat
 lidity. Second,   trrir-J".r not fit'with the discussion Cookin
                                                                                as an external
 treatsgeneralizing     rr.t" irrdiuidrr"lsamples personsand settings
                                                                     doesnot explicitly deal
 validity matter,   thoughtheir list of .*t.*"1 validity threats
 with this and only mentionsinteracti,ons         betweenthe treatmentand attributesof
 the settingand Person.                                                         from.the pop-
       The issue most acutewhen the sample
                  is                                 was randomlyselected
  ulation. considerwhy samplingstatisticians         are so keento promoterandom sam-
                                                            Suchsamplingensures        that the
  pling for represe";i"; " *.il-dJrignated universe.                              unmeasured
                                                              all measured    and
  sample   and populatiJndistributions identicalon
                                                                    this includes popula-
  variables  within the limits of samplingerror.Notice that                                also
  tion label(whether     moreor less"ccorit.;, which randomsampling
                                                                                         a well
  appliesto the ,";;[.       K.y tg tle or.i rl*r, of random samplingis having
  boundedpop.rl"tiJ., from which to sample,          a-requirement samplingtheory and
  something   often obviousin practice.Given that many well
   are alsowell tabeied,    r""a.- sampling     then guarantees a valid populationla-
                                                            For instance'the population of
   bel can equallyvalidly be applied,o itt. saripl..
                                                                         is obviouslycorrectly
   telephone  prefixesor.d i' tlie city of Chicagolsknown and
                                                                     dialing frol that list of
   labeled. Hence,i *""fa be difficuli. ,rrJt"ndom digit
                                                           sampleas representing     telephone
   Chicagopr.fi*., "nJ itt." mislabelthe resulting
                                                       sJction-of Chicago-     Given a clearly
   ownersin Detroii o, orty in the Edgewater
                                                      the samplelabel is the populationla-
   boundedpopulationand random saripling,
                                                 believe  that no methodis superiorto ran-
    bel, which is why samplingstatisticians
    dom selectio'f- iun.ii"g"tumples       when the populationlabelis known'
472 I T+.N CRITICAL       OF

        With purposive sample selection,this elegant rationale cannot be used,
   whetheror not the population label is known. Thus, if respondents       were selected
   haphazardly   from shoppingmalls all over Chicago,many of the peoplestudied
   would belongin the likely populationof interest-residents Chicago.But many
   would not because    someChicagoresidents notdo      go to malls at the hours inter-
   viewing takes place, and becausemany personsin these malls are not from
   Chicago.Lacking random sampling,we could not evenconfidentlycall this sam-
   ple "peoplewalking in Chicagomalls," for other constructs       suchas volunteering
   to be interviewed  may be systematicallyconfounded sample
                                                         with           membership.  So,
   meremembership the sampleis not sufficientfor accurately
                       in                                            representing a pop-
   ulation, and by the  rationalein the previousparagraph, is alsonot sufficientfor
   accurately  labelingthe sample.All this leadsto two conclusions      worth elaborat-
   ing: (1) that random sampling can sometimes       promote constructvalidity, and
   (2) thatexternalvalidity is in play when inferring that a singlecausalrelationship
   from a sample   would hold in a population,whetherfrom a randomsample not.     or
        On the first point, the conditions under which random samplingcan some-
   timespromote the constructvalidity of singlesamples straightforward.
                                                            are                    Given
   a well boundeduniverse,    samplingstatisticians have justifiedrandom samplingas
    away of clearlyrepresenting the sampleall populationattributes.
                                  in                                       This must in-
   cludethe populationlabel,and so random samplingresultsin labelingthe sample
   in the sameterms that apply to the population. Random samplingdoesnot, of
   course,tell us whetherthe population label is itself reasonably     accurate;random
   samplingwill also replicatein the sampleany mistakes       that are madein labeling
   the population. However,given that many        populationsare alreadyreasonably
   well-labeled  basedon past research   and theory and that suchsituationsare often
   intuitively obviousfor researchers   experienced an area,random samplingcan,
   underthese   circumstances, countedon to promoteconstructvalidity.However,
   when random selection not occurredor when the populationlabel is itself in
   doubt, this book hasexplicatedother principlesand methodsthat can be usedfor
   labelingstudy operations,including labelingthe samples personsand settings
   in a study.
        On the second   point, when the questionconcerns validity of generalizing
   from a causalrelationship a singlesample its population,the readermay also
                               in                to
   wonder how externalvalidity can be in play at all. After all, we haveframedex-
   ternal validity as beingabout whetherthe causalrelationship      holds overuariation
   in persons, settings,treatment  variables, measurement
                                            and                 variables. thereis only
   one random samplefrom a population,where         is the variation over which to ex-
   aminethat causal   relationship? answeris simple:the variationis between
                                     The                                            sam-
   pled and unsampled     personsin that population.As we saidin Chapter2        (and as
   was true in our predecessor      books), external validity questionscan be about
   whether a causalrelationshipholds (a) over variationsin persons,       settings,treat-
   ments,and outcomes      that were in the experiment,and (b) for persons,     settings,
   treatments, outcomes       that werenot in the experiment.   Thosepersons a pop-
                                                                          vALlDlw | 473

ulation who were not randomly sampledfall into the latter category.           Nothing
about externalvalidity,eitherin the presentbook or in its predecessors,        requires
that all possible  uariuiion, of externalvalidity interestactuallybe observed thein
study-indeed, it would beimpossible do so,and we providedseveral
                                          to                               arguments
in Cirapter2 aboutwhy it would not be wise to limit external validity questions
only to variationsactuallyobserved a study.Of course,in most cases
                                         in                                    external
ualidiry generalizations things that were not studied are difficult, having to rely
on the .L.r..pt, and methodswe outlined in our grounded theory of generalized
causalinference Chapters11 through 13. But it is the great beautyof random
samplingthat it guaran;es that this generalization hold over both sampledand
,rnr"-pl".d p.rr6nr. So it is indeedan externalvalidity questionwhah-e1acausal
relationship   that hasbeenobserved a singlerandomsample
                                       in                          would hold for those
units that were in the populationbut not'in the random sample.
      Inthe end,this book treatsthe labelingof a singlesample persons set-
                                                                      of         or
tings asa matterof constructvalidiry whetheror         not random samplingis used.It
alsi treatsthe generalization causalrelationships
                                 of                         from a singlesampleto un-
observed    instances a matterof externalvalidity-againrwhether or not random
 samplingwas used.The fact that random sampling(which is associated            with ex-
 ,.rrr"l uiiairy in this book) sometimes    happens facilitatethe constructlabeling
 of a sampleis incidentalto the fact that the population label is alreadyknown.
 Though many populationlabelsare indeedwell-known, many more are still mat-
 ,.r, of debate,as reflected the examples gavein Chapter3 of whetherper-
                              in               we
 sonsshouldbe labeledschizophrenic settingslabeledas hostilework environ-
 ments.In theselatter cases,   random    samplingmakesno contribution to resolving
 debates   about the   applicabilityof thoselabels.Instead,the principlesand meth-
 ods we outlinedin Ci"pt.rs 11 through 13 will haveto be brought to bear.And
 when random samplinghasnot beenused,thoseprinciplesand methodswill also
 haveto be broughito b.". on the externalvalidity problemof generalizing         causal
 relationships   from singlesamples unobserved
                                      to              instances.

         About the Completeness the Typology
Objections                    of

The first objectionof this kind is that our lists of particularthreatsto validity are
incomplete.   Bracht and Glass(1,968), example,ad-ded
                                             for                 new externalvalidity
threatsthat they thought were overlookedby Campbell           and Stanley(1,96311'and
more recentlyAiken ind West (1991) pointed to new reactivity threats._         These
challenges i*portant because key to the most confidentcausalconclusions
            "r.                        the
in our ,f,.ory of validity is the ability to construct a persuasiveargumentthat every
plausibleand identifiedthreat to validity has beenidentifiedand ruled out. How-
iver, thereis no guarantee     that all relevantthreatsto validity havebeenidentified.
Our lists are not divinely ordained,as can be observed                         in
                                                              from the changes the
threats from Campbel IUST) to Campbell and Stanley (1'963)to Cook and
   14.A CRITICAL       OF

Campbell(1979) to this book. Threatsare better identifiedfrom insiderknowl-
edgethan from abstractand nonlocal lists of threats.
     A second   objectionis that we may haveleft out particularvalidity fypesor or-
 ganized  them suboptimally.    Perhaps bestillustration that this is true is Sack-
 ett's(1979) treatmentof bias in case-control    studies.Case-control   studies not
commonly fall under the rubric of experimentalor quasi-experimental            designs;
but they are cause-probing            and in that sense general
                               designs,                  a         interestin general-
ized causalinferenceis at leastpartly shared.Yet Sackettcreateda different ty-
pology.He organized list around seven
                        his                                     at
                                              stages research which biascan oc-
cur: (1) in readingaboutthe field, (2) in sample    specification selection, in
                                                                 and               (3)
defining the experimentalexposure,(4)        'in measuringexposureand outcome,
(5) in dataanalysis, in interpretation analyses, (71inpublishing
                      (5)                   of          and                     results.
Each of thesecould generate validiry type, someof which would overlapcon-
siderably                                                                "in
            with our validity types.For example,his conceptof biases executing
the experimentalmanoeuvre" (p. 62) is quite similar to our internal validiry
whereas withdrawal biasmirrors our attrition. However,his list alsosuggests
new validity types,such as biasesin readingthe literature,and biases lists at he
each stageare partly orthogonal to our lists. For example,biases readingin-
clude biases rhetoric in which "any of several
               of                                               are
                                                     techniques usedto convince
the readerwithout appealing reason"(p. 60).
     In the end,then, our claim is only that the present  typologyis reasonably     well
informed by knowledgeof the nature of generalized       causalinference   and of some
of the problemsthat are frequentlysalientabout thoseinferences field experi-
mentation.It can and hopefullywill continueto be improvedboth by addition of
threatsto existing validity types and by thoughtful exploration of new validity
typesthat might pertainto the problem of generalized       causalinference   that is our
main concern.t

 1. We are acutelyaware of, and modestlydismayedat, the many differentusages thesevalidity labelsthat have
developedover the years and of the risk that posesfor terminological confusion---even     though we are responsible
for rnany of thesevariations ourselves.After all, the understandingsof validiry in this book differ from those in
Campbelland Stanley(1963),whoseonly distinctionwas betweeninternal and externalvalidity. They alsodiffer
from Cook and Campbell (7979), in which externalvalidity was concerned        with generalizing and across
populations of personsand settings,whereasall issuesof generalizingfrom the causeand effect operations
constitutedthe domain of constructvalidity. Further,Campbell(1985) himselfrelabeled         internalvalidiry and
external validiry as local molar causalvalidity and the principle of proximal similarity, respectively.Steppingoutside
Campbell'stradition, Cronbach(1982) usedtheselabelswith yet other meanings. said internalvalidity is the
problem of generalizing   from samples the domain about which the questionis asked,which soundsmuch like our
construct validity except that he specifically denied any distinction betweenconstruct validiry and external validiry,
using the latter term to refer to generalizingresults to unstudied populations, an issueof extrapolation beyond the
data at hand. Our understandingof external validity includessuch extrapolations as one case,but it is not limited to
that because also has to do with empirically identifying sourcesof variation in an effect sizewhen existing data
allow doing so. Finally, many other authors have casually used all theselabels in completelydifferent ways (Goetz
& LeCompte,1984; Kleinbaum,Kupper, & Morgenstern,1982;Menard, 1991).So in view of all thesevariations,
we urge that theselabels be used only with descriptionsthat make their intended understandings      clear.

                                                                          VALIDTTY         47s

         Concerning Natureof Validity
Objections        the
'We                                                                          it dif-
    defined validity as the approximate truth of an inference. Others define
ferently. Here are some alternatives and our reasonsfor not using them'

Validity in the New TestTheory Tradition
Testtheorists         validity(e.g.,
              discussed                    1946;Guilford,1,946) be-
                                   cronbach,                  well
fore Campbell(L957) inventedhis typology.Sfecan only begin to touch on the
many iss.re,     pertinentto validity that aboundin that tradition. Here we outline a
f.* i.y poinis that help differentiateour approachfrom that of test theory.The
early emphasis test theory was mostly on inferences
                     in                                        about what a test meas-
or.j, with a pinnaclebeingieachedin the notion of constructvalidity. Cronbach
                                                     "proper breadth to the notion of
  ltltll creditsCook and -a-pbell for giving
consffucts',(p. 152) in constructvalidity through their claim that constructva-
lidity is not j"tt li-it.d to inferences    about outcomesbut also about causes     and
about orherfeatures experiments. addition, early test theory tied validity to
                          of               In
                                  "The literatureon validationhasconcentrated theon
the truth of suchinferences:
truthfulness testinterpretation"
                 of                      (Cronbach,   1988,p' 5)'
        However,the yearshave bro.tght change this early understanding' one
                                                   to                            In
 particularlyinfluentialdefinitionof validity in test theory Messick(1989)said'
 ;V"lidiry ii an integrated    evaluative judgmentof the degree which empiricalev-
 idenceand theoreti"cal     rationales supportthe adequacy    and appropriateness in-
                          based testscores other modesof assessment" L3);
                                 on             or                              (p.
 ferences actions
                           "Validiry is broadly definedas nothing lessthan an evalua-
 and later he saysthat
 tive summary'of both the ruid.tr.. for and the actual-as well as potential-
 consequen.., scoreinterpretation
                    of                      and use" (199 p.74L)._Whereas. un-
                                                           5,                  our
 d.rrtu.rdirrgof validity is that inferences the subjectof validation,
                                               are                          this defini-
 tion suggeJt,     th"t actionsare also subjectto validation and that validation is ac-
 tually evaluation.     Theseextentionsare far from our view.
        A little historywill help here.Tests designed practicaluse.Commer-
                                               are          for
 cial test developers    hope to profit from sales   to thosewho usetests;employers
 hope to ,rr. t.rt, to seiectbetterpersonnel;      and test takershope that testswill
 tell them something       usefulabout themsqlves.    Thesepracticalapplications    gen-
  eratedconcerni., tf,e AmericanPsychological         Association(APA) to identify the
  characteristics better and worse tests.APA appointeda committeechairedby
  Cronbachto address problem.The committeeproducedthe first in a contin-
  uing series teststandaris(APA,1,954);and wolk alsoled to Cronbach
                of                                  this                             and
  Melhl', (1955)classic      article on constructvalidity. The test standardshave been
  freq.rerrtiy  revised, most recentlycosponsored other professional
                                                      by                   associations
   (AmericanEducaiionalResearch          Association,American Psychological    Associa-
  tion, and National Council on Measurement Education,1985,
                                                       in                   1999)' Re-
  qoirl-.nts to adhereto rhe standards        became  part of professionalethical codes.
  Th" ,tandardswere also influential in legaland regulatoryproceedings have   and

beencited,for example,in U.S.Supreme
                                   Court cases          misuses test-
                                             aboutalleged     of
ing practices (e.g., Albermarle Paper Co. v. MoodS 1975; Washington v. Davis,
L976) and have influencedthe "Uniform Guidelines"for personnel   selectionby
the Equal EmploymentOpportunity Commission(EEOC)et al. (1978).Various
                 were particularly salientin theseuses.
validity standards
    Because this legal,professional,
             of                      and regulatoryconcernwith the useof test-
ing, the researchcommunity concerned with measurementvalidity began to use the                                         i'
word ualidity  moreexpansivelyforexample,    "asonewaytojustifytheuseof    atest"
(Cronbach,   1989,p. M9).It is only a short distance                use
                                                    from validating to validat-
ing action, because most of the relevantuseswere actionssuchas hiring or firing
someone labelingsomeone
          or                  retarded.Actions,in turn, haveconsequences-some
positive,suchas efficiencyin hiring and accurate         that allows bettertailor-
ing of treatment,and somenegative,    suchas lossof incomeand stigmatization.  So
Messick  (1989 199 proposed
               ,    5l          that validationalsoevaluatethose consequences, es-
peciallythe socialjusticeof consequences. evaluating consequences test
                                          Thus            the              of
usebecame key featureof validity in test theory.The net resultwas a blurring of
the line betweenvalidity-as-truth and validity-as-evaluation, the point where
Cronbach   (1988)said"Validation a testor testuseis evaluation"
                                   of                              (p.4).
         strongly endorse the legitimacy of questions about the use of both tests and
experiments. Although scientistshave frequently avoided value questions in the mis-
taken belief that they cannot be studied scientifically or that scienceis value free, we
cannot avoid values even if we try. The conduct of experiments involves values at
every step, from question selection through the interpretation and reporting of re-
sults. Concerns about the usesto which experiments and their results are put and the
value of the consequences those usesare all important (e.g.,Shadishet al., 1991),
as we illustrated in Chapter 9 in discussingethical concerns with experiments.
     However, if validity is to retain its primary association with the truth of
knowledge claims, then it is fundamentally impossible to validate an action be-
causeactions are not knowledge claims. Actions are more properly evaluated, not
validated. Supposean employer administers a test, intending to use it in hiring de-
cisions. Suppose the action is that a person is hired. The action is not itself a
knowledge claim and therefore cannot be either true or false. Supposethat person
then physically assaultsa subordinate. That consequence also not a knowledge
claim and so also cannot be true or false. The action and the consequences       merely
exist; they are ontological entities, not epistemological ones. Perhaps Messick
(1989) really meant to ask whether inferencesabout actions and consequences           are
true or false. If so, the inclusion of action in his (1,989)definition of validity is en-
tirely superfluous, for validity-as-truth is already about evidencein support of in-
ferences,including those about action or consequ.rr..s.'

2. Perhaps  partly in recognitionof this, the most recentversionof the test standards(AmericanEducational
Research  Association,   American Psychological                                                    in
                                                            and National Council on Measurement Education,
1999) helpsresolvesomeof the problemsoudined hereinby removingreference validatingaction from the
definition of validity: "Validity refersto the degree which evidence
                                                     to               and theory support the interpretations test
scores entailedby proposedusesof tests" (p. 9).

                                                                             VALIDITY I 477

     Alternatively perhaps Messick ('1.989,L995) meant his definition to instruct
test validators to eualuatethe action or its consequences, intimated in:
ity is broadly defined as nothing less than an evaluative summary of both the ev-
idence for and the actual-as well as potential--consequences of score interpre-
tation and use" (1,995, p. 742). Validity-as-truth certainly plays a role in
evaluating testsand experiments.But we must be clear about what that role is and
is not. Philosophers(e.g., Scriven, 1980; Rescher,1969) tell us that a judgment
about the value of something requires that we (1) selectcriteria of merit on which
the thing being evaluated would have to perform well, (2) set standards of per-
formanci for how well the thing must do on each criterion to be judged positivel5
(3) gather pertinent data about the thing's performance on the criteria, and then
i+j i"6gr4te the results into one or more evaluative conclusions.
is one (but only one) criterion of merit in dvaluation; that is, it is good if inferences
about a test are true, just as it is good for the causal inference made from an ex-
periment to be true. However, validation is not isomorphic with evaluation. First,
criteria of merit for tests (or experiments) are not limited to validity-as-truth- For
example, a good test meetsother criteria, such as having a test manual that reports
,ror*^r, being affordable for the contexts of application, and protecting confiden-
tialiry ", "ppropriate. Second,the theory of validity Jvlessickproposed gives no
help in accomplishing some of the other steps in the four-step evaluation process
 outlined previously. To evaluate a test, we need to know something about how
 much ualidity the inference should have to be judged good; and we need to know
 how to integrate results from all the other criteria of merit along with validity into
 an overall waluation. It is not a flaw in validity theory that these other steps are
 not addressed,for they are the domain of evaluation theory. The latter tells us
 something about how to executethesesteps (e.g.,Scriven, 1980, 1'991)and also
 about other matters to be taken into account in the evaluation. Validation is not
 evaluation; truth is not value.
      Of course, the definition of terms is partly arbitrary. So one might respond
 that one should be able to conflate validity-as-truth and validity-as-evaluation if
one so chooses.However:

      The very fact that termsmusrbesupplied with arbitrarymeanings         that words
      be usedwith a greatsense responsibility.
                                  of           This responsibility twofold: first, to es-
      tablished                                                          imposeon the
                ,6"9"; second, the limitationsthat the definitionsselected
      user.(Goldschmidt,          P. 642)
    need the distinction between truth and value becausetrue inferencescan be
about bad things (the fact that smoking causescancer does not make smoking or
cancer good); "nd f"lr. inferencescan lead to good things (the astrologer'sadvice
             'lavoid alienating your coworkers today" may have nothing to do with
to Piscei to
heavenly bodies, but may still be good advice). Conflating truth and value can be
actively harmful. Messick (1995) makes clear that the social consequences test- of
                                   "bias, fairness, and distributive justice" (P. 745).
ing are to be judged in terms of
    agreewith this statement,but this is test evaluation, not test validity. Messick
478       ra. n cRrTrcAL      OFOUR
                       ASSESSMENT  ASSUMPTTONS

      notes that his intention is not to open the door to the social policing of truth (i.e.,
      a test is valid if its social consequences good), but ambiguity on this issuehas
      nonethelessopened this very door. For example, Kirkhart (1,995)cites Messick as
      justification for judging the validity of evaluations by their social consequences:
                         validity refers here to the soundnessof changeexerted on systems
      by evaluationand the extent to which thosechanges just" (p.a).This notion is
      risky because   the most powerful arbiter of the soundnessand iustice of social con-
      sequences the sociopolitical systemin which we live. Depending on the forces in
      power in that system at any given time, we may find that what counts as valid is
      effectively determined by the political preferencesof those with power.

      Validity in the Qualitative   Traditions
      One of the most important developmentsin recent social researchis the expanded
      use of qualitative methods such as ethnography ethnology, participant observa-
      tion, unstructured interviewing, and case study methodology (e.g., Denzin 6c
      Lincoln, 2000). These methods have unrivaled strengths for the elucidation of
      meanings, the in-depth description of cases,the discovery of new hypotheses,and
      the description of how treatment interventions are implemented or of possible
      causal explanations. Even for those purposes for which other methods are usually
      preferable,such as for making the kinds of descriptivecausalinferences   that are the
      topic of this book, qualitative methods can often contribute helpful knowledge and
      on rare occasionscan be sufficient (Campbell, 1975; Scriven, 1976ll.               re-
      sources allow, field experiments will benefit from including qualitative methods
      both for the primary benefits they are capable of generatingand also for the assis-
      tance they provide to the descriptive causal task itself. For example, they can un-
      cover important site-specificthreats to validiry and also contribute to explaining
      experimental results in general and perplexing outcome patterns in particular.
          However, the flowering of qualitative methods has often been accompanied
      by theoretical and philosophical controversy, often referred to as the qualitative-
      quantitative debates. These debates concern not just methods but roles and re-
      wards within science,ethics and morality and epistemologiesand ontologies. As
      part of the latter, the concept of validity has receivedconsiderableattention (e.g.,
      Eisenhart & Howe, 1992; Goetz & LeCompte,1984; Kirk & Miller, 1'986;Kvale,
      1.989;J. Maxwell, 1.992;J. Maxwell 6c Lincoln, 1.990;Mishler, 1,990;Phillips,
      1,987;           1990). Notions of validity that are different from ours have occa-
      sionally resulted from qualitative work, and sometimesvalidity is rejectedentirely.
      However, before we review those differences we prefer to emphasize the com-
      monalities that we think dominate on all sides of the debates.

      Comtnonalities. As we read it, the predominant view among qualitative theorists
      is that validity is a concept that is and should be applicable to their work..We start
      with examples of discussionsof validity by qualitative theorists that illustrate these
      similarities because   they are surprisingly more common than someportrayals in the


                                                                                   VALIDITY O''

qualitative-quantitative      debates  suggest because
                                                and                                 an
                                                                they demonstrate underly-
ing unity of interestin producingvalid knowledge             that we believe widely shared
by"*ori social scientiits.For example,Maxwell                   (1990) says, "qualitative re-
searchers just as concerned quantitativeonesabout'getting it wrong,' and
            are                        as
validity broadlydefinedsimplyrefersto the possible               ways one'saccountmight be
                              'validity threats' can be addressed" 505). Even those
*rorrg, and how these                                                                         "go
quafi[tive theoristswho saythey rejectthe word ualidity will admit that they
to considerable   painsnot to getit all wrong" (Wolcott,1990,p. L27).Kvale(1989)
                                            "conceptsof validity are rootedin more com-
tiesvalidity directlyto truth, saying
prehensive    epistemological     assumptions the nature of true knowledge"(p. 1-1);
                           l'refersto the truth and correctness a statement"(p.731.
and later that validity
                                    "the technicaluseof the term 'valid' is as a properly
Kirk and Miller (1986) say
                                   'true' " (p. L9). Maxwell (L9921says"Validiry in a
hedgedweak synonymfor
broad sense,    pertainsto this relationshipbetweenan accountand something                   out-
sidethat account" (p. 283). All theseseemquite compatiblewith our understand-
ing of validity.
       Maxvreli's(7992\ accountpoints to other similarities. claimsthat validity
                          "the kinds of understandings           that accountscan embody"
 is always relative to
 (p. 28il and that different communitiesof inquirers are interested different        in
 kindsof understandings. notesthat qualitativeresearchers interested five
                                He                                          are            in
 kinds of understandings        about: (1) the descriptions what was seen
                                                                 of                   and heard,
 (2) the meaningof what was seenand              heard, (3) theoreticalconstructions          that
  characteriz.  *h"t was seenand heardat higher levelsof abstraction, general-       (4)
  izationof accounts other persons,
                       to                    times,or settings     than originally  studied,and
 (5) evaluations the objectsof study
                   of                           (Maxwell, 1'992;he saysthat the last two
 understandings of interestrelativelyrarely in qualitativework). He then pro-
 poses five-p-art 'We typology for qualitativeresearchers, for eachof the
         a           validity                                                one
 ?ine  orrd..standings. agree         that validity is relativeto understanding,      thoughwe
 usuallyrefer to in-ference      iather than understanding.       And we agree     that different
 communities inquirerstend to be interested
                 of                                        in different kinds of understand-
  ings,though common interests illustratedby the apparentlysharedconcerns
  thlt both ixperimentersand qualitativeresearchers               have in how bestto charac-
   terizewhatwas seen     and heardin a study (Maxwell'stheoreticalvalidity and our
  constructvalidity). Our extended         discussion internal validity reflectsthe inter-
  est of the community of experimenters understanding
                                                 in                     descriptive  causes'  pro-
  portionatelymore so than is relevantto qualitativeresearchers,                 evenwhen their
  reportsare necessarily     repletewith the language causation.
                                                            of               This observation    is
  ,rot " criticismof qualitativeresearchers, is it a criticism of experimenters
                                                    nor                                         as
  being lessinterested    than qualitativeresearchers thick descriptionof an indi-
       On the other hand, we should not let differences prototypical tendencies
  acrossresearch     communitiesblind us to the fact that when a particular under-
  standingls of interest, pertinentvalidity concerns the sameno matterwhat
                             the                                  are
  the metlodology usedto developthe knowledge                 claim. It would be wrong for a

qualitative researcher claim that internal validity is irrelevantto qualitative
methods.Validity is not a properry of methodsbut of inferences knowledge
claims. On those infrequent occasions which a qualitative researcher
                                          in                                   has a
stronginterestin a local molar causal  inference, concerns haveoutlinedun-
                                                 the           we
der internal validity pertain.This argumentcuts both ways,of course. exper-
imenterwho wonderswhat the experiment        means participants
                                                     to             could learna lot
from the concerns   that Maxwell outlinesunder interpretivevalidity.
     Maxwell (1992) also points out that his validity typology suggests      threats
to validity about which qualitativeresearchers    seek  "evidence that would allow
them to be ruled-out. . . usinga logic similar to that of    quasi-experimental  re-
searchers   such as Cook and Campbell" (p. 296). He does not outline such
threatshimself,but his descriptionallows one to guess       what somemight look
like. To judge from Maxwell's prose,threats to descriptivevalidity include er-
rors of commission(describing     something   that did not occur),errorsof omis-
sion  (failingto describe something  that did occur),errorsof frequency    (misstat-
ing how often something occurred), and interrater disagreement                about
description.  Threatsto the validity of knowledge    claimshavealsobeeninvoked
by  qualitative theorists other than Maxwell-for example,by Becker(1979),
Denzin(1989'),   and Goetzand LeCompte(1984).Our only significant          disagree-
ment with Maxwell's discussionof threats is his claim that qualitative re-
searchers lessable to use "designfeatures"(p. 296) to deal with threatsto
validity. For instance,his preferreduseof multiple observers a qualitativede-
signfeaturethat helpsto reduce    errorsof omission,    commission, frequency.
The repertoireof designfeatures                                 use
                                   that qualitativeresearchers will usuallybe
quite different from those used by researchers other traditions, but they are
designfeatures   (methods) the same.

Dffirences. Theseagreements       notwithstanding,many qualitativetheoristsap-
proach validity in ways that differ from our treatment.A few of thesedifferences
 are based arguments
           on            that are simplyerroneous  (Heap,7995;Shadish,    1995a).
 But many are thoughtful and deserve    more attention than our space constraints
allow. Following is a sample.
      Somequalitativetheoristseither mix togetherevaluativeand socialtheories
of truth (Eisner,\979,1983) or propose substitute socialfor theevaluative.
                                         to          the
SoJensen    (1989)saysthat validiry refersto whethera knowledgeclaim is "mean-
ingful and relevant" (p. 107) to a particular languagecommunity; andGuba and
Lincoln (1,982) that truth can be reduced whetheran accountis credibleto
                say                            to
thosewho read it. Although we agreethat socialand evaluative    theories comple-
ment eachother and are both helpful, replacingthe evaluative    with the socialis
misguided. These social alternatives allow for devastatingcounterexamples
(Phillips, 1987): the swindler'sstory is coherentbut fraudulent;cults convince
members beliefsthat havelittle or no apparentbasisotherwise;
           of                                                    and an account
of an interactionbetweenteacherand studentmight be true evenif neitherfound
it to be credible.Bunge(1992) showshow one cannotdefinethe basicideaof er-


qualitative researcher to claim that internal validity is irrelevant to qualitative
methods. Validity is not a properfy of methods but of inferencesand knowledge
claims. On those infrequent occasions in which a qualitative researcher has a
strong interest in a local molar causal inference,the concernswe have outlined un-
der internal validity pertain. This argument cuts both ways, of course. An exper-
imenter who wonders what the experiment meansto participants could learn a lot
from the concerns that Maxwell outlines under interpretive validity.
     Maxwell (1,992) also points out that his validity typology suggeststhreats
to validity about which qualitative researchers    seek "evidencethat would allow
them to be ruled-out . . . using a logic similar to that of quasi-experimentalre-
searcherssuch as Cook and Campbell" (p. 296). He does not outline such
threats himself, but his description allows one to guess what some might look
like. To judge from Maxwell's prose, threats to descriptive validity include er-
rors of commission (describing something that did not occur), errors of omis-
sion (failing to describesomething that did occur), errors of frequency (misstat-
itg how often something occurred), and interrater disagreement about
description. Threats to the validity of knowledge claims have also been invoked
by qualitative theorists other than Maxwell-for example, by Becker (1,979),
Denzin (1989), and Goetz and LeCompte (1984). Our only significant disagree-
ment with Maxwell's discussion of threats is his claim that qualitative re-
searchersare less able to use "design features" (p. 2961to deal with threats to
validity. For instance, his preferred use of multiple observers ls a qualitative de-
sign feature that helps to reduce errors of omission, commission, and frequency.
The repertoire of design featuresthat qualitative researchers   use will usually be
quite different from those used by researchersin other traditions, but they are
design features (methods) all the same.

Differences. These agreementsnotwithstanding, many qualitative theorists ap-
 proach validity in ways that differ from our treatment. A few of thesedifferences
 are basedon argumentsthat are simply erroneous(Heap, 1.995;Shadish,1995a).
 But many are thoughtful and deservemore attention than our spaceconstraints
allow. Following is a sample.
      Some qualitative theorists either mix together evaluative and social theories
of truth (Eisner,             or propose to substitutethe socialfor the evaluative.
So Jensen(1989) saysthat validiry refers to whether a knowledge claim is "mean-
ingful and relevant" (p. L07l to a particular language community; and Guba and
Lincoln (t9821say that truth can be reduced to whether an account is credible to
those who read it. Although we agree that social and evaluative theories comple-
ment each other and are both helpful, replacing the evaluative with the social is
misguided. These social alternatives allow for devastating counterexamples
(Phillips, L987): the swindler's story is coherent but fraudulent; cults convince
members of beliefs that have little or no apparent basis otherwise; and an account
of an interaction between teacher and student might be true even if neither found
it to be credible. Bunge (1992) shows how one cannot define the basic idea of er-         .il


                                                                                 VALIDITY +ET

ror usingsocialtheoriesof truth. Kirk and Miller (1986) capturethe needfor an
evaluativetheory of truth in qualitativemethods:

    In response the propensity of so many nonqualitative researchtraditions to use such
    hidden positivist assumptions, some social scientists have tended to overreact by
    stressinj the possibility ;f alternative interpretations of everything to th€ exclusion of
                                                                                        of ob-
    urry .ffor, to chooseamong them. This extreme relativism ignores the other side
                                                   at all. It ignores the distinction between
    leciivity-that there is an external world
    lrro*l"dg. and opinion, and results in everyonehaving a separateinsight that cannot
    be reconciledwith anyone else's.(p. 15)

      A second  difference  refersto equating validity of knowledgeclaimswith their
evaluation, we discussed
              as                earlierwith tqsttheory (e.g.,     Eisenhart Howe, L992)'
This is mostexplicitin Salner      (L989),whosuggested much of validityin quali-
                                                "that are useful for evaluatingcompeting
tative methodoiogyconcerns criteria
claims',(p. 51);"id rh. urges     researchers expose moral andvalueimplications
                                               to          the
of ,.r."rch, *.rch asMessick'We   (1.989) in reference testtheory.Our response
                                          said                to                            is
the sameas for test theory.            endorsethe need to evaluateknowledge           claims
broadly including    their moial implications; this is not the same saying
                                                   but                      as       that the
claim is-t.ue. Truih is just onecriterionof merit for a good knowledge          claim.
      A third differencemakes validity a result of the              processby which truth
emerges. instance,
            For             emphasizing dialecticprocessthat givesrise to truth'
                        ,,ValidLnowledge    claimsemerge . . from the conflict and dif-
Salnei(l9g9l says:
ferences   between contextsthemselves thesedifferences communicated
                     the                           as                     are
and negotiated    amongpeople       who share    decisions actions"(p. 61).Miles and
Huberman(1984)rpr"t of th. problemof validity in qualitative                  methodsbeing
                    of 'Lnalysis procedures qualitative data" (p. 230). Guba and
 an insufficiency
 Lincoln (1989) argue that       tiustworthinessemerges          from communicationwith
 other colleagues stakeholders. problemwith all thesepositionsis the er-
                    arid                  The
 ror of thinklng that validity is a   property of methods.Any procedure generat- for
 ing knowledg! can g.n.r"i. invalid-knowledge, in the end it is the knowledge
 claim itself that muJt be judged.As Maxwell           (1992) says,"The validity of an ac-
 count is inherent,not in the procedures        usedto produceand validateit, but in its
 relationshipto thosethings it is intendedto be an accountof" (p' 281)'
       A fourth differencesuggests      that traditional approaches validity must be
                                                                   "historically arosein the
 reformulatedfor qualitativemethodsbecause               validiry
 context of experimental      research"(Eisenhart Howe, 1992,p' 64\' Othersre-
 ject validity for similar reasons   except  that they saythat validity arosein test the-
 o.y 1..g.,*lol.orr,    19gO).  Both are incorrect,for validiry concerns       probably first
 "ror. Jrt.*"ti.ally in philosophypreceding theory and experimental
                                                       test                           science
 by hundredsor thour"ndr of years.Validity is pertinentto any discussion the           of
 warrant for believing     knowledgeand is not specificto particular.        methods.
       A fifth differenie .on..rrri the claim that there is no ontological reality at
  all, so thereis no truth to correspond it. The problemswith this perspective
  "r. .rror1nous  (Schmitt,1,995).    First, evenif it were true' it would apply only to

8z    I r+.n cRtlcAL AssEssMENT ouR AssuMploNs

      correspondence theories of truth; coherence and pragmatist theories would be
      unaffected. Second, the claim contradicts our experience. As Kirk and Miller
      ( 1 9 8 6 1 p u ti t :

            Thereis a world of empiricalreality out there.The way we perceive   and understand
            that world is largelyup to us, but the world doesnot tolerateall understandings it
            equally(sothat the individualwho believes or shecanhalt a speeding
                                                       he                         train with his
            or her bare handsmay be punishedby the world for actingon that understanding).
            ( p .1 1 )

      Third, the claim ignores evidenceabout the problems with people'sconstructions.
      Maxwell notes that "one of the fundamental insights of the social sciences that is
      people's constructions are often systematic distortions of their actual situation"
      (p. 506). FinallS the claim is self-contradictory becauseit implies that the claim
      itself cannot be rrue.
           A sixth difference is the claim that it makes no senseto speak of truth because
      there are many different realities, with multiple truths to match each (Filstead,
      1.979;Guba 6c Lincoln, L982; Lincoln 6c Guba, 1985). Lincoln (L990), for ex-
      ample, says that "a realist philosophical stance requires, indeed demands, a sin-
      gular reality and thereforea singulartruth" (p. 502), which shejuxtaposesagainst
      her own assumption of multiple realities with multiple truths. Whatever the mer-
      its of the underlying ontological arguments, this is not an argument against valid-
      ity. Ontological realism (a commitment that "something" does exist) does not re-
      quire a singular reality but merely a commitment that there be at least one reality.
     To take just one example, physicists have speculated that there may be circum-
      stancesunder which multiple physical realities could exist in parallel, as in the case
     of Schrodinger's   cat (Davies,1984; Davies & Brown, 1986). Such circumstances
     would in no way constitute an objection to pursuing valid characterizationsof
     those multiple realities. Nor for that matter would the existenceof multiple real-
     ities require multiple truths; physicists use the same principles to account for the
     multiple realities that might be experiencedby Schrodinger'scat. Epistemological
     realism (a commitment that our knowledge reflects ontological reality) does not
     require only one true account of that world(s), but only that there not be two con-
     tradictory accounts that are both true of the same ontological referent.3 How
     many realities there might be, and how many truths it takes to account for them,
     should not be decided by fiat.
           A seventh difference objects to the belief in a monolithic or absolute Truth
     (with capital T). rUfolcott (1990) says, "'What I seek is something else, a quality
     that points more to identifying critical elements and wringing plausible interpre-
     tations from them, something one can pursue without becoming obsessed              with

     3. The fact that different people might have different beliefs about the same referent is sometimes cited as violating
     this maxim, but it need not do so. For example, if the knowledge claim being validated is "John views the program
     as effective but Mary views it as ineffective," the claim can be true even though the views of John and Mary are



                                                                                    VALIDITY I 483

finding the right or ultimate answer'the correctversion,the Truth" (p' 146)' He
             "the critical point of departurebetweenquantities-oriented quali-       and
describes                                                  'know'with the former'ssatisfy-
ties-oriented    research beingthat] we cannot
ing levelsof certainty" (p. 1,47).Mishler (t990) objectsthat traditional ap-
                                               "as universal,abstractguarantorsof truth"
prl".h., to validationare portrayed
                                             "the realist positiondemands       absolute   truth"
ip. +ZOl.Lincoln      (1990)thinksthat
            However, is misguided attributebeliefs certaintyor absolute
                        it                to                    in                           truth
tp. SOZI.
tf appioaches validity srrchas that in this book.'We
                   to                                               hope we havemadeclear
by now that thereare no guarantorsof valid inferences.               Indeed,the more experi-
encethat mostexperimenters           gain,the morethey appreciate ambiguityof their
                                         "An experiment something
                                                             is             everybody    believes
results.  Albert Einstein   oncesaid,
exceptthe personwho madeit" (Holton, 1986, p. 13).Like \(olcott, most ex-
periri-renter, only to wring plausibleinterpretations
                ,..k                                                  from their work, believ-
irrg thut "prudence poisedbetween
                         sat                      skepticism   and credulity" (Shapin,1994,
p."xxix). rilfle tteednor, shouldnot, and frequentlycannot decidethat one account
i, ,broirrt.ly true ani the other completelyfalse.To the contrary' tolerancefor
multiple knowledgeconstructions a virtual necessity
                                            is                       (Lakatos, 1'978)because
evidence frequeirtlyinadequate distinguishbetweentwo well-supported
            is                             to                                                    ac-
counrs(islight " p"tti.l. or wave?),         and sometimes     accounts   that appear be un-
 supported euiJence manyyears
    "An       by            for                 turn out to betrue (do germs     cause   ulcers?)-
           eighih difference     claims that traditional understandings        of validity have
 moral shoitcomings.       The arguments      herearemany,for example,        that it "forcesis-
 sues of politics, ial,res (social and scientific), and ethics to be'experts'      submerged"
                                                                "social science                    .
 (Lincoln, 1,990, 503) and implicitly empowers
 whoseclass     preoccupations      (primarily'$7hite,   male,and middle-class)      ensure     sta-
 tus for somevoiceswhile marginalizittg. . . thoseof women, persons color, or
 minoritygroupmembers"            (Lincoln,             502). Althoughthese     arguments      may
 b. ou"..tlted, they contain important cautions.Recallthe examplein Chapter3
 that ,,Eventhe rats werewhite males" in healthresearch. doubt this biaswas
 partly due to the dominance
 '..r.ur.h.                          of          malesin the designand executionof health
             None of the methodsdiscussed this book are intendedto redress
                                                    in                                          this
 problem or are capableof it. The purposeof experimental                  designis to elucidate
 ca.rsal  inferences   -or. than morallnferences.'What lessclearis that this prob-
 lem requiresabandoning          notions of validity or truth. The claim that traditional
 ,pprou.h.s to truth forcibly submerge           political and ethicalissues simplywrong.
 Tb-the extent that morality is reflectedin the questionsasked,the assumptions
 made,and the outcomes          examined,     experimentefs go a long way by ensuring
  a broad representation stakeholder
                              of                 voicesin study design.    Further,moral social
  sciencereiuires commitment to truth. Moral righteousness                    without truthful
  analysis ihe stuff of totalitarianism.Moral diversity
            is                                                      helpspreventtotalitarian-
  ism, but without the discipline provided by truth-seeking,diversity offers no
  -."16 to identify thoseoptionsthat are good for the human condition,which is,
   after all,the essence morality.In order to havea moral socialscience, must
                           of                                                            we
  haveboih the capacityto elucidate          personalconstructions       and the capacity see to
484        14.A CR|T|CAL
                       ASSESSMENT OURASSUMPTTONS

       how thoseconstructions                                            'We
                               reflectand distort reality (Maxwell, 19921. embrace
       the moral aspirations scholarssuchas Lincoln, but giving voiceto thoseaspi-
       rations simply doesnot requireus to abandonsuchnotions as validity and truth.


       Criteria RulingOut Threats:
       The Centralityof FuzzyPlausibility
        In a randomized experiment in which all groups are treated in the sameway excepr
        for treatment assignment,very few assumptionsneed to be made about ro,rr.", of
        bias. And those that are made are clear and can be easily tested,particularly as con-
        cerns the fidelity of the original assignment process and its subsequentmainte-
        nance. Not surprisinglS statisticiansprefer methods in which the assumptionsare
        few, transparent, and testable. Quasi-experiments, however, rely heavily on re-
        searcheriudgments about assumptions, especiallyon the fuzzy but indispensable
       concept of plausibility. Judgments about plausibility are neededfor deciding which
       of the many threats to validity are relevant in a given study for deciding whether
       a particular designelement is capable of ruling out a given threat, for estimating by
       how much the bias might have been reduced, and for assessing       whether multiple
       threats that might have been only partially adjusted for might add up to a total bias
       greater than the effect size the researcher is inclined to claim. Vith quasi-
       experiments, the relevant assumptions are numerous, their plausibility is less evi-
       dent, and their single and joint effectsare lesseasily modeled. We acknowledgethe
       fuzzy way in which particular internal validity threats are often ruled out, and it is
       becauseof this that we too prefer randomized experiments (and regressiondiscon-
       tinuity designs)over most of their quasi-experimentalalternatives.
            But quasi-experiments vary among themselveswith respect to the number,
      transparencg and testability of assumptions. Indeed, we deliberately ordered the
      chapters on quasi-experiments to reflect the increase in inferential power that
      comes from moving from designs without a pretest or without a comparison
      group to those with both, to those based on an interrupted time series,and from
      there to regression discontinuity and random assignment.Within most of these
      chapters we also illustrated how inferencescan be improved by adding design el-
      ements-more pretest observation points, better stable matching, replication and
      systematic removal of the treatment, multiple control groups, and nonequivalent
      dependentvariables. In a sense,the plan of the four chapters on quasi-experiments
      reflects two purposes. One is to show how the number, transparency and testa-
      bility of assumptions varies by type of quasi-experimental design so that, in the
      best of quasi-experiments,internal validity is not much worse than with the ran-
      domized experiment. The other is to get students of quasi-experimentsto be more
      sparing with the use of this overly general label, for it threatens to tar all quasi-


                                                                                    to the
experimentswith the samenegativebrush. As scholarswho have contributed
institution alization of the t i^ quoti-experiment, we feel a lot of ambivalence
                                                                            the random-
about our role. Scholarsneed to itrint critically about alternatives to
ized experiment, and from this need arisesthe need for the quasi-experimental
                                                                                under the
bel. But all instancesof quasi-experimentaldesignshould not be brought
sameunduly broad quasi-experimentalumbrella if attributes of         the best studiesdo
not closely match the weaker attributes of the field writ large.
                                                                                    use of
      Statisticians seek to make their assumptions transparent through the
formal models laid out as formulae. For the most part, we      have resistedthis strat-
                                                                                very con-
egy becauseit backfires with so many readers,alienating them from the
                                                                                words in-
.!pt.r"t issuesthe formul ae aredesignedto make evident.'We have used
 stead.There is a cost to this, and not jupt in the distaste of statistical
 particularly those whose own research has emphasized statistical models-
                                                                             to formally
 main cost is that our narrative approach makes it more difficult
                                                                         the alternative
 demonstrate how much fewer and more evident and more testable
 interpretations became as we moved from the weaker to the stronger
 .*p.ri-.rrts, both within the relevant quasi-experimental chapters and
 set of them.      regret this, but do not apologize for the accessibility we tried to
 create by minimirirrg the use of Greek symbols and Roman subscripts.
                                                                               to develop
 nately, this deficit is not absolute, as both we and others have worked
                                                                                 in partic-
 meth;ds that can be used to measurethe size of particular threats' both
 ular studies(e.g.,Gastwirth et al., L994;Shadishet al.,     1998; Shadish,2000) and
 in sets of studiis (e.g.,Kazdin 6c Bass, 1989; Miller, Turner, Tindale,
      1,991;Ror."nitt.tRubin,1,978;Willson Putnam,t982\.
Dugoni,            &                    &
narrative approach has a significant advantage over a more narrowly
emphasisii allows us to addressa broad er array of qualitatively different
                                                                          that there-
to validitS threats for which no statistical measure is yet available and
fore mighi otherwise be overlooked with too strict an emphasison
                                                                                at all
Better to h"u. imprecise attention to plausibility than to have no attention
paid to many imptrtant threats just becausethey cannot be well      measured'

 PatternMatchingas a Problematic
 This book is more explicitthan its predecessors    about the desirabilityof imbuing
 a causalhypothesis    with multiple tistable implicationsin the data, providedthat
                                                   causalexplanations. a sense'
                                                                           In        we
 they serve reduce viability of alternative
            tt         the
 havesoughtto substitute pattern-matching
                            a                   me{rod-ology    for the u-sualassessment
 of wheth-era few means,oft.n only fwo,       reliably differ.       do this not because
 .o-pl.*ity itself is a desideratum science. the contrary,simpliciry in the
                                    in         To
 be, of questions asked   and methodsusedis highly prizedin science. simplicity
 of ,arrjomized experiments descriptive
                               for              causalinferenceillustratesthis well.
 However,the samesimple circumstance         does not hold with quasi-experiments.
 With them. we haveassirtedthat causalinference improvedthe more specific,
488 | ro.o cRtlcALAssEssMENT
                          oF ouRAssuMploNs

   generating   theselists.The main concernwas to havea consensus educationre-
   searchers   endorsingeachpractice;and he guessed     that the number of thesebest
   practicesthat depended randomizedexperiments
                             on                           would be zero. Several    na-
   tionally known educationalresearchers       were present,agreedthat such assign-
   ment probably playedno role in generating list, and felt no distress this. So
                                                 the                        at
   long as the belief is widespreadthat quasi-experiments    constitutethe summit of
   what is neededto support causalconclusions, support for experimentation
   that is currently found in health, agriculture,or health in schoolsis unlikely to
   occur.Yet randomizationis possible    in.manyeducational   contextswithin schools
   if the will existsto carry it out (Cook et al., 1999;Cook et al., in press). un-
   fortunate and inadvertentside effect of seriousdiscussion quasi-experiments
   may sometimes the practicalneglect randomized
                    be                      of            experiments.   That is a pity.

   This sectionlistsobjections
                             that havebeenraised doingrandomized
                                                to                experiments,
   and our analysis the more and lesslegitimate
                    of                         issues
                                                    that these         raise.

             CannotBe Successfully
    Even a little exposure to large-scalesocial experimentation shows that treatments
    are often improperly or incompletely implemented and that differential attrition
    often occurs. Organizational obstaclesto experiments are many. They include the
    reality that different actors vary in the priority they attribute to random assign-
    ment, that some interventions seem disruptive at all levels of the organization,
    and that those at the point of service delivery often find the treatment require-
    ments a nuisance addition to their aheady overburdened daily routine. Then
    there are sometimes treatment crossovers,as units in the control condition adopt
    or adapt components from the treatment or as those in a treatment group are ex-
    posed to some but not all of these same components. These criticisms suggestthat
    the correct comparison is not between the randomized experiment and better
    quasi-experiments when each is implemented perfectly but rather between the
   randomized experiment as it is often imperfectly implemented and better quasi-
   experiments. Indeed, implementation can sometimes be better in the quasi-
   experiment if the decision not to randomize is based on fears of treatment degra-
   dation. This argument cannot be addressedwell becauseit dependson specifying
   the nature and degree of degradation and the kind of quasi-experimental alter-
   native. But taken to its extreme it suggeststhat randomized experiments have no
   special warrant in field settings becausethere is no evidencethat they are stronger
   than other designs in practice (only in theory).
        But the situation is probably not so bleak. Methods for preventing and cop-
   ing with treatment degradation are improving rapidly (seeChapter 10, this vol-
                                                                  EXPERIMENTS AAS
                                                         RANDOMIZED         I

                                                                        random assign-
umel Boru ch,1997;Gueron,1,999;Orr, L999).More important,
                                                                          evenwith the
-.n, may still create superiorcounterfactual its alternatives
                         a                           to
flaws mentionedherein.FLr e*ample,Shadishand Ragsdale
.o-p"..d with randomized."p..i-.tts            without attrition,  randomizedexperi-
mentswith attrition still yieldedbetter effect    sizeestimates than did nonrandom-
                     Sometimes, course,an alternative severely
                                  of                        to            degraded  ran-
ized experiments.
                                                                         with a control'
domizaiion will be best,such as a strong interruptedtime series
But routine rejectionof degraded      randomizedexperiments a poor rule to fol-
l,o*; it takescarefulstudy and judgmentto decide.Further,many alternatives
                                                                              flaws that
experimentation themselu.i ,ob;..t to treatmentimplementation
thieatenthe validity'weinferences
                        of            from them. Attrition and treatmentcrossovers
also occur in them.         also suspect that implementationflaws are salientin ex-
                           experiments  hav6beenaround so long and experimenters
f.ri-errt"tion because                                                       the quality
 "r. .o critical of eachothlr's work. By contrast,criteria for assessing
 of implementation results
                      and         from othermethodsarefar more recent(e'g',Datta,
                                                           conceptuallS    lesssubjected
  D97j,and they may thereforebe lesswell developed
 to peercriticism,and lessimprovedby the lessons experience.

Experimentation    StrongTheoryand Standardized
Treatment Plementation
Many critics claim that experimentationis more fruitful when an intervention
                               theory when implementationof treatment       detailsis
basedon strongsubstantive
faithful to that theor5 when the rlsearchsettingis well managed,     and when im-
               does,roi uury much between    units' In many field experiments' these
conditions are not met. For example,schools arclarge, complex, social
iio"r *ith multiple programs,disputatiouspolitics, and conflicting stakeholder
goals.Many progr"*, a"re    implemented    variablyacross  schooldistricts,aswell as
f.ror, ..hoth, .Lrrroo-r, arri ,t.rdents.  Therecan be no presumPli9n standard
implementation fidelity to programtheory (Berman& Mclaughlin,
     But thesecriticismsur., i' fa-ct, misplaced.   Experimentsdo not requirewell-
specifiedprogram theories,good program management,            standardimplementa-
                     that are tJtally ?aithful to theory' Experimentsmake. contri-
tion, or treatments
 bution when they simplyprobewhetheran intervention-as-implemented
 marginal improvem.tttt.yord other backgroundvariability. Still, the
 fa.tJ* can ieducestatisticalpower and so cloud causalinference.       This suggests
 that in settingsin which *or. of these conditions hold,        experimentsshould:
 (L) uselargesamples detecteffects; take painsto reducethe influence ex-
                       to                (2)                                   of
 traneousvariation either by designor through measurement         and statisticalma-
 nipulation; and (3) studyimplementation     quality both as a variableworth study-
 i"g * its own right in oid.r to ascertain  which settingsand providersimplement
 thl interventionbetterand asa mediatorto seehow implementation
 ment effects outcome.
490 | r+.a cRtTtcAL

        Indeed,for many purposes lack of standardizationmay in understanding
                                    the                               aid
    how effective interventionwill be undernormal conditionsof implementation.
                   an                                                                   In
    the social world, few treatmentsare introduced in a standardand theory-faithful
    way. Local adaptationsand partial implementationare the norm. If this is the case,
   then someexperiments      should reflect this variation and ask whetherthe treatment
   cancontinueto be effective   despiteall the variation within groupsthat we would ex-
   pectto find if the treatmentwerepolicy.Programdeveloperiand socialtheorists       may
   want standardization high levelsof implementation, policy analysrs
                          at                                but                shouldnot
   welcomethis if it makesthe research      conditionsdifferenifro- the practicecondi-
   tions to which they would like to generalize. course, is most desiiable be able
                                                  Of         it                 to
   to answerboth setsof questions-about policy-relevant       effects treatments
                                                                     of           that are
   variably implemented    and alsoabout the more theory-relevant    effects optimal ex-
   posureto the intervention.In this regard,one might recall recenteffortsio analyze
   the effects the original intent to treat through traditional meansbut alsoof the ef-
   fectsof the actual treatmentthrough using random assignment an instrumental
   variable(Angristet al., 1996a\.

             EntailTradeoffsNot Worth Making
   The choiceto experimentinvolvesa number of tradeoffsthat someresearchers              be-
   lieveare not worth making (Cronbach,7982).         Experimenration    prioritizeson un-
    biased answers descriptive
                      to              causalquestions.But, givenfinite r.rour..r, somere-
   searchers   preferto investwhat they havenot into marginalimprovements internal
   validity but into promoting higher constructand externalvalidity. They might be
   content with a greaterdegreeof uncertainryabout the quality of a causalconnec-
   tion in orderto purposively      samplea greater rangeof populations peopleor set-
   tings or, when a particular population is central to the research, ordeito gener-
   ate a formally representative     sample.They might evenusethe resources improve
   treatment    fidelity or to includemultiplemeasures averyimportantoutcome
                                                       of                               con-
   struct. If a consequence this preference constructand ixternal validity is to
                                 of               for
   conducta quasi-experiment evena nonexperiment
                                    or                      rather than a randomizedex-
  periment, then so be it. Similar preferences     make other critics look askance     when
  advocates experimentationcounselrestrictinga study to volunteersin order to
  increase chances beingable to implementand maintainrandomassignment
             the             of
  or when thesesameadvocates          adviseclosemonitoring of the treatmentto ensure    its
  fideliry therebycreatinga situation of greaterobtruiiveness       rhan would pertain if
  the same    treatment   werepart of someongoingsocialpolicy (e.g.,    Heckman,1992).
  In the language Campbelland Stanley
                      of                         (1,963;.,
                                                        theclaim was that ."p.ri*.rrt"-
  tion traded off externalvalidity in favor of internal validiry. In the parlanceof this
  book and of Cook and Campbell(1979),it is that experimentatiortrades both        off
  externaland constructvalidity for internal validiry to its detriment.
       Critics also claim that experiments     overemphasize  conservative   standards   of
  scientificrigor. Theseinclude (1) usinga conservative      criterion to protect against
                                                                      EXPERIMENTS *tt
                                                             RANDOMIZED         |

                                                                                to de-
wrongly concludinga treatmentis effectiv (p <.05) at the risk of failing
tect true treatment;ffects;(2) recommending      intent-to-treatanalyses include
as part of the treatmentthoseunits that have neverreceived       treatment;(3) deni-
gr"ting inferences     that result from exploring unplanned treatment interactions
with characteristics units, observations,
                         of                    settings, times;and (4) rigidly pur-
suing a priori experimentalquestionswhen other interestingquestions
duriig " ,t,rdy. Mort laypersons a more liberal risk calculusto decide
.u,rrul inferences their own lives,as when they considertaking up some
ii"ity lifesaving  therapy.Should not science the same' be lessconservative?
Snoula it notlt least-sometimes         make different tradeoffs betweenprotection
againstincorrectinferences the failure to detecttrue effects?
      critics further obiectthat experimepts                        over explanatory
                                               prioritize descriptive
 causation.  The criticsin qrrestion  would toleratemore uncertaintyabout whether
 the interventionworks in order to learn more about any explanatory
 that havethe potentialto generalize    across               observations, times'
                                              units, settings'              and
 Further,,o-. critics pr.f!, to pursuethis explanatory knowledgeusing
 tive meihodssimilar io thor. of th. historian,journalist, and ethnographer
 by meansof, sa5 structuralequation modeling that seems           much more opaque
 than the narrativereportsof theseother fields'
      critics alsodislikethe priority that experiments   giveto providing policymak-
 ers with ofren belated"rrri.r, about what works insteadof providing
                  providersin local settings. Theseproviders   are rarely interested in
 help to service
  " torrg-a.tayed  r,rrnmaryofwhat, ptogt"- has.achieved.       They often preferre-
  ceiving.o.riin,ro.rs  feedback abouttheir work and especially  about thoseelements
                                                                              letter to
  oiprJ.ri.. that they can changewithout undue complication' A recent
  theNew York Timescapturedthis preference:
                                                                           to approach issues
     Alan Krueger . . claims to eschew value iudgments and wants
     (about educationalreform) empirically. Yet   his insistenceon postponing changesin ed-
     ucation policy until studiesby iesearchers approach certainry is itself a value judgment
                                                                        in parts of public edu-
     in favor of the status quo. In view of the tragic state of affairs
     cation, his judgment is a most questionableone.     (Petersen,1999)

                                               Among all possible              ques-
     we agreewith many of thesecriticisms.                          _research
              questions constitute only a subset. And of all possible causal  meth-
 ods,experimentation not relevant all types
                      is              io          of questions  and all typesof cir-
             One needonly read   the list of options and contingencies   outlined in
 Ch"p,.r, 9 and L0 to appreciate   how foolhardy it is to advocate    experimenta-
                                    "gold standard"that will invariablyresultin
 tion on a routine basisas a causal
 clearly interpretableeffect sizes.However,many of the criticisms about
                                                                          even over-
 offs are basedon artificial dichotomies,correctableproblems,-and
 simplifications.Experiments   can and should examinereasons variableim-
 pl.-.nt"tion, and they should searchto uncover mediating processes'
                                                                         for the '05
  neednot use stringentalpha rates;only statisticaltradition argues
  level.Nor needonJ restric dataanalyses
                           t                only to the intent-to-treat'though that
                                                                          lueururo.rdeJoru qf,ntu
   e sdeld drrprlerrlpuJetur ql1qlv\ ur surerSord             rpuar lurluerelm puorq dpursrrd
  -Jns eql ol pue qf,ntu oor drlPllu^                  aztseqduraap   reqr qf,Jeesar;o sururSord
  ur pa8raureaABr{      stsaSSns   drolsrq lpql sassau>lea^\     IertuaraJuragr ol uouuane Bur
  -llEf, orp arrrtaqrey .(rq8rlrodsaql uI erurl slr a^eq lsntu ad& drlPler d-rela)
  -EA            Jo pnJlsuof, JaAodrrprlerrIeuJalur 1o drerurrd eurlnoJ due -ro; 8ur1er
  lou eJEeM'T lardu{J uI Jealr oPELu se 'esJnoJIO 'parseSSns
                                                a.&\                             arreqsrrlr.rr tsed
 req.a'\ sPeaJxa    dlrear8 sluaut-radxaeldrllnur Ja o senssr        drrpllel leuJalxa pug lrnrls
  -uol qloq sserppeol dlneder aql.slsdleue-Eleruur dlrrap
                                                                        lsoru aeso^4, 1ng ,sans
 -sI asJl{t qroq Sulssarpp" ur r.lf,Ear      perFrll e^Er{ slueurradxa lpnprlrpur .paluerg
  'larrr dlfsapou sanssr,{rrprlerr
                                         Ipuralxo puB lJnrlsuoJ r{foq sserppE ot r{f,rBrsar
 letuaurr.ladxe    Io  srueJSord;o dlpeder agr qrr^\ pessarduneJEaA\ ,1ser1uoc             dg
    '8ur1uru r{uo^\ tanau aJu
                                    lpql stJoapqJfarrnbar sluaurrradxeter{l tsa88nsol luet
 -.rodur ool sr s{Jo.&\     rpqra 1no Surpurg .sanqod IErJospaseq-sseua^rpa}Ja           alouord
 ol lue1Y\  ol{1v\ sJJels  rlaql Pue srolelsr8al asoql ro; d1-rrlnrrued 'cnerualqord eJoru
 uala dlqeqo-rdaru sJel\supJeell tnoqtr^ saurl aturl Buol-opelep qrns .re8uep
 P sI uollEluaurtradxa arntreruard         q8noqlly 'sploq uorlenlrs atues aql lsorule pue
              ruaurdolartaq IooqJS aqr uuSaq rauoJ sauef arurs sread 0t sl lI .sra/\,s
 -uB ou                                              'sloogrs peleJelalf,eue8aq
           PUPsluolutradxa ou aleq all\ Pue                                         urle-J drua11
 eruts sread SI sl fI                     rnoqe sJa^.r{sup   Jeolf,ou e^Erl llrls o.&\puu .pasod
 -ord arain sJarlf,no^                                             .splp .uoryo oor
                            Iooqtrs erurs srcad 0t A ou sr lI                        IIE ,.raddeq
 SFII 'elgBpuedapunsI lEql uollf,auuoJ lusnpr E tnogp suorsnlf,uoo
                                                                                  leraua8 puorg
Sutmerp >lsIJ sI uolluelJetul uE Jo srlaJJear{l uo sarpnts
                  ol                                                    leluauuadxa Suorls o4
e^Pq ol 'spuno;8 IEIIuePI^ero lerrSoyuo elqrsneldurrdl-realf,               arg drrprlerrleuralur
ol slBarql ssaFn 'saf,uareJur        da4;o dtr.rSalur Sursrulordtuot lnoqlrd\ passoJl eq
louupf, spunoq aruos 'lurod srql ot rrlaqledtuds d11erauafi am qfinoqrly .ftget
'qrequo.r3 :og5t ''1"                         ''3'a) spoqlau
                              Ir r{luquorJ                      lutuaurradxo ra8uorls aqr Bur
-zrsuqdruasrue.rSo-rd        tuory ueql serpnls leluaunradxeuou pue Iuluaur-radxa_isenb
;o dlerrlua uela ro dlrsoru lslsuof, rer{l qf,reasarJo sure-r8ord          tuory peuJuel aq IIri\,\
 upluroJul aroru pourr *:u"';rrx;;H:;r:::ilil?
       InJasn r'rrr Et'
sluerue^ordur leur8rulu JeuIJ-JaAa 1uo3aqr pur 'lsa88ns stxel eruos su plSrr se
                                                                                    ilHt", ",
eq tou paau sluaur.radxg '(salqerrul Surlelpau                         ,.8.a)tuaqr;o
                                                  Jo sarnseau Burppe
^{et salulleluoslnq    'saJJnosalartnber sarnparo-rd asaql ilV'{ooq srql ur peurllno
spoqrau eqt Sursn pelpreua8aq plnoqs uortuzrlereua8
                                                            lesneolnoqp alqrssodse
uolletuJotul qlntu sB puv 'sasseoordSurlelpau pue sauof,lno pepuelurun Burre
-^oJsrp tE parurp uorllellor etvp a^nelrlenb aq plnor{s pue upJ aleql .saruof,lno
pue 'stuerulearl 's8urpas (suos.rad sluerussasse dfrpryel lf,nJlsuof, ar{r puu
                                      ;o               ;o
salduresJo sseualrleluasardar er.lrJo sasdyeue
                                                  lulueurr.radxeuou eg osle plnoqs
                    'paqsqqnd aq uE3 sluaurrradxs
Pue uBr erarll                                        tuoJJ sllnsoJ urrelul .dlsnorl
-nBf, suolsnlf,uol rraqf 3urqf,nof,
                                     PuE seleJ JoJJa ale8rgo.ld lsure8e Surpren8
hlo11u remod lptrrtsrlels pue droagl elrtuetsqns leql luatxe eqr ol suorlrrJal
-uI                                              'srsdleur auo oq dlorruryap
    IEf,Itsllels aroldxa osle uet sJatuJurrradxg                              pFor{s

                                                                  tv)tl|u) v .tt I zov
                                        sNoll_dwnssv lo l_Nty\sslssv
                                                                     EXPERIMENTS 493
                                                            RANDOMIZED         |

          Assume InvalidModel
Experiments        an
of Research
To somecritics, experiments         recreatea naive rational choicemodel of decision
                                                                       among (the treat-
making. That is, one first lays out the alternatives choose
                                                                   then one collectsin-
*.rr,rt] then one decides criteria of merit (the outcomes);
                                                                               and finally
formation on eachcriterion for eachtreatment(the data collection),
one makes a decisionabout the superior alternative.UnfortunatelS
work on the useof socialscience         daia showsthat useis not so simpleas the ra-
tional choice     modelsuggests \ufeiss Bucuvalas,
                                   (c.       6c             1980; c''weiss, 1988)'
       First, evenwhen."-rir. and effectquestions askedin decision
p.ri-.nt"l resultsare still usedalong with other forms of information-from
                                                                          consensus a   of
isting theories,personaltestimony,extrapolationsfrom surveys'
                               with interests defend,and ideas
                                             to                   that haverecentlybe-
fieldlchims from experts
                                                                         politics' person-
.o*. trendy.Decisions shapedpartly by ideology,interests,
                                                                  as much made by a
 ality, windows of-opportunity, and ualues;and they are
                                                                      individualor com-
 policy-shapirrg.o-*nrrity (cronbachet al., 1980) as by an
                                                                         overtime asear-
 *i,,... Fuither,manydecisions not so much madeasaccreted
                                                                    maker with few op-
 lier decision,.orrrir"in later ones,leavingthe final decision
 tions ('Weiss,   1980). Indeed,by the time ixperimental resultsare available,new
 decisionmakersand issues        may havereplaced ones.
        Second,.*p.rirn.nts often yield contestedrather than unanimous
                                                                  Disputes arise about
 that therefore have uncertain implications for decisions.
                                                                        resultsare valid'
 whether the causalquestionswere correctly framed, whether
 whetherrelevantoutcomes         were assessed, whetherthe resultsentail a specific
  decision.For example,reexaminations          of the Milwaukee educationalvoucher
  ;"rdy offereddifferentconclusions       about whetherand whereeffectsoccurred(H'
                           Peterson, Du, 1.999;'Sritte,
                                       6c                  1'998,"1'999,2000)'    SimilarlS
  Fuller, 2000;Greene,
                                          from the Tennessee sizeexperiment
                                                              class                    (Finn
  differenteffect,ir., *.r. generated
                                                     Light, 6c Sachs,  1996)'Sometimes,
  EcAchilles    ,1.990;Hanusi'ek,1999;Mosteller,
  scholarlydisagreements at issue,but at other timesthe disputes
  conflictedstakeholder     interests.
        Third,   short-terminstrumentaluseof experimental       data is more likely when
                                                                              it is easier  to
  the interventionis a minor variant on existingpractice.For example,
  change   textbooksin a classroom pills givenlo patientsor eligibility
              entry than it is to relocatehospitalsto.underserved    locationsor to open
   ;;;;"*                                                                                 the
    day-care   centersfor welfare recipientsthroughout an entire state' Because
   more feasible   .tt""g.t are so ,ood.r, in scope,  they are lesslikely to dramatically
                                                                         on shor-t-term    in-
    affecttheproble- ih.y address. critics note that prioritizing
                         tendsto preserve  most of the statusquo and is unlikely     to solve
   strumental    change
                                                                           that truly twist
   tr.rr.hunt social"probl.-s. bf course'thereare someexperiments
                                                                        from densely    poor
   the lion,stail andinvolvebold initiatives.Thus moving families
   inner-citylocationsto the suburbsinvolveda changeof three standard
494       14.A CRIT|CAL

     in the poverty level of the sending and receiving communities, much greater than
    what happens when poor families spontaneously move.'S7hethersuch a dramatic
    change could ever be used as a model for cleaning out the inner cities of those who
    want to move is a moot issue. Many would judge such a policy to be unlikely.
    Truly bold experiments have many important rationales; but creating new policies
    that look like the treatment soon after the experiment is not one of them.
          Fourth, the most frequent use of research may be conceptual rather than in-
    strumental, changing how users think about basic assumptions,how they under-
    stand contexts, and how they organize'or label ideas. Some conceptual uses are
    intentional, as when a person deliberately reads a book on a current problem; for
    example, Murray's (1984) book on social policy had such a conceptual impact in
    the 1980s, creating a new social policy agenda. But other conceptual usesoccur
    in passing, as when a person reads a newspaper story referring to social research.
    Such usescan have great long-run impact as new ways of thinking move through
   the system, but they rarely change particular short-term decisions.
         These arguments against a naive rational decision-making model of experi-
   mental usefulnessare compelling. That model is rightly rejected. However, mosr
   of the objections are true not just of experiments but of all social sciencemethods.
   Consider controversies over the accuracy of the U.S. Census,the entirely descrip-
   tive results of which enter into a decision-making process about the apportion-
   ment of resourcesthat is complex and highly politically charged. No method of-
   fers a direct road to short-term instrumental use. Moreover, the obiections are
   exaggerated.In settings such as the U.S. Congress,decision making is sometimes
   influenced instrumentally by social scienceinformation (Chelimsky, 1998), and
   experiments frequently contribute to that use as part of a researchreview on ef-
   fectivenessquestions. Similarlg policy initiatives get recycled, as happened with
   school vouchers, so that social science data that were not used in past years are
   used later when they become instrumentally relevant to a current issue (Polsby,
   1'984; Quirk, 1986).In addition, data about effectiveness      influence many stake-
  holders' thinking even when they do not use the information quickly or instru-
  mentally. Indeed, researchsuggests     that high-quality experiments can confer exrra
  credibility among policymakers and decision makers (C.                  & Bucuvalas,
   1980)' as happened with the Tennessee      classsize study. We should also not forget
  that the conceptual use of experiments occurs when the texts used to train pro-
  fessionalsin a given field contain results of past studies about successfulpractice
  (Leviton 6c Cook, 1983). And using social sciencedata to produce incremental
  change is not always trivial. Small changescan yield benefits of hundreds of mil-
  lions of dollars (Fienberg,Singer,& Tanur, 1985). SociologistCarol'Weiss, ad-  an
  vocate of doing research for enlightenment's sake, says that 3 decadesof experi-
  ence and her studies of the use of social sciencedata leave her "impressed with the
  utility of evaluation findings in stimulating incremental increasesin knowledge
  and in program effectiveness.    Over time, cumulative incrementsare not such small
  potatoes after all" ('Weiss, 1998, p. 31,9). Finallg the usefulness experimentscan
  be increased by the actions outlined earlier in this chapter that involve comple-
                                                        RANDOMIZED      I

mentingbasicexperimental   designwith adjunctssuchas measures implemen-
tation a-ndmediationo, qualitativemethods-anything that will  help clarify pro-
gram process   and implementation problems.In summarSinvalid modelsof the
ir.foln.rs of experimintalresultsseem us to be no more nor lesscommon
                                     to                      'we
                                                    methods.     have learned
invalid modelslf th. use of any other social science
much in the last severaldecades  about use, and experimenters who want their
work to be usefulcan take advantages thoselessons
                                     of              (Shadish al., 1'99I).

TheConditions Experimentation
            of               Differfrom the
Conditions Policy
         of      lmplementation
                                                                      if         were
Experiments often doneon a smalleiscalethan would pertain services
i-il.-r.rted state-or nationwide,and so they cannot mimic all the details
u"rr, ,o full policy implementation.    Hencepolicy implementationof an interven-
                                                                              For ex-
,i"" -ry yi.ta aiff.rint o,rt.omesthan the experiment(Elmore, 1996)'
                                                                           class size,
ample, t"r.d partly on researchabout the benefits of reducing
Tennessee Caliiornia implementedstatewidepoliciesto have more
with fewer studentsin each.This required many new teachersand new
rooms.However,because a nationalteacher
                            of                              some
                                                    shortage,     of thosenew teach-
ers may havebeenlessqualifiedthan those          in the experiment;and a shortageof
classrooms to more .rs. of trailers and dilapidatedbuildings that may
harmedeffectiveness    further.
                                    treatmentis an innovation that generates   enthu-
      Sometimes experimental
 siasticeffortsto implementit well. This is particularly frequentwhen the
                                  innovator whosetacit knowledgemay exceed         that
 ment is done by a charismatic
 of thosewho would be expected implementthe program in ordinary
 and whosecharismamay inducehigh-qualityimplementation.            Thesefactorsmay
 generate  more srr...srfoi outcomes    than will be seenwhen the interventionis im-
 plemented routine PolicY.
      Policy implementationmay also yield different-resultswhen experimental
 treatments implemented a fashionthat differs from or conflictswith
              are               in
 ticesin real-*orld application.For example,      experiments studying  psychotherapy
                               treatmentwiih a     manual and sometimes   observe   and
 outcomeoften standardize
 correct the therapistfor deviatingfrom the manual (shadishet al., 2000);
 thesepractices rare in clinicallractice. If manualized
                  are                                         treatmentis more effec-
 tive (bhambless& Hollon, 1998; Kendall, 1998), experimentalresults
 transferpoorly to practicesettings.
       Raniom assigrrm.nt   may also changethe program from the intendedpolicy
 implementation    (i{eckman,l992l. For ixample, thosewilling to be randomized
  -"y diff.r from those for whom the treatment is intended; randomizatLon
  changepeople'spsychologicalor social responseto treatment compared
  those"wlroself-select treatment;and randomizationmay disrupt administration
  and implemenration forcingthe programto copewith a differentmix of

496       14.A CR|T|CAL

     Heckman claims this kind of problem with the Job taining PartnershipAct
                           "calls into question
     0TPA) evaluation                           the validity of the experimental       estimates
     as a statement   about theJTPAsystem a whole" (Heckman,1.992, ZZ1,).
                                               as                                  p.
         In many respects, agree
                             we         with thesecriticisms, thoughit is worth noting sev-
     eral responses them. First, theyassumealack of generalizabllity
                      to                                                          from experi-
     ment to policy but that is an empirical question.     Somedata suggesr       thar general-
     ization may be high despite differencesbetweenlab and field (C. Anderson,
     LindsaS & Bushman, 1999) or betweenresearchand practice (Shadish al.,                et
    2000). Second, can help to implement.treatment
                       it                                   underconditionsthat aremore
    characteristic practiceif it doesnot unduly compromise
                     of                                                other research     priori-
    ties. A little forethoughtcan improve the surfacesimilarity of units, trearments,
    observations,    settings, timesto their intendedtargets.
                              or                                    Third, someof these     crit-
    icismsare true of any research     methodologyconducted a limited context,such
    as locally conductedcasestudiesor quasi-experiments,            because local implemen-
    tation issues  alwaysdiffer from large-scale    issues. Fourth, the potentiallydisrup-
    tive natureof experimentally      manipulatedinterventions sharedby many locally
                                                                   is                  'rrr"or"h
    invented novel programs, euen uhen they are not studied by any
    methodology all.Innovation inherentlydisrupts,and substantive
                    at                                                          literatures  are
    rife with examplesof innovationsthat encountered            policy implementationim-
   pediments(Shadish,       1984).
         However,the essential      problem remainsthat large-scale      policy implementa-
   tion is a singularevent,the effectsof which cannot be fully known exceptby do-
   ing the full implementation. singleexperiment, evena smallseries ri-ilrt
                                     A                     or                        of
   ones,cannotprovidecompleteanswers           about what will happenif the intervention
   is adoptedas policy. However,Heckman'scriticism needsreframing.He fails to
   distinguishamongvalidity types(statistical       conclusion,   internal,.onrtro.., exter-
   nal). Doing so makesit clearthat his claim that suchcriticism"calls into question
   the validity of the experimental    estimates a sratement
                                                 as               about the JTPA,yrt.rr, ",
   a whole" (Heckman,1.992,         p.221,)is reallyabout external    validityand construcr
   validity,not statistical  conclusion internalvalidity.Exceptin thenarrow econo-
  metrics   traditionthat he understandably (Haavelmo,
                                                cites                7944;Marschak      ,7953;
  Tinbergen,1956),few socialexperimenters            ever claimedthat experiments         could
  describe "systemas a whole"-even Fisher(1935)acknowledged trade-
             the                                                                    this
  off. Further,the econometric        solutionsthat Heckman suggests       cannot avoid the
  sametradeoffsbetweeninternal and externalvalidity. For example,surveys                    and
  certain quasi-experiments avoid someproblemsby observingexistinginter-
  ventionsthat have aheadybeenwidely implemented, the validity of tleir es-
  timatesof program effectsare suspect        and may themselves      change the program
  were imposedevenmore widely as policy.
        Addressing    thesecriticismsrequiresmultiple lines of evidence-randomized
  experimentsof efficacyand effectiveness,        nonrandomizedexperiments            that ob-
  serveexistinginterventions,       nonexperimental   surveys yield estimates repre-
                                                               to                    of
  sentativeness,    statisticalanalyses  that bracketeffectsunder diverseassumpd;ns,

                                                                   EXPERIMENTS Ot
                                                          RANDOMIZED         I

qualitative observation to discover potential incompatibilities between the inter-
ventiol and its context of likely implementation, historical study of the fates of
similar interventions when they were implemented as policg policy analysesby
those with expertisein the type of intervention at issue,and the methods for causal
 generalizationin this book. The conditions of policy implementation will be dif-
i.r.rr, from the conditions characteristic of any rese^rchstudy of it, so predicting
 generalizationto policy will always be one of the toughest problems.

lmposing                       Flawed
        Treatments Fundamentally
Compared with Encouraging Growthof Local
Solutions Problems
Experiments    imposetreatments recipients. som,e 20th-centurythought
                                   on              Yet       late
,.rjg.rt, that imposedsolutionsmay be inferior to solutionsthat are locally gen-
.rJr".a by thoseiho h"n. the problem. Partly,this view is premisedon research
findings of few effectsfor the Great Societysocialprogramsof the 1960sin the
UniteJ States    (Murrag 1.984;   Rossi, L987),with the presumptionthat a portion
of the failurewas due to the federallyimposednatureof the programs.Partly,the
view reflectsthe success late 2Oth-century
                             of                    free market economics    and conser-
vative political ideologiescompared with          centrally controlled economiesand
more fi|eral political beliefs. Experimentally    imposedtreatments seen some
                                                                       are       in
quartersas beinginconsistent      with suchthinking'
      IronicallS the first objectionis basedon resultsof experiments-if it is true
that impos.i progr"*s do not work, experiments           provided the evidence.    More-
over,these   tro-.ff..t findingsmay havebeenpartly due to       methodological    failures
of experiments they were implemented that time. Much progress solving
                   as                          at                             in
practicalexperimental      problemsoccurredafter,and partly in response those  to,
experiments. so,it is premature assume
                If                   to          these experiments  definitivelydemon-
stiated no effect,especlaly     given our increased   ability to detectsmall effectsto-
day (D. Greenberg shroder,
    '                  6c          1,997; LipseSL992;Lipsey6c'Wilson,        !993).
      We must alsoiistinguish between      political-economic   currencyand the effects
 of interventions. know of no comparisons say,the effectsof locally gener-
 atedversus   imposed   solutions.Indeed, methodological
                                           the                   problemsin doing such
 comparisons daunting, especiallyaccuratelycategotizing
                 are                                                 interventionsinto
 the two categories unlonfounding the categories
                       and                                  with correlatedmethoddif-
 ferences. Bariing an unexpected    solutionto the seemingly    intractableproblemsof
 causalinference nonrandomized
                    in                  designs,  answering  questions about the effects
 of locally generated   solutionsmay requireexactlythe kind of high-qualityexper-
 imentatioi being criticized.Though it is likely that locally generated        solutions
 may indeedhavesignificantadvantages, also is likely that someof thosesolu-
 tions will haveto be experimentally     evaluated.

498 | 14.A CRIT|CAL

    Internal validity is best promoted via random assignment,an omnibus mechanism
    that ensuresthat we do not have many assumptions to worry about when causal in-
    ferenceis our goal. By contrast, quasi-experiments     require us to make explicit many
    assumptions-the threats to internal validity-that we then have to rule out by fiat,
    by design,or by measurement.      The latter is a more complex and assumption-riddled
    processthat is clearly inferior to random assignment.Something similar holds for
    causal generalization,in which random selectionis the most parsimonious and the-
    oretically justified method, requiring the fewest assumptionswhen causalgeneral-
    ization is our goal. But becauserandom selectionis so rarely feasible,one instead
    has to construct an acceptabletheory of generaliz tion out of purposive sampling,
    a much more difficult process.         have tried to do this with our five principles of
    generalizedcausal inference.These, we contend, are the keys to generalizedinfer-
    ence that lie behind random sampling and that have to be identified, explicated,
    and assessed we are to make better general rnterences,
    ano assessed   rt           make                                     if
                                                       inferences, even rt they are not per-
    fect ones. But these principles are much more complex to implement than is ran-
    dom sampling.
         Let us briefly illustrate this with the category called American adult women.
    We could represent this category by random selection from a critically appraised
    register of all women who live in the United Statesand who arc at least 21 years
    of age.I7ithin the limits of sampling error, we could formally generalizeany char-
    acteristics we measured on this sample to the population on that register. Of
    course, we cannot selectthis way becauseno such register exists.Instead,one does
    onet experiment with an opportunistic sample of women. On inspection they all
   turn out to be between           and 30 years of age, to be higher than average in
   achievementand abilit5 and to be attending school-that is, we have useda group
   of college women. Surface similarity suggests       that each is an instance of the cate-
    gory woman. But it is obvious that the modal American woman is clearly not a
   college student. Such students constitute an overly homogeneoussample with re-
   spect to educational abilities and achievement,socioeconomicstatus, occupation,
   and all observable and unobservable correlates thereof, including health status,
   current employment, and educational and occupational aspirations and expecta-
   tions. To remedy this bias, we could use a more complex purposive sampling de-
   sign that selectswomen heterogeneouslyon all these characteristics.But purpo-
   sive sampling for heterogeneousinstances can never do this as well as random
   selection can, and it is certainly more complex to conceive and execute.I7e could
   go on and illustrate how the other principles faclhtate generalization. The point is
   that any theory of generalization from purposive samples is bound to be more
   complicated than the simplicity of random selection.
        But becauserandom selection is rarely possible when testing causal relation-
   ships within an experimental framework, we need these purposive alternatives.
                                                                          I 499

Yet most experimental work probably still relies on the weakest of these alterna-
tives, surfaci similarity.'We seek to improve on such uncritical practice. Unfortu-
nately though, there is often restricted freedom for the more careful selection of
instancesof units, treatments, outcomes, and settings, even when the selection is
done purposively.It requires resourcesto sample irrelevanciesso that they are het-
erogeneouson many attributes, to measure several related constructs that can be
discriminated from each other conceptually and to measure a variety of possible
explanatory processes.     This is partly why we expect more progress on causal gen-
eralization from a review context rather than from single studies. Thus, if one re-
searcher can work with college women, another can work with female school-
teachers, and another with female retirees, this creates an opportunity to see if
thesesourcesof irrelevant homogeneity make a difference to a causal relationship
or whether it holds over all these differ6nt types of women.
     UltimatelS causal generalizationwill always be more complicated than assess-
ing the likelihood that a relationship is causal.The theory is more diffuse, more re-
cent, and lesswell testedin the crucible of researchexperience.And in some quar-
ters there is disdain for the issue,given the belief and practice that relationshipsthat
replicate once should be consideredas generaluntil proven otherwise' not to speak
oithe belief that little progressand prestigecan be achieved by designingthe next
experiment to be some minor variant on past studies. There is no point in pre-
t.nding that causal generalization is as institutionalized procedurally as other
methods in the social sciences.'We     have tried to set the theoretical agendain a sys-
tematic way. But we do not expect to have the last word. There is still no explica-
tion of causal generalizationequivalent to the empirically produced list of threats
to internal validiry and the quasi-experimental designsthat have evolved over 40
years to rule out thesethreats. The agendais set but not complete.

      RI          RNATIVES
Though this book is about experimental     methodsfor answeringquestions     about
.".rr"l hypotheses, is a mistaketo believe
                    it                        that only experimental approachesare
used for thir p,r.pose.In the following; we briefly      consider severalother ap-
proaches,  indiiating the major reasons   why we havenot dwelt on them in detail.
basicallSthe reasonis that we believethat, whatevertheir merits for somere-
search  purposes, they generate clearcausalconclusions
                                  less                         than randomized ex-
perimentsor eventhe bestquasi-experiments        suchas regression-discontinuityor
interruptedtime series
     The nonexperimental     alternatives examineare the major onesto emerge
in variousacademic     disciplines. educationand parts of anthropologyand soci-
ologg one alternative intensive
                        is          qualitativecase         In
                                                    studies. these  samefields,and
also-in developmental     psychologythere is an emerging    interestin theory-based
500        14.A CR|T|CAL

      causal studies basedon causal modeling practices.Across the social sciences      other
      than economics and statistics, the word quasi-experiment is routinely used to justify
      causal inferences,even though designsso referred to are so primitive in structure that
      causal conclusions are often problematic.        have to challenge such advoc acy of
      low-grade quasi-experiments as a valid alternative to the quality of studies we have
      been calling for in this book. And finally in parts of statistics and epidemiology, and
      overwhelmingly in econometrics and those parts of sociology and political science
      that draw from econometrics,the emphasisis more on control through statistical ma-
      nipulation than on experimental design.I7hen descriptive causal inferencesare the
      primary concern, all of these alternatives will usually be inferior to experiments.

       Intensive         Case
               Qualitative   Studies
        The call to generate causal conclusions from intensive case studies comes from
        several sources. One is from quantitative researchersin education who became
        disenchanted with the tools of their trade and subsequently came to prefer the
        qualitative methods of the historian and journalist and especiallyof the ethnog-
        rapher (e.g.,Guba,198l, 1,990;and more tentatively Cronbach, 1986).Another
        is from those researchersoriginally trained in primary disciplines such as qualita-
       tive anthropology (e.g.,Fetterman, 19841or sociology (Patton, 1980).
             The enthusiasm for case study methods arises for several different reasons.
        One is that qualitative methods often reduce enough uncertainty about causation
       to meet stakeholderneeds.Most advocatespoint out that journalists,historians,
       ethnographers, and lay persons regularly make valid causal inferences using a
       qualitative processthat combines reasoning, observation, and falsificationist pro-
       cedures in order to rule out threats to internal validity-even if that kind of lan-
       guage is not explicitly used (e.g.,Becker,1958; Cronbach,1982). A small minor-
       ity of qualitative theorists go even further to claim that casestudiescan routinely
       replace experiments for nearly any causal-sounding question they can conceive
       (e.g.,Lincoln & Guba, 1985). A secondreasonis the belief that suchmethodscan
       also engagea broad view of causation that permits getting at the many forces in
       the world and human minds that together influence behavior in much more com-
      plex ways than any experiment will uncover.And the third reasonis the belief that
      case studies are broader than experiments in the types of information they yield.
      For example, they can inform readers about such useful and diverse matters as
      how pertinent problems were formulated by stakeholders, what the substantive
      theories of the intervention are, how well implemented the intervention compo-
      nents were, what distal, as well as proximal, effects have come about in respon-
      dents' lives, what unanticipated side effects there have been, and what processes
      explain the pattern of obtained results.The claim is that intensivecasestudy meth-
      ods allow probes of an A to B connection, of a broad range of factors condition-
      ing this relationship, and of a range of intervention-relevant questions that is
      broader than the experiment allows.


                                                                          | 501

    Although we agree that qualitative evidence can reduce some uncertainfy
about cause-sometimes substantially the conditions under which this occurs
are usually rare (Campbell, 1975).In particular, qualitative methods usually pro-
duce unclear knowledge about the counterfactual of greatest importance, how
those who receivedtreatment would have changedwithout treatment. Adding de-
sign featuresto casestudies,such as comparison groups and pretreatmentobser-
vations, clearly improves causal inference. But it does so by melding case-study
data collection methods with experimental design.Although we consider this as a
valuable addition ro ways of thinking about casestudies, many advocatesof the
method would no longer recognize it as still being a case study. To our way of
thinking, casestudies are very relevant when causation is at most a minor issue;
but in most other caseswhen substantial uncertainry reduction about causation is
required, we value qualitative methods within experiments rather than as alter-
natives to them, in ways similar to those we outlined in Chapter 12.

Theory-Based luations
This approach has beenformulated relatively recently and is describedin various
books or specialjournal issues(Chen & Rossi, 1,992;Connell, Kubisch, Schorr,&
         1.995;Rogers,Hacsi, Petrosino,& Huebner, 2000). Its origins are in path
analysis and causal modeling traditions that are much older. Although advocates
have some differenceswith each other, basically they all contend that it is useful:
(1) to explicate the theory of a treatment by detailing the expected relationships
among inputs, mediating pfocesses,and short- and long-term outcomes; (2) to
measure all the constructs specified in the theory; and (3) to analyzethe data to
assessthe extent to which the postulated relationships actually occurred. For
shorter time periods, the available data may addressonly the first part of a pos-
tulated causal chain; but over longer periods the complete model could be in-
volved. Thus, the priority is on highly specific substantive theorS high-quality
measurement,and valid analysisof multivariate explanatory processes they un- as
fold in time (Chen & Rossi, 1'987,1,992).
      Such theoretical exploration is important. It can clarify general issueswith treat-
ments of a particular type, suggestspecific researchquestions,describehow the inter-
vention functions, spell out mediating processes,    locate opportunities to remedy im-
plementation failures, and provide lively anecdotes reporting results ('Weiss,
                                                      for                         1'998).
 All th.r. serveto increasethe knowledge yield, evenwhen such theoretical      analysisis
 done within an experimental framework. There is nothing about the approach that
 makes it an alternative to experiments. It can clearly be a very important adjunct to
 such studies,and in this role we heartily endorsethe approach (Cook,2000).
      However, some authors (e.g., Chen 6c Rossi, 1,987, 1992; Connell et al.,
 1,99 have advocated theory-based evaluation as an attractive alternative to ex-
 periments when it comes to testing causal hypotheses.It is attractive for several
 i.urorrr. First, it requires only a treatment group' not a comparison group whose
502 | 14.A CRTT|CAL

    agreement to be in the study might be problematic and whose participation in-
    creasesresearchcosts. Second, demonstrating a match between theory and data
    suggeststhe validity of the causal theory without having to go through a labori-
    ous processof explicitly considering alternative explanations. Third, it is often im-
    practical to measure distant end points in a presumed causal chain. So confirma-
    tion of attaining proximal end points through theory-specified processes       can be
    used in the interim to inform program staff about effectiveness date, to argue
    for more program resourcesif the program seemsto be on theoretical track, to
   justify claims that the program might be effective in the future on the as-yet-not-
    assessed distant criteria, and to defend against premature summative evaluations
   that claim that an intervention is ineffective before it has been demonstrated that
   the processes  necessaryfor the effect have actually occurred.
        However, maior problems exist with this approach for high-quality descrip-
   tive causalinference(Cook, 2000). First, our experience writing about the the-
   ory of a program with its developer (Anson et al., 1,991)has shown that the the-
   ory is not always clear and could be clarified in diverse ways. Second, many
   theories are linear in their flow, omitting reciprocal feedback or external contin-
   genciesthat might moderate the entire flow. Third, few theories specify how long
   it takes for a given processto affect an indicator, making it unclear if null results
   disconfirm a link or suggestthat the next step did not yet occur. Fourth, failure to
   corroborate a model could stem from partially invalid measuresas opposedto in-
   validity of the theory. Fifth, many different models can fit a data set (Glymour et
    a1.,1987;Stelzl, 1986), so our confidencein any given model may be small. Such
   problems are often fatal to an approach that relies on theory to make strong causal
   claims. Though some of theseproblems are present in experiments (e.g.,failure to
   incorporate reciprocal causation, poor measures),they are of far less import be-
   cause experiments do not require a well-specified theory in constructing causal
   knowledge. Experimental causal knowledge is less ambitious than theory-based
   knowledge, but the more limited ambition is attainable.

   Weaker Quasi-Experi
  For some researchers,random assignment is undesirable for practical or ethical
  reasons, so they prefer quasi-experiments. Clearly, we support thoughtful use of
  quasi-experimentation to study descriptive causal questions. Both interrupted
  time series and regression discontinuity often yield excellent effect estimates.
  Slightly weaker quasi-experiments can also yield defensible estimates,especially
  when they involve control groups with careful matching on stable pretest attrib-
  utes combined with other design features that have been thoughtfully chosen to
  addresscontextually plausible threats to validity. However, when a researchercan
  choose, randomized designsare usually superior to nonrandomized designs.
      This is especially true of nonrandomized designs in which little thought is
  given to such matters as the quality of the match when creating control groups,

                                                                         I tOl

includingmultiple hypothesis      testsrather than a singleone' generating      data from
several  pr.tr."t*.nt time points rather than one, or       having severalcomparison
groupsto createcontrolsthat bracketperformancein the treatmentgroups.In-
I..d, when resultsfrom typical quasi-experiments comparedare             with thosefrom
randomizedexperimentson the same topic, several findings emerge.Quasi-
experiments    frequentlymisestimate     effects(Heinsman& Shadish,1'996;Shadish
& Ragsdale,     t9961.Tirese   biases often large and plausiblydue to selection
                                      are                                                bi-
ases  srrchas the self-selection more distressed
                                  of                   clientsinto psychotherapy     treat-
ment conditions(Shadish al., 2000) or of patientswith a poorer prognosis
                             et                                                        into
controlsin medicalexperiments        (Kunz & Oxman,1'9981.                       are
                                                                  Thesebiases espe-
cially prevalent quasi-experiments usepoor quality control groupsand have
                   in                    that
higheiattrition(Heinsmar$c        Shadish,'1,996;Shadish   6cRagsdale,l996l. if the
an"swers  obtainedfrom randomized       experiments   are morecredible    than thosefrom
quasi-experiments theoretical
                       on            groundsand are more accurate        empirically, then
,'h. ".g,.r-entsfor randomized     experiments evenstrongerwhenever high de-
                                                are                             a
gr.. oI uncertainty    reductionis requiredabout     a descriptive causalclaim.
      Because  all  quasi-experiments not equal in their ability to reduceuncer-
tainty about."ur., *. -"ttt to draw attention againto a common but unfortu-
natepractice manysocialsciences-tosaythat a quasi-experiment beingdone
                in                                                           is
 in order to provide justificationthat the resultinginference       will be valid. Then a
 quasi-experimental     designis described   that is so deficientin the desirable    struc-
 tural features  noted previously,   which promote better inference,     that it is proba-
 bly not worth doing. Indeed,over the yearswe have__repeatedly the term    noted
 quasi-experiment     biing usedto justify designs   that fell into the classthat Camp-
 bell and'stanley    (196i) labeled uninterpretable that Cook and Campbell
                                     as                   and
 (1,9791labeled'as    generally  uninterpretable.  Theseare the simplest      forms of the
 designs  discussed Chapters and 5. Quasi-experiments
                      in           4                               cannot be an alterna-
 tive to randomizedexperiments       when the latter are feasible,    and poor quasi-ex-
 periments   can neverbi a substitute strongerquasi-experiments
                                         for                                when_the    lat-
 i., "r. also feasible.   Just as Gueron (L999) has remindedus about randomized
 experiments,    good quasi-experiments     haveto be fought for, too. They are rarely
 handedout as though on a silverplate.

In this book,we haveadvocated statistical
                                that         adjustments groupnonequivalence
are best urrd oBt design controlshavealreadybeenusedto the maximum in order
to reducenonequivalence a minimum. So we are not opponents statisticalad-
                         to                                         of
justmenttechniques suchasthoseadvocated     by the statisticians econometricians
described the appendixto Chapter5. Ratheqwe want to usethem as the last re-
sort.The  positionwe do not like is the assumption               controlsare sowell
                                                  that statistical
developeithat they can be usedto obtain confidentresultsin nonexperimental     and
weak iuasi-e*perimental  contexts.As we saw in Chapter 5, research the past 2
504 | ta. a cRtTtcAL
                   AsSEssMENT OUR
                           OF    ASSUMPT|ONS

    decadeshas not much supported the notion that a control group can be constructed
   through matchingfrom somenational or state registrywhen the treatmentgroup
   comes  from a morecircumscribed localsetting.
                                       and            Nor hasresearch  muchsupported
   the useof statistical adjustments longitudinalnationalsurveys which individuals
                                     in                           in
   with differentexperiences explicitly contrasted order to estimate effects
                               are                    in                  the       of
   this experience  difference.Undermatching a chronic problem here,as are conse-
   quences unreliabilityin the selection
            of                                       not
                                           variables, to speakof specification  errors
   dueto incomplete   knowledge the selection
                                  of             process. particular,
                                                         In         endogeneity prob-
   lemsarea realconcern. areheartened more recentwork on statistical
                                             that                              adjust-
   mentsseems be moving toward the position we represent,
                 to                                             with greateremphasis
   beingplacedon internal controls,on stablematchingwithin suchinternalcontrols,
   on the desirability seeking
                       of        cohort controlsthroughthe useof siblings, the useof
   pretestssorrccf,e(J rne same measures aS tne posttest, On tne Uulrty Ot SUChpretest
   PrstssLs collected on the same measures the posttest, on the utiliw of such
   measures collected at several different times, and on the desirability of studying inter-
   ventions that areclearlyexogenous  shocks someongoingsystem. arealsoheart-
   enedby the progress  beingmadein the statistical                 it
                                                     domainbecause includes  progress
   on design considerations, well ason analysis se(e.g.,
                            as                      per        Rosenbaum,  1999a).Ve
   areagnostic this time asto the virtuesof the propensity
               at                                            scoreandinstrumental
   able approaches  that predominatein discussions statisticaladiustmenr.
                                                      of                    Time will
   tell     well they
   tell how well they pan out relative to the results from randomizedexperiments.'We
   have surely not heard the last word on this topic.

         cannot point to one new development that has revolutionized field experimen-
    tation in the past few decades,yet we have seena very large number of incremen-
    tal improvements. As a whole, these improvements allow us to create far better
    field experiments than we could do 40 years ago when Campbell and Stanley
    (1963) first wrote. In this sense, are very optimistic about the future. Ve believe
    that we will continue to see steadg incremental growth in our knowledge about
    how to do better field experiments. The cost of this growth, howeveq is that field
    experimentation has become a more specializedtopic, both in terms of knowledge
    developmentand of the opportunity to put that knowledge into practice in the con-
   duct of field experiments. As a result, nonspecialistswho wish to do a field exper-
   iment may greatly benefit by consulting with those with the expertise,especially for
   large experiments, for experiments in which implementation problems may be
   high, or for casesin which methodological vulnerabilities will greatly reducecred-
   ibility. The same is true, of course, for many other methods. Case-studymethods,
   for example, have become highly enough developed that most researchers       would
   do an amateurishjob of using them without specializedtraining or supervised   prac-
   tice. Such Balkanization of. methodolog)r is, perhaps, inevitable, though none the
   lessregrettable.\U7e easethe regret somewhat by recognizingthatwith special-
   ization may come faster progress in solving the problems of field experimentation.

To top