Adapting Assessment Instruments for Use in Multiple Languages by sofiaie


									Packing and Unpacking Sources
 of Validity Evidence: History
     Repeats Itself Again

          Stephen G. Sireci
University of Massachusetts Amherst
    Presentation for the conference
“The Concept of Validity: Revisions, New
      Directions, & Applications”
               October 9, 2008
     University of Maryland, College Park
A concept that has evolved and is
 still evolving
The most important consideration in
 educational and psychological
Simple, but complex
  – Can be misunderstood
  – Disagreements regarding what it is,
    and what is important
Purposes of this presentation
Provide some historical context on
 the concept of validity in testing
Present current, consensus
 definitions of validity
Describe the validation framework
 implied in the Standards
Discuss limitations of current
Suggest new directions for validity
 research and practice
Packing and unpacking: A prelude
       Packing                Unpacking
Does the test measure    Predictive, status,
what it purports to      content, congruent
measure?                 validity
A test is valid for      Clarity, coherence,
anything with which it   plausibility of
correlates.              assumptions (validity
Validity is a unitary    5 sources of validity
concept.                 evidence
Validity defined
What is validity?
How have psychometricians come
 to define it?
What does valid mean?
According to Websters, Valid:
  1. having legal force; properly executed and
   binding under the law.
  2. sound; well grounded on principles or
   evidence; able to withstand criticism or
  3. effective, effectual, cogent
  4. robust, strong, healthy (rare)
What is validity?
According to Websters:
  1. the state or quality of being valid;
    specifically, (a) strength or force from
    being supported by fact; justness;
    soundness; (b) legal strength or force.
  2. strength or power in general
  3. value (rare)
How have psychometricians
defined validity?
Some History
In the beginning
In the beginning
Modern measurement started at the
 turn of the 20th century
1905: Binet-Simon scale
  – 30-item scale designed to ensure that
    no child could be denied instruction in
    the Paris school system without formal
    • Binet died in 1911 at age 54
College Board was established in
  – Began essay testing in 1901
What else was happening around
the turn of the century?
1896: Karl Pearson, Galton
 Professor of Eugenics at University
 College, published the formula for
 the correlation coefficient
Given the predictive purpose of
 Binet’s test,
interest in heredity and individual
and a new statistical formula
 relating variables to one another
validity was initially defined in terms
 of correlation
Earliest definitions of Validity
“Valid scale” (Thorndike, 1913)
A test is valid for anything with
 which it correlates
  – Kelley, 1927; Thurstone, 1932;
  – Bingham, 1937; Guilford (1946); others
Validity coefficients
  – correlations of test scores with grades,
    supervisor ratings, etc.
Validation started with group
1917: Army Alpha and Army Beta
  – (Yerkes)
  – Classification of 1.5 million recruits

Borrowed items and ideas from Otis
  – Otis was one of Terman’s graduate
Military Testing
Tests were added or subtracted to
 batteries based solely on
 correlational evidence (e.g, increase
 in R2).
How well does test predict pass/fail
 criterion several weeks later?
Jenkins (1946) and others emerged
 in response to problems with notion
 that validity=correlation
  – See also Pressey (1920)
Problems with notion that
validity = correlation
Finding criterion data
Establishing reliability of criterion
Establishing validity of criterion

If valid, measurable, criteria exist,
 why do we need the test?
What did critics of correlational evidence
of validity suggest for validating tests?
 Professional judgment
 “ is proper for the test developer to
   use his individual judgment in this
   matter though he should hardly
   accept it as being on par with, or as
   worthy of credence as,
   experimentally established facts
   showing validity.”
    – (Kelley, 1927, pp. 30-31)
What did critics of correlational evidence
of validity suggest for validating tests?
 Appraisal of test content with
  respect to the purpose of testing
  (Rulon, 1946)
    – rational relationship
 Sound familiar?
 Early notions of content validity
    – (Kelley, Mosier, Rulon, Thorndike,
    – but notice Kelley’s hesitation in
      endorsing this evidence, or going
      against the popular notion
Other precursors to content
Guilford (1946): validity by
Gulliksen (1950): “Intrinsic Validity”
  – pre/post instruction test score change
  – consensus of expert judgment
    regarding test content
  – examine relationship of test to other
    tests measuring same objectives
Herring (1918): 6 experts evaluated the
 “fitness of items”
Development of Validity Theory
By the 1950s, there was consensus
 that correlational evidence was not
and that judgmental data of the
 adequacy of test content should be
Growing idea of multiple lines of
 “validity evidence”
Emergence of Professional
Cureton (1951): First “Validity”
 chapter in first edition of
 “Educational Measurement” (edited
 by Lindquist).
Two aspects of validity
  – Relevance (what we would call
  – Reliability
Cureton (1951)
Validity defined as “the correlation
 between actual test scores and true
 criterion scores”
but: “curricular relevance or
 content validity” may be appropriate
 in some situations.
Emergence of Professional
1952: APA Committee on Test
  – Technical Recommendations for
    Psychological Tests and Diagnostic
    Techniques: A Preliminary Proposal
Four “categories of validity”
  – predictive, status, content, congruent
 Emergence of Professional
1954: APA, AERA, & NCMUE
  – Technical Recommendations for
    Psychological Tests and Diagnostic
Four “types” or “attributes” of
  – construct validity (instead of congruent)
  – concurrent validity (instead of status)
  – predictive
  – content
1954 Standards
Chair was Cronbach and guess who
 else was on the Committee?
  – Hint: A philosopher

Promoted idea of:
  – different types of validity
  – multiple types of evidence preferred
  – some types preferable in some
Subsequent Developments
1955: Cronbach and Meehl
  – Formally defined and elaborated the
    concept of construct validity.
  – Introduced term “criterion-related
1956: Lennon
  – Formally defined and elaborated the
    concept of content validity.
Subsequent Developments
Loevinger (1957): big promoter of
 construct validity idea.

Ebel (1961…): big antagonist of
 unified validity theory
  – Preferred “meaningfulness”
Evolution of Professional
1966: AERA, APA, NCME
Standards for Educational and
  Psychological Tests and Manuals
Three “aspects” of validity:
  – Criterion-related (concurrent +
  – Construct
  – Content
1966: Standards

Introduced notion that test users are
 also responsible for test validity
Specific testing purposes called for
 specific types of validity evidence.
  – Three “aims of testing”
    • present performance
    • future performance
    • standing on trait of interest
Important developments in content
Evolution of Professional
1974: AERA, APA, NCME
Standards for Educational and
  Psychological Tests
Validity descriptions borrowed
 heavily from Cronbach (1971)
  – Validity chapter in 2nd edition of
    “Educational Measurement” (edited by
    R.L. Thorndike)
1974: Standards
Defined content validity in
 operational, rather than theoretical,
Beginning of notion that construct
 validity is much cooler than content
 or criterion-related.
Early consensus of “unitary”
 conceptualization of validity
Evolution of Professional
1985: AERA, APA, NCME
Standards for Educational and
  Psychological Testing
note “ing”
Described validity as unitary
Notion of validating score-based
Very Messick-influenced
1985 Standards

More responsibility on test users
More standards on applications and
 equity issues
Separate chapters for
  – Validity
  – Reliability
  – Test development
  – Scaling, norming, equating
  – Technical manuals
1985 Standards
New chapters on specific testing
  – Clinical
  – Educational
  – Counseling
  – Employment
  – Licensure & Certification
  – Program Evaluation
  – Linguistic Minorities
  – “People who have handicapping
1985 Standards
New chapters on
  – Administration, scoring, reporting
  – Protecting the rights of test takers
  – General principles of test use
Listed standards as
  – primary,
  – secondary, or
  – conditional.
1999 Standards
 New “Fairness in Testing” section
 No more “primary,” “secondary,”
 3-part organizational structure
  1. Test construction, evaluation, &
  2. Fairness in testing
  3. Testing applications
1999 Standards (2)
 Incorporated the “argument-based
   approach to validity”
Five “Sources of Validity Evidence”
1. Test content
2. Response processes
3. Internal structure
4. Relations to other variables
5. Testing consequences
We’ll return to these sources later.
Comparing the Standards: Packing
 & Unpacking Validity Evidence
Edition                Validity
 1954     Construct, concurrent, predictive,
 1966     Criterion-related, construct,
 1974     Criterion-related, construct,
 1985     Unitary (but, content-related
          evidence, etc.)
 1999     Unitary: 5 sources of evidence
What are the current and
influential definitions of validity?
Cronbach: Influential, but not
 current (1971…)
Messick (1989…)
Shepard (1993)
Standards (1999)
Kane (1992, 2006)
Messick (1989): 1st sentence
“Validity is an integrated evaluative
  judgment of the degree to which
  empirical evidence and theoretical
  rationales support the adequacy
  and appropriateness of inferences
  and actions based on test scores
  and other modes of assessment.”
  (p. 13)
This “integrated” judgment led
Messick, and others, to conclude
All validity is construct validity.

It outside my purpose today to debate
   the unitary conceptualization of
   validity, but like all theories, it has
   strengths and limitations.

But two quick points…
Unitary conceptualization of
Focuses on inferences derived from
 test scores
  – Assumes measurement of a construct
    motivates test development and
The focus on analysis of scores
 may undermine attention to content
Removal of term “content validity”
 may have had negative effect on
 validation practices.
Consider Ebel (1956)
“The degree of construct validity of a
  test is the extent to which a system
  of hypothetical relationships can be
  verified on the basis of measures of
  the construct…but this system of
  relationships always involves
  measures of observed behaviors
  which must be defended on the
  basis of their content validity” (p.
Consider Ebel (1956)
“Statistical validation is not ann
  alternative to subjective evaluation,
  but an extension of it. All statistical
  procedures for validating tests are
  based ultimately upon common
  sense agreement concerning what
  is being measured by a particular
  measurement process” (p. 274).
The 1999 Standards accepted
the unitary conceptualization,
but also took a practical stance.
The practical stance stems from the
 use of an argument-based approach
 to validity.
  – Cronbach (1971, 1988)
  – Kane (1992, 2006)
The Standards (1999)
succinctly defined validity
“Validity refers to the degree to which
 evidence and theory support the
 interpretations of test scores entailed
 by proposed uses of tests.” (p. 9)
Why do I say the Standards
incorporated the argument-based
approach to validation?
“Validation can be viewed as
  developing a scientifically sound
  validity argument to support the
  intended interpretation of test
  scores and their relevance to the
  proposed use.”
 (AERA et al., 1999, p. 9)
Kane (1992)
“it is not possible to verify the
  interpretive argument in any
  absolute sense. The best that
  can be done is to show that the
  interpretive argument is highly
  plausible, given all available
  evidence” (p. 527).
Kane: Argument-based approach
a) Decide on the statements and
   decisions to be based on the test
b) Specify inferences/assumptions
   leading from test scores to
   statements and decisions.
c) Identify competing interpretations.
d) Seek evidence supporting
   inferences and assumptions and
   refuting counterarguments.
Philosophy of Validity
Messick (1989)
“if construct validity is considered to
  be dependent on a singular
  philosophical base such as logical
  positivism and that basis is seen to
  be deficient or faulty, then construct
  validity might be dismissed out of
  hand as being fundamentally
  flawed” (p. 22).
Messick (1989)
“nomological networks are viewed as
  an illuminating way of speaking
  systematically about the role of
  constructs in psychological theory
  and measurement, but not as the
  only way” (p. 23).
3 perspectives on rel. b/w test
and other indicators of construct
1. Test and nontest consistencies are
   manifestations of real traits.
2. Test and nontest consistencies are
   defined by rel. among constructs
   in a theoretical framework.
3. Test and nontest consistencies are
   attributable to real entities but are
   understood in terms of constructs.
See Messick (1989) Figures 2.1-2.3
Messick on test validation
“test validation is a process of
  inquiry” (p. 31)
5 systems of inquiry
  – Leibnizian
  – Lockean
  – Kantian
  – Hegelian
  – Singerian
Systems of inquiry
Main points
  – Validation can seek to confirm
  – Validation can seek consensus
  (Leibniz, Lock)

  – Validation can seek alternative
  – Validation can seek to disconfirm
  (Kant, Hegel, Singer)
Two other important points by
Messick (1989)
“the major limitation is shortsightedness
  with respect to other possibilities” (p. 33)

“The very variety of methodological
  approaches in the validational
  armamentarium, in the absence of
  specific criteria for choosing among
  them, makes it possible to select
  evidence opportunistically and to ignore
  negative findings” (p. 33)
If you look at the seminal papers
and textbooks, and the various
editions of the Standards, there
are several fundamental and
consensus tenets about validity
theory and test validation.
Fundamental Validity Tenets
Validity is NOT a property of a test.
A test cannot be valid or invalid.
What we seek to validate are
 (inferences) uses of test scores.
Validity is not all or none.
Test validity must be evaluated with
 respect to a specific testing purpose.
 Thus, a test may be appropriate for
 one purpose, but not for another.
Fundamental Validity Tenets
Evaluating the validity of inferences
 derived from test scores requires
 multiple lines of evidence (i.e., different
 types of evidence for validity).
Test validation never ends—it is an
 ongoing process.
I believe these tenets can be
considered “consensus” due to
their incorporation in the
standards and predominance in
the literature.
But of course, not everyone need
 agree with consensus, and we will
 here important points from
 detractors over the next two days.
Criticisms of this perspective
Tests are never truly validated (we
 are never done).
No prescription or guidance
 regarding specific types of evidence
 to gather and how to gather it.
Ideal goals with no guidance leads
 to inaction.
The argument-based approach is
a compromise between
sophisticated validity theory and
the reality that at some point, we
must make a judgment about the
defensibility and suitability of
use of a test for a particular
What guidance does the
Standards give us?
 Five “sources of evidence that
 might be used in evaluating a
 proposed interpretation of test
 scores for particular purposes”
(Messick, 1989, p. 13).
“Validation is a matter of making the
  most reasonable case to guide both
  current use of the test and current
  research to advance understanding
  of what the test scores mean…
To validate an interpretive inference is
  to ascertain the degree to which
  multiple lines of evidence are
  consonant with the inference, while
  establishing that alternative
  inferences are less well supported.”
The current Standards
Provide a useful framework for
 evaluating the use of a test for a
 particular purpose.
  – And for documenting validity evidence
Allow us to use multiple lines of
 evidence to support use of a test for
 a particular purpose
But, are not prescriptive and do not
 provide examples or references to
 “adequate” validity arguments.
Standards’ Validation
Validity evidence based on
1. Test content
2. Response processes
3. Internal structure
4. Relations to other variables
5. Testing consequences
What is helpful in the Standards
It provides a system for categorizing
 validity evidence so that a coherent
 set of evidence can be put forward.
It provides a way of standardizing
 the reporting of validity evidence.
It focuses on both test construction
 and test score validation activities.
Emphasizes the importance of
 evaluating consequences
What are the limitations in the
Standards framework?
Not all types of evidence of validity
 fit into the 5 sources categories.
No examples of good validation
 studies or of when sufficient
 evidence is put forth
No statistical guidance
No references
Vagueness in some areas
Suggestions for revising the
Standards (1)
Need to refine sources of validity
 evidence to accommodate
  – Analysis of group differences
  – Alignment research
  – Differential item functioning
  – Statistical analysis of test bias
Need more clarity on validity
 evidence for accountability testing
 (groups, rather than individuals)
Suggestions for revising the
Standards (2)
Need to define “score
 – Across subgroups of examinees
   taking a single assessment
 – Across accommodations to
   standardized assessments
 – Across different language versions
   of an assessment
 – Across different tests in CAT/MST
 – Across different modes of
Suggestions for revising the
Standards (3)
Include specific examples of
 laudable test validation
 analyses and references to
 studies that exemplify sound
 validity arguments.
Closing remarks
There are different perspectives on
 validity theory.
Whether a test is valid for a
 particular purpose will always be a
 question of judgment.
A sound validity argument makes
 the judgment an easy one to make.
Closing remarks (2)
For educational tests, validity
 evidence based on test content, is
 fundamental. Without confirming
 the content tested is consistent with
 curricular goals, the test adequately
 represents the intended domain,
 and the test is free of construct-
 irrelevant material, the utility of the
 test for making educational
 decisions will be undermined.
Why are there different
perspectives on validity?
It’s philosophy.
It’s okay to disagree, but we need
 consensus with respect to
 nomenclature, and that is where
 differences can hurt us as a
For over 50 years, the Standards
 have provided consensus
Remember, thhreats to validity
boil down to
Construct underrepresentation
Construct-irrelevant variance
“Tests are imperfect measures of
  constructs because they either
  leave out something that should be
  included…or else include
  something that should be left out, or
  both” (Messick, 1989, p. 34)
Adhering to and Improving the
I don’t agree with everything in the
But I find it easier to work within the
 framework, than against it.
Advice for the remainder of the
conference, and for your future
validity endeavors
If you criticize, have specific
 improvements to contribute.
  – (e.g., evidence-centered design)
Consider different perspectives on
 validity when evaluating use of a
 test for a particular purpose
  – If one statistical analysis is offered as
    “validation,” be suspicious
  – Look for evidence in test construction
Thank you for your attention
And thanks to Bob Lissitz and UMD
 for the invitation and for holding this
I look forward to continuing the
There is certainly a lot more to hear,
 and to say.

To top