Construct Validity: AU niversal Validity System or Just AnotheTest by 3a5jEc3


									Construct Validity: A Universal
       Validity System

         Susan Embretson
   Georgia Institute of Technology
• Validity is a controversial concept in
  educational and psychological testing
• Research on educational and psychological
  tests during the last half of the 20th century
  was guided by distinction of types of validity
  • Criterion-related validity, content validity and
    construct validity
• Construct validity is the most problematic
  type of validity
  • It involves theory and the relationship of data to
 Yet the most controversial type of validity became
  the sole type of validity in the revised joint
  standards for educational and psychological tests
  (AERA/APA/NCME, 1999)
   In the current standards “Validity refers to the degree to
    which evidence and theory support the interpretations of
    test scores entailed by proposed uses of test”
   Content validity and criterion-related validity are two of
    five different kinds of evidence.
   Reflects substantial impact from Messick’s (1989) thesis
    of a single type of validity (construct validity) with
    several different aspects.
Overview of the validity concept
Current issues on validity
   Discontent with construct validity for educational tests
      Need for content validity
Critique of content validity as basis for educational
Universal system for construct validity
   Applies to all tests
      Achievement tests
      Ability tests
   History of the Construct Validity
           Concept: Origins
• American Psychological Association (1954). Technical
  recommendations for psychological tests and diagnostic
  techniques. Psychological Bulletin, 51, 2, 1-38.
       • Prepared by a joint committee of the American Psychological
         Association, American Educational Research Association, and
         National Council on Measurements Used in Education.
   – “Validity information indicates to the test user the degree to
     which the test is capable of achieving certain aims. … “Thus, a
     vocabulary test might be used simply as a measure of present
     vocabulary, as a predictor of college success, as a means of
     discriminating schizophrenics from organics, or as a means of
     making inferences about "intellectual capacity.“

   – “We can distinguish among the four types of validity by noting
     that each involves a different emphasis on the criterion. (p. 13)
   History of the Construct Validity
           Concept: Origins
Types of validity by use
  Content validity
     “The test user wishes to determine how an individual would
       perform at present in a given universe of situations of which
       the test situation constitutes a sample.”
  Predictive validity
     “The test user wishes to predict an individual's future
  Concurrent validity
     “The test user wishes to estimate an individual's present status
       on some variable external to the test.”
  Construct validity
     “The test user wishes to infer the degree to which the individual
       possesses some trait or quality (construct) presumed to be
       reflected in the test performance.”
  History of the Construct Validity
          Concept: Origins
Cronbach, L. J. & Meehl, P. E. (1955). Construct
  validity in psychological tests. Psychological
  Bulletin, 52, 281-302.
     “We can dsitinguish among four types of validity by noting that
       each one puts a different emphasis on the criterion. In
       predictive or concurrent, the criterion behavior is of concern
       to the tester and he may have no concern whatever with the
       type of behavior observed on the test”
     “ Content validity is studied when the tester is concerned with
       the type of behavior in the test performance. Indeed, if the
       test is a work sample, the test may be an end in itself.”
     “Construct validity is ordinarily studied when the tester has no
       definite criterion measure of the quality with which he is
       concerned, and must use indirect measures. Here the trait or
       quality underlyng the test is of central importance…….”
  Implications of Original Views
• Same test can be used in different ways
• Relevant type of validity depends on test
• The types of validity differ in the
  importance of the behaviors involved in
  the test
  More Recent Views on Types of
• Standards for Educational and Psychological
  Testing (1954; 1966; 1974, 1985, 1999)
• 1985
  – “Traditionally, the various means of accumulating
    validity evidence have been grouped into categories
    called content-related, criterion-related and construct-
    related evidence of validity. …” “These categories are
    convenient.…but the use of category labels does not
    imply that there are distinct types of validity…”
  – “An ideal validation includes several types of
    evidence, which span all three of the traditional
   Conceptualizations of Validity:
  Psychological Testing Textbooks
• “All validity analyses address the same basic question:
  Does the test measure knowledge and characteristics
  that are appropriate to its purpose. There are three
  types of validity analysis, each answering this question in
  a slight different way.” (Friedenberg,1995)
• “ …..the types of validity are potentially independent of
  one another.” (Murphy & Davidshofer,1988)
• “There are three types of evidence: (1) construct-related,
  (2) criterion-related, and (3) content-related.” …..”It is
  important to emphasize that categories for grouping
  different types of validity are convenient; however, the
  use of categories does not imply that there are distinct
  forms of validity.” Kaplan & Saccuszzo (1993)
    Most Recent View on Types of
• Standards for Educational and Psychological Testing 1999
   – “Validity refers to the degee to which evidence and theory
     support the interpretations of test scores entailed by proposed
     uses of tests”. The proposed interpretation refers to the construct
     or concepts that the test is intended to represent.” (p.9)

   – “These sources of evidence may illuminate different aspects of
     validity, but they do not represent distinct types of validity.
     Validity is a unitary concept.”

   – “The wide variety of tests and circumstances makes it natural that
     some types of evidence will be especially critical in a given case,
     whereas other type will be less useful.” (p. 9)

   – “Because a validity argument typically depends on more than one
     proposition, strong evidence in support of one in no way
     diminishes the need for evidence to support others. (p. 11).
  The Sources of Validity Evidence
Evidence based on test content
   Logical & empirical analysis of adequacy representing a content
     domain -- Includes themes, wording, item format and procedures
     for administration & scoring
Evidence based on response processes
   Theoretical and empirical analysis of test taker’s response process
     with respect to construct
Evidence based on internal structure
   Relationships among test items correspond to construct structure
Evidence based on relations to other variables
   Convergent & discriminate evidence
   Test-criterion relationships
   Validity generalization
Evidence based on the consequences of testing
   Different impact by group, claims of testing benefits
Implications of 1999 Validity Concept
 No distinct types of validity
 Multiple sources of evidence for single test aim
   Example-Mathematical achievement test used to
     assess readiness for more advanced course
   Propositions for inference
      1) Certain skills are prerequisite for advanced course
      2) Content domain structure for the test represents skills
      3) Test scores represent domain performance
      4) Test scores are not unduly influenced by irrelevant variables,
        such as writing ability, spatial ability, anxiety etc.
      5) Success in advanced course can be assessed
      6) Test scores are related to success in advanced curriculum
   Current Issues with the Validity
   Concept: Educational Testing
Crocker (2003)
  Content aspect of validity deserves more prominence
         Educational accountability needs content representativeness
  More methods for content related evidence needed
         Design-- test specification and item generation;
         Item review tasks; Subject matter expert reliability
         Data analysis techniques for content judgments

Fremer (2000)
  Construct validity is an unreachable goal
Borsboom, Mellenbergh & van Heerden (2004)
  Current validity theory “fails to serve either the
   theoretically oriented psychologist or the practically
   inclined tester”
  Current Issues with the Validity
  Concept: Educational Testing
 Lissitz and Samuelson (2007)
   Propose some changes in terminology and
    emphasis in the validity concept
   Argue that “construct validity as it currently
    exists has little to offer test construction in
    educational testing”.
   In fact, their system leads to a most startling
      Construct validity is irrelevant to defining what
       is measured by an educational test!!
      Content validity becomes primary in determining
       what an educational test measures
  Current Issues with the Validity
  Concept: Educational Testing
Several published responses in Educational
  Embretson, S. E. (2007). Construct validity: A
   universal validity system or just another test
   Evaluation Procedure? Educational
   Researcher, Vol. 36, No. 8, pp. 449–455.
Lissitz’ response: Organize a conference!
    Critique of Content Validity as
    Basis for Educational Testing
• Content validity is not up to the burden of
  defining what is measured by a test
• Relying on content validity evidence, as
  available in practice, to determine the meaning
  of educational tests could have detrimental
  impact on test quality
• Giving content validity primacy for educational
  tests could lead to very different types and
  standards of evidence for educational and
  psychological tests
   Validity in Educational Tests
 Response to Lissitz & Samuelson
• Background
  • Embretson, S. E. (1983). Construct validity:
    Construct representation versus nomothetic
    span. Psychological Bulletin, 93, 179-197.
    • Construct representation
       • Establishes the meaning of test scores from Identifying
         the theoretical mechanisms that underlie test
         performance (i.e., the processes, strategies and
    • Nomothetic span
       • Establishes the significance of test scores by Identifying
         the network of relationships of test scores with other
Validity in Lissitz and Samuelson’s
Taxonomy of test evaluation procedures
  1) Investigative Focus
    Internal sources = analysis of the test and its items
       Provides evidence about what is measured
    External sources =relationship of test scores to
     other measures & criteria
       Provides evidence about impact, utility and trait theory
  2) Perspective
    Theoretical orientation = concern with measuring
    Practical orientation = concern with measuring
Figure 2. Taxonomy of Test
  Evaluation Procedures
           Theoretical    Practical

   Internal Latent        Content and
            Process       Reliability

   External Nomological     Utility and
            Network          Impact
Figure 1. The Structure of the
   Technical Evaluation of
     Educational Testing

                     Internal                External

     Latent Process                               Theory (Nomological)

        Content                                     Utility (Criterion)

       Reliability                                       Impact

       Implications for Validity
System represents best current practices
Internal meaning (validity) established
   For educational tests, content and reliability evidence
      Evidence based on internal structure (i.e., reliability, etc.)
      Evidence based on test content
   For psychological tests, depends on latent processes
      Evidence based on response processes
      Evidence based on internal structure (item correlations)
But, notice the limitations
   Response process and test content evidence are not
     relevant to both types of tests
   External evidence based on relations to other variables
     has no role in validity
      External Evidence Only?
 Construct validity is removed from the validity
   Critical to this view of construct validity is classification as
    external evidence
 However, Cronbach and Meehl’s conceptualization
  did include internal sources of evidence
   Studies of internal structure
   Studies of change
   Studies of processes
 Within the nomological network, these sources
  would be classified as test to construct evidence.
 Thus, construct validity need not be decentralized
  for this reason
       Current Practice of Construct
 However, internal sources of information have no
  priority in Cronbach and Meehl
   Simply another sources of evidence
 Considering only external sources may characterize
  some current practices
   Re-conceptualize test meaning based on external evidence
    rather than develop new tests
 Concern about the strong role of external sources
  motivated Embretson (1983) distinctions
   If internal sources are primary, then item and test design
    principles can become central in establishing test validity
    (Embretson, 1995)
 Construct Validity for Psychological
   Tests in a Revised Taxonomy
• If construct validity included internal sources
  • Now crucial to meaning for psychological tests
     • Requires scientific foundation for item and test design
        • Impact of item features and testing procedures on KSAs

 But, concept of construct validity still not
  relevant to include internal evidence for
  educational tests
      Test meaning depends primarily on content-related
       evidence and reliability evidence
 Internal Evidence for Educational
 Reliability concept in the Lissitz and
  Samuelson framework is generally
  multifaceted and traditional
   Item interrelationships
   Relationship of test scores over conditions or
   Differential item functioning (DIF)
   Adverse impact
      (Perhaps adverse impact and DIF could be
       considered as external information)
 Internal Evidence for Educational
• Concept of Content Validity
• Previous test standards (1985)**
   Content validity was a type of evidence that
    “…..demonstrates the degree to which a sample of
    items, tasks or questions on a test are representative
    of some defined universe or domain of content”
 Two important elements added by L&S
   Cognitive complexity level
      “whether the test covers the relevant instructional or content
       domain and the coverage is at the right level of cognitive
   Test development procedures
      Information about item writer credentials and quality control
Test Blueprints as Content Validity
 Blueprints specify percentages of test items that
  should fall in various categories
 Example- test blueprint for NAEP for
    Five content strands
    Three levels of complexity
    Majority of states employ similar strands
 But, several reasons why blueprints and other
  forms of test specifications (along with reliability
  evidence) are not sufficient to establish meaning
  for an educational test
  1. Domain Structure is a Theory
     Which Changes Over Time
 NAEP framework, particularly for cognitive
  complexity, has evolved (NAGB, 2006)
 Views on complexity level also may
  change based on empirical evidence, such
  as item difficulty modeling, task
  decomposition and other methods
 Changes in domain structure also could
  evolve in response to recommendations of
  panels of experts.
   National Mathematics Advisory Panel
      Recommend changes in the basic strands
  2. Reliability of Classifications is
      Not Well Documented
Scant evidence that items can be reliably classified
  into the blueprint categories
Certain factors in an achievement domain may
  make these categorizations difficult
  For example, in mathematics a single real-world
    problem may involve algebra and number sense, as
    well as measurement content
     Item could be classified into three of the five strands.
  Similarly, classifying items for mathematical complexity
    also can be difficult
     Abstract definitions of the various levels in many systems
3. Unrepresentative Samples from
 Practical limitations on testing conditions
  may lead to unrepresentative samples of
  the content domain
   More objective item formats, such as multiple
    choice and limited constructed response have
    long been favored
      Reliably and inexpensively scored
   But these formats may not elicit the deeper
    levels of reasoning that experts believe
    should be assessed for the subject matter
       4. Irrelevant Item Solving
 Using content specifications, along with item
  writer credentials and item quality control, may
  not be sufficient to assure high quality tests
   Leighton and Gierl (2007) view content specifications
    as one of three cognitive models for making
    inferences about examinee’s thinking processes
      For the cognitive model of test specifications for
       inferences is that no evidence is provided that examinees are
       in fact using the presumed skills and knowledge to solve
     NAEP Validity Study for
Mathematics: Grade 4 and Grade 8
 Mathematicians examined items from NAEP and
  some state accountability tests
 Results
   Small percent of items deemed flawedn(3-7%),
   Larger percent of items deemed marginal (23-30%)
   Marginal items had construct-irrelevant difficulties
        problems with pattern specifications
        unduly complicated presentation
        unclear or misleading language
        excessively time-consuming processes
   Marginal items previously had survived both content-
    related and empirical methods of evaluation
Examples of Irrelevant Knowledge,
       Skills and Abilities
• Source
  • National Mathematics Advisory Panel (2008).
    Foundations for success: The final report of
    the National Mathematics Advisory Panel.
    Washington, DC: Department of Education
• Method- logical-theoretical analysis by
  mathematicians & curriculum experts
  • Mathematics involves aspects of logical
    analysis, spatial ability and verbal reasoning,
    yet their role can be excessive
Dependence on Non-Mathematical
Dependence on Logic, Not
Excessive Dependence on Spatial
Excessive Dependence on
 Reasoning and Minimal
  Implication for Educational Tests
 Identifying irrelevant sources of item
  performance requires more than content-related
   Latent process evidence is relevant
      E.g., methods include cognitive analysis (e.g., item difficulty
       modeling), verbal reports of examinees and factor analysis
   External sources of evidence may provide needed
      Example: Implications of the correlation of an algebra test
       with a test of English
          If this correlation is too high, it may suggest a failure in the
           system of internal evidence that supports test meaning
  Construct Validity as a Universal
  System and a Unifying Concept
 Features
   Consistent with current Test Standards (1999)
   Consistent with many of Lissitz and
    Samuelson’s distinctions and elaborations
 Validity Concept
   Universal
      All sources of evidence are included
      Appropriate for both educational and psychological
   Interactive
      Evidence in one category is influenced or informed
       by adequacy in the other categories
    Categories of Evidence in the
          Validity System
• Eleven categories of evidence
• Conceive the categories for application to both
  educational and psychological tests
• Consistent with most validity frameworks and the
  current Test Standards (1999), it is postulated
  that tests differ in which categories in the system
  are most crucial to test meaning, depending on
  its intended use
• Even so, most categories of evidence are
  potentially relevant to a test
  A Universal Validity System

 Testing          Item         Scoring            Other
Conditions      Design         Models            Measures

 Latent                              Psycho-
 Process                Test          metric         Utility
 Studies               Specs        Properties

 Logic/        Domain                                          Impact

       Internal Meaning                          ExternalSignificance
Internal Categories of Evidence
                             Theory of the subject matter content, specification of
Logic/Theoretical Analysis   areas and their interrelationships

                             Studies on content interrelationships, impact of item
Latent Process Studies       design features on psychometric properties & response
                             time, impact of various testing conditions. etc.

                             Available test administration methods, scoring
Testing Conditions           mechanisms (raters, machine scoring, computer
                             algorithms), testing time, locations, etc. Included
                             because they determine the item types for which it is
                             important to develop design principles

                             Scientific evidence and knowledge about how
Item Design Principles       features of items impact the KSAs applied by
                             examinees-- Formats, item context, complexity and
                             specific content as determining relevant & irrelevant
                             basis (KSAs) for item responses
Internal Categories of Evidence
                          Specification of content areas and levels, as
Domain Structure          well as relative importance and

                          Blueprints specifying domain structure
Test Specifications       representation, constraints on item features,
                          specification of testing conditions

                          Item interrelationships, DIF, reliability,
Psychometric Properties   relationship of item psychometric properties
                          to content & stimulus features, reliability

                          Psychometric models and procedures to
Scoring Models            combine responses within and between items,
                          weighting of items, item selection standards,
                          relationship of scores to proficiency
                          categories, etc. Decisions about
                          dimensionality, guessing, elimination of
                          poorly fitting items etc. impacts scores and
                          their relationships
External Categories of Evidence
                 Relationship of scores to external variables,
Utility          criteria & categories

                 Relationship of scores to other tests of
Other Measures   knowledge, skills and abilities

                 Consequences of test use, adverse impact,
Impact           proficiency levels & etc
 The Universal System of Validity
• Test Specifications is the most essential
  category: it determines (with Scoring Models)
  • Representation of domain structure
  • Psychometric properties of the test
  • External relationships of test scores
• Preceding Test Specifications are categories
  that involve scientific evidence, knowledge and
  • Domain Structure
  • Item Design Principles
• In turn preceded by
  • Latent Process Studies
  • Logical/Theoretical Analysis
  • Testing Conditions
     General Features of Validity
 Test meaning is determined by internal sources
  of information
 Test significance is determined by external
  sources of information
 Content aspects of the test are central to test
   Test specifications, which includes test content and
    test development procedures, have a central role in
    determining test meaning
   Test specifications also determine the psychometric
    properties of tests, including reliability information
 General Features of the Universal
         Validity System
 Broad system of evidence is relevant to
  support Test Specifications
   Item Design Principles --Relevancy of
    examinees’ responses to the intended domain
   Domain Structure --Regarded as a theory
   Other preceding evidence
      Latent Process Studies
      Logical/theoretical analyses of the domain
      Testing Conditions
 General Features of the Universal
         Validity System
 Interactions among components
   Internal evidence  expectations for external
   External evidence informs adequacy of
    evidence from internal sources
     Potential inadequacies arise when
        Hypotheses are not confirmed
        Unintended consequences of test use

 System of evidence includes both
  theoretical and practical elements
 Relevant to educational and psychological
 The Universal System of Validity
• Example of Feedback
• Speeded math test to emphasize
  automatic numerical processes
  • External evidence-- strong adverse impact
  • Internal evidence categories to question
    • Item Design
       • Relationship of item speededness to automaticity
    • Domain Structure
       • Heavy emphasis on the automaticity of numerical skills
        Analysis of Categories
 Other categories elaborate their distinctions
   “Psychometric Properties”
      Evidence in Lissitz and Samuelson “Reliability” category
   “Latent Process Studies” category as related to a
    specific test
   Scoring Models is a separate category
      Impact of decisions about dimensionality, guessing,
       elimination of poorly fitting items and so forth is highlighted
       for its impact on scores and their relationships
   Test Specifications category is construed broadly
      Include test blueprints, item writer guides, item writer
       credentials, test administration procedures and so forth.
  Application to Educational and
 Psychological Tests: Achievement
Current emphasis
   Test specification
       Central to standards-based testing
   Domain structures
       Essential to blueprints
   Scoring models & Psychometric properties
       State-of-art in large scale testing
Underemphasized areas
   Item design principles
       Research basis is emerging
   Latent process studies
       Important in establishing construct-relevancy of student responses
   Logical/Theoretical Analysis
       Important in defining domain structure
   Implications of feedback from studies on
       Other Measures
  Application to Educational and
 Psychological Tests: Achievement
Example: Feedback from external relationships
   Implications of negative evidence
   Speeded math test to emphasize automatic numerical processes
      External evidence-- strong adverse impact for certain groups
      Issues to question
          Item design
               Relationship of item speededness to automaticity
          Domain structure
               Heavy emphasis on the automaticity of numerical skills
Example: Item Design & Latent Process Studies
   Item response format for mathematics items
      Katz, I.R., Bennett, R.E., & Berger, A.E. (2000). Effects of response
        format on difficulty of SAT-Mathematics items: It’s not the strategy.
        Journal of Educational Measurement, 37(1), 39-57.
   Application to Educational and
  Psychological Tests: Personality
Current emphasis
  Logical/Theoretical Analysis
      I.e., personality theories
      Prediction of job performance
  Other Measures
      Factor analytic studies
Underemphasized areas
  Test Specifications
  Domain Structure
  Item Design Principles
  Latent Process Studies
   Application to Educational and
  Psychological Tests: Personality
• Test Specifications & Domain Structure
  • Multifaceted constructs
     • Ignoring domain structure  Lack of convergent validity
         • Unbalanced or uncontrolled item set
              • Emphasizing facet that is best represented if items
                selected for internal consistency
         • Item selection will not be consistent
  • Example– Conscientiousness construct
     • Major subdivisions
        • Dependabilty, Achievement (Moutafi, Furnham & Crump,
        • Duty (-), Achievement Striving (+) (Moon, 2001)
             • Opposing relationship to commitment
   Application to Educational and
  Psychological Tests: Personality
Test Specifications & Domain Structure
  • Example of structure in personality
    • Facet theory to
       • Define domain membership
       • Define domain structure & observations
          • Roskam, E. & Broers, N. (1996). Constructing
             questionnaires: An application of facet design and
             item response theory to the study of lonesomeness.
             In G. Engelhard & M. Wilson (Eds.). Objective
             Measurement: Theory into Practice Volume 3.
             Norwood, NJ: Ablex Publishing. Pp. 349-385.
Facet Theory Approach to Measure
        of Lonesomeness
   Application to Educational and
  Psychological Tests: Personality
Item Design Principles & Latent Process Studies
   Most measures are self-report format
   Basis of self-report may involve strong
     construct-irrelevant aspects
   Tasks require judgments about relevance of
     statement to own behavior and then reliably
   California Psychological Inventory items
     When in a group of people I usually do what the others want
        rather than make suggestions
     There have been a few times when I have been very mean to
        another person.
     I am a good mixer.
     I am a better talker than listener.
   Application to Educational and
  Psychological Tests: Personality
• Science of self-report is emerging and linked to
  cognitive psychology
     • Stone, A. A., Turkkan, J. S., Bachrach, C.A., Jobe, J. B.,
       Kurtzman, H. S. & Cain, V. S. (2000). The science of self-
       report. Mahwah, NJ: Erlbaum Publishers.
• Studies on how item and test design impacts
  self-report accuracy
  – Self-reports under optimal conditions are biased
     • Daily diaries of dietary self-reports contain insufficient
       calories to sustain life
     • Smith, A. F., Jobe, J. B., & Mingay, D. M. (1991b). Retrieval
       from memory of dietary information. Applied Cognitive
       Psychology, 5, 269-296.
  • Personality inventories are far less optimal for reliable
   Application to Educational and
  Psychological Tests: Personality
Mechanisms in self-report
  Response styles
    Social desirability
  Memory & Context
    When memory information is sufficient, other
     methods are applied
           Information earlier in the questionnaire
           Ambiguity of issue discussed
           Moods evoked by earlier questions
Self-Report Context Effects
   Application to Educational and
  Psychological Tests: Personality
Item Design Principles
     Lievens, F. & Sackett, P. (2007). Situational judgment tests in
       high stakes settings: Issues and strategies with generating
       equivalent forms. Journal of Applied Psychology, 92, 1043-
   Application to Educational and
  Psychological Tests: Personality
• Integration of Item Design Principles &
  Logical/Theorical Analysis & Latent
  Process Studies
   – Example Test of Aggression
       • James, L. R. McIntrye, M. D., Glisson, C. A., Green, P. D. (2005). A
         conditional reasoning measure for aggression. Organizational Research
         Methods, 8, 69-80.
• Item design based on hypothesis that responses to ambiguous
  scenarios involve justification mechanisms related to
Sample Item with Hostile Attribution
   Bias for Keyed Response
 History of validity shows changes in the
   Notion of types still apparent
 Construct validity is appropriate for
  educational tests
   Content aspect is not sufficient
 Construct validity is a universal system of
  evidence relevant to diverse tests

To top