Document Sample
Measurement Powered By Docstoc
 What is measurement? – attempt to quantify
  a hypothetical construct.
 Psychometrics – Study of measurement in
  psychology; sociometry
 Operationalism is important
 Forms of the measurement
   Observational measurement
    ○ Setting, disguise, how to measure behavior

   Physiological measures
    ○ Brain activity, hormone levels, etc.

   Response patterns
    ○ Latencies, choices, etc.
   Self-report measures
    ○ Questionnaires, interviews, limitations

   Archival measures
   Often constructs can be measured in multiple
    ○ E.g., anger
 Often it is best to get multiple measures of
  the construct of interest
 Triangulation (converging operations) –
  observing from several different
  viewpoints to understand the construct.
 Several types of triangulation
     Measurement
     Observers
     Theory
     Method
What makes a good measure (test)?

  Reliability – consistency or dependability
   of the test.
  If you use the test multiple times, do you
   get the same results?
      E.g., scale for weight.
  Measurement error affects reliability
  Observed score = true score +
   measurement error
 Factors that contribute to measurement
 Transient states of the participant
     E.g., anxiety
   Stable attributes of the participant
     E.g., low motivation, low intelligence
   Situational factors
     E.g., treatment of participants, lighting, etc.
   Characteristics of the measure
     E.g., bad questions
   Mistakes
     E.g., miscounting
   Inverse relationship between measurement
    error and reliability
   Estimating reliability
   Observed variance in a set of scores is due
    to two factors, individual differences and
    measurement error
   Total variance = systematic variance + error
   Reliability = systematic variance / total
   Reliability coefficient – if scores are near 1,
    test is reliable. Closer to 0, the lower the
   Rule of thumb – good if above 0.7
     70% of the total variance in scores is systematic
Assessing Reliability
   Use two scores: if they are similar, there is
   How similar are the scores?
   Correlation coefficient – assesses how
    similar they are.
   Correlation actually measures systematic
    variance / total variance
   Three methods to estimate reliability
     Test-retest reliability
     Inter-item reliability
     Interrater reliability
 Test-retest reliability – give the test on
  multiple occasions. Correlate the scores.
 Correlation reflects the degree of reliability
 Assumes that participants behavior is
  stable over time
 Some behaviors are not
     E.g., personality probably is stable
     E.g., hunger is not
   For some behaviors, it is impossible to
    measure the reliability
   Inter-item reliability – degree of consistency
    among the items on the test.
     Most measures use many items and summate.
 Methods of calculating
 Item-total correlation – correlation between a
  particular item and the sum of all other items
  on the scale.
     Can be used to assess bad questions.
     Get rid of all of the bad questions, reliability
   Split-half reliability – divide items on the
    test into two parts
     Even/odd, First half/second half, random halves
     If the two halves do not correlate well, there is
      measurement error.
     Chronbach’s alpha – a measure of split-half
   Interrater reliability –
     aka interjudge or interobserver reliability.
     consistency among researchers that observe
      the behavior.
   They both watch the same thing, do they
    see it the same?
     E.g., conditioning
 Percentage of time they agree
 Correlation between the ratings
 Generally need around 0.9. (90%)
Increasing Reliability
    Eliminate potential sources of measurement
    Clearly conceptualize the construct
    Test construction – clarify instructions and
    Test administration and conditions –
     standardize for all measures
    Test scoring and interpretation –
      Train observers
      Careful in coding, tabulating, or computing data.
    Use pilot studies
 Validity – the extent to which the
  measurement procedure measures what you
  think it is measuring.
 Reliability is a necessary but not sufficient
  condition for validity.
     Test can be reliable but not valid.
     Test must be reliable to be valid.
     E.g., dart board, phrenology
   Different types of validity
       Face
       Content
       Construct
       Criterion
   Face validity – on the surface, does the test
    seem to be measuring what it is supposed to.
   If no face validity, many will doubt its relevance
   E.g., Target and MMPI and CPI
     Evil spirits possess me sometimes
   Problems with face validity
   1. Not always useful
     Can have face validity but not be valid
   2. Not necessary
     Can be valid without face validity
   3. Sometimes disguising purpose is important.
 Content validity – special type of face
 Extent to which a test measures the
  content area it was designed to measure.
 Does it assess all of the content and not
  just a part of it?
 E.g., Psychology GRE, Math test
 Construct validity – does the test measure
  the construct of interest?
 Construct – entities that cannot be directly
  observed on the basis of empirical evidence
     Most variables in the social sciences
     E.g., intelligence, media bias, anxiety
 Correlate the score on the test with scores on
  other converging tests.
 Two parts to construct validity
   Convergent validity – measure should correlate
    with measures that it should correlate with.
     E.g., measures of anxiety
   Divergent validity – measure should not
    correlate with measures of a different construct
 Criterion validity - the degree to which one
  measurement agrees with other
  approaches for measuring the same
 Two parts – depends on time
 Concurrent validity – does the test agree
  with a preexisting measure?
 Predictive validity – test’s ability to predict
  future behavior relevant to the behavior
 E.g., SAT scores and college GPA
Comparing apples and oranges

    How do you compare scores on various
     measures with each other?
      E.g., SAT and ACT
      E.g., WAIS vs. Raven’s
    Standardization – placing all scores in the
     same unit of measurement.
      Force the measurement to have the same scale.
  Z scores – the most common standardization
  Convert the scores into standard deviation
 Z-scores represent distance away from the
  mean in terms of standard deviation units.
 Positive z-scores are above the mean.
 Negative z-scores are below the mean.
 Value represents distance from the mean.
 Ex. z = 3 - 3 standard deviations above
  the mean.
 Ex. Z = -1 - 1 standard deviation below the
 Note: mean of all z scores = 0, Standard
  deviation = 1
Calculating Z - scores

   Individual score minus the mean,
    divided by the standard deviation
   Calculate z scores on an IQ test if the mean is
    100 and the standard deviation is 15
   Brian – 130
   Jan – 72
   Jim – 100
   Converting z-scores to raw scores
   Jody – z = 3
   Jeremiah – z = 0
   Zach – z= - 2.5
   Comparison between tests
     Compare z-scores
   Ex. Jerry makes a 1200 on the SAT, Terry
    makes a 30 on the ACT. Who scored better
     SAT – mean=1000, sd = 150
     ACT – mean = 21, sd = 3
   In 1960, the mean baseball salary was $50,000
    with a standard deviation of $10,000. Today, the
    mean salary is $2,000,000, with a standard
    deviation of $500,000. In 1960, Clete Boyer, the
    third baseman for the New York Yankees, made
    $30,000. What would he earn at today's
Percentiles and z-scores
   If normal distribution is assumed, the percentile
    score is known based on the z score.
Z - score   Percentile

   -3          <1

   -2           5

   -1          15

   0           50

   1           84

   2           98

   3          99.9
Scaling and index construction
   Difference between the two
     Scale – measure the direction or intensity of a
     Index – measure that combines several indicators
      of a construct into a single score.
      ○ Ex. FBI crime index
      ○ Ex. Consumer price index

   Constructing an instrument to assign
    numbers to a qualitative concept
 Important characteristics of indexing and
 Mutual exclusiveness – Individual cases fit
  into one category only
 Exhaustive – all cases fit into one of the
  categories formed.
 Unidimensionality – items comprising a scale
  or index must measure one and only one
  dimension or concept at a time
     Ex. Long tests
Index construction
   Exams are examples of indices.
   Best places to live
   America’s best colleges
   The difficult aspect of index construction is the
    evaluation of the construct
   Process is largely theoretical – face validity is
    often used.
   Weighting – certain factors are valued more
    than others
     Combination of factors does not assume that all are
   US News and World Report
   Peer assessment (weighted by 25 percent).

   Retention (20 percent).

   Faculty resources (20 percent).

   Student selectivity (15 percent).

   Financial resources (10 percent).

   Graduation rate performance (5 percent).

   Alumni giving rate (5 percent).
   Missing data – can be very damaging to
    reliability and validity
   Systematic missing data is problematic
   Ways to handle missing data
   1. Eliminate that case
   2. Substitute with the average of available
   3. Try to estimate using another source
   4. Insert a random value
   5. Make not available a possible item
   6. Analyze reason for missing data

  Scaling is the assignment of objects to
   numbers according to a rule.
  Types of scaling
      Likert
      Thurstone
      Bogardus Social Distance
      Semantic Differential
      Guttman
Likert Scaling
    Assigns construct a value based on bipolar
     response set. E.g., level of agreement, approval,
      Also known as summated-rating or addivite scales
    E.g., Capital punishment should be reinstated.
    ____Strongly agree _____Agree
     _____Neutral _____Disagree ____Strongly
    Generally want 5 point Likert scales
      Can have more – no reason to have above 7. – limits in
      Collapsing data
    Some prefer even numbers – forced decision
   Multiple items can be combined into an
     E.g., sum ten different questions together
   Dummy code the responses
       Strongly agree = 5
       Agree = 4
       Neutral = 3
       Disagree = 2
       Strongly disagree = 1
   Very important: data from Likert scaling is
    ordinal data
     Implication: no statistical analysis, no interval data
     No single way to code data.
     Strongly agree = 100, agree = 50, Neutral = 25,
        disagree = 10, strongly disagree = 1
 Summing items
 Questions included in the scale must be
  chosen carefully
 Requirements:
 Items must be more or less addressing the
  same concept
     Check with item-whole correlation
   Items need to have the same level of
    importance to the respondent
     E.g., restaurant opinion
   Do not ask the same question repeatedly
     Only benefits the statistics
   When coding, be careful of direction of
     E.g., 1. I feel like I make a useful contribution
      at work         - sd, d, n, a, sa
     2. At times I feel incompetent at my job            -
      sd, d, n, a, sa
   Reverse coding
Thurstone scaling
   Also known as Method of Equal-appearing
   Generally used to assess attitudes
   First, generate a number of questions about the
    topic of interest (at least 100).
     Ex. Attitude towards RSU
   Second, have many judges (around 100) rate
    the questions on a scale of 1 to 11.
       Is the question favorable towards the concept
       1 = least favorable to concept
       11 = most favorable to concept
       Important – judges aren’t answering the question, but
        evaluating them.
   Third, analyze the rating data
     Calculate median judges responses for each item
      and the variability
     Some questions will very favorable, some very
     E.g., RSU has small class sizes (11)
     E.g., RSU only offers a few bachelor’s degrees (1)
   Fourth, choose questions from all 11 median
     Reduce questionnaire to about 20 items.
   Fifth, administer the scale
     E.g., RSU has small class sizes: agree or
     Can combine it with Likert scale
   Thurstone:
     Allows for analysis of inter-item agreement
      among the judges
     Allows for the identification of homogeneous
     Very time-consuming and costly
     No more reliable than a Likert scale
Bogardus Social Distance Scale

    Used to measure social distances
     between groups of people
      E.g., ethnic closeness
      Religious closeness
      Prejudice
    Participants respond to a series of ordered
     statements. Threatening 
    E.g., Attitude towards homosexuals
1.   Would marry
2.   Would have as regular friends
3.   Would work beside in an office
4.   Would have several families in my
5.   Would have merely as speaking
6.   Would have live outside my neighborhood
7.   Would have live outside my country
    E.g., mental illness
   Can measure social distance to other
     E.g., level of education, geographical location,
     E.g., Angermeyer and Matschinger (1997) –
      mental illness perception in Germany
      ○ Alcoholics more social distance than
      ○ Personal experience with someone with mental
        illness reduced social distance
Semantic Differential Scaling

  Provides indirect measure about how a
   person feels about a concept, object or other
  Measuring of feelings by using adjectives
      Humans use language to communicate feelings by
       using adjectives
    Adjectives tend to have polar opposites
      Hot/cold
      Tall/short
    Uses adjectives to create a rating scale.
   Three main uses of adjectives: i.e., three
    dimensions of attitudes
     Evaluation (good-bad, pleasant-unpleasant,
     Potency (strong-weak, thick-thin, hard-soft)
     Activity (active-passive, slow-fast, hot-cold)
   Multiple uses for semantic differential
Guttman Scaling

  Also known as cumulative scaling
  Evaluating of data after they are collected
  Meant to determine if a relationship exists
   within a group of items
  Items are arranged such that a person
   who agrees with an item will also agree
   with less extreme items.
   Example:
   1. Slapping a child’s hand is an appropriate
    way to teach the meaning of “No!”
   2. Spanking is sometimes necessary
   3. Sometimes discipline requires using a belt
    or paddle
   4. Some children need a good beating to
    keep them in line.
   Source: Monette et al. (1994)
Factor Analysis

    Statistical method for determining
      Is the test measuring more than one concept?
  Analyzes the pattern of responding to
   each item.
  E.g., Factor analysis and intelligence
  Tests unidimensionality
  Can allow for evaluation of constructs

Shared By: