What is measurement? – attempt to quantify
a hypothetical construct.
Psychometrics – Study of measurement in
Operationalism is important
Forms of the measurement
○ Setting, disguise, how to measure behavior
○ Brain activity, hormone levels, etc.
○ Latencies, choices, etc.
○ Questionnaires, interviews, limitations
Often constructs can be measured in multiple
○ E.g., anger
Often it is best to get multiple measures of
the construct of interest
Triangulation (converging operations) –
observing from several different
viewpoints to understand the construct.
Several types of triangulation
What makes a good measure (test)?
Reliability – consistency or dependability
of the test.
If you use the test multiple times, do you
get the same results?
E.g., scale for weight.
Measurement error affects reliability
Observed score = true score +
Factors that contribute to measurement
Transient states of the participant
Stable attributes of the participant
E.g., low motivation, low intelligence
E.g., treatment of participants, lighting, etc.
Characteristics of the measure
E.g., bad questions
Inverse relationship between measurement
error and reliability
Observed variance in a set of scores is due
to two factors, individual differences and
Total variance = systematic variance + error
Reliability = systematic variance / total
Reliability coefficient – if scores are near 1,
test is reliable. Closer to 0, the lower the
Rule of thumb – good if above 0.7
70% of the total variance in scores is systematic
Use two scores: if they are similar, there is
How similar are the scores?
Correlation coefficient – assesses how
similar they are.
Correlation actually measures systematic
variance / total variance
Three methods to estimate reliability
Test-retest reliability – give the test on
multiple occasions. Correlate the scores.
Correlation reflects the degree of reliability
Assumes that participants behavior is
stable over time
Some behaviors are not
E.g., personality probably is stable
E.g., hunger is not
For some behaviors, it is impossible to
measure the reliability
Inter-item reliability – degree of consistency
among the items on the test.
Most measures use many items and summate.
Methods of calculating
Item-total correlation – correlation between a
particular item and the sum of all other items
on the scale.
Can be used to assess bad questions.
Get rid of all of the bad questions, reliability
Split-half reliability – divide items on the
test into two parts
Even/odd, First half/second half, random halves
If the two halves do not correlate well, there is
Chronbach’s alpha – a measure of split-half
Interrater reliability –
aka interjudge or interobserver reliability.
consistency among researchers that observe
They both watch the same thing, do they
see it the same?
Percentage of time they agree
Correlation between the ratings
Generally need around 0.9. (90%)
Eliminate potential sources of measurement
Clearly conceptualize the construct
Test construction – clarify instructions and
Test administration and conditions –
standardize for all measures
Test scoring and interpretation –
Careful in coding, tabulating, or computing data.
Use pilot studies
Validity – the extent to which the
measurement procedure measures what you
think it is measuring.
Reliability is a necessary but not sufficient
condition for validity.
Test can be reliable but not valid.
Test must be reliable to be valid.
E.g., dart board, phrenology
Different types of validity
Face validity – on the surface, does the test
seem to be measuring what it is supposed to.
If no face validity, many will doubt its relevance
E.g., Target and MMPI and CPI
Evil spirits possess me sometimes
Problems with face validity
1. Not always useful
Can have face validity but not be valid
2. Not necessary
Can be valid without face validity
3. Sometimes disguising purpose is important.
Content validity – special type of face
Extent to which a test measures the
content area it was designed to measure.
Does it assess all of the content and not
just a part of it?
E.g., Psychology GRE, Math test
Construct validity – does the test measure
the construct of interest?
Construct – entities that cannot be directly
observed on the basis of empirical evidence
Most variables in the social sciences
E.g., intelligence, media bias, anxiety
Correlate the score on the test with scores on
other converging tests.
Two parts to construct validity
Convergent validity – measure should correlate
with measures that it should correlate with.
E.g., measures of anxiety
Divergent validity – measure should not
correlate with measures of a different construct
Criterion validity - the degree to which one
measurement agrees with other
approaches for measuring the same
Two parts – depends on time
Concurrent validity – does the test agree
with a preexisting measure?
Predictive validity – test’s ability to predict
future behavior relevant to the behavior
E.g., SAT scores and college GPA
Comparing apples and oranges
How do you compare scores on various
measures with each other?
E.g., SAT and ACT
E.g., WAIS vs. Raven’s
Standardization – placing all scores in the
same unit of measurement.
Force the measurement to have the same scale.
Z scores – the most common standardization
Convert the scores into standard deviation
Z-scores represent distance away from the
mean in terms of standard deviation units.
Positive z-scores are above the mean.
Negative z-scores are below the mean.
Value represents distance from the mean.
Ex. z = 3 - 3 standard deviations above
Ex. Z = -1 - 1 standard deviation below the
Note: mean of all z scores = 0, Standard
deviation = 1
Calculating Z - scores
Individual score minus the mean,
divided by the standard deviation
Calculate z scores on an IQ test if the mean is
100 and the standard deviation is 15
Brian – 130
Jan – 72
Jim – 100
Converting z-scores to raw scores
Jody – z = 3
Jeremiah – z = 0
Zach – z= - 2.5
Comparison between tests
Ex. Jerry makes a 1200 on the SAT, Terry
makes a 30 on the ACT. Who scored better
SAT – mean=1000, sd = 150
ACT – mean = 21, sd = 3
In 1960, the mean baseball salary was $50,000
with a standard deviation of $10,000. Today, the
mean salary is $2,000,000, with a standard
deviation of $500,000. In 1960, Clete Boyer, the
third baseman for the New York Yankees, made
$30,000. What would he earn at today's
Percentiles and z-scores
If normal distribution is assumed, the percentile
score is known based on the z score.
Z - score Percentile
Scaling and index construction
Difference between the two
Scale – measure the direction or intensity of a
Index – measure that combines several indicators
of a construct into a single score.
○ Ex. FBI crime index
○ Ex. Consumer price index
Constructing an instrument to assign
numbers to a qualitative concept
Important characteristics of indexing and
Mutual exclusiveness – Individual cases fit
into one category only
Exhaustive – all cases fit into one of the
Unidimensionality – items comprising a scale
or index must measure one and only one
dimension or concept at a time
Ex. Long tests
Exams are examples of indices.
Best places to live
America’s best colleges
The difficult aspect of index construction is the
evaluation of the construct
Process is largely theoretical – face validity is
Weighting – certain factors are valued more
Combination of factors does not assume that all are
US News and World Report
Peer assessment (weighted by 25 percent).
Retention (20 percent).
Faculty resources (20 percent).
Student selectivity (15 percent).
Financial resources (10 percent).
Graduation rate performance (5 percent).
Alumni giving rate (5 percent).
Missing data – can be very damaging to
reliability and validity
Systematic missing data is problematic
Ways to handle missing data
1. Eliminate that case
2. Substitute with the average of available
3. Try to estimate using another source
4. Insert a random value
5. Make not available a possible item
6. Analyze reason for missing data
Scaling is the assignment of objects to
numbers according to a rule.
Types of scaling
Bogardus Social Distance
Assigns construct a value based on bipolar
response set. E.g., level of agreement, approval,
Also known as summated-rating or addivite scales
E.g., Capital punishment should be reinstated.
____Strongly agree _____Agree
_____Neutral _____Disagree ____Strongly
Generally want 5 point Likert scales
Can have more – no reason to have above 7. – limits in
Some prefer even numbers – forced decision
Multiple items can be combined into an
E.g., sum ten different questions together
Dummy code the responses
Strongly agree = 5
Agree = 4
Neutral = 3
Disagree = 2
Strongly disagree = 1
Very important: data from Likert scaling is
Implication: no statistical analysis, no interval data
No single way to code data.
Strongly agree = 100, agree = 50, Neutral = 25,
disagree = 10, strongly disagree = 1
Questions included in the scale must be
Items must be more or less addressing the
Check with item-whole correlation
Items need to have the same level of
importance to the respondent
E.g., restaurant opinion
Do not ask the same question repeatedly
Only benefits the statistics
When coding, be careful of direction of
E.g., 1. I feel like I make a useful contribution
at work - sd, d, n, a, sa
2. At times I feel incompetent at my job -
sd, d, n, a, sa
Also known as Method of Equal-appearing
Generally used to assess attitudes
First, generate a number of questions about the
topic of interest (at least 100).
Ex. Attitude towards RSU
Second, have many judges (around 100) rate
the questions on a scale of 1 to 11.
Is the question favorable towards the concept
1 = least favorable to concept
11 = most favorable to concept
Important – judges aren’t answering the question, but
Third, analyze the rating data
Calculate median judges responses for each item
and the variability
Some questions will very favorable, some very
E.g., RSU has small class sizes (11)
E.g., RSU only offers a few bachelor’s degrees (1)
Fourth, choose questions from all 11 median
Reduce questionnaire to about 20 items.
Fifth, administer the scale
E.g., RSU has small class sizes: agree or
Can combine it with Likert scale
Allows for analysis of inter-item agreement
among the judges
Allows for the identification of homogeneous
Very time-consuming and costly
No more reliable than a Likert scale
Bogardus Social Distance Scale
Used to measure social distances
between groups of people
E.g., ethnic closeness
Participants respond to a series of ordered
E.g., Attitude towards homosexuals
1. Would marry
2. Would have as regular friends
3. Would work beside in an office
4. Would have several families in my
5. Would have merely as speaking
6. Would have live outside my neighborhood
7. Would have live outside my country
E.g., mental illness
Can measure social distance to other
E.g., level of education, geographical location,
E.g., Angermeyer and Matschinger (1997) –
mental illness perception in Germany
○ Alcoholics more social distance than
○ Personal experience with someone with mental
illness reduced social distance
Semantic Differential Scaling
Provides indirect measure about how a
person feels about a concept, object or other
Measuring of feelings by using adjectives
Humans use language to communicate feelings by
Adjectives tend to have polar opposites
Uses adjectives to create a rating scale.
Three main uses of adjectives: i.e., three
dimensions of attitudes
Evaluation (good-bad, pleasant-unpleasant,
Potency (strong-weak, thick-thin, hard-soft)
Activity (active-passive, slow-fast, hot-cold)
Multiple uses for semantic differential
Also known as cumulative scaling
Evaluating of data after they are collected
Meant to determine if a relationship exists
within a group of items
Items are arranged such that a person
who agrees with an item will also agree
with less extreme items.
1. Slapping a child’s hand is an appropriate
way to teach the meaning of “No!”
2. Spanking is sometimes necessary
3. Sometimes discipline requires using a belt
4. Some children need a good beating to
keep them in line.
Source: Monette et al. (1994)
Statistical method for determining
Is the test measuring more than one concept?
Analyzes the pattern of responding to
E.g., Factor analysis and intelligence
Can allow for evaluation of constructs