Validity and reliability
This term refers to the appropriateness,
meaningfulness, correctness, and
usefulness of inferences a researcher
Reliability refers to the consistency of scores
or answers from one administration of an
instrument to another, or from one set of
items to another. A reliable instrument
yields similar results if given to a similar
population at different times.
Researchers make inferences from their
studies. That is, they use the results to
make decisions or judgments about what
they were trying to find out.
Content-related evidence of validity focuses
on the content and format of an
instrument. Is it appropriate?
Comprehensive? Is it logical? How do the
items or questions represent the content?
Is the format appropriate?
This refers to the relationship between the
scores obtained using the instrument and
the scores obtained using one or more
other instruments or measures. For
example, are students‟ scores on teacher
made tests consistent with their scores on
standardized tests in the same subject
This refers to the psychological construct or
characteristic being measured. For
example, if one is looking at problem
solving in leaders, how well does a
particular instrument explain the
relationship between being able to
problem solve and effectiveness as a
Elements of content-related evidence
Adequacy of sampling: the size and scope
of the questions must be large enough to
cover the topic.
Format of the instrument: Clarity of
printing, type size, adequacy of work area,
appropriateness of language, clarity of
How to achieve content validity
Consult other experts who rate the items.
Rate items, eliminating or changing those
that do not meet the specified content.
Repeat until all raters agree on the
questions and answers.
To obtain criterion-related validity,
researchers identify a characteristic,
assess it using one instrument (e.g., IQ
test) and compare the score with
performance on an external measure,
such as GPA or an achievement test.
Predictive and concurrent validity
Predictive validity is used to look at validity
over time. Researchers administer one
assessment, allow time to lapse, then
compare the earlier score to another
measure of performance.
Concurrent validity looks at information
gathered from two sources at the same
A correlation coefficient, symbolized with the letter
r, indicated the degree of relationship between
individuals‟ scores on two instruments. All
correlation coefficients fall between +1.00 and -
1.00, with a strong positive correlation indicating
a high score on one instrument and high score
on another, or a low score on one and a low
score on the other. A negative correlation
indicates a high score on one paired with a low
one on the other.
A validity coefficient is obtained by
correlating a set of scores on one test (a
predictor) with a set of scores on another
(the criterion). The degree to which the
predictor and the criterion relate is the
validity coefficient. A predictor that has a
strong relationship to a criterion test
would have a high coefficient.
This is a two way chart with predictor
categories listed down the left-hand side
of the chart and the criterion categories
listed horizontally along the top of the
Avg. # of Yrs.
New VP‟s ‟00 ‟01 ‟02 „03
Male 12 11 9 8
Female 14 13 11 12
This type of validity is more typically
associated with research studies than
testing. It relates to psychological traits,
so multiple sources are used to collect
evidence. Often times a combination of
observation, surveys, focus groups, and
other measures are used to identify how
much of the trait being measured is
possessed by the observee.
This refers to the consistency of scores
obtained from one instrument to another,
or from the same instrument over
Errors of measurement
Every test or instrument has associated with
it errors of measurement. These can be
due to a number of things: testing
conditions, student health or motivation,
test anxiety, etc. Test developers work
hard to try to ensure that their errors are
not grounded in flaws with the test itself.
This is a number that tells us how likely one
instrument is to be consistent over
Test-retest: Same test to same group
Equivalent-forms: A different form of the
same instrument is given to the same
group of individuals
Internal consistency: Split-half procedure
computes reliability from the # of items,
the mean, and the standard deviation of
Alpha or Cronbach‟s alpha
This test is used on instruments where
answers aren‟t scored “right” and “wrong”.
It is often used to test the reliability of
Standard error of the measurement
This is a calculation that shows the extent to
which a measurement would vary under
changed circumstances. In other words, it
tells you how much of the error is due to
issues related to measuring.
Many qualitative researchers contend that
validity and reliability are irrelevant to
their work because they study one
phenomenon and don‟t seek to generalize.
Fraenkel and Wallen contend that any
instrument or design used to collect data
should be credible and backed by
evidence consistent with quantitative
Validity can be used in three ways. It can
refer to instrument or measurement
validity, as we just discussed, external or
generalization validity, as presented in
Chapter 6, or internal validity, which
means that the relationship a researcher
observes between two variables should be
clear in its meaning rather than due to
something that is unclear (“something
What is “something else”?
Any one (or more) of these conditions:
Age or ability of subjects
Conditions under which the study was
Type of materials used in the study
Technically,the “something else” is called a
threat to internal validity.
Threats to internal validity
Loss of subjects
Attitude of subjects
Subject characteristics can pose a threat if
there is selection bias, or if there are
unintended factors present within or
among groups selected for a study. For
example, in group studies, members may
differ on the basis of age, gender, ability,
socioeconomic background, etc. They
must be controlled for in order to ensure
that the key variables in the study, not
these, explain differences.
Maturity Reading ability
Ethnicity Manual dexterity
Coordination Socioeconomic status
Speed Religious/political belief
Loss of subjects (mortality)
Loss of subjects limits generalizability, but it can
also affect internal validity if the subjects who
don‟t respond or participate are over
represented in a group. E.g., If you were
conducting a study to examine college students‟
participation in on-campus extracurricular
activities and a substantial proportion of non-
dorming students didn‟t respond, your results
would be skewed toward the likely participants.
The place where data collection occurs,
aka “location” might pose a threat. For
example, hot, noisy, unpleasant conditions
might affect test scores; situations where
privacy is important for the results, but
where people are streaming in and out of
the room, might pose a threat.
Decay: If the nature of the instrument or
the scoring procedure is changed in some
way, instrument decay occurs.
Data Collector Characteristics: The person
collecting data can affect the outcome.
Data Collector Bias: The data collector
might hold an opinion that is at odds with
respondents and it affects the
In longitudinal studies, data are often
collected through more than one
administration of a test. If the previous
test influences subsequent ones by getting
the subject to engage in learning or some
other behavior that he or she might not
otherwise have done, there is a testing
If an unanticipated or unplanned event
occurs prior to a study or intervention,
there might be a history threat.
Attitude of subjects
Sometimes the very fact of being studied
influences subjects. The best known
example of this is the Hawthorne Effect.
In testing this is known as regression
toward the mean. It means that low and
high performing groups, over time, tend
to score more towards the average on
subsequent tests regardless of what
happens in the meantime.
This threat can be caused by various
things; different data collectors, teachers,
conditions in treatment, method bias, etc.
Standardize conditions of study
Obtain more information on subjects
Obtain as much information on details of
the study: location, history,
instrumentation, subject attitude,
Choose an appropriate design
Train data collectors