Tests of Cognitive Intelligence
Common Characteristics of Individual
• individually administered
• administration requires advanced training
• tests cover wide range of age and ability
• examiner must establish rapport
• immediate scoring of items
• usually requires about one hour
• allows opportunity for observation
Two Main Individually Administered
• He wanted to create a process for identifying
intellectually limited children so they could
be removed from the regular classroom and
put in special education.
• Developed in response to the perceived
shortcomings of the Stanford-Binet
Binet’s Principles of Test Construction
• Wanted tasks to measure judgment, attention, and
• Guided by two major concepts: age differentiation
and general mental ability.
• Age differentiation: Binet searched for tasks that
could be completed by 2/3 to ¾ of the children in a
particular age group & was completed by fewer
younger children and more older children.
• General mental ability: measured only the total
product of the various tasks. Judged value of task
in terms of its correlation with the combined result
of all other tasks.
Early Binet Scales
1905: 30 items ordered by difficulty. Test lacked:
• adequate measuring units to express results (only used
idiot, imbecile, and moron)
• adequate normative data (only used 50 subjects)
• evidence of validity
1908: Grouped items according to age level rather than
simply increasing difficulty. Introduced concept of mental
• Increased norm group to 203.
• Criticized because it produced only one score almost
exclusively related to verbal, language, and reading ability
1916 Stanford Binet Intelligence Scale
• Lewis Terman increased size of standardization
sample though it was only white native-California
• Introduced intelligence quotient concept to show
subjects’ rate of mental development.
–IQ = (MA/CA) x 100
• However, maximum mental age was 19.5. Had to
set maximum chronological age, too, so set it at
• Extended age range down to 2 and up to 22 years,
• Scoring standards and instructions were improved
• Several performance items added
• Standardization sample improved to include 3184
subjects from 11 states. Subjects selected
according to their fathers’ occupations. Still,
sample included only whites and mainly those
from urban areas.
• Developed alternate form.
Problems with 1937 Form
• Reliability higher for older subjects than for
younger ones and for those in the lower IQ
• Scores were most unstable for young
children with high IQ.
• Each age group also had different standard
deviations which made interpretation
• Used Binet’s principles to redo scale.
• Solved problem of differential variation in IQ
by using the deviation IQ concept. Set mean
at 100 with SD of 16. Could now compare
scores of one age level with another.
• No new normative sample but did one in
1972 that included non-whites and 2100
Modern Binet Scale
• Totally revised in 1986 by Thorndike et al.
• Used Thurstone’s multidimensional model (1938):
G made up of crystallized ability (verbal &
quantitative reasoning), fluid-analytic abilities
(abstract-visual reasoning) and short term
• Used IRT (Rasch model) to determine proper order
of the items
• Used routing test (Vocabulary) as attempt to adapt
testing to specific ability level of each examinee
without computer adaptive testing
Structure of the SB-IV
• Verbal Reasoning included vocabulary test,
comprehension test, absurdities test, and verbal
• Abstract-Visual Reasoning included pattern analysis
test, copying test, matrices test, paper-folding and
• Quantitative Reasoning included quantitative test,
number series test, equation-building test.
• Short-term Memory included bead memory, memory
for sentences, memory for digits, and memory for
• Composite included all areas combined.
Psychometric properties of SB-IV
• Standardization sample has 5000+ subjects in 47
states and DC.
• Sample stratified based on 1980 census – geographic
region, community size, ethnic group, age, and
• Internal consistency reliability is .98 for composite and
.93-.97 for area scores. Some individual test scores
are lower: .73 for memory for objects is the lowest.
• Test-retest reliabilities for composite score were .91
and .90 for 5 and 8-year-olds.
• Factor analysis supports the structure of the test.
• Correlations with other IQ tests are generally in the
70s and 80s
• David Wechsler worked at NY’s Bellevue Hospital.
He wasn’t happy with the Stanford Binet with it’s
focus on children or on the production of a single
• In 1939, he created the Wechsler-Bellevue, later
called the WAIS.
• In 1949, he created the children’s version, the
• In 1967, he added the WPPSI for children ages 2.5-
Structure of the WAIS
• The WAIS yields separate verbal and
• The WAIS-III has four index scores: Verbal
comprehension, working memory,
perceptual organization, and processing
Verbal and Performance Tests on the WAIS
• Vocabulary • Picture completion
• Similarities • Digit symbol-coding
• Arithmetic • Block design
• digit Span • Matrix reasoning
• Information • Picture arrangement
• Comprehension • Symbol search
• Letter-Number • Object assembly
Scales and Norms for the WAIS
• Determine raw score for each subtest.
• Convert raw scores to standard scores, called scaled
scores (M=10, SD=3)
• There are conversions for 13 age groups. This method of
conversion obscures any differences in performance by
• Subtest scaled scores are added, then converted to WAIS-
III composite scores.
• Three composite scores: verbal, performance, full scale,
each with M=100, SD=15
• Four index scores: verbal comprehension, perceptual
organization, working memory, processing speed
Standardization of the WAIS
• Standardized on a stratified sample of 2,450
adults representative of the US population
• There were 200 cases per age group, except
for the smaller numbers in the two oldest
• Still difficult to know the effects of self-
selection since participants had to be invited
and accept to be included.
Reliability of the WAIS
• Internal consistency and test-retest reliabilities are
about .95 or higher for full scale and verbal scores.
• They’re about .90 for performance and three other
index scores: perceptual organization, working
memory, and processing speed.
• Internal consistency reliability for the subtests
range from upper .70s to low .90s. Test-retest is
• Generally, performance reliabilities are lower than
verbal reliabilities on the subtests.
Validity of the WAIS
• Great deal of information on criterion-related and
• Factors analyses support use of 4 index scores.
• Comparison studies show the pattern of WAIS-III
scores for many special groups, e.g., Alzheimers’
Disease, Parkinson’s, learning disabled, brain
• Is the top test used today
• Is the most popular test for assessing
intellectual ability of children ages 6 years, 0
months to 16 years, 11 months.
• Similar to structure of the WAIS, with easier
• Both tests yield verbal, performance, and
full scale IQ and 4 index scores
• Most of the subtests are the same
Psychometric Properties of the WISC-III
• Standardization program involved 2,200 cases
selected to represent the US population of children
• Composite scores generally have internal
consistency reliabilities in the mid-.90s and test-
retest reliabilities around .90.
• Subtest reliabilities are generally in the mid-.80s.
• Object Assembly and Mazes are problematic, with
reliabilities in the .60s.
Group Differences in IQ
• Psychological tests designed to measure
differences among people.
• Test scores that demonstrate differences among
people may suggest that people are not created
with the same basic abilities.
• Biggest problem: Some ethnic groups obtain
lower average scores on some psychological
tests. On average African Americans score 15
points lower than whites on IQ tests.
• Dispute is not whether differences occur but why
they occur.—environment vs. biology
Problems with Biology Argument
• IQ scores are improving (called the Flynn
effect), more so for African Americans than
• Victimization by stereotyping could affect
test performance and grades.
• Construct of race has no biological meaning
based on evidence from studies in
population genetics, the human genome and
Criticisms Related to Content
• Looking at specific items, it was thought that
they might be biased because some children
wouldn’t have the opportunity to learn the
• Members of ethnic groups might answer
some items differently but still correctly
• Scores affected by language skills
inculcated as part of a white, middle-class
upbringing foreign to inner city children
Responses to Content Validity Criticisms
• Test developers are indifferent to the opportunities
people have to learn the information on the tests. The
meaning they assign to the tests comes from
correlations of test scores with other variables.
• Some evidence suggests that the linguistic bias in
standardized tests does not cause the observed
differences (Scheuneman, 1987).
• Elimination of biased items from a test didn’t change
the test scores (Bianchini, 1976).
• Can’t find classes of items most likely to be missed by
minority group members (Wild, et al., 1989)
Other Ways of Thinking About
• Maybe difference in test scores may reflect
patterns of problem-solving that
characterize different subcultures (e.g.,
• R. D. Goldman (1973) proposed the
differential process theory which maintains
that different strategies may lead to effective
solutions for many types of tasks.
Strategies mediate abilities and
• Most standardized tests are evaluated against other
standardized tests. The criterion may be the same test
dressed up differently or measuring test-wiseness on both
• IQ tests may be correlated with achievement tests.
Achievement may be moderated by opportunity to learn.
• Goldman and Hartig (1976) found scores on the WISC to
be unrelated to teacher ratings of classroom performance
for minority children, but significant for non-minority
• Majority and minority children grow up in different social
environments. Perhaps test scores accurately reflect the
effects of social and economic inequality.