# An Introduction to Multidimensional IRT

Document Sample

```					 An Introduction to
Multidimensional IRT

Derek Briggs
April 3, 2008
Presentation at UC Berkeley
1
multidimensionality in
measurement?

2
Statistical Reason to Care
• A fundamental assumption of unidimensional
IRT models is local independence.
• If the latent construct being measured is actually
multidimensional, this assumption may be
violated.
• If violated, item parameter estimates will be
biased, and the standard errors associated with
ability estimates will be too small.

3
Substantive Reason to Care
• It’s important to measure what you intend to
measure.
• Test fairness—MD can cause DIF
• Some tests have designs that lead one to expect
multidimensionality.
• Some tests are modeled as unidimensional, but
results are reported as subscores.
• Multiple dimensions may be useful
diagnostically.

4
Assess Dimensionality or Model it?

Two Approaches
1. Statistical tests that assess dimensionality.
 DIMTEST, POLY-DIMTEST (Stout, Froelich, Gao)
2. Models that assume the existence of
multidimensionality.
 MRCML (as described in Briggs & Wilson, 2001)
 M2PL (as described by Ackerman, Gierl & Walker,
2003)

5
Assessing Dimensionality:
DIMTEST
Null Hypothesis: Test is unidimensional
1.    Split your test into two subtests
a) one subtest assumed to be dimensionally homogenous
(partition subtest ―PT‖),
b) the other subtest hypothesized to be dimensionally distinct
(assessment subtest ―AT‖).
•     Get ability estimates from your PT.
•     Compute covariances between items in your AT, conditioning on
ability estimate from PT.
•     If test is unidimensional, conditional covariances for AT items
•     DIMTEST generates a T statistic and p-value. If p-value is low
enough, we can reject the null hypothesis test has at least two
dimensions.

6
Example 1: Using DIMTEST with
dichotomous items
• 8th Grade Science Test Form with 29 MC items
• Sample of 964 students
• 5 types of item content:
–   Physical Science (PS; 8 items)
–   Life Science (LS; 5 items)          Understanding of Content
–   Earth Science (ES; 5 items)
–   Scientific Technology (ST; 3 items)
Process Skills
–   Scientific Inquiry (SI; 8 items)
• This provides a confirmatory basis for using DIMTEST.
(It’s also be possible to use DIMTEST is a purely
exploratory way.)

7
DIMTEST Output
--------------------------------------------------

AT List     PT List
--------------------------------------------------
3         1    2    5    7    9   12
4        13   14   16   17   18   19
6        20   21   22   26   28   29
8
10
11
15
23
24
25
27
--------------------------------------------------

•   --------------------------------------------------
•   DIMTEST STATISTIC
•      --------------------------------------------------
•   T        p-value
•      --------------------------------------------------
•   0.9105     0.1813

Weak evidence that the test is multidimensional on
the basis of the AT we used here.               8
MIRT Models: MRCML
e(k ξ
b a
x iθ i )
p    '
θ
Pi  ,) K
(s 1 ξ i
X |
k
k

e(k ξ
x iθ i )
pb a  '
k

k1

h
e
We
r
,a kbitr rerp e ,t s dt o , s t l
i du r s p n s ds e a a rs pi y
c    e   on     n e i r ev,
s n s s pe ste n t im c ge e ce
θ re a c o t t ab ( D c il gh )
e s s er l n r e
pn
r e t vof ev l Me r o r a ,
t a    a   f t s e 1
i s ivo n t n
ξpe a c oepm, n
e st e r i
e s t   t a t a
r e
a r
r r n vof m es d
e s  i  t a '
e st cne r a p n e voh r o r t ,
i r r n sr vo d e st d
b pe aogc, n i r re a snc (ean a es
k               k e s i er s e t a e
g t t e  pm r
u eei b u)
t pf y s .
a ie h r
b rsc d tee

Note: This is the statistical model that underlies both ConQuest and GradeMap
9
Between & Within Item
Multidimensionality

Solid Lines Only………..BIMD
Item           Solid + Dashed Lines…..WIMD
1

DIM 1
Item
2

Item           Terminology Note: Ackerman,
3
DIM 2   Gierl & Walker use the term
Item           BIMD. They use ―approximate
simple structure‖ and ―complex
4

10
The Simplest MIRT Extension using
the MRCML
 ( s 
 
X )
Pi 1
l
n                               Unidimensional Rasch Model
Ps  
s i
 X )
(i 0

 ( s
 
X )
Pi 1
l
n                                Multidimensional Rasch Model
Ps 
s
d i
 X )
(i 0

Say we have a test with multiple-choice math items. Some items are
strictly computational (i.e., 54 / 8 = ?); other items require students to
read a verbal prompt. We might want to analyze this test with a between
item two-dimensional Rasch model.
d = 1 represents ―computational‖ dimension
d = 2 represents ―verbal problem-solving‖ dimension                          11
Example from
Briggs & Wilson (2001)‫‏‬
• Science Education for Public Understanding Program
(SEPUP; Roberts, Wilson & Draney, 1997; Wilson & Draney,
1997 http://bearcenter.berkeley.edu/publications/pubs.php)
• Middle School Curriculum (8th grade students): ―Issues,
Evidence and You‖
• Embedded Assessments
• Student performance assessed across ―SEPUP variables‖
(variable = dimension)

12
Applying the MRCML
SEPUP Variables                SEPUP Data
• Designing and Conducting     • Data collected during 1994-
Investigations (DCI)           95 school year.
• Evidence and Tradeoffs       • 34 open-ended items, 541
(ET)                           students.
• Communicating Scientific     • Items scored polytomously
Information (CSI)              from 0 to 4.
• Understanding Concepts       • Different scoring guides for
(UC)                           each item as a function of
the SEPUP variable
associated with the item.

How should student ability be measured?
13
Modeling Science Ability : Traditional Approaches

U                                                    Unidimensional Approach

UNI = Sum of raw scores on all 34
X1        X2   . . . . . . . . . . . . . . . . . . . . . . .        X34                 test items for student s
θU = Single estimate of latent
science ability
UNI
Consecutive Approach

DCI = Sum of raw scores on 12
DCI                ET                    CSI                 UC                     DCI items for student s
ET = Sum of raw scores on 11
ET items for student s
X1    . . .   X12   X1   . . .   X11       X1    . . .   X7         X1    . . .   X4   CSI = Sum of raw scores on 7
CSI items for student s
UC = Sum of raw scores on 4
DCI                 ET                     CSI                      UC                 UC items for student s
θ = Four independent
estimates of science ability
14
Modeling Science Ability Multidimensionally

Multidimensional Approach

DCI = Sum of raw scores on 12
A                  B                  C                 D                   DCI items for student s
ET = Sum of raw scores on 11
DCI                   ET                CSI          UC
ET items for student s
CSI = Sum of raw scores on 4
X1   . . .   X12   X1   . . .   X11   X1   . . .   X7   X1   . . .   X4          CSI items for student s
UC = Sum of raw scores on 7
UC items for student s
DCI                ET                 CSI               UC           θ = Four correlated estimates of
latent science ability.

AB, AC, AD, BC, BD, CD =
Dimensional correlations

15
Comparing Dimensional
Correlations
C
DI       T
E        C
SI       C
U

D
CI       *
**       5
.5       4
.5       6
.0

E
T       7
.3        *
**      .6
6        5
.5

C
SI       5
.9       8
.3        *
**      4
.3

U
C        8
.1       7
.9       6
.4        *
**

Multidimensional Correlations below diagonal
Consecutive Correlations above diagonal
16
Comparing Reliability* Estimates
EU
SP P                    o s c tv
Cne ui e                     M D
ai be
Vra l                   ei bl t
Rl a i i y                  ei bl t
Rl a i i y
C
DI                        7
.1                         8
.3
ET                       7
.4                        9
.0
S
CI                        7
.8                        8
.0
UC                        6
.9                        7
.9
ndm i n l ei bl t . 0
n
Ui i e so a Rl a i i y= 9

This represents an idea called ―subscore augmentation: that is currently
popular thanks in part to NCLB. An illustration follows.                   17
Aside: Estimating Reliability in IRT

 
    Reliability is reallyscores. concept, based
2              2
a CTT

2
1   on observed test
t              e
t
x          2              2
      x              x

In IRT, test information and SEM plots are are more useful to look at, but
sometimes we want to compute one number that captures ―reliability‖

―Marginal‖ reliability

  

1
v oe s o
aicpoi n
r tu
i
vaf t d ts
ar e s r ri
gn o ib
a orld t
r fot s i
i
v ei bib
a p i tu
nr i ro
c  ay  i n

18
Raw Scores for Two
Hypothetical Students
JOHN
DCI    ET     CSI    UC
(DI = .54)
Raw Scores      30      24      10      14

CON Estimate   .244    -.865   -.059   -.418

MD Estimate    -.079   -.871   -.009   .014
SHELLY
DCI    ET     CSI    UC
(DI = 1.32)
Raw Scores      30      34      12      22

CON Estimate   .244    .497    1.535   .848

MD Estimate    .487    .971    2.02    1.51    19
IRT Scale Scores for Two
Hypothetical Students
JOHN
DCI    ET     CSI    UC
(DI = .54)
Raw Scores      30      24      10      14

CON Estimate   .244    -.865   -.059   -.418

MD Estimate    -.079   -.871   -.009   .014
SHELLY
DCI    ET     CSI    UC
(DI = 1.32)
Raw Scores      30      34      12      22

CON Estimate   .244    .497    1.535   .848

MD Estimate    .487    .971    2.02    1.51    20
MIRT Scale Scores for Two
Hypothetical Students
JOHN
DCI    ET     CSI    UC
(DI = .54)
Raw Scores      30      24      10      14

CON Estimate   .244    -.865   -.059   -.418

MD Estimate    -.079   -.871   -.009   .014
SHELLY
DCI    ET     CSI    UC
(DI = 1.32)
Raw Scores      30      34      12      22

CON Estimate   .244    .497    1.535   .848

MD Estimate    .487    .971    2.02    1.51
21
Correlation of Standardized
Ability Estimates
3

2
r  0.73

1

0

-1

-2
DCISTD

-3
-3      -2   -1   0     1      2      3

ETSTD

Even when dimensions are strongly correlated, they may convey different
Discrepant Cases

D  d s
D

I  
2
s   s

d1

162 students (30%) with DI > .5

Standardized MD Ability Estimates (d )

DI       DCI      ET CSI UC
3.24      .76    -.71 -1.23 .83
2.45      .69    -.96 -1.33 -.11
1.81      .17   -1.01 -1.06 .43
1.49      .78    1.98 2.17 .96
1.28     -.15     .97   .93 -.22
23
Recap
• The preceding example using the MRCML was an
application of a between-item multidimensional (BIMD)
Rating Scale Model.
• Modeling test data multidimensionally has some
– Statistical: Greater precision in dimensional ability estimates
(―borrowing strength‖ for subscore augmentation), disattenuated
dimensional correlations.
– Interpretational: Student ability represented in terms of related,
yet distinct dimensions. Disaggregation as a first step toward
diagnostic assessment.

24
MIRT Models: M2PL
From

x 
p
e(b
[a )]
                          
(  a i s i
P 1 ib
X s ,)
i
s |; i
i 
p
1 [ sb
e ( i]
xa )                                                 Unidimensional 2PL Model

to
x 
(θ
'
p i]
a )
e is d
(  a
Xθ i
s |s ,)
;
P 1 id          Multidimensional 2PL Model
 is d
1( 
i
e θ)
x a '
p i]

s r sia a ro t c n
r n pyes d h d
e tu i r s i ws n
p m t a at h p,
e t b ma e
ll
θe le pe c ia o
s   i  t se   e t
r e
ae lema at ia ,
i p m ia re c t c
r n idna ro he
e tu s tp s i w m
r s p io es d h
e t i
s   l c i ma e
r n t se t
i
a e t i'oomu
n r s m n rna
de nl i a e s e
de t c n s r
ip as o i
r ne a
s    t eo c
se .
nt p f

Note: I’m omitting the scaling constant 1.7 from the expressions above, and in what follows.
25
Example of M2PL
35 item math test with two dimensions: general math ability
(dimension 1), spatial ability (dimension 2).
Solving items 1-6 requires both dimensions 1 and 2. Solving
items 7-35 requires only dimension 1.

x                      

p(
1 is )
eis 2 i
( i i
θ
Xs ,a
P |;d
1 )  1   2

x 
is 2 i
i
s
e 1 is )
p
(
1  2

1) This is an example of a Within-Item Multidimensional (WIMD)
Model, and
2) a Compensatory Multidimensional Model

26
An Aside: Compensatory vs.
Noncompensatory MIRT Models
In a compensatory MIRT model, a respondent with high amount of
one dimension can compensate for low amounts in another. This is
reflected by the additive nature of the model. By contrast, here’s a
noncompensatory model for our two dimensional example:

p                2


xs i
(
ell l]
( a 
X si i
θ
s | ;d
P1 ,)      i

l
 ei 
i
1 xl
p i
l 1(s l]

Key distinction: Probability of a correct response is largely
governed by respondent’s lowest ability dimension. This is
reflected by the multiplicative nature of the model. Being high on
one dimension doesn’t compensate for being low on another.
27
Graphical Presentation of M2PL Output
3

Equiprobability
2
contour plot for
item 3.

1
Prob of correct
response is
function of both
Dim 2

0
dimensions
P = 0.1         P = 0.2           P = 0.3   P = 0.4   P = 0.5   P = 0.6   P = 0.7

-1                                                                                                 Lines get
closer
together as
-2                                                                                                 response
surface
becomes
-3
-3             -2   -1          0                     1                 2                3
steeper.
Dim 1

28
Graphical Presentation of M2PL Output
3

Vector plot for
2                                                                                                 item 3.

1
Dim 2

0

P = 0.1         P = 0.2           P = 0.3   P = 0.4   P = 0.5   P = 0.6   P = 0.7

-1

-2

-3
-3             -2   -1          0                     1                 2                3
Dim 1

29
Interpreting Output from the M2PL
Three key characteristics: discrimination, difficulty and direction

D i 1 a
M C a2
S
I   2
i 2i
Max amount of discrimination for an item.

i
d                       Difficulty of an item. Represents the location of item in
D
i
dimensional space. Positive = harder, negative = easier.
MI C
D i
S                       Magnitude reflects distance from origin necessary for
50% prob of correct response.

 acs 1
1i c
ro
ai                         Angular direction of an item. If angle is 45, item
Di
I
MCS                         measures both dimensions equally. If less than 45, item
measures 1 better than 2, and vice-versa.

Note: arccos = cos-1 ; When you compute arccos the result you get back is typically in
Graphical Presentation of M2PL Output
3

Item vector plot
2                                                                                                 for item 3.

MDISC = .63
1
D = 1.39
 = 29 degrees
Dim 2

0

P = 0.1         P = 0.2           P = 0.3   P = 0.4   P = 0.5   P = 0.6   P = 0.7

-1
This is the
composite of
the two
-2
dimensions
that that is
being best
-3                                                                                                 measured by
-3             -2   -1          0                     1                 2                3
the item.
Dim 1

31
Vector Plots for 4 Spatial Items
3

Item    MDISC D      angle
1      2.11   -.21   18
2      1.20  -1.70   33
2
3      0.63   1.39   29
4      0.45   0.71   21

Item 3
1

Item 4

Item 1
Dim 2

0
-3                -2                   -1        0            1                     2   3

Item 2

-1

-2

-3
Dim 1
32
Concept‫‏‬of‫‏‬a‫“‏‬Validity‫‏‬Sector”
3

2
Composite Angle: 21.8

1
Dim 2

0
-3   -2   -1        0           1               2   3

-1

-2

-3
Dim 1
33
Practice
3

Item    MDISC D      angle                             Match the
1      0.65  -1.48   14
2      1.32   0.18   32
2
items to the
3      0.63   1.39   65
4      1.00   1.00   0                                 vector plots

1
Dim 2

0
-3                -2              -1        0   1   2   3

-1

-2

-3
Dim 1

34
“Self-Test”‫‏‬Questions
1.   What is the most important thing to do before
conducting a MIRT analysis?
2.   What is the difference between BIMD and WIMD?
3.   The most common MIRT model is called
compensatory—what does this mean?
4.   How does the graphical representation of an item differ
from IRT to MIRT?
5.   If the same data used for the M2PL were modeled with
the MRCML, how would the item vector plots differ?

35
Resources: MIRT Software
•   ConQuest
•   TESTFACT, POLYFACT
•   NOHARM
•   SAS PROC NLMIXED
•   GLLAMM (within Stata)
•   BMIRT

36
Resources: MIRT Literature
THEORETICAL
Ackerman, T. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional
perspective. Journal of Educational Measurement, 29, 67-91.
Adams, R. J., Wilson, Mark, Wang, Wen-chung. (1997). The Multidimensional Random Coefficients Multinomial Logit
Model. Applied Psychological Measurement, 21(1), 1-23.
Bock, R. D., Gibbons, R., & Muraki, E. (1988). Full-Information factor analysis. Applied Psychological Measurement, 12,
261-280.
Muraki, E., & Carlson, J. E. (1995). Full-Information factor analysis for polytomous item responses. Applied
Psychological Measurement, 19, 73-90.
Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological
Measurement, 9, 401-412.
Reckase, M. D., & McKinley, R. L. (1991). The discriminating power of items that measure more than one dimension.
Applied Psychological Measurement, 15, 361-373.
Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52, 589-617.
APPLICATIONS
Ackerman, T. (1994). Using multidimensional item response theory to understand what items and tests are measuring.
Applied Measurement in Education, 7, 255-278.
Kupermintz, H., Ennis, M. M., Hamilton, L. S., Talbert, J. E., & Snow, R. E. (1995). Enhancing the Validity and
Usefulness of Large-Scale Educational Assessments .1. Nels-88 Mathematics Achievement. American
Educational Research Journal, 32(3), 525-554.
Walker, C. M. & Beretvas S. N. (2003). Comparing multidimensional and unidimensional proficiency classifications:
Multidimensional IRT as a diagnostic aid. Journal of Educational Measurement 40, 255-275.

37

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 47 posted: 11/11/2011 language: English pages: 37