MARCES Chang

Document Sample
MARCES Chang Powered By Docstoc
					  Making Computerized Adaptive
Testing Diagnostic Tools for Schools


               Hua-Hua Chang
     University of Illinois at Urbana-Champaign
                  October 17, 2010
        What is Adaptive Testing?
• Originally called tailored tests (Lord, 1970)
   – Examinee are measured most effectively if items are
     neither too difficult nor too easy.
• Θ: latent trait. Heuristically,
   – if the answer is correct, the next item should be more difficult;
   – If the answer is incorrect, the next item should be easier.
• How adaptive test works?
   – An item pool, known item properties (such as difficulty level,
     discrimination level,..
   – Algorithm, computer, and network
   – The core is the item selection algorithm
   – Need mathematicians help to design algorithm

                                                                      2
                      Sequential Design &
                  Robbins-Monro Process (1955)
                        Responses:            x1 , x2 , x3 ,.......
                        Design points:        b1 , b2 , b3 ,.......
                        Constants:           1 ,  2 ,  3 ,........

                                         bn 1  bn   n xn
                        bn  m (a point of interest)

Numerous Refinements:

Engineering (Goodwin, Ramadge and Caines, 1981; Kumar, 1985)
Biomedical science (Finney, 1978)
Education (Lord, 1970)


                 bn 1  bn   n ( xn  m)
                                                                        3
 The Maximum Information Criterion (MIC)
• Lord’s (1980) MIC method, the most
  commonly used method.
  0 : true latent trait
  ˆ
   : MLE after n items were administered
   n

  I i () : Item information function
          n
  I ()   I i () : Test Information
         i 1

• MIC would select items with high
  discrimination
• There have been many other methods
                                            4
Item difficulty vs Item discrimination
                                        a=1

                        a=2
                                a=0.5




                                        5
6
In 2D Case: item whose volume is max should be selected




                            Item inf surface




                                    Information volume
          Theta region


                                                          7
From Theoretical Development to Large Scale
                 Operation
 • Issues to be addressed:
 • Should CAT only use the best items?
    – It is common only 50% items are used
 • Is CAT more secure than paper/pencil test?
    – How to improve CAT test security?
 • How to control non-statistical constraints?
 • How to get diagnostic information?
 • How to make CAT affordable to many schools?

                                                 8
  Two NSF Grants and loads of Papers
• Constraint-weighted design
   – Cheng, Chang, & Yi (2006), Cheng & Chang, (2008), Cheng, Chang Douglas, &
     Guo (2009), etc.

• Establish theoretical foundation
   – Chang & Ying (2009)

• Test Security
   – Chang & Zhang (2001), Zhang & Chang (2010)

• Cognitive diagnostic CAT
   – McGlohen & Chang (2008)

• Multi-dimensional CAT
   – Wang & Chang (in press), Wang & Chang (accepted for publication)

• Large scale k-12 Applications in China
   – Liu, Yu, Wang, Ding & Chang (2010)                                          9
  CAT & Transformative Research
• National Science Board (2007)unanimously
  approved a motion to enhance support of
  transformative research at the NSF.
  – All proposals received after Jan 5, 2008, will be
    reviewed against the criterion.
     • revolutionizing entire disciplines;
     • creating entirely new fields; or
     • disrupting accepted theories and perspectives
• Many CAT researches are transformative!
                                                        10
                 New Developments
• Measuring Patient-Reported Outcomes
  – Conventional measures of disease such as lab results do
    not fully capture information about chronic diseases and
    how treatments affect patients.
  – CAT can be used to assess patients subjective experiences
    such as symptom severity, social well-being, and perceived
    level of health.
• K-12 Applications
  –   Large State testing
  –   Teaching/learning, within School application
  –   Diagnostic purpose
  –   Web-based learning
                                                             11
           Challenges in NCLB Testing

• Many items are too difficult to students
   – 70% math items may be too difficult
      • The influence of this kind of test taking experience on low-
        achieving students is not well-understood (e.g., Roderick & Engle,
        2001, Ryan & Ryan, 2005; Ryan, Ryan, Arbuthnot, & Samuels,
        2007).
• Test security of NCLB
      • The # of security violations in P&P based NCLB testing in on the
        rise.
      • Documented cases of such incidents have been uncovered in
        numerous states including New York, Texas, California, Illinois, and
        Massachusetts. (Jacob & Levitt, 2003, and Texas Education Agency,
        2007).


                                                                             12
  CAT Has Glowing Future in the K-12
              Context.
• Why not use benchmark testing?
  – Adaptive Testing can do better.
• Quellmalz & Pellegrino (2009):
  – more than 27 states currently have operational or
    pilot versions of online tests, including Oregon,
    North Carolina, Utah, Idaho, Kansas, Wyoming,
    and Maryland.
  – The landscape of educational assessment is
    changing rapidly with the growth of computer-
    administered tests.
                                                    13
Objectives:

1. MAKING CAT DIAGNOSTIC TOOL
2. DELIVER THE TOOL TO SCHOOLS

                                 14
How to get diagnostic information?
• Post-hoc approach (non-adaptive)
  – perform CD after students completed CAT
• Adaptive approach
  – Select the next item which provides the max info
    about the student’s strength and weakness
  – Need a model, item selection algorithm
  – Psychometric theory
  – Simulation study
  – Field test
                                                       15
             Cognitive Diagnosis

Provide examinees with more information
than just a single score.

• How? By considering the different attributes measured by
  the test.

• An attribute is a “task, subtask, cognitive process, or skill”
  assessed by the test, such as addition or reading
  comprehension.


                                                                   16
What should be reported to
       examinees?
Traditional Testing:               Cognitive
                                   Diagnosis:



                        [1 ,  2 ,...,  K ]

A single score            A set of scores:
                       One for each attribute.
                        (K is the total # of attributes.)

                                                            17
       Why is this beneficial?
Feedback from an exam can be more
individualized to a student’s specific
strengths and weaknesses.


            Julia R.
                            [0000111 ]
                          ˆ
              75


            Halle B.
                            [0101100 ]
                          ˆ
              75                         18
 The Item-Attribute Relationship

Which items measure which attributes is
represented by the Q-matrix:

             i1    i2 i3 i4
           A1 
               0    1 0 1
                        
               1
           A2     0 0 1
           A3 
               1
                  0 1 0
                          
                                          19
    Cognitive Diagnostic Models
                               vector


         P ( X ij  1|  i )
          person   Item


• Many models have been proposed
• DINA model
• Fusion model (Stout’s group)



                                        20
  The DINA Model                                          Student i
                                                          Item j


Deterministic Input; Noisy "And" Gate
(Macready & Dayton, 1977, 1989; Junker & Sijstma, 2001)

                                        ij   (1ij )
P( X ij  1| ij )  (1  s j ) g j
where
                          K
                 ij   
                                 q jk
                                ik
                         k 1

       s j  P ( X ij  0 | ij  1) -- "slip" parameter
       g j  P( X ij  1| ij  0) --"guess" parameter
                                                                      21
      How to adaptively select items?
• No direct analogy to “match theta with b-parameter”
   – Regular CAT, b-parameter with   
• Now      is a vector, called latent class


                K : # of attributes
                 c : pt in the latent space (2K )
                 : estimated 
                ˆ
                P( X i  xi |  ) : IRF
                                                        22
• The KL information Approach (Xu, Chang, &
  Douglass, 2004)
• Let’s assume

          H 0 :    , H1 :    i
                    ˆ
• The likelihood test is the most powerful test
• Intuitively
              the j-th item selected should make

                P( X j  x j |  )
                               ˆ
          log
                P( X j  x j |  j )
                                       large
                                                   23
• Taking expected value assume  is true
                                ˆ
                     1         P( X j  x |  )
                                            ˆ
   KL jc ( || c )   log(
          ˆ                                         )P( X j  x |  )
                                                                  ˆ
                    x 0       P( X j  x |  c )



       Select item j to make the following as large as possible

                           2K
                 KL   KL jc ( || c )
                               ˆ
                           c 1




                                                                        24
                                                        Demo
Consider two attributes and four candidate items
1  [1,1],  2  [0,1],  3  [1, 0],  4  [0, 0] 4 possible patterns
  [1, 0]
ˆ                   Interim estimate for an examinee

Item    Slip    Guess        Q1          Q2                        P1                  P2         P3          P4
1       0.1     0.2          1           1                         0.9                 0.2        0.2         0.2          Pc ( X j  x |  c )
2       0.1     0.2          0           1                         0.9                 0.9        0.2         0.2
3       0.1     0.2          1           0                         0.9                 0.2        0.9         0.2
4       0.1     0.2          0           0                         0.9                 0.9        0.9         0.9
                                          1          P( X j  x |  )
                                                                  ˆ
                        KL jc ( ||  c )   log(
                               ˆ                                          )P( X j  x |  )
                                                                                        ˆ       j: item, c: attribute pattern
    Item bank                            x 0        P( X j  x |  c )
                                                                                                                          2K
                                                                                                                I j ( )   KL jc ( ||  c )
                                                                                                                     ˆ              ˆ
                                                                                                                          c 1
                      Item            KL1              KL2                KL3            KL4      Total
                      1              1.363            0.000             0.000           0.000     1.363
                      2              1.363            1.363             0.000           0.000     2.716
                      3              0.000            1.146             0.000           1.146     2.292
                                                                                                                                        25
                      4              0.000            0.000             0.000           0.000     0.000
                                     Demo
Change the slipping/guessing parameters of the items


Item    Slip   Guess   Q1      Q2            P1        P2        P3        P4
1       0.2    0.3     1       1             0.8       0.3       0.3       0.3
2       0.2    0.3     0       1             0.8       0.8       0.3       0.3
3       0.2    0.3     1       0             0.8       0.3       0.8       0.3
4       0.2    0.3     0       0             0.8       0.8       0.8       0.8


               Item    KL1          KL2      KL3       KL4       Total
               1           0.583     0.000     0.000     0.000     0.583
               2           0.583     0.583     0.000     0.000     1.165
               3           0.000     0.534     0.000     0.534     1.068
               4           0.000     0.000     0.000     0.000     0.000

       •The magnitude of the non-zero values depends on the item
       slipping and guessing parameters
                                                                                 26
                                       Demo
Change the interim estimate to                [0,1]
                                            ˆ
  Item   Slip   Guess        Q1        Q2         P1            P2            P3      P4

  1      0.2    0.3          1         1          0.8           0.3           0.3     0.3

  2      0.2    0.3          0         1          0.8           0.8           0.3     0.3

  3      0.2    0.3          1         0          0.8           0.3           0.8     0.3

  4      0.2    0.3          0         0          0.8           0.8           0.8     0.8


                      Item       KL1        KL2        KL3      KL4          Total
                                                                                 KL
                      1           0.583      0.000      0.000        0.000    0.583
                      2           0.000      0.000      0.534        0.534    1.068
                      3           0.583      0.000      0.583        0.000    1.165
                      4           0.000      0.000      0.000        0.000    0.000

                The positions of the zero KL cells changed for item 2 & 3                   27
• To explain the last table in the previous slide
   – “0” means this item provides no information to discriminate the interim
     estimate with another possible attribute pattern.
   – The magnitude of the non-zero values depends on the item slipping and
     guessing parameters
   – Which cell is zero depends on the q-vector and the examinee’s interim
     estimate. If for a particular item (e.g., item 4 in this demo), q-vector
     contains all zeros, all cells will be zero.




                                                                           28
                           Estimation
   Response data                            Students’ latent class


   x11 , x12 ,...., x1n  (11 , 12 ,..., 1K )
   x21 , x22 ,...., x2 n  ( 21 ,  22 ,...,  2 K )
  :
  xN 1 , xN 2 ,...., xNn  ( N 1 ,  N 2 ,...,  NK )
                  
( s1 , g1 ), ( s2 , g 2 )..., (sn , g n )

     Item parameters                                                 29
        New Tests vs. Existing Tests
• Existing Exams
   – Analyze the responses from an existing large-scale
     assessment from a Cognitive Diagnosis framework.
   – Examine the results across various methods of
     constructing a Q-matrix.
• New Exams
   –   Identify Attributes and Content validity structure
   –   Writing items according to cognitive specifications
   –   Pre-testing
   –   Q-matrix validation


                                                             30
 Application 1: existing dataset
– A simple random sample of 2000 examinees who took the
   • Grade 3 TAAS from Spring 2002
   • Grade 11 TEKS from Spring 2003
– The Math & Reading portion of each test was analyzed by
  using the Fusion Model
– Item selection methods
   • Kullback-Libler (KL)
   • Shannon Entropy (SHE), and etc.
– Reference, e.g.,
   • McGlohen & Chang (2008)
   • Download from
     http://www.psych.illinois.edu/people/showprofile.php?id=539
   • Or, google Hua-Hua Chang
                                                                   31
       Taxes 3rd grade reading assessment
             6 attributes (Application 1)

   The student will determine the meaning of words in a variety of written texts.

   The student will identify supporting ideas in a variety of written texts.

   The student will summarize a variety of written texts.

   The student will perceive relationships and recognize outcomes in a variety of
    written texts.

   The student will analyze information in a variety of written texts in order to
    make inferences and generalizations.

   The student will recognize points of view, propaganda, and/or statements of
    fact and opinion in a variety of written texts.



                                                                                     32
Why CD-CAT?

HOW TO HELP SCHOOLS TO OWN
AND OPERATE CD-CAT?

                             33
 Building CAT-Driven Assessment and Diagnosis
          to Improve Student Learning
                Chang & Ryan (IES Proposal)
• Develop the technical foundations for a CAT system
  to meet NCLB accountability and to inform teaching
  and learning.
• In alignment with race to the top (RTTT) priorities,
  the proposed CAT will include
   – individualized diagnostic information to provide teachers,
     schools, and states with more-precise information about
     student achievement levels along with valuable formative
     feedback to inform instructional planning.


                                                                  34
Address issues reviewers raised

HOW TO ADDRESS ISSUES SUCH AS
SCHOOLS HAVE NO MONEY TO BUY
AND OPERATE CD-CAT?
                                  35
                New Technologies
 --- Schools can use existing PCs or MACs
• Client/Server Architecture (CS)
   – CAT software has to be installed on each client computer ( large
     workload)
   – only applicable to Local Area Network (LAN)
• Browser/Server Architecture (BS)
   – database is still on the server
   – nearly all the tasks concerning development, maintenance and
     upgrade, are carried out on the server.
   – based on the Wide Area Network (WAN)
• Advantages of BS
   – Low maintenance, no network programming

                                                                    36
Hardware and Network Design




                Item Bank

                              37
Develop a CD-CAT system to show its applicability to improve teaching
and learning

APPLICATION 2: THE CHINA
PROJECT

                                                                        38
            Application 2:
  Level II English Proficiency Test
• Pretest and Calibration of Item bank
  – Pretest
     • 38,662 students from 78 schools, 12 counties participated
  – Analyzing pretest data
     1. Estimated the parameters of DINA model
     2. Estimated the parameters of 3PLM model
     3. Calibrate attributes of item again
     4. If it fits well then stop, otherwise revise q-matrix and got 3
  – Assembling the item bank with item parameters and
    specifications.
                                                                     39
          Distribution of the students in pretest




Red: Field Test Sampling Area
Yellow and red: Current Implementation

                                                    40
                                                    40
                      Linking Design
Eg, this block has 10 anchor items,

                                          Anchor items
                        Group1        Group2        Group3       Group4
       Test1
       Test2
       Test3
       Test4
       Test5
       Test6
       Test7
       Test8
       Test9
       Test10
       Test11
       Test12
       Anchor Test

      The locations of the anchor items in each booklet are the same (as they
      appear in anchor test).
                                                                                41
                                                                                41
42
                   Item Writing
• About 40 Excellent Teachers in Beijing
• Process
  1.   Psychometric Training
  2.   Identify Attributes
  3.   Writing Items
  4.   Constructing Q-matrix
  5.   Pre-testing and check FITTING
  6.   Revise Q-matrix until fitting is ok; go to 5 if not
  7.   stop
                                                             43
        Item Selection Strategy

– Shannon Entropy (SHE) procedure was applied to select
  next items
   • SHE (Tatsuoka, 2002, Xu, Chang, & Douglas, 2004, McGlohen &
     Chang, 2008)

– Dual Information (McGlohen & Chang, 2004 and 2008)
  Cheng and Chang, 2007)




                                                                   44
• Parameters Estimation
  – The knowledge state of examinee is estimated
    sequentially.
  – The Maximum posterior estimation (MAPE)
    method was used in the system.
                      i  arg
                     ˆ                        max K ( P( c | u i( m ) ))
                                           c  0,1, ,2 1


                                                                      m
                                                                 g 0 c  Pj ( c ) ij (1  Pj ( c ))
                                                                                     u                     1uij

                                g 0 c L( c ; u i )
                                               (m)
                                                                      j 1
     P( c | u i( m ) )    2 1
                            K                               2 1
                                                             K
                                                                             m

                            g                                g  P ( )
                                                                                                              1uij
                                         L( c ; u i )                                         (1  Pj ( c ))
                                                 (m)                                     uij
                                    0c                               0c          j   c
                            c 0                             c 0         j 1




  – The ability is estimated at the end of the test.
                                                                                                                      45
       Monte Carlo simulation Studies
• Item selection rule
   – Content constraints (same test structure as Pretest)
        • Listening Dialog (item1-item10), the next items is selected within remaining
          Listening Dialog items in the item bank.
        • Short Talks (item11-item12), two items for a piece of speech is selected within the
          short-talk items in the item bank.
        • Grammar and Vocabulary (item17-item32), the next items is selected within
          remaining Grammar and Vocabulary items in the item bank.
        • Reading Comprehension (item33-item40), the next items is selected within
          remaining Reading Comprehension items in the item bank.

• Item selection strategy
   – the item was selected according to Shannon entropy procedure




                                                                                            46
  Classification Accuracy & Evaluation Criteria

• Evaluation criteria
  – Rate of pattern match (RPM)
             The number of examinees of pattern match
        RPM=
                               M

  – Rate of marginal match (RMM)
                        The sum of all h ij
                RMM 
                               MK

  – average test information


                                                        47
                 Field Test
• SHE with content constraints
• The adaptive test was web-based, consisting
  of 36 items and lasting for 40 minutes.


• Number of Participants: 584
  – 5th and 6th grade, from 8 schools in Beijing,
    China

                                                48
                Validity Study
• Evaluating the consistency of
  – CD-CAT system results with an existing English
    achievement test
     • a group of students took two exams
  – CD-CAT system results with Teachers’ evaluation
    outcomes.



                                                      49
       CD scores vs. scores of an
           achievement test
The Consistence between levels and # of mastered attributes

                         # of mastered attributes

Academic Performance      0     1     2     3       4   5   6    7    8    Total
  Level
Excellent                 0     0     1     1       1   3   4    6    23    39

Good                      0     0     1     2       8   5   7    7    3     33

Pass                      1     1     3     5       3   1   0    0    1     15

Fail                      0     1     2     0       0   0   0    0    0      3

Total                     1     2     7     8   12      9   11   13   27    90

                                                                                 50
   CD-CAT Results vs. Teachers'
          Assessment
• Comparison of a CD scores with teachers’
  assessment
  – Participants from three classes:
     •   91 6-grade students and 3 teachers were recruited to evaluate the
         diagnostic reports. one rural school and two urban schools.
  – Measurement
      • Students’ diagnostic reports were presented to three teachers, they
        were asked to evaluate the accuracy of this report.



                                                                        51
  Validity Study: CD vs. Teachers
Evaluation on the CD-CAT feedback reports by teachers

Teacher High consistency medium consistency low consistency        total


   A         28(90.32)            3(9.68)               0(0.00)   31(100)

   B         13(41.94)           16(51.61)              2(6.45)   31(100)

   C         27(93.10)            1(3.45)               1(3.45)   29(100)

 total       68(74.73)           20(21.98)              3(3.30)   91(100)

                                                                           52
               Discussions
• Large scale field tests will take place in
  Shanghai and Dalian in the near future.
• CD-CAT can be implemented effectively and
  economically.
• Though the DINA model was used, the results
  can be generalized to many other IRT and
  Cognitive Diagnostic Models!
• The method for on-line calibrating of pre-test
  items has been developed. In the future,
  paper/pencil based pretesting is not needed.

                                                   53
                 Conclusion
• CAT is revolutionarily changing the way we
  address challenges in assessment and
  learning.
• In June 2010 the IES proposal was revised and
  resubmitted.
• Any good example of LARGE-SCALE CD-CAT?
  – http://cp.guoshi.com/


                                              54
Thank you!

             55

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:10
posted:2/9/2011
language:English
pages:55