Basic Principles of Item Response Theory

Description

Introduces IRT applied to assessments and discusses different approaches to implementing an IRT based assessment system.

Reviews
Shared by: Mohan Kannegal
Stats
views:
225
rating:
not rated
reviews:
0
posted:
7/26/2009
language:
English
pages:
0
Basic Principles of IRT And Application to Practical Testing & Assessment By Dr. V. Natarajan Page 2 About The Author Dr. V. Natarajan is Professor Emeritus at MeritTrac Services. He is an engineer and a D.Litt in Educational Evaluation and Administration, guides the Test Development and Research teams in MeritTrac. The author of 68 books and over 60 international papers on assessment, he is a pioneer in the area of education assessments in India. He is a visiting faculty at ETS, Princeton, USA, and has been a member of the Association of Indian Universities for over three decades. You can write to the Dr. V. Natarajan at drvnatarajan@gmail.com Copyright © Dr. V. Natarajan 2009 All Rights Reserved. This book is self-published by the author, has been released only in an electronic format and is accessible to MeritTracers and participants of 2008 IAEA conference. The author has allowed readers to freely distribute the book provided no modifications are made to the book in the distribution and the authors are acknowledged in all distributions. Please write to drvnatarajan@gmail.com for more information regarding distribution of this book. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 3 Acknowledgement It gives me immense pleasure to make my humble acknowledgements to all those who have been instrumental in my getting knowledge and skills in the matter of IRT and greatly indebted to my teacher Prof. Fred Lord, father of IRT, of ETS, who taught me the basics and nuances of IRT and its applications to practical testing through a course “Applications of IRT to Practical Testing” at ETS. To Dr. D.H.Lawley the whole praise be given to have initiated the concept of item characteristic curve as early as 1943. He triggered the great minds of researchers like Rasch, Birnbaum and Fred Lord for their single, two and three parameter models. I have used most of what I learnt and was inspired by the work of Benjamin Wright of Chicago, Frank Baker, Professor Emeritus who pioneered the first e-book made available free of cost and his BIRT software. I had used this software in our R&D work and included it in the appendix because no one should miss an access to it and learn from it. I acknowledge profusely Dr. Lawrence Rudner, Vice President of GMAC and a Consortium partner of MeritTrac for making available a tutorial for computer adaptive testing that is very unique and brings forth adaptive testing in its full perspective. My friends in NFER in U.K., Dr. Skurnik and Dr. Nuttall from whose book I learnt all about Rasch Model. I am deeply indebted to all of these and my own students who influenced me and this e-book could not have been made possible to all interested.. Dr. V. Natarajan Prof. Emeritus MeritTrac Services (P) Ltd. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 4 CONTENTS INTRODUCTION TO MODERN TESTING ........................................................................................ 6 CLASSICAL TEST THEORY TO ITEM RESPONSE THEORY ....................................................................... 6 CONTRIBUTIONS IN THE AREA OF IRT ................................................................................................... 7 IRT OVER CTT ....................................................................................................................................... 8 BASIC CONCEPTS OF IRT ...................................................................................................................... 9 PLOTTING ABILITY VERSUS PROBABILITY............................................................................................ 13 EXAMPLES ............................................................................................................................................ 18 EXERCISES ........................................................................................................................................... 24 ITEM CHARACTERISTIC CURVE MODELS.................................................................................. 25 RASCH’S SINGLE PARAMETER MODEL ................................................................................................ 26 EXAMPLE FOR RASCH MODEL ............................................................................................................. 27 BIRNBAUM’S TWO PARAMETER MODEL .............................................................................................. 29 EXAMPLE FOR BIRNBAUM’S MODEL .................................................................................................... 30 FRED LORD’S THREE PARAMETER MODEL.......................................................................................... 32 EXAMPLES ............................................................................................................................................ 33 INTERPRETATION OF ITEM PARAMETERS ............................................................................................. 35 ITEM INFORMATION FUNCTION.................................................................................................... 37 ITEM INFORMATION FUNCTION OF SINGLE PARAMETER MODEL ......................................................... 37 ITEM INFORMATION FUNCTION OF TWO PARAMETER MODEL.............................................................. 38 ITEM INFORMATION FUNCTION OF THREE PARAMETER MODEL .......................................................... 39 EXAMPLES ............................................................................................................................................ 40 TEST CHARACTERISTIC CURVE (TEST RESPONSE FUNCTION)............................................................ 46 EXAMPLES ............................................................................................................................................ 47 TEST INFORMATION FUNCTION............................................................................................................. 58 INTERPRETING THE TEST INFORMATION FUNCTION ............................................................................. 63 TEST INFORMATION FUNCTION OF SINGLE PARAMETER MODEL......................................................... 63 TEST INFORMATION FUNCTION OF TWO PARAMETER MODEL ............................................................. 64 TEST INFORMATION FUNCTION OF THREE PARAMETER MODEL.......................................................... 64 ESTIMATING PARAMETERS .......................................................................................................... 66 PROCEDURE FOR ESTIMATING PARAMETERS ...................................................................................... 66 EXAMPLES ............................................................................................................................................ 69 GROUP INVARIANCE OF ITEM PARAMETERS ........................................................................................ 74 NOTE .................................................................................................................................................... 75 EXAMPLES ............................................................................................................................................ 75 ESTIMATING A TEST TAKER’S ABILITY ................................................................................................. 91 ABILITY ESTIMATION PARAMETERS ..................................................................................................... 91 ITEM INVARIANCE OF A TEST TAKER’S ABILITY ESTIMATE .................................................................. 96 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 5 TEST CALIBRATION ........................................................................................................................ 99 TEST CALIBRATION PROCESS.............................................................................................................. 99 THE METRIC PROBLEM ...................................................................................................................... 100 SUMMARY OF THE TEST CALIBRATION PROCESS .............................................................................. 100 THE LIKELIHOOD FUNCTION ............................................................................................................. 101 THE MAXIMUM LIKELIHOOD ESTIMATE OF ABILITY ........................................................................ 102 IRT TEST & ITEM ANALYSIS USING SOFTWARE .................................................................... 104 EXAMPLES .......................................................................................................................................... 104 APPLICATION OF IRT TO ITEM BANKING................................................................................. 106 APPLICATION OF IRT TO ADAPTIVE OR TAILORED TESTING ............................................ 111 EXAMPLES .......................................................................................................................................... 112 FUTURE OF ITEM RESPONSE THEORY IN INDIA.................................................................... 116 APPENDIX ........................................................................................................................................ 119 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 6 Introduction to Modern Testing Classical Test Theory to Item Response Theory Classical Test Theory, popularly known as CTT, started off as majority of practices developed during the 1920’s. This theory has component theories like Theory of Validity, Theory of Reliability, Theory of Objectivity, Theory of Test Analysis, Theory of Item Analysis etc. Most of the practices were initially confined to psychological tests and later on extended to educational testing. However, a new test theory had been developing over the past fifty years that was conceptually more powerful than CTT. Based upon items rather than test scores the new approach was known as Item Response Theory (IRT). While the basic concepts of IRT were, and are, straightforward, the underlying mathematics was somewhat advanced compared to that of CTT. It was difficult to examine some of these concepts without performing a large number of calculations to obtain usable information using computer technology. The advancement in computer technology has accelerated the development of IRT. CTT is best suited for traditional testing situations, either in group or individual settings, in which all the members of a target population are administered the same or parallel sets of test items, for instance, test takers seeking admission in a college or recruitment to a job. These item sets can be presented to the test taker in either a paper-and-pencil or a computer format. Regardless of the format, it is important for the measurement of individual ability that the items in each item set have “difficulties” that match the range of ability or proficiency in the population. In addition, precise estimation of individual ability requires the administration of a “large enough” number of items whose difficulty levels narrowly match the individual’s level of ability or proficiency. For heterogeneous populations, these requirements of the “fixed length” test result in an inefficient and wasteful testing situations that are certainly frustrating to the test taker and not very valid and reliable from the test administrator’s and analyst’s point of view. Models for mental tests began to appear, as early as 1950’s. These addressed the problems with CTT and exploited the emergence of computing technology. In fact, a powerful feature of these newer testing models was the ability to choose test items appropriate to the test taker’s level of proficiency during the testing session, i.e. to tailor the test to the individual in real time. Today, the more popular and well developed of these models make up the family of mathematical characterizations of a test taker’s test responses known as IRT. Although difficult to implement in practice, IRT is the formulation of choice for modern testing. Despite its popularity, CTT has a number of shortcomings that limit its usefulness as a foundation for modern testing. The emerging role of computing technology in mental testing highlights some of these limitations of CTT. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 7 Contributions in the Area of IRT Over the past century, many persons have contributed to the development of IRT. Of them three persons deserve special mention and recognition. D.H. Lawley of the University of Edinburgh published a paper in 1943 showing that many of the constructs of CTT could be expressed in terms of parameters of the item characteristic curve that he talked about. This paper marks the beginning of IRT as a measurement theory. The work of Dr. F. M. Lord of the Educational Testing Service has been the driving force behind both the development of the theory and its application for the past 50 years. Dr. Lord systematically defined, expanded and explored the theory as well as developed computer programs needed to put the theory into practice. In the late 1960s, Dr. B.D. Wright of the University of Chicago recognized the importance of the measurement work by the Danish mathematician Georg Rasch. Since that time, he has played a key role in bringing IRT, the Rasch model in particular, to the attention of practitioners. Without the work of these three individuals, the level of development of IRT would not be where it is today. Frank Baker came out with a book on IRT in 1985. IRT was an upstart whose popular acceptance was delayed partly because the underlying statistical calculations were quite complex. Baker’s contribution was to write a well written introductory text book on IRT with software for the then state-of-the art Apple II and IBM personal computers. This program freed the readers from the tedious statistical calculations. At about the same time, Dr. Natarajan (India, 1984) came up with a text on Sample Free Item Analysis and addressed all three models of Rasch, Birnbaum & Fred Lord. Fred Lord taught Dr. Natarajan through an international course, on Applications of Item Response Theory to Practical Testing held at ETS in Princeton for a period of 10 days where Dr. Natarajan came in to appreciate all the nuances of his three parameter logistic model. More or less at the same time, Dr. Natarajan probed relentlessly the three models and submitted a thesis An Application of Item Response Theory to Aid Discrimination Function in Achievement Testing and was awarded a D.Litt by Pune University in India. Much has changed since 1985. IRT now powers the work of major US test publishers and is used as the basis for developing the National Assessment of Educational Progress, as well as numerous state and local tests. In the UK, the National Foundation for Educational Research brought out a publication titled The Objective Interpretation of Test Performance and it dealt with the Rasch model and its applications. Ever since, IRT has gained acceptance in the UK, and many testing organizations in the UK are adopting IRT and particularly the Rasch model for item calibration and item analysis. Given its widespread acceptance, test constructors and administrators need only a basic understanding of the IRT model. Today, more than 525 organizations in business, education and training are widely using IRT models for their work. In India, there are several leaders making use of IRT analysis for their merit lists in admission tests. Some of them are the admissions tests to MIBE, AIIMS, CMC, and REC. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 8 IRT over CTT Now IRT is gaining in acceptance in psychological and educational testing because it provides more adaptable and effective methods of test construction, analysis and scoring than those derived from CTT the test. The power of IRT over CTT lies in that it provides relationships between item parameters and the ability of the test taker. In order to attain the accuracy of a test under CTT will have to be extremely long such as 200 items and more to provide accuracy comparable to that achieved in terms of IRT. In IRT it needs to have comparably small number of items and hence a reasonably small item bank. It is also possible to reduce the number of items in a parent test of 25 to 30 items reduced to a third or fourth of the number of items in a parent test and this is adaptive testing which has the same accuracy as that of the parent test, thus reducing the cost of producing a high quality operational test. Thus the adaptive testing can be in the form of a computerized version or a two stage testing using paper and pencil instruments. Equally important for long term testing and assessment programs is the ability to retire and replace items in an operational test without altering the interpretation of the test scale. Because IRT scale scores are functions of estimated item parameters, the scoring absorbs possible differences in the characteristics (difficulty, discriminating power etc.) between the retired items and the replacements. In this way, the need to find new items with the same difficulty and discriminating power as the old items or for an equating study of the revised test separate from its operational use, as required in CTT, is eliminated. Another property unique to IRT is the location of items and the test takers on the same scale. The response models on which IRT models are based, enable the analyst to state the probability that a test taker at a particular score level will answer a given item correctly. This permits the “content referencing” of the scale scores. Typical items that test takers can answer correctly with an assigned probability (for instance, 50% or 80%) illustrate the meaning of various points on the scale in terms of task content. Under CTT, the test taker’s raw test score would be the sum of the scores received on the items in the test. Under IRT, the primary interest is in whether a test taker got each individual item correct or not, rather than in the raw test scores. This is because the basic concepts of IRT rest upon the individual items of a test rather than upon some aggregate of the item responses such as a test score. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 9 Basic Concepts of IRT The basic concepts of IRT include Ability, Difficulty and True Score. These are described below in individual sections. Ability In academic areas, one can use descriptive terms such as reading ability and arithmetic ability. Each of these is what psychometricians refer to as an unobservable, or latent, trait. Although such a variable is easily described, and knowledgeable test creators can list its attributes, it cannot be measured directly as can height or weight since the variable is a concept rather than a physical dimension. A primary goal of educational and psychological measurement is the determination of how much of such a latent trait a test taker possesses at a given point of time. Since most of the research has dealt with variables such as scholastic, reading, mathematical, and arithmetic abilities, the generic term “Ability” is used within IRT to refer to such latent traits. To measure how much of this latent trait a test taker has, it is necessary to have a scale of measurement. IRT is also known as probabilistic theory since it deals with probability of possible response to a test item. It derives the probability of each of these responses as a function of ability and item parameters. In CTT, the number right score on a multiple choice test is used to indicate what the ability of a test taker is. But in IRT, the probability of the correct response to an item is summed up for all items answered correctly in a test indicating the ability of the person taking the test. To measure the ability of a test taker, a test can be developed under IRT consisting of a number of multiple choice items. Each of these items measures some facet of the particular ability in question. The test marker scoring the test must then decide whether the response he is giving to the item is correct or not. When the item response is determined to be correct, the test taker receives a score of one; an incorrect answer receives a score of zero i.e., the item is dichotomously scored. Items scored dichotomously are often referred to as binary items. Difficulty A test based on IRT consists of items that are calibrated for its parameters. Therefore, different items in a test will have different parameters. The most common parameter for an item is its difficulty (item difficulty). The probability of getting a correct response by a test taker with extremely low ability (-∞, -4 or -3) is 0 or almost 0, the probability of getting a correct response by a test taker with extremely high ability (+∞, +4 or +3) will be almost 1, tending towards 1 but not equal to 1. The ability corresponding to a probability of 0.5 is defined as the item difficulty of the item. Thus, the item difficulty of an item and the ability of the test taker are on the same scale. This is the unique characteristic of IRT models that provide a true relationship between test items and true scores of test takers. For the Rasch model, it is invariably taken as the middle point of the item characteristic curve where the curve shows a tendency of contra flexure - bending in opposite directions (there is a common tangent that can be drawn at this point, which is the point of inflexion). IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 10 True Score In CTT, the number right score is taken to represent the level where the true score (if it exists) is the mean of several such scores of the test taker over the same or equivalent tests. One can easily realize that such a true score, which is the mean of all such scores, is impractical or impossible to obtain. The next best thing in CTT is to estimate a Standard Error of Measurement (SEM) and specify the limits of any number right score. For example, if a test taker’s number right score is 72 and SEM of the test is 7, the test taker’s score can range between 65 (72-7) to 79 (72+7) for 2/3rd probability. This vitiates all arithmetical value judgments made and used of number right scores. IRT enables the estimation of True Score (TS) from a test taker’s ability, with a low percent of error (usually of the order of 0.1 percent] The probability of getting a correct answer by a test taker is indicative of the maximum mean of the test taker’s scores if he takes the item a great number of times. Let it be .75 [ let his scores be successively 1,1,0,1,0,0,1,1,1,1,1,1,1; the average works out to .75] The true score for an estimated ability of a test taker (θ) is the sum total of probability of a correct response of all the items (with different item difficulty values), that is, TS(θ) = Σ of individual probability of correct answers to all items (D.H Lawley) This presupposes that test taker abilities of various test takers, taking a test of items of various item difficulties are estimated in a given situation and that for an estimated given value of test taker ability, the true score can be estimated. True scores for test takers once determined are invariants and they are “item-free”. Similarly, item parameters like item difficulty, item discrimination and guessing are all invariants for a given item and are “test taker free”. The graph titled “ICC of a Single Item” shown below depicts the ICC of a single item: Graph: ICC of A Single Item IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 11 If an item with an invariant item difficulty is administered to Group2 of very high ability, responses will be all correct responses. The probability of such responses of the order of 0.5 and above nearing 1, the part of item characteristic curve that will be plotted belongs to the higher end of the curve as shown in the Graph titled “ICC Of A Single Item (Group 2 – High Ability)”. It is possible to trace the entire curve with curve fitting software and parameters arrived. This is shown in the Graph below titled “ICC Of A Single Item (Group 2 – High Ability with Curve Fitting)” Graph: ICC Of A Single Item (Group2 - High Ability) Graph: ICC Of A Single Item (Group2 - High Ability With Curve Fitting) IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 12 On the other hand, if the same item is administered to Group 1 of low ability, the probability of correct responses will be much lower than 0.5 (that is, most of the test takers might get the item wrong) and the lower end of the item characteristic curve as in the Graph below titled “Graph : ICC Of A Single Item (Group 1 – Low Ability)”. Again it is possible to trace the entire curve with a curve fitting software and parameters arrived at It is seen that they result in same values. This is shown in the Graph below titled “Graph : ICC Of A Single Item (Group 1 – Low Ability with Curve Fitting)”. Graph: ICC of A Single Item (Group1 – Low Ability) Graph: ICC Of A Single Item (Group1 – Low Ability With Curve Fitting) IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 13 Plotting Ability versus Probability This ability score is denoted by the Greek letter theta, θ. At each ability level, there is a certain probability that a test taker with that ability will give a correct answer to the item. This probability is denoted by P(θ). A response to a binary test item “i” is indicated by the item score Xi which can take the following forms: Xi = 1 : if the test taker answers it correctly Xi = 0 : if the test taker answers it incorrectly By convention, the ability of the test taker is indicated by θ (theta) and probability of a correct response to the item “i” is represented by: P(xi=1/θ) = Pi(θ) θ θ And the probability of an incorrect answer is given by P(xi=0/θ) = 1 – Pi(θ) θ θ For instance, let us look at an item as given below Item: “What is the area of a circle with radius 3cm?” The answer options are 9cm2, 18.85 cm2 and 28.28 cm2 The first option is very naïve, the second option is wrong but implies advanced knowledge [knowledge of circumference] and the third option is the correct one. Let us assume that this item has been calibrated and its psychometric properties look as shown in the graph titled Partial Credit Model. In this example, there is one correct answer and two wrong answers. Graph: The Partial Credit Model IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 14 The black curve is for the 1st response, the blue curve for the 2nd response and the red one for the 3rd response. The person’s ability, denoted by θ, is plotted along the horizontal axis. The probability for each response, denoted by P(θ), is plotted along the vertical axis. The sum of 3 probabilities at each value of θ is 1. For the 1st response, which is totally irrelevant, the probability is very high at low ability and drops down as the ability increases and the person becomes more knowledgeable. The probability of 2nd response rises with ability to a certain point and then drops down. The probability of the 3rd and correct response is small at low ability levels but rises as ability increases. But, at any ability level the persons still have a non-zero probability of selecting any response. This can be further simplified by lumping together the 2 wrong options so that we have a wrong and a correct response (dichotomous item) as shown in the graph titled Dichotomous Item. Graph: Dichotomous Item The black curve is for the wrong option and the red one for the correct option. As ability increases the probability of a correct option increases but the probability of the wrong option decreases. At any value of θ, the sum of the 2 probabilities is 1. Therefore, the probability of the wrong option is 1- P(θ). Item Characteristic Curve (ICC) In the case of a typical test item, at any value of the ability, the probability of the correct option P(θ) will be small for test takers of low ability and large for test takers of high ability. If one plotted P(θ) as a function of ability, the result would be a smooth S-shaped curve such as shown in the graph titled “A Typical Item Characteristic Curve”: IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 15 Graph: A Typical Item Characteristic Curve The probability of correct response is near zero at the lowest levels of ability. It increases until at the highest levels of ability, the probability of correct response approaches 1. This S-shaped curve describes the relationship between the probability of correct response to an item and the ability scale. In IRT, it is known as the item characteristic curve (ICC). This is also called Item Response Function (IRF). Each item in a test has its own ICC since every item in a test has a different difficulty value. Note that the item difficulty is not calculated from the group as it is the case of CTT. The ICC is the basic building block of IRT; all the other constructs of the theory depend upon this curve. There are two technical properties of an ICC that are used to describe it. The first is the difficulty of the item. In IRT, the difficulty of an item describes where the item functions along the ability scale. For example, an easy item functions among the low-ability test takers and a hard item functions among the high-ability test takers; thus, difficulty is a location index. The second technical property is discrimination, which describes how well an item can differentiate between examinees having abilities below the item location and those having abilities above the item location. This property essentially reflects the steepness of the ICC in its middle section. The steeper the curve, the better the item can discriminate. The flatter the curve, the less the item is able to discriminate since the probability of correct response at low ability levels is nearly the same as it is at high ability levels. Using these two descriptors, one can describe the general form of the ICC. These descriptors are also used to discuss the technical properties of an item. But these two properties say nothing about whether the item really measures some facet of the underlying ability or not; that is a question of validity. These two properties simply describe the form of the ICC. It was mentioned elsewhere that validity of a test item is something that depends on the consideration that whether this particular item is constructed to reflect and provide an evidence of achievement of an objective/outcome of learning the content. This process is known as building in validity at the micro level. An item can be made to be valid, if it is made to measure exactly what it is meant to measure. In the Graph below titled “Three item characteristic curves - same discrimination/different difficulty levels”, three ICC’s are presented on the same graph. All have the same level of discrimination but differ with respect to difficulty. The left-hand curve represents an easy item IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 16 because the probability of correct response is high for low-ability test takers and approaches 1 for high-ability test takers. The center curve represents an item of medium difficulty because the probability of correct response is low at the lowest ability levels, around .5 in the middle of the ability scale and near 1 at the highest ability levels. The right-hand curve represents a hard item. The probability of correct response is low for most of the ability scale and increases only when the higher ability levels are reached. Even at the highest ability level shown (+3), the probability of correct response is only .8 for the most difficult item as shown in the graph below. Graph: Three Item Characteristic Curves - Same Discrimination/Different Difficulty The concept of discrimination is illustrated in the graph below. The figure contains three item characteristic curves having the same difficulty level but differing with respect to discrimination. The upper curve has a high level of discrimination since the curve is quite steep in the middle where the probability of correct response changes very rapidly as ability increases. Just a short distance to the left of the middle of the curve, the probability of correct response is much less than 0.5, and a short distance to the right the probability is much greater than 0.5. The middle curve represents an item with a moderate level of discrimination. The slope of this curve is much less than the previous curve and the probability of correct response changes less dramatically than the previous curve as the ability level increases. However, the probability of correct response is near zero for the lowest-ability examinees and near 1 for the highest ability examinees. The third curve represents an item with low discrimination. The curve has a very small slope and the probability of correct response changes slowly over the full range of abilities shown. Even at low ability levels, the probability of correct response is reasonably large, and it increases only slightly when high ability levels are reached. (Although the figures only show a range of ability from -3 to +3, the theoretical range of ability is from negative infinity to positive infinity.) However, Dr. Natarajan (1984) recommended -4 to +4 as the limits since these will include 99.9% of observations. Thus, all the item characteristic curves of the type used here actually become asymptotic to a probability of zero at one tail and to 1.0 at the other tail as shown below. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 17 Graph: Three Item Characteristic Curves - Same Difficulty/Different Discrimination Item characteristic curve of an item with a perfect discrimination is illustrated in the graph below titled “An Item That Discriminates Perfectly at θ=1.5”. The item characteristic curve of such an item is a vertical line at some point along the ability scale. To the left of the vertical line at θ=1.5, the probability of correct response is zero; to the right of the line, the probability of correct response is 1. Thus, the item discriminates perfectly between examinees whose abilities are above and below an ability score of 1.5. Such items would be ideal for distinguishing between examinees with abilities just above and below 1.5. However, such an item neither makes distinction among those examinees with abilities above 1.5 nor among those examinees with abilities below 1.5 as shown in the graph below Graph: An Item That Discriminates Perfectly At Θ=1.5 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 18 Examples BIRT is a software application developed by Frank Baker to help learners familiarize themselves with Item Characteristic Curves. BIRT is available as a free download at the URL given below. http://echo.edres.org:8080/irt/baker/software.htm Some screen shots which are examples taken from the BIRT software application developed by Frank Baker are shown below that illustrate the relation of ICC with difficulty and discrimination. In these examples difficulty has the following five different levels: Very easy Easy Medium Hard Very hard And discrimination has the following five different levels: None Low Moderate High Perfect This example shows an item characteristic curve of an item with an easy difficulty and high discrimination. As seen here, when item discrimination is greater than moderate, the curve is Sshaped and rather steep in the middle. Screen Shot: Difficulty level is Easy and Discrimination is High IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 19 The screen shot titled “Difficulty level is Hard and Discrimination is Low” example shows an item characteristic curve of an item with hard difficulty and low discrimination. As seen here, when item difficulty is greater than medium, most of the curve has a probability of correct response that is less than 0.5. Screen Shot: Difficulty level is Hard and Discrimination is Low IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 20 The screen shot titled “Difficulty level is Medium and Discrimination is Low” shows an item characteristic curve of an item with medium difficulty and low discrimination. As seen here, when item discrimination is less than moderate the curve is nearly linear and appears rather flat. Screen Shot: Difficulty level is Medium and Discrimination is Low IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 21 The screen shot titled “Difficulty level is Medium and Discrimination is Moderate” shows an item characteristic curve of an item with medium difficulty and moderate discrimination. Screen Shot: Difficulty level is Medium and Discrimination is Moderate IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 22 The screen shot titled “Difficulty level is Very Easy and Discrimination is Moderate” shows an item characteristic curve of an item with very easy difficulty and moderate discrimination. As seen here, when item difficulty is less than medium, most of the curve has a probability of correct response greater than 0.5. Screen Shot: Difficulty level is Very Easy and Discrimination is Moderate IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 23 The screen shot titled “Difficulty level is Very Easy and No Discrimination” shows an item characteristic curve of an item with no discrimination. As seen here, no discrimination with any choice of difficulty level yields a horizontal line at a value of P(θ)=0.5. This is because the value of the item difficulty for an item with no discrimination is undefined. Screen Shot: Difficulty level is Very Easy and No Discrimination Thus, regardless of the item discrimination, item difficulty of an item locates the item along the ability scale. Therefore, item difficulty and discrimination are independent of each other. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 24 Exercises Using the BIRT software, find and plot the ICC for the following combination of item difficulty and item discrimination: Very easy difficulty and discrimination level as – Low High Perfect Easy difficulty and discrimination level as – None Low Moderate Perfect Moderate difficulty and discrimination level as – None High Perfect High difficulty and discrimination level as None Moderate High Perfect Perfect difficulty with discrimination level as – None Low Moderate High Perfect IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 25 Item Characteristic Curve Models The central features of IRT are the three Item Characteristic Curve models for the item characteristic curve. They are: Rasch (Single Parameter) Model Birnbaum (Two Parameter) Model Fred Lord (three Parameter Model) These models provide a mathematical equation for the relation of the probability of correct response to ability. These mathematical expressions give the probability of a correct response to a test item as a function of the ability. The models employ one or more parameters whose numerical values define a particular item characteristic curve. They provide a vehicle for communicating information about an item’s technical properties. Single Parameter – Rasch Model Two Parameter – Birnbaum Model Three Parameter – Fred Lord Model Item difficulty ‘b’ Item difficulty ‘b’ & Item discrimination ‘a’ Item difficulty ‘b’ & Item discrimination ‘a’ & Item guessing ‘c’ Every item has invariant item parameters. All the three models integrated into a single graph is shown below in the Graph titled “ICC’s of all three IRT Models” Graph: ICC’s Of All Three IRT Models Pi (θ) (probability of getting answer +1.0 right on any item I with ability θ) 0.5 c Item Difficulty ‘b’ α tan α = a -∞ -4 -3 +∞ +4 +3 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 26 Rasch’s Single Parameter Model The equation for Rasch Single Parameter Model is Pi (θ ) = e a (θ −b ) /1 + e a (θ −b ) Where e=2.718, base of a natural log a=scale constant determining the units of θ (For Rasch it is 1) b=location parameter related to the difficulty of the item "i" (also referred to as item threshold) The modified equation for Rasch model after rationalizing the numerator and the denominator -a(θ-b) by multiplying with e Pi (θ ) = 1/(1 + e −1(θ −b ) ) The items with larger values of bi are more difficult; those with smaller values are comparatively easier. See the Graph below titled “ICC of Rasch’s Single Parameter Model” for the ICC of Rasch’s model. Graph: ICC of Rasch’s Single Parameter Model Pi (θ) (probability of getting answer +1.0 right on any item I with ability θ) 0.5 c Item Difficulty ‘b’ IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 27 Example For Rasch Model Example 1: Let us look at an item with item difficulty as -2.40 and a person with ability as +1.0. This person will have a probability = 0.968. Calculation for the same is shown below: a value for Rasch is 1 Pi(θ = +1.0) = 1/(1 + e ) θ -1(1+2.4) =1/(1 + e ) =1/(1 + e-3.4) =0.968 Similarly, the calculations are carried out for all values of θ = (-3, -2, -1, 0, +1, +2, +3). This is shown in the Screen Shot below titled “θ For Example 1 (Rasch Model)” -1(1-(-2.4)) Screen Shot: θ For Example 1(Rasch Model) LOGIT calculated above is taken as θ-b and EXP(-L) is e-( θ-b). The ICC for this item is given in the Screen Shot below titled “ICC For Example 1(Rasch Model)”. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 28 Screen Shot : ICC for Example 1(Rasch Model) Example 2: Let us look at another item with item difficulty as 0 and test takers with varying ability like -3, -2.9………….+3 at intervals of 0.1.[ Note; ability 0 is not 0 ability but on scale of 3 to =3 it is an average ability]. The format given in the table titled “ICC for Example 2 (Rasch Model)” below helps in easy calculations: Table: ICC For Example 2 (Rasch Model) θi -3.0 -2.9 b 0.0 0.0 (θi-b)=L θ [-3-(-0)]=-3 [-2.9-(-0)]=-2.9 e-L e3 e2.9 1+e-L 1+e3 1+e2.9 1/(1+e-L) 1/(1+e3)= 1/(1+e2.9 )= +3.0 0.0 [3-(-0)]=3 e-3 1+e-3 1/(1+e-3)= IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 29 Birnbaum’s Two Parameter Model In the Two Parameter model, at the point of contra flexure (inflexion) corresponding to 0.5 probability value, a common tangent is drawn and the slope of the tangent is designated as “item discrimination” and this value is indicated by the letter “a”. The “b” and “a” value are estimated for the given items. The equation for the ICC is : θ θ Pi(θ) = ea(θ-b) / 1+ ea(θ-b) θ Where Pi(θ) = Probability of getting the correct answer to item "i" of a person with ability θ θ = Person ability b = Item difficulty a = Item Discrimination The modified equation for Birnbaum model after rationalizing the numerator and the -a(θ-b) denominator by multiplying with e Pi(θ) = 1 / 1+ e θ -a(θ-b) θ See the Graph below titled “ICC of Birnbaum’s Two Parameter Model” for the ICC graph of Birnbaum’s model. Graph: ICC of Birnbaum’s Two Parameter Model Pi (θ) (probability of getting answer +1.0 right on any item I with ability θ) 0.5 c Item Difficulty ‘b’ α tan α = a IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 30 Example For Birnbaum’s Model Example 1: Let us look at an item with item difficulty as 0 and a=1.25, ability as +1.0. This person will have a probability = 0.777. Calculation for the same is shown below: Pi(θ) θ = 1 / 1+ e =1/(1 + e -a(θ-b) θ -1.25(1-0) -1.25 ) =1/(1 + e =0.777 ) Similarly, the calculations are carried out for all values of θ = (-3, -2, -1, 0, +1, +2, +3). This is shown in the Screen Shot below titled “θ For Example 1 (Birnbaum)” Screen Shot : θ For Example 1 (Birnbaum) LOGIT calculated above is taken as θ-b and EXP(-L) is e . The ICC for this item is given in the Screen Shot below titled “ICC For Example 1 (Birnbaum)”. -( θ-b) IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 31 Screen Shot: ICC for Example 1(Birnbaum) Example2: Let us look at another item with item difficulty as 0, a=1.25 and test takers with varying ability like -3, -2.9………….+3 at intervals of 0.1 The format given in the table titled “ICC for Example 2 (Birnbaum Model)” helps in easy calculations Table: ICC for Example 2 (Birnbaum Model) θi -3.0 b=0.0 a=1.25 0.0, 1.25 a(θi-b)=L θ 1.25[-3-(0.0)]= -3.75 e e-L 3.75 1+e-L 1+e 3.75 1/(1+e-L)= Piθi 1/(1+e 3.75 )= 0 0.0 1.25 1.25[0-0]=0 e 0 1+e =2 0 ½=0.5 +3.0 0.0, 1.25 1.25[3-0]=3.75 e -3.75 1+e -3.75 1/(1+e -3.75 )= IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 32 Fred Lord’s Three Parameter Model In the Three Parameter model, there is in addition, a third parameter called guessing for the item and is designated by the letter “c”. This is given by the intercept of the probability axis that indicates the probability of guessing the right answer. The guessing parameter is unique to the item and is independent of test taker ability. Thus the guessing parameter remains a constant for all test takers of various abilities. The equation to the ICC is: θ θ Pi(θ) = c+(1-c) [ea(θ-b) / 1+ ea(θ-b)] θ Where Pi(θ) = Probability of getting the correct answer to item i of a person with ability θ θ = Person ability b = Item difficulty a = Item Discrimination c = Guessing Parameter The modified equation for Birnbaum model after rationalizing the numerator and the -a(θ-b) denominator by multiplying with e is : Pi(θ) = c+(1-c) [1 / 1+ e θ -a(θ-b) θ ] See the dotted line curve in the Graph below titled “ICC of Fred Lord’s Three Parameter Model” for the ICC graph of Fred Lord’s model. Graph: ICC of Fred Lord’s Three Parameter Model Pi (θ) (probability of getting answer +1.0 right on any item I with ability θ) 0.5 c Item Difficulty ‘b’ α tan α = a IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 33 Examples Example 1: Let us look at an item with item difficulty as b= 0 , a=1.25,and c=0.25, with ability as +1.0. This person will have a probability=0.832. Calculation for the same is shown below: Pi(θ) θ = c + (1-c) [1 / 1+ e -a(θ-b) θ ] -1.25(1-0) = 0.25 +(1-0.25) * [1 / 1 + e = 0.832 ] Similarly, the calculations are carried out for all values of θ = (-3, -2, -1, 0, +1, +2, +3). This is shown in the Screen Shot below titled “θ For Example 1 (Lord)” Screen Shot : θ For Example 1 (Lord) IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 34 LOGIT calculated above is taken as θ-b and EXP(-L) is e-( θ-b). The ICC for this item is given in the Screen Shot below titled “ICC For Example 1 (Lord)”. Screen Shot: ICC For Example 1(Lord) IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 35 Example 2: Let us look at another item with item difficulty as 0, a=0.75, c=0.15 and test takers with varying ability like -3, -2.9………….+3 at intervals of 0.1. The format given in the table titled “ICC for Example 2 (Lord Model)” helps in easy calculations Table: ICC For Example 2 (Lord Model) θi a=0.75 b=0.0 c=0.15 0.75, 0.0, 0.15 a(θi-b)=L θ e-L (1-c)/ 1+e-L 1-0.15/ 2.25 1+e c+(1-c)/ (1+e-L) 0.15+0.85/ 2.25 (1+e )= -3.0 0.75[-3-0]=-2.25 e2.25 0 0.75, 0.0, 0.15 0.75[0]=0 e 0 0.85/ 1+1 0.15+0.85/ (1+1)= +3.0 0.75, 0.0, 0.15 e 0.75[3-0]=-2.25 -2.25 0.85/ 1+e-2.25 0.15+0.85/ (1+e-2.25)= The Three Parameter (Fred Lord’s) model estimates are more accurate than the Two Parameter (Birnbaum’s) model while the Single Parameter (Rasch’s) model is least accurate. Interpretation of Item Parameters Instead of verbal labels used earlier in describing the technical properties of an ICC, item parameters can be used to do the same. These parameters have numerical values that have intrinsic meaning. Interpreting these values and conveying this interpretation to a non-technical audience is the next task to be carried out. The verbal labels used to describe an item’s discrimination can be related to ranges of values of the parameter as given in the table below titled “Range of Values For Item Discrimination”. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 36 Table: Range of Values For Item Discrimination Verbal Label None Very low Low Moderate High Very high Perfect Range of Values 0 0.01 to 0.34 0.35 to 0.64 0.65 to 1.34 1.35 to 1.69 > 1.70 + infinity These relations hold when one interprets the values of the discrimination parameter under a logistic model for the ICC. If interpretation of the discrimination parameter under a normal ogive model is required then these values need to be divided by 1.7. Establishing an equivalent table for the values of the item difficulty parameter poses some problems. The drawback of item difficulty, as defined under CTT, was that it was defined relative to a group of test takers. Thus, the same item could be easy for one group and hard for another group. Under IRT, an item’s difficulty is a point on the ability scale where the probability of correct response is 0.5 for Single and Two Parameter models and (1 + c)/2 for a Three Parameter model. Because of this, the verbal labels used earlier have meaning only with respect to the midpoint of the ability scale. The proper way to interpret a numerical value of the item difficulty parameter is in terms of where the item functions on the ability scale. The discrimination parameter can be used to add meaning to this interpretation. The slope of the ICC is at a maximum at an ability level corresponding to the item difficulty. Thus, the item is doing its best in distinguishing between test takers in the neighborhood of this ability level. Because of this, one can speak of the item functioning at this ability level. For example, an item whose difficulty is -1 functions among the lower ability test takers. A value of +1 denotes an item that functions among higher ability test takers. Again, the underlying concept is that the item difficulty is a location parameter. Under a Three Parameter model, the numerical value of the guessing parameter c is interpreted directly since it is a probability. For example, c=0.12 simply means that at all ability levels, the probability of getting the item correct by guessing alone is 0.12. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 37 Item Information Function Any item in a test provides some information about the ability of the examinee, but the amount of this information depends on how closely the difficulty of the item matches the ability of the person. Item Information Function of Single Parameter Model In the case of the Single Parameter model, the item information function depends upon how closely the difficulty of the item matches the ability of the person, while in other models it combines with other factors. The item information function of this model is shown below: Ii(θ, bi) = Pi(θ, bi) * Qi(θ, bi) θ θ θ It is easy to see that the maximum value of the item information function is 0.25. It occurs at the point where the probabilities of a correct and of an incorrect response are both equal to 0.5. Any item in this model is most informative for examinees whose ability is equal to the difficulty of the item. As ability becomes either smaller or greater than the item difficulty, the item information decreases. This is shown in the graph below titled “Item Information Function: Single Parameter Model” Graph: Item Information Function: Single Parameter Model The most important practical implication of this is that we need items of different difficulty if we are to achieve good measurement for people having all sorts of different abilities. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 38 Item Information Function of Two Parameter Model The item information function for the Two Parameter model is as shown below: Ii(θ, bi, ai) = a2i Pi(θ,bi) * Qi(θ,bi) θ θ θ The discrimination parameter ai is the second parameter that has quite a strong influence because it appears in the formula as a square. This means that discrimination parameters below 1 can decrease the information function rather dramatically, while discrimination parameters above one will increase it substantially. This is shown below in the graph titled “Item Information Function: Two Parameter Model”. Graph: Item Information Function: Two Parameter Model The item response functions are plotted with dotted lines and matched in color with the corresponding item information functions, as shown in the above graph. The item information functions still attain their maxima at item difficulty. However, their shapes and the values of the maxima depend strongly on the discrimination parameter. When discrimination is high (and the item response function is steep), the item provides more information on ability, and the information is concentrated around item difficulty. Items with low discrimination parameters are less informative, and the information is scattered along a greater part of the ability range. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 39 Item Information Function of Three Parameter Model The item information function of the Three Parameter model is a bit more complicated as compared to the Single Parameter or Two Parameter model. The item information functions of the three models are shown below: Single Parameter Model Two Parameter Model Three Parameter Model Ii(θ, bi) = Pi(θ, bi) * Qi(θ, bi) Ii(θ, bi, ai) = a2i Pi(θ,bi) * Qi(θ,bi) I(θ, a, b, c) = a2 Q(θ) [P(θ)-c]2 P(θ) 1-c The graph shown below titled “Item Information Function: Three Parameter Model” above has been plotted with two items. The item with the black lines has a=1, b=-1, and c=0.1, while the item with the red lines has a=1, b=+1, and c=0.3. The b parameter shifts the item information function to the left or to the right but does not affect its shape. The two items have the same a=1 but differ in c. Hence, a higher c leads to an overall decrease in item information. A further complication is that the item information function no longer peaks at θ=b. Graph: Item Information Function: Three Parameter Model IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 40 Examples Let us look at the below examples. The parameter values and the ICC's according to the Two Parameter model are calculated for these examples. Example 1: For Item #1, a=1.42, b=1.50, c=0. At the b value of 1.50 the probability of getting the right answer Pi(θ) is 0.5. The information function at this point is calculated as shown below: (1.42 * 1.42) * (0.5 * 0.5) = 0.5041 This shows that the peak of the information function as shown in the graph. This means that this particular item gives maximum information at this ability of the test taker and difficulty of the item. For any another ability, ranging from +1 to +2.5 this item can be supposed to give average information. The graphs shown in green hereinafter are BILOG outputs. (BILOG is a software application used to render ICC graphs). See the graph below titled “Item Information Function: BILOG output two parameter model (Example 1)”. Graph: Item Information Function: BILOG output two parameter model (Example 1) Item Response Function and Item Information Subtest 1: SAMP1 ; a = 1.42; b = 1.50; Item 1: 0001 c = 0.00; 2 1.0 0.9 0.8 0.7 PROB (Correct) Information... 0.6 0.5 0.4 0.3 0.2 0.1 0 -3 -2 -1 0 Scale Score 1 b 2 3 0 1 Metric Type Normal Example 2: For Item #2, a=1.42, b=0.42, c=0. At the b value of 0.42 the probability of getting the right answer Pi(θ) is 0.5. The information function at this point is (1.42 * 1.42) * (0.5 * 0.5) that is 0.5041. This shows that the peak of the information function as shown in the graph. This means that this particular item gives maximum information at this ability of the test taker and difficulty of the item. For any another ability, ranging from -0.5 to +1.5 this item can be supposed to give IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 41 average information. See the graph below titled “Item Information Function: BILOG output two parameter model (Example 2)”. Graph: Item Information Function: BILOG output two parameter model (Example 2) Item Response Function and Item Information Subtest 1: SAMP1 ; a = 1.42; b = 0.42; Item 2: 0002 c = 0.00; 2 1.0 0.9 0.8 0.7 PROB (Correct) Information... 0.6 0.5 0.4 0.3 0.2 0.1 0 -3 -2 -1 b 0 1 0 Scale Score 1 2 3 Metric Type Normal IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 42 Example 3 : For Item #3, a=1.42, b=2.29, c=0. At the b value of 2.29 the probability of getting the right answer Pi(θ) is 0.5. The information function at this point is (1.42 * 1.42) * (0.5 * 0.5) that is 0.5041. This shows that the peak of the information function as shown in the graph. This means that this particular item gives maximum information at this ability of the test taker and difficulty of the item. For any another ability, ranging from +1.5 to +3 this item can be supposed to give average information. See the graph below titled “Item Information Function: BILOG output tow parameter model (Example 3)”. Item Information Function: BILOG output two parameter model (Example 3) Item Response Function and Item Information Subtest 1: SAMP1 ; a = 1.42; b = 2.29; Item 3: 0003 c = 0.00; 2 1.0 0.9 0.8 0.7 PROB (Correct) Information... 0.6 0.5 0.4 0.3 0.2 0.1 0 -3 -2 -1 0 Scale Score 1 2 b 0 1 3 Metric Type Normal IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 43 Example 4: For Item #4, a=1.42, b=0.92, c=0. At the b value of 0.92 the probability of getting the right answer Pi(θ) is 0.5. The information function at this point is (1.42 * 1.42) * (0.5 * 0.5) that is 0.5041. This shows that the peak of the information function as shown in the graph. This means that this particular item gives maximum information at this ability of the test taker and difficulty of the item. For any another ability, ranging from 0 to +2 this item can be supposed to give average information. See the graph below titled “Item Information Function: BILOG output tow parameter model (Example 4)”. Item Information Function: BILOG output two parameter model (Example 4) Item Response Function and Item Information Subtest 1: SAMP1 ; a = 1.42; b = 0.92; Item 4: 0004 c = 0.00; 2 1.0 0.9 0.8 0.7 PROB (Correct) Information... 0.6 0.5 0.4 0.3 0.2 0.1 0 -3 -2 -1 0 Scale Score b 0 1 1 2 3 Metric Type Normal IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 44 Example 5: For Item #5, a=1.42, b=-0.28, c=0. At the b value of -0.28 the probability of getting the right answer Pi(θ) is 0.5. The information function at this point is (1.42 * 1.42) * (0.5 * 0.5) that is 0.5041. This shows that the peak of the information function as shown in the graph. This means that this particular item gives maximum information at this ability of the test taker and difficulty of the item. For any another ability, ranging from -1 to +1 this item can be supposed to give average information. See the graph below titled “Item Information Function: BILOG output tow parameter model (Example 5)”. Item Information Function: BILOG output two parameter model (Example 5) Item Response Function and Item Information Subtest 1: SAMP1 ; a = 1.42; b = -0.28; Item 5: 0005 c = 0.00; 2 1.0 0.9 0.8 0.7 PROB (Correct) Information... 0.6 0.5 0.4 0.3 0.2 0.1 0 -3 -2 -1 b 0 1 0 Scale Score 1 2 3 Metric Type Normal In general terms, for a Two Parameter ICC model, the item information function is defined as shown below: Where ai is the discrimination parameter for item i 2 Ii(θ) = ai Pi(θ) Qi(θ) θ θ θ Qi(θ) =1 - Pi (θ) θ θ θ is the ability level of test taker The amount of item information will be computed at seven ability levels for an item having parameter values of b=1.0 and a=1.5 as shown below in the table titled “Item Information Function for b=1 and a = 1.5”. Pi(θ) = 1/(1+ EXP (-ai (θ - bi))) θ θ IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 45 Table: Item Information Function for b=1 and a = 1.5 θ L EXP(-L) 1/1+ (e-L)= Pi(θ) θ 0.00 0.01 0.05 0.18 0.50 0.82 0.95 Qi(θ) θ Pi(θ) * Qi(θ) θ θ a 2 Ii(θ) θ -3 -2 -1 0 1 2 3 -6 -4.5 -3.0 -1.5 0.00 1.5 3.0 403.43 90.02 20.09 4.48 1.0 0.22 0.05 1.00 0.99 0.95 0.82 0.50 0.18 0.05 0.00 0.01 0.05 0.15 0.25 0.15 0.05 2.25 2.25 2.25 2.25 2.25 2.25 2.25 0.00 0.02 0.11 0.34 0.56 0.34 0.11 This item information function increases rather smoothly as ability increases and reaches a maximum value of .56 at an ability of 1.0. After this point, it decreases. The obtained item information function is symmetrical about the value of the item’s difficulty parameter. Such symmetry holds for all item information functions under Single and Two Parameter models. When only a single item is involved and the discrimination parameter has a moderate value, the magnitude of the amount of item information is quite small. Similar calculations can be carried out for item information function under Single and Three Parameter models. The equations to item information function under these models are given below: I i (θ ) = Pi (θ ) Q i (θ ) I i (θ ) = a 2 [ Q i (θ ) P (θ ) − c ][ i Pi (θ ) (1 − c 2 ) 2 ] Where Pi(θ) = c+(1-c)(1/(1+ EXP (-L))) θ L = a(θ-b) θ Qi =1 - Pi (θ) θ Exercises 1.Given an item with item difficulty b=-2,calculate the item information function ordinates for a single parameter(Rasch Model) ,at various ability levels of -3,-2,-1,0,1,2,3 using BIRT software and the appropriate formula for the item information function. Also plot the item information curve. 2. Given an item with item difficulty b=-2,and item discrimination a=1.42,calculate the item information function ordinates for a 2 parameter(Birnbaum Model) ,at various ability levels of -3,2,-1,0,1,2,3 using BIRT software and the appropriate formula for the item information function. Also plot the item information curve. 3.Given an item with item difficulty b=-2, item discrimination a=1.42,and guessing parameter c=0.2 ,calculate the item information function ordinates for a 3 parameter(Fred Lord Model) ,at various ability levels of -3,-2,-1,0,1,2,3 using BIRT software and the appropriate formula for the item information function. Also plot the item information curve. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 46 Test Characteristic Curve (Test Response Function) IRT is based upon the individual items of a test, and up to this point, we have dealt with the items one at a time. Now, attention will be given to dealing with all the items in a test all at once. When scoring a test, the response made by a test taker to each item is dichotomously scored. A correct response is given a score of 1, and an incorrect response a score of 0; the test taker’s raw test score is obtained by adding up the item scores. This raw test score will always be an integer number and will range from 0 to N (N is the number of items in the test). If test takers were to take the test again, assuming they did not remember how they previously answered the items, a different raw test score would be obtained. Hypothetically, a test taker could take the test a great many times and obtain a variety of test scores. One would anticipate that these scores would cluster themselves around some average value. In CTT, this value is known as the true score. In IRT however, the definition of a true score according to D.H. Lawley is used. The formula for a true score is as shown below: TS j = N ∑ i =1 Pi ( θ j ) Where TSj is the true score for examinee with ability θj i is an item Pi(θj) depends upon the particular ICC model employed Let us look into a test of 20items administered on seventy six test takers, analyzed through a Two Parameter model. Click at the embedded links given below to have a look at the outputs from BILOG. Note that these files can be viewed only if you have an installation of BILOG on your computer: IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 47 Examples Example 1: In order to calculate the true scores of these test takers at a given ability level, we assume an ability of b=+1 and let us find the probability of this test taker of ability=+1 for a correct response using BIRT software. For the first item, b=1.5, a=1.42, for an ability of +1 the probability of getting the correct answer is calculated using the BIRT software application. The output from the BIRT software application is shown below in the Screen Shot titled “Probability of Getting the Correct Answer (Ability = +1) Example 1”. Screen Shot: Probability of Getting Correct Answer (Ability = +1) Example 1 According to the Screen Shot above, Pi(θ) at θ = +1 is 0.330. The calculation for the same is given below: Pi (θ ) = 1 1+ e − a *( θ − b ) = 1 1+ e − 1.42*(1 −1.5 ) = 1 = 0.329 1 + e 0.71 The ICC for this item is shown below in the graph titled “ICC for Ability = +1 (Example 1)” IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 48 Graph: ICC for Ability = +1 (Example 1) Item Response Function and Item Information Subtest 1: SAMP1 ; a = 1.42; b = 1.50; Item 1: 0001 c = 0.00; 2 1.0 0.9 0.8 0.7 PROB (Correct) Information... 0.6 0.5 0.4 0.3 0.2 0.1 0 -3 -2 -1 0 Scale Score 1 b 2 3 0 1 Metric Type Normal IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 49 Example 2: For the second item, b=0.42, a=1.42. For an ability of +1 the output from the BIRT software application is shown below in the Screen Shot titled “Probability of Getting the Correct Answer (Ability = +1) Example 2”. Screen Shot: Probability of Getting Correct Answer (Ability = +1) Example 2 According to the table given above, Pi(θ) at θ = +1 is 0.695. The calculation for the same is given below: Pi ( θ ) = 1 1+ e − a * (θ − b ) = 1 1+ e − 1 .4 2 * (1 − 0 .4 2 ) = 1 = 0 .6 9 5 1 + e − 0 .8 2 The ICC for this item is given below in the graph titled” Item Information Function: BILOG output two parameter model (Example 1)” IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 50 Graph: Item Information Function: BILOG output two parameter model (Example 1) Item Response Function and Item Information Subtest 1: SAMP1 ; a = 1.42; b = 0.42; Item 2: 0002 c = 0.00; 2 1.0 0.9 0.8 0.7 PROB (Correct) Information... 0.6 0.5 0.4 0.3 0.2 0.1 0 -3 -2 -1 b 0 Scale Score 1 2 3 0 1 Metric Type Normal IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 51 Example 3:For the third item, b=2.29, a=1.42. For an ability of +1 the output from the BIRT software application is shown below in the Screen Shot titled “Probability of Getting the Correct Answer (Ability = +1) Example 3”. Screen Shot: Probability of Getting Correct Answer (Ability = +1) Example 3 According to the Screen Shot given above, Pi(θ) at θ = + 1 is 0.138. The calculation for the same is given below: Pi (θ ) = 1 1 + e − a *( θ − b ) = 1 1 + e − 1 .4 2 *(1 − 2 .2 9 ) = 1 = 0 .1 3 8 1 + e 1 .8 3 The ICC for this item is given below in the graph below titled “Item Information Function: BILOG output two parameter model (Example 3)” IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 52 Graph: Item Information Function: BILOG output two parameter model (Example 3) Item Response Function and Item Information Subtest 1: SAMP1 ; a = 1.42; b = 2.29; Item 3: 0003 c = 0.00; 2 1.0 0.9 0.8 0.7 PROB (Correct) Information... 0.6 0.5 0.4 0.3 0.2 0.1 0 -3 -2 -1 0 Scale Score 1 2 b 3 0 1 Metric Type Normal IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 53 Example 4: For the forth item, b=0.92, a=1.42. For an ability of +1 the output from the BIRT software application is shown below in the Screen Shot titled “Probability of Getting the Correct Answer (Ability +1) Example 4”. Screen Shot: Probability of Getting Correct Answer (Ability = +1) Example 4 According to the Screen Shot given above, Pi(θ) at θ= + 1 is 0.528. The calculation for the same is given below: Pi (θ ) = 1 1+ e − a *( θ − b ) = 1 1+ e − 1 .4 2*(1 − 0 .9 2 ) = 1 = 0 .5 2 8 1 + e − 0 .1 1 The ICC for this item is given below in the graph titled “Item Information Function: BILOG output two parameter model (Example 4): IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 54 Graph: Item Information Function: BILOG output two parameter model (Example 4) Item Response Function and Item Information Subtest 1: SAMP1 ; a = 1.42; b = 0.92; Item 4: 0004 c = 0.00; 2 1.0 0.9 0.8 0.7 PROB (Correct) Information... 0.6 0.5 0.4 0.3 0.2 0.1 0 -3 -2 -1 0 Scale Score b 1 2 3 0 1 Metric Type Normal IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 55 Example 5: For the fifth item, b=-0.28, a=1.42. For an ability of +1 the output from the BIRT software application is shown below in the Screen Shot titled “Probability of Getting the Correct Answer (Ability +1) Example 5”. Screen Shot: Probability of Getting Correct Answer (Ability = +1) Example 5 According to the screen shot given above, Pi(θ) at θ =+1 is 0.860. The calculation for the same is given below: Pi ( θ ) = 1 1+ e − a *(θ − b ) = 1 1+ e − 1 .4 2 * ( 1 + 0 .2 8 ) = 1 = 0 .8 6 0 1 + e − 1 .8 1 The ICC for this item is given below in the graph titled “Item Information Function: BILOG output two parameter model (Example 5): IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 56 Graph: Item Information Function: BILOG output two parameter model (Example 5) Item Response Function and Item Information Subtest 1: SAMP1 ; a = 1.42; b = -0.28; Item 5: 0005 c = 0.00; 2 1.0 0.9 0.8 0.7 PROB (Correct) Information... 0.6 0.5 0.4 0.3 0.2 0.1 0 -3 -2 -1 b 0 Scale Score 1 2 3 0 1 Metric Type Normal Thus, test score of a test taker of ability=1 is obtained. The test taker score at this ability for all the five items is the sum of all the individual Pi (θ) as shown below: = 0.329+0.695+0.138+0.528+0.860 = 2.551 Thus, for a test taker of ability=1 the true score for the test is 2.55. Similarly, we may obtain the test scores at all ability levels from -3 to +3 as indicated in the table below titled “Test True Score for Different Ability Levels” IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 57 Table: Test True Score for Different Ability Levels Ability Level -3 -2 -1 0 +1 +2 +3 Item1 0.002 0.007 0.028 0.106 0.330 0.670 0.894 Item2 0.008 0.031 0.117 0.355 0.695 0.904 0.974 Item3 0.001 0.002 0.009 0.037 0.138 0.398 0.733 Item4 0.004 0.016 0.061 0.213 0.528 0.823 0.950 Item5 0.021 0.080 0.265 0.598 0.860 0.962 0.991 Total 0.036 0.136 0.48 1.309 2.551 3.757 4.542 The table below, along with the graph, shows the ability levels and the total Pi(θ) of all these ability levels added up. A plot of ability along the X-axis and test score along the Y-axis for each of these test takers of ability ranging from -3 to +3 will generate a test characteristic curve as shown below in the graph titled “Test Taker Wise Ability Versus Test Score” Graph: Item Response Function & Test Response Function Ability Level -3 -2 -1 0 +1 +2 +3 Total 0.036 0.136 0.48 1.309 2.551 3.757 4.542 The procedure used to work out the test characteristic curve for Two Parameter model can be similarly used to work out the curves for the Single and Three Parameter models. An important concept for the test characteristic curve is that it concerns a particular test and the test characteristic curves for different tests would be different. When a Single or Two Parameter model is used for N items in a test, the left tail of the test characteristic curve approaches zero as the ability score approaches negative infinity; its upper tail approaches the number of items in the test as the ability score approaches positive infinity. The implication of this is that under these two models, a true score of zero corresponds to an ability score of negative infinity, and a true score of N corresponds to an ability level of positive infinity. When a Three Parameter model is used for N items in a test, the lower tail of the test characteristic curve approaches the sum of the guessing parameters for the test items rather than zero. This reflects the fact that under this model, very low-ability test takers can get a test score simply by guessing. The upper tail of the test characteristic curve still approaches the IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 58 number of items in the test. Hence, a true score of N corresponds to an ability of positive infinity under all the three ICC models. The primary role of the test characteristic curve in IRT is to provide a means of transforming ability scores to true scores. This becomes of interest in practical situations where the user of the test may not be able to interpret an ability score. By transforming the ability score into a true score, the user is given a number that relates to the number of items in the test. This number is in a more familiar frame of reference and the user is able to interpret it. However, those familiar with IRT can interpret the ability score directly. The test characteristic curve also plays an important role in the procedures for equating tests. The general form of the test characteristic curve is that of a monotonically increasing function. In some cases, it has a rather smooth S-shape similar to an ICC. In other cases, it will increase smoothly, and then have a small plateau before increasing again. However, in all cases, it will be asymptotic to a value of N in the upper tail. The shape of the test characteristic curve depends upon a number of factors, including the number of items, the ICC model employed, and the values of the item parameters. Test Information Function The test information function is an extremely useful feature of IRT. It indicates how well the test is doing in estimating ability over the whole range of ability scores. Since a test is used to estimate the ability of a test taker, the amount of information yielded by the test at any ability level can also be obtained. A test is a set of items; therefore, the test information at a given ability level is simply the sum of the item information at that level. Since the test information is obtained by summing the item information at a given ability level, the amount of information is defined at the item level. Consequently, the test information function is defined as shown below: i =1 Where I(θ) is the amount of test information at any ability level θ Ii(θ) is the amount of information for item I at any ability level θ N is the number of items in a test The most important thing about the test information function is that it predicts the accuracy to which we can measure any value of the latent ability. The general level of the test information function will be much higher than that for a single item information function. Thus, a test measures ability more precisely than does a single item. Hence, more the items in a test, greater the amount of information. Longer tests will measure a test taker’s ability with greater precision than will shorter tests. Plotting the amount of test information against ability yields a graph of the test information function for a ten-item test as shown below in the graph titled “A Test Information Function (Ten-Item)” I (θ ) = ∑ I i (θ ) N IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 59 Graph: A Test Information Function (Ten-Item) Note: The Y-axis in the above graph should be read as 0 to 5 instead of 0 to 10. The maximum value of the test information function as seen above is modest and the amount of information decreases rather steadily as the ability level differs from that corresponding to the maximum. Thus, ability is estimated with some precision near the center of the ability scale. However, as the ability level approaches the extremes of the scale, the amount of test information decreases significantly. While the ideal test information function often may be a horizontal line, it may not be the best for a specific purpose. For example, if we were interested in constructing a test to award scholarships, this ideal might not be optimal. In this situation, we should measure ability with considerable precision at ability levels near the ability used to separate those who will receive the scholarship from those who do not. The best test information function in this case would have a peak at the cutoff score. Other specialized uses of tests could require other forms of the test information function. While an information function can be obtained for each item in a test, this is rarely done. The amount of information yielded by each item is rather small, and we typically do not attempt to estimate a test taker’s ability with a single item. Consequently, the amount of test information at an ability level and the test information function are of primary interest. The mathematical definition of the amount of item information depends upon the particular ICC model employed. Therefore, it is necessary to examine these definitions under each model. Let us look into a test of five items administered on ten test takers, analyzed through a Two Parameter model. The item parameters values are as shown below in the table titled “5 Items, 10 Test Takers, 2 Parameter Model” IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 60 Table: 5 Items, 10 Test Takers, 2 Parameter Model Item 1 2 3 4 5 b 1.5 0.42 2.29 0.92 -0.28 a 1.42 1.42 1.42 1.42 1.42 The amount of item information and the test information will be computed for the same seven ability levels as shown below: 1. For the first item, b=1.5, a=1.42. For an ability of +1, Pi(θ) is 0.330 as calculated in the example for item information. Thus, the Ii(θ) will be calculated as shown below: Pi(θ) = 0.330 Qi(θ) = 1 – Pi(θ) = 1 – 0.330 = 0.67 Ii(θ) = a2 * Pi(θ) * Qi(θ) = 1.42 * 1.42 * 0.330 * 0.67 Ii(θ) = 0.445 2. For the second item, b=0.42, a=1.42. For an ability of +1, Pi(θ) is 0.695 as calculated in the example for item information. Thus, the Ii(θ) will be calculated as shown below: Pi(θ) = 0.695 Qi(θ) = 1 – Pi(θ) = 1 – 0.695 = 0.305 Ii(θ) = a2 * Pi(θ) * Qi(θ) = 1.42 * 1.42 * 0.695 * 0.305 Ii(θ) = 0.427 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 61 3. For the third item, b=2.29, a=1.42. For an ability of +1, Pi(θ) is 0.138 as calculated in the example for item information. Thus, the Ii(θ) will be calculated as shown below: Pi(θ) = 0.138 Qi(θ) = 1 – Pi(θ) = 1 – 0.138 = 0.862 Ii(θ) = a2 * Pi(θ) * Qi(θ) = 1.42 * 1.42 * 0.138* 0.862 Ii(θ) = 0.239 4. For the fourth b=0.9, a=1.42. For an ability of +1, Pi(θ) is 0.528 as calculated in the example for item information. Thus, the Ii(θ) will be calculated as shown below: Pi(θ) = 0.528 Qi(θ) = 1 – Pi(θ) = 1 – 0.528 = 0.472 Ii(θ) = a2 * Pi(θ) * Qi(θ) = 1.42 * 1.42 * 0.528 * 0.472 Ii(θ) = 0.502 5. For the fifth, b=-0.28, a=1.42. For an ability of +1, Pi(θ) is 0.860 as calculated in the example for item information. Thus, the Ii(θ) will be calculated as shown below: Pi(θ) = 0.0.860 Qi(θ) = 1 – Pi(θ) = 1 – 0.860 = 0.14 Ii(θ) = a2 * Pi(θ) * Qi(θ) = 1.42 * 1.42 * 0.860 * 0.14 Ii(θ) = 0.242 The Pi(θ) and Qi(θ) values for all the five items are calculated as shown in the table below titled “Pi(θ) and Qi(θ) values”: IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 62 Table: Pi(θ) and Qi(θ) values θ θ Ability Level -3 -2 -1 0 +1 +2 +3 P Item1 Q=1-P 0.998 0.993 0.972 0.894 0.670 0.330 0.106 P Item2 Q=1-P 0.992 0.969 0.883 0.645 0.305 0.196 0.026 Item3 P Q=1-P 0.001 0.002 0.009 0.037 0.138 0.398 0.733 0.999 0.998 0.991 0.963 0.862 0.602 0.267 P Item4 Q=1-P 0.996 0.984 0.939 0.787 0.472 0.167 0.05 P Item5 Q=1-P 0.979 0.920 0.835 0.402 0.14 0.038 0.009 0.002 0.007 0.028 0.106 0.330 0.670 0.894 0.008 0.031 0.117 0.355 0.695 0.904 0.974 0.004 0.016 0.061 0.213 0.528 0.823 0.950 0.021 0.080 0.265 0.598 0.860 0.962 0.991 The Ii(θ) values for all the five items are calculated. The test information at any ability level is the sum of all Ii(θ) values at that level for all the five items as shown in the table below titled “Information Function At All Ability Levels”: Table: Information Function At All Ability Levels Ability Level -3 -2 -1 0 +1 +2 +3 Item1 a PQ = I(θ) θ 2 0.004 0.014 0.055 0.191 0.446 0.446 0.191 Item2 a2PQ = I(θ) θ 0.016 0.0606 0.2083 0.4617 0.4274 0.3573 0.0511 Item3 a2PQ = I(θ) θ 0.002 0.004 0.018 0.0718 0.2399 0.4831 0.3946 Item4 a2PQ = I(θ) θ 0.008 0.0317 0.1155 0.338 0.5025 0.2771 0.0958 Item5 a2PQ = I(θ) θ 0.041 0.148 0.446 0.484 0.242 0.073 0.018 Test Information Function 0.071 0.258 0.842 1.547 1.858 1.637 0.750 The table below along with the graph, gives ability levels and the test information function at each of these levels. A plot of ability against test information function is shown below. It may be observed that test information curve is the sum of item information curves of the five items as shown below in the Graph titled “Test Information Curve and Item Information Curve (5 Items)” IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 63 Graph: Test Information Curve and Item Information Curve (5 Items) Ability Level -3 -2 -1 0 +1 +2 +3 Test Information Function 0.07153 0.258765 0.842855 1.547379 1.858409 1.637069 0.75054 Interpreting the Test Information Function While the shape of the desired test information function depends upon the purpose for which a test is designed, some general interpretations can be made. A test information function that is peaked at some point on the ability scale measures ability with unequal precision along the ability scale. Such a test would be best for estimating the ability of test taker whose abilities fall near the peak of the test information function. In some tests, the test information function is rather flat over some region of the ability scale. Such tests estimate some range of ability scores with nearly equal precision and outside this range with less precision. Thus, the test would be a desirable one for those test takers whose ability falls in the given range. When interpreting a test information function, it is important to keep in mind the reciprocal relationship between the amount of information and the variability of the ability estimates. To translate the amount of information into a standard error of estimation, we need to take the reciprocal of the square root of the amount of test information as shown below: S.E (θ) = 1 √I(θ) Test Information Function of Single Parameter Model The test information function relates to the item information function in a way that it is equal to the sum of item information functions. The test information under Single Parameter model is shown below: Ij (θj) = Σi Iij (θj , bi) θ θ IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 64 Graph: Item information functions and test information function (5 Items) In the above graph titled “Item Information Function and Test Information Function”, although the test information function is plotted on the same scale as the item information functions, a separate axis is added to emphasize the difference. The test as a whole is far more informative than each item alone, and it spreads the information over a wider ability range. The information provided by each item is, in contrast, concentrated around ability levels that are close to its difficulty. Test Information Function of Two Parameter Model The formula for the test information function under Two Parameter model is shown below: Ij (θj) = Σi Iij (θj , bi , ai) = Σi a2i P(θ , bi , ai) * Q(θ , bi , ai) θ θ θ θ Because the item information functions in the Two Parameter model depend so strongly on the discrimination parameters ai, the shape of the test information function can become rather curvy and unpredictable—especially in tests with very few items like our examples. In practice, we should have a test information function that is high and reasonably smooth over the relevant ability range — say, (-3; +3). This could be ideally attained with a large number of items having large discrimination parameters and difficulties evenly distributed over the ability range. Items with very low discrimination parameters are usually discarded from practical use. Test Information Function of Three Parameter Model The test information function of the Three Parameter model is the sum of the item information functions over the items in a test. The formula for the test information function of Three Parameter model is shown below: Ij (θj) = Σi Iij (θj , bi , ai, ci)= Σi a2 Q(θ) [P(θ)-c]2 θ θ θ θ P(θ) 1-c θ As seen earlier that the item information function depends strongly on the discrimination parameters ai. In the Three Parameter model, there is the additional influence of the ‘guessing IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 65 parameters’ ci. Larger ci decrease the item information and shift its maximum away from bi. In practical applications, we should have a test information function that is high and reasonably smooth over the relevant ability range — say, (-3; +3). IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 66 Estimating Parameters As seen earlier the ability and item parameters were assumed known and hence it was easy to plot, examine and modify the IRF. If the item parameters are known then the ability can be estimated easily. Alternatively, the estimation of item parameters will become easy if the true abilities of the test takers were known. Since the actual values of the item parameters in a test are unknown, one of the important tasks to be performed when a test is analyzed under IRT is to estimate these parameters. The estimates thus obtained yield information about the technical properties of the test items. For understanding estimation, an individual item is taken and item difficulty, item discrimination, item guessing parameters for this item will be estimated wherever relevant in the three models. Procedure for Estimating Parameters Let us look into the case of a typical test. This test of N number of items is administered to M number of test takers. The ability scores of these test takers will be distributed over a range of ability levels on the ability scale. These test takers are divided into J number of groups along the scale so that all the test takers within a given group have the same ability level θj and there will be mj test takers within group j, where j = 1, 2, 3. . . . J. Within a particular ability score group, rj test takers answer the given item correctly. Thus, at an ability level of θj, the observed proportion of correct response is p(θj) = rj/mj , which is an estimate of the probability of correct response at that ability level. Now the value of rj can be obtained and p(θj) computed for each of the j ability levels established along the ability scale. If the observed proportions of correct response in each ability group are plotted, the result will be something like that shown in Graph below titled “Proportion of Correct Responses (Case 1)”. Graph: Proportion of Correction Responses (Case 1) IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 67 The next task is to find the ICC that best fits the observed proportions of correct response. To do so, a model needs to be selected for the curve to be fitted. Although any of the three models could be used, the two-parameter model is employed here. The procedure used to fit the curve is based upon maximum likelihood estimation. Under this approach, initial values for the item parameters, such as b=0.0, a=1.0, are established a priori. Then, using these estimates, the value of P(θj) is computed at each ability level via the equation for the ICC model. The agreement of the observed value of P(θj) and computed value P(θj) is determined across all ability groups. Then, adjustments to the estimated item parameters are found, that result in better agreement between the ICC defined by the estimated values of the parameters and the observed proportions of correct response. This process of adjusting the estimates is continued until the adjustments get so small that little improvement in the agreement is possible. At this point, the estimation procedure is terminated and the current values of b and a are the item parameter estimates. Given these values, the equation for the ICC is used to compute the probability of correct response P(θj) at each ability level and the ICC can be plotted. The resulting curve is the ICC that best fits the response data for that item. Figure 3-2 shows an ICC fitted to the observed proportions of correct response shown in Figure 3-1. The estimated values of the item parameters were b = -.37 and a = 1.25. Graph below titled “ICC Best Fit For Proportion of Correct Responses”. Graph: “ICC Best Fit For Proportion of Correct Responses”. An important consideration within IRT is whether a particular ICC model fits the item response data for an item. The agreement of the observed proportions of correct response and those yielded by the fitted ICC for an item is measured by the chi-square goodness-of-fit index. This index is defined as follows: IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 68 Where j is the number of ability groups. θj is the ability level of group j. Mj is the number of examinees having ability θj. p(θj) is the observed proportion of correct response for group j. θ P(θj) is the probability of correct response for group j computed from the θ ICC model using the item parameter estimates. If the value of the obtained index is greater than a criterion value, the ICC specified by the values of the item parameter estimates does not fit the data. This can be caused by two things. First, the wrong ICC model may have been employed. Second, the values of the observed proportions of correct response are so widely scattered that a good fit, regardless of model, cannot be obtained. In most tests, a few items will yield large values of the chi-square index due to the second reason. However, if many items fail to yield well-fitting ICC’s there may be reason to suspect that the wrong model has been employed. In such cases, re-analyzing the test under an alternative model, say the Three Parameter model rather than a Single Parameter model, may yield better results. In the case of the item shown in Figure 3-2, the obtained value of the chi-square index was 28.88 and the criterion value was 45.91. Thus, the Two Parameter model with b=-.37 and a=1.25 was a good fit to the observed proportions of correct response. Unfortunately, not all of the test analysis computer programs provide goodness-of-fit indices for each item in the test. The actual maximum likelihood estimation (MLE) procedure is rather complex mathematically and entails very laborious computations that must be performed for every item in a test. In fact, until computers became widely available, IRT was not practical because of its heavy computational demands. For present purposes, it is not necessary to go into the details of this procedure. It is sufficient to know that the curve-fitting procedure exists, that it involves a lot of computing, and that the goodness-of-fit of the obtained ICC can be measured. Because test analysis is done by computer, the computational demands of the item parameter estimation process do not present a major problem today. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 69 Examples Example 1: Let’s look at an illustrative numerical example for item parameter estimation. The model used for this example is Rasch’s Single Parameter model for easy purpose of illustration and easy understanding. Therefore, a single parameter namely item difficulty is to be estimated. Further, let us take an example of an objective type test of 20 items given on 76 test takers. The key to correct response for all the 20 items and responses of all the 76 test takers are given in the matrix which can be looked at by clicking here. The following table titled “Number Right Score” illustrates the data of number right scores and the number of test takers obtaining every score: Table: Number Right Score Number Right Score 18 17 16 15 14 13 12 11 10 9 8 7 6 Number of Test Takers Obtaining This Score 4 4 5 12 7 10 8 11 5 6 2 1 1 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 70 Descriptive Statistics of Number Right Scores of 76 Test Takers for 20 Items Case Processing Summary Cases Missing N Percent 0 .0% Valid N VAR00001 76 Percent 100.0% Total N 76 Percent 100.0% Descriptives Statistic 12.8684 12.2271 13.5097 12.8977 13.0000 7.876 2.8064 6.00 18.00 12.00 4.0000 -.122 -.554 Std. Error .3219 VAR00001 Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Lower Bound Upper Bound .276 .545 VAR00001 VAR00001 Stem-and-Leaf Plot Frequency .00 2.00 8.00 16.00 18.00 19.00 9.00 4.00 Stem & Leaf 0. 0. 0. 1. 1. 1. 1. 1. 67 88999999 0000011111111111 222222223333333333 4444444555555555555 666667777 8888 10.00 1 case(s) Stem width: Each leaf: Let us look at the 5th item for purpose of parameter estimation. The correct answer to this item is B. The various responses of the above groups of test takers are given below in the table titled “Proportion of Right Answers” IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 71 Table: Proportion of Right Answers Number Right Score 18 17 16 15 14 13 12 11 10 9 8 7 6 Number of Test Takers Obtaining This Score 4 4 5 12 7 10 8 11 5 6 2 1 1 Number Scoring Right Answer 4 4 2 5 1 6 2 1 1 0 1 1 0 Proportion of Right Answers (Probability) 4/4=1.0 4/4=1.0 2/5=0.4 5/12=0.41 1/7=0.14 6/10=0.6 2/8=0.25 1/11=0.09 1/5=0.2 0/6=0 ½=0.5 1/1=1 0/1=0 For the proportion of right answers and number right score groups given above, a graph is plotted with an approximate ICC as shown below in the graph titled “Proportion of Right Answers/ Number Right Score”. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 72 Graph: ICC – Proportion of Right Answers/Number Right Score A rough graphical representation of the proportion of right answers at every number right score plotted against the number right score is obtained by trial and error method in curve fitting. The rough representation is given in the graph. It can be seen that the curve’s point of contra flexure is seen at a number right score of 15 corresponding to a rough estimate of 15 number right score and corresponding +1.5 on the ability scale as indicated in the graph. Let us assume the first estimate of item difficulty b, of these items to be 1.5. For a single parameter b=1.5, the ICC or IRF is obtained by using the BIRT software application as shown below: IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 73 Screen Shot: ICC for Item Difficulty b = 1.5 (BIRT) Screen Shot: ICC for Item Difficulty b = 1.5 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 74 The BIRT curve proportions of Pi(θ) at all value from -3 to +3 can be compared now with the obtained proportions of right answers at the number right score levels at the corresponding assumed ability levels as shown in the table below titled “Comparison of Number Right Score versus Obtained Proportion of Right Answers”: Table: Comparison of Number Right Score versus Obtained Proportion of Right Answers Number Right Score Observed proportion of right answers at this score 1.0 1.0 0.4 0.41 0.14 0.6 0.25 0.09 0.2 0 0.5 1 Ability levels Pi(0) at this level 18 17 16 15 14 13 12 11 10 9 8 7 6 +3 +2.5 +2 +1.5 +1 0.5 0 -0.5 -1 -1.5 -2 -2.5 -3 0.894 0.805 0.67 0.5 0.33 0.195 0.106 0.055 0.028 0.014 0.007 0.003 0 Group Invariance of Item Parameters The group invariance of the item parameters is a very powerful feature of IRT. It says that the values of the item parameters are a property of the item, not of the group that responded to the item. These item parameters can be estimated from any segment of the item response curve. This means that these parameters can be estimated from any group of test takers. The term group invariance refers to this independence of the item parameter estimates from the distribution of ability. Thus, the item parameters are known to be group invariant. Unlike IRT, under CTT the item difficulty is the overall proportion of correct response to an item for a group of test takers. Thus, if an item with b=0 were responded to by a low-ability group, few of the test takers would get it correct. The item difficulty index under CTT would yield a low value say 0.3, as the item difficulty for this group. If the same item were responded to by a high ability group, most of the test takers would get it correct. The item difficulty index under CTT would yield a high value, say 0.8, indicating that the item was easy for this group. Thus, the value of the item difficulty index under CTT is not group invariant. Because of this, item difficulty as defined under IRT is easier to interpret because it has a consistent meaning that is independent of the group used to obtain its value. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 75 Note Even though the item parameters are group invariant, this does not mean that the numerical values of the item parameter estimates yielded by the maximum likelihood estimation procedure for two groups of test takers taking the same items will always be identical. The obtained numerical values will be subject to variation due to sample size, how well-structured the data is, and the goodness-of-fit of the curve to the data. In addition, the item must be used to measure the same latent trait for both groups. An item’s parameters do not retain group invariance when taken out of context, i.e., when used to measure a different latent trait or with test takers from a population for which the test is inappropriate. Examples Example 1: Let us assume that two groups of test takers are chosen from a population of test takers. The first group has a range of ability scores from -3 to -1, with a mean of -2. The second group has a range of ability scores from +1 to +3 with a mean of +2. The observed proportion of correct response to a given item is computed from the item response data for every ability level within each of the two groups. These proportions of correct response are plotted in the graph below titled “Observed Proportions of Correct Responses”: Graph: Observed Proportions of Correct Response (Group 1) The maximum likelihood procedure is then used to fit an ICC to the data and numerical values of the item parameter estimates, b(1)=-.37 and a(1)=1.25, were obtained [b(1) indicates the value of b for group 1 and a(1) indicates the value of a for group 1]. The ICC defined by these estimates is then plotted over the range of ability encompassed by the first group as shown in the graph below titled “ICC Fitted to Group 1 Data” IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 76 Graph: ICC Fitted to Group 1 Data The process is repeated for the second group. The observed proportions of correct response for group 2 are shown in the graph below titled “Observed proportions of Correct Responses (Group 2): Graph: Observed Proportions of Correct Responses (Group 2) The maximum likelihood procedure is then used to fit an ICC to the data and numerical values of the item parameter estimates, b(2)=-.37 and a(2)=1.25, were obtained [b(1) indicates the value of b for group 1 and a(1) indicates the value of a for group 1]. The ICC defined by these estimates is then plotted over the range of ability encompassed by the second group as shown in the graph below: IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 77 Graph: ICC Fitted to Group 2 Data An important point of observation here is that, under these conditions when b(1)=b(2) and a(1)=a(2), the two groups yield the same values of the item parameters. Hence, the item parameters are group invariant. While this result may seem a bit unusual, its validity can be demonstrated easily by considering the process used to fit an ICC to the observed proportions of correct response. Since the first group had a low average ability (-2), the ability levels spanned by group 1 will encompass only a section of the curve, in this case, the lower left tail of the curve. Consequently, the observed proportions of correct response will range from very small to moderate values. When fitting a curve to this data, only the lower tail of the ICC is involved. Let us see Figure 3-3 for an example. Since group 2 had a high average ability (+2), it’s observed proportions of correct response range from moderate to very near 1. When fitting an ICC to this data, only the upper right-hand tail of the curve is involved, as shown in the earlier graph. Since the same item was administered to both groups, the two curve-fitting processes were dealing with the same underlying ICC. Consequently, the item parameters yielded by the two analyses should be the same. The output shown below in the graph titled “The ICC Fitted to Pooled Data, b = -0.37 and a = 1.25” integrates the two groups into a single representation showing how the same ICC fits the two sets of proportions of correct response: IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 78 Graph: The ICC Fitted to Pooled Data, b = -0.37 and a = 1.25 Example 2: Let us illustrate the use of the BIRT software application to prove group invariance. We are choosing upper bound and lower bound groups, upper bound from +3 to +1 and lower bound from -1 to -3. Separate ICC’s are drawn for these two groups and then a combined ICC is also shown below in the Screen Shot titled “Lower & Upper Bound Values (Example 2): Screen Shot: Lower & Upper Bound Values (Example 2) The output shown below indicates the Lower Bound and Upper Bound groups of abilities: values for the two IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 79 The output shown below indicates the plots of the lower bound group of abilities: IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 80 The output shown below indicates the plots of the upper bound group of abilities: The output shown below indicates the ICC of the item for lower bound group of abilities: IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 81 The output shown below indicates the ICC of the item for upper bound group of abilities: The output shown below indicates the combined ICC of the item for the two groups of abilities (item has same item difficulty for the two groups of abilities): It is observed that the individual ICC’s and the combined ICC give rise to the same b values. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 82 For a sample of 25 items administered on 1000 test takers, item #5 has been taken to illustrate group invariance. Three methods are followed and they are as follows: An approximate method of finding out a relationship between ability of a test taker taken as xaxis and the probability of getting the correct answer as y- axis is approximated to a relationship between proportion of correct response at the score level and the number right score. Hereinafter, the number right score is taken as an approximation of ability and the proportion of correct answer at a score level is approximately the probability of getting the correct response. In this context the relationship between higher ability group(HAG) and lower ability group(LAG), between the top half and the bottom half of the group and finally between groups of test takers of odd number and even number scores. In order to arrive at this each of the three options groups are isolated and a curve fit using curve expert software is attempted and analysed for ability and number right score and proportion of right answers and probability of correct answers. The examples that follow illustrate these three methods. 1. Higher Ability Group (HAG) of top 27% and Lower Ability Group (LAG) of bottom 27% of the total group are considered as two groups. For item #5 the proportion of right answers of score sets in HAG and LAG are taken out and the same are given below: SCORE CORRES Z-SCORE NUMBER GETTING THIS SCORE NO. OF PEOPLE GETTING THIS QN. RIGHT PROPORTION GETTING THE RIGHT ANSWER HAG LAG 22 21 20 19 18 17 16 15 14 13 9 8 7 6 5 4 3 2 1 3.339 3.028 2.717 2.406 2.096 1.786 1.475 1.165 0.854 0.543 -0.699 -1.001 -1.319 -1.630 -1.941 -2.252 -2.562 -2.873 -3.183 2 3 10 24 40 79 151 185 309 90 228 276 160 98 71 27 22 10 3 2 3 8 20 33 60 116 123 196 59 96 84 56 26 19 4 2 1 0 1 1 0.8 0.833 0.825 0.759 0.768 0.665 0.634 0.656 0.421 0.304 0.35 0.265 0.268 0.148 0.091 0.1 0 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 83 i. Interpretation of the Best Curve Fit to Q.5 to determine Item Difficulty using HAG and LAG If we try to find the best fit for the statistics of HAG of Q.5, provided by the data, we get the model as LINEAR model which is: Y=(a+b*x) Where a=0.10769,b=0.40946 x = total scores achieved by test takers and y=proportion of test takers getting Q.5 right when their score is x. Checking the model at x = 22, we get y =1.00851, which is very close to the actual value. Hence it can be concluded that the model is very accurate. The correlation coefficient is 0.92574 and the standard error is 0.10724.Also checking against y=0.5, we get x=9.581. If we try to find the best fit for the statistics of LAG of Q.5, provided by the data, we get the model as LINEAR model which is: Y=(a+b*x) Where a =-.01833,b = 0.04734 x = total scores achieved by test takers and y=proportion of test takers getting Q.5 right when their score is x. Checking the model at x = 22, we get y =1.02334, which is not very close to the actual value, but still acceptable. Hence it can be concluded that the model is moderately accurate. The correlation coefficient is 0.97291 and the standard error is 0.036129. Also checking against y = 0.5 we get x =10.9471. Incidentally, the Z scores for these number right scores are also worked out. A plot of number right scores in each of the groups and the proportion of test takers getting the right answers in the group is plotted separately for HAG and LAG. Using the Curve Expert software, the best fit for both are found out. They seem to be in agreement with the total group of proportions of right answers. At proportion equal to 0.5 (its taken to correspond to approximately probability of getting the right answer as 0.5), the number right score (approximately indicating the ability) in each of these groups is calculated. These scores compare very well within limits of standard error, number right score of the total group with a proportion of 0.5. These values are 8.81 for the whole group, 9.030 for HAG and 9.047 for LAG. The graphs for LAG, HAG and the whole data are respectively shown below: IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 84 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 85 2. The top half and the bottom half of total population of test takers are taken as two groups following the same procedure. The results and the graphs are shown below: SCORE CORRE S ZSCORE 3.338 3.028 2.717 2.407 2.096 1.786 1.475 1.1646 0.854 0.543 0.233 -0.078 -0.388 -0.699 -1.009 -1.319 -.630 -.941 -2.251 -2.562 -2.873 -3.183 NUMBER GETTING THIS SCORE 2 3 10 24 40 79 151 185 311 384 384 409 355 308 276 160 98 71 27 22 10 3 NUMBER GETTING THIS QN. RIGHT 2 3 8 20 33 60 116 123 196 230 206 194 160 119 84 56 26 19 4 2 1 0 PROPORTION GETTING THE RIGHT ANSWER 1 1 0.8 0.833 0.825 0.759 0.768 0.665 0.630 0.599 0.536 0.474 0.451 0.386 0.304 0.35 0.265 0.268 0.148 0.091 0.1 0 TOP Half BOTTOM Half 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 86 ii.Interpretation of the Best Curve Fit to Q.5 to determine Item Difficulty using Top and Bottom Half Data If we try to find the best fit for the statistics of TOP half of Q.5 provided by the data, we get the model as LINEAR model which is: Y=(a+b*x) Where a =0.00523, b =0.04467 x = total scores achieved by test takers and y=proportion of test takers getting Q.5 right when their score is x. Checking the model at x = 22, we get y = 0.98802, which is very close to the actual value. Hence it can be concluded that the model is very accurate. The correlation coefficient is 0.989188 and the standard error is 0.04055. Also checking against y = 0.5 we get x =11.0754. If we try to find the best fit for the statistics of BOTTOM half of Q.5 provided by the data, we get the model as LINEAR model which is: Y=(a+b*x) Where a = -0.01254, b = 0.04527 x = total scores achieved by test takers and y=proportion of test takers getting Q .5 right when their score is x. Checking the model at x = 22, we get y =1.05504, which is not very close to the actual value, still acceptable. Hence it can be concluded that the model is moderately accurate. The correlation coefficient is 0.98353 and the standard error is 0.03247. Also checking against y = 0.5 we get x =11.3204. The graphs for the top and bottom half data are respectively shown below: IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 87 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 88 3.Out of the total population of test takers, those who secure odd-numbered scores and those who secure even-numbered scores are taken as two distinctive groups. The same procedure follows for the groups. The values are shown below: SCOR E CORRES Z-SCORE EVEN Scores 6 8 10 12 14 16 18 20 22 7 9 11 13 15 17 19 21 -1.630 -1.001 -0.389 0.233 0.854 1.475 2.096 2.717 3.339 -1.319 -0.698 -0.077 0.543 1.164 1.786 2.407 3.028 NUMBER GETTING THIS SCORE 98 276 355 384 311 151 40 10 2 160 308 409 384 185 79 24 3 NUMBER GETTING THIS QN. RIGHT 26 84 160 206 196 116 33 8 2 56 119 194 230 123 60 20 3 PROPORTION GETTING THE RIGHT ANSWER 0.265 0.304 0.451 0.536 0.630 0.768 0.825 0.8 1 0.35 0.386 0.474 0.598 0.665 0.759 0.833 1 ODD Scores ii.Interpretation of the Best Curve Fit to Q.5 to determine Item Difficulty using Even and Odd Data If we try to find the best fit for the statistics of Q.5, according to even scores provided by the data, we get the model as LINEAR model which is: Y=a+b*x Where a = -0.00554, b = 0.04472 x = total scores achieved by test takers and y=proportion of test takers getting Q .5 right when their score is x. Checking the model at x = 22, we get y = 0.978463, which is very close to the actual value. Hence it can be concluded that the model is very accurate. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 89 The correlation coefficient is 0.991148 and the standard error is 0.04322. Also checking against y = 0.5 we get x = 11.3027.. If we try to find the best fit for the statistics of Q.5, according to odd scores provided by the data, we get the model as RATIONAL FUNCTION model which is: Y=(a+b*x)/(1+c*x+d*x2) Where a = 0.000305,b = 0.06334,c = 0.062575,d = -0.00219 x = total scores achieved by test takers and y=proportion of test takers getting Q .5 right when their score is x. Checking the model at x = 22, we get y = 1.06015, which is very close to the actual value. Hence it can be concluded that the model is very accurate. The correlation coefficient is 0.99791 and the standard error is 0.024406. Also checking against y = 0.5 we get x = 11.2546. The graphs for even and odd scores are respectively shown below: IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 90 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 91 Estimating a Test Taker’s Ability The primary purpose of administering a test to a test taker, under IRT, is to locate the test taker on the ability scale. By performing this process the test taker can be evaluated in terms of how much underlying ability he or she possesses. Following this, comparisons among test takers can be made to assign grades, award scholarships etc. In this chapter we will focus upon the test takers and the procedures for estimating an ability score (parameter) for a test taker. The test used to measure an unknown latent trait will consist of N items, each of which measures some facet of the trait. We have earlier dealt with item parameters and their estimation and while doing that we assumed that the ability parameter of each test taker was known. Conversely, to estimate a test taker’s unknown ability parameter, we will assume that the numerical values of the item parameters are known. A direct consequence of this assumption is that the metric of the ability scale will be the same as the metric of the known item parameters. As seen earlier, when the test is taken, a test taker responds to each of the N items in the test, and the responses will be dichotomously scored. The result will be a score of either a 1 or a zero for each item in the test. This set of 1’s and 0’s for the N items is called the test taker’s item response vector. The item response vector thus obtained and the known item parameters will then be used to estimate the test taker’s unknown ability parameter. Ability Estimation Parameters In IRT maximum likelihood procedures are used to estimate a test taker’s ability. This procedure is an iterative process as in the case of estimating item parameters. It begins with some a priori value for the ability of the test taker and the known values of the item parameters. These are used to compute the probability of correct response to each item for that test taker. Then an adjustment to the ability estimate is obtained that improves the agreement of the computed probabilities with the test taker’s item response vector. The process is repeated until the adjustment becomes small enough that the change in the estimated ability is negligible. The result is an estimate of the test taker’s ability parameter. This process is then repeated separately for each test taker taking the test. The estimation equation is as shown below: θ s +1 = θ s + ∧ ∧ ∑ a [u − P (θ ∑ a P (θ 2 i =1 i i =1 N i i i ∧ s N ∧ s )] )Qi (θ s ) ∧ IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 92 = is the estimated ability of a test taker within iteration s ai = is the discrimination parameter of item i where i=1,2,3……N ui = is the response made by the test taker to an item i = 1 for a correct response = 0 for an incorrect response ˆ Pi( θs ) = is the probability of a correct response to an item i, under the given ICC ˆ model at ability level θ within iteration s ˆ θs ˆ ˆ Qi( θs ) = 1 – Pi( θs ) is the probability of an incorrect response to an item i, under ˆ the given ICC model at ability level θ within iterations ˆ Initially, θs on the right hand side of the equal to sign is set to some arbitrary value, such as 1. The probability of correct response to each of the N items in the test is calculated at this ability level using the known item parameters in the given ICC model. Then the second term to the right of the equal sign is evaluated. This is the adjustment term, denoted by ÄDθ. The ˆ ˆ value of θs +1 on the left side of the equal sign is obtained by adding ÄDθ to θs . This value ˆ ˆ θ become θ in the next iteration. The numerator of the adjustment term contains the essence s +1 s ˆ of the procedure. It should be noted that (ui – Pi( θs )) is the difference between the test taker’s ˆ item response and the probability of correct response at an ability level of θ . As the ability s ˆ estimate gets closer to the test taker’s ability, the sum of the differences between ui and Pi( θs ) gets smaller. ˆ Thus, the goal is to find the ability estimate yielding values of Pi( θs ) for all items simultaneously that minimizes this sum. When this happens, the Ä(θ) term becomes as small as possible and ˆ ˆ the value of θs +1 will not change from iteration to iteration. This final value of θs +1 is then used as the test taker’s estimated ability. The ability estimate will be in the same metric as the numerical values of the item parameters. A point to be noted here is that the estimation equation given above can be used with all three ICC models, although the Three Parameter model requires a slight modification. Let us illustrate the ability estimation process by looking into a test of 5 items administered under the Two Parameter model on 10 test takers. Under this model the known item parameters are as shown in the table below titled “Item Parameters”. Table: Item Parameters b +1.499 +0.424 +2.292 +0.920 -0.279 a 1.42 1.42 1.42 1.42 1.42 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 93 The test taker’s responses are given in the table below titled “Test Taker Responses”: Table: Test Taker Responses Test Taker1 Test Taker2 Test Taker3 Test Taker4 Test Taker5 Test Taker6 Test Taker7 Test Taker8 Test Taker9 Test Taker10 Item1 Item2 Item3 1 1 1 0 1 1 1 0 0 0 0 1 1 1 0 1 0 0 1 0 0 0 0 1 1 1 0 1 0 0 Item4 1 1 1 0 0 0 0 0 0 0 Item5 1 1 1 0 1 1 0 1 0 0 Let us look at test taker 3. The ui values are given below: Item1 Item2 0 1 Ui =0 Ui=1 Item3 0 Ui =0 Item4 1 Ui =1 Item5 1 Ui =1 Test Taker3 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 94 ˆ The a priori estimate of ability for test taker #3 is set to θs =1.0. The first iteration is shown in the table below: Item No b u a ^ θ s -a(θ-b) θ e -a(θ-b) θ p=1/(1+e -a(θ-b) θ ) q=1-p a(u-p) a *p*q 2 Corre ction Factor Next Estimate ^ θs+1 1.342 1 2 3 4 5 1.50 0.42 2.29 0.92 -0.28 1 0 0 1 1 1.42 1.42 1.42 1.42 1.42 1 1 1 1 1 0.71 -0.823 1.831 -0.113 -1.817 2.033 0.438 6.245 0.892 0.162 0.329 0.695 0.138 0.528 0.860 0.670 0.305 0.862 0.471 0.139 0.952 -0.986 -0.195 0.669 0.198 0.637 0.445 0.427 0.239 0.502 0.242 1.857 a *p*q 2 0.343 Item No b u a ^ θs -a(θ-b) θ e -a(θ-b) θ p=1/(1 -a(θ+e θ b) ) q=1-p a(u-p) Correc tion Factor Next Estimate ^ θs+1 1.348 1 2 3 4 5 1.50 0.42 2.29 0.92 -0.28 1 0 0 1 1 1.42 1.42 1.42 1.42 1.42 1.343 1.343 1.343 1.343 1.343 0.222 -1.310 1.344 -0.600 -2.304 1.249 0.269 3.837 0.548 0.099 0.444 0.787 0.206 0.645 0.909 0.555 0.212 0.793 0.354 0.090 0.788 -1.118 -0.293 0.503 0.129 0.009 0.498 0.337 0.331 0.461 0.166 1.793 0.005 Ite m N o b u a ^ θs -a(θ-b) θ e -a(θ-b) θ p=1/(1+ -a(θ-b) e θ ) q=1-p a(u-p) a *p* q 2 Corre ction Facto r Next Estimat e^ θs+1 1 2 3 4 5 1.50 0.42 2.29 0.92 -0.28 1 0 0 1 1 1.42 1.42 1.42 1.42 1.42 1.348 1.348 1.348 1.348 1.348 0.215 -1.317 1.337 -0.607 -2.311 1.240 0.267 3.810 0.544 0.099 0.446 0.788 0.207 0.647 0.909 0.553 0.211 0.792 0.352 0.090 0.786 -1.120 -0.295 0.500 0.128 -0.0003 0.498 0.335 0.332 0.460 0.165 1.791 -0.000 1.347 At this point, the procedure is terminated because the value of the adjustment 0.002 is very small. Thus, the test taker’s estimated ability is 1.348. So, the best way to do that is estimate it. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 95 However, this does not prevent us from conceptualizing such a parameter. Fortunately, one can obtain a standard error of the estimated ability that provides some indication of the precision of the estimate. The underlying principle is that a test taker, hypothetically, could take the same test a large number of times, assuming that he does not remember how he answered the ˆ previous test items. An ability estimate θ would be obtained from each testing. The standard ˆ error is a measure of the variability of the values of θ around the test taker’s unknown parameter value θ. In the present case, an estimated standard error can be computed using the equation given below: S .E (θ ) = ∧ 1 ∑ a 2 P(θ )Q(θ ) i =1 N ∧ ∧ In the equation given above, the term under the square root sign is the denominator of the estimation equation. As a result, the estimated standard error can be obtained as a side product of estimating the test taker’s ability. In the illustrated example given above, it will be calculated as shown below: SE(θ) = 1 / √1.793 = 0.746 Thus, the test taker’s ability is not estimated very precisely because the standard error 0.557 is very large. This is primarily due to the fact that only five items were used here and one would not expect a very good estimate. Looking into the PH3 output of BILOG, test taker #3 has an ability of 1.266 as against 1.347. There are two cases for which the maximum likelihood estimation procedure fails to yield an ability estimate. First, when a test taker answers none of the items correctly, the corresponding ability estimate is negative infinity. Second, when a test taker answers all the items in the test correctly, the corresponding ability estimate is positive infinity. In both of these cases it is impossible to obtain an ability estimate for the test taker (the computer literally cannot compute a number as big as infinity). Consequently, the computer programs used to estimate ability must protect themselves against these two conditions. When they detect either a test score of zero or a perfect test score, they will eliminate the test taker from further analysis and set the estimated ability to some symbol such as ****** to indicate what has happened. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 96 Item Invariance of a Test Taker’s Ability Estimate Another basic principle of IRT is that the test taker’s ability is invariant with respect to the items used to determine it. This principle rests upon two conditions: first, all the items measure the same underlying latent trait; second, the values of all the item parameters are in a common metric. To illustrate this principle, assume that a test taker has an ability score of zero, which places him at the middle of the ability scale. Now, if a set of ten items having an average difficulty of -2 were administered to this test taker, the item responses could be used to estimate the examinee’s ability, yielding θ1 for this test. Then if a second set of ten items having an average ˆ difficulty of +1 were administered to this test taker, these item responses could be used to estimate the test taker’s ability, yielding θ 2 for this second test. Under the item invariance ˆ principle, θ1 = θ 2 that is the two sets of items should yield the same ability estimate, within ˆ ˆ sampling variation, for the test taker. In addition, there is no requirement that the discrimination parameters be the same for the two sets of items. This principle is just a reflection of the fact that the ICC spans the whole ability scale. Just as any sub-range of the ability scale can be used in the estimation of item parameters, the corresponding segments of several ICC’s can be used to estimate a test taker’s ability. Items with a high average difficulty will have a point on their ICC’s that corresponds to the ability of interest. Similarly, items with a low average difficulty will have a point on their ICC’s that corresponds to the ability of interest. Consequently, either set of items can be used to estimate the ability of test takers at that point. In each set, a different part of the ICC is involved, but that is acceptable. The practical implication of this principle is that a test located anywhere along the ability scale can be used to estimate a test taker’s ability. For instance, a test taker could take a test that is “easy” or a test that is “hard” and obtain, on the average, the same estimated ability. This is in sharp contrast to CTT, where such a test taker would get a high test score on the easy test, a low score on the hard test, and there would be no way of ascertaining the test taker’s underlying ability. Under IRT, the test taker’s ability is fixed and invariant with respect to the items used to measure it. A word of caution here with respect to the meaning of the word “fixed” is that a test taker’s ability is fixed only in the sense that it has a particular value in a given context. For example, if a test taker took the same test several times assuming he does not remember the items or the responses from test to test then the test taker’s ability would be fixed. However, if the test taker received remedial instruction between the tests or if there were carryover effects, the test taker’s underlying ability level would be different for each testing. Thus, the test taker’s underlying ability level is not immutable. There are a number of applications of IRT that depend upon a test taker’s ability level changing as a function of changes in the educational context. The item invariance of a test taker’s ability and the group invariance of an item’s parameters are two facets of the invariance principle of IRT. This principle is the basis for a number of practical applications of the theory. A twenty-item test administered to 76 test takers yielded true scores for each one of them. The test takers and their true scores are listed in the table below titled “True Scores of 76 Test Takers” IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 97 Table: True Scores of 76 Test Takers Test Taker 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 No Tried 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 No Right 18 18 18 18 17 17 17 17 16 16 16 16 16 15 15 15 15 15 15 15 15 15 15 15 15 14 14 14 14 14 14 14 13 13 13 13 13 13 13 13 13 13 13 12 Ability 1.4636 1.4636 1.4636 1.4636 1.1712 1.1712 1.1712 1.1712 0.8881 0.8881 0.8881 0.8881 0.8881 0.6133 0.6133 0.6133 0.6133 0.6133 0.6133 0.6133 0.6133 0.6133 0.6133 0.6133 0.6133 0.3456 0.3456 0.3456 0.3456 0.3456 0.3456 0.3456 0.0839 0.0839 0.0839 0.0839 0.0839 0.0839 0.0839 0.0839 0.0839 0.0839 0.0839 -0.1726 True Score 11.77 11.77 11.77 11.77 11.31 11.31 11.31 11.31 10.85 10.85 10.85 10.85 10.85 10.41 10.41 10.41 10.41 10.41 10.41 10.41 10.41 10.41 10.41 10.41 10.41 9.97 9.97 9.97 9.97 9.97 9.97 9.97 9.54 9.54 9.54 9.54 9.54 9.54 9.54 9.54 9.54 9.54 9.54 9.13 Odd TS 5.82 5.82 5.82 5.82 5.58 5.58 5.58 5.58 5.35 5.35 5.35 5.35 5.35 5.13 5.13 5.13 5.13 5.13 5.13 5.13 5.13 5.13 5.13 5.13 5.13 4.91 4.91 4.91 4.91 4.91 4.91 4.91 4.70 4.70 4.70 4.70 4.70 4.70 4.70 4.70 4.70 4.70 4.70 4.49 Even TS 5.95 5.95 5.95 5.95 5.72 5.72 5.72 5.72 5.50 5.50 5.50 5.50 5.50 5.28 5.28 5.28 5.28 5.28 5.28 5.28 5.28 5.28 5.28 5.28 5.28 5.06 5.06 5.06 5.06 5.06 5.06 5.06 4.85 4.85 4.85 4.85 4.85 4.85 4.85 4.85 4.85 4.85 4.85 4.64 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 98 Test Taker 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 No Tried 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 No Right 12 12 12 12 12 12 11 11 11 11 11 11 11 11 11 11 11 10 10 10 10 10 9 9 9 9 9 9 8 8 7 6 Ability -0.1726 -0.1726 -0.1726 -0.1726 -0.1726 -0.1726 -0.4247 -0.4247 -0.4247 -0.4247 -0.4247 -0.4247 -0.4247 -0.4247 -0.4247 -0.4247 -0.4247 -0.6736 -0.6736 -0.6736 -0.6736 -0.6736 -0.9203 -0.9203 -0.9203 -0.9203 -0.9203 -0.9203 -1.1654 -1.1654 -1.4095 -1.6539 True Score 9.13 9.13 9.13 9.13 9.13 9.13 8.72 8.72 8.72 8.72 8.72 8.72 8.72 8.72 8.72 8.72 8.72 8.33 8.33 8.33 8.33 8.33 7.94 7.94 7.94 7.94 7.94 7.94 7.56 7.56 7.19 6.83 Odd TS 4.49 4.49 4.49 4.49 4.49 4.49 4.29 4.29 4.29 4.29 4.29 4.29 4.29 4.29 4.29 4.29 4.29 4.09 4.09 4.09 4.09 4.09 3.90 3.90 3.90 3.90 3.90 3.90 3.71 3.71 3.52 3.34 Even TS 4.64 4.64 4.64 4.64 4.64 4.64 4.44 4.44 4.44 4.44 4.44 4.44 4.44 4.44 4.44 4.44 4.44 4.24 4.24 4.24 4.24 4.24 4.04 4.04 4.04 4.04 4.04 4.04 3.85 3.85 3.67 3.48 Let us look at a test taker having a score of 15. His true score works out to be 10.41. In terms of percentage, this is equal to 10.41/20 = 52.05%. When the same test taker takes only the test with odd numbered items, the true score comes out to be 5.28. Similarly, his true score on the even numbered items comes out to be 5.13. In terms of percentages, they will be 5.28/10=52.8% and 5.13/10=51.3% respectively. The error in the estimate of the percentage for the same test taker, if he takes only odd or even numbered items, works out to 52.852.05=0.75%. And for odd, the error is 52.05-51.3=0.75%. This is negligible and can be accounted for a small sample. Thus, item invariance is proved from this example. Hence, a test taker’s true score is not dependent on the items he takes. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 99 Test Calibration While assuming the metric scale to be known, the numerical values of the item parameters and the test taker’s ability parameters can be expressed in this metric. Test constructors while writing an item, know what trait they want the item to measure and whether the item is designed to function among low, medium or high ability test takers. But it is not possible to determine the values of the item’s parameters a priori. In addition, when a test is administered to a group of test takers, it is not known in advance how much of the latent trait each of the test takers possesses. As a result, a major task is to determine the values of the item parameters and test taker’s ability in a metric for the underlying latent trait. In IRT, this task is called test calibration, and it provides a frame of reference for interpreting test results. Test calibration is accomplished by administering a test to a group of M examinees and dichotomously scoring the test taker’s responses to the N items. Then mathematical procedures are applied to the item response data in order to create an ability scale that is unique to the particular combination of test items and test takers. Then the values of the item parameter estimates and the test taker’s estimated abilities are expressed in this metric. Once this is accomplished, the test has been calibrated, and the test results can be interpreted through the constructs of IRT. Test Calibration Process The procedure used to calibrate a test was proposed by Birnbaum in 1968 and has been implemented in widely used computer programs such as BICAL (Wright and Mead, 1976) and LOGIST (Wingersky, Barton and Lord, 1982). The Birnbaum paradigm is an iterative procedure employing two stages of maximum likelihood estimation. In one stage, the parameters of the N items in the test are estimated, and in the second stage, the ability parameters of the M test takers are estimated. The two stages are performed iteratively until stable sets of parameter estimates are obtained. At this point, the test has been calibrated and an ability scale metric defined. Within the first stage of the Birnbaum paradigm, the estimated ability of each test taker is treated as if it is expressed in the true metric of the latent trait. Then the parameters of each item in the test are estimated via the maximum likelihood procedure. This is done one item at a time, because an underlying assumption is that the items are independent of each other. The result is a set of values for the estimates of the parameters of the items in the test. The second stage assumes that the item parameter estimates yielded by the first stage are actually the values of the item parameters. Then, the ability of each test taker is estimated using the maximum likelihood procedure. It is assumed that the ability of each test taker is independent of all other test takers. Hence, the ability estimates are obtained one test taker at a time. The two-stage process is repeated until some suitable convergence criterion is met. The overall effect is that the parameters of the N test items and the ability levels of the M test takers have been estimated simultaneously, even though they were done one at a time. This clever paradigm reduces a very complex estimation problem to one that can be implemented on a computer. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 100 The Metric Problem An unfortunate feature of the Birnbaum paradigm is that it does not yield a unique metric for the ability scale. This means that the midpoint and the unit of measurement of the obtained ability scale are indeterminate implying that many different values work equally well. In technical terms, the metric is unique up to a linear transformation. As a result, it is necessary to “anchor” the metric via arbitrary rules for determining the midpoint and unit of measurement of the ability scale. How this is done is up to the persons implementing the Birnbaum paradigm in a computer program. In the BICAL computer program, this anchoring process is performed after the first stage is completed. Thus, each of two stages within iteration is performed using a slightly different ability scale metric. As the overall iterative process converges, the metric of the ability scale also converges to a particular midpoint and unit of measurement. The crucial feature of this process is that the resulting ability scale metric depends upon the specific set of items constituting the test and the responses of a particular group of test takers to that test. It is not possible to obtain estimates of the test taker’s ability and of the item parameters in the true metric of the underlying latent trait. The best we can do is obtaining a metric that depends upon a particular combination of test takers and test items. Summary of the Test Calibration Process To obtain calibrated items, one has to: • • • Write them, Estimate their parameters, and Make sure that the estimates are on the same scale. The end product of the test calibration process is the definition of an ability scale metric. Under the Rasch model, this scale has a unit of measurement of 1 and a midpoint of zero. Superficially this looks exactly the same as the ability scale metric used in previous chapters. However, it is not the metric of the underlying latent trait. The obtained metric depends upon the item responses yielded by a particular combination of test takers and test items being subjected to the Birnbaum paradigm. Since the true metric of the underlying latent trait cannot be determined, the metric yielded by the Birnbaum paradigm is used as if it were the true metric. The obtained item difficulty values and the test taker’s ability are interpreted in this metric. Thus, the test has been calibrated. The outcome of the test calibration procedure is to locate each test taker and item along the obtained ability scale. In the present example, item 5 had a difficulty of -1 and test taker 10 had an ability estimate of 0.91. Therefore, the probability of test taker 10 answering item 5 correctly is approximately 0.5. The capability to locate items and test takers along a common scale is a powerful feature of item response theory. This feature allows one to interpret the results of a test calibration within a single framework and provides meaning to the values of the parameter estimates. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 101 The Likelihood Function A test taker taking a test under Single Parameter model, with k items can obtain one of k + 1 observed scores (0, 1, , , , , k). However, the number of the possible responses to the test (the response patterns) is much larger: 2k. For a test of 5 items, there are 32 distinct response patterns. Each of them has a certain probability. Because every test taker must have some response pattern and the response patterns are mutually exclusive, their probabilities will sum to 1. This is true for the data set as a whole, and it is also true at any specific level of ability. Let us look at how to calculate the probability that a test taker of ability θj will respond to the test with a certain pattern, e.g. (True, True, False, True, False). We already know how to calculate the probability of each response in the pattern separately: P (θj, b1), P (θj, b2). . . Q (θj, b5), but what is their joint probability? IRT makes the important assumption of local independence. This means that the responses given to the separate items in a test are mutually independent, when ability is given. The actually observed responses may be correlated, even strongly correlated — but this is only because the responses of test takers with widely different abilities have been put together, ignoring ability. If we consider only test takers having the same latent ability, the correlations between the responses are supposed to vanish. Now, because P (θj, b1), P(θj, b2), . . . , Q(θj, b5) are functions of θj, we can multiply them to obtain the probability of the whole pattern. This follows from the assumption of conditional independence, according to which the responses given to the individual items in a test are mutually independent given θ. The function is as shown below: L(θ ) = ∏ Pi (θ , bi ) Qi (θ , bi )1−ui ui i Where ui ε (0, 1) is the score on item i, is called the likelihood function. It is the probability of a response pattern given the ability θ and of course, the item parameters. There is one likelihood function for each response pattern, and the sum of all such functions equals 1 at any value of θ. The likelihood is in fact a probability. The subtle difference between the two concepts has more to do with how we use them than with what they really are. Probabilities usually point from a theoretically assumed quantity to the data that may be expected to emerge: thus, the IRT model predicts the probability of any response to a test given the true ability of the test taker. The likelihood works in the opposite direction: it is used by the same IRT model to predict latent ability from the observed responses. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 102 The Maximum Likelihood Estimate of Ability Let us look at a more conventional approach to ability estimation, which is based on the ˆ principle of maximum likelihood. The ability, say θ , which has the highest likelihood given the observed pattern (and the item parameters), will become the ability estimate. Graph: Finding The Ability Estimates by Maximum Likelihood In the graph above titled “Finding The Ability Estimates by Maximum Likelihood”, the likelihood inunctions shown in blue for the response patterns (T,F,F,F,F), (T,T,F,F,F), (T,T,T,F,F), and (T,T,T,T,F). It is easy to see that the likelihood functions peak exactly at the ability estimates found earlier. Hence, maximum likelihood will produce the same estimates of ability as the previous method. In the Single Parameter model, the ability estimate depends only on how many items were answered correctly, not on which items got the correct responses. This does not mean that the likelihood functions are invariant to the response pattern; it only means that the likelihood functions for patterns having the same number of correct responses peak at the same ability level. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 103 Graph: Likelihood Functions For Various Response Patterns Having A Total Score of 1 The above graph titled “likelihood functions for various response patterns having a total score of 1” figure shows the likelihood functions for the five response patterns having the same total score of 1. All five functions lead to the same ability estimate even if they are not the same functions. It is easy to see why the likelihood functions are different: when a test taker can only get one item right, we expect this to be the easiest item, and we would be somewhat surprised if it turns out to be the most difficult item instead. The accompanying applet lets you manipulate the item difficulties and choose different response patterns simultaneously. To finish with the Single Parameter model, there is yet another applet that brings together most of what we have learnt so far: the item response functions, the test response function, the likelihood function, and two alternative ways to estimate ability, the test information function, and the standard error of measurement. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 104 IRT Test & Item Analysis Using Software It was mentioned that IRT enables fairly accurate test and item analysis. Traditional test analyses will yield all CTT characteristics like Mean/Median/Mode, S.D, Variance, S.E of Mean, Range of Scores, Quartiles, Skewness and Kurtosis. These are exactly the descriptive statistics. The first output from application software through IRT yields all the above and a proportion of the correct responses to every item, item test, score correlation, point biserial to give discrimination. But, important outputs follow namely that of item characteristic estimates of parameters. Depending upon either a One, Two or Three Parameter models, the output will give threshold (b=item difficulty), slope (a=discrimination) and asymptote (c=guessing parameter). This can be to any decimal of accuracy needed but in the maximum likelihood estimates using successive approximation the last two trials will be made not to differ more than 0.001 or 0.002 etc. S.E of each of these estimates is also a part of the output; so is the Chi-square confirming this goodness of fit of the exponential curve to the data. It must be noted that the numerical values of these estimates are on a metric from -3 to +3 or -4 to +4 and actual values may be different for these items of different models. Thus, the values for item characteristics and test taker ability are specific to the model chosen. These are the location and shape parameters. While calibrating the test items, it is essential to specify a model namely Rasch, Birnbaum or Fred Lord. Based on this specification items (item characteristics) can be further used for these characteristics. There are several software applications available in the market for securing a license to use. They are namely, BICAL (Benjamin Wright), BILOG (Scientific Software International) and MULTILOG etc. In addition, Frank Baker has provided a software along with an eBook which enables the understanding of various concepts and principles (otherwise very difficult to prove mathematically) underlying IRT. This is very valuable software available free on the Net. This eBook also has BIRT software (Basics of IRT by Baker). The readers are encouraged to use this software to understand, verify and clarify such difficult concepts through practical exercises. The BICAL software has been illustrated through an example given earlier as Benjamin Wright’s Mathematical Formulation. The detailed procedure for using the BIRT software is given in the Appendix. Examples Following are two examples of tests run through BILOG: Example 1: For a test of 5 items administered on 10 test takers • • A comprehensive illustration of the test based on CTT can be seen by clicking here. A complete report illustrating both CTT and IRT and their outputs can be seen by clicking here. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 105 Example 2: For a test of 20 items administered on 76 test takers. A complete set of outputs adopting all three models illustrating IRT can be seen by clicking on • • • 1Param (PH1, PH2, PH3, PLT) Note that the PLT output can be seen only in BILOG. The path is D:/New Fol~1/Meritt~4/1param/1PARAM.PLT. 2Param (PH1, PH2, PH3, PLT) Note that the PLT output can be seen only in BILOG. The path is D:/New Fol~1r/Meritt~4/2param/2PARAM.PLT. 3Param (PH1, PH2, PH3, PLT) Note that the PLT output can be seen only in BILOG. The path is D:/New Fol~1r/Meritt~4/3param3PARAM.PLT. Test of 25 items administered on random sample of 1000 test takers in the domain of Analytical Ability • 1Param (PH1, PH2, PH3, PLT) Note that the PLT output can be seen only in BILOG. The path is D:/New Fol~1/Meritt~4/CTT-IR~1/Origin~1.PLT. Note that only the authorized MeritTrac employees will be enabled to see these outputs. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 106 Application Of IRT To Item Banking It was indicated earlier that one of the most important applications of IRT is in the domain of item banking. HMSO definition of question banking/item banking is as follows: “A bank of items (questions) of known technical values can be built up for future use. It might, in practice, prove of value to arrange for item construction to be a more or less continuous process. The construction of written papers can then become a matter of judging the suitability of items of known technical values from a bank of items. Items can be weeded out as out of date over a period of time. Further it can be said that new questions should be tried out and statistical evidences for its facility, discrimination ascertained it is absolutely necessary that the banks would have to be large to be of value". This definition is a universally adopted one and in order to apply this to the MeritTrac item bank (that is already available) will involve: 1. Examining and pre-validating all the individual items in the bank in several domains to check content and format accuracy. The procedure for pre-validation is explained below: Pre-validation is a process by which a judgment is made about an individual item with respect to satisfying certain criteria both looking at content and format accuracy. A checklist of criteria is given below in terms of general and specific criteria for multiple choice items and similarly for other types: General Is the item measuring an important outcome or objective agreed to be included in the test? Is the item measuring an important content area or expansion of content area? Is the item pitched at an acceptable difficulty level (0.1 to 0.9)? Is the item capable of being answered right by a majority of more able and more proficient test takers (HAG)? Is the item likely to be answered wrong by a majority of less able and less proficient test takers (LAG)? Is the item capable of restructuring? Does the item have one and only one correct answer? IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 107 Specific Is the stem clear and unambiguous for a majority of test takers? Is the stem devoid of double negatives? If a single negative is unavoidable, does it get highlighted in the stem? Are the distractors plausible i.e. the usual mistakes, misconceptions and misunderstandings? Is the key unarguably and unequivocally correct? If there are multiple answers, does the format take this into account? Is the item using an efficient format? Does the item avoid “window dressing”? At the end of applying this checklist for every individual item, a decision whether to include item in the bank, reject the item or improve the item to include in the bank, shall be taken. 2. Once an item is decided to be included in the bank there are some technical values to be added to the item. Some of these are: Content ID This is coded as C1, C11, C12 and so on. It indicates the major content topic and sub topics that are expansions of the content. Ability/Skill Tested ID This is coded as A1, A2, A3 and so on that are clusters of Bloom Level Objectives/outcomes; for instance A1 may include Knowledge (recall), Comprehension (interpret, detect mistakes), Application (solve, predict) and Evaluation (judge) etc. A1 KCAE. Item ID A combination of content/ability/difficulty; for instance C11 A1 KCAE d001, the last digits gives the number of identify. In content C11, we may have C11 A1 KC d2 004 or C11 A1 KCA d3 003 or C11 A1 KCAE d4 005 and so on. Item Writer ID It is the code given by MeritTrac to every item writer. Difficulty Level Difficulty levels are d1, d2, d3, d4, d5. d1 – 0 to 0.2 Very Easy d2 – 0.2 to 0.4 Easy d3 – 0.4 to 0.6 Average d4 – 0.6 to 0.8 Difficult d5 – 0.8 to 1.0 Very Difficult Time for answering This is invariably decided by the item writer. If it is a simple MC item it can be 1 min, if the stem is lengthy as in the case of a passage or data, time will be decided accordingly. Correct or key answer This should be hidden from the bank. Type of Item MC1 – Multiple Choice 1 in n (n=3, 4 or 5) MC2 – Multiple Completion (multiple answers combination) MC3 – Multiple T/F MC4 – Multiple Facet (a no. of MC items in a topic together) IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 108 MC5 – Assertion Reason Allotted Marks Marks are to be allotted for the right answer. Invariably 1 mark is to be allotted. If scoring weight is available as a result of a large number of test takers in the past and if CTT analysis is done earlier then the scoring weight can be indicated. The index of difficulty and scoring weights may be given. If IRT analysis has been done earlier using any of the three models (preferably a Two Parameter model to start with and then graduated to Three Parameter model later) then the item difficulty and scoring weights correspondingly may be added. An illustrative example in C++ is shown below in the table titled “A 2 Dimensional Blueprint” Table: A 2 Dimensional Blueprint C++ Test Outline Classification Of Topics Debugging Skills / Implementing OOPS in C++ Debugging Skills / Pointers Logic / Files & Streams Logic / Fundamentals Logic / Implementing OOPS in C++ Logic / User Defined Datatypes Programming Concepts / Files & Streams Programming Concepts / Friend Functions & Classes Programming Concepts / Fundamentals Programming Concepts / Implementing OOPS in C++ Programming Concepts / Late Binding Programming Concepts / User Defined Datatypes Software / Advanced C Programming / Debugging Skills / Templates Software / Advanced C Programming / Logic / Templates Software / Advanced C Programming / Programming Concepts / Exceptions Software / Advanced C Programming / Programming Concepts / Templates Total (No of Questions) Total Time in Minutes 1 3 2 1 4 2 2 1 1 1 1 3 3 2 2 Questions of Each 2 Difficulty Level 4 3 1 1 3 2 2 1 1 Total 1 1 3 4 2 1 1 1 4 3 3 1 1 4 6 4 40 45 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 109 Table: A Modified Blueprint Content C1 Debugging C2 Logic Content Expansion C11 - Implementing OOPS C12 - Skills & Pointers C21 - Files & Streams C22 - Fundamentals C23 - User Defined Datatypes C24 - Implementing OOPS C31 - Files & Streams C32 - Friend Functions & Classes C33 - Fundamentals C34 - Implementing OOPS C35 - Late Binding C36 - User Defined Datatypes C41 - Debugging Skills & Templates C42 - Logic & Templates C43 - Programming Concepts & Exceptions C44 - Programming Concepts & Templates KC – d2 KCApE – d3 KC – d2 KCApE – d3 KCApE – d3 A4 - KCAp KCAp – d4 KC – d3 KCAp d4 KC – d3 KCAp – d4 KC – d3 KCAp – d4 2 1 2 1 1 1 1 3 3 3 2 2 15 Ability Cluster A1 - KCE Difficulty Level KC – d2 (easy) KCE – d3 (average) KC E- d3 KC – d2 KCAn – d3 KC– d2 KCAn – d3 A3 – KCApE KCAp– d3 (difficult) KC – d2 KC – d2 No. of Items 1 1 3 2 2 1 2 1 1 4 13 Total 2 A2 – KCAn 10 C3 Programming C4 Software & Advanced Programming • • • • • • A1 – KCE means Knowledge Comprehension Evaluation (a simple judgment). A2 – KCAn means Knowledge Comprehension Analysis A3 – KCApE means Knowledge Comprehension Application Evaluation d2 – easy d3 – average d4 – difficult IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 110 It is evident from the above that in the case of existing item bank, the present 2dimensional blue print can be modified and used to add Item ID to every individual item. Item IDs used for C++ are given below in the table titled “Item IDs for C++” Table: Item IDs for C++ Sample Paper Duration in minutes:40 Item No Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 No of Question:40 Section 1 - C++ Programming Item ID C23A2KCEd1001 C23A2KCEd1002 C32A3Kd1003 C32A3Kd1004 C32A3KEd1005 C32A3Kd1006 C32A3KCEd1007 C33A3KCEd1008 C33A3Kd1009 C33A3KCEd1010 C34A3KCEd1011 C31A3KCAd2012 C31A3KCAd2013 C31A3KCAd2014 C32A3KCAd2015 C32A3KCAd3016 C32A3KCAd2017 C36A3KCAd2018 C36A3KCAd2019 C36A3KCAd2020 C11A1KCAd4021 C32A3Kd4022 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 111 Application of IRT to Adaptive or Tailored Testing A very significant application of IRT is in the area of adaptive or tailored testing. For decades, attempts have been made to introduce adaptive or tailored testing out of necessity to reduce the number of test items and at the same time increase the accuracy of measurement. It is imperative that clients who are increasingly demanding a shorter duration test and a more accurate measurement and assessment will increasingly accept adaptive or tailored testing modules. It should be noted here that there are some pre-requisites before an adaptive testing module is designed. The first step is therefore to have a sufficiently large bank of items and administered over a fairly large population of test takers. The number of items in the bank may not be of the same number of items that we store in a normal item bank used for creating a paper and pencil test. For instance, if the analytical ability test has at the moment 25 items, the normal item bank should have a minimum of 15 times this number when it is used for creating paper and pencil tests. But, for adaptive testing module the item bank can be anything up to 10 times. However, the calibration of these items using IRT in adopting any of the three models is to be initially done for a population of test takers (anything between 200 to 500 test takers). The adaptive testing module will thus normally have 200 items distributed over desired contents and with the help of 3-d building block blueprint these items after administration will be analyzed through any of the models and the item parameters (1, 2 or 3) shall be ascertained. Also the ICC’s and item information function curves will be plotted and stored with preferably test characteristics curve and test information function. Then the parent test is ready for use. Once this is done the module is ready for use. All the calibrated items in the bank are to be rearranged in terms of increasing difficulty. Then it is possible to administer an adaptive test to test takers individually and ascertain their true score by finding the true ability and calculating true score as if he has taken the parent test. A test taker now can be administered with an item selected on the basis of an assumption relating to his ability. This is invariably his/her own judgment about his ability indicated by his position on the Z scale (-3 to +3) or on a scaled score available to him by way of standard scales like TOEFL, GRE, GMAT etc. The reader may refer to Rudner’s Computer Adaptive Testing tutorial that is attached to the appendix of this book. This is an interesting and self learning computer adaptive testing tutorial meant for testing an average level of arithmetic ability with a bank of 200 items. A test taker is shown the first item matched to the difficulty level at the ability of a test taker that is judged by himself/herself. Once the test taker answers it correctly it automatically goes to the next item which is of a higher difficulty than the first item. In case he answers this item also correctly the process is repeated till the test taker answers an item wrong. This is where the test can be terminated. There are several rules for termination as stated below out of which any one can be adopted according to convenience: The time at the disposal of test taker and administrator By a fixed number of items like 5 to 10 Till a consistent estimate of his ability at successive trials during the process. The final ability of the test taker is determined and a corresponding true score can be determined taking this final ability estimate and applying it over all the items in the parent test. Thus a replacement of the parent test is done by a short duration adaptive testing module having a very small number of items. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 112 Examples Example 1: A test of 5 items administered on 10 test takers is converted into an adaptive or tailored testing module. At this stage a test taker X will be assumed to have taken 2 items namely 5 and 2 and his response pattern is Correct & Incorrect. The test is terminated at this point. An ability estimate can be made working out an initial estimate of 0.3 since item #5 and #2 have difficulty values of -0.28 and 0.42 respectively. Test Taker X taking an Adaptive Test with 2 Items (Item #5, #2) Item No 5 2 b u θ -(θθ b) -0.58 0.12 θ e-(θb) p=1/(1+e(θ-b) θ ) 0.641 0.470 q=1p 0.359 0.530 u-p p*q Correction Next Factor Estimate -0.152 0.148 0.28 0.42 1 0 0.3 0.560 1.127 0.359 0.470 0.111 0.230 0.502 0.732 Item No 5 2 b u θ 0.15 -(θθ b) -0.43 0.27 θ e-(θb) p=1/(1+e(θ-b) θ ) 0.605 0.432 q=1p 0.395 0.568 u-p p*q Correction Next Factor Estimate -0.078 0.070 0.28 0.42 1 0 0.652 1.313 0.395 0.432 0.038 0.239 0.245 0.484 Item No 5 2 b u θ 0.07 -(θθ b) 0.350 0.350 θ e-(θb) p=1/(1+e(θ-b) θ ) 0.587 0.413 q=1p 0.413 0.587 u-p p*q Correction Next Factor Estimate 0.000 0.070 0.28 0.42 1 0 0.705 1.419 0.413 0.413 0.000 0.242 0.242 0.485 Let us take another test taker Y who is assumed to have taken the items #4, #1 & #3. The response pattern is Correct, Correct and Incorrect. The test is terminated at this point. The final ability can be estimated by using an initial estimate of 1.5. This assumption is made on the basis that he answered item #4 correct with difficulty value of 0.92 and also answered item #1 correct with difficulty value of 1.5. He IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 113 answered item #3 incorrect with difficulty value of 2.29. His final estimate is worked out as follows: Test Taker X taking an Adaptive Test with 3 Items (Item #4, #1,#3) Item No 4 1 3 b u θ -(θθ b) -0.58 0.00 0.79 θ e-(θb) p=1/(1+e(θ-b) θ ) 0.641 0.500 0.312 q=1p 0.359 0.500 0.688 u-p p*q Correction Factor 0.787 Next Estimate 2.287 0.92 1.50 2.29 1 1 0 1.5 0.560 1.000 2.203 0.359 0.500 0.312 0.547 0.230 0.250 0.215 0.695 Item No 4 1 3 b u θ 2.287 -(θθ b) -1.37 -0.79 0.00 θ e-(θb) p=1/(1+e(θ-b) θ ) 0.797 0.687 0.499 q=1p 0.203 0.313 0.501 u-p p*q Correction Factor 0.027 Next Estimate 2.314 0.92 1.50 2.29 1 1 0 0.255 0.455 1.003 0.203 0.313 0.499 0.017 0.162 0.215 0.250 0.627 Item No 4 1 3 b u θ 2.314 -(θθ b) -1.39 -0.81 -0.02 θ e-(θb) p=1/(1+e(θ-b) θ ) 0.801 0.693 0.506 q=1p 0.199 0.307 0.494 u-p p*q Correction Factor 0.000 Next Estimate 2.314 0.92 1.50 2.29 1 1 0 0.248 0.443 0.976 0.199 0.307 0.506 0.000 0.159 0.213 0.250 0.622 Thus, the ability of the test taker X and that of test taker Y are calculated as 0.070 and 2.314 respectively after administering the adaptive test to them. Their true scores can be calculated by taking these final ability values to the parent test of 5 items. The calculations are shown below: IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 114 Calculation of True Scores of Test Takers X & Y Ability Item 1 1.5 Test Taker X Test Taker Y 0.07 2.314 0.193099 0.692961 Item2 0.42 0.443517 0.56782 b value Item3 2.29 0.136286 0.151591 Item4 0.92 0.31352 0.316823 Item5 -0.28 0.644172 0.644929 1.73059428 2.374123965 True Score Example 2: A test taker x is administered the adaptive test. His initial ability is assumed as 2.0. Accordingly, he is given the first items #5 with difficulty value of 0.340. He answers it correct and the next item administered is of difficulty level 0.880. The process is continued till he answers an item with difficulty value of 1.970 as incorrect. The test is terminated and his final ability is estimated at 2.531 as shown below. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 115 Test Taker X administered with the Adaptive Test Item No 5 18 9 22 21 b u θ θ e-(θb) p=1/(1+e(θ-b) θ ) 0.8402 0.7540 0.7330 0.7047 0.5075 q=1p 0.160 0.246 0.267 0.295 0.493 u-p p*q Correction Factor 0.473 Next Estimate 2.473 0.340 0.880 0.990 1.130 1.970 1 1 1 1 0 2.0000 0.190 0.326 0.364 0.419 0.970 -(θθ 0.160 0.246 0.267 0.295 0.507 0.461 0.134 0.185 0.196 0.208 0.250 0.973 Item No 5 18 9 22 21 b u θ e b) p=1/(1+e (θ-b) θ ) 0.8941 0.8310 0.8150 0.7930 0.6232 - q=1p 0.106 0.169 0.185 0.207 0.377 u-p p*q Correction Factor 0.056 Next Estimate 2.529 0.340 0.880 0.990 1.130 1.970 1 1 1 1 0 2.473 0.118 0.203 0.227 0.261 0.605 0.106 0.169 0.185 0.207 0.623 0.044 0.095 0.140 0.151 0.164 0.235 0.785 Item No 5 18 9 22 21 b u θ θ e-(θb) p=1/(1+e (θ-b) θ ) 0.899 0.839 0.823 0.802 0.64 - q=1p 0.101 0.161 0.177 0.198 0.364 u-p p*q Correction Factor 0.001 Next Estimate 2.530 0.340 0.880 0.990 1.130 1.970 1 1 1 1 0 2.529 0.112 0.192 0.215 0.247 0.572 -(θθ 0.101 0.161 0.177 0.198 0.636 0.001 0.091 0.135 0.145 0.159 0.231 0.762 Item No 5 18 9 22 21 b u θ e b) p=1/(1+e (θ-b) θ ) 0.899 0.839 0.823 0.802 0.64 - q=1p 0.101 0.161 0.177 0.198 0.364 u-p p*q Correction Factor 0.001 Next Estimate 2.531 0.340 0.880 0.990 1.130 1.970 1 1 1 1 0 2.530 0.112 0.192 0.215 0.247 0.572 0.101 0.161 0.177 0.198 0.636 0.001 0.091 0.135 0.145 0.159 0.231 0.762 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 116 Future of Item Response Theory in India Many testing, Assessment, Councils and National Bodies in US including Educational Testing Service (ETS), College Entrance Board (CEB), American Psychological Association (APA), US Civil Service Commission and other recruitment agencies have been using IRT for the last few decades. In Particular Rasch Model of Analysis have found applications in other domains other than Testing and Assessment like Medicine, Textile, Aeronautics and Manufacturing Industries where very accurate results are required. There is evidence that nearly 525 organizations in the US, UK, Germany, Japan and China are actively using IRT methods of analysis for extremely accurate results. The 3 parameter Logistic model has been in use for several decades in Educational and Training Institutions and IRT continues to be a domain of research for Scholars in Measurement, Evaluation and Assessment. In India, the author and several of his Doctoral Students are using IRT in Admission, Entrance and Advance Placement Tests. In recruitment test MeritTrac is the pioneer in using IRT particularly Adaptive Testing Modules, primarily driven by clients and customers who increasingly demand quick and more accurate results. There are several domains in which IRT can influence a future of Testing and Assessment in India. Some of these are: 1. 2. 3. 4. 5. 6. Achievement Testing Recruitment Testing Adaptive Testing Mastery Testing Scholarships and other Award Testing Diagnostic Testing These applications in these domains are explored below. Achievement Testing Following increasing demand for very accurate measurement, evaluation and assessment, School Boards, Universities and other Certifying Organizations are driven to the use of smaller length test and increased accuracy and efficiency test. Many organizations mentioned above are using Question /Item bank for constituting for both formative class room test and for final end examinations. Various Question/ Item bank questions and items are calibrated using IRT and ascertaining invariant Item parameters and incorporated in the Question/Item bank for future use. A healthy trend is being seen in Universities and Institutions of Higher and School learning and increasingly being utilized for mass scale examinations. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 117 Recruitment Testing Recruiting Agencies in India are being compelled to make use of smaller length and shorter duration test and at the same time yielding better, accurate results. And they are increasingly introducing computer aided Adaptive Testing Modules and other online instruments. Even though its application in India from recruitment has just begin, there is increasing opportunities for recruitment in BPO and ITES sectors where a very large number of aspirants compete with the smaller number of positions. The time at the disposal of several Industrial and Training Organizations is very limited for conducting test for assessment of Knowledge, Skills and Attitudes required for the various positions in the industry. IRT can find applications in Classical Test of Recruitment with calibrated item characteristics to yield accurate results wherever and whenever, timeline is not restricting factor. As reiterated earlier where the time at the disposal are very short and more accurate results are needed, IRT driven computer aided Adaptive Testing Modules and Instruments can be effectively put into use. Adaptive Testing This particular application of IRT is extremely promising. In Recruitment, Admission, Entrance and Awards Testing for the simple reasons that it calls for a smaller length test for very short duration (MeritTrac is contemplating to use a module of 6 items and 8-10 minutes duration). The author has devised a simple procedure to conduct Adaptive Testing in offline/ paper-pencil test. According to him a sufficient number of calibrated number through IRT are coded and stored in an Item Bank on a computer and classified into groups or categories of items of different item difficulty ranges(-3 to -2,-2 to -1,-1 to 0,0 to1,1 to 2,2 to3) yielding 6 groups. For any test taker wanting to take an Adaptive Test is prescribed to take 1 item from each group which enables that test taker of any ability level to miss 1, 2, 3 or 4 items of the test. The author has classified those missing 4 of the 6 items, 3 of the 6 items, 2 of the 6 items and 1 of the 6 items as being taken as “Below Average”, “Average”, “Good” and “Par excellence” of their initial ability level before taking the test. Accordingly in the estimation of test takers final ability an initial assumed value for the test taker for successive ,approximation of ability estimation are respectively taken as -0.5,0,1 and 1.5. Benjamin Wright’s final approximation using maximum Likelihood estimate is made use of as elaborately discussed elsewhere in the e-book. Mastery Testing It is found in many assessment scenarios that in Certification or Categorization of achievers in a dichotomous fashion like traditional Pass/ Fail, Selected/ Rejected and Maser/ Non Master, Mastery testing is resorted to. This is a result of Carroll and Bloom’s research work for years that yielded the concept of Mastery Testing and Mastery Learning. This is a particular case of an achiever who can be certified as a Master or Non master. A usual level of Mastery is prescribed as 90/90 which indicates 90% of test takers will secure 90% in the test. There are also situations where 100/100 is insisted, particularly in Nurses Certification Test where 100% Mastery is eminent and required (example, nurse is to be certified for distinguishing between Poisonous/Non Poisonous materials). IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 118 Scholarships and other Award Testing IRT calibrated items in relation to item characteristics related to item difficulty, item discrimination and item guessing provide a platform to constitute a special test for award of Scholarship for any other Excellence awards. It has to be understood that items in a bank calibrated with IRT parameters will enable sorting out the items and the order of increasing difficulty and discrimination and as far as possible we can sort out items with high difficulty values ranging from 2 to 2.5 with very high to perfect discrimination so that they can serve the requirements for a scholarship test. Those who perform very well on these and their ability estimates and true scores are beyond the accepted cut off for award or Scholarship may be selected for such awards. IRT thus provides a test with items all of them aimed at a cut off of difficulty and discrimination required for such awards. This is an application which is worth attempting for such awards as Award of Foreign Scholarships. Diagnostic Testing Educationists and Trainers all over the World are now a days increasingly providing feedback to Students and Trainees on the strengths and weaknesses of their performances in terms of content areas, Abilities and Skills tested and levels of difficulty of items. It is therefore possible with IRT calibration and coding adopted for items in terms of content, ability cluster and difficulty levels to sort out items from different content areas, from different clusters and from different levels of difficulty. It shall then be possible to generate a feedback that will list strengths and weaknesses in respect of selected contents, selected clusters and selected levels of difficulty. Thus weaknesses in these areas can be diagnosed and on the basis of these remedial steps can be recommended. This is an area of application that needs to be tried at all levels of Education and Training, in particular by teachers and trainers on a continuous basis. IRT Principles & Applications Copyright 2009 Dr. V. Natarajan Page 119 APPENDIX This appendix lists a set of web resources that can be viewed by the reader to enhance his or her understanding of IRT and Computer Adaptive Testing. 1) Please visit http://echo.edres.org:8080/scripts/cat/catdemo.htm for an online interactive Computer Adaptive Test taking tutorial. This is created by Dr. Larry Rudner of the Graduate Management Admissions Council which publishes the GMAT exams. 2) Please click on the links below to explore ICC’s in different scenarios. These links are created by Dr. Rolf Steyer. Applet1( for illustration) Applet2 Applet3 Applet4 Applet5 Applet6 Applet7 Applet8 Applet9 Applet10 Applet11 Applet12 Applet13 Applet14 Applet15 IRT Principles & Applications Copyright 2009 Dr. V. Natarajan

Related docs
large scale paper item response theory focus
Views: 2  |  Downloads: 0
ITEM-NO-_____
Views: 4  |  Downloads: 0
Basic principles of probability theory
Views: 3  |  Downloads: 0
basic transformer theory
Views: 398  |  Downloads: 25
IPV6 basic theory and tunnel access 1
Views: 50  |  Downloads: 2
Basic Accounting Principles
Views: 833  |  Downloads: 81
Basic Principles of Ship Propulsion
Views: 430  |  Downloads: 86
Principles of Marketing
Views: 14  |  Downloads: 2
Principles of Marketing
Views: 9  |  Downloads: 0
Basic principles
Views: 21  |  Downloads: 0
ii guiding principles
Views: 1  |  Downloads: 0
Principles of Management
Views: 476  |  Downloads: 97
Other docs by Mohan Kannegal
Assessment and Testing Skills For Recruiters
Views: 157  |  Downloads: 8
Setting Assessment Standards
Views: 120  |  Downloads: 6
Creating Assessments Part 2
Views: 16  |  Downloads: 5
Creating Assessments Part 1
Views: 21  |  Downloads: 5
Assessment on Assessments
Views: 20  |  Downloads: 3
Setting Cut Scores
Views: 94  |  Downloads: 3
A Beginner's Guide To Recruitment
Views: 960  |  Downloads: 162