Docstoc

Likert Scale Reliability Procedures (DOC)

Document Sample
Likert Scale Reliability Procedures (DOC) Powered By Docstoc
					Reliability Discussion topics

Define reliability. What does it encompass?

Look at the various approaches in the attached handout. Which ones are useful for assessing accuracy? Which ones are useful for assessing stability?

How would you assess reliability for a test you are designing?

What could happen to diminish the reliability of your measure?

If you obtained a low reliability coefficient, what could you do to improve it?

2 SOURCES OF VARIATION REPRESENTED IN DIFFERENT PROCEDURES FOR ESTIMATING RELIABILITY Source of Variation
Methods of Estimating Reliability 1. Immediate retest with same test Retest after interval with same test Parallel test form without time interval Parallel test form with time interval Odd-even halves of single test KuderRichardson single test analysis Cronbach’s alpha for single test analysis. Variation Caused by the Measurement Procedure Variation Caused by Respondents’ Day-to-Day Variability Variation in the Items Sampled from the Content Domain Variation in Respondents’ Speed of work

X

X

2.

X

X

X

3.

X

X

X

4.

X

X

X

X

5.

X

X

6.

X

X

7.

X

X

X

X

Which measures are best for establishing stability? Which measures are best for establishing accuracy (equivalence)? If consistency=stability+accuracy, which measures are ideal for establishing reliability of the measure you designed last week?

3 FACTORS INFLUENCING TEST RELIABILITY 1. The greater the number of items, the more accurate the test. The respondents’ mental set for accuracy is important for reliability. That is, variation in incentive or effort are important. Perseverations from previous mental or emotional experiences are important. 2. On the whole, the longer the test administration time, the greater the accuracy. Stability may decline if tests are too long. 3. The narrower the range of difficulty of items, the greater the reliability. Items of moderate difficulty are preferred over easy or hard items. 4. Interdependent items are those which require a correct answer on one item before it is possible to obtain a correct answer on others. Such grouped items tend to reduce the reliability. 5. The more systematic or “objective” the scoring, the greater the reliability coefficient. Error due to mis-scored items reduces accuracy. 6. The greater the probability of achieving success by chance (guessing), the lower the reliability. 7. The more homogeneous the material, the greater the reliability. 8. Reliability is affected by the extent to which individuals have similar characteristics. Restricted range of characteristics in your sample can result in low reliability if there is no variance. If there is variance, reliability can be increased. 9. Trick questions lower the accuracy. Subtle factors leading to misinterpretation of the test item lead to unreliability. 10. Speed of work on test influences accuracy. Some test-takers are set for speed and some are not. Some test-takers distribute their time properly; some do not. 11. Distractions have some effect on accuracy, although those effects can be overrated. Accidents, like breaking a pencil or finding a defective test blank, are incidental factors. The respondents attention to the task may be limited by illness, worry, or excitement. These can affect accuracy although not always to the extent that most people think. 12. Reliability generally decreases when there is intervening time between tests. Delayed posttests are given for the purposes of establishing validity, not reliability. 13. Cheating may be a factor in lowering accuracy or stability. 14. Position of the individual on the learning curve for the tasks of the test may be important. (restriction of range)

4 In Linden, K.W. (1985) Designing tools for assessing classroom achievement: A handbook of materials and exercises for Education 524.

Reviewing Teacher-made Tests From Mitchell, R.J. Measurement in the classroom. Dubuque, Iowa: Rendall-Hunt, 1972, pp. 115-116 The comments and suggestions which have been offered in the preceding pages are appropriate for the planning and constructing of the different types of test items. The purpose of the following suggestions is to present briefly the basic principles or ideas which apply to the development of classroom tests. 1. Item Format A. The items in the tests are numbered consecutively. B. Each item is complete on a page. C. Reference material for an item appears on the same page. D. The item responses are arranged to achieve both legibility and economy of space. 2. Scoring Arrangements A. Consideration has been given to the practicability of a separate answer sheet. B. Answers are to be indicated by symbols rather than underlining or copying. C. Answer spaces are placed in a vertical column for easy scoring. D. If answer spaces are placed at the right of the page, each answer space is clearly associated with its corresponding item. E. Answer symbols to be used by the students are free from possible ambiguity due to careless penmanship or deliberate hedging. F. Answer symbols to be used by the students are free from confusion with the substance or content of the responses. 3. Distribution of Correct Responses A. Correct answers are distributed so that the same answer does not appear in a long series of consecutive questions. B. Correct answers are distributed to avoid an excessive proportion of items in the test with the same answer. C. Patterning of answers in a fixed repeating sequence is avoided.

5

4. Grouping and Arrangement of Items A. Items of the same type requiring the same directions are grouped together in the test. B. Where juxtaposition of items of markedly dissimilar content is likely to cause confusion, items are grouped by content within each item type grouping. C. Items are generally arranged from easy to more difficult within the test as a whole and within each major sub-division of the test. 5. Designating Credit Allowances A. Credits are indicated for the major sections of the tests. B. The credit allowance for each item is clear to the student. C. Where questions have subdivisions, especially in essay questions, credits are indicated for each of the parts of the question. 6. Directions for Answering Questions A. Simple, clear, and specific directions are given for each different item type in the test. B. Directions are clearly set off from the rest of the test by appropriate spacing or type style. C. Effective use is made of sample questions and answers to help clarify directions for unusual item types. 7. Guessing A. If deductions are to be made for wrong answers, pupils are instructed not to guess. B. If no deductions are to be made for wrong answers, pupils are advised to answer every question according to their best judgment. 8. Allowing Choice of Items A. The degree of choice is sufficiently limited, and questions among which choice is allowed are sufficiently in difficulty, to maintain reasonable comparability of pupils’ scores. B. Directions covering choice are prominent, clear, and explicit. C. Choice is exercised within relatively small groups of items rather than among many items. 9. Printing and Duplicating A. The test has been duplicated to provide individual student copies. B. The test is free from annoying and confusing typographical errors. C. Legibility of the test is satisfactory from the viewpoint of type size, adequacy of spacing clarity of printing. D. The length of line is neither too long nor too short for easy comprehension.

and

6 Values of
r

for Different Levels of Significance* Levels of Significance .01 .8745 .7079 .6835 .6614 .6411 .6226 .6055 .5897 .5751 .5614 .5487 .5368 .4869 .4487 .4182 .3932 .3721 .3541 .3248 .3017 .2830 .2673 .2540 .001 .9507 .8233 .8010 .7800 .7603 .7420 .7246 .7084 .6932 .6787 .6652 .6524 .5974 .5541 .5189 .4896 .4648 .4433 .4078 .3799 .3568 .3375 .3211

Sample Size(n) 5 10 11 12 13 14 15 16 17 18 19 20 25 30 35 40 45 50 60 70 80 90 100

.05 .7545 .5760 .5529 .5324 .5139 .4973 .4821 .4683 .4555 .4438 .4329 .4227 .3809 .3494 .3246 .3044 .2875 .2732 .2500 .2319 .2172 .2050 .1946

.02 .8329 .6581 .6339 .6120 .5923 .5742 .5577 .5425 .5285 .5155 .5034 .4921 .4451 .4093 .3810 .3578 .3384 .3218 .2948 .2737 .2565 .2422 .2301

*Reduced version of Table VI of R.A. Fisher and F. Yates: Statistical Tables for Biological, Agricultural, and Medical Research, Oliver & Boyd Ltd., Edinburgh.

Computational Formulas for Test Analysis In Linden, K.W. (1985) Designing tools for assessing classroom achievement: A handbook of materials and exercises for Education 524. Measures of Central Tendency Modal Score (Mo) = Most popular score Median Score (Mdn) = Score presenting the peformance of the middle person in the group - the midpoint of the distribution.
X = Mean Score:

A: X = B: X

 X  N , where  X = sum of raw scores (X) and N = number of cases.  fd , where AM = assumed mean (zero point for deviation method), = AM +
N N = number of cases and

 fd = sum of all deviation scores (f = frequency).

Measure of Variability Range (R) = (High Score - Low Score) + 1 Quartile Deviation (QD) = score for 75%ile rank - score for 25%ile rank 2 Standard Deviation (SD or s) = s =

 (X  X )

2

N X = each raw score, X = mean, N = number of students

Standard Scores Basic z - score ( X = 0; s = 1)
XX  z=1   + 0, s  

where X = any raw score, X = mean of scores, s = standard deviation of scores T-score ( X = 50; s = (10)
XX  T score = 10   + 50 s  

Internal Consistency Reliability and Standard Error of Measurement
 X (k  X )  k 1   KR k 1  s2 (k )  21 where k = number of items s2 = standard deviation squared (variance of scores)

r

=

sm (standard error of measurement) sm = s 1  r where r is reliability and s is standard deviation of scores

8 In Linden, K.W. (1985) Designing tools for assessing classroom achievement: A handbook of materials and exercises for Education 524 INTERPRETATION OF CORRELATION COEFFICIENTS 1. When may we call a coefficient “high” or “low?” Stable coefficients from .00 to .20 = negligible correlation “” “” .20 to .40 = low degree of correlation “” “” .40 to .60 = moderate degree of correlation “” “” .60 to .80 = marked degree of correlation “” “” .80 to 1.00 = high degree of relation 2. How high must a correlation be to be regarded as “satisfactory?” The function of a coefficient of correlation is to measure the degree of association between two variables. In some situations a correlation of .00 might be satisfactory, and in others a correlation of .90 might be regarded as unsatisfactory. The coefficient stands merely as a statement of fact. 3. Does correlation imply a causal relationship between two traits studied? NO! 4. Does the correlation coefficient indicate the percentage of agreement between the two traits? Does a coefficient of .20 mean 20 percent agreement? NO! However, from the coefficient we can obtain a statement regarding the degree of overlap between the two variables. This is done by squaring the coefficient. 5. Does a knowledge of the coefficient of correlation between two traits enable us to predict one from the other? YES, but the relationship between the size of a correlation coefficient and its predictive value is not directly proportional. The lower correlations are of almost no value in prediction; the moderate ones only slightly better; and the marked coefficients are somewhat, but not very much better. Only as we advance into the high correlation range do the predictive values rise to usable levels. A statement of predictive efficiency can be found by the following formula: 100 (1 - 1  r 2 ) 6. Is there a direct arithmetical relationship between the size of a correlation and its value? Is a coefficient of .75 three times as good as .25? NO! A statement can be made more accurate by looking at the squares of the correlation coefficients. The square of .25 is .0626, while the square of .75 is .5625. On this basis, a coefficient of .75 is nine times, not three times, better than a correlation of .25.

9 Likert Scale Reliability Procedures 1. Run a factor analysis including all the scale items with a rotation that is either varimax or oblimin (look up the difference). 2. Look at the first factor matrix obtained. 3. Delete all items that do not load at least .33 on Factor 1. 4. Re-run the analysis without those items. 5. Look at the rotated factor matrix. 6. Identify subscales (groups of items that load at least .33 on a given factor) For each questionnaire item, look to see what factor has the highest loading. Ambiguous items are those that load well on more than one factor (they typically have about the same factor loading on more than one factor. 7. Check to see if any items are loaded negatively and reverse the scoring of that item. 8. Run the Cronbach’s alpha program with the item analysis option. 9. Interpret the item analysis (Would the reliability go up if a particular item was deleted?) 10. Create the scales (I like to average items to facilitate comparison of means across scales, but that is not necessary.)


				
DOCUMENT INFO
pptfiles pptfiles
About