VIEWS: 22 PAGES: 42 POSTED ON: 4/2/2011 Public Domain
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling – test occasions C. Internal consistency – traits 4. Reliability in Observational Studies 5. Using Reliability Information 6. What To Do about Low Reliability 2 Chapter 4 - Reliability • Measurement of human ability and knowledge is challenging because: ability is not directly observable – we infer ability from behavior all behaviors are influenced by many variables, only a few of which matter to us 3 Observed Scores O=T+e O = Observed score T = True score e = error 4 Reliability – the basics 1. A true score on a 3. We assume that test does not errors are random change with (equally likely to repeated testing increase or 2. A true score would decrease any test be obtained if there result). were no error of measurement. 5 Reliability – the basics • Because errors are • Mean of many random, if we test one observed scores for person many times, one person will be the the errors will cancel person‟s true score each other out • (Positive errors cancel negative errors) 6 Reliability – the basics • Example: to measure • Ask Sarah to spell a Sarah‟s spelling subset of English ability for English words words. • % correct estimates • We can‟t ask her to her true English spell every word in spelling skill the OED, so… • But which words should be in our subset? 7 Estimating Sarah‟s spelling ability… • Suppose we choose • What if, by chance, 20 words randomly… we get a lot of very easy words – cat, tree, chair, stand… • Or, by chance, we get a lot of very difficult words – desiccate, arteriosclerosis, numismatics 8 Estimating Sarah‟s spelling ability… • Sarah‟s observed • But presumably her score varies as the true score (her actual difficulty of the spelling ability) random sets of words remains constant. varies 9 Reliability – the basics • Other things can • E.g. on the first day produce error in our that we test Sarah measurement she‟s tired • But on the second day, she‟s rested… • This would lead to different scores on the two days 10 Estimating Sarah‟s spelling ability… • Conclusion: • The variation in Sarah‟s scores is O=T+e produced by measurement error. • How can we measure But e1 ≠ e2 ≠ e3 … such effects – how can we measure reliability? 11 Reliability – the basics • In what follows, we • Different ways of consider various measuring reliability sources of error in are sensitive to measurement. different sources of error. 12 How do we deal with sources of error? • Error due to test items • Domain sampling error 13 How do we deal with sources of error? • Error due to test items • Time sampling error • Error due to testing occasions 14 How do we deal with sources of error? • Error due to test items • Internal consistency • Error due to testing error occasions • Error due to testing multiple traits 15 Domain Sampling error • A knowledge base or • We can‟t test the skill set containing entire set of items. many items is to be So we select a sample tested. of items. E.g., the chemical That produces domain properties of foods. sampling error, as in Sarah‟s spelling test. 16 Domain Sampling error • There is a “domain” of • A person‟s score may knowledge to be vary depending upon tested what is included or excluded from the test. 17 Domain Sampling error • Smaller sets of items • As a result, reliability may not test entire of a test increases knowledge base. with the number of • Larger sets of items items on that test should do a better job of covering the whole knowledge base. 18 Domain Sampling error • Parallel Forms • Across all people Reliability: tested, if correlation • choose 2 different between scores on 2 sets of test items. parallel forms is low, • these 2 sets give you then we probably “parallel forms” of the have domain test sampling error. 19 Time Sampling error • Test-retest Reliability • Give same test person taking test repeatedly & check might be having a correlations among very good or very bad scores day – due to fatigue, emotional state, • High correlations preparedness, etc. indicate stability – less influence of bad or good days. 20 Time Sampling error • Test-retest approach • Not all low test-retest is only useful for traits correlations imply a – characteristics that weak test don‟t change over • Sometimes, the time characteristic being measured varies with time (as in learning) 21 Time Sampling error • Interval over which • Not all low test-retest correlation is correlations imply a measured matters weak test • E.g., for young • Sometimes, the children, use a very characteristic being short period (< 1 measured varies with month, in general) time (as in learning) • In general, interval should not be > 6 months 22 Time sampling error • Test-retest approach • Carryover: first testing advantage: easy to session influences scores on next session evaluate, using • Practice: when carryover correlation effect involves learning • Disadvantage: carryover & practice effects 23 Internal Consistency error • Suppose a test • Would you expect includes both items much correlation on social psychology between scores on and items requiring the two parts? mental rotation of No – because the two abstract visual „skills‟ are unrelated. shapes. 24 Internal Consistency Approach • A low correlation • A good test has high between scores on 2 correlations between halves of a test, scores on its two suggests that the test halves. is tapping two different abilities or But how should we divide the test in two to traits. check that correlation? 25 Internal Consistency error • Split-half method • All of these assess • Kuder-Richardson the extent to which formula items on a given test • Cronbach‟s alpha measure the same ability or trait. 26 Split-half Reliability • After testing, divide • Various ways of test items into halves dividing test into two – A & B that are scored randomly, first half vs. separately. second half, odd- • Check for correlation even… of results for A with results for B. 27 Split-half Reliability – a problem • Each half-test is • So, we shouldn‟t use smaller than the the raw split-half whole reliability to assess • Smaller tests have reliability for the lower reliability whole test (domain sampling error) 28 Split-half reliability – a problem • We correct reliability re = estimated reliability for estimate using the the test Spearman-Brown rc = computed reliability (correlation between formula: scores on the two halves re = 2rc A and B) 1+ rc 29 Kuder-Richardson 20 • Kuder & Richardson • KR-20 avoids (1937): an internal- problems associated consistency measure with splitting by that doesn‟t require simultaneously arbitrary splitting of considering all test into 2 halves. possible ways of splitting a test into 2 halves. 30 Kuder-Richardson 20 • The formula 1. a measure of all the contains two basic variance in the terms: whole set of test results. 31 Kuder-Richardson 20 • The formula 2. “item variance” – contains two basic when items measure terms: the same trait, they co-vary (same people get them right or wrong). More co-variance = less “item variance” 32 Internal Consistency – Cronbach‟s α • KR-20 can only be • Cronbach‟s α (alpha) used with test items generalizes KR-20 to scored as 1 or 0 (e.g., tests with multiple right or wrong, true or response categories. false). • α is a more generally- useful measure of internal consistency than KR-20 33 Review: How do we deal with sources of error? Approach Measures Issues Test-Retest Stability of scores Carryover Parallel Forms Equivalence & Stability Effort Split-half Equivalence & Internal Shortened consistency test KR-20 & α Equivalence & Internal Difficult to consistency calculate 34 Reliability in Observational Studies • Some psychologists • This approach collect data by requires time observing behavior sampling, leading to rather than by testing. sampling error • Further error due to: observer failures inter-observer differences 35 Reliability in Observational Studies • Deal with possibility of • Deal with inter- failure in the single- observer differences observer situation by using: having more than 1 Inter-rater reliability observer. Kappa statistic 36 Reliability in Observational Studies • Inter-rater reliability • % agreement between 2 or more observers problem: in a 2-choice case, 2 judges have a 50% chance of agreeing even if they guess! this means that % agreement may over- estimate inter-rater reliability. 37 Reliability in Observational Studies • Kappa Statistic • estimates actual inter- (Cohen,1960) rater agreement as a proportion of potential inter-rater agreement after correction for chance. 38 Using Reliability Information • Standard error of • estimates extent to measurement (SEM) which test score misrepresents a true score. • SEM = (S)(1 – r) 39 Standard Error of Measurement • We use SEM to • The interval is centered compute a confidence on the test score interval for a • We have confidence that the true score falls in this particular test score. interval • E.g., 95% of the time the true score will fall within 1.96 SEM either way of the test (observed) score. 40 Standard Error of Measurement • A simple way to think • The standard of the SEM: deviation of the • Suppose we gave resulting set of test one student the same scores (for this one test over and over student) would be the • Suppose, too, that no standard error of learning took place measurement. between tests and the student did not memorize questions 41 What to do about low reliability • Increase the number • To find how many you of items need, use Spearman- Brown formula • Using more items may introduce new sources of error such as fatigue, boredom 42 What to do about low reliability • Discriminability • Find correlations analysis between each item and whole test • Delete items with low correlations