VIEWS: 1 PAGES: 56 POSTED ON: 7/29/2012
Inference, Confidence Intervals, Effect sizes, and Power The Vote Evidence of hacking into email accounts to fix the results √ Only six people voted. In some countries you are only allowed to vote after a handful of rich people have paid lobbyists and created adverts that insult rodentia intellect. Jesse Maya Amy Today’s Aims • What is inference? • What are confidence intervals? – How to make and report confidence intervals. – A glance at bootstrapping (more in a couple of weeks) • Touch upon hypothesis testing (more next week) • Effect sizes (and this will continue) • What is power? – How to calculate and report power. Interval versus Point Reporting Inference: Point Estimates • When we calculate the mean of a sample, we use that as an estimate of the population mean μ. • The Plug-in principle. – Requires we believe the sample is representative of the population. – Requires that the sample statistic is an unbiased (or at least good) estimate of the population parameter. • Some estimates are biased. The sample range underestimates the population range. 100,000 more Iraqi dead post invasion: Roberts et al. (2004) Political – Timing??? (Thursday before US election) – Recommendations Geneva convention says occupying force has responsibilities. US General says “we don’t do body counts”. Authors argue it can be done (they did it in 4 weeks with 7 people) and is necessary for Geneva convention. Cluster sample • Travelling important to minimize! (GPS) • 33 clusters of 30 houses. Choose nearest 30 houses in each cluster, which is probably not good. • Their power analysis seems to assume non-clustered sample. Area Household Sorry the labels are small, the point is there is more red and dark blue after the invasion Violent Deaths up • But should be viewed in light of many methodological limitations. – Authors discuss these • 100,000 estimate is 98,000 with a 95% CI from 8,000 to 194,000 (without Falluja). (with Falluja the lower bound of the confidence interval is lower) • This band includes most other estimates. • Ethical problems? What does 8,000 to 194,000 mean? It does NOT mean that – There is a 95% probability the number of deaths is between those numbers. It means that – If you repeated the survey a billion times, and making lots of assumptions, 95% of the time the true value will be in that range. The philosophers say we can be 95% “confident” – Whatever that means in this context??? US and most English speaking places (UK changed in 1974) -> 1,000,000,000 Others (most of Europe, South America, Cuba, Mexico, etc.) -> 1,000,000,000,000 Constructing Confidence Intervals • Population μ (pronounced mu) • Estimate with sample mean ( ), the plug-in principle • But with sampling error. Estimating the region which will usually include the population mean. sd CI 95% x t 0.05 Lots of assumptions n • Need to know df. df = n - 1 for this test. Here 94 - 1 = 93. • t0.05 is usually about 2, but you need to look it up the t table. Example: Newton's (1998) Hostility Data • The mean on arrival for the 94 prisoners was 28.3 with a standard deviation of 8.0. • df = n-1 so 93, or about 90 for the t table sd CI 95% x t0.05 x t0.05 se n 8.0 CI 95% 28.3 1.99 28.3 1.99(.825) 94 28.3 1.6 or 26.7 to 29.9 8.0 CI 95% 28 .3 1.99 28 .3 1.6 94 t distribution with df = 93 and 2.5% in each tail What does having 95% CI of 28.3 ± 1.6 mean? We expect that about 95% of the time when a confidence interval is made that the population mean (μ) will be within the interval created. This allows us to be fairly "confident" that the confidence interval we calculate contains the population mean. It is not that there is a 95% probability that μ is within the interval. This is a tricky concept. Confidence intervals are a fundamental tool for the frequentist statistician. In the long run, you should be right (i.e., μ within the interval) about 95% of the time. (This is a tricky concept and will be revisited) Plotting the precision of the estimate (confidence intervals) and the spread of the distribution (standard deviations). Both are in units of the original variable (here in years). 40 sd 95% CI: x t0.05 n x sd 30 Frequency 20 10 0 0 10 20 30 Years in prison Examining the Difference between Two Means for the Same Person sd diff sd diff CI 95% x1i x 2 i t 0.05 diff t 0.05 n n • Difference in means ± t0.05 times standard error • Standard error of difference using estimate of the standard deviation of the difference. • Assumption include that the difference is normally distributed (not the individual scores ... for most tests the assumptions are about the residuals) last week's journal question • Just calculate a variable for the difference, and perform the calculations as you did before. Brewed Awakenings: http://mybrewedawakening.com/ Data from 10 people's coffee preferences. FRESHi INSTANTi DIFFi DIFFi - (DIFFi - )2 5 3 2 1 1 this variable assumed normal 4 3 1 0 0 6 5 1 0 0 3 4 -1 -2 4 4 4 0 -1 1 5 3 2 1 1 6 3 3 2 4 3 3 0 -1 1 5 3 2 1 1 4 4 0 -1 1 Sum 45 35 10 0 14 Mean 4.5 3.5 1.0 0 sd=1.25 1.25 CI 95% 1.0 2.26 1.0 0.89 Does this allow us to say anything else? 10 20 15 for n=20 and sd=10, width = 9.33 Width of 95% CI for n=20 and sd=5, width = 4.66 10 for n=20 and sd=2, width = 1.87 for n=20 and sd=1, width = 0.93 5 0 0 20 40 60 80 100 Sample size Confidence intervals for differences between groups sd diff sd diff CI 95% x1i x 2 i t 0.05 diff t 0.05 n n (n1 1) var1 (n2 1) var 2 pooled var (n1 1) (n2 1) one of several possibilities 1 1 CI 95% x1 x2 t 0.05 pooled var n1 n2 So, more of a pain to calculate. How big is an effect? • APA and all other science organizations stress the importance of saying how large an effect is (when one is found). • Difference in two means. Raw value. Useful. • Correlation. Standardized. Also useful. • Difference in means divided by some measure of spread. Standardized. Also useful. • In Coffee example, standard deviation of liking ratings or of difference? • Lots of effect size measures for different situations. Many can be transformed into a correlation-like measure, so many people like these. Switzerland Italy 730 km They arrived (60.7 ± 6.9 (stat.) ± 7.4 (sys.)) ns faster than light! (v-c)/c = (2.48 ± 0.28 (stat.) ± 0.30 (sys.)) ×10-5 http://static.arxiv.org/pdf/1109.4897.pdf (c = 186,282 miles per s) Calculating confidence intervals In SPSS Explore and often as an option. Similar in R (or as a function), but often just get the standard error (reason why discussed soon). Lots of procedures print the confidence intervals or have printing them as an option. There used to be a single useful page that did lots. See http://faculty.vassar.edu/lowry/VassarStats.htm and http://www.stat.tamu.edu/~jhardin/applets/ for several pages. Maybe you can write R functions to do these? How To Number 1 Two Approaches • Mathematics (which is what is built into SPSS) • Computation - the bootstrap Hypothesis Testing: The quest for p • If p < .05 we are happy. • Not a good philosophy of science, but how a lot of psychology (and other disciplines) has been done. “The almost universal reliance on merely refuting the null hypothesis is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology” (Meehl, 1978, p. 817). • If H0 is true, 5% of the time we would reject it. This is called a Type 1 error. • H0 always false, so not really sure what the point of it is (more next week). Power: 1 - β State of the World Decision H0 true H0 false don’t reject H0 Type 2 error reject H0 Type 1 error . Probability of making a Type 2 error is conditional on the effect being a certain size. Denoted β. 1- β is power. Convention to aim for is 80%. Need to know the size of effect that you want to detect. Most use past research (recommended) or Cohen’s guidelines. This is wrong! A few ways to do it Simulation. Set up a model of the smallest effect you want to detect, and use "sample" General stats programs. SPSS/PASW has an add-on (and syntax), R has a few function. Cohen's tables. Discussed later. G*Power (or other specialist programs) How to: G*POWER G*Power (Erdfelder, Faul, & Buchner, 1996, and later versions) Lots of software out there. In R, power.t.test, fpower, power.anova.test, power.prop.test, etc. People have written SPSS syntax for power: http://psychology3.anu.edu.au/people/smithson/details/CIstuff/CI.html but not easy to use. Cohen’s Tables • For t test, medium 1 2 sized effect is: d 0.50 • Small is 0.20 and large is 0.80. • If the minimum difference worth detecting is 0.50, you need 128 people in your sample to give you an 80% probability of detecting this difference (p < .05). • For a small effect size you need 784. • Many surveys have shown that often the power is too low! Medium: 64 people in each group. 128 total. Is it really that easy? Yes and no The computations are a little tricky but looking up in the tables is easy. Understanding what to do, and getting adequate sample sizes sometimes difficult Thom Baguley’s (2004) critique: Positives • Avoiding low power • Avoiding excessive power • Efficient planning Negatives • Used retrospectively because SPSS prints something called power (“fundamentally flawed”). • Standardization and automation • Ignoring things other than n which affect power • Treating the effect size as the expected effect size, not the minimum worth detecting • Should we be rejecting interval hypotheses rather than point hypotheses? Journal: Why Dilbert was doomed to fail, assuming all Ratbert is claiming is that he is different from chance. Recap • Confidence intervals give you all the information in a p value, plus more. • Still, an odd thing. • Power is one thing to take into account when deciding on the sample size. • Do not blindly use Cohen’s conventions. Last Week's Journal • Take one of your peers' research statements. Generate a causal hypothesis of interest and an associative hypothesis of interest. • Create a variable that is the average (i.e., the mean) of two normally distributed variables? Is the average of two normally distributed variables, itself normally distributed? The sum of Normal variables is normal The amalgamation is not (in general) • In one sentence answer the following: Why do we calculate the mean value for some attribute for our sample? Journal • Find out how many participants you need if you want to be able to detect an r of .05 with 80% chance, with alpha = .05. And write down the number. How about for r = .55. – Some do with Cohen's tables, some with G*Power. Talk with peers. • Play with the G*Power plots. • Why was Dilbert, in the first frame, bound to fail? • Suppose you have 25 variables distributed Chi-square with three degrees of freedom and 200 people. var1 <- rchisq(200,3) makes one variable. Look at hist(var1). Is it skewed? Add up 25 of these variables. Is this sum skewed? Look at hist(var1 + ... + var25). (there is a reason we are doing this, and the code is in two pages ... try first without looking, then look) Stop Dan They don't want anymore statistics now. They want to go. (but there is a hint for journal on the next slide) # Here it is for 25 library(e1071) # This is for the skewness function par(mfrow=c(1,2)) # This makes 2 graphs on 1 screen x <- rchisq(200,3) hist(x);qqnorm(x);qqline(x) skewness(x) shapiro.test(x) # This tests normality # Shapiro is from FIU for (i in 2:25) x <- x + rchisq(200,3) hist(x);qqnorm(x);qqline(x) skewness(x) shapiro.test(x) # Try with sum of 100 variables http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/