VIEWS: 5 PAGES: 5 POSTED ON: 9/28/2011 Public Domain
1 HYPOTHESIS TESTING Often a set of data is collected, or an experiment carried out, not simply with a view to summarising the results and estimating suitable parameters but rather in order to test an idea. This idea or hypothesis may arise by purely theoretical reasoning, or it may be suggested by the results of earlier experiments. brief overview: The way statistics is used to set up our hypothesis to test is a little strange. First we start with what is called the “Null hypothesis.” This is the assumption that there is no effect of e.g. experimental treatment, difference in conditions etc. We test this against an alternative hypothesis: that is the hypothesis we are attempting to support with our data. Generally we hope that our data shows sufficient differences from the expectations of the null hypothesis to reject it and so accept our alternative hypothesis. E.g. from null hypothesis we expect no effect of drug upon heart rate. Our data shows an increase. If that increase is sufficiently large then we may conclude that no, the null hypothesis was wrong, there is an effect of this drug which does cause an increase in heart rate. (Not always the case we hope for a difference, one may hope that there is no difference.—we can show there is no effect. Eg. tobacco company may wish to show that smoking their cigarettes does not cause an in increase of a certain type of cancer. Rather than hope to reject the null hypothesis, we may hope to be able to “fail to reject” the null hypothesis.) We then use a statistical test to calculate the probability of observing a difference as large as that obtained or larger given the null hypothesis is true. If the probability is less than some specified level then we reject the null hypothesis and accept the alternative. Null hypothesis The notation commonly used to represent the null hypothesis is Ho, and that of the alternative hypothesis Ha (or H1). However, you do not often see these explicitly written in scientific papers. You do sometimes see “the hypothesis we wish to test is…..” However during your research it is very useful to state the null hypothesis as you would see it in statistical textbooks Start by assuming there is no effect. What then would you expect? Would write something like: Ho: µd - µp = 0 where “d” and “p” represent drug and placebo [more often written as above rather than Ho: µd = µp but they are the same] [note we do not use sample parameters but population parameters.] may be something like: Ho: ρ = 0 [no correlation] may be something like: Ho: µ = 2 vs. Ha: µ ≠ 2 If the null hypothesis is rejected then we need an alternative hypothesis to fall back on. (Your expectations or hypothesis being tested.) This dichotomy is denoted: Ho: µd - µp = 0 vs. Ha: µd - µp ≠ 0 or it might be something like: Ho: µd - µp = 0 vs. Ha: µd < µp 2 Differences between these depends upon your expectations. For instance, if you were developing a drug that reduces heart rate (beta blocker) then may hope µd < µp. Alternatively, you may wish to show that a change, any change, in heart as a side effect of the drug is not present in which case µd - µp ≠ 0 is more appropriate. [We will get back to these differences.] Level of significance Next thing to do is set our level of significance. This is generally 5% or 0.05 from historical reasons but is essentially arbitrary. [However, there are reasons why me might not want to use 0.05 which we will come back to later. [Bonferroni]] 0.05 = 1/20 = From one sentence in one paper, but undoubtedly based upon experience. Can I get quote on this? What is the level of significance? This is the chance we are willing to reject the null hypothesis given it is true. 0.05 equals 1/20 chance. That is, there will always be natural variation and so there may be some differences between what we would expect if the null hypothesis were true and the data we observed. For instance, if we toss a coin 100 times we would expect and 50 heads and 50 tails. If we got 51 heads and 49 tails we would rightly assume that this is probably not an effect of bias but simply natural variation. The level of significance sets how stringent we will be about any differences. If it is very important that we do not make a mistake and reject the null hypothesis when it is true then we set a very low level of significance, e.g. perhaps we will only accept 1/1000 chance of being wrong. The level of significance is our preset maximum (critical) level of rejection. It is usually denoted as α. p-value Conduct a statistical test and this gives us a p-value. What is a p-value? p-value The p-value represent the probability of observing a difference as large as that obtained or larger given the null hypothesis is true. The probability, computed assuming that Ho is true, that the test statistic would take a value as extreme or more extreme than that actually observed is called the p-value of the test. The smaller the p-value, the stronger the evidence against Ho provided by the data. example: our level of significance is 0.05. We are only willing to accept a 1/20 chance that we may reject the null hypothesis with this given the null hypothesis is true. Our test gives us a p- value of 0.01. This means there is only a 1/100 chance that the difference we saw in the observed and expected data WAS BY CHANCE ALONE. This chance, 1/100, is below our critical level 1/20 and so we are confident we have made the right decision to reject the null hypothesis and accept the alternative. We have some difference in results, X. This gives us a p value of 0.02. We are confident that there is enough evidence to reject the null hypothesis. However, had we had a more extreme difference, X+c, then we would be even more confident to reject the null hypothesis. 3 “as extreme or more extreme” simply means if we can reject the null hypothesis with degree of difference, then any greater differences we might see automatically will cause the null hypothesis to be rejected. Statistical significance If the p-value is as small or smaller than α, we say that the data are statistically significant at level α. (The term “significant” was introduced by English statistician Francis Y Edgeworth in 1885 as meaning “corresponds to a real difference in fact.”) if α = 0.05. If p <= 0.05, we reject the null hypothesis and accept the alternative hypothesis. if α = 0.05. If p >= 0.05, we fail to reject the null hypothesis (= “accept” null hypothesis as being true) large p means large probability that the difference you saw was due to chance (so we fail to reject null hypothesis) low p means small probability that the difference you saw was due to chance and thus there is high chance that there really is an effect of the experimental treatment (and so we reject null hypothesis in favour of alternative hypothesis). Terminology Probability z significant 5% * 1.96 highly significant 1% ** 2.58 very highly significant 0.1% *** 3.29 *. = 0.05 > p > 0.01 ** = 0.01 > p > 0.001 *** = p < 0.0001 usually it will be specified in the study what the number of asterisks refer to. Concept of p-value is confusing—bit of a double negative: fail to reject null hypothesis. Technically we should not say accept the alternative hypothesis because there may be other alternative hypotheses that fit the data better than the null hypothesis. Bit like Popperian method of falsification (Logic of Scientific Discovery). Cannot prove something is true but we can certainly prove it is not true. E.g. cannot prove all swans are white. We would need to present evidence for every possible swan. However, the presence of a single black swan is sufficient to disprove the theory. Example 2: prove the dodo is not extinct. Could only conclusively show the opposite, a dodo survives. Same with stats. It is difficult to prove that your hypothesis is the best possible hypothesis. However, you can prove the opposite that there is not no effect. Testing our hypothesis: 1) Specify null hypothesis (Ho) and alternative hypothesis (Ha). The test is designed to assess the strength of evidence against Ho. Ha is the statement we will accept if the evidence enables us to reject Ho. 4 2) (optional) Specify the significance level α. This states how much evidence against Ho we will regard as decisive. Normally this will be 5% 3) Calculate the value of the test statistic on which the test will be based. This is a statistic that measures how well the data conform to Ho. 4) Find the p-value for the observed data. This is the probability calculated assuming that Ho is true, that the test statistic will weigh against Ho at least as strongly as it does for the observed data. If the p-value is less than or equal to α, the test result is statistically significant at level α. Type I and type II error. P-value is not a measure of effect but the risk you take of rejecting the null hypothesis given the fact that it is true. The point at which you are willing to lose if you are wrong. With α=0.05 there is 1/20 chance that you might reject the null hypothesis when it is true. This error is known as a type I error. reject Ho fail to reject Ho null hypothesis true type I (α) correct decision null hypothesis false correct decision type II (β) [1-β = power of the test, but we will return to this later.] Some situations when you might want to reduce type I errors, and therefore use smaller α, e.g. 0.01. For instance, you might want to make decision about whether to invest 500 million marks into developing new promising drug. Would not want to make mistake that you reject the null hypothesis—implying that it does it work—when the results you got were by chance. You can easily reduce type I errors by altering α You can reduce type II errors by increasing the sample size which reduce the variance of the sample mean. One tailed and two tailed tests Earlier I stated that the steps are: 1) set α 2) calculate test statistic 3) look up p-value p-value does not necessarily correspond to α in tables. It depends upon alternative hypothesis, namely whether they are one or two tailed tests. if Ha: µd - µp ≠ 0 then we are not worried which direction the difference is in, simply that there is some difference. if Ha: µd > µp then we are explicitly stating which direction the expected difference is in, a large difference in the opposite direction will cause us to accept the null hypothesis in the same way as a small difference. The former is a two tailed test. We test the test statistic at α/2 in the tables. This latter is a one-tailed statistic. We test the test statistic at α in the tables. 5 This is partly why you need to be very clear about setting down your null and alternative hypotheses as it can alter the results at the final stage, looking up p-values. For instance, in minitab when using a 2 sample t-test—a test to look at difference in means—you must specify whether it is one or two tailed by choosing right option: less than not equal [default] greater than for the alternative hypothesis. Z scores Can formulate all above examples in terms of z scores which ultimately is how statistical packages will calculate them: For 2-tailed tests: The given value of z has been picked at random from N(0,1). Test: inspect the value of z, and if it is less than –1.96 or greater than 1.96 (i.e. if |z|>1.96) reject the null hypothesis at the 5% level of significance. If |z| > 2.58, reject the null hypothesis at the 1% level, and if |z| > 3.29 reject the null hypothesis at the 0.1% level. Values for single tailed tests: 95% 1.645; 99% 2.326; 99.9% 3.09