Document Sample

Hypothesis Testing Classical Hypothesis Testing • Hypothesis testing is one area where Bayesian methods • Classical statisticians recognize two kinds of errors (this is and results differ sharply from frequentist ones horrible terminology, but we are stuck with it) • Example: Suppose we have a coin and wish to test the • A Type I error is made if we reject H0 when it is true hypothesis that the coin is fair • A Type II error is made if we do not reject H0 when it • H0: The coin is fair, p(H)=p(T)=0.5 is false • H1: The coin is not fair (anything but the above) • Choose a rejection region R: • We divide the parameter space ! into two pieces, !0 and • R={x: observing x in R leads to the rejection of H0} !1, such that if the parameter ! is in !0 then the • Then the probability of making a Type I error is hypothesis is true, and if it is in !1 then the hypothesis is false. p(x " R | # " $ 0 ) • We observe a test statistic x, which is a function of the and the probability of making a Type II error is data X, and wish to decide, given x, whether or not to p(x " R | # " $1 ) reject H0 Bayesian Inference 4/10/09 1 Bayesian Inference 4/10/09 2 Classical Hypothesis Testing Classical Hypothesis Testing • Often we take !0 as a set containing a single point (simple • We can construct different tests by hypothesis), so • Choosing a different test statistic x(X) " = p(x # R | $ # % 0 ) » But no real choice if x is sufﬁcient is well-deﬁned. However, !1 is usually a collection of • Choosing a different rejection region R intervals (composite hypothesis) and • Note that both of these are subjective choices! " = p(x # R | $ # %1 ) has no deﬁnite value. The best one can do is to evaluate p(x " R | # ) = $ (# ) as a function of !. "(!*) is the probability of making a Type II error if the true value of ! is !* " !1. q(!)=1–"(!) is known as the power function of the test. ! Bayesian Inference ! 4/10/09 3 Bayesian Inference 4/10/09 4 ! Classical Hypothesis Testing Classical Hypothesis Testing • A common practice is to choose things so that # is some • However, from a Bayesian point of view these classical ﬁxed fraction like 0.05 or 0.01. This gives us an #-level tests are all suspect. Since they depend on the probability test, and the probability of making a Type I error is # of x falling into some region R, they depend on the • Then, amongst the #-level tests, the goal would be to probability of data that might have been observed, but was choose that test such that the power function of the test not. Thus, they violate the Likelihood Principle. dominates that of all other tests. Such tests are called • A Bayesian might well say that classical hypothesis tests “uniformly most powerful” (UMP) tests, and they have commit a Type III error: Giving the right answer to the the smallest probability of committing a Type II error for wrong question. any given value of !. Unfortunately, in general a UMP test • A Bayesian test would have to depend on and be does not exist conditioned on just the data X that was observed. Bayesian Inference 4/10/09 5 Bayesian Inference 4/10/09 6 Bayesian Hypothesis Testing Marginal Likelihood • So, we return to the Bayesian paradigm. We have to have • In the Bayesian approach to hypothesis testing, the a prior; we need the likelihood; then we can compute the "marginal likelihood" plays a key role. Suppose we have posterior. hypotheses H0, H1,…,Hn. To each hypothesis Hj there • Suppose we have two simple hypotheses. Thus, we have corresponds a (possibly empty) set of parameters !j. For H0 and H1, and likelihood P(X|H0), P(X|H1), and prior example, in the coin-tossing case there are no parameters p(H0), p(H1). Then the posterior odds are for j=0 and one parameter for j=1. p(H0 | X) p(X | H0 ) p(H0 ) • Note that there is no requirement that the parameters be = ! nested. p(H1 | X) p( X | H1 ) p(H1 ) Bayesian Inference 4/10/09 7 Bayesian Inference 4/10/09 8 Marginal Likelihood Marginal Likelihood • We can calculate the joint density of parameters and • Now we can compute the posterior probability of any of hypotheses, and from that the posterior probability, in the our hypotheses, by simply integrating over the continuous usual way: parameters of each hypothesis: p(x," j , H j ) = p(x | " j , H j ) p(" j | H j )p(H j ) p(H j | x) = # p(" , Hj j | x)d" j p(x | " j , H j )p(" j | H j )p(H j ) p(" j , H j | x) = p(x) = # p(x | " , H j j )p(" j | H j )p(H j )d" j where p(x) n • Since the proportionality constant p(x) is independent of j, p(x) = $ # p(x," j , H j )d" j we can simply write ! j=0 ! i.e., sum over the discrete parameters and integrate over p(H j | x) " [ $ p(x | # j , H j ) p(# j | H j )d# j ] p(H j ) = m(x | H j )p(H j ) the continuous ones, model by model. m(x|Hj) is known as the marginal likelihood under Hj. ! ! Bayesian Inference 4/10/09 9 Bayesian Inference 4/10/09 10 Bayesian Hypothesis Testing Bayesian Hypothesis Testing • Many realistic examples involve testing a compound • We compute the marginal likelihood under each model, hypothesis. For example, our coin tossing problem is to which will be proportional to the posterior probability of decide if the coin is fair, based on observations of the each model: number of heads and tails. The expected proportion of p(H 0 | x) " [ # p(x | $ 0 , H 0 )p($ 0 | H 0 )d$ 0 ] p(H 0 ) heads is !. If the coin is fair, then !=0.5, and if it is not fair it is some other value. This means we will have to put = m(x | H 0 )p(H 0 ) a prior on !, under the alternative hypothesis, for example, and similarly for H1. U(0,1) • The odds for H0 and against H1 are simply • Thus we are testing ! p(H 0 | x) m(x | H 0 ) p(H 0 ) H0 : ! = 0.5, p(! | H0 ) = " (! # 0.5) = " p(H 1 | x) m(x | H 1 ) p(H 1 ) against • The ratio of the two marginal likelihoods is known as the H1 : ! $ 0.5, p(! | H1 ) ~ U(0,1) Bayes factor. ! Bayesian Inference 4/10/09 11 Bayesian Inference 4/10/09 12 Bayesian Hypothesis Testing Bayesian Hypothesis Testing • For the particular case of coin tossing, x={h,t} with h the • As with all Bayesian inference, the results of such a test number of heads and t the number of tails observed, this depend on the prior. And in the case of a simple versus a becomes compound hypothesis, the dependence is sensitive to the 1 h (1# " )t $ (" # 0.5)d" prior on !, in a way that is even more sensitive than in p(H 0 | x) = %" 0 & p(H 0 ) parameter estimation problems, which means that one is 1 p(H 1 | x) %" h t (1# " ) d" p(H 1 ) less certain of the inference 0 h+t = (h + t +1)Ch (0.5)h+t • One can look at a range of sensible priors to see how n sensitive the results are (n +1)Ch = with n = h + t • One can also look at particular classes of priors to see 2n what the maximum evidence against the simple hypothesis • Do you see how the numerator and denominator get the is under that class of priors values they do? ! Bayesian Inference 4/10/09 13 Bayesian Inference 4/10/09 14 Bayesian Hypothesis Testing p-values • Example: Suppose we toss a coin 100 times and obtain 60 • Recall that for a symmetric distribution a p-value is the heads and 40 tails. What is the evidence against the tail area (one-sided) or twice the tail area (two-sided) hypothesis that the coin is fair? under the probability distribution of the test statistic r(x)|! • Assuming the priors we did for our calculation, we ﬁnd from where the data actually lie at x0 to inﬁnity: on prior odds 1 that the posterior odds are # 101 ! C 100 60 p - value = $ p(r | " 0 )dr = 1.095 r0 2100 % r0 # • In other words, these data (slightly) favor the null p - value = $ p(r | " 0 )dr + $ p(r | " 0 )dr hypothesis that the coin is fair! %# r0 » Surprising to a frequentist since the one-sided p-value (tail area for obtaining 60 or more heads on 100 tosses given a fair coin) is 0.028 which would ! reject the null in an #=0.05 level test Bayesian Inference 4/10/09 15 Bayesian Inference 4/10/09 16 Bayesian Hypothesis Testing Bayesian Hypothesis Testing • Example: Suppose we toss a coin 100 times and obtain 60 • Priors for binomial data heads and 40 tails. What is the evidence against the • We used a ﬂat prior for our analysis, for convenience hypothesis that the coin is fair? • If we look at this example by comparing the simple • The Jeffreys prior is beta(!|1/2,1/2) " !–1/2(1- !)–1/2 hypotheses “fair” versus “biased coin with !=0.6”, » Show this! Hint: E(h|!)=n! which is the most favorable prior $(!–0.6) on ! for the • In practice the difference between ﬂat and Jeffreys alternative hypothesis, we still get won’t make much difference since it’s just a difference p(H0 | X) 0.5h 0.5t p(H0 ) of one extra head or tail = ! = 0.134 p(H1 | X) 0.6 h 0.4 t p(H1 ) • Informative conjugate priors are beta distributions • The corresponding probability is 0.118 " !#–1(1- !)"–1. You may choose the parameters to • This is independent of the prior on the parameter and match your prior knowledge we can consider this to be the maximum evidence against the null hypothesis. It is still over four times greater than the classical one-sided p-value Bayesian Inference 4/10/09 17 Bayesian Inference 4/10/09 18 Bayesian Hypothesis Testing MCMC Simulation • Priors for binomial data • We can calculate our results using simulation (useful for • Jaynes suggests beta(!|0,0) " !–1(1- !)–1 when an exact solution is unavailable) » This has the advantage of agreeing with intuition if • We do this by simulating a random walk in both model space {H0, H1} and in parameter space(%). Thus we are there is a good probability that either of the simulating on both discrete and continuous parameters extremes !=0 or !=1 may be true (as with, for • The key here is to allow ourselves to jump between our example, whether a random chemical taken off the two models. This will in general be a M-H step shelf will or will not dissolve. Presumably if it • Since the two models have differing numbers of dissolves the ﬁrst time, it will each time and if it parameters (in the coin tossing case, one model has zero doesn’t dissolve the ﬁrst time, it won’t dissolve any parameters and the other has one) we will have to propose other time either) parameters and models simultaneously » However, if the number of heads or tails is 0 the • I will describe a technique known as reversible jump posterior will not be normalizable and the test will MCMC which is very effective give odds 0 or # which may or may not be desirable Bayesian Inference 4/10/09 19 Bayesian Inference 4/10/09 20 MCMC Simulation MCMC Simulation • The best introduction to the reversible jump MCMC • Thus, in our coin tossing problem, we may be in state technique that I have found is “On Bayesian Model and (Hj,!j) and wish to jump to another state (Hk,!k) Variable Selection Using MCMC,” by Petros Dellaportas, • Propose jump to state (Hk,!k) with probability Jonathan Forster and Ioannis Ntzoufras. It has been q(Hk,!k|Hj,!j) published in Statistics and Computing. A copy may be • Compute (the log of) the Metropolis-Hastings factor: downloaded from the course website p( X | Hk , " k ) p(Hk ," k )q(H j ," j | Hk ," k ) != p(X | H j , " j ) p(H j ," j )q(Hk ," k | H j , " j ) • Generate u, (the log of) a U(0,1) random variable and accept the step if u<$ (log u < log $), otherwise stay where you are • The resulting Markov chain samples the models in proportion to their posterior probability; marginalize out ! by ignoring it Bayesian Inference 4/10/09 21 Bayesian Inference 4/10/09 22 MCMC Simulation MCMC Simulation • We see that # is the ratio of two quantities of the form • And, we might factorize both p and q: p(X | Hm ," m ) p(Hm ," m ) p(X | Hm ," m ) p(" m | Hm )p(Hm ) ! m|n = ! m|n = q(Hm ," m | Hn ," n ) q(" m | Hm )q(Hm ) where m and n refer to the two states • Example: Coin tosses. Choose prior on Hm, for example, p(H0)= p(H1)= 1/2 • We have a great deal of latitude in picking q. For example, we could choose it independent of state n (independence • Choose a proposal, for example q(H0)= q(H1)= 1/2 sampler): » We’ll want to reconsider this p(X | Hm ," m ) p(Hm ," m ) • If m=0, !0=1/2, [strictly, p(!0|H0)= $(!0–0.5)], but if ! m|n = m=1 we need a prior on !1. For simplicity we will take q(Hm ," m ) a uniform prior, as in our calculation • We need also to consider the proposals q(!m| Hm) for m=0,1 Bayesian Inference 4/10/09 23 Bayesian Inference 4/10/09 24 MCMC Simulation MCMC Simulation • And, we might factorize both p and q: • And, we might factorize both p and q: p(X | Hm ," m ) p(" m | Hm )p(Hm ) p(X | Hm ," m ) p(" m | Hm )p(Hm ) ! m|n = ! m|n = q(" m | Hm )q(Hm ) q(" m | Hm )q(Hm ) • An excellent choice of q(!m| Hm) (if possible) would be • Usually the excellent strategy is not possible; if one to make it proportional to the posterior p(!m|X,Hm)! For can approximate the posterior by a distribution that you then we would get can sample from, that is a good strategy. p(Hm ) • A simple-minded strategy would simply be to choose q ! m|n " ﬂat, whence q(Hm ) p(X,# m | H m )p(H m ) which is a constant. Indeed, if we can also arrange "m|n = things so that "m|n=1 we would get a Gibbs sampler! q(H m ) • The latter can be done approximately by using a small » Works if the posterior is not too sharp (not much training sample and picking q(Hm) using the results data). Not so good if there is a very sharp posterior ! Bayesian Inference 4/10/09 25 Bayesian Inference 4/10/09 26 p-values p-values • Amongst the many errors that people make interpreting • The observed p-value is not a probability in any real frequentist results, one in particular is very common, and sense! It is a statistic that happens to have a U(0,1) that is to quote a p-value as if it were a probability (e.g., distribution under the null that the null hypothesis is true, or that “the results • If the observed p-values were real probabilities, we occurred by chance”) could combine them using the rules for probability to • The approved use of p-values, on frequentist obtain p-values of combined experiments. Thus (on the reasoning, is to report whether or not the p-value falls null hypothesis of a fair coin), if we observed 60 heads into the rejection region. This has the interpretation and 40 tails and then independently observed 40 heads that if the null is true, then in no more than a fraction # and 60 tails, the one-sided p-value for the combined of all cases will we commit a Type I error. experiment is evidently 0.5, whereas the one-sided p- • The observed p-value is not a probability in any real values for two independent experiments are 0.028 and sense! It is a statistic that happens to have a U(0,1) (1–0.028) respectively; the product is obviously not distribution 0.5, contrary to the multiplication law • Despite the appeal of quoting p-values, an observed p- • Similar results hold for two-sided p-values value has no valid frequentist probability interpretation Bayesian Inference 4/10/09 27 Bayesian Inference 4/10/09 28 p-values p-values • Furthermore, suppose you routinely reject two-sided at a • Then amongst those experiments rejected with p-values in ﬁxed #-level, say 0.05 (0.05–&,0.05) for small &, at least 30% will actually turn out to be true, and the true proportion can be much higher • Suppose in half the experiments the null was actually true (depends upon the distribution of the actual parameter for • Finally, suppose that in those experiments for which the the experiments where the null is false) null is false, the probability of a given effect size x • This says that under these circumstances, the Type I decreases monotonically as you go away from 0 (in either error rate (probability of rejecting a true null), direction): conditioned on our having observed p=0.05, is at least 30%! • Thus the numerical value of an observed p-value p(e) greatly overstates the evidence against the null hypothesis, which we already found for coin tosses. 0 x! Bayesian Inference 4/10/09 29 Bayesian Inference 4/10/09 30 p-values p-values • The absolute maximum evidence against the null • Papers on this subject can be found on the web: hypothesis can be gotten by evaluating the likelihood ratio • http://makeashorterlink.com/?P3CB12232 (Paper by at the data. For example, if x is standard normal and we Berger and Delampady) observe x=1.96, which corresponds to an # level of 0.05 (two tailed), we can calculate the likelihood ratio as • http://makeashorterlink.com/?V2FB21232 (Paper by Berger and Sellke, with comments and rejoinder) p(x | H 0 ) 1 exp(" 1 1.96 2 ) =21 2 1 2 = exp(" 1 1.96 2 ) = 0.146 2 • http://www.stat.duke.edu/~berger/p-values.html p(x | H1 ) 2 exp(" 2 0 ) • Note that the papers by Berger and Delampady and by p(H 0 | x) = 0.128 on prior odds of 1 Berger and Sellke must be accessed from within the university network or by proxy server. Bayesian Inference 4/10/09 31 Bayesian Inference 4/10/09 32 Falsiﬁcation Bayesian Epistemology • Popper proposed that a scientiﬁc hypothesis must be • Bayesians measure the effect of new data D on the falsiﬁable by data. For example, the hypothesis that a coin relative plausibility of hypotheses by calculating the has two heads can be falsiﬁed by observing one tail Bayes factor • A hypothesis H0 is falsiﬁable in Bayesian terms if, for ! H0 | D$ P(D | H0 ) F# &= some data D, its likelihood on H0 is 0: p(D| H0)=0 " H1 | D% P(D | H1 ) • However, the requirement of falsiﬁability is too • Then we compute posterior odds from prior odds restrictive. In science, ideas are seldom, if ever, actually ! H0 | D$ H |D & = F! 0 $ O! 0 $ H falsiﬁed. What usually happens is that old hypotheses are O# # & # & discarded in favor of new ones that new data have " H1 | D1 % " H1 | D% " H1 % rendered more plausible, i.e., have higher posterior • Bayes’ theorem allows us to calculate the effect of new probability data on various hypotheses and adjust posterior probabilities accordingly. It thus becomes a justiﬁcation for the inductive method Bayesian Inference 4/10/09 33 Bayesian Inference 4/10/09 34 Ockham’s Razor Ockham’s Razor • “Pluralitas non est ponenda sine necessitate.” • One way to reﬂect the common scientiﬁc experience that —William of Ockham simple hypotheses are preferable is to choose the prior • Preferring the simpler of two hypotheses to the more probabilities so that the simpler hypotheses have greater complex, when both account for the data, is an old prior probability (Wrinch and Jeffreys) principle in science • This is a “prior probabilities” interpretation of • Why do we consider Ockham’s razor 1 • Does it beg the question? s = a + ut + gt 2 2 • What principle should be used to assign the priors? to be simpler than 1 s = a + ut + gt 2 + ct 3 ? 2 Bayesian Inference 4/10/09 35 Bayesian Inference 4/10/09 36 Simplicity Plagiarism • We regard H0 as simpler than H1 if it makes sharper • Compilers of mailing lists include bogus addresses to predictions about what data will be observed catch unauthorized repeat use of the list • Hypotheses can be considered more complex if they have • Mapmakers include small, innocuous mistakes to catch extra adjustable parameters (“knobs”) that allow them to copyright violations be tweaked to accommodate a wider variety of data • Mathematical tables can be rounded up or down if the last • Complex hypotheses can accommodate a larger set of digit ends in ‘5’ without compromising the accuracy of the potential observations than can simple ones table. The compiler can embed a secret “code” in the table • “This coin has two heads” vs. “This coin is fair” to catch copyright violations • “This coin is fair” vs. “This coin has unknown bias !” • In all cases, duplication of these errors provides prima 1 facie evidence, useful in court, that copying took place • “The relationship is s = a + ut + gt 2 ” vs. “The 2 1 2 relationship is s = a + ut + gt + ct 3 ” 2 Bayesian Inference 4/10/09 37 Bayesian Inference 4/10/09 39 Plagiarism Evolutionary Biology • Example: a table of 1000 sines • The principle of descent with modiﬁcation underlies • Can expect to have a choice of rounding in 100 cases evolution • Let D = “The rounding pattern is the same” • Pseudogenes are genes that have lost essential codes, rendering them nonfunctional • Let C = “The second table was copied from the ﬁrst” • Nearly identical pseudogenes are observed in closely • Then related organisms (e.g., chimpanzees and humans). By the same arguments as before, the posterior probability that P(D | C) = 1, P(D | C ) ! 10 "30 this is due to actual copying from a common ancestor is # C& vastly greater than the posterior probability that it is due to F% ( ! 10 30 coincidence. This is powerful evidence in favor of $C' evolution • Similar evidence is provided by the fact that the genetic code is redundant. Several triplets of base pairs code for the same amino acids Bayesian Inference 4/10/09 40 Bayesian Inference 4/10/09 42 Mercury’s Perihelion Motion Mercury’s Perihelion Motion • Around 1900, Newtonian mechanics was in trouble • Along comes Einstein and the General Theory of because of the problem of Mercury’s perihelion motion Relativity, which predicts a very precise value for Mercury’s perihelion motion—no other value is possible • Proposed solutions: • Using contemporary ﬁgures • Rings of matter around the Sun, too faint to see • Poor gives a=41.6"±2.0" • “Vulcan”, a small planet near the Sun, difﬁcult to detect • We have aE=42.98" for Einstein’s theory (E) • Flattening of the Sun • The conditional probability of data a on the hypothesis • Additional terms in the law of gravity (e.g., 'r—3 term E that the true value is aE is where ' is an adjustable constant) 1 1 • Some solutions could be ruled out on observational p(a | aE ) = exp$ # 2 (a # aE )2 & 2!" % 2" ' grounds (Jeffreys-Poor debate, 1921) = p(a | E) • One could not rule out modiﬁcations to the law of gravity. The adjustable parameter ' can be chosen to allow any where (=2.0" (error of observation) motion a of the perihelion Bayesian Inference 4/10/09 43 Bayesian Inference 4/10/09 44 Mercury’s Perihelion Motion Mercury’s Perihelion Motion • The older theory F can be thought of as matching the • The Bayes factor is observed value with a “fudge factor” aF p(a | E) 2 $ D 2 ' $ DF ' • Observations of Mars, Earth and Venus can limit the = 1+ " 2 exp&# E ) exp& 2 ) = 26.0 p(a | F) % 2 ( % 2(1+ " )( fudge factor to | aF|<100" • Assuming that Newton’s theory is approximately correct, where we have a prior density of Mercury’s perihelion motion a # aE a " DE = = #0.69, DF = = 20.8, " = = 25.0 which we take for now to be N(0,)2) with )=50": * * * 1 $ a2 ' • This is moderately strong evidence in favor of E. p(aF | F) = exp # F2 • The last two factors are O(1) and measure the “ﬁt” of the 2!" & % 2" () ! two theories to the data. Nearly all of the Bayes factor is due to the ﬁrst factor, which is known as the ‘Ockham factor’. Bayesian Inference 4/10/09 45 Bayesian Inference 4/10/09 47 What’s Happening Examples • The Ockham factor arises from the fact that F spreads its • How do p-values and posterior probabilities compare for bets over a much larger portion of parameter space than sharp null hypotheses? Evidently, a small p-value is does E. Essentially E puts all its money on aE, a precise evidence against the null, but as we have seen, its value, spread out only by the error of observation (. On numerical value overstates the evidence against the null the other hand, F spreads its bets out over a range that is 25 times bigger, and hence most of its probability is • Consider aE= aF=0, so we center everything at 0. Then the “wasted” covering regions of the parameter space that Bayes factor is were not observed. ! E $ p(a | E) ) D2 ! ' 2 $ , • E makes a sharp prediction, F a fuzzy one F# & = = 1 + ' 2 exp ( E # & " F % p(a | F) + 2 "1 + ' 2 % . * - • When the data come out even moderately close to where E predicts it will, E is rewarded for the risk it took by getting a larger share of the posterior probability • The factor of about 25 is just the dilution in probability that F must sacriﬁce in order to ﬁt a larger range of data Bayesian Inference 4/10/09 48 Bayesian Inference 4/10/09 53 Examples Examples • Case 1: Let the p-value be 0.05 so that DE=1.95. We can • Case 1: The minimum in the posterior probability is about plot the Bayes factor versus ! 0.32, which is not very good evidence against the null hypothesis 1.2 • For large ! , we see that the Bayes factor is asymptotically 1 like 0.8 ! E $ p(a | E) ) D2 , F# & = = ' exp ( E Value 0.6 " F % p(a | F) + 2 . * - 0.4 0.2 0 0 1 2 3 4 5 6 Ockham Evidence taubar factor factor Poster ior Pr obability Bayes Factor Bayesian Inference 4/10/09 54 Bayesian Inference 4/10/09 55 Examples Examples • From this we see that for a given DE, the more vague the • For ﬁxed ), and a standard deviation for a single prior on the alternative (measured by ! ), the larger the observation of (, with n observations we will have Bayes factor in favor of the sharp prediction of the null n! 2 • Theories with great predictive power (sharp predictions) 1+! 2 = 1+ = O( n) for large n "2 are favored over those with vague predictions • Thus, asymptotically we have for large n ! E $ p(a | E) ) D2 , "E % " ( % + z2. F# & = = ' exp ( E " F % p(a | F) + 2 . * - F $ ' ~ n $ ' exp-* 0 #F& #) & , 2 / where z=DE is the z-score, or standardized variable DE Ockham Evidence ! factor factor Bayesian Inference 4/10/09 56 Bayesian Inference 4/10/09 57 Examples Jeffreys-Lindley “Paradox” • This suggests a way to interpret p-values obtained on data • Choose a p-value #>0, however small. To this p-value with large n (I.J. Good) there corresponds a z-score z#, and for large n the Bayes • Let the p-value be # factor against the alternative is • Compute the z-score z# which gives this value of # for $ z2 ' B ! n exp " # the p-value (use normal approximation and tables) & % 2( ) • Take )/( = O(1) and compute !E $ ) z2 , Increases Fixed B = F# & ' n exp ( with n " F% + 2. * - • Then the posterior probability of the null is • Thus, for large enough n, a classical test can strongly approximately reject the null at the same time that the Bayesian analysis B strongly afﬁrms it p(H0 | data) ! 1+ B Bayesian Inference 4/10/09 58 Bayesian Inference 4/10/09 61 Jeffreys-Lindley “Paradox” Jeffreys-Lindley “Paradox” • If H is a simple hypothesis, x the result of an experiment, • Example: A parapsychologist has a number of subjects the following two phenomena can exist simultaneously: attempt to “inﬂuence” the output of a hardware random number generator (which operates by radioactive decays). • A signiﬁcance test for H reveals that x is signiﬁcant at In approximately 104,490,000 events, 18,471 excess the level p<# where the pre-chosen rejection level #>0 events are counted in one direction versus the other can be as small as we wish, and • This is a binomial distribution. The standard deviation of • The posterior probability of H, given x, is, for quite the binomial distribution is " = nf (1# f ) where f is the small prior probabilities of H, as high as (1–#) expected frequency of counts. Straightforwardly we ﬁnd • This means that the classical signiﬁcance test can reject H that (=5111 counts (using #=0.5) with an arbitrarily small p-value, while at the same time • The classical signiﬁcance test ﬁnds that the effect is ! signiﬁcant at 18471/5111=3.61 standard deviations, for a the evidence can convince us that H is almost certainly true p-value of 0.0003 (two-tailed), using the approximation, excellent in this case, that the binomial distribution can be approximated by a normal distribution Bayesian Inference 4/10/09 62 Bayesian Inference 4/10/09 63 Jeffreys-Lindley “Paradox” Jeffreys-Lindley “Paradox” • The Bayesian analysis is quite different. We have a • On the alternative hypothesis, all we know is that the genuine belief that the null hypothesis of no (signiﬁcant) effect might be something, but we don’t know how much effect might be true. or even in what direction • To be sure, no point null is probably ever exactly true, • Parapsychologists call effects measured in the direction because the random event generator might not be opposite to the intended one “psi-missing”, and it is perfect. But tests of the generator are claimed to show considered evidence for a real effect—sort of “heads I that its bias is very small so a point null is a good win, tails you lose”. approximation Bayesian Inference 4/10/09 64 Bayesian Inference 4/10/09 65 Jeffreys-Lindley “Paradox” Jeffreys-Lindley “Paradox” • To reﬂect our ignorance we choose a uniform prior on the • The p-value of 0.0003 corresponds to a z-score of z#=3.61 alternative standard deviations on n=104,490,000. Our approximate • We’ve already seen the analysis of this problem in the formula for the Bayes factor yields coin-ﬂipping problem. The result is # (3.61)2 & B ! 104, 490,000 exp " = 15.1 p(H 0 | x) p(H 0 ) % $ 2 ( ' " 12 p(H1 | x) p(H1) • This is certainly in the right ballpark and conﬁrms the • In other words, although the classical test rejects H0 with a approximate formula very small p-value, the Bayesian analysis has made us twelve times more conﬁdent of the null hypothesis than we were! Bayesian Inference 4/10/09 66 Bayesian Inference 4/10/09 67 Connection with Ockham’s Razor Connection with Ockham’s Razor • This result, where the Bayesian answer is at great odds • This means that the alternative hypothesis has an with the orthodox result, can be understood in terms of adjustable parameter, the effect size !, that the null Ockham’s razor hypothesis does not have. The null hypothesis makes a • The sharp null hypothesis H0 is a special one that comes deﬁnite prediction !=0.5, whereas the alternative from our genuine belief (in our parapsychology example) hypothesis can ﬁt any value of !. that people cannot really inﬂuence the output of a random • Therefore the null hypothesis is simpler than the others in number generator by simply wishing it. The sharp the sense we’ve been discussing. It has fewer parameters. hypothesis is inconsistent with nearly all possible data • Because of the Ockham factor, which naturally arises in sets, since it is consistent only with the minuscule fraction the analysis, we can say that in some sense Ockham’s of possible sequences for which the number of 0’s and 1’s razor is a consequence of Bayesian ideas. The Ockham are approximately equal. On the other hand, the factor automatically penalizes complex hypotheses, alternative hypothesis is very open ended and would be forcing them to ﬁt the data signiﬁcantly better than the consistent with any possible data simple one before they will be accepted Bayesian Inference 4/10/09 68 Bayesian Inference 4/10/09 69 Connection with Ockham’s Razor Sampling to a Foregone Conclusion • To conclude: There are at least three Bayesian • The phenomenon we have just discussed is closely related interpretations of Ockham’s razor to the phenomenon of sampling to a foregone conclusion • As a way of assigning prior probabilities to • In classical signiﬁcance testing, one is supposed to hypotheses, based on experience decide on everything in advance, and one is especially supposed to decide on exactly how much data to take • As a consequence of the fact that complex hypotheses in advance. with more parameters, in their attempt to accommodate • Failure to do this can lead to a situation where if you a larger set of possible data, are forced to waste prior sample long enough, at some point with probability as probability on outcomes that are never observed. This close to 1 as you wish you will reject a true null automatic penalty factor favors the simpler theory hypothesis with as small a preset signiﬁcance level as • As an interpretation of the notion that when ﬁtting data you wish to empirical models, one should avoid overﬁtting the • This is sampling to a foregone conclusion data • This phenomenon is peculiar to classical signiﬁcance testing. It cannot occur in Bayesian testing Bayesian Inference 4/10/09 70 Bayesian Inference 4/10/09 71 Stopping Rule Principle Stopping Rule Principle • Sampling to a foregone conclusion is related to the • The ability to ignore the stopping rule in Bayesian stopping rule principle (SRP) according to which the inference has profound implications for experimental stopping rule—how we decide to stop taking data—should design have no effect on the ﬁnal reported evidence about the • It is OK to stop if “the data look good” or “the data parameter ! obtained from the data look horrible”. Breaking a prior decision to test n • The SRP is a consequence of the Likelihood Principle patients, for example, will not compromise the validity • Classical testing violates the stopping rule principle of the test just as it violates the likelihood principle. Thus, » Thus if the data look good and the new treatment “sample until n=12” and “sample until t=3” are looks very effective, it would be unethical not to different stopping rules that give different inferences in break the protocol so that the patients on placebo classical binomial testing, but the same inferences in can get the effective drug Bayesian testing » Likewise if the ﬁrst 20 patients all died under the • See Berger & Wolpert, The Likelihood Principle new treatment it would be unethical to continue Bayesian Inference 4/10/09 72 Bayesian Inference 4/10/09 73 Stopping Rule Principle Frequentist Work-arounds • The ability to ignore the stopping rule in Bayesian • In frequentist hypothesis testing this problem can be inference has profound implications for experimental avoided through the device of “# spending”. That is, design suppose we are in a drug clinical trial, and wish to stop the trial at some point to do a preliminary assessment of the • Similarly, it is OK to continue the test longer if the results, to decide whether to continue the trial results are promising but not fully conclusive at the • Terminate trial because of excess bad outcomes end of the scheduled test • Terminate trial because drug is so effective it would be unethical not to give it to the placebo group • A frequentist can “peek” at the data in advance if one is willing to “spend” some of the # at that point; but if the test is continued, it will require a smaller # at a later point to reach the preassigned #-level for the overall trial. • However, this is much more complex and involved than the Bayesian approach Bayesian Inference 4/10/09 74 Bayesian Inference 4/10/09 75 Hypothesis Testing and Prior Belief Good’s Device • Prior belief has to be a consideration in any kind of • The statistical evidence is the same, but the posterior hypothesis testing. Thus, on the same data, different beliefs are very different! degrees of plausibility may accrue to a hypothesis • This example illustrates a device due to I.J. Good for • For example, consider shufﬂing a pack of alphabet cards. setting prior probabilities A subject is supposed to guess the letter on three cards. Suppose the subject names all three correctly. What do we • It will take much more data to convince most people of think when the truth of H3 than H2, even though the evidence is • H1: The subject is a child and is allowed to look at the identical. One can ask, “how much more data?” Then, cards by using Bayes’ theorem in reverse, we can estimate • H2: The subject is a magician, who only looks at the our prior on the hypotheses backs of the cards » How many times in a row would someone have to • H3: The subject is a psychic, who only looks at the name a randomly-picked card before you would backs of the cards give 1:1 odds that he was a psychic? Bayesian Inference 4/10/09 76 Bayesian Inference 4/10/09 77 Example Example • Published report of an experiment in ESP (Soal and • Calculate probability given null hypothesis Bateman, 1954) p(data | H0 ) = Crn p r (1 ! p) n!r • Deck of 25 Zener cards (5 different designs) are shufﬂed, • Use Stirling approximation subject guesses order of the cards n!! 2" n n+1 / 2 e #n • Sampling without replacement, often misanalyzed by obtain parapsychologists (P. Diaconis) % (1/ 2 n • Subject obtained r=9410 “hits” out of 37100 (vs. 7420±77 p(data | H 0 ) " ' * exp(nH ( f , p)) &2#r(n $ r)) expected) f=0.2536 • What should we think? H0= no ESP: p=0.20, where the cross-entropy is f=r/n=0.2536, which is 25.8 ( away from the expected "1 ! f % value H( f , p) = ! f ln( f / p) ! (1 ! f )ln $ ' #1 ! p& Bayesian Inference 4/10/09 78 Bayesian Inference 4/10/09 79 Example Example • The cross-entropy (Kullback-Leibler divergence) • The priors hardly matter. No matter what prior you choose "1 ! f % on the alternative hypothesis, you’re going to get very H( f , p) = ! f ln( f / p) ! (1 ! f )ln $ ' strong evidence against the null hypothesis #1 ! p& measures the degree to which the observed distribution • So, does ESP exist? matches the expected distribution. Our entropy is just the cross-entropy relative to a uniform distribution • Plug in the data and ﬁnd that p(data | H0 ) = 0.00476 exp(!313.6) = 3.15 " 10 !139 • This is very small! Does the subject have ESP? • If we calculate the Bayes factor against the value of p that maximizes the cross-entropy (p=f) we still get huge odds against the null hypothesis ! Bayesian Inference 4/10/09 80 Bayesian Inference 4/10/09 81 Example Example • The priors hardly matter. No matter what prior you choose • Each of H1, H2, H3,… may have a prior probability *(Hi) on the alternative hypothesis, you’re going to get very that is much greater than that of the hypothesis P=HP that strong evidence against the null hypothesis the subject has genuine psychic powers, and each would • So, does ESP exist? adequately account for the data. As a result • Well, no. Bayesian methodology requires us to consider a p(data | P)! (P) p(P | data) = mutually exclusive and exhaustive set of hypotheses, and p(data | P)! (P) + # p(data | Hi )! (Hi ) we haven’t done that. We’ve left out, for example i" P and if the sum in the denominator dominates the ﬁrst term, • H1: The experimenter altered the data the posterior probability of P will always be much less • H2: The subject was a conjuror than 1 • H3: There was a ﬂaw in the experimental design (e.g., the subject might have noticed the card reﬂected in the scientist’s glasses) Bayesian Inference 4/10/09 82 Bayesian Inference 4/10/09 83 Example Example • Our data is not that someone actually performed the feat • In fact, the Soal/Bateman result was shown to have been in question. It is that someone reported to us that the feat due to experimenter fraud—tampering with the data was performed records. This was shown by Betty Markwick, who • There are always hypotheses that haven’t been considered, provided convincing evidence that Soal had altered the and sometimes they may be raised to signiﬁcance when record sheets in a systematic way so as to achieve an data come along that support them excess of “hits”. • This is why careful consideration of all possible hypotheses is important Bayesian Inference 4/10/09 84 Bayesian Inference 4/10/09 92 Making Decisions Making Decisions • If one is testing hypotheses, it is for a reason. One does • Example: Testing a drug or treatment not just sit around saying “Oh, well, I guess that • Usually when testing a new treatment we will compare hypothesis isn’t true!” it to an old treatment or a placebo • There must be some action implied by making that • We won’t approve the new treatment if it isn’t better decision (an action can include “doing nothing”) than the old one » We decide to approve that drug • We might not approve the new treatment if it is signiﬁcantly more costly than the old one, unless it is » We decide to invest in that stock signiﬁcantly better than the old one » We decide to publish our paper » But that judgement might depend upon whether you » Etc., etc. were the patient, the drug company, the government, or the insurance company! • We probably won’t approve the new treatment if it has signiﬁcant and adverse side-effects Bayesian Inference 4/10/09 93 Bayesian Inference 4/10/09 94 Making Decisions Making Decisions • This shows that not only is the probability of the states of • In drug testing, for example, the actions a we might take nature important, but we must also consider the are consequences of each state of nature (cost, side-effects, desirable effects, and so on) given each of the possible • Approve the drug decisions that we might make. Thus we must decide • Do not approve the drug • What are the states of nature? • As a result of our testing we will end up with posterior • What are the probabilities of each state of nature probabilities on the states of nature ! which will, for (Bayes)? example, include the cure rate of the new drug relative to • What actions are available to us? the old one, information on side effects, etc. • What are the costs or beneﬁts given each possible state • We will have to summarize the consequences of making of nature and each action (loss function)? various decisions about the drug as a utility or loss • What are the expected costs or beneﬁts of each action? (=–utility). Call the loss function L(!,a); it is a function of • Which action is the best under the circumstances? the states of nature ! and the actions a Bayesian Inference 4/10/09 95 Bayesian Inference 4/10/09 96 Making Decisions Making Decisions • Then the expected loss is a function of the actions a and • From this discussion we can see that losses/utilities play can be calculated, e.g. an important role in making decisions. Some aspects are objective (e.g., monetary costs); however many of them E" (L(" , a)) = # L(" , a)p(" | data)d" are subjective, just as priors may be subjective. • The person who is affected by the decision is the one • Evidently, we would want to choose the action (approve, that must determine the loss/utility for this calculation disapprove) depending upon which of these two actions » The insurance company, the drug manufacturer, the ! gives the smallest loss FDA, and the patient each will have a different • If we were using utilities, we would maximize the loss/utility when choosing to use or approve, expected utility market, or use the drug. Each must use his own loss/utility when making the decision » The role of the statistician is to assist the user, but it is not to set the loss/utility (the same goes for the patient’s doctor) Bayesian Inference 4/10/09 97 Bayesian Inference 4/10/09 98 Making Decisions • Our course is not a course in decision theory. Useful books on decision theory are • Smart Choices, by Hammond, Keeney and Raiffa. Good introduction for lay people. Stresses the process of eliciting utilities. Discusses basics of probability • Decision Analysis, by Howard Raiffa, and Making Decisions, by Dennis Lindley. Good introductions • Making Hard Decisions, by Robert Clemen. More advanced, very detailed case analyses • Statistical Decision Theory and Bayesian Analysis, Second Edition, by Jim Berger. Very advanced and theoretical Bayesian Inference 4/10/09 99

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 14 |

posted: | 11/30/2011 |

language: | English |

pages: | 21 |

OTHER DOCS BY ajizai

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.