Docstoc

11. Bayesian Hypothesis Testing

Document Sample
11. Bayesian Hypothesis Testing Powered By Docstoc
					                            Hypothesis Testing                                                       Classical Hypothesis Testing
       •    Hypothesis testing is one area where Bayesian methods                      •    Classical statisticians recognize two kinds of errors (this is
            and results differ sharply from frequentist ones                                horrible terminology, but we are stuck with it)
       •    Example: Suppose we have a coin and wish to test the                             • A Type I error is made if we reject H0 when it is true
            hypothesis that the coin is fair                                                 • A Type II error is made if we do not reject H0 when it
             • H0: The coin is fair, p(H)=p(T)=0.5                                             is false
             • H1: The coin is not fair (anything but the above)                       •    Choose a rejection region R:
       •    We divide the parameter space ! into two pieces, !0 and                          • R={x: observing x in R leads to the rejection of H0}
            !1, such that if the parameter ! is in !0 then the                         •    Then the probability of making a Type I error is
            hypothesis is true, and if it is in !1 then the hypothesis is
            false.                                                                                              p(x " R | # " $ 0 )
       •    We observe a test statistic x, which is a function of the                       and the probability of making a Type II error is
            data X, and wish to decide, given x, whether or not to
                                                                                                                p(x " R | # " $1 )
            reject H0
Bayesian Inference                      4/10/09                             1   Bayesian Inference                     4/10/09                               2




                     Classical Hypothesis Testing                                                    Classical Hypothesis Testing
       •    Often we take !0 as a set containing a single point (simple                •    We can construct different tests by
            hypothesis), so                                                                  • Choosing a different test statistic x(X)
                            " = p(x # R | $ # % 0 )                                             » But no real choice if x is sufficient
            is well-defined. However, !1 is usually a collection of                           • Choosing a different rejection region R
            intervals (composite hypothesis) and                                       •    Note that both of these are subjective choices!
                             " = p(x # R | $ # %1 )
            has no definite value. The best one can do is to evaluate
                                p(x " R | # ) = $ (# )
            as a function of !. "(!*) is the probability of making a
            Type II error if the true value of ! is !* " !1. q(!)=1–"(!)
            is known as the power function of the test.

                                            !
Bayesian Inference                     !
                                       4/10/09                              3   Bayesian Inference                     4/10/09                               4




                                       !
                     Classical Hypothesis Testing                                                 Classical Hypothesis Testing
       •    A common practice is to choose things so that # is some                 •    However, from a Bayesian point of view these classical
            fixed fraction like 0.05 or 0.01. This gives us an #-level                    tests are all suspect. Since they depend on the probability
            test, and the probability of making a Type I error is #                      of x falling into some region R, they depend on the
       •    Then, amongst the #-level tests, the goal would be to                        probability of data that might have been observed, but was
            choose that test such that the power function of the test                    not. Thus, they violate the Likelihood Principle.
            dominates that of all other tests. Such tests are called                •    A Bayesian might well say that classical hypothesis tests
            “uniformly most powerful” (UMP) tests, and they have                         commit a Type III error: Giving the right answer to the
            the smallest probability of committing a Type II error for                   wrong question.
            any given value of !. Unfortunately, in general a UMP test              •    A Bayesian test would have to depend on and be
            does not exist                                                               conditioned on just the data X that was observed.




Bayesian Inference                   4/10/09                             5   Bayesian Inference                    4/10/09                             6




                     Bayesian Hypothesis Testing                                                       Marginal Likelihood
       •    So, we return to the Bayesian paradigm. We have to have                 •    In the Bayesian approach to hypothesis testing, the
            a prior; we need the likelihood; then we can compute the                     "marginal likelihood" plays a key role. Suppose we have
            posterior.                                                                   hypotheses H0, H1,…,Hn. To each hypothesis Hj there
       •    Suppose we have two simple hypotheses. Thus, we have                         corresponds a (possibly empty) set of parameters !j. For
            H0 and H1, and likelihood P(X|H0), P(X|H1), and prior                        example, in the coin-tossing case there are no parameters
            p(H0), p(H1). Then the posterior odds are                                    for j=0 and one parameter for j=1.
                         p(H0 | X) p(X | H0 ) p(H0 )
                                                                                    •    Note that there is no requirement that the parameters be
                                  =           !                                          nested.
                         p(H1 | X) p( X | H1 ) p(H1 )




Bayesian Inference                   4/10/09                             7   Bayesian Inference                    4/10/09                             8
                                  Marginal Likelihood                                                                       Marginal Likelihood
       •    We can calculate the joint density of parameters and                                    •    Now we can compute the posterior probability of any of
            hypotheses, and from that the posterior probability, in the                                  our hypotheses, by simply integrating over the continuous
            usual way:                                                                                   parameters of each hypothesis:
                   p(x," j , H j ) = p(x | " j , H j ) p(" j | H j )p(H j )                                       p(H j | x) =   # p(" , Hj         j   | x)d" j
                                            p(x | " j , H j )p(" j | H j )p(H j )
                         p(" j , H j | x) =
                                                             p(x)                                                       =
                                                                                                                            # p(x | " , H
                                                                                                                                      j         j   )p(" j | H j )p(H j )d" j
            where                                                                                                                                       p(x)
                                             n                                                      •    Since the proportionality constant p(x) is independent of j,
                                   p(x) = $ # p(x," j , H j )d" j                                        we can simply write
       !                                    j=0
                                                                                                    !
            i.e., sum over the discrete parameters and integrate over                              p(H j | x) " [ $ p(x | # j , H j ) p(# j | H j )d# j ] p(H j ) = m(x | H j )p(H j )
            the continuous ones, model by model.
                                                                                                         m(x|Hj) is known as the marginal likelihood under Hj.
                     !
                                                                                         !
Bayesian Inference                                4/10/09                           9        Bayesian Inference                               4/10/09                                    10




                           Bayesian Hypothesis Testing                                                             Bayesian Hypothesis Testing
       •    Many realistic examples involve testing a compound                                      •    We compute the marginal likelihood under each model,
            hypothesis. For example, our coin tossing problem is to                                      which will be proportional to the posterior probability of
            decide if the coin is fair, based on observations of the                                     each model:
            number of heads and tails. The expected proportion of                                             p(H 0 | x) " [ # p(x | $ 0 , H 0 )p($ 0 | H 0 )d$ 0 ] p(H 0 )
            heads is !. If the coin is fair, then !=0.5, and if it is not
            fair it is some other value. This means we will have to put                                              = m(x | H 0 )p(H 0 )
            a prior on !, under the alternative hypothesis, for example,                                 and similarly for H1.
            U(0,1)                                                                                  •    The odds for H0 and against H1 are simply
       •    Thus we are testing                                                                 !                      p(H 0 | x) m(x | H 0 ) p(H 0 )
                         H0 : ! = 0.5, p(! | H0 ) = " (! # 0.5)                                                                  =           "
                                                                                                                       p(H 1 | x) m(x | H 1 ) p(H 1 )
            against
                                                                                                    •    The ratio of the two marginal likelihoods is known as the
                             H1 : ! $ 0.5, p(! | H1 ) ~ U(0,1)                                           Bayes factor.
                                                                                                              !
Bayesian Inference                                4/10/09                           11       Bayesian Inference                               4/10/09                                    12
                       Bayesian Hypothesis Testing                                                            Bayesian Hypothesis Testing
       •    For the particular case of coin tossing, x={h,t} with h the                         •    As with all Bayesian inference, the results of such a test
            number of heads and t the number of tails observed, this                                 depend on the prior. And in the case of a simple versus a
            becomes                                                                                  compound hypothesis, the dependence is sensitive to the
                                     1   h
                                             (1# " )t $ (" # 0.5)d"                                  prior on !, in a way that is even more sensitive than in
                     p(H 0 | x)
                                =
                                    %"
                                     0
                                                                      &
                                                                          p(H 0 )
                                                                                                     parameter estimation problems, which means that one is
                                              1
                     p(H 1 | x)              %"   h         t
                                                      (1# " ) d"          p(H 1 )                    less certain of the inference
                                              0
                                            h+t
                               = (h + t +1)Ch (0.5)h+t                                          •    One can look at a range of sensible priors to see how
                                           n                                                         sensitive the results are
                                    (n +1)Ch
                               =             with n = h + t                                     •    One can also look at particular classes of priors to see
                                       2n
                                                                                                     what the maximum evidence against the simple hypothesis
       •    Do you see how the numerator and denominator get the                                     is under that class of priors
            values they do?
     !

Bayesian Inference                                4/10/09                           13   Bayesian Inference                           4/10/09                             14




                       Bayesian Hypothesis Testing                                                                                  p-values
       •    Example: Suppose we toss a coin 100 times and obtain 60                             •    Recall that for a symmetric distribution a p-value is the
            heads and 40 tails. What is the evidence against the                                     tail area (one-sided) or twice the tail area (two-sided)
            hypothesis that the coin is fair?                                                        under the probability distribution of the test statistic r(x)|!
             • Assuming the priors we did for our calculation, we find                                from where the data actually lie at x0 to infinity:
               on prior odds 1 that the posterior odds are                                                                               #

                                  101 ! C         100
                                                  60
                                                                                                                       p - value =       $ p(r | "   0   )dr
                                             = 1.095                                                                                     r0
                                      2100                                                                                   % r0                   #
             •   In other words, these data (slightly) favor the null                                          p - value =    $ p(r | "   0 )dr +   $ p(r | "   0   )dr
                 hypothesis that the coin is fair!                                                                           %#                     r0

                  » Surprising to a frequentist since the one-sided
                    p-value (tail area for obtaining 60 or more heads on
                    100 tosses given a fair coin) is 0.028 which would
                                                                                                    !
                    reject the null in an #=0.05 level test
Bayesian Inference                                4/10/09                           15   Bayesian Inference                           4/10/09                             16
                     Bayesian Hypothesis Testing                                                     Bayesian Hypothesis Testing
       •    Example: Suppose we toss a coin 100 times and obtain 60                    •    Priors for binomial data
            heads and 40 tails. What is the evidence against the                             • We used a flat prior for our analysis, for convenience
            hypothesis that the coin is fair?
             • If we look at this example by comparing the simple                            • The Jeffreys prior is beta(!|1/2,1/2) " !–1/2(1- !)–1/2
               hypotheses “fair” versus “biased coin with !=0.6”,                                » Show this! Hint: E(h|!)=n!
               which is the most favorable prior $(!–0.6) on ! for the                       • In practice the difference between flat and Jeffreys
               alternative hypothesis, we still get                                            won’t make much difference since it’s just a difference
                      p(H0 | X) 0.5h 0.5t p(H0 )                                               of one extra head or tail
                                 =          !          = 0.134
                      p(H1 | X) 0.6 h 0.4 t p(H1 )                                           • Informative conjugate priors are beta distributions
             • The corresponding probability is 0.118                                           " !#–1(1- !)"–1. You may choose the parameters to
             • This is independent of the prior on the parameter and                           match your prior knowledge
               we can consider this to be the maximum evidence
               against the null hypothesis. It is still over four times
               greater than the classical one-sided p-value

Bayesian Inference                    4/10/09                              17   Bayesian Inference                   4/10/09                             18




                     Bayesian Hypothesis Testing                                                           MCMC Simulation
       •    Priors for binomial data                                                   •    We can calculate our results using simulation (useful for
             • Jaynes suggests beta(!|0,0) " !–1(1- !)–1                                    when an exact solution is unavailable)
                 » This has the advantage of agreeing with intuition if
                                                                                       •    We do this by simulating a random walk in both model
                                                                                            space {H0, H1} and in parameter space(%). Thus we are
                   there is a good probability that either of the                           simulating on both discrete and continuous parameters
                   extremes !=0 or !=1 may be true (as with, for
                                                                                       •    The key here is to allow ourselves to jump between our
                   example, whether a random chemical taken off the                         two models. This will in general be a M-H step
                   shelf will or will not dissolve. Presumably if it
                                                                                       •    Since the two models have differing numbers of
                   dissolves the first time, it will each time and if it                     parameters (in the coin tossing case, one model has zero
                   doesn’t dissolve the first time, it won’t dissolve any                    parameters and the other has one) we will have to propose
                   other time either)                                                       parameters and models simultaneously
                 » However, if the number of heads or tails is 0 the                   •    I will describe a technique known as reversible jump
                   posterior will not be normalizable and the test will                     MCMC which is very effective
                   give odds 0 or # which may or may not be
                   desirable
Bayesian Inference                    4/10/09                              19   Bayesian Inference                   4/10/09                             20
                           MCMC Simulation                                                                  MCMC Simulation
       •    The best introduction to the reversible jump MCMC                         •    Thus, in our coin tossing problem, we may be in state
            technique that I have found is “On Bayesian Model and                          (Hj,!j) and wish to jump to another state (Hk,!k)
            Variable Selection Using MCMC,” by Petros Dellaportas,                          • Propose jump to state (Hk,!k) with probability
            Jonathan Forster and Ioannis Ntzoufras. It has been                               q(Hk,!k|Hj,!j)
            published in Statistics and Computing. A copy may be                            • Compute (the log of) the Metropolis-Hastings factor:
            downloaded from the course website
                                                                                                       p( X | Hk , " k ) p(Hk ," k )q(H j ," j | Hk ," k )
                                                                                                   !=
                                                                                                       p(X | H j , " j ) p(H j ," j )q(Hk ," k | H j , " j )
                                                                                            • Generate u, (the log of) a U(0,1) random variable and
                                                                                              accept the step if u<$ (log u < log $), otherwise stay
                                                                                              where you are
                                                                                            • The resulting Markov chain samples the models in
                                                                                              proportion to their posterior probability; marginalize
                                                                                              out ! by ignoring it
Bayesian Inference                       4/10/09                          21   Bayesian Inference                       4/10/09                                22




                           MCMC Simulation                                                                  MCMC Simulation
       •    We see that # is the ratio of two quantities of the form                  •    And, we might factorize both p and q:
                                   p(X | Hm ," m ) p(Hm ," m )                                               p(X | Hm ," m ) p(" m | Hm )p(Hm )
                         ! m|n =                                                                     ! m|n =
                                     q(Hm ," m | Hn ," n )                                                          q(" m | Hm )q(Hm )
            where m and n refer to the two states                                           • Example: Coin tosses. Choose prior on Hm, for
                                                                                              example, p(H0)= p(H1)= 1/2
       •    We have a great deal of latitude in picking q. For example,
            we could choose it independent of state n (independence                         • Choose a proposal, for example q(H0)= q(H1)= 1/2
            sampler):                                                                           » We’ll want to reconsider this
                                   p(X | Hm ," m ) p(Hm ," m )                              • If m=0, !0=1/2, [strictly, p(!0|H0)= $(!0–0.5)], but if
                         ! m|n =                                                              m=1 we need a prior on !1. For simplicity we will take
                                         q(Hm ," m )
                                                                                              a uniform prior, as in our calculation
                                                                                            • We need also to consider the proposals q(!m| Hm) for
                                                                                              m=0,1

Bayesian Inference                       4/10/09                          23   Bayesian Inference                       4/10/09                                24
                             MCMC Simulation                                                                  MCMC Simulation
       •    And, we might factorize both p and q:                                         •    And, we might factorize both p and q:
                                 p(X | Hm ," m ) p(" m | Hm )p(Hm )                                             p(X | Hm ," m ) p(" m | Hm )p(Hm )
                       ! m|n =                                                                          ! m|n =
                                        q(" m | Hm )q(Hm )                                                             q(" m | Hm )q(Hm )

             • An excellent choice of q(!m| Hm) (if possible) would be                          • Usually the excellent strategy is not possible; if one
                 to make it proportional to the posterior p(!m|X,Hm)! For                         can approximate the posterior by a distribution that you
                 then we would get                                                                can sample from, that is a good strategy.
                                           p(Hm )                                               • A simple-minded strategy would simply be to choose q
                                   ! m|n "                                                        flat, whence
                                           q(Hm )
                                                                                                                      p(X,# m | H m )p(H m )
                 which is a constant. Indeed, if we can also arrange                                           "m|n =
                 things so that "m|n=1 we would get a Gibbs sampler!                                                         q(H m )
             •   The latter can be done approximately by using a small                             » Works if the posterior is not too sharp (not much
                 training sample and picking q(Hm) using the results                                 data). Not so good if there is a very sharp posterior
                                                                                                        !
Bayesian Inference                         4/10/09                            25   Bayesian Inference                    4/10/09                              26




                                       p-values                                                                       p-values
       •    Amongst the many errors that people make interpreting                         •    The observed p-value is not a probability in any real
            frequentist results, one in particular is very common, and                         sense! It is a statistic that happens to have a U(0,1)
            that is to quote a p-value as if it were a probability (e.g.,                      distribution under the null
            that the null hypothesis is true, or that “the results                              • If the observed p-values were real probabilities, we
            occurred by chance”)                                                                   could combine them using the rules for probability to
             • The approved use of p-values, on frequentist                                        obtain p-values of combined experiments. Thus (on the
                reasoning, is to report whether or not the p-value falls                           null hypothesis of a fair coin), if we observed 60 heads
                into the rejection region. This has the interpretation                             and 40 tails and then independently observed 40 heads
                that if the null is true, then in no more than a fraction #                        and 60 tails, the one-sided p-value for the combined
                of all cases will we commit a Type I error.                                        experiment is evidently 0.5, whereas the one-sided p-
       •    The observed p-value is not a probability in any real                                  values for two independent experiments are 0.028 and
            sense! It is a statistic that happens to have a U(0,1)                                 (1–0.028) respectively; the product is obviously not
            distribution                                                                           0.5, contrary to the multiplication law
       •    Despite the appeal of quoting p-values, an observed p-                              • Similar results hold for two-sided p-values
            value has no valid frequentist probability interpretation
Bayesian Inference                         4/10/09                            27   Bayesian Inference                    4/10/09                              28
                                          p-values                                                                      p-values
       •    Furthermore, suppose you routinely reject two-sided at a                         •    Then amongst those experiments rejected with p-values in
            fixed #-level, say 0.05                                                                (0.05–&,0.05) for small &, at least 30% will actually turn
                                                                                                  out to be true, and the true proportion can be much higher
       •    Suppose in half the experiments the null was actually true
                                                                                                  (depends upon the distribution of the actual parameter for
       •    Finally, suppose that in those experiments for which the                              the experiments where the null is false)
            null is false, the probability of a given effect size x                                • This says that under these circumstances, the Type I
            decreases monotonically as you go away from 0 (in either                                 error rate (probability of rejecting a true null),
            direction):                                                                              conditioned on our having observed p=0.05, is at least
                                                                                                     30%!
                                                                                                   • Thus the numerical value of an observed p-value
                            p(e)                                                                     greatly overstates the evidence against the null
                                                                                                     hypothesis, which we already found for coin tosses.

                                                0       x!
Bayesian Inference                            4/10/09                            29   Bayesian Inference                   4/10/09                             30




                                          p-values                                                                      p-values
       •    The absolute maximum evidence against the null                                   •    Papers on this subject can be found on the web:
            hypothesis can be gotten by evaluating the likelihood ratio                            • http://makeashorterlink.com/?P3CB12232 (Paper by
            at the data. For example, if x is standard normal and we                                 Berger and Delampady)
            observe x=1.96, which corresponds to an # level of 0.05
            (two tailed), we can calculate the likelihood ratio as                                 • http://makeashorterlink.com/?V2FB21232 (Paper by
                                                                                                     Berger and Sellke, with comments and rejoinder)
                     p(x | H 0 ) 1 exp(" 1 1.96 2 )
                                =21      2
                                           1 2
                                                    = exp(" 1 1.96 2 ) = 0.146
                                                            2                                      • http://www.stat.duke.edu/~berger/p-values.html
                     p(x | H1 )    2 exp(" 2 0 )
                                                                                             •    Note that the papers by Berger and Delampady and by
                     p(H 0 | x) = 0.128 on prior odds of 1                                        Berger and Sellke must be accessed from within the
                                                                                                  university network or by proxy server.




Bayesian Inference                            4/10/09                            31   Bayesian Inference                   4/10/09                             32
                                Falsification                                                             Bayesian Epistemology
       •    Popper proposed that a scientific hypothesis must be                        •    Bayesians measure the effect of new data D on the
            falsifiable by data. For example, the hypothesis that a coin                     relative plausibility of hypotheses by calculating the
            has two heads can be falsified by observing one tail                             Bayes factor
       •    A hypothesis H0 is falsifiable in Bayesian terms if, for                                        ! H0 | D$ P(D | H0 )
                                                                                                         F#         &=
            some data D, its likelihood on H0 is 0: p(D| H0)=0                                             " H1 | D% P(D | H1 )
       •    However, the requirement of falsifiability is too                           •    Then we compute posterior odds from prior odds
            restrictive. In science, ideas are seldom, if ever, actually
                                                                                                           ! H0 | D$        H |D
                                                                                                                     & = F! 0 $ O! 0 $
                                                                                                                                      H
            falsified. What usually happens is that old hypotheses are                                     O#              #       & # &
            discarded in favor of new ones that new data have                                              " H1 | D1 %    " H1 | D% " H1 %
            rendered more plausible, i.e., have higher posterior                       •    Bayes’ theorem allows us to calculate the effect of new
            probability                                                                     data on various hypotheses and adjust posterior
                                                                                            probabilities accordingly. It thus becomes a justification
                                                                                            for the inductive method

Bayesian Inference                    4/10/09                              33   Bayesian Inference                    4/10/09                           34




                             Ockham’s Razor                                                                  Ockham’s Razor
       •    “Pluralitas non est ponenda sine necessitate.”                             •    One way to reflect the common scientific experience that
                                         —William of Ockham                                 simple hypotheses are preferable is to choose the prior
       •    Preferring the simpler of two hypotheses to the more                            probabilities so that the simpler hypotheses have greater
            complex, when both account for the data, is an old                              prior probability (Wrinch and Jeffreys)
            principle in science                                                             • This is a “prior probabilities” interpretation of
             • Why do we consider                                                              Ockham’s razor
                                             1                                               • Does it beg the question?
                                 s = a + ut + gt 2
                                             2                                               • What principle should be used to assign the priors?
               to be simpler than
                                          1
                             s = a + ut + gt 2 + ct 3 ?
                                          2



Bayesian Inference                    4/10/09                              35   Bayesian Inference                    4/10/09                           36
                                 Simplicity                                                                    Plagiarism
       •    We regard H0 as simpler than H1 if it makes sharper                     •    Compilers of mailing lists include bogus addresses to
            predictions about what data will be observed                                 catch unauthorized repeat use of the list
       •    Hypotheses can be considered more complex if they have                  •    Mapmakers include small, innocuous mistakes to catch
            extra adjustable parameters (“knobs”) that allow them to                     copyright violations
            be tweaked to accommodate a wider variety of data                       •    Mathematical tables can be rounded up or down if the last
       •    Complex hypotheses can accommodate a larger set of                           digit ends in ‘5’ without compromising the accuracy of the
            potential observations than can simple ones                                  table. The compiler can embed a secret “code” in the table
             • “This coin has two heads” vs. “This coin is fair”                         to catch copyright violations
             • “This coin is fair” vs. “This coin has unknown bias !”               •    In all cases, duplication of these errors provides prima
                                                 1                                       facie evidence, useful in court, that copying took place
             • “The relationship is s = a + ut + gt 2 ” vs. “The
                                                 2
                                           1 2
               relationship is s = a + ut + gt + ct 3 ”
                                           2

Bayesian Inference                      4/10/09                         37   Bayesian Inference                     4/10/09                                39




                                 Plagiarism                                                            Evolutionary Biology
       •    Example: a table of 1000 sines                                          •    The principle of descent with modification underlies
             • Can expect to have a choice of rounding in 100 cases                      evolution
             • Let D = “The rounding pattern is the same”                           •    Pseudogenes are genes that have lost essential codes,
                                                                                         rendering them nonfunctional
             • Let C = “The second table was copied from the first”                  •    Nearly identical pseudogenes are observed in closely
       •    Then                                                                         related organisms (e.g., chimpanzees and humans). By the
                                                                                         same arguments as before, the posterior probability that
                         P(D | C) = 1, P(D | C ) ! 10 "30
                                                                                         this is due to actual copying from a common ancestor is
                          # C&                                                           vastly greater than the posterior probability that it is due to
                         F% ( ! 10 30                                                    coincidence. This is powerful evidence in favor of
                          $C'
                                                                                         evolution
                                                                                    •    Similar evidence is provided by the fact that the genetic
                                                                                         code is redundant. Several triplets of base pairs code for
                                                                                         the same amino acids

Bayesian Inference                      4/10/09                         40   Bayesian Inference                     4/10/09                                42
                     Mercury’s Perihelion Motion                                                            Mercury’s Perihelion Motion
       •    Around 1900, Newtonian mechanics was in trouble                            •    Along comes Einstein and the General Theory of
            because of the problem of Mercury’s perihelion motion                           Relativity, which predicts a very precise value for
                                                                                            Mercury’s perihelion motion—no other value is possible
       •    Proposed solutions:
                                                                                       •    Using contemporary figures
             • Rings of matter around the Sun, too faint to see
                                                                                             • Poor gives a=41.6"±2.0"
             • “Vulcan”, a small planet near the Sun, difficult to detect                     • We have aE=42.98" for Einstein’s theory (E)
             • Flattening of the Sun                                                         • The conditional probability of data a on the hypothesis
             • Additional terms in the law of gravity (e.g., 'r—3 term                         E that the true value is aE is
               where ' is an adjustable constant)                                                                          1        1
       •    Some solutions could be ruled out on observational                                              p(a | aE ) =     exp$ # 2 (a # aE )2 &
                                                                                                                         2!"    % 2"             '
            grounds (Jeffreys-Poor debate, 1921)
                                                                                                                = p(a | E)
       •    One could not rule out modifications to the law of gravity.
            The adjustable parameter ' can be chosen to allow any                                where (=2.0" (error of observation)
            motion a of the perihelion
Bayesian Inference                    4/10/09                              43   Bayesian Inference                           4/10/09                      44




                     Mercury’s Perihelion Motion                                                            Mercury’s Perihelion Motion
       •    The older theory F can be thought of as matching the                       •    The Bayes factor is
            observed value with a “fudge factor” aF                                                  p(a | E)                         2
                                                                                                                          $ D 2 ' $ DF '
       •    Observations of Mars, Earth and Venus can limit the                                               = 1+ " 2 exp&# E ) exp&   2 )
                                                                                                                                            = 26.0
                                                                                                     p(a | F)             % 2 ( % 2(1+ " )(
            fudge factor to | aF|<100"
       •    Assuming that Newton’s theory is approximately correct,                         where
            we have a prior density of Mercury’s perihelion motion                                          a # aE              a           "
                                                                                                     DE =          = #0.69, DF = = 20.8, " = = 25.0
            which we take for now to be N(0,)2) with )=50":                                                   *                 *           *

                                       1     $ a2 '
                                                                                       •    This is moderately strong evidence in favor of E.
                          p(aF | F) =     exp # F2                                     •    The last two factors are O(1) and measure the “fit” of the
                                      2!"    &
                                             % 2" ()
                                                                                 !          two theories to the data. Nearly all of the Bayes factor is
                                                                                            due to the first factor, which is known as the ‘Ockham
                                                                                            factor’.


Bayesian Inference                    4/10/09                              45   Bayesian Inference                           4/10/09                      47
                                       What’s Happening                                                                               Examples
       •    The Ockham factor arises from the fact that F spreads its                                   •    How do p-values and posterior probabilities compare for
            bets over a much larger portion of parameter space than                                          sharp null hypotheses? Evidently, a small p-value is
            does E. Essentially E puts all its money on aE, a precise                                        evidence against the null, but as we have seen, its
            value, spread out only by the error of observation (. On                                         numerical value overstates the evidence against the null
            the other hand, F spreads its bets out over a range that is
            25 times bigger, and hence most of its probability is                                       •    Consider aE= aF=0, so we center everything at 0. Then the
            “wasted” covering regions of the parameter space that                                            Bayes factor is
            were not observed.
                                                                                                                       ! E $ p(a | E)              ) D2 ! ' 2 $ ,
       •    E makes a sharp prediction, F a fuzzy one                                                                 F# & =          = 1 + ' 2 exp ( E #       &
                                                                                                                       " F % p(a | F)              + 2 "1 + ' 2 % .
                                                                                                                                                   *              -
       •    When the data come out even moderately close to where E
            predicts it will, E is rewarded for the risk it took by getting
            a larger share of the posterior probability
       •    The factor of about 25 is just the dilution in probability
            that F must sacrifice in order to fit a larger range of data

Bayesian Inference                                  4/10/09                                 48   Bayesian Inference                       4/10/09                           53




                                           Examples                                                                                   Examples
       •    Case 1: Let the p-value be 0.05 so that DE=1.95. We can                                     •    Case 1: The minimum in the posterior probability is about
            plot the Bayes factor versus !                                                                   0.32, which is not very good evidence against the null
                                                                                                             hypothesis
                             1.2
                                                                                                        •    For large ! , we see that the Bayes factor is asymptotically
                              1                                                                              like
                             0.8
                                                                                                                              ! E $ p(a | E)        ) D2 ,
                                                                                                                             F# & =          = ' exp ( E
                     Value




                             0.6
                                                                                                                              " F % p(a | F)        + 2 .
                                                                                                                                                    *    -
                             0.4


                             0.2


                              0
                                   0   1    2                3           4          5   6                                              Ockham       Evidence
                                                          taubar
                                                                                                                                        factor       factor
                                           Poster ior Pr obability   Bayes Factor




Bayesian Inference                                  4/10/09                                 54   Bayesian Inference                       4/10/09                           55
                                 Examples                                                                        Examples
       •    From this we see that for a given DE, the more vague the                 •    For fixed ), and a standard deviation for a single
            prior on the alternative (measured by ! ), the larger the                     observation of (, with n observations we will have
            Bayes factor in favor of the sharp prediction of the null
                                                                                                                n! 2
       •    Theories with great predictive power (sharp predictions)                                   1+! 2 = 1+    = O( n) for large n
                                                                                                                "2
            are favored over those with vague predictions                            •    Thus, asymptotically we have for large n
                         ! E $ p(a | E)        ) D2 ,                                                      "E %       " ( % + z2.
                        F# & =          = ' exp ( E
                         " F % p(a | F)        + 2 .
                                               *    -                                                    F $ ' ~ n $ ' exp-* 0
                                                                                                           #F&        #) & , 2 /
                                                                                          where z=DE is the z-score, or standardized variable DE

                                   Ockham        Evidence
                                                                                                   !
                                    factor        factor


Bayesian Inference                    4/10/09                            56   Bayesian Inference                       4/10/09                          57




                                 Examples                                                              Jeffreys-Lindley “Paradox”
       •    This suggests a way to interpret p-values obtained on data               •    Choose a p-value #>0, however small. To this p-value
            with large n (I.J. Good)                                                      there corresponds a z-score z#, and for large n the Bayes
             • Let the p-value be #                                                       factor against the alternative is
             • Compute the z-score z# which gives this value of # for                                                   $ z2 '
                                                                                                               B ! n exp " #
               the p-value (use normal approximation and tables)                                                        &
                                                                                                                        % 2( )
             • Take )/( = O(1) and compute
                                  !E $           ) z2 ,                                                    Increases             Fixed
                            B = F# & ' n exp (                                                               with n
                                  " F%           + 2.
                                                 *    -
             • Then the posterior probability of the null is                         •    Thus, for large enough n, a classical test can strongly
               approximately                                                              reject the null at the same time that the Bayesian analysis
                                                B                                         strongly affirms it
                                p(H0 | data) !
                                               1+ B

Bayesian Inference                    4/10/09                            58   Bayesian Inference                       4/10/09                          61
                      Jeffreys-Lindley “Paradox”                                                      Jeffreys-Lindley “Paradox”
       •    If H is a simple hypothesis, x the result of an experiment,                •    Example: A parapsychologist has a number of subjects
            the following two phenomena can exist simultaneously:                           attempt to “influence” the output of a hardware random
                                                                                            number generator (which operates by radioactive decays).
             • A significance test for H reveals that x is significant at                     In approximately 104,490,000 events, 18,471 excess
                the level p<# where the pre-chosen rejection level #>0                      events are counted in one direction versus the other
                can be as small as we wish, and                                        •    This is a binomial distribution. The standard deviation of
             • The posterior probability of H, given x, is, for quite                       the binomial distribution is " = nf (1# f ) where f is the
                small prior probabilities of H, as high as (1–#)                            expected frequency of counts. Straightforwardly we find
       •    This means that the classical significance test can reject H                     that (=5111 counts (using #=0.5)
            with an arbitrarily small p-value, while at the same time                  •    The classical significance test finds that the effect is
                                                                                                               !
                                                                                            significant at 18471/5111=3.61 standard deviations, for a
            the evidence can convince us that H is almost certainly true
                                                                                            p-value of 0.0003 (two-tailed), using the approximation,
                                                                                            excellent in this case, that the binomial distribution can be
                                                                                            approximated by a normal distribution


Bayesian Inference                    4/10/09                              62   Bayesian Inference                     4/10/09                              63




                      Jeffreys-Lindley “Paradox”                                                      Jeffreys-Lindley “Paradox”
       •    The Bayesian analysis is quite different. We have a                        •    On the alternative hypothesis, all we know is that the
            genuine belief that the null hypothesis of no (significant)                      effect might be something, but we don’t know how much
            effect might be true.                                                           or even in what direction
             • To be sure, no point null is probably ever exactly true,                      • Parapsychologists call effects measured in the direction
               because the random event generator might not be                                  opposite to the intended one “psi-missing”, and it is
               perfect. But tests of the generator are claimed to show                          considered evidence for a real effect—sort of “heads I
               that its bias is very small so a point null is a good                            win, tails you lose”.
               approximation




Bayesian Inference                    4/10/09                              64   Bayesian Inference                     4/10/09                              65
                       Jeffreys-Lindley “Paradox”                                                       Jeffreys-Lindley “Paradox”
       •    To reflect our ignorance we choose a uniform prior on the                    •    The p-value of 0.0003 corresponds to a z-score of z#=3.61
            alternative                                                                      standard deviations on n=104,490,000. Our approximate
       •    We’ve already seen the analysis of this problem in the                           formula for the Bayes factor yields
            coin-flipping problem. The result is                                                                             # (3.61)2 &
                                                                                                     B ! 104, 490,000 exp "             = 15.1
                               p(H 0 | x)      p(H 0 )                                                                      %
                                                                                                                            $    2 (  '
                                          " 12
                               p(H1 | x)       p(H1)                                    •    This is certainly in the right ballpark and confirms the
       •    In other words, although the classical test rejects H0 with a                    approximate formula
            very small p-value, the Bayesian analysis has made us
            twelve times more confident of the null hypothesis than
            we were!




Bayesian Inference                     4/10/09                              66   Bayesian Inference                    4/10/09                             67




                     Connection with Ockham’s Razor                                                   Connection with Ockham’s Razor
       •    This result, where the Bayesian answer is at great odds                     •    This means that the alternative hypothesis has an
            with the orthodox result, can be understood in terms of                          adjustable parameter, the effect size !, that the null
            Ockham’s razor                                                                   hypothesis does not have. The null hypothesis makes a
       •    The sharp null hypothesis H0 is a special one that comes                         definite prediction !=0.5, whereas the alternative
            from our genuine belief (in our parapsychology example)                          hypothesis can fit any value of !.
            that people cannot really influence the output of a random                   •    Therefore the null hypothesis is simpler than the others in
            number generator by simply wishing it. The sharp                                 the sense we’ve been discussing. It has fewer parameters.
            hypothesis is inconsistent with nearly all possible data                    •    Because of the Ockham factor, which naturally arises in
            sets, since it is consistent only with the minuscule fraction                    the analysis, we can say that in some sense Ockham’s
            of possible sequences for which the number of 0’s and 1’s                        razor is a consequence of Bayesian ideas. The Ockham
            are approximately equal. On the other hand, the                                  factor automatically penalizes complex hypotheses,
            alternative hypothesis is very open ended and would be                           forcing them to fit the data significantly better than the
            consistent with any possible data                                                simple one before they will be accepted

Bayesian Inference                     4/10/09                              68   Bayesian Inference                    4/10/09                             69
                     Connection with Ockham’s Razor                                                   Sampling to a Foregone Conclusion
       •    To conclude: There are at least three Bayesian                              •    The phenomenon we have just discussed is closely related
            interpretations of Ockham’s razor                                                to the phenomenon of sampling to a foregone conclusion
             • As a way of assigning prior probabilities to                                   • In classical significance testing, one is supposed to
                hypotheses, based on experience                                                  decide on everything in advance, and one is especially
                                                                                                 supposed to decide on exactly how much data to take
             • As a consequence of the fact that complex hypotheses                              in advance.
                with more parameters, in their attempt to accommodate                         • Failure to do this can lead to a situation where if you
                a larger set of possible data, are forced to waste prior                         sample long enough, at some point with probability as
                probability on outcomes that are never observed. This                            close to 1 as you wish you will reject a true null
                automatic penalty factor favors the simpler theory                               hypothesis with as small a preset significance level as
             • As an interpretation of the notion that when fitting data                          you wish
                to empirical models, one should avoid overfitting the                          • This is sampling to a foregone conclusion
                data                                                                    •    This phenomenon is peculiar to classical significance
                                                                                             testing. It cannot occur in Bayesian testing

Bayesian Inference                     4/10/09                              70   Bayesian Inference                   4/10/09                             71




                         Stopping Rule Principle                                                           Stopping Rule Principle
       •    Sampling to a foregone conclusion is related to the                         •    The ability to ignore the stopping rule in Bayesian
            stopping rule principle (SRP) according to which the                             inference has profound implications for experimental
            stopping rule—how we decide to stop taking data—should                           design
            have no effect on the final reported evidence about the                            • It is OK to stop if “the data look good” or “the data
            parameter ! obtained from the data                                                  look horrible”. Breaking a prior decision to test n
       •    The SRP is a consequence of the Likelihood Principle                                patients, for example, will not compromise the validity
             • Classical testing violates the stopping rule principle                           of the test
               just as it violates the likelihood principle. Thus,                                » Thus if the data look good and the new treatment
               “sample until n=12” and “sample until t=3” are                                       looks very effective, it would be unethical not to
               different stopping rules that give different inferences in                           break the protocol so that the patients on placebo
               classical binomial testing, but the same inferences in                               can get the effective drug
               Bayesian testing                                                                   » Likewise if the first 20 patients all died under the
       •    See Berger & Wolpert, The Likelihood Principle                                          new treatment it would be unethical to continue


Bayesian Inference                     4/10/09                              72   Bayesian Inference                   4/10/09                             73
                          Stopping Rule Principle                                                     Frequentist Work-arounds
       •    The ability to ignore the stopping rule in Bayesian                       •    In frequentist hypothesis testing this problem can be
            inference has profound implications for experimental                           avoided through the device of “# spending”. That is,
            design                                                                         suppose we are in a drug clinical trial, and wish to stop the
                                                                                           trial at some point to do a preliminary assessment of the
             • Similarly, it is OK to continue the test longer if the                      results, to decide whether to continue the trial
               results are promising but not fully conclusive at the                        • Terminate trial because of excess bad outcomes
               end of the scheduled test
                                                                                            • Terminate trial because drug is so effective it would be
                                                                                               unethical not to give it to the placebo group
                                                                                      •    A frequentist can “peek” at the data in advance if one is
                                                                                           willing to “spend” some of the # at that point; but if the
                                                                                           test is continued, it will require a smaller # at a later point
                                                                                           to reach the preassigned #-level for the overall trial.
                                                                                      •    However, this is much more complex and involved than
                                                                                           the Bayesian approach

Bayesian Inference                     4/10/09                            74   Bayesian Inference                     4/10/09                                75




                     Hypothesis Testing and Prior Belief                                                      Good’s Device
       •    Prior belief has to be a consideration in any kind of                     •    The statistical evidence is the same, but the posterior
            hypothesis testing. Thus, on the same data, different                          beliefs are very different!
            degrees of plausibility may accrue to a hypothesis
                                                                                      •    This example illustrates a device due to I.J. Good for
       •    For example, consider shuffling a pack of alphabet cards.                       setting prior probabilities
            A subject is supposed to guess the letter on three cards.
            Suppose the subject names all three correctly. What do we                       • It will take much more data to convince most people of
            think when                                                                         the truth of H3 than H2, even though the evidence is
             • H1: The subject is a child and is allowed to look at the                        identical. One can ask, “how much more data?” Then,
                cards                                                                          by using Bayes’ theorem in reverse, we can estimate
             • H2: The subject is a magician, who only looks at the                            our prior on the hypotheses
                backs of the cards                                                               » How many times in a row would someone have to
             • H3: The subject is a psychic, who only looks at the                                 name a randomly-picked card before you would
                backs of the cards                                                                 give 1:1 odds that he was a psychic?


Bayesian Inference                     4/10/09                            76   Bayesian Inference                     4/10/09                                77
                                    Example                                                                            Example
       •    Published report of an experiment in ESP (Soal and                         •    Calculate probability given null hypothesis
            Bateman, 1954)                                                                                p(data | H0 ) = Crn p r (1 ! p) n!r
       •    Deck of 25 Zener cards (5 different designs) are shuffled,                  •    Use Stirling approximation
            subject guesses order of the cards
                                                                                                                   n!! 2" n n+1 / 2 e #n
             • Sampling without replacement, often misanalyzed by                           obtain
               parapsychologists (P. Diaconis)                                                                        %          (1/ 2
                                                                                                                           n
       •    Subject obtained r=9410 “hits” out of 37100 (vs. 7420±77                                 p(data | H 0 ) " '          * exp(nH ( f , p))
                                                                                                                      &2#r(n $ r))
            expected) f=0.2536
       •    What should we think? H0= no ESP: p=0.20,                                       where the cross-entropy is
            f=r/n=0.2536, which is 25.8 ( away from the expected                                                                              "1 ! f %
            value                                                                                     H( f , p) = ! f ln( f / p) ! (1 ! f )ln $      '
                                                                                                                                              #1 ! p&



Bayesian Inference                      4/10/09                            78   Bayesian Inference                         4/10/09                       79




                                    Example                                                                            Example
       •    The cross-entropy (Kullback-Leibler divergence)                            •    The priors hardly matter. No matter what prior you choose
                                                             "1 ! f %                       on the alternative hypothesis, you’re going to get very
                     H( f , p) = ! f ln( f / p) ! (1 ! f )ln $      '
                                                                                            strong evidence against the null hypothesis
                                                             #1 ! p&
            measures the degree to which the observed distribution                           • So, does ESP exist?
            matches the expected distribution. Our entropy is just the
            cross-entropy relative to a uniform distribution
       •    Plug in the data and find that
                        p(data | H0 ) = 0.00476 exp(!313.6)
                               = 3.15 " 10 !139
       •    This is very small! Does the subject have ESP?
       •    If we calculate the Bayes factor against the value of p that
            maximizes the cross-entropy (p=f) we still get huge odds
            against the null hypothesis
                          !

Bayesian Inference                      4/10/09                            80   Bayesian Inference                         4/10/09                       81
                                  Example                                                                         Example
       •    The priors hardly matter. No matter what prior you choose                  •    Each of H1, H2, H3,… may have a prior probability *(Hi)
            on the alternative hypothesis, you’re going to get very                         that is much greater than that of the hypothesis P=HP that
            strong evidence against the null hypothesis                                     the subject has genuine psychic powers, and each would
             • So, does ESP exist?                                                          adequately account for the data. As a result
       •    Well, no. Bayesian methodology requires us to consider a                                                     p(data | P)! (P)
                                                                                               p(P | data) =
            mutually exclusive and exhaustive set of hypotheses, and                                         p(data | P)! (P) + # p(data | Hi )! (Hi )
            we haven’t done that. We’ve left out, for example                                                                   i" P
                                                                                            and if the sum in the denominator dominates the first term,
             • H1: The experimenter altered the data                                        the posterior probability of P will always be much less
             • H2: The subject was a conjuror                                               than 1
             • H3: There was a flaw in the experimental design (e.g.,
                the subject might have noticed the card reflected in the
                scientist’s glasses)


Bayesian Inference                    4/10/09                              82   Bayesian Inference                    4/10/09                            83




                                  Example                                                                         Example
       •    Our data is not that someone actually performed the feat                   •    In fact, the Soal/Bateman result was shown to have been
            in question. It is that someone reported to us that the feat                    due to experimenter fraud—tampering with the data
            was performed                                                                   records. This was shown by Betty Markwick, who
       •    There are always hypotheses that haven’t been considered,                       provided convincing evidence that Soal had altered the
            and sometimes they may be raised to significance when                            record sheets in a systematic way so as to achieve an
            data come along that support them                                               excess of “hits”.
       •    This is why careful consideration of all possible
            hypotheses is important




Bayesian Inference                    4/10/09                              84   Bayesian Inference                    4/10/09                            92
                             Making Decisions                                                                Making Decisions
       •    If one is testing hypotheses, it is for a reason. One does                 •    Example: Testing a drug or treatment
            not just sit around saying “Oh, well, I guess that                               • Usually when testing a new treatment we will compare
            hypothesis isn’t true!”                                                            it to an old treatment or a placebo
             • There must be some action implied by making that                              • We won’t approve the new treatment if it isn’t better
                decision (an action can include “doing nothing”)                               than the old one
                 » We decide to approve that drug                                            • We might not approve the new treatment if it is
                                                                                               significantly more costly than the old one, unless it is
                 » We decide to invest in that stock                                           significantly better than the old one
                 » We decide to publish our paper                                                » But that judgement might depend upon whether you
                 » Etc., etc.                                                                      were the patient, the drug company, the
                                                                                                   government, or the insurance company!
                                                                                             • We probably won’t approve the new treatment if it has
                                                                                               significant and adverse side-effects


Bayesian Inference                     4/10/09                             93   Bayesian Inference                      4/10/09                               94




                             Making Decisions                                                                Making Decisions
       •    This shows that not only is the probability of the states of               •    In drug testing, for example, the actions a we might take
            nature important, but we must also consider the                                 are
            consequences of each state of nature (cost, side-effects,
            desirable effects, and so on) given each of the possible
                                                                                             • Approve the drug
            decisions that we might make. Thus we must decide                                • Do not approve the drug
             • What are the states of nature?                                          •    As a result of our testing we will end up with posterior
             • What are the probabilities of each state of nature                           probabilities on the states of nature ! which will, for
               (Bayes)?                                                                     example, include the cure rate of the new drug relative to
             • What actions are available to us?                                            the old one, information on side effects, etc.
             • What are the costs or benefits given each possible state                 •    We will have to summarize the consequences of making
               of nature and each action (loss function)?                                   various decisions about the drug as a utility or loss
             • What are the expected costs or benefits of each action?                       (=–utility). Call the loss function L(!,a); it is a function of
             • Which action is the best under the circumstances?                            the states of nature ! and the actions a


Bayesian Inference                     4/10/09                             95   Bayesian Inference                      4/10/09                               96
                            Making Decisions                                                                Making Decisions
       •    Then the expected loss is a function of the actions a and                •    From this discussion we can see that losses/utilities play
            can be calculated, e.g.                                                       an important role in making decisions. Some aspects are
                                                                                          objective (e.g., monetary costs); however many of them
                       E" (L(" , a)) = # L(" , a)p(" | data)d"                            are subjective, just as priors may be subjective.
                                                                                           • The person who is affected by the decision is the one
       •    Evidently, we would want to choose the action (approve,                          that must determine the loss/utility for this calculation
            disapprove) depending upon which of these two actions                              » The insurance company, the drug manufacturer, the
             !
            gives the smallest loss                                                              FDA, and the patient each will have a different
             • If we were using utilities, we would maximize the                                 loss/utility when choosing to use or approve,
               expected utility                                                                  market, or use the drug. Each must use his own
                                                                                                 loss/utility when making the decision
                                                                                               » The role of the statistician is to assist the user, but it
                                                                                                 is not to set the loss/utility (the same goes for the
                                                                                                 patient’s doctor)

Bayesian Inference                      4/10/09                          97   Bayesian Inference                      4/10/09                                 98




                            Making Decisions
       •    Our course is not a course in decision theory. Useful
            books on decision theory are
             • Smart Choices, by Hammond, Keeney and Raiffa.
               Good introduction for lay people. Stresses the process
               of eliciting utilities. Discusses basics of probability
             • Decision Analysis, by Howard Raiffa, and Making
               Decisions, by Dennis Lindley. Good introductions
             • Making Hard Decisions, by Robert Clemen. More
               advanced, very detailed case analyses
             • Statistical Decision Theory and Bayesian Analysis,
               Second Edition, by Jim Berger. Very advanced and
               theoretical


Bayesian Inference                      4/10/09                          99

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:14
posted:11/30/2011
language:English
pages:21