# 11. Bayesian Hypothesis Testing

Document Sample

```					                            Hypothesis Testing                                                       Classical Hypothesis Testing
•    Hypothesis testing is one area where Bayesian methods                      •    Classical statisticians recognize two kinds of errors (this is
and results differ sharply from frequentist ones                                horrible terminology, but we are stuck with it)
•    Example: Suppose we have a coin and wish to test the                             • A Type I error is made if we reject H0 when it is true
hypothesis that the coin is fair                                                 • A Type II error is made if we do not reject H0 when it
• H0: The coin is fair, p(H)=p(T)=0.5                                             is false
• H1: The coin is not fair (anything but the above)                       •    Choose a rejection region R:
•    We divide the parameter space ! into two pieces, !0 and                          • R={x: observing x in R leads to the rejection of H0}
!1, such that if the parameter ! is in !0 then the                         •    Then the probability of making a Type I error is
hypothesis is true, and if it is in !1 then the hypothesis is
false.                                                                                              p(x " R | # " \$ 0 )
•    We observe a test statistic x, which is a function of the                       and the probability of making a Type II error is
data X, and wish to decide, given x, whether or not to
p(x " R | # " \$1 )
reject H0
Bayesian Inference                      4/10/09                             1   Bayesian Inference                     4/10/09                               2

Classical Hypothesis Testing                                                    Classical Hypothesis Testing
•    Often we take !0 as a set containing a single point (simple                •    We can construct different tests by
hypothesis), so                                                                  • Choosing a different test statistic x(X)
" = p(x # R | \$ # % 0 )                                             » But no real choice if x is sufﬁcient
is well-deﬁned. However, !1 is usually a collection of                           • Choosing a different rejection region R
intervals (composite hypothesis) and                                       •    Note that both of these are subjective choices!
" = p(x # R | \$ # %1 )
has no deﬁnite value. The best one can do is to evaluate
p(x " R | # ) = \$ (# )
as a function of !. "(!*) is the probability of making a
Type II error if the true value of ! is !* " !1. q(!)=1–"(!)
is known as the power function of the test.

!
Bayesian Inference                     !
4/10/09                              3   Bayesian Inference                     4/10/09                               4

!
Classical Hypothesis Testing                                                 Classical Hypothesis Testing
•    A common practice is to choose things so that # is some                 •    However, from a Bayesian point of view these classical
ﬁxed fraction like 0.05 or 0.01. This gives us an #-level                    tests are all suspect. Since they depend on the probability
test, and the probability of making a Type I error is #                      of x falling into some region R, they depend on the
•    Then, amongst the #-level tests, the goal would be to                        probability of data that might have been observed, but was
choose that test such that the power function of the test                    not. Thus, they violate the Likelihood Principle.
dominates that of all other tests. Such tests are called                •    A Bayesian might well say that classical hypothesis tests
“uniformly most powerful” (UMP) tests, and they have                         commit a Type III error: Giving the right answer to the
the smallest probability of committing a Type II error for                   wrong question.
any given value of !. Unfortunately, in general a UMP test              •    A Bayesian test would have to depend on and be
does not exist                                                               conditioned on just the data X that was observed.

Bayesian Inference                   4/10/09                             5   Bayesian Inference                    4/10/09                             6

Bayesian Hypothesis Testing                                                       Marginal Likelihood
•    So, we return to the Bayesian paradigm. We have to have                 •    In the Bayesian approach to hypothesis testing, the
a prior; we need the likelihood; then we can compute the                     "marginal likelihood" plays a key role. Suppose we have
posterior.                                                                   hypotheses H0, H1,…,Hn. To each hypothesis Hj there
•    Suppose we have two simple hypotheses. Thus, we have                         corresponds a (possibly empty) set of parameters !j. For
H0 and H1, and likelihood P(X|H0), P(X|H1), and prior                        example, in the coin-tossing case there are no parameters
p(H0), p(H1). Then the posterior odds are                                    for j=0 and one parameter for j=1.
p(H0 | X) p(X | H0 ) p(H0 )
•    Note that there is no requirement that the parameters be
=           !                                          nested.
p(H1 | X) p( X | H1 ) p(H1 )

Bayesian Inference                   4/10/09                             7   Bayesian Inference                    4/10/09                             8
Marginal Likelihood                                                                       Marginal Likelihood
•    We can calculate the joint density of parameters and                                    •    Now we can compute the posterior probability of any of
hypotheses, and from that the posterior probability, in the                                  our hypotheses, by simply integrating over the continuous
usual way:                                                                                   parameters of each hypothesis:
p(x," j , H j ) = p(x | " j , H j ) p(" j | H j )p(H j )                                       p(H j | x) =   # p(" , Hj         j   | x)d" j
p(x | " j , H j )p(" j | H j )p(H j )
p(" j , H j | x) =
p(x)                                                       =
# p(x | " , H
j         j   )p(" j | H j )p(H j )d" j
where                                                                                                                                       p(x)
n                                                      •    Since the proportionality constant p(x) is independent of j,
p(x) = \$ # p(x," j , H j )d" j                                        we can simply write
!                                    j=0
!
i.e., sum over the discrete parameters and integrate over                              p(H j | x) " [ \$ p(x | # j , H j ) p(# j | H j )d# j ] p(H j ) = m(x | H j )p(H j )
the continuous ones, model by model.
m(x|Hj) is known as the marginal likelihood under Hj.
!
!
Bayesian Inference                                4/10/09                           9        Bayesian Inference                               4/10/09                                    10

Bayesian Hypothesis Testing                                                             Bayesian Hypothesis Testing
•    Many realistic examples involve testing a compound                                      •    We compute the marginal likelihood under each model,
hypothesis. For example, our coin tossing problem is to                                      which will be proportional to the posterior probability of
decide if the coin is fair, based on observations of the                                     each model:
number of heads and tails. The expected proportion of                                             p(H 0 | x) " [ # p(x | \$ 0 , H 0 )p(\$ 0 | H 0 )d\$ 0 ] p(H 0 )
heads is !. If the coin is fair, then !=0.5, and if it is not
fair it is some other value. This means we will have to put                                              = m(x | H 0 )p(H 0 )
a prior on !, under the alternative hypothesis, for example,                                 and similarly for H1.
U(0,1)                                                                                  •    The odds for H0 and against H1 are simply
•    Thus we are testing                                                                 !                      p(H 0 | x) m(x | H 0 ) p(H 0 )
H0 : ! = 0.5, p(! | H0 ) = " (! # 0.5)                                                                  =           "
p(H 1 | x) m(x | H 1 ) p(H 1 )
against
•    The ratio of the two marginal likelihoods is known as the
H1 : ! \$ 0.5, p(! | H1 ) ~ U(0,1)                                           Bayes factor.
!
Bayesian Inference                                4/10/09                           11       Bayesian Inference                               4/10/09                                    12
Bayesian Hypothesis Testing                                                            Bayesian Hypothesis Testing
•    For the particular case of coin tossing, x={h,t} with h the                         •    As with all Bayesian inference, the results of such a test
number of heads and t the number of tails observed, this                                 depend on the prior. And in the case of a simple versus a
becomes                                                                                  compound hypothesis, the dependence is sensitive to the
1   h
(1# " )t \$ (" # 0.5)d"                                  prior on !, in a way that is even more sensitive than in
p(H 0 | x)
=
%"
0
&
p(H 0 )
parameter estimation problems, which means that one is
1
p(H 1 | x)              %"   h         t
(1# " ) d"          p(H 1 )                    less certain of the inference
0
h+t
= (h + t +1)Ch (0.5)h+t                                          •    One can look at a range of sensible priors to see how
n                                                         sensitive the results are
(n +1)Ch
=             with n = h + t                                     •    One can also look at particular classes of priors to see
2n
what the maximum evidence against the simple hypothesis
•    Do you see how the numerator and denominator get the                                     is under that class of priors
values they do?
!

Bayesian Inference                                4/10/09                           13   Bayesian Inference                           4/10/09                             14

Bayesian Hypothesis Testing                                                                                  p-values
•    Example: Suppose we toss a coin 100 times and obtain 60                             •    Recall that for a symmetric distribution a p-value is the
heads and 40 tails. What is the evidence against the                                     tail area (one-sided) or twice the tail area (two-sided)
hypothesis that the coin is fair?                                                        under the probability distribution of the test statistic r(x)|!
• Assuming the priors we did for our calculation, we ﬁnd                                from where the data actually lie at x0 to inﬁnity:
on prior odds 1 that the posterior odds are                                                                               #

101 ! C         100
60
p - value =       \$ p(r | "   0   )dr
= 1.095                                                                                     r0
2100                                                                                   % r0                   #
•   In other words, these data (slightly) favor the null                                          p - value =    \$ p(r | "   0 )dr +   \$ p(r | "   0   )dr
hypothesis that the coin is fair!                                                                           %#                     r0

» Surprising to a frequentist since the one-sided
p-value (tail area for obtaining 60 or more heads on
100 tosses given a fair coin) is 0.028 which would
!
reject the null in an #=0.05 level test
Bayesian Inference                                4/10/09                           15   Bayesian Inference                           4/10/09                             16
Bayesian Hypothesis Testing                                                     Bayesian Hypothesis Testing
•    Example: Suppose we toss a coin 100 times and obtain 60                    •    Priors for binomial data
heads and 40 tails. What is the evidence against the                             • We used a ﬂat prior for our analysis, for convenience
hypothesis that the coin is fair?
• If we look at this example by comparing the simple                            • The Jeffreys prior is beta(!|1/2,1/2) " !–1/2(1- !)–1/2
hypotheses “fair” versus “biased coin with !=0.6”,                                » Show this! Hint: E(h|!)=n!
which is the most favorable prior \$(!–0.6) on ! for the                       • In practice the difference between ﬂat and Jeffreys
alternative hypothesis, we still get                                            won’t make much difference since it’s just a difference
p(H0 | X) 0.5h 0.5t p(H0 )                                               of one extra head or tail
=          !          = 0.134
p(H1 | X) 0.6 h 0.4 t p(H1 )                                           • Informative conjugate priors are beta distributions
• The corresponding probability is 0.118                                           " !#–1(1- !)"–1. You may choose the parameters to
• This is independent of the prior on the parameter and                           match your prior knowledge
we can consider this to be the maximum evidence
against the null hypothesis. It is still over four times
greater than the classical one-sided p-value

Bayesian Inference                    4/10/09                              17   Bayesian Inference                   4/10/09                             18

Bayesian Hypothesis Testing                                                           MCMC Simulation
•    Priors for binomial data                                                   •    We can calculate our results using simulation (useful for
• Jaynes suggests beta(!|0,0) " !–1(1- !)–1                                    when an exact solution is unavailable)
» This has the advantage of agreeing with intuition if
•    We do this by simulating a random walk in both model
space {H0, H1} and in parameter space(%). Thus we are
there is a good probability that either of the                           simulating on both discrete and continuous parameters
extremes !=0 or !=1 may be true (as with, for
•    The key here is to allow ourselves to jump between our
example, whether a random chemical taken off the                         two models. This will in general be a M-H step
shelf will or will not dissolve. Presumably if it
•    Since the two models have differing numbers of
dissolves the ﬁrst time, it will each time and if it                     parameters (in the coin tossing case, one model has zero
doesn’t dissolve the ﬁrst time, it won’t dissolve any                    parameters and the other has one) we will have to propose
other time either)                                                       parameters and models simultaneously
» However, if the number of heads or tails is 0 the                   •    I will describe a technique known as reversible jump
posterior will not be normalizable and the test will                     MCMC which is very effective
give odds 0 or # which may or may not be
desirable
Bayesian Inference                    4/10/09                              19   Bayesian Inference                   4/10/09                             20
MCMC Simulation                                                                  MCMC Simulation
•    The best introduction to the reversible jump MCMC                         •    Thus, in our coin tossing problem, we may be in state
technique that I have found is “On Bayesian Model and                          (Hj,!j) and wish to jump to another state (Hk,!k)
Variable Selection Using MCMC,” by Petros Dellaportas,                          • Propose jump to state (Hk,!k) with probability
Jonathan Forster and Ioannis Ntzoufras. It has been                               q(Hk,!k|Hj,!j)
published in Statistics and Computing. A copy may be                            • Compute (the log of) the Metropolis-Hastings factor:
p( X | Hk , " k ) p(Hk ," k )q(H j ," j | Hk ," k )
!=
p(X | H j , " j ) p(H j ," j )q(Hk ," k | H j , " j )
• Generate u, (the log of) a U(0,1) random variable and
accept the step if u<\$ (log u < log \$), otherwise stay
where you are
• The resulting Markov chain samples the models in
proportion to their posterior probability; marginalize
out ! by ignoring it
Bayesian Inference                       4/10/09                          21   Bayesian Inference                       4/10/09                                22

MCMC Simulation                                                                  MCMC Simulation
•    We see that # is the ratio of two quantities of the form                  •    And, we might factorize both p and q:
p(X | Hm ," m ) p(Hm ," m )                                               p(X | Hm ," m ) p(" m | Hm )p(Hm )
! m|n =                                                                     ! m|n =
q(Hm ," m | Hn ," n )                                                          q(" m | Hm )q(Hm )
where m and n refer to the two states                                           • Example: Coin tosses. Choose prior on Hm, for
example, p(H0)= p(H1)= 1/2
•    We have a great deal of latitude in picking q. For example,
we could choose it independent of state n (independence                         • Choose a proposal, for example q(H0)= q(H1)= 1/2
sampler):                                                                           » We’ll want to reconsider this
p(X | Hm ," m ) p(Hm ," m )                              • If m=0, !0=1/2, [strictly, p(!0|H0)= \$(!0–0.5)], but if
! m|n =                                                              m=1 we need a prior on !1. For simplicity we will take
q(Hm ," m )
a uniform prior, as in our calculation
• We need also to consider the proposals q(!m| Hm) for
m=0,1

Bayesian Inference                       4/10/09                          23   Bayesian Inference                       4/10/09                                24
MCMC Simulation                                                                  MCMC Simulation
•    And, we might factorize both p and q:                                         •    And, we might factorize both p and q:
p(X | Hm ," m ) p(" m | Hm )p(Hm )                                             p(X | Hm ," m ) p(" m | Hm )p(Hm )
! m|n =                                                                          ! m|n =
q(" m | Hm )q(Hm )                                                             q(" m | Hm )q(Hm )

• An excellent choice of q(!m| Hm) (if possible) would be                          • Usually the excellent strategy is not possible; if one
to make it proportional to the posterior p(!m|X,Hm)! For                         can approximate the posterior by a distribution that you
then we would get                                                                can sample from, that is a good strategy.
p(Hm )                                               • A simple-minded strategy would simply be to choose q
! m|n "                                                        ﬂat, whence
q(Hm )
p(X,# m | H m )p(H m )
which is a constant. Indeed, if we can also arrange                                           "m|n =
things so that "m|n=1 we would get a Gibbs sampler!                                                         q(H m )
•   The latter can be done approximately by using a small                             » Works if the posterior is not too sharp (not much
training sample and picking q(Hm) using the results                                 data). Not so good if there is a very sharp posterior
!
Bayesian Inference                         4/10/09                            25   Bayesian Inference                    4/10/09                              26

p-values                                                                       p-values
•    Amongst the many errors that people make interpreting                         •    The observed p-value is not a probability in any real
frequentist results, one in particular is very common, and                         sense! It is a statistic that happens to have a U(0,1)
that is to quote a p-value as if it were a probability (e.g.,                      distribution under the null
that the null hypothesis is true, or that “the results                              • If the observed p-values were real probabilities, we
occurred by chance”)                                                                   could combine them using the rules for probability to
• The approved use of p-values, on frequentist                                        obtain p-values of combined experiments. Thus (on the
reasoning, is to report whether or not the p-value falls                           null hypothesis of a fair coin), if we observed 60 heads
into the rejection region. This has the interpretation                             and 40 tails and then independently observed 40 heads
that if the null is true, then in no more than a fraction #                        and 60 tails, the one-sided p-value for the combined
of all cases will we commit a Type I error.                                        experiment is evidently 0.5, whereas the one-sided p-
•    The observed p-value is not a probability in any real                                  values for two independent experiments are 0.028 and
sense! It is a statistic that happens to have a U(0,1)                                 (1–0.028) respectively; the product is obviously not
distribution                                                                           0.5, contrary to the multiplication law
•    Despite the appeal of quoting p-values, an observed p-                              • Similar results hold for two-sided p-values
value has no valid frequentist probability interpretation
Bayesian Inference                         4/10/09                            27   Bayesian Inference                    4/10/09                              28
p-values                                                                      p-values
•    Furthermore, suppose you routinely reject two-sided at a                         •    Then amongst those experiments rejected with p-values in
ﬁxed #-level, say 0.05                                                                (0.05–&,0.05) for small &, at least 30% will actually turn
out to be true, and the true proportion can be much higher
•    Suppose in half the experiments the null was actually true
(depends upon the distribution of the actual parameter for
•    Finally, suppose that in those experiments for which the                              the experiments where the null is false)
null is false, the probability of a given effect size x                                • This says that under these circumstances, the Type I
decreases monotonically as you go away from 0 (in either                                 error rate (probability of rejecting a true null),
direction):                                                                              conditioned on our having observed p=0.05, is at least
30%!
• Thus the numerical value of an observed p-value
p(e)                                                                     greatly overstates the evidence against the null
hypothesis, which we already found for coin tosses.

0       x!
Bayesian Inference                            4/10/09                            29   Bayesian Inference                   4/10/09                             30

p-values                                                                      p-values
•    The absolute maximum evidence against the null                                   •    Papers on this subject can be found on the web:
hypothesis can be gotten by evaluating the likelihood ratio                            • http://makeashorterlink.com/?P3CB12232 (Paper by
at the data. For example, if x is standard normal and we                                 Berger and Delampady)
observe x=1.96, which corresponds to an # level of 0.05
(two tailed), we can calculate the likelihood ratio as                                 • http://makeashorterlink.com/?V2FB21232 (Paper by
Berger and Sellke, with comments and rejoinder)
p(x | H 0 ) 1 exp(" 1 1.96 2 )
=21      2
1 2
= exp(" 1 1.96 2 ) = 0.146
2                                      • http://www.stat.duke.edu/~berger/p-values.html
p(x | H1 )    2 exp(" 2 0 )
•    Note that the papers by Berger and Delampady and by
p(H 0 | x) = 0.128 on prior odds of 1                                        Berger and Sellke must be accessed from within the
university network or by proxy server.

Bayesian Inference                            4/10/09                            31   Bayesian Inference                   4/10/09                             32
Falsiﬁcation                                                             Bayesian Epistemology
•    Popper proposed that a scientiﬁc hypothesis must be                        •    Bayesians measure the effect of new data D on the
falsiﬁable by data. For example, the hypothesis that a coin                     relative plausibility of hypotheses by calculating the
has two heads can be falsiﬁed by observing one tail                             Bayes factor
•    A hypothesis H0 is falsiﬁable in Bayesian terms if, for                                        ! H0 | D\$ P(D | H0 )
F#         &=
some data D, its likelihood on H0 is 0: p(D| H0)=0                                             " H1 | D% P(D | H1 )
•    However, the requirement of falsiﬁability is too                           •    Then we compute posterior odds from prior odds
restrictive. In science, ideas are seldom, if ever, actually
! H0 | D\$        H |D
& = F! 0 \$ O! 0 \$
H
falsiﬁed. What usually happens is that old hypotheses are                                     O#              #       & # &
discarded in favor of new ones that new data have                                              " H1 | D1 %    " H1 | D% " H1 %
rendered more plausible, i.e., have higher posterior                       •    Bayes’ theorem allows us to calculate the effect of new
probability                                                                     data on various hypotheses and adjust posterior
probabilities accordingly. It thus becomes a justiﬁcation
for the inductive method

Bayesian Inference                    4/10/09                              33   Bayesian Inference                    4/10/09                           34

Ockham’s Razor                                                                  Ockham’s Razor
•    “Pluralitas non est ponenda sine necessitate.”                             •    One way to reﬂect the common scientiﬁc experience that
—William of Ockham                                 simple hypotheses are preferable is to choose the prior
•    Preferring the simpler of two hypotheses to the more                            probabilities so that the simpler hypotheses have greater
complex, when both account for the data, is an old                              prior probability (Wrinch and Jeffreys)
principle in science                                                             • This is a “prior probabilities” interpretation of
• Why do we consider                                                              Ockham’s razor
1                                               • Does it beg the question?
s = a + ut + gt 2
2                                               • What principle should be used to assign the priors?
to be simpler than
1
s = a + ut + gt 2 + ct 3 ?
2

Bayesian Inference                    4/10/09                              35   Bayesian Inference                    4/10/09                           36
Simplicity                                                                    Plagiarism
•    We regard H0 as simpler than H1 if it makes sharper                     •    Compilers of mailing lists include bogus addresses to
predictions about what data will be observed                                 catch unauthorized repeat use of the list
•    Hypotheses can be considered more complex if they have                  •    Mapmakers include small, innocuous mistakes to catch
be tweaked to accommodate a wider variety of data                       •    Mathematical tables can be rounded up or down if the last
•    Complex hypotheses can accommodate a larger set of                           digit ends in ‘5’ without compromising the accuracy of the
potential observations than can simple ones                                  table. The compiler can embed a secret “code” in the table
• “This coin has two heads” vs. “This coin is fair”                         to catch copyright violations
• “This coin is fair” vs. “This coin has unknown bias !”               •    In all cases, duplication of these errors provides prima
1                                       facie evidence, useful in court, that copying took place
• “The relationship is s = a + ut + gt 2 ” vs. “The
2
1 2
relationship is s = a + ut + gt + ct 3 ”
2

Bayesian Inference                      4/10/09                         37   Bayesian Inference                     4/10/09                                39

Plagiarism                                                            Evolutionary Biology
•    Example: a table of 1000 sines                                          •    The principle of descent with modiﬁcation underlies
• Can expect to have a choice of rounding in 100 cases                      evolution
• Let D = “The rounding pattern is the same”                           •    Pseudogenes are genes that have lost essential codes,
rendering them nonfunctional
• Let C = “The second table was copied from the ﬁrst”                  •    Nearly identical pseudogenes are observed in closely
•    Then                                                                         related organisms (e.g., chimpanzees and humans). By the
same arguments as before, the posterior probability that
P(D | C) = 1, P(D | C ) ! 10 "30
this is due to actual copying from a common ancestor is
# C&                                                           vastly greater than the posterior probability that it is due to
F% ( ! 10 30                                                    coincidence. This is powerful evidence in favor of
\$C'
evolution
•    Similar evidence is provided by the fact that the genetic
code is redundant. Several triplets of base pairs code for
the same amino acids

Bayesian Inference                      4/10/09                         40   Bayesian Inference                     4/10/09                                42
Mercury’s Perihelion Motion                                                            Mercury’s Perihelion Motion
•    Around 1900, Newtonian mechanics was in trouble                            •    Along comes Einstein and the General Theory of
because of the problem of Mercury’s perihelion motion                           Relativity, which predicts a very precise value for
Mercury’s perihelion motion—no other value is possible
•    Proposed solutions:
•    Using contemporary ﬁgures
• Rings of matter around the Sun, too faint to see
• Poor gives a=41.6"±2.0"
• “Vulcan”, a small planet near the Sun, difﬁcult to detect                     • We have aE=42.98" for Einstein’s theory (E)
• Flattening of the Sun                                                         • The conditional probability of data a on the hypothesis
• Additional terms in the law of gravity (e.g., 'r—3 term                         E that the true value is aE is
where ' is an adjustable constant)                                                                          1        1
•    Some solutions could be ruled out on observational                                              p(a | aE ) =     exp\$ # 2 (a # aE )2 &
2!"    % 2"             '
grounds (Jeffreys-Poor debate, 1921)
= p(a | E)
•    One could not rule out modiﬁcations to the law of gravity.
The adjustable parameter ' can be chosen to allow any                                where (=2.0" (error of observation)
motion a of the perihelion
Bayesian Inference                    4/10/09                              43   Bayesian Inference                           4/10/09                      44

Mercury’s Perihelion Motion                                                            Mercury’s Perihelion Motion
•    The older theory F can be thought of as matching the                       •    The Bayes factor is
observed value with a “fudge factor” aF                                                  p(a | E)                         2
\$ D 2 ' \$ DF '
•    Observations of Mars, Earth and Venus can limit the                                               = 1+ " 2 exp&# E ) exp&   2 )
= 26.0
p(a | F)             % 2 ( % 2(1+ " )(
fudge factor to | aF|<100"
•    Assuming that Newton’s theory is approximately correct,                         where
we have a prior density of Mercury’s perihelion motion                                          a # aE              a           "
DE =          = #0.69, DF = = 20.8, " = = 25.0
which we take for now to be N(0,)2) with )=50":                                                   *                 *           *

1     \$ a2 '
•    This is moderately strong evidence in favor of E.
p(aF | F) =     exp # F2                                     •    The last two factors are O(1) and measure the “ﬁt” of the
2!"    &
% 2" ()
!          two theories to the data. Nearly all of the Bayes factor is
due to the ﬁrst factor, which is known as the ‘Ockham
factor’.

Bayesian Inference                    4/10/09                              45   Bayesian Inference                           4/10/09                      47
What’s Happening                                                                               Examples
•    The Ockham factor arises from the fact that F spreads its                                   •    How do p-values and posterior probabilities compare for
bets over a much larger portion of parameter space than                                          sharp null hypotheses? Evidently, a small p-value is
does E. Essentially E puts all its money on aE, a precise                                        evidence against the null, but as we have seen, its
value, spread out only by the error of observation (. On                                         numerical value overstates the evidence against the null
the other hand, F spreads its bets out over a range that is
25 times bigger, and hence most of its probability is                                       •    Consider aE= aF=0, so we center everything at 0. Then the
“wasted” covering regions of the parameter space that                                            Bayes factor is
were not observed.
! E \$ p(a | E)              ) D2 ! ' 2 \$ ,
•    E makes a sharp prediction, F a fuzzy one                                                                 F# & =          = 1 + ' 2 exp ( E #       &
" F % p(a | F)              + 2 "1 + ' 2 % .
*              -
•    When the data come out even moderately close to where E
predicts it will, E is rewarded for the risk it took by getting
a larger share of the posterior probability
•    The factor of about 25 is just the dilution in probability
that F must sacriﬁce in order to ﬁt a larger range of data

Bayesian Inference                                  4/10/09                                 48   Bayesian Inference                       4/10/09                           53

Examples                                                                                   Examples
•    Case 1: Let the p-value be 0.05 so that DE=1.95. We can                                     •    Case 1: The minimum in the posterior probability is about
plot the Bayes factor versus !                                                                   0.32, which is not very good evidence against the null
hypothesis
1.2
•    For large ! , we see that the Bayes factor is asymptotically
1                                                                              like
0.8
! E \$ p(a | E)        ) D2 ,
F# & =          = ' exp ( E
Value

0.6
" F % p(a | F)        + 2 .
*    -
0.4

0.2

0
0   1    2                3           4          5   6                                              Ockham       Evidence
taubar
factor       factor
Poster ior Pr obability   Bayes Factor

Bayesian Inference                                  4/10/09                                 54   Bayesian Inference                       4/10/09                           55
Examples                                                                        Examples
•    From this we see that for a given DE, the more vague the                 •    For ﬁxed ), and a standard deviation for a single
prior on the alternative (measured by ! ), the larger the                     observation of (, with n observations we will have
Bayes factor in favor of the sharp prediction of the null
n! 2
•    Theories with great predictive power (sharp predictions)                                   1+! 2 = 1+    = O( n) for large n
"2
are favored over those with vague predictions                            •    Thus, asymptotically we have for large n
! E \$ p(a | E)        ) D2 ,                                                      "E %       " ( % + z2.
F# & =          = ' exp ( E
" F % p(a | F)        + 2 .
*    -                                                    F \$ ' ~ n \$ ' exp-* 0
#F&        #) & , 2 /
where z=DE is the z-score, or standardized variable DE

Ockham        Evidence
!
factor        factor

Bayesian Inference                    4/10/09                            56   Bayesian Inference                       4/10/09                          57

•    This suggests a way to interpret p-values obtained on data               •    Choose a p-value #>0, however small. To this p-value
with large n (I.J. Good)                                                      there corresponds a z-score z#, and for large n the Bayes
• Let the p-value be #                                                       factor against the alternative is
• Compute the z-score z# which gives this value of # for                                                   \$ z2 '
B ! n exp " #
the p-value (use normal approximation and tables)                                                        &
% 2( )
• Take )/( = O(1) and compute
!E \$           ) z2 ,                                                    Increases             Fixed
B = F# & ' n exp (                                                               with n
" F%           + 2.
*    -
• Then the posterior probability of the null is                         •    Thus, for large enough n, a classical test can strongly
approximately                                                              reject the null at the same time that the Bayesian analysis
B                                         strongly afﬁrms it
p(H0 | data) !
1+ B

Bayesian Inference                    4/10/09                            58   Bayesian Inference                       4/10/09                          61
•    If H is a simple hypothesis, x the result of an experiment,                •    Example: A parapsychologist has a number of subjects
the following two phenomena can exist simultaneously:                           attempt to “inﬂuence” the output of a hardware random
number generator (which operates by radioactive decays).
• A signiﬁcance test for H reveals that x is signiﬁcant at                     In approximately 104,490,000 events, 18,471 excess
the level p<# where the pre-chosen rejection level #>0                      events are counted in one direction versus the other
can be as small as we wish, and                                        •    This is a binomial distribution. The standard deviation of
• The posterior probability of H, given x, is, for quite                       the binomial distribution is " = nf (1# f ) where f is the
small prior probabilities of H, as high as (1–#)                            expected frequency of counts. Straightforwardly we ﬁnd
•    This means that the classical signiﬁcance test can reject H                     that (=5111 counts (using #=0.5)
with an arbitrarily small p-value, while at the same time                  •    The classical signiﬁcance test ﬁnds that the effect is
!
signiﬁcant at 18471/5111=3.61 standard deviations, for a
the evidence can convince us that H is almost certainly true
p-value of 0.0003 (two-tailed), using the approximation,
excellent in this case, that the binomial distribution can be
approximated by a normal distribution

Bayesian Inference                    4/10/09                              62   Bayesian Inference                     4/10/09                              63

•    The Bayesian analysis is quite different. We have a                        •    On the alternative hypothesis, all we know is that the
genuine belief that the null hypothesis of no (signiﬁcant)                      effect might be something, but we don’t know how much
effect might be true.                                                           or even in what direction
• To be sure, no point null is probably ever exactly true,                      • Parapsychologists call effects measured in the direction
because the random event generator might not be                                  opposite to the intended one “psi-missing”, and it is
perfect. But tests of the generator are claimed to show                          considered evidence for a real effect—sort of “heads I
that its bias is very small so a point null is a good                            win, tails you lose”.
approximation

Bayesian Inference                    4/10/09                              64   Bayesian Inference                     4/10/09                              65
•    To reﬂect our ignorance we choose a uniform prior on the                    •    The p-value of 0.0003 corresponds to a z-score of z#=3.61
alternative                                                                      standard deviations on n=104,490,000. Our approximate
•    We’ve already seen the analysis of this problem in the                           formula for the Bayes factor yields
coin-ﬂipping problem. The result is                                                                             # (3.61)2 &
B ! 104, 490,000 exp "             = 15.1
p(H 0 | x)      p(H 0 )                                                                      %
\$    2 (  '
" 12
p(H1 | x)       p(H1)                                    •    This is certainly in the right ballpark and conﬁrms the
•    In other words, although the classical test rejects H0 with a                    approximate formula
very small p-value, the Bayesian analysis has made us
twelve times more conﬁdent of the null hypothesis than
we were!

Bayesian Inference                     4/10/09                              66   Bayesian Inference                    4/10/09                             67

Connection with Ockham’s Razor                                                   Connection with Ockham’s Razor
•    This result, where the Bayesian answer is at great odds                     •    This means that the alternative hypothesis has an
with the orthodox result, can be understood in terms of                          adjustable parameter, the effect size !, that the null
Ockham’s razor                                                                   hypothesis does not have. The null hypothesis makes a
•    The sharp null hypothesis H0 is a special one that comes                         deﬁnite prediction !=0.5, whereas the alternative
from our genuine belief (in our parapsychology example)                          hypothesis can ﬁt any value of !.
that people cannot really inﬂuence the output of a random                   •    Therefore the null hypothesis is simpler than the others in
number generator by simply wishing it. The sharp                                 the sense we’ve been discussing. It has fewer parameters.
hypothesis is inconsistent with nearly all possible data                    •    Because of the Ockham factor, which naturally arises in
sets, since it is consistent only with the minuscule fraction                    the analysis, we can say that in some sense Ockham’s
of possible sequences for which the number of 0’s and 1’s                        razor is a consequence of Bayesian ideas. The Ockham
are approximately equal. On the other hand, the                                  factor automatically penalizes complex hypotheses,
alternative hypothesis is very open ended and would be                           forcing them to ﬁt the data signiﬁcantly better than the
consistent with any possible data                                                simple one before they will be accepted

Bayesian Inference                     4/10/09                              68   Bayesian Inference                    4/10/09                             69
Connection with Ockham’s Razor                                                   Sampling to a Foregone Conclusion
•    To conclude: There are at least three Bayesian                              •    The phenomenon we have just discussed is closely related
interpretations of Ockham’s razor                                                to the phenomenon of sampling to a foregone conclusion
• As a way of assigning prior probabilities to                                   • In classical signiﬁcance testing, one is supposed to
hypotheses, based on experience                                                  decide on everything in advance, and one is especially
supposed to decide on exactly how much data to take
• As a consequence of the fact that complex hypotheses                              in advance.
with more parameters, in their attempt to accommodate                         • Failure to do this can lead to a situation where if you
a larger set of possible data, are forced to waste prior                         sample long enough, at some point with probability as
probability on outcomes that are never observed. This                            close to 1 as you wish you will reject a true null
automatic penalty factor favors the simpler theory                               hypothesis with as small a preset signiﬁcance level as
• As an interpretation of the notion that when ﬁtting data                          you wish
to empirical models, one should avoid overﬁtting the                          • This is sampling to a foregone conclusion
data                                                                    •    This phenomenon is peculiar to classical signiﬁcance
testing. It cannot occur in Bayesian testing

Bayesian Inference                     4/10/09                              70   Bayesian Inference                   4/10/09                             71

Stopping Rule Principle                                                           Stopping Rule Principle
•    Sampling to a foregone conclusion is related to the                         •    The ability to ignore the stopping rule in Bayesian
stopping rule principle (SRP) according to which the                             inference has profound implications for experimental
stopping rule—how we decide to stop taking data—should                           design
have no effect on the ﬁnal reported evidence about the                            • It is OK to stop if “the data look good” or “the data
parameter ! obtained from the data                                                  look horrible”. Breaking a prior decision to test n
•    The SRP is a consequence of the Likelihood Principle                                patients, for example, will not compromise the validity
• Classical testing violates the stopping rule principle                           of the test
just as it violates the likelihood principle. Thus,                                » Thus if the data look good and the new treatment
“sample until n=12” and “sample until t=3” are                                       looks very effective, it would be unethical not to
different stopping rules that give different inferences in                           break the protocol so that the patients on placebo
classical binomial testing, but the same inferences in                               can get the effective drug
Bayesian testing                                                                   » Likewise if the ﬁrst 20 patients all died under the
•    See Berger & Wolpert, The Likelihood Principle                                          new treatment it would be unethical to continue

Bayesian Inference                     4/10/09                              72   Bayesian Inference                   4/10/09                             73
Stopping Rule Principle                                                     Frequentist Work-arounds
•    The ability to ignore the stopping rule in Bayesian                       •    In frequentist hypothesis testing this problem can be
inference has profound implications for experimental                           avoided through the device of “# spending”. That is,
design                                                                         suppose we are in a drug clinical trial, and wish to stop the
trial at some point to do a preliminary assessment of the
• Similarly, it is OK to continue the test longer if the                      results, to decide whether to continue the trial
results are promising but not fully conclusive at the                        • Terminate trial because of excess bad outcomes
end of the scheduled test
• Terminate trial because drug is so effective it would be
unethical not to give it to the placebo group
•    A frequentist can “peek” at the data in advance if one is
willing to “spend” some of the # at that point; but if the
test is continued, it will require a smaller # at a later point
to reach the preassigned #-level for the overall trial.
•    However, this is much more complex and involved than
the Bayesian approach

Bayesian Inference                     4/10/09                            74   Bayesian Inference                     4/10/09                                75

Hypothesis Testing and Prior Belief                                                      Good’s Device
•    Prior belief has to be a consideration in any kind of                     •    The statistical evidence is the same, but the posterior
hypothesis testing. Thus, on the same data, different                          beliefs are very different!
degrees of plausibility may accrue to a hypothesis
•    This example illustrates a device due to I.J. Good for
•    For example, consider shufﬂing a pack of alphabet cards.                       setting prior probabilities
A subject is supposed to guess the letter on three cards.
Suppose the subject names all three correctly. What do we                       • It will take much more data to convince most people of
think when                                                                         the truth of H3 than H2, even though the evidence is
• H1: The subject is a child and is allowed to look at the                        identical. One can ask, “how much more data?” Then,
cards                                                                          by using Bayes’ theorem in reverse, we can estimate
• H2: The subject is a magician, who only looks at the                            our prior on the hypotheses
backs of the cards                                                               » How many times in a row would someone have to
• H3: The subject is a psychic, who only looks at the                                 name a randomly-picked card before you would
backs of the cards                                                                 give 1:1 odds that he was a psychic?

Bayesian Inference                     4/10/09                            76   Bayesian Inference                     4/10/09                                77
Example                                                                            Example
•    Published report of an experiment in ESP (Soal and                         •    Calculate probability given null hypothesis
Bateman, 1954)                                                                                p(data | H0 ) = Crn p r (1 ! p) n!r
•    Deck of 25 Zener cards (5 different designs) are shufﬂed,                  •    Use Stirling approximation
subject guesses order of the cards
n!! 2" n n+1 / 2 e #n
• Sampling without replacement, often misanalyzed by                           obtain
parapsychologists (P. Diaconis)                                                                        %          (1/ 2
n
•    Subject obtained r=9410 “hits” out of 37100 (vs. 7420±77                                 p(data | H 0 ) " '          * exp(nH ( f , p))
&2#r(n \$ r))
expected) f=0.2536
•    What should we think? H0= no ESP: p=0.20,                                       where the cross-entropy is
f=r/n=0.2536, which is 25.8 ( away from the expected                                                                              "1 ! f %
value                                                                                     H( f , p) = ! f ln( f / p) ! (1 ! f )ln \$      '
#1 ! p&

Bayesian Inference                      4/10/09                            78   Bayesian Inference                         4/10/09                       79

Example                                                                            Example
•    The cross-entropy (Kullback-Leibler divergence)                            •    The priors hardly matter. No matter what prior you choose
"1 ! f %                       on the alternative hypothesis, you’re going to get very
H( f , p) = ! f ln( f / p) ! (1 ! f )ln \$      '
strong evidence against the null hypothesis
#1 ! p&
measures the degree to which the observed distribution                           • So, does ESP exist?
matches the expected distribution. Our entropy is just the
cross-entropy relative to a uniform distribution
•    Plug in the data and ﬁnd that
p(data | H0 ) = 0.00476 exp(!313.6)
= 3.15 " 10 !139
•    This is very small! Does the subject have ESP?
•    If we calculate the Bayes factor against the value of p that
maximizes the cross-entropy (p=f) we still get huge odds
against the null hypothesis
!

Bayesian Inference                      4/10/09                            80   Bayesian Inference                         4/10/09                       81
Example                                                                         Example
•    The priors hardly matter. No matter what prior you choose                  •    Each of H1, H2, H3,… may have a prior probability *(Hi)
on the alternative hypothesis, you’re going to get very                         that is much greater than that of the hypothesis P=HP that
strong evidence against the null hypothesis                                     the subject has genuine psychic powers, and each would
• So, does ESP exist?                                                          adequately account for the data. As a result
•    Well, no. Bayesian methodology requires us to consider a                                                     p(data | P)! (P)
p(P | data) =
mutually exclusive and exhaustive set of hypotheses, and                                         p(data | P)! (P) + # p(data | Hi )! (Hi )
we haven’t done that. We’ve left out, for example                                                                   i" P
and if the sum in the denominator dominates the ﬁrst term,
• H1: The experimenter altered the data                                        the posterior probability of P will always be much less
• H2: The subject was a conjuror                                               than 1
• H3: There was a ﬂaw in the experimental design (e.g.,
the subject might have noticed the card reﬂected in the
scientist’s glasses)

Bayesian Inference                    4/10/09                              82   Bayesian Inference                    4/10/09                            83

Example                                                                         Example
•    Our data is not that someone actually performed the feat                   •    In fact, the Soal/Bateman result was shown to have been
in question. It is that someone reported to us that the feat                    due to experimenter fraud—tampering with the data
was performed                                                                   records. This was shown by Betty Markwick, who
•    There are always hypotheses that haven’t been considered,                       provided convincing evidence that Soal had altered the
and sometimes they may be raised to signiﬁcance when                            record sheets in a systematic way so as to achieve an
data come along that support them                                               excess of “hits”.
•    This is why careful consideration of all possible
hypotheses is important

Bayesian Inference                    4/10/09                              84   Bayesian Inference                    4/10/09                            92
Making Decisions                                                                Making Decisions
•    If one is testing hypotheses, it is for a reason. One does                 •    Example: Testing a drug or treatment
not just sit around saying “Oh, well, I guess that                               • Usually when testing a new treatment we will compare
hypothesis isn’t true!”                                                            it to an old treatment or a placebo
• There must be some action implied by making that                              • We won’t approve the new treatment if it isn’t better
decision (an action can include “doing nothing”)                               than the old one
» We decide to approve that drug                                            • We might not approve the new treatment if it is
signiﬁcantly more costly than the old one, unless it is
» We decide to invest in that stock                                           signiﬁcantly better than the old one
» We decide to publish our paper                                                » But that judgement might depend upon whether you
» Etc., etc.                                                                      were the patient, the drug company, the
government, or the insurance company!
• We probably won’t approve the new treatment if it has

Bayesian Inference                     4/10/09                             93   Bayesian Inference                      4/10/09                               94

Making Decisions                                                                Making Decisions
•    This shows that not only is the probability of the states of               •    In drug testing, for example, the actions a we might take
nature important, but we must also consider the                                 are
consequences of each state of nature (cost, side-effects,
desirable effects, and so on) given each of the possible
• Approve the drug
decisions that we might make. Thus we must decide                                • Do not approve the drug
• What are the states of nature?                                          •    As a result of our testing we will end up with posterior
• What are the probabilities of each state of nature                           probabilities on the states of nature ! which will, for
(Bayes)?                                                                     example, include the cure rate of the new drug relative to
• What actions are available to us?                                            the old one, information on side effects, etc.
• What are the costs or beneﬁts given each possible state                 •    We will have to summarize the consequences of making
of nature and each action (loss function)?                                   various decisions about the drug as a utility or loss
• What are the expected costs or beneﬁts of each action?                       (=–utility). Call the loss function L(!,a); it is a function of
• Which action is the best under the circumstances?                            the states of nature ! and the actions a

Bayesian Inference                     4/10/09                             95   Bayesian Inference                      4/10/09                               96
Making Decisions                                                                Making Decisions
•    Then the expected loss is a function of the actions a and                •    From this discussion we can see that losses/utilities play
can be calculated, e.g.                                                       an important role in making decisions. Some aspects are
objective (e.g., monetary costs); however many of them
E" (L(" , a)) = # L(" , a)p(" | data)d"                            are subjective, just as priors may be subjective.
• The person who is affected by the decision is the one
•    Evidently, we would want to choose the action (approve,                          that must determine the loss/utility for this calculation
disapprove) depending upon which of these two actions                              » The insurance company, the drug manufacturer, the
!
gives the smallest loss                                                              FDA, and the patient each will have a different
• If we were using utilities, we would maximize the                                 loss/utility when choosing to use or approve,
expected utility                                                                  market, or use the drug. Each must use his own
loss/utility when making the decision
» The role of the statistician is to assist the user, but it
is not to set the loss/utility (the same goes for the
patient’s doctor)

Bayesian Inference                      4/10/09                          97   Bayesian Inference                      4/10/09                                 98

Making Decisions
•    Our course is not a course in decision theory. Useful
books on decision theory are
• Smart Choices, by Hammond, Keeney and Raiffa.
Good introduction for lay people. Stresses the process
of eliciting utilities. Discusses basics of probability
• Decision Analysis, by Howard Raiffa, and Making
Decisions, by Dennis Lindley. Good introductions
• Making Hard Decisions, by Robert Clemen. More
• Statistical Decision Theory and Bayesian Analysis,
Second Edition, by Jim Berger. Very advanced and
theoretical

Bayesian Inference                      4/10/09                          99

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 14 posted: 11/30/2011 language: English pages: 21