Your Federal Quarterly Tax Payments are due April 15th

# Bayesian Analysis of the Normal Distribution, Part II by kjv16990

VIEWS: 0 PAGES: 32

• pg 1
```									      Bayesian Analysis of the
Normal Distribution, Part II

- Set-up of the basic model of a
normally distributed random variable
with unknown mean and variance
(a two-parameter model).
- Discuss philosophies of prior selection
- Implementation of different priors with
a discussion of MCMC methods.
The normal model with unknown
mean and variance
Let’s extend the normal model to the case where the variance
parameter is assumed to be unknown. Thus, yi ~ N( , 2),
where  and 2 are both unknown random variables.

This is our first example of a multi-parameter model. The
Bayesian set-up should still look familiar:
p( , 2 | y)  p( , 2) p(y |  , 2).

Note: we would like to make inferences about the marginal
distributions p( | y) and p(2 | y) rather than the conditional
distribution p( , 2 | y). Ultimately, we’d like to find:
p( | y) =  p( | 2 , y) p(2 | y) d2

What should we choose for the prior distribution p( , 2)?
Different “types” of Bayesians choose
different priors
Classical Bayesians: the prior is a necessary evil.
 choose priors that interject the least information possible.
Modern Parametric Bayesians: the prior is a useful
convenience.
 choose prior distributions with desirable properties (e.g.
conjugacy). Given a distributional choice, prior parameters
are chosen to interject the least information possible.
Subjective Bayesians: the prior is a summary of old beliefs
 choose prior distributions based on previous
knowledge—either the results of earlier studies or non-
scientific opinion.
The Classical Bayesian and the normal
model with unknown mean and variance
y ~ N( , 2) where  and 2 are both unknown
random variables.

What prior distribution would we choose to
represent the absence of any knowledge in this
instance?

What if we assumed that the two parameters were
independent, so p( , 2) = p()p(2)?
Classical Bayesians and the normal
model with unknown mean and variance
y ~ N( , 2) where  and 2 are both unknown
random variables. What prior would a classical
Bayesian use?

If p( , 2) = p()p(2), one option would be to
assume uniform prior distributions for both
parameters. Thus,
p()  c           for - <  < 
p(2)  1/2       for 0 < 2 < 
And the joint density would be: p( , 2)  1/2

Are these distributions proper?
yi ~ N( , 2) for i  {1,…,n} and p( , 2)  1/2

p (  ,  2 | y )  p(  ,  2 ) p ( y |  ,  2 )
n/2
 1               1
 2 
1
 exp            n  1s 2  n( y   )2 

  2 2               2
2

1
where s 2 
n 1
 ( yi  y ) 2
It can be shown that the conditional posterior distribution
p(  |  2 , y ) ~ N ( y ,  2 / n )
To compute the marginal distributi on, we need to calculate

p(  | y )   p(  |  2 , y )d 2
0                                              Notice the parallel
It can be shown that : p(μ | y)~tn 1 ( y , s 2 / n )           between these
results and what we
μ y                 would find in
or more convention ally : p(                    |y)~tn 1
s/ n                 classical theory.
yi ~ N( , 2) for i  {1,…,n} and p( , 2)  1/2

From the previous slide, we saw that:
p (  ,  2 | y )  p(  ,  2 ) p ( y |  ,  2 )
n/2
 1               
 2 
1
 exp  
1
n  1s 2  n( y   ) 2 

      2  
2
 2
2

1
where s 2 
n 1
 ( yi  y ) 2
What is the marginal distribution of 2?
n / 21
 1                   1
p ( 2 | y )    2              exp      n  1s 2  n( y   )2 d

                    2                            
2

ICBST this is a scaled inverse-2 density
2 | y ~ Inv-2 (n-1,s2) 
 Inv-gamma( (n-1)/2), (n-1)s2/2 )
Two Different methods to sample from the
posterior in this case
Method 1
Sample directly from each of the two marginal
distributions.

Method 2:
Two stages:
1) Sample a value from the distribution 2|y
2) Sample a value from the distribution |2,y
Repeat these stages lots of times.
Example 2.1 from Congdon, but with the
assumption of known variance relaxed
yi denotes the systolic blood pressure of individuals i = 1,
…20
Assume that: yi ~ N(, 2) and p(, 2)  1/2
From the previous results for the normal distribution with
the improper prior 1/2, we found that:
p(μ | y)~tn 1 ( y , s 2 / n )
By the properties of the t - distributi on which we can find in a table

E( μ)  y  128 and Var( μ)  
      
 n  1 s2 
n   3.27
 n3 
             
p ( 2 | y ) ~ Scaled  Inv   2 ( n  1, s 2 )
n -1
E( 2 )  s 2         216.1
n-3
# Method 1--Directly sample from both marginal posterior distributions
model {
for (i in 1:N)
{                                                          This is general code that can
ytemp[i] <- (y[i] - mean( y[] ) )*(y[i] - mean( y[] ) )    be used to model (among
}
# GENERATE SAMPLE QUANTITIES
other things) Congdon 2.1 for
# sample mean                                              the case with unknown mean
ybar <- mean(y[])                                          and variance
# sample variance
s2 <- sum( ytemp[] ) / (N-1)

# GENERATE RANDOM DRAWS FROM THE MARGINAL DISTRIBUTION OF SIGMA-SQ
# If sigma2 ~ inverse-gamma(alpha, beta) then 1/sigma2 ~ gamma(alpha, beta)
sigma2 <- 1/tau
alpha <- (N-1)/2
beta <- (N-1)*s2 / 2

# this generates the draws from the distribution of the posterior precision.
tau ~ dgamma(alpha , beta)

# GENERATE RANDOM DRAWS FROM THE MARGINAL DISTRIBUTION OF MU
# specify the parameters of the t-distribution
df <- N-1           # degrees of freedom
s2prime <- s2/N      # variance of the sample mean
inv_s2 <- 1/s2prime # precision of the sample mean

# this generates the draws from the distribution of the posterior mean.
mu ~ dt(ybar, inv_s2, df) }
How do the analytic results and those from
method 1 compare?
Analytic Results
p(μ | y)~tn 1 ( y , s 2 / n )
By the properties of the t - distributi on which we can find in a table
 n  1s 2 / n  
E( μ)  y  128 and Var( μ)  
                     3.27

     n3           

p ( 2 | y ) ~ Scaled  Inv   2 ( n  1, s 2 )
By the properties of the scaled inverse   2
n -1
E( 2 )  s 2       216.1
n-3
WinBugs Results
node           mean          sd      MC error 2.5% median 97.5%           start   sample
mu             128.0         3.263   0.03698 121.6  128.0  134.6           1000    9001
sigma2         216.9         79.33   0.8072 111.7   201.4  414.1           1000    9001
Plots of posterior densities from WinBugs—
Method 1
sigma2 sample: 9001
0.008
0.006
0.004
0.002
0.0
0.0       400.0        800.0

mu sample: 9001
0.15
0.1
0.05
0.0
110.0   120.0   130.0   140.0
# Method 2--First draw from the marginal distribution of the posterior precision,
# then from the conditional distribution of the posterior mean
model {

for (i in 1:N)                                                        This is general code that can be
{
ytemp[i] <- (y[i] - mean( y[] ) )*(y[i] - mean( y[] ) )             used to model (among other things)
}                                                                     Congdon 2.1 for the case with
# GENERATE SAMPLE QUANTITIES                                          unknown mean and variance
# sample mean
ybar <- mean(y[])
# sample variance
s2 <- sum( ytemp[] ) / (N-1)

# GENERATE RANDOM DRAWS FROM THE MARGINAL DISTRIBUTION OF SIGMA-SQ
# If sigma2 ~ inverse-gamma(alpha, beta) then 1/sigma2 ~ gamma(alpha, beta)
sigma2 <- 1/tau
alpha <- (N-1)/2
beta <- (N-1)*s2 / 2

# this generates the draws from the distribution of the posterior precision.
tau ~ dgamma(alpha , beta)

# GENERATE RANDOM DRAWS FROM THE MARGINAL DISTRIBUTION OF MU
# specify the parameters of the t-distribution
df <- N-1           # degrees of freedom
sigma2prime <- sigma2/N         # variance of the sample mean
inv_s2 <- 1/sigma2prime # precision of the sample mean
tauprime <- tau*N

# this generates the draws from the distribution of the posterior mean.
mu ~ dnorm(ybar, tauprime)
}
How do the analytic results and those from
method 2 compare?
Analytic Results
p(μ | y)~tn 1 ( y , s 2 / n )
By the properties of the t - distributi on which we can find in a table

E( μ)  y  128 and Var( μ)  
   10.805
 n  1 s 2
             n            (10.8051/2  3.27)
     n3         

                 
p( 2 | y ) ~ Scaled  Inv   2 ( n  1, s 2 )
By the properties of the scaled inverse   2
n -1
E( 2 )  s 2       216.1
n-3
WinBugs Results
node          mean sd            MC error 2.5% median 97.5%      start    sample
mu            128.0 3.229        0.03006 121.7 128.0 134.3        1000     9001
sigma2        216.3 77.46        0.7695 111.7 201.3 408.6         1000     9001
Method Two is a simple example of Markov Chain
Monte Carlo Numerical Integration

As it may be becoming apparent, the greatest practical
difficult in Bayesian inference is computing the integrals.
In fact, except for “simple” cases like conjugate priors,
researchers lacked the ability to do Bayesian analysis.
This integration must be done to identify the marginal
posterior distribution, which is the object of analysis.
 Many Bayesians suggest that the inability to compute
these integrals is the only reason why we use the
hypothesis testing framework today.
Markov Chain Monte Carlo (MCMC)—a computationally
intensive simulation method to replace exact integrals
developed in the 1980s made it possible to tackle more
complex, realistic problems.
Basics of MCMC
The goal of MCMC is to use simulation so that we can take
a large number of random draws from the posterior
distribution . The resulting sample can be used to
estimate the posterior mean, median, Bayesian credible
interval (i.e. HDR), etc.

In class, we will implement the Gibbs sampler, the most
common form of MCMC, with WinBugs.
 Bugs = Bayesian inference Using Gibbs Sampling

We will generally duck the details of how the program is
making its calculations, but it really isn’t that complex is
theory.
The intuition behind the Gibbs Sampler

Consider a problem with two parameters 1 and 2
and let X denote the data.

Suppose further that we know the conditional
distributions: p(1 | 2 , X) and p(2 | 1 , x)

We need to find: p(1 | X) and p(2 | X)
The Gibbs Sampler proceeds by choosing some initial point
which we will denote 10, 20 from the parameter space.
 This can be any reasonable value of 
Then, we take draws from the two conditional distributions
in the following sequence:
11 ~ p(1 | 20 , Y) This sequence of draws is a Markov
Chain because the values at step t
2 ~ p(2 | 1 , Y)
1              1
only depend on the value at step t-1.
12 ~ p(1 | 21 , Y) If allowed to run long enough, the
Gibbs sampler will “converge” to the
2 ~ p(2 | 1 , Y)
2              2
“true” posterior distribution.
13 ~ p(1 | 22 , Y) If allowed to run for sufficiently long
23 ~ p(2 | 13 , Y) after convergence, the Gibbs sampler
produces a complete sample from
etc.                  the distribution of .
This algorithm can be easily generalized to the n-
parameter case 1, 2, …, n.

For the tth iteration of the Gibbs Sampler we
have:
1t ~ p(1 | 2t-1, 3t-1, … kt-1, Y)
2t ~ p(2 | 1t, 3t-1, … kt-1, Y)
…
nt ~ p(2 | 1t, 2t, … n-1t-1, Y)
The Gibbs Sampler is not Rocket Science even if
the original application was in physics

We have already created a Gibbs Sampler when we
implemented Method 2 above.
In the code, we used WinBugs’ random number
generators rather than allow the program to identify
the conditional distribution which will be the usual
case.
We had to do this because WinBugs does not allow
researchers to use improper priors.
When we get to implementing the modern parametric
Bayesian’s priors, I will show how we can even use
Excel to do MCMC.
Modern Parametric Bayesians and the normal
model with unknown mean and variance

y ~ N( , 2) where  and 2 are both
unknown random variables.

What prior distribution would a modern
parametric Bayesian choose to satisfy the
demands of convenience?

What if we used the definition of conditional
probability, so p( , 2) = p(|2)p(2)?
Modern Parametric Bayesians and the normal
model with unknown mean and variance

y ~ N( , 2) where  and 2 are both unknown
random variables.

A modern parameteric Bayesian would typically
choose a conjugate prior.

For the normal model with unknown mean and
variance, the conjugate prior for the joint
distribution of  and 2 is the normal inverse-
gamma () distribution (i.e. normal-inverse-2)
p( , 2 ) ~ N-Inv-2(0, 02/k0; v0,02)

Four Parameters in the prior
Suppose p(, 2) ~ N-Inv-2(0, 02/k0; v0, 02)

ICBST the above expression can be factored such
that:
p(,2) = p(|2)p(2)
where |2 ~ N(0, 2/k0) and 2 ~ Inv-2(v0,02)

Because this is a conjugate distribution for the
normal distribution with unknown mean and
variance, the posterior distribution will also be
normal-Inv-2.
The posterior distribution if y ~ N(,2) and
,2 ~ N-Inv-2(0, 02/k0; v0, 02)

p(,2 |y) ~ N-Inv-2(n, n2/kk; vn, n2)
k0            n
n           0           y
k0  k n      k0  k n                   Weighted average of the prior
mean and the data.
k n  k0  n
vn  v0  n
k0 n
vn n  v0 0  (n  1) s 2           ( y  0 ) 2
2         2

k0  n

Weighted sum of the prior variance, the sample variance and the distance
between the sample and prior means
If p(,2 |y) ~ N-Inv-2(n, n2/kn; vn, n2) we can
factor the posterior distribution just like the prior…
 |  2 , y ~ N ( n ,  2 / k n )
 k0     n                            
 2 0  2 y                          
 N       ,      1

 k0  n

k0
 2
n                 

   2
 2
 2
                  
 2 | y ~ Inv   2 (vn ,  n 2 )
Further, we can integrate out  2 to find :
 | y ~ t (  n ,  n / k n , vn )
2

These are essentially the same posterior distributions as we found with the
improper priors. To implement the conjugate priors in WinBugs we would
use the same code, but substitute the values n, n2/kn; vn, n2 into the
posterior.
Excel Implementation of the Normal model with
conjugate priors for the mean and variance
To implement a Gibbs                  2 (1)
| y ~ Inv   ( vn ,  n )
2           2

Sampler in this case,
we need to perform                (1) |  2 , y ~ N ( n ,  2 (1) / kn )
the following steps:             
Everything subscripted                2( t )
| y ~ Inv   ( vn ,  n )
2           2

with an n is a function
of stuff that is “known”          ( t ) |  2 , y ~ N ( n ,  2 ( t ) / k n )
In Excel, for each iteration t:
2(t) = 1/GAMMAINV(rand(),vn,1/n2)
(t) = NORMINV(rand(), n , 2(t) / kn)
where 2(t) = spreadsheet cell correspond to the tth draw
Lazy Modern Parametric Bayesians and the
normal model with unknown mean and variance
Suppose that y ~ N(, ) where  was the prior precision.

From here on when we talk about the normal distribution
you should expect that we will speak in terms of the
precision  rather than the variance 2. This is because
WinBugs is programmed to use  rather than 2

Suppose also that you don’t want to think too hard about
the prior joint distribution of  and , and assume that:
p(, ) = p()p()

What distributions would you choose for p() and p()?
Suppose that y ~ N(, )
What priors would you choose for  and ?
I would choose:
 ~ N( 0 , t )     (where t was a large number)
This is because, if we expect something like the central
limit theorem to hold, then the distribution of the sample
mean should be approximately normal for large n.
Gamma

 ~ ( a , b )      (where a, b are small numbers)
This is because this distribution is bounded below at zero
and unlike the 2 distribution which shares this property it
is not constrained to have an equal mean and variance.
 Note how we now have to talk about the mean of the
distribution of the variance.
model {
for (i in 1:N)
{ y[i] ~ dnorm( mu, tau) }
mu ~ dnorm(0, .001)
tau ~ dgamma(.01 , .001) }
}
list(N=20,y=c(98,160,136,128,130,114,123,
134,128,107,123,125,129,132,154,115,126,
132,136,130))
WinBugs results for the model where y ~ N(, )
and  ~ N(0, .001),  ~ (.01 , .001)
node       mean            sd           MC error 2.5%           median         97.5%        start   sample
mu         126.6          3.3           0.03433    119.9        126.7          133.1        10001   10000
tau        0.005145       0.0017        1.815E-5   0.002365     0.00496        0.008981     10001   10000

mu sample: 10000                                       tau sample: 10000
0.15                                                  300.0
0.1                                                  200.0
0.05                                                  100.0
0.0                                                    0.0
110.0   120.0    130.0   140.0                         0.0      0.005      0.01

Notice that the mean of the posterior mean is
less than 128 despite the fact that we are using
what appear to be diffuse prior beliefs.
The Subjective Bayesian and the normal
model with unknown mean and variance
The subjective Bayesian framework provides little guidance about what
prior distribution that one should choose.

In a sense, that is the point of the subjective approach—it is subjective.

You are free to pick whatever prior distribution you want: multi-modal,
uniform, high or low variance, skewed, constrained to lie between a
certain set of values, etc.

One of the key difficulties is that the prior distributions probably are not
independent (i.e. p(1, 2)≠p(1)p(2)).
For example, regression coefficients are generally not independent,
even if that isn’t transparent in your STATA output. If you want to
incorporate subjective beliefs, this non-independence should be
taken into account.
Some General Guidelines
I recommend the following general guidelines:
1) if possible, use standard distributions (i.e. conjugate or
semi-conjugate) and choose parameters that fix the
mean, variance, kurtosis, etc. to be some desirable level.

2) sample from the prior predictive distribution and check to
see if your results make sense.
- Mechanically, perform the following steps:
i) take a random draw ’ from the joint prior distribution of ’.
ii) take a random draw Y from the pdf Y| with  = ’
iii) repeat steps i and ii several thousand times to provide a sample
that you can use to summarize the prior predictive distribution.
iv) generate various summaries of the prior predictive distribution
and check to see if the model’s predictions are consistent with