Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Bayesian Analysis of the Normal Distribution, Part II by kjv16990

VIEWS: 0 PAGES: 32

									      Bayesian Analysis of the
     Normal Distribution, Part II

- Set-up of the basic model of a
    normally distributed random variable
    with unknown mean and variance
    (a two-parameter model).
- Discuss philosophies of prior selection
- Implementation of different priors with
    a discussion of MCMC methods.
         The normal model with unknown
               mean and variance
Let’s extend the normal model to the case where the variance
  parameter is assumed to be unknown. Thus, yi ~ N( , 2),
  where  and 2 are both unknown random variables.

This is our first example of a multi-parameter model. The
  Bayesian set-up should still look familiar:
  p( , 2 | y)  p( , 2) p(y |  , 2).

Note: we would like to make inferences about the marginal
  distributions p( | y) and p(2 | y) rather than the conditional
  distribution p( , 2 | y). Ultimately, we’d like to find:
        p( | y) =  p( | 2 , y) p(2 | y) d2

What should we choose for the prior distribution p( , 2)?
     Different “types” of Bayesians choose
                  different priors
Classical Bayesians: the prior is a necessary evil.
 choose priors that interject the least information possible.
Modern Parametric Bayesians: the prior is a useful
  convenience.
   choose prior distributions with desirable properties (e.g.
  conjugacy). Given a distributional choice, prior parameters
  are chosen to interject the least information possible.
Subjective Bayesians: the prior is a summary of old beliefs
   choose prior distributions based on previous
  knowledge—either the results of earlier studies or non-
  scientific opinion.
  The Classical Bayesian and the normal
  model with unknown mean and variance
y ~ N( , 2) where  and 2 are both unknown
  random variables.

What prior distribution would we choose to
 represent the absence of any knowledge in this
 instance?

What if we assumed that the two parameters were
 independent, so p( , 2) = p()p(2)?
 Classical Bayesians and the normal
model with unknown mean and variance
y ~ N( , 2) where  and 2 are both unknown
  random variables. What prior would a classical
  Bayesian use?

If p( , 2) = p()p(2), one option would be to
   assume uniform prior distributions for both
   parameters. Thus,
       p()  c           for - <  < 
       p(2)  1/2       for 0 < 2 < 
And the joint density would be: p( , 2)  1/2

Are these distributions proper?
 yi ~ N( , 2) for i  {1,…,n} and p( , 2)  1/2

     p (  ,  2 | y )  p(  ,  2 ) p ( y |  ,  2 )
                                   n/2
                    1               1
          2 
                 1
                          exp            n  1s 2  n( y   )2 
                                                                     
               2 2               2
                                          2
                                                                     
                   1
     where s 2 
                 n 1
                       ( yi  y ) 2
It can be shown that the conditional posterior distribution
     p(  |  2 , y ) ~ N ( y ,  2 / n )
To compute the marginal distributi on, we need to calculate
                     
    p(  | y )   p(  |  2 , y )d 2
                     0                                              Notice the parallel
    It can be shown that : p(μ | y)~tn 1 ( y , s 2 / n )           between these
                                                                    results and what we
                                               μ y                 would find in
    or more convention ally : p(                    |y)~tn 1
                                               s/ n                 classical theory.
   yi ~ N( , 2) for i  {1,…,n} and p( , 2)  1/2

  From the previous slide, we saw that:
        p (  ,  2 | y )  p(  ,  2 ) p ( y |  ,  2 )
                                         n/2
                      1               
            2 
                    1
                              exp  
                                           1
                                               n  1s 2  n( y   ) 2 
                                                                         
                     2  
                           2
                                         2
                                             2
                                                                         
                     1
       where s 2 
                   n 1
                          ( yi  y ) 2
 What is the marginal distribution of 2?
                             n / 21
                      1                   1
    p ( 2 | y )    2              exp      n  1s 2  n( y   )2 d
                                                                           
                                         2                            
                                                2




ICBST this is a scaled inverse-2 density
        2 | y ~ Inv-2 (n-1,s2) 
                    Inv-gamma( (n-1)/2), (n-1)s2/2 )
 Two Different methods to sample from the
           posterior in this case
Method 1
Sample directly from each of the two marginal
  distributions.

Method 2:
Two stages:
 1) Sample a value from the distribution 2|y
 2) Sample a value from the distribution |2,y
Repeat these stages lots of times.
   Example 2.1 from Congdon, but with the
    assumption of known variance relaxed
yi denotes the systolic blood pressure of individuals i = 1,
   …20
Assume that: yi ~ N(, 2) and p(, 2)  1/2
From the previous results for the normal distribution with
   the improper prior 1/2, we found that:
 p(μ | y)~tn 1 ( y , s 2 / n )
 By the properties of the t - distributi on which we can find in a table

  E( μ)  y  128 and Var( μ)  
                                          
                                     n  1 s2 
                                              n   3.27
                                     n3 
                                                 
 p ( 2 | y ) ~ Scaled  Inv   2 ( n  1, s 2 )
                 n -1
 E( 2 )  s 2         216.1
                 n-3
# Method 1--Directly sample from both marginal posterior distributions
model {
 for (i in 1:N)
 {                                                          This is general code that can
 ytemp[i] <- (y[i] - mean( y[] ) )*(y[i] - mean( y[] ) )    be used to model (among
 }
 # GENERATE SAMPLE QUANTITIES
                                                            other things) Congdon 2.1 for
 # sample mean                                              the case with unknown mean
 ybar <- mean(y[])                                          and variance
 # sample variance
 s2 <- sum( ytemp[] ) / (N-1)

# GENERATE RANDOM DRAWS FROM THE MARGINAL DISTRIBUTION OF SIGMA-SQ
# If sigma2 ~ inverse-gamma(alpha, beta) then 1/sigma2 ~ gamma(alpha, beta)
 sigma2 <- 1/tau
 alpha <- (N-1)/2
 beta <- (N-1)*s2 / 2

# this generates the draws from the distribution of the posterior precision.
tau ~ dgamma(alpha , beta)

# GENERATE RANDOM DRAWS FROM THE MARGINAL DISTRIBUTION OF MU
# specify the parameters of the t-distribution
 df <- N-1           # degrees of freedom
 s2prime <- s2/N      # variance of the sample mean
 inv_s2 <- 1/s2prime # precision of the sample mean

# this generates the draws from the distribution of the posterior mean.
 mu ~ dt(ybar, inv_s2, df) }
   How do the analytic results and those from
             method 1 compare?
 Analytic Results
 p(μ | y)~tn 1 ( y , s 2 / n )
 By the properties of the t - distributi on which we can find in a table
                                 n  1s 2 / n  
  E( μ)  y  128 and Var( μ)  
                                                     3.27
                                                    
                                     n3           

 p ( 2 | y ) ~ Scaled  Inv   2 ( n  1, s 2 )
 By the properties of the scaled inverse   2
               n -1
 E( 2 )  s 2       216.1
               n-3
WinBugs Results
 node           mean          sd      MC error 2.5% median 97.5%           start   sample
mu             128.0         3.263   0.03698 121.6  128.0  134.6           1000    9001
sigma2         216.9         79.33   0.8072 111.7   201.4  414.1           1000    9001
Plots of posterior densities from WinBugs—
                  Method 1
             sigma2 sample: 9001
    0.008
    0.006
    0.004
    0.002
       0.0
              0.0       400.0        800.0

             mu sample: 9001
      0.15
       0.1
      0.05
       0.0
             110.0   120.0   130.0   140.0
# Method 2--First draw from the marginal distribution of the posterior precision,
# then from the conditional distribution of the posterior mean
model {

for (i in 1:N)                                                        This is general code that can be
{
  ytemp[i] <- (y[i] - mean( y[] ) )*(y[i] - mean( y[] ) )             used to model (among other things)
}                                                                     Congdon 2.1 for the case with
# GENERATE SAMPLE QUANTITIES                                          unknown mean and variance
# sample mean
ybar <- mean(y[])
# sample variance
s2 <- sum( ytemp[] ) / (N-1)

# GENERATE RANDOM DRAWS FROM THE MARGINAL DISTRIBUTION OF SIGMA-SQ
# If sigma2 ~ inverse-gamma(alpha, beta) then 1/sigma2 ~ gamma(alpha, beta)
sigma2 <- 1/tau
alpha <- (N-1)/2
beta <- (N-1)*s2 / 2

# this generates the draws from the distribution of the posterior precision.
tau ~ dgamma(alpha , beta)

# GENERATE RANDOM DRAWS FROM THE MARGINAL DISTRIBUTION OF MU
# specify the parameters of the t-distribution
df <- N-1           # degrees of freedom
sigma2prime <- sigma2/N         # variance of the sample mean
inv_s2 <- 1/sigma2prime # precision of the sample mean
tauprime <- tau*N

# this generates the draws from the distribution of the posterior mean.
mu ~ dnorm(ybar, tauprime)
}
    How do the analytic results and those from
              method 2 compare?
 Analytic Results
p(μ | y)~tn 1 ( y , s 2 / n )
By the properties of the t - distributi on which we can find in a table

 E( μ)  y  128 and Var( μ)  
                                               10.805
                                   n  1 s 2
                                               n            (10.8051/2  3.27)
                                       n3         
                                                    
                                                   
p( 2 | y ) ~ Scaled  Inv   2 ( n  1, s 2 )
By the properties of the scaled inverse   2
              n -1
E( 2 )  s 2       216.1
              n-3
WinBugs Results
 node          mean sd            MC error 2.5% median 97.5%      start    sample
mu            128.0 3.229        0.03006 121.7 128.0 134.3        1000     9001
sigma2        216.3 77.46        0.7695 111.7 201.3 408.6         1000     9001
Method Two is a simple example of Markov Chain
      Monte Carlo Numerical Integration

As it may be becoming apparent, the greatest practical
  difficult in Bayesian inference is computing the integrals.
  In fact, except for “simple” cases like conjugate priors,
  researchers lacked the ability to do Bayesian analysis.
This integration must be done to identify the marginal
  posterior distribution, which is the object of analysis.
   Many Bayesians suggest that the inability to compute
  these integrals is the only reason why we use the
  hypothesis testing framework today.
Markov Chain Monte Carlo (MCMC)—a computationally
  intensive simulation method to replace exact integrals
  developed in the 1980s made it possible to tackle more
  complex, realistic problems.
                   Basics of MCMC
The goal of MCMC is to use simulation so that we can take
  a large number of random draws from the posterior
  distribution . The resulting sample can be used to
  estimate the posterior mean, median, Bayesian credible
  interval (i.e. HDR), etc.

In class, we will implement the Gibbs sampler, the most
   common form of MCMC, with WinBugs.
    Bugs = Bayesian inference Using Gibbs Sampling

We will generally duck the details of how the program is
 making its calculations, but it really isn’t that complex is
 theory.
   The intuition behind the Gibbs Sampler

Consider a problem with two parameters 1 and 2
 and let X denote the data.

Suppose further that we know the conditional
  distributions: p(1 | 2 , X) and p(2 | 1 , x)

We need to find: p(1 | X) and p(2 | X)
The Gibbs Sampler proceeds by choosing some initial point
  which we will denote 10, 20 from the parameter space.
   This can be any reasonable value of 
Then, we take draws from the two conditional distributions
  in the following sequence:
      11 ~ p(1 | 20 , Y) This sequence of draws is a Markov
                               Chain because the values at step t
      2 ~ p(2 | 1 , Y)
        1              1
                               only depend on the value at step t-1.
      12 ~ p(1 | 21 , Y) If allowed to run long enough, the
                               Gibbs sampler will “converge” to the
      2 ~ p(2 | 1 , Y)
        2              2
                               “true” posterior distribution.
      13 ~ p(1 | 22 , Y) If allowed to run for sufficiently long
      23 ~ p(2 | 13 , Y) after convergence, the Gibbs sampler
                            produces a complete sample from
      etc.                  the distribution of .
This algorithm can be easily generalized to the n-
 parameter case 1, 2, …, n.

For the tth iteration of the Gibbs Sampler we
  have:
  1t ~ p(1 | 2t-1, 3t-1, … kt-1, Y)
  2t ~ p(2 | 1t, 3t-1, … kt-1, Y)
  …
  nt ~ p(2 | 1t, 2t, … n-1t-1, Y)
 The Gibbs Sampler is not Rocket Science even if
      the original application was in physics

We have already created a Gibbs Sampler when we
   implemented Method 2 above.
In the code, we used WinBugs’ random number
   generators rather than allow the program to identify
   the conditional distribution which will be the usual
   case.
We had to do this because WinBugs does not allow
   researchers to use improper priors.
When we get to implementing the modern parametric
   Bayesian’s priors, I will show how we can even use
   Excel to do MCMC.
Modern Parametric Bayesians and the normal
  model with unknown mean and variance

y ~ N( , 2) where  and 2 are both
  unknown random variables.

What prior distribution would a modern
 parametric Bayesian choose to satisfy the
 demands of convenience?

What if we used the definition of conditional
 probability, so p( , 2) = p(|2)p(2)?
 Modern Parametric Bayesians and the normal
   model with unknown mean and variance

y ~ N( , 2) where  and 2 are both unknown
  random variables.

A modern parameteric Bayesian would typically
  choose a conjugate prior.

For the normal model with unknown mean and
  variance, the conjugate prior for the joint
  distribution of  and 2 is the normal inverse-
  gamma () distribution (i.e. normal-inverse-2)
      p( , 2 ) ~ N-Inv-2(0, 02/k0; v0,02)

                      Four Parameters in the prior
Suppose p(, 2) ~ N-Inv-2(0, 02/k0; v0, 02)

ICBST the above expression can be factored such
  that:
  p(,2) = p(|2)p(2)
  where |2 ~ N(0, 2/k0) and 2 ~ Inv-2(v0,02)

Because this is a conjugate distribution for the
  normal distribution with unknown mean and
  variance, the posterior distribution will also be
  normal-Inv-2.
   The posterior distribution if y ~ N(,2) and
       ,2 ~ N-Inv-2(0, 02/k0; v0, 02)

     p(,2 |y) ~ N-Inv-2(n, n2/kk; vn, n2)
           k0            n
   n           0           y
        k0  k n      k0  k n                   Weighted average of the prior
                                                 mean and the data.
   k n  k0  n
   vn  v0  n
                                    k0 n
   vn n  v0 0  (n  1) s 2           ( y  0 ) 2
        2         2

                                   k0  n

Weighted sum of the prior variance, the sample variance and the distance
between the sample and prior means
   If p(,2 |y) ~ N-Inv-2(n, n2/kn; vn, n2) we can
 factor the posterior distribution just like the prior…
     |  2 , y ~ N ( n ,  2 / k n )
         k0     n                            
         2 0  2 y                          
      N       ,      1
                                              
         k0  n
        
                     k0
                          2
                            n                 
                                              
           2
                 2
                      2
                                             
      2 | y ~ Inv   2 (vn ,  n 2 )
     Further, we can integrate out  2 to find :
          | y ~ t (  n ,  n / k n , vn )
                              2


These are essentially the same posterior distributions as we found with the
improper priors. To implement the conjugate priors in WinBugs we would
use the same code, but substitute the values n, n2/kn; vn, n2 into the
posterior.
  Excel Implementation of the Normal model with
    conjugate priors for the mean and variance
To implement a Gibbs                  2 (1)
                                                | y ~ Inv   ( vn ,  n )
                                                               2           2

  Sampler in this case,
  we need to perform                (1) |  2 , y ~ N ( n ,  2 (1) / kn )
  the following steps:             
Everything subscripted                2( t )
                                                | y ~ Inv   ( vn ,  n )
                                                               2           2

  with an n is a function
  of stuff that is “known”          ( t ) |  2 , y ~ N ( n ,  2 ( t ) / k n )
 In Excel, for each iteration t:
 2(t) = 1/GAMMAINV(rand(),vn,1/n2)
 (t) = NORMINV(rand(), n , 2(t) / kn)
 where 2(t) = spreadsheet cell correspond to the tth draw
  Lazy Modern Parametric Bayesians and the
normal model with unknown mean and variance
Suppose that y ~ N(, ) where  was the prior precision.

From here on when we talk about the normal distribution
  you should expect that we will speak in terms of the
  precision  rather than the variance 2. This is because
  WinBugs is programmed to use  rather than 2

Suppose also that you don’t want to think too hard about
  the prior joint distribution of  and , and assume that:
      p(, ) = p()p()

What distributions would you choose for p() and p()?
          Suppose that y ~ N(, )
 What priors would you choose for  and ?
I would choose:
   ~ N( 0 , t )     (where t was a large number)
This is because, if we expect something like the central
   limit theorem to hold, then the distribution of the sample
   mean should be approximately normal for large n.
    Gamma

   ~ ( a , b )      (where a, b are small numbers)
This is because this distribution is bounded below at zero
  and unlike the 2 distribution which shares this property it
  is not constrained to have an equal mean and variance.
   Note how we now have to talk about the mean of the
  distribution of the variance.
model {
 for (i in 1:N)
  { y[i] ~ dnorm( mu, tau) }
  mu ~ dnorm(0, .001)
  tau ~ dgamma(.01 , .001) }
  }
list(N=20,y=c(98,160,136,128,130,114,123,
134,128,107,123,125,129,132,154,115,126,
   132,136,130))
      WinBugs results for the model where y ~ N(, )
           and  ~ N(0, .001),  ~ (.01 , .001)
node       mean            sd           MC error 2.5%           median         97.5%        start   sample
mu         126.6          3.3           0.03433    119.9        126.7          133.1        10001   10000
tau        0.005145       0.0017        1.815E-5   0.002365     0.00496        0.008981     10001   10000


               mu sample: 10000                                       tau sample: 10000
        0.15                                                  300.0
         0.1                                                  200.0
        0.05                                                  100.0
         0.0                                                    0.0
               110.0   120.0    130.0   140.0                         0.0      0.005      0.01




  Notice that the mean of the posterior mean is
  less than 128 despite the fact that we are using
  what appear to be diffuse prior beliefs.
   The Subjective Bayesian and the normal
   model with unknown mean and variance
The subjective Bayesian framework provides little guidance about what
  prior distribution that one should choose.

In a sense, that is the point of the subjective approach—it is subjective.

You are free to pick whatever prior distribution you want: multi-modal,
  uniform, high or low variance, skewed, constrained to lie between a
  certain set of values, etc.

One of the key difficulties is that the prior distributions probably are not
   independent (i.e. p(1, 2)≠p(1)p(2)).
For example, regression coefficients are generally not independent,
   even if that isn’t transparent in your STATA output. If you want to
   incorporate subjective beliefs, this non-independence should be
   taken into account.
              Some General Guidelines
I recommend the following general guidelines:
1) if possible, use standard distributions (i.e. conjugate or
   semi-conjugate) and choose parameters that fix the
   mean, variance, kurtosis, etc. to be some desirable level.

2) sample from the prior predictive distribution and check to
   see if your results make sense.
   - Mechanically, perform the following steps:
   i) take a random draw ’ from the joint prior distribution of ’.
   ii) take a random draw Y from the pdf Y| with  = ’
   iii) repeat steps i and ii several thousand times to provide a sample
       that you can use to summarize the prior predictive distribution.
   iv) generate various summaries of the prior predictive distribution
       and check to see if the model’s predictions are consistent with
       your beliefs about the data-generating process.

								
To top