VIEWS: 0 PAGES: 32 CATEGORY: Accounting POSTED ON: 8/29/2010
Bayesian Analysis of the Normal Distribution, Part II - Set-up of the basic model of a normally distributed random variable with unknown mean and variance (a two-parameter model). - Discuss philosophies of prior selection - Implementation of different priors with a discussion of MCMC methods. The normal model with unknown mean and variance Let’s extend the normal model to the case where the variance parameter is assumed to be unknown. Thus, yi ~ N( , 2), where and 2 are both unknown random variables. This is our first example of a multi-parameter model. The Bayesian set-up should still look familiar: p( , 2 | y) p( , 2) p(y | , 2). Note: we would like to make inferences about the marginal distributions p( | y) and p(2 | y) rather than the conditional distribution p( , 2 | y). Ultimately, we’d like to find: p( | y) = p( | 2 , y) p(2 | y) d2 What should we choose for the prior distribution p( , 2)? Different “types” of Bayesians choose different priors Classical Bayesians: the prior is a necessary evil. choose priors that interject the least information possible. Modern Parametric Bayesians: the prior is a useful convenience. choose prior distributions with desirable properties (e.g. conjugacy). Given a distributional choice, prior parameters are chosen to interject the least information possible. Subjective Bayesians: the prior is a summary of old beliefs choose prior distributions based on previous knowledge—either the results of earlier studies or non- scientific opinion. The Classical Bayesian and the normal model with unknown mean and variance y ~ N( , 2) where and 2 are both unknown random variables. What prior distribution would we choose to represent the absence of any knowledge in this instance? What if we assumed that the two parameters were independent, so p( , 2) = p()p(2)? Classical Bayesians and the normal model with unknown mean and variance y ~ N( , 2) where and 2 are both unknown random variables. What prior would a classical Bayesian use? If p( , 2) = p()p(2), one option would be to assume uniform prior distributions for both parameters. Thus, p() c for - < < p(2) 1/2 for 0 < 2 < And the joint density would be: p( , 2) 1/2 Are these distributions proper? yi ~ N( , 2) for i {1,…,n} and p( , 2) 1/2 p ( , 2 | y ) p( , 2 ) p ( y | , 2 ) n/2 1 1 2 1 exp n 1s 2 n( y )2 2 2 2 2 1 where s 2 n 1 ( yi y ) 2 It can be shown that the conditional posterior distribution p( | 2 , y ) ~ N ( y , 2 / n ) To compute the marginal distributi on, we need to calculate p( | y ) p( | 2 , y )d 2 0 Notice the parallel It can be shown that : p(μ | y)~tn 1 ( y , s 2 / n ) between these results and what we μ y would find in or more convention ally : p( |y)~tn 1 s/ n classical theory. yi ~ N( , 2) for i {1,…,n} and p( , 2) 1/2 From the previous slide, we saw that: p ( , 2 | y ) p( , 2 ) p ( y | , 2 ) n/2 1 2 1 exp 1 n 1s 2 n( y ) 2 2 2 2 2 1 where s 2 n 1 ( yi y ) 2 What is the marginal distribution of 2? n / 21 1 1 p ( 2 | y ) 2 exp n 1s 2 n( y )2 d 2 2 ICBST this is a scaled inverse-2 density 2 | y ~ Inv-2 (n-1,s2) Inv-gamma( (n-1)/2), (n-1)s2/2 ) Two Different methods to sample from the posterior in this case Method 1 Sample directly from each of the two marginal distributions. Method 2: Two stages: 1) Sample a value from the distribution 2|y 2) Sample a value from the distribution |2,y Repeat these stages lots of times. Example 2.1 from Congdon, but with the assumption of known variance relaxed yi denotes the systolic blood pressure of individuals i = 1, …20 Assume that: yi ~ N(, 2) and p(, 2) 1/2 From the previous results for the normal distribution with the improper prior 1/2, we found that: p(μ | y)~tn 1 ( y , s 2 / n ) By the properties of the t - distributi on which we can find in a table E( μ) y 128 and Var( μ) n 1 s2 n 3.27 n3 p ( 2 | y ) ~ Scaled Inv 2 ( n 1, s 2 ) n -1 E( 2 ) s 2 216.1 n-3 # Method 1--Directly sample from both marginal posterior distributions model { for (i in 1:N) { This is general code that can ytemp[i] <- (y[i] - mean( y[] ) )*(y[i] - mean( y[] ) ) be used to model (among } # GENERATE SAMPLE QUANTITIES other things) Congdon 2.1 for # sample mean the case with unknown mean ybar <- mean(y[]) and variance # sample variance s2 <- sum( ytemp[] ) / (N-1) # GENERATE RANDOM DRAWS FROM THE MARGINAL DISTRIBUTION OF SIGMA-SQ # If sigma2 ~ inverse-gamma(alpha, beta) then 1/sigma2 ~ gamma(alpha, beta) sigma2 <- 1/tau alpha <- (N-1)/2 beta <- (N-1)*s2 / 2 # this generates the draws from the distribution of the posterior precision. tau ~ dgamma(alpha , beta) # GENERATE RANDOM DRAWS FROM THE MARGINAL DISTRIBUTION OF MU # specify the parameters of the t-distribution df <- N-1 # degrees of freedom s2prime <- s2/N # variance of the sample mean inv_s2 <- 1/s2prime # precision of the sample mean # this generates the draws from the distribution of the posterior mean. mu ~ dt(ybar, inv_s2, df) } How do the analytic results and those from method 1 compare? Analytic Results p(μ | y)~tn 1 ( y , s 2 / n ) By the properties of the t - distributi on which we can find in a table n 1s 2 / n E( μ) y 128 and Var( μ) 3.27 n3 p ( 2 | y ) ~ Scaled Inv 2 ( n 1, s 2 ) By the properties of the scaled inverse 2 n -1 E( 2 ) s 2 216.1 n-3 WinBugs Results node mean sd MC error 2.5% median 97.5% start sample mu 128.0 3.263 0.03698 121.6 128.0 134.6 1000 9001 sigma2 216.9 79.33 0.8072 111.7 201.4 414.1 1000 9001 Plots of posterior densities from WinBugs— Method 1 sigma2 sample: 9001 0.008 0.006 0.004 0.002 0.0 0.0 400.0 800.0 mu sample: 9001 0.15 0.1 0.05 0.0 110.0 120.0 130.0 140.0 # Method 2--First draw from the marginal distribution of the posterior precision, # then from the conditional distribution of the posterior mean model { for (i in 1:N) This is general code that can be { ytemp[i] <- (y[i] - mean( y[] ) )*(y[i] - mean( y[] ) ) used to model (among other things) } Congdon 2.1 for the case with # GENERATE SAMPLE QUANTITIES unknown mean and variance # sample mean ybar <- mean(y[]) # sample variance s2 <- sum( ytemp[] ) / (N-1) # GENERATE RANDOM DRAWS FROM THE MARGINAL DISTRIBUTION OF SIGMA-SQ # If sigma2 ~ inverse-gamma(alpha, beta) then 1/sigma2 ~ gamma(alpha, beta) sigma2 <- 1/tau alpha <- (N-1)/2 beta <- (N-1)*s2 / 2 # this generates the draws from the distribution of the posterior precision. tau ~ dgamma(alpha , beta) # GENERATE RANDOM DRAWS FROM THE MARGINAL DISTRIBUTION OF MU # specify the parameters of the t-distribution df <- N-1 # degrees of freedom sigma2prime <- sigma2/N # variance of the sample mean inv_s2 <- 1/sigma2prime # precision of the sample mean tauprime <- tau*N # this generates the draws from the distribution of the posterior mean. mu ~ dnorm(ybar, tauprime) } How do the analytic results and those from method 2 compare? Analytic Results p(μ | y)~tn 1 ( y , s 2 / n ) By the properties of the t - distributi on which we can find in a table E( μ) y 128 and Var( μ) 10.805 n 1 s 2 n (10.8051/2 3.27) n3 p( 2 | y ) ~ Scaled Inv 2 ( n 1, s 2 ) By the properties of the scaled inverse 2 n -1 E( 2 ) s 2 216.1 n-3 WinBugs Results node mean sd MC error 2.5% median 97.5% start sample mu 128.0 3.229 0.03006 121.7 128.0 134.3 1000 9001 sigma2 216.3 77.46 0.7695 111.7 201.3 408.6 1000 9001 Method Two is a simple example of Markov Chain Monte Carlo Numerical Integration As it may be becoming apparent, the greatest practical difficult in Bayesian inference is computing the integrals. In fact, except for “simple” cases like conjugate priors, researchers lacked the ability to do Bayesian analysis. This integration must be done to identify the marginal posterior distribution, which is the object of analysis. Many Bayesians suggest that the inability to compute these integrals is the only reason why we use the hypothesis testing framework today. Markov Chain Monte Carlo (MCMC)—a computationally intensive simulation method to replace exact integrals developed in the 1980s made it possible to tackle more complex, realistic problems. Basics of MCMC The goal of MCMC is to use simulation so that we can take a large number of random draws from the posterior distribution . The resulting sample can be used to estimate the posterior mean, median, Bayesian credible interval (i.e. HDR), etc. In class, we will implement the Gibbs sampler, the most common form of MCMC, with WinBugs. Bugs = Bayesian inference Using Gibbs Sampling We will generally duck the details of how the program is making its calculations, but it really isn’t that complex is theory. The intuition behind the Gibbs Sampler Consider a problem with two parameters 1 and 2 and let X denote the data. Suppose further that we know the conditional distributions: p(1 | 2 , X) and p(2 | 1 , x) We need to find: p(1 | X) and p(2 | X) The Gibbs Sampler proceeds by choosing some initial point which we will denote 10, 20 from the parameter space. This can be any reasonable value of Then, we take draws from the two conditional distributions in the following sequence: 11 ~ p(1 | 20 , Y) This sequence of draws is a Markov Chain because the values at step t 2 ~ p(2 | 1 , Y) 1 1 only depend on the value at step t-1. 12 ~ p(1 | 21 , Y) If allowed to run long enough, the Gibbs sampler will “converge” to the 2 ~ p(2 | 1 , Y) 2 2 “true” posterior distribution. 13 ~ p(1 | 22 , Y) If allowed to run for sufficiently long 23 ~ p(2 | 13 , Y) after convergence, the Gibbs sampler produces a complete sample from etc. the distribution of . This algorithm can be easily generalized to the n- parameter case 1, 2, …, n. For the tth iteration of the Gibbs Sampler we have: 1t ~ p(1 | 2t-1, 3t-1, … kt-1, Y) 2t ~ p(2 | 1t, 3t-1, … kt-1, Y) … nt ~ p(2 | 1t, 2t, … n-1t-1, Y) The Gibbs Sampler is not Rocket Science even if the original application was in physics We have already created a Gibbs Sampler when we implemented Method 2 above. In the code, we used WinBugs’ random number generators rather than allow the program to identify the conditional distribution which will be the usual case. We had to do this because WinBugs does not allow researchers to use improper priors. When we get to implementing the modern parametric Bayesian’s priors, I will show how we can even use Excel to do MCMC. Modern Parametric Bayesians and the normal model with unknown mean and variance y ~ N( , 2) where and 2 are both unknown random variables. What prior distribution would a modern parametric Bayesian choose to satisfy the demands of convenience? What if we used the definition of conditional probability, so p( , 2) = p(|2)p(2)? Modern Parametric Bayesians and the normal model with unknown mean and variance y ~ N( , 2) where and 2 are both unknown random variables. A modern parameteric Bayesian would typically choose a conjugate prior. For the normal model with unknown mean and variance, the conjugate prior for the joint distribution of and 2 is the normal inverse- gamma () distribution (i.e. normal-inverse-2) p( , 2 ) ~ N-Inv-2(0, 02/k0; v0,02) Four Parameters in the prior Suppose p(, 2) ~ N-Inv-2(0, 02/k0; v0, 02) ICBST the above expression can be factored such that: p(,2) = p(|2)p(2) where |2 ~ N(0, 2/k0) and 2 ~ Inv-2(v0,02) Because this is a conjugate distribution for the normal distribution with unknown mean and variance, the posterior distribution will also be normal-Inv-2. The posterior distribution if y ~ N(,2) and ,2 ~ N-Inv-2(0, 02/k0; v0, 02) p(,2 |y) ~ N-Inv-2(n, n2/kk; vn, n2) k0 n n 0 y k0 k n k0 k n Weighted average of the prior mean and the data. k n k0 n vn v0 n k0 n vn n v0 0 (n 1) s 2 ( y 0 ) 2 2 2 k0 n Weighted sum of the prior variance, the sample variance and the distance between the sample and prior means If p(,2 |y) ~ N-Inv-2(n, n2/kn; vn, n2) we can factor the posterior distribution just like the prior… | 2 , y ~ N ( n , 2 / k n ) k0 n 2 0 2 y N , 1 k0 n k0 2 n 2 2 2 2 | y ~ Inv 2 (vn , n 2 ) Further, we can integrate out 2 to find : | y ~ t ( n , n / k n , vn ) 2 These are essentially the same posterior distributions as we found with the improper priors. To implement the conjugate priors in WinBugs we would use the same code, but substitute the values n, n2/kn; vn, n2 into the posterior. Excel Implementation of the Normal model with conjugate priors for the mean and variance To implement a Gibbs 2 (1) | y ~ Inv ( vn , n ) 2 2 Sampler in this case, we need to perform (1) | 2 , y ~ N ( n , 2 (1) / kn ) the following steps: Everything subscripted 2( t ) | y ~ Inv ( vn , n ) 2 2 with an n is a function of stuff that is “known” ( t ) | 2 , y ~ N ( n , 2 ( t ) / k n ) In Excel, for each iteration t: 2(t) = 1/GAMMAINV(rand(),vn,1/n2) (t) = NORMINV(rand(), n , 2(t) / kn) where 2(t) = spreadsheet cell correspond to the tth draw Lazy Modern Parametric Bayesians and the normal model with unknown mean and variance Suppose that y ~ N(, ) where was the prior precision. From here on when we talk about the normal distribution you should expect that we will speak in terms of the precision rather than the variance 2. This is because WinBugs is programmed to use rather than 2 Suppose also that you don’t want to think too hard about the prior joint distribution of and , and assume that: p(, ) = p()p() What distributions would you choose for p() and p()? Suppose that y ~ N(, ) What priors would you choose for and ? I would choose: ~ N( 0 , t ) (where t was a large number) This is because, if we expect something like the central limit theorem to hold, then the distribution of the sample mean should be approximately normal for large n. Gamma ~ ( a , b ) (where a, b are small numbers) This is because this distribution is bounded below at zero and unlike the 2 distribution which shares this property it is not constrained to have an equal mean and variance. Note how we now have to talk about the mean of the distribution of the variance. model { for (i in 1:N) { y[i] ~ dnorm( mu, tau) } mu ~ dnorm(0, .001) tau ~ dgamma(.01 , .001) } } list(N=20,y=c(98,160,136,128,130,114,123, 134,128,107,123,125,129,132,154,115,126, 132,136,130)) WinBugs results for the model where y ~ N(, ) and ~ N(0, .001), ~ (.01 , .001) node mean sd MC error 2.5% median 97.5% start sample mu 126.6 3.3 0.03433 119.9 126.7 133.1 10001 10000 tau 0.005145 0.0017 1.815E-5 0.002365 0.00496 0.008981 10001 10000 mu sample: 10000 tau sample: 10000 0.15 300.0 0.1 200.0 0.05 100.0 0.0 0.0 110.0 120.0 130.0 140.0 0.0 0.005 0.01 Notice that the mean of the posterior mean is less than 128 despite the fact that we are using what appear to be diffuse prior beliefs. The Subjective Bayesian and the normal model with unknown mean and variance The subjective Bayesian framework provides little guidance about what prior distribution that one should choose. In a sense, that is the point of the subjective approach—it is subjective. You are free to pick whatever prior distribution you want: multi-modal, uniform, high or low variance, skewed, constrained to lie between a certain set of values, etc. One of the key difficulties is that the prior distributions probably are not independent (i.e. p(1, 2)≠p(1)p(2)). For example, regression coefficients are generally not independent, even if that isn’t transparent in your STATA output. If you want to incorporate subjective beliefs, this non-independence should be taken into account. Some General Guidelines I recommend the following general guidelines: 1) if possible, use standard distributions (i.e. conjugate or semi-conjugate) and choose parameters that fix the mean, variance, kurtosis, etc. to be some desirable level. 2) sample from the prior predictive distribution and check to see if your results make sense. - Mechanically, perform the following steps: i) take a random draw ’ from the joint prior distribution of ’. ii) take a random draw Y from the pdf Y| with = ’ iii) repeat steps i and ii several thousand times to provide a sample that you can use to summarize the prior predictive distribution. iv) generate various summaries of the prior predictive distribution and check to see if the model’s predictions are consistent with your beliefs about the data-generating process.