Document Sample

Approximate Bayesian Computation e Simon Tavar´, USC IMA Lecture #5, September 18 2003 Outline • Four examples on diﬀerent time scales • Stochastic computation • Simulating posterior distributions • MCMC without likelihood ratios • Open problems Ex. 1. The Origin of Primates Primate Phylogeny Ex. 2. Human Prehistory: mitochondrial Eve • Mitochondrial DNA: ‘Out of Africa’ – Amerindians (Ryk Ward and collaborators) – 63 HVR I sequences, 360bp long • Microsatellites – Rosenberg et al. (2002), Science – 377 autosomal microsatellites, 1056 individuals, 52 populations Ex. 3. Genome-wide patterns of SNP variation • Nucleotide diversity across genome • looking for signs of selection • LD across genome • LD mapping Ex. 4. Cancer genomics Estimating the age of a tumor Statistical aspects • Highly dependent data • Dependence caused by tree or graph linking observations • Explicit theory hard to come by . . . • . . . computational inference methods essential Stochastic computation in evolutionary genetics • Many diﬀerent approaches have been employed: – Rejection – Importance sampling – Sequential importance sampling with resampling – Markov chain Monte Carlo – Metropolis-coupled MCMC – Perfect sampling: coupling from the past Stochastic computation in evolutionary genetics • Many diﬀerent approaches have been employed: – Rejection – Importance sampling – Sequential importance sampling with resampling – Markov chain Monte Carlo – Metropolis-coupled MCMC – Perfect sampling: coupling from the past Introduction to Bayesian computation • Data D, prior π(·) • Want to generate observations from posterior distribution f (θ | D) • Bayes Theorem gives f (θ | D) ∝ IP(D | θ)π(θ) Rejection methods I 1. Generate θ from π(·) 2. Accept θ with probability IP(D | θ); return to 1 Accepted observations have distribution f (θ | D) Complex stochastic models • A stochastic process often underlies the likelihood computation • This process may be extremely complex, making explicit probability calculations diﬃcult or impossible • Thus IP(D | θ) may be uncomputable (either quickly or theoretically) Rejection methods II The process is often easy to simulate, so: 1. Generate θ from π(·) 2. Simulate D from stochastic model with parameter θ 3. Accept θ if D = D; return to 1 Rejection methods II The process is often easy to simulate, so: 1. Generate θ from π(·) 2. Simulate D from stochastic model with parameter θ 3. Accept θ if D = D; return to 1 Do we ever hit the target? Approximate Bayesian Computation I 1. Generate θ from π(·) 2. Simulate D from stochastic model with parameter θ 3. Calculate distance ρ(D, D ) between D and D 4. Accept θ if ρ ≤ ; return to 1 • If → ∞, generates from prior • If = 0, generates from f (θ|D) • Choice of reﬂects tension between computability and accuracy • Method is honest: you get observations from f (θ | ρ(D, D ) ≤ ) Approximate Bayesian Computation II • The limit = 0 reproduces the data precisely • In many examples the data are too high-dimensional • . . . so reduce dimension by using summary statistics Motivation: suﬃciency If S is suﬃcient for θ, then f (θ|D) = f (θ|S) Inference method can be considerably simpliﬁed, e.g. rejection method via f (θ|S) ∝ IP(S|θ)π(θ) Puts a premium on ﬁnding decent summary statistics • A systematic, implementable approach? (LeCam (1963)) • Estimate distance between f (θ|D) and f (θ|S) given a measure of how far from suﬃcient S is for θ Combine summaries and rejection T et al. (1997), Fu and Li (1997), Weiss and von Haeseler (1998), Pritchard et al. (1999) Choose statistics S = (S1 , . . . , Sp ) to summarize D 1. Generate θ from π(·) 2. Simulate D , calculate s 3. Accept θ if ρ(s , s) ≤ ; return to 1 An example from population genetics The coalescent is a useful tool for interpreting patterns of molecular variation Ingredients: • Biological model for relationships among n sampled sequences (tree or graph; Λ, T ) • Parameters of interest (mutation, migration, recombination, . . . rates θ) • Aim: generate observations from f (θ|D) or from f (θ, Λ, T |D) Probability model: branching Markov process Ward’s mtDNA data Yakima n = 42, 31 SNPs Interested (for this talk) in inference about • the scaled mutation parameter θ • the TMRCA T of sample ( = height of tree) • T has prior mean 1.95 Inference in population genetics • In classical frequentist setting, most estimators are inconsistent • Even in simple cases where do have consistency, rates of convergence are of order log n • Inference about T (an unobserved random variable) makes Bayesian approach more sensible Results Acceptance rate 0.6% TMRCA T 1st quartile 1.07 mean 1.71 median 1.49 3rd quartile 2.11 Advantages and disadvantages of ABC Pros: • Usually easy to code • Generates independent observations (can use embarrassingly parallel computation) • Can be used to estimate Bayes factors directly Cons: • For complex probability models, sampling from prior does not make good use of accepted observations • Q: a scheme that combines generation of perfect observations with use of existing sample? ABC III: Generalization of rejection method Beaumont, Zhang and Balding (2002), Genetics A classic in the frequentist world that uses implicit probability models to estimate likelihoods is Diggle and Gratton (1984), JRSSB Reminder: standard MCMC 1. Now at θ 2. Propose move to θ according to q(θ → θ ) 3. Calculate IP(D|θ )π(θ )q(θ → θ) h = min 1, IP(D|θ)π(θ)q(θ → θ ) 4. Accept θ with probability h, else return θ MCMC in evolutionary genetics setting • Small tweaks in the biology often translate into huge changes in algorithm • Long development time • All the usual problems with convergence • Almost all the eﬀort goes into evaluation of likelihood A bad sampler (Yet) another MCMC approach 1. Now at θ 2. Propose a move to θ according to q(θ → θ ) 3. Generate D using θ 4. If D = D, go to next step, else return θ 5. Calculate π(θ )q(θ → θ) h = h(θ, θ ) = min 1, π(θ)q(θ → θ ) 6. Accept θ with probability h, else return θ Proposition The stationary distribution of the chain is indeed f (θ | D) Proof Denote the transition mechanism of the chain by r(θ → θ ) Choose θ = θ satisfying π(θ )q(θ → θ) ≤1 π(θ)q(θ → θ ) f (θ | D)r(θ → θ ) = f (θ | D)q(θ → θ )IP(D | θ )h(θ, θ ) IP(D | θ)π(θ) π(θ )q(θ → θ) = q(θ → θ )IP(D | θ ) IP(D) π(θ)q(θ → θ ) IP(D | θ )π(θ ) = {q(θ → θ)IP(D | θ)} IP(D) = f (θ | D)q(θ → θ)IP(D | θ)h(θ , θ) = f (θ | D)r(θ → θ) Practical version: ABC Data D, summary statistics S 4 . If ρ(D , D) ≤ , go to next step, otherwise return θ 4 . If ρ(S , S) ≤ , go to next step, otherwise return θ for some suitable metric ρ and approximation level Observations now from f (θ|ρ(D , D) ≤ ) or f (θ|ρ(S , S) ≤ ) Variations on ABC • What if D is not discrete? – Use previous method (binning) – Use simulation approach to estimate the likelihood terms in the Hastings ratio • If the underlying probability model is complex, simulating data will not often lead to acceptance. Thus need update for parts of the probability model • These methods can often be started at stationarity, so no burn-in 1. Now at (θ, T ) 2. Propose a move to (θ , T ) according to q((θ, T ) → (θ , T )) 3. Simulate D using (θ , T ) 4. If D = D, go to next step, else return (θ, T ) (or use ABC) 5. Calculate π(θ , T )q((θ , T ) → (θ, T )) h = min 1, π(θ, T )q((θ, T ) → (θ , T )) 6. Accept (θ , T ) with probability h, else return (θ, T ). Mitochondrial example revisited H = number of haplotypes V = number of SNPs • Yakima n = 42, V = 31, H = 20 • Nuu Chah Nulth n = 63, V = 26, H = 28 Interested in inference about • the scaled mutation parameter θ • the TMRCA T of sample ( = height of tree) • T has prior mean 1.95 From MCMC methods Yakima IE(T |D) = 0.72 Nuu Chah Nulth IE(T |D) = 0.68 Yakima Estimated S = V, = 0 Rej. like. No like. Acceptance rate 0.6% 65% 3.9% TMRCA T 1st quartile 1.07 1.05 1.09 mean 1.71 1.70 1.72 median 1.49 1.45 1.49 3rd quartile 2.11 2.07 2.15 S = (V, H), = 0 Rejection No likelihood Acceptance rate 0.03% 0.46% TMRCA T 1st quartile 0.73 0.74 mean 1.01 1.03 median 0.93 0.94 3rd quartile 1.22 1.24 NCN S = (V, H) Estimated =2 Rejection likelihood No likelihood Acceptance rate 0.0008% 16.9% 0.2% TMRCA T 1st quartile 0.51 0.50 0.54 mean 0.69 0.67 0.70 median 0.64 0.63 0.66 3rd quartile 0.81 0.80 0.81 The eﬀects of NCN S = (V, H) =2 =1 =0 Acceptance rate 0.2% 0.04% 0.005% TMRCA T 1st quartile 0.54 0.49 0.46 mean 0.70 0.64 0.59 median 0.66 0.60 0.55 3rd quartile 0.81 0.74 0.69 Summary In the population genetics setting, there are many existing MCMC algorithms that explore the space of θ, Λ, T . These can be used immediately to address harder problems, by replacing ‘peeling’ by simulation • Complex mutation models (e.g. dependent sites in a sequence) • RFLPs: Population expansion, multiple cutters, mutation model • Ascertainment (e.g. SNP panels) Current work • Testing on complex problems (e.g. subdivision and multiple loci, population history) • Mapping disease genes using ancestral recombination graph • Exploring connection with other auxiliary variable methods • Exploring projection pursuit for combining summary statistics Acknowledgements • (MCMC, statistical genetics) Duncan Thomas, John Molitor • (MCMC) Paul Marjoram, Vincent Plagnol, Lada Markovtsova • (Population genetics) Magnus Nordborg, Noah Rosenberg • (Projection pursuit) Peter Calabrese e Paper: P. Marjoram, J. Molitor, V. Plagnol, S. Tavar´ (2003) Markov chain Monte Carlo without likelihoods, Proc. Natl. Acad. Sci., in press.

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 6 |

posted: | 8/15/2011 |

language: | English |

pages: | 50 |

OTHER DOCS BY sdfgsg234

Feel free to Contact Us with any questions you might have.