VIEWS: 0 PAGES: 62 POSTED ON: 9/14/2012 Public Domain
1 Dirichlet process mixtures are active research areas Dirichlet mixtures are it! The flexibility of DPM models supported its huge popularity in wide variety of areas of application. DPM models are general and can be argued to have less structure. Double Dirichlet Process Mixtures add a degree of structure, possibly at the expense of some degree of flexibility, but possibly with better interpretability in some cases We discuss applications (and limitations) of these semiparametric double mixtures We compare fit-prediction duality with competing models 2 Other DP extensions Double Dirichlet process mixtures are a subclass of dependent Dirichlet Process mixtures (MacEachern 1999,……) Double DP mixture are different from Hierarchical Dirichlet Processes (The et al. 2006 ) Double DPM is simply independent DPMS 3 Motivating Example 1 Luminex measurements on two biomarker proteins from n=156 Patients IL-1β protein C-reactive protein The biological effects of these two proteins are thought to be not (totally) overlapping. y1i 1i ~ N 2 , i , i 1,...,n y 2i 2i 4 Two Biomarkers (y1 and y2) y1i 1i ~ N 2 , i , i 1,...,n y 2i 2i Usual DP mixture of normals (Ferguson 1983,…..) ( i , i ) i.i.d. ~ G, i 1,...,n G ~ DP( , G0 ) G0 ~ N 2 ( 0 , 0 ) ( ~ Inv.Wishart) Questions Should we model the two biomarkers jointly? Should we cluster the patients based on both biomarkers jointly? The biomarkers may operate somewhat independently. 5 Double DP mixtures y1i 1i 12i 1i 2i ~ N 2 , i y , i 1,...,n 2i 2i 2i 2 ( 1i , 12i ) i.i.d. ~ G1 , i 1,..., n G1 ~ DP (1 , G01 ) G01 1 ~ N ( 01 , 01 ) ( 12 ~ Inv. Gamma ) 2 ( 2i , 2i ) i.i.d. ~ G2 , i 1,..., n 2 G2 ~ DP ( 2 , G0 2 ) G02 2 ~ N ( 02 , 02 ) ( 2 ~ Inv. Gamma ) 2 2 Equicorrelation – corr(y1i, y2i) are assumed to be the same for all i=1,…,n Clustering based on biomarker 1 and based on biomarker 2 can be different 6 Motivating Example 2: Interrater Agreement Agreement between 2 Raters (Melia and Diener-West 1994) Each rater provides an ordinal rating on a scale of 1-5 (lowest to highest invasion)of the extent to which tumor has invaded the eye,n=885 Rater 1 1 2 3 4 5 R 1 291 74 1 1 1 a 2 186 256 7 7 3 t 3 2 4 0 2 0 e 4 3 10 1 14 2 r 5 1 7 1 8 3 2 7 Interrater agreement Kottas, Muller, Quintana (2005) analyzed these data using a flexible DP mixture of Bivariate probit ordinal model which modeled the unstructured joint probabilities prob(Rater 1=i and Rater2 = j), i=1,…,5, j=1,…,5 One way to quantify interrrater agrrement is to measure departure from the structured model of independence We consider a (mixture of) Double DP mixtures model here which provides separate DP structures for the two raters. We then measure ``agreement’’ from this model. 8 Motivating Example 3 Mixed model for longitudinal data yij X ij Z ij bi ij , or yi ( yi1 ,..., yip ) ~ N ( X i Z i bi , i ) It is common to assume (Bush and MacEachern 1996) bi ~ G, G ~ DP Modeling the error covariance i or the error variance (if i =diag(2i)) extends the normal distribution assumption to normal scale mixtures (t, Logistic,…) i2 ~ G , G ~ DP 9 Putting the two together One way to combine these two structures is (bi , i2 ) ~ G , G ~ DP Do we expect the random effects bi appearing in the modeling the mean and the error variances to cluster similarly? The error variance model often is used to extend the distributional assumption. bi ~ G1 , G1 ~ DP(1 , G01 ) i2 ~ G2 , G2 ~ DP( 2 , G02 ) 10 Double DPM I will discuss Fitting Applicability Flexibility Limitations of such double semiparametric mixtures I will also compare these models with usual DP models via predictive model comparison criteria 11 Dirichlet process Dirichlet Process is a probability measure on the space of distributions (probability measures) G. G ~ Dirichlet Process (G0), where G0 is a probability Dirichlet Process assigns positive mass to every open set of probabilities on support(G0) Conjugacy: Y1,…., Yn ~ i.i.d. G, (G) = DP( G0) Then Posterior (G|Y) ~ DP( G0 + nFn) where Fn is the empirical distn. Polya Urn Scheme 12 Stick breaking and discreteness G~ DP( G0) implies G is almost surely discrete G q j{ j } with prob.1 (Tiwariand Sethuraman1982) j 1 1 , 2 ,....~ i..i.d . G0 Stick - breaking represntation q1 w1 , w1 ~ Beta(1, q2 w2 (1 w1 ), w2 ~ Beta(1, q3 w3 (1 w2 )(1 w1 ), w3 ~ Beta(1, 13 Bayes estimate from DP The discrete nature of a random G from a DP leads to some disturbing features, such as this result from Diaconis and Freedman (1986) Location model yi = + i, i=1,…n has prior (), such as a normal prior 1,…, n ~ i.i.d. G G ~ DP(G0) - symmetrized G0 = Cauchy or t-distn Then the posterior mean is an inconsistent estimate of 14 Dirichlet process mixtures (DPM) Yi ~ f ( yi | i , , xi ), such as N (i , ), i 1,...n f is a known parametric distn. xi are possible covariates. other parms 1 ,...,n ~ i.id . G G ~ DP(G0 ) If we marginalize over i, we obtain a semiparametric mixture Yi ~ f ( yi | , , xi )dG( ) where the mixing distribution G is random and follows DP(G0) 15 DPM - clusters Yi ~ f ( yi | i , , xi ) 1 ,...,n ~ i.id. G, G ~ DP(G0 ) Since G is almost surely discrete, 1,…,n form clusters 1= 5 = 8 1unique 2= 3 = 4= 6= 7 2unique etc. The number of clusters, and the clusters themselves, are random. 16 DPM – MCMC The Polya urn/marginalized sampler (Escobar 1994, Escobar & West 1995) samples i one-at-a-time from (i | -i, data) Improvements, known as collapsed samplers, are proposed in MacEachern (1994, 1998) where, instead of sampling i , only the cluster membership of i are sampled. For non-conjugate DPM (sampling density f(yi |i ) and base measure G0 are not conjugate), various algorithms have been proposed. 17 Finite truncation and Blocked Gibbs Yi ~ f ( yi | i , , xi ) M 1 ,...,n ~ G q j{ } instead of DP - G q j{ j j} j 1 j 1 With this finite truncation, it is now a finite mixture model with stick-breaking structure on qj (1,....,n) and (q1,....,qM) can be updated in blocks (instead of one-at-time as in Polya Urn sampler) which may provide better mixing 18 Comments In each iteration, the Polya urn/marginal sampler cycles thru each observation, and for each, assigns its membership among a new and existing clusters. The Poly urn sampler is also not straightforward to implement in non-linear (non-conjugate) problems or when the sample size n may not be fixed. For the blocked sampler, on the other hand, the choice of the truncation M is not well understood. 19 Model comparison in DPM models Basu and Chib (2003) developed Bayes factor/ marginal likelihood Log-Marginal and log-Bayes factor for Longitudinal Aids trial computation method for (n=467) with random coeffs DPM. having distn G DPM Student-t Normal This provided a framework DPM -3477 (76) (62) for quantitative comparison Student-t Normal -3553 (-14) -3539 of DPM with competing parametric and semi/nonparametric models. 20 Marginal likelihood of DPM Based on the Basic marginal identity (Chib 1995) log-posterior()=log-likelihood() + log-prior() - log-marginal log-marginal = log-likelihood(*) + log-prior(*) – log-posterior(*) The posterior ordinate of DPM is evaluated via prequential conditioning as in Chib (1995) Log-Marginal and log-Bayes factor The likelihood ordinate of DPM is evaluated from a DPM Student-t Normal sampler. (collapsed) sequential importance (62) DPM -3477 (76) Student-t -3553 (-14) Normal -3539 21 22 23 24 Double Dirichlet process mixtures (DDPM) Yi ~ f ( yi | i , i , , xi ), such as N (i , i ), i 1,...n f is a known parametric distn. xi are possible covariates. other parms 1 ,...,n ~ i.id . G 1 ,..., n ~ i.id . G G ~ DP( G 0 ) G ~ DP( G 0 ) Marginalization obtains a double semiparametric mixture Yi ~ f ( yi | , , , xi )dG ( ) dG ( ) where the mixing distributions G and G are random 25 Two Biomarkers case: y1 and y2 y1i 1i 12i 1i 2i ~ N 2 , i y , i 1,...,n 2i 2i 2i 2 ( 1i , 12i ) i.i.d. ~ G1 , i 1,..., n G1 ~ DP (1 , G01 ) G01 1 ~ N ( 01 , 01 ) ( 12 ~ Inv. Gamma ) 2 ( 2i , 2i ) i.i.d. ~ G2 , i 1,..., n 2 G2 ~ DP ( 2 , G0 2 ) G02 2 ~ N ( 02 , 02 ) ( 2 ~ Inv. Gamma ) 2 2 26 A simpler model: normal means only yi1 i ~ N 2 , , i 1,..n y i2 i 1 ,....., n ~ i.i.d . G , 1 ,....., n ~ i.i.d . G , has a G ~ DP( G 0 ) G ~ DP( G 0 ) Wishart prior We generate n=50 (i,i) means and then (yi1,yi2) observations from this Double-DPM model 27 Double DPM mu Observations y 2 2 0 0 -2 -2 y[,2] psi -4 -4 -6 -6 -8 -8 -10 -10 -4 -2 0 2 4 6 -5 0 5 phi y[,1] 28 Plot of y and mu: Clusters in different vector Single DPM in the bivariate meansymbols Double DPM Clusters components Plot of y and mu: in mean in different symbols 2 y y 2 mu mu 0 0 -2 -2 y[,2] y[,2] -4 -4 -6 -6 -8 -8 -10 -10 -5 0 5 10 -5 0 5 y[,1] y[,1] 29 Model fitting We fitted the Double DPM and the Bivariate DPM models to these data. The Double DPM model can be fit by a two-stage Polya urn sampler or a two-stage blocked Gibbs sampler. “Collapsing” can become more difficult. 1 n Average MSEd E post ( id id ) 2 , d 1,2 true n i 1 1 n Average MADd E post | id id |, d 1,2 true n i 1 30 MSE MAD y1 y2 y1 y2 Data generated from Double DP Double DP 0.99 1.32 0.72 0.92 Bivariate DP in 1.02 4.29 0.83 1.73 (y1,y2) Data generated from Bivariate DP Double DP 0.98 1.29 0.81 0.94 Bivariate DP in 0.98 1.08 0.75 0.81 (y1,y2) 31 Wallace (asymmetric) criterion for comparing two clusters/partitions Let S be the number of mean pairs which are in the same cluster in a MCMC posterior draw and also in the true clustering. Let nk, k=1,..K be the number of means in cluster Ck in the MCMC draw. Then the Wallace asymmetric criterion for comparing these two clusters is Average Wallace S Data generated from Double DP Bivariate DP k nk (nk 1) / 2 Double DP y1 y2 y1 y2 0.89 0.42 0.66 0.48 Bivariate DP in (y1,y2) 0.72 0.24 0.62 0.62 32 Measurements on two biomarker proteins by Luminex panels • Frozen parafin embedded tissues, pre and post surgery 22000 • Luminex panel 20000 • Nodal involvement 18000 CRP 16000 14000 10 20 30 40 50 60 70 80 33 Two biomarker proteins The bivariate DPM ( i , i ) i.i.d. ~ G, i 1,...,n y1i ~ N 2 1i , i , i 1,...,n y G ~ DP( , G0 ) 2i 2i G0 ~ N 2 ( 0 , 0 ) ( ~ Inv.Wishart) vs the Double DPM y1i 1i 12i 1i 2i ~ N 2 , i y , i 1,...,n 2i 2i 2i 2 ( 1i , 12i ) i.i.d. ~ G1 , i 1,..., n ( 2i , 2i ) i.i.d. ~ G2 , i 1,..., n 2 G1 ~ DP (1 , G01 ) G2 ~ DP ( 2 , G0 2 ) G01 1 ~ N ( 01 , 01 ) ( 12 ~ Inv.Gamma ) 2 2 G02 2 ~ N ( 02 , 02 ) ( 2 ~ Inv. Gamm 2 34 µpred Double DP Double DP 22000 100 mu.pred[1] mu.pred[2] 50 18000 0 14000 -50 0 2000 6000 10000 0 2000 6000 10000 Bivariate DP Bivariate DP 22000 100 mu.pred[1] mu.pred[2] 50 18000 0 14000 -50 0 2000 6000 10000 0 2000 6000 10000 35 y.pred[1] y.pred[1] 0.00 0.04 0.08 0.00 0.04 0.08 0.12 0 0 10 10 20 20 Double DP Bivariate DP 30 30 40 40 y.pred[1] y.pred[1] ypred 0.00000 0.00015 0.00030 0.00000 0.00015 0.00030 14000 14000 18000 18000 Double DP Bivariate DP 22000 22000 36 ypred Double DP Bivariate DP 0.030 0.025 0.010 0.020 densit y 0.015 25000 25000 densit y 0.005 0.010 20000 20000 [ 2] [ 2] r ed r ed 15000 15000 y.p y.p 0.005 0.000 10000 0.000 10000 0 10 20 30 40 50 60 0 10 20 30 40 50 60 y.pred[1] y.pred[1] 37 log CPO = log f(yi| y-i) -8 -10 LPML = log f(yi| y-i) Log CPO -12 Double DP = -1498.67 Bivariate DP= -1533.01 -14 Double DP Bivariate DP 0 50 100 150 Observation 38 Model comparison I prefer to use marginal likelihood/ Bayes factor for model comparison. The DIC (Deviance Information Criterion) , as proposed in Spiegelhalter et al. (2002) can be problematic for missing data/random-effects/mixture models. Celeux et al. (2006) proposed many different DICs for missing data models 39 DIC3 I have earlier considered DIC3 (Celeux et al. 2006, Richardson 2002) in missing data and random effects models which is based on the observed likelihood DIC3 4 E log p( yobs | ) | yobs 2 log E p( yobs | ) | yobs , where p( yobs | ) p( yobs , , | ) d ( ) The integration over the latent parameters often has to be obtained numerically. This is difficult in the present problem 40 DIC9 I am proposing to use DIC9 which is similar to DIC3 but is based on the conditional likelihood p( yobs | , , ) DIC 9 4 E , , log p ( yobs | , , ) | yobs 2 log E , , p( yobs | , , ) | yobs , DIC9 Double DP 3018.3 Bivariate DP 3050.3 41 Convergence rate results: Ghosal and Van Der Vaart (2001) Normal location mixtures Model: Yi ~ i.i.d. p(y) = (y-)dG(), i=1,…,n G ~ DP(G0), G0 is Normal Truth: p0(y) = (y-)dF() Ghosal and Van Der Vaart (2001): Under some regularity conditions, Hellinger distance (p, p0) 0 “almost surely” at the rate of (log n)3/2/n 42 Ghosal and Van Der Vaart (2001): results contd. Bivariate DP location-scale mixture of normals Yi ~ i.i.d. p(y) = (y-)dH(,), i=1,…,n H ~DP(H0) Ghosal and Van Der Vaart (2001): If H0 is Normal {a compactly supported distn}, then the convergence rate is (log n)7/2/n Double DP location-scale mixture of normals Yi ~ i.i.d. p(y) = (y-)dG() dG(), i=1,…,n G ~DP(G0), G ~DP(G0) Ghosal and Van Der Vaart (2001): If G0 is Normal, G0 is compactly supported and the true density p0(y) = (y-)dF1() dF2() is also a double mixture, then Hellinger distance (p, p0) 0 at the rate of (log n)3/2/n 43 Interrater data Agreement between 2 Raters (Melia and Diener-West 1994) Each rater provides an ordinal rating on a scale of 1-5 (lowest to highest invasion)of the extent to which tumor has invaded the eye,n=885 Rater 1 1 2 3 4 5 R 1 291 74 1 1 1 a 2 186 256 7 7 3 t 3 2 4 0 2 0 e 4 3 10 1 14 2 r 5 1 7 1 8 3 2 44 DPM multivariate ordinal model Kottas, Muller and Quintana (2005) P (Rater1 j and Rater2 k for subject i) P( 1 j-1 Z i1 1 j , 2 k -1 Z i 2 2 k ) Z i1 Z i i.i.d. ~ N 2 ( z | , ) dG( , ) Z i2 G ~ DP( G0 ) 45 Interrater agreement The objective is to measure agreement Rater 1 between raters beyond what is 1 2 3 4 5 possible by chance. R 1 291 74 1 1 1 a t 2 186 256 7 7 3 This is often measured by departure e from independence, often specifically r 3 2 4 0 2 0 in the diagonals 2 4 3 10 1 14 2 5 1 7 1 8 3 Polychoric correlation of the latent bivariate normal Z has been used as a measure of association. ………………… of the latent bivariate normal mixtures??? 46 Latent class model (Agresti & Lang 1993) C latent classes pijk P(Rater1 j and Rater2 k for subject i) pcjk if subject i belongs to class c, c 1,...,C Ratings of the two raters within a class are independent pcjk pc1 j pc 2 k Rater 1 1 2 3 4 5 R 1 291 74 1 1 1 a C pijk I ( i c) pc1 j pc 2 k t 2 186 256 7 7 3 e 3 2 4 0 2 0 c 1 r 2 4 3 10 1 14 2 5 1 7 1 8 3 47 Mixtures of Double DPMs C pijk I ( i c) pc1 j pc 2 k c 1 For each latent class, we model pc1j and pc2k by two separate univariate ordinal probit DPM models p clj P ( cl, j 1 Z cl cl, j ), j 1,..5, l 1,2, c 1,..,C Z cli i.i.d. ~ N1 ( z | , 2 ) dGcl ( , 2 ) Gcl ~ DP( cl G0 cl ) 48 Computational issue C pijk I ( i c) pc1 j pc 2 k c 1 The ``sample size’’ nc in latent group c is not fixed. This causes problem for the polya-urn/marginal sampler which works with fixed sample size Do, Muller, Tang (2005) suggested a solution to this problem by jointly sampling the latent il =(il,il2) and the latent rating class membership i. 49 1 2 3 4 5 Estimated cell probabilities 1 .3288 .3261 .0836 .0869 .0011 .0013 .0011 .0020 .0011 .0007 (.2945,.3590) (.0696,.1078) (.0001,.0044) (.0003,.0056) (0,.0027) .3286 .0821 .003 .0022 .0009 (.3060,,.3524) (.0701,.09510) (.0008,.007) (.0007,.0047) (.0001,.0023) .2102 .2893 .0079 .0079 .0034 2 .2135 .2826 .0082 .0069 .0031 (.1858,.2428) (.2521,.3141) (.0031,.0152) (.0022,.0143) (.0007,.0075) .2098 .2853 .0076 0103 .0033 (.1856,.2334) (.2700,.2997) (.0036,.0129) (..0074,.0138) (.0017,.0053) .0023 .0045 0 .0023 0 3 .0023 .0055 .0016 .0022 .0008 (.004,.0065) (.0017,.0107) (.0003,.0038) (.004,.0062) (.0,.0032) .0029 .006 .0003 .0015 .0004 (.0009,.0068) (.0025,.0114) (0,.0009) (.0002,.0039) (0,.0011) .0034 .0113 .0011 .0158 .0023 4 .0042 .0102 .0023 ..0143 .0028 (.0012,.0094) (.0042,.0187) (.0004,.006) (.0066,.024) (.0006,.0069) .0031 .0124 .0016 .0127 .0033 (.001,.0059) (.0092,.0157) (.0002,.0044) (.0089,.0168) (.0014,.0057) .0011 .0079 .0011 .009 .0034 5 .0012 .0071 .0019 .0083 .0039 (.0001,.0041) (.0026,.0140) (.0003,.0054) (.0034,.0153) (.0009,.009) .0019 .0086 .0011 .0086 .0026 (.0006,.004) (.0055,.0129) (.0002,.0032) (.0053,.0118) (.0011,.0045) 50 Marginal probability estimates Rater A 1 2 3 4 5 Latent Group 1 0.6691 .3246 .0036 .0014 .0012 Rater B 1 2 3 4 5 0.9374 .0559 .0049 .0011 .0008 1 2 3 4 5 1 0.6261 0.03859 0.003273 0.00068 0.000473 2 0.3058 0.01675 0.001511 0.000326 0.000225 3 0.003405 0.000194 2.38E-05 6.81E-06 5.06E-06 4 0.001237 0.000121 4.37E-05 1.86E-05 1.79E-05 5 0.000882 0.000212 7.31E-05 3.12E-05 3.10E-05 51 Marginal probability estimates Rater A Latent Group 2 1 2 3 4 5 0.1562 .7139 .0188 .0661 .0451 Rater B 1 2 3 4 5 0.1432 .7432 .0226 .0706 .0205 1 2 3 4 5 1 0.02196 0.1264 0.002716 0.003709 0.001382 2 0.1107 0.5625 0.0138 0.02051 0.006414 3 0.002445 0.01194 0.000555 0.003084 0.000764 4 0.005113 0.02517 0.003294 0.02575 0.006747 5 0.002968 0.01718 0.002219 0.01753 0.005212 52 Joint mean covariance modeling Trial with n=200 patients who had acute MI within 28 days of baseline and are depressed/low social support Underwent 6 months of usual care (control) or individual and/or group-based cognitive behavioral counseling (treatment). Response y = depression (Beck Depression Inventory) measured at 0,182,365,548, 913, 1278 days (but actually at irregular intervals) Covariate: Treatment, Family history, Age, Sex, BMI,…… Intermittent missing response, missing covariate…. 53 Beck depression Inventory 0 10 20 30 40 50 0 200 400 600 Visit Days 800 1000 1200 54 Model Model for the mean E ( yij ) ij X ij Z ij bi Z ij bi b01 b1 tij b2i (tij ti 2 ) i Model for the covariance yi ( yi1 ,.., yi 6 ) ~ N 6 ( i , i ) Pourahmadi (1999), Pourahmadi and Daniels (2002) use a Cholesky decomposition of the covariance which allows one to use log-linear model for the variances and “linear regression” for the off-diagonal terms 55 Modeling the covariance We assume i diag ( i2 ,.., i26 ) 80 1 75 70 log ij X ij Z ij bi 2 Variance 65 2 Z ij bi boi b1i tij b2i tij 60 55 1 2 3 4 5 6 visits 56 Mean and variance level random effects bi (boi , b1 , b2i ) and bi (boi , b1i , b2i ), i 1,..n i ? A joint DPM for the two random effects together which allows clustering at the patient level (bi , bi ) i.i.d ~ G , , G , ~ DP ? Or Double DPM, that is, independent DPM separately for the each of the two random effects which allows separate clustering at the mean and variance level bi i.i.d ~ G , G ~ DP bi i.i.d ~ G , G ~ DP Most frequentist and parametric Bayesian analyses use the latter independence among the mean and variance level random effects. 57 Double DP Bivariate DP -2200 -2200 log-likeliood log-likeliood -2300 -2300 -2400 -2400 -7000 -5000 -3000 -7000 -5000 -3000 iterations iterations Log CPO -5 -10 Log-CPO -15 -20 Double DP -25 Bivariate DP 0 50 100 150 200 Observations 58 Fixed effects estimates Double DP Bivariate DP Normal .slope -1.12 (-2.39,-0.16) -0.89 (-1.97,0.24) -0.26 (-.65,.28) .change -3.29 (-4.36,-1.77) -2.98 (-4.4,-1.11) -4.057 (-4.72,-3.43) .slope 0.18 (-.006,.38) 0.19 (.014,.382) -0.16 (-0.32,0.05) .quad -0.005 (-0.027,0.021) -.006 (-.027,.013) 0.033 (0.009,0.051) 59 Pseudo marginal likelihood log f(yi| y-i) Double DP -2495.290 Bivariate DP -2503.916 Normal -2530.92 60 Summary Double DP mixtures may add a level of structure to mixture modeling with DP. They produce interesting “product-clustering” They are applicable to specific problems that may benefit from this structure 61 basu@niu.edu 62