Document Sample

Biomarker Discovery Analysis Targeted Maximum Likelihood Cathy Tuglus, UC Berkeley Biostatistics November 7th-9th 2007 BASS XIV Workshop with Mark van der Laan Overview Motivation Common methods for biomarker discovery □ Linear Regression □ RandomForest □ LARS/Multiple Regression Variable importance measure □ Estimation using tMLE □ Inference □ Extensions □ Issues Two-stage multiple testing Simulations comparing methods “Better Evaluation Tools – Biomarkers and Disease” #1 highly-targeted research project in FDA “Critical Path Initiative” □ Requests “clarity on the conceptual framework and evidentiary standards for qualifying a biomarker for various purposes” □ “Accepted standards for demonstrating comparability of results, … or for biological interpretation of significant gene expression changes or mutations” Proper identification of biomarkers can . . . □ Identify patient risk or disease susceptibility □ Determine appropriate treatment regime □ Detect disease progression and clinical outcomes □ Access therapy effectiveness □ Determine level of disease activity □ etc . . . Biomarker Discovery Possible Objectives Identify particular genes or sets of genes modify disease status □ Tumor vs. Normal tissue Identify particular genes or sets of genes modify disease progression □ Good vs. bad responders to treatment Identify particular genes or sets of genes modify disease prognosis □ Stage/Type of cancer Identify particular genes or sets of genes may modify disease response to treatment Biomarker Discovery Set-up Data: O=(A,W,Y)~Po Variable of Interest (A): particular biomarker or Treatment Covariates (W): Additional biomarkers to control for in the model Outcome (Y): biological outcome (disease status, etc…) (A, W) E p (Y | A a, W ) E p (Y | A 0, W ) Gene Expression Gene Expression (A,W) (W) Disease Disease status Treatment status (Y) (A) (Y) Causal Story Ideal Result: A measure of the causal effect of exposure on hormone level EP * {E p (Y | A a,W ) E p (Y | A 0,W ) | V v } Strict Assumptions: Experimental Treatment Assumption (ETA) □ Assume that given the covariates, the administration of pesticides is randomized Missing data structure □ Full data contains all possible treatments for each subject Under Small Violations: Causal Effect VDL Variable Importance measures Possible Methods Solutions to Deal with the Issues at Hand Linear Regression Variable Reduction Methods Random Forest tMLE Variable Importance Common Approach Linear Regression Model : E[Y | X ( A,W )] b T X Optimized using Least Squares n b arg min ( yi b T X i ) ˆ Seeks to estimate b b i 1 Notation: Y=Disease Status, A=treatment/biomarker 1, W=biomarkers, demographics, etc. E[Y|A,W] = b1*f 1(A)+ b2*f 2(AW) +b3*f 3(W)+ . . . Common Issues: Have a large number of input variables -> Which variables to include??? □ risk of over-fitting May want to try alternative functional forms of the input variables □ What is the form of f1, f 2 , f 3, . . .?? Improper Bias-Variance trade-off for estimating a single parameter of interest □ Estimation for all B bias the estimate of b1 Use Variable Reduction Method: Low-dimensional fit may discount variables believed to be important May believe outcome is a function of all variables What about Random Forest? Breiman (1996,1999) Classification and Regression Algorithm Seeks to estimate E[Y|A,W], i.e. the prediction of Y given a set of covariates {A,W} Bootstrap Aggregation of classification trees □ Attempt to reduce bias of single tree Cross-Validation to assess misclassification rates □ Out-of-bag (oob) error rate W1 W2 W3 sets of covariates, W={ W1 , W2 , W3 , . . .} 0 0 1 1 Permutation to determine variable importance Assumes all trees are independent draws from an identical distribution, minimizing loss function at each node in a given tree – randomly drawing data for each tree and variables for each node Random Forest Basic Algorithm for Classification, Breiman (1996,1999) The Algorithm □ Bootstrap sample of data □ Using 2/3 of the sample, fit a tree to its greatest depth determining the split at each node through minimizing the loss function considering a random sample of covariates (size is user specified) □ For each tree. . Predict classification of the leftover 1/3 using the tree, and calculate the misclassification rate = out of bag error rate. For each variable in the tree, permute the variables values and compute the out-of-bag error, compare to the original oob error, the increase is a indication of the variable‟s importance □ Aggregate oob error and importance measures from all trees to determine overall oob error rate and Variable Importance measure. Oob Error Rate: Calculate the overall percentage of misclassification Variable Importance: Average increase in oob error over all trees and assuming a normal distribution of the increase among the trees, determine an associated p-value Resulting predictor set is high-dimensional Random Forest Considerations for Variable Importance Resulting predictor set is high-dimensional, resulting in incorrect bias- variance trade-off for individual variable importance measure □ Seeks to estimate the entire model, including all covariates □ Does not target the variable of interest □ Final set of Variable Importance measures may not include covariate of interest Variable Importance measure lacks interpretability No formal inference (p-values) available for variable importance measures Targeted Semi-Parametric Variable Importance van der Laan (2005, 2006), Yu and van der Laan (2003) Given Observed Data: O=(A,W,Y)~Po Parameter of Interest : “Direct Effect” E(Y|A a,W) E(Y|A 0,W) m(A,W| b ) a * ( b T W ) (for instance) Semi-parametric Model Representation with unspecified g(W) E(Y|A,W) m(A,W|b ) g (W ) For Example. . . Notation: Y=Tumor progression, A=Treatment, W=gene expression, age, gender, etc. . . E[Y|A,W] = b1*f 1(treatment)+ b2*f 2(treatment*gene expression) +b3*f 3(gene expression)+b4*f 4(age)+ . . . m(A,W|b) = E[Y|A=a,W] - E[Y|A=0,W] = b1*f 1(treatment)+ b2*f 2(treatment*gene expression) No need to specify f 3 or f 4 tMLE Variable Importance General Set-Up Given Observed Data: O=(A,W,Y)~Po W*={possible biomarkers, demographics, etc..} A=W*j (current biomarker of interest) W=W*-j Parameter of Interest: Given : m( A, W | b ) E p (Y | A a, W ) E p (Y | A 0, W ) Define : (A) EW *[m( A,W | b )] (a) EW [m(a, W | b )] Gene Expression (A,W) 1 n m(a, Wi | b ) n i 1 If linear : ab E[W ] Disease status (Y) Simplest Case (Marginal) : ab 0 Nuts and Bolts (P )( a,V ) E {E p (Y | A a,W ) E p (Y | A 0,W ) | V W } Basic Inputs 1. Model specifying only terms including the variable of interest i.e. m(A,V|b)=a*(bTV) 2. Nuisance Parameters E[A|W] treatment mechanism (confounding covariates on treatment) E[ treatment | biomarkers, demographics, etc. . .] E[Y|A,W] Initial model attempt on Y given all covariates W (output from linear regression, Random Forest, etc. . .) E[ Disease Status | treatment, biomarkers, demographics, etc. . .] VDL Variable Importance Methods is a robust method, taking a non-robust E[Y|A,W] and accounting for treatment mechanism E[A|W] • Only one Nuisance Parameter needs to be correctly specified for efficient estimators VDL Variable Importance methods will perform the same as the non-robust method or better New Targeted MLE estimation method will provide model selection capabilities tMLE Variable Importance Model-based set-up van der Laan (2006) Given Observed Data: O=(A,W,Y)~Po Parameter of Interest: ( P )( a, W ) E p (Y | A a, W ) E p (Y | A 0, W ) Model: p : E p (Y | A a,W ) E p (Y | A 0,W ) b ( p ) ( A,W ) b b ( p ) ( A,W ) s.t. b ( p ) (0,W ) 0 b d and b 0 b ( p0 ) b ( p ) ( A, W ) m( A, W | b ) The projection of E p (Y | A a, W ) E p (Y | A 0, W ) onto {m(a, W | b ) : b } tMLE Variable Importance Estimation van der Laan (2006 ) Parameter of Interest : m( A, W | b ) E p (Y | A a, W ) E p (Y | A 0, W ) Can factorize the density of the data: p(Y,A,W)=p(Y|A,W)p(A|W)p(W) Define: Q(p)=p(Y|A,W) Qn(A,W)=E[Y|A,W] G(p)=p(A|W) Gn(W)=E[A,W] Efficient Influence Curve: Dh ( p)(O) h( A,W )(Y m( A,W | b ) Q(0,W )) h( A,W ) m( A,W | b ) E m( A,W | b ) W b b 1 n True b(po)= b0 solves: Dh ( p0 )(Oi | b 0 ) 0 n i 1 tMLE Variable Importance Simple Solution Using Standard Regression van der Laan (2006 ) 1) Given model m(A,W|b) = E[Y|A,W]-E[Y|A=0,W] 2) Estimate initial solution of Q0n(A,W)=E[Y|A,W]=m(A,W|b)+g(W) and find initial estimate b0 Estimated using any prediction technique allowing specification of m(A,W|b) giving b0 g(W) can be estimated in non-parametric fashion 3) Solve for clever covariate derived from the influence curve, r(A,W) r ( A,W ) m( A,W | b ) E m( A,W | b ) W b b 4) Update initial estimate Q0n(A,W) by regressing Y onto r(A,W) with offset Q0n(A,W) gives e = coefficients of updated regression 5) Update initial parameter estimate b and overall estimate of Q(A,W) b0=b0+e Qn1(A,W)= Q0n(A,W) +e*r(A,W) Formal Inference van der Laan (2005) Given Dh (O | b , , ) Estimate InfluenceCurve as Dh (O | b n , n , n ) IC1(O) En ( Dh (O | b n , n , n )) c Then, asympototi covariancematrix is given as 1 n n IC1 (Oi ) IC1 (Oi )T n i 1 Where, n ( b n b 0 ) ~ N (0, n ) Giving Confidence Interval n ( j, j ) b n ( j ) 1.96 n And hypothesis test H 0 : b0 ( j) 0 nb n ( j) Tn ( j ) ~ N (0,1) as n n ( j, j ) “Sets” of biomarkers The variable of interest A may be a set of variables (multivariate A) □ Results in a higher dimensional e □ Same easy estimation: setting offset and projecting onto a clever covariate Update a multivariate b “Sets” can be clusters, or representative genes from the cluster We can defined sets for each variable W‟ □ i.e. Correlation with A greater than 0.8 Formal inference is available □ Testing Ho: b„=0, where b„ is multivariate using Chi-square test “Sets” of biomarkers Can also extract an interaction effect I ( A1 A2 ) m( A1 1, A2 1,W | B) m( A1 0, A2 1,W | B) m( A1 1, A2 0,W | B) E (Y | A1 1, A2 1,W ) E (Y | A1 1, A2 0, W ) E (Y | A1 0, A2 1, W ) E (Y | A1 0, A2 0,W ) Given linear model for b, I ( A1 A2 ) A1 * ( b TW ) A2 * ( b TW ) A1 A2 * ( b TW ) A1 * ( b TW ) A2 * ( b TW ) A1 A2 * ( b TW ) Provides inference using hypothesis test for Ho: cTb=0 Benefits of Targeted Variable Importance Targets the variable of interest □ Focuses estimation on the quantity of interest (P )( a,V ) E {E p (Y | A a,W ) E p (Y | A 0,W ) | V W } □ Proper Bias-Variance Trade-off Hypothesis driven □ Allows for effect modifiers, and focuses on single or set of variables Double Robust Estimation □ Does at least as well or better than common approaches Benefits of Targeted Variable Importance Formal Inference for Variable Importance Measures □ Provides proper p-values for targeted measures Combines estimating function methodology with maximum likelihood approach Estimates entire likelihood, while targeting parameter of interest Algorithm updates parameter of interest as well as Nuisance Parameters (E[A|W], E[Y|A,W]) □ less dependency on initial nuisance model specification Allows for application of Loss-function based Cross-Validation for Model Selection □ Can apply DSA data-adaptive model selection algorithm (future work) Steps to discovery General Method 1. Univariate Linear regressions Apply to all W Control for FDR using BH Select W significant at 0.05 level to be W‟ (for computational ease) 2. Define m(A,W‟|b)=A (Marginal Case) 3. Define initial Q(A,W‟) using some data-adaptive model selection Completed for all A in W We use LARS because it allows us to include the form m(A,W|b) in the model Can also use DSA or glmpath() for penalized regression for binary outcome 4. Solve for clever covariate (1-E[A|W‟]) Simplified r(A,W) given m(A,W|b)=bA E[A|W] estimated with any prediction method, we use polymars() 5. Update Q(A,W) using tMLE 6. Calculate appropriate inference for (A) using influence curve Simulation set-up > Univariate Linear Regression Importance measure: Coefficient value with associated p-value Measures marginal association > RandomForest (Brieman 2001) Importance measures (no p-values) RF1: variable‟s influence on error rate RF2: mean improvement in node splits due to variable > Variable Importance with LARS • Importance measure: causal effect Formal inference, p-values provided LARS used to fit initial E[Y|A,W] estimate W={marginally significant covariates} All p-values are FDR adjusted Simulation set-up > Test methods ability to determine “true” variables under increasing correlation conditions • Ranking by measure and p-value • Minimal list necessary to get all “true”? > Variables Block Diagonal correlation structure: 10 independent sets of 10 Multivariate normal distribution Constant ρ, variance=1 ρ={0,0.1,0.2,0.3,…,0.9} > Outcome Main effect linear model 10 “true” biomarkers, one variable from each set of 10 Equal coefficients Noise term with mean=0 sigma=10 – “realistic noise” Simulation Results (in Summary) Minimal List length to obtain all 10 “true” variables 100 Linear Reg 80 VImp w/LARS List Length RF1 60 RF2 40 20 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Correlation No appreciable difference in ranking by importance measure or p-value □plot above is with respect to ranked importance measures List Length for linear regression and randomForest increase with increasing correlation, Variable Importance w/LARS stays near minimum (10) through ρ=0.6, with only small decreases in power Linear regression list length is 2X Variable Importance list length at ρ=0.4 and 4X at ρ=0.6 RandomForest (RF2) list length is consistently short than linear regression but still is 50% than Variable Importance list length at ρ=0.4, and twice as long at ρ=0.6 Variable importance coupled with LARS estimates true causal effect and outperforms both linear regression and randomForest Results – Type I error and Power Results – Length of List Results – Length of List Results – Average Importance Results – Average Rank ETA Bias Heavy Correlation Among Biomarkers In Application often biomarkers are heavily correlated leading to large ETA violations This semi-parametric form of variable importance is more robust than the non-parametric form (no inverse weighting), but still affected Currently work is being done on methods to alleviate this problem □ Pre-grouping (cluster) □ Removing highly correlated Wi from W* □ Publications forthcoming. . . For simplicity we restrict W to contain no variables whose correlation with A is greater than r □ r=0.5 and r=0.75 Secondary Analysis What to do when W is too large Switch to MTP presentation Application: Golub et al. 1999 Classification of AML vs ALL using microarray gene expression data N=38 individuals (27 ALL, 11 AML) Originally 6817 human genes, reduced using pre-processing methods outlined in Dudoit et al 2003 to 3051 genes Objective: Identify biomarkers which are differentially expressed (ALL vs AML) Adjust for ETA bias by restricting W‟ to contain no variables whose correlation with A is greater than r □ r=0.5 and r=0.75 Steps to discovery Golub Application – Slight Variation from General Method 1. Univariate regressions Apply to all W Control for FDR using BH Select W significant at 0.1 level to be W‟ (for computational ease), Before correlation restriction W‟ has 550 genes Restrict W‟ to W‟‟ based on correlation with A (r=0.5 and r=0.75) For each A in W . . . 2. Define m(A,W‟‟|b)=A (Marginal Case) 3. Define initial Q(A,W‟‟) using polymars() Find initial fit and initial b 4. Solve for clever covariate (1-E[A|W‟‟]) E[A|W] estimated using polymars() 5. Update Q(A,W) and b using tMLE 6. Calculate appropriate inference for (A) using influence curve 7. Adjust p-values for multiple testing controlling for FDR using BH Golub Results – Top 15 VIM Top 15 genes (lowest p-values) when rho<=0.5 VIM Gene ID Gene Description 0.3378 L20815_at S protein mRNA 0.2716 U81554_at CaM kinase II isoform mRNA -0.1593 X59871_at TCF7 Transcription factor 7 (T-cell specific) 0.8393 X70297_at CHRNA7 Cholinergic receptor, nicotinic, alpha polypeptide 7 -0.3112 Z22551_at*** Kinectin gene -0.9727 Z15115_at TOP2B Topoisomerase (DNA) II beta (180kD) 0.3144 U05681_s_at Proto-oncogene BCL3 gene 0.3101 Z50022_at Surface glycoprotein 0.2933 L08246_at*** INDUCED MYELOID LEUKEMIA CELL DIFFERENTIATION PROTEIN MCL1 -0.1718 Z46973_at Phosphatidylinositol 3-kinase 0.2894 M62762_at*** ATP6C Vacuolar H+ ATPase proton channel subunit 0.2496 X85786_at BINDING REGULATORY FACTOR 0.1988 M27891_at*** CST3 Cystatin C (amyloid angiopathy and cerebral hemorrhage) -0.1769 U27460_at Uridine diphosphoglucose pyrophosphorylase mRNA -0.1880 D50840_at Ceramide glucosyltransferase Top 15 genes (lowest p-values) when rho<=0.75 VIM Gene ID Gene Description 0.3266 L20815_at S protein mRNA 1.2598 X70297_at CHRNA7 Cholinergic receptor, nicotinic, alpha polypeptide 7 -0.9460 Z15115_at*** TOP2B Topoisomerase (DNA) II beta (180kD) 0.1811 U05681_s_at Proto-oncogene BCL3 gene 0.3101 Z50022_at Surface glycoprotein -0.1747 X59871_at TCF7 Transcription factor 7 (T-cell specific) Catalase (EC 1.11.1.6) 5'flank and exon 1 mapping to chromosome 11, band p13 (and 0.1629 X04085_rna1_at*** joined CDS) -0.2555 U18422_at DP2 (Humdp2) mRNA -0.2438 U27460_at Uridine diphosphoglucose pyrophosphorylase mRNA 0.1834 M15395_at SELL Leukocyte adhesion protein beta subunit 0.3307 AJ000480_at GB DEF = C8FW phosphoprotein -0.1880 D50840_at Ceramide glucosyltransferase MHC-encoded proteasome subunit gene LAMP7-E1 gene (proteasome subunit LMP7) extracted from H.sapiens gene for major histocompatibility complex encoded proteasome -0.1516 Z14982_rna1_at subunit LMP7 0.3182 U10868_at ALDH7 Aldehyde dehydrogenase 7 -0.2073 M89957_at IGB Immunoglobulin-associated beta (B29) *** Indicates they were in the original 50 gene set from Golub et al. 1999 Golub Results – Top 15 VIM Top 15 genes (lowest p-values) when rho<=0.5 VIM Gene ID Gene Description 0.3378 L20815_at S protein mRNA 0.2716 U81554_at CaM kinase II isoform mRNA -0.1593 X59871_at TCF7 Transcription factor 7 (T-cell specific) 0.8393 X70297_at CHRNA7 Cholinergic receptor, nicotinic, alpha polypeptide 7 -0.3112 Z22551_at*** Kinectin gene -0.9727 Z15115_at TOP2B Topoisomerase (DNA) II beta (180kD) 0.3144 U05681_s_at Proto-oncogene BCL3 gene 0.3101 Z50022_at Surface glycoprotein 0.2933 L08246_at*** INDUCED MYELOID LEUKEMIA CELL DIFFERENTIATION PROTEIN MCL1 -0.1718 Z46973_at Phosphatidylinositol 3-kinase 0.2894 M62762_at*** ATP6C Vacuolar H+ ATPase proton channel subunit 0.2496 X85786_at BINDING REGULATORY FACTOR 0.1988 M27891_at*** CST3 Cystatin C (amyloid angiopathy and cerebral hemorrhage) -0.1769 U27460_at Uridine diphosphoglucose pyrophosphorylase mRNA -0.1880 D50840_at Ceramide glucosyltransferase Top 15 genes (lowest p-values) when rho<=0.75 VIM Gene ID Gene Description 0.3266 L20815_at S protein mRNA 1.2598 X70297_at CHRNA7 Cholinergic receptor, nicotinic, alpha polypeptide 7 -0.9460 Z15115_at*** TOP2B Topoisomerase (DNA) II beta (180kD) 0.1811 U05681_s_at Proto-oncogene BCL3 gene 0.3101 Z50022_at Surface glycoprotein -0.1747 X59871_at TCF7 Transcription factor 7 (T-cell specific) Catalase (EC 1.11.1.6) 5'flank and exon 1 mapping to chromosome 11, band p13 (and 0.1629 X04085_rna1_at*** joined CDS) -0.2555 U18422_at DP2 (Humdp2) mRNA -0.2438 U27460_at Uridine diphosphoglucose pyrophosphorylase mRNA 0.1834 M15395_at SELL Leukocyte adhesion protein beta subunit 0.3307 AJ000480_at GB DEF = C8FW phosphoprotein -0.1880 D50840_at Ceramide glucosyltransferase MHC-encoded proteasome subunit gene LAMP7-E1 gene (proteasome subunit LMP7) extracted from H.sapiens gene for major histocompatibility complex encoded proteasome -0.1516 Z14982_rna1_at subunit LMP7 0.3182 U10868_at ALDH7 Aldehyde dehydrogenase 7 -0.2073 M89957_at IGB Immunoglobulin-associated beta (B29) *** Indicates they were in the original 50 gene set from Golub et al. 1999 Golub Results – Comparison of Methods Golub Results – Better Q Golub Results – Comparison of Methods Percent similar with Univariate Regression – rank by p-value Golub Results – Comparison of Methods Percent Similar with randomForest Measures of Importance Acknowledgements Mark van der Laan, Biostatistics, UC Berkeley Dave Nelson, Lawrence Livermore Nat‟l Lab Sandrine Dudoit, Biostatistics, UC Berkeley Catherine Metayer, NCCLS, UC Berkeley Alan Hubbard , Biostatistics, UC Berkeley NCCLS Group References L. Breiman. Bagging Predictors. Machine Learning, 24:123-140, 1996. L. Breiman. Random forests – random features. Technical Report 567, Department of Statistics, University of California, Berkeley, 1999. Mark J. van der Laan, "Statistical Inference for Variable Importance" (August 2005). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 188. http://www.bepress.com/ucbbiostat/paper188 Mark J. van der Laan and Daniel Rubin, "Estimating Function Based Cross-Validation and Learning" (May 2005). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 180. http://www.bepress.com/ucbbiostat/paper180 Mark J. van der Laan and Daniel Rubin, "Targeted Maximum Likelihood Learning" (October 2006). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 213. http://www.bepress.com/ucbbiostat/paper213 Sandra E. Sinisi and Mark J. van der Laan (2004) "Deletion/Substitution/Addition Algorithm in Learning with Applications in Genomics," Statistical Applications in Genetics and Molecular Biology: Vol. 3: No. 1, Article 18. http://www.bepress.com/sagmb/vol3/iss1/art18 Zhuo Yu and Mark J. van der Laan, "Measuring Treatment Effects Using Semiparametric Models" (September 2003). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 136. http://www.bepress.com/ucbbiostat/paper136

DOCUMENT INFO

Shared By:

Categories:

Tags:
biomarker discovery, data analysis, mass spectrometry, protein biomarkers, drug development, biomarker data, protein expression, high quality, research and development, systems biology, drug discovery, proteomics technologies, experimental design, data mining, personalized medicine

Stats:

views: | 79 |

posted: | 4/4/2010 |

language: | English |

pages: | 43 |

OTHER DOCS BY bfs20781

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.