Integrative Omics for Cancer Biology Xiang Zhang, PhD Department of Chemistry Center for Regulatory and Environmental Analytical Metabolomics University of Louisville, Louisville, KY 40292 email@example.com Systems Biology is a field in biology aiming at systems level understanding of biological processes, where a bunch of parts that are connected to one another and work together. It attempts to create predictive models of cells, organs, biochemical processes and complete organisms. •Integrative systems biology Extracting biological knowledge from the ‘omics through integration •Predictive systems biology Predicting future of biosystem using ‘omics knowledge, e.g. in-silico biosystems Davidov, E.; Clish, C. B.; Oresic, M.; Zhang, X; et al. Omics: A Journal of Integrative Biology. 2004, 8, 267--288. Clish, C. B.; Davidov, E.; Oresic, M.; Zhang, X; et al. Omics: A Journal of Integrative Biology. 2004, 8, 3--13. Omics Space Differential omics is the beginning of Systems Biology molecule cell tissue organism … Differential Proteomics & Metabolomics 1. Differential proteomics and metabolomics are qualitative and quantitative comparison of proteome and metabolome under different conditions that should unravel complex biological processes 2. It can be used to study any scientific phenomena that may change the proteome and/or metabolome of a living system. Cancer Biomarker Discovery NIH Nano-medicine Environment preventative medicine Food and nutrition Biomarker Discovery is Major Research Field of Differential Omics Biomarkers are naturally occurring biomolecules useful for measuring the prognosis and/or progress of diseases and therapies. These substances may be normally present in small amounts in the blood or other tissues When the amounts of these substances change, they may indicate disease. Valid biomarkers should demonstrate drug activity sooner facilitate clinical trial design by defining patient populations optimize dosing for safety and efficacy be sensitive and easy to assay to speed drug development What Types of Change Are Expected? Protein Protein degradation structure structure is unchanged changed •Sensing structural change is a major element of comparative proteomics post- concentration translational modification •Most of metabolomics works focus on concentration change only. sequence (mutation) Challenges in Proteomics Sample complexity About 25K types of protein coding-genes present in Body Fluid profiling: biomarker platform human. IPI human database (v3.25) has 67,250 entries, which could generate about 106-8 peptides Generic High concentration compounds More than one hundred post translational modifications Sample prep. g/ml (PTMs) could happen in a proteome Large protein concentration difference 107-8 in human cells, and at least 1012 in human plasma ng/ml Dynamic range of a LC-MS is about 104-6 The top 12 high abundant proteins constitute approximately 95% of total protein mass of pg/ml plasma/serum Albumin, IgG, Fibrinogen, Transferrin, IgA, IgM, Haptoglobin, alpha 2-Macroglobulin, alpha 1-Acid Focused Glycoprotein, alpha 1-Antitrypsin and HDL (Apo A-I & Sample prep. Low concentration Apo A-II). compounds Dynamic system, large subject variation Challenges in Metabolomics •Metabolites have a wide range of molecular weights and large variations in concentration •The metabolome is much more dynamic than proteome and genome, which makes the metabolome more time sensitive •Metabolites can be either polar or nonpolar, as well as organic or inorganic molecules. This makes the chemical separation a key step in metabolomics •Metabolites have chemical structures, which cholesta-3,5-diene makes the identification using MS an extreme challenge Differential Omics biomarker discovery Diseased Healthy A A A A A A A A B B B B B B B B C C C C C C C C D D D D D D D D … … … … … … … … Z Z Z Z Z Z Z Z S1 S2 S3 S4 S5 S6 S7 S8 Informatics Platform data re-examination Protein Function Molecular Pathway LIMS Interaction networks modeling Correlation assembling Knowledge Peak alignment transformation deconvolution normalization Experiment Experiment information Spectrum execution Raw data Sample design Regulated Peak Significance test peaks Pattern Cluster Regulated recognition loadings molecules Molecular Molecular identification validation Quality control Unidentified molecules targeted tandem MS Roadmap Systems Biology Differential omics 1. Experimental design 2. Molecular identification 3. Data preprocessing 4. Statistical significance test 5. Pattern recognition 6. Molecular networks MDLC Platforms Sample • MudPIT, i.e. SCX followed by RP APR • The proteome is split into 10-20X more fractions • There is carry-over between fractions AP AP • LC fractions generally still are too complex for MS Digestion • Affinity Selection • Avidin selection of Cys-containing peptides SCX • Cu-IMAC for His-containing peptides • Ga-IMAC for phosphorylated peptides • Lectins for glycosylated peptides F1 F2 F2 F2 RPC-MS Qiu, R.; Zhang, X. and Regnier, F. E. J. Chromatogr. B. 2007, 845, 143-150. Wang, S.; Zhang, X.; and Regnier, F. E. J. Chromatogr. A 2002, 949, 153-162. Regnier, F. E.; Amini, A.; Chakraborty, A.; Geng, M.; Ji, J.; Sioma, C.; Wang, S.; and Zhang, X. LC/GC 2001, 19(2), 200-213. Geng, M.; Zhang, X.; Bina, M.; and Regnier, F. E. J. Chromatogr. B 2001, 752, 293-306. In-Gel Stable Isotope Labeling a sample gel based platform •Avoiding gel-to-gel variability •Only labeling K-containing peptides •Accurate quantification d) Asara, J. M.; Zhang, X.; Zheng, B.; Christofk, H. H.; Wu, N.; Cantley, L. C. Nature Protocols, 2006, 1, 46-51. . Asara, J. M.; Zhang, X.; Zheng, B.; Christofk, H. H.; Wu, N.; Cantley, L. C. J. Proteome Res., 2006, 5, 155-163. Ji, J.; Chakraborty, A.; Geng M.; Zhang, X.; Amini, A.; Bina, M.; and Regnier, F. E. J. Chromatogr. A 2000, 745, 197-210. Roadmap Systems Biology Differential omics 1. Experimental design 2. Molecular identification protein identification metabolite identification 3. Data preprocessing 4. Statistical significance test 5. Pattern recognition 6. Molecular networks Protein Identification database searching The database searching approach uses a protein database to find a peptide for which a theoretically predicted spectrum best matches experimental data. Protein Peptide Mass matched peptide Protein Identification database searching More than 20 algorithms have been developed. Sequest Spectrum Mill 1. About 20% of tandem ms spectra Mascot could provide confident peptide identification X! Tandem 2. < 50% of peptides can be OMSSA identified by all algorithms Zhang, X.; Oh, C.; Riley, C. P.; Buck, C. Current Proteomics 2007, 4, 121-130. Protein Identification de novo sequencing de novo sequencing reconstructs the partial or complete sequence of a peptide directly from its MS/MS spectrum. Performance of de novo method is limited by low mass accuracy, mass equivalence, and completeness of fragmentation. Pevtsov, S.; Fedulova, I.; Mirzaei, H.; Buck, C.; Zhang, X. Journal of Proteome Research. 2006, 5, 3018-3028. Fedulova, I.; Ouyang, Z.; Buck, C.; Zhang, X. The Open Spectroscopy Journal 2007, 1, 1-8. Incorporating Peptide Separation Information for Protein Identification structure of pattern classifier VSFLSALEEYTKK LSPLGEEMR Input layer Hidden Feature 1 layer DYVSQFEGSALGKQLNLK Output layer Feature 2 DSGRDYVSQFEGSALGK Flow through AKPALEDLRQGLLPVLESFK Feature Feature 3 Partition Extraction Elution DLATVYVDVLKDSGR zn wo Feature N ym THLAPYSDELR wh xl QGLLPVLESFKVSFLSALEEYT K VQPYLDDFQKK QGLLPVLESFK Oh, C.; Zak, S. H.; Mirzaei, H.; Regnier, F. E.; Zhang, X. Bioinformatics 2007, 23, 114-118. Training the ANNs with Generic Algorithm Initial candidate solutions Crossover whji wokj thj tok Optimal solution whji wokj thj tok Encoding Initial population Mutation Best chromosome Selection Oh, C.; Zak, S. H.; Mirzaei, H.; Regnier, F. E.; Zhang, X. Bioinformatics 2007, 23, 114-118. Protein Identification Using Multiple Algorithms and Predicted Peptide Separation in HPLC PIUMA architecture Unknown modification Unmatched spectra search 2 Raw LC/MS/MS data Protein List Mascot Chromatography Modeling based Validation machine learning Report mzData or Processed MS/ Database 1 Sequest Peptide List mzXML format MS data seraching 3 X! Tandem Unmatched spectra Lutefisk consensus De novo sequencing novoHMM Peptide List Color legend existing algorithms Peaks algorithms to be developed method descriptions Oh, C.; Zak, S. H.; Mirzaei, H.; Regnier, F. E. and Zhang, X. Bioinformatics, 2007, 23, 114-118. Zhang, X.; Oh, C.; Riley, C. P.; Buck, C. Current Proteomics 2007, 4, 121-130. Roadmap Systems Biology Differential omics 1. Experimental design 2. Molecular identification 3. Data preprocessing Spectrum deconvolution Quality control Alignment Normalization 4. Statistical significance test 5. Pattern recognition 6. Molecular networks Spectrum Deconvolution GISTool, single sample analysis 1. To differentiate signals arising from the real analytes as opposed to signals arising from contaminants or instrument noise 2. To reduce data dimensionality, which will benefit down stream statistical analysis. Functionality •Smoothing and centralization •Peak cluster detection •Charge recognition •De-isotope •Peak identification at LC level •Doublet recognition •Doublet quantification GISTool Algorithm Deconvoluting MS spectra 748.97 748.97 +3 pep +2 pep 100 748.64 749.29 748.97 748.6354 3+ 749.62 749.47 748.9694 2+ 80 749.97 749.97 intensity (%) 750.50 60 749.29 748.64 749 750 749.47 749 750 40 749.62 749.97 20 750.50 Single sample 0 analysis 747 748 749 750 751 m/z Zhang, X.; Hines, W.; Adamec, J.; Asara, J.; Naylor, S.; and Regnier, F. E. J. Am. Soc. Mass Spectrom. 2005, 16, 1181-1191. Quality Assessment / Control 0.08 • 0.06 Biological Sample QA/C D value • protein assay 0.04 0.02 • Experimental Data QA/C 0 • 1 2 3 4 5 6 7 8 9 10 2D K-S test sample ID • Percentile of detected peaks • Percentile of aligned peaks • Retention time variance vs. 5 retention time retention time variation (%) 4 • m/z variance vs. retention time 3 • Frequency distribution of RT & m/z 2 variance 1 20 30 40 50 60 retention time (min) Zhang, X.; Asara, J. M.; Adamec, J.; Ouzzani, M.; and Elmagarmid, A. K. Bioinformatics, 2005, 21, 4054-4059. Data Alignment To recognize peaks of the same molecule occurring in different samples from the thousands of peaks detected during the course of an experiment. 1. MS to MS data alignment •Referenced alignment •Blind alignment •Quality depending on the information of peak detection 2. MS to MS/MS data alignment •Depends on experimental design LC-MS Data Alignment XAlign software for proteomics & metabolomics data 0.8 •Detecting median sample retention time difference (min) 0.4 Mj = Ii,jMi,j / Ii,j 0 Tj = Ii,jTi,j / Ii,j -0.4 s Di = |Ti,j -µj| j=1 -0.8 10 20 30 40 50 60 70 retention time (min) •Aligning samples to the median sample 10000 intensity of aligned peaks (sample 2) 1000 100 y = 1.3636x + 16.511 R2 = 0.9475 10 Zhang, X.; Asara, J. M.; Adamec, J.; Ouzzani, M.; and Elmagarmid, A. K. 10 100 1000 10000 intensity of aligned peaks (sample 1) Bioinformatics, 2005, 21, 4054-4059. Chromatogram of Serum Analyzed on GCGC/TOF-MS GCxGC-MS Data Alignment metabolite component of human serum •Four dimension •1535 peaks have been detected GCxGC/TOF-MS Data Alignment MSort software for metabolomics Criteria for alignment •1st dim. rt •2nd dim. rt •spec. correlation Features *peak entry merging *cont. exclusion Oh, C.; Huang, X.; Buck, C.; Regnier, F. E. and Zhang, X. J. Chromatogr. A. 2008, 1179, 205-215 Analysis Results of MAlign 53 standard acids 5 1000 x 10 10 The number of rows in the alignment table 800 8 600 6 Peak area 400 4 200 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 The number of peak entries in a row of alignment table The number of peak entries in a row of alignment table 1. 8 [OA + FA] samples and 8 [AA + FA] samples 2. derivatization reagent: (N-Methyl-N-t-butyldimethylsilyl)-trifluoroacetamide (MTBSTFA) Oh, C.; Huang, X.; Buck, C.; Regnier, F. E. and Zhang, X. J. Chromatogr. A. 2008, 1179, 205-215 Normalization 8000 To reduce concentration effect and experimental variance to make the 6000 data comparable. intensity 4000 2000 Methods 0 Log linear model xij = ai rj eij 0 200 400 600 800 1000 1. peak index 2. Reference sample normalization 3. Auto-scaling 4. Constant mean / trimmed constant mean 5. Constant median / trimmed constant median CV Distribution of Peak Intensities human serum sample Before Normalization Intensity Variation 100 250 80 20.7% rel peak no (%) Frequency 60 150 40 Log linear model: 20 50 xij = ai rj eij 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 CV CV log(xij) = log(ai) + log(rj) + log(eij) After Normalization Intensity Variation 100 250 80 rel peak no (%) Frequency 60 150 17.3% 40 50 20 0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 CV CV Roadmap Systems Biology Differential omics 1. Experimental design 2. Molecular identification 3. Data preprocessing 4. Statistical significance test 5. Pattern recognition 6. Molecular networks Statistical Significance Tests To find individual peaks for which there are significant differences between groups. Methods 1. Pair-wise t-test (diff. mean?) 2. Mann-Whitney U test (diff. median?) 3. Kolmogorov-Smirnov test (diff. population?) 4. Kruskal-Wallis analysis of variance Statistical Significance Tests metabolome of great blue heron fertilized eggs contaminated by PCBs 8 PCBs: polychlorinated biphenyls 6 p (-log) 4 down-regulated up-regulated fold change = I_c / I_n blue line: p=0.05 2 dashed line: fold change = 0 0 -3 -2 -1 0 1 2 3 fold change (log) Roadmap Systems Biology Differential omics 1. Experimental design 2. Molecular identification 3. Data preprocessing 4. Statistical significance test 5. Pattern recognition 6. Molecular networks Clustering or Classification Resulting pattern recognition provides the first glimpse of improvement in understanding the underlying biology. Unsupervised Methods Principle component analysis (PCA) Linear Discriminant Analysis (LDA) Clustering objects on subsets of attributes (COSA) Supervised Methods Support vector machine (SVM) Artificial neural network (ANN) Cross Species Comparison 27 of the 28 control humans and all 8 control rats cluster to one group 11 of the 14 diseased human and all diseased rats cluster to second group Differential Metabolomics of Human Blood breast cancer samples vs. control samples Differential Metabolomics of Human Blood breast cancer samples vs. control samples Roadmap Systems Biology Differential omics 1. Experimental design 2. Protein identification 3. Data preprocessing 4. Statistical significance test 5. Pattern recognition 6. Molecular networks correlation network interaction network regulation network pathway analysis Molecular Correlation Analysis pair wised correlation of proteins and metabolites Diseased Healthy A A A A A A A A B B B B B B B B C C C C C C C C D D D D D D D D … … … … … … … … Z Z Z Z Z Z Z Z S1 S2 S3 S4 S5 S6 S7 S8 Molecular Correlation Network an example of drug effect on disease state ApoE_1 L-5b 2 SerPI_II_ L-5a L-11b L-11a L-18a L-18b L-28b L-26a L-21b L-12a L-21a L-9b C18:1 L-9a LPC Emb ALP L-24b L-24a L-19b L-15b GP-1a Tiss L-15a L-27b L-14b L-6b L-14a L-7a •Reveal important relationships L-1b L-7b L-26b L-8a L-12b L-20a L-27a L-6a L-13b GP-1b L-1a among the various ApoA1_6 Unkn1 C52:2 TG L-20b L-16a L-13a L-10b L-10a components C33:1 PC C32:0 PC M C24:1 SP a-glucose L-8b C30:0 PC M C24:0 SP AMBP C52:1 TG FBGB L-22a ALB phenylalanine L-17b TP C50:4 TG L-23b alanine L-17a FetuinA_2 L-28a C18:2 LPC A1MG_5 C52:5 TG L-23a C54:5 TG C54:5 TG leucine L-22b A1MG_2 valine L-19a •Complimentary to abundance K C34:2 PC formate L-16b C52:3 TG C32:1 PC C20:4 CE A1I3_3 ApoA1_5 C52:4 TG C34:1 PC TRIG C54:3 TG leucine C36:2 PC C54:2 TG NEF A isoleucine BUN GLYC HDL level information glutamine glutamine C58:5 TG C36:1 PC GLUC C54:1 TG C54:6 TG C52:6 TG C22:5 CE C58:3 TG C56:3 TG lactate ITIH3_1 C56:2 TG C60:3 TG C58:2 TG C58:4 TG C60:4 TG glutamine Afamin_2 valine tyrosine glutamine C16:1 CE valine •Provides information about C16:1 LPC valine tyrosine alanine creatine e lactat tyrosine lactate acetate lactate Lipids (LCMS) the biochemical processes tyrosine creatine tyrosine alanine C38:4 PC NMR (DE) diffusion NMR (CPMG) C46:1 TG isoleucine underlying the disease or drug C48:1 TG lactate e lactat C18:0 CE C16:0 CE Peptides Proteins phenylalanine C18:1 CE C20:5 CE C36:4 PC C20:3 CE C20:2 CE C18:2 CE Clinical response gen Plasmino a-glucose C19:0 LPC C22:6 CE C56:4 TG C18:3 CE = positive correlation = negative correlation leucine phenylalanine = higher in treated g roup ApoA1_3 A1I3_4 TT_2 phenylalanine C20:4 LPC phenylalanine = lower i n treated g roup C18:0 LPC Hemopex_1 C18:1 LPC ApoA1_7 b-glucose LD PlasPre_2 b-glucose TT_1 A2GC FG Clish, C. B.; Davidov, E.; Oresic, M.; Plasterer, T.; Lavine, G.; Londo, T. R.; Meys, M.; Snell, P.; Stochaj, W.; Adourian, A.; Zhang, X.; Morel, N.; Neumann, E.; Verheij, E.; Vogels, J, T.W.E.; Havekes, L. M.; Afeyan, N.; Regnier, F. E.; Greef, J.; Naylor, S. Omics: A Journal of Integrative Biology 2004, 8, 3--13. SysNet: Interactive Visual Data Mining of Molecular Correlation Network a) An interactive integration and visualization environment for molecular correlation of ‘omics data. •Integrating molecular expression information generated in different ‘omics •Visualizing molecular correlation in interactive mode b) •Enabling time course data visualization and analysis •Automatically organizing molecules based on their expression pattern in time course. Zhang, M.; Ouyang, Q.; Stephenson, A.; Salt, D.; Kane, D. M.; Burgner J.; Buck, C. and Zhang, X. BMC Systems Biology. Accepted by BMC Systems Biology. Biomarker Verification Wet-lab verification AQUA MRM Antibody In-silico verification tracing lineage pathway analysis Automated Lineage Tracing •Interested in identifying the connections between input and output data for a program Analysis Software •Tracing of fine-grained lineage Lineage Tracing through run-time analysis •Developed based on dynamic slicing techniques used in debugging •Applicable to any arbitrary function Zhang, M.; Zhang, X.; Zhang, X. and Prabhakar, S. 33rd International Conference on Very Large Data Bases (VLDB 2007), 2007. Summary • Informatics platform developed in my group can be used to analyze protein and metabolite profiling data to differentiate disease and normal samples for biomarker discovery • Groups identified using clustering analysis reflected the phenotypic categories of cancer and control samples, the animal and human subjects, etc. with high degree of accuracy • The application of SysNet using an interactive visual data mining approach integrates omics data into a single environment, which enables biologists performing data mining • Lineage tracing technology is an efficient and effective approach for in-silico biomarker verification. This technique will significantly reduce the false discovery rate (FDR) of biomarker discovery Acknowledgements Irina Fedulova Dr. John Burger Dr. David Clemmer Dr. Hamid Mirzaei Dr. Michael D. Kane Dr. John Asara Dr. Cheolhwan Oh Dr. Fred E. Regnier Dr. Mu Wang Sergey E. Pevtsov Dr. David Salt Dr. Jake Chen Ouyang Qi Dr. Mohammad Sulma Dr. Steve Valentine Alan Stephenson Dr. Daniel Raftery Dr. Steve Naylor Mingwu Zhang Dr. Sunil Prabhakar Postdoc Positions Posting Title: Industrial Postdoctoral Fellow - Bioinformatician Work Location: University of Louisville, KY Job Type: Full time Starting Date: Position immediately available Job Description: Predictive Physiology and Medicine (PPM) Inc. is an exciting health and life sciences company based in Bloomington, Indiana focused on developing analytical systems for the individualized health and wellness industry. We have an immediate opening for a postdoctoral fellow. The successful candidate will develop bioinformatics systems for mass spectrometry based quantitative proteomics and metabolomics. Requirements: The position requires a bioinformatician with strong computational background. Priority will be given to the candidate with a PhD in bioinformatics, computer science, statistics, engineer, or computational physics. The successful candidate should have strong understanding of statistics and pattern recognition. Programming skills using Matlab, Microsoft .NET, or Java to accomplish analyses is required. Experience in analyzing biological data is not required; however, interest in multidisciplinary research is a must.
Pages to are hidden for
"Systems Biology Approach to Biomarker Discovery"Please download to view full document