Integrative Omics for Cancer
Biology
Xiang Zhang, PhD
Department of Chemistry
Center for Regulatory and Environmental Analytical Metabolomics
University of Louisville, Louisville, KY 40292
xiang.zhang@louisville.edu
Systems Biology
is a field in biology aiming at systems level understanding of
biological processes, where a bunch of parts that are connected to
one another and work together. It attempts to create predictive
models of cells, organs, biochemical processes and complete
organisms.
•Integrative systems biology
Extracting biological knowledge from
the ‘omics through integration
•Predictive systems biology
Predicting future of biosystem using
‘omics knowledge, e.g. in-silico
biosystems
Davidov, E.; Clish, C. B.; Oresic, M.; Zhang, X; et al. Omics: A Journal of Integrative Biology. 2004, 8, 267--288.
Clish, C. B.; Davidov, E.; Oresic, M.; Zhang, X; et al. Omics: A Journal of Integrative Biology. 2004, 8, 3--13.
Omics Space
Differential omics is
the beginning of
Systems Biology
molecule
cell
tissue
organism
…
Differential Proteomics &
Metabolomics
1. Differential proteomics and metabolomics are qualitative and
quantitative comparison of proteome and metabolome under
different conditions that should unravel complex biological
processes
2. It can be used to study any scientific phenomena that may
change the proteome and/or metabolome of a living system.
Cancer Biomarker Discovery
NIH
Nano-medicine
Environment
preventative medicine
Food and nutrition
Biomarker Discovery is Major
Research Field of Differential Omics
Biomarkers are naturally occurring biomolecules useful
for measuring the prognosis and/or progress of diseases
and therapies.
These substances may be normally present in small amounts
in the blood or other tissues
When the amounts of these substances change, they may
indicate disease.
Valid biomarkers should
demonstrate drug activity sooner
facilitate clinical trial design by defining patient populations
optimize dosing for safety and efficacy
be sensitive and easy to assay to speed drug development
What Types of Change Are
Expected?
•Sensing structural
change is a major element
of comparative
proteomics
•Most of metabolomics
works focus on
concentration change
only.
Challenges in Proteomics
Sample complexity
About 25K types of protein coding-genes present in
human. IPI human database (v3.25) has 67,250 entries,
which could generate about 106-8 peptides
More than one hundred post translational modifications
(PTMs) could happen in a proteome
Large protein concentration difference
107-8 in human cells, and at least 1012 in human plasma
Dynamic range of a LC-MS is about 104-6
The top 12 high abundant proteins constitute
approximately 95% of total protein mass of
plasma/serum
Albumin, IgG, Fibrinogen, Transferrin, IgA, IgM,
Haptoglobin, alpha 2-Macroglobulin, alpha 1-Acid
Glycoprotein, alpha 1-Antitrypsin and HDL (Apo A-I &
Apo A-II).
Dynamic system, large subject variation
Challenges in Metabolomics
•Metabolites have a wide range of molecular
weights and large variations in concentration
•The metabolome is much more dynamic
than proteome and genome, which makes the
metabolome more time sensitive
•Metabolites can be either polar or nonpolar,
as well as organic or inorganic molecules.
This makes the chemical separation a key
step in metabolomics
•Metabolites have chemical structures, which cholesta-3,5-diene
makes the identification using MS an
extreme challenge
Differential Omics
biomarker discovery
Diseased Healthy
A A A A A A A A
B B B B B B B B
C C C C C C C C
D D D D D D D D
… … … … … … … …
Z Z Z Z Z Z Z Z
S1 S2 S3 S4 S5 S6 S7 S8
Informatics Platform
Roadmap
Systems Biology Differential omics
1. Experimental design
2. Molecular identification
3. Data preprocessing
4. Statistical significance test
5. Pattern recognition
6. Molecular networks
MDLC Platforms
Sample
• MudPIT, i.e. SCX followed by RP
APR
• The proteome is split into 10-20X more
fractions
• There is carry-over between fractions AP AP
• LC fractions generally still are too complex
for MS
Digestion
• Affinity Selection
• Avidin selection of Cys-containing peptides
SCX
• Cu-IMAC for His-containing peptides
• Ga-IMAC for phosphorylated peptides
• Lectins for glycosylated peptides
F1 F2 F2 F2
RPC-MS
Qiu, R.; Zhang, X. and Regnier, F. E. J. Chromatogr. B. 2007, 845, 143-150.
Wang, S.; Zhang, X.; and Regnier, F. E. J. Chromatogr. A 2002, 949, 153-162.
Regnier, F. E.; Amini, A.; Chakraborty, A.; Geng, M.; Ji, J.; Sioma, C.; Wang, S.; and Zhang, X. LC/GC 2001, 19(2), 200-213.
Geng, M.; Zhang, X.; Bina, M.; and Regnier, F. E. J. Chromatogr. B 2001, 752, 293-306.
In-Gel Stable Isotope Labeling
a sample gel based platform
•Avoiding gel-to-gel variability
•Only labeling K-containing peptides
•Accurate quantification
d)
Asara, J. M.; Zhang, X.; Zheng, B.; Christofk, H. H.; Wu, N.; Cantley, L. C. Nature Protocols, 2006, 1, 46-51. .
Asara, J. M.; Zhang, X.; Zheng, B.; Christofk, H. H.; Wu, N.; Cantley, L. C. J. Proteome Res., 2006, 5, 155-163.
Ji, J.; Chakraborty, A.; Geng M.; Zhang, X.; Amini, A.; Bina, M.; and Regnier, F. E. J. Chromatogr. A 2000, 745, 197-210.
Roadmap
Systems Biology Differential omics
1. Experimental design
2. Molecular identification
protein identification
metabolite identification
3. Data preprocessing
4. Statistical significance test
5. Pattern recognition
6. Molecular networks
Protein Identification
database searching
The database searching approach uses a protein database to
find a peptide for which a theoretically predicted spectrum best
matches experimental data.
Protein
Peptide
Mass
matched
peptide
Protein Identification
database searching
More than 20 algorithms have been developed.
Sequest
Spectrum Mill 1. About 20% of tandem ms spectra
Mascot could provide confident peptide
identification
X! Tandem 2. < 50% of peptides can be
OMSSA identified by all algorithms
Zhang, X.; Oh, C.; Riley, C. P.; Buck, C. Current Proteomics 2007, 4, 121-130.
Protein Identification
de novo sequencing
de novo sequencing
reconstructs the
partial or complete
sequence of a
peptide directly from
its MS/MS spectrum.
Performance of de novo
method is limited by low mass
accuracy, mass equivalence,
and completeness of
fragmentation.
Pevtsov, S.; Fedulova, I.; Mirzaei, H.; Buck, C.; Zhang, X. Journal of Proteome Research. 2006, 5, 3018-3028.
Fedulova, I.; Ouyang, Z.; Buck, C.; Zhang, X. The Open Spectroscopy Journal 2007, 1, 1-8.
Incorporating Peptide Separation
Information for Protein Identification
structure of pattern classifier
Oh, C.; Zak, S. H.; Mirzaei, H.; Regnier, F. E.; Zhang, X. Bioinformatics 2007, 23, 114-118.
Training the ANNs with Generic
Algorithm
Oh, C.; Zak, S. H.; Mirzaei, H.; Regnier, F. E.; Zhang, X. Bioinformatics 2007, 23, 114-118.
Protein Identification Using Multiple
Algorithms and Predicted Peptide
Separation in HPLC
PIUMA architecture
Oh, C.; Zak, S. H.; Mirzaei, H.; Regnier, F. E. and Zhang, X. Bioinformatics, 2007, 23, 114-118.
Zhang, X.; Oh, C.; Riley, C. P.; Buck, C. Current Proteomics 2007, 4, 121-130.
Roadmap
Systems Biology Differential omics
1. Experimental design
2. Molecular identification
3. Data preprocessing
Spectrum deconvolution
Quality control
Alignment
Normalization
4. Statistical significance test
5. Pattern recognition
6. Molecular networks
Spectrum Deconvolution
GISTool, single sample analysis
1. To differentiate signals arising from the real analytes as opposed
to signals arising from contaminants or instrument noise
2. To reduce data dimensionality, which will benefit down stream
statistical analysis.
Functionality •Smoothing and centralization
•Peak cluster detection
•Charge recognition
•De-isotope
•Peak identification at LC level
•Doublet recognition
•Doublet quantification
GISTool Algorithm
Deconvoluting MS spectra
748.6354 3+
748.9694 2+
Single sample
analysis
Zhang, X.; Hines, W.; Adamec, J.; Asara, J.; Naylor, S.; and Regnier, F. E. J. Am. Soc. Mass
Spectrom. 2005, 16, 1181-1191.
Quality Assessment / Control
• Biological Sample QA/C
• protein assay
• Experimental Data QA/C
• 2D K-S test
• Percentile of detected peaks
• Percentile of aligned peaks
• Retention time variance vs.
retention time
• m/z variance vs. retention time
• Frequency distribution of RT & m/z
variance
Zhang, X.; Asara, J. M.; Adamec, J.; Ouzzani, M.; and Elmagarmid, A. K.
Bioinformatics, 2005, 21, 4054-4059.
Data Alignment
To recognize peaks of the same molecule occurring in different
samples from the thousands of peaks detected during the
course of an experiment.
1. MS to MS data alignment
•Referenced alignment
•Blind alignment
•Quality depending on the information of peak detection
2. MS to MS/MS data alignment
•Depends on experimental design
LC-MS Data Alignment
XAlign software for proteomics & metabolomics
data
•Detecting median sample
Mj = Ii,jMi,j / Ii,j
Tj = Ii,jTi,j / Ii,j
s
Di = |Ti,j -µj|
j=1
•Aligning samples to the median sample
Zhang, X.; Asara, J. M.; Adamec, J.; Ouzzani, M.; and Elmagarmid, A. K.
Bioinformatics, 2005, 21, 4054-4059.
Chromatogram of Serum Analyzed on GCGC/TOF-MS
GCxGC-MS Data Alignment
metabolite component of human serum
•Four dimension
•1535 peaks have
been detected
GCxGC/TOF-MS Data Alignment
MSort software for metabolomics
Criteria for alignment
•1st dim. rt
•2nd dim. rt
•spec. correlation
Features
*peak entry merging
*cont. exclusion
Oh, C.; Huang, X.; Buck, C.; Regnier, F. E. and Zhang, X. J. Chromatogr. A. 2008, 1179, 205-215
Analysis Results of MAlign
53 standard acids
1. 8 [OA + FA] samples and 8 [AA + FA] samples
2. derivatization reagent: (N-Methyl-N-t-butyldimethylsilyl)-trifluoroacetamide (MTBSTFA)
Oh, C.; Huang, X.; Buck, C.; Regnier, F. E. and Zhang, X. J. Chromatogr. A. 2008, 1179, 205-215
Normalization
To reduce concentration effect and
experimental variance to make the
data comparable.
Methods
1. Log linear model xij = ai rj eij
2. Reference sample normalization
3. Auto-scaling
4. Constant mean / trimmed constant mean
5. Constant median / trimmed constant median
CV Distribution of Peak Intensities
human serum sample
20.7%
Log linear model:
xij = ai rj eij
log(xij) = log(ai) + log(rj) + log(eij)
17.3%
Roadmap
Systems Biology Differential omics
1. Experimental design
2. Molecular identification
3. Data preprocessing
4. Statistical significance test
5. Pattern recognition
6. Molecular networks
Statistical Significance Tests
To find individual peaks for which there are significant
differences between groups.
Methods
1. Pair-wise t-test (diff. mean?)
2. Mann-Whitney U test (diff. median?)
3. Kolmogorov-Smirnov test (diff. population?)
4. Kruskal-Wallis analysis of variance
Statistical Significance Tests
metabolome of great blue heron fertilized eggs
contaminated by PCBs
PCBs: polychlorinated biphenyls
down-regulated up-regulated fold change = I_c / I_n
blue line: p=0.05
dashed line: fold change = 0
Roadmap
Systems Biology Differential omics
1. Experimental design
2. Molecular identification
3. Data preprocessing
4. Statistical significance test
5. Pattern recognition
6. Molecular networks
Clustering or Classification
Resulting pattern recognition provides the first glimpse of
improvement in understanding the underlying biology.
Unsupervised Methods
Principle component analysis (PCA)
Linear Discriminant Analysis (LDA)
Clustering objects on subsets of attributes (COSA)
Supervised Methods
Support vector machine (SVM)
Artificial neural network (ANN)
Cross Species Comparison
27 of the 28 control humans and all 8 control rats cluster to one group
11 of the 14 diseased human and all diseased rats cluster to second group
Differential Metabolomics of
Human Blood
breast cancer samples vs. control samples
Differential Metabolomics of
Human Blood
breast cancer samples vs. control samples
Roadmap
Systems Biology Differential omics
1. Experimental design
2. Protein identification
3. Data preprocessing
4. Statistical significance test
5. Pattern recognition
6. Molecular networks
correlation network
interaction network
regulation network
pathway analysis
Molecular Correlation Analysis
pair wised correlation of proteins and metabolites
Diseased Healthy
A A A A A A A A
B B B B B B B B
C C C C C C C C
D D D D D D D D
… … … … … … … …
Z Z Z Z Z Z Z Z
S1 S2 S3 S4 S5 S6 S7 S8
Molecular Correlation Network
an example of drug effect on disease state
•Reveal important relationships
among the various
components
•Complimentary to abundance
level information
•Provides information about
the biochemical processes
underlying the disease or drug
response
Clish, C. B.; Davidov, E.; Oresic, M.; Plasterer, T.; Lavine, G.; Londo, T. R.; Meys, M.; Snell, P.; Stochaj, W.; Adourian, A.;
Zhang, X.; Morel, N.; Neumann, E.; Verheij, E.; Vogels, J, T.W.E.; Havekes, L. M.; Afeyan, N.; Regnier, F. E.; Greef, J.;
Naylor, S. Omics: A Journal of Integrative Biology 2004, 8, 3--13.
SysNet: Interactive Visual Data
Mining of Molecular Correlation
Network
An interactive integration and
visualization environment for
molecular correlation of ‘omics data.
•Integrating molecular expression
information generated in different ‘omics
•Visualizing molecular correlation in
interactive mode
•Enabling time course data visualization and
analysis
•Automatically organizing molecules based
on their expression pattern in time course.
Zhang, M.; Ouyang, Q.; Stephenson, A.; Salt, D.; Kane, D. M.; Burgner J.; Buck, C. and Zhang, X. BMC
Systems Biology. Accepted by BMC Systems Biology.
Biomarker Verification
Wet-lab verification
AQUA
MRM
Antibody
In-silico verification
tracing lineage
pathway analysis
Automated Lineage Tracing
•Interested in identifying the
connections between input and
output data for a program
Analysis Software
•Tracing of fine-grained lineage
Lineage Tracing
through run-time analysis
•Developed based on dynamic slicing
techniques used in debugging
•Applicable to any arbitrary
function
Zhang, M.; Zhang, X.; Zhang, X. and Prabhakar, S. 33rd International
Conference on Very Large Data Bases (VLDB 2007), 2007.
Summary
• Informatics platform developed in my group can be used to analyze
protein and metabolite profiling data to differentiate disease and
normal samples for biomarker discovery
• Groups identified using clustering analysis reflected the phenotypic
categories of cancer and control samples, the animal and human
subjects, etc. with high degree of accuracy
• The application of SysNet using an interactive visual data mining
approach integrates omics data into a single environment, which
enables biologists performing data mining
• Lineage tracing technology is an efficient and effective approach for
in-silico biomarker verification. This technique will significantly
reduce the false discovery rate (FDR) of biomarker discovery
Acknowledgements
Irina Fedulova Dr. John Burger Dr. David Clemmer
Dr. Hamid Mirzaei Dr. Michael D. Kane Dr. John Asara
Dr. Cheolhwan Oh Dr. Fred E. Regnier Dr. Mu Wang
Sergey E. Pevtsov Dr. David Salt Dr. Jake Chen
Ouyang Qi Dr. Mohammad Sulma Dr. Steve Valentine
Alan Stephenson Dr. Daniel Raftery Dr. Steve Naylor
Mingwu Zhang Dr. Sunil Prabhakar
Postdoc Positions
Posting Title: Industrial Postdoctoral Fellow - Bioinformatician
Work Location: University of Louisville, KY
Job Type: Full time
Starting Date: Position immediately available
Job Description: Predictive Physiology and Medicine (PPM) Inc. is an exciting
health and life sciences company based in Bloomington, Indiana focused on
developing analytical systems for the individualized health and wellness industry.
We have an immediate opening for a postdoctoral fellow. The successful
candidate will develop bioinformatics systems for mass spectrometry based
quantitative proteomics and metabolomics.
Requirements: The position requires a bioinformatician with strong
computational background. Priority will be given to the candidate with a PhD in
bioinformatics, computer science, statistics, engineer, or computational physics.
The successful candidate should have strong understanding of statistics and
pattern recognition. Programming skills using Matlab, Microsoft .NET, or Java to
accomplish analyses is required. Experience in analyzing biological data is not
required; however, interest in multidisciplinary research is a must.