Integrative Omics for Cancer
Biology
Xiang Zhang, PhD
Department of Chemistry
Center for Regulatory and Environmental Analytical Metabolomics
University of Louisville, Louisville, KY 40292
xiang.zhang@louisville.edu
Systems Biology
is a field in biology aiming at systems level understanding of
biological processes, where a bunch of parts that are connected to
one another and work together. It attempts to create predictive
models of cells, organs, biochemical processes and complete
organisms.
•Integrative systems biology
Extracting biological knowledge from
the ‘omics through integration
•Predictive systems biology
Predicting future of biosystem using
‘omics knowledge, e.g. in-silico
biosystems
Davidov, E.; Clish, C. B.; Oresic, M.; Zhang, X; et al. Omics: A Journal of Integrative Biology. 2004, 8, 267--288.
Clish, C. B.; Davidov, E.; Oresic, M.; Zhang, X; et al. Omics: A Journal of Integrative Biology. 2004, 8, 3--13.
Omics Space
Differential omics is
the beginning of
Systems Biology
molecule
cell
tissue
organism
…
Differential Proteomics &
Metabolomics
1. Differential proteomics and metabolomics are qualitative and
quantitative comparison of proteome and metabolome under
different conditions that should unravel complex biological
processes
2. It can be used to study any scientific phenomena that may
change the proteome and/or metabolome of a living system.
Cancer Biomarker Discovery
NIH
Nano-medicine
Environment
preventative medicine
Food and nutrition
Biomarker Discovery is Major
Research Field of Differential Omics
Biomarkers are naturally occurring biomolecules useful
for measuring the prognosis and/or progress of diseases
and therapies.
These substances may be normally present in small amounts
in the blood or other tissues
When the amounts of these substances change, they may
indicate disease.
Valid biomarkers should
demonstrate drug activity sooner
facilitate clinical trial design by defining patient populations
optimize dosing for safety and efficacy
be sensitive and easy to assay to speed drug development
What Types of Change Are
Expected?
Protein Protein
degradation
structure structure is
unchanged changed •Sensing structural
change is a major element
of comparative
proteomics
post-
concentration translational
modification
•Most of metabolomics
works focus on
concentration change
only.
sequence
(mutation)
Challenges in Proteomics
Sample complexity
About 25K types of protein coding-genes present in Body Fluid profiling: biomarker platform
human. IPI human database (v3.25) has 67,250 entries,
which could generate about 106-8 peptides Generic
High concentration
compounds
More than one hundred post translational modifications Sample prep.
g/ml
(PTMs) could happen in a proteome
Large protein concentration difference
107-8 in human cells, and at least 1012 in human plasma
ng/ml
Dynamic range of a LC-MS is about 104-6
The top 12 high abundant proteins constitute
approximately 95% of total protein mass of
pg/ml
plasma/serum
Albumin, IgG, Fibrinogen, Transferrin, IgA, IgM,
Haptoglobin, alpha 2-Macroglobulin, alpha 1-Acid Focused
Glycoprotein, alpha 1-Antitrypsin and HDL (Apo A-I & Sample prep.
Low concentration
Apo A-II). compounds
Dynamic system, large subject variation
Challenges in Metabolomics
•Metabolites have a wide range of molecular
weights and large variations in concentration
•The metabolome is much more dynamic
than proteome and genome, which makes the
metabolome more time sensitive
•Metabolites can be either polar or nonpolar,
as well as organic or inorganic molecules.
This makes the chemical separation a key
step in metabolomics
•Metabolites have chemical structures, which cholesta-3,5-diene
makes the identification using MS an
extreme challenge
Differential Omics
biomarker discovery
Diseased Healthy
A A A A A A A A
B B B B B B B B
C C C C C C C C
D D D D D D D D
… … … … … … … …
Z Z Z Z Z Z Z Z
S1 S2 S3 S4 S5 S6 S7 S8
Informatics Platform
data re-examination
Protein
Function
Molecular Pathway
LIMS Interaction
networks modeling
Correlation
assembling
Knowledge
Peak alignment
transformation
deconvolution
normalization
Experiment
Experiment
information
Spectrum
execution
Raw data
Sample
design
Regulated
Peak
Significance
test peaks
Pattern Cluster Regulated
recognition loadings molecules
Molecular Molecular
identification validation
Quality control Unidentified
molecules
targeted tandem MS
Roadmap
Systems Biology Differential omics
1. Experimental design
2. Molecular identification
3. Data preprocessing
4. Statistical significance test
5. Pattern recognition
6. Molecular networks
MDLC Platforms
Sample
• MudPIT, i.e. SCX followed by RP
APR
• The proteome is split into 10-20X more
fractions
• There is carry-over between fractions AP AP
• LC fractions generally still are too complex
for MS
Digestion
• Affinity Selection
• Avidin selection of Cys-containing peptides
SCX
• Cu-IMAC for His-containing peptides
• Ga-IMAC for phosphorylated peptides
• Lectins for glycosylated peptides
F1 F2 F2 F2
RPC-MS
Qiu, R.; Zhang, X. and Regnier, F. E. J. Chromatogr. B. 2007, 845, 143-150.
Wang, S.; Zhang, X.; and Regnier, F. E. J. Chromatogr. A 2002, 949, 153-162.
Regnier, F. E.; Amini, A.; Chakraborty, A.; Geng, M.; Ji, J.; Sioma, C.; Wang, S.; and Zhang, X. LC/GC 2001, 19(2), 200-213.
Geng, M.; Zhang, X.; Bina, M.; and Regnier, F. E. J. Chromatogr. B 2001, 752, 293-306.
In-Gel Stable Isotope Labeling
a sample gel based platform
•Avoiding gel-to-gel variability
•Only labeling K-containing peptides
•Accurate quantification
d)
Asara, J. M.; Zhang, X.; Zheng, B.; Christofk, H. H.; Wu, N.; Cantley, L. C. Nature Protocols, 2006, 1, 46-51. .
Asara, J. M.; Zhang, X.; Zheng, B.; Christofk, H. H.; Wu, N.; Cantley, L. C. J. Proteome Res., 2006, 5, 155-163.
Ji, J.; Chakraborty, A.; Geng M.; Zhang, X.; Amini, A.; Bina, M.; and Regnier, F. E. J. Chromatogr. A 2000, 745, 197-210.
Roadmap
Systems Biology Differential omics
1. Experimental design
2. Molecular identification
protein identification
metabolite identification
3. Data preprocessing
4. Statistical significance test
5. Pattern recognition
6. Molecular networks
Protein Identification
database searching
The database searching approach uses a protein database to
find a peptide for which a theoretically predicted spectrum best
matches experimental data.
Protein
Peptide
Mass
matched
peptide
Protein Identification
database searching
More than 20 algorithms have been developed.
Sequest
Spectrum Mill 1. About 20% of tandem ms spectra
Mascot could provide confident peptide
identification
X! Tandem 2. < 50% of peptides can be
OMSSA identified by all algorithms
Zhang, X.; Oh, C.; Riley, C. P.; Buck, C. Current Proteomics 2007, 4, 121-130.
Protein Identification
de novo sequencing
de novo sequencing
reconstructs the
partial or complete
sequence of a
peptide directly from
its MS/MS spectrum.
Performance of de novo
method is limited by low mass
accuracy, mass equivalence,
and completeness of
fragmentation.
Pevtsov, S.; Fedulova, I.; Mirzaei, H.; Buck, C.; Zhang, X. Journal of Proteome Research. 2006, 5, 3018-3028.
Fedulova, I.; Ouyang, Z.; Buck, C.; Zhang, X. The Open Spectroscopy Journal 2007, 1, 1-8.
Incorporating Peptide Separation
Information for Protein Identification
structure of pattern classifier
VSFLSALEEYTKK
LSPLGEEMR Input
layer
Hidden
Feature 1
layer
DYVSQFEGSALGKQLNLK Output
layer
Feature 2
DSGRDYVSQFEGSALGK Flow
through
AKPALEDLRQGLLPVLESFK
Feature Feature 3
Partition
Extraction Elution
DLATVYVDVLKDSGR zn
wo
Feature N ym
THLAPYSDELR wh
xl
QGLLPVLESFKVSFLSALEEYT
K
VQPYLDDFQKK
QGLLPVLESFK
Oh, C.; Zak, S. H.; Mirzaei, H.; Regnier, F. E.; Zhang, X. Bioinformatics 2007, 23, 114-118.
Training the ANNs with Generic
Algorithm
Initial candidate solutions
Crossover
whji wokj thj tok
Optimal solution
whji wokj thj tok
Encoding
Initial population
Mutation
Best
chromosome
Selection
Oh, C.; Zak, S. H.; Mirzaei, H.; Regnier, F. E.; Zhang, X. Bioinformatics 2007, 23, 114-118.
Protein Identification Using Multiple
Algorithms and Predicted Peptide
Separation in HPLC
PIUMA architecture
Unknown modification
Unmatched spectra
search
2
Raw LC/MS/MS
data
Protein List
Mascot
Chromatography
Modeling based
Validation
machine
learning
Report
mzData or Processed MS/ Database
1 Sequest Peptide List
mzXML format MS data seraching
3 X! Tandem
Unmatched spectra Lutefisk
consensus
De novo sequencing novoHMM Peptide List Color legend
existing algorithms
Peaks algorithms to be developed
method descriptions
Oh, C.; Zak, S. H.; Mirzaei, H.; Regnier, F. E. and Zhang, X. Bioinformatics, 2007, 23, 114-118.
Zhang, X.; Oh, C.; Riley, C. P.; Buck, C. Current Proteomics 2007, 4, 121-130.
Roadmap
Systems Biology Differential omics
1. Experimental design
2. Molecular identification
3. Data preprocessing
Spectrum deconvolution
Quality control
Alignment
Normalization
4. Statistical significance test
5. Pattern recognition
6. Molecular networks
Spectrum Deconvolution
GISTool, single sample analysis
1. To differentiate signals arising from the real analytes as opposed
to signals arising from contaminants or instrument noise
2. To reduce data dimensionality, which will benefit down stream
statistical analysis.
Functionality •Smoothing and centralization
•Peak cluster detection
•Charge recognition
•De-isotope
•Peak identification at LC level
•Doublet recognition
•Doublet quantification
GISTool Algorithm
Deconvoluting MS spectra
748.97 748.97
+3 pep +2 pep
100
748.64
749.29
748.97
748.6354 3+
749.62 749.47 748.9694 2+
80 749.97 749.97
intensity (%)
750.50
60 749.29
748.64
749 750 749.47
749 750
40
749.62
749.97
20 750.50 Single sample
0 analysis
747 748 749 750 751
m/z
Zhang, X.; Hines, W.; Adamec, J.; Asara, J.; Naylor, S.; and Regnier, F. E. J. Am. Soc. Mass
Spectrom. 2005, 16, 1181-1191.
Quality Assessment / Control
0.08
•
0.06
Biological Sample QA/C
D value
• protein assay 0.04
0.02
• Experimental Data QA/C 0
•
1 2 3 4 5 6 7 8 9 10
2D K-S test sample ID
• Percentile of detected peaks
• Percentile of aligned peaks
• Retention time variance vs.
5
retention time
retention time variation (%)
4
• m/z variance vs. retention time
3
• Frequency distribution of RT & m/z
2
variance
1
20 30 40 50 60
retention time (min)
Zhang, X.; Asara, J. M.; Adamec, J.; Ouzzani, M.; and Elmagarmid, A. K.
Bioinformatics, 2005, 21, 4054-4059.
Data Alignment
To recognize peaks of the same molecule occurring in different
samples from the thousands of peaks detected during the
course of an experiment.
1. MS to MS data alignment
•Referenced alignment
•Blind alignment
•Quality depending on the information of peak detection
2. MS to MS/MS data alignment
•Depends on experimental design
LC-MS Data Alignment
XAlign software for proteomics & metabolomics
data
0.8
•Detecting median sample
retention time difference (min)
0.4
Mj = Ii,jMi,j / Ii,j
0
Tj = Ii,jTi,j / Ii,j
-0.4
s
Di = |Ti,j -µj|
j=1 -0.8
10 20 30 40 50 60 70
retention time (min)
•Aligning samples to the median sample
10000
intensity of aligned peaks (sample 2)
1000
100
y = 1.3636x + 16.511
R2 = 0.9475
10
Zhang, X.; Asara, J. M.; Adamec, J.; Ouzzani, M.; and Elmagarmid, A. K. 10 100 1000 10000
intensity of aligned peaks (sample 1)
Bioinformatics, 2005, 21, 4054-4059.
Chromatogram of Serum Analyzed on GCGC/TOF-MS
GCxGC-MS Data Alignment
metabolite component of human serum
•Four dimension
•1535 peaks have
been detected
GCxGC/TOF-MS Data Alignment
MSort software for metabolomics
Criteria for alignment
•1st dim. rt
•2nd dim. rt
•spec. correlation
Features
*peak entry merging
*cont. exclusion
Oh, C.; Huang, X.; Buck, C.; Regnier, F. E. and Zhang, X. J. Chromatogr. A. 2008, 1179, 205-215
Analysis Results of MAlign
53 standard acids
5
1000 x 10
10
The number of rows in the alignment table
800 8
600 6
Peak area
400 4
200 2
0
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
The number of peak entries in a row of alignment table The number of peak entries in a row of alignment table
1. 8 [OA + FA] samples and 8 [AA + FA] samples
2. derivatization reagent: (N-Methyl-N-t-butyldimethylsilyl)-trifluoroacetamide (MTBSTFA)
Oh, C.; Huang, X.; Buck, C.; Regnier, F. E. and Zhang, X. J. Chromatogr. A. 2008, 1179, 205-215
Normalization
8000
To reduce concentration effect and
experimental variance to make the
6000
data comparable.
intensity
4000
2000
Methods
0
Log linear model xij = ai rj eij
0 200 400 600 800 1000
1. peak index
2. Reference sample normalization
3. Auto-scaling
4. Constant mean / trimmed constant mean
5. Constant median / trimmed constant median
CV Distribution of Peak Intensities
human serum sample
Before Normalization Intensity Variation
100
250
80
20.7%
rel peak no (%)
Frequency
60
150
40
Log linear model:
20
50
xij = ai rj eij
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
CV CV log(xij) = log(ai) + log(rj) + log(eij)
After Normalization Intensity Variation
100
250
80
rel peak no (%)
Frequency
60
150
17.3%
40
50
20
0
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
CV CV
Roadmap
Systems Biology Differential omics
1. Experimental design
2. Molecular identification
3. Data preprocessing
4. Statistical significance test
5. Pattern recognition
6. Molecular networks
Statistical Significance Tests
To find individual peaks for which there are significant
differences between groups.
Methods
1. Pair-wise t-test (diff. mean?)
2. Mann-Whitney U test (diff. median?)
3. Kolmogorov-Smirnov test (diff. population?)
4. Kruskal-Wallis analysis of variance
Statistical Significance Tests
metabolome of great blue heron fertilized eggs
contaminated by PCBs
8
PCBs: polychlorinated biphenyls
6
p (-log)
4
down-regulated up-regulated fold change = I_c / I_n
blue line: p=0.05
2
dashed line: fold change = 0
0
-3 -2 -1 0 1 2 3
fold change (log)
Roadmap
Systems Biology Differential omics
1. Experimental design
2. Molecular identification
3. Data preprocessing
4. Statistical significance test
5. Pattern recognition
6. Molecular networks
Clustering or Classification
Resulting pattern recognition provides the first glimpse of
improvement in understanding the underlying biology.
Unsupervised Methods
Principle component analysis (PCA)
Linear Discriminant Analysis (LDA)
Clustering objects on subsets of attributes (COSA)
Supervised Methods
Support vector machine (SVM)
Artificial neural network (ANN)
Cross Species Comparison
27 of the 28 control humans and all 8 control rats cluster to one group
11 of the 14 diseased human and all diseased rats cluster to second group
Differential Metabolomics of
Human Blood
breast cancer samples vs. control samples
Differential Metabolomics of
Human Blood
breast cancer samples vs. control samples
Roadmap
Systems Biology Differential omics
1. Experimental design
2. Protein identification
3. Data preprocessing
4. Statistical significance test
5. Pattern recognition
6. Molecular networks
correlation network
interaction network
regulation network
pathway analysis
Molecular Correlation Analysis
pair wised correlation of proteins and metabolites
Diseased Healthy
A A A A A A A A
B B B B B B B B
C C C C C C C C
D D D D D D D D
… … … … … … … …
Z Z Z Z Z Z Z Z
S1 S2 S3 S4 S5 S6 S7 S8
Molecular Correlation Network
an example of drug effect on disease state ApoE_1
L-5b
2
SerPI_II_
L-5a
L-11b
L-11a L-18a
L-18b
L-28b
L-26a L-21b
L-12a
L-21a
L-9b
C18:1 L-9a
LPC
Emb
ALP L-24b L-24a L-19b L-15b
GP-1a Tiss L-15a
L-27b L-14b
L-6b L-14a
L-7a
•Reveal important relationships
L-1b
L-7b L-26b
L-8a
L-12b L-20a
L-27a
L-6a
L-13b
GP-1b
L-1a
among the various ApoA1_6
Unkn1
C52:2 TG
L-20b
L-16a
L-13a
L-10b
L-10a
components
C33:1 PC C32:0 PC M
C24:1 SP
a-glucose L-8b
C30:0 PC M
C24:0 SP
AMBP C52:1 TG
FBGB L-22a
ALB
phenylalanine L-17b TP C50:4 TG L-23b
alanine L-17a FetuinA_2 L-28a
C18:2 LPC A1MG_5 C52:5 TG L-23a
C54:5 TG C54:5 TG leucine L-22b
A1MG_2 valine L-19a
•Complimentary to abundance
K C34:2 PC formate L-16b
C52:3 TG C32:1 PC
C20:4 CE
A1I3_3 ApoA1_5
C52:4 TG C34:1 PC TRIG
C54:3 TG leucine
C36:2 PC C54:2 TG NEF A
isoleucine
BUN GLYC
HDL
level information glutamine
glutamine
C58:5 TG
C36:1 PC
GLUC
C54:1 TG
C54:6 TG C52:6 TG
C22:5 CE C58:3 TG C56:3 TG
lactate
ITIH3_1
C56:2 TG
C60:3 TG
C58:2 TG
C58:4 TG C60:4 TG
glutamine Afamin_2
valine
tyrosine
glutamine
C16:1 CE valine
•Provides information about
C16:1 LPC valine
tyrosine
alanine
creatine e
lactat
tyrosine lactate
acetate lactate Lipids (LCMS)
the biochemical processes tyrosine
creatine
tyrosine
alanine
C38:4 PC NMR (DE)
diffusion
NMR (CPMG)
C46:1 TG
isoleucine
underlying the disease or drug
C48:1 TG
lactate
e
lactat
C18:0 CE
C16:0 CE Peptides
Proteins
phenylalanine C18:1 CE
C20:5 CE C36:4 PC
C20:3 CE
C20:2 CE
C18:2 CE
Clinical
response
gen
Plasmino a-glucose
C19:0 LPC C22:6 CE
C56:4 TG C18:3 CE
= positive correlation
= negative correlation
leucine
phenylalanine = higher in treated g roup
ApoA1_3
A1I3_4 TT_2
phenylalanine C20:4 LPC
phenylalanine = lower i n treated g roup
C18:0 LPC
Hemopex_1
C18:1 LPC ApoA1_7
b-glucose LD
PlasPre_2 b-glucose TT_1
A2GC
FG
Clish, C. B.; Davidov, E.; Oresic, M.; Plasterer, T.; Lavine, G.; Londo, T. R.; Meys, M.; Snell, P.; Stochaj, W.; Adourian, A.;
Zhang, X.; Morel, N.; Neumann, E.; Verheij, E.; Vogels, J, T.W.E.; Havekes, L. M.; Afeyan, N.; Regnier, F. E.; Greef, J.;
Naylor, S. Omics: A Journal of Integrative Biology 2004, 8, 3--13.
SysNet: Interactive Visual Data
Mining of Molecular Correlation
Network a)
An interactive integration and
visualization environment for
molecular correlation of ‘omics data.
•Integrating molecular expression
information generated in different ‘omics
•Visualizing molecular correlation in
interactive mode
b)
•Enabling time course data visualization and
analysis
•Automatically organizing molecules based
on their expression pattern in time course.
Zhang, M.; Ouyang, Q.; Stephenson, A.; Salt, D.; Kane, D. M.; Burgner J.; Buck, C. and Zhang, X. BMC
Systems Biology. Accepted by BMC Systems Biology.
Biomarker Verification
Wet-lab verification
AQUA
MRM
Antibody
In-silico verification
tracing lineage
pathway analysis
Automated Lineage Tracing
•Interested in identifying the
connections between input and
output data for a program
Analysis Software
•Tracing of fine-grained lineage
Lineage Tracing
through run-time analysis
•Developed based on dynamic slicing
techniques used in debugging
•Applicable to any arbitrary
function
Zhang, M.; Zhang, X.; Zhang, X. and Prabhakar, S. 33rd International
Conference on Very Large Data Bases (VLDB 2007), 2007.
Summary
• Informatics platform developed in my group can be used to analyze
protein and metabolite profiling data to differentiate disease and
normal samples for biomarker discovery
• Groups identified using clustering analysis reflected the phenotypic
categories of cancer and control samples, the animal and human
subjects, etc. with high degree of accuracy
• The application of SysNet using an interactive visual data mining
approach integrates omics data into a single environment, which
enables biologists performing data mining
• Lineage tracing technology is an efficient and effective approach for
in-silico biomarker verification. This technique will significantly
reduce the false discovery rate (FDR) of biomarker discovery
Acknowledgements
Irina Fedulova Dr. John Burger Dr. David Clemmer
Dr. Hamid Mirzaei Dr. Michael D. Kane Dr. John Asara
Dr. Cheolhwan Oh Dr. Fred E. Regnier Dr. Mu Wang
Sergey E. Pevtsov Dr. David Salt Dr. Jake Chen
Ouyang Qi Dr. Mohammad Sulma Dr. Steve Valentine
Alan Stephenson Dr. Daniel Raftery Dr. Steve Naylor
Mingwu Zhang Dr. Sunil Prabhakar
Postdoc Positions
Posting Title: Industrial Postdoctoral Fellow - Bioinformatician
Work Location: University of Louisville, KY
Job Type: Full time
Starting Date: Position immediately available
Job Description: Predictive Physiology and Medicine (PPM) Inc. is an exciting
health and life sciences company based in Bloomington, Indiana focused on
developing analytical systems for the individualized health and wellness industry.
We have an immediate opening for a postdoctoral fellow. The successful
candidate will develop bioinformatics systems for mass spectrometry based
quantitative proteomics and metabolomics.
Requirements: The position requires a bioinformatician with strong
computational background. Priority will be given to the candidate with a PhD in
bioinformatics, computer science, statistics, engineer, or computational physics.
The successful candidate should have strong understanding of statistics and
pattern recognition. Programming skills using Matlab, Microsoft .NET, or Java to
accomplish analyses is required. Experience in analyzing biological data is not
required; however, interest in multidisciplinary research is a must.