Embed
Email

Zhang

Document Sample
Zhang
Shared by: HC111111031642
Categories
Tags
Stats
views:
3
posted:
11/10/2011
language:
English
pages:
49
Integrative Omics for Cancer

Biology

Xiang Zhang, PhD



Department of Chemistry

Center for Regulatory and Environmental Analytical Metabolomics

University of Louisville, Louisville, KY 40292



xiang.zhang@louisville.edu

Systems Biology



is a field in biology aiming at systems level understanding of

biological processes, where a bunch of parts that are connected to

one another and work together. It attempts to create predictive

models of cells, organs, biochemical processes and complete

organisms.



•Integrative systems biology

Extracting biological knowledge from

the ‘omics through integration





•Predictive systems biology

Predicting future of biosystem using

‘omics knowledge, e.g. in-silico

biosystems









Davidov, E.; Clish, C. B.; Oresic, M.; Zhang, X; et al. Omics: A Journal of Integrative Biology. 2004, 8, 267--288.

Clish, C. B.; Davidov, E.; Oresic, M.; Zhang, X; et al. Omics: A Journal of Integrative Biology. 2004, 8, 3--13.

Omics Space









Differential omics is

the beginning of

Systems Biology





molecule

cell

tissue

organism



Differential Proteomics &

Metabolomics

1. Differential proteomics and metabolomics are qualitative and

quantitative comparison of proteome and metabolome under

different conditions that should unravel complex biological

processes



2. It can be used to study any scientific phenomena that may

change the proteome and/or metabolome of a living system.



 Cancer Biomarker Discovery

NIH

 Nano-medicine



 Environment

preventative medicine

 Food and nutrition

Biomarker Discovery is Major

Research Field of Differential Omics

Biomarkers are naturally occurring biomolecules useful

for measuring the prognosis and/or progress of diseases

and therapies.



 These substances may be normally present in small amounts

in the blood or other tissues

 When the amounts of these substances change, they may

indicate disease.

 Valid biomarkers should

 demonstrate drug activity sooner

 facilitate clinical trial design by defining patient populations

 optimize dosing for safety and efficacy

 be sensitive and easy to assay to speed drug development

What Types of Change Are

Expected?

Protein Protein

degradation

structure structure is

unchanged changed •Sensing structural

change is a major element

of comparative

proteomics

post-

concentration translational

modification

•Most of metabolomics

works focus on

concentration change

only.





sequence

(mutation)

Challenges in Proteomics



 Sample complexity

 About 25K types of protein coding-genes present in Body Fluid profiling: biomarker platform

human. IPI human database (v3.25) has 67,250 entries,

which could generate about 106-8 peptides Generic

High concentration

compounds

 More than one hundred post translational modifications Sample prep.

g/ml

(PTMs) could happen in a proteome

 Large protein concentration difference

 107-8 in human cells, and at least 1012 in human plasma

ng/ml

 Dynamic range of a LC-MS is about 104-6

 The top 12 high abundant proteins constitute

approximately 95% of total protein mass of

pg/ml

plasma/serum

 Albumin, IgG, Fibrinogen, Transferrin, IgA, IgM,

Haptoglobin, alpha 2-Macroglobulin, alpha 1-Acid Focused

Glycoprotein, alpha 1-Antitrypsin and HDL (Apo A-I & Sample prep.

Low concentration

Apo A-II). compounds



 Dynamic system, large subject variation

Challenges in Metabolomics



•Metabolites have a wide range of molecular

weights and large variations in concentration



•The metabolome is much more dynamic

than proteome and genome, which makes the

metabolome more time sensitive



•Metabolites can be either polar or nonpolar,

as well as organic or inorganic molecules.

This makes the chemical separation a key

step in metabolomics



•Metabolites have chemical structures, which cholesta-3,5-diene

makes the identification using MS an

extreme challenge

Differential Omics

biomarker discovery







Diseased Healthy



A A A A A A A A

B B B B B B B B

C C C C C C C C

D D D D D D D D

… … … … … … … …

Z Z Z Z Z Z Z Z

S1 S2 S3 S4 S5 S6 S7 S8

Informatics Platform





data re-examination

Protein

Function

Molecular Pathway

LIMS Interaction

networks modeling



Correlation









assembling

Knowledge

Peak alignment

transformation









deconvolution









normalization

Experiment









Experiment

information









Spectrum

execution









Raw data

Sample









design









Regulated









Peak

Significance

test peaks







Pattern Cluster Regulated

recognition loadings molecules





Molecular Molecular

identification validation

Quality control Unidentified

molecules



targeted tandem MS

Roadmap





Systems Biology Differential omics



1. Experimental design

2. Molecular identification

3. Data preprocessing

4. Statistical significance test

5. Pattern recognition

6. Molecular networks

MDLC Platforms

Sample



• MudPIT, i.e. SCX followed by RP

APR

• The proteome is split into 10-20X more

fractions

• There is carry-over between fractions AP AP

• LC fractions generally still are too complex

for MS

Digestion

• Affinity Selection

• Avidin selection of Cys-containing peptides

SCX

• Cu-IMAC for His-containing peptides

• Ga-IMAC for phosphorylated peptides

• Lectins for glycosylated peptides

F1 F2 F2 F2







RPC-MS

Qiu, R.; Zhang, X. and Regnier, F. E. J. Chromatogr. B. 2007, 845, 143-150.

Wang, S.; Zhang, X.; and Regnier, F. E. J. Chromatogr. A 2002, 949, 153-162.

Regnier, F. E.; Amini, A.; Chakraborty, A.; Geng, M.; Ji, J.; Sioma, C.; Wang, S.; and Zhang, X. LC/GC 2001, 19(2), 200-213.

Geng, M.; Zhang, X.; Bina, M.; and Regnier, F. E. J. Chromatogr. B 2001, 752, 293-306.

In-Gel Stable Isotope Labeling

a sample gel based platform





•Avoiding gel-to-gel variability

•Only labeling K-containing peptides

•Accurate quantification





d)









Asara, J. M.; Zhang, X.; Zheng, B.; Christofk, H. H.; Wu, N.; Cantley, L. C. Nature Protocols, 2006, 1, 46-51. .

Asara, J. M.; Zhang, X.; Zheng, B.; Christofk, H. H.; Wu, N.; Cantley, L. C. J. Proteome Res., 2006, 5, 155-163.

Ji, J.; Chakraborty, A.; Geng M.; Zhang, X.; Amini, A.; Bina, M.; and Regnier, F. E. J. Chromatogr. A 2000, 745, 197-210.

Roadmap







Systems Biology Differential omics



1. Experimental design

2. Molecular identification

protein identification

metabolite identification

3. Data preprocessing

4. Statistical significance test

5. Pattern recognition

6. Molecular networks

Protein Identification

database searching

The database searching approach uses a protein database to

find a peptide for which a theoretically predicted spectrum best

matches experimental data.



Protein







Peptide





Mass

matched

peptide

Protein Identification

database searching





More than 20 algorithms have been developed.



 Sequest

 Spectrum Mill 1. About 20% of tandem ms spectra

 Mascot could provide confident peptide

identification

 X! Tandem 2. < 50% of peptides can be

 OMSSA identified by all algorithms









Zhang, X.; Oh, C.; Riley, C. P.; Buck, C. Current Proteomics 2007, 4, 121-130.

Protein Identification

de novo sequencing





de novo sequencing

reconstructs the

partial or complete

sequence of a

peptide directly from

its MS/MS spectrum.





Performance of de novo

method is limited by low mass

accuracy, mass equivalence,

and completeness of

fragmentation.





Pevtsov, S.; Fedulova, I.; Mirzaei, H.; Buck, C.; Zhang, X. Journal of Proteome Research. 2006, 5, 3018-3028.

Fedulova, I.; Ouyang, Z.; Buck, C.; Zhang, X. The Open Spectroscopy Journal 2007, 1, 1-8.

Incorporating Peptide Separation

Information for Protein Identification

structure of pattern classifier





VSFLSALEEYTKK



LSPLGEEMR Input

layer

Hidden

Feature 1

layer

DYVSQFEGSALGKQLNLK Output

layer

Feature 2

DSGRDYVSQFEGSALGK Flow

through





AKPALEDLRQGLLPVLESFK

Feature Feature 3

Partition



Extraction Elution

DLATVYVDVLKDSGR zn

wo

Feature N ym

THLAPYSDELR wh

xl



QGLLPVLESFKVSFLSALEEYT

K

VQPYLDDFQKK



QGLLPVLESFK









Oh, C.; Zak, S. H.; Mirzaei, H.; Regnier, F. E.; Zhang, X. Bioinformatics 2007, 23, 114-118.

Training the ANNs with Generic

Algorithm

Initial candidate solutions



Crossover



whji wokj thj tok

Optimal solution



whji wokj thj tok

Encoding



Initial population









Mutation









Best

chromosome







Selection





Oh, C.; Zak, S. H.; Mirzaei, H.; Regnier, F. E.; Zhang, X. Bioinformatics 2007, 23, 114-118.

Protein Identification Using Multiple

Algorithms and Predicted Peptide

Separation in HPLC

PIUMA architecture

Unknown modification

Unmatched spectra

search

2

Raw LC/MS/MS

data

Protein List

Mascot









Chromatography

Modeling based

Validation

machine

learning









Report

mzData or Processed MS/ Database

1 Sequest Peptide List

mzXML format MS data seraching



3 X! Tandem









Unmatched spectra Lutefisk









consensus

De novo sequencing novoHMM Peptide List Color legend

existing algorithms

Peaks algorithms to be developed

method descriptions









Oh, C.; Zak, S. H.; Mirzaei, H.; Regnier, F. E. and Zhang, X. Bioinformatics, 2007, 23, 114-118.

Zhang, X.; Oh, C.; Riley, C. P.; Buck, C. Current Proteomics 2007, 4, 121-130.

Roadmap



Systems Biology Differential omics



1. Experimental design

2. Molecular identification

3. Data preprocessing

Spectrum deconvolution

Quality control

Alignment

Normalization

4. Statistical significance test

5. Pattern recognition

6. Molecular networks

Spectrum Deconvolution

GISTool, single sample analysis





1. To differentiate signals arising from the real analytes as opposed

to signals arising from contaminants or instrument noise

2. To reduce data dimensionality, which will benefit down stream

statistical analysis.



Functionality •Smoothing and centralization

•Peak cluster detection

•Charge recognition

•De-isotope

•Peak identification at LC level

•Doublet recognition

•Doublet quantification

GISTool Algorithm

Deconvoluting MS spectra



748.97 748.97







+3 pep +2 pep



100

748.64

749.29

748.97

748.6354 3+

749.62 749.47 748.9694 2+

80 749.97 749.97

intensity (%)









750.50

60 749.29

748.64





749 750 749.47

749 750

40

749.62

749.97

20 750.50 Single sample

0 analysis

747 748 749 750 751

m/z







Zhang, X.; Hines, W.; Adamec, J.; Asara, J.; Naylor, S.; and Regnier, F. E. J. Am. Soc. Mass

Spectrom. 2005, 16, 1181-1191.

Quality Assessment / Control

0.08











0.06



Biological Sample QA/C









D value

• protein assay 0.04









0.02









• Experimental Data QA/C 0







1 2 3 4 5 6 7 8 9 10



2D K-S test sample ID







• Percentile of detected peaks

• Percentile of aligned peaks

• Retention time variance vs.









5

retention time







retention time variation (%)



4

• m/z variance vs. retention time







3

• Frequency distribution of RT & m/z



2

variance

1

20 30 40 50 60



retention time (min)









Zhang, X.; Asara, J. M.; Adamec, J.; Ouzzani, M.; and Elmagarmid, A. K.

Bioinformatics, 2005, 21, 4054-4059.

Data Alignment



To recognize peaks of the same molecule occurring in different

samples from the thousands of peaks detected during the

course of an experiment.





1. MS to MS data alignment

•Referenced alignment

•Blind alignment

•Quality depending on the information of peak detection





2. MS to MS/MS data alignment

•Depends on experimental design

LC-MS Data Alignment

XAlign software for proteomics & metabolomics

data



0.8





•Detecting median sample









retention time difference (min)

0.4



Mj =  Ii,jMi,j /  Ii,j

0





Tj =  Ii,jTi,j /  Ii,j

-0.4

s

Di =  |Ti,j -µj|

j=1 -0.8

10 20 30 40 50 60 70

retention time (min)





•Aligning samples to the median sample

10000









intensity of aligned peaks (sample 2)

1000









100

y = 1.3636x + 16.511

R2 = 0.9475







10

Zhang, X.; Asara, J. M.; Adamec, J.; Ouzzani, M.; and Elmagarmid, A. K. 10 100 1000 10000

intensity of aligned peaks (sample 1)

Bioinformatics, 2005, 21, 4054-4059.

Chromatogram of Serum Analyzed on GCGC/TOF-MS

GCxGC-MS Data Alignment

metabolite component of human serum









•Four dimension

•1535 peaks have

been detected

GCxGC/TOF-MS Data Alignment

MSort software for metabolomics









Criteria for alignment

•1st dim. rt

•2nd dim. rt

•spec. correlation



Features

*peak entry merging

*cont. exclusion









Oh, C.; Huang, X.; Buck, C.; Regnier, F. E. and Zhang, X. J. Chromatogr. A. 2008, 1179, 205-215

Analysis Results of MAlign

53 standard acids



5

1000 x 10

10

The number of rows in the alignment table









800 8







600 6









Peak area

400 4







200 2







0

0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The number of peak entries in a row of alignment table The number of peak entries in a row of alignment table







1. 8 [OA + FA] samples and 8 [AA + FA] samples

2. derivatization reagent: (N-Methyl-N-t-butyldimethylsilyl)-trifluoroacetamide (MTBSTFA)





Oh, C.; Huang, X.; Buck, C.; Regnier, F. E. and Zhang, X. J. Chromatogr. A. 2008, 1179, 205-215

Normalization









8000

To reduce concentration effect and

experimental variance to make the









6000

data comparable.









intensity



4000

2000

Methods



0

Log linear model xij = ai  rj  eij

0 200 400 600 800 1000

1. peak index



2. Reference sample normalization

3. Auto-scaling

4. Constant mean / trimmed constant mean

5. Constant median / trimmed constant median

CV Distribution of Peak Intensities

human serum sample







Before Normalization Intensity Variation





100

250









80



20.7%

rel peak no (%)

Frequency









60

150









40









Log linear model:

20

50









xij = ai  rj  eij

0









0









0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0



CV CV log(xij) = log(ai) + log(rj) + log(eij)

After Normalization Intensity Variation

100

250









80

rel peak no (%)

Frequency









60

150









17.3%

40

50









20

0









0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0



CV CV

Roadmap





Systems Biology Differential omics



1. Experimental design

2. Molecular identification

3. Data preprocessing

4. Statistical significance test

5. Pattern recognition

6. Molecular networks

Statistical Significance Tests



To find individual peaks for which there are significant

differences between groups.



Methods

1. Pair-wise t-test (diff. mean?)

2. Mann-Whitney U test (diff. median?)

3. Kolmogorov-Smirnov test (diff. population?)

4. Kruskal-Wallis analysis of variance

Statistical Significance Tests

metabolome of great blue heron fertilized eggs

contaminated by PCBs

8









PCBs: polychlorinated biphenyls

6

p (-log)



4









down-regulated up-regulated fold change = I_c / I_n

blue line: p=0.05

2









dashed line: fold change = 0

0









-3 -2 -1 0 1 2 3



fold change (log)

Roadmap



Systems Biology Differential omics



1. Experimental design

2. Molecular identification

3. Data preprocessing

4. Statistical significance test

5. Pattern recognition

6. Molecular networks

Clustering or Classification





Resulting pattern recognition provides the first glimpse of

improvement in understanding the underlying biology.





Unsupervised Methods

Principle component analysis (PCA)

Linear Discriminant Analysis (LDA)

Clustering objects on subsets of attributes (COSA)

Supervised Methods

Support vector machine (SVM)

Artificial neural network (ANN)

Cross Species Comparison









27 of the 28 control humans and all 8 control rats cluster to one group

11 of the 14 diseased human and all diseased rats cluster to second group

Differential Metabolomics of

Human Blood

breast cancer samples vs. control samples

Differential Metabolomics of

Human Blood

breast cancer samples vs. control samples

Roadmap



Systems Biology Differential omics

1. Experimental design

2. Protein identification

3. Data preprocessing

4. Statistical significance test

5. Pattern recognition

6. Molecular networks

correlation network

interaction network

regulation network

pathway analysis

Molecular Correlation Analysis

pair wised correlation of proteins and metabolites







Diseased Healthy



A A A A A A A A

B B B B B B B B

C C C C C C C C

D D D D D D D D

… … … … … … … …

Z Z Z Z Z Z Z Z

S1 S2 S3 S4 S5 S6 S7 S8

Molecular Correlation Network

an example of drug effect on disease state ApoE_1

L-5b

2

SerPI_II_

L-5a

L-11b

L-11a L-18a

L-18b

L-28b

L-26a L-21b

L-12a

L-21a

L-9b

C18:1 L-9a

LPC

Emb

ALP L-24b L-24a L-19b L-15b

GP-1a Tiss L-15a

L-27b L-14b

L-6b L-14a

L-7a







•Reveal important relationships

L-1b

L-7b L-26b

L-8a

L-12b L-20a

L-27a

L-6a

L-13b

GP-1b

L-1a





among the various ApoA1_6

Unkn1

C52:2 TG

L-20b

L-16a

L-13a



L-10b



L-10a







components

C33:1 PC C32:0 PC M

C24:1 SP

a-glucose L-8b

C30:0 PC M

C24:0 SP

AMBP C52:1 TG

FBGB L-22a

ALB

phenylalanine L-17b TP C50:4 TG L-23b



alanine L-17a FetuinA_2 L-28a



C18:2 LPC A1MG_5 C52:5 TG L-23a



C54:5 TG C54:5 TG leucine L-22b

A1MG_2 valine L-19a









•Complimentary to abundance

K C34:2 PC formate L-16b

C52:3 TG C32:1 PC

C20:4 CE

A1I3_3 ApoA1_5

C52:4 TG C34:1 PC TRIG

C54:3 TG leucine

C36:2 PC C54:2 TG NEF A

isoleucine

BUN GLYC

HDL





level information glutamine

glutamine

C58:5 TG

C36:1 PC



GLUC

C54:1 TG

C54:6 TG C52:6 TG

C22:5 CE C58:3 TG C56:3 TG

lactate

ITIH3_1

C56:2 TG

C60:3 TG

C58:2 TG

C58:4 TG C60:4 TG







glutamine Afamin_2

valine

tyrosine

glutamine

C16:1 CE valine









•Provides information about

C16:1 LPC valine

tyrosine

alanine

creatine e

lactat

tyrosine lactate

acetate lactate Lipids (LCMS)

the biochemical processes tyrosine



creatine

tyrosine

alanine

C38:4 PC NMR (DE)

diffusion

NMR (CPMG)

C46:1 TG

isoleucine





underlying the disease or drug

C48:1 TG

lactate

e

lactat

C18:0 CE

C16:0 CE Peptides

Proteins

phenylalanine C18:1 CE

C20:5 CE C36:4 PC

C20:3 CE

C20:2 CE

C18:2 CE

Clinical



response

gen

Plasmino a-glucose

C19:0 LPC C22:6 CE

C56:4 TG C18:3 CE



= positive correlation

= negative correlation





leucine

phenylalanine = higher in treated g roup

ApoA1_3

A1I3_4 TT_2

phenylalanine C20:4 LPC

phenylalanine = lower i n treated g roup

C18:0 LPC

Hemopex_1

C18:1 LPC ApoA1_7

b-glucose LD

PlasPre_2 b-glucose TT_1

A2GC

FG









Clish, C. B.; Davidov, E.; Oresic, M.; Plasterer, T.; Lavine, G.; Londo, T. R.; Meys, M.; Snell, P.; Stochaj, W.; Adourian, A.;

Zhang, X.; Morel, N.; Neumann, E.; Verheij, E.; Vogels, J, T.W.E.; Havekes, L. M.; Afeyan, N.; Regnier, F. E.; Greef, J.;

Naylor, S. Omics: A Journal of Integrative Biology 2004, 8, 3--13.

SysNet: Interactive Visual Data

Mining of Molecular Correlation

Network a)

An interactive integration and

visualization environment for

molecular correlation of ‘omics data.

•Integrating molecular expression

information generated in different ‘omics



•Visualizing molecular correlation in

interactive mode

b)

•Enabling time course data visualization and

analysis



•Automatically organizing molecules based

on their expression pattern in time course.



Zhang, M.; Ouyang, Q.; Stephenson, A.; Salt, D.; Kane, D. M.; Burgner J.; Buck, C. and Zhang, X. BMC

Systems Biology. Accepted by BMC Systems Biology.

Biomarker Verification





 Wet-lab verification

 AQUA

 MRM

 Antibody



 In-silico verification

 tracing lineage

 pathway analysis

Automated Lineage Tracing



•Interested in identifying the

connections between input and

output data for a program









Analysis Software

•Tracing of fine-grained lineage









Lineage Tracing

through run-time analysis



•Developed based on dynamic slicing

techniques used in debugging



•Applicable to any arbitrary

function



Zhang, M.; Zhang, X.; Zhang, X. and Prabhakar, S. 33rd International

Conference on Very Large Data Bases (VLDB 2007), 2007.

Summary



• Informatics platform developed in my group can be used to analyze

protein and metabolite profiling data to differentiate disease and

normal samples for biomarker discovery

• Groups identified using clustering analysis reflected the phenotypic

categories of cancer and control samples, the animal and human

subjects, etc. with high degree of accuracy

• The application of SysNet using an interactive visual data mining

approach integrates omics data into a single environment, which

enables biologists performing data mining

• Lineage tracing technology is an efficient and effective approach for

in-silico biomarker verification. This technique will significantly

reduce the false discovery rate (FDR) of biomarker discovery

Acknowledgements







Irina Fedulova Dr. John Burger Dr. David Clemmer

Dr. Hamid Mirzaei Dr. Michael D. Kane Dr. John Asara

Dr. Cheolhwan Oh Dr. Fred E. Regnier Dr. Mu Wang

Sergey E. Pevtsov Dr. David Salt Dr. Jake Chen

Ouyang Qi Dr. Mohammad Sulma Dr. Steve Valentine

Alan Stephenson Dr. Daniel Raftery Dr. Steve Naylor

Mingwu Zhang Dr. Sunil Prabhakar

Postdoc Positions

Posting Title: Industrial Postdoctoral Fellow - Bioinformatician

Work Location: University of Louisville, KY

Job Type: Full time

Starting Date: Position immediately available



Job Description: Predictive Physiology and Medicine (PPM) Inc. is an exciting

health and life sciences company based in Bloomington, Indiana focused on

developing analytical systems for the individualized health and wellness industry.

We have an immediate opening for a postdoctoral fellow. The successful

candidate will develop bioinformatics systems for mass spectrometry based

quantitative proteomics and metabolomics.

Requirements: The position requires a bioinformatician with strong

computational background. Priority will be given to the candidate with a PhD in

bioinformatics, computer science, statistics, engineer, or computational physics.

The successful candidate should have strong understanding of statistics and

pattern recognition. Programming skills using Matlab, Microsoft .NET, or Java to

accomplish analyses is required. Experience in analyzing biological data is not

required; however, interest in multidisciplinary research is a must.


Related docs
Other docs by HC111111031642
INCOSERMToolSurveyConsolidatedResults
Views: 0  |  Downloads: 0
2728
Views: 0  |  Downloads: 0
1230638163_P T OCT NOV 08 Final
Views: 0  |  Downloads: 0
resume
Views: 0  |  Downloads: 0
BE CSE 3_to_8 _Final_31stMarch2010
Views: 0  |  Downloads: 0
S1_012809_SOA_Concepts
Views: 0  |  Downloads: 0
WP5_EBRCN_transport
Views: 0  |  Downloads: 0
cv
Views: 0  |  Downloads: 0
Software
Views: 1  |  Downloads: 0
1 27 SharepointFirestarter ChrisMayo
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!