Analysis of Protein Geometry_ Particularly Related to Packing at
Shared by: linzhengnd
-
Stats
- views:
- 3
- posted:
- 11/8/2011
- language:
- English
- pages:
- 103
Document Sample


Permissions Statement
1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
This Presentation is copyright Mark Gerstein,
Yale University, 2002. Feel free to use
images in it with PROPER acknowledgement.
Do not reproduce without permission
Computational Proteomics:
Genome-scale studies of protein
function, structure, and evolution
Mark B Gerstein
2 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Yale U
H Hegyi, J Lin, J Qian, N Luscombe, T Johnson,
A Drawid, R Jansen, V Alexandrov,
M Snyder, A Kumar, H Zhu, D Greenbaum, N Lan, P Harrison,
N Echols, S Balasubramanian, P Bertone, Z Zhang,
R Das, Y Liu, Y Kluger, H Yu, D Greenbaum,
P Miller, K Cheung, S Weissman
Talk at Harvard
02.02.11
Do not reproduce without permission
Understand Proteins,
through analyzing populations
Structures (motions, packing, folds)
3 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Functions
Evolution
Integration of Information
Structures Sequences Microarrays Do not reproduce without permission
1st
Google
Pub- Pub- The central post-
Med Med
Hits
Hits Hit genomic term
Year
Genome ~1880000 66171 1932
Proteome ~63,000 703 1995
Transcriptome 3520 72 1997
Physiome 2980 15 1997
Metabolome 349 12 1998
4 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Phenome 4980 6 1995
Morphome 238 2 1996
Interactome 56 2 1999
Glycome 46 1 2000
Secretome 21 1 2000
Ribonome 1 1 2000
Orfeome 42 - -
Regulome 18 - -
Cellome 17 - -
Operome 8 - -
Transportome 1 - -
Functome 1 - - Do not reproduce without permission
1st
Google
Pub- Pub- The central post-
Med Med
Hits
Hits Hit genomic term
Year
Genome ~1880000 66171 1932
Proteome ~63,000 703 1995
Transcriptome 3520 72 1997
Physiome 2980 15 1997
Metabolome 349 12 1998
5 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Phenome 4980 6 1995
Morphome 238 2 1996
Interactome 56 2 1999
Glycome 46 1 2000
Secretome 21 1 2000
Ribonome 1 1 2000
Orfeome 42 - -
Regulome 18 - -
Cellome 17 - -
Operome 8 - -
Transportome 1 - -
Functome 1 - - Do not reproduce without permission
1st
Google
Pub- Pub- The central post-
Med Med
Hits
Hits Hit genomic term
Year
Genome ~1880000 66171 1932
Proteome ~63,000 703 1995
Transcriptome 3520 72 1997
Physiome 2980 15 1997
Metabolome 349 12 1998
6 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Phenome 4980 6 1995
Proteome
Morphome 238 2 1996
PubMed Hits
Interactome 56 2 1999
Glycome 46 1 2000
Secretome 21 1 2000
Ribonome 1 1 2000
Orfeome 42 - -
Regulome 18 - -
Cellome 17 - -
Operome 8 - -
Transportome 1 - -
Functome 1 - - Do not reproduce without permission
Proteins are central to
the 2 major post-genomic challenges
1. Understanding
genes in detail
7 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Predicting protein
function on a
genomic scale
2. Understanding yy y y
what’s between
genes (Initial Step: genome sequence & genes)
Analyzing protein fossils
Do not reproduce without permission
Proteins are central to
the 2 major post-genomic challenges
8 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Predicting protein
function on a
genomic scale
Do not reproduce without permission
How to predict function
for 1000s of proteins?
.…… ~650
9 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
• 250 of 650 known on chr. 22 [Dunham et al.]
• >>30K+ Proteins in Entire Human Genome
(alt. splicing)
Do not reproduce without permission
How to predict functions
for 1000s of proteins?
1) "Traditional" sequence patterns
2) Via fold similarity (structural genomics)
3) Clustering a microarray experiment
10 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
4) Data integration
Do not reproduce without permission
How to predict functions
for 1000s of proteins?
1) "Traditional" sequence patterns
2) Via fold similarity (structural genomics)
3) Clustering a microarray experiment
11 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
4) Data integration
Compare uncharacterized genome sequences
against known sequences in DBs, transferring
func. annotation for similar sequences
Issue: Threshold is major parameter & limitation
Also, look for motifs & sites [Sternberg, Thornton, Rose, Koonin]
Do not reproduce without permission
1000s of structurally based alignments
of structurally and functionally characterized sequences
Sequence Function
(Human) 5.3.1.1 (TP Isomerase)
90% Same
Exact
12 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
(Chick) 5.3.1.1 (TP Isomerase)
(E coli) Both 5.3.1.1 (TP Isomerase)
45% Class 5
(isom.)
(E coli) 5.3.1.24 (PRA Isomerase)
(B ster.) 5.3.1.15 (Xylose Isom.)
20% Different
Classes
(E coli) 4.1.3.3 (Aldolase)
(Yeast) 4.2.1.11 (Enolase)
Do not reproduce without permission
Relationship of Similarity in Sequence to
that in Function
% Same Function
100
13 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
90
Percentage of pairs that have 80
same precise function as 70
defined by Enzyme & FlyBase 60
functional classifications 50
40
30
Sequence similarity of pairs of proteins
20
10
0
%ID 70 60 50 40 30 20 10 0
Do not reproduce without permission
Relationship of Similarity in Sequence to
that in Function
% Same Function
100
14 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
90
80
70
60
50
40
30
20
10
0
%ID 70 60 50 40 30 20 10 0
Do not reproduce without permission
Relationship of Similarity in Sequence to
that in Function
% Same Function
Can transfer both
Fold & Functional
Annotation
100
15 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
90
80
70
60
50
40
30
20
10
0
%ID 70 60 50 40 30 20 10 0
Do not reproduce without permission
Relationship of Similarity in Sequence to
that in Function
% Same Function
Can transfer Can not transfer
Can transfer both Annotation related Fold or Functional
Fold & Functional Fold but not Annotation
Annotation Function ("Twilight Zone")
100
16 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
90
80
70
60
50
40
30
20
10
0
%ID 70 60 50 40 30 20 10 0
Do not reproduce without permission
Relationship of Similarity in Sequence to
that in Function
% Same Function
Can transfer Can not transfer
Can transfer both Annotation related Fold or Functional
Fold & Functional Fold but not Annotation
Annotation Function ("Twilight Zone")
100
17 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
90
80
70
Broad 60
v 50
Narrow 40
30
Similarity
20
10
0
%ID 70 60 50 40 30 20 10 0
Do not reproduce without permission
Caveats: Sequence Divergence of Multidomain
Proteins , Implies a Practical Theshold is >40%
Single Domain Sequences Multidomain Sequences
(Human)
(Chick)
18 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
(E coli)
(E coli)
(B ster.)
(E coli)
(Yeast)
(Rat)
Do not reproduce without permission
Multi-domain proteins have greater
divergence in function with sequence
100%
% Same Function
90%
80%
70%
19 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Single-
60%
domain
50%
Multi-
40%
domain
30%
20%
10%
0%
50 40 30 20 10 0
(Very Close) Sequence Similarity [ -log(e-value) ]
Do not reproduce without permission
How to predict functions
for 1000s of proteins?
1) "Traditional" sequence patterns
2) Via fold similarity (structural genomics)
3) Clustering a microarray experiment
20 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
4) Data integration
Structures of ORFs with unknown function,
Use Fold & Site Similarity to Determine Function
Rationale for Structure Prediction
Issue:
To what degree does fold determine function?
[Kim, Edwards & Arrowsmith, Montelione, Burley, Eisenberg] Do not reproduce without permission
Fold Function
Combinations
Many Functions on
21 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Same Fold (TIM-barrel)
Different Folds with
Same Function
(Carbonic
Anhydrases, 4.2.1.1)
Do not reproduce without permission
Global View of Fold-
Function Combinations 229 Folds
Non-Enz
22 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
91 Enzymatic Functions
Do not reproduce without permission
Correlation with
Architectural
Structural Features 229 Folds Class
all-a all-b small
Non-Enz
23 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
91 Enzymatic Functions
Do not reproduce without permission
Correlation with Slight Overpopulation
Architectural
Structural Features 229 Folds Class
Enzyme Class all-a all-b small
Non-Enz
24 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
91 Enzymatic Functions
Do not reproduce without permission
Global View of Fold-
Function Combinations 229 Folds
Sort
Non-Enz
25 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
91 Enzymatic Functions
Do not reproduce without permission
To what degree is fold associated with
function? Folds with multiple functions
Frequency in database of 229 folds
26 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
[Similar results
Number of functions associated with a fold by Thornton]
Do not reproduce without permission
How to predict functions
for 1000s of proteins?
1) "Traditional" sequence patterns
2) Via fold similarity (structural genomics)
3) Clustering a microarray experiment
27 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
4) Data integration
Do not reproduce without permission
Microarray experiments
Expression
Arrays
[Brown]
28 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Proteome
Chips
[Snyder]
Do not reproduce without permission
[Brown, Davis]
Clustering
the 4
mRNA expression level (ratio)
yeast cell 3
RPL19B
cycle to 2
TFIIIC
uncover 1
29 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
interacting 0
proteins -1
-2
0 4 8 12 16
Time->
Microarray timecourse of
1 ribosomal protein
Do not reproduce without permission
Clustering
the 4
mRNA expression level (ratio)
yeast cell 3
RPL19B
cycle to 2
TFIIIC
uncover 1
30 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
interacting 0
proteins -1
-2
0 4 8 12 16
Time->
Random relationship from ~18M
Do not reproduce without permission
[Botstein; Church, Vidal]
Clustering
the 4
mRNA expression level (ratio)
yeast cell 3
RPL19B
cycle to 2
RPS6B
uncover 1
31 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
interacting 0
proteins -1
-2
0 4 8 12 16
Time->
Close relationship from 18M
(2 Interacting Ribosomal Proteins)
Do not reproduce without permission
Clustering
the 4
mRNA expression level (ratio)
RPL19B
yeast cell 3 RPS6B
RPP1A
cycle to 2 RPL15A
?????
uncover 1
32 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
interacting 0
proteins -1
-2
0 4 8 12 16
Time->
Predict Functional Interaction of
Unknown Member of Cluster
Do not reproduce without permission
Simultaneous Local
Traditional Clustering
Global
Correlation algorithm
identifies
further
(reasonable)
33 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Time-
Shifted types of
expression
relation-
ships
Inverted
[Church]
Do not reproduce without permission
Examples 3
C
2
inverted 1
relationships 0
-1
Documented
YME1
-2
YNT20
YME1 : mito. protease -3
34 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Expr. Ratio
involved in cplx. 0 4 8 12 16
assembly
YNT20 : known Time
surpressor of
YME1 3
PUT2
2
SER3
1
Suggestive
0
PUT2 : involved in Pro -1
degradation
-2
SER3 : involved in Ser
synthesis -3
0 4 8 16
12 Do not reproduce without permission
4
Examples 3
time-shifted 2
ARC35
ARP3
relationships 1
Suggestive 0
-1
ARP3 : in actin
35 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Expr. Ratio
remodelling cplx. -2
ARC35 : in same cplx. 0 4 8 12 16
(required late in Time
cell cycle) 2
1
0
Predicted
-1 J0544
J0544 : unknown -2
ATP11
function MRPL17
MRPL19
MRPL19: mito.ribosome -3
YDR116C
-4
0 4 8 12 16
Do not reproduce without permission
4
Examples 3
time-shifted 2
ARC35
ARP3
relationships 1
Suggestive 0
-1
ARP3 : in actin
36 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Expr. Ratio
remodelling cplx. -2
ARC35 : in same cplx. 0 4 8 12 16
(required late in Time
cell cycle) 2
1
0
Predicted
-1 J0544
J0544 : unknown -2
ATP11
function MRPL17
MRPL19
MRPL19: mito.ribosome -3
YDR116C
-4
0 4 8 12 16
Do not reproduce without permission
CDC47
CDC46
CDC54
CDC45
POL32
MCM3
MCM6
MCM2
ORC2
ORC6
ORC5
ORC4
ORC3
ORC1
CDC2
CDC7
DPB3
DPB2
HYS2
POL2
DBF4
MCM3
MCM6
CDC47
MCM2
CDC46
Expression CDC54
DPB3
Correlations CDC45
37 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
DPB2
Segment CDC2
CDC7
Large POL2
HYS2
Replication POL32
DBF4
Complex ORC2
ORC6
into ORC5
ORC4
Component ORC3
ORC1
Parts
Do not reproduce without permission
CDC45
POL32
CDC47
CDC46
CDC54
MCM3
MCM6
MCM2
CDC2
CDC7
DPB3
DPB2
POL2
HYS2
DBF4
ORC2
ORC6
ORC5
ORC4
ORC3
ORC1
MCM3
MCM6
CDC47
MCM2
MCMs CDC46
Expression prots. CDC54
DPB3
Correlations CDC45
38 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
DPB2
Segment CDC2
CDC7
Large Polym. POL2
d&e HYS2
Replication POL32
DBF4
Complex ORC2
ORC6
into ORC5
ORC4
Component ORC ORC3
ORC1
Parts
Do not reproduce without permission
Range of Expression
Correlations within Complexes
Replication Cplx Proteasome Ribosome
Overall .05 Overall .43 Overall .80
ORC .19, MCMs .75 20S .50 Large .80
Pol. d .45, e .75, 19S .51 Small .81
39 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Do not reproduce without permission
Permanent v. Transient
Complexes
40 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Do not reproduce without permission
Global Network
of 3 Different
Types of
Relationships
41 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
~313K
significant
relationships
from ~18M
possible
Do not reproduce without permission
Global Network
of 3 Different
Types of
Relationships
Simultaneous 188K
Inverted 63K
42 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Shifted 67K
~313K
significant
relationships
from ~18M
possible
Do not reproduce without permission
Globally, how well
do expression
relationships
predict known
interactions?
43 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Coverage of the Enrichment
8250 Known Compared to
Interactions in Randomized
Complexes Expression
Found Relationships
Random ~2% 1x CC: 313K
(313K/18M)
CC 42% 24x relationships from
~18M possible
from clustering
cell-cycle expt.
Do not reproduce without permission
Combining
Expression Data
Sets Increases
Coverage &
Decreases Noise
44 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Coverage of the Enrichment
8250 Known Compared to
Interactions in Randomized
Complexes Expression
Found Relationships
KO: 278K
relationships
KO 34% 22x from clustering
knock-out
profiles [Rosetta]
Do not reproduce without permission
Combining
Expression Data
Sets Increases
Coverage &
Decreases Noise
45 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Coverage of the Enrichment
8250 Known Compared to
Interactions in Randomized
Complexes Expression
Found Relationships
KO: 278K CC: 313K
CC 42% 24x relationships relationships from
KO 34% 22x from clustering ~18M possible
knock-out from clustering
KO CC
v 55% 111x
profiles [Rosetta] cell-cycle expt.
KO CC
^ 21% 254x
Do not reproduce without permission
How to predict function
for 1000s of proteins?
1) "Traditional" sequence patterns
2) Via fold similarity (structural genomics)
3) Clustering a microarray experiment
46 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
4) Data integration
Obviously integration of orthogonal info. is good
but how to achieve it?
And what are the issues.
An example for subcellular localization in yeast
Do not reproduce without permission
Subcellular Localization,
a standardized aspect of function
Cytoplasm
Nucleus
47 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Membrane
ER
Extra-
cellular
[secreted] Golgi
Mitochondria
Do not reproduce without permission
"Traditionally" subcellular localization is
"predicted" by sequence patterns
Cytoplasm NLS
Nucleus
48 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Membrane
TM-helix
ER
HDEL
Extra-
cellular
[secreted] Golgi Import Sig.
Sig. Seq. Mitochondria
Do not reproduce without permission
Subcellular localization is associated with
the level of gene expression
[Expression Level
in Copies/Cell]
Cytoplasm
Nucleus
49 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Membrane
ER
Extra-
cellular
[secreted] Golgi
Mitochondria
Do not reproduce without permission
Combine Expression Information &
Sequence Patterns to Predict Localization
[Expression Level
in Copies/Cell]
Cytoplasm NLS
Nucleus
50 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Membrane
TM-helix
ER
HDEL
Extra-
cellular
[secreted] Golgi Import Sig.
Sig. Seq. Mitochondria
Do not reproduce without permission
Issues in Combining Many Features
Total of 30
diverse NLS
features (also Nucleus
including
51 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
essentiality,
coiled-coils,
expression Membrane
fluc., & obscure TM-helix
seq. patterns) How to
ER standardize
features?
HDEL How to
weight
Extra- them?
cellular
[secreted] Golgi Import Sig.
Sig. Seq. Mitochondria
Do not reproduce without permission
Bayesian System Prior
for Localizing Feature 1: NLS
Proteins
# NLS
New
Everything expressed in Estimate
52 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
standardized probabilitic
terms
(Features as freq. in training set)
Sequentially apply
features to refine prior Better
assumed estimate using Feature 2: High Expr. Estimate
Bayes Rule
(Feature x Prior / Normalization)
Final estimate that
naturally weights Final
features comes out Feature 3: Is Essential? Estimate
Do not reproduce without permission
Results on
ER
Testing Data Cyt.
TM
• 7-fold cross-
validation
53 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
training & test sets Mito.
• Overall
compartment
population
Nuc.
96% accuracy
Do not reproduce without permission
Mito.
Extrapolation to Cyt. 452
1570
Compartment Nuc.
975
Populations of
Whole Yeast Sec.
Path
Not
Genome localized
346
2789
54 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
~3300 Known
Localizations
Cyt.
from Exisiting DB + 47%
Expt. Localizations by
Transposon Tagging &
Direct Overexpression Mito.
[Snyder] 13%
Sec.
Path
13% Nuc.
+ Predictions
(from Bayesian System) 27%
Do not reproduce without permission
How to predict function
for 1000s of proteins?
1) "Traditional" sequence patterns
o Limitation: tight threshold
o Developed 40% threshold for sequence comparison
55 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
2) Via fold similarity (structural genomics)
o Limitation: multifunctionality on a fold, so weak relationship
o Measured extent of this in current DB
3) Clustering a microarray experiment
o Limitation: suggestive relationships but not yet predictive
o Local clustering found ~130K relationships beyond ~180K
simultaneous ones
4) Data integration
o Limitation: power is obvious but complicated to achieve
o Increased power. Works for localization of all proteins in yeast.
Do not reproduce without permission
Proteins are central to
the 2 major post-genomic challenges
1. Understanding
genes in detail
56 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Predicting protein
function on a
genomic scale
2. Understanding yy y y
what’s between
genes (Initial Step: genome sequence & genes)
Analyzing protein fossils
Do not reproduce without permission
Proteins are central to
the 2 major post-genomic challenges
57 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
yy y y
Analyzing protein fossils
Do not reproduce without permission
Pseudogenes (yG) as Disabled Homologies
S/T Protein Phosphatase PP1 (C-term)
…SRILCMHGGLSPHLQTLDQLRQLPRPQDPPNPSIGIDLLWADPDQWVKGWQAN
TRGVSYVFGQDVVADVCSRLDIDLVARAHQVVQDGYEFFASKKMVTIFSAPHYC
GQFDNSAATMKVDENMVCTFVMYKPTPKSMRRG*
58 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Worm
Genome
Most Multiply
I II III IV V X Disabled
Pseudogenic fragment
#
TKRTSNGFGQDVVVDLFSILDSGLVARAHX
VLQDIFEFFASKKMVTIFS#APHSPHSAPH
YCAQFDNSAATVKV
Do not reproduce without permission
Types of yG: Duplicated & Processed
Original Gene
Duplicated yG
Duplicated Gene
59 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
AAAAAA>
Spliced mRNA
AAAAAA> Processed yG
AAAAAA>
Processed yG
with disablements
[Heidmann]
Do not reproduce without permission
Large-scale Assignment of Pseudogenes
Size Genes Total Pseudogenes
(Mb) (apx) yG Duplicated Processed
Yeast 12 6200 221 (4%) 221 0
Worm 100 19000 2168 (11%) 1960 208
60 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Fly 116 14000 <100 (1%)
Human* 69 930 350 (38%) 178 172
(*Chr 21+22 only)
Do not reproduce without permission
yG Calculations
• Large Basic Calculation Protein
"blasting" ~1M fragments chromosome DB
against protein DB
50 CPU days for worm
Parallel computing + large DBs
61 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
• Integrating
Repeats 1 Sequence 1
heterogeneous,
dynamically Sequence 2
changing annotation Genes A
Changing sequences, Sequence 3
gene predictions, Repeats 2
repeats Genes B
yG
Do not reproduce without permission
yG: Questions
Size Genes Total Pseudogenes
(Mb) (apx) yG Duplicated Processed
Yeast 12 6200 221 (4%) 221 0
Worm 100 19000 2168 (11%) 1960 208
62 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Fly 116 14000 <100 (1%)
Human* 69 930 350 (38%) 178 172
1) What are they composed of? (composition)
2) Where are they? (which organisms, chr. position)
3) Which type of proteins are they? Why? (functions)
4) Do processed & duplicated varieties differ?
Do not reproduce without permission
Worm yG: Lots! Good Stats
Size Genes Total Pseudogenes
(Mb) (apx) yG Duplicated Processed
Yeast 12 6200 221 (4%) 221 0
Worm 100 19000 2168 (11%) 1960 208
63 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Fly 116 14000 <100 (1%)
Human* 69 930 350 (38%) 178 172
1) What are they composed of? (composition)
2) Where are they? (which organisms, chr. position)
3) Which type of proteins are they? Why? (functions)
4) Do processed & duplicated varieties differ?
Do not reproduce without permission
Amino-acid composition of Pseudogenes is
midway between Genes
and translated Intergenic DNA
Frequency
64 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Worm
Amino Acid (sorted)
Do not reproduce without permission
Amino-acid composition
of Pseudogenes is midway between Genes and
translated Intergenic DNA in many genomes
Human
Worm
65 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Yeast
Fly
Do not reproduce without permission
Midway composition also applies to codons
66 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Do not reproduce without permission
Pseudogenes
elevated at
ends of worm
chr I
Genes
Pseudogene
67 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
elevated in distribution
middle on worm
chomo-
somes:
On Ends
Do not reproduce without permission
YG
--
G
29%
(max)
YG
--
Pseudogene
68 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
G distribution
16% on worm
(min)
chomo-
somes:
On Ends
~50% YG in
terminal 3Mb vs
~30% G
Do not reproduce without permission
Default: #yG #genes, in a family
# genes in family
69 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
28 RT
# pseudogenes in family 59
Do not reproduce without permission
Completely Dead
Families in Worm
Genome
0
70 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
# worm
organism of
yG in
closest match
family Function
3 E. coli acrAB operon repressor Extinction?
7 Yeast unknown function
3 Vaccinia virus unknown function
3 Vaccinia virus Host range protein
5 Human Complementing protein
4 Human SEX gene
3 Human Initiation factor 4A
4 Frog Thyroid hormone receptor
4 Fly Multidrug resistance protein 1
5 Cow Polyadenylation specificity factor Do not reproduce without permission
Completely Dead
Families in Worm
Genome
0
71 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
# worm
organism of
yG in
closest match
family Function
3 E. coli acrAB operon repressor Extinction?
7 Yeast unknown function
3 Vaccinia virus unknown function
3 Vaccinia virus Host range protein
Horz. Transfer?
5 Human Complementing protein
4 Human SEX gene
3 Human Initiation factor 4A
4 Frog Thyroid hormone receptor
4 Fly Multidrug resistance protein 1
5 Cow Polyadenylation specificity factor Do not reproduce without permission
Completely Dead
Families in Worm
Genome
0
72 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
# worm
organism of
yG in
closest match
family Function
3 E. coli acrAB operon repressor Extinction?
7 Yeast unknown function
3 Vaccinia virus unknown function
3 Vaccinia virus Host range protein
Horz. Transfer?
5 Human Complementing protein
4 Human SEX gene
3 Human Initiation factor 4A Contamination?
4 Frog Thyroid hormone receptor
4 Fly Multidrug resistance protein 1
5 Cow Polyadenylation specificity factor Do not reproduce without permission
Worm yG families:
chemoreceptors & transposon functions
Associated
Gene Family in
Worm vs. Fly
yG worm Relative
Families only expansion
73 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
? in worm
RT 59 6x
7TM chemoreceptor fam. #1 51 *
fam. unk. func. #1 31 *
7TM chemoreceptor fam. #2 27 *
7TM chemoreceptor fam. #3 22 *
Major sperm protein 21 28x
fam. unk. func. #3 20 *
fam. unk. func. #4 19 * Environ-
TcA transposase 19 23x mental
Response
7TM receptor fam. #4 17 1.4x Family
Do not reproduce without permission
Worm yG families:
Unique or highly expanded relative to fly
Associated
Gene Family in
Worm vs. Fly
yG worm Relative
Families only expansion
74 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
? in worm
RT 59 6x
7TM chemoreceptor fam. #1 51 *
fam. unk. func. #1 31 *
7TM chemoreceptor fam. #2 27 *
7TM chemoreceptor fam. #3 22 *
Major sperm protein 21 28x
fam. unk. func. #3 20 *
fam. unk. func. #4 19 * Environ-
TcA transposase 19 23x mental
Response
7TM receptor fam. #4 17 1.4x Family
Do not reproduce without permission
Common worm Pseudofolds
yG yG Genes
Genes
rank rank
rank rank
75 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
[Scop,
Murzin]
Do not reproduce without permission
Common worm Pseudofolds
yG yG Genes
Genes
rank rank
rank rank
76 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
[Scop,
Murzin]
Do not reproduce without permission
Fly yG: Broken Fossils
Size Genes Total Pseudogenes
(Mb) (apx) yG Duplicated Processed
Yeast 12 6200 221 (4%) 221 0
Worm 100 19000 2168 (11%) 1960 208
77 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Fly 116 14000 <100 (1%)
Human* 69 930 350 (38%) 178 172
1) What are they composed of? (composition)
2) Where are they? (which organisms, chr. position)
3) Which type of proteins are they? Why? (functions)
4) Do processed & duplicated varieties differ?
Do not reproduce without permission
Fly fossils are "broken up" by deletions
Explanation for having only 5% of yG of worm
is high rate of small genomic deletions (~10 - 50 bp)
78 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
[Petrov, Hartl]
pseudomotifs
Do not reproduce without permission
Same density of pseudomotifs
in fly and worm
Analyze occurrence of "pseudomotifs" (ancient, broken-
up fossils) in intergenic regions relative to statistical
expectation
79 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
1329 Prosite Motifs
(e.g ZnF C C f H H & Tubulin)
-x(2,4)- -x(3)- -x(8)- -x(3,5)-
TM-helices
worm fly
Size Whole Genome 97 116
in Pure Intergenic Region (no repeats, ygenes, &c) 42 61
Mb Pseudomotifs 1.3 2.1
Number of over represented motifs in Prosite 40 70
Do not reproduce without permission
Human yG: Duplicated v. Processed
Size Genes Total Pseudogenes
(Mb) (apx) yG Duplicated Processed
Yeast 12 6200 221 (4%) 221 0
Worm 100 19000 2168 (11%) 1960 208
80 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Fly 116 14000 <100 (1%)
Human* 69 930 350 (38%) 178 172
(*Chr 21+22 only)
1) What are they composed of? (composition)
2) Where are they? (which organisms, chr. position)
3) Which type of proteins are they? Why? (functions)
4) Do processed & duplicated varieties differ?
Expansion of genome papers [Venter et al., Dunham et al., Lander et al.]
Do not reproduce without permission
Top Functional Families
in Genes and yG on Human Chr. 21 and 22
Pseudogenes
Ig* 70
Ribosomal protein 43
Transcription factor 12
Other DNA binding 8
81 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Receptor 8
Kinase 5
Environmental
Response Family
Do not reproduce without permission
Top Functional Families
in Genes and yG on Human Chr. 21 and 22
Genes Pseudogenes
Ig* 69 Ig* 70
Other DNA binding 32 Ribosomal protein 43
Nucleotide binding 28 Transcription factor 12
Transcription Factor 28 Other DNA binding 8
82 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Other Nucleic-acid binding 16 Receptor 8
Kinase 14 Kinase 5
Environmental
Response Family
Do not reproduce without permission
Top Functional Families
in Genes and yG on Human Chr. 21 and 22
Genes Pseudogenes
Ig* 69 Ig* 70
Other DNA binding 32 Ribosomal protein 43
Nucleotide binding 28 Transcription factor 12
Transcription Factor 28 Other DNA binding 8
83 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Other Nucleic-acid binding 16 Receptor 8
Kinase 14 Kinase 5
Duplicated Processed
Ig* 70 Ribosomal protein 42
Transcription factor 6 Transcription factor 7
Nucleotide binding 5 Other DNA binding 7
Kinase 4 Receptor 4
Environmental Transferase 4 Other Nucleic-acid binding 4
Response Family
Receptor 4
Do not reproduce without permission
Distribution of processed yG roughly
matches expectation of random insertions
human
Constant worm
21+22
density
total 69 97
between size (Mb)
non-coding 67 70
84 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
human & worm
processed total 178 208
yG per Mb 2.6 3.0
Occurrence related to expression ribosomal proteins
Among ribosomal proteins:
Proportional selection between subunits
Uniform insertion across 21 & 22
22
21
Do not reproduce without permission
Yeast yG: Simple Story & Mechanism
Size Genes Total Pseudogenes
(Mb) (apx) yG Duplicated Processed
Yeast 12 6200 221 (4%) 221 0
Worm 100 19000 2168 (11%) 1960 208
85 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Fly 116 14000 <100 (1%)
Human* 69 930 350 (38%) 178 172
1) What are they composed of? (composition)
2) Where are they? (which organisms, chr. position)
3) Which type of proteins are they? Why? (functions)
4) Do processed & duplicated varieties differ?
Do not reproduce without permission
Yeast yG concentrated near telomeres
86 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Pseudogenes
……
……
Genes
Do not reproduce without permission
Environmental response
functions of yeast yG
yG
Growth Inhibitor 16
87 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
(GIN11)
Flocculins 11
DUP 6
(hypothet. TM prot. associated with pheromone resp.)
DEAD box helicase 6 Environ-
mental
Response
Stress response 3 Family
(SRP/TIP, cell-wall mannoproteins)
5 most common families in yG
Not same as the most common families in genes
Do not reproduce without permission
Yeast yG come from yeast-specific families
Fraction having a non-yeast homolog
88 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
40% 80%
yG genes
Do not reproduce without permission
Resurrecting
pseudogenes: FLO8
disabled
is it possible? in lab
Hypothetical strain
(S288C)
example of a
89 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
flocculin
Idea of Functioning
"untranslatable FLO8
intermediates" causes
in protein evolution filamentous
has been around for growth in
a while most
strains
[Nei, '70; Koch, '72]
[Walsh] [ Fink ]
Do not reproduce without permission
A speculative mechanism for
resurrecting yeast yG, via [PSI+],
perhaps in environmental response
[PSI+] [ Lindquist ]
• Prion of Sup35p, translation-termination protein
90 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
• Causes read-through of stops
• Causes phenotypic diversity, through the
expression of new or altered proteins
We find suggestive evidence that PSI resurrects yG:
• 35 yG easily resurrectable with only 1 stop
• Microarrays show some of these are expressed
[M Snyder]
• Many involved in environmental response
Perhaps testable with selection experiments
Do not reproduce without permission
Evolutionary Implications of a Reservoir
Resurrectable yG for Creating New Folds
91 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Do not reproduce without permission
Not all folds shared
between phylogenetic
groups
Evolution of new
folds
92 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Plants
20
46 156 73 104
90
Eubacteria Eukaryotes Animals
Do not reproduce without permission
Evolutionary Implications of a Reservoir
Resurrectable yG for Creating New Folds
Paradox:
going between folds
A & Z with all
93 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
intermediates
functional
Pseudogenes free
of constraints of
being transcribed &
translated
[Koch '72]
Do not reproduce without permission
yG: Summary
1) What are they composed of?
Intermediate composition
between genes & translated intergenic DNA
2) Where are they?
On chromosomal ends.
94 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
In all organisms, though reduced to pseudomotifs in the fly.
3) Which type of proteins are they? Why?
Environmental response proteins.
yG may be extra parts that can be resurrected.
Potential mechanism suggested for yeast involving [PSI+].
4) Do processed & duplicated varieties differ?
YES. (Duplicated yG described above.) Processed yG appear
to be just randomly inserted from mRNA pool. Hence, they
show obvious relationship to mRNA level & intergenic region
size
Do not reproduce without permission
Practical Backdrop to Integrated Gene
Annotation & Interpretation of DNA Arrays
I II III
IV
95 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
V VI
VII VIII
137 potential new yeast genes Human DNA Arrays (ongoing)
• Integrated approach: • All of chr22 on a chip in ~1kb
homology search + transposons + chunks, probe for expression, TF
microarrays binding
• Small ORFs & • Need to have mapped landscape
anti-sense to existing ORFs (genes, yGs, repeats, SNPs, &c)
to design chip & interpret results
[Snyder]
Do not reproduce without permission
GeneCensus.org Detailed
Tables
Alignment
Database
96 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Alignment ORF
Server Query
Ranks Trees
PDB
Query
PartsList.org
Do not reproduce without permission
Acknowledgements
Predicting Protein Function
on a Genomic Scale
Jiang Qian, Ronald Jansen,
Amar Drawid, Cyrus Wilson,
Hedi Hegyi, Dov Greenbaum,
Hayiuan Yu, Jimmy Lin,
97 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Ning Lan, Yuval Kluger
Analysis of Pseudogenes
Paul Harrison, Zhaolei Zhang,
Nathaniel Echols,
Suganthi Balasubramanian, Collaborators
Nicholas Luscombe, M Snyder
Paul Bertone, Ted Johnson, (A Kumar, H Zhu, M Bilgin, C Horack …)
Patrick McGarvey S Weissmann (Z Lian, S Yamaga…)
Other Projects P Miller (K Cheung), M Schultz
Yang Liu, Jochen Junker, G Montelione et al.
Vadim Alexandrov, Rajdeep Das,
Werner Krebs, Brad Stenger PartsList.org, GeneCensus.org, nesg.org
Do not reproduce without permission
Assessing Function
Globally
98 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Known
associated
pairs have
same "cellular
role"
(according to
MIPS, GO, &c)
Do not reproduce without permission
15
Odds rati
timeshifted
simultaneous
10
Results on Function Prediction
5
0
0
6
C. Functions P-value Cutoff
5
99 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
2.7e-3
Odds ratio R
4
3
2
1
0
9 10 11 12 13 14 15 16
Score S
Do not reproduce without permission
Correlation:
Always
Significant
Sometimes
Significant
(depends
on expt.)
100 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Never
Significant
Based on Distributions,
Correlation of Established
Functional Categories,
Computer Clusterings
Do not reproduce without permission
25
20 B. Interactions
unmatched
Odds ratio R
inverted
15 timeshifted
Results on 10
simultaneous
Interaction 5
Prediction 0
101 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
0
6
C. Functions P-value Cutoff
5
2.7e-3
Odds ratio R
4
3
2
1
0
9 10 11 12 13 14 15 16
Score S
Do not reproduce without permission
Protein-Protein
Interactions &
Expression
Cell Cycle
CDC28 expt. (Davis)
102 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Sets of interactions
(all pairs,
control)
Pairwise interactions
(from MIPS)
(Uetz et al.)
(strong interactions in perm-
between selected expression timecourses anent complexes, clearly diff.)
Do not reproduce without permission
Protein-Protein
Interactions &
Expression
Cell Cycle
CDC28 expt. (Davis)
103 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
Sets of interactions
(all pairs,
control)
Pairwise interactions
(from MIPS)
(Uetz et al.)
(strong interactions in perm-
between selected expression timecourses anent complexes, clearly diff.)
Do not reproduce without permission
Get documents about "