Automating comparative and metagenomics methods using Perl and BioPerl
Document Sample


Automating comparative and
metagenomics methods using
Perl and BioPerl
Jason Stajich
UC Berkeley
1
Workshop introduction
• Programming in Biology
• Pipelines and Automation of Genome analyses
• Phylogenomics and Metagenomics
2
Goals
• Introduction to Bioinformatics solutions with BioPerl
• Some motivation for learning the programming
approaches
• Generic solutions to comparative genomics questions
3
Programming languages
• Why Perl? Why not (Python,Ruby,etc)
• I don’t believe in language wars -- use what makes sense
to you. Perl makes sense to me.
• Perl has had a large community developing modules to
support Bioinformatics
4
BioPerl
• Toolkit of Perl modules for lifesciences data
• Collaborative open-source
• Multiple contributors from academic and industry
groups
• Not directly funded through grants
• http://bioperl.org/
5
BioPerl: What can you do?
• Parse, write out Sequence data in a variety of file
formats
• FASTA, GenBank, EMBL, SwissProt, FASTQ
• Process results from BLAST, HMMER, FASTA, etc.
• Parse Multiple Sequence Alignments
6
Parse some Sequence
Script
use
strict;
use
Bio::SeqIO;
my
$file
=
‘input.fasta’; Input
my
$io
=
Bio::SeqIO‐>new(‐format
=>
‘fasta’, >Locus123
‐file
=>
$file); tatatatatt
ctcgtatgtc
ttattctacc
my
$seq
=
$io‐>next_seq; aacacaacaa
taagcgtgtt
gcagtcaatg
print
$seq‐>display_id,
“
is
display
id\n”;
print
$seq‐>seq,
“
is
sequence
string\n”;
print
$seq‐>length,
“
is
sequence
length\n”;
Output
Locus123
tatatatattctcgtatgtcttattctaccaacacaacaataagcgtgttgcagtcaatg
60
7
Sequence data
• Bio::SeqIO module for parsing and writing sequence in
many different formats
• Represent sequence data as a Bio::Seq Object
• Wrap up Sequence data, Annotations (Features), and
methods to compute on it (translate, subseq, length)
8
Translate sequence
Script
use
strict;
use
Bio::SeqIO;
my
$file
=
‘input.fasta’; Input
my
$io
=
Bio::SeqIO‐>new(‐format
=>
‘fasta’, >Locus123
‐file
=>
$file); tatatatatt
ctcgtatgtc
ttattctacc
my
$out
=
Bio::SeqIO‐>new(‐format
=>
‘fasta’); aacacaacaa
taagcgtgtt
gcagtcaatg
my
$seq
=
$io‐>next_seq;
my
$peptide
=
$seq‐>translate;
$out‐>write_seq($peptide);
Output
>Locus123
YIYSRMSYSTNTTISVLQSM
9
Process BLAST reports
• Sequence database searches components
• Query sequence
• Significant Hits
• Pairwise Alignments of Query and Hit (Subject)
sequences
10
Process BLAST reports
• Sequence database searches components
• Result - Total search result for a single query sequence
• Hit - Sequences from database with significant similarity
to Query
• HSP - High-scoring Segment Pair
• Pairwise alignment segment between Query and Hit
11
Reference:
Gish,
W.
(1996‐2000)
http://blast.wustl.edu
Query=
BOSS_DROME
Bride
of
sevenless
protein
precursor.
(896
letters)
Database:
wormpep87
Result
20,881
sequences;
9,238,759
total
letters.
Searching....10....20....30....40....50....60....70....80....90....100%
done
Smallest
Sum
High
Probability
Sequences
producing
High‐scoring
Segment
Pairs:
Score
P(N)
N
F35H10.10
CE24945
status:Partially_confirmed
TR:Q20073...
182
4.9e‐11
1
M02H5.2
CE25951
status:Predicted
TR:Q966H5
protein_id:...
86
0.15
1
ZC506.4
CE01682
locus:mgl‐1
metatrophic
glutamate
recept...
91
0.18
1
F23D12.2
CE05700
status:Partially_confirmed
TR:Q19761
...
73
0.45
3 Hit
>F35H10.10
CE24945
status:Partially_confirmed
TR:Q20073
protein_id:AAA81683.2
Length
=
1404
Score
=
182
(69.1
bits),
Expect
=
4.9e‐11,
P
=
4.9e‐11
Identities
=
75/315
(23%),
Positives
=
149/315
(47%)
Query:
511
YPFLFDGESVMFWRIKMDTWVATGLTAAILGLIATLAILVFIVVRISLGDVFEGNPTTSI
570
Y
+F+
+
WR
+V
L
++
+
+A+LV
++V++
L
V
+GN
+
I
Sbjct:
1006
YQSVFEHITTGHWRDHPHNYVLLALITVLV‐‐VVAIAVLVLVLVKLYLR‐VVKGNQSLGI
1062
Query:
571
LLLLSLILVFCSFVPYSIEYVGEQRNSHVTFEDAQTLNTLCAVRVFIMTLVYCFVFSLLL
630 HSP
LL+
+I++
YS
+
F+
+++C
+RV
+
L
Y
F
+++
Sbjct:
1063
SLLIGIIIL‐‐‐‐‐‐YSTAFF‐‐‐‐‐‐‐FVFDPT‐‐‐DSVCRLRVILHGLGYTICFGVMI
1106
Query:
631
CRAVMLASIGSEG‐GFLSHVNGYIQAVICAFSVVAQVGMSVQLLVVMHVASETVSCENIY
689
+A
L
+
+
G
G
H++
+
++
F
V
Q+
+S+
+
+++
V
N+
Sbjct:
1107
AKATQLRNAETLGFGTAIHISFWNYWLLLFFIVGVQIALSISWFLEPFMSTIGVIDTNVQ
1166
12
Reference:
Gish,
W.
(1996‐2000)
http://blast.wustl.edu
Query=
BOSS_DROME
Bride
of
sevenless
protein
precursor.
(896
letters)
Database:
wormpep87
Result
20,881
sequences;
9,238,759
total
letters.
Searching....10....20....30....40....50....60....70....80....90....100%
done
Smallest
Sum
High
Probability
Sequences
producing
High‐scoring
Segment
Pairs:
Score
P(N)
N
F35H10.10
CE24945
status:Partially_confirmed
TR:Q20073...
182
4.9e‐11
1
M02H5.2
CE25951
status:Predicted
TR:Q966H5
protein_id:...
86
0.15
1
ZC506.4
CE01682
locus:mgl‐1
metatrophic
glutamate
recept...
91
0.18
1
F23D12.2
CE05700
status:Partially_confirmed
TR:Q19761
...
73
0.45
3 Hit
>F35H10.10
CE24945
status:Partially_confirmed
TR:Q20073
protein_id:AAA81683.2
Length
=
1404
Score
=
182
(69.1
bits),
Expect
=
4.9e‐11,
P
=
4.9e‐11
Identities
=
75/315
(23%),
Positives
=
149/315
(47%)
Query:
511
YPFLFDGESVMFWRIKMDTWVATGLTAAILGLIATLAILVFIVVRISLGDVFEGNPTTSI
570
Y
+F+
+
WR
+V
L
++
+
+A+LV
++V++
L
V
+GN
+
I
Sbjct:
1006
YQSVFEHITTGHWRDHPHNYVLLALITVLV‐‐VVAIAVLVLVLVKLYLR‐VVKGNQSLGI
1062
Query:
571
LLLLSLILVFCSFVPYSIEYVGEQRNSHVTFEDAQTLNTLCAVRVFIMTLVYCFVFSLLL
630 HSP
LL+
+I++
YS
+
F+
+++C
+RV
+
L
Y
F
+++
Sbjct:
1063
SLLIGIIIL‐‐‐‐‐‐YSTAFF‐‐‐‐‐‐‐FVFDPT‐‐‐DSVCRLRVILHGLGYTICFGVMI
1106
Query:
631
CRAVMLASIGSEG‐GFLSHVNGYIQAVICAFSVVAQVGMSVQLLVVMHVASETVSCENIY
689
+A
L
+
+
G
G
H++
+
++
F
V
Q+
+S+
+
+++
V
N+
Sbjct:
1107
AKATQLRNAETLGFGTAIHISFWNYWLLLFFIVGVQIALSISWFLEPFMSTIGVIDTNVQ
1166
12
Reference:
Gish,
W.
(1996‐2000)
http://blast.wustl.edu
Query=
BOSS_DROME
Bride
of
sevenless
protein
precursor.
(896
letters)
Database:
wormpep87
Result
20,881
sequences;
9,238,759
total
letters.
Searching....10....20....30....40....50....60....70....80....90....100%
done
Smallest
Sum
High
Probability
Sequences
producing
High‐scoring
Segment
Pairs:
Score
P(N)
N
F35H10.10
CE24945
status:Partially_confirmed
TR:Q20073...
182
4.9e‐11
1
M02H5.2
CE25951
status:Predicted
TR:Q966H5
protein_id:...
86
0.15
1
ZC506.4
CE01682
locus:mgl‐1
metatrophic
glutamate
recept...
91
0.18
1
F23D12.2
CE05700
status:Partially_confirmed
TR:Q19761
...
73
0.45
3 Hit
>F35H10.10
CE24945
status:Partially_confirmed
TR:Q20073
protein_id:AAA81683.2
Length
=
1404
Score
=
182
(69.1
bits),
Expect
=
4.9e‐11,
P
=
4.9e‐11
Identities
=
75/315
(23%),
Positives
=
149/315
(47%)
Query:
511
YPFLFDGESVMFWRIKMDTWVATGLTAAILGLIATLAILVFIVVRISLGDVFEGNPTTSI
570
Y
+F+
+
WR
+V
L
++
+
+A+LV
++V++
L
V
+GN
+
I
Sbjct:
1006
YQSVFEHITTGHWRDHPHNYVLLALITVLV‐‐VVAIAVLVLVLVKLYLR‐VVKGNQSLGI
1062
Query:
571
LLLLSLILVFCSFVPYSIEYVGEQRNSHVTFEDAQTLNTLCAVRVFIMTLVYCFVFSLLL
630 HSP
LL+
+I++
YS
+
F+
+++C
+RV
+
L
Y
F
+++
Sbjct:
1063
SLLIGIIIL‐‐‐‐‐‐YSTAFF‐‐‐‐‐‐‐FVFDPT‐‐‐DSVCRLRVILHGLGYTICFGVMI
1106
Query:
631
CRAVMLASIGSEG‐GFLSHVNGYIQAVICAFSVVAQVGMSVQLLVVMHVASETVSCENIY
689
+A
L
+
+
G
G
H++
+
++
F
V
Q+
+S+
+
+++
V
N+
Sbjct:
1107
AKATQLRNAETLGFGTAIHISFWNYWLLLFFIVGVQIALSISWFLEPFMSTIGVIDTNVQ
1166
12
Reference:
Gish,
W.
(1996‐2000)
http://blast.wustl.edu
Query=
BOSS_DROME
Bride
of
sevenless
protein
precursor.
(896
letters)
Database:
wormpep87
Result
20,881
sequences;
9,238,759
total
letters.
Searching....10....20....30....40....50....60....70....80....90....100%
done
Smallest
Sum
High
Probability
Sequences
producing
High‐scoring
Segment
Pairs:
Score
P(N)
N
F35H10.10
CE24945
status:Partially_confirmed
TR:Q20073...
182
4.9e‐11
1
M02H5.2
CE25951
status:Predicted
TR:Q966H5
protein_id:...
86
0.15
1
ZC506.4
CE01682
locus:mgl‐1
metatrophic
glutamate
recept...
91
0.18
1
F23D12.2
CE05700
status:Partially_confirmed
TR:Q19761
...
73
0.45
3 Hit
>F35H10.10
CE24945
status:Partially_confirmed
TR:Q20073
protein_id:AAA81683.2
Length
=
1404
Score
=
182
(69.1
bits),
Expect
=
4.9e‐11,
P
=
4.9e‐11
Identities
=
75/315
(23%),
Positives
=
149/315
(47%)
Query:
511
YPFLFDGESVMFWRIKMDTWVATGLTAAILGLIATLAILVFIVVRISLGDVFEGNPTTSI
570
Y
+F+
+
WR
+V
L
++
+
+A+LV
++V++
L
V
+GN
+
I
Sbjct:
1006
YQSVFEHITTGHWRDHPHNYVLLALITVLV‐‐VVAIAVLVLVLVKLYLR‐VVKGNQSLGI
1062
Query:
571
LLLLSLILVFCSFVPYSIEYVGEQRNSHVTFEDAQTLNTLCAVRVFIMTLVYCFVFSLLL
630 HSP
LL+
+I++
YS
+
F+
+++C
+RV
+
L
Y
F
+++
Sbjct:
1063
SLLIGIIIL‐‐‐‐‐‐YSTAFF‐‐‐‐‐‐‐FVFDPT‐‐‐DSVCRLRVILHGLGYTICFGVMI
1106
Query:
631
CRAVMLASIGSEG‐GFLSHVNGYIQAVICAFSVVAQVGMSVQLLVVMHVASETVSCENIY
689
+A
L
+
+
G
G
H++
+
++
F
V
Q+
+S+
+
+++
V
N+
Sbjct:
1107
AKATQLRNAETLGFGTAIHISFWNYWLLLFFIVGVQIALSISWFLEPFMSTIGVIDTNVQ
1166
12
use
Bio::SearchIO;
my
$cutoff
=
’0.001’;
my
$file
=
‘BOSS_Ce.BLASTP’,
my
$in
=
Bio::SearchIO‐>new(‐format
=>
‘blast’,
‐file
=>
$file);
while(
my
$r
=
$in‐>next_result
)
{
print
"Query
is:
",
$r‐>query_name,
"
",
$r‐>query_description,"
",
$r‐>query_length,"
aa\n";
print
"
Matrix
was
",
$r‐>get_parameter(’matrix’),
"\n";
while(
my
$h
=
$r‐>next_hit
)
{
last
if
$h‐>significance
>
$cutoff;
print
"Hit
is
",
$h‐>name,
"\n";
while(
my
$hsp
=
$h‐>next_hsp
)
{
print
"
HSP
Len
is
",
$hsp‐>length(’total’),
"
",
"
E‐value
is
",
$hsp‐>evalue,
"
Bit
score
",
$hsp‐>score,
"
\n",
"
Query
loc:
",$hsp‐>query‐>start,
"
",
$hsp‐>query‐>end,"
",
"
Sbject
loc:
",$hsp‐>hit‐>start,
"
",
$hsp‐>hit‐>end,"\n";
}
}
}
13
Parsing Result
Query
is:
BOSS_DROME
Bride
of
sevenless
protein
precursor.
896
aa
Matrix
was
BLOSUM62
Hit
is
F35H10.10
HSP
Len
is
315
E‐value
is
4.9e‐11
Bit
score
182
Query
loc:
511
813
Sbject
loc:
1006
1298
HSP
Len
is
28
E‐value
is
1.4e‐09
Bit
score
39
Query
loc:
508
535
Sbject
loc:
427
454
14
FASTA
searches
a
protein
or
DNA
sequence
data
bank
version
3.4t20
Sep
26,
2002
Please
cite:
W.R.
Pearson
&
D.J.
Lipman
PNAS
(1988)
85:2444‐2448
Query
library
BOSS_DROME.aa
vs
/blast/wormpep87
library
searching
/blast/wormpep87
library
1>>>BOSS_DROME
Bride
of
sevenless
protein
precursor.
‐
896
aa
vs
/blast/wormpep87
library
9238759
residues
in
20881
sequences
Expectation_n
fit:
rho(ln(x))=
5.6098+/‐0.000519;
mu=
12.8177+/‐
0.030
mean_var=107.8223+/‐22.869,
0's:
0
Z‐trim:
2
B‐trim:
0
in
0/62
Lambda=
0.123515
Kolmogorov‐Smirnov
statistic:
0.0333
(N=29)
at
48
FASTA
(3.45
Mar
2002)
function
[optimized,
BL50
matrix
(15:‐5)]
ktup:
2
join:
38,
opt:
26,
gap‐pen:
‐12/‐2,
width:
16
Scan
time:
9.680
The
best
scores
are:
opt
bits
E(20881)
F35H10.10
CE24945
status:Partially_confirmed
T
(1404)
207
48.5
6.8e‐05
T06E4.11
CE06377
locus:pqn‐63
status:Predicted
(
275)
122
32.6
0.8
C33B4.3
CE01508
ankyrin
and
proline
rich
domain
(1110)
124
33.6
1.6
Y48C3A.8
CE22141
status:Predicted
TR:Q9NAG3
pr
(
291)
110
30.5
3.7
Y34D9A.2
CE30217
status:Partially_confirmed
TR
(
326)
108
30.2
5.1
K06H7.3
CE26941
Isopentenyl‐diphosphate
delta
i
(
618)
107
30.3
8.9
F44B9.8
CE29044
ARPA
status:Partially_confirmed
(
388)
104
29.5
9.4
>>F35H10.10
CE24945
status:Partially_confirmed
TR:Q20
(1404
aa)
initn:
94
init1:
94
opt:
207
Z‐score:
197.9
bits:
48.5
E():
6.8e‐05
Smith‐Waterman
score:
275;
22.527%
identity
(27.152%
ungapped)
in
728
aa
overlap
(207‐847:640‐1330)
180
190
200
210
220
230
BOSS_D
RAISIDNASLAENLLIQEVQFLQQCTTYSMGIFVDWELYKQLESVIKD‐‐‐LEYNIWPIP
:
:
.
.
.:
::
..
:.
..
.:
F35H10
NQAGRNITIVPKSVFGYASALHGDSQESLKGYFSSGDTDASLVSVDSEHSALQRSFTALP
610
620
630
640
650
660
15
use
Bio::SearchIO;
my
$cutoff
=
’0.001’;
my
$file
=
‘BOSS_Ce.BLASTP’,
my
$in
=
Bio::SearchIO‐>new(‐format
=>
‘fasta’,
‐file
=>
$file);
while(
my
$r
=
$in‐>next_result
)
{
print
"Query
is:
",
$r‐>query_name,
"
",
$r‐>query_description,"
",
$r‐>query_length,"
aa\n";
print
"
Matrix
was
",
$r‐>get_parameter(’matrix’),
"\n";
while(
my
$h
=
$r‐>next_hit
)
{
last
if
$h‐>significance
>
$cutoff;
print
"Hit
is
",
$h‐>name,
"\n";
while(
my
$hsp
=
$h‐>next_hsp
)
{
print
"
HSP
Len
is
",
$hsp‐>length(’total’),
"
",
"
E‐value
is
",
$hsp‐>evalue,
"
Bit
score
",
$hsp‐>score,
"
\n",
"
Query
loc:
",$hsp‐>query‐>start,
"
",
$hsp‐>query‐>end,"
",
"
Sbject
loc:
",$hsp‐>hit‐>start,
"
",
$hsp‐>hit‐>end,"\n";
}
}
}
16
Query is: BOSS_DROME Bride of sevenless protein
precursor. 896 aa
Matrix was BL50
Hit is F35H10.10
HSP Len is 728 E-value is 6.8e-05 Bit score 197.9
Query loc: 207 847 Sbject loc: 640 1330
17
Parsing Multiple Alignments
• Read in many different formats
• Clustal, Phylip, Selex, Stockholm, multi-FASTA, Nexus,
MAF
• Create a Multiple Alignment Object
• Process alignment (trim gapped columns, calculate
percent identity, etc)
18
Input
CATL_HUMAN
......MNPT
LILAAFCLGI
..........
...ASATLTF
DHSLEAQWTK
CATL_RAT
......MTPL
LLLAVLCLGT
..........
...ALATPKF
DQTFNAQWHQ
CATH_RAT
...MWTALPL
LCAGAWLLSA
G.........
...ATAELTV
NAIEKFHFTS
PAPA_CARPA
MAMIPSISKL
LFVAICLFVY
MGLSFGDFSI
VGYSQNDLTS
TERLIQLFES
Script
use
strict;
use
Bio::AlignIO;
my
$in
=
Bio::AlignIO‐>new(‐format
=>
‘msf’,
‐file
=>
‘hmmer.aln’);
my
$out
=Bio::AlignIO‐>new(‐format
=>
‘clustalw’,
‐file
=>
‘>out.aln’);
my
$aln
=
$in‐>next_aln;
$aln‐>map_chars(‘\.’,
‘‐’);
$aln‐>map_chars(‘\*’,
‘‐’);
my
$ungapped
=
$aln‐>remove_gaps;
$out‐>write_aln($ungapped);
print
“percent
identity=”,
$ungapped‐>percentage_identity,
“\n”;
19
Result
24.7311827956989
CLUSTAL
W(1.81)
multiple
sequence
alignment
CATL_HUMAN/1‐31
MNPTLILAAFCLGIASATLTFDHSLEAQWTK
CATL_RAT/1‐31
MTPLLLLAVLCLGTALATPKFDQTFNAQWHQ
CATH_RAT/4‐35
ALPLLCAGAWLLSAATAELTVNAIEKFHFTS
PAPA_CARPA/7‐50
ISKLLFVAICLFVYSQNDLTSTERLIQLFES
*
.
:
:
.
:
.
20
Phylogenetic Trees
• Parse and Write various Tree formats
• Manipulate (re-root, remove nodes, add bootstraps) and
compute relationships on trees
21
Read/Write Trees
use
Bio::TreeIO;
use
strict;
my
$in
=
Bio::TreeIO‐>new(‐format
=>
‘nexus’,
‐file
=>
‘trees.nex’);
my
$out
=
Bio::TreeIO‐>new(‐format
=>
‘newick’,
‐file
=>
‘>trees.nh’);
while(
my
$tree
=
$in‐>next_tree
)
{
$out‐>write_tree($tree);
}
22
Fetching subset of nodes
use
strict;
use
Bio::TreeIO;
my
$in
=
Bio::TreeIO‐>new(‐format
=>
‘newick’,
‐fh
=>
\*DATA);
if(
my
$tree
=
$in‐>next_tree
)
{
my
@nodes
=
$tree‐>get_nodes;
my
@tips
=
grep
{
$_‐>is_Leaf
}
@nodes;
print
"there
are
",scalar
@tips,
"
tips\n";
my
@internal
=
grep
{
!
$_‐>is_Leaf
}
@nodes;
print
"there
are
",scalar
@internal,
"
internal
nodes\n";
my
($A_node)
=
$tree‐>find_node(‐id
=>
'A');
print
"branch
length
of
Ancestor
of
",$A_node‐>id,
A
"
is
",
$A_node‐>ancestor‐>branch_length,
"\n";
} B
__DATA__
C
(((A:10,B:11):2,C:5),((D:7,F:6):17,G:8));
Output D
there
are
6
tips F
there
are
5
internal
nodes
G
branch
length
of
Ancestor
of
A
is
2
23
Walking up the tree (tips to root)
if(
my
$tree
=
$in‐>next_tree
)
{
my
@tips
=
grep
{
$_‐>is_Leaf
}
$tree‐>get_nodes;
for
my
$node
(
@tips
)
{
my
@path;
while(
defined
$node)
{
push
@path,
$node‐>id;
$node
=
$node‐>ancestor;
}
print
join(“,“,
@path),
“\n”;
}
}
__DATA__
(((A:10,B:11)I1:2,C:5)I2,((D:7,F:6)I3:17,G:8)I4)Root; A
I1
Output I2
B
C,I2,Root C
A,I1,I2,Root Root
B,I1,I2,Root D
I3
G,I4,Root
F
D,I3,I4,Root I4
F,I3,I4,Root G
24
Visualizations
25
Draw Sequence Graphics
• Read in GenBank file
• Write out features as tracks
• Automatically grab feature names and
26
LOCUS
AB077698
2701
bp
mRNA
linear
PRI
01‐MAR‐2002
/db_xref="GI:19032345"
DEFINITION
Homo
sapiens
mRNA
for
hCHCR‐G,
complete
cds.
/translation="MTAVNVALIRDTKWLTLEVCREFQRGTCSRADADCKFAHPPRVC
ACCESSION
AB077698
HVENGRVVACFDSLKGRCTRENCKYLHPPPHLKTQLEINGRNNLIQQKTAAAMFAQQM
VERSION
AB077698.1
GI:19032344
QLMLQNAQMSSLGSFPMTPSIPANPPMAFNPYIPHPGMGLVPAELVPNTPVLIPGNPP
KEYWORDS
.
LAMPGAVGPKLMRSDKLEVCREFQRGNCTRGENDCRYAHPTDASMIEASDNTVTICMD
SOURCE
Homo
sapiens
(human)
YIKGRCSREKCKYFHPPAHLQARLKAAHHQMNHSAASAMALQPGTLQLIPKRSALEKP
ORGANISM
Homo
sapiens
NGATPVFNPTVFHCQQALTNLQLPQPAFIPAGPILCMAPASNIVPMMHGATPTTVSAA
Eukaryota;
Metazoa;
Chordata;
Craniata;
Vertebrata;
Euteleostomi;
TTPATSVPFAAPTTGNQLKF"
Mammalia;
Eutheria;
Euarchontoglires;
Primates;
Haplorrhini;
misc_feature
137..196
Catarrhini;
Hominidae;
Homo.
/gene="CHCR"
REFERENCE
1
/note="Cys3His,
zinc
finger
AUTHORS
Squillace,R.M.
and
Wang,E.H.
Encoded
on
BAC
clone
RP5‐842K24
(AL050310)"
TITLE
Genomic
structure,
chromosomal
localization,
and
splicing
variation
misc_feature
239..292
of
the
human
CHCR
gene,
cloning
and
characterization
of
mouse
CHCR
/gene="CHCR"
JOURNAL
Unpublished
/note="Cys3His,
zinc
finger
REFERENCE
2
Encoded
on
BAC
clone
RP5‐842K24
(AL050310)"
AUTHORS
Squillace,R.M.,
Chenault,D.M.
and
Wang,E.H.
misc_feature
617..676
TITLE
Inhibition
of
myogenesis
by
the
novel
Muscleblind‐related
protein
/gene="CHCR"
CHCR
/note="Cys3His,
zinc
finger
JOURNAL
Unpublished
Encoded
on
BAC
clone
RP5‐842K24
(AL050310)"
REFERENCE
3
(bases
1
to
2701)
misc_feature
725..778
AUTHORS
Squillace,R.M.,
Chenault,D.M.
and
Wang,E.H.
/gene="CHCR"
TITLE
Direct
Submission
/note="Cys3His,
zinc
finger
JOURNAL
Submitted
(10‐JAN‐2002)
Edith
H.
Wang,
University
of
Washington,
Encoded
on
BAC
clone
RP5‐842K24
(AL050310)"
Pharmacology;
1959
NE
Pacific
Ave.,
Box
357280,
Seattle,
Washington
3'UTR
1145..2659
98195,
USA
(E‐mail:ehwang@u.washington.edu,
Tel:206‐616‐5376,
/gene="CHCR"
Fax:206‐685‐3822)
polyA_site
1606
FEATURES
Location/Qualifiers
/gene="CHCR"
source
1..2701
/note="Encoded
on
BAC
clone
RP5‐842K24
(AL050310);
/organism="Homo
sapiens"
PolyA_site#1
used
by
CHCR
EST
clone
PLACE1010202
/mol_type="mRNA"
(AK002178)"
/db_xref="taxon:9606"
polyA_site
2660
/chromosome="X"
/gene="CHCR"
/map="Xq26.1"
/note="Encoded
on
BAC
clone
RP5‐842K24
(AL050310);
gene
1..2701
PolyA_site#2
used
by
CHCR
EST
clone
DKFZp434G2222
/gene="CHCR"
(AL133625)"
5'UTR
<1..79 ORIGIN
/gene="CHCR"
1
aattcatttt
taatccttta
atagtccaca
gtaatattgt
cctaaagagg
gtacattgga
CDS
80..1144
61
ttttaatttt
gctttcaata
tgacggctgt
caatgttgcc
ctgattcgtg
ataccaagtg
/gene="CHCR"
121
gctgacttta
gaagtctgta
gagaatttca
gagaggaact
tgctctcgag
ctgatgcaga
/note="Cys3His
CCG1‐Required
181
ttgcaagttt
gcccatccac
caagagtttg
ccatgtggaa
aatggtcgtg
tggtggcctg
Encoded
on
BAC
clone
RP5‐842K24
(AL050310)
241
ttttgattct
ctaaagggtc
ggtgtacccg
agagaactgc
aagtaccttc
accctcctcc
The
human
CHCR
(Cys3His
CCG1‐Required)
protein
is
highly
301
acacttaaaa
acgcagctgg
agattaatgg
gcggaacaat
ctgattcaac
agaagactgc
related
to
EXP/MBNL
(Y13829,
NM_021038,
AF401998)
and
MBLL
361
cgcagccatg
ttcgcccagc
agatgcagct
tatgctccaa
aacgctcaaa
tgtcatcact
(NM_005757,AF061261),
which
together
comprise
the
human
421
tggttctttt
cctatgactc
catcaattcc
agctaatcct
cccatggctt
tcaatcctta
Muscleblind
family"
481
cataccacat
cctgggatgg
gcctcgttcc
tgcagaactt
gtaccaaata
cacctgttct
/codon_start=1
541
gattcctgga
aacccacctc
ttgcaatgcc
aggagctgtt
ggcccaaaac
tgatgcgttc
/product="hCHCR‐G"
601
agataaactg
gaggtttgcc
gagaatttca
gcgtggaaat
tgtacccgtg
gggagaatga
/protein_id="BAB85648.1"
661
ttgccgctat
gctcacccta
ctgatgcttc
catgattgaa
gcgagtgata
atactgtgac
721
aatctgcatg
gattacatca
aaggtcgatg
ctcgcgggag
aaatgcaagt
actttcatcc
27
Rendered Sequence File
$
perl
render.pl
seqfile.gbk
>
result.png
28
use
strict;
use
Bio::Graphics;
use
Bio::SeqIO;
use
Bio::SeqFeature::Generic;
$panel‐>add_track(generic
=>
Bio::SeqFeature::Generic‐>new(‐start
=>
1,
my
$file
=
shift
‐end
=>
$seq‐>length,
or
die
"provide
a
sequence
file
as
the
argument";
‐bgcolor
=>
'blue',
my
$io
=
Bio::SeqIO‐>new(‐file=>$file)
‐label
=>
1));
or
die
"couldn't
create
Bio::SeqIO";
my
$seq
=
$io‐>next_seq
#
general
case
or
die
"couldn't
find
a
sequence
in
the
file";
my
@colors
=
qw(cyan
orange
blue
purple
green
chartreuse
magenta
yellow
aqua);
my
@features
=
$seq‐>all_SeqFeatures;
my
$idx
=
0;
#
sort
features
by
their
primary
tags
for
my
$tag
(sort
keys
%sorted_features)
{
my
%sorted_features;
my
$features
=
$sorted_features{$tag};
for
my
$f
(@features)
{
$panel‐>add_track($features,
my
$tag
=
$f‐>primary_tag;
‐glyph
=>
'generic',
if(
!
$f‐>has_tag('note')
&&
$f‐>has_tag('gene')
)
{
‐bgcolor
=>
$colors[$idx++
%
@colors],
$f‐>add_tag_value('note',
$f‐>get_tag_values('gene'));
‐fgcolor
=>
'black',
}
‐font2color
=>
'red',
push
@{$sorted_features{$tag}},$f;
‐key
=>
"${tag}s",
}
‐bump
=>
+1,
my
$panel
=
Bio::Graphics::Panel‐>new(
‐height
=>
8,
‐length
=>
$seq‐>length,
‐label
=>
1,
‐key_style
=>
'between',
‐description
=>
1,
‐width
=>
1200,
);
‐pad_left
=>
10,
}
‐pad_right
=>
10);
print
$panel‐>png;
$panel‐>add_track(arrow
=>
Bio::SeqFeature::Generic‐>new(‐start
=>
1,
‐end
=>
$seq‐>length),
‐bump
=>
0,
‐double=>1,
‐tick
=>
2);
29
Genome Browser
30
Pipelines, Workflows,
Automation in
Bioinformatics
31
Test for Positive Selection
• Codon alignments
• Run Molecular Evolution Tests
• Aggregate data
32
Codon Alignment
• Molecular Evolution studies need sequence aligned on
codon boundaries
• Start with coding sequence
• Align in protein space
• Back-Translate into codons
33
use
Bio::AlignIO;
use
Bio::SeqIO;
use
Bio::Align::Utilities
qw(
aa_to_dna_aln
);
my
$input
=
Bio::AlignIO‐>new
(‐format
=>
’clustalw’,
‐file
=>
’pep.aln’);
my
$out
=
Bio::AlignIO‐>new
(‐format
=>
’clustalw’,
‐file
=>
’>cds.aln’);
my
$aa_aln
=
$input‐>next_aln
;
my
$cds
=
Bio::SeqIO‐>new
(‐format
=>
’file’,
‐file
=>
’cds.fa’);
my
%cds;
while
(
my
$seq
=
$cds‐>next_seq
)
{
$cds
{$seq‐>display_id}
=
$seq;
}
my
$cds_aln
=
&aa_to_dna_aln
(
$aa_aln,
\%cds);
$out‐>write_aln(
$cds_aln
);
34
use
Bio::Tools::Run::PAML::Codeml;
use
Bio::AlignIO;
#
get
a
codon
alignment
from
a
file
my
$alnio
=
Bio::AlignIO‐>new
(‐format
=>
’clustalw’,
‐file
=>
‘cds.aln’
);
my
$codon_aln
=
$alnio‐>next_aln;
my
$kaks_f
=
Bio::Tools::Run::Phylo::PAML::Codeml
‐>new
(
‐params
=>
{
’runmode
’
=>
‐2,
’seqtype
’
=>
1}
);
$kaks_f‐>alignment($codon_aln);
my
($rc,
$parser
)
=
$kaks_factory‐>run;
my
$result
=
$parser‐>next_result
;
my
$MLmatrix
=
$result‐>get_MLmatrix();
my
$pair1
=
$MLmatrix‐>[0]‐>[1];
printf
"dN,
dS,
dN/dS
for
seqs
0
and
1
is
%.4f,
%.4f
,%.4f\n",
$pair1‐>{
’dS’},
$pair1‐>{
’dN’},
$pair1‐>{
’omega’};
35
Genes
One at a time
Run perl script on
each gene file as Result file per gene
Compute separate job
Cluster
Summarize results
36
Application Wrappers
• A Perl module that will run the application
• Programatic access to object to run the analysis, but
local script doesn’t require all the command line
arguments
• Handles the creation of temporary files, and cleanup
• bioperl-run package
37
Workflows are protocols
• Create a reusable set of steps
• Can be re-run easily
• Makes writing Methods section easier!
• Have stored the parameters, program names, program
version
• Can even be a simple set of Perl scripts or a Shell script
with commands
• Large-scale workflow managers for Bioinformatics are
also available
38
Pipelines and Workflow
management Software
• Open Source
• Taverna: taverna.sourceforge.net
• MyExperiment.org, myGrid.org
• Commercial
• Pipeline Pilot - student edition free to academics
• Grid-based computing
• MOBY - http://biomoby.org
39
Phylogenomics &
Metagenomics
40
Phylogenomics
• Gene trees from Gene families across multiple species
• Identification of Orthologs/Paralogs
• Multiple Sequence Alignments
• Tree Building
41
Clusters of Orthologs
• Best reciprocal Hits (BLAST,FASTA)
• Reciprocal Smallest Distance (RSD)
• Markov CLustering, TribeMCL
• OrthoMCL - MCL with in-paralog correction
• Synergy - RSD + Synteny
42
Best Reciprocal Hits
43
Multiple Alignments
• ClustalW - Fast
• Muscle - Fast, but more accurate
• T-Coffee - Slower, much more accurate than ClustalW
• ProbCons - Comparable to T-Coffee, prob score for
each alignment column
• MAFFT - Multiple alignment from FFT
44
Tree Building
• PHYLIP - Parsimony, Distance, Nucleotide Likelihood
• BEAST, MrBayes - Bayesian for nt and aa
• PAUP - Parsimony, Distance, Nucleotide Likelihood
(commercial)
• RAxML - Likelihood with bootstrapping
• GARLI - Genetic algorithm for nt likelihood tree
45
Phylogenetic Profiling
• Classifying a gene’s history
• When was it gained evolutionarily (on species
phylogeny)
• If it is lost in other lineages
• Co-gained or Co-lost clusters?
46
Dictyostelium discoideum
Monosiga brevicollis
Homo sapiens
Batrachochytrium dendrobatidis
Rhizopus oryzae
Cryptococcus neoformans
Coprinopsis cinerea
Laccaria bicolor
Schizosaccharomyces pombe
Saccharomyces cerevisiae
Neurospora crassa
Aspergillus nidulans
Coccidioides immitis
0.1
47
Genome Phylogenetic Profile
48
Mechanics of profiling
• Take set of all Proteins from genome
• Search against each of the genomes in the tree
• For that gene mark whether or not it has a significant
homolog
• Can be dynamic, based on E-value, Similarity, etc
• NOT a statement of orthology
49
A Gene’s Profile
50
51
Metagenomics
• Same types of problems, but much bigger sequence sets
• Binning Sequences
• Building Trees
52
Binning
• By nucleotide sequence similarity
• Classify a sequence into a Clade if it has similarity only
with members of that clade
• Parsing a BLAST or similar-type report
• Keeping track of the Bins for each gene
• Producing lists and new databases of sequence that
are in each bin
53
Building Trees
• Have to use Distance-based methods (NJ) for the
LARGE sequence sets
• Cluster by similarity just like orthology approaches but
may have to use the simplest calculations for speed on
large datasets
54
Summary
• Tools and techniques in bioinformatics can be
automated and reusable with pipelines
• Perl and BioPerl are one way to interact with the data
and tools
• Tutorials http://jason.open-bio.org/
• HOWTOs http://bioperl.org/wiki/HOWTOs
55
Related docs
Get documents about "