Automating comparative and metagenomics methods using Perl and BioPerl

Document Sample
scope of work template
							Automating comparative and
metagenomics methods using
      Perl and BioPerl
          Jason Stajich
          UC Berkeley




                             1
    Workshop introduction


• Programming in Biology
• Pipelines and Automation of Genome analyses
• Phylogenomics and Metagenomics


                                                2
                      Goals

• Introduction to Bioinformatics solutions with BioPerl
• Some motivation for learning the programming
  approaches
• Generic solutions to comparative genomics questions


                                                          3
     Programming languages

• Why Perl? Why not (Python,Ruby,etc)
• I don’t believe in language wars -- use what makes sense
  to you. Perl makes sense to me.
• Perl has had a large community developing modules to
  support Bioinformatics



                                                             4
                     BioPerl
• Toolkit of Perl modules for lifesciences data
• Collaborative open-source
• Multiple contributors from academic and industry
  groups
• Not directly funded through grants
• http://bioperl.org/


                                                     5
    BioPerl: What can you do?

•    Parse, write out Sequence data in a variety of file
     formats
    • FASTA, GenBank, EMBL, SwissProt, FASTQ
• Process results from BLAST, HMMER, FASTA, etc.
• Parse Multiple Sequence Alignments

                                                          6
               Parse some Sequence
Script
use
strict;
use
Bio::SeqIO;
my
$file
=
‘input.fasta’;                      Input
my
$io
=
Bio::SeqIO‐>new(‐format
=>
‘fasta’,   >Locus123

























‐file


=>
$file);    tatatatatt
ctcgtatgtc
ttattctacc

my
$seq
=
$io‐>next_seq;                       aacacaacaa
taagcgtgtt
gcagtcaatg
print
$seq‐>display_id,
“
is
display
id\n”;
print
$seq‐>seq,
“
is
sequence
string\n”;
print
$seq‐>length,
“
is
sequence
length\n”;




Output
            Locus123
            tatatatattctcgtatgtcttattctaccaacacaacaataagcgtgttgcagtcaatg
            60
                                                                                   7
               Sequence data

•   Bio::SeqIO module for parsing and writing sequence in
    many different formats
• Represent sequence data as a Bio::Seq Object
• Wrap up Sequence data, Annotations (Features), and
    methods to compute on it (translate, subseq, length)



                                                            8
                 Translate sequence
Script
use
strict;
use
Bio::SeqIO;
my
$file
=
‘input.fasta’;                     Input
my
$io
=
Bio::SeqIO‐>new(‐format
=>
‘fasta’,   >Locus123

























‐file


=>
$file);    tatatatatt
ctcgtatgtc
ttattctacc

my
$out
=
Bio::SeqIO‐>new(‐format
=>
‘fasta’); aacacaacaa
taagcgtgtt
gcagtcaatg
my
$seq
=
$io‐>next_seq;
my
$peptide
=
$seq‐>translate;
$out‐>write_seq($peptide);




Output
           >Locus123
           YIYSRMSYSTNTTISVLQSM

                                                                                   9
    Process BLAST reports

• Sequence database searches components
• Query sequence
• Significant Hits
• Pairwise Alignments of Query and Hit (Subject)
  sequences



                                                   10
    Process BLAST reports
• Sequence database searches components
• Result - Total search result for a single query sequence
• Hit - Sequences from database with significant similarity
  to Query
• HSP - High-scoring Segment Pair
 • Pairwise alignment segment between Query and Hit


                                                             11
Reference:

Gish,
W.
(1996‐2000)
http://blast.wustl.edu

Query=

BOSS_DROME
Bride
of
sevenless
protein
precursor.








(896
letters)

Database:

wormpep87
                                                                                  Result











20,881
sequences;
9,238,759
total
letters.
Searching....10....20....30....40....50....60....70....80....90....100%
done






































































Smallest







































































Sum






























































High

Probability
Sequences
producing
High‐scoring
Segment
Pairs:













Score

P(N)





N

F35H10.10
CE24945



status:Partially_confirmed
TR:Q20073...


182

4.9e‐11


1
M02H5.2


CE25951



status:Predicted
TR:Q966H5
protein_id:...

86

0.15





1
ZC506.4


CE01682

locus:mgl‐1
metatrophic
glutamate
recept...

91

0.18





1
F23D12.2

CE05700



status:Partially_confirmed
TR:Q19761
...


73

0.45





3    Hit
>F35H10.10
CE24945



status:Partially_confirmed
TR:Q20073











protein_id:AAA81683.2








Length
=
1404


Score
=
182
(69.1
bits),
Expect
=
4.9e‐11,
P
=
4.9e‐11

Identities
=
75/315
(23%),
Positives
=
149/315
(47%)

Query:


511
YPFLFDGESVMFWRIKMDTWVATGLTAAILGLIATLAILVFIVVRISLGDVFEGNPTTSI
570













Y

+F+

+


WR




+V


L


++

+

+A+LV
++V++
L

V
+GN
+

I
Sbjct:

1006
YQSVFEHITTGHWRDHPHNYVLLALITVLV‐‐VVAIAVLVLVLVKLYLR‐VVKGNQSLGI
1062

Query:


571
LLLLSLILVFCSFVPYSIEYVGEQRNSHVTFEDAQTLNTLCAVRVFIMTLVYCFVFSLLL
630     HSP














LL+
+I++





YS

+









F+




+++C
+RV
+

L
Y


F
+++
Sbjct:

1063
SLLIGIIIL‐‐‐‐‐‐YSTAFF‐‐‐‐‐‐‐FVFDPT‐‐‐DSVCRLRVILHGLGYTICFGVMI
1106

Query:


631
CRAVMLASIGSEG‐GFLSHVNGYIQAVICAFSVVAQVGMSVQLLVVMHVASETVSCENIY
689














+A

L
+

+
G
G


H++
+


++

F
V

Q+
+S+


+


+++

V


N+

Sbjct:

1107
AKATQLRNAETLGFGTAIHISFWNYWLLLFFIVGVQIALSISWFLEPFMSTIGVIDTNVQ
1166
                                                                                           12
Reference:

Gish,
W.
(1996‐2000)
http://blast.wustl.edu

Query=

BOSS_DROME
Bride
of
sevenless
protein
precursor.








(896
letters)

Database:

wormpep87
                                                                                  Result











20,881
sequences;
9,238,759
total
letters.
Searching....10....20....30....40....50....60....70....80....90....100%
done






































































Smallest







































































Sum






























































High

Probability
Sequences
producing
High‐scoring
Segment
Pairs:













Score

P(N)





N

F35H10.10
CE24945



status:Partially_confirmed
TR:Q20073...


182

4.9e‐11


1
M02H5.2


CE25951



status:Predicted
TR:Q966H5
protein_id:...

86

0.15





1
ZC506.4


CE01682

locus:mgl‐1
metatrophic
glutamate
recept...

91

0.18





1
F23D12.2

CE05700



status:Partially_confirmed
TR:Q19761
...


73

0.45





3    Hit
>F35H10.10
CE24945



status:Partially_confirmed
TR:Q20073











protein_id:AAA81683.2








Length
=
1404


Score
=
182
(69.1
bits),
Expect
=
4.9e‐11,
P
=
4.9e‐11

Identities
=
75/315
(23%),
Positives
=
149/315
(47%)

Query:


511
YPFLFDGESVMFWRIKMDTWVATGLTAAILGLIATLAILVFIVVRISLGDVFEGNPTTSI
570













Y

+F+

+


WR




+V


L


++

+

+A+LV
++V++
L

V
+GN
+

I
Sbjct:

1006
YQSVFEHITTGHWRDHPHNYVLLALITVLV‐‐VVAIAVLVLVLVKLYLR‐VVKGNQSLGI
1062

Query:


571
LLLLSLILVFCSFVPYSIEYVGEQRNSHVTFEDAQTLNTLCAVRVFIMTLVYCFVFSLLL
630     HSP














LL+
+I++





YS

+









F+




+++C
+RV
+

L
Y


F
+++
Sbjct:

1063
SLLIGIIIL‐‐‐‐‐‐YSTAFF‐‐‐‐‐‐‐FVFDPT‐‐‐DSVCRLRVILHGLGYTICFGVMI
1106

Query:


631
CRAVMLASIGSEG‐GFLSHVNGYIQAVICAFSVVAQVGMSVQLLVVMHVASETVSCENIY
689














+A

L
+

+
G
G


H++
+


++

F
V

Q+
+S+


+


+++

V


N+

Sbjct:

1107
AKATQLRNAETLGFGTAIHISFWNYWLLLFFIVGVQIALSISWFLEPFMSTIGVIDTNVQ
1166
                                                                                           12
Reference:

Gish,
W.
(1996‐2000)
http://blast.wustl.edu

Query=

BOSS_DROME
Bride
of
sevenless
protein
precursor.








(896
letters)

Database:

wormpep87
                                                                                  Result











20,881
sequences;
9,238,759
total
letters.
Searching....10....20....30....40....50....60....70....80....90....100%
done






































































Smallest







































































Sum






























































High

Probability
Sequences
producing
High‐scoring
Segment
Pairs:













Score

P(N)





N

F35H10.10
CE24945



status:Partially_confirmed
TR:Q20073...


182

4.9e‐11


1
M02H5.2


CE25951



status:Predicted
TR:Q966H5
protein_id:...

86

0.15





1
ZC506.4


CE01682

locus:mgl‐1
metatrophic
glutamate
recept...

91

0.18





1
F23D12.2

CE05700



status:Partially_confirmed
TR:Q19761
...


73

0.45





3    Hit
>F35H10.10
CE24945



status:Partially_confirmed
TR:Q20073











protein_id:AAA81683.2








Length
=
1404


Score
=
182
(69.1
bits),
Expect
=
4.9e‐11,
P
=
4.9e‐11

Identities
=
75/315
(23%),
Positives
=
149/315
(47%)

Query:


511
YPFLFDGESVMFWRIKMDTWVATGLTAAILGLIATLAILVFIVVRISLGDVFEGNPTTSI
570













Y

+F+

+


WR




+V


L


++

+

+A+LV
++V++
L

V
+GN
+

I
Sbjct:

1006
YQSVFEHITTGHWRDHPHNYVLLALITVLV‐‐VVAIAVLVLVLVKLYLR‐VVKGNQSLGI
1062

Query:


571
LLLLSLILVFCSFVPYSIEYVGEQRNSHVTFEDAQTLNTLCAVRVFIMTLVYCFVFSLLL
630     HSP














LL+
+I++





YS

+









F+




+++C
+RV
+

L
Y


F
+++
Sbjct:

1063
SLLIGIIIL‐‐‐‐‐‐YSTAFF‐‐‐‐‐‐‐FVFDPT‐‐‐DSVCRLRVILHGLGYTICFGVMI
1106

Query:


631
CRAVMLASIGSEG‐GFLSHVNGYIQAVICAFSVVAQVGMSVQLLVVMHVASETVSCENIY
689














+A

L
+

+
G
G


H++
+


++

F
V

Q+
+S+


+


+++

V


N+

Sbjct:

1107
AKATQLRNAETLGFGTAIHISFWNYWLLLFFIVGVQIALSISWFLEPFMSTIGVIDTNVQ
1166
                                                                                           12
Reference:

Gish,
W.
(1996‐2000)
http://blast.wustl.edu

Query=

BOSS_DROME
Bride
of
sevenless
protein
precursor.








(896
letters)

Database:

wormpep87
                                                                                  Result











20,881
sequences;
9,238,759
total
letters.
Searching....10....20....30....40....50....60....70....80....90....100%
done






































































Smallest







































































Sum






























































High

Probability
Sequences
producing
High‐scoring
Segment
Pairs:













Score

P(N)





N

F35H10.10
CE24945



status:Partially_confirmed
TR:Q20073...


182

4.9e‐11


1
M02H5.2


CE25951



status:Predicted
TR:Q966H5
protein_id:...

86

0.15





1
ZC506.4


CE01682

locus:mgl‐1
metatrophic
glutamate
recept...

91

0.18





1
F23D12.2

CE05700



status:Partially_confirmed
TR:Q19761
...


73

0.45





3    Hit
>F35H10.10
CE24945



status:Partially_confirmed
TR:Q20073











protein_id:AAA81683.2








Length
=
1404


Score
=
182
(69.1
bits),
Expect
=
4.9e‐11,
P
=
4.9e‐11

Identities
=
75/315
(23%),
Positives
=
149/315
(47%)

Query:


511
YPFLFDGESVMFWRIKMDTWVATGLTAAILGLIATLAILVFIVVRISLGDVFEGNPTTSI
570













Y

+F+

+


WR




+V


L


++

+

+A+LV
++V++
L

V
+GN
+

I
Sbjct:

1006
YQSVFEHITTGHWRDHPHNYVLLALITVLV‐‐VVAIAVLVLVLVKLYLR‐VVKGNQSLGI
1062

Query:


571
LLLLSLILVFCSFVPYSIEYVGEQRNSHVTFEDAQTLNTLCAVRVFIMTLVYCFVFSLLL
630     HSP














LL+
+I++





YS

+









F+




+++C
+RV
+

L
Y


F
+++
Sbjct:

1063
SLLIGIIIL‐‐‐‐‐‐YSTAFF‐‐‐‐‐‐‐FVFDPT‐‐‐DSVCRLRVILHGLGYTICFGVMI
1106

Query:


631
CRAVMLASIGSEG‐GFLSHVNGYIQAVICAFSVVAQVGMSVQLLVVMHVASETVSCENIY
689














+A

L
+

+
G
G


H++
+


++

F
V

Q+
+S+


+


+++

V


N+

Sbjct:

1107
AKATQLRNAETLGFGTAIHISFWNYWLLLFFIVGVQIALSISWFLEPFMSTIGVIDTNVQ
1166
                                                                                           12
use
Bio::SearchIO;
my
$cutoff
=
’0.001’;
my
$file
=
‘BOSS_Ce.BLASTP’,
my
$in
=
Bio::SearchIO‐>new(‐format
=>
‘blast’,



























‐file



=>
$file);
while(
my
$r
=
$in‐>next_result
)
{


print
"Query
is:
",
$r‐>query_name,
"
",
$r‐>query_description,"
",









$r‐>query_length,"
aa\n";


print
"
Matrix
was
",
$r‐>get_parameter(’matrix’),
"\n";


while(
my
$h
=
$r‐>next_hit
)
{




last
if
$h‐>significance
>
$cutoff;




print
"Hit
is
",
$h‐>name,
"\n";




while(
my
$hsp
=
$h‐>next_hsp
)
{





print
"
HSP
Len
is
",
$hsp‐>length(’total’),
"
",











"
E‐value
is
",
$hsp‐>evalue,












"
Bit
score
",
$hsp‐>score,
"
\n",











"
Query
loc:
",$hsp‐>query‐>start,
"
",
$hsp‐>query‐>end,"
",











"
Sbject
loc:
",$hsp‐>hit‐>start,
"
",
$hsp‐>hit‐>end,"\n";




}


}
}


                                                                           13
             Parsing Result
Query
is:
BOSS_DROME
Bride
of
sevenless
protein

precursor.
896
aa
Matrix
was
BLOSUM62
Hit
is
F35H10.10
HSP
Len
is
315
E‐value
is
4.9e‐11
Bit
score
182
Query
loc:
511
813
Sbject
loc:
1006
1298
HSP
Len
is
28
E‐value
is
1.4e‐09
Bit
score
39
Query
loc:
508
535
Sbject
loc:
427
454




                                                   14

FASTA
searches
a
protein
or
DNA
sequence
data
bank

version
3.4t20
Sep
26,
2002
Please
cite:

W.R.
Pearson
&
D.J.
Lipman
PNAS
(1988)
85:2444‐2448

Query
library
BOSS_DROME.aa
vs
/blast/wormpep87
library
searching
/blast/wormpep87
library



1>>>BOSS_DROME
Bride
of
sevenless
protein
precursor.
‐
896
aa

vs

/blast/wormpep87
library
9238759
residues
in
20881
sequences


Expectation_n
fit:
rho(ln(x))=
5.6098+/‐0.000519;
mu=
12.8177+/‐
0.030

mean_var=107.8223+/‐22.869,
0's:
0
Z‐trim:
2

B‐trim:
0
in
0/62

Lambda=
0.123515

Kolmogorov‐Smirnov

statistic:
0.0333
(N=29)
at

48

FASTA
(3.45
Mar
2002)
function
[optimized,
BL50
matrix
(15:‐5)]
ktup:
2

join:
38,
opt:
26,
gap‐pen:
‐12/‐2,
width:

16

Scan
time:

9.680
The
best
scores
are:






































opt
bits
E(20881)
F35H10.10
CE24945



status:Partially_confirmed
T

(1404)

207

48.5
6.8e‐05
T06E4.11
CE06377

locus:pqn‐63

status:Predicted


(
275)

122

32.6




0.8
C33B4.3
CE01508


ankyrin
and
proline
rich
domain

(1110)

124

33.6




1.6
Y48C3A.8
CE22141



status:Predicted
TR:Q9NAG3
pr

(
291)

110

30.5




3.7
Y34D9A.2
CE30217



status:Partially_confirmed
TR

(
326)

108

30.2




5.1
K06H7.3
CE26941


Isopentenyl‐diphosphate
delta
i

(
618)

107

30.3




8.9
F44B9.8
CE29044


ARPA
status:Partially_confirmed

(
388)

104

29.5




9.4

>>F35H10.10
CE24945



status:Partially_confirmed
TR:Q20

(1404
aa)

initn:

94
init1:

94
opt:
207

Z‐score:
197.9

bits:
48.5
E():
6.8e‐05
Smith‐Waterman
score:
275;

22.527%
identity
(27.152%
ungapped)
in
728
aa
overlap
(207‐847:640‐1330)









180






190






200






210






220









230



BOSS_D
RAISIDNASLAENLLIQEVQFLQQCTTYSMGIFVDWELYKQLESVIKD‐‐‐LEYNIWPIP





































:
:
.
.


.:
::
..


:.
..

.:
F35H10
NQAGRNITIVPKSVFGYASALHGDSQESLKGYFSSGDTDASLVSVDSEHSALQRSFTALP





610






620






630






640






650






660









                                                                                                       15
use
Bio::SearchIO;
my
$cutoff
=
’0.001’;
my
$file
=
‘BOSS_Ce.BLASTP’,
my
$in
=
Bio::SearchIO‐>new(‐format
=>
‘fasta’,



























‐file



=>
$file);
while(
my
$r
=
$in‐>next_result
)
{


print
"Query
is:
",
$r‐>query_name,
"
",
$r‐>query_description,"
",









$r‐>query_length,"
aa\n";


print
"
Matrix
was
",
$r‐>get_parameter(’matrix’),
"\n";


while(
my
$h
=
$r‐>next_hit
)
{




last
if
$h‐>significance
>
$cutoff;




print
"Hit
is
",
$h‐>name,
"\n";




while(
my
$hsp
=
$h‐>next_hsp
)
{





print
"
HSP
Len
is
",
$hsp‐>length(’total’),
"
",











"
E‐value
is
",
$hsp‐>evalue,












"
Bit
score
",
$hsp‐>score,
"
\n",











"
Query
loc:
",$hsp‐>query‐>start,
"
",
$hsp‐>query‐>end,"
",











"
Sbject
loc:
",$hsp‐>hit‐>start,
"
",
$hsp‐>hit‐>end,"\n";




}


}
}


                                                                           16
Query is: BOSS_DROME Bride of sevenless protein
precursor. 896 aa
Matrix was BL50
Hit is F35H10.10
HSP Len is 728 E-value is 6.8e-05 Bit score 197.9
Query loc: 207 847 Sbject loc: 640 1330




                                                    17
Parsing Multiple Alignments

•    Read in many different formats
    • Clustal, Phylip, Selex, Stockholm, multi-FASTA, Nexus,
      MAF
    • Create a Multiple Alignment Object
    • Process alignment (trim gapped columns, calculate
      percent identity, etc)



                                                               18
Input
 CATL_HUMAN





......MNPT
LILAAFCLGI
..........
...ASATLTF
DHSLEAQWTK

 CATL_RAT







......MTPL
LLLAVLCLGT
..........
...ALATPKF
DQTFNAQWHQ

 CATH_RAT







...MWTALPL
LCAGAWLLSA
G.........
...ATAELTV
NAIEKFHFTS

 PAPA_CARPA





MAMIPSISKL
LFVAICLFVY
MGLSFGDFSI
VGYSQNDLTS
TERLIQLFES



Script
use
strict;
use
Bio::AlignIO;
my
$in
=
Bio::AlignIO‐>new(‐format
=>
‘msf’,



























‐file


=>
‘hmmer.aln’);
my
$out
=Bio::AlignIO‐>new(‐format
=>
‘clustalw’,



























‐file


=>
‘>out.aln’);
my
$aln
=
$in‐>next_aln;
$aln‐>map_chars(‘\.’,
‘‐’);
$aln‐>map_chars(‘\*’,
‘‐’);
my
$ungapped
=
$aln‐>remove_gaps;
$out‐>write_aln($ungapped);
print
“percent
identity=”,
$ungapped‐>percentage_identity,
“\n”;



                                                                           19
                      Result
24.7311827956989

CLUSTAL
W(1.81)
multiple
sequence
alignment


CATL_HUMAN/1‐31







MNPTLILAAFCLGIASATLTFDHSLEAQWTK
CATL_RAT/1‐31









MTPLLLLAVLCLGTALATPKFDQTFNAQWHQ
CATH_RAT/4‐35









ALPLLCAGAWLLSAATAELTVNAIEKFHFTS
PAPA_CARPA/7‐50







ISKLLFVAICLFVYSQNDLTSTERLIQLFES



























*

.


:

:



.







:
.



                                                     20
         Phylogenetic Trees


• Parse and Write various Tree formats
• Manipulate (re-root, remove nodes, add bootstraps) and
  compute relationships on trees




                                                           21
          Read/Write Trees

use
Bio::TreeIO;
use
strict;
my
$in
=
Bio::TreeIO‐>new(‐format
=>
‘nexus’,


























‐file


=>
‘trees.nex’);
my
$out
=
Bio::TreeIO‐>new(‐format
=>
‘newick’,



























‐file


=>
‘>trees.nh’);
while(
my
$tree
=
$in‐>next_tree
)
{

$out‐>write_tree($tree);
}




                                                      22
            Fetching subset of nodes
  use
strict;
  use
Bio::TreeIO;
  my
$in
=
Bio::TreeIO‐>new(‐format
=>
‘newick’,
  

























‐fh


=>
\*DATA);
  if(
my
$tree
=
$in‐>next_tree
)
{
  


my
@nodes
=
$tree‐>get_nodes;
  


my
@tips
=
grep
{
$_‐>is_Leaf
}
@nodes;
  


print
"there
are
",scalar
@tips,
"
tips\n";
  


my
@internal
=
grep
{
!
$_‐>is_Leaf
}
@nodes;
  


print
"there
are
",scalar
@internal,
"
internal
nodes\n";
  


my
($A_node)
=
$tree‐>find_node(‐id
=>
'A');
  


print
"branch
length
of
Ancestor
of
",$A_node‐>id,
                                                                 A
  


"
is
",
$A_node‐>ancestor‐>branch_length,
"\n";
  }                                                              B
  __DATA__
                                                                 C
  (((A:10,B:11):2,C:5),((D:7,F:6):17,G:8));
Output                                                           D

 there
are
6
tips                                                F
 there
are
5
internal
nodes
                                                                 G
 branch
length
of
Ancestor
of
A
is
2
                                                                     23
  Walking up the tree (tips to root)
 if(
my
$tree
=
$in‐>next_tree
)
{
 


my
@tips
=
grep
{
$_‐>is_Leaf
}
$tree‐>get_nodes;
 


for
my
$node
(
@tips
)
{
 





my
@path;
 





while(
defined
$node)
{
 







push
@path,
$node‐>id;
 







$node
=
$node‐>ancestor;
 





}
 





print
join(“,“,
@path),
“\n”;
 


}
 }
 __DATA__
 (((A:10,B:11)I1:2,C:5)I2,((D:7,F:6)I3:17,G:8)I4)Root;         A

                                                          I1

Output                                               I2
                                                               B



 C,I2,Root                                                     C

 A,I1,I2,Root                                 Root
 B,I1,I2,Root                                                  D

                                                          I3
 G,I4,Root
                                                               F
 D,I3,I4,Root                                        I4

 F,I3,I4,Root                                                  G
                                                                   24
Visualizations



                 25
  Draw Sequence Graphics


• Read in GenBank file
• Write out features as tracks
• Automatically grab feature names and


                                         26
LOCUS






AB077698















2701
bp



mRNA



linear


PRI
01‐MAR‐2002   




















/db_xref="GI:19032345"
DEFINITION

Homo
sapiens
mRNA
for
hCHCR‐G,
complete
cds.                          




















/translation="MTAVNVALIRDTKWLTLEVCREFQRGTCSRADADCKFAHPPRVC
ACCESSION


AB077698                                                              




















HVENGRVVACFDSLKGRCTRENCKYLHPPPHLKTQLEINGRNNLIQQKTAAAMFAQQM
VERSION




AB077698.1

GI:19032344                                               




















QLMLQNAQMSSLGSFPMTPSIPANPPMAFNPYIPHPGMGLVPAELVPNTPVLIPGNPP
KEYWORDS



.                                                                     




















LAMPGAVGPKLMRSDKLEVCREFQRGNCTRGENDCRYAHPTDASMIEASDNTVTICMD
SOURCE





Homo
sapiens
(human)                                                  




















YIKGRCSREKCKYFHPPAHLQARLKAAHHQMNHSAASAMALQPGTLQLIPKRSALEKP


ORGANISM

Homo
sapiens                                                          




















NGATPVFNPTVFHCQQALTNLQLPQPAFIPAGPILCMAPASNIVPMMHGATPTTVSAA












Eukaryota;
Metazoa;
Chordata;
Craniata;
Vertebrata;
Euteleostomi;     




















TTPATSVPFAAPTTGNQLKF"












Mammalia;
Eutheria;
Euarchontoglires;
Primates;
Haplorrhini;          




misc_feature



137..196












Catarrhini;
Hominidae;
Homo.                                          




















/gene="CHCR"
REFERENCE


1                                                                     




















/note="Cys3His,
zinc
finger


AUTHORS


Squillace,R.M.
and
Wang,E.H.                                          




















Encoded
on
BAC
clone
RP5‐842K24
(AL050310)"


TITLE




Genomic
structure,
chromosomal
localization,
and
splicing
variation   




misc_feature



239..292












of
the
human
CHCR
gene,
cloning
and
characterization
of
mouse
CHCR    




















/gene="CHCR"


JOURNAL


Unpublished                                                           




















/note="Cys3His,
zinc
finger
REFERENCE


2                                                                     




















Encoded
on
BAC
clone
RP5‐842K24
(AL050310)"


AUTHORS


Squillace,R.M.,
Chenault,D.M.
and
Wang,E.H.                           




misc_feature



617..676


TITLE




Inhibition
of
myogenesis
by
the
novel
Muscleblind‐related
protein     




















/gene="CHCR"












CHCR                                                                  




















/note="Cys3His,
zinc
finger


JOURNAL


Unpublished                                                           




















Encoded
on
BAC
clone
RP5‐842K24
(AL050310)"
REFERENCE


3

(bases
1
to
2701)                                                  




misc_feature



725..778


AUTHORS


Squillace,R.M.,
Chenault,D.M.
and
Wang,E.H.                           




















/gene="CHCR"


TITLE




Direct
Submission                                                     




















/note="Cys3His,
zinc
finger


JOURNAL


Submitted
(10‐JAN‐2002)
Edith
H.
Wang,
University
of
Washington,      




















Encoded
on
BAC
clone
RP5‐842K24
(AL050310)"












Pharmacology;
1959
NE
Pacific
Ave.,
Box
357280,
Seattle,
Washington   




3'UTR










1145..2659












98195,
USA
(E‐mail:ehwang@u.washington.edu,
Tel:206‐616‐5376,         




















/gene="CHCR"












Fax:206‐685‐3822)                                                     




polyA_site





1606
FEATURES












Location/Qualifiers                                          




















/gene="CHCR"





source









1..2701                                                      




















/note="Encoded
on
BAC
clone
RP5‐842K24
(AL050310);





















/organism="Homo
sapiens"                                     




















PolyA_site#1
used
by
CHCR
EST
clone
PLACE1010202





















/mol_type="mRNA"                                             




















(AK002178)"





















/db_xref="taxon:9606"                                        




polyA_site





2660





















/chromosome="X"                                              




















/gene="CHCR"





















/map="Xq26.1"                                                




















/note="Encoded
on
BAC
clone
RP5‐842K24
(AL050310);





gene











1..2701                                                      




















PolyA_site#2
used
by
CHCR
EST
clone
DKFZp434G2222





















/gene="CHCR"                                                 




















(AL133625)"





5'UTR










<1..79                                                       ORIGIN



























/gene="CHCR"                                                 







1
aattcatttt
taatccttta
atagtccaca
gtaatattgt
cctaaagagg
gtacattgga





CDS












80..1144                                                     






61
ttttaatttt
gctttcaata
tgacggctgt
caatgttgcc
ctgattcgtg
ataccaagtg





















/gene="CHCR"                                                 





121
gctgacttta
gaagtctgta
gagaatttca
gagaggaact
tgctctcgag
ctgatgcaga





















/note="Cys3His
CCG1‐Required                                 





181
ttgcaagttt
gcccatccac
caagagtttg
ccatgtggaa
aatggtcgtg
tggtggcctg





















Encoded
on
BAC
clone
RP5‐842K24
(AL050310)                   





241
ttttgattct
ctaaagggtc
ggtgtacccg
agagaactgc
aagtaccttc
accctcctcc





















The
human
CHCR
(Cys3His
CCG1‐Required)
protein
is
highly     





301
acacttaaaa
acgcagctgg
agattaatgg
gcggaacaat
ctgattcaac
agaagactgc





















related
to
EXP/MBNL
(Y13829,
NM_021038,
AF401998)
and
MBLL   





361
cgcagccatg
ttcgcccagc
agatgcagct
tatgctccaa
aacgctcaaa
tgtcatcact





















(NM_005757,AF061261),
which
together
comprise
the
human      





421
tggttctttt
cctatgactc
catcaattcc
agctaatcct
cccatggctt
tcaatcctta





















Muscleblind
family"                                          





481
cataccacat
cctgggatgg
gcctcgttcc
tgcagaactt
gtaccaaata
cacctgttct





















/codon_start=1                                               





541
gattcctgga
aacccacctc
ttgcaatgcc
aggagctgtt
ggcccaaaac
tgatgcgttc





















/product="hCHCR‐G"                                           





601
agataaactg
gaggtttgcc
gagaatttca
gcgtggaaat
tgtacccgtg
gggagaatga





















/protein_id="BAB85648.1"                                     





661
ttgccgctat
gctcacccta
ctgatgcttc
catgattgaa
gcgagtgata
atactgtgac
                                                                                  





721
aatctgcatg
gattacatca
aaggtcgatg
ctcgcgggag
aaatgcaagt
actttcatcc
                                                                                                                                                                    27
        Rendered Sequence File

$
perl
render.pl
seqfile.gbk
>
result.png




                                            28

use
strict;

use
Bio::Graphics;

use
Bio::SeqIO;

use
Bio::SeqFeature::Generic;
                                                                 $panel‐>add_track(generic
=>

                                                                 


Bio::SeqFeature::Generic‐>new(‐start
=>
1,


my
$file
=
shift
                                                                 
































‐end
=>
$seq‐>length,


or
die
"provide
a
sequence
file
as
the
argument";
                                                                 































‐bgcolor
=>
'blue',

my
$io
=
Bio::SeqIO‐>new(‐file=>$file)

                                                                 































‐label

=>
1));









or
die
"couldn't
create
Bio::SeqIO";

my
$seq
=
$io‐>next_seq

                                                                 
#
general
case









or
die
"couldn't
find
a
sequence
in
the
file";
                                                                 
my
@colors
=
qw(cyan
orange
blue
purple
green

                                                                 
















chartreuse
magenta
yellow
aqua);

my
@features
=
$seq‐>all_SeqFeatures;
                                                                 
my
$idx



=
0;
                                                                 


#
sort
features
by
their
primary
tags
                                                                 for
my
$tag
(sort
keys
%sorted_features)
{

my
%sorted_features;
                                                                 


my
$features
=
$sorted_features{$tag};

for
my
$f
(@features)
{
                                                                 


$panel‐>add_track($features,



my
$tag
=
$f‐>primary_tag;
                                                                 











‐glyph



=>

'generic',



if(
!
$f‐>has_tag('note')
&&
$f‐>has_tag('gene')
)
{
                                                                 











‐bgcolor

=>

$colors[$idx++
%
@colors],








$f‐>add_tag_value('note',
$f‐>get_tag_values('gene'));
                                                                 











‐fgcolor

=>
'black',



}
                                                                 











‐font2color
=>
'red',



push
@{$sorted_features{$tag}},$f;
                                                                 











‐key





=>
"${tag}s",

}
                                                                 











‐bump




=>
+1,

my
$panel
=
Bio::Graphics::Panel‐>new(
                                                                 











‐height


=>
8,




‐length
=>
$seq‐>length,
                                                                 











‐label



=>
1,




‐key_style
=>
'between',
                                                                 











‐description
=>
1,




‐width




=>
1200,
                                                                 










);




‐pad_left

=>
10,
                                                                 
}




‐pad_right
=>
10);
                                                                 
print
$panel‐>png;
$panel‐>add_track(arrow
=>

Bio::SeqFeature::Generic‐>new(‐start
=>
1,






























‐end
=>
$seq‐>length),





‐bump
=>
0,




‐double=>1,




‐tick
=>
2);
                                                                                                                          29
Genome Browser




                 30
Pipelines, Workflows,
   Automation in
   Bioinformatics


                       31
  Test for Positive Selection


• Codon alignments
• Run Molecular Evolution Tests
• Aggregate data


                                  32
           Codon Alignment

•   Molecular Evolution studies need sequence aligned on
    codon boundaries
• Start with coding sequence
• Align in protein space
• Back-Translate into codons

                                                           33
use
Bio::AlignIO;
use
Bio::SeqIO;
use
Bio::Align::Utilities
qw(
aa_to_dna_aln
);
my
$input
=
Bio::AlignIO‐>new
(‐format
=>
’clustalw’,































‐file
=>
’pep.aln’);
my
$out
=
Bio::AlignIO‐>new
(‐format
=>
’clustalw’,





























‐file
=>
’>cds.aln’);
my
$aa_aln
=
$input‐>next_aln
;
my
$cds
=
Bio::SeqIO‐>new
(‐format
=>
’file’,



























‐file
=>
’cds.fa’);
my
%cds;
while
(
my
$seq
=
$cds‐>next_seq
)
{

$cds
{$seq‐>display_id}
=
$seq;

}

my
$cds_aln
=
&aa_to_dna_aln
(
$aa_aln,
\%cds);

$out‐>write_aln(
$cds_aln
);




                                                        34
use
Bio::Tools::Run::PAML::Codeml;
use
Bio::AlignIO;
#
get
a
codon
alignment
from
a
file
my
$alnio
=
Bio::AlignIO‐>new
(‐format
=>
’clustalw’,


‐file
=>
‘cds.aln’
);
my
$codon_aln
=
$alnio‐>next_aln;
my
$kaks_f
=
Bio::Tools::Run::Phylo::PAML::Codeml
‐>new
(
‐params
=>
{
’runmode
’
=>
‐2,
’seqtype
’
=>
1}
);
$kaks_f‐>alignment($codon_aln);
my
($rc,
$parser
)
=
$kaks_factory‐>run;
my
$result
=
$parser‐>next_result
;
my
$MLmatrix
=
$result‐>get_MLmatrix();
my
$pair1
=
$MLmatrix‐>[0]‐>[1];

printf
"dN,
dS,
dN/dS
for
seqs
0
and
1
is
%.4f,
%.4f
,%.4f\n",
$pair1‐>{
’dS’},
$pair1‐>{
’dN’},
$pair1‐>{
’omega’};




                                                                 35
             Genes
             One at a time
           Run perl script on
            each gene file as    Result file per gene
Compute       separate job
 Cluster


                                Summarize results




                                                      36
      Application Wrappers

• A Perl module that will run the application
• Programatic access to object to run the analysis, but
  local script doesn’t require all the command line
  arguments
• Handles the creation of temporary files, and cleanup
• bioperl-run package

                                                          37
    Workflows are protocols
• Create a reusable set of steps
• Can be re-run easily
•  Makes writing Methods section easier!
  • Have stored the parameters, program names, program
    version
• Can even be a simple set of Perl scripts or a Shell script
  with commands
• Large-scale workflow managers for Bioinformatics are
  also available
                                                               38
       Pipelines and Workflow
        management Software
•    Open Source
    • Taverna: taverna.sourceforge.net
    • MyExperiment.org, myGrid.org
• Commercial
 • Pipeline Pilot - student edition free to academics
• Grid-based computing
 • MOBY - http://biomoby.org
                                                        39
Phylogenomics &
 Metagenomics


                  40
              Phylogenomics

•   Gene trees from Gene families across multiple species


• Identification of Orthologs/Paralogs
• Multiple Sequence Alignments
• Tree Building



                                                            41
     Clusters of Orthologs

• Best reciprocal Hits (BLAST,FASTA)
• Reciprocal Smallest Distance (RSD)
• Markov CLustering, TribeMCL
• OrthoMCL - MCL with in-paralog correction
• Synergy - RSD + Synteny



                                              42
Best Reciprocal Hits




                       43
         Multiple Alignments
•   ClustalW - Fast
• Muscle - Fast, but more accurate
• T-Coffee - Slower, much more accurate than ClustalW
• ProbCons - Comparable to T-Coffee, prob score for
    each alignment column
•   MAFFT - Multiple alignment from FFT


                                                        44
               Tree Building
• PHYLIP - Parsimony, Distance, Nucleotide Likelihood
• BEAST, MrBayes - Bayesian for nt and aa
• PAUP - Parsimony, Distance, Nucleotide Likelihood
  (commercial)
• RAxML - Likelihood with bootstrapping
• GARLI - Genetic algorithm for nt likelihood tree


                                                        45
         Phylogenetic Profiling

•    Classifying a gene’s history
    • When was it gained evolutionarily (on species
      phylogeny)
    • If it is lost in other lineages
    • Co-gained or Co-lost clusters?

                                                      46
        Dictyostelium discoideum
      Monosiga brevicollis
        Homo sapiens
      Batrachochytrium dendrobatidis
      Rhizopus oryzae
           Cryptococcus neoformans
         Coprinopsis cinerea
        Laccaria bicolor
       Schizosaccharomyces pombe
          Saccharomyces cerevisiae
             Neurospora crassa
             Aspergillus nidulans
             Coccidioides immitis
0.1


                                       47
Genome Phylogenetic Profile




                             48
      Mechanics of profiling
• Take set of all Proteins from genome
• Search against each of the genomes in the tree
• For that gene mark whether or not it has a significant
  homolog
• Can be dynamic, based on E-value, Similarity, etc
• NOT a statement of orthology


                                                          49
A Gene’s Profile




                  50
51
             Metagenomics


• Same types of problems, but much bigger sequence sets
• Binning Sequences
• Building Trees


                                                          52
                      Binning
• By nucleotide sequence similarity
• Classify a sequence into a Clade if it has similarity only
  with members of that clade
 • Parsing a BLAST or similar-type report
 • Keeping track of the Bins for each gene
 • Producing lists and new databases of sequence that
    are in each bin


                                                               53
               Building Trees

• Have to use Distance-based methods (NJ) for the
  LARGE sequence sets
• Cluster by similarity just like orthology approaches but
  may have to use the simplest calculations for speed on
  large datasets




                                                             54
                    Summary

•   Tools and techniques in bioinformatics can be
    automated and reusable with pipelines
• Perl and BioPerl are one way to interact with the data
    and tools
• Tutorials http://jason.open-bio.org/
• HOWTOs http://bioperl.org/wiki/HOWTOs

                                                           55

						
Related docs