Tutorial for phylogenetic tree
Shared by: h1fpa24f
-
Stats
- views:
- 19
- posted:
- 6/22/2012
- language:
- pages:
- 88
Document Sample


Phylogenetic Analysis
─in implementation
way
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 1
Abstract
Concepts of Phylogenetic Tree
Two Categories of Phylogenetic Tree
Character State Matrix
Distance Matrix
Tools Introduction
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 2
Definition of Phylogenetic Tree
What is Phylogenetic Tree?
To present how species relate to one
another in terms of common ancestors.
How do we construct a phylogenetic
tree?
Generally, we don’t have enough data
about distinct ancestors of present-day
species.
Most of the phylogenetic tree are hypothesis!
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 3
Common Phylogenetic Tree
Terminology
Branches or
Lineages A Represent the
TAXA (genes,
populations,
B species, etc.)
used to infer
C the phylogeny
D
Ancestral Node
or ROOT of Internal Nodes or E
the Tree Divergence Points
(represent hypothetical
ancestors of the taxa)
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 4
Two Categories of Classify Input
Data for Phylogenetic Tree
Comparative Characters :
Such as beak shape, number of fingers,
presence or absence, which called
character state matrix.
Distance numerical data :
Distance between objects, the resulting
matrix is called the distance matrix.
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 5
Convergence
(parallel evolution)
Two or more objects have the same
states for the same characters.
Ex:bird and insect do have the power to
fly, but these two objects share a state but
are not genetically close.
Convergence events should not happen,
or their number should be minimized.
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 6
Types of data used in phylogenetic inference:
Character-based methods: Use the aligned characters, such as DNA
or protein sequences, directly during tree inference.
Taxa Characters
Species A ATGGCTATTCTTATAGTACG
Species B ATCGCTAGTCTTATATTACA
Species C TTCACTAGACCTGTGGTCCA
Species D TTGACCAGACCTGTGGTCCG
Species E TTGACCAGTTCTCTAGTTCG
Distance-based methods: Transform the sequence data into pair-wise
distances (dissimilarities), and then use the matrix during tree building.
A B C D E
Species A ---- 0.20 0.50 0.45 0.40
Species B 0.23 ---- 0.40 0.55 0.50
Species C 0.87 0.59 ---- 0.15 0.40
Species D 0.73 1.12 0.17 ---- 0.25
Species E 0.59 0.89 0.61 0.31 ----
Comparison between Character
method and Distance method
Pros
very fast
Cons
Sequence information is reduced to a number
Provide only one tree topology
Dependent on the model of evolution used
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 8
Two cases to verify
Phylogeny analysis is helpful
Corona-Viruses
From "Characterization of a Novel Coronavirus
Associated with Severe Acute Respiratory Syndrome “ ,
Science 1 May 2003.
HIV
From "Molecular epidemiology of HIV transmission in
a dental practice", Science 22 May 1992.
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 9
The Florida dentist case
A dentist from Florida seems to have infected 7 of his
patients. The dentist died and the patients claim for
insurance money.
Samples:
dentist, 7 patients, controls (HIV positive of the same area).
env gene:
HIV isolates from 5 patients and the dentist strain clustered
together with sufficient bootstrap support (80%).
Two patients have different virus strains.
Conclusion:
The dentist has infected 5 of his patients.
The insurance company made a deal with the patients.
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 10
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 11
Classifications of
Corona-viruses
There are three groups of corona viruses,
groups 1 and 2 contains only
mammalian viruses, while groups 3
contains only avian viruses.
Classified into distinct species by
host range
antigenic relationships
genomic organization
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 12
Membrane Spanning
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 13
Phylogenetic Analysis [1]
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 14
Phylogenetic Analysis [2]
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 15
There are three possible unrooted
trees for four taxa (A, B, C, D)
Tree 1 Tree 2 Tree 3
A C A B A B
B D C D D C
Phylogenetic tree building (or inference) methods are aimed at
discovering which of the possible unrooted trees is "correct".
We would like this to be the “true” biological tree — that is, one
that accurately represents the evolutionary history of the taxa.
However, we must settle for discovering the computationally
correct or optimal tree for the phylogenetic method of choice.
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 18
n
i 3
(2i 5)
The number of unrooted trees increases in a
greater than exponential manner with
number of taxa # Taxa ( N) # Unrooted trees
3 1
4 3
A B A C 5 15
6 105
7 945
C B D 8 10,935
9 135,135
10 2,027,025
C C . .
A D A D . .
. .
. .
B E B F E 30 Å3.58 x 10 36
n
# unrooted trees for N taxa i 3
(2i 5)
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 19
Inferring evolutionary relationships
between the taxa requires rooting tree:
B
C
To root a tree mentally,
Root D
imagine that the tree is
made of string. Grab the Unrooted tree
string at the root and A
tug on it until the ends of
A B
B C D
the string (the taxa) fall
opposite the root:
Note that in this rooted tree, taxon A is Rooted tree
no more closely related to taxon B than
it is to C or D. Root
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 20
Now, try it again with the root at another
position:
B
C
Unrooted tree
Root
D
A A
B
C D
Rooted tree
Note that in this rooted tree, taxon A is most
closely related to taxon B, and together they
Root
are equally distantly related to taxa C and D.
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 21
Each unrooted tree theoretically can be
rooted anywhere along any of its
A C branches
# Unrooted e
# Root d
# Taxa Trees x # Root =
s Trees
B D 3 1 3 3
4 3 5 15
C 5 15 7 105
A D 6 105 9 945
7 945 11 10,395
8 10,935 13 135,135
B E 9 135,135 15 2,027,025
. . . .
. . . .
C
A D . . . .
. . . .
36 38
30 ~3.58x 10 57 ~2.04x 10
i3 (2i 3) = # unrooted trees for N taxa
n
B F E
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 22
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 23
Tree building of phylogeny tree
Phylogenetic analysis should be
conceived as a search for a correct
model.
Presumable
Particular
Rationality
Explanation
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 24
Category of Phylogeny Tree
Distance Base
UPGMA
Neighbor Joining
Fitch-Margoliash
Minimum Evolution
Least Square
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 25
Establish by UPGMA
Unweighted Pair Group Method with Arithmetic Mean
ATCC ATGC TTCG TCGG
ATCC 0 1 2 4
ATGC 0 3 3
TTCG 0 2
TCGG 0
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 26
Establish by UPGMA (Cont.)
{ ATCC, ATGC } ATCC
ATGC
Find the difference metrics to
seek the minimal distance
0.5 0.5
ATCC ATGC
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 27
Establish by UPGMA (Cont.)
{ATCC
TTCG TCGG
ATGC }
{ATCC ½(2+3) ½(4+3)
0
ATGC } =2.5 =3.5
TTCG 0 2
TCGG 0
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 28
Establish by UPGMA (Cont.)
{ TTCG, TCGG }
TTCG
TCGG
0.5 0.5 1 1
ATCC ATGC TTCG TCGG
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 29
Establish by UPGMA (Cont.)
{ATCC {TTCG
ATGC } TCGG }
{ATCC ½(3+3)
0
ATGC } =3
{TTCG
0
TCGG }
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 30
Establish by UPGMA (Cont.)
1.5 1.5
0.5 0.5 1 1
ATCC ATGC TTCG TCGG
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 31
Four steps in phylogenetic
data analysis
1. Alignment
Building the data model
Extracting a phylogenetic dataset
2. Determining the substitution model
models of heterogeneity
Which model to use
3. Tree building
4. Tree evaluation
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 32
Alignment ─
Building the data model
How much computer dependence?
manually ? optimally ?
Phylogenetic criteria preferred
explicitly ?
Alignment parameter estimation
parameters should vary dynamically with divergence
Which alignment procedure is best?
unless the actual tree relationship are known beforehand.
Mathematical optimization and analysis structure
statistical models is not yet clear that can determine models
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 33
Alignment ─Extraction of a
phylogenetic data set
1. One of the most important steps in p-tree
analysis because it produce the data set.
2. Be conscious of deleting unambiguously aligned
regions and inserting or deleting gaps.
3. Slightly modified alignments to determine how
ambiguous regions in the alignment affect.
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 34
Determining the
Substitution Model
The substitution model should be given
the same emphasis as alignment and
tree building !
Which substitution model to use?
The fewer the parameters the better. This
is because every parameter estimate has
an associated variance.
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 35
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 36
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 37
Molecular phylogenetic tree building
methods
There are many phylogenetic methods available today, each having
strengths and weaknesses. Most can be classified as follows:
COMPUTATIONAL METHOD
Optimality criterion Clustering algorithm
Characters
PARSIMONY
MAXIMUM LIKELIHOOD
DATA TYPE
Distances
MINIMUM EVOLUTION UPGMA
LEAST SQUARES NEIGHBOR-JOINING
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 39
Category of Phylogeny Tree
Distance Base
UPGMA
Neighbor Joining
Fitch-Margoliash
Minimum Evolution
Least Square
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 40
Establish by N.J
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 41
Establish by N.J (Cont.)
Character
Object A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 42
Establish by N.J (Cont.) Step1
Calculate the net divergence r (i) for
each OTU from all other OTUs.
r (A) = 5+4+7+6+8=30
r (B) = 5+7+10+9+11=42
r (C) = 32
r (D) = 38
r (E) = 34
r (F) = 44
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 43
Establish by N.J (Cont.) Step2
Calculate a new distance matrix using
for each pair of OUTs the formula:
r (i ) r ( j )
M ij d ij [ ]
N 2
r ( A) r ( B)
M AB d AB [ ]
N 2
30 42
5 [ ]
62
13
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 44
Establish by N.J (Cont.) Step2
New Distance Matrix
Object A B C D E
B -13
C -11.5 -11.5
D -10 -10 -10.5
E -10 -10 -10.5 -13
F -10.5 -10.5 -11 -11.5 -11.5
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 45
Establish by N.J (Cont.) Step3
choose as neighbors those two OTUs for
which Mij is the smallest. Now we
calculate the branch length from the
internal node U to the external OTUs A
and B.
d AB [r ( A) r ( B)]
S ( AU ) 1
2 2( N 2)
S ( BU ) d AB S ( AU ) 4
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 46
Establish by N.J (Cont.) Step4
define new distances from U to each
other terminal node:
d AB
d CU d AC d BC 3
2
d
d DU d AD d BD AB 6
2
d AB
d EU d AE d BE 5
2
d
d FU d AF d BF AB 7
2
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 47
Establish by N.J (Cont.) Step4
Character
Object U C D E
C 3
D 6 7
E 5 6 5
F 7 8 9 8
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 48
Establish by N.J (Cont.) Step5
N= N-1 = 5
The entire procedure is repeated starting
at step 1
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 49
Category of Phylogeny Tree
Distance Base
UPGMA
Neighbor Joining
Fitch-Margoliash
Minimum Evolution
Least Square
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 50
Use of FM-algorithm
for three sequence
A B C a b 22
A ─ 22 39
a c 39
B ─ ─ 41
C ─ ─ ─ b c 41
Distance from A to B = a + b = 22 (1) substrate (3) form (2), a – b = - 2 (4)
Distance from A to C = a + c = 39 (2) add (1) and (4), a = 10
Distance from B to C = b + c = 39 (3) from (1) and (2), b = 12, c = 29
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 51
Tree showing relationship among
three sequence A,B and C.
This calculation finds that the branch
lengths of A and B form their common
A ancestor are not the same.
a
10 29
C
c
12
b
B
A and B are diverging at different rates of
evolution by this calculation and model
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 52
Use of FM-algorithm
for five sequence
A B C D E
A ─ 22 39 39 41
B ─ ─ 41 41 43
C ─ ─ ─ 18 20
D ─ ─ ─ ─ 10
E ─ ─ ─ ─ ─
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 53
Use of FM-algorithm
for five sequence
39 41 18
D E ave ABC ( )
3
D ─ 10 32.7
41 43 20
E ─ ─ 34.7 ( )
3
Average
─ ─ ─
ABC
The most closely related sequences given in
the distance table are D and E. A new table is
made with the remaining sequence combined.
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 54
Use of FM-algorithm
for five sequence
A B C (DE) DE C AB
A ─ 22 39 39 DE ─ 19 41
B ─ ─ 41 41 C ─ ─ 40
C ─ ─ ─ 19 AB ─ ─ ─
(DE) ─ ─ ─ ─
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 55
Tree showing relationships
among sequence A-E
C
A c
a
10 20
f 5
12 g
b d
B 6 D
4
e
E
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 56
Steps followed by fitch-margoliash
algorithm for phylogenetic analysis
1. Find the most closely related pair of sequence.
2. Treat the rest of the sequence as a single composite sequence.
3. Calculate the distance in the above example with three sequence.
4. Calculate the average distances between AB and make a new
distance table.
5. Identify the next pair of most closely related sequences.
6. When necessary, to calculate lengths of intermediate branches.
7. Repeat the entire procedure starting with all possible pairs.
8. Calculate the predicted distances between each pair of
sequences.
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 57
Category of Phylogeny Tree
Distance Base
UPGMA
Neighbor Joining
Fitch-Margoliash
Minimum Evolution
Least Square
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 58
Construction of ME
The trees with the shortest sum of the branch lengths (or overall
tree length) is chosen as the best tree.
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 59
Construction of ME (Cont.)
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 60
Category of Phylogeny Tree
Distance Base
UPGMA
Neighbor Joining
Fitch-Margoliash
Minimum Evolution
Least Square
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 61
Category of Phylogeny Tree
Character Base
Maximum Parsimony
Maximum Likelihood
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 62
Maximum parsimony method
Requires the minimum number of
mutational changes.
Pros:
Not reduce all sequence information
Evaluate different tree topology
Cons:
Slow for large data sets
Sensitive to unequal rates of evolution
Only give topology but no branch length
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 63
Steps in building
maximum parsimony tree
Investigate all possible tree topologies
Reconstruct ancestral sequences
Choose topology with smallest number
of steps
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 64
A A C T G A
B A T T G A
C G T G G A
D G T G A C
ACTGA GTGAA
A C 1 0
2
ATTGA GTGAA
0 2
B Topology 1 D
ATTGA GTGAC
There are 5 substitutions
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 65
A A C T G A
B A T T G A
C G T G G A
D G T G A C
ACTGA ATTGA
A 1 0
B
0
ATTGA ATTGA
2 4
C Topology 2 D GTGGA GTGAC
There are 7 substitutions
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 66
A A C T G A
B A T T G A
C G T G G A
D G T G A C
ACTGA ATTGA
A 1 0
B
0
ATTGA ATTGA
4 2
D Topology 2 C GTGAC GTGGC
There are 7 substitutions
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 67
Maximum Parsimony Method
Branch and Bound !
char 1 2 3 4 5 6 7 8 9
C1 A G G A G T G C A
C2 A G C C G T G C G
C3 A G A T A T C C A
C4 A G A G A T C C G
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 68
Maximum Parsimony Method
(Cont.)
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 69
Maximum Parsimony Method
(Cont.)
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 70
Maximum Parsimony Method
(Cont.)
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 71
Maximum Parsimony Method
(Cont.)
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 72
Category of Phylogeny Tree
Character Base
Maximum Parsimony
Maximum Likelihood
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 73
Tree Evaluation
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 74
Tools Introduction
http://evolution.genetics.washington.edu/
phylip/software.html
Phylogenetic tree whole website
http://www.tigr.org/tigr-
scripts/CMR2/webmum/mumplot
The Whole Genome Alignment Tool
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 75
PHYLIP
Phylogeny inference package(PHYLIP)
Consisting of about 30 programs that
cover most aspects of p-tree analysis
Free and available for a wild variety of
computer platforms. (dos、mac、unix)
A command line program without GUI.
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 76
Sequence data in
FASTA format file
Options
Input file in PHYLIP format
Option
-K2P matrix
-Neighbor Join
-PAM
-outgroup
-Jin and Nei
-UPGMA
-Kimura
-rooting
READSEQ -categories model
-randomize
-max likeihood
input order
-Jukes-Cantor
treefile with Treetool or
TreeView
view trees with
DNADIST CONSENSE
standard text editors
PRODIST NEIGHBOR
Sequence data
DNADIST NEIGHBOR
in FASTA file
Options Option
-k2p -Neighbor-join
-jin and Nei -UPGMA
readseq -max likeihood -randomize
-Jukes-Cantor -input ouderl
SEQBOOT PROTDIST
Options Option CONSENSE
-bootstrap -PAM matrix Option
-jackknife -Kimura -outgroup
-permute -categories model -rooting
view tree with standard text editor
PHYLIP input file
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 79
PRODIST output
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 80
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 82
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 83
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 84
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 85
Other Reference In This
Reporting
Molecular Evolution and Phylogenetics
─Masatoshi Nei and Sudhir Kumar
Phylogenetic analysis
─Caro-Beth.Stewart
Introduce to Bioinformatics
─Arther M. Lesk
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 87
THE END
Thank U for your audient
2012/6/22 NDHU CSE ALGOLAB Yu-Wei TSAY 88
Get documents about "