VIEWS: 52 PAGES: 19 CATEGORY: Biology POSTED ON: 2/26/2010 Public Domain
Phylogenetics Phylogenetics the anatomy of a tree Branch Taxon • The study of evolutionary relationships. (Edge, Link) Leaf D Tip • Conversion of DNA or protein sequence data into a branching diagram (‘tree’) that shows the relationships between the sequences. C B Clade A Node time Phylogenetics Phylogenetics the many shapes of trees time and most recent common ancestors (mrca) MRCA (ABC,D) time 3 time MRCA (AB,C) time 2 = = = MRCA (A,B) time 1 A B C D A B C D D C B A C B A D A B C D Phylogenetics Rooting Trees rooted vs. unrooted trees A • Root – one node identified as the root from which all other nodes descend. Rooted B • Rooted trees have a direction corresponding to evolutionary Tree C time. – Allow us to define ancestor-descendent relationships D A C Unrooted Tree B D Rooting Trees Rooting Trees unrooted 3 taxa tree B C Unrooted Tree 3 possible roots for a 3 taxa tree A A 2 4 C D D B C 1 B C B C D 15 possible roots for a 4 taxa tree A A B 3 5 D A E E DE B D B D B B D B D E C C Rooted Trees C C C E 1 2 3 4 5 A A A A A E E E B D C B D C B D C B D B E D C C E A B C D A B C D B A C D C D A B D C A B A A A A (A,B),(C,D) (A,(B,(C,D))) (B,(A,(C,D))) (C,(D,(A,B))) (D,(C,(A,B))) E E C B C E D B C B C B C B E D D D D E A A A A A Rooting Trees Rooting Trees A C # Sequences # Unrooted Trees # Rooted Trees 2 1 1 B D 3 1 3 4 3 15 Outgroup Rooting Midpoint Rooting 5 15 105 Outgroup 6 105 945 A 7 945 10,395 A 8 10,395 135,135 B 9 135,135 2,027,025 B C C 10 2,027,025 34,459,425 D D Rooting Trees Phylogenetics SARS example feline AAK09095 terminology human NP 073549 porcine AJ271965 porcine AF353511 • Ancestral State Replicase protein 1 avian AJ311317 – The state of the common ancestor 2 SARS AAP13442 – a.k.a. plesiomorphy 3 bovine AF220295 • Derived State murine NP 068668 murine AF201929 – A state that has changed from the ancestor murine AF029248 – a.k.a. apomorphy • Autapomorphy = unique derived state root 1 root 2 root 3 • Synapomorphy = shared derived state avian AJ311317 avian AJ311317 bovine AF220295 SARS AAP13442 SARS AAP13442 murine AF201929 bovine AF220295 bovine AF220295 murine AF029248 • Homoplasy murine AF201929 murine AF201929 murine AF029248 murine NP 068668 SARS AAP13442 – Similarity due to parallel evolution, convergent evolution or murine AF029248 murine NP 068668 murine NP 068668 avian AJ311317 secondary loss. porcine AJ271965 porcine AJ271965 porcine AJ271965 feline AAK09095 feline AAK09095 feline AAK09095 porcine AF353511 porcine AF353511 porcine AF353511 human NP 073549 human NP 073549 human NP 073549 Stavrinides & Guttman 2004 J.Virology 78:76 Phylogenetics Phylogenetics terminology homoplasy Derived Ancestral Homoplasy Character Character Parallel Convergent Secondary Evolution Evolution Loss Independent Independent Reversion to ancestral evolution of same evolution of same state character from same character from ancestral state different ancestral state Phylogenetics Phylogenetics homoplasy fundamental elements Ancestral Sequence ACTGAACGTAACGC • Taxa – Proper sampling A A • Loci C Single substitution C → A T – Homologous sequences T G G – Sufficient (but not excessive) genetic variation T ← C ← A Multiple substitution A – Proper sampling of genetic variation A A Coincidental substitutions – Independence of characters G ← C C → A G G A ← T Parallel substitutions T → A • Analysis A A – Quality data * T ← C ← A *Convergent substitution A * → T C C – Good multiple sequence alignments G G → T + C +Back sustitution C C → Phylogenetics Phylogenetics tree building methods distance-based methods • Distance methods • Relationships based upon the amount of dissimilarity between sequences. – UPGMA • Advantages – Neighbor-joining – Computationally fast. – Minimum evolution – Single ‘best tree’ found. • Character-based (discrete) methods • Disadvantages – Maximum parsimony – Assume additive distances. – Maximum likelihood – Information loss occurs when transforming sequence data into distances. – Uninterpretable branch lengths (e.g. negative distances or distance • Note, there are many other powerful and important methods corresponding to fractional substitutions). that we do not have time to discuss. – Single ‘best tree’ found. Distance-Based Phylogenetic Methods Distance-Based Phylogenetic Methods additive distances and the four-point metric condition four-point metric condition • Evolutionary distances between each pair of taxa is equal to the sum of A B C D A ACGCGTTGGGCGATGGCAAC the lengths of each branch in the path connecting them. A - 3 7 8 B -------------C--T--T – This includes hypothetical ancestral taxa. B - 6 7 C ----A---AAT----AT--T C - 3 • Must satisfy the four-point metric condition. D --A-A---A-T---AAT--T D - dAB + dCD ≤ maximum(dAC + dBD, dAD + dBC) dij is the distance between taxa i and j A C • is equivalent to stating that of the three distances 2 2 (dAB + dCD, dAC + dBD, dAD + dBC), the two largest distances are equal. 4 1 1 A C dAB = v1 + v2 B D v1 v4 dAC = v1 + v3 + v4 v3 dAD = v1 + v3 + v5 dBC = v2 + v3 + v4 dAB + dCD ≤ max(dAC + dBD, dAD + dBC) dBD = v2 + v3 + v5 v2 v5 dCD = v4 + v5 3+3 ≤ 8+6 = 7+7 B D 6 ≤ 14 = 14 Distance-Based Phylogenetic Methods Distance-Based Phylogenetic Methods ultrametric trees additive trees • Distances between all taxa and their common A - ancestors are equal. A - The taxa may diverge different amounts from their common ancestor B 2 - Molecular Clock B 6 - rate of evolution is the same and constant for all C 6 6 - C 7 3 - lineages D 10 10 10 - 1 D 14 10 9 - 5 A A A B C D A B C D 1 2 2 6 1 1 2 B 6 2 B 7 6 10 3 14 3 1 C 10 C 10 10 9 5 6 D D 5 4 3 2 1 0 6 5 4 3 2 1 0 Distance-Based Phylogenetic Methods UPGMA Unweighted Pair Group Method using Arithmetic Averages (UPGMA) 1. Given a matrix of pairwise distances, find the clusters i and j such that dij is the minimum. A B C D E A - 0.17 0.21 0.31 0.23 • Assumes rate of change among branches of the tree is constant B - 0.30 0.34 0.21 (molecular clock). C - 0.28 0.39 D - 0.43 • Distances are ultrametric. E - – Distances between all taxa and their common ancestors are equal. 2. Define the depth of the branch between i and j to be dij/2 D Branch between A and B at depth of 0.17 / 2 = 0.086 C A 0.086 B B A UPGMA UPGMA 3. Define a distance u to each other cluster (k) to be an average of the distances dki and dkj. 3. Define a distance u to each other cluster (k) to be an average of the distances dki and dkj. A B C D E A B C D E A - 0.17 0.21 0.31 0.23 A - 0.17 0.21 0.31 0.23 B - 0.30 0.34 0.21 B - 0.30 0.34 0.21 C - 0.28 0.39 C - 0.28 0.39 D - 0.43 D - 0.43 E - E - dA:B,C = (0.21 + 0.30)/2 = 0.26 dA:B,C = (0.21 + 0.30)/2 = 0.26 dA:B,D = (0.31 + 0.34)/2 = 0.33 A:B C D E A:B C D E A:B - 0.26 A:B - 0.26 0.33 C - 0.28 0.40 C - 0.28 0.40 D - 0.43 D - 0.43 E - E - UPGMA UPGMA 3. Define a distance u to each other cluster (k) to be an average of the distances dki and dkj. 4. Go back to step 1 with one less cluster • clusters i and j have been eliminated, and cluster u has been added. A B C D E A:B C D E A - 0.17 0.21 0.31 0.23 A:B - 0.26 0.33 0.22 B - 0.30 0.34 0.21 C - 0.28 0.40 C - 0.28 0.39 D - 0.43 D - 0.43 E - E - Branch between A:B and E at depth of 0.22 / 2 = 0.11 dA:B,C = (0.21 + 0.30)/2 = 0.26 A dA:B,D = (0.31 + 0.34)/2 = 0.33 dA:B,E = (0.23 + 0.21)/2 = 0.22 B 0.11 A:B C D E A:B - 0.26 0.33 0.22 E C - 0.28 0.40 D - 0.43 E - UPGMA UPGMA A:B C D E A:B C D E A:B - 0.26 0.33 0.22 A:B - 0.26 0.33 0.22 C - 0.28 0.40 C - 0.28 0.40 D - 0.43 D - 0.43 E - E - If cluster i contains Ti taxa, and cluster j contains Tj then: dku = (Tidki + Tjdkj) / (Ti + Tj) If cluster i contains Ti taxa, and cluster j contains Tj then: dku = (Tidki + Tjdkj) / (Ti + Tj) dA:B:E,C = (2 x 0.26 + 0.40)/3 = 0.31 dA:B:E,C = (2 x 0.26 + 0.40)/3 = 0.31 dA:B:E,D = (2 x 0.33 + 0.43)/3 = 0.36 A:B:E C D A:B:E C D A:B:E - 0.31 A:B:E - 0.31 0.36 C - 0.28 C - 0.28 D - D - UPGMA UPGMA A:B:E C D A:B:E C D A:B:E - 0.31 0.36 A:B:E - 0.31 0.36 C - 0.28 C - 0.28 D - D - Branch between C and D at depth of 0.28 / 2 = 0.14 A dA:B:E,C:D = (3 x 0.31 + 3 x 0.36)/6 = 0.34 B A:B:E C:D E A:B:E - 0.34 C:D - C 0.14 D UPGMA Distance-Based Phylogenetic Methods Neighbor-Joining A:B:E C:D A:B:E - 0.34 • Additive trees, but removes the assumption that the data are C:D - ultrametric. – The taxa may diverge different amounts from their common ancestor. Branch between A:B:E and C:D at depth of 0.34 / 2 = 0.17 A B E 0.17 C D Neighbor-Joining Neighbor-Joining 1. Give a matrix of pairwise distances (d), calculate the net 4. Regenerate matrix by defining the distance from u to each divergence (ri) for each terminal node i for all taxa. N is the remaining terminal node as: number of terminal nodes. dku = (dik + djk – dij) / 2 N ri = ∑d k =1 ik 5. If more than 2 nodes, return to step 1. If tree is fully defined except for length of the branch joining the 2 remaining nodes (i 2. Create a rate-corrected distance matrix (M). and j) then define this branch as Mij = dij – (ri + rj) / (N – 2) sij = dij - the only values of interest are the minimum Mij 3. Define a new node u whose three branches join nodes i, j and the rest of the tree. siu = dij / 2 + (ri – rj) / [2(N – 2)] sju = dij - siu Neighbor-Joining Neighbor-Joining star decomposition C C B D B D Pairwise distances – upper diagonal Rate-corrected distances – lower diagonal A E A E A B C D E ri A - 0.17 0.21 0.31 0.23 H F H F B - 0.30 0.34 0.21 G G C - 0.28 0.39 D - 0.43 C E - B D C B D 1. Give a matrix of pairwise distances (d), calculate the net divergence A E A E (ri) for each terminal node i for all taxa. N is the number of terminal nodes. N H H F ri = ∑d ik F k =1 G G Neighbor-Joining Neighbor-Joining Pairwise distances – upper diagonal Pairwise distances – upper diagonal Rate-corrected distances – lower diagonal Rate-corrected distances – lower diagonal A B C D E ri A B C D E ri A - 0.17 0.21 0.31 0.23 0.92 A - 0.17 0.21 0.31 0.23 0.92 B - 0.30 0.34 0.21 B - 0.30 0.34 0.21 1.02 C - 0.28 0.39 C - 0.28 0.39 D - 0.43 D - 0.43 E - E - 1. Give a matrix of pairwise distances (d), calculate the net divergence 1. Give a matrix of pairwise distances (d), calculate the net divergence (ri) for each terminal node i for all taxa. N is the number of (ri) for each terminal node i for all taxa. N is the number of terminal nodes. terminal nodes. N N ri = ∑d k =1 ik ri = ∑d k =1 ik Neighbor-Joining Neighbor-Joining Pairwise distances – upper diagonal Pairwise distances – upper diagonal Rate-corrected distances – lower diagonal Rate-corrected distances – lower diagonal A B C D E ri A B C D E ri A - 0.17 0.21 0.31 0.23 0.92 A - 0.17 0.21 0.31 0.23 0.92 B - 0.30 0.34 0.21 1.02 B -0.48 - 0.30 0.34 0.21 1.02 C - 0.28 0.39 1.18 C - 0.28 0.39 1.18 D - 0.43 1.36 D - 0.43 1.36 E - 1.26 E - 1.26 2. Create a rate-corrected distance matrix (M). 1. Give a matrix of pairwise distances (d), calculate the net divergence Mij = dij – (ri + rj) / (N – 2) (ri) for each terminal node i for all taxa. N is the number of terminal nodes. MAB = 0.17 – (0.93 + 1.02)/3 = -0.48 N ri = ∑d k =1 ik Neighbor-Joining Neighbor-Joining Pairwise distances – upper diagonal Pairwise distances – upper diagonal Rate-corrected distances – lower diagonal Rate-corrected distances – lower diagonal A B C D E ri A B C D E ri A - 0.17 0.21 0.31 0.23 0.92 A - 0.17 0.21 0.31 0.23 0.93 B -0.48 - 0.30 0.34 0.21 1.02 B -0.48 - 0.30 0.34 0.21 1.02 C -0.49 -0.43 - 0.28 0.39 1.18 C -0.49 -0.43 - 0.28 0.39 1.19 D -0.45 -0.45 -0.57 - 0.43 1.36 D -0.45 -0.45 -0.57 - 0.43 1.36 E -0.50 -0.55 -0.42 -0.44 - 1.26 E -0.50 -0.55 -0.42 -0.44 - 1.26 2. Create a rate-corrected distance matrix (M). 2. Create a rate-corrected distance matrix (M). Mij = dij – (ri + rj) / (N – 2) Mij = dij – (ri + rj) / (N – 2) - identify the minimum Mij Neighbor-Joining Neighbor-Joining Pairwise distances – upper diagonal Pairwise distances – upper diagonal Rate-corrected distances – lower diagonal Rate-corrected distances – lower diagonal A B C D E ri A B C D E ri A - 0.17 0.21 0.31 0.23 0.93 A - 0.17 0.21 0.31 0.23 0.93 B - 0.30 0.34 0.21 1.02 B - 0.30 0.34 0.21 1.02 C - 0.28 0.39 1.19 C - 0.28 0.39 1.19 D -0.57 - 0.43 1.36 D -0.57 - 0.43 1.36 E - 1.26 E - 1.26 3. Define a new node u whose three branches join nodes i, j and the 3. Define a new node u whose three branches join nodes i, j and the rest of the tree. rest of the tree. siu = dij / 2 + (ri – rj) / [2(N – 2)] sC,Node1 = dCD / 2 + (rC – rD) / [2(N – 2)] = 0.28 / 2 + (1.19 – 1.36) / 2(3) = 0.11 sju = dij - siu sD,Node1 = dCD – sC,Node1 = 0.28 – 0.11 = 0.17 Neighbor-Joining Neighbor-Joining Node1 Node2 Node3 Node1 Node2 Node3 A - A - B - B - D C 0.11 C 0.11 D 0.17 D 0.17 E - E - Node1 - Node1 - Node2 - Node2 - E E A D A n1 B C B 0.05 C Neighbor-Joining Neighbor-Joining Begin the process again: A B C D E A - 0.17 0.21 0.31 0.23 • Calculate net divergence (ri) and rate-corrected distance matrix (M). B - 0.30 0.34 0.21 A B E node1 ri C - 0.28 0.39 A - 0.17 0.23 0.12 0.52 D - 0.43 B -0.37 - 0.21 0.18 0.56 E - E -0.39 -0.43 - 0.27 0.71 4. Regenerate matrix by defining the distance from u to each remaining terminal node1 -0.43 -0.39 -0.37 - 0.57 node as: dku = (dik + djk – dij) / 2 where ij is the new node, and k is the tip • Find the minimum rate corrected distance dA,Node1 = (dA,C + dA,D - dC,D) / 2 • Calculate the distance to node2 dA,Node1 = (0.21 + 0.31 – 0.28) / 2 = 0.12 siu = dij / 2 + (ri – rj) / [2(N – 2)] A B E node1 sA,Node2 = dA,Node1 / 2 + (rA – rNode1) / [2(N – 2)] A - 0.17 0.23 0.12 sA,Node2 = 0.12 / 2 + (0.52 – 0.57) / 4 = 0.05 B - 0.21 0.18 sju = dij - siu E - 0.27 sNode1,Node2 = 0.12 – 0.05 = 0.07 node1 - Neighbor-Joining Neighbor-Joining Node1 Node2 Node3 B E node2 ri A - 0.05 B - - B - 0.21 0.11 0.32 D E -0.51 - 0.19 0.40 C 0.11 - D 0.17 - node2 -0.51 -0.51 - 0.31 E - - Node1 - 0.07 sB,Node3 = 0.11 / 2 + (0.32 – 0.31) / 2 = 0.06 Node2 - - sNode2,Node3 = 0.11 – 0.06 = 0.05 E E node3 n1 n2 E - 0.14 node3 - B A sE,Node3 = 0.14 0.05 C Neighbor-Joining Neighbor-Joining Node1 Node2 Node3 Node1 Node2 Node3 A - 0.05 - A - 0.05 - B - - 0.06 B - - 0.06 D C 0.11 - - C 0.11 - - E D 0.17 - - D 0.17 - - E - - 0.14 E - - 0.14 Node1 - 0.07 - Node1 - 0.07 - Node2 - - 0.05 Node2 - - 0.05 n3 B n3 n1 n2 n2 E A C B n1 A D 0.05 C 0.05 Distance-Based Phylogenetic Methods Distance-Based Phylogenetic Methods UPGMA vs. Neighbor-Joining Minimum Evolution • Finds the tree that minimizes the total branch length of the tree (L). A • Unrooted tree of n taxa has (2n-3) branches, each with length ei B • Estimate the length ei of each branch from the pairwise distances between E taxa. C 2 n −3 D L= ∑ei =1 i n3 B • Has an optimality criterion - optimum defined by minimum tree length n2 E – Conceptually similar to Maximum Parsimony except based upon pairwise A distances C Simulation studies find ME to be one of the most accurate methods. n1 D 0.05 Distance-Based Phylogenetic Methods Distance-Based Phylogenetic Methods Minimum Evolution 1. Pick a tree topology. 2. Estimate the length of each branch of the tree based on the pairwise distance between the taxa. 3. Determine the total tree length by summing the individual branch lengths. 4. Return to step 1 and repeat until the smallest (optimal) tree length is found. www.megasoftware.net Distance-Based Phylogenetic Methods Distance-Based Phylogenetic Methods Distance-Based Phylogenetic Methods Distance-Based Phylogenetic Methods Distance-Based Phylogenetic Methods Distance-Based Phylogenetic Methods Character-Based Phylogenetic Methods Maximum Parsimony Maximum Parsimony • Advantages – Maximizes similarity that can be attributed to common ancestry. • Optimization criterion • Any character that does not fit a given tree requires us to postulate that the similarity is due to homoplasy, not homology. – The best explanation of the data is the simplest – Based on an implicit evolutionary assumption: • requires the fewest ad hoc assumptions. • Evolutionary change is rare. – Can produce many equally parsimonious trees. – Gives rise to the shortest tree • The tree with the fewest number of substitutions. • Disadvantages • The tree with the fewest number of homoplasies A C – Can be inconsistent under certain models • “Long branch attraction” – Can produce many equally parsimonious trees. B D Maximum Parsimony Maximum Parsimony minimizing evolutionary change across a tree selecting the most parsimonious tree Position 1 Position 1 Sequence 1 ATATT Sequence 1 ATATT Sequence 2 ATCGT Sequence 2 ATCGT Sequence 3 GCAGT Sequence 3 GCAGT Sequence 4 GCCGT Sequence 4 GCCGT assume tree ((1,2),(3,4)) Potential Tree 1 Potential Tree 2 Potential Tree 3 A G ((1,2),(3,4)) ((1,3),(2,4)) ((1,4),(2,3)) 1 3 1A G3 A G A A 1 3 1A A2 1 2 A G G A 2A G4 2A G4 A G G G 1 Step 5 Steps 2A G4 3G G4 4G G3 1 Step 2 Steps 2 Steps Maximum Parsimony Sequence 1 ATATT Maximum Parsimony selecting the most parsimonious tree Sequence 2 ATCGT selecting the most parsimonious tree Sequence 3 GCAGT Sequence 4 GCCGT Sites Tree 1 2 3 4* 5* Total ((1,2),(3,4)) 1 1 2 1 0 5 Position 2 Position 3 ((1,3),(2,4)) 2 2 1 1 0 6 T C 1A A 3 1A A3 1 3 ((1,4),(2,3)) 2 2 2 1 0 7 or * Not phylogenetically informative T C C A 2T C4 2C C4 2C C4 Most parsimonious tree 1 Step 2 Steps 2 Steps Position 4 Position 5 1 3 1 2 1 2 1T G3 1T T 3 G T 2 4 3 4 4 3 2G 2T T 4 G4 1 Step 0 Steps Maximum Parsimony Maximum Parsimony the process the process Taxa 1 2 3 4 {A} {C} {A} {G} Continue with procedure until reach the node above the root Character state (S1..5) {A} {C} {A} {G} (the basal fork node). {A,C} Initialize tree length = 0 If the root state is not contained in the basal fork node state {A} l=1 l=0 set, increase the length of the tree by 1. {A,G} {C} {C} 5 (root) Start at the root node and traverse up through the nodes. {A} {C} {A} {G} Si Sj For the case where an internal node has more than one {A} {C} {A} {G} potential state: {A} Visit internal nodes: • If the intersection of the states Si and Sj is not empty, • If the derived node state set shares a state with the {A} l=1 set the state of this node (Sk) to the intersection state. Sk ancestral node, set the derived node state equal to the {A} • Else, set the state of this node (Sk) to be the union of Si l=0 ancestral node. {A,C} and Sj • If the derived node state set does not share a state {C} with the ancestral node, pick one of the states from {C} the derived node arbitrarily. Maximum Parsimony Maximum Parsimony the process generalized parsimony {A} {C} {A} {G} Count the number of state changes and add this number to • We can easily establish weighting schemes for all possible {A} the tree length. substitutions to increase the biological realism. {A} l=4 In this case, number of state changes = 3 {A} Substitution Model Step Matrix {C} A C G T 1 A 0 1 1 1 {A} {C} {A} {G} A G Equal substitution probability C 1 0 1 1 Note that there is an alternative, equally as parsimonious 1 1 1 1 G 1 1 0 1 reconstruction for this tree. {A} {A} l=4 C 1 T T 1 1 1 0 {G} A C G T {C} 1 A 0 2 1 2 The total number of evolutionary changes on a tree (the tree’s Transversions more A G length, L) is the sum of the number of changes at each k costly than transitions C 2 0 2 1 L = ∑ li position. If we have k positions, each with a length l, the the 2 2 2 2 G 1 2 0 2 total length L of a tree is: i =1 C T T 2 1 2 0 1 Maximum Parsimony Maximum Parsimony generalized parsimony weighted parsimony Sequence 1 ATATT Sequence 2 ATCGT • Different sequence position often evolve at different rates (e.g. 3rd Sequence 3 GCAGT positions of codons). Sequence 4 GCCGT • Rapidly evolving sites may quickly become saturated with change. A A A C A C 1 3 1 2 1 2 k 2 C 4 C 3 A C 4 4 C A 3 L = ∑ wi li i =1 Tv = Ts = 1 Tv = 2, Ts = 1 Tree 1 2 3 4 5 T 1 2 3 4 5 T wi = weight of position i ((1,2),(3,4)) 1 1 2 1 0 5 1 1 4 1 0 7 ((1,3),(2,4)) 2 2 1 1 0 6 2 2 2 1 0 7 ((1,4),(2,3)) 2 2 2 1 0 7 2 2 4 1 0 9