Document Sample

Phylogenetic trees BioE131/231 QuickTime™ and a QuickTime™ an d a TIFF (Uncompressed) decompre ssor decompressor are need ed to see this picture . are neede d to see this picture. HIV phylogeny. Simmonds et al (left); Yamamoto et al (right) “Star” vs hierarchical phylogenies QuickTime™ and a TIFF (Uncompressed) decompressor are neede d to see this picture. Star Hierarchical Rooted and unrooted trees QuickTime™ and a TIFF (Uncompressed) decompre ssor are neede d to see this picture. “Degree” of a node = the number of neighbors of that node. In a “binary” tree, all nodes have degree 1 or 3, except the root which has degree 2. (i.e. each node has 0 or 2 children). A binary tree with N leaf nodes has N-1 internal nodes (c.f. table-tennis tournament...) Rooted = directed Root node QuickTime™ and a TIF F (Uncompressed) decompressor QuickTime™ a nd a are needed to see this picture. Internal node TIFF (Uncompressed) decompressor are need ed to see this picture. Clade, subtree Leaf node, “taxon” (pl. “taxa”) Synonyms: phylogeny, tree, dendrogram, cladogram Rooting via “outgroups” QuickTime™ and a TIFF (Un compressed) decompressor are neede d to see this picture. QuickTime™ an d a TIFF (Uncompressed) decompressor QuickTime™ and a are need ed to see this p icture. TIFF (Un compressed) decompressor are neede d to see this picture. Root can be ambiguous QuickTime™ and a TIFF (Uncompressed) decompre ssor are neede d to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are need ed to see this picture. No outgroup... Earliest studies placed LUCA on eukarya-bacteria branch; later studies suggested bacteria-archaea “Ultrametric” trees & molecular clocks Branch lengths are typically in units of “average number of substitutions per site”. Thus, branch lengths of >1 have large estimation errors Ultrametric: Non-ultrametric: (“distance”) QuickTime™ and a X QuickTime™ and a “Height” TIFF (Un compressed) decompressor TIFF (Un compressed) decompressor are neede d to se e this picture. are neede d to se e this picture. of node X Q. Why are non-ultrametric trees necessary? Wen-Hsiung Li, 1985 A. Mutation rate ~ 1/(generation time) (2003 Balzan Prize) Also correlated w/other physiological variables (e.g. metabolic rate) “Longitudinal” data (e.g. serial viral sequencing from same host) can also generate non- ultrametric trees, since leaf nodes are not contemporaneous Newick format a.k.a. New Hampshire format • Rooted tree topologies (A,B,(C,D)); • Branch lengths (A:.1,B:.2,(C:.3,D:.4):.5); • Internal node names (A:.1,B:.2,(C:.3,D:.4)E:.5)F; QuickTime™ an d a TIFF (Uncompressed) decompressor are need ed to see this picture. Algorithms for phylogenetic reconstruction • Start with a multiple alignment – use substitutions to evaluate trees – indels informative, but harder to model • Parsimony – find the tree with the fewest substitutions • Likelihood – find the tree with the most likely substitutions (transition/transversion bias, long branches, ...) – sum probabilities over unseen ancestral states – enumerating all possible tree topologies is sloooooow • Distance matrix – Start by computing all pairwise distances – Quick approximation to likelihood methods UPGMA algorithm • Creates ultrametric trees • Basic idea: – Two closest nodes must be siblings – Parent is equidistant between siblings – Distance from parent to any other node is average of distances of siblings to those nodes UPGMA algorithm • Input: a “distance matrix”, Dij N2 entries • Let N be the set of nodes to be joined • Let the “height” of node i be Hi – Initialize Hi=0 for all the leaf nodes in N • While N contains >1 node: N-1 steps – Find i & j, the two closest nodes in N • (i,j) = argmini,j Dij N2 steps – Create a new node, k, the parent of (i,j) – Set Hk = .5 * (Hi + Hj + Dij) • Branch length ki is (Hk-Hi) and similarly for kj O(N3) time – For all nodes n in N (excluding i & j): If we maintain argminj Dij for each j, then it is O(N2) • Set Dkn = .5 * (Din + Dkn) – Add k to N; remove i & j O(N2) memory UPGMA in Perl • Questions: – How to represent a tree? • For each node, need children/parents/both, name, branch length to parent... – How to print a tree in Newick format? • Recursive (“print a particular node”) • Pre-order traversal (parents before children) – How to represent a distance matrix? • Can side-step some of these... Identify nodes by name, not by number • Entry Dij of distance matrix is $distance{$iname}->{$jname} where $iname is the name of node i Accessing the distance matrix • Set of all nodes, N: keys (%distance) • Removing a node from the set: delete $distance{$iname} Construct the Newick representation on-the-fly • Siblings (i, j) = ($iname , $jname) • Branch lengths: Branch ki has length $ki Branch kj has length $kj • Name of new node (k): ”($iname:$ki,$jname:$kj)” • Then, Newick-format tree is just the name of the root node (plus a semicolon) Other phylogeny algorithms • “Neighbor-joining” (e.g. “neighbor” program) – Parents not equidistant from siblings • “Weighted neighbor-joining” (e.g. “weighbor” program) – Corrects for long-branch estimation error • “Quartet-puzzling” (e.g. “tree-puzzle” program) – Looks at sets of 4 nodes, instead of pairs • MCMC sampling (e.g. “MrBayes” program) – Stochastically explores tree space – Slow, but provides much more information (confidence limits, etc.) Long branch attraction • Arises because sequences on QuickTime™ and a long branches share chance TIFF (Uncompressed) decompressor are need ed to see this picture. similarities • Some methods (esp. parsimony) interpret this incorrectly as relatedness • Solutions: – add more taxa to break up the branches – use more realistic likelihood models Confidence estimates • “Bootstrap” – Sample a random subset of alignment columns (with replacement) and build a tree from those – Repeat a large number of times • “Support” for a branch – defined as % of trees that include that branch – identify a branch by its partitioning of QuickTime™ and a TIFF (Uncompressed) decompressor are neede d to see this picture. the taxa • MCMC is a more statistically rigorous way to get confidence estimates for trees – because it samples directly from the posterior distribution of trees Evolutionary linguistics QuickTime™ and a TIFF (Uncompressed) decompre ssor are neede d to see this picture. QuickTime™ and a QuickTime™ and a QuickTime™ and a TIFF (Uncompressed) decompre ssor TIFF (Un compressed) decompressor areare neede d see this picture. neede d to to see this picture. TIFF (Uncompressed) decompressor are neede d to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are neede d to see this picture. How to estimate distances? QuickTime™ and a TIFF (Un compressed) decompressor are neede d to se e this picture. • T. Jukes and C. Cantor – Berkeley, 1969 QuickTime™ an d a TIFF (Uncompressed) decompressor are need ed to see this p icture . QuickTime™ and a TIFF (Un compressed) decompressor are neede d to se e this picture.

DOCUMENT INFO

Shared By:

Categories:

Tags:
Phylogenetic Trees, phylogenetic tree, common ancestor, evolutionary trees, tree of life, evolutionary history, protein sequences, unrooted trees, branch lengths, extinct species

Stats:

views: | 32 |

posted: | 1/26/2011 |

language: | English |

pages: | 20 |

OTHER DOCS BY suchenfz

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.