Docstoc

Phylogenetic trees (PowerPoint)

Document Sample
Phylogenetic trees (PowerPoint) Powered By Docstoc
					                 Phylogenetic trees
                          BioE131/231




         QuickTime™ and a
                                                         QuickTime™ an d a
TIFF (Uncompressed) decompre ssor                           decompressor
                                                   are need ed to see this picture .
   are neede d to see this picture.




                             HIV phylogeny. Simmonds et al (left);
                                    Yamamoto et al (right)
“Star” vs hierarchical phylogenies


                     QuickTime™ and a
            TIFF (Uncompressed) decompressor
               are neede d to see this picture.




     Star                                         Hierarchical
     Rooted and unrooted trees


                             QuickTime™ and a
                    TIFF (Uncompressed) decompre ssor
                       are neede d to see this picture.




“Degree” of a node = the number of neighbors of that node.
In a “binary” tree, all nodes have degree 1 or 3, except the
root which has degree 2. (i.e. each node has 0 or 2 children).
A binary tree with N leaf nodes has N-1 internal nodes
(c.f. table-tennis tournament...)
                Rooted = directed
                Root node


                                                             QuickTime™ and a
                                                   TIF F (Uncompressed) decompressor
                      QuickTime™ a nd a               are needed to see this picture.
Internal node
            TIFF (Uncompressed) decompressor
               are need ed to see this picture.




                                            Clade, subtree
         Leaf node,
         “taxon” (pl. “taxa”)

   Synonyms: phylogeny, tree, dendrogram, cladogram
                              Rooting via “outgroups”



                                                 QuickTime™ and a
                                       TIFF (Un compressed) decompressor
                                          are neede d to see this picture.




         QuickTime™ an d a
TIFF (Uncompressed) decompressor                                                       QuickTime™ and a
   are need ed to see this p icture.                                         TIFF (Un compressed) decompressor
                                                                                are neede d to see this picture.
        Root can be ambiguous


                                                        QuickTime™ and a
                                               TIFF (Uncompressed) decompre ssor
                                                  are neede d to see this picture.




         QuickTime™ and a
TIFF (Uncompressed) decompressor
   are need ed to see this picture.




                                               No outgroup...
                                        Earliest studies placed LUCA on
                                         eukarya-bacteria branch; later
                                      studies suggested bacteria-archaea
“Ultrametric” trees & molecular clocks
Branch lengths are typically in units of “average number of substitutions per site”.
              Thus, branch lengths of >1 have large estimation errors

Ultrametric:                                    Non-ultrametric:



                                                                                               (“distance”)

                  QuickTime™ and a X                                     QuickTime™ and a
                “Height”
        TIFF (Un compressed) decompressor                      TIFF (Un compressed) decompressor
           are neede d to se e this picture.                      are neede d to se e this picture.
               of node X




  Q. Why are non-ultrametric trees necessary?
                                                                                   Wen-Hsiung Li, 1985
  A. Mutation rate ~ 1/(generation time)
                                                                                     (2003 Balzan Prize)
  Also correlated w/other physiological variables (e.g. metabolic rate)
  “Longitudinal” data (e.g. serial viral sequencing from same host) can also generate non-
  ultrametric trees, since leaf nodes are not contemporaneous
                    Newick format
                  a.k.a. New Hampshire format

• Rooted tree topologies
  (A,B,(C,D));

• Branch lengths
  (A:.1,B:.2,(C:.3,D:.4):.5);

• Internal node names
  (A:.1,B:.2,(C:.3,D:.4)E:.5)F;




                               QuickTime™ an d a
                      TIFF (Uncompressed) decompressor
                         are need ed to see this picture.
Algorithms for phylogenetic reconstruction

  • Start with a multiple alignment
     – use substitutions to evaluate trees
     – indels informative, but harder to model
  • Parsimony
     – find the tree with the fewest substitutions
  • Likelihood
     – find the tree with the most likely substitutions
       (transition/transversion bias, long branches, ...)
     – sum probabilities over unseen ancestral states
     – enumerating all possible tree topologies is sloooooow
  • Distance matrix
     – Start by computing all pairwise distances
     – Quick approximation to likelihood methods
        UPGMA algorithm
• Creates ultrametric trees
• Basic idea:
  – Two closest nodes must be siblings
  – Parent is equidistant between siblings
  – Distance from parent to any other node is
    average of distances of siblings to those
    nodes
                  UPGMA algorithm
• Input: a “distance matrix”, Dij                                   N2 entries
• Let N be the set of nodes to be joined
• Let the “height” of node i be Hi
  – Initialize Hi=0 for all the leaf nodes in N
• While N contains >1 node:                                        N-1 steps
  – Find i & j, the two closest nodes in N
     • (i,j) = argmini,j Dij                                         N2 steps
  – Create a new node, k, the parent of (i,j)
  – Set Hk = .5 * (Hi + Hj + Dij)
     • Branch length ki is (Hk-Hi) and similarly for kj         O(N3) time
  – For all nodes n in N (excluding i & j):           If we maintain argminj Dij for
                                                            each j, then it is O(N2)
     • Set Dkn = .5 * (Din + Dkn)
  – Add k to N; remove i & j                                O(N2) memory
            UPGMA in Perl
• Questions:
  – How to represent a tree?
     • For each node, need children/parents/both,
       name, branch length to parent...
  – How to print a tree in Newick format?
     • Recursive (“print a particular node”)
     • Pre-order traversal (parents before children)
  – How to represent a distance matrix?
• Can side-step some of these...
Identify nodes by name, not by number

  • Entry Dij of distance matrix is

       $distance{$iname}->{$jname}

    where $iname is the name of node i
Accessing the distance matrix
• Set of all nodes, N:

     keys (%distance)

• Removing a node from the set:

     delete $distance{$iname}
       Construct the Newick
     representation on-the-fly
• Siblings (i, j) = ($iname , $jname)
• Branch lengths:
      Branch ki has length $ki
      Branch kj has length $kj


• Name of new node (k):

      ”($iname:$ki,$jname:$kj)”

• Then, Newick-format tree is just the name of
  the root node (plus a semicolon)
   Other phylogeny algorithms
• “Neighbor-joining” (e.g. “neighbor” program)
   – Parents not equidistant from siblings
• “Weighted neighbor-joining” (e.g. “weighbor” program)
   – Corrects for long-branch estimation error
• “Quartet-puzzling” (e.g. “tree-puzzle” program)
   – Looks at sets of 4 nodes, instead of pairs
• MCMC sampling (e.g. “MrBayes” program)
   – Stochastically explores tree space
   – Slow, but provides much more information
     (confidence limits, etc.)
Long branch attraction



                                      • Arises because sequences on
         QuickTime™ and a               long branches share chance
TIFF (Uncompressed) decompressor
   are need ed to see this picture.     similarities
                                      • Some methods (esp.
                                        parsimony) interpret this
                                        incorrectly as relatedness
                                      • Solutions:
                                         – add more taxa to break up the
                                           branches
                                         – use more realistic likelihood
                                           models
               Confidence estimates
• “Bootstrap”
   – Sample a random subset of alignment
     columns (with replacement) and build a
     tree from those
   – Repeat a large number of times
• “Support” for a branch
   – defined as % of trees that include that
     branch
   – identify a branch by its partitioning of            QuickTime™ and a
                                                TIFF (Uncompressed) decompressor
                                                   are neede d to see this picture.




     the taxa
• MCMC is a more statistically rigorous
  way to get confidence estimates for
  trees
   – because it samples directly from the
     posterior distribution of trees
Evolutionary linguistics

                                                           QuickTime™ and a
                                                  TIFF (Uncompressed) decompre ssor
                                                     are neede d to see this picture.




          QuickTime™ and a                        QuickTime™ and a
              QuickTime™ and a
TIFF (Uncompressed) decompre ssor
    TIFF (Un compressed) decompressor
   areare neede d see this picture.
       neede d to to see this picture.
                                         TIFF (Uncompressed) decompressor
                                            are neede d to see this picture.




                                                             QuickTime™ and a
                                                    TIFF (Uncompressed) decompressor
                                                       are neede d to see this picture.
How to estimate
  distances?                                                   QuickTime™ and a
                                                     TIFF (Un compressed) decompressor
                                                        are neede d to se e this picture.



• T. Jukes and C. Cantor
  – Berkeley, 1969




              QuickTime™ an d a
     TIFF (Uncompressed) decompressor
        are need ed to see this p icture .
                                                       QuickTime™ and a
                                             TIFF (Un compressed) decompressor
                                                are neede d to se e this picture.

				
DOCUMENT INFO