Solving Phylogenetic Trees

W
Shared by: HC111211081323
Categories
Tags
-
Stats
views:
0
posted:
12/11/2011
language:
pages:
54
Document Sample
scope of work template
							Solving Phylogenetic Trees
          Benjamin Loyle
          March 16, 2004
          Cse 397 : Intro to MBIO




      Benjamin Loyle 2004 Cse 397
Table of Contents
 Problem & Term Definitions
 A DCM*-NJ Solution
 Performance Measurements
 Possible Improvements




              Benjamin Loyle 2004 Cse 397
Phylogeny
                                    From the Tree of the Life Website,
                                               University of Arizona




Orangutan   Gorilla         Chimpanzee               Human




                  Benjamin Loyle 2004 Cse 397
 DNA Sequence Evolution
                                                         -3 mil yrs
                        AAGACTT

                                                         -2 mil yrs
    AAGGCCT                      TGGACTT


                                                         -1 mil yrs
AGGGCAT              TAGCCCT                AGCACTT




AGGGCAT   TAGCCCA   TAGACTT     AGCACAA        AGCGCTT          today




                      Benjamin Loyle 2004 Cse 397
Problem Definition
   The Tree of Life
     Connecting all living organisms
     All encompassing

     Find evolution from simple beginnings

 Even smaller relations are tough
 Impossible
       Infer possible ancestral history.


                     Benjamin Loyle 2004 Cse 397
So what….
 Genome sequencing provides entire map
  of a species, why link them?
 We can understand evolution
 Viable drug testing and design
 Predict the function of genes
 Influenza evolution



              Benjamin Loyle 2004 Cse 397
Why is that a problem?
 Over 8 million organisms
 Current solutions are NP-hard
 Computing a few hundred species takes
  years
 Error is a very large factor




              Benjamin Loyle 2004 Cse 397
What do we want?
   Input
       A collection of nodes such as taxa or protein
        strings to compare in a tree
   Output
       A topological link to compare those nodes to
        each other
   When do we want it?
       FAST!

                     Benjamin Loyle 2004 Cse 397
Preparing the input
 Create a distance matrix
 Sum up all of the known distances into a
  matrix sized n x n
       N is the number of nodes or taxa
   Found with sequence comparison




                     Benjamin Loyle 2004 Cse 397
Distance Matrix
Take 5 separate DNA strings

A : GATCCATGA
B : GATCTATGC
C : GTCCCATTT
D : AATCCGATC
E : TCTCGATAG

The distance between A and B is 2
The distance between A and C is 4
      This is subjective based on what your criteria are.

                      Benjamin Loyle 2004 Cse 397
Distance Matrix
   Lets start with an example matrix
             A       B         C             D             E

     A   0       63          94          111           67

     B           0           79          96            16
     C                       0           47            83
     D                                   0             100
     E                                                 0
                         Benjamin Loyle 2004 Cse 397
Lets make it simple
(constrain the input)
   Lets keep the distance between nodes
    within a certain limit
     From F -> G
     F and G have the largest distance; they are
      the most dissimilar of any nodes.
     This is called the diameter of the tree

   Lets keep the length of the input (length of
    the strings) polynomial.
                   Benjamin Loyle 2004 Cse 397
ERROR?!?!!?
 All trees are inferred, how do you ever
  know if you‟re right?
 How accurate do we have to be?
 We can create data sets to test trees that
  we create and assume that it will then
  work in the real world



                Benjamin Loyle 2004 Cse 397
Data Sets
   JC Model
     Sites evolve independent
     Sites change with the same probability

     Changes are single character changes
         • Ie. A -> G or T -> C
       The expectation of change is a Poisson
        variable (e)


                        Benjamin Loyle 2004 Cse 397
More Data Sets
   K2P Model
     Based on JC Model
     Allows for probability of transitions to
      tranversions
        • It‟s more likely for A and T to switch and G and C
          to switch
        • Normally set to twice as likely




                      Benjamin Loyle 2004 Cse 397
Data Use
 Using these data sets we can create our
  own evolution of data.
 Start with one “ancestor” and create
  evolutions
 Plug the evolutions back and see if you
  get what you started with



               Benjamin Loyle 2004 Cse 397
Aspects of Trees
   Topology
      • The method in which nodes are connected to
        each other
      • “Are we really connected to apes directly, or just
        linked long before we could be considered
        mammals?”
   Distance
      • The sum of the weighted edges to reach one
        node from another


                    Benjamin Loyle 2004 Cse 397
What can distance tell us?
 The distance between nodes IS the
  evolutionary distance between the nodes
 The distance between an ancestor and a
  leaf(present day object) can be
  interpreted as an estimate of the number
  of evolutionary „steps‟ that occurred.



               Benjamin Loyle 2004 Cse 397
Current Techniques
   Maximum Parsimony
     Minimize the total number of evolutionary
      events
     Find the tree that has a minimum amount of
      changes from ancestors
   Maximum Likelihood
     Probability based
     Which tree is most probable to occur based
      on current data

                  Benjamin Loyle 2004 Cse 397
More Techniques
   Neighbor Joining
     Repeatedly joins pairs of leaves (or subtrees)
      by rules of numerical optimization
     It shrinks the distance matrix by considering
      two „neighbors‟ as one node




                   Benjamin Loyle 2004 Cse 397
Learning Neighbor Joining
   It will become apparent later on, but lets
    learn how to do Neighbor Joining (NJ)
                 A    B      C      D      E
             A 0     3      3      4      3
             B       0      3      3      4
             C              0      3      3
             D                     0      3
             E                            0
                     Benjamin Loyle 2004 Cse 397
NJ Part 1
   First start with a “star tree”
              E

     A                      D




         B           C



                   Benjamin Loyle 2004 Cse 397
NJ Part 2
   Combine the closest two nodes (from
    distance matrix)
        • In our case it is node A and B at distance 3
                   E

A                                  D




    B                       C
                       Benjamin Loyle 2004 Cse 397
NJ Part 3
   Repeat this until you have added n-2
    nodes (3)
       • N-2 will make it a binary tree, so we only have to
         include one more node.
                        E

      A                                            D




          B                                   C
                     Benjamin Loyle 2004 Cse 397
Are we done?
 ML and MP, even in heuristic form take
  too long for large data sets
 NJ has poor topological accuracy,
  especially for large diameter trees
 We need something that works for large
  diameter trees and can be run fast.



               Benjamin Loyle 2004 Cse 397
Here‟s what we want
   Our Goal
       An “Absolute Fast Converging” Method
        •  is afc if, for all positive f,g, €, on the Model M,
          there is a polynomial p such that, for all (T,{(e)})
          is in the set Mf,g on a set S of n sequences of
          length at least p(n) generated on T, we have
          Pr[(S) = T] > 1- €.
        • Simply: Lets make it in polynomial time within a
          degree of error.


                       Benjamin Loyle 2004 Cse 397
    A DCM* - NJ Solution
 2 Phase construction of a final phylogenetic
  tree given a distance matrix d.
 Phase 1 : Create a set of plausible trees for
  the distance matrix
 Phase 2 : Find the best fitting tree




                   Benjamin Loyle 2004 Cse 397
Phase 1

   For each q in {dij}, compute a tree tq

   Let T = { tq : q in {dij} }




                     Benjamin Loyle 2004 Cse 397
Finding tq
 Step 1: Compute Thresh(d,q)
 Step 2: Triangulate Thresh(d,q)
 Step 3: Compute a NJ Tree for all
  maximal cliques
 Step 4: Merge the subtrees into a
  supertree


               Benjamin Loyle 2004 Cse 397
What does that mean
   Breaking the problem up
       Create a threshold of diameters to break the
        problem into
           • A bunch of smaller diameter trees (cliques)
     Apply NJ to those cliques
     Merge them back




                        Benjamin Loyle 2004 Cse 397
Finding tq (terms)
   Threshold Graph
       Thresh(d,q) is the threshold graph where (i,j)
        is an edge if and only if dij <= q.




                     Benjamin Loyle 2004 Cse 397
Threshold
   Lets bring back our distance matrix and
    create a threshold with q equal to d15 or
    the distance between A and E
       So q = 67




                    Benjamin Loyle 2004 Cse 397
Distance Matrix
   Our old example matrix
             A       B         C             D             E

     A   0       63          94          111           67

     B           0           79          96            16
     C                       0           47            83
     D                                   0             100
     E                                                 0
                         Benjamin Loyle 2004 Cse 397
With q = D15 = 67
                                C
                                          47
       A


                67                             D
  63


       B
           16                    E


                Benjamin Loyle 2004 Cse 397
Triangulating
   A graph is triangulated if any cycle with
    four or more vertices has a chord
       That is, an edge joining two nonconsecutive
        vertices of the cycle.
   Our example is already triangulated, but
    lets look at another



                      Benjamin Loyle 2004 Cse 397
  Triangulating
  Lets say this is for q = 5

 10 and 15 would                          5
 Not be in the graph W                                    X
                                     10
To triangulate this                                           5
                    5
graph you add the
edge length 10.                     15
                        Y                                 Z
                                              5

                            Benjamin Loyle 2004 Cse 397
Maximal Cliques
 A clique that cannot be enlarged by the
  addition of another vertex.
 Recall our original threshold graph which
  is triangulated:




               Benjamin Loyle 2004 Cse 397
Triangulated Threshold Graph
   Our old Graph
                                     C
                                                  47
           A


                    67                                 D
      63


           B
               16                     E

                    Benjamin Loyle 2004 Cse 397
Clique
Our maximal cliques would be:
{A, B, E}
{C, D}




               Benjamin Loyle 2004 Cse 397
Create Trees for the Cliques
   We have two maximal cliques, so we
    make two trees; {A, B, E} and {C, D}
     How do we make these trees?
     Remember NJ?




                 Benjamin Loyle 2004 Cse 397
Tree {A, B, E} and {C,D}

A                          E



        B


                          C               D




            Benjamin Loyle 2004 Cse 397
Merge your separate trees
together.
 Create one Supertree
 This is done by creating a minimum set of
  edges in the trees and calling that the
  “backbone”
 This is it‟s own doctorial thesis, so lets do
  a little hand waving



                 Benjamin Loyle 2004 Cse 397
That sounds like NP-hard!
   Computing Threshold is Polynomial
   Minimally triangulating is NP-hard, but can be
    obtained in polynomial time using a greedy
    heuristic without too much loss in performance.
   Maximal cliques is only polynomial if the data
    input is triangulated (which it is!).
   If all previous are done, creating a supertree
    can be done in polynomial time as well.


                   Benjamin Loyle 2004 Cse 397
Where are we now?
   We now have a finalized phylogeny created for
    from smaller trees in our matrix joined together
   Remember we started from all possible size of
    smaller trees.




                    Benjamin Loyle 2004 Cse 397
Phase 2
   Which one is right?
     Found using the SQS (Short Quartet
      Support) method
     Let T be a tree in S (made from part 1)

     Break the data into sets of four taxa
        • {A, B, C, D} {A, C, D, E} {A, B, D, E}… etc
        • Reduce the larger tree to only hold “one set”
        • These are called Quartets


                      Benjamin Loyle 2004 Cse 397
SQS - A Guide

  Q(T) is the set of trees induced by T on each
   set of four leaves.
  Let Qw (different Q) be a set of quartets with
   diameter less than or equal to w
  Find the maximum w where the quartets are
   inclusive of the nodes of the tree
  This w is the “support” of that tree



                Benjamin Loyle 2004 Cse 397
SQS - Refrased
 Qw is the set of quartet trees which have a
  diameter <= w
 Support of T is the max w where Qw is a
  subset of Q(T)
     Support is our “quality measure”
     What are we exactly measuring?,




                  Benjamin Loyle 2004 Cse 397
Qw =

       A       B       C D                A B D            E




A      B   C       D    E             A       B     C      D   E
                             Benjamin Loyle 2004 Cse 397
SQS Method
   Return the tree in which the support of
    that tree is the maximum.
     If more than one such tree exists return the
      tree found first.
     This is the tree with the smallest original
      diameter (remember from phase 1)




                   Benjamin Loyle 2004 Cse 397
How do we know we‟re right?
   Compare it to the data set we created
   Look at Robinson-Foulds accuracy
       Remove one edge in the tree we‟ve created.
         • We now have two trees
       Is there anyway to create the same set of leaves by
        removing one edge in our data set?
         • If no, add a „point‟ of error.
       Repeat this for all edges
       When the value is not zero then the trees are not
        identical


                            Benjamin Loyle 2004 Cse 397
     Performance of DCM * - NJ
   Outperforms                 0.8                                 NJ
                                                                    DCM-NJ
    NJ method at
    sequence                    0.6

                   Error Rate
    lengths
                                0.4
    above 4000
    and with                    0.2
    more taxa.
                                0
                                       0         400       800       1200    1600
                                                         No. Taxa
                                    Benjamin Loyle 2004 Cse 397
Improvements
 Improvement possibilities like in Phase 2
 Include test of Maximum Parsimony (MP)
       Try and minimize the overall size of the tree
   Test using statistical evidence
       Maximum Likelihood (ML)




                     Benjamin Loyle 2004 Cse 397
Performance gains
 Simply changing Phase 2 has massive
  gains in accuracy!
 DCM - NJ + MP and DCM -NJ + ML are
  VERY accurate for data sets greater than
  4000 and are NOT NP hard.
 DCM - NJ + MP finished its analysis on a
  107 taxon tree in under three minutes.

               Benjamin Loyle 2004 Cse 397
             Comparing Improvements
                                    DCM-NJ+SQS
         0.8
                                    NJ
                                    DCM-NJ+MP
                                    HGT-FP
Error Rate




         0.6


         0.4


         0.2


         0
               0   400       800     1200       1600
                         # leaves
                                    Benjamin Loyle 2004 Cse 397

						
Related docs
Other docs by HC111211081323
UNIVERSIDAD NACIONAL DE QUILMES
Views: 43  |  Downloads: 0
Vision of Jesus for Nigeria
Views: 3  |  Downloads: 0
Sampling Distributions
Views: 9  |  Downloads: 0
???
Views: 4  |  Downloads: 0
No Slide Title
Views: 2  |  Downloads: 0
vector intro
Views: 7  |  Downloads: 0
Total Synthesis of Reserpine
Views: 169  |  Downloads: 0
PowerPoint Presentation
Views: 0  |  Downloads: 0