# Solving Phylogenetic Trees

W
Shared by:
Categories
Tags
-
Stats
views:
0
posted:
12/11/2011
language:
pages:
54
Document Sample

```							Solving Phylogenetic Trees
Benjamin Loyle
March 16, 2004
Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397
 Problem & Term Definitions
 A DCM*-NJ Solution
 Performance Measurements
 Possible Improvements

Benjamin Loyle 2004 Cse 397
Phylogeny
From the Tree of the Life Website,
University of Arizona

Orangutan   Gorilla         Chimpanzee               Human

Benjamin Loyle 2004 Cse 397
DNA Sequence Evolution
-3 mil yrs
AAGACTT

-2 mil yrs
AAGGCCT                      TGGACTT

-1 mil yrs
AGGGCAT              TAGCCCT                AGCACTT

AGGGCAT   TAGCCCA   TAGACTT     AGCACAA        AGCGCTT          today

Benjamin Loyle 2004 Cse 397
Problem Definition
   The Tree of Life
 Connecting all living organisms
 All encompassing

 Find evolution from simple beginnings

 Even smaller relations are tough
 Impossible
   Infer possible ancestral history.

Benjamin Loyle 2004 Cse 397
So what….
 Genome sequencing provides entire map
of a species, why link them?
 We can understand evolution
 Viable drug testing and design
 Predict the function of genes
 Influenza evolution

Benjamin Loyle 2004 Cse 397
Why is that a problem?
 Over 8 million organisms
 Current solutions are NP-hard
 Computing a few hundred species takes
years
 Error is a very large factor

Benjamin Loyle 2004 Cse 397
What do we want?
   Input
   A collection of nodes such as taxa or protein
strings to compare in a tree
   Output
   A topological link to compare those nodes to
each other
   When do we want it?
   FAST!

Benjamin Loyle 2004 Cse 397
Preparing the input
 Create a distance matrix
 Sum up all of the known distances into a
matrix sized n x n
   N is the number of nodes or taxa
   Found with sequence comparison

Benjamin Loyle 2004 Cse 397
Distance Matrix
Take 5 separate DNA strings

A : GATCCATGA
B : GATCTATGC
C : GTCCCATTT
D : AATCCGATC
E : TCTCGATAG

The distance between A and B is 2
The distance between A and C is 4
This is subjective based on what your criteria are.

Benjamin Loyle 2004 Cse 397
Distance Matrix
A       B         C             D             E

A   0       63          94          111           67

B           0           79          96            16
C                       0           47            83
D                                   0             100
E                                                 0
Benjamin Loyle 2004 Cse 397
Lets make it simple
(constrain the input)
   Lets keep the distance between nodes
within a certain limit
 From F -> G
 F and G have the largest distance; they are
the most dissimilar of any nodes.
 This is called the diameter of the tree

   Lets keep the length of the input (length of
the strings) polynomial.
Benjamin Loyle 2004 Cse 397
ERROR?!?!!?
 All trees are inferred, how do you ever
know if you‟re right?
 How accurate do we have to be?
 We can create data sets to test trees that
we create and assume that it will then
work in the real world

Benjamin Loyle 2004 Cse 397
Data Sets
   JC Model
 Sites evolve independent
 Sites change with the same probability

 Changes are single character changes
• Ie. A -> G or T -> C
   The expectation of change is a Poisson
variable (e)

Benjamin Loyle 2004 Cse 397
More Data Sets
   K2P Model
 Based on JC Model
 Allows for probability of transitions to
tranversions
• It‟s more likely for A and T to switch and G and C
to switch
• Normally set to twice as likely

Benjamin Loyle 2004 Cse 397
Data Use
 Using these data sets we can create our
own evolution of data.
evolutions
 Plug the evolutions back and see if you
get what you started with

Benjamin Loyle 2004 Cse 397
Aspects of Trees
   Topology
• The method in which nodes are connected to
each other
• “Are we really connected to apes directly, or just
linked long before we could be considered
mammals?”
   Distance
• The sum of the weighted edges to reach one
node from another

Benjamin Loyle 2004 Cse 397
What can distance tell us?
 The distance between nodes IS the
evolutionary distance between the nodes
 The distance between an ancestor and a
leaf(present day object) can be
interpreted as an estimate of the number
of evolutionary „steps‟ that occurred.

Benjamin Loyle 2004 Cse 397
Current Techniques
   Maximum Parsimony
 Minimize the total number of evolutionary
events
 Find the tree that has a minimum amount of
changes from ancestors
   Maximum Likelihood
 Probability based
 Which tree is most probable to occur based
on current data

Benjamin Loyle 2004 Cse 397
More Techniques
   Neighbor Joining
 Repeatedly joins pairs of leaves (or subtrees)
by rules of numerical optimization
 It shrinks the distance matrix by considering
two „neighbors‟ as one node

Benjamin Loyle 2004 Cse 397
Learning Neighbor Joining
   It will become apparent later on, but lets
learn how to do Neighbor Joining (NJ)
A    B      C      D      E
A 0     3      3      4      3
B       0      3      3      4
C              0      3      3
D                     0      3
E                            0
Benjamin Loyle 2004 Cse 397
NJ Part 1
E

A                      D

B           C

Benjamin Loyle 2004 Cse 397
NJ Part 2
   Combine the closest two nodes (from
distance matrix)
• In our case it is node A and B at distance 3
E

A                                  D

B                       C
Benjamin Loyle 2004 Cse 397
NJ Part 3
   Repeat this until you have added n-2
nodes (3)
• N-2 will make it a binary tree, so we only have to
include one more node.
E

A                                            D

B                                   C
Benjamin Loyle 2004 Cse 397
Are we done?
 ML and MP, even in heuristic form take
too long for large data sets
 NJ has poor topological accuracy,
especially for large diameter trees
 We need something that works for large
diameter trees and can be run fast.

Benjamin Loyle 2004 Cse 397
Here‟s what we want
   Our Goal
   An “Absolute Fast Converging” Method
•  is afc if, for all positive f,g, €, on the Model M,
there is a polynomial p such that, for all (T,{(e)})
is in the set Mf,g on a set S of n sequences of
length at least p(n) generated on T, we have
Pr[(S) = T] > 1- €.
• Simply: Lets make it in polynomial time within a
degree of error.

Benjamin Loyle 2004 Cse 397
A DCM* - NJ Solution
 2 Phase construction of a final phylogenetic
tree given a distance matrix d.
 Phase 1 : Create a set of plausible trees for
the distance matrix
 Phase 2 : Find the best fitting tree

Benjamin Loyle 2004 Cse 397
Phase 1

   For each q in {dij}, compute a tree tq

   Let T = { tq : q in {dij} }

Benjamin Loyle 2004 Cse 397
Finding tq
 Step 1: Compute Thresh(d,q)
 Step 2: Triangulate Thresh(d,q)
 Step 3: Compute a NJ Tree for all
maximal cliques
 Step 4: Merge the subtrees into a
supertree

Benjamin Loyle 2004 Cse 397
What does that mean
   Breaking the problem up
   Create a threshold of diameters to break the
problem into
• A bunch of smaller diameter trees (cliques)
 Apply NJ to those cliques
 Merge them back

Benjamin Loyle 2004 Cse 397
Finding tq (terms)
   Threshold Graph
   Thresh(d,q) is the threshold graph where (i,j)
is an edge if and only if dij <= q.

Benjamin Loyle 2004 Cse 397
Threshold
   Lets bring back our distance matrix and
create a threshold with q equal to d15 or
the distance between A and E
   So q = 67

Benjamin Loyle 2004 Cse 397
Distance Matrix
   Our old example matrix
A       B         C             D             E

A   0       63          94          111           67

B           0           79          96            16
C                       0           47            83
D                                   0             100
E                                                 0
Benjamin Loyle 2004 Cse 397
With q = D15 = 67
C
47
A

67                             D
63

B
16                    E

Benjamin Loyle 2004 Cse 397
Triangulating
   A graph is triangulated if any cycle with
four or more vertices has a chord
   That is, an edge joining two nonconsecutive
vertices of the cycle.
   Our example is already triangulated, but
lets look at another

Benjamin Loyle 2004 Cse 397
Triangulating
Lets say this is for q = 5

10 and 15 would                          5
Not be in the graph W                                    X
10
To triangulate this                                           5
5
edge length 10.                     15
Y                                 Z
5

Benjamin Loyle 2004 Cse 397
Maximal Cliques
 A clique that cannot be enlarged by the
 Recall our original threshold graph which
is triangulated:

Benjamin Loyle 2004 Cse 397
Triangulated Threshold Graph
   Our old Graph
C
47
A

67                                 D
63

B
16                     E

Benjamin Loyle 2004 Cse 397
Clique
Our maximal cliques would be:
{A, B, E}
{C, D}

Benjamin Loyle 2004 Cse 397
Create Trees for the Cliques
   We have two maximal cliques, so we
make two trees; {A, B, E} and {C, D}
 How do we make these trees?
 Remember NJ?

Benjamin Loyle 2004 Cse 397
Tree {A, B, E} and {C,D}

A                          E

B

C               D

Benjamin Loyle 2004 Cse 397
together.
 Create one Supertree
 This is done by creating a minimum set of
edges in the trees and calling that the
“backbone”
 This is it‟s own doctorial thesis, so lets do
a little hand waving

Benjamin Loyle 2004 Cse 397
That sounds like NP-hard!
   Computing Threshold is Polynomial
   Minimally triangulating is NP-hard, but can be
obtained in polynomial time using a greedy
heuristic without too much loss in performance.
   Maximal cliques is only polynomial if the data
input is triangulated (which it is!).
   If all previous are done, creating a supertree
can be done in polynomial time as well.

Benjamin Loyle 2004 Cse 397
Where are we now?
   We now have a finalized phylogeny created for
from smaller trees in our matrix joined together
   Remember we started from all possible size of
smaller trees.

Benjamin Loyle 2004 Cse 397
Phase 2
   Which one is right?
 Found using the SQS (Short Quartet
Support) method
 Let T be a tree in S (made from part 1)

 Break the data into sets of four taxa
• {A, B, C, D} {A, C, D, E} {A, B, D, E}… etc
• Reduce the larger tree to only hold “one set”
• These are called Quartets

Benjamin Loyle 2004 Cse 397
SQS - A Guide

 Q(T) is the set of trees induced by T on each
set of four leaves.
 Let Qw (different Q) be a set of quartets with
diameter less than or equal to w
 Find the maximum w where the quartets are
inclusive of the nodes of the tree
 This w is the “support” of that tree

Benjamin Loyle 2004 Cse 397
SQS - Refrased
 Qw is the set of quartet trees which have a
diameter <= w
 Support of T is the max w where Qw is a
subset of Q(T)
 Support is our “quality measure”
 What are we exactly measuring?,

Benjamin Loyle 2004 Cse 397
Qw =

A       B       C D                A B D            E

A      B   C       D    E             A       B     C      D   E
Benjamin Loyle 2004 Cse 397
SQS Method
   Return the tree in which the support of
that tree is the maximum.
 If more than one such tree exists return the
tree found first.
 This is the tree with the smallest original
diameter (remember from phase 1)

Benjamin Loyle 2004 Cse 397
How do we know we‟re right?
   Compare it to the data set we created
   Look at Robinson-Foulds accuracy
   Remove one edge in the tree we‟ve created.
• We now have two trees
   Is there anyway to create the same set of leaves by
removing one edge in our data set?
• If no, add a „point‟ of error.
   Repeat this for all edges
   When the value is not zero then the trees are not
identical

Benjamin Loyle 2004 Cse 397
Performance of DCM * - NJ
   Outperforms                 0.8                                 NJ
DCM-NJ
NJ method at
sequence                    0.6

Error Rate
lengths
0.4
above 4000
and with                    0.2
more taxa.
0
0         400       800       1200    1600
No. Taxa
Benjamin Loyle 2004 Cse 397
Improvements
 Improvement possibilities like in Phase 2
 Include test of Maximum Parsimony (MP)
   Try and minimize the overall size of the tree
   Test using statistical evidence
   Maximum Likelihood (ML)

Benjamin Loyle 2004 Cse 397
Performance gains
 Simply changing Phase 2 has massive
gains in accuracy!
 DCM - NJ + MP and DCM -NJ + ML are
VERY accurate for data sets greater than
4000 and are NOT NP hard.
 DCM - NJ + MP finished its analysis on a
107 taxon tree in under three minutes.

Benjamin Loyle 2004 Cse 397
Comparing Improvements
DCM-NJ+SQS
0.8
NJ
DCM-NJ+MP
HGT-FP
Error Rate

0.6

0.4

0.2

0
0   400       800     1200       1600
# leaves
Benjamin Loyle 2004 Cse 397

```
Related docs
Other docs by HC111211081323
Vision of Jesus for Nigeria
Sampling Distributions
???
No Slide Title
vector intro
Total Synthesis of Reserpine