Solving Phylogenetic Trees
Document Sample


Solving Phylogenetic Trees
Benjamin Loyle
March 16, 2004
Cse 397 : Intro to MBIO
Benjamin Loyle 2004 Cse 397
Table of Contents
Problem & Term Definitions
A DCM*-NJ Solution
Performance Measurements
Possible Improvements
Benjamin Loyle 2004 Cse 397
Phylogeny
From the Tree of the Life Website,
University of Arizona
Orangutan Gorilla Chimpanzee Human
Benjamin Loyle 2004 Cse 397
DNA Sequence Evolution
-3 mil yrs
AAGACTT
-2 mil yrs
AAGGCCT TGGACTT
-1 mil yrs
AGGGCAT TAGCCCT AGCACTT
AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT today
Benjamin Loyle 2004 Cse 397
Problem Definition
The Tree of Life
Connecting all living organisms
All encompassing
Find evolution from simple beginnings
Even smaller relations are tough
Impossible
Infer possible ancestral history.
Benjamin Loyle 2004 Cse 397
So what….
Genome sequencing provides entire map
of a species, why link them?
We can understand evolution
Viable drug testing and design
Predict the function of genes
Influenza evolution
Benjamin Loyle 2004 Cse 397
Why is that a problem?
Over 8 million organisms
Current solutions are NP-hard
Computing a few hundred species takes
years
Error is a very large factor
Benjamin Loyle 2004 Cse 397
What do we want?
Input
A collection of nodes such as taxa or protein
strings to compare in a tree
Output
A topological link to compare those nodes to
each other
When do we want it?
FAST!
Benjamin Loyle 2004 Cse 397
Preparing the input
Create a distance matrix
Sum up all of the known distances into a
matrix sized n x n
N is the number of nodes or taxa
Found with sequence comparison
Benjamin Loyle 2004 Cse 397
Distance Matrix
Take 5 separate DNA strings
A : GATCCATGA
B : GATCTATGC
C : GTCCCATTT
D : AATCCGATC
E : TCTCGATAG
The distance between A and B is 2
The distance between A and C is 4
This is subjective based on what your criteria are.
Benjamin Loyle 2004 Cse 397
Distance Matrix
Lets start with an example matrix
A B C D E
A 0 63 94 111 67
B 0 79 96 16
C 0 47 83
D 0 100
E 0
Benjamin Loyle 2004 Cse 397
Lets make it simple
(constrain the input)
Lets keep the distance between nodes
within a certain limit
From F -> G
F and G have the largest distance; they are
the most dissimilar of any nodes.
This is called the diameter of the tree
Lets keep the length of the input (length of
the strings) polynomial.
Benjamin Loyle 2004 Cse 397
ERROR?!?!!?
All trees are inferred, how do you ever
know if you‟re right?
How accurate do we have to be?
We can create data sets to test trees that
we create and assume that it will then
work in the real world
Benjamin Loyle 2004 Cse 397
Data Sets
JC Model
Sites evolve independent
Sites change with the same probability
Changes are single character changes
• Ie. A -> G or T -> C
The expectation of change is a Poisson
variable (e)
Benjamin Loyle 2004 Cse 397
More Data Sets
K2P Model
Based on JC Model
Allows for probability of transitions to
tranversions
• It‟s more likely for A and T to switch and G and C
to switch
• Normally set to twice as likely
Benjamin Loyle 2004 Cse 397
Data Use
Using these data sets we can create our
own evolution of data.
Start with one “ancestor” and create
evolutions
Plug the evolutions back and see if you
get what you started with
Benjamin Loyle 2004 Cse 397
Aspects of Trees
Topology
• The method in which nodes are connected to
each other
• “Are we really connected to apes directly, or just
linked long before we could be considered
mammals?”
Distance
• The sum of the weighted edges to reach one
node from another
Benjamin Loyle 2004 Cse 397
What can distance tell us?
The distance between nodes IS the
evolutionary distance between the nodes
The distance between an ancestor and a
leaf(present day object) can be
interpreted as an estimate of the number
of evolutionary „steps‟ that occurred.
Benjamin Loyle 2004 Cse 397
Current Techniques
Maximum Parsimony
Minimize the total number of evolutionary
events
Find the tree that has a minimum amount of
changes from ancestors
Maximum Likelihood
Probability based
Which tree is most probable to occur based
on current data
Benjamin Loyle 2004 Cse 397
More Techniques
Neighbor Joining
Repeatedly joins pairs of leaves (or subtrees)
by rules of numerical optimization
It shrinks the distance matrix by considering
two „neighbors‟ as one node
Benjamin Loyle 2004 Cse 397
Learning Neighbor Joining
It will become apparent later on, but lets
learn how to do Neighbor Joining (NJ)
A B C D E
A 0 3 3 4 3
B 0 3 3 4
C 0 3 3
D 0 3
E 0
Benjamin Loyle 2004 Cse 397
NJ Part 1
First start with a “star tree”
E
A D
B C
Benjamin Loyle 2004 Cse 397
NJ Part 2
Combine the closest two nodes (from
distance matrix)
• In our case it is node A and B at distance 3
E
A D
B C
Benjamin Loyle 2004 Cse 397
NJ Part 3
Repeat this until you have added n-2
nodes (3)
• N-2 will make it a binary tree, so we only have to
include one more node.
E
A D
B C
Benjamin Loyle 2004 Cse 397
Are we done?
ML and MP, even in heuristic form take
too long for large data sets
NJ has poor topological accuracy,
especially for large diameter trees
We need something that works for large
diameter trees and can be run fast.
Benjamin Loyle 2004 Cse 397
Here‟s what we want
Our Goal
An “Absolute Fast Converging” Method
• is afc if, for all positive f,g, €, on the Model M,
there is a polynomial p such that, for all (T,{(e)})
is in the set Mf,g on a set S of n sequences of
length at least p(n) generated on T, we have
Pr[(S) = T] > 1- €.
• Simply: Lets make it in polynomial time within a
degree of error.
Benjamin Loyle 2004 Cse 397
A DCM* - NJ Solution
2 Phase construction of a final phylogenetic
tree given a distance matrix d.
Phase 1 : Create a set of plausible trees for
the distance matrix
Phase 2 : Find the best fitting tree
Benjamin Loyle 2004 Cse 397
Phase 1
For each q in {dij}, compute a tree tq
Let T = { tq : q in {dij} }
Benjamin Loyle 2004 Cse 397
Finding tq
Step 1: Compute Thresh(d,q)
Step 2: Triangulate Thresh(d,q)
Step 3: Compute a NJ Tree for all
maximal cliques
Step 4: Merge the subtrees into a
supertree
Benjamin Loyle 2004 Cse 397
What does that mean
Breaking the problem up
Create a threshold of diameters to break the
problem into
• A bunch of smaller diameter trees (cliques)
Apply NJ to those cliques
Merge them back
Benjamin Loyle 2004 Cse 397
Finding tq (terms)
Threshold Graph
Thresh(d,q) is the threshold graph where (i,j)
is an edge if and only if dij <= q.
Benjamin Loyle 2004 Cse 397
Threshold
Lets bring back our distance matrix and
create a threshold with q equal to d15 or
the distance between A and E
So q = 67
Benjamin Loyle 2004 Cse 397
Distance Matrix
Our old example matrix
A B C D E
A 0 63 94 111 67
B 0 79 96 16
C 0 47 83
D 0 100
E 0
Benjamin Loyle 2004 Cse 397
With q = D15 = 67
C
47
A
67 D
63
B
16 E
Benjamin Loyle 2004 Cse 397
Triangulating
A graph is triangulated if any cycle with
four or more vertices has a chord
That is, an edge joining two nonconsecutive
vertices of the cycle.
Our example is already triangulated, but
lets look at another
Benjamin Loyle 2004 Cse 397
Triangulating
Lets say this is for q = 5
10 and 15 would 5
Not be in the graph W X
10
To triangulate this 5
5
graph you add the
edge length 10. 15
Y Z
5
Benjamin Loyle 2004 Cse 397
Maximal Cliques
A clique that cannot be enlarged by the
addition of another vertex.
Recall our original threshold graph which
is triangulated:
Benjamin Loyle 2004 Cse 397
Triangulated Threshold Graph
Our old Graph
C
47
A
67 D
63
B
16 E
Benjamin Loyle 2004 Cse 397
Clique
Our maximal cliques would be:
{A, B, E}
{C, D}
Benjamin Loyle 2004 Cse 397
Create Trees for the Cliques
We have two maximal cliques, so we
make two trees; {A, B, E} and {C, D}
How do we make these trees?
Remember NJ?
Benjamin Loyle 2004 Cse 397
Tree {A, B, E} and {C,D}
A E
B
C D
Benjamin Loyle 2004 Cse 397
Merge your separate trees
together.
Create one Supertree
This is done by creating a minimum set of
edges in the trees and calling that the
“backbone”
This is it‟s own doctorial thesis, so lets do
a little hand waving
Benjamin Loyle 2004 Cse 397
That sounds like NP-hard!
Computing Threshold is Polynomial
Minimally triangulating is NP-hard, but can be
obtained in polynomial time using a greedy
heuristic without too much loss in performance.
Maximal cliques is only polynomial if the data
input is triangulated (which it is!).
If all previous are done, creating a supertree
can be done in polynomial time as well.
Benjamin Loyle 2004 Cse 397
Where are we now?
We now have a finalized phylogeny created for
from smaller trees in our matrix joined together
Remember we started from all possible size of
smaller trees.
Benjamin Loyle 2004 Cse 397
Phase 2
Which one is right?
Found using the SQS (Short Quartet
Support) method
Let T be a tree in S (made from part 1)
Break the data into sets of four taxa
• {A, B, C, D} {A, C, D, E} {A, B, D, E}… etc
• Reduce the larger tree to only hold “one set”
• These are called Quartets
Benjamin Loyle 2004 Cse 397
SQS - A Guide
Q(T) is the set of trees induced by T on each
set of four leaves.
Let Qw (different Q) be a set of quartets with
diameter less than or equal to w
Find the maximum w where the quartets are
inclusive of the nodes of the tree
This w is the “support” of that tree
Benjamin Loyle 2004 Cse 397
SQS - Refrased
Qw is the set of quartet trees which have a
diameter <= w
Support of T is the max w where Qw is a
subset of Q(T)
Support is our “quality measure”
What are we exactly measuring?,
Benjamin Loyle 2004 Cse 397
Qw =
A B C D A B D E
A B C D E A B C D E
Benjamin Loyle 2004 Cse 397
SQS Method
Return the tree in which the support of
that tree is the maximum.
If more than one such tree exists return the
tree found first.
This is the tree with the smallest original
diameter (remember from phase 1)
Benjamin Loyle 2004 Cse 397
How do we know we‟re right?
Compare it to the data set we created
Look at Robinson-Foulds accuracy
Remove one edge in the tree we‟ve created.
• We now have two trees
Is there anyway to create the same set of leaves by
removing one edge in our data set?
• If no, add a „point‟ of error.
Repeat this for all edges
When the value is not zero then the trees are not
identical
Benjamin Loyle 2004 Cse 397
Performance of DCM * - NJ
Outperforms 0.8 NJ
DCM-NJ
NJ method at
sequence 0.6
Error Rate
lengths
0.4
above 4000
and with 0.2
more taxa.
0
0 400 800 1200 1600
No. Taxa
Benjamin Loyle 2004 Cse 397
Improvements
Improvement possibilities like in Phase 2
Include test of Maximum Parsimony (MP)
Try and minimize the overall size of the tree
Test using statistical evidence
Maximum Likelihood (ML)
Benjamin Loyle 2004 Cse 397
Performance gains
Simply changing Phase 2 has massive
gains in accuracy!
DCM - NJ + MP and DCM -NJ + ML are
VERY accurate for data sets greater than
4000 and are NOT NP hard.
DCM - NJ + MP finished its analysis on a
107 taxon tree in under three minutes.
Benjamin Loyle 2004 Cse 397
Comparing Improvements
DCM-NJ+SQS
0.8
NJ
DCM-NJ+MP
HGT-FP
Error Rate
0.6
0.4
0.2
0
0 400 800 1200 1600
# leaves
Benjamin Loyle 2004 Cse 397
Related docs
Other docs by HC111211081323
Get documents about "