Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Developing Better Supertree Methods

VIEWS: 4 PAGES: 39

  • pg 1
									      SupreFine,
a new supertree method

      Shel Swenson
   September 17th 2009
           Reconstructing the
              Tree of Life
                                      Tree of Life challenges:
                                      - millions of species
                                      - lots of missing data

         QuickTime™ and a
TIFF (Uncompressed) decompressor
   are need ed to see this picture.




                                      Two possible approaches:
                                      - Combined Analysis
                                      - Supertree Methods
            Two competing approaches
          gene 1   gene 2 . . .   gene k
Species




                            ...            Combined
                                           Analysis
     Combined Analysis Methods

     gene 1
S1   TCTAATGGAA
                                         gene 3
S2                                  S1
S3
     GCTAAGGGAA
                       gene 2            TATTGATACA
     TCTAAGGGAA                     S3   TCTTGATACC
S4   TCTAACGGAA   S4   GGTAACCCTC   S4   TAGTGATGCA
S7   TCTAATGGAC   S5   GCTAAACCTC
                                    S7   TAGTGATGCA
S8   TATAACGGAA   S6   GGTGACCATC
                                    S8   CATTCATACC
                  S7   GCTAAACCTC
               Combined Analysis
      gene 1 gene 2 gene 3
S1   TCTAATGGAA ? ? ? ? ? ? ? ? ? ? TATTGATACA
S2   GCTAAGGGAA ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
S3   TCTAAGGGAA ? ? ? ? ? ? ? ? ? ? TCTTGATACC
S4   TCTAACGGAA GGTAACCCTC TAGTGATGCA
S5   ??????????       GCTAAACCTC ? ? ? ? ? ? ? ? ? ?

S6   ??????????       GGTGACCATC ? ? ? ? ? ? ? ? ? ?

S7   TCTAATGGAC GCTAAACCTC TAGTGATGCA

S8   TATAACGGAA ? ? ? ? ? ? ? ? ? ? CATTCATACC
            Two competing approaches
          gene 1   gene 2 . . .    gene k
Species




                            ...               Combined
                                              Analysis




                                         Analyze
                                        separately




                             ...             Supertree
                                              Method
Why use supertree methods?

• Missing data
• Large dataset sizes
• Incompatible data types (e.g.,
 morphological features, biomolecular
 sequences, gene orders, even distances
 based upon biochemistry)
• Unavailable sequence data (only trees)
    Many Supertree Methods
          Matrix Representation with Parsimony
          (Most commonly used and most accurate)
• MRP                   • SDM
• weighted MRP          • Q-imputation
• Min-Cut               • PhySIC
• Modified Min-Cut      • Majority-Rule
• Semi-strict             Supertrees
  Supertree
                        • Maximum Likelihood
• MRF
                          Supertrees
• MRD
                        • and many more ...
• QILI
            Today’s Outline
• Supertree and combined analysis methods       
• Why we need better supertree methods
• SuperFine: a new supertree method that is
  fast and more accurate than other supertree
  methods
  – Strict Consensus Merger (SCM)
  – Resolving polytomies
  – Performance of SuperFine (compared to MRP and
    combined anaylses)
  – applications and future work
     Previous Simulation Studies
  1. Generate                            3. Select Subsets
  Model Tree
                                       gene 1   gene 2 . . .    gene k




                                Taxa
                 2. Generate                             ...
                sequence data


6. Compare
to Model                                           4. Construct
Tree                                               Source Trees



                  5. Apply Supertree
                       Method                             ...
  What does lead to missing
           data?
• Evolution (gain and loss of genes)

• Dataset selection

• Limited resources (time, money, etc.)
          My Simulation Study
1. Generate model trees (100-1000 taxa)
2. Simulate gene gain and loss and generate sequences
3. Simulate techniques for gene and taxon selection
   •   Clade-based datasets
   •   Scaffold dataset
4. Generate source trees and a combined dataset
5. Apply supertree and combined analysis methods
6. Compare each estimated tree to the model tree, and
   record topological error
   Experimental Parameters

• Number of taxa in model tree: 100, 500, and
  1000
  – Generate 5, 15 and 25 clade-based datasets,
    respectively
• Scaffold density: 20%, 50%, 75%, and 100%
• Six super-methods:
  – Combined analysis using ML and MP
  – MRP on ML and MP source trees
  – Weighted MRP on ML and MP source Parsimony)
      (MRP = Matrix Representation with trees
    Quantifying Topological Error
         C    D                D
A                 E   A                    C


                                               E


B     True Tree   F   B   Estimated Tree   F

• False negative (FN): An edge in the true
  tree missing from the estimated tree
• False positive (FP): An edge in the
  estimated tree not in the true tree
Comparison of MRP-ML and CA-ML
     (False Negative Rate)




                 QuickTime™ and a
        TIFF (Uncompressed) decompre ssor
           are neede d to see this picture.




                       Scaffold Density (%)
     We still need supertree
            methods!

Combined analysis cannot be used for:
  – Datasets that are very large
  – Incompatible data types
  – Unavailable sequence data
                    Outline
• Supertree and combined analysis methods        
• Why we need better supertree methods           
• SuperFine: a new supertree method that is
  fast and more accurate than other supertree
  methods
   – Strict Consensus Merger (SCM)
   – Resolving polytomies
   – Performance of SuperFine (compared to MRP
     and combined anaylses)
   – applications and future work
Methods that Led to SuperFine

• The Strict Consensus Merger (SCM)
          (Huson et al. 1999)

• Quartet MaxCut (QMC)
    (Snir and Rao 2008)
        Strict Consensus Merger
a          e    b (SCM) b
                     e
                                a


                f               c                               e       b
c                           d                           a
                    g                   f
                                            g       d                   f
    a                   b       a               b                           g
                                                        c
                                                            h
                                                                i   j       d
                                c
                                    h
    c                                   i       j   d
        h                   d
            i           j
               Theorem

Let S be a collection of source trees and
  T be a SCM tree on S.
Then for every s in S, ∑(T|L(s))  ∑(s),
  where T|L(s) is the induced subtree of T
  on the leafset of s.
        Intuition for the Theorem
a                   e       b               e           b
                                a


                f               c                                       e       b
c                           d                                   a
                    g                   f
                                                g           d                   f
    a                   b       a                   b                               g
                                                                c
                                                                    h
                                                                        i   j       d
                                c
                                    h
    c                                   i           j       d
        h                   d
            i           j
      Performance of SCM

• Low false positive (FP) rate
  (Estimated supertree has few false edges)


• High false negative (FN) rate
  (Estimated supertree is missing many true
    edges)
Methods that Led to SuperFine

• The Strict Consensus Merger (SCM)
          (Huson et al. 1999)

• Quartet MaxCut (QMC)
    (Snir and Rao 2008)
     Quartet MaxCut (QMC)
QMC is a heuristic for the following
 optimization problem:

Given a collection Q of quartet trees, find
 a supertree T, with leaf set L(T) = qQ
 L(q), that displays the maximum number
 of quartet trees in Q.
                            3       5   6
                      1
    1       5
    2       4         2                 7
                                4
     Maximizing # of Quartet Trees
              Displayed
• 12|34, 23|45, 34|56, 45|67 are compatible quartet
  trees with supertree
1        3 3         5
                                   3       5   6
                             1
2        4 4         6
2        4 4         6
                             2                 7
3        5   5       7                 4

• Adding the quartet 17|23 creates an incompatible
  set of quartet trees. An “optimal” supertree would
  be the same as above, because it agrees with 4
  out of 5 quartet trees.
 QMC as a Supertree Method

• Step 1: Encode source trees as a set of
  quartets

• Step 2: Apply QMC
     Idea behind SuperFine

• First, construct a supertree with low
  false positives using SCM
                      The Strict
  Consensus Merger
• Then, refine the tree to reduce false
  negatives by resolving each
  polytomy using QMC
                      Quartet Max Cut
Resolving a single polytomy, v

• Step 1: Encode each source tree as a
  collection of quartet trees on {1,2,...,d},
  where d=degree(v) Why?
• Step 2: Apply Quartet MaxCut (Snir and
  Rao) to the collection of quartet trees, to
  produce a tree t on leafset {1,2,...,d}
• Step 3: Replace the star tree at v by
  tree t
              Back to Our Example
                                                      a
                                                      1               e
                                                                      1       b
                                                                              1
              e               b
      a
                              f
                                      g
      c                                               c
                                                      1           f
                                                                  6           d
                                                                              4
                                                                      g
                                                                      5
          h
              i           j       d
                                                          a
                                                          1               b
                                                                          1
  1           2                   3
                  h
                                  i       j
a c e b               4           5           6           c
                                                          1
                      d               g           f           h
                                                              2               d
                                                                  i
                                                                  3       j
                                                                          3   4
Where We Use the Theorem
                        a               e       b
        e       b
a
                                        1
                f
                    g
c                       c           f
                                    6           d
                                                4
                                        g
                                        5
    h
        i   j       d
                            a               b

For every s in                  1
S, ∑(T|L(s)) 
∑(s)                        c
                                h
                                2       3       d
                                                4
                                    i       j
Step 1: Encode each source tree as a
collection of quartet trees on {1,2,...,d}
a                   e       b
                                    1               1   4

c               f           d   6           4       6   5
                    g               5
    a                b
                                1                   1   4


                                2                   2   3
    c                                   3       4
        h                   d
            i           j
Step 2: Apply Quartet MaxCut (QMC)
   to the collection of quartet trees

1       4

                           1       5   4
6       5
                 QMC
1       4                  2   6       3

2       3
 Replace polytomy using tree from
              QMC
           e           b                                   b
   a
                                                       e
                           f                       c
                                               a                   g
                                                                   5       d
                                                                           4
                               g                           1
   c
       h
           i       j           d
                                                           2
                                                           h   6
                                                               f           3
                                                                               j
               h                                                       i
                               i       j
a c e b
                       d           g       f
False Negative Rate



             QuickTime™ and a
    TIFF (Uncompressed) decompressor
       are need ed to see this picture.




                         Scaffold Density (%)
False Negative Rate



             QuickTime™ and a
    TIFF (Uncompressed) decompressor
       are need ed to see this picture.




                         Scaffold Density (%)
False Positive Rate



             QuickTime™ and a
    TIFF (Uncompressed) decompressor
       are need ed to see this picture.




                         Scaffold Density (%)
                                      Running Time
                                               SuperFine vs. MRP




          QuickTime™ and a                                   QuickTime™ and a                       QuickTime™ and a
TIFF (Un compressed) decompre ssor                 TIFF (Un compressed) decompressor      TIFF (Un compressed) decompressor
   are neede d to see this picture.                   are neede d to se e this picture.      are neede d to se e this picture.




                             MRP 8-12 sec.
                          SuperFine 2-3 sec.




             Scaffold Density (%)                  Scaffold Density (%)                    Scaffold Density (%)
           Observations

• SuperFine is much more accurate than
  MRP, with comparable performance
  only when the scaffold density is 100%
• SuperFine is almost as accurate as CA-
  ML
• SuperFine is extremely fast
                     Future Work
• Exploring algorithm design space for Superfine
    – Different quartet encodings
    – Not using SCM in Step 1
    – Parallel version
    – Post-processing step to minimize Sum-of-FN to source trees
• Using Superfine to enable phylogeny estimation
    – without an alignment
    – on many marker combined datasets
• Using Superfine in conjunction with divide-and-conquer methods
  to create more accurate phylogenetic methods
• Exploration of impact of source tree collections (in particular the
  scaffold) on supertree analyses
• Revisiting specific biological supertrees

								
To top