Document Sample
paris Powered By Docstoc
					         Reconstructing Ancestral
        Recombination Graphs - or
        Phylogenetic Networks with
                        Dan Gusfield
                         UC Davis
Different parts of this work are joint with Satish Eddhu, Charles
Langley, Dean Hickerson, Yun Song, Yufeng Wu.
 Reconstructing the Evolution of
     Binary Bio-Sequences
• Perfect Phylogeny (tree) model
• Phylogenetic Networks (DAG) with
• Phylogenetic Networks with disjoint cycles:
• Phylogenetic Networks with unconstrained cycles:
• Combinatorial Structure and Efficient Algorithms
• Efficiently Computed Lower and Upper bounds
  on the number of recombinations needed
   Geneological or Phylogenetic
• The major biological motivation comes from
  genetics and attempts to reconstruct the history of
  recombination in populations.
• Also relates to phylogenetic-based haplotyping.
• Some of the algorithmic and mathematical results
  have phylogenetic applications, for example in
  hybrid speciation, lateral gene transfer.
               Why binary?

Single nucleotide polymorphisms are the key
data that we use. SNPs imply that the sequences
are binary, and also that the order of the sites is
fixed (on a chromosome). This is in contrast to
a set of taxonomic characters, which may be
binary, but where the given order
is arbitrary.
    The Perfect Phylogeny Model for
            binary sequences
                                sites 12345
                  Ancestral sequence 00000
                                      1     4
      Site mutations on edges
The tree derives the set M:
                                   3               00010
10100                         10100
10000                                       5
01011                           10000             01010
01010                                01011
00010                   Extant sequences at the leaves
           The converse problem
Given a set of sequences M we want to find, if possible,
a perfect phylogeny that derives M. Remember that
each site can change state from 0 to 1 only once.

 n will denote the number of sequences in M, and m will
 denote the length of each sequence in M.
  M    n    11100101
When can a set of sequences be
derived on a perfect phylogeny?
Classic NASC: Arrange the sequences in a
 matrix. Then (with no duplicate columns),
 the sequences can be generated on a unique
 perfect phylogeny if and only if no two
 columns (sites) contain all four pairs:
           0,0 and 0,1 and 1,0 and 1,1

       This is the 4-Gamete Test
                     A richer model

        01011                             1     4
        00010                         3                    00010
        10101 added           10100                    2
pair 4, 5 fails the three
gamete-test. The sites 4, 5
Real sequence histories often involve recombination.
       Sequence Recombination
  10100                       01011

             P           S

                             Single crossover recombination
A recombination of P and S at recombination point 5.

 The first 4 sites come from P (Prefix) and the sites
 from 5 onward come from S (Suffix).
       Network with Recombination

         01011                             1     4
         00010                         3                    00010
         10101 new         10100                        2
                                   P   10000
The previous tree with one
recombination event now derives            5       0101101010
all the sequences.                 10101
Multiple Crossover


         2-crossovers = ``gene conversion”
     Elements of a Phylogenetic
     Network (single crossover
• Directed acyclic graph.
• Integers from 1 to m written on the edges. Each integer
  written only once. These represent mutations.
• A choice of ancestral sequence at the root.
• Every non-root node is labeled by a sequence obtained
  from its parent(s) and any edge label on the edge into it.
• A node with two edges into it is a ``recombination node”,
  with a recombination point r. One parent is P and one is S.
• The network derives the sequences that label the leaves.
          A Phylogenetic Network
                      1          3
                                      00100      5
                                 2                       00101
b:10010                               01100         S
                     P       S                  4
          c:00100        3               p           01101   g:00101

                     d:10100                   f:01101

   Which Phylogenetic Networks
         are meaningful?
Given M we want a phylogenetic network that
 derives M, but which one?
A: A perfect phylogeny (tree) if possible. As little deviation
from a tree, if a tree is not possible. Use as little recombination
or gene-conversion as possible.
     Minimizing recombinations
• Any set M of sequences can be generated by a
  phylogenetic network with enough recombinations, and
  one mutation per site. This is not interesting or useful.
• However, the number of (observable) recombinations is
  small in realistic sets of sequences. ``Observable” depends
  on n and m relative to the number of recombinations.
• Two algorithmic problems: given a set of sequences M,
  find a phylogenetic network generating M, minimizing the
  number of recombinations (Hein’s problem). Find a
  network generating M that has some biologically-
  motivated structural properties.
         Minimization is NP-hard
 The problem of finding a phylogenetic network that creates a given set of
  sequences M, and minimizes the number of recombinations, is NP-
  hard. (Wang et al 2000) (Semple 2004)

Wang et al. explored the problem of finding a phylogenetic network
  where the recombination cycles are required to be node disjoint, if

They gave a sufficient but not a necessary condition to recognize cases
  when this is possible. O(nm + n^4) time.
       Recombination Cycles
• In a Phylogenetic Network, with a
  recombination node x, if we trace two paths
  backwards from x, then the paths will
  eventually meet.
• The cycle specified by those two paths is
  called a ``recombination cycle”.
A recombination cycle in a phylogenetic
 network is called a “gall” if it shares no
 node with any other recombination cycle.

A phylogenetic network is called a “galled-
 tree” if every recombination cycle is a gall.
A galled-tree generating
the sequences generated
by the prior network.


             a: 00010          p s

                   b: 10010          c: 00100

                           d: 10100               2            5

                                                   p   4
                                                                g: 00101
                                           e: 01100

                                                  f: 01101
      Sales pitch for Galled-Trees
Galled-trees represent a small deviation from true trees.

There are sufficient applications where it is plausible that a galled tree
exists that generates the sequences.

  Observable recombinations tend to be recent; block structure of human
  DNA; recombination is sparse, so the true history of observable
   recombinations may be a galled-tree.
The number of recombinations is never more than m/2. Moreover,
when M can be derived on a galled-tree, the number of recombinations used
is the minimum number over any phylogenetic network, even if multiple
 cross-overs at a recombination event are counted as a single recombination.

A galled-tree for M is ``almost unique” - implications for reconstructing the
correct history.
       Old (Aug. 2003) Results
• O(nm + n^3)-time algorithm to determine whether
  or not M can be derived on a galled-tree with all-0
  ancestral sequence.
• Proof that the galled-tree produced by the
  algorithm is a “nearly-unique” solution.
• Proof that the galled-tree (if one exists) produced
  by the algorithm minimizes the number of
  recombinations used, over all phylogenetic-
  networks with all-0 ancestral sequence.
               New work
We derive the galled-tree results in a more
general setting that addresses unconstrained
  recombination cycles and multiple
crossover recombination. This also solves the
problem of finding the ``most tree-like”
network when a perfect phylogeny is not
possible. In this algorithm, no ancestral
  sequence is known in advance.
     Blobbed-trees: generalizing
• In a phylogenetic network a maximal set of
  intersecting cycles is called a blob.
• Contracting each blob results in a directed, rooted
  tree, otherwise one of the “blobs” was not
• So every phylogenetic network can be viewed as a
  directed tree of blobs - a blobbed-tree.
 The blobs are the non-tree-like parts of the network.
                 Every network is a
                 tree of blobs.
                 How do the tree parts
                 and the blobs relate?
                 How can we exploit
                 this relationship?

Ugly tangled
network inside
the blob.
           Incompatible Sites
A pair of sites (columns) of M that fail the
4-gametes test are said to be incompatible.

A site that is not in such a pair is compatible.
      12345              Incompatibility Graph
    a 00010                4
    b 10010
    c 00100
M   d 10100          1       3      2      5
    e 01100
    f 01101          Two nodes are connected iff the pair
    g 00101          of sites are incompatible, i.e, fail the
                     4-gamete test.

     THE MAIN TOOL: We represent the pairwise
     incompatibilities in a incompatibility graph.
   The connected components of
    G(M) are very informative
• The number of non-trivial connected components
  is a lower-bound on the number of recombinations
  needed in any network (Bafna, Bansal; Gusfield,
• When each blob is a single-cycle (galled-tree case)
  all the incompatible sites in a blob must come
  from a single connected component C, and that
  blob must contain all the sites from C.
  Compatible sites need not be inside any blob.
  (Gusfield et al 2003-5)
                      Simple Fact
     If sites two sites i and j are incompatible,
    then the sites must be together on some
    recombination cycle whose recombination
    point is between the two sites i and j.
   (This is a general fact for all phylogenetic

Ex: In the prior example, sites 1, 3 are incompatible, as are 1, 4;
as are 2, 5.
          A Phylogenetic Network
                      1          3
                                      00100      5
                                 2                       00101
b:10010                               01100         S
                     P       S                  4
          c:00100        3               p           01101   g:00101

                     d:10100                   f:01101

     Simple Consequence of the
            simple fact
All sites on the same (non-trivial) connected
  component of the incompatibility graph
must be on the same blob in any blobbed-tree.
Follows by transitivity.

So we can’t subdivide a blob into a tree-like
  structure if it only contains sites from a single
  connected component of the incompatibility
    Key Result about Galls: For
  galls, the converse of the simple
      consequence is also true.
 Two sites that are in different (non-trivial) connected
 components cannot be placed on the same gall in
 any phylogenetic network for M.

 Hence, in a galled-tree T for M each gall contains all
   and only the sites of one (non-trivial) connected
   component of the incompatibility graph. All
   compatible sites can be put on edges outside of the
This is the key to the galled-tree solution.
A galled-tree generating                            Incompatibility Graph
the sequences generated                               4
by the prior network.

                           1                    1         3               2   5

             a: 00010          p s

                   b: 10010          c: 00100

                           d: 10100                   2               5

                                                   p          4
                                                                       g: 00101
                                           e: 01100

                                                      f: 01101
Motivated by the one-one correspondence between galls
and non-trivial connected components, we ask:

To what extent does this one-one correspondence hold
in general blobbed-trees, i.e. with no constraints on
how recombination cycles interweave?
  The Decomposition Theorem
       (Recomb 2005)
For any set of sequences M, there is a
blobbed-tree T(M) that derives M, where
each blob contains all and only the sites in
one non-trivial connected component of
G(M). The compatible sites can always be
put on edges outside of any blob.

A blobbed-tree with this structure is called fully-decomposed.
               General Structure
So, for any set of sequences M, there is a phylogenetic
  network T(M) that is fully decomposed.

Moreover, the tree part of T(M) is unique. And it is easy to
 find the tree part.
A fully-decomposed
network for the sequences 4                        Incompatibility Graph
generated by the prior                               4
                         1                     1         3               2   5

            a: 00010          p s

                  b: 10010          c: 00100

                          d: 10100                   2               5

                                                  p          4
                                                                      g: 00101
                                          e: 01100

                                                     f: 01101
Since all sites from a single connected
  component must be together on some blob
  in any phylogenetic network, no network is
  more decomposed than the fully
  decomposed network.
              Proof Ideas
Let C be a connected component of G(M).
  Define M[C] as the sequences in M
  restricted to the sites in C.
    a 00010             4
    b 10010                         C2
    c 00100
M   d 10100        1        3   2        5
    e 01100
    f 01101            B1           B2
    g 00101
                        134             25
                       a001         a   00
                       b101         b   00
                       c010         c   00
                       d110         d   00
                       e010         e   10
                       f 010        f   11
                       g010         g   01

                       M[C1]     M[C2]
                     Faux Proof
Pick one site from each connected component C in G(M) to
   ``represent” C. No pair of those sites are incompatible, so
   by the NASC for a perfect phylogeny, there will be a
   perfect phylogeny T for the sites. Expand each node to a
   network generating the sequences in M[C].
Incorrect, because the structure of T can be wrong. We need
   to use information about all the sites in each C.
    Now for each connected component C in G(M), call each distinct
     sequence in M[C] a supercharacter, and let W be the indicator
    matrix for the supercharacters. So W indicates which rows of M
    contain which particular supercharacters.

      134               25        a   10001000
1    a001       5   a   00        b   01001000
2    b101       5   b   00        c   00101000
3    c010       5   c   00        d   00011000
4    d110       5   d   00        e   00100100
3    e010       6   e   10        f   00100010
3    f010       7   f   11        g   00100001
3    g 01 0     8   g   01
     M[C1]          M[C2]
               Proof Ideas
Lemma: No pair of supercharacters are

So by the NASC for a Perfect Phylogeny,
  there is a unique perfect phylogeny T for W.
                  Proof Ideas
For each connected component C of G(M), all
supercharacters that originate from C label edges in T that
are incident with one single node v[C] in T. So, if we
expand each node v[C] to be a network that generates the
supercharacters from C (the sequences in M[C]), and
connect each network correctly to the edges in T, the
resulting network is a fully-decomposed blobbed-tree that
generates M.
 Algorithmically, T is easy to find and is the
tree resulting from contracting each blob in
the fully-decomposed blobbed-tree T(M) for
M. T can be constructed from M in
O(nm^2) time.
Broader Biological Applications
Our major interest is in recombination, but
the proof of the decomposition theorem
does not explicitly use recombination. So it
holds for whatever biological phenomena
caused the incompatibility of sites. For
example, back or recurrent mutation, gene-
conversion, lateral gene transfer etc.
     The main open question
The Decomposition Theorem says there is
 always a fully-decomposed blobbed-tree for
 any M, but
Is there always a fully-decomposed
 blobbed-tree that minimizes the number of
 recombinations over all possible
 phylogenetic networks for M?
We conjecture the answer is yes.
If true, then we can decompose the problem of minimizing the total
 number of recombinations into separate problems on each connected
 component, and also find lower bounds on the needed number of
 recombinations, in each component separately, adding those bounds to
 get a valid overall lower bound for M. This computation of lower
 bounds is known to be correct for certain lower bounds (Bafna, Bansal
          Progress on Proving the
Definition: If N is a phylogenetic network for M, and a node v in N is
   labeled with a sequence in M, then v is said to be visible in N.
Theorem: If every node in N is visible, then there is a fully-decomposed
   network for M where the number of recombinations is at most the
   number in N.
Corollary: The conjecture is true for any M where the Haplotype or
   History lower bounds (S. Myers) on the number of recombinations
   needed to generate M, is tight.
     Optimality of Galled-Trees

Theorem: (G,H,B,B) The minimum number of recombination nodes
in any phylogenetic network for M is at least the number
 of non-trivial connected components of the incompatibility graph.

Hence, when the sequences on each blob on T(M) can be generated
with a single recombination node, the blobbed-tree minimizes
the number of recombination nodes over all phylogenetic
networks and all choices of ancestral sequence. This solves
the root-unknown galled-tree problem in polynomial time.
Code is on the web.
The number of arrangements on a
 gall (all-0 ancestral sequence)
By analysing the algorithm to layout the
sites on a gall (not discussed here), one can
prove that the number of arrangements of any gall is
at most three, and this happens only if the gall has two

If the gall has more than two sites, then the number of
arrangements is at most two.

If the gall has four or more sites, with at least two sites
on each side of the recombination point (not the side of
the gall) then the arrangement is forced and unique.
 Computing close bounds on the
    minimum number of
     Dan Gusfield UCD

Y. Song, Y. F. Wu, D. Gusfield (ISMB2005)

D. Gusfield, D. Hickerson (Dis. Appl. Math 2005)
D. Gusfield, V. Bansal (Recomb 2005)
     The grandfather of all lower
         bounds - HK 1985
• Arrange the nodes of the incompatibility graph on the line
  in order that the sites appear in the sequence. This bound
  requires a linear order.
• The HK bound is the minimum number of vertical lines
  needed to cut every edge in the incompatibility graph.
  Weak bound, but widely used - not only to bound the
  number of recombinations, but also to suggest their
         Justification for HK
If two sites are incompatible, there must have
   been some recombination where the
   crossover point is between the two sites.
HK Lower Bound

   1   2   3   4   5
HK Lower Bound = 1

   1   2   3   4   5
       More general view of HK
Given a set of intervals on the line, and for each interval I, a
number N(I), define the composite problem: Find the minimum
number of vertical lines so that every interval I intersects at
least N(I) of the vertical lines.

In HK, each incompatibility defines an interval I where N(I) = 1.

The composite problem is easy to solve by a left-to-right myopic
placement of vertical lines: Sort the intervals by right end-point;
Process the intervals left to right in that order; when the right
endpoint of an interval I is reached, place there (if needed)
additional vertical so that N(I) lines intersect I.
 If each N(I) is a ``local” lower bound on the number of
 recombinations needed in interval I, then the solution to
 the composite problem is a valid lower bound for the
 full sequences. The resulting bound is called the composite
 bound given the local bounds.

This general approach is called the Composite Method
(Simon Myers 2002).
 Haplotype Bound (Simon Myers)
• Rh = Number of distinct sequences (rows) - Number of
  distinct sites (columns) -1 (folklore)

• Before computing Rh, remove any site that is compatible
  with all other sites. A valid lower bound results - generally
  increases the bound.

• Generally really bad bound, often negative, when used on
  large intervals, but Very Good when used as local bounds
  in the Composite Interval Method, and other methods.
Composite Interval Method using
          RH bounds
 Compute Rh separately for each of the C(m,2) intervals of
  the m sites; let N(I) = Rh(I) be the local lower bound for
  interval I. Then compute the composite bound using these
  local bounds.

Polynomial time and gives bounds that often double HK in
  our simulations.
     Composite Subset Method
• Let S be subset of sites, and Rh(S) be the
  haplotype bound for subset S. If the
  leftmost site in S is L and the rightmost site
  in S is R, then use Rh(S) as a local bound
  N(I) for interval I = [S,L].
• Compute Rh(S) on many subsets, and then
  solve the composite problem to find a
  composite bound.
             RecMin (Myers)
• World Champion Lower Bound Program (until
  now). Often RecMin gives a bound three times as
  large as HK.
• Computes Rh on subsets of sites, but limits the
  size and the span of the subsets. Default
  parameters are s = 6, w = 15 (s = size, w = span).
• Generally, impractical to set s = w = m, so
  generally one doesn’t know if increasing the
  parameters would increase the bound. (example:
  Myers bound of 70 on the LPL data).
 Optimal RecMin Bound (ORB)
• The Optimal RecMin Bound is the lower
  bound that RecMin would produce if both
  parameters were set to their maximum
  values (s = w= m).
• In general, RecMin cannot compute (in
  practical time) the ORB.
• Practical computation of the ORB is our
  first contribution.
            Computing the ORB
• Gross Idea: For each interval I, use ILP to find a subset of
  sites S that maximizes Rh(S) over all subsets in interval I.
  Call the result Opt(I).
• Set N(I) = Opt(I), and compute the composite bound using
  those local bounds.
• The composite bound is the ORB. -- the result one would
  get by using all 2^m subsets in RecMin, with s = w = m.
We have moved from doing an exponential number of simple
computations (computing Rh for each subset), to solving
 a quadratic number of (possibly expensive) ILPs.

Is this a good trade-off in practice? Our experience - very much
    How to compute Opt(I) by ILP
Create one variable Xi for each row i; one variable Yj
for each column j in interval I. All variables are 0/1 variables.

Define S(i,i’) as the set of columns where rows i and i’ have different
values. Each column in S(i,i’) is a witness that rows i and i’ differ.

For each pair of rows (i,i’), create the constraint:
Xi + Xi’ <= 1 + ∑ [Yj: j in S(i,i’)]

Objective Function: Maximize ∑ Xi - ∑ Yj -1
   Alternate way to compute Opt(I)
               by ILP
First remove any duplicate rows. Let N be the number of rows
Create one variable Yj for each column j
in interval I. All variables are 0/1 variables.

S(i,i’) as before.

For each pair of rows (i,i’) create the constraint:
1 <= ∑ [Yj: j in S(i,i’)]

Objective Function: Maximize N - ∑(Yj) -1

Finds the smallest number of columns to distinguish the rows.
             Two critical tricks
1) Use the second ILP formulation to compute Opt(I). It solves
much faster than the first (why?)

2) Reduce the number of intervals where Opt(I) is computed:

                       If the solution to Opt(I) uses columns that
                       only span interval L, then there is no
        L              need to find Opt(I’) in any interval
                       I’ containing L. Each ILP computation
                       directly spawns at most 4 other ILPs.
                        Apply this idea recursively.
With the second trick we need to find Opt(I)
for only 0.5% - 35% of all the C(m,2)
intervals (empirical result). Surprisingly
fast in practice (with either the GNU solver
or CPLEX).
Bounds Higher Than the Optimal
       RecMin Bound
• Often the ORB underestimates the true minimum, e.g.
  Kreitman’s ADH data: 6 v. 7
• How to derive sharper bound? Idea: In the composite
  method, check if each local bound N(I) = Opt(I) is tight,
  and if not, increase N(I) by one.
• Small increases in local bounds can increase the composite
   Bounds Sharper Than Optimal
         RecMin Bound
• A set of sequences M is called self-derivable if there is a
  network generating M where the sequence at every node
  (leaf and intermediate) is in M.

• Observation: The haplotype bound for a set of sequences is
  tight if and only if the sequences are self-derivable.

• So for each interval I where Opt(I) is computed, we check
  self-derivability of the sequences induced by columns S*,
  where S* are the columns specified by Opt(I). Increase
  N(I) by 1 if the sequences are not self-derivable.
  Algorithm for Self-Derivability
• Solution is easy when there are no mutations --only
  recombinations are allowed and one initial pair of
  sequences is chosen as ``reached”.
• Two reached sequences S1 and S2 can reach a third
  sequence S3, if S3 can be created by recombining S1 and
• Do BFS to see if all sequences can be reached by
  successive application of the ``reach operation”.
• Clearly polynomial time and can be optimized with suffix-
  trees etc. (Kececiouglu, Gusfield)
       Self-derivability Test with
           mutations allowed
• For each site i, construct set MUT(i) of sequence pairs (S1,
  S2) in M where S1 and S2 differ at only site.
• Try each sequence in M as root (which is the only reached
  sequence initially).
• For each root, enumerate all ways of choosing exactly one
  ordered pair of sequences (Sp, Sq) from each MUT(i). Sp
  is allowed to ``reach” Sq.
• Run the prior self-derivability algorithm with these new
  permitted reach relations, to test if all sequences in M can
  be reached. If so, then M is self-derivable, otherwise it is
    Checking if N(I) should be
        increased by two
If the set of sequences is not self-derivable,
we can test if adding one new sequence makes
   it self-derivable.
the number of candidate sequences is
   polynomial and for each one added to M we
   check self-derivability. N(I) should be
   increased by two if none of these sets of
   sequences is self-derivable.
             Program HapBound
• HapBound –S. Checks each Opt(I) subset for self-
  derivability. Increase N(I) by 1 or 2 if possible. This often
  beats ORB and is still practical for most datasets.
• HapBound –M. Explicitly examine each interval directly
  for self-derivability. Increase local bound if possible.
  Derives lower bound of 7 for Kreitman’s ADH data in this
  HapBound vs. RecMin on LPL
       from Clark et al.
Program              Lower Bound   Time
RecMin (default)     59            3s
RecMin –s 25 –w 25   75            7944s
RecMin –s 48 –w 48   No result     5 days
HapBound             75            31s
HapBound -S          78            1643s

  2 Ghz PC
     Example where RecMin has
difficulity in Finding Optimal Bound
     on a 25 by 376 Data Matrix
Program              Bound   Time
RecMin default       36      1s
RecMin –s 30 –w 30   42      3m 25s
RecMin –s 35 –w 35   43      24m 2s
RecMin –s 40 –w 40   43      2h 9m 4s
RecMin –s 45 –w 45   43      10h 20m 59s
HapBound             44      2m 59s
HapBound -S          48      39m 30s
    Frequency of HapBound –S
   Bound Sharper Than Optimal
         RecMin Bound
ms param. Rho=1          Rho=5       Rho=10         Rho=20

Theta=1      0.0%        0.4%        0.5%           1.5%

Theta=5      0.7%        4.0%        10.4%          27.0%

Theta=10 1.4%            9.2%        17.8%          40.4%

Theta=20 1.4%            10.5%       27.8%          45.4%

For every ms parameters, 1000 data sets are used.
     Computing Upper Bounds
• The method is an adaptation of the
  ``history” lower bound (Myers).
• A non-informative column is one with
  fewer than two 0’s or fewer than two 1’s.
     Single History Computation
1)   Set W = 0
2)   Collapse identical rows together, and remove non-
     informative columns. Repeat until neither is possible.
3)   Let A be the data at this point. If A is empty, stop, else
     remove some row r from A, and set W = W + 1. Go to
     step 2).

Note that the choice of r is arbitrary in Step 3), so the
    resulting W can vary.
        History Lower Bound
Theorem (Myers) Let W* be the minimum W
 obtained from all possible single history
Then W* is a valid lower bound on the
 number of recombinations needed.

Naïve time: theta(n!) (RecMin), but can be
reduced to theta(2^n) (Bafna, Bansal).
  Converting the History Lower
   Bound to an Upper Bound
• Given a set of rows A and a single row r,
  define w(r | A - r) as the minimum number
  of recombinations needed to create r from
  A-r (well defined in our application).
• w(r | A-r) can be computed in linear time by
  a greedy-type algorithm.
         Upper Bound Computation
  1)   Set W = 0
  2)   Collapse identical rows together, and remove non-
       informative columns. Repeat until neither is possible.
  3)   Let A be the data at this point. If A is empty, stop, else
       remove some row r from A, and set W = W + W(r | A-r).
       Go to step 2).

  Note that the choice of r is arbitrary in Step 3), so the
      resulting W can vary.

This is the Single History Computation, with a change in
step 3).
Note, even a single execution of the upper bound computation
  gives a valid upper bound, and a way to construct a
  network. This is in contrast to the History Bound which
  requires finding the minimum W over all histories.

However, we also would like to find the lowest upper bound
  possible with this approach. This can be done in O(2^n)
  time by DP. In practice, we can use branch and bound to
  find the lowest upper bound, but we have also found that
  branching on the best local choice, or randomizing gives
  good results.
          Branch and Bound
(Branching) In Step 3) choose r to minimize
   w(r | A-r) + L(A-r), where L(A-r) is some
  fast lower bound on the number of
  recombinations needed for the set A-r.
  Even HK is good for this purpose.
(Bounding) Let C be the min for an full
  solution found so far; If W + L(A) >= C,
  then backtrack.
             Kreitman’s 1983 ADH Data

•   11 sequences, 43 segregating sites
•   Both HapBound and SHRUB took only a fraction of a
    second to analyze this data.
•   Both produced 7 for the number of detected
    recombination events
    Therefore, independently of all other methods, our lower
    and upper bound methods together imply that 7 is the
    minimum number of recombination events.
       A minimal ARG for Kreitman’s data

SHRUB produces
code that can be
input to an open
source program to
display the                   QuickTime™ and a
                          TIFF (LZW) decompressor
                       are neede d to see this picture.
constructed ARG
         The Human LPL Data (Nickerson et al. 1998)
     (88 Sequences, 88 sites)

 Our new lower                                  QuickTime™ an d a
                                            TIFF (LZW) decomp resso r
 and upper                               are need ed to see this picture.


                                                        QuickTime™ and a
 Optimal RecMin                                     TIFF (LZW) decompressor
                                                 are neede d to see this picture.


(We ignored insertion/deletion, unphased sites, and sites with missing data.)
                                  Study on simulated data:
      Exact-Match frequency for varying parameters
  = Scaled mutation rate
                                                     Used Hudson’s MS to
 = Scaled recombination rate                    generate1000 simulated datasets
• n = Number of sequences                          for each pair of and 

                               n = 25                                   n = 15

                   QuickTime™ and a                             QuickTime™ an d a
               TIFF (LZW) decompresso r                     TIFF (LZW) decomp resso r
            are neede d to see this picture.             are need ed to see this picture .

 For < 5, our lower and upper bounds match more than
 95% of the time.
             Exact-Match frequency
        for varying number of sequences

                                 QuickTime™ and a
                             TIFF (LZW) decompressor
                          are neede d to see this picture.

Match frequency does not depend on n as much as it does on
 or 
            A closer look at the deviation
       Average ratio of lower bound to upper bound when they
        do not match

 For n = 25:
                                       QuickTime™ and a
                                   TIFF (LZW) decomp resso r
                                are need ed to see this picture.

The numerical difference between lower and upper bounds
grows as or  increases, but their ratio is more stable.
        Papers and Software

Shared By: