Docstoc

holland

Document Sample
holland Powered By Docstoc
					 A powerful low-parameter
method for inferring quartets
           under
the General Markov Model


               Jeremy Sumner
               Barbara Holland
                 Peter Jarvis
                The general markov model
                         (GMM)
                M1                        M4

                               M3
                     π

           M2                                  M5


                                                             A    C    G    T
                                                         A   0.80 0.10 0.07 0.03
            A        C     G        T                    C   0.05 0.75 0.15 0.05
e.g. π =                                            M=
            0.3      0.2   0.4      0.1                  G   0.02 0.03 0.92 0.03
                                                         T   0.02 0.05 0.04 0.87
             Base composition
• In the GMM the mutation transition matrices do not have
  to be symmetrical.


• As a consequence of this, base frequencies could be
  different in different taxa.


• Almost all phylogenetic methods / commonly used
  models cannot account for drift in base composition
  across the tree.
The exception: Log-det distances

                      dxy = -ln det Fxy
        GCCTACGTCGAAGTCGTAGCTGTGCATGCTAGCGTCTC...
        GTCTACATCGAAGTCGTATTTGTGCATGCAACAGTCTC...


                  A        C       G      T
           A      6        0       0      0
           C      1        8       0      2
Fxy =      G      1        1       8      1
           T      1        0       0      9
           Markov invariants
•   The log det is an example (the simplest)
    of a Markov invariant
•   JS and PJ extended the theory of
    Markov invariants to larger subsets of
    taxa
     –   Tangles (3 taxa)
     –   Squangles (4 taxa)
     –   Stangles
Math wizards...
  ...and their magical polynomials
3 1 18 69 171 256
-3 1 18 69 172 255
-3 1 18 69 175 252
3 1 18 69 176 251
-3 1 18 69 187 240
3 1 18 69 188 239
3 1 18 69 191 236
-3 1 18 69 192 235
-1 1 18 71 169 256
                           coefficient
1 1 18 71 172 253
1 1 18 71 173 252
-1 1 18 71 176 249
1 1 18 71 185 240
-1 1 18 71 188 237
-1 1 18 71 189 236
1 1 18 71 192 233                         indices 1-256
1 1 18 72 169 255
-1 1 18 72 171 253
-1 1 18 72 173 251
1 1 18 72 175 249
-1 1 18 72 185 239
1 1 18 72 187 237
1 1 18 72 189 235
-1 1 18 72 191 233
-3 1 18 73 167 256           e.g. 3*p1*p18*p73*p168*p255
3 1 18 73 168 255
3 1 18 73 175 248
-3 1 18 73 176 247
3 1 18 73 183 240
-3 1 18 73 184 239             = 3*pAAAA*pACAC*pCAGA*pGGCT*pTTTG
-3 1 18 73 191 232
3 1 18 73 192 231

and another 66,712 terms
q1   q2        q3
                    Choosing a quartet
0    -u        u
 v   0         -v
-w   w         0

                     0




          -u         0     u
q1   q2   q3
0    -u   u
               Choosing a quartet
v    0    -v

-w   w    0

                0




                0

               u=0
       Residual sum of squares
• Pick the quartet tree that minimises the residual sum
  of squares (RSS)


• u = max {0,(q3-q2)/2}    (v,w similar)


• The RSS are always of the form
     q12 + [(q3-q2)/2 – u]2

If things are in the right order (q3>q2) then the second
    term vanishes, but if they aren't then u gets set to 0
                  Weights (I)
• Weight each quartet
wi = 1/RSSi


• A posterior probability (ish) weighting scheme for the
  quartets is then
   pi = wi/(w1+w2+w3)
                 Example            MtDNA genomes
                                    13856 sites


((Rhea,Hippo),Platypus,Wallaroo);

q1 = 9.14e-07            u = 7.13e-06
q2 = -7.58e-06           v =0
                         w =0
q3 = 6.67e-06


p1 = 0.978               RSS1 = 8.36e-13
                         RSS2 = 6.58e-11
p2 = 0.011
                         RSS3 = 6.25e-11
p3 = 0.011
                   Weights (II)
• The RSS weights give a measure of the relative support
  for each topology.


• It would also be useful to have a quartet weight that was
  related to the edge length of the middle edge of the
  quartet
       q1     q2     q3
       0      -u     u
                                The most likely suspect is
       v      0      -v         u = (q3-q2)/2
       -w     w      0
Felsenstein tree, pendant short edges = 0.01, pendant long edges = 0.1

                                                                           q1
q1      q2       q3            0.0300

                               0.0200

0        -u       u            0.0100

                               0.0000

v        0        -v          -0.0100

                              -0.0200


-w       w        0           -0.0300
                                    0.0000    0.0200    0.0400   0.0600     0.0800    0.1000      0.1200    0.1400     0.1600
                                                       probability of mutation on middle edge



                                                                          q3-q2
                            0.9000

q1      q2       q3         0.8000
                            0.7000
                            0.6000

0        -u       u         0.5000
                            0.4000
                            0.3000

v        0        -v        0.2000
                            0.1000
                            0.0000
                            -0.1000
-w       w        0              0.0000      0.0200    0.0400    0.0600     0.0800    0.1000
                                                         probability of mutation on middle edge
                                                                                                   0.1200     0.1400     0.1600
       Basic simulation setup




             Felsenstein zone            Farris zone


Jukes Cantor model: equal base frequencies, all changes equally likely

           100 data sets for each parameter choice
                                                   Simulations (I)
      Testing power compared to cNJ
                                                                                        Farris:
                                     Felsenstein                        short edge = 0.0025 , long edge =0.15
                  short edge = 0.0025, long edge = 0.15                               80
                  80
                                                                                      70
                  70                                                                                                                             SQ(12)(34)
                                                           SQ(12)(34)                 60
                                                                                                                                                 SQ(13)(24)




                                                                        frequency
                  60
                                                           SQ(13)(24)                 50                                                         SQ(14)(23)
                  50
      frequency




                                                           SQ(14)(23)                 40                                                         NJ(12)(34)
                  40
                                                           NJ(12)(34)                 30                                                         NJ(13)(24)
                  30                                       NJ(13)(24)                                                                            NJ(14)(23)
                                                                                      20
                  20                                       NJ(14)(23)
                                                                                      10
                  10
                                                                                            0
                   0
                                                                                                201        402      804      1608    10000
                    201      402      804    1608 10000
                                                                                                                  #sites
                                    #sites

                                    Felsenstein                                      Farris:
              short edge = 0.005, long edge = 0.075                     short edge=0.005, long edge=0.075
                                                                                            100
              100
                                                                                                90
                  80                                                                            80
                                                          SQ(12)(34)                                                                         SQ(12)(34)
                                                          SQ(13)(24)                            70
                                                                                                                                             SQ(13)(24)
                  60                                                            frequency       60
frequency




                                                          SQ(14)(23)                                                                         SQ(14)(23)
                                                          NJ(12)(34)                            50
                                                                                                                                             NJ(12)(34)
                  40                                      NJ(13)(24)                            40
                                                                                                                                             NJ(13)(24)
                                                          NJ(14)(23)                            30
                  20                                                                                                                         NJ(14)(23)
                                                                                                20
                                                                                                10
                  0
                                                                                                 0
                       201   402     804     1608 10000
                                                                                                     201   402     804     1608   10000
                                   #sites
                                                                                                                 #sites
          Simulations (II)
    Adding base composition drift
•   Added a GC bias along the long edges

             A     C      G      T

       A     *     pl*b   pl*b   pl

       C     pl    *      pl     pl

       G     pl    pl     *      pl

       T     pl    pl*b   pl*b   *
         GC bias on long edges
          Felsenstein: short edge = 0.005, long edge = 0.075


          bias = 1    2          3          4          5

#Sites       SQ 71          59         50         39           24
=200         NJ 66          49         15          0            0

400              86         68         62         42           35
                 86         47          8          0            0

800              93         75         63         60           36
                 91         53          3          0            0

1600            100         92         79         59           38
                100         56          0          0            0

10000           100        100         95         78           50
                100         67          0          0            0
                Simulations (II)
Adding a proportion of invariant sites
         Felsenstein: short edge = 0.005, long edge = 0.075

         pInv = 0 0.1         0.2        0.3        0.4        0.5

#Sites        73        64          60         53         48         35
=200

400           84        78          62         57         42         41

800           93        76          70         55         47         34

1600          97        89          72         66         35         22

10000        100        100         97         63         26         4
          Putting it all together
•   Most people want to build trees on more than 4 taxa


•   Fortunately there are already several methods for
    going from quartets to larger trees
     – Q*
     – Quartet puzzling
     – Any supertree method


•   Or from quartets to splits graphs
     – QNet
mt genomes

 mt genomes




 Qnet – distance based weights
                                2nd
   1st   codon pos




Qnet – distance based weights         3rd
      Detecting invariant sites
•   The residual sum of squares (RSS)
    scores give an opportunity to detect
    invariant sites.

•   Remove constant sites in order to
     –   Idea 1: Minimise sum of RSS
     –   Idea 2: Minimise minimum RSS
   15,000 sites of which 5000 are invariable
   proportion of constant sites out of 10,000 variable sites was 0.58




constant sites          PP:                            sum RSS              min RSS

                 0.72         0.25   0.44       0.30             2.22E-09             7.14E-10

                 0.70         0.28   0.40       0.32             1.74E-09             5.40E-10

                 0.68         0.31   0.35       0.33             1.45E-09             3.81E-10

                 0.66         0.37   0.30       0.33             1.20E-09             2.46E-10

                 0.64         0.45   0.25       0.30             9.90E-10             1.37E-10

                 0.62         0.59   0.18       0.23             8.31E-10             5.85E-11

                 0.60         0.80   0.09       0.11             7.40E-10             1.33E-11

                 0.57         0.98   0.01       0.01             7.07E-10             2.06E-14

                 0.55         0.97   0.02       0.02             7.15E-10             1.29E-11

                 0.52         0.82   0.09       0.08             7.38E-10             4.26E-11

                 0.50         0.69   0.17       0.14             7.47E-10             7.76E-11
       Vagaries of real data
• Dealing sensibly with missing or
  ambiguous data
• Currently remove all sites with questions
  marks, gaps or ambiguities over the whole
  alignment
• Seems better to do this on a per quartet
  basis
                  Code
• R code
• Python code, creates output that can be
  understood by Qnet
           Simulation plans
• Compare to likelihood
• Compare to NJ with log-det distances
• Look at rates across sites instead of just
  proportions of invariant sites

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:10/3/2011
language:English
pages:28