Maximum Likelihood

Document Sample
Maximum Likelihood Powered By Docstoc
					                         Maximum Likelihood



                                  James McInerney,



    This presentation is based almost entirely on Peter G. Fosters - "The Idiot’s Guide to the Zen of
                          Likelihood in a Nutshell in Seven Days for Dummies,
Unleashed”.http://www.bioinf.org/molsys/data/idiots.pdf (Peter cannot be blamed for any inaccuracies in
                                            this presentation)
              Maximum Likelihood
•   Could equally be called maximum probability.
•   Historically the newest method.
•   Popularized by Joseph Felsenstein, Seattle, Washington.
•   Its slow uptake by the scientific community has to do with the
    difficulty of understanding the theory and also the absence
    (initially) of good quality software with choice of models and
    ease of interaction with data.
•   Also, at the time, it was computationally intractable to analyse
    large datasets (consider that in the mid-1980s a typical
    desktop computer had a processor speed of less than 30 MHz).
•   In recent times, software, models and computer hardware have
    become sufficiently sophisticated that ML is becoming a method
    of choice.
     ML: comparison with other
            methods.
• ML is similar to many other methods in many ways
• In many ways it is fundamentally different.
• ML assumes a model of sequence evolution (so does
  Maximum Parsimony and so do distance matrix methods).
• ML attempts to answer the question: What is the
  probability that I would observe these data (a multiple
  sequence alignment), given a particular model of evolution
  (a tree and a process).
• ML uses a „model‟. This is justifiable, since molecular
  sequence data can be shown to have arisen according to a
  stochastic process.
        Maximum Likelihood - goal

   • To estimate the probability that we would
     observe a particular dataset, given a
     phylogenetic tree and some notion of how the
     evolutionary process worked over time.


                                                b
                                                 a     c   d




                                (                            )
                                                           
                                               b a   e   f 
                                                           
                                               c e   a   g
                                                           
Probability of          given                   c
                                                 d     f   a


                                     a,c,g,t
     What is the probability of
       observing a datum?
• If we flip a coin and get a head and we think the coin is
  unbiased, then the probability of observing this head is
  0.5.
• If we think the coin is biased so that we expect to get a
  head 80% of the time, then the likelihood of observing
  this datum (a head) is 0.8.
• Therefore: The likelihood of making some observation is
  entirely dependent on the model that underlies our
  assumption.

                                         Lesson: The datum has not



       p                 =?
                                         changed, our model has.
                                         Therefore under the new model
                                         the likelihood of observing the
                                         datum has changed.
    What is the probability of
    observing a 'G' nucleotide?
• Question:If we have a DNA sequence of one nucleotide in
  length and the identity of this nucleotide is 'G', what is
  the likelihood that we would observe this 'G'?
• Answer: In the same way as the coin-flipping
  observation, the likelihood of observing this 'G' is
  dependent on the model of sequence evolution that is
  thought to underlie the data.
• E.g.
   – Model 1: frequency of G = 0.4 => likelihood(G) = 0.4
   – Model 2: frequency of G = 0.1 => likelihood(G) =0.1
   – Model 3: frequency of G = 0.25 => likelihood(G) = 0.25
     One rule…the rule of 1.

• The sum of the likelihoods of all
  the possibilities will always equal 1.

• E.g. for DNA p(a)+p(c)+p(g)+p(t)=1
 What about longer sequences?

• If we consider a gene of length 2:
      Gene 1:     ga
• The the probability of observing this gene is
  the product of the probabilities of observing
  each character.
• E.g
   – p(g) = 0.4; p(a)=0.15 (for instance)
   – likelihood(ga) = 0.4 x 0.15 = 0.06
      …or even longer sequences?

• Gene 1: gactagctagacagatacgaattac
• Model (simple base frequency model):
  – p(a)=0.15; p(c)=0.2; p(g)=0.4; p(t)=0.25;
  – (the sum of all probabilities must equal 1)


Like(Gene 1) = 0.000000000000000018452813
              Note about models
• You might notice that our model of base
   frequency is not the optimal model for our
   observed data. If we had used the following
   model:
p(a)=0.4; p(c) =0.2; p(g)= 0.2; p(t) = 0.2;
The likelihood of observing the gene is:
Like(gene 1) = 0.000000000000335544320000
(a value that is almost 10,000 times higher)


                                               Lesson: The datum has not
                                               changed, our model has.
                                               Therefore under the new model
                                               the likelihood of observing the
                                               datum has changed.
      How does this relate to
        phylogenetic trees?
• Consider an alignment of two sequences:
   – Gene 1: gaac
   – Gene 2: gacc
• We assume these genes are related by a
  (simple) phylogenetic tree with branch lengths.
Increase in model sophistication
• It is no longer possible to simply invoke a model that
  encompasses base composition, we must also include the
  mechanism of sequence change and stasis.
• There are two parts to this model - the tree and the
  process (the latter is confusingly referred to as the
  model, although both parts really compose the model).




Note: We will stay with the confusing notation - to avoid further confusion.
                    The model
 • The two parts of the model are the tree and the
   process (the model).
 • The model is composed of the composition and the
   substitution process -rate of change from one character
   state to another character state.




                                 b
                                  a     c   d
                                            
Model =                    +    b a
                                
                                c e
                                
                                 c
                                  d
                                        e
                                        a
                                        f
                                            g
                                                 a,c,g,t
                                            f 
                                              
                                              
                                            a
 Simple “time-reversible” model
• A simple model is that the rate of change from a to c or
  vice versa is 0.4, the composition of a is 0.25 and the
  composition of c is 0.25 (a simplified version of the
  Jukes and Cantor 1969 model)



        . 0.4    .   .
                      
P=     0.4
       
        .
             .
             .
                   .
                   .
                       .
                        
                       .
                                 0.25 0.25 . .
                      
        .  .     .   .
       Probability of the third nucleotide
        position in our current alignment
     • p(a) =0.25; p(c) = 0.25;            pa c  0.4

     Starting with a, the likelihood of the nucleotide is 0.25 and
        the likelihood of the substitution (branch) is 0.4. So the
        likelihood of observing these data is:
     *Likelihood(D|M) = 0.25 x 0.4 =0.01

Note: you will get the same result if you start with c, since this model is
reversible




                                  *The likelihood of the data, given the model.
                   Substitution matrix

     • For nucleotide sequences, there are 16 possible
       ways to describe substitutions - a 4x4 matrix.

                          a        b    c     d 
                                                
                          e        f    g     h 
                      P                              Convention dictates
                           i       j    k     l       that the order of
                                                      the nucleotides is
                          
                           m         n    o     p       a,c,g,t


Note: for amino acids, the matrix is a 20 x 20 matrix and for codon-based models,
the matrix is 61 x 61
           Substitution matrix - an
                   example
                     0.976 0.01 0.007 0.007
                                            
                     0.002 0.983 0.005 0.01 
                 P                         
                     0.003 0.01 0.979 0.007
                                            
                     0.002 0.013 0.005 0.979


   • In this matrix, the probability of an a changing
     to a c is 0.01 and the probability of a c
     remaining the same is 0.979, etc.

Note: The rows of this matrix sum to 1 - meaning that for every
nucleotide, we have covered all the possibilities of what might happen
to it. The columns do not sum to anything in particular.
      To calculate the likelihood of the entire
     dataset, given a substitution matrix, base
      composition and a branch length of one
      "certain evolutionary distance" or "ced"


                                       0.976 0.01 0.007 0.007
                                                              
                                       0.002 0.983 0.005 0.01 
                                   P                         
                                       0.003 0.01 0.979 0.007
                                                              
                                       0.002 0.013 0.005 0.979

              Gene 1: ccat
Likelihood of Gene 2: ccgt given



                                       π=[0.1,0.4,0.2,0.3]
  Likelihood of a two-sequence
            alignment.
• ccat
• ccgt
             c Pc c c Pc c  aPa g t Pt t
         =0.4x0.983x0.4x0.983x0.1x0.007x0.3x0.979
         =0.0000300



Likelihood of going from the first to the second
sequence is 0.0000300
      Different Branch Lengths
• For very short branch lengths, the probability of a
  character staying the same is high and the probability of it
  changing is low (for our particular matrix).
• For longer branch lengths, the probability of character
  change becomes higher and the probability of staying the
  same is lower.
• The previous calculations are based on the assumption that
  the branch length describes one Certain Evolutionary
  Distance or CED.
• If we want to consider a branch length that is twice as long
  (2 CED), then we can multiply the substitution matrix by
  itself (matrix2).
                    2 CED model

    0.976 0.01 0.007 0.007            0.976 0.01 0.007 0.007
                                                             
    0.002 0.983 0.005 0.01 
P                         
    0.003 0.01 0.979 0.007
                                  X       0.002 0.983 0.005 0.01 
                                      P                         
                                          0.003 0.01 0.979 0.007
                                                             
    0.002 0.013 0.005 0.979           0.002 0.013 0.005 0.979
                       0.953 0.02 0.013 0.015
                      
                                            
                =      0.005 0.966 0.015 0.029
                      
                                            
                      0.01 0.029 0.939 0.022
                                            
                       0.007 0.038 0.015 0.94 
                      



Which gives a likelihood of 0.0000559

                                                   Note the higher likelihood
                   For 3 CED
               0.93   0.029   0.019 0.022
                                         
            3  
                0.007   0.949   0.015 0.029
           P                            
               0.01   0.029   0.939 0.022
                                         
               
                0.007   0.038   0.015 0.94 

 This gives a likelihood of 0.0000782

Note that as the branch lengths increase, the
values on diagonal decrease and the values on
the off-diagonals increase.
For higher values of CED units
                 L
1    0.0000300   i
                 k
2    0.0000559   e
3    0.0000782   l
                 i
10   0.0001620   h
                 o
15   0.0001770   o
20   0.0001750   d

30   0.0001520       0   10   20   30    40
                         Branch Length
    Raising P to a large power

• If we raise P to a very large power, we find
  that the ML base composition „pops out‟.

             0.1 0.4 0.2 0.3
            
                             
       10 6  0.1 0.4 0.2 0.3
            
      P                     
             0.1 0.4 0.2 0.3
            
            
• So the base composition is0.3 into the
             0.1 0.4 0.2 built
                             
  probability matrix P.
                                   Rate Matrices
                                                          This does make
                                   5  exp 4log(5)
                                    4
Consider the following equation…                          sense doesn’t it??


       • In the same way, raising a matrix to the power of a number
         can be calculated by taking the log of the matrix, multiplying
         it by branch length and taking the exponent of the product.
       • In this way, you can exponentiate the matrix by a number
         that is not a whole number (e.g. 4.5698 or whatever).
       • E.g. The log of the previous matrix, P is:


                             0.0244 0.0101 0.0067 0.0076
                            
                                                       
                            0.0025 0.0176 0.005 0.0101
                    log P                             
                            0.0034 0.0101 0.021 0.0076     Note: the
                                                            sum of each
                            0.0025 0.0134 0.005 0.021
                                                                row is zero.
                   0.0244 0.0101 0.0067 0.0076
                  
                                             
                  0.0025 0.0176 0.005 0.0101
          log P                             
                  0.0034 0.0101 0.021 0.0076
                                             
                  0.0025 0.0134 0.005 0.021

•This matrix corresponds to one CED. What we want is to
derive a matrix so that when we exponentiate it, the
values correspond to substitutions per site.
•We must therefore scale logP so that when the rows of
logP are multiplied by       the off-diagonal elements sum
to 1.                   row
•The resulting scaled logP (lets call it Q), when its
exponent is taken gives a P corresponding to 1
substitution per site.
    Converting to substitutions per
                 site.
•   For a branch length of v:

                                Qv
                          e           P(v)
•   If we scale the logP appropriately, we will get a Q matrix. If we
    multiply this Q matrix by a diagonal matrix of the composition we
    get a matrix where the off-diagonal elements sum to 1 and the
    diagonal elements sum to -1.
           Scaling logP appropriately.
                     1.218 0.504 0.336 0.378
                    
 LogP scaled by                             
 a factor of 50 Q  0.126 0.882 0.252 0.504
                                            
 (for instance)     0.168 0.504 1.05 0.378
                                            
                    0.126 0.672 0.252 1.05
     
      0.1 0  0  0   0.122 0.05 0.034 0.038 
                                          
     0 0.4 0  0   0.05 0.353 0.101 0.202 
                  
                   .Q                           
     0  0 0.2 0  0.034 0.101 0.21 0.076 
                                          
     0  0  0 0.3 0.038 0.202 0.076 0.315
                                      Off-diagonal elements sum to 1,
     (diagonal matrix of the
                                        diagonal elements sum to -1
           composition)

P’s generated from this Q will give branch lengths in substitutions per site.
   Separating composition from
             rates.
• If we divide the columns of Q by     the composition is
                                     col use the exact
  separated from the rates. You can then
  same rate matrix with different matrices of base
  composition.
• For the model we have been using, the rate matrix R is:

                       0.3 0.4 0.3
                                   
                      0.3  0.3 0.4
                     
                 R                
                      0.4 0.3  0.3
                     
• The diagonal elements do not matter and are left out.
                                   
                      0.3 0.4 0.3  
  The model is symmetrical (time reversible).
                     
Relationships between R, Q and
          P matrices.
    Multiply columns by
     the composition,
     scale so that the        Multiply by branch
     off-diagonals of            length, then


R                         Q                          P
         sum to 1               exponentiate



                                Log, then scale so
    Divide columns by           that off-diagonals
                                  of    sum to 1
     the composition
      Likelihood of the alignment at
          various branch lengths
                              0.0002
                             0.00018

     ccat                    0.00016
                             0.00014


     ccgt                    0.00012
                              0.0001
                             0.00008
                             0.00006
                             0.00004
                             0.00002
                                   0
                                       0   0.1   0.2   0.3   0.4   0.5   0.6




The maximum likelihood value is 0.0001777 at a branch length
of 0.330614
Likelihood of a two-branch tree

                                              A
                                0.1
                        O

                                      0.2
                                                       B
O is the origin or root, the numbers represent branch lengths. The
likelihood can be calculated in three ways:
     •from A to B in one step (this amounts to the previous method)
     •from A to B in two steps (through O)
     •in two parts starting at O.
                         Lesson about O
    •    O is an unknown sequence.
    •    We can only speculate what each position in the alignment would be if
         we could observe the sequence of O.
    •    What we do know is that the sum of all possibilities is equal to 1.
    •    Therefore we must sum the likielihoods for all possibilities of O.
    •    This becomes computationally intensive.



                     A                                                   {c}
        0.1                                                0.1
O                        For position 1:   {a,c,g,t}

              0.2                                                0.2
                               B                                                 {c}
               A three branch tree

                           A    0.1         0.3
                                                       C
                         B     0.2


The tree can be rooted anywhere and the substitutions calculated
accordingly.
There are many ways of doing this and this is left as an exercise for
the student.
   Increasing the sophistication of
               models
   • So far, the models we have dealt with assume that
     change is equally likely at all positions and that the rate
     of change is constant for the entire duration of the
     phylogeny.
   • This is not a realistic model for all sequences (it is a
     neutral model with a constant molecular clock).

             A                 B         A

               p               p                    D
CORRECT TREE           q                                WRONG TREE
                   q       q       p>q
                                                    C
                   C       D
                                         B
Small subunit ribosomal RNA




                      18S or 16S rRNA
The molecular clock for alpha-globin:
Each point represents the number of substitutions separating each animal
                             from humans
                            100
  number of substitutions


                                                                      shark
                            80


                            60                                       carp
                                             platypus
                                                               chicken
                            40

                                      cow
                            20


                             0
                                            100




                                                   200




                                                         300




                                                                    400




                                                                              500
                                  0




                      Time to common ancestor (millions of years)
                       Invariable sites
•   For a given dataset we can assume that a certain proportion of sites are not free
    to vary - purifying selection (related to function) prevents these sites from
    changing).

•   We can therefore observe invariable positions either because they are under this
    selective constraint or because they have not had a chance to vary or because
    there is homoplasy in the dataset and a reversal (say) has caused the site to
    appear constant.

•   The likelihood that a site is invariable can be calculated by incorporating this
    possibility into our model and calculating for every site the likelihood that it is an
    invariable site.

•   It might improve the likelihood of the dataset if we remove a certain proportion
    of invariable sites (in a way that is analogous to the preceding discussion).

•   The PAUP software can estimate the proportion of invariable sites using Maximum
    Likelihood.
                  Variable sites
• Obviously other sites in the dataset are free to vary.
• Selection intensity on these sites is rarely uniform, so it
  is desirable to model site-by-site rate variation.
• This is done in two ways:
   – site specific (codon position, or alpha helix etc.)
   – using a discrete approximation to a continuous distribution
     (gamma distribution).
• Again, these variables are modeled over all possibilities
  of sequence change over all possibilities of branch length
  over all possibilities of tree topology.
The shape of the gamma distribution for
different values of alpha.
    Does changing a model affect
           the outcome?
There are different models
Jukes and Cantor (JC69):
    All base compositions equal (0.25 each), rate of change from one base to
    another is the same
Kimura 2-Parameter (K2P):
    All base compositions equal (0.25 each), different substitution rate for
    transitions and transversions).
Hasegawa-Kishino-Yano (HKY):
    Like the K2P, but with base composition free to vary.
General Time Reversible (GTR):
    Base composition free to vary, all possible substitutions can differ.


All these models can be extended to accommodate invariable sites and site-to-site
rate variation.
       Long-branch attraction (LBA)
   •   In the case below, the wrong tree is often selected. ML will
       not be prone to this problem, if the correct model of sequence
       evolution is used.

               A                 B            A

                 p               p                      D
CORRECT TREE             q                                  WRONG TREE
                     q       q       p>q
                                                        C
                     C       D
   •   This is often called the „felsenstein-zone‟.
                                              B
                 Strengths of ML
•   Does not try to make an observation of sequence change and
    then a correction for superimposed substitutions. There is no
    need to „correct‟ for anything, the models take care of
    superimposed substitutions.
•   Accurate branch lengths.
•   Each site has a likelihood.
•   If the model is correct, we should retrieve the correct tree*.
•   You can use a model that fits the data.
•   ML uses all the data (no selection of sites based on
    informativeness, all sites are informative).
•   ML can not only tell you about the phylogeny of the sequences,
    but also the process of evolution that led to the observations of
    today‟s sequences.



              *If we have long-enough sequences and a sophisticated-enough model.
                           Models
• You can use models that:
   –   Deal with different transition/transversion ratios.
   –   Deal with unequal base composition.
   –   Deal with heterogeneity of rates across sites.
   –   Deal with heterogeneity of the substitution process
       (different rates across lineages, different rates at
       different parts of the tree).
• The more free parameters, the better your model fits
  your data (good).
• The more free parameters, the higher the variance of
  the estimate (bad).
• Use a model that fits your data.
     Over-fitting a model to your data.
12                            12

10                            10

 8                                 8

 6                                 6

 4                                 4

 2                                 2


 0                                 0
     0   2   4   6   8   10            0   2   4   6   8   10
12                            12

10                            10

 8                             8

 6                             6

 4                             4

 2                             2

 0                             0
     0   2   4   6   8   10            0   2   4   6   8   10
          Weaknesses of ML

• Can be inconsistent if we use models that are
  not accurate.
• Model might not be sophisticated enough (you
  can „max-out‟ on models).
• Very computationally-intensive. Might not be
  possible to examine all models (substitution
  matrices, tree topologies, etc.).
             Recommendations
• Interact with the data.
• If you have collected enough data, you might get a good
  picture of the underlying model of sequence evolution.
• Use a test of alternative models (implemented in the
  modeltest software).
• Don‟t just „choose‟ a model, use a model that fits your
  data.
• Don‟t „over-fit‟ a model to your data.
“How often have I said to you that when you have eliminated
the impossible, whatever remains, however improbable,
must be the truth.”

                         Sherlock Holmes to Dr. Watson in
                      The Sign of Four, by A. Conan Doyle.

				
DOCUMENT INFO