Slide 1 by e1X71NK

VIEWS: 6 PAGES: 29

									                     1       4




Phylogeny Tree           3   2       5

Reconstruction




                 1   4       2   3       5
Phylogenetic Trees

       • Nodes: species
       • Edges: time of independent
         evolution

       • Edge length represents
         evolution time

               AKA genetic distance


               Not necessarily
                chronological time




CS262 Lecture 13, Win07, Batzoglou
Parsimony – direct method not using distances


       •     One of the most popular methods:
                   GIVEN multiple alignment
                   FIND tree & history of substitutions explaining alignment


       Idea:
          Find the tree that explains the observed sequences with a minimal
          number of substitutions

       Two computational subproblems:

       1. Find the parsimony cost of a given tree (easy)

       2. Search through all tree topologies (hard)
CS262 Lecture 13, Win07, Batzoglou
Example: Parsimony cost of one column

                                                                 {A}
                                                            Final cost C = 1



                                                      {A}


                                           {A, B}
                                           Cost
         A
                                           C+=1
         B
         A
         A

                                      A              B         A                A
                                     {A}            {B}       {A}              {A}
CS262 Lecture 13, Win07, Batzoglou
Parsimony Scoring
       Given a tree, and an alignment column u
          Label internal nodes to minimize the number of required substitutions

       Initialization:
           Set cost C = 0; node k = 2N – 1 (last leaf)
       Iteration:
           If k is a leaf, set Rk = { xk[u] } // Rk is simply the character of kth species

            If k is not a leaf,
                    Let i, j be the daughter nodes;
                    Set Rk = Ri  Rj if intersection is nonempty
                    Set Rk = Ri  Rj, and C += 1, if intersection is empty

       Termination:
          Minimal cost of tree for column u, = C
CS262 Lecture 13, Win07, Batzoglou
Example

                                                            {B}

                                                {A,B}


                                     {A}
                                                                        {B}
                               {A}
                                                                    {A,B}
                {A}


       A                    A         A     A           B         B         A     B
     {A}                  {A}         {A}   {A}         {B}       {B}       {A}   {B}
CS262 Lecture 13, Win07, Batzoglou
Traceback to find ancestral nucleotides


       Traceback:

       1. Choose an arbitrary nucleotide from R2N – 1 for the root

       2. Having chosen nucleotide r for parent k,
          If r  Ri choose r for daughter i
          Else, choose arbitrary nucleotide from Ri



       Easy to see that this traceback produces some assignment of cost C


CS262 Lecture 13, Win07, Batzoglou
Example
                                     Admissible with Traceback

                                                      B
                                                  x                   Still optimal, but
                                                                 inadmissible with Traceback
                                                  A
                    {A, B}
                                              A                                   B

              {A}                                 x                           B
       {A, B}                             A       B   A   B
                                                                          B       x
                                                                      x
      A        B A         B                          A
     {A}      {B} {A}     {B}                                         A       B   A   B

                                                  A

                                              A
                                                          x
                                                  x
                                          A       B   A   B


CS262 Lecture 13, Win07, Batzoglou
                                     Multiple Sequence
                                        Alignments




CS262 Lecture 13, Win07, Batzoglou
Definition


       • Given N sequences x1, x2,…, xN:
               Insert gaps (-) in each sequence xi, such that
                  • All sequences have the same length L
                  • Score of the global map is maximum

       • A faint similarity between two sequences becomes significant if
         present in many

       • Multiple alignments reveal elements that are conserved among a
         class of organisms and therefore important in their common biology

       • The patterns of conservation can help us tell function of the element


CS262 Lecture 13, Win07, Batzoglou
Scoring Function: Sum Of Pairs

       Definition: Induced pairwise alignment
             A pairwise alignment induced by the multiple alignment


       Example:
                                     x:   AC-GCGG-C
                                     y:   AC-GC-GAG
                                     z:   GCCGC-GAG
       Induces:

                      x: ACGCGG-C;         x: AC-GCGG-C;   y: AC-GCGAG
                      y: ACGC-GAC;         z: GCCGC-GAG;   z: GCCGCGAG


CS262 Lecture 13, Win07, Batzoglou
Sum Of Pairs (cont’d)

       • Heuristic way to incorporate evolution tree:

                                                      Human

                                                      Mouse
                                                      Duck
                                                      Chicken


        • Weighted SOP:


                                     S(m) = k<l wkl s(mk, ml)

CS262 Lecture 13, Win07, Batzoglou
A Profile Representation

                                 -   A   G    G   C   T   A    T   C    A    C   C   T   G
                                 T   A   G    –   C   T   A    C   C    A    -   -   -   G
                                 C   A   G    –   C   T   A    C   C    A    -   -   -   G
                                 C   A   G    –   C   T   A    T   C    A    C   –   G   G
                                 C   A   G    –   C   T   A    T   C    G    C   –   G   G

                A                    1                    1            .8
                C              .6                 1           .4   1        .6 .2
                G                        1 .2                          .2         .4     1
                T              .2                     1       .6                  .2
                -              .2            .8                             .4 .8 .4


       • Given a multiple alignment M = m1…mn
               Replace each column mi with profile entry pi
                      • Frequency of each letter in 
                      • # gaps
                      • Optional: # gap openings, extensions, closings
               Can think of this as a “likelihood” of each letter in each position
CS262 Lecture 13, Win07, Batzoglou
                            Multiple Sequence Alignments

                                     Algorithms




CS262 Lecture 13, Win07, Batzoglou
Multidimensional DP


       Generalization of Needleman-Wunsh:


                                                   S(m) = i S(mi)

       (sum of column scores)



       F(i1,i2,…,iN):                Optimal alignment up to (i1, …, iN)

       F(i1,i2,…,iN)                 = max(all neighbors of cube)(F(nbr)+S(nbr))


CS262 Lecture 13, Win07, Batzoglou
Multidimensional DP

       • Example: in 3D (three
         sequences):

       • 7 neighbors/cell

            F(i,j,k)                 = max{ F(i – 1, j – 1, k – 1) + S(xi, xj, xk),
                                            F(i – 1, j – 1, k ) + S(xi, xj, - ),
                                            F(i – 1, j     , k – 1) + S(xi, -, xk),
                                            F(i – 1, j     ,k      ) + S(xi, -, - ),
                                            F(i , j – 1, k – 1) + S( -, xj, xk),
                                            F(i , j – 1, k        ) + S( -, xj, - ),
                                            F(i , j       , k – 1) + S( -, -, xk) }


CS262 Lecture 13, Win07, Batzoglou
Multidimensional DP

           Running Time:

           1. Size of matrix: LN;

                  Where L = length of each sequence
                        N = number of sequences

           2. Neighbors/cell: 2N – 1

           Therefore………………………… O(2N LN)

CS262 Lecture 13, Win07, Batzoglou
Multidimensional DP

                 • How
           Running Time: do gap states generalize?

                 • VERY badly!
           1. Size of matrix: LN;
                         Require 2N – 1 states, one per combination of
                          gapped/ungapped sequences
                         Running of each  2N  LN) =
                  Where L = lengthtime: O(2N sequence O(4N LN)
                                     N = number of sequences
                                                  Y         YZ

           2. Neighbors/cell: 2N – 1
                                             XY       XYZ        Z

           Therefore………………………… O(2N LN)
                                                  X         XZ

CS262 Lecture 13, Win07, Batzoglou
Progressive Alignment
                                                            x
                                                   pxy
                                                            y

                                     pxyzw                  z
                                             pzw
                                                            w

       •     When evolutionary tree is known:

               Align closest first, in the order of the tree
               In each step, align two sequences x, y, or profiles px, py, to generate a new
                alignment with associated profile presult

              Weighted version:
               Tree edges have weights, proportional to the divergence in that edge
               New profile is a weighted average of two old profiles


CS262 Lecture 13, Win07, Batzoglou
Progressive Alignment
                                                        x
                                                        y
                                                  Example
                                                           z
                                                  Profile: (A, C, G, T, -)
                                                  px = (0.8, 0.2, 0, 0, 0)
                                                           w
                                                  py = (0.6, 0, 0, 0, 0.4)

       •     When evolutionary tree is known:     s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A)
                                                            + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -)
               Align closest first, in the order of the tree
               In each step, align two sequences x, y, or profiles(0.7, y0.1,generate a new
                                                      Result: pxy = px, p , to 0, 0, 0.2)
                alignment with associated profile presult
                                                  s(p , -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -)
                                                     x
              Weighted version:
               Tree edges have weights, proportional to the divergence in that edge
                                                 Result: p = (0.4, 0.1, 0, 0, 0.5)
               New profile is a weighted average of two old x-
                                                             profiles


CS262 Lecture 13, Win07, Batzoglou
Progressive Alignment
                                                      x
                                                      y
                                             ?        z
                                                      w

       •     When evolutionary tree is unknown:

               Perform all pairwise alignments
               Define distance matrix D, where D(x, y) is a measure of evolutionary
                distance, based on pairwise alignment
               Construct a tree (UPGMA / Neighbor Joining / Other methods)
               Align on the tree



CS262 Lecture 13, Win07, Batzoglou
Heuristics to improve alignments

       • Iterative refinement schemes

       • A*-based search

       • Consistency

       • Simulated Annealing

       • …
CS262 Lecture 13, Win07, Batzoglou
Iterative Refinement


       One problem of progressive alignment:
       • Initial alignments are “frozen” even when new evidence comes

       Example:


                      x:             GAAGTT
                      y:             GAC-TT   Frozen!

                      z:             GAACTG   Now clear correct y = GA-CTT
                      w:             GTACTG



CS262 Lecture 13, Win07, Batzoglou
Iterative Refinement

      Algorithm (Barton-Stenberg):

      1.  For j = 1 to N,
         Remove xj, and realign to               z   x
             x1…xj-1xj+1…xN
      2.  Repeat 4 until convergence         y



                           allow y to vary

                     x,z fixed projection


CS262 Lecture 13, Win07, Batzoglou
Iterative Refinement

       Example: align (x,y), (z,w), (xy, zw):

                                     x:   GAAGTTA
                                     y:   GAC-TTA
                                     z:   GAACTGA
                                     w:   GTACTGA


       After realigning y:

                                     x:   GAAGTTA
                                     y:   G-ACTTA   + 3 matches
                                     z:   GAACTGA
                                     w:   GTACTGA
CS262 Lecture 13, Win07, Batzoglou
Iterative Refinement



       Example not handled well:

                                     x:    GAAGTTA
                                     y1:   GAC-TTA   Realigning any single yi
                                     y2:   GAC-TTA   changes nothing
                                     y3:   GAC-TTA

                                     z:    GAACTGA
                                     w:    GTACTGA


CS262 Lecture 13, Win07, Batzoglou
Consistency


                                     zk
        z


                                           xi
        x

        y
                                      yj        yj’




CS262 Lecture 13, Win07, Batzoglou
Consistency
                                                zk
            z
                                                              xi
            x

            y
                                                  yj                       yj’


       Basic method for applying consistency

       •     Compute all pairs of alignments xy, xz, yz, …

       •     When aligning x, y during progressive alignment,

                 For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz)
                 Align x and y with DP using the modified s(.,.) function
CS262 Lecture 13, Win07, Batzoglou
Some Resources
       Genome Resources

       Annotation and alignment genome browser at UCSC
       http://genome.ucsc.edu/cgi-bin/hgGateway

       Specialized VISTA alignment browser at LBNL
       http://pipeline.lbl.gov/cgi-bin/gateway2

       ABC—Nice Stanford tool for browsing alignments
       http://encode.stanford.edu/~asimenos/ABC/



       Protein Multiple Aligners

       http://www.ebi.ac.uk/clustalw/
       CLUSTALW – most widely used

       http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py
       MUSCLE – most scalable

       http://probcons.stanford.edu/
       PROBCONS – most accurate

CS262 Lecture 13, Win07, Batzoglou

								
To top