VIEWS: 6 PAGES: 29 POSTED ON: 11/5/2012 Public Domain
1 4 Phylogeny Tree 3 2 5 Reconstruction 1 4 2 3 5 Phylogenetic Trees • Nodes: species • Edges: time of independent evolution • Edge length represents evolution time AKA genetic distance Not necessarily chronological time CS262 Lecture 13, Win07, Batzoglou Parsimony – direct method not using distances • One of the most popular methods: GIVEN multiple alignment FIND tree & history of substitutions explaining alignment Idea: Find the tree that explains the observed sequences with a minimal number of substitutions Two computational subproblems: 1. Find the parsimony cost of a given tree (easy) 2. Search through all tree topologies (hard) CS262 Lecture 13, Win07, Batzoglou Example: Parsimony cost of one column {A} Final cost C = 1 {A} {A, B} Cost A C+=1 B A A A B A A {A} {B} {A} {A} CS262 Lecture 13, Win07, Batzoglou Parsimony Scoring Given a tree, and an alignment column u Label internal nodes to minimize the number of required substitutions Initialization: Set cost C = 0; node k = 2N – 1 (last leaf) Iteration: If k is a leaf, set Rk = { xk[u] } // Rk is simply the character of kth species If k is not a leaf, Let i, j be the daughter nodes; Set Rk = Ri Rj if intersection is nonempty Set Rk = Ri Rj, and C += 1, if intersection is empty Termination: Minimal cost of tree for column u, = C CS262 Lecture 13, Win07, Batzoglou Example {B} {A,B} {A} {B} {A} {A,B} {A} A A A A B B A B {A} {A} {A} {A} {B} {B} {A} {B} CS262 Lecture 13, Win07, Batzoglou Traceback to find ancestral nucleotides Traceback: 1. Choose an arbitrary nucleotide from R2N – 1 for the root 2. Having chosen nucleotide r for parent k, If r Ri choose r for daughter i Else, choose arbitrary nucleotide from Ri Easy to see that this traceback produces some assignment of cost C CS262 Lecture 13, Win07, Batzoglou Example Admissible with Traceback B x Still optimal, but inadmissible with Traceback A {A, B} A B {A} x B {A, B} A B A B B x x A B A B A {A} {B} {A} {B} A B A B A A x x A B A B CS262 Lecture 13, Win07, Batzoglou Multiple Sequence Alignments CS262 Lecture 13, Win07, Batzoglou Definition • Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the global map is maximum • A faint similarity between two sequences becomes significant if present in many • Multiple alignments reveal elements that are conserved among a class of organisms and therefore important in their common biology • The patterns of conservation can help us tell function of the element CS262 Lecture 13, Win07, Batzoglou Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG CS262 Lecture 13, Win07, Batzoglou Sum Of Pairs (cont’d) • Heuristic way to incorporate evolution tree: Human Mouse Duck Chicken • Weighted SOP: S(m) = k<l wkl s(mk, ml) CS262 Lecture 13, Win07, Batzoglou A Profile Representation - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4 • Given a multiple alignment M = m1…mn Replace each column mi with profile entry pi • Frequency of each letter in • # gaps • Optional: # gap openings, extensions, closings Can think of this as a “likelihood” of each letter in each position CS262 Lecture 13, Win07, Batzoglou Multiple Sequence Alignments Algorithms CS262 Lecture 13, Win07, Batzoglou Multidimensional DP Generalization of Needleman-Wunsh: S(m) = i S(mi) (sum of column scores) F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN) F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr)) CS262 Lecture 13, Win07, Batzoglou Multidimensional DP • Example: in 3D (three sequences): • 7 neighbors/cell F(i,j,k) = max{ F(i – 1, j – 1, k – 1) + S(xi, xj, xk), F(i – 1, j – 1, k ) + S(xi, xj, - ), F(i – 1, j , k – 1) + S(xi, -, xk), F(i – 1, j ,k ) + S(xi, -, - ), F(i , j – 1, k – 1) + S( -, xj, xk), F(i , j – 1, k ) + S( -, xj, - ), F(i , j , k – 1) + S( -, -, xk) } CS262 Lecture 13, Win07, Batzoglou Multidimensional DP Running Time: 1. Size of matrix: LN; Where L = length of each sequence N = number of sequences 2. Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN) CS262 Lecture 13, Win07, Batzoglou Multidimensional DP • How Running Time: do gap states generalize? • VERY badly! 1. Size of matrix: LN; Require 2N – 1 states, one per combination of gapped/ungapped sequences Running of each 2N LN) = Where L = lengthtime: O(2N sequence O(4N LN) N = number of sequences Y YZ 2. Neighbors/cell: 2N – 1 XY XYZ Z Therefore………………………… O(2N LN) X XZ CS262 Lecture 13, Win07, Batzoglou Progressive Alignment x pxy y pxyzw z pzw w • When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles CS262 Lecture 13, Win07, Batzoglou Progressive Alignment x y Example z Profile: (A, C, G, T, -) px = (0.8, 0.2, 0, 0, 0) w py = (0.6, 0, 0, 0, 0.4) • When evolutionary tree is known: s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles(0.7, y0.1,generate a new Result: pxy = px, p , to 0, 0, 0.2) alignment with associated profile presult s(p , -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) x Weighted version: Tree edges have weights, proportional to the divergence in that edge Result: p = (0.4, 0.1, 0, 0, 0.5) New profile is a weighted average of two old x- profiles CS262 Lecture 13, Win07, Batzoglou Progressive Alignment x y ? z w • When evolutionary tree is unknown: Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment Construct a tree (UPGMA / Neighbor Joining / Other methods) Align on the tree CS262 Lecture 13, Win07, Batzoglou Heuristics to improve alignments • Iterative refinement schemes • A*-based search • Consistency • Simulated Annealing • … CS262 Lecture 13, Win07, Batzoglou Iterative Refinement One problem of progressive alignment: • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT Frozen! z: GAACTG Now clear correct y = GA-CTT w: GTACTG CS262 Lecture 13, Win07, Batzoglou Iterative Refinement Algorithm (Barton-Stenberg): 1. For j = 1 to N, Remove xj, and realign to z x x1…xj-1xj+1…xN 2. Repeat 4 until convergence y allow y to vary x,z fixed projection CS262 Lecture 13, Win07, Batzoglou Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA CS262 Lecture 13, Win07, Batzoglou Iterative Refinement Example not handled well: x: GAAGTTA y1: GAC-TTA Realigning any single yi y2: GAC-TTA changes nothing y3: GAC-TTA z: GAACTGA w: GTACTGA CS262 Lecture 13, Win07, Batzoglou Consistency zk z xi x y yj yj’ CS262 Lecture 13, Win07, Batzoglou Consistency zk z xi x y yj yj’ Basic method for applying consistency • Compute all pairs of alignments xy, xz, yz, … • When aligning x, y during progressive alignment, For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz) Align x and y with DP using the modified s(.,.) function CS262 Lecture 13, Win07, Batzoglou Some Resources Genome Resources Annotation and alignment genome browser at UCSC http://genome.ucsc.edu/cgi-bin/hgGateway Specialized VISTA alignment browser at LBNL http://pipeline.lbl.gov/cgi-bin/gateway2 ABC—Nice Stanford tool for browsing alignments http://encode.stanford.edu/~asimenos/ABC/ Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable http://probcons.stanford.edu/ PROBCONS – most accurate CS262 Lecture 13, Win07, Batzoglou