WEEK 1, LECTURE 2 Dynamic Programming in Sequence Alignment and its Motivation (AKA Whereas the first lecture was very biologically oriented, this second lecture will be much more computational.) Serafim Batzoglu; Scribed by Lekan Wang 1. DNA SEQUENCING 1.1 WHY SEQUENCE THE GENOME? 1.1.1 FULL GENOME SEQUENCING Complete DNA sequencing is a relatively new thing. While the biggest breakthrough is often considered to be the Human Genome Project, over 1000 other full genomes have been sequenced besides that of the human. Most of these genomes are those of microorganisms several thousand times smaller than the human genome (where microorganisms are around 5 million base pairs (BPs), and mammals are typically around 3 billion BPs), and because the cost to sequence a genome is currently approximately proportional to the length of the genome, these microorganisms are significantly cheaper to sequence. This rapid sequencing is still going on; if you go to http://genome.gov/10002154, you can see a very long list of organisms that have been approved and funded for sequencing just in the United States. Notice that many of the approved sequencing targets are primates, cats, and other large mammals. The cost of sequencing has also fallen dramatically since the Human Genome Project. The full cost of the project was at least $3 billion, and took more than a full decade to complete. As soon as the Project completed, the sequencing of the mouse genome began in 2000. Though the length of the genome was comparable to the human genome, the budgeted cost was only slightly greater than $300 million. Since then, costs have dropped even more substantially, and now, at the start of 2010, the cost to sequence a full mammalian genome is only around $1 million. In full, around forty mammalian genomes have been sequenced thus far at a cost of several billion dollars. 1.1.2 COMPARATIVE GENOMICS While some of the organisms we sequence are model organisms useful for work in the lab, such as the mouse, yeast, E. coli, etc, and some others are animals with near universal appeal, such as cats, dogs, etc., some of the remaining choices of animals to sequence are head‐scratchers. Why did we choose to sequence the fruit bat, or the lemur, or the sloth, or the hundreds of other species? Why are they special, and why are we investing millions for their DNA sequence? The answer is comparative genomics. By comparing the genomes of different species, and seeing whether the gene is seen across a wide variety of species, or only endemic to a few, we can often deduce important facts about phenotypes. For example, if a gene is conserved across species ranging from humans all the way to yeast, it probably has something to do with basic eukaryotic function, and probably unlikely has WEEK 1, LECTURE 2 anything to do with growing limbs. This is comparative genomics, and one of the most important reasons why so many mammals have been sequenced. 1.2 EVOLUTION AT THE DNA LEVEL Every time we copy DNA, polymorphisms occur that accumulate at specific rates across generations. There are several types of polymorphisms, as illustrated below. The first general category is the sequence edit, where a single nucleotide changes. This can occur as either a deletion from the sequence, a mutation, or an insertion. Mutations are the most common type of polymorphism. Notice that insertions and deletions are often frame‐shifts, meaning that the sequences of three nucleotides that comprise the code for an amino acid will be offset, and hence, new amino acids will be encoded. We have a good estimate for the rate of these single mutations, at approximately 10‐9. For insertions and deletions, the probability of occurring have a geometric distribution on the length of the inserted or deleted sequence. The second general category is the rearrangement. In the illustration below, the original sequence can change in three ways. The first is an inversion, in which part of the sequence changes direction; the second is a translocation, in which two subsequences change places; and the third is a duplication, in which a subsequence is repeated in a future generation. WEEK 1, LECTURE 2 All of these mutations have physical biological explanations. Take, for example, the inversion. This can happen when DNA crosses over itself, as shown below. DNA may cross over itself, then pinch off and recombine again with the opposite end of the strand, which essentially will reverse the direction of the DNA for the pinched off segment. Duplication polymorphisms—or, as it’s often referred to, copy number variations—have received a fair share of buzz in the media lately as it relates to certain diseases and phenotypes. Having copy number variations in an exonic region of the genome will obviously increase protein production of the encoded protein, which may lead to phenotypic changes. 1.3 COMPARING GENOMES All the variations above occur at some specified rate. Many variations occur in an intronic segment of the DNA (“junk” DNA), or are synonymous (silent) mutations, so there will be no phenotypic effect, and will accumulate randomly according to the known rates. However, other mutations will affect phenotype. Most of these will be deleterious, leading to a significant change in the fitness of the organism, leading to the pruning off of the polymorphism by the effects of evolution. Some of the mutations will be beneficial, and be positively selected and eventually overtake a population. WEEK 1, LECTURE 2 So, now we can look at the sequence below of small portions of DNA mapped and aligned across many different species. The purple segments are coding segments, and the plots at the top of each alignment is the level of conservation at each base pair across the species we’re mapping. From this, we see that most of the conservation occurs at exonic regions, since, as we mentioned before, polymorphisms in those areas are more likely to cause phenotype changes, and most changes will probably be deleterious. We also notice that by removing the fish and chicken from the alignment, and only retaining the mammals, the amount of conservation goes up dramatically. Looking at the alignment with all the species again, only the functional segments are most conserved, then as we move to the alignments of more similar species, more genes are conserved, and even non‐functional areas. One question that always comes up in comparative genomics research is what species to include in the comparison in order to maximize signal, but still be selective. After the Human Genome Project completed, the first animal to be sequenced was the mouse, since it was considered the perfect species to compare to humans to determine functional segments of the human genome. This was true for several reasons. The mouse is a placental mammal, like humans; the mouse has a much shorter generation time and a much higher population, allowing mice to evolve much faster than humans. Both of these reasons lend themselves to making the mouse a perfect species with which to do functional comparative genomics. If we were to sequence any animal today after the human, it would still be the mouse or rat. 2 SEQUENCE ALIGNMENT 2.1 MOTIVATION In order to study the conservation across species, we must first find a way of intelligently aligning the different sequences. Hence, the problem of sequence alignment is the simple concept of aligning two sequences and potentially inserting gaps in between letters of the two sequences to reveal the similarity between the two sequences. So, given the two sequences AGGCTATCACCTGACCTCCAGGCCGATGCCC WEEK 1, LECTURE 2 and TAGCTATCACGACCGCGGTCGATTTGCCCGAC a possible alignment is -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC When aligning between species, we’re making the implicit evolutionary hypothesis that the letters that correspond come from the most recent common ancestor between the two organisms in the evolutionary tree. This is interesting because most of the genome is lost when it gets passed down from generation to generation. You have 1024 ancestors in the 10th generation, but you only have one genome. Between any two people, we can trace the genomes back to some ancestor long ago that gave the sequence to both, and the same is true for individuals of different species. 2.2 ALIGNMENT So, what makes a good alignment? If we have the example sequence AGGCTAGTT and AGCGAAGTTT, we can have many alignments. Three possible alignments are: AGGCTAGTT‐ AGCGAAGTTT yielding 6 matches, 3 mismatches, and 1 gap AGGCTA‐GTT‐ AG‐CGAAGTTT yielding 7 matches, 1 mismatch, and 3 gaps AGGC‐TA‐GTT‐ AG‐CG‐AAGTTT yielding 7 matches, 0 mismatches, and 5 gaps So which one is best? Well, that would depend on our scoring function. A scoring function in sequence alignment is a function of the number of matches, mismatches, and gaps. We can assign some score, m, to each match, some penalty, s, to each mismatch, and some penalty, d, to each gap. An alternate definition uses the minimal edit distance: “Given two strings x, y, find the minimum # of edits (insertions, deletions, mutations) to transform one string to the other.” 2.3 COMPUTING THE ALIGNMENT We can think of an alignment as just a path through a graph with each sequence on an axis. By convention, both sequences start in the top‐left, and one goes horizontally while the other goes vertically. Starting at the top‐left, we can move one to the right (+1,0), one down (0,+1), or one diagonally (+1,+1). A (+1,0) move corresponds to inserting a gap in the sequence Y (the sequence along the Y axis), a (0,+1) WEEK 1, LECTURE 2 move corresponds to inserting a gap in the sequence X (the sequence along the X axis), and a (+1,+1) move corresponds to a either an alignment if the letter match, or a mismatch if they don’t. See the figure below. Finding the best alignment by exhaustive search is an exponential‐complexity problem. Oh no. Fortunately, we note that alignments are additive. That is, we note that the score of aligning x1…xm to y1…yn is just the sum of the score of aligning x1…xi to y1…yj and the score of aligning xi+1…xm to yj+1…yn. This lends itself very well to dynamic programming. 2.4 THE DYNAMIC PROGRAMMING SOLUTION Dynamic programming is a way of solving an exponential‐complexity problem in a polynomial subproblems, each solved using the result from previous subproblems. It depends on a table of cached solved subproblems, a process called memorization. In our sequence alignment problem, notice that there are three possible cases at each point in our alignment: 1. xi aligns to yj, in which case we did a (+1,+1) move, and so the score, F(i,j) = F(i‐1,j‐1)+m if xi=yj, or F(i‐1,j‐1)‐s if there was a mismatch. 2. xi aligns to a gap, in which case we did a (+1,0) move, and so the score, F(i,j) = F(i‐1,j)‐d 3. yj aligns to a gap, in which case we did a (0,+1) move, and so the score, F(i,j) = F(i,j‐1)‐d WEEK 1, LECTURE 2 Once we have the three scores, we can just take the max of the three for the score for our current cell, since we are seeking to maximize our alignment score. This depends on the inductive assumption that F(i,j‐1), F(i‐1, j) and F(i‐1,j‐1) are all themselves optimal. You can prove optimality using induction. Depending on which of the three options are chosen, a back‐pointer is created to point to the cell from which we calculated the current cell’s score. Then, to produce the final alignment, we can start at the end of the sequence, and just follow our back‐pointers until we reach the beginning of the sequence. This algorithm is called the Needleman‐Wunsch algorithm, and the matrix of alignments of subsequences is called the Needleman‐Wunsch Matrix. Every non‐decreasing path through the matrix from (0,0) to (m,n) corresponds to an alignment of the two sequences. The original Needleman‐Wunsch algorithm took O(N3) time somehow. The modern algorithm is just O(N2), or more specifically, O(NM) in both time and space. WEEK 1, LECTURE 2 The pseudo‐code for the Needleman‐Wunsch Algorithm is as follows: 1. Initialization. F(0, 0) = 0 F(0, j) = ‐ j × d F(i, 0) = ‐ i × d 2. Main Iteration. Filling‐in partial alignments For each i = 1……M For each j = 1……N F(i – 1,j – 1) + s(xi, yj) [case 1] F(i, j) = max F(i – 1, j) – d [case 2] F(i, j – 1) – d [case 3] DIAG, if [case 1] Ptr(i, j) = LEFT, if [case 2] UP, if [case 3] 3. Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment 2.5 VARIANTS Often, it is considered acceptable to have an unlimited number of gaps in the beginning and the end of a sequence. We can easily accommodate this by initializing 0’s in the first row and column of the alignment matrix, and then after aligning as usual, we then take the max in the final row and column as our starting point in tracing back through the matrix. This variant is called the overlap detection variant. The alignment matrix may end up looking something like this: WEEK 1, LECTURE 2 Another variant that is extremely useful and widely used is the local alignment problem. This algorithm variant even has its own name, the Smith‐Waterman algorithm. Given two strings, x and y, we want to find two substrings, x’ and y’ whose similarity (optimal global alignment value) is maximized. Note that this is different from the overlap detection variant in that we no longer require that we begin and end our aligned sequences at the end of one of the original sequences. So, for example, given the strings aaaacccccggggtta and ttcccgggaaccaacc, we find a local alignment of cccggg in the sequences. Local alignment is especially useful in biology because we are often not sure which parts of the genome to align. Genes are shuffled between genomes, and polymorphisms change the expected locations of genes. The Smith‐Waterman algorithm can be thought of a a two‐dimensional extension of the greatest maximal subsequence problem. After initializing like in the overlap detection variant, all we need to do is to change the iteration condition to the max of the three conditions we had before and 0. Hence, if we ever have an alignment that has a negative score, we just ignore that alignment. There are other variants, like the Waterman‐Eggert for finding all local alignments with a score greater than some threshold. 2.6 SCORING FUNCTIONS In our current model of scoring, a gap of length n incurs a penalty nxd, so the penalty is linear to the length. We can write that γ(n)=nd. However, gaps do occur in bunches, so gaps after the first one don’t do as much harm as the initial gap, so we want our gap function to be convex, that is, we want that for γ(n), for all n, γ(n + 1) – γ(n) ≤ γ(n) – γ(n – 1). So, for consecutive gaps in a sequence, the penalty of extending the gap one more letter is less than or equal to the cost of the last extension. So, the convex gap dynamic programming algorithm looks something like this: 1. Initialization: same 2. Iteration: F(i – 1, j – 1) + s(xi, yj) F(i, j) = max maxk=0…i‐1F(k, j) – γ(i – k) maxk=0…j‐1F(i, k) – γ(j – k) 3. Termination: same This algorithm has running time O(N2M) and space complexity of O(NM), assuming N>M. But what if we can’t afford O(N2M)? We can then compromise with using affine gap penalties. We now introduce d, the gap opening penalty, and a separate e, the gap extension penalty. This simulates the convex gap function, but will give us O(NM) performance. We will complete sequence alignment and the affine gap algorithm in the next lecture.
Pages to are hidden for
"Dynamic Programming in Sequence Alignment and its"Please download to view full document