Sequence Alignment Continued

W
Document Sample
scope of work template
							Sequence Alignment Continued                                         CS262 Winter 2005 Lecture 5, 1/13/05
Lecturer: Serafim Batzoglou                                                        Scribe: Daniel Woods



                             Sequence Alignment Continued


1       Previous Lecture Review
First, several topics were reviewed from the preview lecture.

1.1     Needleman-Wunsch with Affine Gaps

The technique of affine gaps allows calculations to perform alignment scoring without a strict linear score
or gaps. Finding the best scoring alignment with this technique requires multiple arrays to be kept because
the maximum score at any midpoint is not purely deterministic on the predecessors (which would require
only one array) rather, multiple possible scores must be kept for each location which will then be resolved
by later information.

The affine gap method specifically has two parameters for scoring gaps. There is a gap “open” penalty
incurred once for each gap, regardless of the gap length. There is also a gap “extension” penalty which is
multiplied by the length of each gap. This generates a 2-part piecewise model for gap scoring.

This method can be easily extended to handle more complicated piecewise models, although it will require
even more memory space.

1.2     Linear-space Alignment Algorithm

This is a divided-and-conquer method of performing the Needleman-Wunsch algorithm. The basic
premise is to first map only the midpoint of one sequence to its place in the other, and then to treat each
half as a similar problem to repeat recursively.

Figure 1 depicts this process graphically and it can clearly be seen that the problem greatly reduces in size
with this method because large portions of the problem do not need to be calculated.




                                                     1
Sequence Alignment Continued                                        CS262 Winter 2005 Lecture 5, 1/13/05
Lecturer: Serafim Batzoglou                                                       Scribe: Daniel Woods




Figure 1 – Graphical depiction of the Linear-space Alignment Algorithm, which greatly reduces the
computational requirements of the Needleman-Wunsch algorithm


1.3     The Four-Russian Algorithm

This is a divided-and-conquer method of performing the Needleman-Wunsch algorithm. The basic

The Four-Russian Algorithm was covered only conceptually in the previous lecture, but summarized here
in review. Basically, the entire Needleman-Wunsch grid is divided into “t-blocks” (named for the variable
corresponding to the length of a side). The key observation between the algorithm is that the rightmost
column and bottom row of each t-block are deterministic given the leftmost column and top row of the t-
block. This allows the edges of all t-blocks in an entire grid to be calculated very quickly using a lookup
table of possible t-blocks. This way, no calculations need to be done involving the spaces inside of the t-
blocks at this stage.

Once that is complete, it can quickly be found where the best alignment enters and exists each t-block is
crosses, and those t-blocks can then be better analyzed to find the exact alignment.




                                                    2
Sequence Alignment Continued                                     CS262 Winter 2005 Lecture 5, 1/13/05
Lecturer: Serafim Batzoglou                                                    Scribe: Daniel Woods




Figure 2 - Graphical depiction of the Four-Russian Algorithm which further reduces the computational
requirements of an allignment


2      Heuristic Local Aligners
2.1    Background

The algorithms covered so far are unable to be applied as discussed to today’s large genomic databases.
The amount of known genomic data is roughly doubling each year, and this trend has been going steadily
for 15 years. Enormous efforts are underway to severely reduce the cost of sequencing so this rate may
                                                  3
Sequence Alignment Continued                                          CS262 Winter 2005 Lecture 5, 1/13/05
Lecturer: Serafim Batzoglou                                                         Scribe: Daniel Woods



even accelerate in coming years.      More heuristic approaches are therefore necessary to handle the
enormous amounts of data.

When a new genome is sequenced, what would be extremely useful to a biologist would be the ability to
compare the new genome against the entire genome database in order to find similar genes. The numbers
of genes typically found in some classifications of life forms are as follows:

      •   Mammals:               ~25,000
      •   Insects:               ~14,000
      •   Worms:                 ~17,000
      •   Fungi:                 ~6,000 – 10,000
      •   Small Organisms:       100s – 1,000s

Finding the best mappings between genes throughout the entire genome database requires some new
techniques.


2.1       Indexing-based Local Alignment with BLAST (Basic Local Asignment Search Tool)

2.1.1 Dictionary

The basic premise behind this methodology is to create a dictionary of “words” from across the entire
genome database. This calculation is performed one time for a particular database, and can then be used
for many searches. Here, a “word” refers to a very short subsequence. With such a dictionary, a lookup
can be performed on any word and it will immediately return all locations in all genomes of the database
where this word is found.

Some immediate observations can be made about word length as it affects the usefulness of this algorithm.
If the word length is too short, the dictionary will show many hits that are not actually well aligned
sequences and sifting through them to find the correct one will require significant computation. If words
are too long, there may be good alignments that are missed because the word does not appear without a
substitution occurring within it.

Another important issue to consider is that not all words occur with equal regularity. Some words are far
more common than others with a very predictable distribution. This make it much more difficult to “even
out” the dictionary so that each word contains roughly the same number of matches in order to smooth out
and maximize the worth length parameter. One approach is to simply exclude from the dictionary words
which are particularly repetitive so that matches returned by the dictionary (from those that ARE still in it)
are more likely to correspond to real matches.

2.1.2     Alignment

For a particular match returned by the dictionary, the goal is to score the alignment using the subsequences
located just before and after the query word in both the query sequence and the returned position in the
genome database. Using the Needleman-Wunsch scoring methods, there are still several ways to find the
best scoring matches.




                                                      4
Sequence Alignment Continued                                         CS262 Winter 2005 Lecture 5, 1/13/05
Lecturer: Serafim Batzoglou                                                        Scribe: Daniel Woods




Figure 3 - A naive approach to extending to performing sequences maching based on a dictionary lookup



A naïve approach, as described graphically in Figure 3, is to simply extend the match out in each direction,
applying rewards and penalties until the score drops below a certain threshold from the maximum score
obtained along the way. It will be considered a good match if the returned score is above a certain
threshold. The problem with this approach is that it does not allow for any gaps to occur in the optimal
alignment.




                                                     5
Sequence Alignment Continued                                        CS262 Winter 2005 Lecture 5, 1/13/05
Lecturer: Serafim Batzoglou                                                       Scribe: Daniel Woods




Figure 4 - Using an expanded band to allow for some gaps in subsequence alignment

In order to allow for the possibility of gaps in the best alignment, a simple approach would be not to
simply examine a single line that assumes no gaps, but to widen it and assume a band extending beyond
each end of the match. This method is demonstrated graphically in Figure 4.




                                                    6
Sequence Alignment Continued                                         CS262 Winter 2005 Lecture 5, 1/13/05
Lecturer: Serafim Batzoglou                                                        Scribe: Daniel Woods




Figure 5 - A smarter way of examining the match's context in order to quantify match quality by expanding as
necessary


As depicted in Figure 5, the most commonly used method of accomplishing this alignment and scoring is
to widen the area of analysis as the analysis occurs. If a particular direction I generating good alignment
scores, examination will continue in that direction. Examination stops in a particular direction when
scores in that direction fall below a threshold from the highest score previously obtained.

2.1.2   Improvements to Indexing Techniques

As already discussed, there is a tradeoff between the speed and the sensitivity of a word. As discussed
previously, this was government by word length, where shorter words are more sensitive and longer words
are faster to perform lookups on. A quantitative analysis of this tradeoff can be seen in figure 6.




                                                     7
Sequence Alignment Continued                                          CS262 Winter 2005 Lecture 5, 1/13/05
Lecturer: Serafim Batzoglou                                                         Scribe: Daniel Woods




                                                          Kent WJ, Genome Research
Figure 6 - The Sensitivity vs. Speed (implied by number of hits shown at bottom) for various word lengths (K)
and mismatch rates.


There are three techniques discussed which can improve on these numbers. The first such technique is to
allow inexact words. For example, the dictionary may return all words with at most one mismatch from
the query word. This would allow longer words to be used without statistical requirement that the words
be found with zero substitutions. Figure 7 shows the resulting statistics, which are clearly more favorable
for computation than those in Figure 6.




                                                            Kent WJ, Genome Research

Figure 7 - The resulting Sensivity vs. Speed chart when the words are allowed to have one mismatch




                                                      8
Sequence Alignment Continued                                           CS262 Winter 2005 Lecture 5, 1/13/05
Lecturer: Serafim Batzoglou                                                          Scribe: Daniel Woods



The second improvement method is to examine words only in parts and return only instances where both
words are found within a specific, relatively short distance. Figure 8 shows the resulting statistics, which
are clearly more favorable for computation than those in Figure 6.




                                                             Kent WJ, Genome Research
Figure 8 - The resulting Sensitivity vs. Speed chart when two words are used allowing no substitutions within
them


The third technique is to query no a pattern of non-consecutive positions. By doing this, this distribution
of words becomes far more uniform (whereas the distribution in the dictionary already is discussed is
known to have some words occur far more often than others.




Figure 9 - Demonstration of Sensitivity using patterns

                                                         9
Sequence Alignment Continued                                       CS262 Winter 2005 Lecture 5, 1/13/05
Lecturer: Serafim Batzoglou                                                      Scribe: Daniel Woods




As can bee seen in Figure 9, higher sensitivity can be obtained by the 11-position pattern shown than by
the 10-position word. Sensitivity can be further increased by using many patterns in the same dictionary.




                                                   10

						
Related docs