Document Sample
Searching Powered By Docstoc



School of Computer Science

Scott Hazelhurst

February/March 2005


1 Introduction
Searching and matching is fundamental to much of bioinformatics: • function follows shape; • shape follows sequence; • similar sequence means similar form Understanding similarities and differences gives insight • can be visual or numeric • can be global or local


Matching can be exact: • are these sequences the same? • does this (short) sequence occur in this (long) sequence? Matching can be approximate: • are these sequences similar? (what does it mean to be similar?) • how similar (or different) are these sequences? (how do we measure) And what about 3D matching?


Exact matching: • can be performed very efficiently (linear or even better) e.g consider the performance of Google. Inexact matching: • different models; • need to take into account the biology • computationally more challenging


Will look at • Dot Plots Good tool for visualisation • Edit distance, alignment and similarity Smith-Waterman, Needleman-Wunsch • Heuristic tools Blast and others


Why do we do approximate matching? • evolution The DNA of organisms change Individual bases change, are deleted, inserted. Can be gross changes: gene re-ordering, chromosomal splitting joining (much rarer, but very important) • lab errors, contamination, sequencing errors


2 Dot Plots
• Find regions of similarity between two sequences • Find repeated regions within a sequence Two dimensional picture: graph sequence r1 against sequence r2 as follows: • r1 is represented on the x-axis; • r2 is represented on the y-axis; • put a dot at position i, j iff i-th nucleotide/amino-acid are the same. r1 [i] == r2 [j]


Plotting at the nucleotide/amino-acid level is too fine! Probability high that two positions match by chance: • choose a sliding window k, threshold θ • put a dot at position (i, j) if a word of length k starting at position i matches a word of length k at position j with at least a score θ. Problem of selectivity versus sensitivity: • make k too small, θ too small matches happen by chance; • make k too big, θ too big, miss real matches




Can also do an n versus n comparison


3 Alignments
Want to compare two sequences: We hypothesise that they came from a common ancestor through: • mutations • insertions, deletions How similar are they? Try to align them • ACGACTTTACTTAACCGAGGGTAGTC • AGACTTTACTATGAAACGAGTGGCAGTC ACGACTTTACT-T-AACCGA-GGGTAGTC A-GACTTTACTATGAAACGAGTGGCAGTC


Once you have an alignment you can score it: • reward matches • penalise mismatches • have costs for indels • usually have some cost model – transitions, transversions – gap creation, extension penalties • give a scoring matrix – more later Goal: find the optimal alignment – find fewest number of edit operations required to transform one into the other.



Global alignment

Find the best overall alignment between two sequences overall • known by dynamic programming algorithm used to solve it: Needleman-Wunsch • basic idea: – trivial to compute optimal alignment of simple sequences – build optimal alignment of bigger sequences from smaller sequences. Illustrate using unit cost model: reward of 1 for a match, −1 for a mismatch or indel.


Suppose you have two empty sequences: • optimal alignment cost is 0 Suppose you have two sequences each with exactly one character • If the two characters are the same, get the reward for the match • If the two charactes are different, pay the penalty for a mismatch. Look at the general case: • sequence x = c0 c1 . . . cn−1 • sequence y = d0 d1 . . . dm−1


To find the score of the optimal alignment for x and y consider each of the following three cases: • Reward/Penalty for matching/mismatching x0 with y0 plus Score of alignment of x1 x2 . . . xn−1 with y1 y2 . . . ym−1 • Penalty of a deletion of x0 from x plus Score of alignment of x1 x2 . . . xn−1 with y0 y1 . . . ym−1 • Penalty of deletion of y0 from y plus Score of alignment of x0 x1 x2 . . . xn−1 with y1 . . . ym−1 Choose the best.

ed (ATAT, CAT) = best of A −1 + ed (TAT, AT) (mismatch) B −1 + ed (TAT, CAT) (insert in ATAT, gap in CAT) C −1 + ed (ATAT, AT) (gap in ATAT, insert in CAT)


(A) ed (TAT, AT) = best of A1 −1 + ed (AT, T) A2 −1 + AT , AT ed (AT, AT) = best of – 1 + ed (T, T) – −1 + ed (T, AT) – −1 + ed (AT, T)

A3 −1 + ed (T, AT) And so and so forth: caclulate all, subsitute back.


Dynamic programming approach: • e[i, j] is the score of aligning xi xi+1 . . . xn−1 and yj yj+1 . . . ym−1 . • c(i, j) is reward/penalty for matching xi with yj • δ cost for a deletion • Represent e as an n + 1 × m + 1 matrix. • Fill in the matrix in stripes • e[n, m] is easy to compute: 0 • For each other element in row n or column m e[n, j] = δ + e[n, j + 1], e[j, m] = δ + e[j + 1, m]


• For each other element:   c(i, j) + e[i + 1, j + 1]   e[i, j] = max δ + e[i + 1, j]    δ + e[i, j + 1]

      

Fill in the matrix from bottom-right to top-left • can compute the overall score; • can reconstruct what the alignment is depending on choice at each stage;


Work out best optimal alignment of ATTT and ACTAT. • +1 for a match, 0 for mismatch or gap A A T T T C T A T



Semi-global alignments

NW algorithm above does global alignment: compare sequence x to sequence y. Sometimes have a short sequence x and a long sequence y and want to find the best alignment of x to some substring of y. • semi-global alignment • allow free deletions at either end of the long sequence



Local alignment

Known as Smith-Waterman algorithm. Given two sequences x and y find the optimal alignment of some sub-string of x with some substring of y. Need to balance: • Want as few errors, deletions as possible; • Want the match to be as long as possible. Approach: amend our DP algorithm above so that at each stage there is a new choice • can throw away the match found so far, and can start again


 c(i, j) + e[i + 1, j + 1]     δ + e[i + 1, j] e[i, j] = δ + e[i, j + 1]     0

Match/mismatch Gap in y Gap in x Start again

Can always avoid errors, but then may only have short local alignments.


4 Heuristic searches
DP approaches are seen as the most accurate. • Expensive: quadratic in the size of the
10000 x x*log(x) x*x





0 0 20 40 60 80 100


Balance performance against accuracy. Basic idea: • inexact matching is slow, exact matching is fast • for one sequence to match another approximately: – parts of the two sequences must match exactly • Find possible places of matching, refine further • Different approaches with different trade-offs




User chooses a paramter ktup – substring length (commonly 2 or 1 for amino acids, 6 for nucleotides0. Build an approximate DP table as follows: • Find all hotspots (i, j) in the table where a ktup length substring of x starting at i matches a substring of y starting at j. • Find and evaluate runs of hotspots in the diagonals — Weight for length, penalise for gaps. • Return the best 10. • Score using biological information (e.g. PAM, BLOSUM): get sub-alignments • Try to combine sub-alignments to allow gaps. • Apply Smith-Waterman on the sub-matrices of interest.



Basic Local Alignment Search Tool (Altschul, Gish, Miller, Myers, Lipman, 1990) Generic algorithm: • Fix cut-off score C, length w (about 12) threshold θ. • To compare strings S1 and S2 , • Find all sub-strings of length w that match in both and have a score above θ This can be done very efficiently. • Extend these matches as much as possible: find locally maximal matching pairs. • Return matches above the cut-off score. Original BLAST did not deal with gaps. Latest versions of BLAST do

Only generic versions of algorithm described – have been many extensions and improvements. BLAST is seen as faster & better than FASTA but mileage differs. NB: Both are heuristics — can miss interesting matches


5 Substitution matrices
Important to incorporate biological information. • Some differences are more biologically important than others. Must apply mind to the problem: • What are you comparing? nucleotides or amino acids? • Do you know how similar the sequences are?



Basic nucleotide substitution models

Will explore evolutionary models in more detail later. Here are three basic approaches.
Identity matrix A A T C G 1 0 0 0 T 0 1 0 0 C 0 0 1 0 G 0 0 0 1 A T C G BLAST matrix A 5 −4 −4 −4 T −4 5 −4 −4 C −4 −4 5 −4 G −4 −4 −4 5 A T C G Transition/transversion matrix A 1 −5 −5 −1 T −5 1 −1 −5 C −5 −1 1 −5 G −1 −5 −5 1



Amino-acid models

PAM – point/percent accepted mutation • obtained by studies of similar sequences • different variants depending on how closely related the sequences are PAM-x: expect there to be x mutations per 100 characters. Use larger x values for more distantly related matrices. Empirical studies done to produce. Many tools allow you to select.


Problems: • determining how far this is • have extant sequences only (particularly for amino-acids) • how to handle gaps do we allow? how do we weight opening a gap? how do we weight exending? PAM-250 often used for evolutionary studies, but you must apply your mind to the data.


BLOSUM (blocks substiution matrix) Obtained from studies of conserved regions in distantly related matrix. Also parameterised: BLOSUM-x • Roughly: BLOSUM-x is useful for comparing sequences with x% similarity. Lower numbered matrices are used for more distantly related sequences. Good when there are blocks of conserved amino acids. BLOSUM-62 is a commonly used sequence Again: you must apply your mind to the problem, and understand the biology (e.g. even understand the 3D structure).

What are you comparing? Nucleotide vs amino-acids Good rule of thumb: using standard tools it is better to compare amino-acids than nucleotides • but must understand the biology Why? • amino acids scoring matrices allow more meaninful alignments e.g. a change of A −→ C may have very different effects depending on where it is • do need to know ORF or try all 6 reading frames. • sometimes nucleotide comparison is the best


Example tools: • BLASTP: source protein in a protein database; • BLASTN: source nucleotide sequence in a nucleotide sequence DB best for high-scoring matches • BLASTX: source nucleotide sequence in a protein database • TBLASTN: source protein sequence in a nucleotide database • TBLASTX: source nucleotide sequence in a nucleotide database first translating to amino acids.



Interpretation of results

A match can happen because: • biologically relevant and interesting; • chance: e.g. small query sequences will occur by chance in any reasonable size database with absolutely no biological evidence Should decide whether statistically interesting. Topic of another lecture.


6 Distance and similarity
Duals of each other • large distance implies low similarity and vice-versa • scores in matrices change their roles Have already seen Edit or Levenshtein distance • score of the optimal alignment NB: typically when seen as a distance, penalties are positive numbers, rewards are 0. Another simple one is Hamming distance: • number of places two sequences match after an alignment is done.

There are many other distance functions, based upon both empirical and statistical studies. Many are word statistic based. • we can either fix a word size or consider all words. Suppose we fix a word length k • Let w be be all the possible words of length k • Let cw (y) be the number of times that the string w appears in the string y.


Given two sequences x and y Manhattan distance: d(x, y) = abs( d2 distance: d2 (x, y) =
w∈W (cw (x) w∈W

cw (x) − cw (y))

− cw (y))2 xs

Often used in clustering applications: • sensitive to repeats


Shared By:
Tags: Searc, hing
Description: Searching