# Overview of Sequence Alignment and Database Searches

Document Sample

```					Overview of Sequence Alignment and Database Searches
John Hamilton Michigan State University

Sequence Alignment
•  A set of algorithmic methods for determining similarity between two sequences. •  Pairwise alignment generates an optimal alignment between two query sequences – useful for database searches. •  Dot plots, dynamic programming, and word based methods are three common methods are used in bioinformatics

Dot Plots
• A method of sequence alignment that produces a visual output • The two sequences form the axes of a matrix, a dot is placed in the matrix where the sequence is similar. • Similar regions form a diagonal. •  Repeats, inversions, and indels can be determined. •  Filtering is usually needed to reduce noise • Programs: Dotter, dotplot • Web Tool: DNADot - http://www.vivo.colostate.edu/molkit/ dnadot/index.html

Dot Plot Examples

Dynamic Programming Approaches
• These approaches use a method called dynamic programming to find the optimal sequence alignment. • The algorithm uses a scoring matrix and gap scoring scheme in the calculations. • The algorithm can be find the optimal global alignment (Needleman-Wunsch) or the optimal local alignment (Smith-Waterman) • These approaches are very computationally intensive and therefore very slow but they provide the optimal solution.

Local vs Global Alignments

For sequences that are divergent, the optimal global alignment introduces gaps that can hide biologically relevant information such as motifs.

Biological Relevance of Similarity
• Homology – similarity arising from shared ancestry • Regions of similarity in a sequence alignment have a possibility of being homologous. • There are two categories of homology: • Orthologs – similar genes from different species that share a common ancestor that generally have the same function • Paralogs – similar genes that that arose from a gene duplication event. The function of one gene may diverge due to the lack of selection pressure. • Regions of similarity are also called conserved regions.

Proteins and Similarity
• The scoring matrix for proteins has to take in account that a amino acid at a given position has differing probabilities of mutating to another amino acid of time. • The probabilities are calculated from alignments of protein sequences known to be closely related. • BLOSUM62 is that default substitution matrix, it is effective at finding similarity in distantly related sequences. • For determining similarity in more closely related sequences, use a BLOSUM matrix with a high cutoff.

BLAST
•  BLAST = Basic Local Alignment Search Tool •  Used for searching a query sequence or sequences against a database of subject sequences. •  Local alignment = searches for regions in the query that locally align to subject sequences. •  BLAST uses word based heuristics to approximate the Smith-Waterman algorithm to find the nearoptimal local alignments quickly, thus gaining speed over sensitivity.

•  Two main versions: NCBI Blast and WUBLAST

BLAST Algorithm
•  The process starts by searching the sequences for exact matches of small fixed length strings from ther query called ‘words’. •  These matches are the seeds for the local alignments. •  The next step is to extend the seed by aligning the sequence until a gap is found – mismatches are inserted here. •  A gapped alignment is then performed using a modified Smith-Waterman algorithm – indels are added here. •  Only results scoring above a threshold (expect value or e value) are reported back to the user

BLAST Programs
•  blastn - query DNA, subject DNA •  blastp- query Protein, subject Protein •  blastx - query DNA (6 frame translation) subject Protein •  tblastn - query protein - subject DNA (6 frame translation) - slow •  tblastx - query DNA (6 frame translation), subject DNA (6 frame translation) - very slow •  megablast - fast searching of many related sequences •  blastz - large sequence global alignment

Which one do I use?
What database do you have and how sensitive does your search has to be?
Blastn, blastp – good for identifying sequences that are already in a database, finding local regions of similarity in closely related organisms. Blastx - Use when you have a nucleotide seq with an unknown reading frame and/or sequencing errors that would lead to frame shifts or coding errors such as ESTs. More sensitive. Tblastn- useful for finding homologs in sequence where the frame is unknown or sequencing errors are likely to be present such as ESTs and draft sequence. Tblastx - used to detect novel ORFs/exons. Very Slow, use as the last resort.

A Look at a BLAST Web Interface - Part 1
BLAST Programs

Query Sequence In FASTA format Subject Databases

A Look at a BLAST Web Interface - Part 2

Note: Only one sequence at a time can be searched using the CPGR BLAST web tool.

Understanding BLAST Options
•  Expect (e value) - expectation threshold for reporting hits. The probability due to chance, that there is another alignment with a similarity greater than the similarity score of the alignment. •  Max Number of Descriptions - The number of descriptions to show in the hit table at the top of the Blast Report. •  Max Number of Alignments – The number of alignments to show at the bottom of the report. •  Matrix - Specify a scoring matrix for protein searches. •  Word Length - for blastn, change the size of the initial matches of the seeds. •  Strand - Restrict searching to one strand - limit to three frame translation.

Understanding BLAST Options
•  Filter - dust for nucleotide, seg,xnu, seg+xnu for protein. Masks regions of low complexity. •  View Filter - show the query seq in the report after all filters have been applied. •  A review of blast options can be found here: –  WU-BLAST: http://blast.wustl.edu/blast/ parameters.html –  NCBI BLAST: http://www.ncbi.nlm.nih.gov/ Education/BLASTinfo/tut1.html (highly recommended)

The BLAST OUTPUT

The BLAST OUTPUT

The BLAST OUTPUT

The BLAST OUTPUT

The BLAST OUTPUT

Other Sequence alignment programs
• Mummer – suffix tree based - whole genomes • Vmatch – suffix array based – exact matching • Blat, gmap – splice site aware gap extension - mRNA/ cDNA alignment • clustalW, T-Coffee– multiple sequence alignment – protein families, phylogenetics • HMMER – profile based HMM search – search protein domain consensus dbs such as Pfam

Review of Primer Design
•  A primer is a short oligonucleotide which is the reverse complement of the template DNA where we want the primer to anneal. •  It is important design primers so they have an optimal melting temperature (Tm) and length so the primers anneal specifically to the desired site on the template. •  Tm is approximated by the following formula: –  Tm = 4(G+C) +2(A+T) ºC •  Therefore Tm is a function of G/C% and length. •  The annealing temp (Ta) should be 5ºC less that the lowest Tm of the primers in the reaction. •  It is important to avoid complementarity between the primer pair (primer dimer) and regions of a primer (formation of hairpins and other secondary structures)

Summary of Primer Design Guidelines
•  •  •  •  •  •  •  Primers should be 17-28 bases in length Base composition should be 50-60% (G+C) Primers should end (3') in a G or C, or CG or GC: this prevents "breathing" of ends and increases efficiency of priming Tms between 55-80 ºC are preferred Runs of three or more Cs or Gs at the 3'-ends of primers may promote mispriming at G or C-rich sequences (because of stability of annealing), and should be avoided 3'-ends of primers should not be complementary (ie. base pair), as otherwise primer dimers will be synthesized preferentially to any other product Primer self-complementarity (ability to form 2º structures such as hairpins) should be avoided.
–  adapted from Innis and Gelfand, 1991

•  Fortunately, there is some help in meeting these guidelines when designing primers …

Primer3 Web Tool

•  http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi

•  For simple primer picking, you provide primer3 with the template sequences in FASTA format, desired product size range(s), and the target to flank and regions to exclude. •  Primer3 picks the optimal primers taking in account the primer design parameters discussed in the previous slide.

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 5 posted: 5/31/2009 language: English pages: 26