What is BLAST

Document Sample
What is BLAST
What is BLAST?

BLAST® (Basic Local Alignment Search Tool) is a set of

similarity search programs designed to explore all of the available

sequence databases regardless of whether the query is protein or

DNA.

“local” means it searches and aligns sequence segments, rather

than align the entire sequence. It’s able to detect relationships

among sequences which share only isolated regions of similarity.

Currently, it is the most popular and most accepted sequence

analysis tool.

Why BLAST?

• Identify unknown sequences - The best way to identify an

unknown sequence is to see if that sequence already exists in a

public database. If the database sequence is a well-characterized

sequence, then you may have access to a wealth of biological

information.

• Help gene/protein function and structure prediction – genes with

similar sequences tend to share similar functions or structure.

• Identify protein family – group related (paralog or ortholog)

genes and their proteins into a family.

•Prepare sequences for multiple alignments

• And more …

Different types of homology search

DNA v.s. DNA



GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC

|| ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | |||||

GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC



GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----

|| ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | ||

GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG



------------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA

||||||||||||| ||| ||||||||||| || ||||||| || |||| |

TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA



GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG

||||||| |||| | | |||| ||||| || ||||| || |||||| |||||||||||||||

GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG

Protein v.s. Protein

DNA translated v.s. protein









Or the other way around

DNA translated v.s. DNA translated

Basic BLAST programs and databases

In 6 frames



Nucleotide Protein

Sequence Translated Sequence

Protein Sequence

tblastn

blastn blastp

blastx



Nucleotide DB tblastx Protein DB









In 6 frames Translated DB

(contain amino

acid sequences)

How Does BLAST Work

Two-step procedure:

1. Compare query sequence to every database entries. For each

entry, if there are segments of certain length (word size) similar to

part of the query sequence, they have a hit.



Word size = 7

Query: GTTGACCGTTAGCCGACGTTAAGCT

DB entry: ACATAGCCCGTTAGCCGCTGATACGACCGTAC



2. For each hit, extending two both ends until the expect value falls

below the threshold. They become “high-scoring segment pair”

(HSP)

3. A Smith-Waterman like algorithm is used to do local alignment

around each HSP.

Blastn ~ Construct Queries





paste your sequence here





specify search region



choose database









nr ~ non-redundant database

Others are subsets of nr database.

Blastn ~ Options

limit result to

Example: protease NOT from only certain

hiv1[Organism] organism









Lower EXPECT thresholds are more stringent.



The smaller the word size, the higher the sensitivity.

Blastn ~ Filters

• Low-complexity: Some sequence segments are biologically uninteresting

(e.g., hits against common acidic-, basic- or proline-rich regions) determined by

SEG or DUST program. Such segments are screened out.



• Human repeats: This option masks Human repeats (LINE's and SINE's) and

is especially useful for human sequences that may contain these repeats. Filtering

for repeats can increase the speed of a search especially with very long sequences

(>100 kb) and against databases which contain large number of repeats (e.g. htgs).



• Mask for lookup table only: BLAST searches consist of two phases,

finding hits based upon a lookup table and then extending them. This option tell

BLAST search to apply other filters only in the first phase.



• Mask Lower Case: Sequences in lower case are screened out. This allows

users to define customized filtering region.

Blastn ~ When to Use

Your query sequence is nucleotide sequence. Blastn can help to

• Find the identity of your query sequence.

• Find sequences similar to your query sequence.

Blastn returns nucleotide sequences stored in NCBI databases.





Variance of blastn ~ MegaBlast :

It’s specifically designed to efficiently (up to 10 times faster )

find long alignments between very similar sequences.

Interpret BLAST results - Distribution

Query sequence





BLAST hits.

Click to access

the pairwise

alignment.





This image shows the distribution of BLAST hits on the query

sequence. Each line represents a hit. The span of a line represents

the region where similarity is detected. Different colors represent

different ranges of scores.

Interpret BLAST results - Description

The description (also called definition) lines are listed below under

the heading "Sequences producing significant alignments". The

term "significant" simply refers to all those hits whose E value was

less than the threshold. It does not imply biological significance.









ID (GI #, refseq #, DB- Gene/sequence Bit score – higher, better. Links

specific ID #) Click to access Definition Click to access the

the record in GenBank pairwise alignment



Expect value – lower, better. It tells the

possibility that this is a random hit

Interpret BLAST results – Pairwise

Alignment









Query line: the segment from query sequence.

Subj line: the segment from hit (subject) sequence.

Middle line: the consensus bases

Blastp ~ Protein – Protein DB



Blastp is used for both identifying a query amino acid sequence and

for finding similar sequences in protein databases. Like other

BLAST programs, blastp is designed to find local regions of

similarity. However, when sequence similarity spans the whole

sequence, blastp will report a global alignment, which is the

preferred result for protein identification purposes.



Unlike nucleotide BLAST, there is no comparable MEGABLAST

for protein searches.

Blastp ~ Special Parameters

Gap: penalties for

opening a new

gap, or for

extending an

existing gap.

Matrix: a table of scores that are assigned to

various amino acid substitutions. In general,

different substitution matrices are tailored to

detecting similarities among sequences that are

diverged by differing degrees.

BLOSUM-62 matrix is among the best for

detecting most weak protein similarities. For

particularly long and weak alignments, the

BLOSUM-45 matrix may prove superior. For

short queries, PAM matrices may be used

instead.

Exercise

 Find out how the gap cost is calculated:

 For a length k gap, the cost is

 Gap_exist + k * gap_ext OR

 Gap_exist + (k-1) * gap_ext

Blastp ~ Special Parameters



For proteins, a provisional table of recommended substitution

matrices and gap costs for various query lengths is:





Query Length Substitution Matrix Gap Costs

85 BLOSUM-62 (10,1)

BLOSUM62 matrix

C S T P A G N D E Q H R K M I L V F Y W







C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2

S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3

T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3

P -3 -1 1 7 -1 -2 -1 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4

A 0 1 -1 -1 4 0 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -2 -3

G -3 0 1 -2 0 6 -2 -1 -2 -2 -2 -2 -2 -3 -4 -4 0 -3 -3 -2

N -3 1 0 -2 -2 0 6 1 0 0 -1 0 0 -2 -3 -3 -3 -3 -2 -4

D -3 0 1 -1 -2 -1 1 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4

E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -3 -3 -2 -3

Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2

H -3 -1 0 -2 -2 -2 1 1 0 0 8 0 -1 -2 -3 -3 -2 -1 2 -2

R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3

K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -3 -3 -2 -3

M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 -2 0 -1 -1

I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 1 0 -1 -3

L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 3 0 -1 -2

V -1 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 -1 -3

F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1

Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2

W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

Basic idea

 Conserved regions from multiple sources

are aligned into blocks







 The identity level is high therefore we

know they are homologues without a score

matrix

Frequency of AA pairs





 37 columns, each column has 3*(3-1)/2 pairs. In

total 111 pairs.

 Pair I-L occurs 3 times. L-L occurs 13 times

 P_{IL} = 3/111. P_{LL}= 13/111

 Total amino acid 111.

 P_I = 2/111, P_L = 21/111

2 * P_I * P_L < P_{IL}!

 P_L * P_L < P_{LL}!

Blosum

 Score(x,y) = log_2 (p_{xy} / e_{xy}),

 where e_{xy} = 2 p_x p_y

 e_{xx} = p_x p_x

BLOSUM 62

 Some protein families are more well

studied so they are over represented in

the database.

 To remove this bias in statistics, those

proteins are classified together before

BLOSUM calculation.

BLOSUM 62

Weight 0.5

Weight 0.5

Weight 1

Weight 1





 The sequences that are 62% or above similarity

are grouped together and given total weight 1.

 This way, the AA pairs are counted among

groups that are 62% or below.

 The lower this number is, the better is the matrix

suitable to distant homology search.

Blastx ~ nucleotide – protein DB

Blastx is useful for finding similar proteins to those

encoded by a nucleotide query. It compares the translation

of the nucleotide query sequence to a protein database.

Because blastx translates the query sequence in all six

reading frames and provides combined significance

statistics for hits to different frames, it is particularly

useful when the reading frame of the query sequence is

unknown or it contains errors that may lead to frame

shifts or other coding errors. Thus blastx search is often

the first analysis performed with a read from a newly

derived sequence and is used extensively in analyzing

EST sequences.

Blastx ~ Attention



ATTENTION:

1. You have to make sure that your sequence sequence is a

nucleotide coding region.

2. Blastx is not applicable to Genomic DNA/RNA (introns,

intergenic region, tRNA, rRNA), because they do not

encode for protein.

Blastx ~ Special Parameters

Different species may

use different genetic

codes to encode for the

same amino acid. You

have to specify

appropriate genetic

codes (translation table)

for your query sequence

based on the organism

and sources.

Blastx ~ Interpret Results









Middle line:

letters ~ consensus amino acid residues

+ ~ similar amino acid residue

white space ~ unmatched

Tblastn ~ protein – translated DB

A tblastn search allows you to compare a protein sequence to the

six-frame translations of a nucleotide database. It can be a very

productive way of finding homologous protein coding regions

in unannotated nucleotide sequences such as expressed

sequence tags (ESTs) and draft genome records (HTG),

located in BLAST databases est and htgs, respectively.

Tblastx ~ nucleotide – translated DB

tblastx takes a nucleotide query sequence, translates it in

all six frames, and compares those translations to the

database sequences dynamically translated in all six

frames. This effectively performs a more sensitive blastp

search without doing the manual translation.



tblastx gets around the the potential frame-shift and

ambiguities that may prevent certain open reading frames

from being detected. This is very useful in identifying

potential proteins encoded by single pass read ESTs. In

addition, it would be a good tool for identifying novel

genes.

Other blast programs

PSI blast: Position-Specific Iterated (PSI)-BLAST is the

most sensitive BLAST program, making it useful for finding

very distantly related proteins. Use PSI-BLAST when your

standard protein-protein BLAST search either failed to find

significant hits, or returned hits with descriptions such as

"hypothetical protein" or "similar to..."

Other blast programs

BLAST 2 sequences: BLAST 2 Sequences" is designed for direct

comparison of two sequences. This program takes two input

sequences and compares them directly. Please note that "BLAST 2

Sequences" regards the second sequence as the database. If the

database sequence or second query is present in NCBI databases,

using GI/Accession instead of the FASTA sequence would allow

the program to incorporate the translation and other sequence

features, found in that record, into the final result to make it more

informative.

Other blast programs

Search for short and near exact matches: Normal parameters for

standard blast are too stringent for short query sequences.

Therefore, appropriate parameters are set for short and near exact

matches.

• For Nucleotide (<20bp): A common use is to check the specificity

of primers used in the polymerase chain reaction (PCR) or

hybridization. Forward primer – NNNNNNNNNN – reverse

primer. Since BLAST looks for local alignments and searches both

strands, there is no need to reverse complement one of the primers

before doing the concatenation or the search. Use word size 7, E

value 1000, no filter.

• For protein (< 10-15mer): using matrix PAM30, E value 20000,

word size 2, no filter.

Summary - If your sequence is NUCLEOTIDE



Length DB Purpose Program



20 bp Nucl Identify the query sequence MegaBlast

or longer blastn

Find sequences similar to query blastn

sequence

Find similar proteins to translated tblastx

query in a translated database

Prot Find similar proteins to translated blastx

query in a protein database

7-20 bp Nucl Find primer binding sites or map Search for

short contiguous motifs short, nearly

exact matches

Summary - If your sequence is PROTEIN

Length DB Purpose Program



15 Prot Identify the query sequence or find blastp

residue protein sequences similar to query

or longer Find members of a protein family or PSI-blast

build a custom position-specific score

matrix

Nucl Find similar proteins in a translated tblastn

nucleotide database

5-15 Prot Search for peptide motifs Search for

residue short, nearly

exact matches

Raw Score, Bit Score, P-value and E-

value

Score Matrix



 BLOSUM62

Raw Score and E-value



 VLNVWGKVEAD

 VLKCWGPMEAD

 raw score = S(V,V)+S(L,L)+S(N,K)+…+S(D,D)

 Both sequences are substrings of the query and

the subject (database).

 Because there is no gap, this is called an HSP

 High-Scoring Segment Pair.

 Is this HSP significant?

 Can it occur purely by chance?

 E-value of this raw score is the number of expected

occurrences if both query and database are random

sequences.

How to compute E-value from raw

score

 There is rigorous mathematical analysis

behind this. But we only need to know that

 Ifquery sequence has length m, and database

has length n, then by chance, the number of non-

overlapping HSPs with score x is expected to be

 K*m*n*exp(- lambda * x)

 This makes sense

 Doubling the length of either sequence should double

the number of HSPs attaining a given score.

 Also, for an HSP to attain the score 2x it must attain the

score x twice in a row, so one expects E to decrease

exponentially with score

Bit Score



 Raw scores have little meaning without

detailed knowledge of the scoring system

used, or more simply its statistical

parameters K and lambda.

 Bit score is the “normalized” score









 Therefore, E-value = m*n*(2^bitscore)

Exercise



 Retrieve myoglobin horse.

 BLASTp

 What do you get?

 What is Hemoglobin?

 TBLAST

 Findthe DNA sequence corresponding to

myoglobin horse.

 Can you do the reverse-translation

without knowing the DNA sequence?


Share This Document


Related docs
Other docs by paperboy
What is the Daily Constitutional
Views: 31  |  Downloads: 0
WPP FAQ
Views: 18  |  Downloads: 1
Game Theory ISCI 330 Class Notes
Views: 16  |  Downloads: 1
SUMMARY OF COMMENTS ON PROPOSED REGULATIONS
Views: 15  |  Downloads: 0
ESSENTIALS OF FIRST AID
Views: 94  |  Downloads: 5
What is changing
Views: 18  |  Downloads: 0
by registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!