What is BLAST?
BLAST® (Basic Local Alignment Search Tool) is a set of
similarity search programs designed to explore all of the available
sequence databases regardless of whether the query is protein or
DNA.
“local” means it searches and aligns sequence segments, rather
than align the entire sequence. It’s able to detect relationships
among sequences which share only isolated regions of similarity.
Currently, it is the most popular and most accepted sequence
analysis tool.
Why BLAST?
• Identify unknown sequences - The best way to identify an
unknown sequence is to see if that sequence already exists in a
public database. If the database sequence is a well-characterized
sequence, then you may have access to a wealth of biological
information.
• Help gene/protein function and structure prediction – genes with
similar sequences tend to share similar functions or structure.
• Identify protein family – group related (paralog or ortholog)
genes and their proteins into a family.
•Prepare sequences for multiple alignments
• And more …
Different types of homology search
DNA v.s. DNA
GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC
|| ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | |||||
GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC
GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----
|| ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | ||
GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG
------------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA
||||||||||||| ||| ||||||||||| || ||||||| || |||| |
TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA
GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG
||||||| |||| | | |||| ||||| || ||||| || |||||| |||||||||||||||
GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG
Protein v.s. Protein
DNA translated v.s. protein
Or the other way around
DNA translated v.s. DNA translated
Basic BLAST programs and databases
In 6 frames
Nucleotide Protein
Sequence Translated Sequence
Protein Sequence
tblastn
blastn blastp
blastx
Nucleotide DB tblastx Protein DB
In 6 frames Translated DB
(contain amino
acid sequences)
How Does BLAST Work
Two-step procedure:
1. Compare query sequence to every database entries. For each
entry, if there are segments of certain length (word size) similar to
part of the query sequence, they have a hit.
Word size = 7
Query: GTTGACCGTTAGCCGACGTTAAGCT
DB entry: ACATAGCCCGTTAGCCGCTGATACGACCGTAC
2. For each hit, extending two both ends until the expect value falls
below the threshold. They become “high-scoring segment pair”
(HSP)
3. A Smith-Waterman like algorithm is used to do local alignment
around each HSP.
Blastn ~ Construct Queries
paste your sequence here
specify search region
choose database
nr ~ non-redundant database
Others are subsets of nr database.
Blastn ~ Options
limit result to
Example: protease NOT from only certain
hiv1[Organism] organism
Lower EXPECT thresholds are more stringent.
The smaller the word size, the higher the sensitivity.
Blastn ~ Filters
• Low-complexity: Some sequence segments are biologically uninteresting
(e.g., hits against common acidic-, basic- or proline-rich regions) determined by
SEG or DUST program. Such segments are screened out.
• Human repeats: This option masks Human repeats (LINE's and SINE's) and
is especially useful for human sequences that may contain these repeats. Filtering
for repeats can increase the speed of a search especially with very long sequences
(>100 kb) and against databases which contain large number of repeats (e.g. htgs).
• Mask for lookup table only: BLAST searches consist of two phases,
finding hits based upon a lookup table and then extending them. This option tell
BLAST search to apply other filters only in the first phase.
• Mask Lower Case: Sequences in lower case are screened out. This allows
users to define customized filtering region.
Blastn ~ When to Use
Your query sequence is nucleotide sequence. Blastn can help to
• Find the identity of your query sequence.
• Find sequences similar to your query sequence.
Blastn returns nucleotide sequences stored in NCBI databases.
Variance of blastn ~ MegaBlast :
It’s specifically designed to efficiently (up to 10 times faster )
find long alignments between very similar sequences.
Interpret BLAST results - Distribution
Query sequence
BLAST hits.
Click to access
the pairwise
alignment.
This image shows the distribution of BLAST hits on the query
sequence. Each line represents a hit. The span of a line represents
the region where similarity is detected. Different colors represent
different ranges of scores.
Interpret BLAST results - Description
The description (also called definition) lines are listed below under
the heading "Sequences producing significant alignments". The
term "significant" simply refers to all those hits whose E value was
less than the threshold. It does not imply biological significance.
ID (GI #, refseq #, DB- Gene/sequence Bit score – higher, better. Links
specific ID #) Click to access Definition Click to access the
the record in GenBank pairwise alignment
Expect value – lower, better. It tells the
possibility that this is a random hit
Interpret BLAST results – Pairwise
Alignment
Query line: the segment from query sequence.
Subj line: the segment from hit (subject) sequence.
Middle line: the consensus bases
Blastp ~ Protein – Protein DB
Blastp is used for both identifying a query amino acid sequence and
for finding similar sequences in protein databases. Like other
BLAST programs, blastp is designed to find local regions of
similarity. However, when sequence similarity spans the whole
sequence, blastp will report a global alignment, which is the
preferred result for protein identification purposes.
Unlike nucleotide BLAST, there is no comparable MEGABLAST
for protein searches.
Blastp ~ Special Parameters
Gap: penalties for
opening a new
gap, or for
extending an
existing gap.
Matrix: a table of scores that are assigned to
various amino acid substitutions. In general,
different substitution matrices are tailored to
detecting similarities among sequences that are
diverged by differing degrees.
BLOSUM-62 matrix is among the best for
detecting most weak protein similarities. For
particularly long and weak alignments, the
BLOSUM-45 matrix may prove superior. For
short queries, PAM matrices may be used
instead.
Exercise
Find out how the gap cost is calculated:
For a length k gap, the cost is
Gap_exist + k * gap_ext OR
Gap_exist + (k-1) * gap_ext
Blastp ~ Special Parameters
For proteins, a provisional table of recommended substitution
matrices and gap costs for various query lengths is:
Query Length Substitution Matrix Gap Costs
85 BLOSUM-62 (10,1)
BLOSUM62 matrix
C S T P A G N D E Q H R K M I L V F Y W
C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2
S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3
T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3
P -3 -1 1 7 -1 -2 -1 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4
A 0 1 -1 -1 4 0 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -2 -3
G -3 0 1 -2 0 6 -2 -1 -2 -2 -2 -2 -2 -3 -4 -4 0 -3 -3 -2
N -3 1 0 -2 -2 0 6 1 0 0 -1 0 0 -2 -3 -3 -3 -3 -2 -4
D -3 0 1 -1 -2 -1 1 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4
E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -3 -3 -2 -3
Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2
H -3 -1 0 -2 -2 -2 1 1 0 0 8 0 -1 -2 -3 -3 -2 -1 2 -2
R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3
K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -3 -3 -2 -3
M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 -2 0 -1 -1
I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 1 0 -1 -3
L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 3 0 -1 -2
V -1 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 -1 -3
F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1
Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2
W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11
Basic idea
Conserved regions from multiple sources
are aligned into blocks
The identity level is high therefore we
know they are homologues without a score
matrix
Frequency of AA pairs
37 columns, each column has 3*(3-1)/2 pairs. In
total 111 pairs.
Pair I-L occurs 3 times. L-L occurs 13 times
P_{IL} = 3/111. P_{LL}= 13/111
Total amino acid 111.
P_I = 2/111, P_L = 21/111
2 * P_I * P_L < P_{IL}!
P_L * P_L < P_{LL}!
Blosum
Score(x,y) = log_2 (p_{xy} / e_{xy}),
where e_{xy} = 2 p_x p_y
e_{xx} = p_x p_x
BLOSUM 62
Some protein families are more well
studied so they are over represented in
the database.
To remove this bias in statistics, those
proteins are classified together before
BLOSUM calculation.
BLOSUM 62
Weight 0.5
Weight 0.5
Weight 1
Weight 1
The sequences that are 62% or above similarity
are grouped together and given total weight 1.
This way, the AA pairs are counted among
groups that are 62% or below.
The lower this number is, the better is the matrix
suitable to distant homology search.
Blastx ~ nucleotide – protein DB
Blastx is useful for finding similar proteins to those
encoded by a nucleotide query. It compares the translation
of the nucleotide query sequence to a protein database.
Because blastx translates the query sequence in all six
reading frames and provides combined significance
statistics for hits to different frames, it is particularly
useful when the reading frame of the query sequence is
unknown or it contains errors that may lead to frame
shifts or other coding errors. Thus blastx search is often
the first analysis performed with a read from a newly
derived sequence and is used extensively in analyzing
EST sequences.
Blastx ~ Attention
ATTENTION:
1. You have to make sure that your sequence sequence is a
nucleotide coding region.
2. Blastx is not applicable to Genomic DNA/RNA (introns,
intergenic region, tRNA, rRNA), because they do not
encode for protein.
Blastx ~ Special Parameters
Different species may
use different genetic
codes to encode for the
same amino acid. You
have to specify
appropriate genetic
codes (translation table)
for your query sequence
based on the organism
and sources.
Blastx ~ Interpret Results
Middle line:
letters ~ consensus amino acid residues
+ ~ similar amino acid residue
white space ~ unmatched
Tblastn ~ protein – translated DB
A tblastn search allows you to compare a protein sequence to the
six-frame translations of a nucleotide database. It can be a very
productive way of finding homologous protein coding regions
in unannotated nucleotide sequences such as expressed
sequence tags (ESTs) and draft genome records (HTG),
located in BLAST databases est and htgs, respectively.
Tblastx ~ nucleotide – translated DB
tblastx takes a nucleotide query sequence, translates it in
all six frames, and compares those translations to the
database sequences dynamically translated in all six
frames. This effectively performs a more sensitive blastp
search without doing the manual translation.
tblastx gets around the the potential frame-shift and
ambiguities that may prevent certain open reading frames
from being detected. This is very useful in identifying
potential proteins encoded by single pass read ESTs. In
addition, it would be a good tool for identifying novel
genes.
Other blast programs
PSI blast: Position-Specific Iterated (PSI)-BLAST is the
most sensitive BLAST program, making it useful for finding
very distantly related proteins. Use PSI-BLAST when your
standard protein-protein BLAST search either failed to find
significant hits, or returned hits with descriptions such as
"hypothetical protein" or "similar to..."
Other blast programs
BLAST 2 sequences: BLAST 2 Sequences" is designed for direct
comparison of two sequences. This program takes two input
sequences and compares them directly. Please note that "BLAST 2
Sequences" regards the second sequence as the database. If the
database sequence or second query is present in NCBI databases,
using GI/Accession instead of the FASTA sequence would allow
the program to incorporate the translation and other sequence
features, found in that record, into the final result to make it more
informative.
Other blast programs
Search for short and near exact matches: Normal parameters for
standard blast are too stringent for short query sequences.
Therefore, appropriate parameters are set for short and near exact
matches.
• For Nucleotide (<20bp): A common use is to check the specificity
of primers used in the polymerase chain reaction (PCR) or
hybridization. Forward primer – NNNNNNNNNN – reverse
primer. Since BLAST looks for local alignments and searches both
strands, there is no need to reverse complement one of the primers
before doing the concatenation or the search. Use word size 7, E
value 1000, no filter.
• For protein (< 10-15mer): using matrix PAM30, E value 20000,
word size 2, no filter.
Summary - If your sequence is NUCLEOTIDE
Length DB Purpose Program
20 bp Nucl Identify the query sequence MegaBlast
or longer blastn
Find sequences similar to query blastn
sequence
Find similar proteins to translated tblastx
query in a translated database
Prot Find similar proteins to translated blastx
query in a protein database
7-20 bp Nucl Find primer binding sites or map Search for
short contiguous motifs short, nearly
exact matches
Summary - If your sequence is PROTEIN
Length DB Purpose Program
15 Prot Identify the query sequence or find blastp
residue protein sequences similar to query
or longer Find members of a protein family or PSI-blast
build a custom position-specific score
matrix
Nucl Find similar proteins in a translated tblastn
nucleotide database
5-15 Prot Search for peptide motifs Search for
residue short, nearly
exact matches
Raw Score, Bit Score, P-value and E-
value
Score Matrix
BLOSUM62
Raw Score and E-value
VLNVWGKVEAD
VLKCWGPMEAD
raw score = S(V,V)+S(L,L)+S(N,K)+…+S(D,D)
Both sequences are substrings of the query and
the subject (database).
Because there is no gap, this is called an HSP
High-Scoring Segment Pair.
Is this HSP significant?
Can it occur purely by chance?
E-value of this raw score is the number of expected
occurrences if both query and database are random
sequences.
How to compute E-value from raw
score
There is rigorous mathematical analysis
behind this. But we only need to know that
Ifquery sequence has length m, and database
has length n, then by chance, the number of non-
overlapping HSPs with score x is expected to be
K*m*n*exp(- lambda * x)
This makes sense
Doubling the length of either sequence should double
the number of HSPs attaining a given score.
Also, for an HSP to attain the score 2x it must attain the
score x twice in a row, so one expects E to decrease
exponentially with score
Bit Score
Raw scores have little meaning without
detailed knowledge of the scoring system
used, or more simply its statistical
parameters K and lambda.
Bit score is the “normalized” score
Therefore, E-value = m*n*(2^bitscore)
Exercise
Retrieve myoglobin horse.
BLASTp
What do you get?
What is Hemoglobin?
TBLAST
Findthe DNA sequence corresponding to
myoglobin horse.
Can you do the reverse-translation
without knowing the DNA sequence?