Docstoc

blast

Document Sample
blast Powered By Docstoc
					              BLAST



Lecture 3.1           1
                 BLAST
• Basic Local Alignment Search Tool
• Developed in 1990 and 1997 (S. Altschul)
• A heuristic method for performing local
  alignments through searches of high
  scoring segment pairs (HSP‟s)
• 1st to use statistics to predict significance
  of initial matches - saves on false leads
• Offers both sensitivity and speed
Lecture 3.1                                       2
                      BLAST
• Looks for clusters of nearby or locally dense “similar
  or homologous” k-tuples
• Uses “look-up” tables to shorten search time
• Uses larger “word size” than FASTA to accelerate the
  search process
• Performs both Global and Local alignment
• Fastest and most frequently used sequence alignment
  tool -- THE STANDARD


  Lecture 3.1                                         3
              BLAST Access
• NCBI BLAST
• http://www.ncbi.nlm.nih.gov/BLAST/
• Canadian Bioinformatics Resource BLAST
• http://cbr-rbc.nrc-cnrc.gc.ca/blast/
• European Bioinformatics Institute BLAST
• http://www.ebi.ac.uk/blastall/
• http://www.ebi.ac.uk/blast2/


Lecture 3.1                                 4
Lecture 3.1   5
Lecture 3.1   6
Lecture 3.1   7
  Different Flavours of BLAST

• BLASTP - protein query against protein DB
• BLASTN - DNA/RNA query against GenBank (DNA)
• BLASTX - 6 frame trans. DNA query against proteinDB
• TBLASTN - protein query against 6 frame GB transl.
• TBLASTX - 6 frame DNA query to 6 frame GB transl.
• PSI-BLAST - protein „profile‟ query against protein DB
• PHI-BLAST - protein pattern against protein DB
 Lecture 3.1                                         8
               Other BLAST Services
• MEGABLAST - for comparison of large sets
  of long DNA sequences
• RPS-BLAST - Conserved Domain Detection
• BLAST 2 Sequences - for performing pairwise
  alignments for 2 chosen sequences
• Genomic BLAST - for alignments against
  select human, microbial or malarial genomes
• VecScreen - for detecting cloning vector
  contamination in sequenced data

 Lecture 3.1                               9
              Running NCBI BLAST




Lecture 3.1                        10
              MT0895
• MMKIQIYGTGCANCQMLEKNAREAVKELG
  IDAEFEKIKEMDQILEAGLTALPGLAVDG
  ELKIMGRVASKEEIKKILS




Lecture 3.1                       11
              Running NCBI BLAST
• Paste in sequence (FASTA format, raw
  sequence or type in GI or accession number)
              >Mysequence MT0895
              KIQIYGTGCANCQMLEKNAREAVKELGIDAE
              FEKIKEMDQILEAGLTALPGLAVDGELKIDS
OR
              >
              KIQIYGTGCANCQMLEKNAREAVKELGIDAE
              FEKIKEMDQILEAGLTALPGLAVDGELKIDS
OR
              KIQIYGTGCANCQMLEKNAREAVKELGIDAE
              FEKIKEMDQILEAGLTALPGLAVDGELKIDS
Lecture 3.1                                     12
              Running NCBI BLAST
• Choose a range of interest in the sequence
  “set subsequences” (not usually used)
• Select the database from pull-down menu
  (usually choose nr = non-redundant)
• Keep CD Search “check box” on
• Leave “Options” unchanged (use defaults)
• Go to “Format” menu and adjust Number of
  descriptions and alignments as desired
Lecture 3.1                               13
              Running NCBI BLAST




                       Select Database




Lecture 3.1                              14
Conserved Domain Database
• Contains a collection of pre-identified
  functional or structural domains
• Derived from Pfam and Smart databases
  as well as other sources
• Uses Reverse Position Specific BLAST
  (RPS-BLAST) to perform search
• Query sequence is compared to a PSSM
  derived from each of the aligned domains

Lecture 3.1                                  15
              Running NCBI BLAST




                              Click BLAST!



Lecture 3.1                          16
              Formatting Results




Lecture 3.1                        17
              BLAST Format Options




Lecture 3.1                          18
              BLAST Output




Lecture 3.1                  19
              BLAST Output




Lecture 3.1                  20
              BLAST Output




Lecture 3.1                  21
              BLAST Output




Lecture 3.1                  22
              BLAST Output




Lecture 3.1                  23
              BLAST Output




Lecture 3.1                  24
              BLAST Parameters
• Identities - No. & % exact residue matches
• Positives - No. and % similar & ID matches
• Gaps - No. & % gaps introduced
• Score - Summed HSP score (S)
• Bit Score - a normalized score (S‟)
• Expect (E) - Expected # of chance HSP aligns
• P - Probability of getting a score > X
• T - Minimum word or k-tuple score (Threshold)
Lecture 3.1                                       25
          BLAST - Rules of Thumb
• Expect (E-value) is equal to the number of BLAST
  alignments with a given Score that are expected to
  be seen simply due to chance
• Don‟t trust a BLAST alignment with an Expect score
  > 0.01 (Grey zone is between 0.01 - 1)
• Expect and Score are related, but Expect contains
  more information. Note that %Identies is more
  useful than the bit Score
• Recall Doolittle‟s Curve (%ID vs. Length, next slide)
  %ID > 30 - numres/50
• If uncertain about a hit, perform a PSI-BLAST search

 Lecture 3.1                                         26
                                              Doolittle‟s Curve
                                            Evolutionary Distance VS Percent Sequence Identity

                                      120
              Sequence Identity (%)




                                      100
                                      80

                                      60
                                                                                        Twilight Zone

                                      40
                                      20
                                       0
                                               0   40   80   120     160   200   240    280     320     360   400
                                                                   Number of Residues

Lecture 3.1                                                                                                         27
    Getting the Most from
           BLAST



Lecture 3.1                 28
              BLAST Options




Lecture 3.1                   29
              BLAST Options
•     Composition-based statistics (Yes)
•     Sequence Complexity Filter (Yes)
•     Expect (E) value (10)
•     Word Size (3)
•     Substitution or Scoring Matrix (Blosum62)
•     Gap Insertion Penalty (11)
•     Gap Extension Penalty (1)

Lecture 3.1                                   30
              Composition Statistics
• Recent addition to BLAST algorithm
• Permits calculated E (Expect) values to
  account for amino acid composition of
  queries and database hits
• Improves accuracy and reduces false
  positives
• Effectively conducts a different scoring
  procedure for each sequence in database

Lecture 3.1                                  31
              LCR‟s (low complexity)
• Watch out for…
        – transmembrane or signal peptide regions
        – coil-coil regions
        – short amino acid repeats (collagen, elastin)
        – homopolymeric repeats
• BLAST uses SEG to mask amino acids
• BLAST uses DUST to mask bases

Lecture 3.1                                              32
               Scoring Matrices
• BLOSUM Matrices
        – Developed by Henikoff & Henikoff (1992)
        – BLOcks SUbstitution Matrix
        – Derived from the BLOCKS database
• PAM Matrices
        – Developed by Schwarz and Dayhoff (1978)
        – Point Accepted Mutation
        – Derived from manual alignments of closely
          related proteins
Lecture 3.1                                           33
How to Make Your Own Matrix

    ACDEFGH..                 #Aobs            A    C D ...
    ACDEFGK..     f   (A,A) =
                              #Aexp
                                            A 0.8    -- --
    AADEFGH..                               C 0.2   0.8 --
    GCDEFGH..
                                            D 0.0   0.3 1.0
    ACAEYGK..                 #C/Aobs
    ACAEFAH..     f   (C,A) =
                                 +
                              #Aexp #Cexp   E --     -- --


      Perform             Calculate           Fill Sub
      Alignment           Frequencies         Matrix

Lecture 3.1                                               34
                PAM versus BLOSUM
• First useful scoring        • Much later entry to matrix
  matrix for protein            “sweepstakes”
• Assumed a Markov            • No evolutionary model is
  Model of evolution (I.e.      assumed
  all sites equally mutable   • Built from PROSITE
  and independent)              derived sequence blocks
• Derived from small,         • Uses much larger, more
  closely related proteins      diverse set of protein
  with ~15% divergence          sequences (30% - 90% ID)


  Lecture 3.1                                         35
                PAM versus BLOSUM
• Higher PAM numbers to      • Lower BLOSUM numbers
  detect more remote           to detect more remote
  sequence similarities        sequence similarities
• Lower PAM numbers to       • Higher BLOSUM numbers
  detect high similarities     to detect high similarities
• 1 PAM ~ 1 million years    • Sensitive to structural
  of divergence                and functional subsitution
• Errors in PAM 1 are        • Errors in BLOSUM arise
  scaled 250X in PAM 250       from errors in alignment


  Lecture 3.1                                        36
               PAM Matricies
• PAM 40 - prepared by multiplying PAM 1 by
  itself a total of 40 times
  best for short alignments with high similarity
• PAM 120 - prepared by multiplying PAM 1 by
  itself a total of 120 times
  best for general alignment
• PAM 250 - prepared by multiplying PAM 1 by
  itself a total of 250 times
  best for detecting distant sequence similarity
 Lecture 3.1                                  37
              BLOSUM Matricies
• BLOSUM 90 - prepared from BLOCKS
  sequences with >90% sequence ID
  best for short alignments with high similarity
• BLOSUM 62 - prepared from BLOCKS
  sequences with >62% sequence ID
  best for general alignment (default)
• BLOSUM 30 - prepared from BLOCKS
  sequences with >30% sequence ID
  best for detecting weak local alignments
Lecture 3.1                                  38
  Scraping the Bottom of
the Barrel with Psi-BLAST



Lecture 3.1             39
               PSI-BLAST Algorithm
• Perform initial alignment with BLAST using
  BLOSUM 62 substitution matrix
• Construct a multiple alignment from matches
• Prepare position specific scoring matrix
• Use PSSM profile as the scoring matrix for a
  second BLAST run against database
• Repeat steps 3-5 until convergence

 Lecture 3.1                                   40
              PSI-BLAST




Lecture 3.1               41
PresS Iterate!   PSI-BLAST




  Lecture 3.1                42
                          PSI-BLAST



              PresS Iterate!




Lecture 3.1                           43
              PSI-BLAST




Lecture 3.1               44
              PSI-BLAST
• For Protein Sequences ONLY
• Much more sensitive than BLAST
• Slower (iterative process)
• Often yields results that are as good as
  many common threading methods
• SHOULD BE YOUR FIRST CHOICE IN
  ANALYZING A NEW SEQUENCE

Lecture 3.1                                  45
              BLAST against PDB




Lecture 3.1                       46
               Still Confused?
 http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html




Lecture 3.1                                                      47
              Conclusions
• BLAST is the most important program in
  bioinformatics (maybe all of biology)
• BLAST is based on sound statistical
  principles (key to its speed and sensitivity)
• A basic understanding of its principles is
  key for using/interpreting BLAST output
• Use NBLAST or MEGABLAST for DNA
• Use PSI-BLAST for protein searches
Lecture 3.1                                   48

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:37
posted:7/13/2011
language:English
pages:48