Gapped BLAST and PSI-BLAST

Document Sample
Gapped BLAST and PSI-BLAST Powered By Docstoc
					Gapped BLAST and PSI-BLAST


             Altschul et al
        Presenter: 張耿豪 莊凱翔
Outline
   BLAST 1.0 background (from lecture slides)
   BLAST 2.0
   Gapped BLAST
   PSI-BLAST
   Demonstration
Statistical preliminaries
 Pi : background probability that amino
  acids occur randomly at all position
 E: number of distinct HSPs with
  normalized score at least S
 sij
 qij : target frequency of aligned pair of
  letters (i, j) with HSP, high-scoring
  segment paris
Outline
   BLAST 1.0 background (from lecture slides)
   BLAST 2.0
   Gapped BLAST
   PSI-BLAST
BLAST
 Basic Local Alignment Search Tool
   (by Altschul, Gish, Miller, Myers and Lipman)
 The central idea of the BLAST algorithm is
  that a statistically significant alignment is
  likely to contain a high-scoring pair of
  aligned words.
The maximal segment pair measure
 A maximal segment pair (MSP) is
  defined to be the highest scoring pair of
  identical length segments chosen from 2
  sequences.
  (for DNA: Identities: +5; Mismatches: -4)
                         •The MSP score may be computed
          the highest    in time proportional to the product
          scoring pair   of their lengths. (How?) An exact
                         procedure is too time consuming.
                         •BLAST heuristically attempts to
                         calculate the MSP score.
BLAST
1) Build the hash table for Sequence A.
2) Scan Sequence B for hits.
3) Extend hits.
                          BLAST
Step 1: Build the hash table for Sequence A. (3-tuple example)
  For DNA sequences:                    For protein sequences:
                                       Seq. A = ELVIS
  Seq. A = AGATCGAT
           12345678
  AAA                                  Add xyz to the hash table
  AAC                                    if Score(xyz, ELV) ≧ T;
  ..
  AGA   1                              Add xyz to the hash table
  ..                                     if Score(xyz, LVI) ≧ T;
  ATC   3
  ..                                   Add xyz to the hash table
  CGA   5                                if Score(xyz, VIS) ≧ T;
  ..
  GAT   2   6
  ..
  TCG   4                                 The higher T, the less
  ..                                      sensitivity, but faster
  TTT
                          BLAST
Step2: Scan sequence B for hits.
                             BLAST
  Step2: Scan sequence B for hits.




  Step 3: Extend hits.
                                            BLAST 2.0 saves the
                                               time spent in
                                              extension, and
                                  hit        considers gapped
                                                alignments.
Terminate if the score of the sxtension fades
away. (That is, when we reach a segment pair
whose score falls a certain distance below the
best score found for shorter extensions.)
Outline
   BLAST 1.0 background (from lecture slides)
   BLAST 2.0
   Gapped BLAST
   PSI-BLAST
Two-Hit Method
 BLAST 1.o
    Extension step accounts for 90% of total time
 Observations:
    HSP of interest is much longer than a single word pair
    Entail multiple hits on the same diagonal and within
     short distance of one another
 Invoke an extension only when two non-
  overlapping hits are found within distance A on
  the same diagonal
Demonstration
 Recent[i]: the most recent hit found on the
  ith diagonal (always increasing)
          overlap

                    >A




                         <A   Extend!
Discussion
 T must to be lowered
   More one-hits while the
    majority are dismissed
 Speed:
   Twice as rapid as one-hit
 Sensitivity
   Almost the same
Outline
   BLAST 1.0 background (from lecture slides)
   BLAST 2.0
   Gapped BLAST
   PSI-BLAST
Gapped BLAST
 Original BLAST: find several distinct HSPs
    All HSPs related to one alignment should be found
 Now:
    Find one HSP only– seed, than use 2-hit
 T can be raised  faster
    Find all HSPs vs find one HSP for one optimal
     alignment
    For example, result should > 0.95, p: miss prob of HSP
      Orignial with 2 HSP: (1-p)(1-p)>0.95 p<0.025
      Now: p2<0.05p=0.22
Gapped BLAST (contd)
 A gapped extension takes much longer to
  execute than an ungapped extension, but
  by performing very few of them the fraction
  of the total time could be kept low.
 Trigger a gapped extension for any HSP
  exceeding score Sg
 Original BLAST locates only the first and the last
  ungapped aligment, E-value > 50 times
 Example
Outline
   BLAST 1.0 background (from lecture slides)
   BLAST 2.0
   Gapped BLAST
   PSI-BLAST
PSI-BLAST
 position-specific score matrices
    Vs substitution matrices
    Use it as ordinary ways
 Iterated, using position-specific score matrices
 For a BLAST run
    Constructed automatically from the output
    Use this matrix in place of the query for the next run
 For proteins, |query| = L
    Position-specific matrix : L * 20
 Benefits:
    Better to detect weak relationships
Construct Position-specific matrix
1. Construct multiple alignment M from the
   output
2. For every column of M
  1) Find reduced Mc of column C
  2) Calculate scores in column C of the position-
     specific matrix
Construct multiple alignment M
 Collect sequence segments output
   With E-value below a Threshold (why)
   Identical sequence are dropped
 Pair-wise alignment columns with query
  involves inserted gap are ignored
   Multiple alignment M has same length
    (column length) as query
Construct multiple alignment M
Calculate position-specific matrix
score
 The scores of a given alignment column
  should dependent the residues appeared on
  the column
 But upon those in other columns as well
Find reduced Mc of column C
 R: sequences contribute a residue in
  column C
 Mc: those columns of M in which all the
  sequences are represented
Calculate scores in column C of the
   position-specific matrix
 Related to all residues frequency observed
  fi, and number of independent residues in
  column C (Nc)
   log(Qi/Pi)
      Qi: estimated probability for residue i to be found
       in C
 Thank you

 Any problems now?
Outline
   BLAST 1.0 background (from lecture slides)
   BLAST 2.0
   Gapped BLAST
   PSI-BLAST
   Demonstration

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:14
posted:8/20/2012
language:Latin
pages:28