Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

2pairwise_alignment

VIEWS: 2 PAGES: 70

									                     Motivation
ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG
TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG
MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…
GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT
|| || ||||| ||| || ||            |||||||||||||||||||
GATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCA
MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE…
GAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAG
GTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACA
ACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCC
TGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGT
CATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGC
ATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTT
TCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACA
ATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTT
TCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTA
CTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAG
GGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGG
TTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAAC
AAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGT
CTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAA
GGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCC
                                                            1
CTGGCTCACAAGTACCATTGA
          Lesson 2


Aligning sequences and searching
            databases




                                   2
  Homology and
sequence alignment




                     3
              Homology
Homology = Similarity between objects due to
a common ancestry
   Sequence homology
Similarity between sequences as a
result of common ancestry.

     VLSPAVKWAKVGAHAAGHG
     ||| || |||| | ||||
     VLSEAVLWAKVEADVAGHG




                                    5
      Sequence alignment

Alignment: Comparing two (pairwise) or
more (multiple) sequences. Searching
for a series of identical or similar
characters in the sequences.




                                         6
                                  VLSPAVKWAKV
                Why align?        ||| || ||||
                                  VLSEAVLWAKV
1.To detect if two sequence are homologous. If
  so, homology may indicate similarity in
  function (and structure).
2.Required for evolutionary studies (e.g., tree
  reconstruction).
3.To detect conservation (e.g., a tyrosine that
  is evolutionary conserved is more likely to be
  a phosphorylation site.

                                                  7
Insertions, deletions, and
       substitutions




                             8
Evolutionary changes in sequences

   Three types of changes:
1. Substitution – a replacement of one (or more)
   sequence letter by another: AA GA C
2. Insertion - an insertion of a letter or several
   letters to the sequence: AAGA  T
3. Deletion - deleting a letter (or more) from the
   sequence: A A GA

            Insertion + Deletion  Indel
                                                 9
       Sequence alignment
If two sequences share a common ancestor
– for example human and armadillo
hemoglobin, we can represent their
evolutionary relationship using a tree


                           VLSPAV-WAKV
                           ||| || ||||
                           VLSEAVLWAKV
VLSPAV-WAKV    VLSEAVLWAKV
                                       10
           Perfect match
A perfect match suggests that no change
has occurred from the common ancestor
(although this is not always the case).


     VLSEAVLWAKV
                            VLSPAV-WAKV
                            ||| || ||||
                            VLSEAVLWAKV
VLSPAV-WAKV    VLSEAVLWAKV
                                          11
            A substitution
A substitution suggests that at least one
change has occurred since the common
ancestor (although we cannot say in
     which lineage it has occurred).

      VLSEAVLWAKV
      VLSPAVLWAKV
                              VLSPAV-WAKV
                              ||| || ||||
                              VLSEAVLWAKV
VLSPAV-WAKV     VLSEAVLWAKV
                                            12
                  Indel
Option 1: The ancestor had L and it was
lost here*. In such a case, the event was a
deletion.


      VLSEAVLWAKV
                             VLSPAV-WAKV
        *                    ||| || ||||
                             VLSEAVLWAKV
VLSPAV-WAKV     VLSEAVLWAKV
                                              13
                    Indel
Option 2: The ancestor was shorter and the
L was inserted here*. In such a case, the
event was an insertion.
             L
       VLSEAV WAKV
                            VLSPAV-WAKV
                *           ||| || ||||
                            VLSEAVLWAKV
VLSPAV-WAKV    VLSEAVLWAKV
                                         14
                  Indel
Normally, given two sequences we cannot
tell whether it was an insertion or a
deletion, so we term the event as an indel.




 Deletion?        Insertion?


VLSPAV-WAKV     VLSEAVLWAKV
                                              15
 Indels in protein coding genes
Indels in protein coding genes are often of
3bp, 6bp, 9bp, etc...


Gene Search
In fact, searching for indels of length 3K
(K=1,2,3,…) can help algorithms that
search a genome for open reading frames
(ORFs).
                                              16
Global and Local pairwise
       alignments




                            17
              Global vs. Local                Global
                                           alignment:
• Global alignment – finds the best           forces
                                          alignment in
  alignment across the entire two            regions
                                          which differ
  sequences.        ADLGAVFALCDRYFQ
                     ||||     |||| |
                     ADLGRTQN-CDRYYQ           Local
                                            alignment
• Local alignment – finds regions of        will return
                                                only
  similarity in parts of the sequences.    regions of
                                               good
                     ADLG   CDRYFQ          alignment
                     ||||   |||| |
                     ADLG   CDRYYQ
                                                    18
      Global alignment




PTK2 protein tyrosine kinase 2 of human
and rhesus monkey



                                      19
Proteins are comprised of domains



 Human PTK2 :

 Domain A
                               Domain B


            Protein tyrosine
            kinase domain                 20
  Protein tyrosine kinase domain

In leukocytes, a different gene for tyrosine
kinase is expressed.


    Domain A
                                 Domain X


  Protein tyrosine
  kinase domain                                21
PTK2   The sequence similarity is
       restricted to a single domain
Domain A     Protein tyrosine
                                 Domain B
                   kinase domain




                                   Domain X


Protein tyrosine
kinase domain        Leukocyte TK             22
Global alignment of PTK and LTK




                                  23
Local alignment of PTK and LTK




                                 24
             Conclusions
Use global alignment when the two
sequences share the same overall
sequence arrangement.


Use local alignment to detect regions of
similarity.



                                           25
How alignments scores
   are computed?




                        26
    Pairwise alignment

          AAGCTGAATTCGAA
          AGGCTCATTTCTGA



One possible alignment:

         AAGCTGAATT-C-GAA
         AGGCT-CATTTCTGA-


                            27
      AAGCTGAATT-C-GAA
      AGGCT-CATTTCTGA-




This alignment includes:
         2 mismatches
         4 indels (gap)
     10 perfect matches

                           28
       Choosing an alignment
       for a pair of sequences
      Many different alignments are
      possible for 2 sequences:
             AAGCTGAATTCGAA
             AGGCTCATTTCTGA


A-AGCTGAATTC--GAA     AAGCTGAATT-C-GAA
AG-GCTCA-TTTCTGA-     AGGCT-CATTTCTGA-


      Which alignment is better?         29
            Scoring system (naïve)
  Perfect match: +1
  Mismatch: -2
  Indel (gap): -1
    AAGCTGAATT-C-GAA                        A-AGCTGAATTC--GAA
    AGGCT-CATTTCTGA-                        AG-GCTCA-TTTCTGA-

Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2   Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1



          Higher score  Better alignment                                 30
Scoring systems




                  31
          Scoring system

• In the example above, the choice of +1
  for match,-2 for mismatch, and -1 for
  gap is quite arbitrary

• Different scoring systems  different
  alignments

• We want a good scoring system…
                                           32
             Scoring matrix

• Representing the
                                 A    G    C    T
  scoring system as a
  table or matrix n X n (n   A   2
  is the number of
  letters the alphabet       G   -6   2
  contains. n=4 for
                             C   -6   -6   2
  nucleotides, n=20 for
  amino acids)               T   -6   -6   -6   2
• Symmetric
                                                    33
          DNA scoring matrices
• Uniform substitutions between all nucleotides:


        From       A        G        C        T
         To
          A        2
          G        -6       2
          C        -6       -6       2
          T        -6       -6       -6       2


               Mismatch              Match         34
      DNA scoring matrices
Can take into account biological phenomena
  such as:
• Transition-transversion




                                             35
    Amino-acid
     scoring
     matrices
•    Take into
     account
     physico-
     chemical
     properties


                  36
    Amino-acid substitution matrices
• Actual substitutions:
  – Based on empirical data
  – Commonly used by many bioinformatics
    programs
  – PAM & BLOSUM




                                           37
       Protein matrices – actual
             substitutions
The idea: Given an alignment of a large number of
closely related sequences we can score the relation
between amino acids based on how frequently they
substitute each other
            M   G   Y   D   E
            M   G   Y   D   E
            M   G   Y   E   E   In the 4th Column D/E is found in 7/8
            M   G   Y   D   E    of the cases (compared with 5/8 to
            M   G   Y   Q   E               D/Q and E/Q).
            M   G   Y   D   E
            M   G   Y   E   E
            M   G   Y   E   E

                                                                        38
  BLOSUM: Blocks Substitution
           Matrix
• Based on BLOCKS database
  – ~2000 blocks from 500 families of related
    proteins
  – Families of proteins with identical function
• Blocks are short                 AABCDA----BBCDA
  conserved patterns of            DABCDA----BBCBB
                                   BBBCDA-AA-BCCAA
  3-60 aa                          AAACDA-A--CBCDB
  without gaps                     CCBADA---DBBDCC
                                   AAACAA----BBCCC
                                                   39
               BLOSUM
• Each block represents a sequence
  alignment with different identity
  percentage

• For each block the amino-acid substitution
  rates were calculated to create the
  BLOSUM matrix


                                           40
      BLOSUM Matrices
• BLOSUMn is based on sequences that
  share at least n percent identity

• BLOSUM62 represents closer sequences
  than BLOSUM45




                                       41
     Example : Blosum62




Derived from blocks where the sequences
share at least 62% identity               42
              Scoring gaps
In advanced algorithms, two gaps of one
amino-acid (X-Y-) are given a different score
than one gap of two amino acids (X--Y).
This is performed by giving different penalty for
“opening” a gap and for extending a gap


Gap extension penalty < Gap opening penalty
                                              43
      Intermediate summary
1. Scoring system =
   substitution matrix + gap penalty.
2. Used for both global and local alignment
3. For amino acids, there are two types of
   substitution matrices: PAM and
   BLOSUM



                                              44
Computational aspects




                        45
   Many possible alignments
 AAGCTGAATTCGAA     It is not trivial (for
 AGGCTCATTTCTGA
                    most people) to
AAGCTGAATT-C-GAA    figure out how to go
AGGCT-CATTTCTGA-    over all possible
                    pairwise alignments
AAG-CTGAATT-C-GAA   and find the one with
AGGCT-CATTT-CTGA-   the highest score.

AAGCT-GAATT-C-GAA
A-GGCT-CATTTCTGA-
                                         46
     Optimal alignment algorithms
•Needleman-Wunsch (global) [1970]
•Smith-Waterman (local) [1981]

Their algorithm’s complexity is O(mn)
(m – length of sequence 1, n – length of sequence 2).

Informally:
If one doubles the sequence length of one sequence  it
doubles the computation time.
If one doubles both  it quadruples the computation time. For
proteins of lengths < 1000 it takes much less than a second to
compute the alignments.
                                                            47
      Dynamic programming
• Solving a problem with many overlapping
  sub-problems
• Example:

 Fibonacci sequnce:
 1, 1, 2, 3, 5, 8,13,…

 F)1) = F(2) = 1;
 F(n) = F(n-1) + F(n-2)
                                            48
       Dynamic programming
                               F)1) = F(2) = 1;
• Naïvely solving F(7):        F(n) = F(n-1) + F(n-2)
  F(7) = F(6) + F(5)
    = F(5) + F(4) + F(4) + F(3)
    = F(4) + F(3) + F(3) + F(2) + F(3) + F(2) +F(2)
    + F(1)
    = F(3) + F(2) + F(2) + F(1) + F(2) + F(1) + F(2)
    + F(2) + F(1) + F(2) +F(2) + F(1)
    = F(2) + F(1) + F(2) + F(2) + F(1) + F(2) + F(1)
    + F(2) + F(2) + F(1) + F(2) +F(2) + F(1)
    = 13                                           49
      Dynamic programming
• F(7) using Dynamic programming:
  F(3) = F(2) + F(1) = 1 + 1 = 2
  F(4) = F(3) + F(2) = 2 + 1 = 3
  F(5) = F(4) + F(3) = 3 + 2 = 5
  F(6) = F(5) + F(4) = 5 + 3 = 8
  F(7) = F(6) + F(5) = 8 + 5 = 13




                                    50
   Needleman Wunsch (1970)
Finds the best alignment for the first i
characters of seq1 with the first j of seq2
• Base Case:
  F ( i , 0 ) = i ×Gap penalty      F ( 0 , j ) = j ×Gap penalty

• Recursion rule
                        F ( i - 1, j - 1) + s ( x i , y j ),
    F ( i , j ) = max   F ( i - 1, j ) + Gap penalty
                        F ( i , j - 1) + Gap penalty
                                                               51
   Needleman Wunsch (1970)
• Base Case:
  F ( i , 0 ) = i ×Gap penalty      F ( 0 , j ) = j ×Gap penalty

• Recursion rule
                        F ( i - 1, j - 1) + s ( x i , y j ),
    F ( i , j ) = max   F ( i - 1, j ) + Gap penalty
                        F ( i , j - 1) + Gap penalty

• Cool alignment applet:
  http://baba.sourceforge.net/
                                                               52
Searching databases




                      53
 Searching a sequence database

Idea: In order to find homologous sequences
to a sequence of interest, one should compute
its pairwise alignment against all known
sequences in a database, and detect the best
scoring significant homologs
      The same idea in short: Use your
   sequence as a query to find homologous
     sequences in a sequence database
                                            54
            Some terminology
• Query sequence - the sequence with
  which we are searching
• Hit – a sequence found in the database,
  suspected as homologous




                                            55
Protein or DNA search




                        56
   Query sequence: DNA or protein?
• For coding sequences, we can use the
  DNA sequence or the protein sequence to
  search for similar sequences.
• Which is preferable?




                                            57
             Protein is better!
• Selection (and hence conservation) works
  (mostly) at the protein level:

 CTTTCA =     Leu-Ser
 TTGAGT =     Leu-Ser




                                             58
              Query type
• Nucleotides: 4 letter alphabet
• Amino acids: 20 letter alphabet




• Two random DNA sequences will, on
  average, have 25% identity
• Two random protein sequences will,
  on average, have 5% identity
                                       59
        Conclusion
The amino-acid sequence is often
 preferable for homology search




                                   60
Computation time




                   61
 How do we search a database?
• If each pairwise alignment takes 1/10 of a
  second, and if the database contains 107
  sequences, it will take 106 seconds
  = 11.5 days to complete one search.

• 150,000 searches (at least!!) are
  performed per day. >82,000,000 sequence
  records in GenBank.
                                               62
                Conclusion
• Using the exact comparison pairwise
  alignment algorithm between the query
  and all DB entries – too slow




                                          63
                Heuristic
• Definition: a heuristic is a design to
  solve a problem that does not
  provide an exact solution (but is not
  too bad) but reduces the time
  complexity of the exact solution



                                           64
BLAST




        65
                   BLAST

• BLAST -
  Basic Local Alignment and Search Tool
• A heuristic for searching a database for
  similar sequences
• The heuristic based on restrictions of the
  similarity (such as using ungapped word
  matching instead of single character
  matching).
                                               66
                  DNA or Protein
• All types of searches are possible
         Query:        DNA           Protein



         Database:     DNA           Protein
 blastn – nuc vs. nuc
 blastp – prot vs. prot
 blastx – translated query vs. protein database
 tblastn – protein vs. translated nuc. DB
 tblastx – translated query vs. translated database
                                                      67
                    E-value
• The number of times we will theoretically
  find an alignment with a score ≥ Y of a
  random sequence vs. a random database

  Theoretically,   In practice – BLAST uses estimations.
  we could trust    E-values of 10-4 and lower indicate a
                             significant homology.
    any result
                   E-values between 10-4 and 10-2 should
     with an        be checked (similar domains, maybe
   E-value ≤ 1                non-homologous).
                    E-values between 10-2 and 1 do not
                         indicate a good homology
                                                       68
         Filtering low complexity
• Low complexity regions : e.g., Proline rich
  areas (in proteins), Alu repeats (in DNA)
• Regions of low complexity generate high
  scores of alignment, BUT – this does not
  indicate homology




                                                69
                  Solution
• In BLAST there is an option to mask low-
  complexity regions in the query sequence
  (such regions are represented as XXXXX
  in query)




                                             70

								
To top