Docstoc

ALIGNMENT

Document Sample
ALIGNMENT Powered By Docstoc
					            Pairwise Alignment


           How do we tell whether
            two sequences are
                 similar?
    Assigned reading:
    Ch 4.1-4.7,
    Ch 5.1, get what you can out of 5.2, 5.4
BIO520 Bioinformatics   Jim Lund
        Pairwise alignment

• DNA:DNA
• polypeptide:polypeptide


The   BASIC Sequence Analysis Operation
          Alignments

• Pairwise sequence alignments
  –One-to-One
  –One-to-Database
• Multiple sequence alignments
  –Many-to-Many
Origins of Sequence Similarity

 • Homology
   – common evolutionary descent
 • Chance
   – Short similar segments are very
     common.
 • Similarity in function
   – Convergence (very rare)
Visual sequence comparison: Dotplot
Visual sequence comparison: Filtered dotplot




       4 bp window, 75% identity cutoff
Visual sequence comparison: Dotplot




     4 bp windw, 75% identity cutoff
Dotplots of sequence
  rearrangements
Assessing similarity

 GAACAAT
 ||||||| 7/7 OR 100%
 GAACAAT
            Which is BETTER?
            How do we SCORE?

    GAACAAT
     |   1/7 or 14%
GAACAAT
    Similarity

GAACAAT
||||||| 7/7 OR 100%
GAACAAT
            MISMATCH
GAACAAT
||| ||| 6/7 OR 84%
GAATAAT
  Mismatches

GAACAAT
||| ||| 6/7 OR 84%
GAATAAT


GAACAAT
||| ||| 6/7 OR 84%
GAAGAAT
Terminal Mismatch

     GAACAATttttt
     ||| |||
aaaccGAATAAT 6/7 OR 84%
       INDELS

GAAgCAAT
||| ||||   7/7 OR 100%
GAA*CAAT
      Indels, cont’d


GAAgCAAT       GAAggggCAAT
||| ||||       |||    ||||
GAA*CAAT       GAA****CAAT
         Similarity Scoring

Common Method:
•   Terminal mismatches (0)
•   Match score (1)
•   Mismatch penalty (-3)
•   Gap penalty (-1)
•   Gap extension penalty (-1)
             DNA Defaults
             DNA Scoring

GGGGGGAGAA
|||||*|*||      8(1)+2(-3)=2
GGGGGAAAAAGGGGG


GGGGGGAGAA--GGG
|||||*|*|| |||    11(1)+2(-3)+1(-1)+1(-1)=3
GGGGGAAAAAGGGGG
Absurdity of Low Gap Penalty



     GATCGCTACGCTCAGC
      A.C.C..C..T


    Perfect similarity,
      Every time!
Sequence alignment algorithms


 • Local alignment
   – Smith-Waterman
 • Global alignment
   – Needleman-Wunsch
     Alignment Programs


• Local alignment (Smith-Waterman)
  – BLAST (simplified Smith-Waterman)
  – FASTA (simplified Smith-Waterman)
  – BESTFIT (GCG program)
• Global alignment (Needleman-Wunsch)
  – GAP
    Local vs. global alignment

                         10 gaggc 15
                            |||||
                          3 gaggc 7
Local alignment: alignment of regions of substantial similarity

           1    gggggaaaaagtggccccc 19
                ||        ||||   ||
           1 gggggttttttttgtggtttcc 22

Global alignment: alignment of the full length of the sequences
Local vs. global alignment
            BLAST Algorithm

Look for local alignment, a High Scoring Pair (HSP)
• Finding word (W) in query and subject. Score > T.
• Extend local alignment until score reaches
  maximum-X.
• Keep High Scoring Segment Pairs (HSPs) with
  scores > S.
• Find multiple HSPs per query if present
• Expectation value (E value) using Karlin-Altschul
  stats
 BLAST statistical significance:
  assessing the likelihood a match
        occurs by chance

Karlin-Altschul statistic:
E = k m N exp(-Lambda S)

m = Size of query seqeunce
N = Size of database
k = Search space scaling parameter
Lambda = scoring scaling parameter
S = BLAST HSP score

Low E -> good match
 BLAST statistical significance:


Rule of thumb for a good match:

•Nucleotide match
  •E < 1e-6
  •Identity > 70%

•Protein match
   •E < 1e-3
   •Identity > 25%
  Protein Similarity Scoring

• Identity - Easy
• WEAK Alignments
• Chemical Similarity
   – L vs I, K vs R…
• Evolutionary Similarity
   – How do proteins evolve?
   – How do we infer
     similarities?
              BLOSUM62

    C    S     T    P    A    G    N    D
C    9   -1    -1   -3    0   -3   -3   -3
S   -1    4     1   -1    1    0    1    0
T   -1    1     4    1   -1    1    0    1
P   -3   -1     1    7   -1   -2   -1   -1
A    0    1    -1   -1    4    0   -1   -2
G   -3    0     1   -2    0    6   -2   -1
N   -3    1     0   -2   -2    0    6    1
D   -3    0     1   -1   -2   -1    1    6
 Single-base evolution
changes the encoded AA

      CAU=H
CAC=H CGU=R UAU=Y
CAA=Q CCU=P GAU=D
CAG=Q CUU=L AAU=N
   Substitution Matrices

Two main classes:

• PAM-Dayhoff

• BLOSUM-Henikoff
           PAM-Dayhoff

• Built from closed related proteins,
  substitutions constrained by evolution
  and function
• “accepted” by evolution (Point
  Accepted Mutation=PAM)
• 1 PAM::1% divergence
     • PAM120=closely related proteins
     • PAM250=divergent proteins
                 BLOSUM-
              Henikoff&Henikoff

• Built from ungapped alignments in proteins:
  “BLOCKS”
• Merge blocks at given % similar to one sequence
• Calculate “target” frequencies
• BLOSUM62=62% similar blocks
   – good general purpose
• BLOSUM30
   – Detects weak similarities, used for distantly related proteins
              BLOSUM62

    C    S     T    P    A    G    N    D
C    9   -1    -1   -3    0   -3   -3   -3
S   -1    4     1   -1    1    0    1    0
T   -1    1     4    1   -1    1    0    1
P   -3   -1     1    7   -1   -2   -1   -1
A    0    1    -1   -1    4    0   -1   -2
G   -3    0     1   -2    0    6   -2   -1
N   -3    1     0   -2   -2    0    6    1
D   -3    0     1   -1   -2   -1    1    6
      Gapped alignments

• No general theory for significance of
  matches!!
• G+L(n)
  – indel mutations rare
  – variation in gap length “easy”, G > L
Real Alignments
Phylogeny
        Cow-to-Pig Protein
  1 MGLSDGEWQLVLNAWGKVEADVAGHGQEVLIRLFTGHPETLEKFDKFKHL   50
    ||||||||||||| |||||||||||||||||||| |||||||||||||||
  1 MGLSDGEWQLVLNVWGKVEADVAGHGQEVLIRLFKGHPETLEKFDKFKHL   50
             .         .         .         .         .
 51 KTEAEMKASEDLKKHGNTVLTALGGILKKKGHHEAEVKHLAESHANKHKI   100
    |.| ||||||||||||||||||||||||||||||||. ||:||| ||||
 51 KSEDEMKASEDLKKHGNTVLTALGGILKKKGHHEAELTPLAQSHATKHKI   100
             .         .         .         .         .
101 PVKYLEFISDAIIHVLHAKHPSDFGADAQAAMSKALELFRNDMAAQYKVL   150
    |||||||||:||| || .||| ||||||| |||||||||||||||.|| |
101 PVKYLEFISEAIIQVLQSKHPGDFGADAQGAMSKALELFRNDMAAKYKEL   150

151 GFHG 154
    || |
151 GFQG 154
           Cow-to-Pig cDNA
  1 CAGCTGTCGGAGACAGACACCCAGTCAGTCCCGCCCTTGTTCTTTTTCTC   50
           | ||| |||     ||   | ||||| |||| ||| ||||||
  1 .......CAGAGCCAGGACACCCAGTACGCCCGCACTTGCTCTGTTTCTC   43
             .         .         .         .         .
 51 TTCTTCAGACTGCGCCATGGGGCTCAGCGACGGGGAATGGCAGTTGGTGC   100
    |||| ||||||| |||||||||||||||||||||||||||||| ||||||
 44 TTCTGCAGACTGTGCCATGGGGCTCAGCGACGGGGAATGGCAGCTGGTGC   93
             .         .         .         .         .
101 TGAATGCCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG   150
    |||| | |||||||||||||||||||||||||||||||||||||||||||
 94 TGAACGTCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG   143
             .         .         .         .         .
151 GTCCTCATCAGGCTCTTCACAGGTCATCCCGAGACCCTGGAGAAATTTGA   200
    ||||||||||||||||| | ||||| |||||||||||||||||||||||
144 GTCCTCATCAGGCTCTTTAAGGGTCACCCCGAGACCCTGGAGAAATTTGA   193
             .         .         .         .         .
201 CAAGTTCAAGCACCTGAAGACAGAGGCTGAGATGAAGGCCTCCGAGGACC   250
    |||||| |||||||||||| |||||| ||||||||||||||| |||||||
194 CAAGTTTAAGCACCTGAAGTCAGAGGATGAGATGAAGGCCTCTGAGGACC   243


         80% Identity (88% at aa!)
     DNA similarity reflects
     polypeptide similarity
101 TGAATGCCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 150
    |||| | |||||||||||||||||||||||||||||||||||||||||||
 94 TGAACGTCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 143




501 CCAGTACAAGGTGCTGGGCTTCCATGGCTAAGCCCCACCCCTGTGCCCCT 550
    | ||||||||| |||||||||||| ||||||||||| |    | || |
494 CAAGTACAAGGAGCTGGGCTTCCAGGGCTAAGCCCCCCAGACGCCCCTCA 543
             .         .         .         .         .
Coding vs Non-coding Regions

451 CAGGCTGCCATGAGCAAGGCCCTGGAACTGTTCCGGAATGACATGGCTGC   500
    |||| ||||||||||||||||||||||| |||||||| |||||||| ||
444 CAGGGAGCCATGAGCAAGGCCCTGGAACTCTTCCGGAACGACATGGCGGC   493
             .         .         .         .         .
501 CCAGTACAAGGTGCTGGGCTTCCATGGCTAAGCCCCACCCCTGTGCCCCT   550
    | ||||||||| |||||||||||| ||||||||||| |    | || |
494 CAAGTACAAGGAGCTGGGCTTCCAGGGCTAAGCCCCCCAGACGCCCCTCA   543
             .         .         .         .         .
551 CAC.CCCACCCACCTGGG...........CAGGGTGGGCGGGGACTGAAT   588
    | | |||| |||| ||||           | || ||| ||| |||||
544 CCCACCCATCCACTTGGGCCAGGGCCCCCCGCGGAGGGTGGGCGCTGAAG   593
             .         .         .         .         .
589 CCCAAGTAGTTATAGGGTTTGCTTCTGAGTGTGTGCTTTGTTTAGGAGAG   638
    | | |||| | |||||||||||||||||||| ||||||||| | |||||
594 CTCCTGTAGCTGTAGGGTTTGCTTCTGAGTGT.TGCTTTGTTCATGAGAG   642
             .         .         .         .         .
639 GTGGGTGGAAGAGGTGGATGGGTTAGGGGTGGAGG...............   673
    |||||||| ||||||||| ||| | | ||||| ||
643 GTGGGTGGGAGAGGTGGAGGGGCTGGTGGTGGTGGTGGGGGGGTGTTCAG   692


 90% in coding (70% in non-coding)
        Third Base of Codon is
            Hypervariable

201 CAAGTTCAAGCACCTGAAGACAGAGGCTGAGATGAAGGCCTCCGAGGACC   250
    ||||||*||||||||||||*||||||*|||||||||||||||*|||||||
194 CAAGTTTAAGCACCTGAAGTCAGAGGATGAGATGAAGGCCTCTGAGGACC   243
             .         .         .         .         .
251 TGAAGAAGCATGGCAACACGGTGCTCACGGCCCTGGGGGGTATCCTGAAG   300
    ||||||||||*||||||||||||||*||*|||||||||||*|||||*|||
244 TGAAGAAGCACGGCAACACGGTGCTGACTGCCCTGGGGGGCATCCTTAAG   293
       Cow-to-Fish Protein

  1 MGLSDGEWQLVLNAWGKVEADVAGHGQEVLIRLFTGHPETLEKFDKFKHL   50
          :. :|| || .||| | || || |||| |||||. | || :
  1 ....MADFDMVLKCWGPMEADHATHGSLVLTRLFTEHPETLKLFPKFAGI   46
             .         .         .         .         .
 51 KTEAEMKASEDLKKHGNTVLTALGGILKKKGHHEAEVKHLAESHANKHKI   100
        ::     . || ||| || :|| :| | | .| |. ||| ||||
 47 .AHGDLAGDAGVSAHGATVLNKLGDLLKARGAHAALLKPLSSSHATKHKI   95
             .         .         .         .         .
101 PVKYLEFISDAIIHVLHAKHPSDFGADAQAAMSKALELFRNDMAAQYKVL   150
    |:   . |.: | |: |     | | | |:     : :   || | || |
 96 PIINFKLIAEVIGKVMEEKAGLD..AAGQTALRNVMAIIITDMEADYKEL   143

151 GFHG 154
    ||
144 GFTE 147



       42% identity, 51% similarity
        Cow-to-Fish DNA
 32 .ACAGGACATTTTACTACTCTGCAGATAATGGCTGACTTTGACATGGTAC   80
      |   | | |   | |    || |      | || |    | |||| |
 51 TTCTTCAGACTGCGCCATGGGGCTCAGCGACGGGGAATGGCAGTTGGTGC   100
             .         .         .         .         .
 81 TGAAGTGCTGGGGTCCAATGGAGGCGGACCACGCAACCCACGGGAGTCTG   130
    ||||   ||||||     ||||||| ||   |||| ||| |||      |
101 TGAATGCCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG   150
             .         .         .         .         .
131 GTGCTGACCCGTTTATTCACAGAGCACCCAGAAACCCTAAAGTTATTCCC   180
    || || | | | | ||||||| || || || ||||| || |||
151 GTCCTCATCAGGCTCTTCACAGGTCATCCCGAGACCCTGGAGAAATTTGA   200
             .         .         .         .         .
181 CAAGTTTGCTGGC...ATCGCCCATGGGGACCTGGCCGGGGATGCAGGTG   227
    ||||||      |   |   | | | || ||      |     | |
201 CAAGTTCAAGCACCTGAAGACAGAGGCTGAGATGAAGGCCTCCGAGGACC   250




                48% similarity
              Protein vs. DNA
                Alignments

• Polypeptide similarity > DNA
• Coding DNA > Non-coding

• 3rd base of codon hypervariable
• Moderate Distance  poor DNA similarity
           Rules of Thumb

• DNA-DNA similarities
   – 50% significant if “long”
   – E < 1e-6, 70% identity
• Protein-protein similarities
   – 80% end-end: same structure, same function
   – 30% over domain, similar function, structure
     overall similar
   – 15-30% “twilight zone”
   – Short, strong match…could be a “motif”
     Basic BLAST Family
• BLASTN
  – DNA to DNA database
• BLASTP
  – protein to protein database
• TBLASTN
  – DNA (translated) to protein database
• BLASTX
  – protein to DNA database (translated)
• TBLASTX
  – DNA (translated) to DNA database (translated)
                   DNA Databases

    • nr (non-redundantish merge of Genbank,
      EMBL, etc…)
         – EXCLUDES HTGS0,1,2, EST, GSS, STS, PAT, WGS
    •   est (expressed sequence tags)
    •   htgs (high throughput genome seq.)
    •   gss (genome survey sequence)
    •   vector, yeast, ecoli, mito
    •   chromosome (complete genomes)
    •   And more

http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#nucleotide_databases
       Protein Databases

• nr (non-redundant Swiss-prot, PIR,
  PDF, PDB, Genbank CDS)
• swissprot
• ecoli, yeast, fly
• month
• And more
            BLAST Input

•   Program
•   Database
•   Options - see more
•   Sequence
    – FASTA
    – gi or accession#
          BLAST Options

• Algorithm and output options
  – # descriptions, # alignments returned
  – Probability cutoff
  – Strand
• Alignment parameters
  – Scoring Matrix
     • PAM30, PAM70, BLOSUM45,
       BLOSUM62, BLOSUM80
  – Filter (low complexity) PPPPP->XXXXX
  Extended BLAST Family

• Gapped Blast (default)
• PSI-Blast (Position-specific iterated
  blast)
   – “self” generated scoring matrix
• PHI BLAST (motif plus BLAST)
• BLAST2 client (align two seqs)
• megablast (genomic sequence)
• rpsblast (search for domains)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:22
posted:8/19/2012
language:simple
pages:50