# Homology Modeling - PowerPoint

Document Sample

```					Homology Modeling
Part I
Homology Modeling

• Presentation
• Fold recognition
• Model building
– Loop building
– Sidechain modeling
– Refinement
• Testing methods: the CASP experiment
Homology Modeling

• Presentation
• Fold recognition
• Model building
– Loop building
– Sidechain modeling
– Refinement
• Testing methods: the CASP experiment
Why do we need homology modeling ?

To be compared with:
Structural Genomics project

• Aim to solve the structure of all proteins: this is
too much work experimentally!

• Solve enough structures so that the remaining
structures can be inferred from those
experimental structures

• The number of experimental structures needed
depend on our abilities to generate a model.
Structural Genomics

Proteins
with
known
structures

Unknown proteins
Homology Modeling: why it works

High sequence identity

High structure similarity
Homology Modeling: How it works

o Find template

o Align target sequence
with template

o Generate model:

o Refine model
Homology Modeling

• Presentation
• Fold recognition
• Model building
– Loop building
– Sidechain modeling
– Refinement
• Testing methods: the CASP experiment
What is sequence alignment?

• Given two sequences of letters and a scoring
scheme for evaluating letter matching, find the
optimal pairing of letters from one sequence to
the other.
• Different alignments:
Favors identity          Favors similarity
ACCTAGGC                 ACCTAGGC
AC-T-GG                  ACT-GG

Gaps
Computing Cost

The computational complexity of aligning two
sequences when gaps are allowed anywhere is
exponential in the length of the sequences being
aligned.

Computer science offers a solution
for reducing the running time:
Dynamic Programming
Dynamic Programming (DP) Concept

A problem with overlapping sub-problems and
optimal sub-structures can be solved using the
following algorithm:

(1)    break the problem into smaller sub-problems

(2)    solve these problems optimally using this 3-step
procedure recursively

(3)    use these optimal solutions to construct an
optimal solution to the original problem
DP and Sequence Alignment

Key idea:

The score of the optimal alignment that ends at a given
pair of positions in the sequences is the score of the best
alignment previous to these positions plus the score of
aligning these two positions.
DP and Sequence Alignment
Test all alignments that can lead to i aligned with j
i

?

j

i

?

j
DP and Sequence Alignment
Find alignment with best [previous score + score(i,j)]
i

?

j
Best alignment that ends at (i,j)
i

j
Implementing the DP algorithm for sequences

Aligning 2 sequence S1 and S2 of lengths N and M:

1) Build a NxM alignment matrix A such that
A(i,j) is the optimal score for alignments
up to the pair (i,j)

2) Find the best score in A

3) Track back through the matrix to get
the optimal alignment of S1 and S2.
Example 1

Sequence 1: ATGCTGC

Sequence 2: AGCC

Score(i,j) = 10 if i=j, 0 otherwise

no gap penalty
Example 1
1) Initialize
A     T   G   C   T   G   C

A    10    0   0   0   0   0   0

G    0

C    0

C    0
Example 1
2) Propagate
A     T   G   C   T   G   C

A     10    0   0   0   0   0   0

G     0    10

C     0

C     0
Example 1
2) Propagate
A     T   G    C   T   G   C

A     10    0   0    0   0   0   0

G     0    10   20

C     0

C     0
Example 1
2) Propagate
A     T   G    C    T    G    C

A     10    0   0    0    0    0    0

G     0    10   20   10   10   20   10

C     0    10

C     0
Example 1
2) Propagate
A     T   G    C    T    G    C

A     10    0   0    0    0    0    0

G     0    10   20   10   10   20   10

C     0    10   10   30   20   20   30

C     0    10   10   30   30
Example 1
3) Trace Back
A     T   G    C    T     G    C

A      10    0   0    0    0      0    0

G      0    10   20   10   10    20   10

C      0    10   10   30   20    20   30

C      0    10   10   30   30    30   40

Alignment:          ATGCTGC              Score: 40
AXGCXXC
Mathematical Formulation

Global alignment (Needleman-Wunsch):





A(i  1, j  1),

A(i, j) Score(i, j) max max A(i  1, j  1 k)Wk ,









1k j2



                         


    max A(i  1 k, j  1)Wk








   1ki2   
                      


Wk: penalty for a gap of size k
Complexity

1) The computing time required to fill in the
alignment matrix is O(NM(N+M)), where
N and M are the lengths of the 2 sequences

2) This can be reduced to O(NM) by storing
the best score for each row and column.
True if gap penalty is linear!
Example 2

A     A    T    G    C

A    10    10   0     0   0

G    0     10   10   20   10

G    0     10   10   20   20

C    0     10   10   10   30 High Score: 30

Alignments:
AATGC       AATGC   AATGC     AATG C    AATG C
AG GC       A GGC    AGGC     A GGC      A GGC
Complexity (2)

1) The traceback routine can be quite costly in computing
time if all possible optimal paths are required, since
there may be many branches.

2) Usually, an arbitrary choice is made about which
branch to follow. Then computing time is O(max(N,M))
By simply following pointers.
The Scoring Scheme
• Scores are usually stored in a “weight” matrix or
“substitution” matrix or “matching” matrix.
• Defining the “proper” matrix is still an active area of
research
• Usually, start from known, reliable alignment.
Compute fi, the frequency of occurrence of residue
type i, and qij, the probability that residue types i and
j are aligned; score is computed as:

 qij 
Sij  log         
 fi f j 
        
Example of a Scoring matrix
C        S        T        P        A        G        N        D        E        Q        H        R        K        M       I        L        V        F        Y        W

C       9        -1       -1       -3       0        -3       -3       -3       -4       -3       -3       -3       -3   -1          -1       -1       -1       -2       -2   -2

S       -1       4        1        -1       1        0        1        0        0        0        -1       -1       0    -1          -2       -2       -2       -2       -2   -3

T       -1       1        4        1        -1       1        0        1        0        0        0        -1       0    -1          -2       -2       -2       -2       -2   -3

P       -3       -1       1        7        -1       -2       -1       -1       -1       -1       -2       -2       -1   -2          -3       -3       -2       -4       -3   -4

A       0        1        -1       -1       4        0        -1       -2       -1       -1       -2       -1       -1   -1          -1       -1       -2       -2       -2   -3

G       -3       0        1        -2       0        6        -2       -1       -2       -2       -2       -2       -2   -3          -4       -4       0        -3       -3   -2

N       -3       1        0        -2       -2       0        6        1        0        0        -1       0        0    -2          -3       -3       -3       -3       -2   -4

D       -3       0        1        -1       -2       -1       1        6        2        0        -1       -2       -1   -3          -3       -4       -3       -3       -3   -4

E       -4       0        0        -1       -1       -2       0        2        5        2        0        0        1    -2          -3       -3       -3       -3       -2   -3

Q       -3       0        0        -1       -1       -2       0        0        2        5        0        1        1        0       -3       -2       -2       -3       -1   -2

H       -3       -1       0        -2       -2       -2       1        1        0        0        8        0        -1   -2          -3       -3       -2       -1       2    -2

R       -3       -1       -1       -2       -1       -2       0        -2       0        1        0        5        2    -1          -3       -2       -3       -3       -2   -3

K       -3       0        0        -1       -1       -2       0        -1       1        1        -1       2        5    -1          -3       -2       -3       -3       -2   -3

M       -1       -1       -1       -2       -1       -3       -2       -3       -2       0        -2       -1       -1       5       1        2        -2       0        -1   -1

I       -1       -2       -2       -3       -1       -4       -3       -3       -3       -3       -3       -3       -3       1       4        2        1        0        -1   -3

L       -1       -2       -2       -3       -1       -4       -3       -4       -3       -2       -3       -2       -2       2       2        4        3        0        -1   -2

V       -1       -2       -2       -2       0        -3       -3       -3       -2       -2       -3       -3       -2       1       3        1        4        -1       -1   -3

F       -2       -2       -2       -4       -2       -3       -3       -3       -3       -3       -1       -3       -3       0       0        0        -1       6        3        1

Y       -2       -2       -2       -3       -2       -3       -2       -3       -2       -1       2        -2       -2   -1          -1       -1       -1       3        7        2

W       -2       -3       -3       -4       -3       -2       -4       -4       -3       -2       -2       -3       -3   -1          -3       -2       -3       1        2    11
Heuristic methods

• O(NM) is too slow for database search

• Heuristic methods based on frequency of shared
subsequences

• Usually look for ungapped small sequences

FASTA, BLAST
FASTA
• Create hash table of short words of the query sequence
(from 2 to 6 characters)
• Scan database and look for matches in the query hash table
• Extend good matches empirically

Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   …       SeqN
Word1

Word2

Word3

…

WordP
BLAST

1) Break query sequence and database sequences into words

2) Search for matches (even not perfect) that scores at least T

3) Extend matches, and look for alignment that scores at least S

Tutorial:   http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
BLAST Input
BLAST results
BLAST results (2)
Significance
• We have found that the score of the alignment
between two sequences is S.
Question: What is the “significance” of this score?

• Otherwise stated, what is the probability P that the
alignment of two random sequences has a score at
least equal to S ?

• P is the P-value, and is considered a measure of
statistical significance.
If P is small, the initial alignment is significant.
Statistics of Protein Sequence Alignment

What is a local alignment ?

“Pair of equal length segments, one from each sequence, whose scores
can not be improved by extension or trimming. These are called
high-scoring pairs, or HSP”

http://www.people.virginia.edu/~wrp/cshl98/Altschul/Altschul-1.html
The E-value for a sequence alignment
HSP scores follow an extreme value distribution, characterized
by two parameters, K and l.

The expected number of
HSP with score at least S is
given by:

E  Kmn exp  lS 
-10    -8   -6   -4   -2   0       2   4   6   8   10

S

m, n : sequence lengths
E : E-value
The Bit Score of a sequence alignment
Raw scores have little meaning without knowledge of the scoring
scheme used for the alignment, or equivalently of the parameters
K and l.
Scores can be normalized according to:
lS  ln K 
S' 
ln 2 
S’ is the bit score of the alignment.

The E-value can be expressed as:
S '
E  mn2
The P-value of a sequence alignment
The number of random HSP with score greater of equal to S follows a
Poisson distribution:
EX
P X random HSP with score  S   exp  E 
X!
(E: E-value)
Then:
P0 random HSP with score  S   exp  E 

Pval  Pat least 1 random HSP with score  S   1  exp  E 

Note: when E <<1, P ≈E
The database E-value for a sequence
alignment

Database search, where database contains NS sequences
corresponding to NR residues:

1) All sequences are a priori equally likely to be related to the query:

EDB  N S Kmn exp  lS 
2) Longer sequences are more likely to be related to the query:

EDB2  KmN R exp  lS 
BLAST reports EDB2
Fold Recognition

Homology modeling refers to the easy case when the template structure can be
identified using BLAST alone.

What to do when BLAST fails to identify a template?

•Use more sophisticated sequence methods
•Profile-based BLAST: PSIBLAST
•Hidden Markov Models (HMM)
Fold Recognition: Sequence approaches

Single sequence alignment:
o The score Score(i,j) is usually
obtained from a substitution
Best previous alignment + score(i,j)                                   matrix M of size 20x20
i                 (PAM, BLOSUM matrices…)

o This score is independent
?                 of the position in the protein:
for example, substitution of
A with P would have the
j                same score in a loop or in
a helix




A(i  1, j  1),       o there is a need for position

A(i, j) Score(i, j) max max A(i  1, j  1 k)Wk ,








    specific scoring system:
1k j2



                         


    max A(i  1 k, j  1)Wk





     -> profiles.


   1ki2   
                      

Building a profile using sequence
Start from a multiple sequence alignment:

Multiple sequence alignments can be computed either exactly using
multi-dimensional dynamic programming (very costly in computing time),
or heuristically (iterative pairwise alignments)

Multiple sequence alignments reveal:

- conservation of individual residues
- conservation of regions
- differences within protein families
From sequences to profile

For each position along the sequence,
Number   of   G   0   0   0   0   0   0   0   0   4
Number   of   A   0   0   0   0   0   0   0   0   0   tabulate how often each type of
Number   of   V   0   0   0   0   0   1   0   0   0   amino acid occur (include ‘.’ for gap)
Number   of   I   0   0   0   0   0   0   0   1   0
Number   of   L   0   0   3   0   0   0   0   0   0
Number   of   F   3   0   0   0   0   0   0   0   0
Number   of   P   0   0   0   0   5   4   0   0   0   The profile is always of size Nx21,
Number   of   M   0   0   0   0   0   0   0   0   0   no matter how many sequences
Number   of   W   0   0   0   0   0   0   0   0   0
Number   of   C   0   5   0   0   0   0   0   0   0   are considered
Number   of   S   0   0   0   0   0   0   0   0   0
Number   of   T   0   0   0   0   0   0   0   3   0
Number   of   N   0   0   0   0   0   0   0   0   0
Number   of   Q   0   0   0   0   0   0   0   0   0
Number   of   H   0   0   0   0   0   0   0   0   0
Number   of   Y   1   0   0   0   0   0   3   0   0
Number   of   D   1   0   1   0   0   0   1   1   0
Number   of   E   0   0   0   4   0   0   0   0   0
Number   of   K   0   0   1   1   0   0   0   0   1
Number   of   R   0   0   0   0   0   0   1   0   0
STRUCTURE-DERIVED PROFILES

(Frozen approximation)

(Bowie, Luthy, Eisenberg, Science, 253:164-170 (1991))
Different types of alignment

Query       Library     Program

Sequence    Sequence    Blast, Fasta,…

Profile     Sequence    Psiblast

IMPALA,
PSSM

Profile     Profile     ?
Some References
• PSIBLAST
Altschul, Stephen F., Madden, Thomas L., Schaffer, Alejandro A., Zhang, Jinghui, Zhang, Zheng,
Miller, Webb, and Lipman, David J. (1997). Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res. 25(17); 3389-3402

http://helix.nih.gov/docs/gcg/psiblast.html

• IMPALA
Schaffer, Wolf, Ponting, Koonin, Aravind, Altschul, IMPALA: matching a protein sequence against
a collection of PSI-BLAST constructed position specific score matrices. Bioinformatics, 15:1000-
1011 (1999)
Jones, D.T., Taylor, W.R. & Thornton, J.M. A new approach to protein fold recognition. Nature.
358: 86-89 (1992)

• Profile – Profile
Yona, G. and M. Levitt. Within the Twilight Zone: A Sensitive Profile-Profile Comparison Tool
Based on Information Theory. J Mol Biol. 315: 1257-1275 (2002).
Fold Recognition

Blast for PDB search

Full homology modeling
packages

Profile based approach
HMM

Structure-derived profiles

Fold recognition and
Secondary structure prediction

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 38 posted: 7/29/2012 language: Latin pages: 49
How are you planning on using Docstoc?