Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

green

VIEWS: 10 PAGES: 65

									Computational Protein Design:
 A problem in combinatorial
        optimization
         CSE 549 Guest Lecture
          September 17, 2009
              David Green
    Applied Mathematics & Statistics
            What is a protein?
• Polymers (chains) of amino acids.
  – There are 20 different amino acids that can be
    part of the chain.
• Machines of the cell.
  – It’s proteins that do most of the work involved in
    life!
       Polymers of amino acids.




• Amino acids link to form polypeptides.
   – There is a backbone of constant composition.
   – There are side chains that vary.
      The twenty amino acids.

• AA side chains vary
  from:
  – Big to small.
  – Non-polar (all C and
    H) to polar.
  – Positive to
    negative.
  – Flexible to rigid.
           The machinery of life.
• Protein sensors (receptors) are responsible for all the
  senses (sight, smell, taste, touch, hearing).
• Enzymes are proteins the catalyze chemical
  reactions, like the ones that convert food to energy.
• Specialized structural proteins make skin elastic, and
  make the lens of the eye work.
• Muscles are primarily composed of proteins that
  combine structural and enzymatic parts to make a
  machine.
           Why design proteins?
• New sensors based on biology.
   – Proteins have been engineered to detect TNT (explosive)
     and sarin (nerve gas).
• Proteins are used as treatments for many diseases.
   – Protein engineering has helped improve proteins that are
     given to cancer patients on radiation or chemo-therapy.
   – Work in the Green lab is on-going to design proteins for
     use as anti-HIV prophylatics.
• Many nanotechnology applications that haven’t even
  been considered yet!
   Where do proteins come from?
• The genome contains instructions for every protein
  in a cell.
   – A few HUGE molecules of DNA.
• Each gene is the code for one protein.
   – There are ~30,000 genes in humans.
• Genes are expressed through an intermediate
  molecule, RNA.
   – Many copies of each protein can be made.
The Central Dogma of Molecular
            Biology.




    • Then proteins do the work!
        How do proteins work?
• Proteins fold into a unique 3-dimensional
  structure.
• The amino acid sequence of a protein dictates
  it’s structure.
• The function of a protein is controlled by it’s
  structure.
Many polymers are long, unstructured
             chains.

• Polyethylene
  – Is made of long
    chains of the same
    monomers.
  – Adopts a random
    mesh of inter-
    weaving strands.
  – This structure gives
    us PLASTIC!
DNA has the same structure for every
             sequence.




• The “double-helix” is a great structure for storing
  and replication information.
 Protein structures are well-defined
             and diverse!

• One chain or
  many.
• Elongated or
  globular.
• Many forms of
  symmetry (or
  none).
  What does a protein look like?




• Cyanovirin – A protein that inhibits the
  entry of HIV into human cell.
  What does a protein look like?




• The atoms of a protein form a compact,
  well-packed cluster.
  What does a protein look like?




• A protein can be thought of as a nearly
  solid object.
  What does a protein look like?




• Simplified cartoons make the structure
  easier to see.
  What does a protein look like?




• The path of the backbone of a protein is
  called it’s “fold”.
  What does a protein look like?




• Different types of amino acids are found all
  along the protein chain.
  What does a protein look like?




• Each amino acid has a side chain that
  protrudes from the backbone.
  What does a protein look like?




• Many proteins bind other molecules, like
  the sugar molecules here.
  What does a protein look like?




• Binding interfaces are usually a close fit of
  two complementary surfaces.
  What does a protein look like?




• The core of a protein is key in keeping a
  stable structure.
Many side chains fill the core.
The core is well packed …
… with groups from all along the chain.
Each side chain fits perfectly.
           What is a protein?
• A protein is a complicated three-dimensional
  structure, made up by an amazing 3-D jigsaw
  puzzle of interlocking amino acids.
• Amino acids pack together not just
  geometrically, but with complementary
  chemical groups as well.
• Proteins move too, but we’ll ignore that for
  now.
      How can we design one?!?
1. Choose a fold (path of the backbone).
2. Pack the core with the right set of amino
   acids to achieve the desired fold.
3. Choose other amino acids to achieve the
   desired function (such as binding to a target
   molecule, or getting the right molecular
   motions).
   Structure prediction is a forward
               problem.
•Given a protein
sequence, what is the
structure that it will
adopt (fold to)?
•This is a VERY hard
problem, and it not yet
fully solved.
•Prediction is difficult
because you are stuck
with what nature gives
you.
Protein design is an inverse problem.
•Given a desired 3-
dimensional protein
structure, what is a
sequence that will fold
to that structure?
•We have the freedom
to add constraints that
simplify the problem.
•As a result, methods
for protein design have   • Pabo. Nature 301: 200 (1981).
had many successes.       • Drexler. PNAS 78: 5275-5278 (1981).
  A designed sequence should fold
        according to design.
•ANY sequence which
folds to the correct
target structure (and
carries out the desired
function) can be
considered a successful
design
•There is more than
one right answer, unlike
in prediction!
      Choosing a backbone fold.
• The structure dictates the function, and a big
  part of structure is the fold.
• We still don’t really know how to choose the
  “best” fold.
• Instead, we just borrow from nature –
  redesign a natural protein to do something
  new.
Zinc finger proteins bind DNA.
A Zinc ion holds them together.

                • The protein will not
                  fold if zinc is not
                  present.
                • The protein only binds
                  DNA when it is folded.
                • A group at Caltech set
                  out to design a zinc
                  finger that doesn’t
                  need zinc!
1997: The first fully automated protein
                 design!




• Dahiyat and Mayo. Science 287: 82-87 (1997).
           Designing function.
• Making a molecule bind is like designing a the
  core – we want to make the interface between
  the two pieces complementary.
• Other functions are a lot trickier … and we
  don’t have good ways to solve them yet, but
  we’re on our way.
   2003: A Duke group designs a set of
            protein sensors.




• Looger, Dwyer, Smith and Hellinga. Nature 423: 185-190 (2003).
   Protein design is a BIG problem.
• The zinc finger is one of the smallest protein domains …
  about 30 amino acids long.
• How many different 30 amino acid polypeptides are
  there?
   – Choose from any of 20 amino acids at each position.
   – Total sequences = 2030 = 1x1039
      • Mass of earth = 6x1027 g
      • Mass of a grain of sand ~ 1x10-3 g
          – A billion earths’ worth of sand grains
• Enumeration of possible states is beyond impossible
  — must take advantage of need to achieve
  complementary interactions between amino acids.
Many different structures are possible.




• An arginine and a glutamate interact.
Many different structures are possible.




• An arginine and a glutamate interact.
Many different structures are possible.




• An arginine and a glutamate interact.
Many different structures are possible.




• An arginine and a glutamate interact in several different
  conformations.
                      Really Big!!!
• Amino-acid side chains are flexible.
• But not every shape (conformation) is equal.
• Each amino acid has a set of preferred conformations
  (rotamers).
   – 1 to 80 per amino acid.
• Instead of choosing from 20 amino acids … we need
  to choose from ~400 (at least) amino acid rotamers!
   – Total structures = 40030 = 1x1078
      • (approx. number of atoms in the universe!!!!!)
      Packing side chains – a puzzle.
• How do you solve a jigsaw puzzle?
• Impossible to try all combinations of piece
  placement
   – Unique ways of placing N pieces on a grid is (4N)(N!)
       • For N=100, (1.6x1060)(9.3x10157)= 1.5x10218
• Trying each piece one by one is better, but still
  infeasible
   – Number of iterative tries for a N piece puzzle is:
                                                 N

  N                                               4( N  i)(2i  2)
  4( N  i)( 2i  2)  3 N ( N  4)( N  1)
                        4                        i 1

                                                                     N      N
                                                                                  
                                                  8 N 2  ( N  1) i   i 2 
                                                                    i 1   i 1  
 i 1
                                                                     N ( N  1) N ( N  1)(2 N  1) 
                                                  8 N 2  ( N  1)                                
       • For N=100, 1.37x106                                             2               6          
                                                  N ( N  4)(N  1) 
                                                   4
                                                   3
   Packing side chains – be smart.
• How do you solve a jigsaw puzzle?
• Group pieces by colors and patterns.
• Iterate over matching of pieces that are
  complementary
  – Shape is important.
  – The pattern must also match.
   Pattern matching in proteins?
• What does it mean for two amino acids in the
  core of a protein to “match”?
  – Must fit close together (but not too close) 
    Steric complementarity.
  – Neighboring atoms must have complementary
    charges (neutral likes neutral, positive likes
    negative)  Electrostatic complementarity.
Steric fit: Lennard-Jones potential.
          r0 12  r0  6 
 U  U o    2            • Van der Waals attraction
          r 
                  r          between atoms at moderate
                                  distances.
                                • Repulsion of atoms from
                                  one another at short
                                  distances.
                                • If atoms are not nearby, the
                                  energy between them will
                                  be very close to zero.
                                • The total score of the
                                  “goodness” of fit in a
                                  molecule is the sum of
                                  the energy for every pair
                                  of atoms.
 Electrostatic fit: Coulomb’s Law
       q1q2   • Atoms in molecules can be thought
U  kC          of as having tiny charges on them,
         r      even if the total charge on a
                molecule is zero.
              • Coulomb’s Law describes the
                energy of how two charges interact.
              • The overall electrostatic fit is
                calculated by adding up the energy
                of all pairs of atoms.
                 – Like charges give a positive value.
                 – Opposite charges give a negative.
                 – Neutral (zero charge) groups don’t
                   matter.
The total energy describes the fitness
            of a structure.
• Van der Waals + Coulomb’s Law, for every pair
  of atoms, and all added up.
• Negative energies are favorable, positive
  energies unfavorable.

  – Nature works to MINIMIZE energy.
             Protein Design as a Discrete
Position 1
               Conformational Search


Position 2




Position 3




                 Conformational states of system
Tree Pruning with Dead End Elimination
                                                Molecular Mechanics energy
i , j are positions in sequence
                                                 Van der Waals
Ri is rotamer choice at
                                                 Coulombic
position i
                                                 Bond, angle, dihedral
     
 E ( R)        Ei,Ri     Ei , Ri  Eiself   1 2 E(pairi ),( j , R j )
                                       , Ri           i,R
            i positions                         i j

The Dead-end Elimination Theorem
  Given two rotamer choices X, and Y at position I, if the best energy
  of X (with any choice of rotamers at other positions) is worse than
  the worst energy of Y, then X can not be part of the global energy
  minimum.
                                                           ?                
  Need to make the comparison,            min ( E ( X i , R))  max( E (Yi , R))
                                                                
                                            R                   R



  but the min and max functions require evaluating all states.
                             Making DEE feasible.
                    self                                  
E ( X i , R)    E p , R p   2 ( p , R p ),( q , Rq )  * Ri is used to replace Xi in the first equation.
                                        1 E pair

                 p                q p                     
                                           self                                 
 Eiselfi   E(pairi ),( q , Rq )    E p , R p   1 2 E(pair p ),( q , Rq ) 
     ,X            i, X                                       p,R
            q i                     p i           q  p ,i                    
 Last sum is invariant with respect to our choice at position i, and thus:

              ?
                                  
                                                     self                         ?          self                   
 min E ( X i , R)  max E (Yi , R)  min  Ei , X i   E(i , X i ),( q , Rq )   max Ei ,Yi   E(i ,Yi ),( q , Rq ) 
                                             
                                                                          pair
                                                                                             
                                                                                                            pair
  R                  R                        R
                                                                 q i                     R
                                                                                                      q i             
 But note that:          self
                    min  Ei , X i   E(i , X i ),( q , Rq )   Eiselfi   min E(pairi ),( q , Rq )
                     
                                           pair
                                                              
                                                                     ,X                i, X                            
                                                             
                     R                                                           Rq
                                      q i                                  q i


                    max Ei ,Yi   E(i ,Yi ),( q , Rq )   Eiself   max E(pair),( q , Rq )
                     
                         self             pair
                                                             
                                                                    ,Yi                i ,Yi                       
                     R
                                     q i                                 q i
                                                                                 Rq


 This gives a sufficient condition for the DEE theorem to hold:
                         E  self
                           i, X i     min E
                                            Rq
                                                     pair
                                                     ( i , X i ),( q , Rq )    E
                                                                               ?
                                                                                    self
                                                                                   i ,Yi                 
                                                                                             max E(pair),( q , Rq )
                                                                                                     i ,Yi
                                                                                                    Rq
                                                                                                                        
                                     q i                                                    q i

min and max are evaluated over rotamers at a single position, so entirely feasible!
        Improving the bounds for DEE
Original problem statement was to compare the best structure with X i at
position I to the worst with Yi; it is a easier to satisfy, but still sufficient,
criterion to find the single set of choices at all other positions with the minimum
difference between choice Xi and Yi.
                   min E ( X i , R)  max E (Yi , R)  min E ( X i , R)  E(Yi , R)  0
                    
                    R
                                 ?
                                                 
                                                  
                                                        
                                                               R
                                                                     
                                                                                 ?
                                                                                                       R
                                                                                                                                                             
                                                                                                                     ?
                           E     self
                                 i, X i   E        self
                                                   i ,Yi       min  E(i , X i ),( q , Rq )  E(i ,Yi ),( q , Rq )   0
                                                                  
                                                                            pair                   pair
                                                                  R
                                                                      q i                                           
Again, we use the same trick to bound the minimum in a feasible manner:
                   pair                   pair
                                                              
         min  E(i , X i ),( q , Rq )  E(i ,Yi ),( q , Rq )    min E(pairi ),( q , Rq )  E(pair),( q , Rq )
                                                                         i, X                   i ,Yi                                                            
                                                             
          R                                                            Rq
               q i                                               q i


This gives a alternate sufficient condition for the DEE theorem to hold:
                        E                                  min E                                                                        0
                                                                                                                                             ?
                               self
                              i, X i   E         self
                                                 i ,Yi
                                                                                        pair
                                                                                       ( i , X i ),( q , Rq )   E    pair
                                                                                                                     ( i ,Yi ),( q , Rq )
                                                                          Rq
                                                                   q i
This is a tighter bound on the “true” desired comparison, since:
       minE
       q i
              Rq
                         pair
                        ( i , X i ),( q , Rq )     maxE
                                                      q i
                                                                   Rq
                                                                                pair
                                                                               ( i ,Yi ),( q , Rq )     minE
                                                                                                            q i
                                                                                                                    Rq
                                                                                                                                 pair
                                                                                                                                ( i , X i ),( q , Rq )    E(pair),( q, Rq )
                                                                                                                                                             i ,Yi             
        DEE as an iterative algorithm
As rotamers are flagged as incompatible with the global minimum solution, the
min/max functions are evaluated over a smaller and smaller set of choices, and
so additional iterations of the comparison can eliminate more possibilities.
Thus, the algorithm can be outlined as:

While (any rotamers eliminated ) {
  For each position in the sequence (i){
    For each rotamer choice (X) at position i {
      For each rotamer choice (Y ne X) at position i {
        If DEE criterion is satisfied {
          Eliminate choice X at position I
        }
      }                 The order of each cycle is ~N*R2, where N is
    }                   the number of positions, and R is the number
  }                     of choices at each position. However, as R
}                       decreases with iteration, each subsequent
                                  cycle costs less.
          DEE identifies branches that are
  incompatible with the global minimum
Position 1




Position 2




Position 3




             Conformational states of system
   The pruned tree can be much smaller
Position 1




Position 2




Position 3




             Conformational states of system
          DEE can be highly effective
Example of a 5-position design, with 306 choices per position, done in a simple
MATLAB script:

Iteration    0:   306 306 306 306 306
Iteration    0:   Structures = 2.682916e+12
Iteration    1:   143 198 145 42 33
Iteration    1:   Structures = 5.690265e+09 from 2.682916e+12
Iteration    1:   Elapsed time is 96.461053 seconds.
Iteration    2:   52 83 55 11 8
Iteration    2:   Structures = 20889440 from 5.690265e+09
Iteration    2:   Elapsed time is 11.604577 seconds.
Iteration    3:   4 12 36 6 5
Iteration    3:   Structures = 51840 from 20889440
Iteration    3:   Elapsed time is 0.625266 seconds.
Iteration    4:   4 12 35 6 5
Iteration    4:   Structures = 50400 from 51840
Iteration    4:   Elapsed time is 0.148598 seconds.
Iteration    5:   4 12 35 6 5
Iteration    5:   Structures = 50400 from 50400
Iteration    5:   Elapsed time is 0.045631 seconds.
            DEE: Notes and Caveats
Dead-end elimination is not guaranteed not to eliminate any
choices … in this case computational expense is used at zero gain.
However, experience suggests that in the case of protein design,
the algorithm is highly efficient.

For large design problems, even a highly efficient pruning can leave
a tree which is too large to be searched by enumeration (such as
depth-first search); for example, consider an original space of 10100
states, reduced to 1020.
An efficient bounding heuristic can be defined using similar tricks as
discussed here; this can be used in the A* algorithm to find the
global minimum within the remaining space.

The DEE criterion can be extended to pairs, triples, and n-tuples of
positions. Most applications use only singles and pairs.
    Optimization as a tree search.
How to choose a path through a tree, which the goal of
reaching the global minimum state?
First, define an energy to be associated with each step through
the tree:
                                         i 1
                   Ei , Ri  Eiself   E(pairi ),( j , R j )
                               , Ri       i,R
                                         j 1

This is the energy of placing rotamer choice R at position i, given that
all positions in the tree above i have been selected.
Note this is similar, but not identical, to the definition used with DEE.
 At the leaf node, the state energy is then the sum of all individual
 path energies:
                                
                            E ( R)        E
                                       i positions
                                                      i , Ri


 Thus, we wish to find the path with the lowest total.
Efficient search using a heuristic (A*)
Challenge is that we do not know the total until we
have traversed the tree to a leaf!
 Assume we have some heuristic, that estimates the best possible
 energy that we can achieve down some path. For choice R at
 position I, we will call this E *,R
                                 ii



 When choosing a path to take at a given node, we know the path we
 have already taken, and thus the true cost. We thus combine this
 information with the heuristic in deciding what step to take:
                                         i 1
                  f (i, Ri )  E *,Ri   E j , R j
                                 i
                                         j i

 At a leaf node, the heuristic is 0, and this gives the total energy.

  However, if we only use the heuristic, how do we know we get the
  correct solution?
   Guaranteed optimality with A*
1. Allow backtracking.
At every step, not only choose between the possible paths from the
current node, but also the paths from all nodes which have been
visited in the past.
In this way, if the heuristic turned out to be poor for a given path (and
the true energy became large), a new path is chosen.

2. Use a heuristic that bounds the true solution.
Consider a heuristic that is guaranteed to be lower than the true best
energy down a path (an optimistic prediction of the best energy). When
a leaf is reached, a comparison between the true energy of that leaf
and the heuristic energy of the un-followed choice can be made. Since
the heuristic is an optimistic guess, if the true energy is lower than the
heuristic for all other choices, it must be the global minimum.
Thus, it is possible to have a guarantee that the solution found is
the true global minimum!
           Defining the bounding heuristic
       Recall our energy definitions:
                                   i 1                                     
           Ei , Ri  E    self
                         i , Ri   E       pair
                                           ( i , Ri ),( j , R j )
                                                                        E ( R)           E
                                                                                   i positions
                                                                                                  i , Ri
                                   j 1
       The optimal heuristic is:
                                                   i 1                   N
                            f opt (i, Ri )   E j , R j  min
                                                                        E      j ,R j
                                                                    R
                                                   j 1                   j i

       Now, we bound the second term:
                  N
                     self j 1 pair                      N        self j 1 pair                        
           min   E j , R j   E( j , R j ),( k , Rk )    min  E j , R j   E( j , R j ),( k , Rk ) 
                                                               
               j i                                      j i                                           
            R                                                   R
                               k 1                                              k 1

      N
             self j 1 pair                      N         self            j 1 pair                
     i min  E j ,R j   E( j ,R j ),( k ,Rk )   i min  E j ,R j  min   E( j ,R j ),( k ,Rk )  
         
                                                          Rj              R 
                                                                                                       
     j
         R
                        k 1                     j                       k 1                     
                                                                                                           
                                                       N 
         self             j 1 pair                                   self j 1
                                                                                                                
                                                                                                                 
 N
   min  E j , R j  min   E( j , R j ),( k , Rk )    min  E j , R j   min E( j , R j ),( k , Rk )
i R j 
j 
                                                                   
                                                                                         pair
                                                                                                                 
                                                                                                                 
                     R
                           k 1                         j i  R j 
                                                                                k 1
                                                                                      Rk
                                                                                                                 
      Overview of A* for protein design
   Initialize a list of traveled nodes with the root.
   While (no mininum leaf in list) {
     Select minimum f* of all paths from nodes in list.
     Add this new node to list.
     If (new node is leaf) {
        Compare leaf energy to minimum f* from list.
        If (Leaf energy < min(f*)) {
          Leaf is global minimum.
     }
   }

    Our final heuristic is:
                     self j 1 pair                      N         self j 1
                                                                                                               
                                                                                                                
                i 1
f * (i, Ri )    E j , R j   E( j , R j ),( k , Rk )    min  E j , R j   min E(pair j ),( k , Rk )   
                                                                  Rj                                           
                                                                                           j,R
               j 1                                      j i                                               
                                                                                       Rk
                               k 1                                               k 1


     The second term involves a minimum over each choice at
     following position, which is order (N-i)*R, with an inner minimum
     over the same, order (N-j)*R. Thus, the cost is less than (N-j)2*R2.
             A*: Notes and Caveats
Performance of A* is not guaranteed — in the worst case, the
entire tree must be enumerated to find the solution. Again,
however, experience suggests that the algorithm is highly efficient
for protein design.

In many cases, we wish not only to have the global minimum, but
all solutions within a cutoff of the minimum.
A* can be adapted to solve this problem as well, but care must be
taken in the first step of pruning with DEE — the elimination
criterion must be modified to prevent elimination of low, but not
minimum, energy states.
                         References
     Protein design as an inverse problem:
1.   Pabo. Nature 301: 200 (1981).
2.   Drexler. PNAS 78: 5275-5278 (1981).
     Examples of successful protein design:
3.   Dahiyat and Mayo. Science 287: 82-87 (1997).
4.   Looger, Dwyer, Smith and Hellinga. Nature 423: 185-190 (2003).
     Development of the Dead-End Elimination algorithm:
5.   Desmet, DeMaeyer, Hazes and Lasters, Nature 356: 539-542 (1992).
6.   Goldstein, Biophys. J. 66: 1335-1340 (1994).
7.   Gordon and Mayo. J. Comput. Chem. 19: 1505-1514 (1998).
     Formulation of A* for protein design:
8.   Leach and Lemon, Proteins 33: 227-239 (1998).

								
To top