Docking by dandanhuanghuang

VIEWS: 12 PAGES: 46

									Exploring Chemical Space with
Computers—Challenges and
Opportunities

         Pierre Baldi
             UCI
Chemical Informatics

   Historical perspective: physics,
    chemistry and biology
   Understanding chemical space
   Small molecules (systems biology,
    chemical synthesis, drug design,
    nanotechnology)
Chemical Space


            Stars        Small
                         Mol.
 Existing   1022         107
 Virtual    0            1060 (?)
 Access     Difficult    “Easy”


 Mode       Individual   Combinatorial
Chemical Space
Chemical Informatics

   Historical perspective: physics, chemistry and biology
   Understanding chemical space
   Small molecules (systems biology, chemical
    synthesis, drug design, nanotechnology)
   Predict physical, chemical, biological properties
    (classification/regression)
   Build filters/tools to efficiently navigate chemical
    space to discover new drugs, new galaxies, etc.
Methods

   Spetrum:
       Schrodinger Equation
    

       Molecular Dynamics
    

       Machine Learning (e.g. SS prediction)
Chemical Informatics

   Informatics must be able to deal with
    variable-size structured data
       Graphical Models
       (Recursive) Neural Networks
       ILP
       GA
       SGs
       Kernels
Two Essential Ingredients

1.       Data
2.       Similarity Measures

Bioinformatics analogy and differences:
          Data (GenBank, Swissprot, PDB)
          Similarity (BLAST)
Data

   Mutag (Mutagenicity)
        200 compounds (125/63), mutagenicity in Salmonella
   PTC (Predictive Toxicity Challenge)
        A few hundred compounds, carcinogenicity (FM,MM,FR,MR)
   NCI (Anti-cancer activity)
        70,000 compounds screened for ability to inhibit growth in 60 human tumor
         cell lines
   Alkanes (Boiling points)
        All 150 non-cyclic alkanes (CnH2n+2) with n<11 and their boiling points ([-
         164,174])
   Benzodiazepines (QSAR)
        79 1,4-benzodiazepines-2-one, affinity towards GABAA
   ChemDB
        7M compounds
Similarity

   Rapid Searches of Large Databases

   Predictive Methods (Kernel Methods)

   Why it is not hopeless?
     Similarity




   Rapid Search of Large Databases
     Protein Receptor (Docking)
     Small Molecule/Ligand (Similarity)

   Predictive Methods (Kernel Methods)
   Why it is not hopeless
Linear Classifiers
    Classification

   Learning to Classify
       Limited number of training
        examples (molecules, patients,
        sequences, etc.)
       Learning algorithm (how to
        build the classifier?)
       Generalization: should correctly
        classify test data.
   Formalization
       X is the input space
       Y (e.g. toxic/non toxic, or {1,-
        1}) is the target class
       f: X→Y is the classifier.
 Classification


 Fundamental Point:
   f is entirely determined
  by the dot products xi,xj
  measuring the similarity
    between pairs of data
    points
Non Linear Classification
(Kernel Methods)
   We can transform a nonlinear problem
    into a linear one using a kernel.
Non Linear Classification
(Kernel Methods)
 We can transform a nonlinear problem
  into a linear one using a kernel K.
 Fundamental property: the linear
  decision surface depends on
 K(xi ,xj)=(xi ) , (xj).
 All we need is the Gram similarity
  matrix K. K defines the local metric of
  the embedding space.
 Similarity: Data Representations




NC(O)C(=O)O
Molecular Representations

   1D: SMILES strings
   2D: Graph of bonds
   2.5D: Surfaces
   3D: Atomic coordinates
   4D: Temporal evolution
      1D SMILES Kernel


CCCCCc1ccc(cc1)CO               CCCCCCc1ccc(cc1O)O




                    Total: 15
        2D Molecule Graph Kernel

   For chemical compounds
       atom/node labels:
        A = {C,N,O,H, … }
       bond/edge labels:
        B = {s, d, t, ar, … }
   Count labeled paths
                        (CsNsCdO)
   Fingerprints
Similarity Measures
 3D Coordinate Kernel

        2.8 A

         2.0 A           4.2 A
1.4 A
                 3.4 A
Example of Results
Results
Results
Results
Example of Results
    Summary


   Derived a variety of kernels for small molecules
   State-of-the-art performance on several benchmark datasets
   2D kernels slightly better than 1D and 3D kernels
   Many possible extensions: 2.5D kernels, isomers, etc…
   Need for larger data sets and new models of cooperation in the
    chemistry community
   Many open (ML) questions (e.g. clustering and visualizing 107
    compounds, intelligent recognition of useful molecules,
    information retrieval from literature, docking, prediction of
    reaction rates, matching table of all proteins against all known
    compounds, origin of life)
   Chemistry version of the Turing test
    ChemDB


   7M compounds (3.5M unique)
   Commercially available
   PostgreSQL/Oracle
   Annotation (Experimental,
    Computational)
   Searchable
   Web interface
   Similarity, in silico reactions
    Acknowledgements
   Informatics               Pharmacology
        Liva Ralaivola           Daniele Piomelli
        J. Chen              Chemistry
        S. J. Swamidass          G. Weiss
        Yimeng Dou               J. S. Nowick
                                  R. Chamberlin
        Peter Phung
        Jocelyne Bruand
   Funding
        NIH
        NSF
        IGB
    New Questions

   Predict drug-like molecules? toxicity?
       New Strategies

   How can we search efficiently? Intelligently?
       New data structures and algorithms
       Optimizing old structures

   How can we understand this much data?
       Cluster and visualize millions of data points
       Define commercially accessible space.

   Are there other useful things we can do with this?
       Discover new polymers, etc.
       Wonder about the origin of life.
       Combinatorially combine all known chemicals.
    Acknowledgements

   Jocelyne Bruand
   Peter Phung
   Liva Ralaivola
   S. Joshua
    Swamidass
   Yimeng Dou
   NIH/NSF/IGB
Questions
Docking

                      Query:
                      Binding Site of Protein

          Scoring
          Function
             &
          Efficient
          Minimizer

…
Some Targets
     P53 (Luecke)
     ACCD5 (Tsai)
     IMPDH, PPAR, etc. (Luecke)
     HIV Integrase (Robinson)
P53
Drug Rescue of P53 Mutants
Docking → ChemDB

   ~6 million commercially available
    compounds
   Searchable, annotated, downloadable.
   Other Databases:
       Cambridge Structural Database
       ChemBank
       PubChem
Chemical Toxicity Prediction
      By Kernel Methods

        Jonathan Chen
     S Joshua Swamidass
           The Baldi Lab
     Data Flow
ID       Toxic?                  Gram Matrix
1           No    Kernel


2           No




3          Yes
                     Toxicity      Linear
                    State List    Classifier
4          Yes



                                 Predictions
Results
   Example of Results

Kernel/Method Mutag MM             FM       MR      FR
Kashima (2003) 89.1       61.0     61.0      62.8   66.7
 Kashima (2003) 85.1      64.3     63.4     58.4    66.1
1D SMILES spec. 84.0      66.1     61.3     57.3    66.1
1D SMILES spec+ 85.6      66.4     63.0     57.6    67.0
2D Tanimoto      87.8     66.4     64.2     63.7    66.7
2D MinMax        86.2     64.0     64.5     64.5    66.4
2D Tanimoto, l = 1024, b = 1 87.2 66.1 62.4 65.7 66.9
2D Hybrid l = 1024, b = 1 87.2 65.2 61.9 64.2 65.8
2D Tanimoto, l = 512, b = 1 84.6 66.4 59.9 59.9 66.1
2D Hybrid l = 512, b = 1 86.7 65.2 61.0 60.7 64.7
2D Tanimoto, l = 1024 + MI 84.6 63.1 63.0 61.9 66.7
2D Hybrid l = 1024 + MI 84.6 62.8 63.7 61.9 65.5
2D Tanimoto, l = 512 + MI 85.6 60.1 61.0 61.3 62.4
2D Hybrid l = 512 + MI 86.2 63.7 62.7 62.2 64.4
3D Histogram     81.9     59.8     61.0      60.8   64.4
Chemical Informatics

   Historical perspective: physics, chemistry and biology
   Understanding chemical space
   Small molecules (systems biology, chemical
    synthesis, drug design, nanotechnology)
   Catalog
   Predict physical, chemical, biological properties
   Build filters/tools to efficiently navigate chemical
    space to discover new drugs, new galaxies, etc.
Datasets
    Small Molecules as Undirected Labeled
    Graphs of Bonds



   atom/node labels:
    A = {C,N,O,H, … }
   bond/edge labels:
    B = {s, d, t, ar, … }
Chemical Informatics

   Historical perspective: physics, chemistry and biology
   Understanding chemical space
   Small molecules (systems biology, chemical
    synthesis, drug design, nanotechnology)
   Bioinformatics analogy:
       Catalog (GenBank)
       Search (BLAST)
   Predict physical, chemical, biological properties
   Build filters/tools to efficiently navigate chemical
    space to discover new drugs, new galaxies, etc.
Chemical Informatics

   Historical perspective: physics, chemistry and biology
   Understanding chemical space
   Small molecules (systems biology, chemical
    synthesis, drug design, nanotechnology)
   Bioinformatics analogy:
       Catalog (GenBank)
       Search (BLAST)
   Predict physical, chemical, biological properties
   Build filters/tools to efficiently navigate chemical
    space to discover new drugs, new galaxies, etc.

								
To top