CICC MEP SDSC final Amino Acid Soap

Document Sample
CICC MEP SDSC final Amino Acid Soap Powered By Docstoc
					Building a Chemical
Informatics Grid
    Marlon Pierce
    Community Grids Laboratory
    Indiana University

 CICC researchers and developers who contributed
 to this presentation:
   Prof. Geoffrey Fox, Prof. David Wild, Prof. Mookie Baik,
   Prof. Gary Wiggins, Dr. Jungkee Kim, Dr. Rajarshi Guha,
   Sima Patel, Smitha Ajay, Xiao Dong
 Thanks also to Prof. Peter Murray Rust and the
 WWMM group at Cambridge University
 More info: and
Chemical Informatics and the

    An overview of the basic problem and
Chemical Informatics as a Grid
 Chemical Informatics is the application of information technology
 to problems in chemistry.
    Example problems: managing data in large scale drug discovery
    and molecular modeling
 Building Blocks: Chemical Informatics Resources:
    Chemical databases maintained by various groups
      NIH PubChem, NIH DTP
   Application codes (both commercial and open source)
      Data mining, clustering
      Quantum chemistry and molecular modeling
   Visualization tools
   Web resources: journal articles, etc.
 A Chemical Informatics Grid will need to integrate these into a
 common, loosely coupled, distributed computing environment.
Problem: Connecting It Together
 The problem is defining an architecture for tying all of
 these pieces into a distributed computing system.
    A “Grid”
 How can I combine application codes, web resources,
 and databases to solve a particular problem that
 interests me?
   Specifically, how do I build a runtime environment that can
   connect the distributed services I need to solve an interesting
 For academic and government researchers, how can I
 do all of this in an open fashion?
   Data and services can come from anywhere
   That is, I must avoid proprietary infrastructure.
NIH Roadmap for Medical Research
The NIH recognizes chemical and biological
information management as critical to medical
Federally funded high throughput screening centers.
  100-200 HTS assays per year on small molecules.
  100,000’s of small molecules analyzed
  Data published, publicly available through NIH PubChem
  online database.
What do you do with all of this data?
High-Throughput Screening

                   Testing perhaps millions of compounds
                   in a corporate collection to see if any
                   show activity against a certain disease
High-Throughput Screening

 Traditionally, small numbers of compounds were
 tested for a particular project or therapeutic area
 About 10 years ago, technology developed that
 enabled large numbers of compounds to be assayed
 High-throughput screening can now test 100,000
 compounds a day for activity against a protein target
 Maybe tens of thousands of these compounds will
 show some activity for the protein
 The chemist needs to intelligently select the 2 - 3
 classes of compounds that show the most promise for
 being drugs to follow-up
Informatics Implications
 Need to be able to store chemical structure and
 biological data for millions of data points
   Computational representation of 2D structure
 Need to be able to organize thousands of active
 compounds into meaningful groups
   Group similar structures together and relate to activity
 Need to learn as much information as possible
 (data mining)
   Apply statistical methods to the structures and related
 Need to use molecular modeling to gain direct
 chemical insight into reactions.
The Solution, Part I: Web Services

 Web Services provide the means for wrapping
 databases, applications, web scavengers, etc, with
 programming interfaces.
   WSDL definitions define how to write clients to talk with
   databases, applications, etc.
   Web Service messaging through SOAP
   Discovery services such as UDDI, MDS, and so on.
 Many toolkits available
   Axis, .NET, gSOAP, SOAP::Lite, etc.
 Web Services can be combined with each other into
   Workflow==use case scenario
   More about this later.
Basic Architectures: Servlets/CGI and
Web Services
     Browser                                          GUI

         HTTP GET/POST    Web                        WSDL


     Server                                Web

Solution Part II: Grid Resources

 Many Grid tools provide powerful backend services
   Globus: uniform, secure access to computing resources
   (like TeraGrid)
     File management, resource allocation management, etc.
   Condor: job scheduling on computer clusters and
   SRB: data grid access
   OGSA-DAI: uniform Grid interface to databases.
 These have Web Service as well as other interfaces
 (or equivalently, protocols).
Solution, Part III: Domain Specific
Tools and Standards -->More Services
 For Chemical Informatics, we have a number of tools and
    Chemical string representations
      SMILES, InChI
   Chemistry Markup Language
      XML language for describing, exchanging data.
      JUMBO 5: a CML parser and library
   Glue Tools and Applications
      Chemistry Development Kit (CDK)
 These are the basis for building interoperable Chemical
 Informatics Web Services
 Analogous situations exist for other domains
    Astronomy, Geosciences, Biology/Bioinformatics
Solution Part IV: Workflows

 Workflow engines allow you to connect services
 together into interesting composite applications.
 This allows you to directly encode your scientific use
 case scenario as a graph of interacting services.
 There are many workflow tools
   We’ll briefly cover these later.
   General guidance is to build web services first and then
   use workflow tools on top of these services.
   Don’t get married to a particular workflow technology yet,
   unless someone pays you.
Solution Part V: User Interfaces

 Web Services allow you to cleanly separate user
 interfaces from backend services.
   Model-view-controller pattern for web applications
 Client environments include
   Grid and web service scripting environments
   Desktop tools like Taverna and Kepler
   Portlet-based Web portal systems
 Typically, desktop tools like Taverna are used by
 power users to define interesting workflows.
 Portals are for running canned workflows.
Next steps

 Next we will review the online data base
 resources that are available to us.
 Databases come in two varieties
   Journal databases
   Data databases
 As we will discuss, it is useful to build
 services and workflows for automatically
 interacting with both types.
Online Chemical Journal and
Data Resources
MEDLINE: Online Journal Database

 MEDLINE (Medical Literature Analysis and Retrieval
 System Online) is an international literature
 database of life sciences and biomedical
 It covers the fields of medicine, nursing, dentistry,
 veterinary medicine, and health care.
 MEDLINE covers much of the literature in biology
 and biochemistry, and fields with no direct medical
 connection, such as molecular evolution.
 It is accessed via PubMed.

PubMed: Journal Search Engine
PubMed is a free search engine offered by the United States
National Library of Medicine as part of the Entrez information
retrieval system.
The PubMed service allows searching the MEDLINE database.
   MEDLINE covers over 4,800 journals published in the United States
   and more than 70 other countries primarily from 1966 to the present.
In addition to MEDLINE, PubMed also offers access to:
   OLDMEDLINE for pre-1966 citations.
   Citations to articles that are out-of-scope (e.g., general science and
   chemistry) from certain MEDLINE journals
   In-process citations which provide a record for an article before it is
   indexed with MeSH and added to MEDLINE
   Citations that precede the date that a journal was selected for
   MEDLINE indexing
   Some life science journals

PubChem: Chemical Database
 PubChem is a database of chemical molecules.
 The system is maintained by the National Center for
 Biotechnology Information (NCBI) which belongs to the United
 States National Institutes of Health (NIH).
 PubChem can be accessed for free through a web user
   And Web Services for programmatic access
 PubChem contains mostly small molecules with a molecular
 mass below 500.
 Anyone can contribute
   The database is free to use, but it is not curated, so value of a
   specific compound information could be questionable.
   NIH funded HTS results are (intended to be) available through

NIH DTP Database
Part of NIH’s Developmental Therapeutics
Screens up to 3,000 compounds per year for
potential anticancer activity.
Utilizes 59 different human tumor cell lines,
representing leukemia, melanoma and cancers
of the lung, colon, brain, ovary, breast, prostate,
and kidney.
DTP screening results are part of PubChem
and also available as a separate database.
 Example screening results.
 Positive results (red bar to
 right of vertical line) indicates
 greater than average toxicity
 of cell line to tested agent.

 COMPARE is an algorithm for mining DTP
 result data to find and rank order compounds
 with similar DTP screening results.
  Discovered compounds may be less toxic to
  humans but just as effective against cancer cell
  May be much easier/safer to manufacture.
  May be a guide to deeper understanding of
Many Other Online Databases
 Complementary protein information
 Indiana University: Varuna project
   Discussed in this presentation
 University of Michigan: Binding MOAD
   “Mother of All Databases”
   Largest curated database of protein-ligand complexes
   Subset of protein databank
   Prof. Heather Carlson
 University of Michigan: PDBBind
   Provides a collection of experimentally measured binding
   affinity data (Kd, Ki, and IC50) exclusively for the protein-
   ligand complexes available in the Protein Data Bank (PDB)
   Dr. Shaomeng Wang
The Point Is…
 All of these databases can be accessed on line with
 human-usable interfaces.
   But that’s not so important for our purposes
 More importantly, many of them are beginning to
 define Web Service interfaces that let other
 programs interact with them.
   Plenty of tools and libraries can simulate browsers, so you
   can also build your own service.
 This allows us to remotely analyze databases with
 clustering and other applications without modifying
 the databases themselves.
 Can be combined with text mining tools and web
 robots to find out who else is working in the area.
Encoding chemistry
Chemical Machine Languages
 Interestingly, chemistry has defined three simple
 languages for encoding chemical information.
   Can generate these by hand or automatically
 InChIs and SMILES can represent molecules as a
 single string/character array.
   Useful as keys for databases and for search queries in
 You can convert between SMILES and InChIs
   OpenBabel, OELib, JOELib
 CML is an XML format, and more verbose, but
 benefits from XML community tools
SMILES: Simplified Molecular Input
Line Entry Specification
 Language for describing the structure of
 chemical molecules using ASCII strings.
InC hI: International Chemical
 IUPAC and NIST Standard similar to SMILES
 Encodes structural information about compounds
 Based on open an standard and algorithms.

InChI in Public Chemistry Databases
 US National Institute of Standards and Technology (NIST) - 150,000
 NIH/NCBI/PubChem project - >3.2 million structures
 Thomson ISI - 2+ million structures
 US National Cancer Institute(NCI) Database - 23+ million structures
 US Environmental Protection Agency(EPA)-DSSToX Database - 1450
 Kyoto Encyclopaedia of Genes and Genomes (KEGG) database - 9584
 University of California at San Francisco ZINC - >3.3 million structures
 BRENDA enzyme information system (University of Cologne) - 36,000
 Chemical Entities of Biological Interest (ChEBI) database of the European
 Bioinformatics Institute - 5000 structures
 University of California Carcinogenic Potency Project - 1447 structures
 Compendium of Pesticide Common Names - 1437 (2005-03-03) structures
Journals and Software Using InChI

   Nature Chemical Biology.
   Beilstein Journal of Organic Chemistry
   ACD/Labs ACD/ChemSketch.
   ChemAxon Marvin.
   SciTegic Pipeline Pilot.
   CACTVS Chemoinformatics Toolkit by Xemistry, GmbH.

Chemistry Markup Language

 CML is an XML markup language for encoding
 chemical information.
     Developed by Peter Murray Rust, Henry Rzepa and others.
     Actually dates from the SGML days before XML
 More verbose than InChI and SMILES
     But inherits XML schema, namespaces, parsers, XPATH,
     language binding tools like XML Beans, etc.
 Not limited to structural information
 Has OpenBabel support.,
InChI Compared to SMILES
 SMILES is proprietary and
 different algorithms can give
 different results.
 Seven different unique
 SMILES for caffeine on
 Web sites:
   N1(C)C(=O)N(C)C2=C(C1=O)N(C)C=N               On the other hand, some claim
                                                 SMILES are more intuitive for
                                                 human readers.

A CML Example
Clustering Techniques, Computing
Requirements, and Clustering

    Computational techniques for
    organizing data
The Story So Far

 We’ve discussed managing screening assay
 output as the key problem we face
   Must sift through mountains of data in PubChem
   and DTP to find interesting compounds.
   NIH funded High Throughput Screening will make
   this very important in the near future.
 Need now a way to organize and analyze the
Clustering and Data Analysis
 Clustering is a technique that can be applied to large data sets to
 find similarities
    Popular technique in chemical informatics
 Data sets are segmented into groups (clusters) in which
 members of the same cluster are similar to each other.
 Clustering is distinct from classification,
    There are no pre-determined characteristics used to define the
    membership of a cluster,
    Although items in the same cluster are likely to have many
    characteristics in common.
 Clustering can be applied to chemical structures, for example, in
 the screening of combinatorial or Markush compound libraries in
 the quest for new active pharmaceuticals.
 We also note that these techniques are fairly primitive
   More interesting clustering techniques exist but apparently are not
   well known by the chemical informatics community.
Non-Hierarchical Clustering
 Clusters form around centroids.
 The number of which can be specified by the user.
 All clusters rank equally and there is no particular
 relationship between them.

Hierarchical Clustering
 Clusters are arranged in hierarchies
   Smaller clusters are contained within larger ones; the bottom of
   the hierarchy consists of individual objects in "singleton" clusters,
   while the top of it consists of one cluster containing all the objects
   in the dataset.
   Such hierarchies can be built either from the bottom up
   (agglomerative) or the top downwards (divisive)

Fingerprinting and Dictionaries--What Is
Your Parameter Space?
 Clustering algorithms require a parameter space
   Clusters defined along coordinate axes.
 Coordinate axes defined by a dictionary of chemical
 Use binary on/off for fingerprinting a particular compound
 against a dictionary.

Cluster Analysis and Chemical
 Used for organizing datasets into chemical series, to build
 predictive models, or to select representative compounds
 Clustering Methods
   Jarvis-Patrick and variants
      O(N2), single partition
   Ward’s method
      Hierarchical, regarded as best, but at least O(N2)
      < O(N2), requires set no of clusters, a little “messy”
   Sphere-exclusion (Butina)
      Fast, simple, similar to JP
   Kohonen network
      Clusters arranged in 2D grid, ideal for visualization
Limitations of Ward’s method for
large datasets (>1m)
 Best algorithms have O(N2) time requirement (RNN)
 Requires random access to fingerprints
   hence substantial memory requirements (O(N))
 Problem of selection of best partition
   can select desired number of clusters
 Easily hit 4GB memory addressing limit on 32 bit
   Approximately 2m compounds
Scaling up clustering methods

   Clustering algorithms can be adapted for multiple
   Some algorithms more appropriate than others for
   particular architectures
   Ward’s has been parallelized for shared memory machines,
   but overhead considerable
 New methods and algorithms
   Divisive (“bisecting”) K-means method
   Hierarchical Divisive
   Approx. O(NlogN)
Divisive K-means Clustering

 New hierarchical divisive method
   Hierarchy built from top down, instead of bottom up
   Divide complete dataset into two clusters
   Continue dividing until all items are singletons
   Each binary division done using K-means method
   Originally proposed for document clustering
 “Bisecting K-means”
   Steinbach, Karypis and Kumar (Univ. Minnesota)
   Found to be more effective than agglomerative methods
   Forms more uniformly-sized clusters at given level
BCI Divkmeans
 Several options for detailed operation
    Selection of next cluster for division
    size, variance, diameter
    affects selection of partitions from hierarchy, not shape of hierarchy
 Options within each K-means division step
    distance measure
    choice of seeds
    batch-mode or continuous update of centroids
    termination criterion
 Have developed parallel version for Linux clusters / grids in conjunction
 with BCI
 For more information, see Barnard and Engels talks at:
Divisive K-means: Conclusions
 Much faster than Ward’s, speed comparable to K-means,
 suitable for very large datasets (millions)
   Time requirements approximately O(N log N)
   Current implementation can cluster 1m compounds in under
   a week on a low-power desktop PC
   Cluster 1m compounds in a few hours with a 4-node parallel
   Linux cluster
 Better balance of cluster sizes than Wards or Kmeans
 Visual inspection of clusters suggests better assembly of
 compound series than other methods
 Better clustering of actives together than previously-
 studied methods
 Memory requirements minimal
 Experiments using AVIDD cluster and Teragrid
 (50+ nodes)
Effective exploitation of large volumes and diverse sources of
chemical information is a critical problem to solve, with a
potential huge impact on the drug discovery process
Most information needs of chemists and drug discovery
scientists are conceptually straightforward, but complex to
All of the technology is now in place to implement may of these
information need “use-cases”: the four level model using
service-oriented architectures together with smart clients look
like a neat way of doing this
In conjunction with grid computing, rapid and effective
organization and visualization of large chemical datasets is
feasible in a web service environment
Some pieces are missing:
  Chemical structure search of journals (wait for InChI)
  Automated patent searching
  Effective dataset organization
  Effective interfaces, especially visualization of large numbers of 2D structures
Divisive K-Means as a Web Service

 The previous exercise was intended to show
 that Divisive K-Means is a classic example of
 Grid application.
   Needs to be parallelized
   Should run on TeraGrid
 How do you make this into a service?
 We’ll go on a small tour before getting back
 to our problem.
Wrapping Science Applications as Services

 Science Grid services typically must wrap legacy
 applications written in C or Fortran.
 You must handle such problems as
   Specifying several input and output files
     These may need to be staged in
   Launching executables and monitoring their progress.
   Specifying environment variables
 Often these have also shell scripts to do some
 miscellaneous tasks.
 How do you convert this to WSDL?
   Or (equivalently) how do you automatically generate the
   XML job description for WS-GRAM?
Generic Service Toolkit (GFAC)
(G. Kandaswamy, IU and RENCI)
  The Generic Service Toolkit can "wrap" any command-line
  application as an application service.
     Given a set of input parameters, it runs the application, monitors
     the application and returns the results.
  Requires no modification to program code.
  Also has web user interface generating tools.
     When a user accesses an application service, the user is
     presented with a graphical user interface (GUI) to that service.
     The GUI contains a list of operations that the user is allowed to
     invoke on that service.
  After choosing an operation, the user is presented with a GUI for
  that operation, which allows the user to specify all the input
  parameters to that operation.
     The user can then invoke the operation on the service and get the output

OPAL (S. Krishan, SDSC)
Features include scheduling (using Globus and Condor/SGE) and
security (using GSI-based certificates), and persistent state
The WSDL defines operations to do the following:
  getAppMetadata: includes usage information, arbitrary application-specific
  metadata specified as an array of other elements,
      e.g. description of the various options that are passed to the application
  launchJob: runs job with specified input and returns a Job ID.
  queryStatus: returns status code, message, and URL of the working
  getOutputs: returns the outputs from a job that is identified by a Job ID.
      URLs for the standard output and error
      Array of structures representing the output file names and URLs
  getOutputAsBase64ByName: This operation returns the contents of an
  output file as Base64 binary.
  destroy: This operation destroys a running job identified by a Job ID.
  launchJobBlocking: This operation requires the list of arguments as a string,
  and an array of structures representing the input files.
Our Solution: Apache Ant Services
 We’ve found using Apache Ant to be very useful for
 wrapping services.
   Can call executables, set environment variables.
   Lots of useful built-in shell-like tasks.
   Extensible (write your own tasks).
   Develop build scripts to run your application
 You can easily call Ant from other Java programs.
   So just write a wrapper service
   We use both blocking (hold connection until return) and
   non-blocking version (suitable for long running codes).
   In non-blocking case, “Context” web service is used for
  Flow Chart of SMILES to Cluster
  Partitioned of BCI Web Service
                                SMILES to DKM

  SMILE                            Fingerprint
                   Makebits                        DivKmeans         Hierarchy
  String                             (*.scn)

                              Generating         Clustering
                              the best levels    Fingerprints
Fingerprints   Dictionary                                            New
               (Default)                                            SMILE
                                           Extracting individual    String
                                           cluster partitions

                                   Extracted         One
           best                     Cluster                         Merge
 Optclus           RNNclus                         Column
                                   Hierarchy                       Process
           level                                   Process
 BCI Clustering Service Methods
 Service Method          Description             Input         Output
makebitsGenerate   Generate fingerprints   SMIstring        Fingerprint
                   from a SMILES structure                  string
divkmGenerate      Cluster fingerprints with    SCNstring Clustered
                   Divkmeans                              Hierarchy
smile2dkm          Makebits + divkm             SMIstring   Clustered
optclusGenerate    Generate the best levels DKMstring Best partition
                   in a hierarchy                     cluster level
rnnclusGenerate    Extract individual cluster   DKMstring Indiv. cluster
                   partitions                             partitions
smile2ClusterPartiti Generate a new SMILES SMIstring        New SMILES
oned                 structure w/ extra col.                structure
A Library of Chemical
Informatics Web Services
All Services Great and Small

 Like most Grids, a Chemical Informatics Grid will
 have the classic styles:
   Data Grid Services: these provide access to data sources
   like PubChem, etc.
   Execution Grid Services: used for running cluster analysis
   programs, molecular modeling codes, etc, on TeraGrid and
   similar places.
 But we also need many additional services
   Handling format conversions (InChI<->SMILES)
   Shipping and manipulating tabular data
   Determining toxicity of compounds
   Generating batch 2D images
 So one of our core activities is “build lots of services”
VOTables: Handling Tabular Data
 Developed by the Virtual Observatory community for encoding
 astronomy data.
 The VOTable format is an XML representation of the tabular
 data (data coming from BCI, NIH DTP databases, and so on).
 VOTables-compatible tools have been built
   We just inherit them.
 SAVOT and JAVOT JAVA Parser APIs for VOTable allow us
 to easily build VOTable-based applications
   Web Services
   Spread sheet
   Plotting applications.
      VOPlot and TopCat are two
  Document Structure of VOTable
                          <?xml version="1.0"?>
                          <VOTABLE version="1.1“ xmlns:xsi=
Compound        Cluster   instance
Name            Number    ble/v1.1">
Acemetacin      1          <RESOURCE >
                              <TABLE name="results">
Candesartan     1               <FIELD name=“CompoundName" ID="col1" datatype=“char"
Acenocoumarol   2
                                <FIELD name=“ClustureNumber” ID="col2“ datatype=“int”/>
Dicumarol       2              <DATA>
Phenprocoumon   2
Trioxsalen      2                   <TR><TD>Acenocoumarol</TD><TD>2</TD></TR>
Warfarin        2                   <TR><TD>Dicumarol</TD><TD>2</TD></TR>
mrtd1.txt – smiles representation of chemical compounds
                 along with its properties

 Taverna Client
                               Tomcat Server

   votable.xml            VOPlot
Votable.xml : xml representation of mrtd1.txt file
VOPlot Application from generated votable.xml file : Graph
       plotted on Mass (X–axis) and PSA (Y-axis)
Other Uses for VOTables

 VOTables is a useful intermediate format for
 exchanging data between data bases.
 Simple example: exchange data between VARUNA
   Each student in the Baik group maintains his/her on copy
   (sandbox purposes).
   Often need to import/export individual data sets.
 It is also good for storing intermediate results in
   Value is not the format, but the fact that the XML can be
   manipulated programmatically.
      Unions, subset, intersection operations
More Services: WWMM Services
Services          Descriptions          Input        Output

InChIGoogle Search an InChI          inchiBasic   Search result in
            structure through Google type         HTML format

InChIServer   Generate InChI         version      An InChI
                                     format       structure

OpenBabelS Transform a chemical      format       Converted
erver      format to another using   inputData    chemical
           Open Babel                outputData   structure string
CMLRSSSer Generate CMLRSS feed       mol, title   Converted
ver       from CML data              description CMLRSS feed
                                     link, source of CML data
CDK-Based Services
Common         Calculates the common substructure between
Substructure   two molecules.
CDKsim         Takes two SMILES and evaluates the Tanimoto
               coefficient (ratio of intersection to union of their
CDKdesc        Calculates a variety of molecular and atomic
               descriptors for QSAR modeling
CDKws          Fingerprint generation

CDKsdg         Creates a jpeg of the compound’s 2D structure

CDKStruct3D    Generates 3D coordinates of a molecule from its
ToxTree Service
  The Threshold of
  Toxicological Concern
  (TTC) establishes a level of
  exposure for all chemicals
  below which there would be
  no appreciable risk to
  human health.
  ToxTree implements the
  Cramer Decision Tree
  approach to estimate TTC.
  We have converted this into
  a service.
     Uses SMILES as input.
     Note the GUI must be
     separated from the library to
     be a service
Taverna Workflow for Toxic Hazard Estimation
OSCAR3 Service
 Oscar3 is a tool for shallow, chemistry-specific
 natural language parsing of chemical documents
 (i.e. journal articles).
 It identifies (or attempts to identify):
   Chemical names: singular nouns, plurals, verbs etc., also
   formulae and acronyms.
   Chemical data: Spectra, melting/boiling point, yield etc. in
   experimental sections.
   Other entities: Things like N(5)-C(3) and so on.
 There is a larger effort, SciBorg, in this area
 This (like ToxTree) is potentially productively
 pleasingly parallelized.
 It also has potentially very interesting Workflows
Use Cases and Workflows

   Putting data and clustering together in a
   distributed environment.
Chemical Informatics as a Grid Problem
 NIH-Funded experimental screening
    NIH DTP and HTS projects are generating a wealth of raw data
    on small compounds.
    Available in PubChem
 Journal and chemical data sources often have public Web clients
 and GUIs.
    But we need Web Service interfaces, not just Web interfaces.
    These provide a programming interfaces for building both human
    and machine clients.
 These need to be connected to computing resources for running
 clustering, data mining, and molecular modeling applications.
    Excellent candidates for running on the TeraGrid
 We can formulate scientific problems that map to inter-
 connections of Grid services.
    This is generally called “Grid workflow” or “Service Orchestration”
                Oracle Database (HTS)
                                                     All the compounds pass the
                                                                                        Excel Spreadsheet (Toxicity)
                Compounds were tested                Lipinksi Rule of Five and
              against related assays and                    toxicity filters              One of the compounds was
               showed activity, including                                                   previously tested for
                selectivity within target                                               toxicology and was found to
                        families                                                            have no liver toxicity

Oracle Database (Genomics)
                                                                                            Word Document (Chemistry)
? None of these compounds
   have been tested in a                                                                      Several of the compounds
     microarray assay

                                                     ?                                       had been followed up in a
                                                                                           previous project, and solubility
                                                                                            problems prevented further

                                                    SCIENTIST                                 Journal Article
     The information in the
                                                                                        A recent journal article
structures and known activity
                                                                                     reported the effectiveness of
data is good enough to create           “These compounds look promising from
                                                                                     some compounds in a related
    a QSAR model with a                their HTS results. Should I commit some
                                                                                     series against a target in the
      confidence of 75%               chemistry resources to following them up?”
                                                                                     same family

                      External Database (Patent)                       Word Document (Marketing)
                         Some structures with a                             A report by a team in
                       similarity > 0.75 to these                        Marketing casts doubt on
                      appear to be covered by a                         whether the market for this
                      patent held by a competitor                      target is big enough to make
                                                                        development cost-effective
Workflow, Services, and Science
 Web Services work best as simple stateless
   No implicit input, output, or interdependency of
 Services must be composed into interesting
 This is called workflow.
 A good workflow ...
   Is composed of independent services
   Completely specifies an interesting science
Some Open Source Grid Workflow Projects
UK e-Science Project’s Taverna
   Scufl.xml scripting, GUI interface, works with Web Services.
   Works with Web services and the Globus Toolkit.
Condor DAGMan
   Works over the top of Condor’s scheduler.
   Extended by the GriPhyN Virtual Data System
Java CoGKit’s Karajan
   XML workflow specification for scripting COG clients.
   Works with GT 2 and 4.
Community Grids Lab’s HPSearch
   JavaScript scripting, works with Web services.
Indiana Extreme Lab’s Workflow Composer
   Jython, BPEL (soon) scripting
Finding compound-protein relationships
 A 2D structure is supplied for input into the
 similarity search (in this case, the extracted     A protein implicated in tumor growth is supplied          Correlation of docking
 bound ligand from the PDB IY4 complex)             to the docking program (in this case HSP90 taken          results and “biological
                                                    from the PDB 1Y4 complex)                                 fingerprints” across
                                                                                                              the human tumor cell
                                                                                                              lines can help identify
                                                                                                              potential mechanisms
                                                                                                              of action of DTP

           The workflow employs our local
          NIH DTP database service to search                                                 Once docking is
             200,000 compounds tested in                                                     complete, the user
            human tumor cellular assays for                                                  visualizes the high-
            similar structures to the ligand.                                                scoring docked
          Client portlets are used to browse                                                 structures in a portlet
                    these structures                                                         using the JMOL
                                                  Similar structures are filtered for
                                                  drugability, and are automatically
                                                  passed to the OpenEye FRED
                                                  docking program for docking into the
                                                  target protein.
HTS data organization & flagging
                                                       A tumor cell line is selected. The activity results
OpenEye FILTER is used to calculate biological         for all the compounds in the DTP database in the
and chemical properties of the compounds that          given range are extracted from the PostgreSQL
are related to their potential effectiveness as        database

       The compounds are clustered on
        chemical structure similarity, to         The compounds along with property
      group similar compounds together            and cluster information are
                                                  converted to VOTABLES format and
                                                  displayed in VOPLOT
Use Case: Which of these hits should I
follow up?
 An HTS experiment has produced 10,000 possible hits out of a
 screening set of 2m compounds. A chemist on the project wants to
 know what the most promising series of compounds for follow-up are,
 based on:
   Series selection      cluster analysis
   Structure-activity relationships     modal fingerprints/stigmata
   Chemical and pharmacokinetic properties mitools, chemaxon
   Compound history gNova / PostgreSQL
   Patentability BCI Markush handling software
   Synthetic feasibility
   + requires visualization tools!
A Workflow Scenario: HTS Data
Organization and Flagging
 This workflow demonstrates how screening data can be flagged
 and organized for human analysis.
 The compounds and data values for a particular screen are
 retrieved from the NIH DTP database and then are filtered to
 remove compounds with reactive groups, etc.
    A tumor cell line is selected. The activity results for all the
    compounds in the DTP database in the given range are extracted
    from the PostgreSQL database
 OpenEye FILTER is used to calculate biological and chemical
 properties of the compounds that are related to their potential
 effectiveness as drugs
 ToxTree is used to flag the potential toxicities of compounds.
 Divkmeans is used to add a column of cluster numbers.
 Finally, the results are visualized using VOPlot and the 2D viewer
Web Services
Example plots of our
workflow output using
VOPlot and VOTables
                                                   SMILES + ID +

NIH Database                       Fingerprint                      Cluster             Me Clust
                     SMILES + ID                                                          m b er
   Service                         Generator                        Analysis                 ers
  PostgreSQL                                       Fingerprints
                                    BCI Makebits                   BCI Divkmeans                          Table


               Cluster the compounds in the NIH DTP database
               by chemical structure, then choose representative
               compounds from the clusters and dock them into                                          SMILES + ID +
                          PDB protein files of interest                                                + Cluster # +

                                                                                   Docking                                Plot
                                                                                   Selector                            Visualizer
   3D Visualizer                                                                                                         VoPlot


                                                                                           SMILES + ID

                                    Docking                          2D-3D
         Co cked
              l ex                 OpenEye FRED     MOL File      OpenEye OMEGA
                                                                                                                 PDB Database

                                                                        PDB Structure
                                                                           + Box
Use Case: Are there any good ligands for
my target?
 A chemist is working on a project involving a
 particular protein target, and wants to know:
   Any newly published compounds which might fit the protein
   receptor site gNova / PostgreSQL, PubChem search,
   FRED Docking
   Any published 3D structures of the protein or of protein-
   ligand complexes PDB search
   Any interactions of compounds with other proteins gNova
   / PostgreSQL, PubChem search
   Any information published on the protein target Journal
   text search
Use Case: Who else is working on these
 A chemist is working on a chemical series for a particular
 project and wants to know:
   If anyone publishes anything using the same or related
   compounds ~ PubChem search
   Any new compounds added to the corporate collection which are
   similar or related gNova CHORD / PostgreSQL
   If any patents are submitted that might overlap the compounds he
   is working on ~ BCI Markush handling software
   Any pharmacological or toxicological results for those or related
   compounds gNova CHORD / PostgreSQL, MiToolkit
   The results for any other projects for which those compounds
   were screened gNova CHORD / PostgreSQL, PubChem
VARUNA – Towards a Grid-based
Molecular Modeling Environment

    A brief overview of Prof. Mookie Baik’s
    VARUNA project.
Chemical Informatics in Academic Research?
 Industrial Research: Target   Academic Research:
 Oriented                      Concept Oriented
   Not bound to a specific       Specialized on few
   molecular system              molecular families
                                 Method Development is
   Not bound to a method         important
   Not concerned with            Obsessed with generality
   generality                    Does not care much about
   Aware of Efficiency           efficiency
   Aware of Overall Cost         Cost is unimportant
   Aware of Toxicity             Often can’t even assess for
   Concerned about
   Formulations                  Formulation is a minor issue
                                 Cares mostly about
   Cares about active            REACTIONS, i.e.
   MOLECULES                     ways to GET to a molecule
AutoGeFF, Varuna and Workflows
 Metalloproteins are extremely important in
 biochemical processes
 Understanding their chemistry is difficult
 To add value to the small molecule DB’s
 (PubChem, etc.), we must somehow connect
 them to PDB’s, BindMOAD, etc.
 By extending Varuna’s functionality to
 handling, storing Metalloproteins, we could
 provide a connection
Automatic Generator of ForceFields
 Developing a service that can take ANY
    drug-like molecule (from PubChem, for example)
    metal complexes
    metalloenzymes (from PDB, for example)
    unnatural or functionalized amino acids, nucleobases (from in-
    house db)
 for which molecular mechanics force fields are not available and
 automatically generate FF’s based on
    High level Quantum Simulations (using Varuna as a Web service)
 for Sophisticated Molecular Mechanics Simulations
 First Step: Coding of a specialized Prototype that can reproduce
 our manually derived novel force fields for Cu-Aβ Alzheimer’s
 Disease as a Proof-Of-Principles Study.
Automatic Quantum Mechanical Curation
of Structure Data
 Chemical Research logic is often driven by molecular
 Large-scale, small molecule DB’s (such as PubChem)
 have low-resolution structure data
 Often key properties are not consistently available:
   e.g.: Rotation-barriers, Redox Potentials, Polarizabilities, IR
   frequencies, reactivity towards nucleophiles
 QM web-services will provide tools for generating
 high-resolution data
   that will curate the results of traditional ChemInfo studies
   allow for combinatorial computational chemistry
   access a database of modeling data
Prototype-Project: Controlling the TGFβ pathway
in-house Molecules in Varuna

                                                                    Understanding of TGFβ

                                Inactive TGFβ
                                                   Active TGFβ
                                    1IAS           With inhibitor

                                                                     - What molecular feature
        PubChem                                                      controls inhibitor binding?

                                                                     - How do mutations impact
                                       PDB        in the Zhang
Consequences for ChemInfo Design
for Academia
TWO Strategies are needed:
  Making traditional ChemInfo tools that are often available in
  commercial research available to Academia is in principle
  New ChemInfo Tools that are CONCEPT centered and include
  REACTIONS in addition to MOLECULES must be developed.
Our approach:
     Development of
             (a) Quantum Chemical Database
             (b) Molecular Modeling Database
  Harness the power of recent advances in Molecular Modeling
  (QM, QM/MM, MM, MD) through information management.
  Data-depository for Quantum Chemical Data including both
  Properties & Mechanisms
QM Calculation Workflow


Input Param                 Job Script       SSH
                            Service          Service

XYZ File of
a Molecule    Input File   Job Scheduling
              Generator    Service
List of
More Information

 Contact me:
 Most of this was taken from our CICC project. See
   Note we’ve found wikis to be extremely useful and fun to
   use for maintaining collaborative web sites.
   See also and
   for other examples using Media Wiki.
 Many elements of our approach are based on Prof.
 Peter Murray Rust’s group’s approach.
   WWMM Wiki:
 SourceForge Project Site
Additional Slides
Use Case - CICC
Which of these hits should I follow up?
 An MLI HTS experiment has produced 10,000 possible hits out of
 a screening set of 2m compounds. A chemist at another
 laboratory wants to know if there are any interesting active
 series she might want to pursue, based on:
    Structure-activity relationships
    Chemical and pharmacokinetic properties
    Compound history
    Synthetic feasibility
CICC Web Services I
 BCI Clustering
    Provides Bernard Chemical Information (BCI) clustering packages
    A module of the workflow for HTS data organization and flagging
        Added URL output support to the previous solid prototype (Multi-user durable)
        Taverna Beanshell Scripting for data format adjusting (e.g. Filtering out the head part listing
        column names)
    To do: Evaluating the URI(URL) based workflow design
    Estimates toxic hazard by applying a decision tree approach
    A module of the workflow for HTS data organization and flagging
    Status: A test prototype producing the level of toxicity in a brief or verbose explanation
    against a SMILE structure
    To do:
        Refining the Web service for cluster input and external property support
        The Taverna Beanshell scripting for data merging not used in some modules
CICC Web Services II
Workflow for HTS data organization and flagging
  Demonstrates how screening data can be flagged and organized for human
  Status: Individual modules except the visualization are in prototype
  To do:
      Defining at least XML schema or DTD for the workflow data (at most the Ontology)
      Redefining current workflow model to reflect the new feature of Taverna 1.4
      supporting complex data structures and the provenance plugin
Other Planed Web Services
  Open Source Chemistry Analysis Routines (OSCAR)
      Extracts chemical information from text and produces an XML instance highlighting
      the chemical information
      A module of the PMR workflow
      Status: OSCAR3 is available and works fine as a Java application
      To do: Studying XML instances for extracting chemical names
  InfoChem’s SPRESI Web Service
      Provides access to the SPRESI molecule database
      Status: Perl scripts for accessing SPRESI Web Service
      To do: Developing a Web service wrapper to utilize InfoChem’s SPRESI Web
   BCI Clustering URL Service Methods

 Service Method           Description            Input      URLOutput
makebitsURLGene Generate fingerprints   SMIstring           Fingerprint and
rate            from a SMILES structure                     program output
divkmURLGenerat     Cluster fingerprints with   SCNstring DKM data and
e                   Divkmeans                             program output
smile2dkmURL        Makebits + divkm            SMIstring   All SMI, DKM
                                                            and std. outputs
optclusURLGenera Generate the best levels SMIstring Best data and
te               in a hierarchy           DKMstring program output
rnnclusURLGenera Extract individual cluster     SMIstring New partition
te               partitions                     DKMstring and std. output
smile2ClusterPartiti Generate a new SMILES SMIstring        All intermediate
onedURL              structure w/ extra col.                data and output
Workflow for smile2ClusterPartitionedURL
Workflow for Toxic Hazard in Verbose
Diagram of Workflow2

Web Services

Beanshell Scripting
Informatics is the discipline of science which investigates the
   structure and properties (not specific content) of scientific
   information, as well as the regularities of scientific information
   activity, its theory, history, methodology and organization. The
   purpose of informatics consists in developing optimal methods
   and means of presentation (recording), collection, analytical-
   synthetic processing, storage, retrieval and dissemination of
   scientific information.
A. I. Mikhailov, A. I. Chernyi, R. S. Gilyarevskii (1967) “Informatics --
   New Name of the Theory of Scientific Information”
Chemical informatics is …

 More usually know as chemoinformatics or
 Very differently defined, reflecting its cross-
 disciplinary nature
   Chemist (synthetic, medicinal, theoretical)
   Biologist / Bioinformatician
   Molecular modeler
   Pharmaceutical or Chemical Engineer
   Computer Scientist / Informatician
More definitions

 Computational Chemistry – The application of
 mathematical and computational methods to
 particularly to theoretical chemistry
 Molecular Modeling – Using 3D graphics and
 optimization techniques to help understand the
 nature and action of compounds and proteins
 Computer-Aided Drug Design – The discipline of
 using computational techniques (including chemical
 informatics) to assist in the discovery and design of
Traditional areas of application

 Pharmaceutical & life science industry
   particularly in early stage drug design
 Databases of available chemicals
 Electronic publishing
   including searchable chemical structure
   information in journals, etc.
 Government and patent databases
The –ics so far (1960’s to present) …
 How do you represent 2D and 3D chemical structures?
   Not just a pretty picture
 How do you search databases of chemical structures?
   Google doesn’t help (much, but it might do soon…)
 How do you organize large amounts of chemical information?
 How do you visualize chemical structures & proteins?
 Can computers predict how chemicals are going to behave
   … in the test tube?
   … in the body?
Current trends & hot topics

 The decorporatization of chemical informatics
 (PubChem, MLI, eScience, open source)
 Service-oriented architectures
 Packaging & processing large volumes of complex
 information for human consumption
 Integration with other –ics (bioinformatics, genomics,
 proteomics, systems biology)
Main players (Commercial)

 Tripos, inc.
 Daylight CIS, inc.
Main players (Academia)
“Pure” Chemoinformatics
   University of Sheffield, UK (Willett / Gillet)
   Erlangen, Germany (Gasteiger)
   Cambridge Unilever Center
   Indiana University School of Informatics
Related (computational chemistry, etc.)
   UCSF (Kuntz)
   University of Texas (Pearlman)
   Yale (Jorgensen)
   University of Michigan (Crippen)
“Traditional” Journals
 Journal of Chemical Information & Modeling (formerly JCICS)
 Journal of Computer-Aided Molecular Design
 Journal of Molecular Graphics and Modeling
 Journal of Computational Chemistry
 Journal of Chemical Theory and Computation
 Journal of Medicinal Chemistry
“Informal” publications

 Network Science (online)
 Chemical & Engineering News
 Drug Discovery Today
 Scientific Computing World
 Bio-IT World
CINF-L Distribution List

 Chemical Information Sources Discussion
 Created by Gary Wiggins at IUB
Yahoo! Chemoinformatics Discussion
   Job postings
   Ideas exchange
   Industry – Student
 All students encouraged to
 Open to others

       To join, go to
       Or send an email to
Open Source / Free Software
 Blue Obelisk -
 InChI -
 OpenBabel -
 CML -
 CDK -
Example 2
3D Visualization & Docking

    3D Visualization of interactions between compounds and proteins
    “Docking” compounds into proteins computationally
3D Visualization

 X-ray crystallography and NMR Spectroscopy can
 reveal 3D structure of protein and bound
 Visualization of these “complexes” of proteins and
 potential drugs can help scientists understand the
 mechanism of action of the drug and to improve the
 design of a drug
 Visualization uses computational “ball and stick”
 model of atoms and bonds, as well as surfaces
 Stereoscopic visualization available
Docking algorithms

 Require 3D atomic structure for protein, and 3D
 structure for compound (“ligand”)
 May require initial rough positioning for the ligand
 Will use an optimization method to try and find the
 best rotation and translation of the ligand in the
 protein, for optimal binding affinity
Genetic Algorithms

 Create a “population” of possible solutions,
 encoded as “chromosomes”
 Use “fitness function” to score solutions
 Good solutions are combined together
 (“crossover”) and altered (“mutation”) to
 provide new solutions
 The process repeats until the population
 “converges” on a solution
Traditional Workflow of Molecular Modeling

                   FORTRAN Code,
 Researcher            Scripts,                             Supercomputer
                  Visualization Code

 Chemical                  Hard Drive
 Concepts               Directory Jungle

              Highly inefficient workflow (no automation)

              Knowledge is human bound (grad student leaves and projects dies)

Experiments   Incorporation with other DB’s is done in Researcher’s head
Varuna – a new environment for molecular modeling

   Concepts              Researcher

Experiments              Chem-Grid
                                                 Simulation Service
                                                  FORTRAN Code,

                              DB Service
 Reaction       QM        Queries, Clustering,
   DB         Database      Curation, etc.

       PubChem, PDB,             Database
          NCI, etc.                              Supercomputer
Tools for mining the data

 Tripos Benchware HTS Dataminer (formerly SAR Navigator),

Shared By:
Description: CICC MEP SDSC final Amino Acid Soap