OGF21-EchemistryPierce.ppt - Semantic Grid

Document Sample
OGF21-EchemistryPierce.ppt - Semantic Grid Powered By Docstoc
					E-Chemistry and Web 2.0

          Marlon Pierce
     mpierce@cs.indiana.edu
      Community Grids Lab
        Indiana University


                              1
        One Talk, Two Projects
 NIH funded Chemical       Proposed Microsoft-
  Informatics and            Funded Project: E-
  Cyberinfrastructure        Chemistry
  Collaboratory (CICC) @       Carl Lagoze (Cornell),
  IU.                          Lee Giles (PSU),
    Geoffrey Fox              Steve Bryant (NIH),
    Gary Wiggins              Jeremy Frey (Soton),
    Rajarshi Guha             Peter Murray-Rust
    David Wild                 (Cambridge),
    Mookie Baik               Herbert Van de Sompel (Los
    Kevin Gilbert              Alamos),
    And others                Geoffrey Fox (Indiana)
                               And others
                                                      2
          CICC Infrastructure Vision
 Chemical Informatics: drug discovery and other academic chemistry,
  pharmacology, and bioinformatics research will be aided by powerful,
  modern, open, information technology.
    NIH PubChem and PubMed provide unprecedented open, free data and
     information.
    We need a corresponding open service architecture (i.e. avoid stove-piped
     applications)
    CICC set up as distributed cyberinfrastructure in eScience model
 Web clients (user interfaces) to distributed databases, results of high
  throughput screening instruments, results of computational chemical
  simulations and other analyses.
    Composed of clients to open service APIs (mash-ups)
    Aggregated into portals
 Web services manipulate this data and are combined into workflows.
 So our main agenda items: create interesting databases and build lots of
  Web services and clients.
                                                                         3
            CICC Databases
Most of our databases aim to add value to
 PubChem or link into PubChem
  1D (SMILES) and 2D structures
3D structures (MMFF94)
  Searchable by CID, SMARTS, 3D similarity
Docked ligands (FRED, Autodock)
  906K drug-like compounds into 7 ligands
  Will eventually cover ~2000 targets
Philosophy: we have big computers, so let‟s
 calculate everything ahead of time and put the
 results in a DB.
  Building Up the Infrastructure
Our SOA philosophy: use standard Web services.
  Mostly stateless
  Some cluster, HPC work needed but these populate
   databases
Services are aggregate-able into different
 workflows.
  Taverna, Pipeline Pilot, …
You can also build lots of Web clients.
See
 http://www.chembiogrid.org/wiki/index.php/CICC_
 Web_Resources for links and details.
Not so far from Web 2.0….                    5
                   Sample Services
Type              Service        Functionality       Source       License
                                 Provides access to
                                 the results of
                                 docking a subset of
                                 PubChem into a set
                                                     Indiana      Freely
Database          Docking        of ligands.
                                                     University   accessible
                                 Searchable by 2D
                                 structure and
                                 docking docking
                                 score
                                 Provides access to
                                 3D structure       Indiana       Freely
Database          3D Structure
                                 generated for most University    accessible
                                 of PubChem
                                 Extract chemical
                                                     Cambridge    Freely
Cheminformatics   OSCAR3         structures from
                                                     University   accessible
                                 text
                                 Uses Google to
                                                     Cambridge    Freely
Cheminformatics   InChiGoogle    search for an
                                                     University   accessible
                                 InChI
                                 Generates a
                                                     Cambridge    Freely
Cheminformatics   CMLRSSServer   CMLRSS feed from
                                                     University   accessible
                                 CML data
                                 Converts chemical   Cambridge    Freely
Cheminformatics   OpenBabel
                                 file formats        University   accesible


                                                                               6
                                                       Indiana
                                                       University &
                                    Obtains toxicity                   Freely
Cheminformatics   ToxTreeServer                        European
                                    hazard predictions                 accessible
                                                       Chemical
                                                       Bureau
                                                        Indiana
                                    Generates 166 bit   University &   Freely
Cheminformatics   DBUtil
                                    MACCS keys          gNova          accessible
                                                        Consulting
                                    Evaluates 2D/3D
                                    similarity and
                                                        Indiana
                  Molecular         evaluate distance                  Freely
Cheminformatics                                         University &
                  Similarity        moments for 3D                     accessible
                                                        CDK
                                    similarity
                                    calculations
                                    Generatesarious
                                    descriptors         Indiana
                  Molecular                                            Freely
Cheminformatics                     including TPSA,     University &
                  Descriptors                                          accessible
                                    XLogP, surface      CDK
                                    areas
                                    Generates 2D       Indiana
                  2D Structure                                         Freely
Cheminformatics                     structure diagrams University &
                  Diagrams                                             accessible
                                    from SMILES        CDK
                                    Evaluates           Indiana
                  Druglikeness                                         Freely
Cheminformatics                     measures of         University &
                  Methods                                              accessible
                                    druglikeness        CDK
                                    Generates hashed
                                                        Indiana
                                    fingerprints, 2D                   Freely
Cheminformatics   Utility Methods                       University &
                                    coordinate                         accessible
                                                        CDK
                                    generation etc.
                                                                                    7
                                 Samples from
                                 several
             Sampling                                Indiana      Freely
Statistics                       distributions
             Distributions                           University   accessible
                                 (normal, uniform,
                                 Weibull etc)
                                 Builds linear       Indiana      Freely
Statistics   Linear Regression
                                 regression models   University   accessible
                                 Builds neural
                                                    Indiana       Freely
Statistics   CNN Regression      network regression
                                                    University    accessible
                                 models
                                 Builds random
                                                     Indiana      Freely
Statistics   RF Regression       forest regression
                                                     University   accessible
                                 models
                                 Builds linear
                                                     Indiana      Freely
Statistics   LDA                 discriminant
                                                     University   accessible
                                 analysis models
                                 Performs K-means Indiana         Freely
Statistics   K-Means
                                 clustering       University      accessible
                                 Performs feature
                                 selection using     Indiana      Freely
Statistics   Feature Selection
                                 stepwise            University   accessible
                                 regression
                                 Generates 2D        Indiana      Freely
Statistics   XY Plots
                                 scatter plots       University   accessible
                                 Generates           Indiana      Freely
Statistics   Histogram Plots
                                 histograms          University   accessible

                                                                           8
                                   Converts tab
                                                        Indiana      Freely
Data Exchange   TabToVOTables      delimited files to
                                                        University   accessible
                                   VOTables
                                   Converts VOTables
                                                     Indiana         Freely
Data Exchange   VOTablesToTab      to tab delimited
                                                     University      accessible
                                   files
                                   Converts VOTables
                                                     Indiana         Freely
Data Exchange   VOTablesToXLS      to Excel
                                                     University      accessible
                                   spreadsheet
                                   Retrieves field
                                   names and data
                                                        Indiana      Freely
Data Exchange   VOTable Retrieve   types from a
                                                        University   accessible
                                   VOTables
                                   document
                                   Extracts columns
                                                        Indiana      Freely
Data Exchange   VOTableExtract     from a VOTables
                                                        University   accessible
                                   document
                                   Handles file
Computational   Varuna File                             Indiana      Freely
                                   formats for
Chemistry       Format                                  University   accessible
                                   QM/MM packages
                                   Performs analysis
Computational                                           Indiana      Freely
                Varuna Analysis    of results from
Chemistry                                               University   accessible
                                   Jaguar and ADF
Computational                      Searches the         Indiana      Freely
                Varuna Query
Chemistry                          Varuna database      University   accessible
                                   Submits input data
Computational                                           Indiana      Freely
                Varuna Submit      for calculation on a
Chemistry                                               University   accessible
                                   local cluster

                                                                                  9
                                                     Openeye
Application   Fred              Performs docking                     Commercial
                                                     Software
                                Property
                                                     Openeye
Application   Filter            calculation and                      Commercial
                                                     Software
                                filtering
                                Generates 3D         Openeye
Application   Omega                                                  Commercial
                                conformers           Software
                                Generates 1052      Digital
Application   BCI Fingerprint                                        Commercial
                                BCI structural keys Chemistry
                                Performs divisive  Digital
Application   BCI Clustering                                         Commercial
                                k-means clustering Chemistry
                                Evaluates            Indiana
                                pharmacokinetic      University &    Freely
Application   PkCell
                                parameters for       University of   accessible
                                druglike molecules   Michigan
                                Gets toxicity
                                predictions for RF   Indiana
              Scripps MLSCN                                          Freely
Application                     models built using   University &
              Toxicity                                               accessible
                                MLSCN cell-line      Scripps, FL.
                                data
                                Gets anti-cancer
              NTP DTP Anti-     actvity predictions Indiana          Freely
Application
              cancer activity   for the 60 NCI cell University       accessible
                                lines
              Ames              Gets mutagenicity    Indiana         Freely
Application
              Mutagenicity      predictions          University      accessible


                                                                                  10
                 Web Client Interfaces
Name             Functionality        Type             Links
                 Interface to the                      http://www.chembiogrid.org/cheminf
PubDock                               Web
                 docking database                      o/dock/
                 Interface to the 3D                   http://www.chembiogrid.org/cheminf
Pub3D                                Web
                 structure database                    o/p3d/
                 Identify
                 compounds that
Frequent         occur in multiple                     http://www.chembiogrid.org/cheminf
                                      Web
Hitters          assays, with links                    o/freqhit/fh
                 to individual
                 assays
                 Predict whether a
MLSCN Toxicity                        Web and          http://www.chembiogrid.org/cheminf
                 compound will be
Predictions                           Pipeline Pilot   o/rws/scripps
                 toxic or not
                                                       http://cheminfo.informatics.indiana.e
                 Predict toxicity
ToxTree                               Web              du/~rguha/code/java/cdkws/cdkws.
                 hazard class
                                                       html#tox
                 Predict whether a
DTP Anti-        compound exhibits
                                                       http://www.chembiogrid.org/cheminf
Cancer           anti-cancer activity Web
                                                       o/ncidtp/dtp
Predictions      against the 60 NCI
                 cell lines

                                                                                     11
                          More Clients…
               Predict whether a
Ames
               compound is                        http://www.chembiogrid.org/cheminf
Mutagenicity                       Web
               mutagenic or not in                o/rws/ames
Predictions
               the Ames test
               Evaluate
                                                  http://www.chembiogrid.org/cheminf
PkCell         pharmacokinetic      Web
                                                  o/pkcell/
               parameters
               Natural language
                                                  http://cheminfo.informatics.indiana.e
Kemo           interface to         Web
                                                  du:8080/kemo/
               PubChem
               Generate RSS
               feeds for various    Web and RSS   http://www.chembiogrid.org/cheminf
RSS Feeds
               PubChem related      feed          o/rssint.html
               queries
Statistical    Download
                                                  http://www.chembiogrid.org/cheminf
Model          statistical models   Web
                                                  o/rws/mlist
Download       as R binary files
               Miscellaneous
               functions such as                  http://cheminfo.informatics.indiana.e
Cheminformatic
               structure            Web           du/~rguha/code/java/cdkws/cdkws.
s
               diagrams,                          html
               similarity etc.


                                                                                   12
                           More Clients…
                                                     http://129.79.139.29/filecon/Default.
                 File operations and                 aspx and
Varuna                               Web
                 result analysis                     http://129.79.139.29/utilityclient/De
                                                     fault.aspx
                 Plotting data using                 http://gf1.ucs.indiana.edu:9080/axis
                 VOTables as well                    /VOTables.html and
VOTables                              Web
                 as using Excel files                http://www.chembiogrid.org/cheminf
                 via VOTables                        o/rws/xlsvor
                 .Net interface to    Desktop        http://darwin.informatics.indiana.edu
PubChemSR
                 PubChem              application    /juhur/Tools/PubChemSR/
                                                     http://cran.r-
                 R packages to
                                                     project.org/src/contrib/Descriptions/
rpubchem and     interface with the   Desktop
                                                     rcdk.html and http://cran.r-
rcdk             CDK and access       applciation
                                                     project.org/src/contrib/Descriptions/
                 PubChem
                                                     rpubchem.html
                 A plugin to allow    Desktop
                 Chimera to utilize   application    http://poincare.uits.iupui.edu/~h eila
Chimera plugin
                 the PubDock          (requires      nd/cicc/code/
                 database             Chimera)
                 A Greasemonkey
                                      Web (requires
                 script that shows
PubChem 3D                            Firefox and   http://rna.informatics.indiana.edu/hg
                 3D structures
View                                  Greasemonkey opalak/3DStructView.user.js
                 when viewing
                                      )
                 Pubchem pages
                                                                                        13
                Example: PubDock
 Database of approximately 1
  million PubChem structures (the
  most drug-like) docked into
  proteins taken from the PDB
 Available as a web service, so
  structures can be accessed in
  your own programs, or using
  workflow tools like Pipeline Polit
 Several interfaces developed,
  including one based on Chimera
  (right) which integrates the
  database with the PDB to allow
  browsing of compounds in
  different targets, or different
  compounds in the same target
 Can be used as a tool to help
  understand molecular basis of
  activity in cellular or image
  based assays

                                       14
        Example: R Statistics applied to
               PubChem data
 By exposing the R statistical package, and the Chemistry Development Kit
  (CDK) toolkit as web services and integrating them with PubChem, we can
  quickly and easily perform statistical analysis and virtual screening of
  PubChem assay data
 Predictive models for particular screens are exposed as web services, and
  can be used either as simple web tools or integrated into other applications
 Example uses DTP Tumor Cell Line screens - a predictive model using
  Random Forests in R makes predictions of probability of activity across
  multiple cell lines.




                                                                           15
                                                  A protein implicated in tumor
                                                  growth with known ligand is
                                                  selected (in this case HSP90 taken
                                                  from the PDB 1Y4 complex)



The screening data from a    Example assay                                                  Docking results and
cellular HTS assay is                                                                       activity patterns fed into
similarity searched for      screening                                                      R services for building of
compounds with similar                                                                      activity models and
2D structures to the         workflow: finding                                              correlations
ligand.
                             cell-protein
                             relationships                                             Least        Random         Neural
                                                                                       Squares      Forests        Nets
                                                                                       Regression




                                    Similar structures are
                                    filtered for drugability, are
                                    converted to 3D, and are                Once docking is complete,
                                    automatically passed to                 the user visualizes the high-
                                    the OpenEye FRED                        scoring docked structures
 Similar structures to the          docking program for                     in a portlet using the JMOL
  ligand can be browsed             docking into the target                 applet.
    using client portlets.          protein.

                                                                                                              16
        Relevance to Web 2.0
Some Web 2.0 Key Features
  REST Services
  Use of RSS/Atom feeds
  Client interfaces are “mashups”
  Gadgets, widgets for portals aggregate clients
So…
  We provide RSS as an alternative WS format.
  We have experimented with RSS feeds, using Yahoo
   Pipes to manipulate multiple feeds.
  CICC Web interfaces can be easily wrapped as
   universal gadgets in iGoogle, Netvibes.
     Alternative to classic science gateways.
                                                    17
    RSS Feeds/REST Services
Provide access to DB's via RSS feeds
Feeds include 2D/3D structures in CML
Viewable in Bioclipse, Jmol as well as Sage etc.
Two feeds currently available
  SynSearch – get structures based on full or partial
   chemical names
  DockSearch – get best N structures for a target
Really hampered by size of DB and Postgres
 performance.
Tools and mashups based on web service
             infrastructure




      http://www.chembiogrid.org/projects/proj_tools.html   19
    Mining information from journal
                articles
 Until now SciFinder / CAS only chemistry-aware portal
  into journal information
 We can access full text of journal articles online (with
  subscription)
 ACS does not make full text available … but there are
  ways round that!
 RSC is now marking up with SMILES and GO/Goldbook
  terms!
    www.projectprospect.org
 Having SMILES or InChI means that we can build a
  similarity/structure searchable database of papers: e.g.
  “find me all the papers published since 2000 which
  contain a structure with >90% similarity to this one”
 In the absence of full text, we can at least use the abstract
                                                           20
         Text Mining: OSCAR
 A tool for shallow, chemistry-specific natural
  language parsing of chemical documents (e.g. journal
  articles).
 It identifies (or attempts to identify):
    Chemical names: singular nouns, plurals, verbs etc., also
     formulae and acronyms.
    Chemical data: Spectra, melting/boiling point, yield etc. in
     experimental sections.
    Other entities: Things like N(5)-C(3) and so on.
 Part of the larger SciBorg effort
    See
     http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html)
 http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/O
  scar3
                                                                21
           QuickTime™ and a
                                                    Create a database containing the
                                                    text of all recent PubMed abstracts
 TIFF (Un compressed) decompressor
    are neede d to se e this picture.



                                                    (2006-2007 = ~500,000)

       Use OSCAR to extract all of the
       chemical names referred to in the
       abstracts and covert to SMILES

                                        DATABASE SERVICE
                                               +

                                        DOCKING SERVICE


                   Convert molecules to
                    3D and dock into a
                    protein of interest
                                                                               Visualize top docked
                                                                               molecules in a Google-
                                                                               like interface


Mash-Up: What published compounds might bind to this protein?
E-Chemistry and Digital
      Libraries
We can‟t wait to get started….




                                 23
 E-Chemistry and Digital Libraries
Key problem with our SOA-based e-Science is
 information management.
  Where is the service that I need?
  What does it do?
We may consider our data-centric services to be
 digital libraries.
Data is diverse
  Documents
  Not just computational information like structures.
Another point of view: how can I link together
 publications, results, workflows, etc?
  That is, I need to manage digital documents.
                                                         24
                  Digital Libraries
 Open Archives Initiative Object Reuse and Exchange
  Project (OAI-ORE)
 Developing standardized, interoperable, and machine-
  readable mechanisms to express information about
  compound information objects on the web.
 Graph-based representations of connected digital objects.
 Objects may be encoded in (for example) RDF or XML,
 Retrievable via repositories with REST service interfaces
  (c.f. Atom Publishing Protocal)
    Obtain, harvest, and register



                                                        25
       QuickTime™ and a
   TIFF (LZW) decomp resso r
are neede d to see this picture.
       QuickTime™ and a
   TIFF (LZW) decomp resso r
are neede d to see this picture.
   Challenges for E-Chemistry
Can digital library principals be applied to data as
 well as documents?
  Can you link your workflow to your conference paper?
Can we engineer a publishing framework and
 message formats around Web 2.0 principals?
  REST, Atom Publishing Protocol, Atom Syndication
   Format, JSON, Microformats
Can we do this securely?
  Access control, provenance, identify federation are key
   problems.

                                                      28
Institution   Project Focus

Cambridge        Retrospective Data Extraction
                 Searching and Indexing
                 Data Models/Ontologies
                 Tools and Applications

Cornell          Data Models
                 Interoperability infrastructure
                 Project Management
                 Publicity and outreach

Indiana          Infrastructure Integration
                 Trust and Provenance
                 Tools and Applications

LANL             Data Models
                 Interoperability infrastructure

PuBChem          Chemical Structure Archive
                 Results of Experimental Biological Activity Testing
                 Cross References to BioMedical Databases

Penn State       Retrospective Data Extraction
                 Searching and Indexing
                 Analysis

Southampton      Prospective & Retrospective Data Provision
                 Tools and Applications
                 In-process capture of eChemistry data
                 Data Linking Ğ in analysis and publication
        More Information
Project Web Site: www.chembiogrid.org
Project Wiki: www.chembiogrid.org/wiki
Contact me: mpierce@cs.indiana.edu




                                      30
31
                           Chemical Informatics and Cyberinfrastucture Collaboratory
  CICC                                                  Funded by the National Institutes of Health                                CICC
                                                                 www.chembiogrid.org


                         CICC Combines Grid Computing with Chemical Informatics
    Large Scale Computing Challenges                                                    Science and Cyberinfrastructure
 Chemical Informatics is non-traditional area of high                                CICC is an NIH funded project to support chemical
 performance computing, but many new, challenging                                    informatics needs of High Throughput Cancer
 problems may be investigated.                                                       Screening Centers. The NIH is creating a data deluge
                                                                                     of publicly available data on potential new drugs.

     NIH          OSCAR
                                 Cluster    Toxicity
                                                                                                                                    .
   PubMed          Text                                  Docking
                                Grouping    Filtering
   DataBase       Analysis



                   Initial 3D      OSCAR-mined molecular signatures can
Chemical           Structure       be clustered, filtered for toxicity, and
                  Calculation
informatics                        docked onto larger proteins. These are
text analysis                      classic “pleasingly parallel” tasks. Top-
programs can                       ranking docked molecules can be further
process           Molecular        examined for drug potential.
100,000‟s of      Mechanics
abstracts of     Calculations
online journal
articles to                                        Big Red (and the TeraGrid) will
                                                   also enable us to perform time    CICC supports the NIH mission by combining state of
extract           Quantum
chemical          Mechanics
                                   NIH
                                 PubChem
                                                   consuming, multi-stepped          the art chemical informatics techniques with
                                                   Quantum Chemistry
signatures of    Calculations    DataBase
                                                   calculations on all of PubMed.
                                                                                      • World class high performance computing
potential
drugs.                                             Results go back to public          • National-scale computing resources (TeraGrid)
                                                   databases that are freely          • Internet-standard web services
                                   IU’s            accessible by the scientific
                   POVRay
                   Parallel       Varuna           community.
                                                                                      • International activities for service orchestration
                  Rendering      DataBase                                             • Open distributed computing infrastructure for scientists
                                                                                      world wide

                        Indiana University Department of Chemistry, School of Informatics, and Pervasive Technology Laboratories
         MLSCN Post-HTS Biology Decision
   Percent Inhibition or Support
   IC50 data is retrieved
   from HTS
                                                                   Grids can link data
                                     Workflows encoding plate      analysis ( e.g image
                                     & control well statistics,    processing developed in
    Question:Was this
                                     distribution analysis, etc
    screen successful?                                             existing Grids),
                                                                   traditional Chem-
                                                                   informatics tools, as well
Question:What should the             Workflows encoding            as annotation tools
active/inactive cutoffs be?          distribution analysis of      (Semantic Web,
                                     screening results
                                                                   del.icio.us) and enhance
                                                                   lead ID and SAR analysis
Question:What can we learn           Workflows encoding
about the target protein or cell     statistical comparison of     A Grid of Grids linking
line from this screen?               results to similar screens,   collections of services at
                                     docking of compounds
                                     into proteins to correlate    PubChem
                                     binding, with activity,       ECCR centers
                                     literature search of active   MLSCN centers
Compounds submitted to               compounds, etc
PubChem

        PROCESS                    CHEMINFORMATICS                       GRIDS
R Web Services




                 34
                Why?
Need access to math and stat
 functionality
Did not want to recode algorithms
Wanted latest methods
Needed a distributed approach to
 computation
  Keep computation on a powerful machine
  Access it from a smaller machine

                                            35
              Why R?
Free, open-source
Many cutting edge methods avilable
Flexible programming language
Interfaces with many languages
  Python
  Perl
  Java
  C

                                      36
           The R Server
R can be run as a remote compute
 server
  Requires the rserve package
Allows authenticated access over
 TCP/IP
Connections can maintain state
Client libraries for Java & C

                                    37
       R as a Web Service
On its own the R server is not a web
 service
We provide Java frontends to specific
 functionalities
The frontend classes are hosted in a
 Tomcat web container
Accessible via SOAP
Full Javadocs for all available WS‟s
                                         38
Flowchart




            39
              Functionality
Two classes of functionality
  General functions
     Allows you to supply data and build a
      predictive model
     Sample from various distributions
     Obtain scatter plots and hisotgram
     Model development functions use a Java front-
      end to encapsulate model specific information


                                                  40
              Functionality
Two classes of functionality
  Model deployment
     Allows you to build a model outside of the
      infrastructure
     Place the final model in the infrastructure
     Becomes available as a web service
     Each model deployed requires its own front
      end class
     In general, these classes are identical - could
      be autogenerated

                                                        41
     Available Functionality
Predictive models - OLS, RF, CNN,
 LDA
Clustering - k-means
Statistical distributions
XY plot and scatter plots
Model deployment for single model
 types and ensemble model types

                                     42
        Deployed Models
Since deployed models are visible as
 web services we can build a simple web
 front end for them
Examples
  NCI anti-cancer predictions
  Ames mutagenicity predictions



                                      43
              Applications
The R WS is not restricted to „atomic‟
 functionality
Can write a whole R program
  Load it on the R compute server
  Provide a Java WS frontend
Examples
  Feature selection
  Automated model generation
  Pharmacokinetic parameter calculation


                                           44
        Data Input/Output
Most modeling applications require data
 matrices
Depending on client language we can
 use
  SOAP array of arrays (2D matrices)
  SOAP array (1D vector form of a 2D
   matrix)
  VOTables
                                        45
        Data Input/Output
Some R web services can take a URL
 to a VOTables document
  Conversion to R or Java matrices is done
   by a local VOTables Java library
R also has basic support for VOTables
 directly
  Ignores binary data streams


                                              46
    Interacting With R WS‟s
Traditional WS‟s do not maintain state
Predictive models are different
  A model is built at one time
  May be used for prediction at another time
  Need to maintain state
State is maintained by serialization to R
 binary files on the compute server
Clients deal with model ID‟s
                                            47
    Interacting with R WS‟s
Protocol
  Send data to model WS
  Get back model ID
  Get various information via model ID
    Fitted values
    Training statistics
    New predictions



                                          48
 Cheminformatics at Indiana
University School of Informatics


                  David J. Wild
               djwild@indiana.edu

   Associate Director of Chemical Informatics &
               Assistant Professor
     Indiana University School of Informatics,
                   Bloomington

                http://djwild.info                49
Cheminformatics education at
         Indiana
 M.S. in Chemical Informatics
    2 years, 36 semester hours
    Includes a 6-hour capstone / research project
    Opportunity to work in Laboratory Informatics (IUPUI) or
     closely with Bioinformatics (IUB)
    Currently 9 students enrolled
 Ph.D. in Informatics, Cheminformatics Specialty
    90 credit hours, including 30 hours dissertation research.
     Usually 4 years.
    Research rotations expose students to research in related
     areas
    Currently 4 students enrolled
 Graduate Certificate
    4 courses, all available by Distance Education               50
      Distance Education for
         Cheminformatics
Uses Breeze + teleconference for live sharing
 of classes: all that is required is a P.C. and a
 telephone. Optional Polycom
 videoconferencing.
Lectures are recorded for easy playback
 through a web browser
Wiki or similar webpage for dissemination of
 course materials
Also participate in CIC courseshare to give
 class at University of Michigan
Of 75 students taking our courses since fall 51
 Current research in the Wild
             lab
Integration of cheminformatics tools and data
 sources
  A web service infrastructure for cheminformatics
  Compound information & aggregation web service
   and interface (“by the way box”)
  An enhanced chatbot for exploting chemical
   information & web services
  A semantically-aware workflow tools for
   cheminformatics
  Data mining the NIH DTP tumor cell line database
  PubDock: a docking database for PubChem
                                                  52
 Current research in the Guha
             lab
Predictive Modeling
  Interpretation, validation, domain applicability
  Generalization to other „models‟ such as docking,
   pharmacophore etc
  Integration of multiple data types
  Addressing imbalanced and noisy datasets
Analysis of Chemical Spaces
  Quantify distributions in spaces
  Investigation of density approaches
  Applications to lead hopping, model domains
Methods to summarize & compare data
                                                       53
  Applications to HTS and smaller lead series type
                                                         Cheminformatics services
     Cheminformatics web service                               Docking (FRED)
                                                               3D structure generation
            infrastructure                                     (OMEGA)
     Database Services                                         Filtering (FRED, etc)
     PostgreSQL + gNova                                        OSCAR3
                                                               Fingerprints (BCI, CDK)
     PubChem mirror
          (augmented)                                          Clustering (BCI)
                                                               Toxicity prediction
     Pub3D - 3D structures
          for PubChem                                          (ToxTree)
                                                               R-based predictive models
     PubDock - Bound 3D
          structures                                           Similarity calculations
                                                               (CDK)
     Compound-indexed
          journal article DB                                   Descriptor calculation
                                                               (CDK)
     NIH Human Tumor Cell
 Xiao Dong, Kevin E. Gilbert, Rajarshi Guha, Randy Heiland, Jungkee Kim, Marlon E. Pierce, Geoffrey C. Fox and
David J. Wild, Web service infrastructure for chemoinformatics, Journal of Chemical Information and Modeling, 2007;
          Line                                                 2D
                                                 47(4) pp 1303-1307 structure diagrams
                                                               (CDK)                                            54
     Local PubChem mirror
 RSC Project Prospect - what
       can we do with the
            information?
www.projectprospect.org
>100 papers marked up with SMILES/InChI
 (using OSCAR3), plus Gene Ontology and
 Goldbook Ontology terms
Created similarity searchable PostgreSQL /
 gNova database with paper DOIs, SMILES,
 and ontology terms
Web service and simple HTML interfaces for
 searching … “which papers reference
 compounds similar to this one in the scope of
 these ontological terms?”
                                             55
Greasemonkey / OSCAR
       script




http://cheminfo.informatics.indiana.edu:8080/ChemGM/index.jsp

                                                                56
               By the way…

By the way… annotation
               This compounds is very similar to a
               prescription drug, Tamoxifen.
               This compound is referenced in 20 journal


       (mock-up!)
               articles published in the last 5 years
               Similar compounds are associated with the
               words “toxic” and “death” in 280 web pages
               It appears to be covered under 3 patents
               It has been shown to be active in 5 screens
               Computer models predict it to show some
               activity against 8 protein targets




               Here are some comments on this
               compound:
               David Wild: don‟t take any notice of the
               computational models - they are rubbish




                                                          57
              Cheminformatics aware
                           Plug-in allows structures
                               to be drawn with

            simple lab notebook (mock
                           the pen and cleaned up




                        up!)
                                          Some useful chemical reactions

                         Iodoacetate a Iodoacetamide I-CH4COO- ICH2CONH2

                                                                             OH
                                                OH
                                                                         C
                                                                                  +     I
                          S     +   H2C     C                        S
                                                                             O
                                                O

                                                            FIND INFO ABOUT THIS REACTION
                                                                                              Web service interface
                              This may also react, chem favored by alkaline pH                 provides access to
                                                                                            computation and searching.
                                                       ….                                   Page is marked up by what
Free text input can be
                                                                                                    is possible
converted to machine
  readable form by
    electrovaya




                               Automatic detection of
                               data fields (yield, etc)
                                  Where possible


                                                                                                                 58
              Automatic workflow
           generation and natural
               language queries or
       Develop service ontology using OWL-S
             similar language
             Allows service interoperability, replacement and
              input/outut compatibility
   2d                              3D structures are
      We can then use generic reasoning and
similarity
                   2D structures
                                     compounds
                                                           3D search

                                    3D structures
           network analysis tools to find paths from
                      2D -> 3D

         2D
           inputs to desired outputs
      structure2D structures                             3D structures
       crawler                                                                    3D structures & complexes

      Natural language can be parsed to inputs and
               2D structures are
                       P’phore
                 compounds
                                        dock
                                                     result
                       search                       3D protein                          3D structures are
           desired outputs                           structure                            compounds


      Smart Clients <--> Agents <--> Services                      dock = bind



      Possible “supercharged life science Google?”    59

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:12/8/2011
language:English
pages:59