E-Chemistry and Web 2.0 by Semaj1212

VIEWS: 261 PAGES: 59

									E-Chemistry and Web 2.0
Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University


One Talk, Two Projects
 NIH funded Chemical Informatics and Cyberinfrastructure Collaboratory (CICC) @ IU.
 Geoffrey Fox  Gary Wiggins  Rajarshi Guha  David Wild  Mookie Baik  Kevin Gilbert  And others

 Proposed MicrosoftFunded Project: EChemistry
 Carl Lagoze (Cornell),  Lee Giles (PSU),  Steve Bryant (NIH),  Jeremy Frey (Soton),  Peter Murray-Rust (Cambridge),  Herbert Van de Sompel (Los Alamos),  Geoffrey Fox (Indiana)  And others

CICC Infrastructure Vision
 Chemical Informatics: drug discovery and other academic chemistry, pharmacology, and bioinformatics research will be aided by powerful, modern, open, information technology.
 NIH PubChem and PubMed provide unprecedented open, free data and information.  We need a corresponding open service architecture (i.e. avoid stove-piped applications)  CICC set up as distributed cyberinfrastructure in eScience model

 Web clients (user interfaces) to distributed databases, results of high throughput screening instruments, results of computational chemical simulations and other analyses.
 Composed of clients to open service APIs (mash-ups)  Aggregated into portals

 Web services manipulate this data and are combined into workflows.  So our main agenda items: create interesting databases and build lots of Web services and clients.

CICC Databases
Most of our databases aim to add value to PubChem or link into PubChem
1D (SMILES) and 2D structures

3D structures (MMFF94)
Searchable by CID, SMARTS, 3D similarity

Docked ligands (FRED, Autodock)
906K drug-like compounds into 7 ligands Will eventually cover ~2000 targets

Philosophy: we have big computers, so let‟s calculate everything ahead of time and put the results in a DB.

Building Up the Infrastructure
Our SOA philosophy: use standard Web services.
Mostly stateless Some cluster, HPC work needed but these populate databases

Services are aggregate-able into different workflows.
Taverna, Pipeline Pilot, …

You can also build lots of Web clients. See http://www.chembiogrid.org/wiki/index.php/CICC_ Web_Resources for links and details. Not so far from Web 2.0…. 5

Sample Services
Type Service Functionality Source License Database Docking Provides access to the results of docking a subset of PubChem into a set Indiana of ligands. University Searchable by 2D structure and docking docking score Provides access to 3D structure Indiana generated for most University of PubChem Extract chemical structures from text Uses Google to search for an InChI Generates a CMLRSS feed from CML data Converts chemical file formats Cambridge University Cambridge University Cambridge University Cambridge University Freely accessible


3D Structure

Freely accessible Freely accessible Freely accessible Freely accessible Freely accesible





Cheminformatics Cheminformatics

CMLRSSServer O penBabel




Indiana University & O btains toxicity European hazard predictions Chemical Bureau Generates 166 bit MACCS keys Evaluates 2D/3D similarity and evaluate distance moments for 3D similarity calculations Generatesarious descriptors including TPSA, XLogP, surface areas Indiana University & gNova Consulting Indiana University & CDK

Freely accessible



Freely accessible


Molecular Similarity

Freely accessible


Molecular Descriptors

Indiana University & CDK

Freely accessible


2D Structure Diagrams Druglikeness Methods

Generates 2D Indiana structure diagrams University & from SMILES CDK Evaluates measures of druglikeness Generates hashed fingerprints, 2D coordinate generation etc. Indiana University & CDK Indiana University & CDK

Freely accessible Freely accessible Freely accessible



Utility Methods



Sampling Distributions

Samples from several distributions (normal, uniform, Weibull etc) Builds linear regression models

Indiana University Indiana University

Freely accessible Freely accessible Freely accessible Freely accessible Freely accessible Freely accessible Freely accessible Freely accessible Freely accessible

Statistics Statistics

Linear Regression CNN Regression

Builds neural Indiana network regression University models Builds random forest regression models Builds linear discriminant analysis models Indiana University Indiana University


RF Regression

Statistics Statistics

LDA K-Means

Performs K-means Indiana clustering University Performs feature selection using stepwise regression Generates 2D scatter plots Generates histograms Indiana University Indiana University Indiana University


Feature Selection

Statistics Statistics

XY Plots Histogram Plots


Data Exchange


Converts tab delimited files to VOTables

Indiana University

Freely accessible Freely accessible Freely accessible

Data Exchange


Converts VOTables Indiana to tab delimited University files Converts VOTables Indiana to Excel University spreadsheet Retrieves field names and data types from a VOTables document Extracts columns from a VOTables document Handles file formats for QM/MM packages Performs analysis of results from Jaguar and ADF Searches the Varuna database Indiana University

Data Exchange


Data Exchange

VOTable Retrieve

Freely accessible

Data Exchange Computational Chemistry Computational Chemistry Computational Chemistry Computational Chemistry

VOTableExtract Varuna File Format Varuna Analysis Varuna Q uery Varuna Submit

Indiana University Indiana University Indiana University Indiana University

Freely accessible Freely accessible Freely accessible Freely accessible Freely accessible

Submits input data Indiana for calculation on a University local cluster


Application Application Application Application Application

Fred Filter Omega BCI Fingerprint BCI Clustering

Performs docking Property calculation and filtering Generates 3D conformers

O peneye Software O peneye Software O peneye Software

Commercial Commercial Commercial Commercial Commercial Freely accessible

Generates 1052 Digital BCI structur al keys Chemistry Performs divisive Digital k-means clustering Chemistry Evaluates pharmacokinetic parameters for druglike molecules Gets toxicity predictions for RF models built using MLSCN cell-line data Indiana University & University of Michigan Indiana University & Scripps, FL.




Scripps MLSCN Toxicity

Freely accessible


NTP DTP Anticancer activity Ames Mutagenicity

Gets anti-cancer actvity predictions Indiana for the 60 NCI cell University lines Gets mutagenicity predictions Indiana University

Freely accessible Freely accessible



Web Client Interfaces
Name PubDock Pub3D Functionality Interface to the docking database Type Web Lin ks htt p://w ww.chembiogrid.org/cheminf o/dock/ htt p://w ww.chembiogrid.org/cheminf o/p3d/ Interface to the 3D Web structure database Ident ify compounds that occur in multiple assays, with links to individual assays Predict whether a compound will be toxic or not Predict toxicity hazard class

Frequent Hitters


htt p://w ww.chembiogrid.org/cheminf o/freqhit/fh

MLSCN T oxicity Predictions ToxTree

Web and Pipeline Pilot Web

htt p://w ww.chembiogrid.org/cheminf o/rws/scripps htt p://che minfo.informatics.indiana.e du/~rguha/code/java/cdkws/cdkws. html#tox htt p://w ww.chembiogrid.org/cheminf o/ncidtp/dtp

DTP Anti Cancer Predictions

Predict whether a compound exhibits anti-cancer activity Web against the 60 NCI cell lines


More Clients…
Ames Mutagenicity Predictions PkCell Predict whether a compound is Web mutagenic or not in the Ames test Evaluate pharmacokinetic parameters Natural language interface to PubChem Generate RSS feeds for various PubChem related queries Download statistical models as R binary files Web http://www.chembiogrid.org/cheminf o/rws/ames http://www.chembiogrid.org/cheminf o/pkcell/ http://cheminfo.informatics.indiana.e du:8080/kemo/ http://www.chembiogrid.org/cheminf o/rssint.html http://www.chembiogrid.org/cheminf o/rws/mlist http://cheminfo.informatics.indiana.e du/~rguha/code/java/cdkws/cdkws. html



RSS Feeds Statistical Model Download

Web and RSS feed


Miscellaneous functions such as Cheminformatic structure s diagrams, similarity etc.



More Clients…
Varuna File operations and Web result analysis Plotting data using VOTables as well Web as using Excel files via VOTables .Net interface to PubChem R packages to interface with the CDK and access PubChem A plugin to allow Chimera to utilize the PubDock database A Greasemonkey script that shows 3D structures when viewing Pubchem pages Desktop application Desktop applciation Desktop application (requires Chimera) aspx and fault.aspx http://gf1.ucs.indiana.edu:9080/axis /VOTables.html and http://www.chembiogrid.org/cheminf o/rws/xlsvor http://darwin.informatics.indiana.edu /juhur/Tools/PubChemSR/ http://cran.rproject.org/src/contrib/Descriptions/ rcdk.html and htt p://cran.rproject.org/src/contrib/Descriptions/ rpubchem.html http://poincare.uits.iupui.edu/~heila nd/cicc/code/



rpubchem and rcdk

Chimera plugin

PubChem 3D View

Web (requires Firefox and http://rna.informatics.indiana.edu/hg Greasemonkey opalak/3DStructView.user.js )


Example: PubDock
 Database of approximately 1 million PubChem structures (the most drug-like) docked into proteins taken from the PDB  Available as a web service, so structures can be accessed in your own programs, or using workflow tools like Pipeline Polit  Several interfaces developed, including one based on Chimera (right) which integrates the database with the PDB to allow browsing of compounds in different targets, or different compounds in the same target  Can be used as a tool to help understand molecular basis of activity in cellular or image based assays

Example: R Statistics applied to PubChem data
 By exposing the R statistical package, and the Chemistry Development Kit (CDK) toolkit as web services and integrating them with PubChem, we can quickly and easily perform statistical analysis and virtual screening of PubChem assay data  Predictive models for particular screens are exposed as web services, and can be used either as simple web tools or integrated into other applications  Example uses DTP Tumor Cell Line screens - a predictive model using Random Forests in R makes predictions of probability of activity across multiple cell lines.


A protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex)

The screening data from a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand.

Example assay screening workflow: finding cell-protein relationships

Docking results and activity patterns fed into R services for building of activity models and correlations

Least Squares Regression

Random Forests

Neural Nets

Similar structures to the ligand can be browsed using client portlets.

Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein.

Once docking is complete, the user visualizes the highscoring docked structures in a portlet using the JMOL applet.


Relevance to Web 2.0
Some Web 2.0 Key Features
REST Services Use of RSS/Atom feeds Client interfaces are “mashups” Gadgets, widgets for portals aggregate clients

We provide RSS as an alternative WS format. We have experimented with RSS feeds, using Yahoo Pipes to manipulate multiple feeds. CICC Web interfaces can be easily wrapped as universal gadgets in iGoogle, Netvibes.
Alternative to classic science gateways.

RSS Feeds/REST Services
Provide access to DB's via RSS feeds Feeds include 2D/3D structures in CML Viewable in Bioclipse, Jmol as well as Sage etc. Two feeds currently available
SynSearch – get structures based on full or partial chemical names DockSearch – get best N structures for a target

Really hampered by size of DB and Postgres performance.

Tools and mashups based on web service infrastructure



Mining information from journal articles
 Until now SciFinder / CAS only chemistry-aware portal into journal information  We can access full text of journal articles online (with subscription)  ACS does not make full text available … but there are ways round that!  RSC is now marking up with SMILES and GO/Goldbook terms!
 www.projectprospect.org

 Having SMILES or InChI means that we can build a similarity/structure searchable database of papers: e.g. “find me all the papers published since 2000 which contain a structure with >90% similarity to this one”  In the absence of full text, we can at least use the abstract

Text Mining: OSCAR
 A tool for shallow, chemistry-specific natural language parsing of chemical documents (e.g. journal articles).  It identifies (or attempts to identify):
 Chemical names: singular nouns, plurals, verbs etc., also formulae and acronyms.  Chemical data: Spectra, melting/boiling point, yield etc. in experimental sections.  Other entities: Things like N(5)-C(3) and so on.

 Part of the larger SciBorg effort
 See http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html)

 http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/O scar3

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

Create a database containing the text of all recent PubMed abstracts (2006-2007 = ~500,000)

Use OSCAR to extract all of the chemical names referred to in the abstracts and covert to SMILES


Convert molecules to 3D and dock into a protein of interest Visualize top docked molecules in a Googlelike interface

Mash-Up: What published compounds might bind to this protein?

E-Chemistry and Digital Libraries
We can‟t wait to get started….


E-Chemistry and Digital Libraries
Key problem with our SOA-based e-Science is information management.
Where is the service that I need? What does it do?

We may consider our data-centric services to be digital libraries. Data is diverse
Documents Not just computational information like structures.

Another point of view: how can I link together publications, results, workflows, etc?
That is, I need to manage digital documents.

Digital Libraries
 Open Archives Initiative Object Reuse and Exchange Project (OAI-ORE)  Developing standardized, interoperable, and machinereadable mechanisms to express information about compound information objects on the web.  Graph-based representations of connected digital objects.  Objects may be encoded in (for example) RDF or XML,  Retrievable via repositories with REST service interfaces (c.f. Atom Publishing Protocal)
 Obtain, harvest, and register


QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

Challenges for E-Chemistry
Can digital library principals be applied to data as well as documents?
Can you link your workflow to your conference paper?

Can we engineer a publishing framework and message formats around Web 2.0 principals?
REST, Atom Publishing Protocol, Atom Syndication Format, JSON, Microformats

Can we do this securely?
Access control, provenance, identify federation are key problems.

Institution Cambridge

Proje ct Focus              Retrospective Data Extraction Searching and Indexing Data Mo dels/Ontologies Tools and Applications Data Mo dels Interoperability infrastructure Project Management Publicity and outreach Infrastructure Integration Trust and Provenance Tools and Applications Data Mo dels Interoperability infrastructure Chemical Structure Archive Results of Experime ntal Biological Activity Testing Cross References to Bi oMedical Databases Retrospective Data Extraction Searching and Indexing Analysis Prospective & Retrospective Data Provision Tools and Applications In-process capture of eC hemistry data Data Linking Š in analysis a nd public ation





  

Penn State

      


More Information
Project Web Site: www.chembiogrid.org Project Wiki: www.chembiogrid.org/wiki Contact me: mpierce@cs.indiana.edu




Chemical Informatics and Cyberinfrastucture Collaboratory
Funded by the National Institutes of Health www.chembiogrid.org


CICC Combines Grid Computing with Chemical Informatics Large Scale Computing Challenges
Chemical Informatics is non-traditional area of high performance computing, but many new, challenging problems may be investigated.
NIH PubMed DataBase OSCAR Text Analysis Cluster Grouping Toxicity Filtering

Science and Cyberinfrastructure
CICC is an NIH funded project to support chemical informatics needs of High Throughput Cancer Screening Centers. The NIH is creating a data deluge of publicly available data on potential new drugs.


Chemical informatics text analysis programs can process 100,000‟s of abstracts of online journal articles to extract chemical signatures of potential drugs.

Initial 3D Structure Calculation

Molecular Mechanics Calculations

OSCAR-mined molecular signatures can be clustered, filtered for toxicity, and docked onto larger proteins. These are classic “pleasingly parallel” tasks. Topranking docked molecules can be further examined for drug potential.

Quantum Mechanics Calculations

NIH PubChem DataBase

POVRay Parallel Rendering

IU’s Varuna DataBase

Big Red (and the TeraGrid) will also enable us to perform time consuming, multi-stepped Quantum Chemistry calculations on all of PubMed. Results go back to public databases that are freely accessible by the scientific community.

CICC supports the NIH mission by combining state of the art chemical informatics techniques with
• World class high performance computing • National-scale computing resources (TeraGrid) • Internet-standard web services • International activities for service orchestration • Open distributed computing infrastructure for scientists world wide

Indiana University Department of Chemistry, School of Informatics, and Pervasive Technology Laboratories

MLSCN Post-HTS Biology Decision Support Percent Inhibition or
IC50 data is retrieved from HTS
Workflows encoding plate & control well statistics, distribution analysis, etc Question:Was this screen successful?

Question:What should the active/inactive cutoffs be?

Workflows encoding distribution analysis of screening results

Grids can link data analysis ( e.g image processing developed in existing Grids), traditional Cheminformatics tools, as well as annotation tools (Semantic Web, del.icio.us) and enhance lead ID and SAR analysis
A Grid of Grids linking collections of services at PubChem ECCR centers MLSCN centers

Question:What can we learn about the target protein or cell line from this screen?

Compounds submitted to PubChem

Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etc




R Web Services


Need access to math and stat functionality Did not want to recode algorithms Wanted latest methods Needed a distributed approach to computation
Keep computation on a powerful machine Access it from a smaller machine

Why R?
Free, open-source Many cutting edge methods avilable Flexible programming language Interfaces with many languages
Python Perl Java C

The R Server
R can be run as a remote compute server
Requires the rserve package

Allows authenticated access over TCP/IP Connections can maintain state Client libraries for Java & C

R as a Web Service
On its own the R server is not a web service We provide Java frontends to specific functionalities The frontend classes are hosted in a Tomcat web container Accessible via SOAP Full Javadocs for all available WS‟s



Two classes of functionality
General functions
Allows you to supply data and build a predictive model Sample from various distributions Obtain scatter plots and hisotgram Model development functions use a Java frontend to encapsulate model specific information


Two classes of functionality
Model deployment
Allows you to build a model outside of the infrastructure Place the final model in the infrastructure Becomes available as a web service Each model deployed requires its own front end class In general, these classes are identical - could be autogenerated

Available Functionality
Predictive models - OLS, RF, CNN, LDA Clustering - k-means Statistical distributions XY plot and scatter plots Model deployment for single model types and ensemble model types

Deployed Models
Since deployed models are visible as web services we can build a simple web front end for them Examples
NCI anti-cancer predictions Ames mutagenicity predictions


The R WS is not restricted to „atomic‟ functionality Can write a whole R program
Load it on the R compute server Provide a Java WS frontend

Feature selection Automated model generation Pharmacokinetic parameter calculation

Data Input/Output
Most modeling applications require data matrices Depending on client language we can use
SOAP array of arrays (2D matrices) SOAP array (1D vector form of a 2D matrix) VOTables

Data Input/Output
Some R web services can take a URL to a VOTables document
Conversion to R or Java matrices is done by a local VOTables Java library

R also has basic support for VOTables directly
Ignores binary data streams


Interacting With R WS‟s
Traditional WS‟s do not maintain state Predictive models are different
A model is built at one time May be used for prediction at another time Need to maintain state

State is maintained by serialization to R binary files on the compute server Clients deal with model ID‟s

Interacting with R WS‟s
Send data to model WS Get back model ID Get various information via model ID
Fitted values Training statistics New predictions


Cheminformatics at Indiana University School of Informatics

David J. Wild djwild@indiana.edu Associate Director of Chemical Informatics & Assistant Professor Indiana University School of Informatics, Bloomington http://djwild.info

Cheminformatics education at Indiana
 M.S. in Chemical Informatics
 2 years, 36 semester hours  Includes a 6-hour capstone / research project  Opportunity to work in Laboratory Informatics (IUPUI) or closely with Bioinformatics (IUB)  Currently 9 students enrolled

 Ph.D. in Informatics, Cheminformatics Specialty
 90 credit hours, including 30 hours dissertation research. Usually 4 years.  Research rotations expose students to research in related areas  Currently 4 students enrolled

 Graduate Certificate
 4 courses, all available by Distance Education

Distance Education for Cheminformatics
Uses Breeze + teleconference for live sharing of classes: all that is required is a P.C. and a telephone. Optional Polycom videoconferencing. Lectures are recorded for easy playback through a web browser Wiki or similar webpage for dissemination of course materials Also participate in CIC courseshare to give class at University of Michigan Of 75 students taking our courses since fall 51

Current research in the Wild lab
Integration of cheminformatics tools and data sources
A web service infrastructure for cheminformatics Compound information & aggregation web service and interface (“by the way box”) An enhanced chatbot for exploting chemical information & web services A semantically-aware workflow tools for cheminformatics Data mining the NIH DTP tumor cell line database PubDock: a docking database for PubChem

Current research in the Guha lab
Predictive Modeling
Interpretation, validation, domain applicability Generalization to other „models‟ such as docking, pharmacophore etc Integration of multiple data types Addressing imbalanced and noisy datasets

Analysis of Chemical Spaces
Quantify distributions in spaces Investigation of density approaches Applications to lead hopping, model domains

Methods to summarize & compare data
Applications to HTS and smaller lead series type

Cheminformatics services Docking (FRED) 3D structure generation (OMEGA) Filtering (FRED, etc) Database Services OSCAR3 PostgreSQL + gNova Fingerprints (BCI, CDK) PubChem mirror Clustering (BCI) (augmented) Toxicity prediction Pub3D - 3D structures (ToxTree) for PubChem R-based predictive models PubDock - Bound 3D Similarity calculations structures (CDK) Compound-indexed Descriptor calculation journal article DB (CDK) Xiao Dong, Kevin E. Gilbert, Rajarshi Guha, Randy Heiland, Jungkee Kim, Marlon E. Pierce, Geoffrey C. Fox and NIH Human Tumor Cell David J. Wild, Web service infrastructure for chemoinformatics, Journal of Chemical Information and Modeling, 2007; 2D 47(4) pp 1303-1307 structure diagrams Line 54 (CDK) Local PubChem mirror

Cheminformatics web service infrastructure

RSC Project Prospect - what can we do with the information? www.projectprospect.org
>100 papers marked up with SMILES/InChI (using OSCAR3), plus Gene Ontology and Goldbook Ontology terms Created similarity searchable PostgreSQL / gNova database with paper DOIs, SMILES, and ontology terms Web service and simple HTML interfaces for searching … “which papers reference compounds similar to this one in the scope of these ontological terms?”


Greasemonkey / OSCAR script


By the way… annotation (mock-up!)

By the way…

This compounds is very similar to a prescription drug, Tamoxifen. This compound is referenced in 20 journal articles published in the last 5 years

Similar compounds are associated with the words “toxic” and “death” in 280 web pages
It appears to be covered under 3 patents It has been shown to be active in 5 screens

Computer models predict it to show some activity against 8 protein targets

Here are some comments on this compound: David Wild: don‟t take any notice of the computational models - they are rubbish


Cheminformatics aware simple lab notebook (mock up!)
Plug-in allows structures to be drawn with the pen and cleaned up
Some useful chemical reactions Iodoacetate a Iodoacetamide I-CH4COO- ICH2CONH2







This may also react, chem favored by alkaline pH

Free text input can be converted to machine readable form by electrovaya


Web service interface provides access to computation and searching. Page is marked up by what is possible

Automatic detection of data fields (yield, etc) Where possible


Automatic workflow generation and natural language queries or Develop service ontology using OWL-S
similar language
Allows service interoperability, replacement and input/outut compatibility
2d similarity

We can then use generic reasoning and 2D -> 3D network analysis tools to find paths from 2D inputs to desired outputs structure crawler result Natural language can be parsed to inputs and P’phore dock search desired outputs Smart Clients <--> Agents <--> Services Possible “supercharged life science Google?” 59
3D search
2D structures 3D structures 2D structures 3D structures 3D structures & complexes 2D structures are compounds 3D protein structure 3D structures are compounds dock = bind

3D structures are compounds

To top