Slide 1 - Indiana University

Document Sample
Slide 1 - Indiana University Powered By Docstoc
					Grids for Chemical
      Chemistry, IU Bloomington
            Oct. 21 2005

             Geoffrey Fox
 Computer Science, Informatics, Physics
   Pervasive Technology Laboratories
Indiana University Bloomington IN 47401

          Why are Grids Important
   Grids are important for Chemistry because they support key
    functionalities that grow in importance as we are deluged with
    data from instruments and simulations
   Grids provide information access, storage and management
   Grids manage multiple simulations with different defining
   Grids allow complex workflows with data flowing between
   Grids define models for portals
   Grids are built on top of commodity web service technology
    with broad industry support – the next generation
    information technology
   Grids are used in multiple NIH and other life
    science/chemistry projects across the world (BIRN, caBIG,
    myGrid, Comb-e-Chem )
    Internet Scale Distributed Services
   Grids use Internet technology and are distinguished by
    managing or organizing sets of network connected resources
     • Classic Web allows independent one-to-one access to
       individual resources
     • Grids integrate together and manage multiple Internet-
       connected resources: People, Sensors, computers, data
   Organization can be explicit as in
     • TeraGrid which federates many supercomputers;
     • Deep Web Technologies IR Grid which federates multiple
       data resources;
     • CrisisGrid which federates first responders, commanders,
       sensors, GIS, (Tsunami) simulations, science/public data
   Organization can be implicit as in Internet resources such as
    curated databases and simulation resources that “harmonize a
       Different Visions of the Grid
   Grid just refers to the technologies
     • Or Grids represent the full system/Applications
   DoD’s vision of Network Centric Computing can be considered a
    Grid (linking sensors, warfighters, commanders, backend
    resources) and they are building the GiG (Global Information
   Utility Computing or X-on-demand (X=data, computer ..) is
    major computer Industry interest in Grids and this is key part of
    enterprise or campus Grids
   e-Science or Cyberinfrastructure are virtual organization Grids
    supporting global distributed science (note sensors, instruments
    are people are all distributed
   Skype (Kazaa) VOIP system is a Peer-to-peer Grid (and
    VRVS/GlobalMMCS like Internet A/V conferencing are
    Collaboration Grids)
   Commercial 3G Cell-phones and DoD ad-hoc network initiative
    are forming mobile Grids
           Types of Computing Grids
   Running “Pleasing Parallel Jobs” as in United Devices, Entropia
    (Desktop Grid) “cycle stealing systems”
   Can be managed (“inside” the enterprise as in Condor) or more
    informal (as in SETI@Home)
   Computing-on-demand in Industry where jobs spawned are
    perhaps very large (SAP, Oracle …)
   Support distributed file systems as in Legion (Avaki), Globus with
    (web-enhanced) UNIX programming paradigm
     • Particle Physics will run some 30,000 simultaneous jobs
   Distributed Simulation HLA/RTI style Grids
   Linking Supercomputers as in TeraGrid
   Pipelined applications linking data/instruments, compute,
   Seamless Access where Grid portals allow one to choose one of
    multiple resources with a common interfaces
   Parallel Computing typically NOT suited for a Grid (latency)
 Analysis and

VISUALIZATION                                       ,ANALYSIS
                                                                                       Large Disks

                      QuickTime™ and a

                are neede d to see this picture.

                                                   Old Style Metacomputing Grid
                                                    LARGE-SCALE DATABASES

                                                      Large Scale Parallel Computers
         Original: Spread a single large Problem over multiple supercomputers
         Now-1: Control multiple smallish jobs each on independent Computers
         Now-2: Choose which of a few supercomputers to use
Towards an International
    Compute Grid                                       UK NGS

    Infrastructure                                                    Leeds

                 Starlight (Chicago)                          Manchester
 US TeraGrid                            Netherlight

  SDSC                                                          RAL



                                                                Local laptops in
                          All sites connected by                Seattle and UK
                         production network (not
                                 all shown)

                   Computation            Steering clients
                   Network PoP            Service Registry
     Information/Knowledge Grids
   Distributed (10’s to 1000’s) of data sources (instruments,
    file systems, curated databases …)
   Data Deluge: 1 (now) to 100’s petabytes/year (2012)
    • Moore’s law for Sensors
   Possible filters assigned dynamically (on-demand)
    • Run image processing algorithm on telescope image
    • Run Gene sequencing algorithm on compiled data
   Needs decision support front end with “what-if”
   Metadata (provenance)
    critical to annotate data
   Integrate across experiments
     as in multi-wavelength
Data Deluge comes from pixels/year available
               Data Deluged Science
   Now particle physics will get 100 petabytes from CERN using
    around 30,000 CPU’s simultaneously 24X7
   Exponential growth in data and compare to:
    •   The Bible = 5 Megabytes
    •   Annual refereed papers = 1 Terabyte
    •   Library of Congress = 20 Terabytes
    •   Internet Archive (1996 – 2002) = 100 Terabytes
   Weather, climate, solid earth (EarthScope)
   Bioinformatics curated databases (Biocomplexity only 1000’s of
    data points at present)
   Virtual Observatory and SkyServer in Astronomy
   Environmental Sensor nets
   In the past, HPCC community worried about data in the form of
    parallel I/O or MPI-IO, but we didn’t consider it as an enabler
    of new science and new ways of computing
   Data assimilation was not central to HPCC
   DoE ASCI set up because didn’t want test data!
      Virtual Observatory Astronomy Grid
              Integrate Experiments
       Radio       Far-Infrared           Visible

                           Dust Map

Visible + X-ray                       Galaxy Density Map
            International Virtual
            Observatory Alliance
• Reached international agreements on Astronomical
  Data Query Language, VOTable 1.1, UCD 1+,
  Resource Metadata Schema
• Image Access Protocol, Spectral Access Protocol
  and Spectral Data Model, Space-Time Coordinates
  definitions and schema
• Interoperable registries by Jan 2005 (NVO,
  AstroGrid, AVO, JVO) using OAI publishing and
• So each Community of Interest builds data AND
  service standards that build on GS-* and WS-*
                      myGrid Project
• Imminent
  „deluge‟ of data
• Highly
• Highly complex
  and inter-related
• Convergence of
  data and
The Williams
A                        B   C

A: Identification of
overlapping sequence
B: Characterisation of
nucleotide sequence
C: Characterisation of
protein sequence
    Web services                                               Devices

   Web Services build
    distributed                                    Databases         Computational resources
    applications, (wrapping
    existing codes and

                                BPEL, Java, .NET

                                                                                                  service logic
    databases) based on the
    SOA (service oriented
    architecture) principles.
   Web Services interact

                                                                                                  message processing
    by exchanging messages      SOAP and WSDL                      <env:Envelope>
    in SOAP format                                                      ...
   The contracts for the                                            <env:Body>
    message exchanges that                                           </env:Body>
    implement those
    interactions are
    described via WSDL                                         SOAP messages

             A typical Web Service
   In principle, services can be in any language (Fortran .. Java ..
    Perl .. Python) and the interfaces can be method calls, Java RMI
    Messages, CGI Web invocations, totally compiled away (inlining)
   The simplest implementations involve XML messages (SOAP) and
    programs written in net friendly languages like Java and Python

       Web Services                                     Payment
                                                       Credit Card
                              WSDL interfaces

            Service        Security         Catalog

                          WSDL interfaces              Warehouse
        Web Services                                   Shipping
         Two-level Programming I
• The Web Service (Grid) paradigm implicitly assumes a
  two-level Programming Model
• We make a Service (same as a “distributed object” or
  “computer program” running on a remote computer) using
  conventional technologies
   – C++ Java or Fortran Monte Carlo module
   – Data streaming from a sensor or Satellite
   – Specialized (JDBC) database access
• Such services accept and produce data from users files and
• The Grid is built by coordinating such services assuming
  we have solved problem of programming the service
         Two-level Programming II
   The Grid is discussing the composition of distributed
    services with the runtime Service1                  Service2
    interfaces to Grid as
    opposed to UNIX
    pipes/data streams       Service3             Service4

   Familiar from use of UNIX Shell, PERL or Python
    scripts to produce real applications from core programs
   Such interpretative environments are the single
    processor analog of Grid Programming
   Some projects like GrADS from Rice University are
    looking at integration between service and composition
    levels but dominant effort looks at each level separately
      Repositories            Sensors    Streaming         Field Trip Data
   Federated Databases                      Data

  Database       Database
                                  Sensor Grid
      Database Grid

               Research        SERVOGrid                  Education

     Compute Grid

                            ?        GIS
                           Discovery Grid
                                                         to Education
  Services Research        Services     Analysis and                    Education
             Simulations                Visualization
Grid of Grids: Research Grid and Education Grid                         Farm
        SERVOGrid Requirements
   Seamless Access to Data repositories and large scale
   Integration of multiple data sources including sensors,
    databases, file systems with analysis system
    • Including filtered OGSA-DAI (Grid database access)
   Rich meta-data generation and access with
    SERVOGrid specific Schema extending openGIS
    (Geography as a Web service) standards and using
    Semantic Grid
   Portals with component model for user interfaces and
    web control of all capabilities
   Collaboration to support world-wide work
   Basic Grid tools: workflow and notification
   NOT metacomputing
Portal Screen
     DoD NCOW Grid                                          Earthquake Grid

  C2 (JBI CEE etc.)                       …
                             … CoI Specific
                                                         Earthquake Data
  NCOW-IS Services                                       & Simulation Service

    Information Grid                7: Portals             Compute Grid

    6: Collaboration Grid           GIS Grid                 Sensor Grid
   9: Application Services                               10: Policy (ECS)

                             8: Data Access/Storage
    4: Discovery                                                11: Metadata
                       Core Low Level Grid Services
  2: Security        3: Messaging         5: Mediation         1: Management

                               Physical Network
       n: Service refers to core services identified by DoD
CoI Community of Interest     GIS Geographical Information System
Chemical Informatics Grid                                 BioInformatics Grid

   HTS Tools
                               … Domain Specific
                                                  …        Sequencing Tools
   Calculations                                              Simulations
   CIS                                                     BIS

      Information Grid               7: Portals             Compute Grid

     6: Collaboration Grid          MIS Grid               Instrument Grid
     9: Application Services                              10: Policy

                               8: Data Access/Storage
     4: Discovery                                                11: Metadata
                        Core Low Level Grid Services
    2: Security       3: Messaging          5: Workflow         1: Management

                                 Physical Network
          M(B,C)IS Molecular (Bio, Chem) Information System
GIS Grid with WMS, WFS, data sources and GML
                      `                                                                                                                                <name> Northridge2 </name>
                                                                                                                            WMS                        <segment> Northridge2




                                                                                                                                                       <author> Wald D. J.</author>









                                                                                                                                                           -118.72,34.243 -
                     ry                     WFS Server                                                                                                 118.591,34.176
              LQ                                                                                         SQ
                          ds                                                              Hi
                     ro a                                                                      gw                    ue
                  ail -b]                                                                                                   ry
                 R [a                                                                               ay
                                                                                                         [1                                             </gml:LineString>
                                                 SQL Query

                                                                Bridge [1-5]
                                                                River [a-d]

                                                                                                                                  Highways           </gml:featureMember>
                                                      Rivers                                                                                   90

                                                                                                                                                     GML becomes CML, CellML, SBML
Electric Power and Natural Gas data from LANL
Interdependent Critical Infrastructure Simulations



      FeatureInfo mode

      Measure distance mode

      Clear Distance

      Drag and Drop mode

      Refresh to initial map
                    Google maps
                    can be
                    integrated with
                    Web Feature
                    Archives to
                    filter and
                    browse seismic
  Archived Web
 Feature Services
and Google Maps
              What is Happening?
   Grid ideas are being developed in (at least) four communities
     • Web Service – W3C, OASIS, (DMTF)
     • Grid Forum (High Performance Computing, e-Science)
     • Enterprise Grid Alliance (Commercial “Grid Forum” with a
       near term focus)
   Service Standards are being debated
   Grid Operational Infrastructure is being deployed
   Grid Architecture and core software being developed
     • Apache has several important projects as do academia; large
       and small companies
   Particular System Services are being developed “centrally” –
    OGSA or GS-* framework for this in GGF; WS-* for
   Lots of fields are setting domain specific standards and building
    domain specific services
   USA started but now Europe is probably in the lead and Asia
    will soon catch USA if momentum (roughly zero for USA)
The Grid and Web Service Institutional Hierarchy

    4: Application or Community of Interest
                Specific Services
 such as “Run BLAST” or “Look at Houses for sale”

    3: Generally Useful Services and Features                      OGSA GS-*
 Such as “Access a Database” or “Submit a Job” or “Manage          and some WS-*
 Cluster” or “Support a Portal” or “Collaborative Visualization”   GGF/W3C/….

          2: System Services and Features                   WS-* from
Handlers like WS-RM, Security, Programming Models like BPEL OASIS/W3C/
                    or Registries like UDDI

                 1: Container and                                  Apache Axis
          Run Time (Hosting) Environment                           .NET etc.

      Must set standards to get interoperability
Location of software for Grid Projects in
     Community Grids Laboratory
   htpp:// provides Web service
    (and JMS) compliant distributed publish-subscribe
    messaging (software overlay network)
   htpp:// is a service oriented (Grid)
    collaboration environment (audio-video conferencing)
 is an OGC (open geospatial
    consortium) Geographical Information System (GIS)
    compliant GIS and Sensor Grid (with POLIS center)
 has WS-Context, Extended
    UDDI etc.
   The work is still in progress but NaradaBrokering is
    quite mature
   All software is open source and freely available
                      Project Goals
   Establish Requirements from stakeholders
    • Research
    • Pharmaceutical Industry
    • Government
   Consider educational implications
    • e-Science v Bio/Chem/Molecular Informatics
   Consider other national and international projects to ensure we
    either lead or use best practice
   Design a Grid architecture and staged implementation
   Start pilot projects led by Chemistry/Chemical Informatics
   Evaluate and iterate
   Design and implement ?(Chem, Life Science, Science, Molecular)
    Informatics educational program that will attract students
   Write winning center grant in 2006-7
   Web Services Introduction
• What are “Web Services”?
  – A distributed invocation system built on Grid
    • Independent of platform and programming
    • Built on existing Web standards
  – A service oriented architecture with
    • Interfaces based on Internet protocols
    • Messages in XML (except for binary data
    Web Services Introduction
• A web-based architecture providing for
  interoperability among resources
  – Centralized service registry
  – Solves problems associated with finding, using, and
    combining online resources
• Employ standard Internet protocols for:
  – Communication with resources
  – Automated discovery using centralized registries
• Communicate with devices, people, and each
  other with the protocols and computer
   Service Oriented Architecture
• Goal is to achieve loose coupling among
  interacting software agents
• Define service: a unit of work done by a
  service provider to achieve desired end
  results for a service consumer
• Both provider and consumer are roles
  played by software agents on behalf of
  their owners.
       How does SOA work?
• Two architectural constraints are
  – Small set of simple and ubiquitous interfaces
    to all participating software agents
  – Descriptive messages constrained by an
    extensible schema delivered through the
   Web Services Architectures
• Individual services are registered globally
  – Broken down into individual services with
    inputs and outputs specified
• Services are published
• Services are requested
• Open registry, publishing, and requesting
  Service-Oriented Architecture
• From Curcin et al.
  DDT, 2005,
    Web Services for Science
• Invisible Services, Semantic Web, and
• Easy-to-use tools for any scientist
• High throughput, resource intensive
  computing done for low cost/resources
• Shared community
  – Collaborations between labs and fields
  – Shared data
  – Shared tools
     e-Science and the Grid 1
• e-Science: Major UK Program
  – global collaboration in key areas of science and the
    next generation of infrastructure that will enable it
     • reflects growing importance of international
       laboratories, satellites and sensors and their
       integrated analysis by distributed teams
     • total investment of some £200M over the five-year
       period from 2001 to 2006
• CyberInfrastructure: the analogous US initiative
• Grid Technology: supports e-Science &
      Basic Architectures:
Servlets/CGI and Web Services
  Browser                                          GUI

      HTTP GET/POST    Web                        WSDL

  Server                                Web
    DB                                   DB
  or MPI                               or MPI
   Appl.                                Appl.
  Importance of Web Services
• Building a true science community
• Enabling interoperability between tools
  and the integration of data
• Less time coding, more time for science
• Change the way scientists work by
  achieving new levels of integration
  When To Use Web Services?
• Applications do not have severe restrictions on
  reliability and speed.
• Two or more organizations need to cooperate.
  – One needs to write an application that uses another‟s service.
• Services can be upgraded independently of
• Services can be easily expressed with simple
  request/response semantics and simple state.
      Web Services Benefits
• Web services provide a clean separation
  between a capability and its user interface.
• Increase in productivity
• Increase in flexibility
• Rapid return on investment
• Integration across multiple applications
    Web Services Advantages
• Output in human- and computer-readable
• I/O formats based on standard Internet
• Resources accessible server to server
  allow automated I/O
• Integration based on specific services: you
  select services or data needed without
  downloading the entire data set
    Web Services Advantages
• Description protocols provide details of
  service provided and interface
• Semantic Web standards increase
• Use a central registry and standardized
  description of services
• Quality and status of the information is
  dynamically available
      Web Services Drawbacks
•   Based on new technologies
•   Time and commitment required to learn
•   Standards still in a state of rapid flux
•   Issues with quality of data, (and for
    chemistry, quantity of open data), security,
    and privacy
 Components of Web Services
• Protocols
  – SOAP
  – WSDL
  – UDDI
• XML as a basis for the protocols
• Ontologies
  – OWL: Ontology Web Language
• Semantic Web
    Components of the Semantic Web
            for Chemistry
• XML – eXtensible Markup Language
• RDF – Resource Description Framework
• RSS – Rich Site Summary
• Dublin Core – allows metadata-based
• OWL – for ontologies
• BPEL4WS – for workflow and web services
     – Murray-Rust et al. Org. Biomol. Chem. 2004, 2, 3192-
   SOAP: Simple Object Access
• Flexible protocol to communicate
  information between server and server or
  client and server using XML
• Supports Remote Procedure Calls
• Allows layers (security, authentication,
  transactions) over the basic SOAP
   WSDL: Web Service Definition
• Describes a service‟s interface to clients
• Services register themselves with Web
• WSDL describes how to contact and
  interact with services
  – I/O, operations and messages to aid
    interaction with client
                WSDL Overview
• An XML-based Interface Definition Language.
   – You can define the APIs for all of your services in WSDL.
• WSDL docs are broken into five major parts:
   – Data definitions (in XML) for custom types
   – Abstract message definitions (request, response)
   – Organization of messages into “ports” and “operations”
     (classes and methods).
   – Protocol bindings (to SOAP, for example)
   – Service point locations (URLs)
• Some interesting features
   – A single WSDL document can describe several versions of an
   – A single WSDL doc can describe several related services.
     UDDI: Universal Description,
      Discovery, and Integration
• Provides ways for clients and services to interact
  with other services
• Uses XML
• Defines the means of access, e.g.,
  – URL
  – E-Mail
• Defines services hosted by an entity
• Business-oriented tags
• Uses SOAP for communicating
XML: eXtensible Markup Language
• Allows definitions of types of documents
• Tags are used to specify components of
• Allows specification of namespaces to
  differentiate between identical tag names
• Tag names do not provide semantics other
  than simple hierarchical relations
              XML Overview
• A language for building languages
• Basic rules: be well formed and be valid
• Particular XML “dialects” are defined by XML
  – XML itself is defined by its own schema.
• Extensible via namespaces
• Many non-Web services dialects
• Many basic tools available: parsers, XPath
  and XQuery for searching/querying, etc.
       XML and Web services
• XML lends itself to distributed computing:
  – It‟s just a data description.
  – Platform, programming language independent
• Web Services Description Language
  – Describes how to invoke a service
  – Can bind to SOAP, other protocols for actual
• Simple Object Access Protocol (SOAP)
  – Wire protocol extension for conveying RPC calls
  – Can be carried over HTTP, SMTP
OWL: Web Ontology Language
• Builds on RDF and RDFS and adds a
  means for richer descriptions of properties
  and classes
  – Disjoint classes
  – Cardinality of classes
  – Characteristics of relations, like symmetry
  Standards for Web Services
• Business Process Execution Language for
  Web Services (BPEL4WS)
• Ontology Web Language Semantics
• Web Service Modeling Ontology (WSMO)
    Standards Setting Boards
• OASIS: Organization for Advancement of
  Structured Information Standards
  – ebXML: e-business XML
  – UDDI: Universal Description, Discovery and
• Global Grid Forum
  – community of users, developers, and vendors
    leading the global standardization effort for
    grid computing
    Standards Setting Boards
• W3C: World Wide Web Consortium
  – OWL: Ontology Web Language
  – RDF/RDFS: Resource Description
  – SOAP: Simple Object Access Protocol
  – URI/URL/URN: Universal Resource
  – WSDL: Web Service Definition Language
  – XML: eXtensible Markup Language
  SWWS: Semantic Web-Enabled
        Web Services
• Main objectives:
  – Provide a comprehensive Web Service
    description framework
  – Define a Web Service discovery framework
  – Provide a scalable Web Service mediation
• A program of the European Commission to
  run 2002-2005
Web Services Integration Projects:
• myGrid
    Web Services for Chemistry:
• Performance and scalability
• Proprietary data
• Competition from high-performance desktop
  -- Geoff Hutchison, it‟s a puzzle blog, 2005-01-05
  – Lack of a substantial body of trustworthy Open
    Access databases
  – Non-standard chemical data formats (over 40 in
    regular use and requiring normalization to one
 Missing Ingredients in Chemistry
• Chemical communities to assemble Open
  Access databases
  – Well-defined quality assurance procedures
    performed by distributed peer-review systems
  – Software underlying the databases needs to
    be open source.
 Chemistry Databases on the Web
• Marc Nicklaus lists 37 databases as of
  October 2001
  – Must have structure searching and at least
    100 molecules
• SoaringBear‟s List has 15 databases
     Institutional Repositories
• NARSTO Quality Systems Science Center
  – Pollutant species in the troposphere over
    North America
  – Part of the Carbon Dioxide Information
    Analysis Center at ORNL
  – NARSTO Data and Information Sharing Tool
     Public Data Repositories
• Developmental Therapeutics Program/NCI
  – Some assay data for download
  – Structures for over 200,000 compounds
• Zinc and other screening databases
• NIST computational chemistry database
• Environmental fate and exposure
   Other Public Repositories 1
• ChemExper Chemical Directory
  – > 200,000 substances; > 10,000 IR spectra
• HIC-Up; Hetero-Compound Identification Centre
  – Uppsala
  – 5384 substances as of 1/15/05
• Chemicals with Pharmaceutical Activity; a 3D
  Structural Database
  – 400 3D structures
   Other Public Repositories 2
  – 41 data sets in 9 categories as of 8/18/05
• WebReactions
    Other Public Repositories 3
• MolTable
• MatWeb Materials Property Data
• Spectral Database for Organic Compounds (SDBS)
   – Over 32,000 compounds
   – Has EI-MS, FT-IR, 1H NMR, 13C NMR, Raman, ESR
• NMRShiftDB (Christoph Steinbeck)
   – 14,753 structures as of 8/19/05
   – Features peer-reviewed submission of data sets
         Other Public Repositories:
           Commercial Teasers
• (Thermo Electron)
   – Demo file of 575 spectra from 87,000 in the full database
• ChemACX
   – 30 of >350 suppliers catalog data
• Sunset Molecular Discovery, LLC
   – Wombat (World of Molecular BioAcTivity)
       • 117,007 entries with over 230,000 biological activities
   – Wombat PK
       • Database for Clinical Pharmacokinetics: 643 substances with 4668
   – Three sample files from Wombat containing 341 Histamine-1 receptor
• A group of chemists, programmers, and
  informaticians working collaboratively on
  projects such as:
  –   Chemistry Development Kit (CDK)
  –   JChemPaint
  –   Jmol
  –   JUMBO
  –   NMRShiftDB
  –   Octet
  –   Open Babel
  –   QSAR
  –   World Wide Molecular Matrix (WWMM)
Indiana University Existing Projects
• System for the Integration of
  Bioinformatics Services (SIBIOS)
• PlatCom: A Platform for Computational
  Comparative Genomics
• Reciprocal Net
Indiana University Planned Projects
• Design of a Grid-based distributed data
• Development of tools for HTS data analysis and
  virtual screening
• Database for quantum mechanical simulation
• Chemical prototype projects
  – Novel routes to enzymatic reaction mechanisms
  – Mechanism-based drug design
  – Data-inquiry-based development of new methods in
    natural product synthesis
       Web Services Future
• Depends on
  – Adoption of standards
  – Incorporation of WS in current and newly
    developed applications
  – Security, privacy, quality of data issues
  – Development of WS tools and resources for