Data Grids for HPC Geographical Information System Grids by broverya77


									  Data Grids for HPC:
Geographical Information
     System Grids

          Marlon Pierce
           Geoffrey Fox
        Indiana University
         December 7 2004
         Internet Seminar

Overview from Previous

              Parallel Computing
   Parallel processing is built on breaking problems up
    into parts and simulating each part on a separate
    computer node
   There are several ways of expressing this breakup into
    parts with Software:
    • Message Passing as in MPI or
    • OpenMP model for annotating traditional languages
    • Explicitly parallel languages like High Performance Fortran
   And several computer architectures designed to
    support this breakup
    • Distributed Memory with or without custom interconnect
    • Shared Memory with or without good cache
    • Vectors with usually good memory bandwidth
          What are Web Services?
   Web Services are distributed computer programs that
    can be in any language (Fortran .. Java .. Perl .. Python)
   The simplest implementations involve XML messages
    (SOAP) and programs written in net friendly languages
    like Java and Python
   Here is a typical e-commerce use?

                      WSDL interfaces            Credit Card

                   Security         Catalog

                  WSDL interfaces                 shipping
         What Is the Connection?
   Both MPI and Web Services rely upon messaging to
   But the difference is in speed of message transmission
    • MPI useful for microsecond communication speeds.
        Clusters, traditional parallel computing

    • Web Services communicate with Internet speeds
        Millisecond communication times at best.

   This implies that we have (at least) a two-level
    programming model.
    • Level 1: MPI within science applications on clusters and
    • Level 2: Programming between science applications.
         Two-level Programming I
   The Web Service (Grid) paradigm implicitly assumes a two-level
    Programming Model
   We make a Service (same as a “distributed object” or “computer
    program” running on a remote computer) using conventional
    • C++ Java or Fortran Monte Carlo module perhaps running with MPI on
      a parallel machine
    • Data streaming from a sensor or Satellite
    • Specialized (JDBC) database access
   Such services accept and produce data from other services, files
    and databases
   The Grid is used to coordinate such services assuming we have
    solved problem of programming the service

         Two-level Programming II
   The Grid is discussing the composition of distributed
    services with the runtime Service1                  Service2
    interfaces to Grid as
    opposed to UNIX
    pipes/data streams       Service3             Service4

   Familiar from use of UNIX Shell, PERL or Python
    scripts to produce real applications from core programs
   Such interpretative environments are the single
    processor analog of Grid Programming
   Some projects like GrADS from Rice University are
    looking at integration between service and composition
    levels but dominant effort looks at each level separately
       3 Layer Programming Model

              Application                    MPI Fortran C++ etc.
        (level 1 Programming)

Application Semantics (Metadata, Ontology)       Semantic Web
          Level 2 “Programming”

   Basic Web Service Infrastructure

            Web Service 1                       WS 2    WS 3        WS 4

                            Workflow (level 3) Programming BPEL

    Semantic Web adds a another layer between workflow and
         Services representing traditional applications                    8
    Data and Science Applications
   Two- (or three-) level programming applies to all
   Typically we need to bind together HPC and non-HPC
    • How do you provide data to your application?
    • How do you share data between applications?
    • How do you communicate results to analysis and visualization
   This is particularly important as the size and quality of
    observational data is growing rapidly.
   Q: How do you easily bind together science apps and
    remote data sources?
    • A: Web Services (and Grids) provide the unifying
                   Grid Libraries
   Programming the Grid has many similarities with
    conventional languages
    • In HPSearch you use similar Scripting languages
   Grids are particularly good at supporting user
    interfaces as the browser is a particular service
    • Portal technology important “gift” of Grids for HPC
   Most promising (and not exploited often) is building
    Grid “Libraries” which are collections of services
    which can be re-used in several applications
    • Mastercard service is a typical business Grid library
    • Visualization, Sensor processing, GIS are naturally
      distributed components of a HPC application that can be
      developed as Grid libraries
Data Grids for HPC

             Data Deluged Science
   In the past, we worried about data in the form of parallel I/O or
    MPI-IO, but we didn’t consider it as an enabler of new
    algorithms and new ways of computing
   Data assimilation was not central to HPCC
   ASC set up because didn’t want test data!
   Now particle physics will get 100 petabytes from CERN
    • Nuclear physics (Jefferson Lab) in same situation
    • Use around 30,000 CPU’s simultaneously 24X7
   Weather forecasting, climate, solid earth (EarthScope, Eath
    Systems Grid, GEON)
    • We discussed our project SERVOGrid in October 2004 lecture.
   Bioinformatics curated databases (Biocomplexity only 1000’s of
    data points at present)
   Virtual Observatory and SkyServer in Astronomy
   Environmental Sensor nets

                Data Deluge @ Home
   In 2003, all of Marion County, IN (including Indianapolis) was surveyed using
    Light Detection and Ranging (LiDAR) sensing.
   GRW, Inc flew a Cessna 337 airplane over the entire county to produce
    digitized maps.
     • 1 point per square meter.
     • 495 square miles total.
   Can be used to create high resolution contour maps….
   But what do you do with all of the data?
                                   •LiDAR data represents 3 orders of
                                   magnitude increase in data resolution over
                                   what is used today in conventional flood
                                   prediction (B. Engles, Purdue).

                                   •Flood modeling codes thus must become
                                   HPC codes to handle the size of newly
                                   available data.

          Example Data Grid:
         The Earth System Grid
   U.S. DOE SciDAC funded R&D effort
   Build an “Earth System Grid” that enables
    management, discovery, distributed access,
    processing, & analysis of distributed terascale climate
    research data
   A “Collaboratory Pilot Project”
   Build upon ESG-I, Globus Toolkit, DataGrid
    technologies, and deploy
   Potential broad application to other areas

                   ESG Data Sets
   Community Climate Systems Model data
    • This is data that is compatible with the National Center for
      Atmospheric Research (NCAR) global climate model, CCSM
        Couples atmospheric, land surface, ocean, and sea ice
    • This is a US government model for climate modeling and
   Parallel Climate Model data
    • Data compatible with extensions to CCSM.
    • Uses same atmospheric model but different ocean and sea ice

                   ESG Challenges
   By the end of 2003, DOE-sponsored climate change research had
    produced 100 TB of scientific data.
    • Stored across several DOE sites and NCAR.
   Consequence of HPC, will only escalate as models can simulate
    global weather patterns at increasingly fine resolution.
   Basic problems in data management
    • What is in the data files (metadata)?
    • How were data created and by whom (provenance)?
    • How data be stored and moved
      between sites efficiently?
    • How can data be delivered to
      scientific community?
         ESG web portal

                   ESG Data Sets
   Community Climate Systems Model data
    • This is data that is compatible with the National Center for
      Atmospheric Research (NCAR) global climate model, CCSM
        Couples atmospheric, land surface, ocean, and sea ice
    • This is the US government’s workhorse code for climate
      modeling and prediction.
   Parallel Climate Model data
    • Data compatible with extensions to CCSM.
    • Uses same atmospheric model but different ocean and sea ice

        Example Data Grid: GEON
   Project Goal: Prototype interpretive environments of the future
    in Earth Sciences.
   Use advanced information technologies to facilitate
    collaborative, inter-disciplinary science efforts.
   Scientists will be able to discover data, tools, and models via
    portals, using advanced, semantics-based search engines and
    query tools, in a uniform authentication environment that
    provides controlled access to a wide range of resources.
    • A prototype “Semantic Grid”
   A services-based environment facilitates creation of scientific
    workflows that are executed in the distributed environment.
   Advanced GIS mapping, 3D, and 4D visualization tools allow
    scientists to interact with the data.

GEON Grid Application: SYNSEIS
• SYNSEIS is a grid application that provides an
  opportunity for seismologists and other earth
  science partners to compute and study 3D
  seismic records to understand complex subsurface

• SYNSEIS is built using a service-based
  architecture. While it provides users an easy-to-
  use GUI to access data, models and compute
  resources, it also provides “connectors” (APIs) for
  developers should they choose to utilize any of its
  components in other applications.
             SYNSEIS Architecture
                                 GEON Portal

                                   SYNSEIS                      SDSC
                                 (FLASH GUI)

 Cornell                                  SOAP
Map Server                             Web service

                                                       GASS    TeraGrid
                                                      GRAM      NCSA
                                   SynSeis           GridFTP
                                   Engine                GSI

  DMC                                                           LLNL
                Waveform and seismic event
                catalogs:                                 21
    GEON SYNSEIS Conclusions
   Using the Grid technology, GEON team was able to bring an extremely
    complex and cumbersome seismic data analysis procedure to a level that can
    be used by anyone efficiently and effectively, hence SYNSEIS is a first step
    towards faster discovery.
   Democratization of community resources allows not only GEON
    researchers but also external community members to access state-of-the-
    art software and tools.
   Although the tool is developed for GEON applications, it holds a
    tremendous potential for projects like EarthScope. SYNSEIS can be used by
    EarthScope researchers to conduct timely analysis of collected data
   SYNSEIS also has a high potential to be used in educational environments
    allowing students to experiment with data and make their own
   SYNSEIS has allowed us to practice building distributed data and
    computational resources.

    SERVOGrid Example: GeoFEST
   SERVOGrid was discussed in more detail in the October lecture
    of this series.
    • But worth another mention in this context.
   GeoFEST is
    • Geophysical Finite Element Simulation Tool
    • GeoFEST solves solid mechanics forward models with these
        2-D or 3-D irregular domains
        1-D, 2-D or 3-D displacement fields

        Static elastic or time-evolving viscoelastic problems
        Driven by faults, boundary conditions or distributed loads

    • GeoFEST runs in a variety of computing environments:
        UNIX workstations (including LINUX, Mac OS X, etc.)

        Web portal environment

        Parallel cluster/supercomputer environment

   GeoFEST output can be compared directly with current and
    future InSAR satellite data.
         GeoFEST and Data Grids
   GeoFEST works directly with Earth fault data.
   Luckily for us, there is a Web Service data source for earth
    faults in California
    • QuakeTables: accessible for human use through


    • USC, UC-Irvine, and IU designed and built this as part of the SERVO
   But GeoFEST needs programmatic access to the fault data
    • Users design layer and fault geometry problems and create finite element
      meshes through Web portal interface.
        Like GEON, we use portlets.

        Portlets are a standard way to make Java-based (and other) portals
         out of reusable components.
    • Must then pass this information to GeoFEST as an input file.
    • GeoFEST on some remote host from the data.
                  Browser Interface


                 User Interface Server

               WSDL WSDL WSDL WSDL
       SOAP                                  SOAP

   WSDL            WSDL WSDL                   WSDL
                   Job Sub/Mon
DB Service 1         And File                Viz Service


                   Operating and
    DB               Queuing                    IDL
                     Systems                    GMT

   Host 1               Host 2                  Host 3     26

Site-specific Irregular                                                    a
Scalar Measurements                         Constellations for Plate
                           Ice Sheets       Boundary-Scale Vector


   Long Valley, CA

                                                    1 km

                                                      Stress Change
  Northridge, CA

  Earthquakes             Hector Mine, CA
  Grid Services                    Analysis
                      Grid          Control

                                     Data Deluged
 Grid Data
                         HPC         Science

Distributed Filters
  massage data
  For simulation

                             Data Assimilation
   Data assimilation implies one is solving some optimization
    problem which might have Kalman Filter like structure
                            N obs

                             Datai ( position, time)  Simulated _ Value  Errori 2
     Theoretical Unknowns
                            i 1

   Due to data deluge, one will become more and more dominated
    by the data (Nobs much larger than number of simulation
   Natural approach is to form for each local (position, time)
    patch the “important” data combinations so that optimization
    doesn’t waste time on large error or insensitive data.
   Data reduction done in natural distributed fashion NOT on
    HPC machine as distributed computing most cost effective if
    calculations essentially independent
     • Filter functions must be transmitted from HPC machine
               Distributed Filtering
Nobslocal patch >> Nfilteredlocal patch ≈ Number_of_Unknownslocal patch
 In simplest approach, filtered data gotten by linear transformations on
 original data based on Singular Value Decomposition of Least squares
 matrix                                               Send needed Filter
                                                                     Receive filtered data
          Nobslocal patch 1

     Geographically                         Nfilteredlocal patch 1
     Sensor patches

          Nobslocal patch 2
                                                 Nfilteredlocal patch 2   HPC Machine
Factorize Matrix
to product of
local patches                                                                           32
Standards For Geographic
     Data Services

                 The Story So Far…
   HPC applications generate huge amounts of data.
    • Constant problem for all HPC centers, including DOD MSRCs.
    • Managing scientific information about these applications is just as
      important as storage technology.
   HPC applications use observational data as input.
    • Projects like the ESG, GEON, and SERVO illustrate how HPC
      applications need to be coupled to data sources.
    • Quantity of observational data is growing rapidly, opening fields for non-
      traditional HPC (LiDAR and flood modeling).
   Huge amounts of new data potentially drive new HPC
    applications (LiDAR->Flood modeling)
   Earth sciences are a focus of our examples, but really, many
    applications have data sources that are geographically described.
    • Weather prediction is an obvious example.
   Thus we see the importance of coupling GIS data grid services to
    HPC applications for both data access and
                     What is GIS?
   Geographic Information Systems
    • ESRI: commercial company with many popular GIS
    • Open Geospatial Consortium (formerly OpenGIS
    • We will focus on OGC since they define open and
      interoperable standards.
   What are the characteristics of a GIS system?
    • Need data models to represent information
    • Need services for remotely accessing data.
    • Need metadata for determining what is stored in the services.

      GML: A Data Model For GIS
   GML 3.x is a interconnected suite of over 20 connected
    XML schemas.
   GML is an abstract model for geography.
   With GML, you can encode
    • Features: abstract representations of map entities.
    • Geometry: encode abstractly how to represent a feature
    • Coordinate reference systems
    • Topology
    • Time, units of measure
    • Observation data.

              Example Use of GML
   The SCIGN (Southern
    California Integrated GPS
    Network) maintains online
    catalogs of GPS stations.
   Collective data for each site is
    made available through
    online catalogs.
    • Using various text formats.
   This is not suitable for
    processing, but GML is.
   GML can be used to describe
    GPS using Feature.xsd
    schema, with values encoded
    at GPS observations.
                 Open GIS Services
   GML abstract data models can encode data but you need
    services to interact with the remote data.
   Some example OGC services include
    • Web Feature Service: for retrieving GML encode features, like faults,
      roads, county boundaries, GPS station locations,….
    • Web Map Service: for creating maps out of Web Features
    • Sensor Grid Services: for working with streaming, time-stamped data.
   Problems with OGC services
    • Not (yet) Web Service compliant
        “Pre” web service, no SOAP or WSDL

        Use instead HTTP GET/POST conventions.

    • Often define general Web Service services as specialized standards
        Information services

        Notification services in sensor grids

       Anatomy of WFS (G. Aydin)
   WFS provides three major services as described in OGC specification:
     • GetCapabilities: The clients (WMS servers or users) starts with requesting a
       document from WFS which describes it’s abilities. When a getCapabilities request
       arrives, the server dynamically creates a capabilities document and returns this.
         • This is OGC’s formalization of metadata, so important to GEON, ESG, etc.

     • DescribeFeatureType: After the client receives the capabilities document he/she
       can request a more detailed description for any of the features listed in the WFS
       capabilities document.
         • The WFS returns an XML schema that describes the requested feature.

         • Metadata about a specific entry.

     • GetFeature: The client can ask the WFS to return a particular portion of any
       feature data.
         • GetFeature requests contain some property names of the feature and a Filter
           element to describe the query.
         • The WFS extracts the query and bounding box from the filter and queries the
           feature databases.
         • The results obtained from the DB query are converted the feature’s GML
           format and returned to the client as a FeatureCollection.

 Example WFS Capability Entries

Name                A name the service provider assigns to the web feature service instance.

Title               Human-readable title to briefly identify this server in menus.

Abstract            Descriptive narrative for more information about the server.

Keyword             Contains short words to aid catalog searching.

                    Defines the top-level HTTP URL of this service. Typically the URL of a "home page" for the
OnlineResource           service.

                    Contains a text block indicating any fees imposed by the service provider for usage of the
Fees                    service or for data retrieved from the WFS. The keyword NONE is reserved to mean no

                    Text block describing any access constraints imposed by the service provider on the WFS or
AccessConstraints        data retrieved from that service. The keyword NONE is reserved to indicate no access
                         constraints are imposed.

 Sample Feature- CA Fault Lines
<gml:featureMember>                        After receiving getFeature
  <fault>                                   request, WFS decodes this
   <name>Northridge2</name>                 request, creates a DB query
   <segment>Northridge2</segment>           from it and queries the
   <author>Wald D. J.</author>              database.
   <gml:lineStringProperty>                WFS then retrieves the
    <gml:LineString                         features from the database
         srsName="null">                    and converts them into GML
      <gml:coordinates>                     documents.
       -118.72,34.243 -118.591,34.176
    </gml:coordinates>                     Each feature instance is
    </gml:LineString>                       wrapped as a
   </gml:lineStringProperty>                gml:featureMember element.
  </fault>                                 WFS returns a
 </gml:featureMember>                       wfs:FeatureCollection
                                            document which includes all
                                            featureMembers returned in
                                            the query result.












                                                                                                                                                    •A WFS can serve multiple

                                                                                                                                                    feature types data.
                                                                                                                                                    •WFS returns the results of
                                                                                                                                                    GetFeature requests as GML
                     ry                     WFS Server
                                                                                                                                                    documents (Feature Collections).
              LQ                                                                                         SQ
                                                                                                                                                    •Clients may include other
                  ail -b]
                 R [a
                     ro a
                                                                                                                            ry                      services as well as humans.
                                                 SQL Query

                                                                Bridge [1-5]
                                                                River [a-d]




                                                      Rivers                                                                                   90

Schematic Interactions Between GIS
Client                   WMS                    Client


   WFS                                        WFS
 california fault data                      california river data

 @complexity                                @gf1

                         california boundary data
                         @gf1                                       43
                          Defining IS
   The central IS block in the proceeding diagram represents
    nebulous “information services.”
   Information services are needed to bind together various GIS
    and other services.
    • What are their URLs? How do you interact with them (WSDL)? What do
      they do (capabilities)?
   The OGC defines information services, but they are specialized
    to GIS.
    • Web Catalogue Service: state appears uncertain.
    • Web Registry Service: a common mechanism to classify, register,
      describe, search, maintain and access information about OGC Web
   But if they adopt Web Service standards, they get Web Service
    information system solutions for free.
    • IS is a more general problem than just GIS.

    Universal Description, Discovery
            and Integration
   UDDI is the standard for building service registries and for
    describing their contents.
    • UDDI is part of the WS-I core:
   But no one seems to like it…
   Centralized solution
    • Single point of failure
   Poor discovery model
    • No uniform way of querying about services, service interfaces and
    • Limited query capabilities: search for services restricted to WS name and
      its classification
   Stale data in registries
    • Out-of-date service documents in UDDI registries.
    • Need a leasing system
    • Registry entries need to be dynamically updated
        UDDI Has Other Problems
   Many Web Services need to maintain the concept of
    state between themselves during complicated
   For example, for better performance, I may wish to
    cache maps in a Web Map Server instead of
    reconstructing it via calls to a Web Feature Service
    every time.
   This is basically a glorified HTTP Cookie problem.
   We need a way to store this kind of volatile session state
    data in light weight data.
    • UDDI==heavyweight.
   So IS must support both registries and contexts.

               GIS Service Registries
   Functional capabilities of a GIS service is defined in
    “capabilities.xml” file
   An information service can gather metadata about functional
    requirements of a GIS service
    • By processing the capabilities file in an automated fashion when a service
      is registered
    • By having the service provider declare these capabilities when publishing
      a service
    • Information System API introduce a library for XML Schema Processing
      of different capability files
   UDDI with the geospatial focus of GIS Services
    • Data layers (features) of a GIS service may have varying geospatial
    • UDDI registries do not natively support spatial queries.
    • We use existing geographic taxonomies such as QuadCode taxonomy to
      associate service descriptions with spatial coverage.

    WS-Context: Session State Service
   Repository of Context Information
   Allows for
     • Sharing Context info
         Info related to a particular transaction in multiple Web
          Service interactions
     • Sharing data
         Data in multiple Web service interactions

   Simply put, its a Distributed Variation of Shared
   See

An Information Service with both WS-Registry and WS-
                              Context capability
                WMS                    WMS                            WMS
                WSDL                   WSDL                           WSDL
                                Information Service
                                WSDL WSDL WSDLWSDL WSDL

                   SOAP             SOAP            SOAP              SOAP

    WSDL           WSDL                        WSDL                    WSDL            WSDL
  WS-Context     WS-Context                 UDDI Registry         UDDI Registry     UDDI Registry
   Service        Service                     Service               Service           Service

 JDBC           JDBC                        JDBC                  JDBC              JDBC

    DB             DB                          DB                      DB              DB

 WS-Context I   WS-Context II               UDDI-Replica I        UDDI-Replica II   UDDI-Replica III

WS Context Replica Group                                     UDDI Replica Group                     49
      GIS FTHPIS Implementation
          Status (M. S. Aktas)
   UDDI v.3E implementation
         metadata extension [completed]
         Processing geographic taxonomies to enable

          UDDI support spatial queries [completed]
         WSDL interface to UDDI v.3 [completed]
         WSDL interface to WS-Context 1.0

         Monitoring scheme
            • Leasing [completed]
            • Heart-beat
   WS-Discovery implementation
           metadata extension [completed]
           WSDL interface to Information Service
           Message dissemination via Soap Handler Environment
   Caching mechanism
   Replication mechanism

            Concluding Remarks
   High Performance Computing will be increasingly data
   High volumes of observational data will push many
    applications into the realms of HPC.
   There must be an overarching architecture to integrate
    data sources, HPC applications, visualization
    applications, users.
    • Web Service architectures provide this.
    • Use to build Grid libraries
   Large amounts of data related to the earth’s surface.
   GIS data and service standards need to be integrated
    into HPC applications.

To top