Docstoc

Insert Title Here

Document Sample
Insert Title Here Powered By Docstoc
					                               1




Survey of Emerging IT Trends
      and Technologies

         Chaitan Baru
        Monday, 10th Aug
                                             2




               OUTLINE
• Trends in data sharing
  – And, Discovery/Search
• Trends in service-oriented architectures
• Trends in computing and data
  infrastructure
• The road ahead
                                                            3




       Geoinformatics Use Cases
• “…a use has access from a terminal to vast stores of
  data of almost any kind, with the easy ability to
  visualize, analyze and model those data.”

• “For a given region (i.e. lat/long extent, plus depth),
  return a 3D structural model with accompanying
  geophysical parameters and geologic information, at a
  specified resolution”
                                                   4




    Implied IT Requirements
• Search and discovery of resources
• Integration of heterogeneous 3D / 4D
  Earth Science data
• Integration of data with tools
• Analysis and Visualization
  – Ability to feed data to tools, and analyze &
    visualize model outputs
• (data-centric view…)
                                        5




       Search and Discovery
• Searching “structured data”, i.e.
  metadata catalogs
                     Search



                  Structured metadata
                        catalogs
                                            6




      Search and Discovery
• Searching “unstructured data”, i.e. the
  Web             Search



                   The Web




• Structured databases are a major
  component of the “Deep Web”
                                          7




Combined Search and
    Discovery
                       Search



 Structured metadata
       catalogs                 The Web
                                 8




               Advanced Search
• Proposed:
   – Geoscience
     Knowledge System,
     GeoKnowSys
   – Built using Yahoo
     Build Your Own
     Search (BOSS)
     service
• E.g. See
  wolframalpha.com
                                                     9




   Advanced Search: PaleoLit
• Research project at Dept of CS, CMU
  – Dr. Judith Gelernter and Prof. Jamie Carbonell
• Use ontologies to match search requests
  to related publications
• Demo…
         Informatics Issues:
     The Informatics Progression
                        Informatics
IT Cyber        Cyber         Core          Science        Science,
                Informatics   Informatics   Informatics,
   Infrastru                                aka
                                                           SBAs
   cture                                    Xinformatics




Courtesy: Prof. Peter Fox, RPI, CSIG’08
                                                                           11




          The Computer Science /
         Domain Science continuum
Computer  IT  Geoinformatics  Domain  Domain
Science   Standards Standards   Standards Science
Topics                                    Topics
e.g. Database e.g. ODBC,   e.g. Ontologies,    e.g. domain       e.g. geology
Systems,           XML         GeoSciML         vocabularies
Semistructure data              definitions   (Geologic Time,
                                               rock description,…)
                                                                                   12




The data interoperability onion
• System Interop
                          Social Networks
   – Approaches: e.g., ODBC, JDBC, Java, Web services, …          Social Networks
                              Semantics
   – Purview of: Computer Science                                    Semantics
• Syntactic                      Syntax                               Syntax
                                 Systems                              Systems
   – Approaches: Schema standards
   – Purview of: Standards organizations, domain science
     repositories, data archives
• Semantic
   – Approaches: Controlled vocabularies, thesaurii, domain ontologies
   – Purview of: Domain scientists
• Social Networks
   – Approaches: recommendation systems
   – Purview of: social networking software (CS and domain science, data driven)
                                                                               13




Software interoperability onion
• System Interop
    – Approaches: e.g., REST, Web services                         Social Networks
                                                                      Semantics
• Syntactic
                                                                       Syntax
    – Approaches: e.g., SOAP, WSDL
                                                                       Systems
• Semantic
    – Approaches: Controlled vocabularies, thesaurii, domain ontologies
    – Purview of: Domain scientists
• Social Networks
    – Approaches: recommendation systems
    – Purview of: social networking software
• Service orchestration via worflow systems
Geologic Map Integration
                    Data Mediation
• Dealing with heterogeneities in (distributed) data sources
   – Data may be in different “administrative domains”
        •  Manage authentication
   –   Data schemas may be different among sources
   –   Terminologies may be different among sources
   –   Terminologies may be different among sources and user
   –   Software infrastructure (“stack”) may be different
• Solve the problem with “middleware”
   – Layers of software between the original application and the end user
• Mediator
   – Middleware that bridges across heterogeneities without requiring sources
     to change
         A Data Integration Example:
               Geologic Maps
                  DB2           SRB
                                               GM
                                               L
                        MT         MT
Shapefile
 (ESRI)                                      WY     Heterogeneities
             ID

            NV
                                                    • Operating system
PostGIS                                             • File storage
                                                    • Database schemas
                                                    • Data Semantics
Oracle      UT


                  AZ
                              NM        CO


             Windows         Linux      iMac
     Adopting WMS/WFS: Can provide
           Syntactic Integration
            MT           MT
                                                    Advantages
                                    WY
ID         WMS         WMS                • Integrated presentation
                                          • Uniform syntactical structure
                                          • Uniform spatial definition
NV
       WMS
                                                       Problems

UT         WMS                            • Each resource may use a
                                            different schema
                      WMS
                                          • Difficult to build a a uniform
                                            query interface for
      AZ
                               CO           multiple resources.
                    NM

                 FORMATION    ROCK_TYPE
                 UNIT_NAME    PERIOD
                 ROCK_TYPE
                 ERA
                 SYSTEM
                 SERIES
                 LITH
     GeoSciML: Can Provide Schema Integration
             MT            MT                                  Advantages
                                         WY
ID                       GeoSciML                   • Integrated schema
           GeoSciML                                 • Partial integrated semantics


NV

     GeoSciML
                                                                Problem
UT         GeoSciML                                • Each resource may use
                       GeoSciML                      different vocabulary and
                                                     semantic model.
      AZ
                      NM            CO



                British Rock         Multi-hierarchical
                Classification       Rock Classification
          Semantic Mediation with
               GeoSciML
                                              Multi-hierarchical
Mappings may also be   British Rock
                                                    Rock
needed between the     Classification
                                               Classification
data and the
application ontology

E.g., say, mapping            CO                      NM
                                                                   Semantic
240 mya to Mesozoic
                                                                   Mapping


                                        GeoSciML



                                        Application
                                         Ontology
          Query Rewriting:
     Example: A Rock Classification
               Ontology
                 Genesis




  Fabric




Composition

                   Texture
            Query: Concept Expansion
Concept expansion:
• what else to look for when
 user asks for ‘Mafic’




                          Composition
      Query: Concept Generalization
Generalization:
• finding data that are ‘like’
X and Y




                          Composition
     Ontology-based Geologic Map
  Integration: Implemented in GEON
 domain
knowledge




                     Show
                     formations
                     where AGE =
                     ‘Paleozic’
                                           Show
                     (without age          formations
                     ontology)             where
                                           AGE =
            Nevada                         ‘Paleozic’
                                    +/- a few hundred
                                       million years
                                           (with age
                                         ontology)
                    ODAL, SOQL, and
                  Data Integration Carts™
              •ODAL: Ontological Database Annotation Language
                  • Create a partial model of ontologies from database


                   <odal:NamedIndividuals odal:id="RockSample"
                                           odal:database="VTDatabase">
                     <odal:Class odal:resource="http://geon.vt.edu#RockSample" />
                     <odal:Table>Samples</odal:Table>
                     <odal:Table>RockTexture</odal:Table>
                     <odal:Table>RockGeoChemistry</odal:Table>
 GUI
                     <odal:Table>ModalData</odal:Table>
       generate      <odal:Table>MineralChemistry</odal:Table>                        to ODAL
                     <odal:Table>Images</odal:Table>                                  processor
                     <odal:Column>ssID</odal:Column>
                    </odal:NamedIndividuals>


The values in the column ssID of the tables Samples, RockTexture, RockGeoChemistry,
ModalData,MineralChemistry and Images represent instances of RockSample
  SOQL: Simple Ontology Query
Query single or many resources Language
      • via ontologies (i.e., high level logical views)
      • independent of physical representation (i.e. schemas)
                       RockSample      location   Location
                      hasSiO2                     lat      long
                       ValueWithUnit    value      float
                         unit
                           string


                         SELECT X.location.*;
                           FROM RockSample X
                          WHERE X.location.lat > 60
GUI                          AND X.location.long > 100
           generate          AND X.hasSiO2.value < 30
                             AND X.hasSiO2.unit =‘weightPercetage’   to SOQL
                                                                     processor
                                                           26




        Issues in sharing data:
    Primary vs secondary (derived)
                          Collect Data        Share data




Share intermediate    Process and Visualize
            results



                         Share Results
                                                                          27




                 Sources of Data
• Distributed data collections
   – By individual PIs
   – “Informal” sharing, e.g. via social network
   – “Formal” sharing, e.g. via submission to community data archives /
     databases
• Centralized data collections
   – E.g. via a large project (standardized protocols)
   – By agencies (internal protocols)
• Metadata to the rescue
   – Data description standards
   – Process description standards (workflows)
• State Surveys and USGS are major sources
                                   28




  Major Interoperability Efforts
• OneGeology.org
  – International initiative of
    geological surveys to create
    dynamic geological map data
    available via the web.
• US Geoscience Information
  Network (US GIN)
  – Led by Lee Allison, AZGS
                                                          29




 Federating Metadata Catalogs
• Local vs Community “View”
   – Individual data providers may choose to “export” a
     community view
• Direct access to the source may still provide
  more “rich” access to data
• Federated Catalogs
   – The Geosciences Information Network, GIN approach
   – Adopt standards for catalog content (ISO) and
     implementation (CSW)
              Interoperation between GEON
                     and GEO GRID
  GEON                                               CSW                            GEO Grid
      ADN
                                                   Composite
                                                    Service
                                                                                           Geogrid                600 scenes/day
            GEON                                                                           Catalog
            Catalog

                                  REQUEST                      REQUEST

                                       CSW                          CSW
                                                                   RESPONSE
                                                                      RESPONSE
                        Catalog         RESPONSE
                                                                                 Catalog               Storage
      SRB               Service
                         Web
                                                                                 Service
                        Adapter                                                   Web         WMS
                                                                                              URL




                                                                                                     WMS Server
                      WMS
                      URL
    WMS Server


• Implement CSW interfaces
   – Collaboration with the NSF PRAGMA project (Pacific Rim Assembly for Grid
     Middleware Applications)
Integration & Visualization of 3D/4D
               data
          “For a given region (i.e. lat/long extent, plus depth), return a 3D
          structural model with accompanying physical parameters of
          density, seismic velocities, geochemistry, and geologic ages,
          using a cell size of 10km”


         –Derived 3D volumetric model
             –Multiple isosurfaces with different transparencies
             –Slices through the volume
             –Variable gridding: data typically has lower resolution at
             greater depths


        –2D surface data: Topography (“2.5D”) Satellite imagery, street
        maps, geologic maps, fault lines, and other derived features etc.


        –Bore hole or well data and point observations.
 OpenEarth Framework Goals
Geoscience Integration:

• Data types - topography, imagery, bore hole samples,
  velocity models from seismic tomography, gravity
  measurements, simulation results…


• Data coordinate spaces and dimensionality - 2D and 3D
  spatial representations and 4D that covers the range of
  geologic processes (EQ cycle to deep time).
   OpenEarth Framework Goals
Structural Integration:

• Data formats – shapefiles, NetCDF, GeoTIFF, and other formal
  and defacto standards.

• Data models - 2D and 3D geometry to semantically richer models
  of features and relationships between those features.

• Data delivery methods & Storage Schemes- local files to database
  queries, web services (WMS, WFS) and services for new data
  types (large tomographic volumes, etc.).
              OEF Philosophy
• OEF focused on integrating data spanning the
  geosciences.

• Open software architecture and corresponding
  software that can properly access, manipulate and
  visualize the integrated data.

• Open source to provide the necessary flexibility for
  academic research and to provide a flexible test bed for
  new data models and visualization ideas.
OEF Architecture
             OEF Architecture
Data Integration Services:
 – Designed to support rapid
   visualization of integrated
   datasets
 – operations to grid data,
   resample it at multiple
   resolutions and subdivide data
   to better support progressive
   changes to the display as the
   user pans and zooms
             OEF Architecture
Visualization Tools:
 – Run on the user's computer,
   dynamically query spatial and
   temporal data from the OEF
   services
 – Uses 3D graphics hardware for
   fast display
 – Open architecture supports
   multiple visualization tools
   authored throughout the
   community (e.g GEON IDV)
 – New viz capabilities
   developed as necessary
OEF Visualization
The software services stack
     Example: GEON



                   Pushing down the service interface




   Compute nodes               Disk Storage
             Software as a Service:
              At different levels of software
  • Software as a Service: SaaS
       – E.g., Google Apps, Salesforce.com, SAP, …
  • Infrastructure as a Service, IaaS
       – E.g., Amazon EC2, …
  • Platform as a Service, PaaS



                                                                    SaaS
PaaS
                                                                    IaaS
                 Compute nodes                       Disk Storage
                                        41




  The evolving computational
         architecture
• Mainframe computers (institutional
  computing)
• Minicomputers (departmental
  computing)
• Workstations (laboratory computing)
• Laptops (personal computing)
• …back to the future..??
Cloud Computing: A meeting
        of trends
                   Price/performance
                     of computing
                       platforms
       Data                             Cost of
      Volumes                          Ownership




     Capabilities of
    networking and
      distributed
        systems
     Cloud Computing Origins
• Cloud computing: Many definitions
   – Here’s one: Use of remote data centers to manage scalable, reliable, on-
     demand access to applications
• Origins
   – Goes back to the need by Web search engines to inexpensively process all
     the pages on the Web
   – Done by creating a grid of datacenters and processing data in parallel
     across them
   – Development of a parallel data programming environment by Google:
     MapReduce
• Data + cloud computing
   – what about remote centers for scalable, reliable, on-demand access to
     data?
          Cloud Computing
• A different pricing model
  – No upfront cost of acquisition. Rent don’t buy.
• Can access 1000’s of processors / disks
  – Scalability
  – “Elastic computing”
• A different model for dealing with
  system failures
  – Retry, loose consistency, …
     Cloud computing for data
• Data as a service: what is the abstraction for storage?
   – Table, Blob, Queue
   – …??
• Describing characteristics of the data
   – Metadata about storage to specify policies to be applied
   – Security, reliability, performance, etc
• Scaling to meet application needs
   – Large configurations
   – Dealing with virtualization
   – New failure models
       • Retry, loose consistency
              Storage as a Service
• Amazon S3: An example
   – Charges for Storage, Data Transfer, and Requests (e.g. PUT, COPY, POST, LIST,
     GET)

• Issues
   –   Bandwidth to storage
   –   Quality of Service
   –   Storage Elasticity
   –   Privacy / security

• Standardization efforts
   – Storage Networking Industry Assocation (SNIA) Technical Working Group (TWG)
     on Cloud Storage has just started

• Important Issues
   – Metadata for storage
   – Scaling up to large dataset sizes
The two sides of Cloud Computing
 • Large distributed infrastructure
     –   “Everything is in the cloud”
     –   Interesting as a proposition for the IT operations of an enterprise
     –   Cloud companies would like to reach deep into enterprise IT
     –   “Our business is not the entrenched data centers in current large
         organizations, but the new companies…”
 •   Large-scale infrastructure in the Datacenter
     – Seeding the cloud
     – Shared-nothing parallelism
     – Data on the cheap…a la Google
 The NSF Cluster Exploratory
      (CluE) Program
• Google-IBM-NSF Cluster
   – Well over a thousand processors
       • When fully built out, will comprise approximately 1,600 processors
   – Terabytes of memory
   – Hundreds of terabytes of storage
• Open source software
   – Linux and Apache Hadoop
• IBM Tivoli
   – System management, monitoring and dynamic resource
     provisioning
• A platform for “apples-to-apples” comparisons
   – Can reserve time on nodes for exclusive access
                 Our CluE Project
• Project (PI: Baru; co-PI: Krishnan)
    – Performance Evaluation of On-Demand Provisioning Strategies for Data
      Intensive Applications
• Investigate hybrid software model
    – Database system / Hadoop system
    – Some parts of the application require features provided by a DBMS
        • Transactional capability, full SQL support
    – Other parts of the application can exploit Hadoop model
        • Very large data sets
        • Data parallel processing
        • Loose consistency models
• Price / performance is an issue
    – Including energy costs
   San Andreas Fault LiDAR
           Dataset:
     Data Access Patterns
• B4 Dataset
                  Experiments
•   “On-demand” database vs Hadoop
•   SQL vs Hadoop
•   Energy consumption as a factor in price/performance
•   Platforms to be used
     • Google-IBM cluster
     • OpenCirrus testbed
     • Triton resource
                                                               52




            The Road Ahead
• Advanced search engines
  – Search structured and unstructured data
  – Deal with display of heterogeneous results
  – Show provenance of data
• Sophisticated tools for 3D and 4D data
  integration
  – Combination of “server-side” processing and caching
    and client-side interaction and visualization
• Service-oriented architecture
  – Applications and IT infrastructure available as services
  – Perhaps some of them in “the Cloud”
53
  Dealing with very large data
• Either the data can be partitioned into
  segments and processed in parallel
  – Shared-nothing parallelism
• Or not
  – Shared memory systems
Parallel Processing of Large
            Data
  P    P    P      P       P



                Shared Memory
           M



            D
    Shared Nothing

P    P   P   P   P   Network


M   M    M   M   M



D    D   D   D   D
    Shared Nothing
P      P         P       P   P

M      M        M        M   M


D      D         D       D   D


     Partitioning Strategy

           Dataset
       Data partitioning strategies
• Round-robin
  – Equal distribution     P     P         P       P   P
    across nodes by data
    volume                 M     M        M       M    M
• Hash
  – all data with the
    same key value go      D     D         D       D   D
    to same node
• Range                        Partitioning Strategy
  – all data within a
    range of values go                 Dataset
    to the same node
          MapReduce / Hadoop
• Programming environment for very large scale
  data processing
• Managing task executions and data transfers in a
  shared nothing environment
   – MapReduce: Infrastructure to support data scatter / gather
   – Distributed data repository (“file system”)
       • Google File System (GFS)
       • Hadoop Distributed File System (HDFS)
   – Round-robin partitioning of data
• MapReduce
   – Google’s proprietary implementation
• Hadoop
   – Apache, open source implementation
         MapReduce execution
• Hadoop vs database
          MapReduce vs Database
•   Database
     –   Partition “base tables” into N partitions
     –   Intermediate data can be “re-partitioned”
     –   Intermediate data can be combined
     –   Well-defined algebra for data manipulation (SQL)
•   MapReduce / Hadoop
     –   Partition input data file into M splits
     –   Intermediate data are re-hashed
     –   Intermediate data can be “combined”
     –   Java programs
•   Cost of dynamic vs static partitioning
     –   Run time costs
     –   Storage costs
•   Optimal partitioning
     –   Query and Workload dependent
     –   How to measure any deviations from the optimal?
     –   When to repartition?
    USGS Role in Geoinformatics
Fundamental: Develop, maintain, make accessible:
 Long-term national and regional geologic, hydrologic,
  biologic, and geographic databases
 Earth and planetary imagery
 Open-source models of the complex natural systems and
  human interaction with that system
 Physical collections of earth materials, biologic materials,
  reference standards, geophysical recordings, paper records.
 National geologic, biologic, hydrologic, and geographic
  monitoring systems
 Standards of practice for the geologic, hydrologic, biologic,
  and geographic sciences
           Source: Presentation by Dr. Linda Gundersen, USGS, at Geoinformatics
                                   2007, San Diego, CA.
     USGS Role in Geoinfomatics
   All activities: Data creation, modeling,
    monitoring, collections, standards etc. Must
    be done in cooperation and collaboration
    with the public and governmental,
    academic, and private sector partners and
    stakeholders.

             A critical USGS role:
facilitate bringing communities together!
          Source: Presentation by Dr. Linda Gundersen, USGS, at Geoinformatics
                                  2007, San Diego, CA.
            Data Collections versus
           Communities of Practice
   Geoinformatics must evolve beyond the
    accumulation of data, models, and standards to
    become the framework for a community of practice
    in the natural sciences.

    Etienne Wegner and Jean Lave coined the term
    and developed the learning theory of communities
    of practice – that we learn not only as individuals
    but as communities. By engaging in communities
    of practice we increase our capacity and innovation
    as well as leverage our support for areas of interest.
           Source: Presentation by Dr. Linda Gundersen, USGS, at Geoinformatics
                                   2007, San Diego, CA.
Creativity, Learning, and Innovation
A community of practice is not merely a community
  with a common interest. But are practitioners who
  share experiences and learn from each other. They
  develop a shared repertoire of resources:
  experiences, stories, tools, vocabularies, ways of
  addressing recurring problems. This takes time
  and sustained interaction. Standards of practice
  and reference materials will grow out of this. But
  the critical benefits include: creating and
  sustaining knowledge, leveraging of resources, and
  rapid learning and innovation.
         Source: Presentation by Dr. Linda Gundersen, USGS, at Geoinformatics
                                 2007, San Diego, CA.
           1000’s of National and Regional
                      Databases
   The National Map – topographic, elevation,
    orthoimagery, transportation hydrography etc.
   Geospatial One Stop-portal
   MRDATA – Mineral Resources and Related
    Data
   The National Geologic Map Database
    stnadardized community collection of
    geologic mapping
   National Water Information System -
    NWISWeb

   National Geochemical Survey Database
    (PLUTO, NURE)
   National Geophysical Database (aeromag,
    gravity, aerorad)
   Earthquake Catalogs
   North American Breeding Bird Survey
   National Vegetation/speciation maps
   National Oil and Gas Assessment
   National Coal Quality Inventory by Dr. Linda Gundersen, USGS, at Geoinformatics
                    Source: Presentation
                                         2007, San Diego, CA.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:1/9/2012
language:
pages:66
jianghongl jianghongl http://
About