Supersite Relational Database System (SRDS) by kxi15611


									               Proposal Presentation to the Supersite Program
                               Nov 30, 2001

Supersite Relational Database System (SRDS)

                           a sub- project of
          St. Louis Midwest Supersite Project, Jay Turner, PI

                            Rudolf Husar, PI
     Center for Air Pollution Impact and Trend Analysis (CAPITA)
                 Washington University, St. Louis, MO
                     Purpose of the Project:

              Design, Populate and Maintain a
            Supersite Relational Database System

• Facilitate cross-Supersite, regional, comparative data analyses

• Support analyses by a variety of research groups

• Include monitoring data from Supersites and auxiliary projects
 EPA Specs of the Supersite Relational Data System (from RFP)

• Data Input:
   – Data input electronically
   – Modest amount of metadata on sites, instruments, data sources/version, contacts etc.
   – Simple data structures, formats and convenient submission procedures

• Data Storage and Maintenance:
   – Data storage in relational database(s), possibly distributed over multiple servers
   – A catalog of data holdings and request logs
   – Supersite data updates quarterly

• Data Access:
   – User-friendly web-access by multiple authorized users
   – Data query by parameter, method, location, date/time, or other metadata
   – Multiple data output formats (ASCII, spreadsheet, other (dbf, XML)
              General Approach to SRDS Design
• Based on consensus, adopt a uniform relational data structure, suitable for
  regional and cross-Supersite data integration and analysis.
• We propose a star schema with spatial, temporal, parameter and method

• The ‘original’ data are to be maintained at the respective providers or
  custodians (Supersites, CIRA, CAPITA..).
• We propose the creation of flexible ‘adapters’ and web-submission forms
  for the transfer of data subsets into the uniformly formatted ‘Federated
  Data Warehouse’.

• Data users would access the data warehouse manually or through software.
• We propose data access using modern ‘web services’ protocol, suitable for
  adding data viewers, processors (filtering, aggregation and fusion) and
  other value-adding processes.
                    The RDMS Schema: ‘Minimal’ Star Schema
                For integrative, cross-Supersite analysis, data queries by time,
             location and parameter and method, the database has to have, at the
                minimum, time, location parameter and method as dimensions

•     The minimal Sites table includes SiteID, Name and Lat/Lon.

•     The minimal Parameter table consists of ParamterID,
      Description and Unit

•     The time dimensional table is usually skipped since the time
      code is self-describing

•     The minimal Fact (Data) table consists of the Obs_Value and
      the three dimensional codes for Obs_DateTime, Site_ID and

•     Additional dimension tables may include Method and Data

    The CAPITA data exploration software, Voyager uses this minimal schema. Voyager was in use for the past 12
        years successfully encoding and browsing 1000+ datasets worldwide.
    The state of California still formats and distributes their AQ data on CDs using Voyager.
    From Heterogeneous to Homogeneous Schema
•   Individual Supersite SQL databases have varied designs, usually following an more elaborate the
    ‘snowflake’ pattern (see Database Schema Design for the Federated Data Warehouse)

•   Though they have more complicated schemata, these SQL servers can be queried along spatial,
    temporal, parameter, method dimensions. However, the query to retrieve the same information
    depends on the particular database schema.

•   A way to homogenize the distributed data is by accessing all the data through a Data Adapter using
    only a subset of the tables/fields from any particular database (show red in the schemata below)

•   The proposed extracted uniform (abstract) schema is the Minimal Star Schema, (possibly expanded).

•   The final form of the uniformly extracted data schema will be arrived at by Supersite consensus.
                                                                                   Subset used

      Uniform Schema

                                         Data Adapter
            Table                            Extraction of
                                        homogeneous data from
                                         heterogeneous sources
             Live Demo of the Data Warehouse Prototype
Currently online data are accessible from the
    CIRA (IMPROVE) and CAPITA SQL servers

 Uniform Data Query regardless of the native
     schema: Query by parameter, location,
     time, method

The hidden DataAdopter
    - accepts the uniform query
    - translates the uniform to server-specific query
    - return DataSet in uniform schema

 Data Returned in uniform schema

 A rudimentary viewer displays the data in a
     table for browsing.
                         Federated Data Warehouse Architecture
•   Tree-tier architecture consisting of
     –   Provider Tier: Back-end servers containing heterogeneous data, maintained by the federation members
     –   Proxy Tier: Retrieves Provider data and homogenizes it into common, uniform Datasets
     –   User Tier: Accesses the Proxy Server and uses the uniform data for presentation, integration or further processing
•   The Provider servers interact only with the Proxy Server in accordance with the Federation Contract
     –   The contract sets the rules of interaction (accessible data subsets; types of queries submitted by the Proxy)
     –   The Proxy layer allows strong security measures, e.g. through Secure Socket layer
•   The data User interacts only with the generic Proxy Server using flexible Web Services interface
     –   Generic data queries, applicable to all data in the Warehouse (e.g. space, time, parameter data sub-cube)
     –   The data query is addressed to a Web Service provided by the Proxy Server of the Federation
     –   Uniformly formatted, self-describing XML data packages are handed to the user for presentation or further machine processing

                                                                     Federated Data Warehouse

                 User Tier                                            Proxy Tier                                Provider Tier
                                                               Data Homogenization, etc.                       Heterogeneous Data
              Data Consumption

              Presentation                                        SQLDataAdapter1                                SQLServer1

                                                                  SQLDataAdapter2                                SQLServer2

                                                                CustomDataAdapter                                LegacyServer

          Data Access & Use                                          Proxy Server                             Member Servers
                                                                                      Fire Wall, Federation Contract
                              Web Service, Uniform Query & Data
            Federated Warehouse: Data Re-Use and Synergy
    •     Data producers maintain their own workspace and resources (data, reports, comments).
    •     Part of the resources are shared by creating a common virtual resources.
    •     Web-based integration of the resources can be across several dimensions:
               Spatial scale:        Local – global data sharing
               Data content:         Combination of data generated internally and externally

                                    Local             Local                     User


                                    Virtual Shared Resources
    Shared part of resources                                                    User
                                            Data, Knowledge
                                             Tools, Methods


                                   Global            Global                     User

•       The main benefits of sharing are data re-use, data complementing and synergy.
•       To be self-sustaining, the user-benefits of sharing need to outweigh the ‘costs’ of sharing.
     Data Entry to the Supersite Relational Data System:
1.      Batch transfer of large Supersite and other datasets to the SRDS SQL server
2.      Web-submission of of relational tables by the data producers/custodians
3.      Automatic translation and transfer of NARSTO-archived DES data to SQL

                    NARSTO ORNL               3. DES-SQL
                    DES, Data Ingest          Transformer

        EPA                          EOSDIS
 Supersite Data                        Data                   Supersite
     Coordinated                                                SQL       Query
     Relational             2. Direct Web Data Input
       Tables                                                             Output
                                  1. DataAdapter
Supersite & other
      SQL Data
                Federated Data Warehouse Features

•   As much as possible, data should reside in their respective home environment. ‘Uprooted’ data
    in decoupled databases tend to decay i.e. can not be easily updated, maintained, enriched.

•   Data Providers would need to ‘open up’ their SQL data servers for limited data subsets and
    queries, in accordance with a ‘contract’. However, the data structures of the Providers will not
    need to be changed.

•   Data from the providers will be transferred to the ‘federated data warehouse’ through (1) on-line
    DataAdapters, (2) Manual web submission and (3) Semi-automated transfer from the NARSTO

•   Retrieval of uniform data from the data warehouse facilitates integration and comparison along
    the key dimensions (space, time, parameter, method)

•   The open architecture data warehouse (see Web Services) promotes the building of further value
    chains: Data Viewers, Data Integration Programs, Automatic Report Generators etc..
                         Data Preparation Procedures:

•   Data gathering, QA/QC and standard formatting is to be done by individual projects

•   The data naming standards, data ingest and archives are by ORNL and NASA

•   Data ingest is to automated, aided by tools and procedures supplied by this project
     – NARSTO DES-SQL translator
     – Web submission tools and procedures
     – Metadata Catalog and I/O facilities

•   Data submissions and access will be password protected as set by the community.

•   Submitted data will be retained in a temporary buffer space and following verification
    transferred to the shared SQL database.

•   The data access, submissions and other ‘transactions’ will be automatically recorded an
    summarized in human-readable reports.
    Data Catalog
•   Data Catalog will be
    maintained through
    CAPITA website.

•   Limited metadata (based on
    user consensus) will be
    recorded for each dataset

•   User feedback on individual
    datasets will be through
    comments/feedback pages

•   An example is the data
    section of the St. Louis
    Supersite website.
    SRDS and Federated Data Warehouse Technologies

•   Server hardware: 2 identical Dell PowerEdge 4400 servers, (SQL server and Web
    Server); dual processor Xeon, 1 GB memory, 260 GB RAID drives

•   SQL Software: Microsoft SQL 2000 Enterprize Development Server

•   Web Server: Microsoft IIS 2000, including the Data Transformation Services for data

•   Programming Environment: Microsoft .Net languages (VB, C#) for creating and
    programming web objects and ASP.NET to create the distributed web pages.

•   Note: The rapid development of distributed applications was recently made possible by the ubiquity of
    SOAP/XML as a data transport protocol, and Web Services/.Net as the distributed programming environment. In
    fact, .NET is still in version Beta2.
                         Related CAPITA Projects
•   EPA Network Design Project (~$150K/yr –April 2003). Development of novel quantitative
    methods of network optimization. The network performance evaluation is conducted using
    the complete PM FRM data set in AIRS which will be available for input into the SRDS.

•   EPA WebVis Project (~$120K/yr - April 2003). Delivery current visibility data to the
    public through a web-based system. The surface met data are being transferred into the SQL
    database (Since March 2001) and will be available to SRDS.

•   NSF Collaboration Support Project (~$140K/yr – Dec 2004). Continuing development of
    interactive web sites for community discussions and for web-based data sharing; (directly
    applicable to this project)

•   NOAA ASOS Analysis Project (~$50K/yr - May 2002). Evaluate the potential utility of the
    ASOS visibility sensors (900 sites, one minute resolution) as PM surrogate. Data now
    available for April-October 2001 – can be incorporated into to the Supersite Relational Data

•   St. Louis Supersite Project website (~$50K/yr – Dec 2003) . The CAPITA group maintains
    the St. Louis Supersite website and some auxiliary data. It will also be used for this project
    Federated Data Warehouse Applications: Distributed ‘Voyager’

   XDim Data
                   XML Web
                                 Data Warehouse Tier                Data View & Process Tier
        OLAP                                                        Layered Map       Time Chart
                                 Connection      Cursor-Query
     GIS Data                     Manager          Manager
                   OpenGIS                                          Cursor
                   Services      Data Access       Data View                          Text, Table
                                  Manager          Manager
                                                                     Scatter Chart
    Text Data
         Web          HTTP
         Page        Services


Distributed data of multiple     The Broker handles the views,      Data are rendered by linked
types (spatial, temporal text)   connections, data access, cursor   Data Views (map, time, text)
          Live Demo:
     Distributed ‘Voyager’

• Browsing by space, time, parameter
  dimension (parameters now ‘hard

• Data selection from distributed servers

• Data overlay in time an map views

• ‘Clickable’ data map and time views s
  for browsing.
                    Supersite Relational Data System: Schedule

         Year 1 - 2002                         Year 2 - 2003                      Year 2 - 2004

              Impl. &
             Test SQL   Supersite Data Entry

                                  Auxiliary Data Entry

                                   Other Coordinated Data Entry

                                                 Supersite, Coordinated and Auxiliary Data Updates

•    First four four months to design of the relational database, associated data transformers, I/O;
     submitted to the Supersite workgroups for comment
•    In six months, Supersite data preparation and entry begins
•    In Year 2 and Year 3, data sets will be updated by providers as needed; system accessible to data
     user community
                Personnel, Management and Facilities

•   PI, R. B. Husar (10%), Kari Hoijarvi (25%). Software experience at CAPITA, Microsoft, Visala.
•   20% of project budget ($12k/yr) to consultants: J. Watson, DRI, W. White and J. Turner, WU.
•   Collaborators, (CAPITA associates): B. Schichtel, CIRA, S. Falke, EPA, M. Bezic, Microsoft.

•   This project is a sub-project of the St. Louis-Midwest Supersite project, Dr. Jay Turner, PI.
•   Special focus is on supporting large scale, crosscutting, and integrative analysis.
•   This project will leverage the other CAPITA data sharing projects

                                Resources and Facilities
•   CAPITA has the ‘largest known privately held collection of air quality, metrological and emission
    data’, available in uniform Voyager format and extensively accessed from the CAPITA website
•   The computing and communication facilities include two servers, ten workstations and laptops,
    connected internally and externally through high-speed networks.
•   Software development tools includes Visual Studio, part of the .NET distributed development

To top