Talk

Document Sample
Talk Powered By Docstoc
					              Data Management Services

                      Reagan W. Moore
             San Diego Supercomputer Center
        9500 Gilman Drive, La Jolla, CA 92093-0505
        Phone: 858 534-5073 FAX: 858 534-5152
                 E-mail: moore@sdsc.edu
                http://www.npaci.edu/DICE/




National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
Data Intensive Computing Environment Group
   Staff                                                 Students - GSRA
   •   Reagan Moore                                      •    Martin Kuhl
   •   Chaitan Baru                                      •    Liying Sui
   •   Sheau Yen Chen                                    •    Yang Yu
   •   Charles Cowart                                    •    Valter Crescenzi
   •   Amarnath Gupta                                    Students - Undergrad Interns
   •   George Kremenek                                   •    Peter Shin
   •   Bertram Ludäscher                                 •    Roman Olshanowsky
   •   Richard Marciano                                  •    Shabbar Tambawala
   •   Arcot Rajasekar                                   •    Pratik Mukhopadhyay
   •   Abe Singer                                        •    +/- NN
   •   Michael Wan
   •   Ilya Zaslavsky
   •   Bing Zhu



       National Partnership for Advanced Computational Infrastructure    San Diego Supercomputer Center
                                     Topics

• Data management systems
• Examples of large-scale data management
• Characterization of data, information, and
  knowledge for digital libraries




   National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
     Evolution of Data Management
Collection - managed data
      Use database to organize attributes about data objects
      Separate information management from data storage
      Support APIs for information discovery, data access

    Database A               Storage Resource Broker
                                                                           Storage

   Integration accomplished through a data handling system
   which characterizes the storage systems

       National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
   Evolution of Data Management
Distributed Data Collection
       Same name space
       Same schema
       Separate administration domains
       Heterogeneous database instances

  Database A               Storage Resource Broker                    Database B

Integration requires the ability to characterize both the
schemas and the table structures of each information repository

     National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
                              Data Grids
Data Grid - linking multiple data collections
      Separate name spaces
      Separate schema
      Separate administration domains
      Heterogeneous database instances

 Database A                           Data grid                     Database B

The data grid is itself a collection that provides
mechanisms to hide latency and manage semantics
   National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
                                                           Astronomy Sky Survey
1. Portals and Workbenches                                       Data Grid

 2.Knowledge
                                                                         Bulk Data
 & Resource                3. Metadata              Data     Catalog
 Management                   View                  View     Analysis Analysis
Concept space                                         Standard APIs and Protocols
  4.Grid                    Information Metadata Data          Data
  Security               5.
                            Discovery delivery Discovery Delivery
  Caching
                         Standard Metadata format, Data model, Wire format
  Replication
  Backup                    6.         Catalog Mediator                  Data mediator
  Scheduling                              Catalog/Image Specific Access
7. Compute Resources Derived Collections Catalogs Data Archives

        National Partnership for Advanced Computational Infrastructure    San Diego Supercomputer Center
        Federated Digital Libraries
Virtual Data Grid - linking multiple data collections
      Ability to execute processes to recreate derived data

   Database A                                                         Database B
                               Virtual Data Grid
    Services                                                           Services

 The virtual data grid integrates data grid and digital library
 technology to manage processes

     National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
                               User Interfaces
           Portals &
            Portals &
                                                            NSDL          Usage
            Clients &
              Portals                                                  Enhancement
             Clients              Delivery
               Clients            Presentation
                                  Aggregation - Channels
                                                                        NSDL
                                                                      NSDL
                                                                       Services
Information
                                                                  Other NSDL
                                                                     Services
about collections                       Core NSDL Bus
                                                                    Services
                                       Meta-data delivery
                                         Data delivery
                                            Query            Metadata & data
   NSDL                                   Global Ids          access-based
     NSDL
      NSDL
 Collections                               Security              services
  Collections                              Network
   Collections
                      Virtual                                   Core Services
                      Collections &
                                                                   CI Services:
                                                                   annotation
                                                                    CI Services
                      Mediators
                                                                query transform
                                                                    CI Services
    referenced
  referenced                    Core Services:
                                Core Collection-               topic-map registry
                                                                     CI Services
 Referenced
      items
    items &
  Items &&
                              metadata normalizing
                                Building Services
                                  Core Collection-                personalization
                                                                      CI Services
    collections
  collections                   metadata harvesting
                                  Building Services                  discussion
 Collections
                    Collection     persistent storage               visualization...
                    Building
                    Persistent Archive
Persistent archive
       Describe archived data as collections
       Describe processes used to create collections
       Manage evolution of technology

Database A                                                            Database A
                             Virtual Data Grid
 (today)                                                              (tomorrow)

The persistent archive is itself a virtual data grid that
provides mechanisms to manage relationships over time
    National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
ERA: Archival Components Concept
                                              Grid Security Infrastructure

              ERA Concept model
           Storage Resource Broker/Extensible Meta-data CATalog

                Accessioning               Archival                 Reference
 Tap es         Workbench                 Repository                Workbench

                   Accession               C oll ecti on                Query

                                           C oll ecti on
 D isks                Verify
                                                                        Rebuil d
                                           C oll ecti on
                    Wrap &
                  Containerize
                                           Me tadat a                   Present
Internet
                    Describe



                                 Mediation of       Information using       XML
                                                                              Order
            Records
                                  Archival Research Catalog                 Fulfillment
           Schedules
                                                                             System
      Data Management Systems
• Distributed data collections
   – Single name space
   – Distributed data storage systems
• Data Grid - integration of multiple data collections
   – Each collection has a separate name space
   – Infrastructure that interconnects the collections can use its
     own name space, containers, replication
• Virtual Data Grids - federation of digital libraries
   – In addition, support interoperability between services for
     manipulation, presentation, discovery of digital objects
• Persistent archive
   – In addition, manage evolution of technology components
   National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
Distributed Environment Hurdles
• Access to data distributed across multiple
  administration domains
• Access to local name spaces
• Persistence / consistency of distributed
  digital objects

• Latency hiding mechanisms

   National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
     Distributed Data Collection
• Logical organization of distributed digital objects
  into a collection
   – Access through federated servers
   – Collection-owned data, implies the server at each
     storage repository runs under a collection user-ID
   – Collection attributes define a global namespace
   – Self-consistent attribute update on all data accesses
   – Support for multiple access APIs
   – Extensible support for access to any type of storage
     system (archive, file system, database)
   – Extensible collection attributes
   National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
                 Logical Collections
• Separate the organization of digital objects
  into a collection from their physical storage
  location
  – Metadata catalog to manage attributes about the
    digital objects
  – Data handling system to manage interaction
    with remote storage systems


   National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
 Interoperability across Data and
     Information Repositories
• Define a representation for storage that is
  independent of the implementation of the
  storage system
  – Unix file system semantics -
    Open/Close/Read/Write/Seek
• Define a representation of a collection that
  is independent of the choice of database
  – XML DTD defining schema, table structures
   National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
  Defining Collection Attributes
• Composing schema - define sets of
  attributes that are needed for each collection
  function
        •   SRB metadata - Unix file system semantics
        •   Provenance metadata - Dublin Core
        •   Resource metadata - User access control lists
        •   Discipline metadata - User defined attributes



   National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
SDSC Storage Resource Broker
& Meta-data Catalog
                                               Application


 Resource,
   User                C, C++,          Unix        Java, NT          Prolog   Web           Third-party
          User        Linux I/O         Shell       Browsers         Predicate                  copy
         Defined

                                                   SRB
                                                                                                Remote
 MCAT                    Archives                    File Systems Databases                     Proxies
                      HPSS, ADSM, HRM                   Unix, NT,         DB2, Oracle,
       Dublin         UniTree, DMF                      Mac OSX            Postgres
        Core                                                                               DataCutter

  Application
  Meta-data


         National Partnership for Advanced Computational Infrastructure       San Diego Supercomputer Center
               Latency Management
• Data streaming
   – Overlap I/O access time with data movement
• Data caching
   – Create a local copy to minimize I/O access time
• Data replication
   – Choose between multiple sources for data access
• Data aggregation
   – Use containers to hold multiple small data sets
• I/O aggregation
   – Use remote proxies to do remote filtering/data subsetting
    National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
 Minimizing Latency in I/O Pipes
Remote                                     Data
Proxies                                 Aggregation                            Staging




Replication                               Streaming                            Caching

 Source                                    Network                          Destination

    National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
         Knowledge Management
• Must manage semantic relationships between the
  multiple name spaces
   – Data Grid
• Must manage procedural relationships between
  digital library services
   – Federated digital library
• Must manage structural relationships between
  different versions of software systems
   – Persistent archive


   National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
              Differentiating between Data,
              Information, and Knowledge
• Data
   – Digital object
   – Objects are streams of bits
• Information
   – Any tagged data, which is treated as an attribute.
   – Attributes may be tagged data within the digital object, or tagged data that is
     associated with the digital object
• Knowledge
   – Relationships between attributes
   – Relationships can be procedural/temporal, structural/spatial, logical/semantic,
     functional

         National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
Types of Knowledge Relationships
• Logical / semantic
   – Digital Library cross-walks
• Temporal / procedural
   – Workflow systems
• Spatial / structural
   – GIS systems
• Functional / algorithmic
   – Scientific feature analysis

   National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
  Knowledge Based Digital Libraries
                      Ingest                         Management                            Access
                      Services                                                             Services




                                                                         Rules - KQL
                   Relationships                     Knowledge                         Knowledge or




                                         XTM DTD
Knowledge          Between                           Repository for                    Topic-Based
                   Concepts                          Rules                             Query / Browse

                                                (Model-based Access)

                                          XML DTD




                                                                          SDLIP
Information        Attributes                        Information                       Attribute- based
                   Semantics                         Repository                        Query

                                      (Data Handling System - SRB)
                                          MCAT/HDF




Data               Fields                            Storage


                                                                        Grids
                                                                                       Feature-based
                   Containers                        (Replicas,
                                                                                       Query
                   Folders                           Persistent IDs)

       National Partnership for Advanced Computational Infrastructure                  San Diego Supercomputer Center
    Information Management Projects
• Digital Libraries
    –   NSF Digital Library Initiative, Phase II - UCSB, Stanford
    –   Digital Embryo digital library - GMU
    –   NPACI Digital Sky - Caltech 2MASS sky survey
    –   CDL - AMICO
    –   NSF NSDL - UCAR / DLESE
• Grid Environments
    –   NASA Information Power Grid - NASA Ames
    –   DOE Data Visualization Corridor - LLNL
    –   DOE Particle Physics Data Grid - Stanford, Caltech
    –   NSF Grid Physics Network - U Fl
• Persistent Archives
    – NARA Persistent Archive
    – NHPRC - Scalable archives

           National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center
              Further Information

               http://www.npaci.edu/DICE




National Partnership for Advanced Computational Infrastructure   San Diego Supercomputer Center

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:8
posted:7/8/2011
language:English
pages:26