NPACI Neuroscience

Document Sample
NPACI Neuroscience Powered By Docstoc
					Storage Resource Broker

       Modern Data Management

                Reagan W. Moore

• Data management evolution
  • Shared collections
  • Digital Libraries
  • Persistent Archives
• Building shared collections
  • Project level / National level / International
• Demonstration of shared collections
  • Access to collections at SDSC
Types of Data Management
• File system (AFS)
   • Provides caching at remote sites, uses single authentication system
• Backup system (Veritas)
   • Provides time-based copies of data, tools for re-loading backups
• Database system (Oracle 10g IFS)
   • Can link metadata to files on an Internet File System
• Archive system (HPSS)
   • Manages data stored on tape, supports parallel I/O streams
• Persistent object environment (Avaki)
   • Provides vaults for storing objects
• Globus toolkit
   • Provides differentiated services for building a data grid
Data Management Environments

• Data grids
  • Manage shared collections
• Digital libraries
  • Provide discovery, browsing, presentation services on
    top of collections
• Persistent archives
  • Manage technology evolution while the authenticity
    and integrity of the assembled collection is preserved
• Real-time sensor networks
  • Manage access to real-time data streams from
    thousands of sensors
  Generic Infrastructure

• Can a single system provide all of the
  features needed to implement each type
  of data management system, while
  supporting access across
  administrative domains and managing
  data stored in multiple types of storage
• Answer is data grid technology
 Types of Data Management
• File system (AFS)
   • Data grid manages replication, parallel I/O, containers
• Backup system (Veritas)
   • Data grid supports replicas, versions, and snapshots of files and
• Database system (Oracle 10g IFS)
   • Data grid virtualizes catalogs - schema extension, bulk metadata load
• Archive system (HPSS)
   • Data grid integrates access across archives and file systems
• Persistent object environment (Avaki)
   • Data grid manages user-defined metadata and collection hierarchy
• Globus toolkit - set of differentiated services
   • Data grid manages consistent state information
      Shared Collections
• Purpose of SRB data grid is to enable the
  creation of a collection that is shared between
  academic institutions
  •   Register digital entity into the shared collection
  •   Assign owner, access controls
  •   Assign descriptive, provenance metadata
  •   Manage state information
       • Audit trails, versions, replicas, backups, locks
       • Size, checksum, validation date, synchronization date, …
  • Manage interactions with storage systems
       • Unix file systems, Windows file systems, tape archives, …
  • Manage interactions with preferred access mechanisms
       • Web browser, Java, WSDL, C library, …
Federated Server Architecture
                                               Read Application          Brokering
                                                                                               Parallel Data
                      Logical Name
                             Or                                                                  Access
                    Attribute Condition

                                                         6             5/6                  SRB
                       SRB                                    3
                      server                                                           4

                                                                  5                    SRB
                            SRB                                                        agent
                            agent                  2

1.Logical-to-Physical mapping             R1                           Data                     Spawning
2.Identification of Replicas                                          Access      R2
3.Access & Audit Control
  Generic Infrastructure
• Digital libraries now build upon data grids to
  manage distributed collections
  • DSpace digital library - MIT and Hewlitt Packard
  • Fedora digitial library - Cornell University and University
    of Virginia
• Persistent archives build upon data grids to
  manage technology evolution
  • NARA research prototype persistent archive
  • California Digital Library - Digital Preservation Repository
  • NSF National Science Digital Library persistent archive
  Southern California Earthquake Center
•Intuitive User Interface
    –Pull-Down Query Menus              SCEC Digital Library
    –Graphical Selection of
    Source Model
    –Clickable LA Basin Map                         Select Receiver (Lat/Lon)
    extraction (Olsen)
•Access SCEC Digital              Select Scenario                       Output
Library                             Fault Model                       Time History
    –Data stored in a data grid    Source Model                       Seismograms
    –Annotated by modelers
    –Standard naming convention
    –Automated extraction of                           SCEC
    selected data and metadata                       Community
    –Management of
  Terashake Data Handling
• Simulate 7.7 magnitude
  earthquake on San Andreas
   • 50 Terabytes in a simulation
   • Move 10 Terabytes per day
• Post-Processing of wave field
   • Movies of seismic wave propagation
   • Seismogram formatting for interactive
     on-line analysis
   • Velocity magnitude
   • Displacement vector field
   • Cumulative peak maps
   • Statistics used in visualizations
   • Register derived data products into
     SCEC digital library

                         Wind Speed

Sensor Network          Seismic
Data Integration
                                                  Rain start
      Frank Vernon - UCSD/SIO        Fire start
 Chile        June 13, 2005

   Mw 7.9
Frank Vernon - UCSD/SIO
National Science Digital Library

 • URLs for educational material for all
   grade levels registered into repository
   at Cornell
 • SDSC crawls the URLs, registers the
   web pages into a SRB data grid, builds
   a persistent archive
   • 750,000 URLs
   • 13 million web pages
   • About 3 TBs of data
Worldwide University Network Data Grid

•   SDSC
•   Manchester
•   Southampton
•   White Rose
•   NCSA
•   U. Bergen

• A functioning, general
  purpose international
  Data Grid for academic           Manchester-SDSC mirror
          KEK Data Grid

•   Japan
•   Taiwan
•   South Korea
•   Australia
•   Poland
•   US

• A functioning, general
  purpose international
  Data Grid for high-      Manchester-SDSC mirror
  energy physics
BaBar High-energy Physics

• Stanford Linear
• Lyon, France
• Rome, Italy
• San Diego

• A functioning
  international Data
  Grid for high-energy               Manchester-SDSC mirror
  physics              Moved over 100 TBs of data
    Astronomy Data Grid

• Chile
• Tucson, Arizona
• NCSA, Illinois

• A functioning
  international Data Grid
  for Astronomy                       Manchester-SDSC mirror
                        Moved over 400,000 images
 International Institutions (2005)
Project                                                  Institution
Data mangement project                                   British Antarctic Survey, UK
eMinerals                                                Cambridge e-Science Center, UK
Sickkids Hospital in Toronto                             Canada
Welsh e-Scie nce Centre                                  Cardiff University, UK
Australian Partnership for Advance d Computing Data GridVictoria, Australia
                                                         Commonwealth Scientific and Industrial Re searc
Australian Partnership for Advance d Computing Data Grid O rganization, Australia

Australian Partnership for Advance d Computing Data GridUniversity of Te chnology, Australia
Center for Advanced Studies, Re search, and Deve lopme nt Italy
LIACS(Leiden Inst. O f Comp. Sci)                         Leiden Unive rsity,The Netherlands
Australian Partnership for Advance d Computing Data GridMelbourne, Australia
Monash E-Research Grid                                    Monash University, Australia
Computational Modelling                                   University of Q ueensland, Australia
Virtual Tissue Bank                                       O saka University, Japan
Cybe rmedia Ce nter                                       O saka University, Japan
Belfast e-Science Centre                                  Q ueen's University, UK
Information Technology Department                         Sejong Unive rsity, South Korea
Nanyang Centre for Supercomputing                         Singapore
National University (Biology data grid)                   Singapore
Protein structure prediction                              Taiwan University, Taiwan
                                                                       GBs of                              Users
 Storage Resource Broker Collections at SDSC (8/2/2005)                 data                                with
                                                                                         of files
                                                                       stored                              ACLs
Data Grid                                                              Ź             Ź                 Ź
NSF/ITR - National Virtual Observatory                                      53,862        9,536,751            100
NSF - National Partnership for Advanced Computational Infrastructure        36,149        7,539,180            380
Static collections Š Hayden planetarium                                      8,013          161,352            227
Pzone Š p  ublic collections                                                12,998        6,707,952             68
NSF/NPACI - Biology and Environmental collections                           40,155           76,083             67
NSF/NPACI Š Joint Center for Structural Genomics                            15,731        1,577,260             55
NSF - TeraGrid, ENZO Cosmology simulations                                 176,730        2,125,945          3,267
NIH - Biomedical Informatics Research Network                               10,561        7,596,888              303
Digital Library                                                              Ź               Ź               Ź
NSF/NPACI - Long Term Ecological Reserve                                       256            9,033               36
NSF/NPACI - Grid Portal                                                      2,620           53,048              460
NIH - Alliance for Cell Signaling microarray data                              741           84,594               21
NSF - National Science Digital Library SIO Explorer collection               2,733        1,083,998               27
NSF/ITR - Southern California Earthquake Center                            131,010        2,702,421               73
Persistent Archive                                                     Ź             Ź                 Ź
NHPRC Persistent Archive Testbed (Kentucky, Ohio, Michigan, Minnesota)        100           382,186             28
UCSD Libraries archive                                                      4,147           408,050             29
NARA- Research Prototype Persistent Archive                                 1,478           893,434             58
NSF - National Science Digital Library persistent archive                   3,600        27,034,150            136
TOTA L                                                                     501 TB         68 million         5,335
Storage Resource Broker 3.3.1

        C                                              DLL /     HTTP,   OAI,
     Library,       Unix    Linux I/O NT Browser,     Python,    DSpace, WSDL,
                    Shell      C++    Kepler Actors    Perl,    OpenDAP, (WSRF)
                                                      Windows   GridFTP

                             Federation Management
       Consistency & Metadata Management / Authorization, Authentication, Audit

     Logical Name            Latency               Data                Metadata
         Space              Management           Transport             Transport

     Database Abstraction                  Storage Repository Abstraction
         Databases -           Archives - Tape,                      Databases -
                                                 File Systems
     DB2, Oracle, Sybase,      Sam-QFS, DMF, ORB                    DB2, Oracle,
                                                  Unix, NT,
      Postgres, mySQL,          HPSS, ADSM,                       Sybase, Postgres,
                                                  Mac OSX
          Informix              UniTree, ADS                      mySQL, Informix
        SRB Objectives
• Automate all aspects of data discovery,
  access, management, analysis,
  • Security paramount
  • Distributed data
• Provide distributed data support for
  •   Data sharing - data grids
  •   Data publication - digital libraries
  •   Data preservation - persistent archives
  •   Data collections - Real time sensor data
 SRB Developers
Reagan Moore       - PI
Michael Wan        - SRB Architect
Arcot Rajasekar    - SRB Manager
Wayne Schroeder    - SRB Productization
Charlie Cowart     - inQ
Lucas Gilbert      - Jargon
Bing Zhu           - Perl, Python, Windows
Antoine de Torcy   - mySRB web browser
Sheau-Yen Chen     - SRB Administration
George Kremenek    - SRB Collections
Arun Jagatheesan   - Matrix workflow
Marcio Faerman     - SCEC Application
Sifang Lu          - ROADnet Application
Richard Marciano   - SALT persistent archives

            75 FTE-years of support
            About 300,000 lines of C
• 1995 - DARPA Massive Data Analysis Systems
• 1997 - DARPA/USPTO Distributed Object Computation Testbed
• 1998 - NSF National Partnership for Advanced Computational Infrastructure
• 1998 - DOE Accelerated Strategic Computing Initiative data grid
• 1999 - NARA persistent archive
• 2000 - NASA Information Power Grid
• 2001 - NLM Digital Embryo digital library
• 2001 - DOE Particle Physics data grid
• 2001 - NSF Grid Physics Network data grid
• 2001 - NSF National Virtual Observatory data grid
• 2002 - NSF National Science Digital Library persistent archive
• 2003 - NSF Southern California Earthquake Center digital library
• 2003 - NIH Biomedical Informatics Research Network data grid
• 2003 - NSF Real-time Observatories, Applications, and Data management Network
• 2004 - NSF ITR, Constraint based data systems
• 2005 - LC Digital Preservation Lifecycle Management
• 2005 - LC National Digital Information Infrastructure and Preservation program
• SRB 1.1.8 - December 15, 2000
  • Basic distributed data management system
  • Metadata Catalog
• SRB 2.0 - February 18, 2003
  • Parallel I/O support
  • Bulk operations
• SRB 3.0 - August 30, 2003
  • Federation of data grids
• SRB 3.3.1 - April 6, 2005
  • Feature requests (extensible schema)
 SRB Latency Management

Remote Proxies,        Data Aggregation
   Staging                Containers          Prefetch

                         Network             Destination
    Source                Network             Destination

   Replication            Streaming           Caching
Server-initiated I/O      Parallel I/O    Client-initiated I/O
 Latency Management -
    Bulk Operations
• Bulk register
  • Create a logical name for a file
  • Load context (metadata)
• Bulk load
  • Create a copy of the file on a data grid storage
• Bulk unload
  • Provide containers to hold small files and pointers
    to each file location
• Bulk delete
  • Trash can
• Sticky bits for access control, …
      Logical Name Spaces
         Data Access Methods (C library, Unix, Web Browser)

      Storage Repository           Data access directly between
• Storage location                 application and storage
• User name
                                   repository using names
                                   required by the local
• File name                        repository
• File context (creation date,…)
• Access constraints
      Logical Name Spaces
         Data Access Methods (C library, Unix, Web Browser)

                                             Data Collection

      Storage Repository                         Data Grid
• Storage location                   • Logical resource name space
• User name                          • Logical user name space
• File name                          • Logical file name space
• File context (creation date,…)     • Logical context (metadata)
• Access constraints                 • Control/consistency constraints
               Data is organized as a shared collection
Federation Between Data Grids

        Data Access Methods (Web Browser, DSpace, OAI-PMH)

         Data Collection A                    Data Collection B

            Data Grid                              Data Grid
• Logical resource name space          • Logical resource name space
• Logical user name space              • Logical user name space
• Logical file name space              • Logical file name space
• Logical context (metadata)           • Logical context (metadata)
• Control/consistency constraints       • Control/consistency constraints
               Access controls and consistency constraints
               on cross registration of digital entities
         Types of Risk
• Media failure
  • Replicate data onto multiple media
• Vendor specific systemic errors
  • Replicate data onto multiple vendor products
• Operational error
  • Replicate data onto a second administrative domain
• Natural disaster
  • Replicate data to a geographically remote site
• Malicious user
  • Replicate data to a deep archive
   How Many Replicas
• Three sites minimize risk
  • Primary site
    • Supports interactive user access to data
  • Secondary site
    • Supports interactive user access when first site is
    • Provides 2nd media copy, located at a remote site,
      uses different vendor product, independent
      administrative procedures
  • Deep archive
    • Provides 3rd media copy, staging environment for
      data ingestion, no user access
State of the Art Technology

• Grid - workflow virtualization
  • Support execution of jobs (processes) across
    multiple compute servers
• Data grid - data virtualization
  • Manage a shared collection that is distributed
    across multiple storage servers
• Semantic grid - information virtualization
  • Create a common understanding of information
    (metadata) across multiple collections.
For More Information

         Reagan W. Moore
 San Diego Supercomputer Center