A Simple Mass Storage System for the SRB Data Grid
Michael Wan, Arcot Rajasekar, Reagan Moore, Phil Andrews
San Diego Supercomputer Center,
University of California at San Diego
(mwan,sekar,moore,andrews)@sdsc.edu
Abstract: continent. Persistent archives are now being
implemented using the Storage Resource
The functionality that is provided by Broker in support of the National Archives
Mass Storage Systems can be implemented and Records Administration, and the
using data grid technology. Data grids National Science Foundation National
already provide many of the required Science Digital Library.
features, including a logical name space and The capabilities provided by the SRB
a storage repository abstraction. We represent a unique integration of data grids,
demonstrate how management of tape digital libraries, and persistent archives. The
resources can be integrated into data grids. mechanisms that were required to integrate
The resulting infrastructure has the ability to these three types of data handling systems
manage archival storage of digital entities turn out to be uniquely suited to the
on tape or other media, while maintaining implementation of a mass storage system.
copies on distributed, remote disk caches The name space is managed by the digital
that can be accessed through advanced library technology, distributed physical
discovery mechanisms. Data grids provide storage resources are managed by the data
additional levels of data management grid technology, and technology evolution is
including the ability to aggregate data into managed through the persistent archive
containers before storage on tape, and the storage repository abstractions.
ability to migrate collections across a The SRB is implemented as a federated
hierarchy of storage device. client-server system, with each server
managing/brokering a set of storage
1. Introduction resources. Storage resources that are
brokered by the SRB include Mass Storage
The SDSC Storage Resource Broker Systems (MSS) such as HPSS [5], UniTree
(SRB) [1, 2, 3, 4] is data grid middleware [6], DMF [7] and ADSM [8], as well as file
that provides a storage repository abstraction systems. What is of great interest is that the
for transparent access to multiple types of SRB data grid can be used to implement all
storage resources. The SRB has been used of the capabilities of a distributed Mass
to implement data grids (to integrate access Storage System, while providing access to
to data distributed across multiple data stored in file systems, databases, object
resources), digital libraries (to support ring buffers, databases, and other types of
collection-based management of distributed storage systems. A Mass Storage System
data), and persistent archives (to manage based upon data grid technology can be
technology evolution). The Storage implemented using virtually any type of
Resource Broker is in widespread use, storage device.
supporting collections that have five million The motivations for implementing a
images replicated across multiple HPSS MSS in a data grid are:
archives, and data grids that span the
• Cost of licensing - Not all of our users POSIX-like logical name space that
can afford the licensing fees of eliminates the need to design a name
commercial MSS systems. By server from scratch. Since the MCAT
implementing the Mass Storage System namespace is based on commercial
functionality within a data grid, a database technology, the transaction
common software system can be used for performance is substantially better than
resource federation and data sharing, as current archives. Other reusable features
well as for data archiving. will be discussed later. Leveraging these
• Efficiency and performance - A MSS reusable features greatly reduces the
system that is tightly integrated with the effort required to implement a simple
infrastructure of the SRB data grid can MSS in a data grid.
take full advantage of SRB features such • Finally, an MSS based on data grids can
as file replication, server directed parallel provide a storage system that spans
I/O, latency management, data-based remote caches and distributed archival
access controls, and collection based data devices interconnected by a Widea-Area-
management. Network. Such a logical linking of
• Elimination of duplication of features – distributed devices will provide new
The SRB and MSS systems such as ways for data sharing and fault tolerance
HPSS duplicate some features. For not currently provided by site-located
example, the HPSS has its own disk mass storage system.
cache that is used as a front end to a tape
system. The SRB can be configured to 2. SRB Architecture and Features
provide a cache that serves as a front end
to a HPSS resource. Since the SRB has The Storage Resource Broker (SRB) is
no knowledge of the operational middleware that uses distributed clients to
characteristics of the HPSS cache, it may provide uniform access to diverse storage
not be able to effectively manage its own resources. It consists of three components:
cache utilization in conjunction with the the metadata catalog (MCAT) service, SRB
HPSS cache management. With the servers for access to storage repositories and
integration of Mass Storage System SRB clients, connected to each other via a
capabilities into the SRB, a single large network.
cache pool can be used. Another The MCAT is implemented using a
example is that most MSS systems have relational DBMS such as Oracle, DB2,
their own authentication schemes in SQLServer, PostgresSQL, or Sybase. It
addition to the SRB authentication stores metadata associated with data sets,
system. A single authentication system users and resources managed by the SRB. It
can be used if their capabilities are maintains a POSIX-like logical name space
integrated. The resulting system is easier (file names, directories and subdirectories)
to administer. and provides a mapping of each logical file
• SRB already has many of the required name to a set of physical attributes and a
capabilities – The features needed for a physical handle for data access. The
simple MSS include a name space, a physical attributes include the host name and
storage repository abstraction, storage the type of resource (UNIX file system,
resource naming and data management HPSS archive, object ring buffer, database).
tools. For example, the SRB Metadata The physical handle for data access is the
Catalog (MCAT) [9] maintains a file path for UNIX file system type
resources. The MCAT server handles create, open, close, unlink, read, write, seek,
requests from the SRB servers. These sync, stat, fstat, mkdir, rmdir, chmod,
requests include information queries as well opendir, closedir, and readdir. If the handler
as instructions for metadata creation and cannot handle the request locally, it will
update. forward the request to the server that can
The MCAT imposes additional respond.
mappings on the logical name space to
support replication (one logical name 3. Simple MSS Design
mapped to multiple physical file names), soft
links (a logical name mapped to another The design goals for a simple Mass
logical name), aggregation (structural Storage System are:
mapping of a file to a location in a
container), segmentation (structural mapping • Provide a distributed farm of disk cache
of a file across multiple tape media), file- resources backed by a tape library
based access control (users mapped to system. The cache system should be
permissions on roles for each digital entity). configured to contain any number of
These mappings make it possible to organize distributed cache resources that may or
digital entities independently of their actual may not be on the same host as the tape
storage location. system. This makes it possible to treat
The SRB is implemented as a federated the disk cache as an independent level of
server system. Each server consists of three the storage hierarchy, with the disk cache
layers. The top level "communication and created “near” the end user.
dispatcher" layer listens for incoming client • Provide a tape library system to control
requests and dispatches the requests to the the mounting and dismounting of tapes.
proper request handlers. At a minimum, a Storage Tech silo
The middle layer is the logical layer or running ACSLS software should be
the "high-level API handler" layer. This supported.
layer handles requests in which all input • Provide a uniform access mechanism for
parameters are given in terms of their logical data stored on the Mass Storage System
representations (e.g., logical path name in or on disk caches. A file in the logical
the logical name space, logical resource name space stored in the MSS resource
name, logical user etc). Upon receiving a should appear and behave the same as
request, the logical layer handler queries the any other files stored on other resources.
MCAT and translates the logical input The physical location (on cache or tape)
parameters into their physical of the file should be totally transparent to
representations. It then calls upon the users.
appropriate handler in the physical layer to • Files should always be staged
perform the actual data access and automatically to cache before any I/O
movement. operations are done. Tools for system
The physical layer or the "low-level administrators to manage the cache
API handler" layer handles data access and system are also needed. i.e., tools to
data movement requests from its own logical synchronize files from cache to tape and
layer or directly from the physical layer of purge files on cache.
other SRB servers. This layer basically • Large files should be stored in
consists of driver functions for the 16 most segments. The advantages of using
commonly used POSIX I/O functions: segmented files are: the system can
handle files of very large size; and d . Functions to stage files from
parallel data transfer between tapes and tape to cache and dump files
the cache system can be implemented in from cache to tapes. The tape
our first release. Although large files are and cache resources can be
stored in segments on tapes, parallel distributed.
transfer between tapes and cache has not e . Support for data transfer
been implemented. Data transfer between the cache system and
between the cache system and clients can clients.
be done in parallel using existing SRB 6 . A tape library server whose primary
infrastructure. function is to schedule and perform the
mounting and dismounting of tapes.
Based of the above design goals, the 7. A tape database that tracks the usage of
following software components are needed all tapes controlled by the MSS.
to build the MSS:
4. Implementation
1. A client-server architecture with an
authentication scheme appropriate for The SRB framework version 1.1.8
access across administration domains initially provided the functionalities listed
and a framework for exchange of above, except for the metadata needed to
information between clients and servers manage files on the MSS, the drivers for
that can function over Wide Area basic tape I/O functions, functions to stage
Networks. files from tape to cache, and the tape library
2. A federated server system that allows server and tape database.
cache and tape resources to be located on A major innovation that was needed to
different hosts. implement a MSS within the SRB data grid
3. A metadata server that maintains a was the development of a new compound
logical POSIX-like name space and resource type as a fundamental resource type
provides a mapping of each logical file within the SRB/MCAT system. Files
name to its physical location. written to a compound resource are treated
4. Additional meta data and server as residing on a single resource. In order to
functions that allow files stored in the allow files stored in the MSS resource to
MSS resource to appear and behave the appear and behave the same as files stored
same as any other files stored on other on other resources, support is needed for
resources. compound digital entities. The file that is
5. Storage resource servers that have the written to a compound resource can be
following capabilities: migrated between the cache and the tape
a. Ability to translate user requests back-end within the compound document,
to physical actions using without requiring separate metadata
metadata information attributes to describe the separate residency
maintained in the MCAT of the file on either component of the
catalog. compound resource.
b . A set of driver functions for A compound resource contains multiple
basic tape I/O operations. cache resources for a given tape resource
c . A set of driver functions for (each of which are called internal compound
basic cache I/O operations. resources). When a user creates a file using
a compound resource, the object created is
tagged as a compound digital entity. With server framework as other SRB servers.
the help of MCAT, the server then drills Currently, tape mount is on a first-come-
down through the compound resource and first-server basis. Some amount of
discovers all of the internal resources. It intelligence is built in such that when a client
selects the cache resource where the digital is done using a tape and issues a dismount
entity will be stored initially. After the file request, the tape will not be actually
operation is completed (with a “close” call), dismounted if there is another request for the
the metadata of the just created compound mounting of the same tape in the queue. In
digital entity is updated with the physical this case, the server will just pass the tape to
description of the file pointing to the digital the new request. Specialized queuing
entity created in the cache resource. features can be implemented as needed.
A compound digital entity is treated as A database schema that tracks the usage
any other digital entity for most operations of all tapes controlled by the MSS has been
in SRB, except when the digital entity is incorporated in the MCAT. The schema is
opened for read or write. In this case, the used to track the tape position, total bytes
server will check to see if the digital entity is written, full flag, etc for all tapes controlled
already in a cache. If it is not, the digital by the MSS. A set of system utilities have
entity will be staged on one of the cache been created for tape labeling and tape
resources configured in the compound metadata ingestion, listing of tape metadata
resource. If the digital entity is changed, the and modification of tape metadata. By
dirty bit of the cached digital entity is set. managing these attributes in the MCAT
The dirty copy is not automatically catalog, it is possible to support
synchronized to tape. Synchronization is sophisticated queries against the tape
only done via requests by system attributes and against the attributes of the
administrators. A "dump tape" API and digital entities within the MSS. One can
command are created to allow system readily determine all of the files resident on
administrators to manage the cache system a given tape, identify all tapes that are filled
by synchronizing files in the cache system beyond a given level, and identify all tapes
onto tape, and then purging the files from the that are needed to retrieve all digital entities
cache system. within a logical sub-collection in the
The ability to manipulate data that is metadata catalog.
stored on tape requires additional
capabilities beyond those required for access 5. Comparisons with the IEEE Mass
to data on disk. A set of driver functions for Storage System Reference Model
basic tape I/O operations have been defined
and been incorporated into the SRB server. The SRB MSS provides similar
These functions include mount, dismount, functionality to the IEEE MSS Reference
open, close, read, write, seek, etc. Currently, Model [10]. Comparing with the
the driver has only been tested for 3590 tape implementations of the reference model,
drives. there are both similarities and differences. A
A tape library server for the STK silo major difference stems from the fact that the
running ACSLS software has been SRB MSS uses the underlying File System
incorporated into the SRB system. Its of the operating system for managing data
primary function is to schedule and perform storage instead of the mapping of bitfiles to
the mounting and dismounting of tapes. It the logical and physical volume abstractions
uses the same authentication system and used in the Reference Model. The use of a
File System greatly simplifies the design of bitfiles into physical volume references is
the storage manager. This approach is not needed. The rest of the functionalities of
reasonable given that the performance of the Bitfile Server, Storage Server and Mover
modern File Systems is quite good. Another of the Reference Model are combined into a
source of difference is that the SRB MSS single SRB Resource server. For tape
integrates the functionality of several servers resources, the SRB uses a tape library server
of the Reference Model into a single server. for mounting and dismounting of tapes. As
This greatly simplifies the architecture of the for the Migration-Purge server of the
SRB MSS. Improved robustness and Reference Model, SRB has an API and a
performance of the system is achieved at the system utility that migrates files on cache to
expense modularity. The design provides tape resources.
modular interfaces to support addition of
new storage repositories and new access 6. Usage Examples
APIs, while aggregating all metadata into a
single database. The use of data grid technology to
The major components of the Reference implement a Mass Storage System makes it
Model include: possible to incorporate latency management
capabilities directly into the architecture.
1. Name Server – provides POSIX-like For access to data distributed across multiple
name space, a mapping of logical names resources, the finite speed of light can
to bitfile IDs and access control (ACL) severely limit sustainable transaction rates, if
for the name space objects. the transactions are issued one by one. The
2. Bitfile Server – provides the abstraction SRB data grid provides multiple
of logical bitfiles to its clients and mechanisms to minimize the number of
handles the logical aspects of the storage messages that are sent over a wide area
and retrieval of bitfiles. network, including the aggregation of data
3 . Storage Server – handles the physical into containers, the aggregation of metadata
aspect of bitfile storage and retrieval. It into an XML file, and the aggregation of I/O
translates references to storage segments commands through the use of remote
into references to virtual volume and into proxies.
physical volume references. For a Mass Storage System, the ability
4 . Mover – transfers data from a source to aggregate data into containers is essential
device to a sink device. for achieving high performance when
5 . Migration-Purge server – provides managing small digital entities. When the
storage management by migrating size of a digital entity is less than the tape
bitfiles on disks to tapes. access bandwidth multiplied by the tape
latency, it becomes cost effective to work
The SRB MCAT server, which maintains with containers of files. The size of the
a POSIX-like logical name space, is container is adjusted such that the retrieval
equivalent to the Name Server of the of two digital entities from the same
Reference Model. The only difference is that container is smaller that the retrieval of the
each SRB digital entity is mapped directly to two files directly from tape.
a set of physical attributes rather than to a The usage scenario that illustrates the
logical bitfile as in the Reference Model. generality of the data grid based mass
Because of the direct mapping, the storage system is to consider the storage of a
functionality of translating from logical container on a mixture of caches, archives,
and compound resources. This scenario in the creation of a replica of the container in
requires the use of five different mappings the compound resource (disk cache and
on the logical name space: backend tape), and the creation of a replica
on one of the disk caches. A
• Mapping from the logical file name to a synchronization command will cause the
location within a container replica on the disk cache to be copied onto
• Mapping of the container to one of one of the archives.
several replicas The ability of data grids to support
• Mapping of a logical resource name to a sophisticated resource management
physical resource name functions on top of distributed storage
• Mapping of access control lists between repositories makes it possible to greatly
the user name and the requested file increase the number of options when
• Mapping of a compound digital entity to archiving data. An example is the
its location in a compound resource implementation of alternate completion
scenarios, in which a file is assumed to be
The SRB provides the ability to organize archived when it is written to “k” of the “n”
physical resources by a logical resource physical resources specified by a logical
name. Writing to the logical resource name resource name. Another example is the
then results in the replication of the file implementation of load balancing, by the
across all of the physical resources. Separate writing of digital entities in turn to a list of
metadata attributes are maintained for each resources specified by the logical resource
replica of the file. The SRB also supports name. The ability to replicate files across
the aggregation of files into a container. trees of storage resource options instead of
Containers are manipulated on disk caches, the traditional simple storage hierarchy,
and then written to an archive. A primary greatly increases the ability to manage
disk cache can be identified with multiple archival copies of data.
secondary disk caches. A primary archive A second data grid capability that greatly
can be identified with multiple secondary enhances mass storage systems is the
archives. When a primary resource is not organization of the digital entities as a
available, the SRB will then complete the hierarchical collection. It is possible to use
operation to the secondary resource. Note digital library discovery mechanisms to
that containers can be replicated onto identify relevant files within the mass
multiple storage repositories. storage system. The discovery mechanisms
The SRB also supports compound can be exercised through interactive web
resources composed of a cache and either a interfaces, or directly from applications
tape or archive. Writing a file to a through C library calls.
compound resource results in the creation of A third data grid capability that
a single set of metadata, with an attribute simplifies management of mass storage
used to specify which component of the systems is the association of access controls
compound resource holds the data. The with the data, rather than the storage system.
interesting management scenario is the This makes it possible to include sites across
creation of a logical resource name that administration domains within the mass
includes a compound resource, a primary storage system, while simplifying
disk cache, secondary disk caches, a primary administration of the system. Data grids
archive, and secondary archives. Writing a support collection-owned files, in which
container to this logical resource then results access to the files is restricted to the
collection. Users authenticate themselves to man months) because we were able to
the collection. The collection uses access leverage existing capabilities within the SRB
control lists that are specified separately for infrastructure. We believe this approach will
each registered digital entity to determine radically change how archives are
whether a person is authorized to access a constructed. The ability to manage replicas
file. The collection in turn authenticates of data on low cost storage media is as
itself to the remote storage system. simple as making a replica of a digital entity
A fourth data grid capability that on a disk cache. The ability to discover,
simplifies incorporation of new technology access, and manipulate digital entities stored
into the mass storage system is the use of a on tape media can now be done through the
storage repository abstraction that defines same sophisticated interfaces that data grids
the set of operations that will be performed provide for access to all types of storage
when accessing and manipulating digital systems.
entities. The storage repository abstraction
makes it possible to write drivers for each
type of storage system, without having to 8. Acknowledgements
modify any of the higher software levels of
the storage environment. The storage This research has been sponsored by the
repository abstraction is also used to support Data Intensive Computing thrust area of the
dynamic addition of resources to the system. National Science Foundation project ASC
96-19020 “National Partnership for
7. Conclusions Advanced Computational Infrastructure,” the
NSF National Science Digital Library, the
Because of the cost of licensing, access NARA supplement to the NSF NPACI
efficiency and transaction performance program, the NSF National Virtual
issues, we have implemented a simple MSS Observatory, and the DOE ASCI Data
system in the Storage Resource Broker data Visualization Corridor.
grid. We were able to accomplish this task
within a relative short time (slightly over 6
9. References
1. Baru, C., R, Moore, A. Rajasekar, M. 4. Rajasekar, A., M. Wan, and R. Moore,
Wan, (1998) "The SDSC Storage (2002), “MySRB & SRB - Components
Resource Broker," Proc. CASCON 98 of a Data Grid,” The 11th International
Conference, Nov.30-Dec.3, 1998, Symposium on High Performance
Toronto, Canada. Distributed Computing (HPDC-11)
2. SRB, (2001) "Storage Resource Broker, Edinburgh, Scotland, July 24-26, 2002.
Version 1.1.8", SDSC 5. HPSS, High Performance Storage
(http://www.npaci.edu/dice/srb). System,
3. Rajasekar, A., and M. Wan, (2002), http://www4.clearlake.ibm.com/hpss/ind
“SRB & SRBRack - Components of a ex.jsp.
Virtual Data Grid Architecture”, 6. UniTree, http://www.unitree.com.
Advanced Simulation Technologies 7. DMF, Data Migration Facilitty,
Conference (ASTC02) San Diego, April http://136.162.32.160/products/software/
15-17, 2002. dmf.html.
8. ADSM, ADSTAR Distributed Storage (http://www.npaci.edu/dice/srb/mcat.htm
Management, l).
http://searchstorage.techtarget.com/sDefi 10. Sam Coleman and Steve Mller, “Mass
nition/0,,sid5_gci214398,00.html. Storage System Reference Model:
9. MCAT, (2000) "MCAT:Metadata Version 4", Goddard Conference on
Catalog", SDSC Mass Storage Systems and Technologies,
Volume 1, 1992.