Model Cover Page for Deliverables

Document Sample
Model Cover Page for Deliverables Powered By Docstoc
					                                                                               Partner Logo


                     STATE O F TH E ART R EPORT

                                   Document identifier:      DataGrid-02-D2.1-0105-1_0

                                   Date:                     26/06/2011

                                   Work package:             WP02 Grid Data Management

                                   Partner(s):               CERN, IRST, SRC, UH, INFN,

                                   Lead Partner:             UH/CSC

                                   Document status           DRAFT

                                   Deliverable identifier:   DataGrid-D2.1

Abstract: This document is reporting on the current state-of-the-art in technology of data
access and mass storage systems.

IST-2000-25182                             PUBLIC                                    1 / 67
                                                                                              Doc. Identifier:
                      DATA ACCESS AND MASS STORAGE
                                SYSTEMS                                                     Date: 26/06/2011
                                     S tate of the Art Report

                                             Delivery Slip
                            Name                    Partner         Date                Signature

      From       Olli Serimaa                    UH/CSC         19/11/2001

 Verified by     Peter Kunszt                    CERN           22/11/2001

Approved by

                                           Document Log
Issue        Date                        Comment                                   Author
0_0     19/11/2001     First draft                                  Olli Serimaa

1_0     22/11/2001     Updated draft                                Olli Serimaa

                                Document Change Record
Issue                Item                                       Reason for Change

        Software Products                                           User files

Word                                         DataGrid-02-D2.1-0105-1_0.doc

IST-2000-25182                                     PUBLIC                                              2 / 67
                                                                                                                                                                     Doc. Identifier:
                                          DATA ACCESS AND MASS STORAGE
                                                    SYSTEMS                                                                                                       Date: 26/06/2011
                                                                S tate of the Art Report

1. INTRODUCTION .............................................................................................................................................................. 5
    1.1. OBJECTIVES OF T HIS DOCUMENT................................................................................................................................. 5
    1.2. A PPLICATION AREA ....................................................................................................................................................... 5
    1.3. A PPLICABLE DOCUMENT S AND REFERENCE DOCUMENT S........................................................................................ 5
    1.4. TERMINOLOGY ............................................................................................................................................................... 5
2. EXECUTIV E S UMMARY............................................................................................................................................... 6

3. INTRODUCTION: GRIDS AND METACOMPUT ERS ........................................................................................ 7
    3.1. M IDDLEWARE................................................................................................................................................................. 7
    3.2. M ETACOMPUT ING AND THE GRID ............................................................................................................................ 8
    3.3. THE W EB AS A SUCCESS MODEL .................................................................................................................................. 8
    3.4. W HAT IS NEEDED FOR GRID INFRAST RUCTURE ?....................................................................................................... 8
    3.5. DAT A A CCESS ................................................................................................................................................................ 8
4. CURRENT [NON-GRID] DATA ACCESS .............................................................................................................. 10
    4.1. INT RODUCTION SPECIAL FEAT URES.......................................................................................................................... 10
       4.1.1. Software environment ........................................................................................................................................ 10
       4.1.2. Storage hierarchies ............................................................................................................................................ 12
       4.1.3. Storage Media ..................................................................................................................................................... 13
       4.1.4. Local vs. network access ................................................................................................................................... 15
       4.1.5. Network Attached Storage (NAS) and Storage Area Network (SAN) ........................................................ 16
       4.1.6. RAID ..................................................................................................................................................................... 17
    4.2. LOCAL FILE SYSTEMS.................................................................................................................................................. 19
       4.2.1. Address size of file systems ............................................................................................................................... 19
       4.2.2. Local Unix file systems and inodes.................................................................................................................. 20
       4.2.3. Traditional Unix File systems (UFS and EXT) .............................................................................................. 20
       4.2.4. Journaling ............................................................................................................................................................ 21
       4.2.5. Extended File System (XFS).............................................................................................................................. 22
       4.2.6. ReiserFS ............................................................................................................................................................... 23
       4.2.7. Journaled File System (JFS)............................................................................................................................. 23
       4.2.8. EXT3FS ................................................................................................................................................................ 24
       4.2.9. Local Windows file systems............................................................................................................................... 24
    4.3. NETWORK FILE SYSTEMS............................................................................................................................................ 24
       4.3.1. Authentication issues.......................................................................................................................................... 24
       4.3.2. NFS (Network File Sharing System)................................................................................................................ 24
       4.3.3. AFS (Andrew File System) ................................................................................................................................ 25
       4.3.4. Open AFS ............................................................................................................................................................. 26
       4.3.5. GPFS (General Parallel File System)............................................................................................................. 26
       4.3.6. Clustered XFS (CXFS, SGI).............................................................................................................................. 27
       4.3.7. GFS (Global File System) ................................................................................................................................. 28
       4.3.8. Direct Access File System (DAFS) .................................................................................................................. 29
       4.3.9. Coda and InterMezzo ......................................................................................................................................... 30
       4.3.10. Microsoft Distributed File System (MS DFS).............................................................................................. 31
       4.3.11. Windows Network File System (CIFS).......................................................................................................... 31
       4.3.12. Web-based file systems .................................................................................................................................... 32
    4.4. HIERARCHICAL STORAGE M ANAGEMENT (HSM) .................................................................................................. 32
       4.4.1. Introduction ......................................................................................................................................................... 32
       4.4.2. Commercial solutions ........................................................................................................................................ 33
       4.4.3. Open source solutions........................................................................................................................................ 36
5. GRID STYLE DATA ACCESS .................................................................................................................................... 39

IST-2000-25182                                                                        PUBLIC                                                                                     3 / 67
                                                                                                                                                                     Doc. Identifier:
                                           DATA ACCESS AND MASS STORAGE
                                                     SYSTEMS                                                                                                      Date: 26/06/2011
                                                                S tate of the Art Report

    5.1. INT RODUCTION............................................................................................................................................................. 39
    5.2. GRID CONCEPT S AND REQUIREM ENTS FOR DAT A A CCESS........................................................................... 39
       5.2.1. Requirements for a Grid environment............................................................................................................. 40
    5.3. GRID DIRECT ORY SERVICES........................................................................................................................................ 40
       5.3.1. Data Catalogues and Data discovery ............................................................................................................. 40
       5.3.2. Data replication and caching ........................................................................................................................... 40
    5.4. GRID DAT A ST ORAGE .................................................................................................................................................. 41
    5.5. NETWORK ..................................................................................................................................................................... 41
       5.5.1. Network access - WAN ....................................................................................................................................... 41
       5.5.2. Protocols .............................................................................................................................................................. 41
       5.5.3. Network tuning .................................................................................................................................................... 42
    5.6. GRID M IDDLEWARE .................................................................................................................................................... 42
       5.6.1. Globus Project..................................................................................................................................................... 43
       5.6.2. Condor Project.................................................................................................................................................... 45
       5.6.3. Globus and Condor compared ......................................................................................................................... 48
       5.6.4. Legion ................................................................................................................................................................... 48
       5.6.5. Sequential Access to data via Metadata (SAM)............................................................................................. 49
       5.6.6. Storage Resource Broker (SRB) ....................................................................................................................... 50
       5.6.7. Nimrod-G Resource Broker .............................................................................................................................. 51
       5.6.8. Other distributed computation technologies .................................................................................................. 52
    5.7. PEER T O PEER (P2P) COMPUTING AND NETWORKING........................................................................................... 53
       5.7.1. Introduction ......................................................................................................................................................... 53
       5.7.2. ACIRI Content Addressable Network (ACIRI CAN)..................................................................................... 57
       5.7.3. Napster and Gnutella ......................................................................................................................................... 57
       5.7.4. Chord .................................................................................................................................................................... 59
       5.7.5. OceanStore........................................................................................................................................................... 62
       5.7.6. Freenet.................................................................................................................................................................. 63
       5.7.7. Mojo Nation ......................................................................................................................................................... 64
       5.7.8. JXTA Search ........................................................................................................................................................ 64
6. ANNEX ES ........................................................................................................................................................................... 66
    6.1. BACKUP AND ARCHIVAL ............................................................................................................................................. 66
       6.1.1. Backup tools ........................................................................................................................................................ 66
       6.1.2. Backups and archives at CERN ....................................................................................................................... 67

IST-2000-25182                                                                        PUBLIC                                                                                     4 / 67
                                                                                        Doc. Identifier:
                      DATA ACCESS AND MASS STORAGE
                                SYSTEMS                                               Date: 26/06/2011
                                  S tate of the Art Report


The report on current technology gives an overview on existing work on data access systems and mass
storage systems. Both the academic domain and the commercial market are considered. The aim of the
document is to provide a survey of current efforts in the domain of WP2. This should avoid
involuntary duplication of effort on the part of WP2 where an existing technology could be used or
existing ideas could be exploited.

WP2, possibly WP5.

Applicable documents

Reference documents
[R1] Ian Foster and Carl Kesselman, eds., The Grid: Blueprint for a New Computing Infrastructure,
      Morgan-Kaufmann Publishers, Inc., 1998,



IST-2000-25182                                  PUBLIC                                           5 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

Data Management is a broad concept in today's computing world. It has many aspects depending on
what kind of data is to be managed. In this report we try to describe how data management is handled
by many existing applications, systems and services in a networked environment.
Of course by now we have to include the substantial amount of work that has been done by the many
Grid and Metacomputing projects. Data access in a Grid environment is a rather broad subject. The
performance, reliability, availability, and usability all depend on the media and hardware, operating
system, local file system, network software, and protocols both at local and wide area networking
levels. In the Grid, all resources are distributed between the individual workstations and other
resources of the organisations, their regional centres and individual participants belonging to a virtual
organisation. Hence, the computing resources, application programs and data are shared by many sites
and organisations, and the data will be replicated, scattered and possibly ―fragmented‖ all around the
Grid network. The data storage consists of regional storage systems, whose architectural details,
transfer speeds and storage capacities vary widely. These local stores are autonomous and independent
of each other, and governed by the participants.
The three major Grid research projects are: Globus, Condor, and Legion. The Globus Toolkit has
become as the de facto standard tool collection in Grid computing middleware. The older Condor
project is aimed at running High Throughput Computing jobs on a pool of otherwise idle distributed
workstations. Legion is an object-oriented middleware product for Grid functionality. We report on
these and other Grid technologies currently available and focus on their Data Management aspects.
We also report on different techniques and philosophies for data access that are currently available
outside the Grid context as wel. We discuss the Open versus Proprietary software development model
and report on the existing achievements of both.
We discuss different properties and several kinds of storage. In reality the various properties and
classifications form a continuum. In this context we also report on current disk and tape technologies
that are being used in data storage, as well as how they are being accessed: locally or over the network.
The current techniques for networked are being reviewed.
Our report also discusses different local and networked file systems and highlights the features of
each. Again, there is no such thing as the best file system, it all depends what the application wants to
do with the data. The knowledge of the usage patterns on the data is essential to be able to make
qualified choices on technology.
Another aspect of data management comes from the world of Hierarchical Storage Management
(HSM). By Hierarchical Storage Management (HSM) one means a system where storage is
implemented in several layers from fast online primary storage to manual, off-line or automated near-
line secondary storage and even to off-site tertiary storage for archiving, backup and safekeeping
purposes. In other words, a simple HSM system consists of a disk farm, a tape cartridge robot silo and
special software for the HSM file System. . The system automatically handles the data movements
between the storage layers. We report on commercial and open HSMs available today.


IST-2000-25182                                    PUBLIC                                              6 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

The concept of a computing or data Grid was introduced recently as an analogy of electric al power
grids or electrical network [R1]. The editors of The Grid: Blueprint for a New Computing
Infrastructure [R1], Foster and Kesselman, are also members of Globus, which is at the present one of
the most important software collections of middleware tools realizing the Grid.
One definition of the Grid is (according to Kesselman, modified): ―Grid is a networked computer
resource sharing and coordinated problem solving in dynamic, multi-institutional virtual
The Grid concept is rather elusive to be defined exactly, as it seems to be evolving constantly and
people mean different things by it. At present, there is still some ―hype‖ around the Grid: it is the
current popular buzzword in distributed computing and gaining enormous verbal support. As with all
good concepts, initial expectations are too high. The Grid will not solve all the problems in distributed
computing, resource sharing or efficient utilisation. But it will make many things simpler and easier to
use by standardising the usage and creating middleware for the infrastructure. In the future, the Grid
will be a probable general model for scientific computing, needed for all applications and benefiting
all users of computers.
The Grid sometimes means that the computational and database environments are combined so that
the users do not have to know nor care where and how the data is stored or the applications run. This
concept applies especially well to data intensive computing, where the amount of data is huge and
most calculations can be performed independent of each other.
The software tools (middleware) developed for the Grid are suitable also for computing intensive and
co-operative applications, such as remote instrumentation and virtual environments. Conceptually one
might speak of computational grids, which are extensions to cluster computing, and data grids, which
are conceptually close to wide area versions of distributed database systems. Naturally, these two
versions of the Grid are overlapping.
Site usage policies, user authorisation and user access mechanisms are still being worked on. Of
course, the control of the shared resources will remain within the organisation that owns the resource.
The usage patterns, access rights and prices of the use of the resources are not wild and free, but will
and must be agreed upon by the organisations that use and share the resources, thus forming the virtual
organisation. Technically these requirements are enforced by the middleware security tools.

Middleware is a software layer that functions as a conversion or translation layer. It is also a
consolidator and integrator. Custom-programmed middleware solutions have been developed for
decades to enable one application to communicate with another that either runs on a different platform
or comes from a different vendor.
Distributed processing middleware connects networks, workstations, supercomputers, and other
computer resources together into a system that can encompass different architectures, operating
systems, and physical locations.
At present, there are already many serious projects developing Grid infrastructure and the necessary
middleware tools or trying to make use of existing ones. A lot of money and effort is being invested in
these projects worldwide. The Global Grid Forum [R?] is the international body that takes care of
standardisation, and is a communication forum between all the existing Grid projects. The most
popular toolkits are the Globus Toolkit [R?], the Sun Grid Engine [R?], Condor [R?] and Legion

IST-2000-25182                                    PUBLIC                                              7 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

In essence the Grid is a logical extension and continuation of the older metacomputing concept
introduced in the late 1980s and implemented in several instances in the 1990s. Originally
metacomputing denoted a collection of physically disjoint and disparate computers, which could be
used like one logical computing resource. This concept has been one of the main guidelines in
developing many supercomputer centres. The metacomputer integration has been realised quite well in
general, even though there has been some difficulties due to different architectures and
implementations, for example. So the Grid concept is similar to the metacomputing one, but on a
larger and deeper scale.
In a metacomputer, the resources are usually connected within the Local Area Network (LAN)
belonging to one organisation, and the coupling of the resources is restricted to sharing storage, files
and separate applications.
For a Grid, the coupling goes both much deeper and wider, and most of the details of resource sharing
and coupling are hidden from the user (using middleware). The resources are often distributed on the
Wide Area Network (WAN) and they belong to and are administered by different organisations, even
though these separate organisations are themselves loosely combined to one virtual organisation. Not
only files and separate applications are seamlessly combined, but also other resources such as
databases are transparently used by large distributed applications. Transparency means that from the
user's point of view, these distributed applications behave like one conventional (local) application. To
achieve this kind of integration one needs lot of standardised protocols and interfaces (i.e. the
middleware) for the interaction between users and grid applications, between the applications and
between different servers. Also special application programming interfaces are necessary to be able to
develop the middleware.

The developing model for the Grid is the hugely successful Internet and the World Wide Web
(WWW). In Grid computing, one aim is to build the infrastructure for large scale distributed
computing, the ―computing resource web of the future‖. Even though the initiative has once again
come from scientific computing and supercomputing, it seems that the concept is already being
embraced by the whole computing community. If it proves successful, it may affect all computing like
the Internet and the web did.

For the Grid, one needs mainly standardised interfaces, policies and a well-organised administration.
The Global Grid Forum tries to be the body of standardisation for the Grid, just as W3C is for the
WWW. Elements to consider are interoperability, shared infrastructure services, Grid protocols and
services, and Grid Application Programming Interfaces (API) and Software Development Kits (SDK).
Based on these elements, existing applications can be gridified, i.e. the applications can be adapted to
the grid environment.
The tools, especially the middleware, are still in the early stages of development, but there are very
good tools already available. One of these is the Globus Toolkit, which contains some – already partly
integrated – tools for Layered Grid Architecture, Resource Management, Data Access and Transfer,
and Replica Management. The DataGrid project's aims are to develop additional middleware services
and to have a fully integrated running testbed by the end of the project.


IST-2000-25182                                    PUBLIC                                              8 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

Data access in a Grid environment is a rather broad subject. The performance, reliability, availability,
and usability all depend on the media and hardware, operating system, local file system, network
software, and protocols both at local and wide area networking levels. Finally, the decisive role will be
played by the wide area network based new Grid middleware tools for data access. This report
captures the most important aspects of the current status of data access in a Grid environment.
Nowadays the local computing resources are already networked on a LAN scale. In other words, the
local environment consists of a metacomputer, which is built integrating many heterogeneous
resources into one. The Grid then extends the integration to wide area networking and multiple
organisations (forming one ―virtual organisation‖).
At the core there will always be an individual local resource, be it a computing resource, such as
clustered computer, or a storage unit. First we will discuss the data access and its requirements on this
fundamental level. In Section 4.1, the usual operating systems, storage hierarchies, and current trends
in storage media are discussed. For the Grid the Unix-like operating systems, like Linux, are most
important. Also some special features like RAIDs and the new style special network-attached storage
units, such as SAN and NAS, are described. In an annex, some backup concepts are discussed.
Local file system concepts, such as address size limitations and journaling, are discussed in Section
4.2, emphasising the new journaling file systems like ext3fs, ReiserFS and XFS and JFS. The local
networked file systems are treated in Section 4.3. After the traditional old network file systems, like
NFS and AFS, the new distributed file systems, like GPFS, CXFS, GFS, DFS, DAFS, Coda, and
InterMezzo are presented. Section 4.4 presents some hierarchical storage management systems, like
HPSS, DMF, EuroStore, Enstore, and Castor, where — from the user’s point of view — tape robots
and disk farms are integrated into a single large storage system.
In Chapter 5, current state of the art of Grid data access is presented. First, Grid concepts like data
catalogues, data discovery, replication, and caching, are discussed, and then in Section 5.5, some
network considerations are presented.
The main part of Chapter 5, Section 5.6, concentrates on the current status of Grid middleware. The
important ones are Globus, Condor, Legion, SAM, SBR and Nimrod-G, but also other distributed
computation technologies, like Java and CORBA, are discussed.
The final Section 5.7 presents peer-to-peer (P2P) computing and networking, which has gained
popularity as a promising method to solve many problems in distributed computing and Grid
environments. The important software and protocols in peer-to-peer computing are Chord, CAN,
OceanStore, FreeNet, Mojo Nation, and JXTA Search, in addition to the famous first ones to succeed,
Napster and Gnutella.

IST-2000-25182                                    PUBLIC                                              9 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report



4.1.1. Software environment Open Source vs. Proprietary Software
Today many software packages and network protocols are open or free in some sense. But there seems
to be a lot of confusion and discussion about different meanings of Open Source Software (OSS): e.g.,
free software, open source, and shared source.
Open software is freely distributable, its source code is always available, and often it is cheap or even
free of charge. In contrast, proprietary software is available in executable form only. The company
who develops the software also owns it, and usually it is expensive. Some open source-like software is
available as shareware or freeware, but then the source code is not necessarily available (i.e. the
software is actually not open source); here the focus is either on the distributability (shareware) or
price (freeware).
There are some very successful major open source software projects, for instance:
       The Linux operating system, initiated by Linus Thorvalds, at University of Helsinki,
        Finland, but now developed and maintained as an international voluntary co-operation,
       All GNU tools and software (discussed below),
       TeX typesetting package, developed for mathematical typesetting by Donald Knuth at
        Stanford University,
       The Kermit terminal emulation package,
       Ghostscript, a Postscript interpreter by Aladdin Software, and
       The programming (scripting) language Perl, developed, modified, and improved
        through a global network of collaborators coordinated by programmer Larry Wall.
Most application software domains like word processing and web publishing applications, databases,
etc. have at least one open source version, too.
In the case of Linux and TeX, people even argue that the open source approach has resulted in
something superior to corresponding commercial software packages. But clearly not all open source
projects are of such a high quality.
Free and open software and protocol practice was started with (and was almost required by) the US
federal government founding and support of the Internet networks, protocols and software packages.
One can argue, that free, open source software created the Internet. Early examples include the email
handling program, sendmail, developed by Eric Allman, and the program bind for Domain Name
Server (DNS), both at the University of California, Berkeley. Most Internet protocol standards are
open. New open protocols include the encrypting Secure Shell protocol (SSH1, even though the
program ssh itself is not open) by Tatu Ylönen from Finland, and Gnutella.
Also the WWW was created by open ideas. Ted Nelson introduced the hypertext concept, and
Timothy Berners-Lee at CERN created the http protocol. The most popular WWW server software,
Apache, is open source, and the first browser, Mosaic (or Mozilla), was developed in the National
Center for supercomputing Applications (NCSA) at University of Illinois. All well-known browsers
were initially based on Mozilla.
Open Source Initiative (OSI) is a non-profit corporation dedicated to managing and promoting the new
Open Source Definition According to this definition,
the Open Source Software must be distributed and redistributed freely, it must include or make

IST-2000-25182                                    PUBLIC                                            10 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                    S tate of the Art Report

available source code, and must allow distribution of source code as well as the compiled form, and
modifications and derived works must be allowed under the same terms as the license of the original
GNU Tools and its associated GNU General Public Licence (GPL, also known as copyleft license and
GNU Public License) is the first and most well known Free Software collection. The whole GNU
Project (GNU is a self-referential acronym, GNU is Not Unix,, and Free
Software idea are invented by Richard Stallman, and fostered by Free Software Foundation (FSF)
founded by him. Stallman is originally from Artificial Intelligence laboratory at MIT.
The GPL is the granddaddy of all open software licenses, including the Open Source Definition. GLP
grants the user three rights: to copy the software and to give it away, to have access to the source code,
and to change the software. A key requirement is that the user passes these rights, unimpaired, to other
users. Because many people follow this license, much of the free software and all of the Lin ux kernel
have the same licensing terms, thus simplifying compound distributions.
In the GNU project terminology, free has two connotations. In the economic sense the software is
(mostly) free of charge, but more important is the freedom of speech of programming: when the source
code is available, one is able to build on the works of others. The GNU license model gives the users
special rights to obtain the source, and this right cannot be revoked.
In spirit it is very close to the new open source license, but see ―Why Free Software is better than
Open Source'' on There are other free
licenses besides the GPL; these include the Open Group X License and the other BSD-style licenses,
and the Perl Artistic License (none insists on source code distribution).
There seems to be almost a religious war between the different proponents of various free open source
software concepts. For more information and legal ramifications for the various public software
license types, see for instance the articles by Donald K. Rosenberg, Stromian Technologies,, and also the white paper ―The Origins and
Future of Open Source Software‖ by NetAction on
The pressure from open source side on Microsoft has been so hard, that it has been forced to announce
its ―Shared Source idea‖ to give customers, partners and developers greater access to its source code: Microsoft has often been
criticised for trying monopolise and capture standards by using the the ―embrace and extend‖ principle
to create incompatible software and causing interoperability problems. Clearly there have been such
problems, but it is not easy to decide whether this is intentional. Definitely some people see Shared
Source program of Microsoft as an undertaking to undermine Open Source, see also the ―Shared
Source               vs.            Open             Source:              Panel            Discussion‖
A large part of the software packages are proprietary: windows, most windows programs, most
database managers, and many scientific wide-purpose programs and libraries (such as Matlab,
Mathematica, and NAG). Most of them are also of very high quality. Often they are easier to use as
they usually explicit aim to usability. Since the open source programs are often created by enthusiastic
programmers, who are experts in their area, the usability aspect is given only secondary thoughts.
The biggest benefit of the open source software is the fast development, debugging and correcting
cycle. Because the code is accessible to all, people can easily spot errors and correct them.
Enhancements can be built on the proven existing solutions. The software is not a black box anymore,
people can check if they might be able to trust it and rely on it. The openness can lead to source code
―forking‖, i.e. to creation of several slightly differing similar programs with incompatibilities and
interoperability problems.

IST-2000-25182                                    PUBLIC                                             11 / 67
                                                                                                  Doc. Identifier:
                         DATA ACCESS AND MASS STORAGE
                                   SYSTEMS                                                      Date: 26/06/2011
                                   S tate of the Art Report

A drawback can be the possible lack of necessary programmer base to support more esoteric hardware
and software needs. But sometimes this can be a problem with commercial software, too. Most
companies are reluctant to use open source software, because there are no guaranties of support.
Especially when there are serious problems, there might be none interested enough to fix an open
source software. For commercial software, the vendors are usually very interested in developing and
correcting their software, because their income depends on the quality of the software.
Sometimes the openness can weaken the security, because crackers can read the source code and spot
ways to exploit security holes and software bugs. But due to the openness, the security hole can be
patched and bugs easily eliminated just by modifying the offending part of the code. See,4586,2784795-2,00.html for two conflicting views on
whether open source or proprietary software is more secure. Operating systems
There are several different versions of Unix (or Unix-like, e.g. Linux) operating systems available.
Practically each major computer manufacturer has its own flavour of Unix. The most important ones
Unix version                    Vendor                       Platform, architecture
Linu x                         Open source                      several: Intel x86, Power PC, Alpha, …
FreeBSD, OpenBSD, Net BSD      Open source                      several: Intel x86, A lpha, …
Darwin, Mac OS X               Apple                            Power PC, Intel x86
Solaris and SunOS              Sun Microsystems                 Sparc
AIX                            IBM                              RS/ 6000 (Power PC)
Tru64 (Digital Un ix, OSF/1)   Co mpaq (Dig ital Equ ip ment)   Alpha
IRIX                           Silicon Graphics (SGI)           MIPS R10000
HPUX                           Hewlett Packard                  HP PA Risc
Unicos                         Cray                             Cray
For the Grid the most important operating system is Unix and especially Linux. Marginally interesting
are the most common desktop operating systems:
Windows                        Microsoft                        Intel x86 (also IA 64)
Mac OS                         Apple Co mputer                  Power PC

4.1.2. Storage hierarchies
Originally in computer systems there were only two types of storage components: volatile work
storage (memory and registers) and the proper permanent storage, disks and tapes. Historically and
traditionally the only data accessible was local, and had to be either laboriously keyed in locally or
created as a result of computation, but soon there were input peripherals for data entry (paper tape,
punched cards or magnetic tape).
Later many kinds of caches were introduced: several levels of memory caches between registers and
memory, and disk caches between memory and disk. The purpose was to speed up the transfer and
eliminate bottlenecks. The various caches may lead to performance bottlenecks if cache schemes are
not designed carefully. Especially efficient cache coherence can be difficult to achieve in networking
Several types of robots for magnetic tapes and cassettes were introduced to make off-line storage all
the time available (near-line) un-operated, un-assisted with a delay of a few seconds. And the latest
introduction is networked storage, where the actual storage is on a server computer connected with a
network (LAN).

IST-2000-25182                                   PUBLIC                                                  12 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

The storage in a computer system (in addition to memory, the primary storage) can be hierarchically
outlined as follows (usually permanent, non-volatile and rewritable):
         Secondary online storage (in practice often local magnetic disks)
                  special disk configurations (e.g. RAIDS)
                  tape cache on disk
                  network storage (non-local)
         Tertiary off-line storage (tapes and other such media, also backup)
                  near-line version automated operated (tape robots)
                  offline version manually operated
                  tertiary long time archival and backup (off-site tapes and also write once CD-R)
There are several different properties and several kinds of storage. In reality the various properties and
classifications form a continuum.
The storage levels form a pyramid with memory at the top and secondary and tertiary storage at
bottom. The secondary storage is usually faster but more expensive and has less capacity. So the
performance (and price) goes up when going from the secondary to tertiary storage but the capacity
increases in the reverse direction. Today we are at 0.25 – 2 GB of memory, but permanent storage runs
from tens and hundreds of gigabytes on disk up to hundreds of terabytes and more on tapes. Storage is
permanent whereas memory is volatile, resulting in different uses for data storage. Memory performs
well whereas storage is persistent.
The access speed for data is slower at the bottom. The memory and disks are typically randomly
accessed, always online and fast, whereas the tapes are sequentially accessed and slower. Tape media
can be off-line, manually loaded, or near-line on huge tape robots loaded automatically when needed.
Data is usually archived or backed up on tapes. The backup tapes are kept in vaults off-site. Some
media are made more permanent, for instance read-only CD's for software and tape distribution or
write-once CD-R's for precious data for security. The usual secondary storage devices, disks and tapes,
may of course be rewritten.
For backup and archival purposes the secondary and tertiary storage is used. Off-site storing in vaults
guarantees the data even if the computer site would be destroyed (e.g. in a fire). There is a definite
difference for backup and archiving: backup is archiving everything for a medium length period of
time according to a definite backup plan and schedule, so that one can technically recover from minor
hardware failures or even total catastrophes, whereas archiving mean storing precious data in principle
indefinitely for historical or legal purposes for example. Also, archived data has no online pendant.
Network storage can sometimes be regarded as a (possibly a little bit slower) secondary storage, or
something between the secondary and the tertiary storage. Usually the main part of the tertiary storage
is nowadays implemented as network storage, where files on the tapes or on other off-line or some
near-line storage are stored temporary or cached on disks (tape cache on disks) on a network server.
Later the written files are transferred to tapes (manually or by robots). Only when reading old files,
which are not on tape cache disks, one has to wait for the data to appear online.

4.1.3. Storage Media Disk and tape technology
There are two major disk interface technologies today: Small Computer System Interface (SCSI) and
Integrated Drive Electronics (IDE) or nowadays usually its modern counterpart Enhanced Integrated
Drive Electronics (EIDE), differing in principle only how they are connected with the bus to the host
computer. Typical sizes of the disk are 30 – 300 GB. IDE is known also as Advanced Technology

IST-2000-25182                                    PUBLIC                                            13 / 67
                                                                                            Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                 Date: 26/06/2011
                                    S tate of the Art Report

Attachment (ATA), and EIDE as Advanced Technology Attachment Interface with Extensions (ATA-
SCSI disks are mainly used on servers and IDE disks on desktops, even though many sites have started
to use cheaper IDE disks also on servers. The more powerful disks have usually a SCSI interface, but
for average server usage even the IDE disks perform well enough. Today the raw disk capacity is
getting relatively rather cheap, IDE being cheaper than SCSI by a factor of about 2-3. Typical prices
for a megabyte have been $10 in 1988, $1 in 1993, $0.1 in 1997 and $0.01 in 2000, which means that
the prices will drop to half within a little more than a year. A typical low price nowadays is 2-3
dollars/GB, but the powerful server disks are more expensive.
The disk rotation speed determines the latency to find a specific sector. The normal speed has long
been 3600 rpm, but nowadays 5400-7200 rpm are a commodity, and the best models have 15 000 rpm
and more. The seek times are 8-9 s for desktop models, and about 3-4 s for the best server models.
Transfer rates are up to 100 MB/s for desktop, up to 400 MB/s for more expensive servers. Packing
densities are being improved so that drives with a capacity 30 – 100 GB seem to become the norm on
desktop workstations, and about 10 GB on laptops. Of course these numbers will be outdated within a
few months.
There are several different types of tape media in various shapes and forms on cartridges and cassettes,
with conventional or video like usage (up o 100 GB per cassette). The access is serial by necessity,
access times are longer, and transfer rates are much lower, but the media is cheaper than disk by a
factor of 3-10. The capacity of one cartridge is typically 20-100 GB. Tape media is still the choice for
backups and for volume storage of data that is not accessed very often.
Different media for secondary and tertiary storage has proliferated: several kinds of diskettes (12",
5.25", 3.5", ZIP, 1-250 MB), magneto-optic media and optical media (data CD, CD-ROM, CD-RW,
DVD about 0.5 – 5 GB), detachable disks (SyQuest, JAZ, external packs, 2-20 GB). Several
promising new technologies based for instance on holography or on atomic force microscope needle
imparting slots into plastic (Millipede). The production use of these new technologies are probably far
in the future, however.
All this has made storage issues difficult and complicated, each media has different access
characteristics: serial access, random access, access speed, transfer speeds and storage capacity. Example: SCSI vs. IDE at CERN
CERN has shifted in recent years from using SCSI disks to commodity IDE disks (not the cheapest,
but rather sturdy IBM disks). For most manufacturers the HDA (Head Disk Assembly) is the same for
SCSI and IDE disks, i.e. there are corresponding SCSI and IDE models where only the controller
differs. However, the fastest disks (basically determined by the spinning speed and controller transfe r
rates) are manufactured only as SCSI models.
CERN did not use SCSI RAIDs much. Currently for IDE disks RAID is not used otherwise, except
that everything is mirrored. Reliability has been as good as for SCSI disks, but because everything is
mirrored, one has less downtime (reboots, etc) and one needs less manpower for the administration of
the disks. In the future the disks might be based on (other) schemes of RAID instead of mirroring.
The price has been cut by a factor 1/6 compared to the prices for SCSI systems (because mirroring the
real price cut has been a factor 1/3). The system is capacity oriented, basically it is considered as a
cache for data. Most data resides on tapes, only new data is first created on disk, and frequently
accessed files are kept on disks (tape cache).
For performance reasons there have been some difficulties when pushing to the upper limit: a
bottleneck has been noticed when trying to get the transfer rates from 45 MB/s to over 60 MB/s. It is
not clear what the real nature of the problem is: it might be Linux buffering of data while transferring

IST-2000-25182                                    PUBLIC                                           14 / 67
                                                                                           Doc. Identifier:
                       DATA ACCESS AND MASS STORAGE
                                 SYSTEMS                                                 Date: 26/06/2011
                                   S tate of the Art Report

(i.e. copying of data between buffers ―unnecessarily‖), or some kind of memory bandwidth or
hardware level limitation. Neither the disks nor the network is a problem here, but network software,
like IP-stacks, could be. Pixie dust
In the past decade, the data density for magnetic hard disk drives has grown faster than Moore's Law
for integrated circuits predicts, doubling every year since 1997. But when magnetic regions on the disk
become too small, they cannot retain their magnetic orientations over the typical lifetime of the
product. This is the "superparamagnetic effect‖, and has long been predicted to appear when densities
reach 20 to 40 Gigabits per square inch, which is near the data density of current products.
The new AntiFerromagnetically Coupled (AFC) media by IBM is the first dramatic change in disk
drive design made to avoid the high-density data decay due to the superparamagnetic effect. AFC
works by putting a thin three-atom-thick layer of Ruthenium (Ru, similar to platinum) sandwiched
between two magnetic layers on a disk. This technology allows hard disk drives to store four times as
much data per square inch of disk area as previous hard drives. That only a few atoms could have such
a dramatic impact caused some IBM scientists to nickname the ruthenium layer informally as ―pixie
Fujitsu Ltd. is using similar technology called SF Media by Fujitsu. To make use of the new medium,
Fujitsu also developed a better disk drive head. Fujitsu is now working toward the development of
densities around 300 gigabits per square inch.
With the pixie dust technology hard-disk densities will reach 100 Gbits/in^2 by 2003. This means that
hard drives could come with capacities of 400 GB for desktop drives and 200 GB for notebooks. This
200 GB is equivalent to more than 40 DVDs or 300 CDs. One inch MicroDrives for handheld devices
could be 6 GB or 13 hours of MPEG-4 compressed digital video, in other words about eight complete

4.1.4. Local vs. network access
From the user’s point of view, data is either
       A. ―local‖ to the computer used, and hence readily available, or
       B. ―accessed through the network‖, and hence also possibly non-available or accessed with a
performance penalty.
With the clustering of computing resources in computer centres there is a tendency to keep only the
run-time working files on the fastest "local" file system (and of course also the operating system and
its files). For performance reasons this local working storage is always needed, and in the final
analysis all files reside on the local store of some server.
The programs and even the data files are stored on a networked file server and copied to local file
system only when needed, and only if there are speed performance gains from it. Since the speed of
LANs is getting higher, the performance gain will not improve anymore; and for the most time one
can regard the network storage as local. However, there can be bottlenecks and contention resulting
from the many clustered computers using the network to access the file servers.
We can classify further the typical storage possibilities:
      local                      storage attached to the computer itself with IDE or SCSI interface
      cluster                    special high speed connection to cluster storage server
      network                    LAN connection to a network storage server
      remote                     WAN (storage is ―on the Internet‖)
Systems usually try to hide from the user the complexity of data storage levels and the details about
the data location and the means used to access the data.

IST-2000-25182                                   PUBLIC                                           15 / 67
                                                                                               Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                    Date: 26/06/2011
                                    S tate of the Art Report

4.1.5. Network Attached Storage (NAS) and Storage Area Network (SAN)
A current trend in storage is the storage virtualisation by attaching the storage directly to network. The
primary approaches are Network Attached Storage (NAS) and Storage Area Network (SAN). By
attaching storage directly to network, storage is becoming divorced from servers, and one achieves
several benefits: for instance ability to add storage without interruption, and consolidating storage and
hence making it easier to administer (many devices are harder to manage).
One can distinguish at least the following storage attachment solutions:
    1. Directly Attached Storage (DAS). Basically IDE or SCSI disks attached locally to the
        computer, accessed directly via bus.
    2. Server Attached Storage (SAS). Basically IDE or SCSI disks attached locally to the server on
        the network, accessed via LAN using TCP/IP.
    3. Network Attached Storage (NAS). Most simply and typically some RAID disk system with
        network (Ethernet) card, and enough optimised software and processing power for file access,
        but doing nothing else, so one can say that the server is dispensed with. Accessed via LAN
        using TCP/IP.
    4. Storage Area Network (SAN). Special Fibre Channel simply a bunch of disks strung together
        with a bit of fibre optics.
The examples may be oversimplifying the technologies a little. All NAS and SAN systems use more
or less standard, readily available technology: NAS takes RAID disks and connects them to the
network using Ethernet or other LAN topologies while a typical SAN implementation will provide a
separate data network for disks and tape devices using the Fibre Channel equivalents of hubs, switches
and cabling. Some SAN vendors are promising to offer 10 Gigabit Ethernet as an alternative, when
The following table highlights some of the key characteristics of both SAN and NAS.
             SAN                                          NAS
Protocol     Fibre Channel (Fibre Channel-to-SCSI)        TCP/IP
Applications Mission-critical transactions                File sharing in NFS and CIFS
             Database application processing              Limited database access
             Centralized data backup and recovery         Small-block transfers over long distances
Benefits     High availability and reliability            Simplified addition of file sharing capacity
             High performance and scalability
             Reduced traffic on the primary network       No distance limitations
             Flexibility, centralized management          Easy deployment and maintenance
SAN can provide high-bandwidth block storage access over a long distance via extended Fibre
Channel links, but such links are generally restricted to connections between data centres. The
physical distance restricts less NAS access since communications are via TCP/IP.
The two technologies have been rivals, and there has been some ―holy wars‖ between them. But there
is a place for both NAS and SAN in most large enterprises, because despite their similarities they are
actually complementary storage technologies: They have a lot in common, and are in fact rapidly
converging towards a common future reaching a balanced combination of approaches. So storage
vendors seems to move toward unified solutions bringing both SAN and NAS access to the same self-
optimising large storage array with a single management interface. The unified storage will accept
multiple types of connectivity and offer traditional NAS-based file access over IP links, but allowing
for underlying SAN-based disk-farm architectures for increased capacity and scalability.

IST-2000-25182                                    PUBLIC                                              16 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                    S tate of the Art Report NAS
NAS is an optimised single function solution that provides a quick, easy to add and manage file
storage using the existing TCP/IP network and file sharing protocols like Network File System (NFS)
or Common Internet File System (CIFS). NAS provides shared file-level access to a common storage
from anywhere in the organization that can be shared across different operating system environments.
Most NAS connections reside between workstation clients and the NAS file-sharing facility.
Because file access is typically low volume and less sensitive to response times, predictable
performance and distance are less of a concern in NAS. Typical interaction between a NAS client and
an appliance involves data transfers of relatively short duration and volume. Large transfers must be
split into many small network packets, and processing is required at each end of the connection to
break down and reassemble the data stream. NAS performance depends on the ability of the network
to deliver the data, and network congestion directly affects it.
The NAS server controls the (often proprietary) file system it uses and the manner in which stored
data is accessed on directly attached storage devices. NAS is independent of traditional servers, which
can improve availability. If the file sharing were instead of a NAS server on a traditional server, the
network transfer processing might adversely affect applications running on that server.
NAS has been used in the ISP and ASP communities where many internet-based applications require
file sharing. NAS is well suited for file access by combined UNIX and Windows computers, e -mail
and web page services, and wide-area streaming video distribution and media serving. SAN
SAN is designed to provide block level data at high speeds primarily to application servers from a
flexible, high-performance, highly scalable, and networked storage. It uses many direct connections on
a dedicated, high-speed storage network between servers and storage devices, such as disk storage
systems and tape libraries.
High-performance Fibre Channel switches and protocols ensure that device connections are both
reliable and efficient. Connections are based on either native Fibre Channel or SCSI through a SCSI-
to-Fibre Channel converter or gateway, making SAN flexible by overcoming the cabling restrictions
traditionally associated with SCSI. One or more Fibre Channel switches provide the interconnectivity
for the host servers and storage devices in a ―SAN fabric‖ meshed topology.
When transferring large blocks there is not much processing overhead on servers, since data is broken
into few large segments. Hence SAN is effective for large bursts of block data, which makes it ideal
for storage-intensive environments. SAN is well suited to high-bandwidth storage access by
transaction processing and database applications that manage their own large data spaces and run in a
block-oriented environment. These applications do not need universal access at the file level from
several environments; they care more about block level performance and control.
SAN requires sophisticated knowledge to implement, especially when compared with NAS; it needs to
be ―built‖. SAN does not impose any inherent restrictions on the operating system or file system that
may be used.
Organizations can benefit by employing SAN for mission-critical applications, in a heavy data-mining
environment, backups and restores, and for high-availability computing. SAN has been used also when
huge great files have to be manipulated and shared at a level of reliability that no ordinary network can
support, for instance video production houses.

4.1.6. RAID
For both data protection and performance one often uses special features and special schemes for data
storage (on the disk level, i.e. both on the driver, controller and media level). One of the best known of

IST-2000-25182                                    PUBLIC                                             17 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

them is called RAID (Redundant Array of Inexpensive Disks) for parallel disks, which includes
mirroring and striping. By doing I/O operations in parallel the performance is enhanced. Keeping an
exact copy of a one disk on another or disk mirroring is the simplest possibility for protection, but it
leads to halving the capacity (or doubling the number of drives with an associated cost). Using special
error correcting codes one can also help security and fault tolerance, and can reduce the redundancy
There are concepts similar to RAIDs for other media too, e.g. RAIT (Redundant Array of Inexpensive
Tapes) for tapes.
The purpose and benefits of a disk arrays or RAIDs are:
  1. High data availability and protection and fault tolerance increasing data availability and disk
  reliability by using redundancy and special error correcting parity checks. If a single drive crashes,
  no data is lost. A disk array can continue function after a single drive crash.
   2. Performance enhancements, by parallelising the I/O operations by spreading data over multiple
   drives (spindles). This allows multiple drives to work in parallel on a single transfer request.
   3. To increase total storage capacity by using many small and inexpensive commodity disks; disk
   connectivity per system increases also, since usually multiple drives appear as one.
   4: Enhance operating flexibility and minimising downtime due to the disk subsystem by using
   intelligent array controllers making the system often hot swappable, i.e. individual disks can be
   replaced without disturbing the storage system (i.e. without downtime or boots).
RAID can be implemented either (partly) in hardware or (totally) in software. Although not strictly
necessary, the special hardware for RAID is usually needed for high performance; software only
RAIDs tend to be rather slow.
Nowadays there are a lot of commercial RAID solutions from literally hundreds of vendors, for all
computer platforms, ranging from simple PC servers to supercomputers. Usually these are mainly
hardware solutions where the RAID disk system is placed onto a special intelligent RAID controller,
which makes all the drives appear as one drive to the computer system. The special hardware will add
to the price, but the disks are commodity SCSI or IDE disks. Of course, usually some special software,
e.g. drivers, is needed too.
There are some software only solutions, especially for striping and mirroring (also for Linux). In the
software only solution all the bookkeeping done by the special controller is instead done by the driver
software. Thus a server host can effectively act as a networked RAID controller, providing flexibility
and a possibly reduced price. There are however usually some performance penalty in the form of
increased CPU overhead, and also for software mirroring, since everything has to be written at least
twice. However, a software solution probably means spreading the I/O load over multiple controllers. RAID Levels
There are several different types with varying subtypes of RAIDs, depending on the goals and details
of the implementation. Garth Gibson and Randy H. Katz (UC Berkeley) introduced the concept and
published the original (non-exhaustive) 5 level taxonomy of RAID levels in the SIGMOD paper in
1988 [R? make this a reference] (A case for redundant arrays of inexpensive disks (RAID), Proc.
SIGMOD, Chicago, Illinois, 1--3 June 1988, pp 109--116).
The paper roughly classifies RAID architectures according to the layout of data and parity
information: RAID 0, Striping where data is segmented and split (i.e. striped) onto multiple drives for
performance by parallelising I/O, RAID 1, Mirroring, where duplicate data is kept on multiple drives
for protection by redundancy, RAID 3-4, Data protection by redundancy and error correcting codes,
and Raid-5, combined striping and redundancy with error correcting codes. (There is no Level 2). For

IST-2000-25182                                    PUBLIC                                            18 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                       S tate of the Art Report

higher level mathematical error correcting code (ECC) is calculated from multiple drives and stored on
another drive.
A simple high performance and fault tolerant but only moderately expensive RAID consisting 10
disks, special fast controllers and software demonstrates neatly the benefits and properties:
1. Performance: Using 8 disks, each bit of a byte is written and read bit in parallel to separate disk,
and so one can effectively get 8-times the transfer rate of one disk.
2. Data protection by redundancy: At the same time a parity bit is written on the 9th disk. If one disk
fails the system recreates its information on the fly using the parity. With the controller the penalty to
compute the parity for writes can be negligible. Because the parity depends only on the data to be
written, there is no need to read data already on the disk to calculate it, as with some other schemes.
3. Fault tolerant: The 10th disk is an on-line replacement disk: one can switch (even automatically) to
use it at once. There is no downtime, because one can read the data correctly while the system
recreates the contents of the broken disk. If the disks are hot swappable, one can then later replace the
failed disk, also on the fly.

Here we discuss Unix (and Windows) local file system.

4.2.1. Address size of file systems
When the performance of the computers and the amount of data stored increases, one requires to
process with more accuracy and store more data than possible with the existing architectures. Hence
the sizes of various registers, addresses, pointers and memory units have to be increased. This is
usually a slow and tedious process because it means redesigning the whole architecture of both
hardware and software with the new values while usually maintaining backward compatibility.
Nowadays these sizes are invariably powers of 2 of a byte or octet with 8 bits: 8 bit bytes, 16 bit
words, 32 bit long or double words, 64 bit quad words or 128 bit ―octawords‖.
The computer industry is in the process of changing these sizes: for instance for computation with real
floating point numbers from 32 bit single precision to 64 bit double precision (mostly done), for
operating systems from 32 bit to 64 bit (mostly done, but PC’s with Linux and Windows lagging
behind because the hardware platform of Intel x86 architecture or IA32 is 32 bit and the new 64 bit
Intel Architecture IA64 is just emerging), and for character sets from 8 bit ISO-8859 to 16 and 32 bit
Unicode and ISO 10646.
One of the limiting factors in file systems is the size of the pointers or addresses used to access data.
For file systems the prevalent address size used to be 32 bit, but most modern file systems are 64 bit.
This has relevance most to the largest possible file system size and file size; usually the largest file is
about the same size than the largest file size, but a single file can span several file systems and hence
be even greater than the file system. The actual maximum in a computer system is usually limited by
the amount of storage media for the system, not by the address size.
In a file system all blocks are identified with 32 or 64 bit (unsigned or signed) integer. Hence the last
block has address 2^32 or 2^64, or actually the exponent is usually one less (signed integer) and these
numbers should be divided by 2. The maximum file system size is obtained by multiplying this with
the allocation block size, which is typically 4 kB= 2^12 B. Also stored is the size of the file in bytes,
also in 32-bit or 64-bit integer format. The sizes are (or depending on the details of the file system
architecture, the exponents again might be a bit lower than indicated, requiring division by 2, 4 or
some other small power of 2 of these numbers, here EB is exabytes and ZB zeptobytes):
                              32-bit                                  64-bit
   for bytes                  2^32 ~     4 * 10^ 9 =      4 GB        2^64 ~ 16 x 10^18 = 16 EB

IST-2000-25182                                       PUBLIC                                          19 / 67
                                                                                               Doc. Identifier:
                         DATA ACCESS AND MASS STORAGE
                                   SYSTEMS                                                   Date: 26/06/2011
                                     S tate of the Art Report

   for 4 kB blocks            2^44 ~ 16 * 10^12 = 16 TB               2^76 ~ 64 x 10^21 = 64 ZB

The exabyte is about million terabytes, and hence about a million times larger than most large file
systems in use today, but large address space is needed to plan for the exponential disk capacity
improvements observed in recent years. As the disk sizes grow the address space needs to be
sufficiently large, and also the file system structures and algorithms need to scale. In future, as the file
system size limitations of Linux are eliminated, also the Linux file systems will scale to the full 64 bit.

4.2.2. Local Unix file systems and inodes
In traditional UNIX file systems, files are written to disk using inodes, indirect blocks, and data
blocks. Inodes and indirect blocks are considered metadata, as distinguished from data, or actual file
         1. Directory entries (one or more hard links) points to the inode of the file.
         2. Each file has an inode containing information such as the file size and the time of last
modification. The inodes of small files also contain the addresses of all disk blocks that comprise the
file data.
         3. A large file can use too many data blocks for an inode to directly address. In such a case,
the i-node points instead to one or more levels of indirect blocks that are deep enough to hold all of
the data block addresses. This is the indirection level of the file.
         A file starts out with direct pointers to data blocks in the inodes (zero level of indirection). As
the file increases in size to the point where the inode cannot hold enough direct pointers, the
indirection level is increased by adding an indirect block and moving the direct pointers there.
Subsequent levels of indirect blocks are added as the file grows. This allows file sizes to grow up to
the largest supported file system size.
         4. The data resides in the data blocks.
Extents are sets of contiguous allocation blocks used by several file systems to enhance spatial locality
of the files, leading to better performance as fewer header movements are needed, changes for multi-
sector transfers are improved and the disk cache misses reduced. Using extents also reduces the
external fragmentation. Extents also provide a way to organise free contiguous space efficiently.

4.2.3. Traditional Unix File systems (UFS and EXT)
On Unix some of the traditional file systems are Unix File System (UFS, Sun) and the Second
Extended Filesystem (ext2fs, Remi Card, Theodore Ts'o and Stephen Tweedie) used in Linux usually
and based on the original Extended File System (ext) and the Virtual File System (VFS) layer
(, [R]
The ext2fs file system works rather well in small Linux installations (and remembering that the source
code consist of only about 6000 lines, its effectiveness is remarkable).
All these the traditional static Unix file systems use 32-bit addressing, are block-structured, and have
slow restart times after system crashes. They use fixed space allocation for inodes, which can take of
the order of 5% of the disk space.
These conventional block allocation based file systems like ext2fs allocate fixed size (often 4 kB)
allocation units called blocks for the data of the file. Hence the last trailing block is on the average
only half filled, leading to an average waste of 2 kB per each file. Since the average file size on Unix
servers is typically around 100 kB, this can be a problem of wasting space. This wasting of space is
called inner fragmentation (and it is in modern file systems often solved by packing small files and the
tail ends of large files).
These traditional file systems are especially space inefficient and troublesome for small files, because
then one has to use at least one 4 kB block for each small file, even there could be only few bytes in

IST-2000-25182                                     PUBLIC                                             20 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                     S tate of the Art Report

the file. If there are lot of small files, the wasting of space can be huge. For instance if one writes 100
files with contents of 1 byte, each of them uses one block (often 4 kB, altogether 400 kB), when they
really could be packed into one block (100 B in one 4 kB block).
In all block allocation based file systems the (external) fragmentation, where the files and the pool of
free blocks tend to consist of non-contiguous blocks because of constant deletion and creation of files
with different sizes, is always a problem, and one needed to defragment the file system from time to
time. When files are stored in non-contiguous blocks, it becomes slower to access them.
Historically important old file system is the Fast File System (FFS, M.K. McKusick, W.N. Joy, S.J.
Leffler, and R.S. Fabry, 1984). The file system employs parent directory location knowledge in
determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and
uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation.
Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the
architectural foundation for many current block allocation file systems, and was later bundled with the
standard Unix releases.
In these file system there are performance problems for instance in supporting 10 000’s or 100 000’s
files in one directory, and size problems mainly due to the 32 bit addressing. Also there is the problem
of slow restart times after a system crash, because there is no journaling (as discussed below).

4.2.4. Journaling
When a Unix computer with traditional file system (ext2fs and similar file systems) is rebooted after
an unexpected interruption such as a power failure, it runs file system consistency check (a program
called fsck) that walks through the entire file system, validating all entries and making sure that blocks
are allocated and referenced correctly. It would find the corrupted directory entry and attempt to repair
it. When it cannot repair the damage all of the entries in the corrupted directory can be "lost", which
means they get linked into a special directory (lost+ found) for each file system. Blocks put into this
directory are in use, but there is no way to know where they are referenced.
These checks can take many hours to complete, depending on the number of files the file system is
managing. Using Journaling technologies the file system recovers quickly from a disaster allowing a
file system to restart very quickly after an interruption, regardless of the number of files, thus avoiding
these lengthy file system checks. Journaling is similar to transaction logging done with database
Journaling file systems maintain a special non-cached file called a log or journal. Whenever the file
system is updated, a record describing the transaction is added to the log. An idle thread processes
these transactions, writes data to the file system, and flags processed transactions as completed. If the
machine crashes, the background process is run on reboot and simply finishes copying updates from
the journal to the file system. Incomplete transactions in the journal file are discarded, so the file
system's internal consistency is guaranteed. This cuts the complexity of a file system check by a
couple of orders of magnitude, a complete consistency check is in principle never necessary and
restoring a file system after a reboot is a matter of seconds at most.
The journaling file system has to provide the journaling while minimizing the performance impact of
journaling on read and write data transactions, i.e. its journaling structures and algorithms have to be
tuned to log the transactions rapidly, for instance using table structures like balanced trees for fast
searches and rapid space allocation. This speeds also response times for directories with tens of
thousands of entries.
For Linux there are several open source journaling file systems in various stages of completion and
offering different advantages; some of them ready for production: ReiserFS (Hans Reiser), XFS/Linux
(SGI), JFS (IBM), and ext3fs (Stephen Tweedie, Red Hat). Also Veritas has journaling file system
VxFS for Sun Solaris (where the traditional file system is UFS), and they have announced plans to

IST-2000-25182                                     PUBLIC                                            21 / 67
                                                                                                Doc. Identifier:
                         DATA ACCESS AND MASS STORAGE
                                   SYSTEMS                                                    Date: 26/06/2011
                                      S tate of the Art Report

port it to Linux, but it will not be open source. A detailed technical comparison of some of them is
available from Linux Gazette, issue 55.
Most of the available options provide support for dynamically extending the file systems using a
logical volume manager (such as LVM), which makes them perfect for large server installations.

4.2.5. Extended File System (XFS)
XFS is a journaled 64-bit file system from Silicon Graphics ( for
their desktop workstations and supercomputers. XFS is one of the leading high-performance file
systems, and has been in production since late 1994. XFS for Linux (
was made available under GPL in May 2001 by SGI. XFS (now at release 1.0.1 on Linux 2.4) is the
first production type journaled file system for Linux.
The XFS file system integrates volume management with full 64-bit addressing, scalable structures
and algorithms with the ability to support extremely large disk farms, guaranteed rate I/O, and
advanced journaling technology with fast transactions for fast, reliable recovery from system crashes
with guaranteed file system consistency. SGI claims that this combination delivers the most scalable
and high-performance file system in the world.
The XFS journaling technology allows it to restart very quickly after an unexpected interruption,
regardless of the number of files it is managing, avoiding lengthy file system checks. The XFS file
system provides the advantages of journaling while minimizing the performance impact of journaling
on read and write data transactions. Its journaling structures and algorithms are tuned to log the
transactions rapidly. XFS uses efficient table structures for fast searches and rapid space allocations,
and it delivers rapid response times even for directories with tens of thousands of entries.
It provides full 64-bit file capabilities that are supposed to scale to large files and file systems of about
1 TB. High Scalability XFS is a full 64-bit file system, and thus, as a file system, is capable of
handling files as large as a million terabytes. In future, as the file system size limitations of Linux are
eliminated XFS will scale to the largest file systems.
For the current Linux 2.4, the maximum file system size is 2 TB, and the maximum file size
(maximum offset) is 16 TB (4 kB pages) or 64 TB (16 kB pages). As Linux moves to 64 bit on block
devices layer, the file system limits will increase and the file size limit will increase to 9 million
terabytes (or o the system drive limits). File system block size is currently fixed at the system page
size, which is 4 kB on IA32. File system extents (contiguous data) are configurable at the file creation
time and are multiples of the file system block size. Single extents can be up to 4 GB in size. The
supported physical disk sector size is the usual 512 bytes.
XFS is designed for high performance, as a file system it is capable of delivering near-raw I/O
performance: sustained throughput in excess of 300 MB/s has been obtained on SGI MIPS systems
XFS has proven scalability on SGI of multiple gigabytes/sec on multiple terabyte file systems. As the
bandwidth capabilities of Linux improve, the XFS file system will be able to utilize those capabilities.
XFS implements fully journaled extended attributes. An extended attribute is a name/value pair
associated with a file. Attributes can be attached to all types of inodes: regular files, directories,
symbolic links, device nodes, and so forth. Attribute values can contain up to 64KB of arbitrary binary
data. XFS implements two attribute namespaces: a user namespace available to all users, protected by
the normal file permissions; and a system namespace, accessible only to privileged users. The system
namespace can be used for protected file system meta-data such as access control lists (ACLs) and
hierarchical storage manager (HSM) file migration status.
The Data Management API (DMAPI/XDSM) allows implementation of hierarchical storage
management software with no kernel modifications as well as high-performance dump programs
without requiring "raw" access to the disk and knowledge of file system structures.

IST-2000-25182                                      PUBLIC                                             22 / 67
                                                                                                Doc. Identifier:
                         DATA ACCESS AND MASS STORAGE
                                   SYSTEMS                                                    Date: 26/06/2011
                                      S tate of the Art Report

The NFS version 3 protocol allows 64-bit file systems exported to other systems, and systems that use
NFS V2 protocol may access XFS file systems within the 32-bit limit imposed by the protocol. For
Windows connectivity XFS uses the Open Source Samba server to export XFS file systems to
Windows. Samba speaks the SMB (Server Message Block) and CIFS (Common Internet File System)
The XFS for Linux file system supports swap to files, user and group quotas, the POSIX Access
Control Lists (ACLs) semantics and interfaces described in the draft POSIX 1003.1e standard. File
systems can be backed up while still in use, significantly reducing administrative overhead. Special
backup commands can be used for backup and restore of XFS file systems. XFS supports dumping of
extended attributes and quota information. XFS dumps created on either IRIX or Linux can be restored
onto an XFS file system on either operating system.

4.2.6. ReiserFS
ReiserFS version 4 (Hans Reiser, departs from the traditional block-
structure of Unix file systems. ReiserFS was the first to be included in the standard Linux kernel
distribution: It is available in Red Hat 7.1 and SuSE Linux 7.0.
ReiserFS started to use buffering and preserve lists to track modifications, which is in effect very
similar to journaling. Full journaling support is being developed. Journaling features still extract a
slight performance penalty in some cases in the interests of increased reliability and faster restart
times. It can be slower than ext2fs when dealing with files between 1 kB and 10 kB (the average file
size on Unix servers is typically around 100 kB.), but on average it is substantially faster.
Both files and filenames are stored in a balanced tree using a plug-in based object oriented variant of
classical balanced tree algorithms. Balanced trees are a robust algorithmic foundation for a file system.
It supports efficiently 100 000 files in one directory. The results when compared to the conventional
block allocation based file system ext2fs, running under the same operating system and employing the
same buffering code, suggest that these algorithms are overall more efficient.
ReiserFS is space efficient: small files, directory entries, inodes, and the tail ends of large files are
packed, reducing storage overheads due to fragmentation. The traditional requirements for block
alignment are relaxed and the fixed space allocation for inodes is eliminated. The effect is that many
common operations, such a filename resolutions and file accesses, are faster than for instance in
ext2fs. Furthermore, optimisations for small files are well developed. Being more effective at small
files does not make it less effective for other files; the file system is truly general purpose.
ReiserFS supports file system plug-ins that makes it easy to create own types of directories and files.
This makes it easy to extend ReiserFS to support the requirements for protocols that are still being
finalized, such as streaming audio and video. For example, a system administrator can create a special
file system object for streaming audio or video files, and then create her own special item and search
handlers for the new object types. The content of such files can already be stored in TCP/IP packet
format, reducing processing latency during subsequent transmission of the actual file.
Future development plans include facilities to store objects much smaller than those that are normally
saved as separate files, to add set-theoretic semantics, and to retrieve files by specifying their attributes
instead of an explicit pathname. The ReiserFS file system is not yet 64-bit enabled, but will change in
the future for later version of Linux kernels.

4.2.7. Journaled File System (JFS)
Journaled File System (IBM) technology is used in and developed for high-throughput enterprise
server of IBM. JFS is a full 64-bit file journaling file system, and its design is sound and it has a
proven record on IBM servers.

IST-2000-25182                                      PUBLIC                                             23 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                     S tate of the Art Report

The Linux port of JFS (Steve Best,
seems to be incomplete and farther from being usable for production than its competitors, but it is
licensed under the GNU General Public License.
JFS organises free blocks by structuring them in a tree and using a special technique to collect and
group continuous groups of free logical blocks. Although it uses extents for a file's block addressing,
free space is therefore not used to maintain the free space. Small directories are stored directly within
an inode, although with different limitations than those of XFS, but small files cannot be stored
directly within an inode, however.

4.2.8. EXT3FS
The file system ext3fs is an alternative for all those Linux users who do not want to switch their
existing ext2fs file system, but require journaling capabilities. It is distributed in the form of a kernel
patch and provides full backward compatibility. It also allows one to convert an ext2fs partition to
ext3fs and back, without reformatting, but this has the drawback that none of the advanced
optimisation techniques employed in the other journaling file systems is available: no balanced trees,
no extents for free space, etc. It seems to lack the momentum that the others have.

4.2.9. Local Windows file systems
The original file system for DOS based PC is File Allocation table (FAT) file system with the
infamous 8.3 file naming convention, even though nowadays long filenames are supported. FAT has
several subversions, of which the original FAT12 is still used for floppies. The usual FAT16 type used
for hard disks has severe size and performance limitations. These are partly addressed with the new
FAT32 version supported by Windows 98 and 2000.
The first modern file system for Windows is Microsoft NT File System (NTFS), a serialized log
structured file system architecture by Tom Miller, Gary Kimura, Brian Andrew, and David Goebel.
The file system design is attempting to optimise for small files, and it was one of first OS designer
efforts to integrate small objects into the file name space. It has also some performance problems
especially if compared with ext2fs. The design is perhaps optimal for floppies and other hardware
eject media beyond OS control.
Apple Macintosh file system employs balanced trees for filenames, it was an interesting file system
architecture for its time in a number of ways, now its problems with internal fragmentation have
become more severe as disk drives have grown larger, and the code has not received sufficient further
These non-Unix workstation file systems are probably not very important for Grid computing.

Network file systems and access protocols allow users on diverse operating en vironments to access
and share remote files across the network. The most popular network file systems have been NFS and
AFS, but there are many new file systems also available.

4.3.1. Authentication issues
For NFS, identical UIDs are needed for each system, which is a real hassle for administration of the
system. For AFS, the Kerberos v4 is used, where authentication is done with a token. This solution is
much better than NFS, but for instance at CERN, this raises compatibility issues, because of the
differences in Kerberos' versions 4 and 5. CERN requires AFS functionality, but now it seems to
require too much of work to get it working with Kerberos 5.

4.3.2. NFS (Network File Sharing System)

IST-2000-25182                                     PUBLIC                                            24 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

One of the first network file systems was the NFS (Network File Sharing System), originally
developed by Sun Microsystems ( in 1985, and then turned over to the Internet
Engineering Task Force (IETF, NFS V2 is defined in RFC 1094, V3 in RFC 1813). Nowadays it is
almost a standard in the Unix world and used almost universally (but not exclusively) in networked
Unix computers, at least in simple configurations. There also exists NFS clients for Windows,
Macintosh, VMS, and other non-Unix platforms.
NFS uses UDP protocol, and hence is stateless. Some implementations also support TCP. NFS
protocol is synchronous, and each block of data requires two network I/Os.
There are some fundamental drawbacks with NFS:
    1. The kernel cache is small, which results in poor performance. Especially, when an application
    accesses data only few bytes at time, data will be transferred again and again over the network.
    2. The fileserver, which is visible to several machines, has to export file systems (even 1000s of
NFS is a nightmare to administer, because of the exports of the file systems needed; exports
management is hopeless. There were also system hangs and deadlocks when several computers tried to
mount file systems at the same time. The mounting was made a bit easier using auto-mounting, i.e.,
the file systems were mounted only when accessed, but this caused problems when the systems hang:
failures are often not reported to clients.
Version V2 supports single files only up to 2^32-1 (2 GB) bytes. Version 2 was simpler than the
original version 1, and especially simple to implement.
NFS Version V3 (around 1995) supports 64-bit files, write caching and buffers, and achieves
relatively good performance. The first implementation was from Digital with DEC OSF/1 V3.0 for
Alpha AXP. Most vendors support NFS in version V3.2. Further information on NFS V3 can be found
from the FTP archive by Compaq
A new version V4 of NFS is also available. NFSV4 has improved on security (strong security by
Kerberos V5 and Low Infrastructure Public Key, LIPKEY), interoperability (namespace compatibility
across all platforms), performance (client file caching via the Internet while maintaining the traditional
LAN environment performance) and Internet access (compound operations to minimize the number of
Sun is developing a reference implementation for the Solaris, and is funding the University of
Michigan's Center for Information Technology Integration to produce an enterprise-quality reference
implementation for Linux on the basis of the Linux NFS Version 3 implementation. Prototypes can be
downloaded at
Recently (February 2000) Sun released its rights to the NFS trademark and the source code for
Transport Independent Remote Procedure Call protocol, TI-RPC, of NFS to the open source
community. TI-RPC is one of the foundations of NFS, and a key component of the security
advancements in version 4.

4.3.3. AFS (Andrew File System)
AFS is a distributed file system product, pioneered at the Carnegie Mellon University as Andrew File
System. Later it was renamed, and transferred to Transarc Corporation (,, which has supported and developed it as a product. Now TransArc is
owned by IBM as a subsidiary (IBM Pittsburgh Labs).
AFS offers client-server architecture for file sharing, providing location independence, scalability and
transparent migration capabilities for data. AFS works well also over WAN.

IST-2000-25182                                    PUBLIC                                            25 / 67
                                                                                            Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                 Date: 26/06/2011
                                    S tate of the Art Report

AFS is a distributed file system with several data servers and metadata catalogue, ―name server‖. The
AFS allows naming of files (―worldwide‖, from cooperating clients, of course) as if they were on a
locally mounted file system: everything is seen as a single tree. The name servers are replicated to
avoid the performance degradation, which happens if only a single name server is used.
In AFS, files are in volume, which can easily be split into two and then joined back together. A new
volume can also be easily added. Hence, it is rather easy to add storage by adding more disks. This
kind of easy administration and expansion is a big advantage for large sites like CERN.
AFS is much better and easier to manage than NFS, because it has no exporting of the mapped file
systems. AFS is also better than NFS because it has cache. But there are problems with too small
cache. AFS works well if the cache is large enough, but when there is a lot of activity, the cache gets
constantly flushed, and this degrades the performance. AFS is good for small files, but not so for large
files; AFS is not good, e.g. for distribution of software.
Future of AFS is maybe shadowed, because IBM is dropping it. IBM is designing a new file system to
replace AFS, but it has released the code of AFS to public domain (See OpenAFS below).
There is also Distributed File System (DFS) developed by TransARC. It is using DCE (Distributed
Control Environment), and it was supposed to replace AFS. However, it will probably discontinued
very soon.

4.3.4. Open AFS
TransARC (IBM, released most of the code of AFS to open (public
domain), and made the source available for community development and maintenance (available from
IBM Open Source site). They called the release OpenAFS, it is an open source version of AFS 3.6.
There is a site ( for the coordination and distribution of ongoing OpenAFS
development. OpenAFS version 1.1.1 has recently (July 2001) been released.
Since then many sites have improved on AFS, and hopefully, many more will. There are, e.g., ports to
new platforms, and the performance problems will be hopefully addressed. OpenAFS works nicely, it
is a well debugged, and a nice product, and might be an interesting solution for the future.
Arla is a free AFS implementation. The currently supported platforms are: Linux, OpenBSD,
FreeBSD, NetBSD, Solaris, and Darwin (Mac OS X), and part of the work has already been done on
AIX, Tru64, IRIX, HPUX and SunOS, but these are not supported.

4.3.5. GPFS (General Parallel File System)
General Parallel File System (GPFS) for AIX is a file system used in IBM RS/6000 SP systems, the
MPP computing system manufactured by IBM and running its version of Unix, AIX. It is designed to
be used on the parallel jobs running on SP. It allows shared access to files that may span multiple disk
drives on multiple SP nodes, and parallel applications to simultaneously access even same files from
different nodes.
GPFS uses the speed of the SP switch to accelerate parallel file operations. The communication
protocol between the GPFS daemons can be Low-Level Application Programming Interface (LAPI,
better performance through SP Switch) or Transmission Control Protocol/Internet Protocol (TCP/IP).
GPFS has very high performance, and does journaling. It offers high recoverability and data
accessibility. DFS can export a GPFS file system to DFS clients. Almost all applications run exactly
the same under GPFS as they do with other file systems. Unix file system utilities are supported, so
users can use the familiar commands for file operations.
GPFS improves system performance by:

IST-2000-25182                                    PUBLIC                                           26 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

      Allowing multiple processes on different nodes simultaneous access to the same file using
       standard file system calls.
      Increasing bandwidth of file system by striping across multiple disks.
      Balancing the load across all disks to maximize throughput. All disks are equally active.
      Supporting large amounts of data and bigger file systems.
       Allowing concurrent reads and concurrent writes from multiple nodes; this is important in
        parallel processing.
In GPFS, all I/O-requests are protected by a token management system, which insures that the file
system on multiple nodes honours the atomicity and provides data consistency of a file system.
However, GPFS allows independent paths to the same file from anywhere in the system, and it can
find an available path to file system data even when nodes are down. GPFS increases data availability
by creating separate logs for each node, and it supports mirroring of data to preserve it in the event of
disk failure.
GPFS enhances system flexibility by dynamic configuration. Disks can be added or deleted while the
file system is mounted. Administration is simple; most tasks can be performed from any node
affecting the file system across the entire system. The commands are similar to UNIX file system
GPFS is implemented on each node as a kernel extension and a multi-threaded daemon. GPFS appears
to applications as just another file system, because they make normal file system calls to AIX: The
kernel extension will satisfy requests by sending them to the daemon, which performs all I/O and
buffer management, including read-ahead for sequential reads and write-behind for non-synchronous
writes. Files are written to disk as in traditional UNIX file systems, using inodes, indirect blocks, and
data blocks.
There is one metanode per open file which is responsible for maintaining file metadata integrity. All
nodes accessing a file can read and write data directly using the shared disk capabilities, but updates to
metadata are written only by the metanode. The metanode for each file is independent of that for any
other file and can be moved to any node to meet application requirements.
The maximum number of file systems is 32, disks in a file system 1024, and maximum file system size
9 TB. The maximum indirection is 3 and replication 2. The maximum block size is 1 MB.

4.3.6. Clustered XFS (CXFS, SGI)
The Clustered XFS file system technology ( is
developed by Silicon Graphics for high-performance computing environments like their Origin. It is
supported on IRIX 6.5, and also Linux and Windows NT. CXFS is designed as an extension to their
XFS file system, and its performance, scalability and properties are for the main part similar to XFS,
for instance, there is an API support for hierarchical storage management. Quite good.
Like XFS, CXFS is a high-performance and scalable file system, journaled for fast recovery, and has
64-bit scalability to support extremely large files and file system. Size limits are similar to XFS:
maximum file size 9 EB, maximum file system size 18 EB, block and extends (contiguous data ) size
are configurable at file system creation, block size from 512 B to 64 kB for normal data and up to 1
MB for real-time data, and single extents can be up to 4 GB in size. There can be up to 64k partitions,
64k wide stripes and dynamic configurations.
CXFS differs from XFS by being a distributed, clustered shared access file system, allowing multiple
computers to share large amounts of data. All systems in a CXFS file system have the same, single file
system view, i.e. all systems read and write all files at the same time at near-local file system speeds.
CXFS performance approaches the speed of standalone XFS even when multiple processes on

IST-2000-25182                                    PUBLIC                                            27 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

multiple hosts are reading from and writing to the same file. This makes CXFS suitable for
applications with large files, and even with real-time requirements like video streaming. Dynamic
allocation algorithms ensure that a file system can store and a single directory can contain millions of
files without wasting disk space or degrading performance.
CXFS extends XFS to Storage Area Network (SAN) disks, working with all storage devices and SAN
environments supported by SGI. CXFS provides the infrastructure allowing multiple hosts and
operating systems to have simultaneous direct access to shared disks, and the SAN provides high-
speed physical connections between the hosts and disk storage.
Disk volumes can be configured across thousands of disks with the XVM volume manager, and hence
configurations scale easily through the addition of disks for more storage capacity. CXFS can use
multiple Host Bus Adapters to scale a single system’s I/O path to add more network bandwidth. CXFS
and NFS can export the same file system, allowing scaling of NFS servers.
CXFS requires a few metadata I/Os and then the data I/O is direct to disk. All file data is transferred
directly to disk. There is a metadata server for each CXFS file system responsible for metadata
alterations and coordinating access and ensuring data integrity. Metadata transactions are routed over a
TCP/IP network to the metadata server. Metadata transactions are typically small and infrequent
relative to data transactions, and the metadata server is only dealing with them. Hence it does not need
to be large to support many clients, and even a slower connection (like Fast Ethernet) could be
sufficient but faster (like Gigabit Ethernet) could be used in high-availability environments. Fast,
buffered algorithms and structures for metadata and lookups enhance performance, and allocation
transactions are minimised by using large extends. However, special metadata-intensive applications
could reduce performance due to the overhead of coordinating access between multiple systems.

4.3.7. GFS (Global File System)
The Global File System (GFS, version 4.2, is a 64-bit shared disk cluster
file system for Linux. GFS was developed at the University of Minnesota, USA,
(, Department of Electrical and Computer Engineering, Parallel Computer
Systems Lab, Binary Operations Research Group, B.O.R.G) but now it is owned and developed by a
small company, Sistina Software.
GFS allows multiple servers on a storage area network (SAN) to have read and write access to a single
file system on shared SAN devices; cluster nodes physically share the same storage by means of Fibre
Channel or shared SCSI devices. The file system appears to be local on each node, and GFS
synchronizes file access across the cluster. GFS is fully symmetric, that is, all nodes are equal and
there is no server, which could be a bottleneck or single point of failure. GFS uses read and write
caching while maintaining full UNIX file system semantics. GFS supports journaling and recovery
from client failures.
The maximum GFS file system size file systems is 1 TB (2^40) due to Linux kernel limitations, and
maximum GFS file size is full 64 bit (i.e. much more than 1 TB). It does not yet support quotas or
GFS is an interesting product, but it is not in production anywhere, and hence its true performance and
properties are not yet known (it is tested for instance in CERN). It is supposed to address all problems.
GFS creates a single image of the data for all users. Should a server fail, the load is automa tically
redistributed and balanced on the remaining servers. Access to the data is uninterrupted.
GFS permits users to select low cost hardware from trusted vendors of their choice. Servers and
storage devices from different vendors can be used side by side with no detrimental effect on system
performance. Servers and storage devices can be dynamically added as the data storage requirements
of an organization increase while the entire system remains on-line and accessible.

IST-2000-25182                                    PUBLIC                                            28 / 67
                                                                                            Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                 Date: 26/06/2011
                                    S tate of the Art Report

4.3.8. Direct Access File System (DAFS)
The Direct Access File System (DAFS) protocol is a new, fast, and light-weight file-access protocol
for data centre environments designed to take advantage of standard memory-to-memory interconnect
technologies such as Virtual Interface (VI, discussed below) and InfiniBand as its standard transport
mechanism. The protocol will enhance the performance, reliability and scalability of applications by
using a new generation of high-performance and low-latency storage networks.
The DAFS Collaborative ( consists of nearly hundred companies,
including Adaptec, Cisco, Compaq, Fujitsu, HP, IBM, Intel, NEC and Seagate. It has released the
specification v1.0 for DAFS protocol and submitted it to the IETF as an Internet Draft. The protocol is
expected to will be published as an Internet standard soon, and first DAFS products should appear on
the market early next year (2002). It should be available on multiple platforms later next year.
Recently (on October 2001) the Storage Networking Industry Association (SNIA) announced the
formation of the DAFS Implementers' Forum effectively as the successor to the DAFS Collaborative
to focus on the marketing and delivery of interoperable solutions based on the DAFS file access
DAFS is implemented as a file access library, which will require a VI provider library implementation.
Once an application is modified to link with DAFS, it is independent of the operating system for its
data storage. The Direct Access File System (DAFS) uses the underlying VI capabilities to provide
direct application access to shared file servers. It is optimised for high-throughput, low-latency
communication, and for the requirements of local file-sharing architectures.
Local file sharing requires high-performance file and record locking to maintain consistency. DAFS
allows locks to be cached, so that repeated access to the same data need not result in a file server
interaction, and when required by a node, a lock cached by another node is transferred without
timeouts. DAFS is also designed to be resilient for both client and file server reboots and failures.
Given the nature of VI with networked blocks of shared memory, the VI architecture and DAFS
require some level of trust between clustered machines. However, DAFS will provide secure user
authentication. The servers maintain a table of their partners, preventing unauthorized servers entering
the cluster and servers that left the cluster from accessing data that was formerly authorized. The Virtual Interface Architecture
The Virtual Interface architecture was developed by Compaq, Intel, and Microsoft to provide a single
standard interface for clustering software, independent of the underlying networking technology.
Currently, both network-attached storage and direct-attached storage require a significant amount of
extra computing cycles and operating system support to copy data from file systems buffers into
application buffers. VI provides two fundamentally new capabilities to eliminate this overhead plus
any additional overhead from network protocol processing: direct memory-to-memory transfer and
direct application access. This allows data to bypass the normal protocol processing. Applications can
perform data transfer directly to VI-compliant network interfaces, without operating system
involvement: the data is transferred directly between appropriately aligned buffers on the
communicating machines, and the VI host adapters perform all message fragmentation, assembly , and
alignment in hardware and allow data transfer directly to or from application buffers in virtual
By eliminating protocol-processing overhead and buffer handling the VI architecture improves CPU
utilization and drastically reduces latency. However, the VI architecture is optimised for
communications within a controlled high-speed, low-latency interconnection network; it is not suitable
for general WAN communications via the Internet.

IST-2000-25182                                    PUBLIC                                           29 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                     S tate of the Art Report

The specification can be implemented to run over a variety of physical interconnection networks, such
as current implementations for Fibre Channel and proprietary inter-connection networks, and the
planned ones for TCP/IP over gigabit Ethernet, even 10 gigabit Ethernet, and Infiniband. InfiniBand
Architecture is a new approach to I/O technology and a common I/O specification for a channel based,
switched fabric technology, developed by the InfiniBand Trade Association (members include
Compaq, Dell, Hewlett-Packard, IBM, Intel, Microsoft and Sun,

4.3.9. Coda and InterMezzo
Coda ( is an advanced networked file system developed at Carnegie
Mellon University since 1987 by the systems group, including M. Satyanarayanan, Peter J. Braam, and
Michael Callahan. The source code of Coda is freely available under a liberal license.
Coda was originally implemented on Mach 2.6, but it has been ported to Linux, NetBSD and
FreeBSD. There is also a port of a large portion of Coda to Windows 95, and developers are currently
evaluating the feasibility of porting Coda to Windows NT.
Coda is a distributed file system with its origin in AFS2. It has many features that are desirable for
network file systems. Coda has several failure resilience features directed at mobile computing, such
as disconnected operation, continued operation during partial network failures, network bandwidth
adaptation, reintegration of data from disconnected clients, and ability to share files even in the
presence of network failures. It has also features targeted at good performance, reliability and
availability: scalability, high performance through client side persistent caching of files, directories
and attributes, write back caching, read/write replication servers, and security model for
authentication, encryption and access control with Kerberos like authentication and access control lists
(ACLs). InterMezzo
InterMezzo ( is a new distributed file system project, that has its origin in
Coda. InterMezzo focuses on high availability and is suitable for replication of servers, mobile
computing, and managing system software on large clusters. It was started in the Fall of 1998 at
Carnegie Mellon University by Peter Braam and Michael Callahan and others, in an attempt to see if
some of Coda's achievements could be reached with a much simpler design and implementation.
InterMezzo is an Open Source (GPL) project, and currently it is included in the Linux kernel.
InterMezzo is a client-server file system that maintains replicas of file system data across multiple
server machines. InterMezzo offers disconnected operation and automatic recovery from network
outages. The file system can also be used for replicating HTTP servers via fail-over replication.
InterMezzo file systems can be exported, somewhat similarly as in NFS, and is accessible on the
InterMezzo consists of a Linux kernel module, Presto, which adds a new file system to the kernel, and
a user level cache manager and file server, Lento or InterSync, which deals with all the networking
aspects and data transfer. Kernel code Presto is responsible for managing the replication log, the
reintegration of update records and for presenting a file system to the kernel.
Lento is the original file server and cache manager written in Perl. Servers and clients run the same
Lento daemon, but employ different policies. InterSync is the new cache manager. InterMezzo
systems will run a web server and Intersync, which initiates the actions driven by kernel events and the
web server is mainly there to serve up data (TUX is the primary target, but Apache will work fine).
InterMezzo tries only to keep a file set (folder collection or volume) synchronised on a number of
replicators, one of which is the server and the others clients, and synchronise a new client with the
servers. If a replicator modifies a file, the modifications are journaled and forwarded to all replicators.

IST-2000-25182                                     PUBLIC                                            30 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                     S tate of the Art Report

InterMezzo uses a standard local journaling file system (e.g. Ext3 on Linux) as the cache and server
file storage. It exploits the journaling infrastructure for its recovery.
Intermezzo does not entirely replace a disk file system but it merely wraps itself around the local file
system to manage moving data from servers to clients. Normally access to and modifications of the
InterMezzo file system run fully at the speed of this underlying disk file system. Also, the scalability
and recovery is that of the local file system, with a minor amount of recovery overhead for
InterMezzo. This design has some limitations over a fully native metadata handling as in Coda and
AFS, but the drawbacks are compensated by simplicity, scalability, performance and administrative
Some of the design criteria were: The server file storage must reside in a native file system, and client
kernel level file system should exploit this. The system should perform persistent kernel level write
back caching. File system objects should have metadata suitable for disconnected operation.
Scalability and recovery of the distributed state should utilise scalability and recovery of the local file
systems. The system should use standard TCP protocols and be designed to exploit existing advanced
protocols such as rsync for synchronization and ssl and ssh for security. Management of the client
cache and server file systems should differ in policy, but use the same mechanisms.

4.3.10. Microsoft Distributed File System (MS DFS)
This is a completely different (from DFS by TransArc) file system by Microsoft, which runs only on
Windows NT or Windows 2000. Here all data resides in the same tree like in Unix irrespective of
which server the files actually reside on.
This file system is utilisable on a portable Windows laptop computer: when the laptop is not hooked
on to the network, it will work on a sub-tree from a local cache. The data files are resynchronised
automatically, when the laptop is reconnected to the network.
The concept and implementation are nice, but of course a drawback is that is that it only works on

4.3.11. Windows Network File System (CIFS)
The standard Windows network file system is Common Internet File System (CIFS), an enhanced
version of the Server Message Block (SMB) protocol. It is native protocol in Windows 2000
( There is
also a Linux and Unix implementation for it (Samba, The CIFS
protocol has been tuned to run well over slow-speed dial-up lines.
The redirector packages requests meant for remote computers in a CIFS structure, and also uses CIFS
to make requests to the protocol stack of the local computer. For NetBIOS requests, NetBIOS is
encapsulated in the IP protocol and transported over the network to appropriate server. CIFS provides
also support for NFS (when the Microsoft Windows Services for Unix is installed). Extensions to
CIFS and NetBT allow connections for Windows Internet Name Service (WINS) and Domain Name
System (DNS) name resolution directly over TCP/IP.
CIFS allows multiple clients to access and update the same file while preventing conflicts by
providing file sharing and file locking mechanisms that can be used over the Internet. They ensure that
only one copy of a file can be active at a time, prevent data corruption, and also permit caching and
read-ahead and write-behind without loss of integrity.
CIFS servers support both anonymous transfers and authenticated access to named files. File and
directory security policies are easy to administer. CIFS servers are highly integrated with the operating
system, and are tuned for maximum system performance.

IST-2000-25182                                     PUBLIC                                            31 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

Uniform Naming Convention (UNC) file names of type \\Server\Resource\Path\filename.ext are
supported and they create global file names so that a drive letter does not need to be created for remote
files. File names support Unicode so that file names can be in any character set, not just character sets
designed for English or Western European languages. Distributed File Systems (DFS) allows use of an
enterprise-wide namespace.

4.3.12. Web-based file systems
Web-based file systems are file systems where the access is (exclusively or not) by using web
browsers or similar interfaces as a client to retrieve data, i.e., using http protocol instead of FTP or
some similar file access protocol. There exists also some record level access (of databases and files)
using http protocol. These browser interfaces are very easy to use; files are accessed just by typing the
name of the file (or pointing it with mouse). These kind of file systems work probably only for rather
small files.


4.4.1. Introduction
By Hierarchical Storage Management (HSM) one means a system where storage is implemented in
several layers from fast online primary storage to manual, off-line or automated near-line secondary
storage and even to off-site tertiary storage for archiving, backup and safekeeping purposes. In other
words, a simple HSM system consists of a disk farm, a tape cartridge robot silo and special software
for the HSM file System. In the HSM software there are automatic tools to keep track of the files so
that the users do not have to know exactly where the data at a particular moments resides. The system
automatically handles the data movements between the storage layers.
The storage capacity of the layers increase from the primary online disk storage nearest to processing
to the more distant secondary tape storage layers, whereas the performance increases in the reverse
direction, and, hence, also for a fixed amount of data the price of storing. The top level of the
hierarchy consists of fast, relatively more expensive random-access primary storage, usually disk with
possibly some RAID configuration or even with special extra fast cache arrangements, with the best
performance (short access times, large transfer rates), but less storage capacity (from 10s to 1000s
GB). On the other hand, the secondary storage consists on slower media, sometimes slower disk but
most often tape cartridges or cassettes, possibly automated by robots, where initial access time can be
considerable, even few minutes for near-line storage, and access is often serial on tapes, but lots of
storage capacity (20-100 GB/cartridge, and 10-100 TB, even petabytes per robot silo system). The
storage media is rather cheap.
HSM manages the storage space automatically by detecting the data to be migrated using criteria
specified by administrators, usually based on the size and age of the file, and automatically moves
selected data from online disk to less expensive secondary offline storage media: Each site configures
the HSM for optimal performance based on, e.g., the total number of files, average size of individual
files, total amount of data in the system, and network bandwidth requirements.
The HSM systems also support migration of data tapes to and from shelves, or export and import of
tapes out of the near-line robot system. Here operator intervention is required to move the tapes
between the shelves and robot silos.
The files are automatically and transparently recalled from offline when the user accesses the file. At
any time, many of the tapes are accessed through automated tape libraries. Users can access their data
stored on secondary storage devices using normal Unix commands, as if the data were online. Hence
to users all data appears to be online.

IST-2000-25182                                    PUBLIC                                            32 / 67
                                                                                                Doc. Identifier:
                         DATA ACCESS AND MASS STORAGE
                                   SYSTEMS                                                    Date: 26/06/2011
                                     S tate of the Art Report

HSM system can provide distributed access to, and management of, petabytes of data stored on tapes
and consisting of billions of files of varying sizes. HSM is a cost efficient way to store huge amounts
of data. However, because of the relatively expensive initial investments, it is usually used only in the
big institutions and organisations.
From the administrator’s point of view, one of the advantages of the HSM system is that there is no
need for a separate backup of the files, and also only the most recently referenced files are kept on
disks. When the system copies files to tapes, it creates backup copies on tapes. If a file remains unused
for a long period of time, the system deletes it from the disks. For security and backup reasons one
often makes at least two tape copies of the files, and even in physically different locations. If desired,
the second copy can be dispensed with for huge relatively ―unimportant‖ files, whose loss is not vital,
because a complete loss of tape files is relatively rare.
One administrative problem is the compacting of the tapes. When users delete files kept on tapes, the
files are only marked deleted, but for obvious reasons are actually not deleted from the sequentially
accessed slow tapes. This creates in time free ―holes‖ into tapes, and periodically the tapes in a library
must be compacted, i.e., all non-deleted files are copied to disks and back to different tapes, thus
eliminating free holes and freeing tape media. Hiding the concept from users.
One of the objects for HSM is to hide the details and exact location from users. Basically, this means
that the metadata for the files (name, type, creation, change and access dates and timestamps, online
status of the file (online, near-line, off-line etc), and other such information about the file extraneous to
data itself) should be kept online, even when the data itself is moved to near- of off-line. Only when
specifically requested the exact status of the file should be given. The access to the file itself should be
transparent for the users and also for the software accessing the file (i.e., the users use same
commands, software, and API calls to access the files regardless of their online status).
The system should keep track of the copies of the files kept on tapes and different caches, and move or
copy the file where it is needed.
From the user's point of view, the files on an HSM system are always visible, i.e., the directory entries
and metadata for the files are always kept on disk. Of course, there has to be some in dication that the
file actually resides on tapes, and when accessing the file, the user might experience a longer delay
while the file is retrieved from tape. Depending on the local policy, the files could be retained on tapes
for a long period, and hence the system can be regarded also as a long-term archive. On more simple
systems, the responsibility of moving the files to the tape archive could be left to the user, with a quota
system ensuring that large files are not kept on disk unnecessarily.

4.4.2. Commercial solutions UniTree
UniTree ( is one of the oldest hierarchical storage management systems; the
current version is 2.1. It is not part of the company OTG Software. The main problem with UniTree is
that it does not scale at all. There are some attempts to make UniTree more scalable, but the scalability
problems are deeply rooted in the design, implementation decisions and architecture of UniTree. Its
development started in the 1980 at the San Diego Supercomputer Centre. It used to be very popular in
the early 1990s, but it not used so much anymore in big installations because of the non-scalability.
UniTree is the predecessor of HPSS, which is scalable.

IST-2000-25182                                     PUBLIC                                              33 / 67
                                                                                            Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                 Date: 26/06/2011
                                    S tate of the Art Report HPSS (High Performance Storage System)
Development of HPSS ( began in 1993 as a Cooperative Research and
Development Agreement (CRADA) between the government and industry in USA. HPSS is
progressing towards its fifth release (current release 4.2). HPSS provides a scalable parallel storage
system for highly parallel computers, traditional supercomputers, and workstation clusters. A growing
number of sites around the world are using HPSS for production storage services.
The development team consists of IBM Global Government Industry and five DOE laboratories: Los
Alamos, Lawrence Livermore, Lawrence Berkeley National Energy Research Supercomputer Center
(NERSC), Oak Ridge, and Sandia. Also universities and National Science Foundation (NSF)
supercomputer centres and federal agencies have contributed to various aspects of this effort. Cornell
University, NASA's Langley Research Center, the San Diego Supercomputer Center, Argonne
National Laboratory, the National Center for Atmospheric Research, and Pacific Northwest National
Laboratory have contributed requirements and other assistance to the work.
Industry collaborators Transarc (IBM, integrating HPSS and the Distributed File System (DFS) and
Objectivity Inc. with several high-energy physics laboratories are trying to integrate their object-
oriented data management system with HPSS, Kinesix Corporation (SAMMI Graphical User Interface
for system managers), StorageTek, Sun Microsystems and Compaq.
The HPSS collaboration is based on the premise that no single organization has the experience and
resources to meet all the challenges represented by the growing imbalance in storage system I/O,
capacity, and functionality.
HPSS is designed for high performance computing environments, in which large amounts of data are
generated by massively parallel processors (MPPs) and workstation clusters, and to use network-
connected and directly connected storage devices to achieve high transfer rates. The design is based on
IEEE Mass Storage System Reference Model (MSSRM), version 5.
Scalability is along several dimensions: data transfer rate, storage size, number of name space objects,
size of objects, and geographical distribution. Although developed to scale for order of magnitude
improvements, HPSS is a general-purpose storage system. It is claimed to be portable, but this seems
not to be true. Other key objectives are modularity with open interfaces, reliability, and recovery.
As can be seen from the large number of collaborators, HPSS was originally very popular. But
because it tries to implement the IEEE transfer model, there are a lot of daemons, and all daemons
need to talk to each other and the performance goes down. That is the basic reason why other HSM
solutions usually do not try to implement MSSRM. HPSS maintenance seems to be still largely an
IBM effort.
HPPS was originally meant for MPP supercomputers, where one often needs large source data files,
and high transfer rates, of the order of 100 MB/s for a single stream. For performance, HPSS supports
software striping, threads in parallel, RAID disks, and RAITs (Redundant Array of Inexpensive Tape
Drives, where tape drives work in parallel, i.e. tape RAIDS).
HPSS supports large files spanning several disks on several servers, so that segments of file reside on
different disks. But then comes a drawback, software has to inquire where the next block is, and that
requires a central server, which forms a bottleneck.
HPSS is not very good for applications, which use lot of random access, even though it is well done
for very fast streams. Hence it works well with sequential access, but does not work on databases (no
list). DMF (Data Migration Facility)
Cray developed the Data Migration Facility (DMF, originally for
the UNICOS operating system of their vector supercomputers several years ago, in the early 1990s.

IST-2000-25182                                    PUBLIC                                           34 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

Later it was adapted by SGI (who owned Cray for a while) for their Origin platforms with IRIX
operating system also. There is even a port to Linux.
DMF manages tape libraries, makes archival decisions, and does journaling. However, DMF is not
scalable, the main problem is that it is completely centralised. Basically it was designed for a single
supercomputer, but it is not good or suitable for several machines or distributed environments. About
two years ago DMF was redesigned to be more distributed, but still does not scale beyond about 1000
users. However, it works well for few tens of computers, i.e., in a typical supercomputer centre.
DMF is running at over 80% of large Cray supercomputing sites today. Typical usage: DMF
commonly supports about 4 million files per system, and it handles files of the size up to the range of
4-7 TB. The overall data managed by DMF averages in the range of tens of terabytes, up to 300 TB,
and will reach 1 PB in some centres. Some DMF centres routinely move 2 TB, from 150 GB to 500
GB daily is typical.
DMF supports several tape medias and formats, absolute block positioning, and it is rather resilient to
interruptions and media failures because of media checking and recovery, two-phase commit, auditing
file systems and journaled database transactions. Storage and Archive Manager (SAM/ASM)
The Application Storage Manager (ASM, by
StrorageTek, and Storage and Archive Manager File System (SAM-FS, by LSC,
are the same product but sold with a different brand name. The developing company LSC was
acquired recently by Sun. The file system supports only Sun’s Solaris.
SAM-FS is a hierarchical high performance, 64-bit file system and volume manager. This product has
a nice file system, migrating, and integrated backup, good performance, but it does not scale well.
ASM has HSM features, and performance is higher than, for instance, UniTree or DFM. Applications
can easily access everything, and the file system is fully integrated and completely transparent to users
and applications.
Up to four backup and archival copies of data can be made to different types of media to protect
against media malfunction and create media volume sets for off-site storage. SAM-FS provides
continuous automatic and unobtrusive backup of work-in progress, effective long-term data storage,
and GUI-based management tools for flexible control of high performance storage systems. SAM-FS
archived files are written in tar format.
With SAM-FS, multiple copies of a file can be made as soon as the file is created, and users can
instantly recall their files using their standard file system interface with no manual intervention by a
network administrator. The files archived by SAM-FS are written in tar format. Data can be recovered
TÄSTÄ PUUTTUU SANA: IN, BY TAI FROM?any system with or without SAM-FS Software. Open Storage Manager (OSM)
Open Storage Manager (OSM) was designed by a small company, Lachman. The design is good, but
the company had no money to implement it properly. Computer Associates bought OSM and renamed
it CA-Unicenter/Open Storage Manager, but it has been discontinued and not supported for two years .
It was tried at CERN, and it is still in production at DESY, but they basically only fix problems.
Thomas Jefferson National Accelerator Facility (Jefferson Lab, USA, used OSM
several years.
OSM is designed to fully exploit disk capacity by applying hierarchical storage management
techniques to managing data in large, distributed environments. It provides a multilevel, policy-based
solution for transparently and automatically managing the movement of infrequently accessed files to
and from slower and less expensive networked or local storage devices.

IST-2000-25182                                    PUBLIC                                            35 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                    S tate of the Art Report

The Hierarchical Storage Management part of Tivoli/TSM is based on OSM. Also EuroStore and
Enstore described below are based on OSM. AMASS
AMASS from ADIC ( transforms an off-line storage device into an online direct
access mass storage. Instead of increasing raw disk capacity or using backup software or a true HSM
solution, AMASS presents automated tape, optical or DVD libraries as a single, exportable mount
point with a standard Unix file system consisting of directories and files.
Because AMASS uses a Unix-based directory and file system interface, it is transparent to
applications and users. The files are moved to the library and read from the library, like with any disk
using the file system interface. Disk-based index means disk-speed browsing and no need to access the
library for operations like cd, ls, find, etc.
AMASS supports Compaq Thru64 Unix, HP Pa-Risc HP-Ux, IBM AIX SGI Irix and SUN Solaris,
and TCP/IP, NFS, RFS, Netware, LAN Manager, and AppleShare. There is also a Windows version
available. It supports also wide range of storage devices, drive technologies, and interface protocols. It
can manage more than 100 different automated storage libraries. AMASS is scalable, supporting a few
gigabytes up to more than a petabyte.
Block-level file access allows AMASS to retrieve only selected data from large files, reducing disk
requirements and increasing system performance. Configurable disk cache provides disk-speed file
transfer from optical and tape media, minimising the need to access data in the library. Journaled
database provides continuous updating and journaling of index data for fast and reliable recovery.

4.4.3. Open source solutions
Here we discuss some open source (or nearly so) solutions for HSM, like EuroStore (DESY et al.),
Enstore (FermiLab) and CASTOR (CERN). Also Jefferson Lab ( has
implemented a scalable, distributed, high performance mass storage system called JASMine, which is
entirely implemented in Java. EuroStore
The EuroStore project (, is part of the
ESPRIT IV (European Strategic Programme for Research and the Development in Information
Technology) program funded by EU. It was launched in 1998 and lasted for two years (project number
There are 4 full and 3 associate project partners: CERN (Conseil Européen pour la Recherche
Nucléaire), the European Research Institute for High Energy Physics, DESY (Deutsche Elektronen-
Synchrotron, in Hamburg), German national research centre for particle physics and synchrotron
radiation research, Quadrics Supercomputers World Ltd (QSW, owned by Alenia Aerospazio SpA,
Italy and Meiko, UK) and HCSA (Hellenic Company for Space Applications, owned by Alenia Spazio
SpA, Italy, and Commersa Ltd., Greece), Tera Foundation, ―Terapia con Radiazioni Adroniche‖
(Therapy with Hadronic Radiation, Italy,, HNMS (Hellenic
National Meteorological Service, Greece), and AMC (Athens Medical Centre S.A.).
EuroStore is based on OSM, it is OSM rewritten in Java, and it is supposed to scale to exabytes and
even petabytes, but to be much more secure (it should even have the security needed for medical
applications and hence be used, e.g. in hospitals).
EuroStore consists of a PFS parallel file system (developed by QSW) using HSM (developed by
DESY). The parallel file system PFS was initially introduced in the MEIKO CS2 (Computing Surface)
massive parallel computer system a few years ago, and allows versatile configurations of hardware and

IST-2000-25182                                    PUBLIC                                             36 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                     S tate of the Art Report

software components among all nodes on the system, leading to a high performance and high capacity
capable implementation. Due to its implementation (on top of standard VFS layer in the UNIX
kernel), any legacy application can use PFS and its additional features like file migration, without any
changes. There is an implementation for the Sun Solaris environment.
EuroStore seems to be still, after 3 years, at prototype level. There will be no more funding from EU.
There are no technical problems, but the main part of the developers have already left staff. The
system has been tested at CERN. DESY is the only site where the product is in some use, but it is not
in production anywhere. FermiLab Enstore
Enstore ( is designed for data acquisition, processing and analysis
systems of experiments at FermiLab (, Fermi National Accelerator Laboratory, in
Illinois, USA). Enstore architecture and design goals are described in its Technical Design Document
Enstore a better product than EuroStore. It is well tested with millions of files, and seems to scale well.
It is not known whether it scales to very high data rates; the rates are now of the order of 10 MB/s
only. It has good and well-done graphical interface and even a training guide for operators. Most HSM
products have actual bottlenecks in configuration. With other products, for instance with HPSS, one
needs a specially trained and knowledgeable system administrator to configure them. Enstore is in that
sense a simpler product, and standard operators can configure and administer it.
Enstore system is designed to provide mass storage for large experimental data sets with billions of
files with a typical size of a few gigabytes. As such, it is not a general-purpose mass storage system,
but optimised to allow access to large datasets made of many files. The system supports random
access of files, but also streaming, the sequential access of successive files on tape. It is based on a
client-server model that allows hot swapping of hardware components and dynamic software
configuration, is platform independent, runs on heterogeneous environments, and is easily extendable.
The Enstore system provides access to the data by user/client applications, which are distributed
across an IP network. It supports tape drives attached locally to the users' computer, as well as those
remotely accessible over the network.
The system is written in python, a scripting language with object-oriented features. It uses Fermilabs
StorageTek silos and ADIC robots. It incorporates also the pnfs package from DESY (see and
Other links: Castor
Castor (CERN Advanced Storage Manager) is an in-house disk cache system and Hierarchical Storage
Manager written at CERN. The goal is to handle data produced at high data rate by CERN elementary
particle physics experiments, and hence it is focused on High Energy Physics (HEP) requirements. It
is an open source program.
Castor is a second incarnation of an earlier SHIFT archiving system, written 10-12 years ago by Les
Robertson and others, Jean-Philippe Baud for the tape and disk cache part, because Unix (Unicos
excepted) had no tape software (and still hasn't). SHIFT has been in production for 10 years. SHIFT is

IST-2000-25182                                     PUBLIC                                            37 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                     S tate of the Art Report

relatively ―old-fashioned‖, and cumbersome, requiring external databases managed by the users.
SHIFT scales to about 10s of machines and 100s of users, but not much beyond that.
Then CERN tried to use OSM, then HPPS, but HPPS is an IBM proprietary product , which runs
mainly on IBM platforms, not on Linux PCs; it did not give high performance in random access mode
and does not evolve quickly. It will be phased out at CERN by the end of 2001. Therefore, it was
decided to develop Castor, so that it is backward compatible with the existing SHIFT software. Castor
is a superset of SHIFT, for instance there are facilities to import from Castor tapes written by SHIFT.
Castor uses the Unix file system, but adds more space using tape cartridges to put data near-line. It
supports most SCSI tape drives and robotics.
Castor provides some basic Hierarchical Storage Manager functionality and capabilities: a Name
Server has been implemented, and users can migrate and recall files explicitly or the disk pool
manager can initiate migration automatically. Tape volume allocation is done automatically. Castor is
transparent to users and all levels of applications. Accessed files are copied to disk and accessed there.
One goal has been easy integration of new technologies and commercial products, by simple, modular
and clean design, and modularity so components can be easily replaced. Castor is available on all Unix
and NT platforms. The user, operator and administrator interface is a command line interface, but a
graphical (GUI and WEB) interface is being developed. Use, administration and deployment are easy.
Castor scales well, better than SHIFT and has better queue management than SHIFT.
Castor has been in production at CERN for about 1 year now. It contains about 260 TB data, and has
2 600 000 files in the system. It has more data (but not files!) than any other site. The migration and
recall rate is about 20 TB/week. Periodically tapes are compacted.
Castor has good performance in supporting both sequential access and random access. Currently, it
can sustain transfer rate of more than about 30 MB/s per stream. The goal this year was to sustain
aggregate transfer rates of 100 MB/s for weeks (for a single experiment). A peak performance of 120
MB/s was obtained and a sustained rate of 85 MB/s has been demonstrated for a week. At present, it is
unclear what the problem is: a hardware limitation, Linux-kernel, or software?
Castor supports many large files (file size: 2^64 bytes, number of files: 2^64) and long file names
(pathname: 1023 bytes, of which the filename component 255 bytes).
Castor offers same security as SHIFT, with trusted hosts and passwords for remote non-trusted
requests). It is being interfaced to the GSI modules.

IST-2000-25182                                     PUBLIC                                            38 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report


Grids are distributed environments that allow software applications to integrate computational and
information resources, and also enable specialised hardware, like experimental instruments, to be
shared. All these resources are managed by diverse organizations in widespread locations.
The ten-year history of the World Wide Web (WWW) has been a revolution in the way information is
organised and accessed. Today, everyone is taking for granted the ability to access information from
all over the world using the Web. The aim in developing Grid software is to cause a similar revolution
in the sharing of other computer resources, like processing power, database access and specialised
hardware. As with the Web, it is not easy to forecast future applications, if for instance, remote access
to supercomputers and mass storage will ever be as straightforward to use as the Web.
In Grid, all resources are distributed between the individual workstations and other resources of the
organisations, their regional centres and individual participants belonging to a virtual organisation.
Hence, the computing resources, application programs and data are shared by many sites and
organisations, and the data will be replicated, scattered and possibly ―fragmented‖ all around the Grid
network. The data storage consists of regional storage systems, whose architectural details, transfer
speeds and storage capacities vary widely. These local stores are autonomous and independent of each
other, and governed by the participants. In other words, the remote resources are wholly administered
and maintained by the owning organisations.
Most of the Grid environments will probably be so complex that a centralised administration and
architecture is impossible. Grid architecture will likely consist of resources acting at the same time as
both servers and clients for various aspects of Grid, and communicating with each other on a peer-to-
peer basis, similar to what has recently been used in Napster and Gnutella. Hence, no centralised
storage is needed, only some mechanism for discovering the data elements and accessing them on
These regional centres are often rather complex and heterogeneous computing environments. In all of
them, the conventional and common local storage infrastructure, like backups, access control and other
security features, are already in use. In addition, one needs a lot of special Grid enabling software
tools, programming interfaces and protocols (middleware) to enable seamless and transparent
integration of these distributed storage resources into a unified Grid Storage.
From the users’ point of view, all the data will reside ―on the Grid File System‖, i.e., it will be
distributed in the Grid, but the Grid middleware tools provide transparent and seamless access
completely similar to local access to the File System, hiding all the details about the data location and
retrieval, how the data is distributed between the individual regional centres, and the sometimes rather
complex details of the means used to access the data.
To facilitate the seamless and transparent access to data in the Grid environment, one needs complex
and specialised software layers (middleware) to tie the architecturally varying components owned by
different organisations, with different ―cultures‖ and administrative usages, into a unified whole as
seen by the users. And all the time necessary security and access controls and good administrative
practices have to be maintained.
Below some requirements and considerations for data access in a Grid environment are discussed.
Although some solutions for the issues are beginning to emerge, much remains to be done. The initial
attempts are still far from integrated, polished, and widely accepted middleware components. Thus, for
the main part, this is still an open research field.


IST-2000-25182                                    PUBLIC                                            39 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                     S tate of the Art Report

5.2.1. Requirements for a Grid environment
A large part of the requirements for a Grid storage system is the same as for a local file system, but
there are some special requirements, because a Grid consists of widely distributed storage facilities
and will store a huge amount of data (of the order of petabytes and more). Of course, a large amount
of data needs so much storage space, that the storage will probably have to be distributed as well.
For instance, in the case of the DataGrid, the huge amount of data (around 10 petabytes in a year) to be
processed is produced centrally at CERN (by the LHC experiments), and then distributed to the
storage and processing servers at the Tier centres. So, basically, the experimental data will be
distributed on the whole DataGrid (i.e. on the European and even world-wide scale).

One problem within the Grid environment will be the location of the data. One will need Data
Catalogues to find the data, special Data Discovery mechanisms to find the relevant data by describing
it, Data Replication to intelligently disperse the files nearer to the computing resources where they will
be needed later.
For a Grid, one needs special grid-wide Data Catalogue, Discovery, Replication and Cache Services,
and special interfaces for the user to find, organise and retrieve the data from these databases. All these
special services have to be distributed, which requires several servers. These services together with
other needed middleware layers form a grid-wide hierarchical network file system, the Grid File
System. The catalogues describe the stored files to the users, while behind the scenes replication and
caching take care of performance of this Grid file system.
These servers might exchange information using peer-to-peer protocols and mechanisms, where
central data servers and repositories are not necessary. This might be similar to the concept of Napster,
where only a database of the available (music) files were kept centrally, but the transfers were always
between the computers of individual users, or Gnutella, where even the central database was dispensed
of. Each Gnutella client is also a server, and needs only few neighbouring servers to connect to. A user
sends the request for a file using the Gnutella protocol via these neighbours, and then the request is
spread to the whole Gnutella network. By limiting the ―hops‖ one can even limit the number of servers
accepting the request (for instance, the request could be limited to spread to about 10 000 computers),
creating a limiting horizon for the request. For efficiency, once found stored on a server, the requested
file is transferred directly between the server and requesting computer.

5.3.1. Data Catalogues and Data discovery
For the individual user, the first problem is to find the relevant data files from the Grid. For instance,
in the DataGrid case, one has to locate the experimental events one is interested in. For this purpose
one needs distributed Grid-wide databases or Data Catalogues, where all the files are described and
named. Using suitable interfaces the user specifies what kind of files she is going to use, and the data
discovery mechanisms will locate (the nearest copies) of the relevant files. When needed, this data
discovery can be handled automatically in the scripts of the jobs submitted for processing.
Using the data catalogues and data discovery mechanisms the relevant data files can be located. In
practice, these data catalogues also have be distributed, and the users do not have to know the
locations of the catalogue servers. The catalogue servers are working together in finding the relevant
data, transparent to the users and the applications.

5.3.2. Data replication and caching
In a Grid environment one needs automatic and transparent movement of both the data files from the
storage computer onto the computer processing the data and the result files back to storage. The data
files needed in a computation might be scattered in different locations.

IST-2000-25182                                     PUBLIC                                            40 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                     S tate of the Art Report

For efficiency reasons the data files, usually have to be replicated into several locations, i.e. several
copies of replicates of the data files will be stored in several distributed locations. When needed, the
required data files can then be fetched fast enough and without too long delays. Also on the computing
nodes one needs local cached copies of the files to be processed. The data catalogues have to maintain
also the location of the replicas for the files.
The data file catalogues will keep track where the ―original‖ data files are, where they are replicated
on a permanent basis, and where they happen to be cached locally for fast access during processing.
Basically this amounts to a Grid-wide multi-stage caching scheme (in addition to the traditional
―local‖ caches) where cache coherence will be of utmost importance. Due to the relative slowness of
the WAN network there are special considerations to be taken into account here.

From user's point of view, Grid storage should have no artificial limits on the size of an individual file
nor on the size of a directory or a file system (not so long a go there was a 2 GB limit for a file system
on Unix, because of the 32-bit file systems; and at the same time that was the biggest possible file on
Unix, because there was usually no spanning of a file from a file system to an other). Similarly, there
should be no practical limits for the numbers of files. Also, there should be no important performance
or storage penalty for storing either very small or very large files.
For the administrator, the totality of the file systems must be scalable, so that he can add storage
capacity basically just by adding more storage servers, disks, tapes, and tape drives to the system. For
availability, one should be able to do this without needing to boot the whole system down and up
when repairing, adding more capacity, or updating the file system. The changes to the file system
should be done incrementally; and this also means that these changes will be invisible to the users, i.e.
they do not see any disruption of the storage service.
Even though there are some solutions to this effect, at present most of them are confined to a single
computer system on a proprietary platform.


5.5.1. Network access - WAN
At present, the typical speeds of a LAN network are 100 megabit/s, with the higher speed of 1
gigabit/s rapidly gaining popularity and some older parts still being at 10 megabits/s. The Ethernet
hierarchy (10 Mbit/s Ethernet, 100 Mbit/s Fast Ethernet, emerging 1 Gbit/s GigaEthernet, and future
10 GBit/s GigaEthernet) seems to be at the core of LAN networking. For special purposes, also other
networking solution, like Fibre Channel and Myrinet, might be used.
WAN speeds vary more than LAN speeds, but good backbone speeds are at present typically of the
order of 2.5 gigabits/s. Of course, the network will probably always be very heterogeneous, so that
also in the future some Grid nodes, hopefully the more important ones, will be better connected than
For the reliability and availability of the network connections, the whole network has to be redundant,
so that failures in any part will not result in loss of connectivity between the still functioning nodes.
This will hopefully be the case in general, because, even with the Grid deve lopment, network
availability is of utmost importance.

5.5.2. Protocols

IST-2000-25182                                     PUBLIC                                            41 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

The fundamental IP protocols UDP and TCP and the basic application protocols like FTP and HTTP
form the networking core, which is not expected to change. The application protocols might be
enhanced for Grid use.
For Grid use, there are already emerging some enhanced FTP variants, such as Globus GridFTP and
GSIFTP, and CERN RFIO (Remote File I/O).
RFIO is a component of the CERN Advanced Storage Manager (Castor), which implements remote
versions of most POSIX calls like open, read, write, lseek and close, and several Unix commands like
cp, rm, and mkdir. Programs can use RFIO libraries to access files on remote disks or in the Castor
namespace. By overlapping network and disk I/O and using a lightweight protocol separating control
and data streams one optimises the data throughput. Each connection uses a circular buffer and two
threads, and on a high speed WAN one could implement multiple parallel streams.
The Globus GSIFTP is a subset of the GridFTP protocol. It is a high-performance, secure protocol
compatible with the popular FTP protocol, but using GSI (Grid Security Infrastructure) for
authentication, and having useful features such as TCP buffer sizing and multiple parallel streams. It is
being integrated with the GASS client library, and enhanced with a variety of features to be used as a
tool for higher-level application data access on the Grid.
Rajesh Kalmady and Brian Tierney at CERN made a comparison of RFIO and GSIFTP using various
tools, of which especially Netlogger is highly usable for network tuning. The results are presented in From these tests, one finds that
proper TCP buffer size setting is the single most important factor in achieving good performance, but
using un-tuned TCP buffers with enough parallel streams one gets the same throughput. However, 2-3
parallel tuned streams will gain an additional 25% performance over a single tuned stream. Data
transfer with tuned buffers is highly sensitive to the variation in ambient network traffic, and hence
there should be a mechanism of dynamically varying the buffer sizes during data transfer.

5.5.3. Network tuning
In a high-performance and high-throughput Grid environment it is of utmost importance to have the
performance of the host computer (which most probably will be Linux PCs) tuned in the best possible
One of the possible pitfalls here is caused by the multiple buffers for both storage and network and the
unnecessary copying of the data between these buffers. Currently the data to be transferred is often
copied from a user-level buffer to a system-level buffer and vice versa, and, in the most unfortunate
case, several times. This unnecessary copying back and forth can easily drain all the cycles of even the
most powerful computer.
The efficient use of the buffers should be ascertained in the Linux kernel, most ideally with tuneable
parameters, so that the system can be tuned easily to suit different purposes with varying demands for
performance. In recent Linux kernel (version 2.4), there is an auto-tuning feature for the optimisation
of the network performance. When mature, this could solve many issues in the network performance.

There are at present three major Grid tools projects: Globus, Condor, and Legion. The Globus Toolkit
has become as the de facto standard tool collection in Grid computing middleware. The older Condor
project is aimed at running High Throughput Computing jobs on a pool of otherwise idle distributively
owned workstations, but it contains interesting features for the implementation of Grid. Legion is an
object-oriented middleware product for Grid functionality. A few others, non-grid specific distributed
computing technologies, like Java and JINI (and JavaScript and ActiveX), CORBA, and DCOM have
also emerged.

IST-2000-25182                                    PUBLIC                                            42 / 67
                                                                                            Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                 Date: 26/06/2011
                                    S tate of the Art Report

Basically, these Grid technologies are all complementary, not competing. For instance, the Globus and
Condor projects are co-operating in developing interfaces to the Grid middleware based on these
additional technologies. Especially the Condor-G is a gridified interface for the Globus in Condor
environment, and here Condor and Globus complement each other. The Grid-specific tools, like the
Globus toolkit, address more directly the Grid issues, and use some of these technologies to provide
portable clients. The role of the other technologies in the Grid environment will probably be to assist
in the developing user interfaces, but people are seriously speaking about running grid jobs using, e.g.

5.6.1. Globus Project
The Globus project ( focuses on enabling the application of Grid concepts to
scientific and engineering computing. The Globus project is developing fundamental technologies and
prototype software tools (the Globus Toolkit) needed to build computational grids on a variety of
platforms. It is also doing basic research in resource management, data management and access,
application development environments, and security. The project supports the construction of Grid
infrastructures spanning supercomputer centres, data centres, and scientific organisations.
The Globus project started in 1996, and the main collaborators are Argonne National Laboratory’s
Mathematics and Computer Science Division (Ian Foster) and the University of Southern California’s
Information Sciences Institute (Carl Kesselman). Other institutions and people contribute to the
toolkit, for instance the National Computational Science Alliance, the NASA, the National Partnership
for Advanced Computational Infrastructure, the University of Chicago, and the University of
Wisconsin. Globus is supported by the Defense Advanced Research Projects Agency (DARPA,
Quorum project,), the U.S. Department of Energy (ASCI Distributed Resource Management project),
National Aeronautics and Space Administration (NASA, Information Power Grid project), and the
National Science Foundation (NSF, National Technology Grid).
The Globus Toolkit is used to develop Grids and applications. Several tools projects (including
CAVERNsoft, MPICH-G2, NEOS, Nimrod-G, NetSolve, WebFlow) have used Globus, and
international collaborations have used Globus components. The European DataGrid project and other
major initiatives are now making extensive use of Globus technologies.
Many challenging problems remain to be overcome before a fully functional Grid will be ready.
Among the major research challenges and ambitious goals that are addressed in the Globus project are
designing, developing and supporting
       Resource management infrastructure that supports end-to-end resource and performance
        management, adaptation techniques and fault tolerance via network scheduling, advance
        reservations, and policy-based authorisation, so that one could provide application-level
        performance guarantees despite dynamically varying resource properties.
       Automated techniques for negotiation of resource usage, policy, and accounting in large-scale
        grid environments.
       Technologies to support data grids: distributed infrastructures and tools for data-intensive
        applications for managing and providing high-performance access to and processing of large
        amounts of data (terabytes or even petabytes).
       New problem solving techniques, programming models, and algorithms for Grid computing.
       High-performance communication methods and protocols. Globus Toolkit
The Globus Toolkit is a collection of software tools, or a set of services and software libraries to
support Grids and Grid applications. It is not yet very well integrated or transparent, and it is a bit

IST-2000-25182                                    PUBLIC                                           43 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                    S tate of the Art Report

difficult to install; this was especially true for version 1, but the new version 2 is supposed to be much
easier. The Toolkit is first and foremost a ―bag of services,‖ a set of useful components that can be
used either independently or together to develop useful grid applications and programming tools.
Some applications can run with no modifications at all, simply by linking with a grid-enabled version
of an appropriate programming library. Globus capabilities can often be incorporated into an existing
application incrementally, producing a series of increasingly ―grid-enabled‖ versions.
The Toolkit includes software for security, information infrastructure, resource management, data
management, communication, fault detection, and portability:
       The Globus Resource Allocation Manager (GRAM) provides resource allocation and
        process creation, monitoring, and management services. GRAM implementations map
        requests expressed in a Resource Specification Language (RSL) into commands to local
        schedulers and computers.
       The Grid Security Infrastructure (GSI) provides a single-sign-on, run-anywhere
        authentication service, with support for local control over access rights and mapping from
        global to local user identities. Smartcard support increases credential security.
       The Metacomputing Directory Service (MDS) is an extensible Grid information service that
        combines data discovery mechanisms with the Lightweight Directory Access Protocol
        (LDAP). MDS provides a uniform framework for providing and accessing system
        configuration and status information such as compute server configuration, network status, or
        the locations of replicated datasets.
       Global Access to Secondary Storage (GASS) implements a variety of automatic and
        programmer-managed data movement and data access strategies, enabling programs running
        at remote locations to read and write local data.
       Nexus and globus_io provide communication services for heterogeneous environments,
        supporting multimethod communication, multithreading, and single-sided operations.
       The Heartbeat Monitor (HBM) allows system administrators or ordinary users to detect
        failure of system components or application processes.
For each component, an application programmer interface (API) written in the C programming
language is provided for use by software developers. Command line tools are also provided for most
components, and Java classes are provided for the most important ones. Some APIs make use of
Globus servers running on computing resources.
In addition to these core services, the Globus Project has developed prototypes of higher-level
components (resource brokers, resource co-allocators) and services. A large number of individuals,
organizations, and projects have developed higher-level services, application frameworks, and
scientific/engineering applications using the Globus Toolkit. For example, the Condor-G software
provides an excellent framework for high-throughput computing using the Globus Toolkit for inter-
organizational resource management, data movement, and security. Globus security
The site administrator creates a file containing mappings from Grid credentials to local account names.
In this way, sites retain complete control over access to their resources. Only people who are mapped
to a valid local account can submit jobs, and all submissions are logged.
For a user to utilise resources at some remote site, the user certificates identifying the user and
defining Grid credentials for the user have to be added to the Grid credential map file at that site.
Acquiring permission to use another site is the responsibility of the user.

IST-2000-25182                                    PUBLIC                                             44 / 67
                                                                                           Doc. Identifier:
                       DATA ACCESS AND MASS STORAGE
                                 SYSTEMS                                                 Date: 26/06/2011
                                    S tate of the Art Report

Administering these certificates for 1000s of users and 100s of sites on an international scale will be
one of difficult problems associated with the use of Globus in Grid computing.

5.6.2. Condor Project
Condor ( is a High Throughput Computing (as contrasted to a High
Performance Computing) environment that can manage very large collections of distributed
workstations. Bypass, FTP-Lite Library, Pluggable File System, Kangaroo, and NeST are additional
research projects pursued by Douglas Thain, John Bent, and other members of the Condor Project
Team. Related Globus projects are GridFTP and Globus GASS and Condor-G is an interface between
Condor and Globus. Even though Condor itself is not originally aimed for Grid use, the techniques,
protocols and software tools of Condor can be useful in creating gridified applications and middleware
when connected with other Grid tools, Condor-G being the first example.
The Condor project was started in 1988 by a group directed by Professor Miron Livny at the
University of Wisconsin-Madison. It was based on an earlier project on load balancing in a distributed
system, when the focus shifted to distributively owned computing environments, where owners have
full control over the resources. The first version of the Condor Resource Management system was
based on the earlier Remote-Unix project and it implemented already in 1986.
Many scientific problems require weeks or months of computations and need an environment
delivering lot of computational power over a long period of time, called a High Throughput
Computing (HTC) environment. One is interested in the sustained throughput and number of
completed jobs, instead of the peak performance and wall clock time. These latter are important in a
High Performance Computing (HPC) environment delivering a tremendous power over a short period
of time.
Earlier, typically just one large expensive mainframe computer was used. Scientists would wait their
turn for the allocated mainframe time, limiting the problems size to ensure completion. This
environment was inconvenient but efficient. When computers became smaller and less expensive,
scientists purchased personal computers, dedicated workstations or clusters of workstations, which
provide exclusive access, are available always and owned by its user. This distributed ownership is
more convenient but less efficient. Condor takes wasted time and puts it to computational use. Condor
Condor is currently available for download on many Unix platforms, and a port to Windows NT is
underway. Condor has been used in production mode for more than 10 years.
Condor is aiming to harness the capacity of collections of workstations for users with large computing
needs into environments with heterogeneous distributed resources; working well with a pool
containing hundreds of machines in an environment of distributed ownership. It allows users to utilise
resources otherwise wasted, making resources more efficient by putting idle machines to work. It
expands the resources available by allowing the use of machines otherwise inaccessible to the users.
Many jobs can be submitted at once, so that a lot of computation can be done with little intervention
from the user.
Condor is based on a layered architecture providing a suite of Resource Management services to
applications. When using Condor, it is not necessary to modify the source code. And if the code can be
re-linked with the Condor libraries, then a process can produce checkpoints and perform remote
system calls. A user-submitted job is executed on a remote machine within the pool of machines
available to Condor. System calls are trapped and serviced by Condor instead of by the operating
system of the remote machine, and Condor sends them to the machine where the job was submitted.
The system call is executed on the originating computer and results sent back to the remote executing

IST-2000-25182                                    PUBLIC                                          45 / 67
                                                                                            Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                 Date: 26/06/2011
                                    S tate of the Art Report

By using remote system calls one can preserve most of the environment of the originating computer on
the execution machine, even when they not share a file system or user ID scheme, and the impact of
Condor on the remote machine is minimal. A user of Condor does not need an account on the remote
machine, so the security of the remote system is preserved, and all data is maintained on the
submitting machine instead of residing on the remote machine, where it could be an imposition.
Condor jobs consisting of a single process are check-pointed periodically or when the machine will
soon become unavailable, allowing the jobs to be continued or migrated on another machine (with
similar platform) to ensure eventual completion. A checkpoint consists of saving the complete state of
the program to a file to be used later to resume execution, preserving computations already completed
if a machine crashes or must be rebooted. For long-running computations checkpoints can save weeks
of accumulated computation time.
Condor acts as a matchmaker between the jobs and the machines by pairing the ClassAd of the job
with the ClassAd of a machine providing a mechanism more flexible than a simple platform matching.
A separate ClassAd is produced for each job and machine. The ClassAd of a job specifies the
requirements, like specific platform or amount of memory, and preferences (e.g. good floating point
performance). The owner writes the ClassAd of a machine to a configuration file specifying available
resources, like platform and memory, and the requirements (e.g. jobs run only when idle) and
preferences (e.g. preferably jobs from a specific group, but others are accepted if there are no jobs
from that group) for the jobs.
Thus the owner of a machine has extensive control over their machine, e.g. when and which jobs are
executed, keeping the owners happy in participating in a Condor pool. Condor pays special attention
to the rights and sensitivities of the owners of the resources, viewing the owners as holding the key to
the success of Condor. The owner defines the conditions under which the workstation can be allocated
to an external user. Condor Bypass
Bypass ( is a software tool for writing interposition agents and
split execution systems. Example applications written using this technology include: Grid Console,
Automatic GASS, Pluggable File System, and Kangaroo.
An interposition agent is a small piece of software, which squeezes itself into existing programs, and
modifies their behaviour and operation. It places itself between the program and the operating system,
and when the program attempts certain system calls, the agent grabs control and manipulates the
results. They can be used to instrument, measure and debug programs, to give new capabilities to an
old program, to attach them to new storage systems, and to emulate one system while using a nother
and to emulate operations that otherwise might not be available. Bypass can attach legacy applications
to new distributed systems.
A split execution system consists of a matched interposition agent and a shadow process. The
interposition agent traps system calls and sends them back to the shadow process on another machine
for execution. Thus a program can run on any networked machine and yet execute exactly as if it were
running on the same machine as the shadow.
An interposition agent created by Bypass can be added to (nearly) any Unix program at run-time. The
receiving program must be dynamically linked, but it need not otherwise be specially prepared for
Bypass. Agents created by Bypass have been used with unmodified system tools such as cp, grep ,
and emacs.
Bypass is a code generator, much like the compiler tools yacc and bison, relying on lists specifying
what system calls are to be trapped and what code is to replace them. Bypass produces C++ source for
an agent to implement it and hides all the tricky details and differences of the systems calls for each
operating system.

IST-2000-25182                                    PUBLIC                                           46 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report Condor FTP-Lite Library
FTP-Lite ( is a library for a simple implementation of
GridFTP for accessing FTP servers from Unix C programs. This library is distributed under the GNU
General Public License. It is designed to be simple, correct, and easily debuggable. FTP -Lite should
run on most any POSIX-compliant system. It has successfully run on Linux, Solaris, and IRIX.
It provides a blocking, Unix-like interface to the protocol, and presents an FTP service in terms of
Unix abstractions. Errors are returned in standard errno values, and data streams are presented as
FILE objects. All procedures perform blocking I/O.
The interface is easily used by single-threaded applications. In particular, the library was designed for
the Pluggable File System, which presents network storage devices as Unix file systems, to attach FTP
services to ordinary POSIX programs.
FTP-Lite provides reasonable performance for simple clients that access one file at a time. FTP -Lite
does not provide all the advanced features present in complete GridFTP implementation provided by
the Globus Project. For clients that need to manage multiple servers at once, the FTP implementation
found in the Globus Toolkit uses a variety of techniques, such as multiple data streams and non-
blocking interfaces, for achieving high performance and multi-server access.
The library can be used to communicate with ordinary name/password or anonymous FTP servers, and
if the Globus Toolkit is available, it will also perform GSI authentication. Condor Pluggable File System (PFS)
The Pluggable File System (PFS, is a tool for attaching old
legacy applications to new storage systems. PFS presents new storage systems as file system entries,
making a user-level storage system appear as a file system to a legacy application.
PFS does not require any special privileges, any recompiling, or any change whatsoever to existing
programs. Normal users doing normal tasks can use it, for example, an HTTP service could be
presented to a vi editor like so:
    % vi /http/
or to make vi Grid-enabled
    % vi /gass/http/
PFS is useful to developers of distributed systems, because it allows rapid deployment of new code to
real applications and real users that do not have the time, skill, or permissions to build a kernel-level
file system. It is useful to users of distributed systems, because it frees them from the unpleasant
possibilities either of rewriting the code to work with new systems, or relying on remote administrator
to trust and install new software.
Current version is PFS 0.4, for Linux, IRIX and Solaris, and it ought to run on most any POSIX-
compliant system. PFS currently supports the following distributed systems, and more will be added
later: HTTP, FTP and GSIFTP via the FTP-Lite library, Globus GASS, NeST, and Kangaroo Condor Kangaroo
Kangaroo ( is a wide-area data-movement system. Current
version is Kangaroo 0.5, and it has implementations for Linux, IRIX, and Solaris. It is an Open source
A collection of Kangaroo servers can be used to provide robust background data movement for
remotely executing programs. Applications write to nearby Kangaroo servers at disk speeds, and rely
on background processes to guide data files home. With the help of the Pluggable File System,

IST-2000-25182                                    PUBLIC                                            47 / 67
                                                                                           Doc. Identifier:
                       DATA ACCESS AND MASS STORAGE
                                 SYSTEMS                                                 Date: 26/06/2011
                                   S tate of the Art Report

applications perceive Kangaroo to be a mere file system. No special privileges are needed to install
and use this feature.
Kangaroo keeps working, even under hostile conditions, but it does not provide optimized, high-
performance single-file transfers. It does aim to provide high-throughput computing by overlapping
CPU and I/O tasks over a wide area, giving high end-to-end application performance. Condor Network Storage (NeST)
Network Storage (NeST, John Bent, project tries to design
and build flexible commodity storage appliances. The need for reliable, scalable, manageable, and
high performing network storage has led to a proliferation of commercial storage appliances provided
by such vendors as NetApp and EMC.
However, there are two limitations in these products: they impose a dependence between the hardware
and the software, which are bundled together (leading to unnecessary expensive solutions), and
inflexibility of the software requiring to use only a vendor approved set of protocols, security
mechanisms and data semantics.
The goals in NeST are to break the dependence between hardware and software by creating software-
only storage appliances out of commodity hardware, and to create software with flexible mechanisms
supporting a wide range of communication protocols, security mechanisms, and data semantics.

5.6.3. Globus and Condor compared
Condor ( is a tool for harnessing the capacity of idle workstations for
computational tasks. Condor is well suited for parameter studies and high throughput computing,
where parts of the job generally do not need to communicate with each other.
Condor and Globus are complementary technologies, as demonstrated by Condor-G, a Globus-enabled
version of Condor that uses Globus to handle inter-organizational problems like security, resource
management for supercomputers, and executable staging. Condor can be used to submit jobs to
systems managed by Globus, and Globus tools can be used to submit jobs to systems managed by
Condor. For instance the job description language of Condor could be utilised to specify the jobs for
Globus. The Condor and Globus teams co-operate to ensure that the Globus Toolkit and Condor
software fit well together.

5.6.4. Legion
Legion ( is developing an object-oriented middleware framework
that can be used to implement Grid functionality. The goal of the Legion project is to promote the
principled design of distributed system software by providing standard object representations for
processors, data systems, etc. Legion applications are developed in terms of these standard objects.
Legion was conceived in 1993, based on an earlier project Mentat, and in 1996 development of a full
version of Legion was started. The current release 1.4 of Legion runs on Solaris/Sun. Irix/SGI, Linux
(both Intel and Alpha), DEC Alpha, AIX/IBM RS6000, HP-UX, and CRAY UNICOS (as a virtual
host only). It does not currently support Windows. Legion consists of libraries, source code, and
executable binaries. It is designed to support a variety of architectures and to run applications on
multiple platforms.
Legion is a middleware layer between operating system and other Legion resources, connecting
networks and computer resources together in different architectures, operating systems, and physical
locations. There is no central controller of the resources, but each resource is an independent element
instead. These combined resources can be used to parallelise programs, and Legion will seamlessly

IST-2000-25182                                   PUBLIC                                           48 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

schedule and distribute the programs on available and appropriate hosts returning the results creating
―a worldwide Virtual Computer‖.
All local resources are represented in Legion as objects, and the fundamental Legion resource is the
ability to call a method on an object. In fact in Legion everything is an object represented by an active
process that responds to function invocations from other objects in the system. Legion defines the
message format and high-level protocol for object interaction, but not the programming language or
the communications protocol.
Every Legion object is defined and managed by its class object. Classes create new instances,
schedule, activate and deactivate them, and provide information about their current location. Legion
allows users to define and build their own class objects. Legion 1.4 contains default implementations
of several types of classes. Users can change these, if they do not meet the performance, security, or
functionality requirements. Legion is scalable: it is designed to handle trillions of remote objects in
varying totally distributed computing environments.
Core objects implement common services that support basic system services, such as naming, binding,
object creation, activation, deactivation, and deletion. Examples of core objects include hosts, vaults,
contexts, binding agents, and implementations. Hosts represent processors; one or more host objects
run on each computing resource, they create and manage processes. This provides control of resources
for the owners of the resources. Implementation objects allow other Legion objects to run as processes
in the system, they typically contain executable code. A vault object represents persistent storage, but
only for the purpose of maintaining the state of the inert Legion objects that the vault object supports.
Legion supports a wide variety of applications and programs. Users can start remote programs from
the Legion command line. The user's workload can be spread out on several processors, and they can
execute multiple programs in parallel. Legion offers Basic Fortran Support (BFS), which provides a
set of Legion directives that can be embedded in Fortran code in the form of pseudo-comment lines.
The Mentat Programming Language (MPL) is an extension of C++ and is designed to help parallelise
applications. Applications using MPI and PVM can use the Legion core MPI or PVM interface to use
Legion features in a PVM environment. PVM programs can be registered and run with special Legion
The Globus and Legion technologies are in some respects complementary. Both are distributed high-
performance metacomputing systems, and since both are addressing the same problem, there are
similar features and significant areas of overlap. Globus can be characterized as a "sum of services"
architecture focusing on low-level services, while Legion is an integrated architecture wit focus on
higher-level programming models. The Globus Toolkit is being used as the basis for numerous
production Grid environments (from modest collaborative research projects to huge international
scientific ventures), whereas Legion's community of users is smaller and more focused.
A similar object-oriented middleware research project is Globe ( at
Vrije University, Netherlands.

5.6.5. Sequential Access to data via Metadata (SAM)
The Sequential Access to data via Metadata (SAM, system is developed at
Fermilab (Lauri Loebel-Carpenter, Lee Lueking, Carmenita Moore, Igor Terekhov, Julie Trumbo,
Sinisa Veseli, Matthew Vranicar, Stephen P. White, Victoria White) to accommodate the high volume
data management operations needed for Run II Physics in D0 Experiment Collaboration and to enable
streamlined access and data mining of these large data sets.
The D0 detector collaboration includes 60 institutions and over 500 physicists all over the globe. Data
form the D0 operation is estimated to require more than a 0.5 PB of storage over the next two years.
SAM is being employed to store, manage, deliver, and track processing of all data, and provides
transparent caches for the various D0 computing platforms.

IST-2000-25182                                    PUBLIC                                            49 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

Here the term sequential refers to sequential events within files stored sequentially on tapes within a
Mass Storage System (MSS). In Fermilab and some other collaborating sites the MSS is Enstore
( Other MMS types, including HPSS, are used and interfaces to these
have also been provided. Metadata and configuration information for the entire system are currently
maintained in a central Oracle database at Fermilab, but it is planned to distribute this to reduce
latencies and make it reliable.
SAM is a file based network-distributed data management system implemented as an access layer
between the storage management system and the data processing layers. SAM is designed with a
distributed network architecture using CORBA ( services and interfaces.
The goal of SAM is to optimise the use of data storage and delivery resources, such as tape mounts,
drive usage, and network bandwidth. The primary objectives are: clustering the data onto tertiary
storage corresponding to access patterns, caching data on disk or tape, organizing data requests to
minimise tape mounts, and estimating the resources required and deciding data delivery priority.
Individual file tracking is done by the data management system instead of the scientists.
The globally shared services include the CORBA Naming Service, the database server, a global
resource manager, and a log server. The database servers could be cloned to distribute the load and
make the system more reliable.
The station is deployed on local processing platforms. Stations can communicate among themselves
and data within one station’s cache can be replicated to other stations on demand. Local groupings of
stations, for instance at Fermilab, can share a locally available Mass Storage System. Most of the
contentions are among local station groupings, like those that share a robotic tape library.
The station’s responsibilities include storing and retrieving data files, managing data stored on cache
disk, launching project managers which oversee the processing of data requested by users
(consumers). The station manager oversees the removal of files from the cache disk, and instructs file
stagers to add new files. All processing projects are started through the station server, which starts
project managers. Files are added to the system through the File Storage Server (FSS), which uses the
Stagers to initiate transfers to the available MSS.
The data to be added to the system has to be described by a small python file containing information
such as parameters and information of the physics experiment (like number of events and event range)
and attributes describing the files. Also a map which tells the system where the data should be stored
is needed.
User and administrative interfaces are provided to add data, access data, set configuration parameters,
and monitor the system. Interfaces to the system are UNIX command line, Web GUI’s, and a Python
API. There is also a C++ interface provided for accessing data through a standard D0 framework
package. Most of these interfaces are connected to the system through CORBA.

5.6.6. Storage Resource Broker (SRB)
The Storage Resource Broker (SRB, is a client-server based
middleware implemented at San Diego Supercomputer Center (SDSC) to provide distributed clients
with uniform access to different types of storage devices, diverse storage resources, and replicated data
sets in a heterogeneous computing environment.
The SRB is implemented on AIX, Sun Solaris, SunOs, DEC OSF, SGI and the Cray C90. Storage
systems supported by the current release (V1.1.8) include the Unix file system, hierarchical archival
storage systems UniTree and HPSS, and database objects managed by DB2, Oracle and Illustra.
SRB provides client applications with a library and a uniform API that can be used to connect to
heterogeneous distributed resources and access replicated data sets. The srbBrowser is a Java based
SRB client GUI that can be used to perform a variety of client level operations including replication,

IST-2000-25182                                    PUBLIC                                            50 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                     S tate of the Art Report

copy and paste, registration of datasets and metadata manipulation and query. Also Unix like utilities
(like ls, cp, chmod) for manipulating datasets in the SRB space are provided.
The Metadata Catalog (MCAT, is a metadata
repository system managed by the MCAT relational database server and implemented at SDSC to
provide a mechanism for storing and querying system-level and domain-dependent metadata using a
uniform interface. Together the SRB and MCAT servers provide a way to access data sets and
resources through querying their attributes instead of knowing their physical names or locations.
The MCAT server stores metadata associated with data sets, users and resources; here metadata
includes information for access control, and data structures required for implementing ―collection‖
(directory) abstraction. The MCAT relational database can be extended beyond the capability of
traditional file system, such as implementing a more complex access control system, proxy operations,
and information discovery based on system level and application level meta data.
The MCAT presents clients with a logical view of data. Each data set stored in SRB has a logical
name (similar to file name), which may be used as a handle for data operation. The physical location
of a data set in the SRB environment is logically mapped to the data sets, but it is not implicit in the
name like in a file system. Hence the data sets of a collection may reside in different storage systems.
A client does not need to remember the physical location, since it is stored in the metadata of the data
Data sets can be arranged in a directory-like structure, called a collection. Collections provide a logical
grouping mechanism where each collection may contain a group of physically distributed data sets or
sub-collections. Large amount of small files incur performance penalties due to the high overhead of
creating and opening files in hierarchical archival storage systems. The container concept of SRB was
specifically created to circumvent this type limitation. Using containers, many small files can be
aggregated in the cache system before storage in the archival storage system.
The SRB supports three authentication schemes: plain text password, SEA and Grid Security
Infrastructure (GSI). GSI a security infrastructure based on X.509 certificates developed by the Globus
group ( SEA is an authentication and encryption
scheme developed at SDSC ( on RSA.
The ticket abstraction facilitates sharing of data among users. The owner of a data may grant read only
access to a user or a group of users by issuing tickets. A ticket can be issued to either MCAT
registered or unregistered users. For unregistered users the normal user authentication will be bypassed
but the resulting connection has only limited privileges.
The SRB supports automatic creation data replicas by grouping two or more physical resources into a
resource group or logical resource. When creating a data set a logical resource rather than a physical
resource in specified a copy of the data is created in each of the physical resource belonging to the
logical resource. Subsequent write to this data object will write to all data copies. A user can specify
which replica to open by specifying the replica number of a data set.
Proxy operations are implemented in SRB. Some operations can be more efficiently done by the server
without involvement by the client. For instance, when copying a file, it is more efficient for the server
to do all the read and write operations than passing the data read to the client and then the client
passing it back to the server to be written. Another proxy function can be a data sub-setting function,
which can return a portion instead of the full data set. As an example, the dataCutter code from UMD
has been integrated as one of the proxy operations supported by the SRB.

5.6.7. Nimrod-G Resource Broker
The Nimrod-G Grid resource broker (, supports
deadline- and budget-constraining scheduling of applications on a peer-to-peer Grid distributed across

IST-2000-25182                                     PUBLIC                                            51 / 67
                                                                                            Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                 Date: 26/06/2011
                                    S tate of the Art Report

the globe. A prototype version of Nimrod-G has been developed at Monash University, Melbourne,
Australia.     The       prototype      version      is     available      for     download        from
Nimrod-G uses an economics paradigm for resource management and scheduling on the Grid. It
supports deadline and budget constraints for schedule optimisations and regulates supply and demand
of resources in the Grid by leveraging the services of Grace (Grid Architecture for Computational
Economy) resource trading. This provides economic incentive for resource owners to share their
resources on the Grid, and provides mechanisms to trade-off quality of service (QoS) parameters,
deadlines, and computational costs and offers incentive for relaxing requirements. The resource broker
dynamically leases Grid services depending on their cost, quality, and availability. In addition to
deadline and budget constraints in scheduling, the broker supports optimisation of both or either.
Nimrod-G provides a programmable and persistent Task Farming Engine (TFE) that coordinates
resource trading, scheduling, data staging, execution, and gathering results from remote nodes to the
user’s home transparently. A dispatcher is capable of deploying jobs on Grid resources enabled by
Globus, Legion, and Condor.
Scheduling experiments have been conducted involving over 200 job executions on Nimrod-G World
Wide Grid (WWG, testbed with resources
located on five continents: Australia, Asia, Europe, North America, and South America. Each job had
a specified budget and timeframe, and they were prioritised and distributed according to their
deadlines to the cheapest available machines depending on the time of day in different continents. If
the cheapest machine was unavailable, another more expensive resource was selected as long as it was
within budget, to meet the deadline.
Nimrod ( is a tool managing execution of parametric
studies across distributed computers. EnFuzion is a commercial version of the research system
Nimrod. Nimrod tools for modelling parametric experiments are mature and in production use for
cluster computing.

5.6.8. Other distributed computation technologies Java And Jini
The Java scripting or programming language developed by Sun offers a way to do cross platform or
platform independent remote computing. All that is needed is a Java run time environment, and then
scripts or programs can be executed remotely. The run time environment interprets the Java script
programs and executes them. It is available without cost for virtually every platform used today. The
scripts can be transferred for instance by the www browser (using http protocol).
This provides promising platform independent way of building Grid applications, especially user
interfaces for Grid resources. Even the inherent slowness (of the order 5 to 10 times when compared to
native compiled code) of the interpretation of the Java code does not deter people of developing Java
based Grid applications. Later the slowness might be overcome by some kind of tokenised compilation
phase on demand and on the fly.
While Java provides useful technology for portable, object-oriented application development, it does
not address many of the hard problems that arise when one tries to achieve high-performance
execution in heterogeneous distributed environments. For example, Java doesn’t help in running
programs on different types of supercomputers, discover the policy elements that apply at a particular
site, achieve single sign-on authentication, perform high-speed transfer across wide area networks, etc.
JavaScript, developed by Netscape, and ActiveX by Microsoft are more www browser specific
scripting facilities. Netscape supports Java and JavaScript and Microsoft Internet Explorer all three:

IST-2000-25182                                    PUBLIC                                           52 / 67
                                                                                          Doc. Identifier:
                       DATA ACCESS AND MASS STORAGE
                                 SYSTEMS                                                Date: 26/06/2011
                                   S tate of the Art Report

Java, JavaScript and ActiveX. (But typical to current competitive situation, the supported JavaScript
versions are not entirely compatible.)
Jini ( is a Sun software system designed to easily connect any computing
device (whether PC, cell phone, or laptop) to a larger network, using Java to distribute processes
among the devices connected to the network. Jini allows users to simply plug the devices into the
network without having to initialise or restart any part of the system. The device announces and
describes itself to the network, so that its resources (memory, display, applications, etc.) can be
utilized as necessary by the network.
However, Jini is not middleware and it is not concerned with the communication processes involved in
distributed computing. Jini allows devices to communicate on a common network, but it does not
handle the actual communications or the transfer of data. It is a simple publish and subscribe
architecture, and does not concern with fault tolerance, access control, authentication, or parallel
computing. CORBA
The Common Object Request Broker Architecture (CORBA, developed by the
Object Management Group defines an object-oriented model for accessing distributed objects.
CORBA supports describing interfaces to distr ibuted objects via an Interface Description Language
(IDL) for inter-language interoperability and linking the IDL to implementation code that can be
written in several supported languages. Compiled object implementations use the Object Request
Broker (ORB) run-time system, to perform remote method invocations. CORBA provides a simpler
remote-procedure-call method execution model suited to client-server style applications, and a variety
of more specialized services such as a trader (for resource location).
Like Java, CORBA provides important software engineering advantages, but it doesn’t directly
address the challenges that arise in Grid environments such as specialized devices and the high
performance required by many Grid applications. CORBA is commonly used for business
applications, such as providing remote database access for clients. It is not designed for high-
performance applications, parallel programs or multi-organizational situations. Microsoft DCOM
Microsoft’s Distributed Component Object Model (DCOM) provides a variety of services, including
remote procedure call, directory service, and distributed file system. These services are useful but
don’t address issues of heterogeneity or performance directly. And of course, they can be used on
Windows computers only. Grid and Windows
Basically all Java client programs can be used on Windows NT, of course, and there are some Grid
portals (for instance SDSC's HotPage) designed with Windows clients in mind. The main protocols
used by Globus and all other Grid tools (LDAP, HTTP, FTP) are already widely available on
Windows. Windows NT systems can be harnessed as Grid compute resources using Condor-G.


5.7.1. Introduction
Networked computer systems are rapidly growing in importance as the medium for the storage and the
exchange of information. Today peer-to-peer (P2P) computing and networking are widely used for
distributing computer and processing efforts. A general peer-to-peer information can be found at the

IST-2000-25182                                   PUBLIC                                          53 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

site For instance, contains a list of
distributed search engines.
One of the earliest well-known efforts (even though it is not exactly peer-to-peer) in distributed
computing is the project seti@home, searching signs of extra-terrestrial intelligence from astronomical
Peer-to-peer systems and applications are distributed systems without any centralised control or
hierarchical organization. All participating components or nodes are equal to each other and the
software running at each node is equivalent in functionality. There are no servers or clients; or better,
each computer is a server and a client at the same time. There are no dedicated servers, masters or
slaves; the peer-to-peer system is ―democratic‖.
A traditional client-server model has powerful dedicated servers and relatively simple clients. A
dedicated server is a single point of failure: when it fails, the resource usually becomes unusable.
Sometimes redundancy can be enhanced by duplicating the servers.
In peer-to-peer systems, some of the nodes are sometimes elevated to a ―super-node‖ status for
efficiency reasons and given special duties in the network. Still, they are not dedicated for this role,
and usually the super-nodes are peers to each other. The super-nodes are a distributed version of a
dedicated server.
The most important features recently attributed to peer-to-peer applications are redundant storage,
permanence, anonymity, search, authentication, and hierarchical naming. But the core operation in
most peer-to-peer systems is efficient location of data items.
Many of the general computing concepts, for instance caches, hashing, and cryptography, were utilised
for years for single computer systems. The use was later extended to clusters, massively parallel
computing (MMP) and metacomputing. It seems that these concepts are now being applied for large
scale and wide area peer-to-peer computing systems and Grids, too.
Because the essential parameters and characteristics (relative speeds and latencies, bandwidths,
capacities, etc.) of the involved hardware and system components vary enormously and differ from the
previous local and cluster situations, new strategies are needed. At present, several differing schemes
are being tried. These schemes have strengths and weaknesses in different areas, and it is not known
yet which scheme will prevail. It is probable that many schemes will be utilised simultaneously, each
in its strong area.
Especially, the networking characteristics in peer-to-peer computing differ from local computers, the
―network‖ of which is formed by the buses connecting hardware components, they differ from cluster
networking, where specialised high-speed networks are installed, and even from the usual local area
networks. Limitations of current systems
Wide area networks are controlled by several organisations. They are public and open in character, and
so inherently ―uncontrollable‖ and unreliable (component systems appear and disappear seemingly
randomly when then computers go up and down), and to certain extent unpredictable (only statistical
predictions can be made).
Because of the openness, the resources have to be considered as untrusted. This creates huge
availability and security problems, which can be solved by strong cryptography, and by designing the
systems so that they will adapt to the constantly changing and unguaranteed availability of the
Current systems afford little privacy to their users, and typically store any given data item in only one
or a few fixed places, creating a central point of failure. There is a continued desire among individuals
to protect the privacy of their authorship or readership of various types of sensitive information. The

IST-2000-25182                                    PUBLIC                                            54 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

central points of failure can be attacked by opponents who wish to remove data from the system, or
they can simply be overloaded by too much interest. Centralising enabled government agencies and
other organisations with ―vested interest‖ to censor the freedom to search the net. Hashing
A fundamental problem in peer-to-peer applications is how to efficiently locate the node that stores a
particular data item. Scalable protocols and methods are needed in a dynamic peer-to-peer system with
frequent node arrivals and departures.
Central to any peer-to-peer system is the indexing scheme used to map file names to their (address of)
location in the system. The peer-to-peer file transfer process itself is inherently scalable, but the hard
part is finding the peer from whom to retrieve the file. Thus, a scalable peer-to-peer system requires a
scalable indexing mechanism. Such indexing systems are called here Content-Addressable Networks
(this general concept was introduced in the paper about CANs, see ACIRI CAN below).
CANs resemble a hash table: the basic operations performed on a CAN are the insertion, lookup and
deletion of (key, value ) pairs. A hash table is a data structure that efficiently maps ―keys‖ onto
―identifiers‖. Hash tables are essential core building blocks in the implementation of many modern
software systems. Many large-scale distributed systems could likewise benefit from similar hash table
functionality. The term CAN describes a distributed Internet-scale hash table. Interest in them is based
on the belief that a hash table-like abstraction would give developers a powerful design tool leading to
new applications and communication models.
CAN is usually composed of many individua l nodes. Each CAN node stores a chunk (zone) of the
entire hash table and holds information about a small number of ―adjacent‖ zones in the table.
Requests to insert, lookup or delete a particular key are routed by intermediate CAN nodes towards the
CAN node whose zone contains that key.
The CAN design should strive to be completely distributed. It should not require centralised control,
coordination or configuration. It should be scalable in that sense that nodes maintain only a small
amount of control state that is independent of the number of nodes in the system. It should also be
fault-tolerant (nodes routing around failures).
Several peer-to-peer systems support indexed CAN-type access. Given a key, they map the key onto a
node. Data location can be easily implemented by associating a key with each data item. Then, the key
data item pair is stored at the node to which the key maps. Topology
A CAN is designed topologically as a virtual coordinate space. This coordinate space is completely
logical and bears no relation to any physical coordinate system. At any point in time, the entire
coordinate space is dynamically partitioned among all the nodes in the system such that every node
―owns‖ its individual, distinct zone within the overall space.
This virtual coordinate space is used to store (key, value) pairs as follows: to store a pair (K, V), the
key K is deterministically mapped onto a point P in the coordinate space using a uniform hash
function. The corresponding key-value pair is then stored at the node that owns the zone within which
the point P lies. To retrieve an entry corresponding to key K, any node can apply the same
deterministic hash function to map K onto point P and then retrieve the corresponding value from the
point P. If the point P is not owned by the requesting node or its immediate neighbours, the request
must be routed through the CAN infrastructure until it reaches the node in whose zone P lies. Efficient
routing is therefore a critical aspect of the CAN.
Nodes in the CAN self-organize into an overlay network that represents this virtual coordinate space.
A node learns and maintains as its set of neighbours the IP addresses of those nodes that hold

IST-2000-25182                                    PUBLIC                                            55 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                    S tate of the Art Report

coordinate zones adjoining its own zone. This set of immediate neighbours serves as a coordinate
routing table that enables routing between arbitrary points in the coordinate space.
Routing in a Content Addressable Network works by following the straight-line path through the
Cartesian space from source to destination coordinates. A CAN node maintains a coordinate routing
table that holds the IP address and virtual coordinate zone of each of its neighbours in the coordinate
space. This purely local neighbour state is sufficient to route between two arbitrary points in the space.
Note that many different paths exist between two points in the space and so, even if one or more of a
node’s neighbours were to crash, a node would automatically route along the next best available path.
A peer computer usually needs only a few connections to its nearest neighbours (the network
connections abstract virtual topology, not necessarily spatial). For example, for Gnutella mesh 2-4
direct connections are enough. The path length scales approximately logarithmically. In an N-node
system in the steady state, each node maintains information only about O(log N) other nodes, and
resolves all lookups via O(log N) messages to other nodes.
While CAN maps keys onto nodes, traditional name and location services provide a direct mapping
between keys and values. A value can be an address, a document, or an arbitrary data item. CAN can
easily implement this functionality by storing each key/value pair at the node to which that key maps.
For example, for Chord each node maintains information only about O(log N) other nodes, and a
lookup requires O(log N) messages. The updating of the routing information when a node joins or
leaves the network requires O(log^2 N) messages. ACIRI CAN uses a d-dimensional Cartesian
coordinate space (for some fixed d, designed not to vary) to implement a distributed hash table that
maps keys onto values. Each node maintains O(d) state, and the lookup cost is O(d N^(1/d)). The state
maintained by a ACIRI CAN node does not depend on the network size N, but the lookup cost
increases faster than log N. Current peer-to-peer systems
CANs are used in large-scale storage management systems such as ACIRI CAN, Chord, OceanStore,
Farsite, and Publius. These systems all require efficient insertion and retrieval of content in a large
distributed storage infrastructure, and a scalable indexing mechanism is an essential component. Other
new file sharing systems and protocols are also Scour, FreeNet, Ohaha, Jungle Monkey, and Mojo
Nation, which have been introduced quite recently.
OceanStore uses a variant of the distributed data location protocol developed by Plaxton et al. It
guarantees that queries make a logarithmic number hops and that keys are well balanced, but the
Plaxton protocol also ensures that queries never travel further in network distance than the node where
the key is stored.
The Freenet peer-to-peer storage system is decentralized and symmetric and automatically adapts
when hosts leave and join. Freenet does not assign responsibility for documents to specific servers;
instead its lookups take the form of searches for cached copies. This allows Freenet to provide a
degree of anonymity, but prevents it from guaranteeing retrieval of existing documents or from
providing low bounds on retrieval costs.
The Ohaha system uses a consistent hashing-like algorithm for mapping documents to nodes, and
Freenet-style query routing.
Most of these are examples of the general concept of a CAN. Some of these are discussed more
thoroughly below. Perhaps the best examples of current peer-to-peer systems that could potentia lly be
improved by a CAN are the recently introduced file sharing systems like Napster and Gnutella.
An instructive potential application for CANs would be to construct wide-area name resolution
services that (unlike the DNS) decouple the naming scheme from the name resolution process thereby
enabling arbitrary, location-independent naming schemes. DNS provides a host name to IP address

IST-2000-25182                                    PUBLIC                                             56 / 67
                                                                                               Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                    Date: 26/06/2011
                                     S tate of the Art Report

mapping. A CAN can provide the same service with the name representing the key and the associated
IP address representing the value.

5.7.2. ACIRI Content Addressable Network (ACIRI CAN)
A scalable Content-Addressable Network (CAN) was presented in a (theoretical) paper by Sylvia
Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and Scott Shenker (University of California,
Berkeley and ACIRI, AT&T Center for Internet Research at ICSI,
The paper presents theoretical CAN properties and results, some of which are here presented above in
the introduction. In the paper also a specific implementation fixing some of the parameters is
described. To differentiate from the general concept of a Content-Addressable Network and this
specific implementation, the latter is called here ACIRI CAN. Most of the discussion in the paper
applies to general CANs.
The ACIRI CAN design is scalable, fault-tolerant and completely self-organizing. CAN, unlike
systems such as the DNS or IP routing, does not impose any form of rigid hierarchical naming
structure to achieve scalability.
For ACIRI CAN the virtual coordinate space is d-dimensional torus. For a d dimensional space
partitioned into n equal zones, the average routing path length is (d/4) (N^(1/d)) and individual nodes
maintain 2d neighbours. These scaling results mean that for a d dimensional space, the number of
nodes (and hence zones) can grow without increasing per node state while the path length grows as
Several recently proposed routing algorithms for location services route in hops with each node
maintaining O(log N) neighbours. If one selects the number of dimensions d = (log2 n)/2, the same
scaling properties would be achieved. The d is fixed independent of N, since here ACIRI CANs are
applied to very large systems with frequent topology changes. In such systems, it is important to keep
the number of neighbours independent of the system size
To allow the CAN to grow incrementally, a new node that joins the system must be allocated its own
portion of the coordinate space. This is done by an existing node splitting its allocated zone in half,
retaining half and handing the other half to the new node.
A new CAN node first discovers the IP address of any node currently in the system. The functioning
of a CAN does not depend on the details of how this is done. A CAN has an associated DNS domain
name, and this resolves to the IP address of one or more CAN bootstrap nodes. A bootstrap node
maintains a partial list of CAN nodes it believes are currently in the system. To join a CAN, a new
node looks up the CAN domain name in DNS to retrieve a bootstrap node’s IP address. The bootstrap
node then supplies the IP addresses of several randomly chosen nodes currently in the system.

5.7.3. Napster and Gnutella
Napster and similar peer-to-peer systems like Gnutella have become quite popular. The first and
biggest, Napster was introduced in mid-1999 and the software has been downloaded by 50 million
users (December 2000) making it the fastest growing application on the Web.
The popularity was no doubt explained by the sole purpose of Napster, sharing music audio files. This
was found to be a breach of copyright in legal court, and Napster server was shut down, but it will be
opened again as a commercial music-sharing venture. In its heyday, there used to be about 1.6 million
users daily, but only a tenth of it are using Napster client (to listen to their files). Gnutella is sharing
any type of files, not just audio ones. At present, Napster is largely superseded in music file sharing by
Gnutella and several similar applications like Morpheus (
In Napster and Gnutella, files are stored at the end user machines or node peers, not at a central server.
In Napster, a central directory or index server (a single point of failure) stores the metadata (file

IST-2000-25182                                     PUBLIC                                             57 / 67
                                                                                              Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                   Date: 26/06/2011
                                     S tate of the Art Report

names) and the location (node server) of all the files available within the Napster user community. To
retrieve a file, a user queries this central server using the name of the desired file and obtains the IP
address of a user node storing the file. The file is then downloaded directly from this user machine.
Gnutella went one step further and decentralised also the file location process. Even the directory of
shared files is distributed. Nodes in a Gnutella network se lf-organise into an application-level mesh,
on which requests for a file (whispers or ―gossips‖) are flooded with a certain scope (horizon, see
below). In this respect Gnutella is in many ways similar to the way gossips travel (and this is
apparently the reason for its name resembling ―Gnu tell‖).
The Gnutella Network is segmented using the concept horizon, where the GnutellaNet network
organizes itself into segments with about 10 000 nodes. Because the nodes disjoin and rejoin over
time, a fixed computer can see about 50 000 other hosts. The information one is seeking has to be
found from one’s horizon. But the horizon can change completely in time, and the computer can wade
into a different (abstract topological) part of the network.
Both in Napster and Gnutella, files are transferred directly between peer nodes, opposed to the
traditional client-server model.
Unfortunately, these types of the peer-to-peer designs are not scalable. Although Napster uses a peer-
to-peer communication for the actual file transfer, the process of locating a file is still centralized,
making it both expensive (to scale the central directory) and vulnerable (since there is a single point of
failure). Gnutella method of flooding the network on every request is not scalable. It may fail to find a
file that is actually in the system, because the flooding has to be cut off by the horizon.
Peer-to-peer file sharing can ―easily‖ store huge amounts. Without centralised planning or huge
investments in hardware and bandwidth, the content available through Napster has been estimated to
exceed 7 TB of storage on a single day. While the business potential of the file sharing systems has
changed after the court decision, their rapid and wide-spread deployment indicates important
advantages in peer-to-peer systems and file sharing. This can lead to new content distribution models,
for instance for software distribution and web content delivery. Napster
According to their own web-site,, Napster is the world's leading file sharing
community, founded by Shawn Fanning in 1999 (still only 20 year old). Now after the court decision
and agreement with music publishers Napster CEO is Konrad Hilbers, earlier at BMG Entertainment
and AOL Europe.
Napster is the pioneer, the best known and first success in peer-to-peer file sharing. It is designed only
for sharing music files (in MPEG 3 audio format *.mp3 and now in new proprietary secured NAP
format). The client is designed and available mainly for desktop computers, Windows and Macintosh,
and also a Java applet is available. There are other compatible clients and similar music file sharing
systems also (for instance SpinFrenzy, CuteMX. OpenNAP).
There was legal difficulties with the music publishers and songwriters and composers, and the sharing
of music files without royalties was found to be a breach of copyright, and Napster server was shut
down for several months by a court order. But now there is some agreement with the publishers, and
Napster will be opened again in early 2002 running legally on an royal base as a membership service
with a monthly fee. Currently file sharing is offline and in all old clients the f ile sharing is disabled
(but they can play already downloaded files).
As indicated above, in Napster there is no centralised storage server, the music files are on individual
desktops. There were lots of problems with broken connections, node (peer server) computers being
down and the requested file unavailable, or peer server node just quitting in the midst of a file transfer.
The transfer protocol is proprietary, and implemented in the client application. And the shutting down

IST-2000-25182                                     PUBLIC                                            58 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

of the Napster server illuminates the huge failure of the service that happens when the vulnerable
single point of access fails. Gnutella
Gnutella ( is an open, decentralized, peer-to-peer search system that is
mainly used to find files. Basically Gnutella is a name for a protocol (or a technology, ―Gnutella is
nothing but a protocol‖). The Gnutella protocol and original servent (SERVer + cliENT) know as
Gnutella v0.56 were developed in March 2000 by Justin Frankel and Tom Pepper at Nullsoft
(nowadays owner by America Online and AOL Time Warner; Nullsoft develops also the well-known
Windows      media      player   winamp).      The    protocol has       been     openly   published
There is no official program named "Gnutella"; the original version, v0.56, was released as an early
beta. Because the Gnutella protocol is open, there are many interoperable servents to choose from. All
existing servents are clones, with their functionality derived from the original program. For each
supported platform there are several client applications ( Linux/Unix
(Gnut, LimeWire, Phex), Windows (BearShare, Gnotella, Gnucleus, LimeWire, Phex, SwapNut,
XoloX), Macintosh (LimeWire, Mactella, Phex,), and Java (LimeWire, Phex). Recent servents provide
improved     features    and    network      behaviour.   Some      servents are        open    source
With a servent, one can connect to others and form a private network or connect to the general
Gnutella public network (GnutellaNet). Newer programs automatically connect to GnutellaNet, but
some older ones require an initial IP address to connect to. The other half of Gnutella is giving back:
almost everyone on GnutellaNet shares files.
Gnutella is preserving (partially) anonymity: When one sends a query to the GnutellaNet, there is not
much in it that can link the query message to the originator. It is not totally impossible (but unlikely)
to figure out who is searching for what. Each time the query is passed, the possibility of discovering
who originated that query is reduced exponentially.
There are no centralized servers, and Gnutella is not focused on trading music only. Gnutella client
software is basically a mini search engine and file serving system in one: a gnutella servent is an
application that lets one to search for, download, and upload any type of file. When one gets a search
hit on the GnutellaNet, the file is virtually guaranteed to be there and available for download. To speed
things up, downloads are directly from the storing node to the requesting node, and clearly they are not

5.7.4. Chord
Chord ( is a flexible lookup primitive for peer-to-peer environments. Chord
maps keys to servers in a decentralised manner and require only O(log N) messages to perform the
mapping if there are N nodes in the system. The chord is developed by a team lead by Frans Kaashoek
(Robert Morris, Frank Dabek, Ion Stoica, Emma Brunskill, and David Karger) at Massachuset
Institute of Tecnology (MIT). The distinguishing features of Chord are its simplicity and provable
correctness and performance.
Chord is based on the Self-certifying File System (SFS) user level file system toolkit. SFS is also
developed at MIT by a related team and described shortly below. Chord runs on a system that supports
SFS, including Linux, FreeBSD, OpenBSD, Solaris, Macintosh OS X, and NetBSD. Chord is
implemented as a library, which is linked with the client and server applications.
The Chord protocol supports just one operation. Given a key (―file name‖), it maps the key onto a
node (―computer‖), which might store a value (―file data‖) corresponding to the key. It efficiently
adapts to nodes joining and leaving as the system is continuously changing. Chord is scalable, with

IST-2000-25182                                    PUBLIC                                            59 / 67
                                                                                                Doc. Identifier:
                         DATA ACCESS AND MASS STORAGE
                                   SYSTEMS                                                    Date: 26/06/2011
                                      S tate of the Art Report

communication cost and the state of each node scaling logarithmically with the number of nodes. The
Chord lookup algorithm yields the IP address of the node for the key. Chord notifies the application of
changes in the set of keys of the node, and the application can, for example, move the corresponding
values to a new node that has just joined the system.
The Chord application provides all desired authentication, caching, replication, and naming of data.
For instance, data could be authenticated by storing it under a key derived from a cryptographic hash
of the data. A file could be replicated by storing it under two distinct keys derived from the name of
the file.
In the application, the higher ―file system‖ layer would provide a file-like interface to users, including
friendly named directories and files and authentication. The middle ―block storage‖ layer would
implement the block operations, like of storage, caching, and replication of blocks. It would call the
underlying Chord library to identify the node storing a block and talking to the block storage server on
that node to read or write the block.
Chord implements a distributed hash function spreading keys evenly over the nodes and providing a
natural load balance. Chord is fully distributed and decentralised, improving robustness and making it
appropriate for loosely organized peer-to-peer applications. The cost of a Chord lookup grows as
proportional to the logarithm of the number of nodes. Thus, even very large systems are feasible, and
no parameter tuning is required to achieve this scaling. It adjusts its internal tables to reflect node joins
and departures, also a continuous state of change, ensuring that the node for a key is found and
improving availability. The Chord key-space is flat, giving applications a large amount of flexibility in
mapping names to keys.
Examples of application areas using Chord as a lookup service:
     Shared storage, where the storage computers are only occasionally available. The data name is
     the key to the live node that stores the data at any given time. Using the same mechanism, one
     can also focus on load balance in the distributed storage.
      Distributed Indexes for Napster- or Gnutella- like keyword search: here the key could be hashed
      keywords and the values could be lists of machines offering documents w ith those keywords.
      Chord avoids single points of failure or control like in Napster. Chord also avoids the lack of
      scalability because of widespread use of broadcasts like in Gnutella.
      Large scale combinatorial search, such as code breaking: keys are candidate solutions, and
      Chord maps them to the machines and tests them as solutions.
The Chord protocol specifies how to find the locations of keys, how new nodes join the system, and
how to recover from the failure or departure of existing nodes.
Chord uses a variant of consistent hashing to assign keys to Chord nodes. Consistent hashing tends to
balance load, since each node receives roughly the same number of keys, and involves relatively little
movement of keys. Chord improves the scalability of consistent hashing by distributing the routing
information: each node needs information about only a few other nodes. A node resolves the hash
function by communicating with a few other nodes.
In an N-node network, each node maintains information only about O(log N) other nodes, and a
lookup requires O(log N) messages. The updating of the routing information when a node joins or
leaves the network requires O(log^2 N) messages.
A Chord node requires logarithmic information for efficient routing, but its performance de grades
gracefully when that information is out of date. This is important in practice because nodes will join
and leave arbitrarily. Only one piece of information per node needs be correct in order for Chord to
guarantee correct (though slow) routing of queries.

IST-2000-25182                                      PUBLIC                                             60 / 67
                                                                                                 Doc. Identifier:
                         DATA ACCESS AND MASS STORAGE
                                   SYSTEMS                                                     Date: 26/06/2011
                                      S tate of the Art Report

The consistent hash functions assign each node and key an m-bit identifier using e.g. SHA-1, where m
must be large enough to make colliding of the hashed values improbable. The node identifier is chosen
by hashing the IP address, while the key identifier is the hash of the key. Key identifiers are ordered in
an one-dimensional identifier circle modulo 2^m, and the key k is assigned to the first node whose
identifier is equal to or follows k in the identifier space. This node is called the successor of k, and it is
the first node clockwise from k. Consistent hashing is designed to let nodes enter and leave the
network with minimal disruption. When a node joins the network, certain keys previously assigned to
its successor are assigned to it. When the node leaves the network, all of its assigned keys are
reassigned to its successor. Self-certifying File System (SFS)
The Self-certifying File System (SFS, is a network file system providing strong
security in untrusted networks. It tries to prevent security from hurting performance or becoming an
administrative burden. It is developed at MIT (David Mazières, Chuck Blake, Frank Dabek, Kevin Fu,
Frans Kaashoek, Michael Kaminsky, Emil Sit, Emmett Witchel) sponsored by DARPA.
SFS runs on Unix platforms with NFS version 3 support (for instance OpenBSD, FreeBSD, OSF/1,
Solaris, and Linux with an SFS-capable Linux kernel).
SFS is also a global file system. Users can access any server from any client in the world, and share
files with anyone anywhere. There is no need to rely on system administrators or trusted third parties
to coordinate the sharing of files across administrative realms. Thus, SFS provides convenient file
sharing over the Internet (even if security is not essential).
SFS always provides security over untrusted networks, but does not perform any key management.
SFS accomplishes this by naming file systems by their public keys. Every SFS file server is accessible
under a self-certifying pathname of the form /sfs/Location:HostID, where Location is the server's DNS
hostname or IP address and HostID a cryptographic hash of the server's public key. Self-certifying
pathnames are automatically created or auto-mounted the first time they are referenced.
One can use SFS to improve the security of local-area network by providing a secure replacement for
NFS, for instance. To replace NFS by SFS one can simply add symbolic links from the local hard
disks of all client machines to the self-certifying pathnames of all relevant servers.
One can also use SFS to gain remote file system access by creating a new private server even to places
where one is not allowed to have it. Or one can set up a file server to share files across a dministrative
realms, even if servers are under centralized control. SFS is designed for file sharing across the
Internet in a secure way. It is also trivial to set up. Often file servers and user accounts are centrally
maintained, and one cannot set up own file servers or create guest accounts without involving the
But SFS lets one create a file server on ones own machine and access that server from any other
machine without any special privileges if the computers run the SFS client. There is no administrative
overhead for accessing many, separately administered SFS servers. To do this, one can simply
download (securely using a password) the self-certifying pathname of the server and create a symbolic
link to the self-certifying pathname.

This makes it easy to access a file server at work from home, or share a file system with people
collaborating at a different institution. This would often be impractical or impossible because of
security concerns or administrative hassles.
Configuring and running SFS is relatively easy. The SFS client can peacefully coexist with other
network file system clients. The SFS server serves the operating system's local file systems. Thus, one
can install and evaluate SFS without disrupting any existing network file systems and gradually

IST-2000-25182                                      PUBLIC                                              61 / 67
                                                                                               Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                    Date: 26/06/2011
                                     S tate of the Art Report

migrate to SFS. SFS server does not even require a dedicated disk or partition. The current
implementation of SFS serves existing local file systems on the server.
Currently SFS is slightly slower than straight NFS3 over UDP. But the developers think that SFS
could outperform NFS by moving a small portion of its functionality into the kernel, because most of
slowdown is caused by its portable user-level implementation, rather than by encryption for instance.
One can even use SFS as the home directory, but it's not very convenient. At the first login one has to
run a program to authenticate and get the permission to access the home directory. Often it is easiest to
log out and back again to let the shell process all settings from initialisation (dot) files. In the future,
login programs will probably do this automatically (similarly what is currently done for ssh and AFS).
SFS uses several cryptographic algorithms: It uses SHA-1 or slightly strengthened variants as a
"random oracle" in a number of situations. For instance, host IDs are computed with SHA-1. SFS
generates random bits with the NIST pseudo-random generator, which is based on SHA-1. It protects
the secrecy of messages using ARC4 (alleged RC4). A separate key is used for traffic in each
direction. SFS never operates in a degraded or insecure mode. Public key encryption and digital
signatures are performed using the Rabin-Williams algorithm. SFS uses blowfish with 20-byte keys to
encrypt NFS file handles. All user-chosen passwords are processed with the eksblowfish algorithm.
Private keys are encrypted with eksblowfish when stored on disk.

5.7.5. OceanStore
The OceanStore project lead by John Kubiatowicz at UC Berkeley Computer Science Division tries, in
their own words, to provide global-scale persistent data store. OceanStore is designed to scale up to
thousands of millions of users, and to provide a consistent, highly-available, and durable storage utility
atop an infrastructure comprised of untrusted servers. Any computer can join, contribute or consume
storage, and provide the user access in exchange for economic compensation. Many components of
OceanStore are already functioning, but an integrated prototype is currently being developed.
Users subscribe to a single OceanStore service provider, and then they may utilise storage and
bandwidth from many different providers, in analogy of current internet service providers (IPS). The
providers transparently purchase capacity and coverage among themselves. OceanStore thus combines
the resources from many providers to achieve better quality service than possible for a single
OceanStore caches data cached promiscuously; any server may create a local replica of any data
object. These local replicas provide faster access and robustness to network partitions. They also
reduce network congestion by localizing access traffic. The model also assumes that any server in the
infrastructure may crash, leak information, or become compromised. Promiscuous caching therefore
requires redundancy and cryptographic techniques to protect the data from the servers upon which it
OceanStore stores each version of a data object in a permanent read-only form, which is encoded with
an erasure code and spread over hundreds or thousands of servers. A small subset of the encoded
fragments is sufficient to reconstruct the archived object. Only a global-scale disaster could disable
enough machines to destroy the archived object. This version-based archival storage provides
durability, which exceeds today's best by orders of magnitude.
OceanStore employs a fault tolerant commit protocol to provide strong consistency across replicas.
The OceanStore API also allows applications to weaken their consistency restrictions in exchange for
higher performance and availability.
The OceanStore introspection layer adapts the system to improve performance and fault tolerance.
Internal event monitors collect and analyse information such as usage patterns, network activity, and
resource availability. OceanStore can adapt to regional outages and denial of service attacks by

IST-2000-25182                                     PUBLIC                                             62 / 67
                                                                                            Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                 Date: 26/06/2011
                                    S tate of the Art Report

proactively migrating data towards areas of use and by maintaining sufficiently high levels of data

5.7.6. Freenet
Freenet is distributed, anonymous, and adaptive peer-to-peer network information storage and retrieval
system developed by Ian Clarke (USA), Oskar Sandberg (Sweden), Brandon Wiley (USA), and
Theodore W. Hong (UK). It started with the work done by Ian Clarke at the University of Edinburgh.
Freenet is developed as a free software project on Sourceforge. (For documentation, see e.g. and references there. An initial implementation can be
downloaded from
Freenet has had the following design goals:
   Designed for the publication, replication, and retrieval of data addressing concerns of privacy and
   Provides anonymity to both authors producing and consumers reading the information.
   Resistance to attempts by third parties to deny access to information
   Efficient dynamic storage and routing of information
   Decentralization of all network functions. Hence it does not use broadcast search or centralised
   location index.
Freenet is designed to respond adaptively to usage patterns, transparently moving, replicating, and
deleting files as necessary. The files are dynamically and automatically replicated and deleted from
locations depending on the usage. Hence Freenet forms an adaptive caching system based on usage
and incorporating lazy replication.
A Freenet consists of a network of similar nodes pooling the storage spaces and routing requests to the
most likely physical location of data. The pooling of storage results in the users providing a part of
their storage to the pool, which actually becomes an extension of their storage capacity. File ―names‖
are location-independent. Thus, Freenet is a co-operative location-independent distributed file system
consisting of many computers. Freenet allows files to be stored and requested anonymously from the
pooled storage.
Freenet is not intended to guarantee permanent file storage. Actually, the only storage is the cache,
because storage nodes can delete files at will by using Least recently Used caching. Hence there is no
guaranty of a permanent storage, although with a sufficient number of nodes and with enough storage
capacity most files can be stored indefinitely.
Anonymity is maintained by a complex scheme of cryptographic keys and hashing for file naming,
transfer, retrieval and contents protection (for details see the documentation). This makes it virtually
infeasible to discover the true origin or destination of a file, and difficult for a node operators to
determine or be held responsible for the contents of the nodes.
The system operates at the application layer and assumes the existence of a secure transport layer,
although it is transport-independent. It only provides anonymity for Freenet file transactions, not for
general network usage.
Freenet is implemented as an adaptive peer-to-peer network of nodes that query one another to store
and retrieve data files named by location-independent keys. Each node maintains its own local store
available to the network for reading and writing, and a dynamic routing table for addresses and keys.
Probably most users of the system will run nodes, both to provide security guarantees against a hostile
foreign node and to increase the storage capacity available.
Requests for keys are passed along from node to node. Each node makes a local decision where to
send the request next in the style of IP (Internet Protocol) routing. Each request is assigned a pseudo-

IST-2000-25182                                    PUBLIC                                           63 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                    S tate of the Art Report

unique random identifier and a hops-to-live limit, analogously to IP protocol, to prevent infinite chains
and loops. Routes will vary depending on the key. The routing algorithms for storing and retrieving
data are designed to adaptively adjust routes over time to provide efficient performance while using
local, rather than global, knowledge. The locality is necessary since the nodes only have knowledge of
their immediate upstream and downstream neighbors to maintain privacy. The quality of routing
should improve with time. Nodes should gather more information about where files with ―similar‖
keys are to be found, and actually store more and more the files requested from them.
The problem of finding keys for file requests in the first place seems still to be a partially open

5.7.7. Mojo Nation
Mojo Nation is a new peer-driven content distribution technology. While simple data distribution
architectures like Napster or Gnutella may be sufficient to allow users to trade mp3 files they are
unable to scale up to deliver rich-media content while still taking advantages of the cost savings of
peer-to-peer systems. Mojo Nation combines the flexibility of the marketplace with a secure "swarm
distribution" mechanism to go far beyond any current file sharing system, providing high-speed
downloads that run from multiple peers in parallel. The Mojo Nation technology is an efficient,
massively scalable and secure toolkit for distributors and consumers of digital content.
Supported platforms: Windows 98/ME, NT and 2000 (Windows 95 is not officially supported), Linux
on Red Hat Linux and Debian distributions, and BSD: FreeBSD and OpenBSD. Macintosh OS X will
probably supported. Other platforms are supported by compiling the (open) source (with GNU tools);
the source is available from SourceForge (

5.7.8. JXTA Search
JXTA Search is a new service based on Sun initiated Project JXTA (discussed below), which enables
efficient search in distributed networks. JXTA Search is based on technology originally developed by
Infrasearch, which was acquired by Sun. JXTA Search searches for content and services on JXTA
nodes and on the web from either network. The JXTA Search technology is made available as open
source at
JXTA Search consists of a new XML search protocol for describing queries and responses, and it
provides a Hub searching service for both JXTA nodes and the web. This allows consumer
applications to efficiently find providers that can answer their requests. JXTA Search is developed in
Java, XML, and JXTA, and provides distributed search of JXTA and web content and services. (An
early InfraSearch demonstration was based on Gnutella.)
JXTA Search does not use web crawling with large indexes like other search engines like Google or
Altavista to provide search results. Instead JXTA Search specifies a simple XML protocol used by
content providers to provide current, dynamic content in response to search requests. JXTA Search is
ideally suited for environments where content is rapidly changing and is spread out across many
different providers. Content providers can be other JXTA Search hubs, JXTA peer nodes, or web sites.
Scraping or proxy engines use a customized understanding of each network service, e.g. a web site, to
perform distributed search. This does not a scale it requires a developer to read through the HTML of
each site they wish to proxy. In addition, sites may change their look and feel at any time, and this
usually breaks the proxy.
Meta Search engines like Dogpile focus on combining the results of a limited number of existing
search engines such as Google and Atavista. JXTA Search is instead focused on search in a more
general sense searching of millions of peers for contents or hundreds of product catalogues in a
consistent fashion.

IST-2000-25182                                    PUBLIC                                            64 / 67
                                                                                            Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                 Date: 26/06/2011
                                    S tate of the Art Report Project JXTA
Project JXTA (short for Juxtapose, as in side by side) started as a research project incubated at Sun
under the guidance of Bill Joy and Mike Clary ( Its goal is to explore distributed
network computing using peer-to-peer topology, and to develop basic services that would enable
innovative applications for peer groups.
The project has posted a draft specification and implementation code to a web site, under the Apache
Software License, encouraging others to join in our efforts. The available code is early prototype
quality. Latest builds of the source code are posted at The software and code are
free, subject only to the terms of the Apache Software License (modified only to reflect the Project
JXTA name and that Sun was the original contributor).
The initial prototype implementations for Project JXTA are written in Java programming language,
but specifications are intentionally language-independent, and JXTA will be implemented in C in the
future. The initial implementation requires a platform that supports the Java Run-Time Environment
(JRE), which is available on Windows, Solaris, Linux, and Macintosh. Alternate implementations of
JXTA will be developed to run in additional environments and using other development platforms
such as Perl and Python.
Project JXTA can be described using three layers; a core layer, a middle services layer, and an
application layer. The core layer includes protocols and building blocks for peer to peer networking,
including discovery, transport, and the creation of peers and peer groups. The services layer provides
hooks for supporting generic services such as searching, sharing and security. The application layer
supports implementation of integrated applications, such as file sharing, resource sharing, monetary
systems, and distributed storage. The entire system is designed to be modular.

IST-2000-25182                                    PUBLIC                                           65 / 67
                                                                                            Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                 Date: 26/06/2011
                                    S tate of the Art Report


Archiving is an integral part of a HSM solution. Here the files, which are important, but maybe not
needed for a while, or so big that one can not store them on online disks, are moved to tertiary, near-
line robot tape or cartridge archive or even off-line archive. These non-online files are usually read
back from tape to disk, when needed. The metadata of the files should be on-line.
Sometimes files are truly archived for future use. The nature of the use could perhaps not been known
beforehand, but it is anticipated that the file could be needed sooner or later. Or backup is done for
historical reasons, tracking the evolution of a project and documenting it; or because one has to store
some data by official demand. This kind of ―historical archive‖ could also be burned on CD’s, if the
amount of data is moderate.
Backup and archiving are superficially similar, yet different. The main purpose of backup is store extra
copy of all important files for recovery from all kinds of disasters that might happen to the files,
whether these disasters are malfunctions of the storage software, hardware or media, or accidental
deletions by the user. The HSM system does not provide for extra copies of the files unless special
arrangements are made; this is the major difference of archival system, including HSM systems, and

6.1.1. Backup tools
Backup tools are, in principle, separate from a HSM, even though properties of some backup tools that
can be used to implement some HSM features. In any case the backup tools must cooperate with the
HSM and complement them. TSM (Tivoli Storage Manager)
Tivoli Storage Manager (TSM), a successor to the IBM and Tivoli ADSM storage management
software, is developed and supported by Tivoli Systems Inc. (, a firm founded
by former IBM employees in 1989 and bought by IBM in 1996. It is a backup tool, but it has also
some HSM functionality. It is supposed to scale, but it is limited to relatively small tape devices.
Tivoli supports about 35 platforms, including Linux, most of the Unix platforms, and Macintosh. The
server can be running on Aix, HP-UX, Sun Solaris and Windows NT and 2000. It also supports a
broad range of storage devices, about 250 different systems with disks, removable cartridge drives,
and tapes. It tries to maximize tape usage to minimize the required number of tapes, with optional
Tivoli has Java-enabled Web browser interfaces for administrators and end-users. Application
Program Interfaces (APIs) enable applications to access a Tivoli Storage Manager server for backup,
archive, or specialized services.
The Hierarchical Storage Management part of TSM is based on OSM (Open Storage Manager).
ADSTAR Distributed Storage Manager has following properties:
      Automated, unattended backups, high speed restore, and disaster recovery
      Hierarchical storage management, and long-term data archives:
      Server-to-Server Communication enables objects to be sent to or received from another server
      Robust storage management server database.

IST-2000-25182                                    PUBLIC                                           66 / 67
                                                                                             Doc. Identifier:
                        DATA ACCESS AND MASS STORAGE
                                  SYSTEMS                                                  Date: 26/06/2011
                                      S tate of the Art Report

      Multitasking capability
      Robust security capabilities Legato Networker

6.1.2. Backups and archives at CERN
For historical reasons 4 different backup and archiving systems are used at CERN presently. Legato
Networker backs central servers, for AFS an AFS backup system is employed, and Tivoli/TSM keeps
public archives except for large physics files (bulk data), which are archived on Castor, possibly with
multiple copies for precious data.
A. For individual desktop workstations TIVOLI/TSM is used, and only file systems are backed up, not
whole disks. Backups are taken by AIX to an STK robot with IBM 3590 cartridges (1000 cartridges,
plus some archiving). TSM is relatively nice, especially nice interface. The recovery of full disks used
to be low, but now the performance as good as with Legato.
B. For all central systems, i.e. departmental and IT severs and Windows NT servers, backups are taken
by Legato Networker backup tool to STK 9840 drive (fast positioning) containing 800 cartridges, and
recycled after 3 months. Originally these were backed up with TSM, Networker was acquired for
faster restoring, even though it is more expensive to use. Now one could return to use TSM.
C. AFS is used for home directories of about 1200 staff users containing some 3-4 TB on 10 disk
servers. AFS makes clones of the file system, which are kept for 24h. Backups are done with the AFS
backup product to DLT 7000 cartridges on a robot system. Incremental backup is taken nightly, about
20 DLT cartridges, and full backups monthly, about 100 DLT cartridges, which are kept for 12
months. Currently users have to send recovery request to administrators, even though they should be
able to recover files themselves. For security, solution is classical: copies to vault.
D. Bulk data (which here basically means ―physics‖, i.e. the huge data resulting from physics
experiments) are kept in Castor. For each file the user can request a second copy for precious data; the
second copy will incur a cost to the user. There are 2 STK clusters, which are for safety reasons
physically separated. The data storing is alternated in round robin between silos, so that on the average
file copies of one user are alternating between clusters.

IST-2000-25182                                      PUBLIC                                          67 / 67