LegionFS A Secure and Scalable F
Document Sample


LegionFS: A Secure and Scalable File System Supporting
Cross-Domain High-Performance Applications
Brian S. White Michael Walker Marty Humphrey Andrew S. Grimshaw
Department of Computer Science
University of Virginia
Charlottesville, VA 22903
fbsw9d,mpw7t,humphrey,grimshawg@cs.virginia.edu
ABSTRACT to numerical, scienti c datasets. The latter bene ts from a
Realizing that current le systems can not cope with the di- non-traditional interface, though neither require typical le
verse requirements of wide-area collaborations, researchers system precautions such as consistency guarantees. How-
have developed data access facilities to meet their needs. Re- ever, a le system catering only to these le access char-
cent work has focused on comprehensive data access archi- acteristics would be short-sighted, ignoring a possible re-
tectures. In order to ful ll the evolving requirements in this quirement for additional policy, such as a stricter form of
environment, we suggest a more fully-integrated architecture consistency. To service an environment which continues to
built upon the fundamental tenets of naming, security, scal- evolve, a le system should be exible and extensible.
ability, extensibility, and adaptability. These form the un- Wide-area environments are fraught with insecurity and
derpinning of the Legion File System LegionFS. This pa- resource failure. Providing abstractions which mask such
per motivates the need for these requirements and presents nuances is a requirement. The success of Grid and wide-
benchmarks that highlight the scalability of LegionFS. Le- area environments will be determined in no small part by its
gionFS aggregate throughput follows the linear growth of initial and primary users, domain scientists and engineers,
the network, yielding an aggregate read bandwidth of 193.8 who have little expertise in coping which the vagaries of mis-
MB s on a 100 Mbps Ethernet backplane with 50 simulta- behaved systems. Corporations may wish to publish large
neous readers. The serverless architecture of LegionFS is datasets via mechanisms such as TerraVision 27 , while lim-
shown to bene t important scienti c applications, such as iting access to the data. This is appropriate for collabora-
those accessing the Protein Data Bank, within both local- tions that are mutually bene cial to organizations, which
and wide-area environments. are, nevertheless, mutually distrusting. Such varied and dy-
namic security requirements are most easily captured by a
security mechanism that transcends object interactions.
1. INTRODUCTION As resources are incorporated into wide-area environments,
Emerging wide-area collaborations are rapidly causing the the likelihood of failure increases. A le system should re-
manner and mechanisms in which les are stored, retrieved, lieve the user of coping with such failures. Approaches which
and accessed to be re-evaluated. New, inexpensive storage require a user to explicitly name data resources in a location-
technology is making terabyte and petabyte weather data dependent manner require that a user rst locate the re-
stores feasible. Such data should be accessible physically source and later deal with any potential faults or migrations
close to the place of origin and by clients around the world. of that resource.
Companies are seeking mechanisms to share data without To address these concerns, we advocate a fully-integrated
compromising the proprietary information of any involved le system infrastructure. We have implemented the Legion
site. Increasingly, clients desire the le system to dynam- 17 File System LegionFS, an architecture supporting the
ically adapt to varying connectivity, security, and latency following ve tenets, which we consider fundamental to any
requirements. system hoping to meet the goals delineated above:
Accommodating the varied and continually evolving re-
quirements of applications existing in these domains pre- Location-Independent Naming: LegionFS utilizes a
cludes the use of le systems that impose static interfaces three-level naming scheme that shields users from low-
or xed access semantics. Common access patterns include level resource discovery and is employed to seamlessly
whole- le access to large, immutable les and strided access handle faults and object migrations.
Security: Each component of the le system is rep-
resented as an object. Each object is its own secu-
Permission to make digital or hard copies of all or part of this work for rity domain, controlled by ne-grained Access Control
personal or classroom use is granted without fee provided that copies are Lists ACLs. The security mechanisms can be easily
not made or distributed for profit or commercial advantage, and that copies con gured on a per-client basis to meet the dynamic
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
requirements of the request.
permission and/or a fee.
c Scalability: Files can be distributed to any host par-
ticipating in the system. This yields superior perfor-
SC2001 November 2001, Denver 2001 ACM 1-58113-293-X/01/0011
$ 5.00
mance to monolithic solutions and addresses the goals
of fault tolerance and availability. Address Space
Extensibility: Every object publishes an interface, Server Loop
which may be inherited, extended, and specialized to
provide alternate policies or a novel implementation. BasicFileObject
Adaptability: LegionFS maintains a rich set of system- LegionBuffer (Data)
wide metadata that is useful in tailoring an object's
behavior to environmental changes. BasicFileObject
LegionBuffer (Data)
Previous work targeted at wide-area, collaborative envi-
ronments has successfully constructed infrastructures com-
posed of existent, deployed internet resources. Such ap-
proaches are laudable in that they leverage valuable, legacy
data repositories. However, they fail to seamlessly feder-
ate such distributed resources to achieve a uni ed and re- Native File
silient environment. A fully-integrated architecture adopts System
basic mechanisms such as naming and security upon which
new services are built or within which existent services are
wrapped. This obviates the need for application writers and
service providers to focus on tedious support structure and
allows them to concentrate on realizing policies within the
exible framework provided by the mechanisms.
LegionFS provides only basic functionality and is intended Figure 1: ProxyMultiObject
to be extended to meet the performance requirements of
speci c environments. The core of LegionFS functionality
is provided at the user-level by Legion's distributed object- classes are responsible for creating and locating their in-
based system. The le and directory abstractions of Le- stances, and for selecting appropriate security and object
gionFS may be accessed independently of any kernel le sys- placement policies.
tem implementation through libraries encapsulating Legion Legion objects may be active or inactive, and store their
communication primitives. This approach provides exibil- internal state on disk. Objects may be migrated simply by
ity as interfaces are not required to conform to standard transferring this internal state to another host. The object's
UNIX system calls. A modi ed user-level NFS daemon, class then spawns a process which is instantiated with the
lnfsd, interposes an NFS kernel client and the objects consti- migrated internal state.
tuting LegionFS. This implementation provides legacy ap- The complete set of method signatures exported by an
plications with seamless access to LegionFS. object de nes its interface. The Legion le abstraction is
This paper is organized as follows: Section 2 contains a BasicFileObject, whose methods closely resemble UNIX
a description of the design of LegionFS, including a brief system calls such as read, write, and seek. ContextObjects
overview of the Legion wide-area operating system and an manage the Legion name space. Due to the resource inef-
in-depth discussion of naming, security, scalability, extensi- ciency of representing each as a stand-alone process, les
bility, and adaptability. Section 3 contains a performance and contexts residing on one host have been aggregated into
evaluation highlighting the advantages a orded by a scal- container processes, called ProxyMultiObjects Figure 1.
able design. Section 4 presents an overview of related work A ProxyMultiObject polls for requests and demultiplexes
and Section 5 concludes. them to the corresponding contained le or context. Files
store data in a LegionBu er, which achieves persistence
2. LEGIONFS DESIGN through the underlying UNIX le system. ProxyMultiOb-
Legion 17 is middleware that provides the illusion of a jects leverage existent le systems for data storage, pro-
single virtual machine and the security to cope with its un- viding direct access to UNIX les. Unlike traditional le
trusted, distributed realization. From its inception, Legion servers, ProxyMultiObjects are lightweight and intended to
was designed to deal with tens of thousands of hosts and mil- be distributed throughout the system. They service only a
lions of objects - a capability lacking in other object-based portion of the name space, rather than comprising it in its
distributed systems. This section discusses the key areas of entirety.
Legion as they apply to the design of LegionFS.
2.2 Naming
2.1 Object Model User-de ned text strings called context names identify
Legion is an object-based system comprised of indepen- Legion objects. Context names are mapped by a directory
dent, logically address space-disjoint, active objects that service called context space to unique, location-independent
communicate via remote procedure calls RPCs. Objects binary names called Legion object identi ers LOIDs. For
represent coarse-grained resources and entities such as users, direct object-to-object communication, LOIDs must be
hosts, schedulers, les, and directories. Each Legion object bound via a binding process to low-level Object Addresses.
belongs to a class, which is itself a Legion object. Much An Object Address OA represents an arbitrary communi-
of what is usually considered system-level responsibility is cation endpoint, such as a TCP socket.
delegated to user-level class objects. For instance, Legion The LOID records the class of an object, its instance num-
ber, and a public key to enable encrypted communication. 2.4 Scalability
New LOID types can be constructed to contain additional LegionFS distributes les and contexts across the avail-
security information such as an X.509 certi cate, location able resources in the system. This allows applications to
hints, and other information. access les without encountering centralized servers and en-
Context space is similar to a globally distributed, rooted sures that they can enjoy a larger percentage of the available
directory. It is comprised of ContextObjects, which provide network bandwidth without contending with other applica-
mappings from context names to LOIDs in the same fashion tion accesses.
that directories map path names to inode numbers. Unlike Scheduler objects provide placement decisions upon ob-
directories, ContextObjects may contain references to arbi- ject creation. Utilizing information on host load, network
trary objects. connectivity, or other system-wide metadata, a scheduler
Having translated a context name to a LOID, an object can make intelligent placement decisions. A user may em-
consults a series of distributed caches to bind the LOID to ploy existing schedulers, implement an application-tailored
an OA. Each object maintains a local binding cache. A scheduler which places les, contexts, and objects accord-
binding cache miss results in a call to a Binding Agent ob- ing to domain-speci c requirements, or may enforce directed
ject. Cache misses at the Binding Agent are serviced by the placement decisions. Using the latter mechanism, a user
class of the LOID. This operation recurses, if necessary, but might specify that all of his les be created on a local host
is guaranteed to terminate at LegionClass, the root of the or within a highly-connected, nearby cluster. This ensures
Legion object hierarchy. that most le accesses are local, while allowing for wide-area
Legion's location-independent naming facilitates fault tol- access. It also isolates user le accesses to achieve maximum
erance and replication. Because objects are not bound by e ciency. A user may employ the replication techniques de-
name to individual hosts, they may be seamlessly migrated. scribed elsewhere in this paper to tolerate failures of local
If an object's host fails, but the internal state of an object resources. This provides highly-e cient access in the com-
is still accessible, the object's class may restart it elsewhere. mon case, with a measure of insurance in case of host or disk
Classes may act as replication managers by mapping one failures.
LOID to multiple OAs, referring to objects on di erent hosts. The fully-distributed design of LegionFS allows the user
A class object is a logical replication manager as its instances to remain ignorant of the constraints of physical disk en-
would likely employ the same replica consistency policies. closures, available disk space, and le system allocations.
By entrusting a class object with more responsibility, the Administrators seamlessly incorporate additional storage re-
system increases the load on that object. Means of ensur- sources into the le system. By simply adding a storage
ing that individual objects do not become bottlenecks are subsystem to a context of available storage elements, the
discussed in Section 2.4. additional space is advertised to the system and becomes a
target for placement decisions.
LegionFS utilizes multiple levels of caching to facilitate ef-
2.3 Security cient le and directory lookups and employs limited forms
Legion's distributed, extensible nature and user-level im- of replication. Aside from their role in the binding process,
plementation prevent it from relying on a trusted code base Binding Agents cache translations between context names
or kernel. Furthermore, there is no concept of a superuser and LOIDs. lnfsd similarly caches translations to avoid ex-
in Legion. Individual objects are responsible for legislat- cessive RPCs.
ing and enforcing their own security policies. The public Manager objects such as classes can become hot spots.
key embedded in an object's name enables secure commu- Fortunately, there is no inherent reason to have one class
nication with other objects. Objects are free to negotiate manager for all instances of a particular class. To mitigate
the per-transaction security level on messages, such as full potential bottlenecks, management responsibilities are dis-
encryption, digital signatures, or cleartext. tributed across 'clones' of a particular class.
When a user authenticates to Legion, currently via pass-
word, she obtains a short-lived, unforgeable credential 12 2.5 Extensibility
that uniquely identi es her. Authorization is determined by LegionFS di erentiates between objects according to their
an Access Control List ACL associated with each object; exported interfaces, not their implementations. For exam-
an ACL enumerates the operations on an object and the ple, LegionFS interacts with any object providing the stan-
associated access rights of speci c principals or groups of dard BasicFileObject interface as if it were a le. By focus-
principals. If the signer of any credentials passed in an in- ing on the interface without concern for the object's actual
vocation is allowed to perform the operation, the operation class or implementation, LegionFS provides an extensible
is permitted. set of services which can be specialized on an application- or
Per-method access control facilitates ner-grained secu- domain-speci c basis. An object may provide a value-added
rity than traditional UNIX le systems. No special privi- service by changing the semantics associated with a method.
lege is necessary to create a group upon which to base ac- Thus the same interface can be used to wrap di erent im-
cess. A client can dynamically modify the level of security plementations. Further, an interface may be augmented to
employed for communication, for example to use encryption provide functionality in the form of additional methods.
when transacting with a geographically-distant peer, but to A newly-minted object exporting the standard interface
communicate in the clear within a cluster. Specialized le may be accessed by existent libraries. If functionality war-
objects can be designed to keep audit trails on a per-object rants an additional method, it may be implemented, ex-
or per-user basis i.e., auditing can be performed by someone ported by the object, and incorporated into a newly gen-
other than a system administrator. erated library. This allows multiple policies governing a
particular design issue to co-exist. A programmer builds
upon lower-level functionality, such as the Legion security other consistency scheme.
and communication layers, to construct objects suited for
particular domains, adding them to the pool of objects al- 2.6 Adaptability
ready populating the system. A wide-area le system must be adaptable to a diverse set
Legion's event-based protocol stack provides an additional of network, load, and system-wide conditions. LegionFS fa-
opportunity for extensibility. Remote messages and excep- cilitates adaptation by maintaining system-wide metadata.
tions are intercepted and announced to higher-level han- Each object has an associated, arbitrary set of key,value
dlers. These handlers are registered according to priority pairs. Typical attributes for a host object include load av-
and may handle an event or provide limited processing and erages, architecture, and operating system. This list could
announce the event to subsequent handlers. The Legion se- easily be extended to include other factors which might af-
curity layer is implemented as a layer in the protocol stack. fect le placement in a wide-area environment such as net-
Operations that transcend method invocations, such as an work interfaces and their associated nominal bandwidths,
auditing facility, may be implemented as additional layers local le systems, and disk con gurations.
in the stack. Attributes are available directly from the object and are
Providing excessive and heavy-weight functionality such also stored in a metadata repository, called the Collection.
as consistency and replication in all le and contexts ob- The Collection is a hierarchically distributed set of objects
jects is inappropriate as some applications neither require which is queried by schedulers to determine object charac-
nor want the overhead associated with these mechanisms. teristics and state. Objects periodically push their state
Instead LegionFS provides the basic set of functionality de- information to the Collection. More sophisticated monitor-
scribed above and the framework to extend semantics where ing facilities such as the Network Weather Service 44 could
desired. Such functionality need be implemented only in the also be employed to populate the Collection.
objects that require it, without impeding objects and appli- The Collection allows applications to track the dynam-
cations that do not. ics of the system as well as capitalize on its more stable,
Interface inheritance was useful in implementing Proxy- inherent diversity. A geographically-distributed system is
MultiObjects, TwoDFileObjects, and Simple K-Copy likely to contain a range of heterogeneity in the form of un-
Classes SKCC. TwoDFileObjects are a domain-speci c im- derlying le systems, storage devices, and architectures. If
plementation serving the scienti c community, but are appli- the characteristics of an application are well-known, the ap-
cable to a broader audience. A TwoDFileObject implements plication may bene t from placement that matches these
the BasicFileObject interface such that reads and writes are needs against the properties of particular resources. As spe-
striped across constituent, underlying BasicFileObjects, ar- ci c examples, XFS 4 provides bene ts to streaming ap-
ranged as a two-dimensional matrix. A parallel le interface plications by allowing them to circumvent standard kernel
provides convenient access to applications performing ma- bu er caches and RAID enclosures may provide more e -
trix operations. The two-dimensional design degenerates to cient availability than can be provided at higher layers in
striping for high-performance I O. the system.
SKCC wrap standard classes to provide fault tolerance, by Since les and contexts are logically self-contained objects,
replicating an object's internal state but not the object it- it is more convenient to specify ne-grained policies than
self across a number of user-speci ed storage elements. The would be possible in a more conventional distributed le
state of an active class object may be synchronized across system. Objects may act on these policies asynchronously
the replicas at convenient stable points of execution, such as with respect to the user. LegionFS allows a user to explicitly
during object deactivation. This approach provides a good migrate or deactivate an object. More interesting behaviors
measure of fault tolerance with a minimum of performance include the ability to migrate due to network conditions or
degradation. replicate to accommodate increased load. A le might con-
Some environments need more full-featured replication sider re-negotiating transfer size, changing consistency pol-
and consistency guarantees than those provided by LegionFS. icy, or varying write-back policy in accordance with network
It is possible to extend ContextObjects to perform replica- constraints. Many of these issues were explored in the Coda
tion management: instead of a one-to-one mapping of con- le system 24 .
text names to LOIDs, ContextObjects could provide a one- Golding et al. 15 discuss means of exploiting idle peri-
to-many context name-to-LOID translation. The Context- ods in computer systems. Assuming fair load distribution,
Object could perform replica selection based on availability a le object is more likely to experience idleness than a cen-
or network connectivity constraints. tralized le server. Therefore, a le object has an oppor-
File data consistency is not addressed by basic Legion tunity to analyze its access patterns in order to prefetch.
mechanisms, because no current Legion object caches le The le system literature is replete with prefeteching mech-
data. The initial implementation of lnfsd, which serves as anisms 8 10 26 29 35 . Often e ciency is a concern as
an access point to LegionFS, provides NFS-like consistency the mechanisms must be realized within the limited latency
semantics; it caches data for a con gurable amount of time and memory constraints of a kernel-resident le system. Be-
before re-validating le metadata via a stat call. There are ing less memory-constrained, a Legion le may retain more
important classes of domains where consistency guarantees exact data concerning access patterns and prefetch sched-
are not appropriate, for example large read-only scienti c ules. Having characterized its own usage, a le object could
datasets. For environments where consistency is necessary, provide access hints 35 to the client to facilitate prefetching
it can be handled on a per- le or per-context basis at the across the network. As a further optimization, a le object
object itself, without forcing the semantics on users access- could recognize long periods of inactivity and move data to
ing other data. An object could grant leases 16 , which are a more space-e cient, but less readily-accessible represen-
more scalable than simple callbacks 21 , or implement any tation, le system, or storage device, as done in the HP
250
200
Bandwidth (MB/sec)
150 NFS
lnfsd
LegionFS
100
50
0
1 10 20 30 40 50
Number of readers
Figure 2: Scalability of read performance in NFS, lnfsd, and LegionFS
AutoRAID system 43 . separate nodes, though they share the same switch when-
ever possible. The experiment employs up to 100 nodes,
3. EVALUATION providing the opportunity to scale the benchmark to 50
This section compares the scalable design of LegionFS to readers accessing les on 50 separate nodes. The LegionFS
a more traditional volume-oriented approach. The gross dis- case utilizes the Legion BasicFile library and distributes Ba-
parity in potential parallelism between the two experimental sicFileObjects throughout the network. These same Basic-
setups is intentional, and serves to validate the move from FileObjects are accessed in the lnfsd experiment by clients
monolithic servers as employed by NFS to the peer-to-peer that are co-located with the lnfs interposition agents. The
architecture advocated by Legion, xFS 4 , and others. Pre- NFS experiment uses a single NFS daemon to service le
vious work 42 examined Legion wide-area I O performance system requests from 50 readers. In all cases, caching oc-
alongside the Globus 14 I O facility and FTP, the de facto curs only on the server side.
means of transferring les in a wide-area environment. Single readers attained 4.5 MB s and 2.1 MB s under Le-
Each benchmark utilizes the Centurion cluster 28 at the gionFS and NFS, respectively. NFS is limited to 4K trans-
University of Virginia. These experiments employ 400-Mhz fers over the network, whereas LegionFS can use arbitrary
dual-processor Pentium II machines running Linux 2.2.14 transfer sizes. lnfsd performs similarly with a bandwidth of
with 256 MB of main memory and IDE local disks. These 2.2 MB s. lnfsd performance is degraded by frequent con-
commodity components are directly connected to 100 Mbps text switches and RPCs between the kernel client and lnfsd.
Ethernet switches, which are in turn connected via a 1 Gbps This pure overhead is the expense of supporting legacy ap-
switch. A 100 Mbps link provides the cluster with access plications, and is avoided when using the Legion library in-
to the vBNS. During the second experiment, remote hosts terfaces. lnfsd attempts to mitigate the ine ciency of its
at Binghamton University and the University of Minnesota user-level implementation by performing read-ahead on se-
communicate with the Centurion cluster using this connec- quential le access, asynchronous write-behind, and le and
tion. The Sparc hosts at Binghamton University run Solaris metadata caching.
5.7, while the dual-processor Intel machines at the Univer- LegionFS and lnfsd each achieved peak performance at 50
sity of Minnesota run Linux 2.2.12. readers, yielding aggregate bandwidths of 193.8 MB s and
The rst micro-benchmark Figure 2 is designed to show 95.4 MB s, respectively. NFS peak performance occurred at
that LegionFS clients accessing independent subtrees achieve 2 readers, yielding aggregate bandwidth of 2.1 MB s. NFS
a linear increase in aggregate throughput in accordance with does not scale well with more than two readers, whereas
the linear growth of the network. lnfsd performance also both lnfsd and LegionFS scale linearly with the number of
scales nearly linearly. On the other hand, NFS performance readers, assuming the le partitioning described above.
scales poorly with additional clients. Each reader accesses To put the above results in the context of a popular do-
a private 10 MB le via a series of 1 MB transfers. The main, the next benchmark examines access to a subset of
experiment varies the number of simultaneous readers per the Protein Data Bank PDB. This experiment is intended
run. Each reader and its associated target le are placed on to simulate the workings of parameter space studies such as
1000 1000
800 800
600 600
KB/s
KB/s
400 400
200 200
0 0
0 5 10 15 20 25 30 35 0 10 20 30 40
Number of readers Number of readers
(a) (b)
Figure 3: Centurion clients accessing PDB data stored in ProxyMultiObject within Centurion cluster.
a Average Client Bandwidth b Aggregate Bandwidth
1600 3500
1400 3000
1200 2500
1000
2000
KB/s
KB/s
800
1500
600
400 1000
200 500
0 0
0 10 20 30 40 0 10 20 30 40
Number of readers Number of readers
(a) (b)
Figure 4: Centurion clients accessing PDB data stored in BasicFileObjects within Centurion cluster.
a Average Client Bandwidth b Aggregate Bandwidth
Feature 3 , which has been used to scan the PDB searching reader begins execution. Each client records the elapsed
for calcium binding sites. Feature, and similar parameter time to read the list of les in its entirety and calculates its
space studies, employ coarse-grained parallelism to execute bandwidth. The average of these bandwidths is reported on
large simultaneous runs against di erent datasets. The Pro- the left-hand side of Figures 3, 4, 5, and 6 as average client
tein Data Bank is typical of large datasets in that it services bandwidth. The test harness responsible for remotely exe-
many applications from various sites worldwide desiring to cuting the hosts records the elapsed time from instantiation
access it via a high-sustained data rate. of the rst job to completion of the last. This aggregate
Clients read a subset of les from the PDB stored in bandwidth is reported on the right-hand pane of the same
Legion context space. To avoid excessively long runs, only gures. The two metrics are intended to capture the per-
the rst 100 les from the PDB were accessed. These les formance of individual clients and the throughput of the
have an average size of approximately 171 KBs, with a le system under a speci ed load. During the prelude and epi-
size standard deviation of 272 KBs. Such a distribution logue of an experiment, the test is not in a steady state and
indicates there are many small les in the database along the number of active clients is below the speci ed value.
with a few very large les. A client's execution is termed a Files hosted on the Centurion cluster store the PDB data.
job and consists of 100 whole- le reads. Client execution is Though only the rst 100 les are accessed, 12000 les are
not synchronized. Each stage of the experiment de nes the stored under a single context. This simulates accessing a
number of active clients. While the number of active clients relatively small subset of a large data collection. The exper-
is varied from 1 to 32 between stages of the experiment, the iments vary the placement of the clients and the le system
number of jobs remains constant at 100. distribution to cover local- and wide-area environments and
The test harness iterates through the target hosts in round- volume-oriented and peer-to-peer designs. The local-area
robin fashion, assigning readers until the speci ed paral- experiments Figures 3 and 4 execute each of the clients
lelism is reached. Upon a client's completion, an additional on one of 32 nodes within the Centurion cluster, though
350 800
300 700
250 600
500
200
KB/s
KB/s
400
150 300
100 200
50 100
0 0
0 10 20 30 40 0 5 10 15 20 25 30 35
Number of readers Number of readers
(a) (b)
Figure 5: Remote clients accessing PDB data stored in ProxyMultiObject within Centurion cluster.
a Average Client Bandwidth b Aggregate Bandwidth
450 2500
400
350 2000
300
1500
KB/s
KB/s
250
200 1000
150
100 500
50
0 0
0 10 20 30 40 0 10 20 30 40
Number of readers Number of readers
(a) (b)
Figure 6: Remote clients accessing PDB data stored in BasicFileObjects within Centurion cluster.
a Average Client Bandwidth b Aggregate Bandwidth
they are never co-located with PDB les. The wide-area MultiObject and results in a slightly less severe performance
experiments Figure 5 and 6 place jobs on a pool of 4 ma- impact. Clients achieve peak average bandwidth at 827
chines at the University of Minnesota and 12 at Binghamton KB s and 312 KB s within local-area and wide-area en-
University. The relative dearth of remote machines requires vironments, respectively. This occurs when a client need
that clients be scheduled on the same host when their ac- not contend with other readers. Average client bandwidth
tive number exceeds 16. While an unfortunate incongruity is minimized under each case at 32 readers, dropping to 42
between the environments, the jobs are I O-bound and do KB s and 35 KB s for the local-area and wide-area cases,
not su er unduly by being placed on the same host. The respectively.
experiments designed to stress the volume-oriented design The ProxyMultiObject aggregate bandwidth curves bear
Figures 3 and 5 host all les within a single ProxyMulti- close resemblance to one another. Aggregate bandwidth
Object. The peer-to-peer setup Figures 4 and 6 distributes grows steadily until a maximum is reached at 8 clients, and
the BasicFileObjects across 32 Centurion nodes. then attens. The ProxyMultiObject is best utilized by a
As expected, the ProxyMultiObject shows immediate and small number of clients, but can not continue to scale with
drastic performance degradation with increasing load. The increased load. The peak bandwidths of 944 KB s within
e ect is particularly acute when the clients execute within the cluster and 717 KB s of the remote clients may seem sur-
the cluster Figure 3. In this case, there is a near 50 re- prisingly small in comparison to the achieved average client
duction in bandwidth with each doubling of the number of bandwidths. This occurs because the aggregate bandwidth
active readers. Figure 5 exhibits a similar dramatic trend, measures the total elapsed execution time of all 100 jobs,
though the curve is not as steep. Given its relatively greater including the time required to start the remote jobs, trans-
distance from the ProxyMultiObject, a client's requests are fer an input le, and reap the results. While this additional
less densely concentrated than when running within the clus- overhead comprises a non-trivial percentage of the total job
ter. This ensures less immediate contention for the Proxy- turnaround time, it is illustrative of actual execution. The
average client bandwidths report performance once the job storage facilities, such as le systems, databases, and hier-
has begun execution; the aggregate bandwidth is indicative archical storage systems.
of system throughput. In the context of the Globus Grid Toolkit 14 , Chervenak
Distributing the load amongst the BasicFileObjects leads et al. 9 posit a framework that stresses the importance
to more graceful performance degradation in Figures 4 and 6. of employing standard protocols to achieve interoperabil-
The system does not scale linearly, however. Unlike the raw ity. This work leverages previous work on Globus data
throughput experiment above, clients in this setup access access 6 , deployed internet infrastructure and protocols
a shared portion of the PDB, rather than dedicated per- such as HTTP and LDAP, and protocol extensions such as
client les. While average client bandwidth remains fairly GridFTP 1 . File replication and selection via Condor Clas-
steady with a few additional clients in both graphs, large sAds 36 have been successfully implemented using these
numbers of active clients increase the likelihood that one mechanisms 41 . While Globus bene ts from existent pro-
or more will access the same data, leading to contention at tocols and internet services, it is also constrained by their
the BasicFileObject. At 1406 KB s, peak client bandwidth mandates. To ensure interoperability, entities must com-
accessing BasicFileObjects within the cluster is signi cantly municate using the standard protocol. A perceived need
higher than the corresponding ProxyMultiObject case. This or feature in the service may require amending that stan-
suggests additional bene ts of BasicFileObjects. ProxyMul- dard. Not held captive to prescribed interfaces, Legion ob-
tiObjects must maintain state for each constituent object, jects may simply export new methods. Because internet pro-
leading to overhead when demultiplexing a request to the tocols evolved independently, they do not necessarily share
target. Further, BasicFileObjects may greater exploit the commonalities along important dimensions such as naming,
local le system cache since they serve a much smaller por- authentication, and authorization. Thus features such as
tion of the name space and are less likely to su er capacity authorization, that might be expected to pervade the sys-
cache misses. tem, must be implemented anew for each service, either as
The increasingly large standard deviations of Figure 4 re- a mapping to each speci c protocol or outside the service
sult from the contention described above. Since all clients proper. By exposing uniform and integrated mechanisms to
iterate through les in the same order, contention is more distributed objects, LegionFS ensures le abstractions are
likely at the onset of the experiment. During the ramp up secured in a consistent manner without this burden.
stage, clients perform more poorly than during steady-state WebFS 40 and Ufo 2 also provide access to internet ser-
execution. This may seem counterintuitive as the test has vices. WebFS is a kernel-resident le system that provides
not reached its full complement of active clients. Never- access to the global HTTP name space. It supports three
theless, clients caravan behind one another until adequate cache coherency policies deemed appropriate for HTTP ac-
spacing is achieved. As a job completes, a new job begins cess: last writer wins, append only, and multicast updates.
execution and inherits the spacing won by the nished job. Ufo employs the UNIX tracing facility to intercept open sys-
This e ect is not present in the wide-area case of Figure 6, tem calls and transfers whole les from FTP and HTTP
where temporal distance between jobs is achieved by the servers.
relatively longer time required to start remote execution. The PUNCH Virtual File System PVFS 13 interposes
The caravan e ect is pronounced in the aggregate band- unmodi ed NFS clients and servers with NFS-forwarding
widths of Figure 4b, where performance dips under the proxies. PVFS allows a client executing on a compute server
load of 32 clients. Unlike previous cases, the retarded to access les stored within another security domain. Dur-
progress of the 32 initial clients is signi cant amongst 100 ing the course of a session, clients are allocated a temporary
jobs. Aggregate bandwidth reaches its height of 3044 KB s shadow account on the compute server. Requests are di-
at 16 clients. Unburdened by temporal proximity, the re- rected to the proxy, co-located with the target NFS server.
mote clients accessing the BasicFileObjects contribute to in- The proxy maps the shadow account id of the request to the
creased aggregate bandwidth up to 32 clients at 1938 KB s. user's corresponding id on the target host and forwards the
request to the NFS server.
4. RELATED WORK File system adaptability has been addressed in Coda 24
The continued and increasing interest in wide-scale dis- and Odyssey 34 , which support application-transparent and
tributed computing, driven by high-bandwidth, long-haul application-aware adaptation, respectively. Both adapta-
networks and the economies of scale of commodity hard- tion strategies are designed to provide resilience in the pres-
ware, has lead to the design of le systems and data ac- ence of varying network performance and collect simple in-
cess facilities engineered speci cally for such an environ- formation about certain resources to aid in system monitor-
ment 2 6 5 9 13 . Such le systems were motivated ing.
by concerns inherent in wide-area environments, unlike their The Hurricane File System HFS 25 employs building
predecessors which were originally intended for campus- or blocks to encapsulate le system policies, such as prefetching
local-area networks and were retro tted to ll expanding and distribution. These building blocks may be composed
roles 21 38 37 . according to their interfaces to achieve per- le and per-open
Recognizing the diverse and evolving nature of wide-area le instance specialization.
environments, researchers have followed the approach taken While building blocks are relatively coarse-grained and
in LegionFS of developing layered architectures consisting of focus on policies that span the entire le system, stacking
a potentially-expansive set of services integrated via lower- allows individual le system calls to be interposed. Higher
level protocols 5 9 . SRB 5 is middleware that provides layers in a stacked le system may provide additional pro-
access to data stored on heterogeneous resources residing cessing or modify arguments before invoking the same op-
within a distributed system. SRB Agents contact the MCAT eration on the subsequent, symmetric layer. Stacking vn-
metadata service in order to locate and transact with local odes 39 create a chain of traditional vnodes to support
interposing. Ficus 19 is a replicated le system that allows Understanding that many classes of scienti c applications
kernel- or user-level le system modules exporting the vn- can best utilize the Grid without the imposition of costly
ode interface to be stacked. Later work 20 abandoned the functionality, LegionFS follows a minimalist approach. How-
rigid vnode interface in favor of the UCLA interface which ever, the means of incorporating application-speci c policies
is formed at kernel initialization and is the union of inter- is enabled by the set of mechanisms a orded by Legion.
faces exported by each layer. A directory subtree constitutes This ensures that emerging services and applications can ef-
a layer and may be mounted atop another layer to form a fortlessly utilize existent infrastructure to form a cohesive
stack. system, without having to cobble and reconcile mechanisms
The Spring 32 object-oriented operating system is com- that were not intended to work in unison. Extensions to core
posed of cooperating servers running on a micro-kernel. File services, such as ProxyMultiObjects and TwoDFileObjects,
objects inherit from Spring interfaces charged with handling are a result of this philosophy. We have also described as yet
operations such as paging, authentication, consistency, and unimplemented opportunities such as replication via class or
I O 33 . A new le system is allocated by contacting its cor- context objects and consistency guarantees that capitalize
responding creator object. This le system may be stacked on lower-level Legion facilities.
on an existing le system by means of a stackon method 23 . The heterogeniety, wealth of storage, and abundance of
Subsequent work on the Solaris MC File System 30 replaces CPU cycles in wide-area environments suggest interesting
the vnode interface with a new interface de ned in CORBA possibilities for le systems. The ability to schedule pro-
IDL. cesses according to their I O a nities and leverage idle pe-
The FiST language 45 is a high-level language for de- riods are two avenues for continued research. We expect
scribing stackable le systems. By providing a standard in- wide-area le systems to evolve into more than mere exten-
terface to mask operating system peculiarities, FiST allows sions of smaller-scale distributed le systems. Rather, they
for portable le system implementations. File systems may may e ciently bridge local or local-area le systems, gaining
interpose speci c operations or a set of operations and may advantage from their unique strengths.
choose to insert code before, after, or in lieu of the opera- While anticipating the future of wide-area le systems,
tion. The FiST description of the le system extensions is this paper provided a quantitative study of the current state
input to stgen, a parser and code generator, which outputs of LegionFS. The utility of LegionFS has been demonstrated
kernel C sources. with the Legion object-to-object protocol as well as lnfsd,
Legion's goal of acceptance amongst diverse organizations a user-level daemon designed to exploit UNIX le system
requires both that it provide secure means of cross-domain calls and provide an interface between a UNIX kernel and
access and that administrative overhead be minimized. Cen- LegionFS. Benchmarks showed that the scalability of Le-
tralized key services, such as Kerberos, have been success- gionFS compared favorably under load to volume-based le
fully employed by AFS 21 38 and DFS 22 , but do not systems, such as NFS. Finally, LegionFS was shown to facil-
meet these requirements. The centralized key management itate e cient data access in an important scienti c domain,
in Kerberos becomes increasingly di cult as the system the Protein Data Bank.
scales. The Self-certifying File System SFS 31 embeds
a public key in the name of a le, making "self-certifying" 6. ACKNOWLEDGMENTS
pathnames. LegionFS leverages a similar, distributed key This work was partially supported by Logicon for the
management system. DoD HPCMOD PET program DAHC 94-96-C-0008, NSF-
The notion of serverless or peer-to-peer le systems was NGS EIA-9974968, NSF-NPACI ASC-96-10920, and a grant
popularized by xFS 4 , and has spurred a rash of related from NASA-IPG. In addition, the authors would like to
projects 7 11 18 . xFS implements a serverless architec- thank John Karpovich and Mark Morgan for answering ques-
ture to provide scalable le service, and provides data redun- tions on the Legion architecture, Norm Beekwilder for his
dancy through networked disk striping to increase reliabil- aid in experimental design and administration, Katherine
ity. JetFile 18 relies on multicast to locate les distributed Holcomb for her patience as we taxed the Centurion net-
throughout the network. This location-independent naming work, Anand Natrajan for his feedback and guidance with
scheme encourages data replication. Unfortunately, multi- Legion scheduling, and the entire Legion team. The wide-
cast is problematic in wide-area environments as it oods area results would not have been feasible without contributed
networks and relies on router support. academic resources. The authors thank Mike Lewis for the
use of machines at Binghamton University and Jon Weiss-
5. CONCLUSION man of the University of Minnesota for his support. Finally,
This paper has examined a small sample of the usage we thank our shepherd, Ann Chervenak, and the anonymous
scenarios and requirements of le access in Computational referees for adding clarity to this paper's presentation.
Grids as they exist today. With this knowledge, we advocate
an architecture integrated by basic, but powerful, facilities 7. REFERENCES
such as location-independent naming and pervasive authen- 1 Gridftp: Ftp extensions for the grid. Grid Forum
tication, authorization, and con dentiality mechanisms. A Remote Data Access group, October 2000.
scalable, peer-to-peer design ensures that the Grid can ben- 2 A. D. Alexandrov, M. Ibel, K. E. Schauser, and C. J.
e t fully from its consituent resources, rather than be bound Scheiman. Extending the operating system at the user
by the performance limitations of centralized services. Fi- level: the ufo global le system. In 1997 Annual
nally, wide-area applications can exploit the dynamics of the Technical Conference on Unix and Advanced
system through adaption and continue to evolve with our Computing Systems USENIX '97, January 1997.
understanding of the Grid's potential through a framework 3 R. Altman and R. Moore. Knowledge from biological
promoting extensibility. data collections. enVision, 162, April 2000.
4 T. E. Anderson, M. D. Dahlin, J. M. Neefe, D. A. sharing on a large scale. IEEE Computer, 325:29 37,
Patterson, D. S. Roselli, and R. Y. Wang. Serverless May 1999.
network le systems. In Proceedings of the Fifteenth 18 B. Gronvall, A. Westerlund, and S. Pink. The design
ACM Symposium on Operating Systems Principles, of a multicast-based distributed le system. In
pages 109 126, Copper Mountain, CO, December Proceedings of the Third Symposium on Operating
1995. ACM Press. Systems Design and Implementation, New Orleans,
5 C. Baru, R. Moore, A. Rajasekar, and M. Wan. The Louisiana, February 1999.
sdsc storage resource broker. In CASCON'98, 19 R. G. Guy, J. S. Heidemann, W. Mak, J. Thomas
Toronto,Canada, November-December 1998. W. Page, G. J. Popek, and D. Rothmeier.
6 J. Bester, I. Foster, C. Kesselman, J. Tedesco, and Implementation of the cus replicated le system. In
S. Tuecke. GASS: A data movement and access service USENIX Conference Proceedings, Berkeley, CA, June
for wide area computing systems. In Proceedings of the 1990. USENIX Association.
Sixth Workshop on Input Output in Parallel and 20 J. S. Heidemann and G. J. Popek. File system
Distributed Systems, pages 78 88, Atlanta, GA, May development with stackable layers. ACM Transactions
1999. ACM Press. on Computer Systems, 121:58 89, February 1994.
7 W. J. Bolosky, J. R. Douceur, D. Ely, and 21 J. Howard, M. Kazar, S. Menees, D. Nichols,
M. Theimer. Feasibility of a serverless distributed le M. Satyanarayanan, R. Sidebotham, and M. West.
system deployed on an existing set of desktop pcs. In Scale and Performance in a Distributed File System.
Sigmetrics 2000, pages 34 43, 2000. ACM Transactions on Computer Systems, 61:51 81,
8 P. Cao, E. W. Felten, A. R. Karlin, and K. Li. A February 1988.
study of integrated prefetching and caching strategies. 22 M. L. Kazar, B. W. Leverett, O. T. Anderson,
In Proceedings of the 1995 ACM SIGMETRICS V. Apostolides, B. A. Bottos, S. Chutani, C. F.
Conference on Measurement and Modeling of Everhart, W. A. Mason, S.-T. Tu, and E. R. Zayas.
Computer Systems, pages 188 196, Ottawa, Ontario, Decorum le system architectural overview. In
Canada, 1995. Proceedings of the 1990 Summer USENIX Conference,
9 A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, pages 151 163, Anaheim, CA, June 1990. USENIX
and S. Tuecke. The data grid: Towards an Association.
architecture for the distributed management and 23 Y. A. Khalidi and M. N. Nelson. Extensible le
analysis of large scienti c datasets. Journal of systems in spring. In Proceedings of the Fourteen
Network and Computer Applications, 1999. ACM Symposium on Operating Systems Principles,
10 K. M. Curewitz, P. Krishnan, and J. S. Vitter. Asheville, NC, December 1993. ACM Press.
Practical prefetching via data compression. In 24 J. J. Kistler and M. Satyanarayanan. Disconnected
Proceedings of the ACM SIGMOD International operation in the coda le system. In Proceedings of the
Conference on Management of Data. ACM Press, Thirteenth ACM Symposium on Operating Systems
1993. Principles, pages 3 25. ACM Press, February 1992.
11 P. Druschel and A. Rowstron. Past: A large-scale, 25 O. Krieger and M. Stumm. Hfs: A
persistent peer-to-peer storage utility. In HOTOS performance-oriented exible le system based on
VIII, Schoss Elmau, Germany, May 2001. building-block compositions. ACM Transactions on
12 A. Ferrari, F. Knabe, M. Humphrey, S. Chapin, and Computer Systems, 153:286 321, August 1997.
A. Grimshaw. A exible security system for 26 T. M. Kroeger and D. D. E. Long. The case for
metacomputing environments. Technical report, e cient le access pattern modeling. In Proceedings of
University of Virginia, December 1998. the 1996 USENIX Technical Conference, January
13 R. J. Figueiredo, N. H. Kapadia, and J. A. B. Fortes. 1996.
The punch virtual le system: Seamless access to 27 Y. G. Leclerc, M. Reddy, L. Iverson, and N. Bletter.
decentralized storage services in a computational grid. Terravisionii: An overview. Technical report, SRI
In Proceedings of the Tenth IEEE International International, 2000.
Symposium on High Performance Distributed 28 G. Lindahl, S. J. Chapin, N. Beekwilder, and
Computing. IEEE Computer Society Press, August A. Grimshaw. Experiences with legion on the
2001. centurion cluster. Technical report, University of
14 I. Foster and C. Kesselman. Globus: A metacomputing Virginia, August 1998.
infrastructure toolkit. International Journal of 29 T. M. Madhyastha and D. A. Reed. Input output
Supercomputer Applications, 112:115 128, 1997. access pattern classi cation using hidden markov
15 R. Golding, P. Bosch, C. Staelin, T. Sullivan, and models. In Proceedings of the Fifth Workshop on
J. Wilkes. Idleness is not sloth. In USENIX Technical Input Output in Parallel and Distributed Systems,
Conference, pages 201 212, January 1995. pages 57 67, San Jose, CA, November 1997.
16 C. Gray and D. Cheriton. Leases: An e cient 30 V. Matena, Y. A. Khalidi, and K. Shirri . Solaris mc
fault-tolerant mechanism for distributed le cache le system framework. Technical report, Sun
consistency. In Proceedings of the Twelfth ACM Microsystems Research, 1996.
Symposium on Operating Systems Principles, pages 31 D. Mazieres, M. Kaminsky, M. F. Kasshoek, and
202 210. ACM Press, December 1989. E. Witchel. Separating key management from le
17 A. S. Grimshaw, A. Ferrari, F. Knabe, and system security. In Proceedings of the Seventeenth
M. Humphrey. Wide-area computing: Resource ACM Symposium on Operating Systems Principles,
Kiawah Island, SC, December 1999. ACM Press. June 2000. USENIX Association.
32 J. G. Mitchell, J. J. Gibbons, G. Hamilton, P. B.
Kessler, Y. A. Khalidi, P. Kougiouris, P. W. Madany,
M. N. Nelson, M. L. Powell, and S. R. Radia. An
overview of the spring system. In CompCon
Conference Proceedings, 1994.
33 M. Nelson, Y. Khalidi, and P. Madany. The spring le
system. Technical report, Sun Microsystems Research,
February 1993.
34 B. Noble, M. Satyanarayanan, D. Narayanan, J. E.
Tilton, J. Flinn, and K. R. Walker. Agile
application-aware adaptation for mobility. In
Proceedings of the Twelfth ACM Symposium on
Operating Systems Principles, St. Malo, France,
October 1997. ACM Press.
35 R. H. Patterson, G. A. Gibson, E. Ginting,
D. Stodolsky, and J. Zelenka. Informed prefetching
and caching. In Proceedings of the Fifteenth ACM
Symposium on Operating Systems Principles, pages
79 95. ACM Press, December 1995.
36 R. Raman, M. Livny, and M. Solomon. Matchmaking:
Distributed resource management for high throughput
computing. In Proceedings of the Seventh IEEE
International Symposium on High Performance
Distributed Computing. IEEE Computer Society
Press, 1998.
37 R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and
B. Lyon. Design and implementation of the sun
network lesystem. In USENIX Conference
Proceedings, Berkeley, CA, Summer 1985. USENIX
Association.
38 M. Satyanarayanan. Scalable, secure and highly
available le access in a distributed workstation
environment. IEEE Computer, 235:9 22, May 1990.
39 G. C. Skinner and T. K. Wong. "stacking" vnodes: A
progress report. In USENIX Conference Proceedings,
pages 61 74. USENIX Association, Summer 1993.
40 A. M. Vahdat, P. C. Eastham, and T. E. Anderson.
Webfs: A global cache coherent le system. Technical
report, University of California, Berkeley, 1996.
41 S. Vazhkudai, S. Tuecke, and I. Foster. Replica
selection in the globus data grid. In Proceedings of the
First IEEE ACM International Conference on Cluster
Computing and the Grid, pages 106 113. IEEE
Computer Society Press, May 2001.
42 B. S. White, A. S. Grimshaw, and A. Nguyen-Tuong.
Grid-based le access: The legion i o model. In
Proceedings of the Ninth IEEE International
Symposium on High Performance Distributed
Computing, Pittsburgh, PA, August 2000. IEEE
Computer Society Press.
43 J. Wilkes, R. Golding, C. Staelin, and T. Sullivan.
The hp autoraid hierarchical storage system. ACM
Transactions on Computer Systems, 141:108 136,
February 1996.
44 R. Wolski, N. Spring, and J. Hayes. The network
weather service: A distributed resource performance
forecasting service for metacomputing. Journal of
Future Generation Computing Systems, 1998.
45 E. Zadok and J. Nieh. Fist: A language for stackable
le systems. In Proceedings of the 2000 USENIX
Annual Technical Conference, San Diego, California,
Related docs
Other docs by ps94506
Selberg Trace Formulae and Equidistribution Theorems for Closed Geodesics and Laplace Eigenfunctions
Views: 44 | Downloads: 0
Static Headspace-Gas Chromatography Theory and Practice (B Kolb & L S Ettre)
Views: 54 | Downloads: 0
Kocherlakota, N - Statistical Approach To Reporting Uncertainty on Certified Values of Chemical Reference Materials for Trace Metal Analysis (2002)
Views: 79 | Downloads: 0
(COINS)(BMC - GREEK 03) Poole-Catalogue of the Greek Coins in the British Museum The Tauric Chersonese Sarmatia Dacia Moesia Trace 1877
Views: 21 | Downloads: 0
Guitar World 2001-08 ACDC, Alien Ant Farm, Zeppelin, Linkin Park, Static-X, Beatles, Weezer
Views: 48 | Downloads: 0
Get documents about "