PersiFS_ A Continuously Versioned Network File System_1_

Document Sample
PersiFS_ A Continuously Versioned Network File System_1_ Powered By Docstoc
					         PersiFS: A Continuously Versioned Network File System
          Austin T. Clements, Dan R. K. Ports, Ben A. Schmeckpeper, Hector Yuen
                                {aclements, drkp, bschmeck, hyz}

                                                 May 12, 2005

Abstract                                          stamp. PersiFS ’s time stamp interface provides
                                                  users with a natural way to express accesses to
Most file systems are ephemeral, meaning that once past versions of files.
a change has been made, there is no way to recall
the previous contents of the file system. Backups,
version control systems, and user interface improve-            1.1   Motivation
ments such as “trash cans” attempt to alleviate this
                                                                Users frequently wish to undo changes they have
problem; however, these are all rough approximations
                                                                made, whether intentionally or inadvertently, to
of persistent file system structures, giving users re-
                                                                the file system — for example, inadvertent dele-
stricted access to a restricted set of past states of the
                                                                tions, restoring files corrupted by application
file system. PersiFS is a fully persistent file system,
                                                                bugs, or simply reverting to an earlier revision of
providing access to any past state of the entire file
                                                                a document. Regular file systems do not support
system. PersiFS achieves full persistence without sac-
                                                                this operation natively, which results in a num-
rificing access time to either current versions or past
                                                                ber of tools that address parts of this problem
versions, using inordinate amounts of disk space, or
                                                                in different ways. Some operating systems pro-
requiring modification to existing applications.
                                                                vide a “trash can” interface to help users avoid
                                                                mistaken deletions of files. However, this is only
1     Introduction                                              a partial solution because it only addresses the
                                                                problem of deletions, and even then only until
A version-controlled file system lets the user ac-               the trash can is emptied and the files are perma-
cess not just the current state of the file system               nently removed. The proliferation of “undelete”
but previous states as well. For example, back-                 tools for administrators suggests that this solu-
ups are one typical, restricted form of a version-              tion is inadequate. PersiFS makes these tools
controlled file system, allowing high-latency ac-                unnecessary, since files are never truly deleted
cess to a highly restricted set of past copies of a             from the file system. Fears of accidental over-
file system.                                                     write, rename or deletion are unnecessary.
  PersiFS is a continuously versioned network                      Versioning is a natural and desirable aspect
file system. A continuously version-controlled                   of file systems. Often, critical files are versioned
file system allows access to the complete state of               through the use of version control systems. How-
the file system at any point in the past. Incorpo-               ever, these must be explicitly configured, main-
rating continuous versioning into the file system                tained, and interacted with. Providing such sup-
means that a file can never be lost and mistaken                 port at the file system level gives the ability to
modifications can always be undone. In Per-                      easily and automatically track changes to any
siFS, any change to a file forces the file system                 file over time. To eliminate the need for user
to archive a copy of that file, indexed by a time                interaction and the possibility of user error, sys-

tems exist which automatically create periodic     sion; accordingly, the PersiFS structures provide
snapshots (either of just critical system files, or reasonably fast recall of past versions, but do not
of entire file systems). These, however, may fail   optimize for it.
to capture multiple changes made between snap-        PersiFS exposes a novel file system interface
shots and can create inconsistent snapshots at     that allows users and applications to access the
inopportune times. By capturing every revision,    file system and its past states without any modi-
PersiFS does not suffer from these problems.        fications to applications or the need for any spe-
   Because PersiFS is a network-accessible file     cial libraries. For improved interoperability, Per-
system, it is ideal for multi-user systems. Such   siFS exposes this file system interface over an
systems typically use some form of snapshotting    unmodified NFS protocol by providing a special
to provide all users of the system with secu-      root directory containing automatically gener-
rity for their files and to avoid administrative    ated subdirectories for every point in time. Sec-
nightmares with recovering from user errors. By    tion 3.1 discusses this interface further. While
using a continuously versioned file system like     the NFS interface allows unmodified applications
PersiFS, system administrators can can provide     and operating systems to interface with PersiFS,
users with an easy way to recover from a much      it introduces further complications to its design
wider range of problems.                           in order to accommodate the quirks of NFS. Pri-
                                                   marily, the statelessness of certain aspects of the
1.2 Challenges                                     NFS protocol directly conflicts with the persis-
                                                   tence goals of PersiFS, as further discussed in
The primary challenges to such a system lie in Section 4.2.
not only providing access to past versions, but
doing so reasonably fast; efficiently utilizing disk
space; providing access and modification to the 2 Related Work
current version at speeds comparable to non-
versioned file systems; and achieving all of this Many other systems provide some form of per-
without requiring modifications to existing ap- sistent versioned storage. PersiFS differs from
plications.                                        these systems in that it provides access to all
   PersiFS provides durable storage by storing previous versions of the file system’s state via a
all data on disk on a central server. For relia- standard file system interface.
bility and ease of implementation, the PersiFS
data structures are stored on a normal file sys- 2.1 Version Control Systems
tem, though with minor modifications it could
operate on a physical disk as well. Because the Version control systems such as CVS [1] are the
server must be able to retrieve any past version standard mechanism for tracking revision histo-
of any file from durably storage, PersiFS intro- ries in large projects. While PersiFS provides
duces many optimizations in order to achieve an similar revision history operations, it does not
efficient durable representation. As further dis- provide the higher-level semantics for synchro-
cussed in Section 3.2, it uses a compact metadata nizing the work of multiple users, such as file
log to track changes over time and an underly- locking as in RCS [10] or merging as in CVS.
ing content store that efficiently fuses common PersiFS provides only revision history support;
data between files and versions. These structures it does not provide the higher-level functionality
are designed to optimize access to the current because the appropriate semantics are dependent
version so it is nearly indistinguishable from ac- on the type of file and its mode of use (e.g. text
cesses a regular file system. Access time to past vs. binary files), and so should not be applied at
versions is less critical than to the current ver- the file system level.

   CVS organizes revision histories by assigning           tem: each change to the file system is stored as
a version number to each revision of a particular          a new revision. Elephant [8] and CVFS [9] also
file. This is a natural interface for the history           use this technique. However, PersiFS provides
of a particular file, but does not generalize well          a much more convenient interface for users.
to tracking the history of the entire file system.
Many of the weaknesses of CVS as a version con-            2.3   Interfaces
trol system are due to this interface: it does not
capture the notion of changes that occur simul-            Some versioned file systems require special tools
taneously to multiple files (changesets), and it            to access previous versions of the file system. For
does not effectively handle changes to the direc-           example, CVFS is designed for performing post-
tory structure, such as moving or renaming a file.          intrusion forensic analysis, so it does not pro-
The latter is a major problem for a file system,            vide an interface for users to easily recover old
since the directory structure can be highly dy-            versions of their files. A convenient way to pro-
namic.                                                     vide access to old versions is via a filesystem in-
   Subversion [2] addresses many of the short-             terface. Plan 9 creates a directory hierarchy of
comings of CVS by using a single global revision           snapshots; Pike et al. [6] report that providing
number that identifies a particular state of the            access to snapshots via this interface is very con-
entire repository. This is similar to PersiFS ’s in-       venient for users.
ternal representation; however, the user interface            Ideally, it should be possible to interact with
uses date/timestamps instead because global re-            the versioned file system exclusively through the
vision numbers do not generally correspond to a            file system interface, such that unmodified stan-
useful identifier from the user’s perspective.              dard UNIX utilities (ls, cp, etc.) suffice for re-
                                                           vision control. This is not possible in Elephant,
                                                           which adds new system calls. VersionFS [4] al-
2.2   Snapshots vs.          Continuous Ver-
                                                           lows unmodified utilities to be used, but only
                                                           via a library wrapper and the LD PRELOAD mech-
Previous states of the file system are commonly             anism. PersiFS uses a purely file-system-based
stored by a backup system that periodically                interface, using an automounter to allow stan-
archives the state of the filesystem. Snapshot-             dard utilities to be used without any modifica-
based version-controlled file systems are the nat-          tion.
ural extension of this idea: they periodically
take a snapshot of the state of the filesystem,
and make it readily available. The Plan 9 file              3     Design
server [6], for example, creates daily snapshots
                                                           3.1   User Interface
of the filesystem and stores them on a write-
once mass-storage system such as a WORM juke-              PersiFS provides access to both current and
box or the Venti archival server [7]. WAFL [3]             previous versions of the file system state via a
provides similar snapshotting functionality us-            standard file system interface. Volumes are ac-
ing copy-on-write disk blocks. AFS provides ac-            cessed across the network using an automounter
cess to the most recent snapshot through the               interface: PersiFS is mounted over NFS, say as
OldFiles mechanism.                                        /persifs, and the current version of the file sys-
  While a great improvement over a standard                tem is exposed as /persifs/now.
ephemeral file system, snapshotting filesystems                The automounter provides access to
have the obvious disadvantage that they cannot             previous versions of the file system
track changes that occur between revisions. Per-           using timestamps as names,             such as
siFS is a continuously version-controlled file sys-         /persifs/2005-04-13-12-00-00. This gives

a read-only snapshot of the file system as it              allowing a fast replay to be performed by starting
appeared at noon on April 13th , 2005.                    from the last snapshot before the desired time.
  The user interface also includes a number of               Like traditional Unix file systems, PersiFS
small tools that improve the usability of PersiFS.        uses inodes identified with unique inumbers to
For example,                                              store file and directory metadata. Unlike most
                                                          file systems, however, inumbers are never re-
  • persifs now                                           cycled because inodes are never completely re-
    Prints the archive path to the current direc-         moved. A NFS file handle is thus simply an
    tory as of the current time.                          inumber, plus a timestamp if it does not refer-
                                                          ence the current version; no generation numbers
  • persifs info
                                                          are required. Also unlike most file systems, a
    Prints statistics about the server.
                                                          PersiFS inumber is simply a unique identifier
  • persifs log file                                       for a file, and has no correlation to physical disk
    Displays a log of the modification times for           addresses.
    file.                                                     In order to further compact the inode log, the
                                                          actual contents of the inodes are also stored in
                                                          the superblob, and inode log entries contain only
3.2     File System Structure
                                                          pointers into the superblob. This allows more in-
A trivial representation for a file system that            ode log entries to be stored in each disk block,
achieves persistence is a log of every operation          thereby dramatically increasing the rate at which
ever performed on the file system. In order to             a specific entry can be located during a replay.
answer a query for a past point in time, the              Furthermore, it is quite practical to keep the
server can replay the log to this point in time           mapping of inumbers to superblob addresses for
and answer the query from that snapshot of the            the current version of the file system in mem-
file system.                                               ory, allowing operations on the metadata of the
   However, this fails to achieve the goals of Per-       current file system to be performed without re-
siFS : reading from any version requires vast             playing the log.
amounts of time in order to replay the log                   Together, the capabilities of the inode log al-
(though writing is very fast!), and space uti-            low PersiFS to nearly achieve its speed goals.
lization will be poor because many applications           Reads from the current version of the file sys-
rewrite the full contents of a file to disk even           tem need only read the file contents from the su-
when only a small portion has been changed.               perblob and writes to the current version of the
                                                          file system need only append to the inode log
3.2.1    Read Optimization                                and insert the file contents into the superblob.
                                                          Answering reads for past versions of the file sys-
To optimize this, we separate file contents from           tem only requires replaying a small amount of
metadata. File contents are stored in a sepa-             log from a past snapshot.
rate block store called the superblob, while meta-
data changes (including pointers to the file con-
                                                          3.2.2   Write Optimization
tents in the superblob) are stored in an inode
log. Because the inode log stores only meta-              The superblob can be optimized in order to im-
data changes, it is very compact. Furthermore,            prove the write performance of PersiFS and
the metadata is small enough compared to the              achieve its space efficiency goals. First, instead
file contents that it becomes reasonable to store          of storing entire files as blocks in the superblob,
snapshots of the entire state of the file system           files are divided into chunks which are stored in
metadata in the inode log at periodic intervals,          the superblob. Thus, a modification to some

part of a file need only rewrite the affected             ple files (for instance, due to file copying or au-
chunks. Because files in these scheme cannot be          tomatic backup copies) will consume very little
addressed by a single location in the superblob,        space beyond the single copy.
inodes must store the sequence of chunk loca-
tions for a file.
   In a further refinement, chunk boundaries
                                                        4     Implementation
are not placed at regular intervals, but instead        The main aspects of the implementation of Per-
use content-sensitive chunking. This technique          siFS are described in this section. A logical sub-
places chunk boundaries based on file contents           division of the modules involved in PersiFS is
such that local modifications to file contents,           described in Figure 1. In particular, the low-
including insertions and deletions, only affect          est levels of the implementation are modules cor-
chunks in that region of the file. We use the same       responding to the information structures repre-
Rabin fingerprint-based algorithm as LBFS [5]            sented on disk, the inode log, superblob, and blob
for content-sensitive chunking.                         index. Layered atop this are file and directory
   Chunking allows PersiFS to achieve its speed         modules, which provide an abstraction based on
goals because it has only minor effects on read          the standard file system concepts. Finally, inter-
performance, while greatly improving write per-         action with users is handled by a NFS interface,
formance by avoiding the need to manipulate file         including the automounter.
contents that have not been modified. Chunk-
ing also improves PersiFS ’s space efficiency by
avoiding rewriting some redundant data.                                    Network

3.2.3   Space Optimization                                                  NFS
To further optimize space efficiency, the su-
perblob leverages chunk fusion. Every chunk has                Directory             File
a fingerprint, which is simply the 160-bit SHA1
of its contents. The blob index maps chunk fin-
                                                                       Inode Log            Chunkable
gerprints to the locations of those chunks in the
superblob. Whenever a chunk is added to the su-
perblob, the blob index is first checked for that                                            Superblob
chunk’s fingerprint. If the fingerprint is found,
the chunk being inserted can be fused with the                                              Blob Index
existing chunk in the superblob, requiring no ad-
ditional space. Otherwise, it is appended to the                Figure 1: Implementation model
superblob and indexed. This is similar to the ap-
proach employed by Venti [7] for archival block
storage. While the blob index could be recom-
                                                        4.1   Representation Management
puted by reading the entire superblob, this would
lead to prohibitively long recovery, so the blob        Each of the three information structures is stored
index is maintained as an on-disk B+ -tree.             on disk, and has a corresponding module that
   Combined with content-sensitive chunking,            provides an interface to the higher levels of the
chunk fusion allows PersiFS to efficiently store          PersiFS implementation. For ease of implemen-
local modifications to files, sharing most file con-       tation, our initial implementation stores each
tents between versions of that file. Furthermore,        structure as a separate file on a standard file sys-
it also ensures that identical content in multi-        tem, though it would be reasonably straightfor-

ward to adapt it for direct storage on a physical          The chunkable string can be converted to a mar-
disk.                                                      shalled representation suitable for storing in in-
   The Superblob module implements content-                odes that lists all of the chunks contained in a
addressable storage, providing a standard                  file and their offsets in the superblob.
get/put-block interface that uses the hash fin-
                                                              The Directory module provides the standard
gerprint of the block to identify it. It makes use
                                                           directory operations of accessing or modifying
of the Blob Index module, which uses an on-disk
                                                           the list of files in the directory. Each directory
B+ -tree to map block fingerprints to indexes into
                                                           is stored as a file whose content happens to be
the superblob structure on disk.
                                                           directory entries, distinguished from other files
   The Inode Log module stores inumber map-
                                                           through special flags.
pings in a time-indexed log. Each mapping con-
tains the inode number and the address of the                 Ideally, multiple related changes to the file
inode in the superblob. The inode log inter-               would be aggregated and committed as a whole
face allows creation, modification, and deletion            only when the file is closed, in order to avoid in-
of inodes to be recorded in the log, and supports          consistent states. Often one write from the user
scanning the log to obtain the inode map at any            is in fact broken into a series of writes, and forc-
point in the past as well as its current state. This       ing the filesystem to create distinct versions for
module also periodically records snapshots of the          each ’sub-write’ would be less than desirable. It
entire inode map in the log for rapid replay from          is both inefficient, since it creates many unnec-
any past point.                                            essary versions that will not need to be accessed,
                                                           and logically incorrect, since a read could con-
4.2   File System Abstraction                              ceivably access an inconsistent state created by a
                                                           partially-performed change. Unfortunately, NFS
To simplify the implementation, a file system ab-           v3 does not provide file closure notification to the
straction consisting of file and directory repre-           server, so the desired commit-on-close semantics
sentations is built atop the persistent data struc-        are impossible to achieve. Instead, the File mod-
tures. Using these abstractions also makes it fea-         ule attempts to approximate these semantics by
sible to experiment with different on-disk repre-           grouping together successive writes (in a span of
sentations of the file system data structures.              five seconds) to a single file.
   The File module bundles together a file’s in-
ode metadata and its content, stored as chunks
chunks. The File module API defines meaning-
ful file operations, such as create, read, write,
getAttr, and setAttr.                                      4.3   Network Interface
   The contents of a file are represented by a mu-
table chunkable string backed by the superblob             Clients access files on the PersiFS server using
and aware of the LBFS-style content-sensitive              standard NFS. The PersiFS NFS layer exposes
chunking. The Chunkable module provides a                  a magic root directory that contains a now di-
substring-read and substring-write interface. As           rectory reflecting the current file system state,
file contents are large, the contents of the string         as well as directories created on-the-fly that ex-
are not read from the superblob unless neces-              pose any past version of the file system. The
sary. Multiple batched changes to the chunkable            NFS layer translates requests for current or past
string can be made without being committed                 versions of the file system into operations on the
to disk; when a flush operation is performed,              underlying file and directory structures, which in
the chunk boundaries are recalculated and new              turn operate on the underlying disk representa-
chunks are added to the superblob as necessary.            tion.

5    Evaluation                                         6    Future Work
                                                        In the future, we would like to provide im-
                                                        proved administrative control over the file sys-
Upon implementing PersiFS, we found that it             tem. Primarily, we would like the implement
achieves performance on operations in the cur-          user-controlled retention policy, allowing old re-
rent version of the file system comparable to            visions to be deleted or merged with other re-
that of other NFS-based file systems. Snapshots          visions, selectively reducing the granularity of
were sufficiently compact (80K for 10,000 files)           time in order to conserve space. This capability
that the occasional blocking write operation was        would eliminate the primary space drawback to
barely noticeable.                                      using a continuously versioned file system such as
                                                        PersiFS. How to effectively implement retention
   Navigating past versions of the file system was       policies with the existing file system structures
slower than expected, though still usable. Read-        is unclear, so new structures may be necessary.
ing a past version of the file system containing            We would also like to improve the user in-
10,000 files required reading approximately 80K          terface of PersiFS, perhaps adding support for
from the inode log (the cost was dominated by           some of the version management features typical
reading the previous snapshot), in addition to          in version control systems. One such example
the time required to read the file data from the         is tagging, in which a user is able to associate a
superblob (which was no more expensive than             symbol with a particular version of a file for easy
reading file data for the current version).              reference later. It would also be advantageous to
                                                        support a wider range of date specifications, in-
   While examining why read performance did             cluding relative ones, somewhat like CVS’s date-
not reach expectations, we found that standard          specs.
UNIX tools typically made more queries than                We are currently developing PersiFS 2 , an re-
expected or repeated queries. ls, for example,          implementation of PersiFS that applies theoret-
performed multiple NFS calls for each file in the        ical results from the field of data structures to
listed directory in order to obtain metadata. We        the underlying file system structures, primar-
suspect that better caching of historic data (in        ily building on work with persistent, external
particular, metadata) would greatly improve the         memory structures. From this, we hope to gain
performance of PersiFS.                                 insight into the applicability of these advanced
                                                        data structures in solving real problems and to
  Chunk fusion proved quite advantageous for            examine the trade-offs between the increased im-
some operations, while introducing no noticeable        plementation complexity of these structures and
overhead in general. We experimented with edit-         the realization of their theoretical promise.
ing a 160K text file. After four revisions, the
file system had grown by merely 30K. Reduc-
ing the average chunk size (currently 8K) would         7    Conclusions
have improved this further, though would have
increased the time required for both read and           PersiFS successfully overcomes the challenges
write operations. Chunk fusion also abated the          described at the outset of this paper by achieving
effects of programs which implement their own            both time and space efficiency and transparently
backup. For example, backup copies created by           providing persistent file system services that can
Emacs required less than a kilobyte of additional       be utilized without the need to modify applica-
file system space, regardless of the size of the         tions. Time and space efficiency are achieved
original file.                                           with the separation of the log from the bulk

data, content-sensitive chunking, and chunk fu-          [6] R. Pike, D. Presotto, S. Dorward, B. Flan-
sion. PersiFS behaves like a regular NFS server,             drena, K. Thompson, H. Trickey, and
providing access to different versions of the file             P. Winterbottom. Plan 9 from Bell Labs.
system through a naming convention that oth-                 Computing Systems, 8(3):221–254, Summer
erwise retains regular UNIX semantics and thus               1995.
does not require any changes to existing applica-
tions to work.                                           [7] S. Quinlan and S. Dorward. Venti: a
   PersiFS has the potential to give users ease              new approach to archival storage. In First
of mind and ease of use by eliminating the need              USENIX conference on File and Storage
to worry about the integrity of their files. For              Technologies, Monterey, CA, 2002.
administrators of multi-user systems, most user          [8] D. S. Santry, M. J. Feeley, N. C. Hutchin-
error can be trivially dealt with by just backing            son, A. C. Veitch, R. W. Carton, and J. Ofir.
up until before the mistake.                                 Deciding when to forget in the elephant file
   Continuously versioned file systems like Per-              system. In Symposium on Operating Sys-
siFS have the potential to change the way users              tems Principles, pages 110–123, 1999.
view and interact with their files. By adding a
time axis to the file system and giving it a very         [9] C. Soules, G. Goodson, J. Strunk, and
tangible archeology, users gain an extra dimen-              G. Ganger. Metadata efficiency in version-
sion of expressive power over their manipulations            ing file systems, 2003.
of the file system.
                                                        [10] W. F. Tichy. RCS — a system for version
                                                             control. Software — Practice and Experi-
References                                                   ence, 15(7):637–654, 1985.

 [1] P. Cederqvist, editor. Version Manage-
     ment with CVS. Free Software Foundation,
     2005. Available at https://www.cvshome.

 [2] B. Collins-Sussman, B. W. Fitzpatrick, and
     C. M. Pilato. Version Control with Subver-
     sion. O’Reilly Media, 2004. Available at

 [3] D. Hitz, J. Lau, and M. Malcolm. File sys-
     tem design for an NFS file server appliance.
     In Proceedings of the USENIX Winter 1994
     Technical Conference, pages 235–246, San
     Fransisco, CA, USA, 17–21 1994.

 [4] K.-K. Muniswamy-Reddy. VERSIONFS:
     A versitile and user-oriented versioning file
     system. Master’s thesis, December 2003.

 [5] A. Muthitacharoen,       B. Chen, and
     D. Mazieres. A low-bandwidth network
     file system. In Symposium on Operating
     Systems Principles, pages 174–187, 2001.


Shared By:
Description: Network File System (NFS) developed by SUN UNIX company, said layer protocol (pressentation layer protocol), enables users to access files elsewhere on the network to use their computers as the same. NFS is based on UDP / IP protocol applications, its implementation is mainly the use of remote procedure call RPC mechanism, RPC provides a set of machines, operating systems and low-level transport protocol-independent access to remote file operations. RPC uses the XDR support. XDR is a machine-independent data description and encoding of the agreement, he and any machine architecture independent format for online transfer of data encoding and decoding, support the transfer of data between heterogeneous systems.