File Access Patterns in Coda Distributed File System

Document Sample
File Access Patterns in Coda Distributed File System Powered By Docstoc
					File Access Patterns in Coda
   Distributed File System
      Yevgeniy Vorobeychik
Project Description
Related Work
Case Analysis
Experimental setup
   DFSTrace
   Custom Perl library
   Process
Flaws and Limitations
Future Work
DFS: Distributed File System
CMU: Carnegie Mellon University
Coda: DFS created at CMU
(File) Caching: storing replicas of files locally
Unstable files: files that are frequently updated
Peer-to-peer network: network with no central
Ousterhout, Baker, Sandhu, Zhou: last names of
File caching has long been used as a
technique to improve DFS performance
When a cached copy is updated, it has to
be written back to the server at some point
Or does it?
   What if you have a peer-to-peer network?
   What if there are many unstable files?
What if there is a “very small” set of computers that
update a file?
  Then you can avoid writing back to the server,

   reducing server load (if there is a server at all)
  Members of the “writers” group can synchronize the

   file amongst themselves
  Clients can contact a member of the “writers” group

   directly for an updated version of the file
What does “very small” mean?
  Reduction in server load should justify the amount of

   intra-group synchronization
  I make a very conservative assumption that

   “very small” = 1
          Project Description
In this project I tried to determine access
patterns that can be observed in Coda
Distributed File System
   Used Coda traces collected continuously for
    over 2 years at CMU
   Collected information on “create”, “read”, and
    “write” system calls
   Created several access summary files
    (discussed later)
               Related Work
Ousterhout et al. (1985)
   Analyzed UNIX 4.2 BSD File System to determine file
    access patterns and effects of memory caching
Baker et al. (1991)
   Analyzed user-level access patterns in Sprite
Sandhu, Zhou (1992)
   Noted that there is a high level of sharing of unstable
    files in a corporate environment
   However, there tends to be one cluster that writes to a
    file and many that read it
   Introduced FROLIC system for cluster-based file
What About Access Patterns?
    A case analysis of file access:
      CASE I: “No Creators” – file was created outside of the trace set
      CASE II: “1 Creator” – file was created by one computer and never
       deleted and recreated by another
      a)   created, but never updated
      b)   updated by only one computer
           Was that computer the creator?
      c)   updated by multiple computers
           Was one of those computers the creator?
      d)   created, but never read
      e)   read by only one computer
           Was that computer the creator?
      f)   read by multiple computers
           Was one of those computers the creator?
         Case Analysis (cont’d)
   CASE III: “Many Creators” – file was recreated by multiple
   CASE IV: “No Writers” – file was never updated
   CASE V: “1 Writer” – file was updated by only 1 computer
    a)   File was written to but never read
    b)   File was read by only one computer
         Was the reader also the writer?
    c)   File was read by many computers
         Was the writer one of the readers?
   CASE VI: “Many Writers” – file was updated by many
          Experimental Setup
   Library and related programs for analyzing Coda
Custom Perl Library
   Wrote a small (4 classes) library in Perl for analyzing
    ASCII Coda Traces generated by DFSTrace
   Generated summary files of only creates, reads, and
    writes for each computer from the original trace files
   Used the summary files to tally the access patterns
    for each file
Library for writing, reading, and
manipulating Coda traces
I used it to convert traces to ASCII for
further manipulation with Perl scripts
                   PERL Library
4 Classes
   Tracefile class
       Reads the trace file and outputs the create, read, and write system
       calls and affected files
       Information stored in <computername>.sum.txt file, as each trace
       file contains information gathered from a specified computer
   TracefileSet class
       Uses the tracefile class and collects information for all the tracefiles
       on CD or on the web (as specified by a switch)
   File class
       This class is used to maintain and manipulate information about a
       specified file accessed within the traces
   ComputerSet class
     - Uses the file class to maintain information for all files accessed
       within the traces
     - Writes the access summary information into the “accesstally.txt” file
       PERL Library (cont’d)
2 scripts that use the above classes
 uses TracefileSet class to
    read and summarize all the trace files on a
    CD or the web
 uses ComputerSet class to
    read and summarize information for all the
    traced files
“No Creators”                “1 Creator”                    “Many Creators”
                No writers                6593

                1 Writer                    0

                Many writers                0
   23507        No readers                2710
                1 reader         3871=creator; 2≠creator

                Many readers      10, all include creator

“No Writers”                   “1 Writer”                   “Many Writers”
                No readers                 122

   29987        1 reader                13≠writer                 3
                                   1, does not include
                Many Readers
                                Total: 30126
136 files are updated by only one computer vs. only 3 files that
are updated by more than one computer
    Thus, even the conservative assumption of “very small” = 1
     encompasses 136 of 139 files that were updated
There are very few unstable files
    Vast majority of the files are accessed only to be read, as found in
     earlier studies
It’s very likely that a file will be read by the same computer that
created it
In most of the instances when a file has one writer or one
creator, it is read by only one computer
    The reader group for unstable files tends to be small
It’s likely that a file will be read by a different computer from the
one that updated it
    Thus, there seems to be a separation between computers that update
     files and computers that only read them
Do the results make sense?
   It makes sense that a computer that created a
    file will subsequently read it
   It seems counterintuitive that a computer that
    updated the file will not be the one reading it
    in the future
     - such a scenario is possible in a project oriented
     - indeed, this is similar to the observation made by
       Sandhu and Zhou that there is typically one cluster
       that updates a file, while other clusters read it
Since the “writers” group is “very small” for most
files, this group can be contacted directly by
other clients, avoiding server write-back
It makes a lot of sense for a computer that
creates a file to cache a copy of it
Since unstable files tend to have small “readers”
groups, a DFS may maintain a list of “readers”
as well as “writers” to optimize file sharing
     Flaws and Limitations
Traces were collected only at CMU and
only for Coda
Only 5 of 38 CD’s of data were analyzed,
leaving a lot of questions unanswered
Very little data is analyzed in detail: there
is no further analysis on the “No Creators”
and “No Writers” cases, into which most of
the data falls
              Future Work
This follows directly from the “Flaws and
Limitations” section
   Analyze the rest of the Coda trace data
   Analyze other available trace data (Sprite,
   Analyze in more detail the “No Creators” and
    “No Writers” cases