Filesystem Optimizations for Static Content Multimedia Servers.ppt by shenreng9qgrg132


									Filesystem Optimizations for
Static Content Multimedia
    Review of academic papers for TDC573
    Jeff Absher
Papers Reviewed

   Implementation and Evaluation of EXT3NS
    Multimedia File System
       Baik-Song Ahn, Sung-Hoon Sohn , Chei-Yol Kim , Gyu-Il
        Cha, Yun-Cheol Baek, Sung-In Jung, Myung-Joon Kim.
       Presented 12th Annual ACM International Conference on
        Multimedia , October 10–16, 2004, New York, New York,
   The Tiger Shark File System
       Roger L. Haskin, Frank B. Schmuck
       IBM Journal of Research and Development, 1998
What is the problem?
   MM Server with relatively        Massive transfer of data
    static content                    from disks to NIC.
     Prerecorded movies             Can safely avoid focusing
     Audio                           on real-time writes.
     Lectures                         It is a server.

     Commercials                      Assume data is collected

   End user can                         non-real-time.
    start/stop/pause/seek              Note: both systems could

   Many different simultaneous          be easily extended within
    users.                               their scope to handle RT
                                     Should be backward
                                      compatible for legacy
Scope Limitations and Design Goals

   Limitations                              Design Goals in order
       Single Server or Cluster              of importance
        with single shared set of                Pump as much data as
        disks. No distributed                     you can from the disks to
        nodes.                                    the NICs.
           There is research in the                 This can be done by
            slightly different areas of               avoiding kernel
            distributed Filesystems,                  memcopys
            P2P filesystems, and
                                                 Seeking
       Single local Filesystem,                 Quick Recoverability for
        May consist of an array                   very large filesystems
        of multiple disks.                           Journaling
                                                 Legacy Compatibility
Problems with “old” filesystem block
transfer to NIC in the network-server
context? (simplified)
   Multiple Memcpy() calls across user/kernel mode.
   Disk blocks optimized for small files.
   Many context switches.
   The kernel must be involved in both reading from the disk and
    writing to the NIC.
   Bus contention with other IO.
   Block Cache is in main memory, may not be fast enough from a
    hardware perspective.
   The data may be slow to “bubble down” the Networking layer due to
    redirectors, routing, etc.
   Checksum calculations and such for networking happen in software.
The newer MM Filesystems:
Classes of requests
   Both of the studied filesystems assign some type of
    class to FS requests.
       the minimum needed is 2 classes.
         Legacy Requests
               Read/Write data for small files, not needed quickly at the NIC
           High-Performance Requests
               Read data for large likely-contiguous files that needs to be quickly
                dumped to the nic
       This is similar to our newer networking paradigm
           “not all traffic is equal”
       Unaddressed question that I had: Can we take the concept
        of discardability and apply it to filesystems?
Classes of requests

   EXT3NS                                   Tiger Shark
     2 classes which are                      Real-time Class

      determined by an                               Real-time class is fine
      argument to the system                          grained into subclasses,
                                                      because Tiger Shark has
      call in a user buffer
                                                         Resource Reservation
                                                         Admission Control
           Fastpath Class dumps data                       If the controllers and
            onto the NIC,                                    disks cannot handle the
           Legacy Class handles                             predicted load then the
            legacy filesystem requests.                      request is denied.

       The data itself does not                 Legacy Class
        have an inherent class                       Also has a legacy interface
                                                      for old filesystem access
        and the client process
        explicitly defines its class.
EXT3NS Caching, Quantization, and
Scheduling optimizations
   The hardware is designed to have a minimum block size of 256
    KB up to a maximum of 2MB;
     normal Linux block devices have a maximum block size of
     Some compromises were made in disk metadata block design
       for SDA (what is SDA? The substitute for RAID) that it was
       compatible with EXT3FRS.
     The large block sizes lead to a large maximum addressable
       file size for first-level indirection is 275 GB, for maximal
       indirection is ~253B.
   The memory contained on the NS card is actually a buffer in the
    current version of EXT3NS, the authors plan to add caching
    capability to it. (if you don't know the difference between a buffer
    and a cache.. Look it up!).
   Asynch IO is not currently supported, but plans are in place.
Tiger Shark Caching, Quantization, and
Scheduling optimizations
   "Deadline Scheduling" instead of elevator
       This is an interesting aspect of Tiger Shark, it benchmarks
        the hardware against a "synthetic workload" to determine
        the best order to schedule the disk requests and the best
        thresholds to start denying requests.
   Blocksize is 256KB (default), Normal AIX uses 4KB
   Tiger Shark will "chunk" contiguous block reads
    better than the default filesystems to work with its
    large blocksize.
EXT3NS Streamlining of operations to
get the data from the platter to the NIC.
   EXT3NS has special hardware that avoids
    memcopy and most kernel calculations.
   This hardware takes the data output from the disk
    hardware buffer directly onto a custom PCI BUS and
    then copies through buffers and directly to the NIC
    on the SAME CARD.
   Hardware avoids using the system's PCI bus when
    the fastpath option is used.
   Joint Network Interface and Disk Controller.
    Hardware speedups also calculate IP/TCP/UDP
    headers and checksums to speed up processing.
Tiger Shark Streamlining of operations to
get the data from the platter to the NIC.
   A running daemon that pre-allocates OS resources
    such as buffer space, disk bandwidth and controller
   Not a hardware dependant solution.
   Even though it does not have shared memory
    hardware, Tiger Shark copies data from the disks
    into a shared memory area. Essentially this is a very
    large extension of the kernel's disk block cache.
   VFS layer for Tiger Shark intercepts repeated calls
    and uses the shared memory area, therefore saving
    kernel memcopys on subsequent requests.
Platter Layout and Scaling optimizations
for Contiguous Streaming
   EXT3NS Hardware uses a RAID-3-like cross platter
    optimization called "SDA" which distributes the
    blocks across multiple disk platters (simple striping,
    not interleaving).
       Maximum of 4 platters as implemented.
   Tiger Shark
       Striping across a maximum of 65000 platters
       Striping method unspecified, looks like it is flexible and
        extended to include redundancy if desired.
   Keeps all members of a block group contiguous (per
    journaling FS concepts) and attempts to keep the
    block groups contiguous.
Seeking Optimizations

   EXT3NS                           Tiger Shark
       None noted beyond large          Byte Range Locking.
        block size.                       Allows multiple clients to
                                          access different areas of
                                          a file with real-time
                                          guarantees if they don't
                                          step on each other.
Legacy nods
           EXT3NS and Tiger Shark: Fully compatible with
            VFS for respective platforms.
                Virtual Filesystem

       EXT3NS: If the legacy option                    Tiger Shark: Compatible with
        (slow class) is used, the disk                   JFS.
        contents are copied into the
        system's page cache through                     VFS/JFS calls go through the
        the system's bus as if                           kernel interface with some
        EXT3FS was being used.                           Block translation.
               The paper does not go into it, but
                my guess is that this is a rather
                wasteful operation given the
                large blocksize of SDA. Other
                legacy tools such as fsck and
                mkfs are also available for
Current Research and Future Directions
and Jeff’s questions
   Tiger Shark gives us Filesystem QoS. But can we do better
    by integrating VBR/ABR into the system?
   What about Peeling in a VBR system to save resources?
   Replication and redundancy are always an issue, but not
    addressed in this scope.
   If it is a software-based system such as Tiger Shark, Where
    in the OS should we put these optimizations? (Kernel, Tack-
    On Daemon, Middleware)
   Legacy disk accesses have a huge cost in both of these
    systems, how can we minimize?
EXT3NS Final Thoughts
   Valid, but not a novel approach.
       custom hardware does not represent an incremental step forward in
        universal knowledge.
   EXT3NS is built for exactly one thing: Network Streaming of data.
       An engineering change was made to the hardware design of a computer
        system, and some optimizations were made to the software to take
        advantage of it. The authors are not advocating a radical design change to
        all computers.
       Violates a few “design principles” therefore it must be relegated to a
        customized specific-purpose system.
   Empirical data confirm that EXT3NS design is able to squeeze more
    concurrent sessions out of a multimedia server than would have been
    available previously.
       There is still a saturation point where the memory of the NS card or the
        capabilities of the card's internal bus break down and the system cannot
        scale beyond that point.
   Better than Best Effort.
Tiger Shark Final Thoughts
   Valid, somewhat novel approach
     It adds QoS guarantees to current disk interface architectures

   Built to be extensible to more than just MM disk access. But
    definitely optimized for it.
   Empirical data confirm that Tiger Shark design is able to serve
    more concurrent sessions out of a multimedia server than would
    have been available previously, BUT there is still a kernel
    bottleneck for the initial block load.
   Better suited to multiple concurrent access than EXT3NS
     Currently appears scalable beyond any reasonable (modern)
      demands.. As usual in computer science though, future
      demands may find a point of scaling breakdown of the system.
   Guaranteed QoS.
   Many other later QoS filesystems extend this concept and tweak
    some aspects of it such as scheduling.
The fundamental academic question at the
end of the day:
   The 2 major competing solution paradigms:
       Fundamentally alter the hardware datapath in a
        computer and present a customized hardware
        solution with relevant changes in OS.
           Scaling = not addressed.
       Retrofit current operating systems with some
        tacked-on task-specific optimizations and
        tweaking of settings. The system and the
        hardware are kept generic.
           Scaling = buy more hardware
   Or can we find an alternate third paradigm?

To top