Filesystem Optimizations for Static Content Multimedia Servers Review of academic papers for TDC573 Jeff Absher Papers Reviewed Implementation and Evaluation of EXT3NS Multimedia File System Baik-Song Ahn, Sung-Hoon Sohn , Chei-Yol Kim , Gyu-Il Cha, Yun-Cheol Baek, Sung-In Jung, Myung-Joon Kim. Presented 12th Annual ACM International Conference on Multimedia , October 10–16, 2004, New York, New York, USA. The Tiger Shark File System Roger L. Haskin, Frank B. Schmuck IBM Journal of Research and Development, 1998 What is the problem? MM Server with relatively Massive transfer of data static content from disks to NIC. Prerecorded movies Can safely avoid focusing Audio on real-time writes. Lectures It is a server. Commercials Assume data is collected End user can non-real-time. start/stop/pause/seek Note: both systems could Many different simultaneous be easily extended within users. their scope to handle RT writes. Should be backward compatible for legacy requests. Scope Limitations and Design Goals Limitations Design Goals in order Single Server or Cluster of importance with single shared set of Pump as much data as disks. No distributed you can from the disks to nodes. the NICs. There is research in the This can be done by slightly different areas of avoiding kernel distributed Filesystems, memcopys P2P filesystems, and others. Seeking Single local Filesystem, Quick Recoverability for May consist of an array very large filesystems of multiple disks. Journaling Legacy Compatibility Problems with “old” filesystem block transfer to NIC in the network-server context? (simplified) Multiple Memcpy() calls across user/kernel mode. Disk blocks optimized for small files. Many context switches. The kernel must be involved in both reading from the disk and writing to the NIC. Bus contention with other IO. Block Cache is in main memory, may not be fast enough from a hardware perspective. The data may be slow to “bubble down” the Networking layer due to redirectors, routing, etc. Checksum calculations and such for networking happen in software. The newer MM Filesystems: Classes of requests Both of the studied filesystems assign some type of class to FS requests. the minimum needed is 2 classes. Legacy Requests Read/Write data for small files, not needed quickly at the NIC High-Performance Requests Read data for large likely-contiguous files that needs to be quickly dumped to the nic This is similar to our newer networking paradigm “not all traffic is equal” Unaddressed question that I had: Can we take the concept of discardability and apply it to filesystems? Classes of requests EXT3NS Tiger Shark 2 classes which are Real-time Class determined by an Real-time class is fine argument to the system grained into subclasses, because Tiger Shark has call in a user buffer Resource Reservation address. Admission Control Fastpath Class dumps data If the controllers and onto the NIC, disks cannot handle the Legacy Class handles predicted load then the legacy filesystem requests. request is denied. The data itself does not Legacy Class have an inherent class Also has a legacy interface for old filesystem access and the client process interfaces. explicitly defines its class. EXT3NS Caching, Quantization, and Scheduling optimizations The hardware is designed to have a minimum block size of 256 KB up to a maximum of 2MB; normal Linux block devices have a maximum block size of 4KB. Some compromises were made in disk metadata block design for SDA (what is SDA? The substitute for RAID) that it was compatible with EXT3FRS. The large block sizes lead to a large maximum addressable file size for first-level indirection is 275 GB, for maximal indirection is ~253B. The memory contained on the NS card is actually a buffer in the current version of EXT3NS, the authors plan to add caching capability to it. (if you don't know the difference between a buffer and a cache.. Look it up!). Asynch IO is not currently supported, but plans are in place. Tiger Shark Caching, Quantization, and Scheduling optimizations "Deadline Scheduling" instead of elevator algorithms. This is an interesting aspect of Tiger Shark, it benchmarks the hardware against a "synthetic workload" to determine the best order to schedule the disk requests and the best thresholds to start denying requests. Blocksize is 256KB (default), Normal AIX uses 4KB size. Tiger Shark will "chunk" contiguous block reads better than the default filesystems to work with its large blocksize. EXT3NS Streamlining of operations to get the data from the platter to the NIC. EXT3NS has special hardware that avoids memcopy and most kernel calculations. This hardware takes the data output from the disk hardware buffer directly onto a custom PCI BUS and then copies through buffers and directly to the NIC on the SAME CARD. Hardware avoids using the system's PCI bus when the fastpath option is used. Joint Network Interface and Disk Controller. Hardware speedups also calculate IP/TCP/UDP headers and checksums to speed up processing. Tiger Shark Streamlining of operations to get the data from the platter to the NIC. A running daemon that pre-allocates OS resources such as buffer space, disk bandwidth and controller time. Not a hardware dependant solution. Even though it does not have shared memory hardware, Tiger Shark copies data from the disks into a shared memory area. Essentially this is a very large extension of the kernel's disk block cache. VFS layer for Tiger Shark intercepts repeated calls and uses the shared memory area, therefore saving kernel memcopys on subsequent requests. Platter Layout and Scaling optimizations for Contiguous Streaming EXT3NS Hardware uses a RAID-3-like cross platter optimization called "SDA" which distributes the blocks across multiple disk platters (simple striping, not interleaving). Maximum of 4 platters as implemented. Tiger Shark Striping across a maximum of 65000 platters Striping method unspecified, looks like it is flexible and extended to include redundancy if desired. Keeps all members of a block group contiguous (per journaling FS concepts) and attempts to keep the block groups contiguous. Seeking Optimizations EXT3NS Tiger Shark None noted beyond large Byte Range Locking. block size. Allows multiple clients to access different areas of a file with real-time guarantees if they don't step on each other. Legacy nods EXT3NS and Tiger Shark: Fully compatible with VFS for respective platforms. Virtual Filesystem EXT3NS: If the legacy option Tiger Shark: Compatible with (slow class) is used, the disk JFS. contents are copied into the system's page cache through VFS/JFS calls go through the the system's bus as if kernel interface with some EXT3FS was being used. Block translation. The paper does not go into it, but my guess is that this is a rather wasteful operation given the large blocksize of SDA. Other legacy tools such as fsck and mkfs are also available for EXT3NS. Current Research and Future Directions and Jeff’s questions Tiger Shark gives us Filesystem QoS. But can we do better by integrating VBR/ABR into the system? What about Peeling in a VBR system to save resources? Replication and redundancy are always an issue, but not addressed in this scope. If it is a software-based system such as Tiger Shark, Where in the OS should we put these optimizations? (Kernel, Tack- On Daemon, Middleware) Legacy disk accesses have a huge cost in both of these systems, how can we minimize? EXT3NS Final Thoughts Valid, but not a novel approach. custom hardware does not represent an incremental step forward in universal knowledge. EXT3NS is built for exactly one thing: Network Streaming of data. An engineering change was made to the hardware design of a computer system, and some optimizations were made to the software to take advantage of it. The authors are not advocating a radical design change to all computers. Violates a few “design principles” therefore it must be relegated to a customized specific-purpose system. Empirical data confirm that EXT3NS design is able to squeeze more concurrent sessions out of a multimedia server than would have been available previously. There is still a saturation point where the memory of the NS card or the capabilities of the card's internal bus break down and the system cannot scale beyond that point. Better than Best Effort. Tiger Shark Final Thoughts Valid, somewhat novel approach It adds QoS guarantees to current disk interface architectures Built to be extensible to more than just MM disk access. But definitely optimized for it. Empirical data confirm that Tiger Shark design is able to serve more concurrent sessions out of a multimedia server than would have been available previously, BUT there is still a kernel bottleneck for the initial block load. Better suited to multiple concurrent access than EXT3NS Currently appears scalable beyond any reasonable (modern) demands.. As usual in computer science though, future demands may find a point of scaling breakdown of the system. Guaranteed QoS. Many other later QoS filesystems extend this concept and tweak some aspects of it such as scheduling. The fundamental academic question at the end of the day: The 2 major competing solution paradigms: Fundamentally alter the hardware datapath in a computer and present a customized hardware solution with relevant changes in OS. Scaling = not addressed. Retrofit current operating systems with some tacked-on task-specific optimizations and tweaking of settings. The system and the hardware are kept generic. Scaling = buy more hardware Or can we find an alternate third paradigm?
Pages to are hidden for
"Filesystem Optimizations for Static Content Multimedia Servers.ppt"Please download to view full document