Filesystem Optimizations for
Static Content Multimedia
Review of academic papers for TDC573
Implementation and Evaluation of EXT3NS
Multimedia File System
Baik-Song Ahn, Sung-Hoon Sohn , Chei-Yol Kim , Gyu-Il
Cha, Yun-Cheol Baek, Sung-In Jung, Myung-Joon Kim.
Presented 12th Annual ACM International Conference on
Multimedia , October 10–16, 2004, New York, New York,
The Tiger Shark File System
Roger L. Haskin, Frank B. Schmuck
IBM Journal of Research and Development, 1998
What is the problem?
MM Server with relatively Massive transfer of data
static content from disks to NIC.
Prerecorded movies Can safely avoid focusing
Audio on real-time writes.
Lectures It is a server.
Commercials Assume data is collected
End user can non-real-time.
start/stop/pause/seek Note: both systems could
Many different simultaneous be easily extended within
users. their scope to handle RT
Should be backward
compatible for legacy
Scope Limitations and Design Goals
Limitations Design Goals in order
Single Server or Cluster of importance
with single shared set of Pump as much data as
disks. No distributed you can from the disks to
nodes. the NICs.
There is research in the This can be done by
slightly different areas of avoiding kernel
distributed Filesystems, memcopys
P2P filesystems, and
Single local Filesystem, Quick Recoverability for
May consist of an array very large filesystems
of multiple disks. Journaling
Problems with “old” filesystem block
transfer to NIC in the network-server
Multiple Memcpy() calls across user/kernel mode.
Disk blocks optimized for small files.
Many context switches.
The kernel must be involved in both reading from the disk and
writing to the NIC.
Bus contention with other IO.
Block Cache is in main memory, may not be fast enough from a
The data may be slow to “bubble down” the Networking layer due to
redirectors, routing, etc.
Checksum calculations and such for networking happen in software.
The newer MM Filesystems:
Classes of requests
Both of the studied filesystems assign some type of
class to FS requests.
the minimum needed is 2 classes.
Read/Write data for small files, not needed quickly at the NIC
Read data for large likely-contiguous files that needs to be quickly
dumped to the nic
This is similar to our newer networking paradigm
“not all traffic is equal”
Unaddressed question that I had: Can we take the concept
of discardability and apply it to filesystems?
Classes of requests
EXT3NS Tiger Shark
2 classes which are Real-time Class
determined by an Real-time class is fine
argument to the system grained into subclasses,
because Tiger Shark has
call in a user buffer
Fastpath Class dumps data If the controllers and
onto the NIC, disks cannot handle the
Legacy Class handles predicted load then the
legacy filesystem requests. request is denied.
The data itself does not Legacy Class
have an inherent class Also has a legacy interface
for old filesystem access
and the client process
explicitly defines its class.
EXT3NS Caching, Quantization, and
The hardware is designed to have a minimum block size of 256
KB up to a maximum of 2MB;
normal Linux block devices have a maximum block size of
Some compromises were made in disk metadata block design
for SDA (what is SDA? The substitute for RAID) that it was
compatible with EXT3FRS.
The large block sizes lead to a large maximum addressable
file size for first-level indirection is 275 GB, for maximal
indirection is ~253B.
The memory contained on the NS card is actually a buffer in the
current version of EXT3NS, the authors plan to add caching
capability to it. (if you don't know the difference between a buffer
and a cache.. Look it up!).
Asynch IO is not currently supported, but plans are in place.
Tiger Shark Caching, Quantization, and
"Deadline Scheduling" instead of elevator
This is an interesting aspect of Tiger Shark, it benchmarks
the hardware against a "synthetic workload" to determine
the best order to schedule the disk requests and the best
thresholds to start denying requests.
Blocksize is 256KB (default), Normal AIX uses 4KB
Tiger Shark will "chunk" contiguous block reads
better than the default filesystems to work with its
EXT3NS Streamlining of operations to
get the data from the platter to the NIC.
EXT3NS has special hardware that avoids
memcopy and most kernel calculations.
This hardware takes the data output from the disk
hardware buffer directly onto a custom PCI BUS and
then copies through buffers and directly to the NIC
on the SAME CARD.
Hardware avoids using the system's PCI bus when
the fastpath option is used.
Joint Network Interface and Disk Controller.
Hardware speedups also calculate IP/TCP/UDP
headers and checksums to speed up processing.
Tiger Shark Streamlining of operations to
get the data from the platter to the NIC.
A running daemon that pre-allocates OS resources
such as buffer space, disk bandwidth and controller
Not a hardware dependant solution.
Even though it does not have shared memory
hardware, Tiger Shark copies data from the disks
into a shared memory area. Essentially this is a very
large extension of the kernel's disk block cache.
VFS layer for Tiger Shark intercepts repeated calls
and uses the shared memory area, therefore saving
kernel memcopys on subsequent requests.
Platter Layout and Scaling optimizations
for Contiguous Streaming
EXT3NS Hardware uses a RAID-3-like cross platter
optimization called "SDA" which distributes the
blocks across multiple disk platters (simple striping,
Maximum of 4 platters as implemented.
Striping across a maximum of 65000 platters
Striping method unspecified, looks like it is flexible and
extended to include redundancy if desired.
Keeps all members of a block group contiguous (per
journaling FS concepts) and attempts to keep the
block groups contiguous.
EXT3NS Tiger Shark
None noted beyond large Byte Range Locking.
block size. Allows multiple clients to
access different areas of
a file with real-time
guarantees if they don't
step on each other.
EXT3NS and Tiger Shark: Fully compatible with
VFS for respective platforms.
EXT3NS: If the legacy option Tiger Shark: Compatible with
(slow class) is used, the disk JFS.
contents are copied into the
system's page cache through VFS/JFS calls go through the
the system's bus as if kernel interface with some
EXT3FS was being used. Block translation.
The paper does not go into it, but
my guess is that this is a rather
wasteful operation given the
large blocksize of SDA. Other
legacy tools such as fsck and
mkfs are also available for
Current Research and Future Directions
and Jeff’s questions
Tiger Shark gives us Filesystem QoS. But can we do better
by integrating VBR/ABR into the system?
What about Peeling in a VBR system to save resources?
Replication and redundancy are always an issue, but not
addressed in this scope.
If it is a software-based system such as Tiger Shark, Where
in the OS should we put these optimizations? (Kernel, Tack-
On Daemon, Middleware)
Legacy disk accesses have a huge cost in both of these
systems, how can we minimize?
EXT3NS Final Thoughts
Valid, but not a novel approach.
custom hardware does not represent an incremental step forward in
EXT3NS is built for exactly one thing: Network Streaming of data.
An engineering change was made to the hardware design of a computer
system, and some optimizations were made to the software to take
advantage of it. The authors are not advocating a radical design change to
Violates a few “design principles” therefore it must be relegated to a
customized specific-purpose system.
Empirical data confirm that EXT3NS design is able to squeeze more
concurrent sessions out of a multimedia server than would have been
There is still a saturation point where the memory of the NS card or the
capabilities of the card's internal bus break down and the system cannot
scale beyond that point.
Better than Best Effort.
Tiger Shark Final Thoughts
Valid, somewhat novel approach
It adds QoS guarantees to current disk interface architectures
Built to be extensible to more than just MM disk access. But
definitely optimized for it.
Empirical data confirm that Tiger Shark design is able to serve
more concurrent sessions out of a multimedia server than would
have been available previously, BUT there is still a kernel
bottleneck for the initial block load.
Better suited to multiple concurrent access than EXT3NS
Currently appears scalable beyond any reasonable (modern)
demands.. As usual in computer science though, future
demands may find a point of scaling breakdown of the system.
Many other later QoS filesystems extend this concept and tweak
some aspects of it such as scheduling.
The fundamental academic question at the
end of the day:
The 2 major competing solution paradigms:
Fundamentally alter the hardware datapath in a
computer and present a customized hardware
solution with relevant changes in OS.
Scaling = not addressed.
Retrofit current operating systems with some
tacked-on task-specific optimizations and
tweaking of settings. The system and the
hardware are kept generic.
Scaling = buy more hardware
Or can we find an alternate third paradigm?