Andrew Whitaker, Marianne Shaw, and Steven D. Gribble Presented By Steve Rizor Abstract The Denali isolation kernel is an operating system architecture designed to safely multiplex a large number of internet services on shared hardware Allows new services to be “pushed” onto third-party infrastructures, relieving authors from the burden of maintaining physical infrastructure Exposes a virtual machine abstraction but does not attempt to emulate the underlying hardware precisely Modifies the virtual architecture to gain scale, performance, and simplicity of implementation Introduction With the proliferation of Internet services comes the need for hardware solutions – but obviously one machine per service is usually highly inefficient A large fraction of web services are infrequently accessed, while a small fraction is frequently accessed. Introduction Why not virtualize all of the infrequently-accessed services? If one machine can handle 10,000 requests per hour for one service, why can’t one machine handle 1 request per hour for 10,000 services? Making a Case for Isolation Kernels Many services can already run on one machine – but there is a need for security Isolation not only enables many services to run, but they run without the ability to affect one another This enables the push of new/untrusted services without the worry of harming other services It also brings about an interesting experimentation infrastructure – the ability to deploy wide-area testbeds for network research: thousands of running subjects without the physical machines Isolation Kernel Design Principles An isolation kernel is a small-kernel operating system architecture targeted at hosting multiple un-trusted applications that require little data sharing. 1. Expose low-level resources rather than high-level abstractions. • High-level abstractions entail significant complexity and typically have a wide API, violating the security principle of economy of mechanism. They also invite “layer below” attacks, in which an attacker gains unauthorized access to a resource by requesting it below the layer of enforcement 2. Prevent direct sharing by exposing only private, virtualized namespaces. • Little direct sharing is needed across Internet services, and therefore an isolation kernel should prevent direct sharing by conning each application to a private namespace. Memory pages, disk blocks, and all other resources should be virtualized, eliminating the need for a complex access control policy: the only sharing allowed is through the virtual network. Isolation Kernel Design Principles An isolation kernel is a small-kernel operating system architecture targeted at hosting multiple un-trusted applications that require little data sharing. 3. Scalability. • An isolation kernel designed for internet services must be able to scale up into the thousands on a single machine. As such, the memory footprint (including the kernel metadata) must be minimized. Since the set of all unpopular services won’t fit in memory, the kernel must treat memory as a cache of popular services, swapping inactive services to disk. It will also have a poor hit rate, so there must be rapid swapping to reduce cache miss penalties. 4. Modify the virtualized architecture for simplicity, scale, and performance. • VMMs such as Disco adhere to the first two principles. They also strive to support legacy operating systems by precisely emulating the physical hardware. In this case, however, deviating from the underlying physical hardware can enhance performance, simplicity, and scalability. The drawback to this is that this removes support for unmodified legacy operating systems. Delani Isolation Kernel While the Delani Isolation Kernel looks like a standard VMM: The virtual machine interface is quite different from most others The Delani virtual instruction set is a subset of x86, so that most virtual instructions execute directly on the physical processor. x86 VMMs normally have to use binary rewriting and memory protection techniques to virtualize some of the instructions. Since Delani does not support legacy operating systems, those instructions are simply defined to have ambiguous semantics. At worst, the VM will harm only itself. However, such instructions are rarely used, and none are emitted by C compilers such as gcc. The instruction set also adds an “idle-with-timeout” instruction that relinquishes control to another VM instead of using time in an idle loop, an instruction to terminate the VM, and several virtual registers revealing information about the system. Delani Isolation Kernel Delani’s virtual machine interface is also different in that the emulated hardware is not a representation of the physical system: By keeping the emulated devices static, there is no need to poll for hardware. By keeping the devices simple, it reduces the number of programmed I/O instructions used to transmit or receive a single packet. Delani uses a round-robin schedule across all the active VMs (those with active threads) and uses a buffered interrupt scheme to prevent thrashing Those VMs which voluntarily give up time via the “idle-with-timeout” instruction are given priority once the timeout has finished Each Denali VM is given its own (virtualized) physical 32-bit address space. A VM may only access a subset of this 32-bit address space, the size and range of which is chosen by the isolation kernel when the VM is instantiated. The kernel itself is mapped into a portion of the address space that the VM cannot access; because of this, we can avoid physical TLB flushes on VM/VMM crossings. Virtual registers are stored in a page at the beginning of a VM's (virtual) physical address space. This page is shared between the VM and the isolation kernel, avoiding the overhead of kernel traps for register modications. In other respects, the virtual registers behave like normal memory (for example, they can be paged out to disk). Benchmarks For testing, since a standard operating system must be modified for use on the Delani Isolation Kernel, a small guest OS was developed based on the virtual machine interface named Ilwaco. Because of the simplification of the virtual network device, fewer programmed I/O instructions are needed per packet. However, there still needs to be a user/kernel switch for Delani, where there does not need to be one in BSD. Adding a syscall to BSD packets (forcing this user/kernel switch) brings the BSD performance more into line with Delani. Benchmarks The performance gains for buffering interrupt requests are quite obvious. Note the performance hit around 800 VMs due to memory demands and excessive paging. Benchmarks Using the new instruction, there is a huge performance gain over normal OS-idle loops. Benchmarks Even at 800 virtual machines running, there is still an astonishing throughput The effects of paging are quite obvious – with a larger amount of memory, the cliff can be pushed further out. Benchmarks Running the Quate II Linux server on Delani, it is apparent that even with 30 servers (4 clients each), there is no change in latency or reliability. The scheduling algorithm combined with the idle-with-timeout instruction and the buffered interrupts keep the servers running without issues. References Andrew Whitaker, Marianne Shaw, and Steven D. Gribble, “Scale and Performance in the Denali Isolation Kernel”, OSDI’02.