Learning Center
Plans & pricing Sign in
Sign Out



      Caution: Still a new topic for me as well.
Note: these slides draw, sometimes verbatim, on the
           papers cited on the next slide.
Xen and the Art of Virtualization
Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho,
   Rolf Neugebauery, Ian Pratt, Andrew Wareld, SOSP 2003.
• What is virtualization?
• Why would you do it?
• Why is it important?
       What is virtualization?
• Used to present the illusion of many
  smaller virtual machines (VMs) each
  running a separate operating system
  – Run multiple XPs on XP.
  – Run LINUX, Solaris, XP on XP.
  – Run XP on LINUX.
  – Etc.
          Why would you do it?
• Single user: You have an XP box and you want to run
• Single user: You don‟t trust security of one OS or of
  some applications and you‟d like to “wall it off.”
• Miroslav Ponec: Run linux on new laptop in vmware to
  avoid driver problems.
• Enterprise manager: You have lots of boxes that sit idle
  a lot of the time. If you multiplex you can save hardware,
  etc. costs.
• Shared grid/resource: host multiple (untrusted?)
  applications and servers on a shared machine.
                        Why VMs?
• Number of ways to build a system to host multiple
  applications and servers on a shared machine.
   – Deploy hosts running standard OS and allow users to install files
     and start processes, protection between processes provided by
     standard OS techniques.
       • System administration challenging due to complex configuration
       • No adequate support for performance isolation: scheduling priority,
         memory demand, network traffic and disk accesses of one process
         impact the performance of others.
   – Possible solution: retrofit support for performance isolation to the
     operating system.
       • Hard to ensure that all resource usage is accounted to the correct
         process– complex interactions due to buffer cache or page
         replacement algorithms.
• Heck, we had trouble getting the
  Operating System to properly account for
  resource usage of different threads in one
               VM approach
• Multiplex physical resources at the
  granularity of an entire operating system
  and provide performance isolation.
• Price paid:
  – More heavyweight in terms of initialization
    and resource consumption.
  – Xen: “For target of up to 100 hosted OS
    instances, price worth paying. Enhanced
    flexibility, avoid configuration interactions (e.g.
    Windows registry).
A little hype from the media …
               eweek on Virtualization
•   IT departments are doing this to try to find "ways to use the newest in
    technology (processors, storage, memory, communications, and software)
    to improve: the application environment by increasing performance;
    optimizing processor utilization through workload management, scalability
    and reliability; increasing organizational efficiency by reducing costs of
    hardware, software and staff; and reducing both the number and the impact
    of system outages regardless of the underlying reason," said Kusnetzky.
•   At a recent Gartner Symposium/ITxpo, Gartner Inc. vice president John
    Enck called virtualization a "megatrend."
•   "We see virtualization being extremely important across all server types"
    and "virtualization is the best tool you have right now in the market to
    increase efficiency and drive up the utilization of your servers," said Enck.
•   What all this boils down to is that virtualization should make today's more
    powerful computers more productive while simultaneously making them
    easier and cheaper to manage.
•   The trick is how to make this happen.
      ComputerWorld, 11/21/2005
•   NOVEMBER 21, 2005 (IDG NEWS SERVICE) - A recent survey of 100 IT executives predicts that
    IT spending will decrease slightly in 2006 as more businesses worry about global economic
    conditions, but security software and enterprise IT upgrades remain top concerns.
•   Macroeconomic factors such as high oil prices and a devastating hurricane season in the U.S.
    have caused 40% of the executives surveyed by Goldman, Sachs & Co. to consider reducing their
    2006 IT budgets, according to survey results released Friday. Most executives, 52%, believe IT
    spending will be unchanged in 2006.
•   Security software has been a long-running priority among the executives on Goldman‟s survey
    panel, and nothing has changed that mind-set based on the current results. Spending on antivirus
    products has eased up after a flurry of activity, but CIOs continue to focus on improving security in
    areas like identity management and regulatory compliance, the survey said.
•   Other enterprise software priorities include enterprise resource management and customer
    relationship management systems, with CIOs upgrading those two categories to top priorities.
    When Goldman polled its panel in April, ERP and CRM software were considered only medium
•   Among enterprise software vendors, VMware Inc. and SAP AG were the two most cited
    companies receiving a larger percentage of the respondents’ IT budgets. Virtualization
    technologies are a hot topic this year as Intel Corp. and Advanced Micro Devices Inc.
    prepare chips that improve the performance of virtualization software. Respondents listed
    Novell Inc. and Computer Associates International Inc. as receiving less of their IT budgets.
            ZDNet Blog 11/14/05
• “With virtual machines of the desktop sort that VW5 enables, PC
  users can literally carve their desktop and notebook systems into
  completely separate instances of Windows that run side-by-side with
  each other as though the other instances don't exist. In other words,
  if some process in one tries some sort of security exploit like a buffer
  overflow, it can't get to the others any more than a buffer overflow
  could affect another computer across the network. It can only get to
  whatever is running in that instance or "partition of Windows." The
  idea of partitioning systems in this way makes it possible to dedicate
  partitions to specific activities. For example, you can do all your
  Web browsing in one partition while you run your corporate
  applications in another and your personal applications like Quicken
  in a third and never the three shall meet. I'm a Firefox user. But for
  those Web sites that require Internet Explorer (which I'm always
  nervous about using), I just run it in a separate partition. Using a
  virtual machine for just one application is like driving on a completely
  empty road with airbags. “
                       More …
• Intel has announced the arrival of the first desktop chips
  to include its hardware-based virtualization technology
  known as VT (codenamed Vanderpool). This could very
  well signal a new era in desktop/notebook computing
  and I would think long and hard before buying a new
  system that doesn't include this new and worthwhile
•   So, why is the Intel announcement so significant? Until Intel started
    releasing its VT technology (it first debuted in the company's recently
    announced Paxville XEON server chips), companies like SWSoft, VMWare,
    and Microsoft had to do a lot of the virtual machine heavy lifting in their
    software. Without any hardware assistance the likes of which VT provides,
    it takes far more in the way of physical resources (processor, memory) to
    launch and run virtual machines than it does if those instantiations can be
    activated through hardware. While such technologies make it easier for
    competing virtual machine software solutions like Xen to get in the virtual
    machine game, Raghu Raghuram, VMware's senior director of strategy and
    marketing, told me earlier this year that his company welcomes innovations
    like VT because end users will get better performance and his company can
    focus its attention on adding value in higher layers of the virtualization stack
    such as management. VMWare is wasting no time in rolling out its support
    for Intel's VT technology. According to a press release on its Web site, VT
    support is being beta tested in version 5.5 of VMWare Workstation, which
    the company expects to release by the end of the year.
           Dianne Greene, President,
•   To start out, why don't you describe what your company does?
    VMware produces virtualization software. What that means is we take a physical x86 -
    based system and we provide the multiple isolated, movable partitions that you can
    run operating systems with their applications in. In terms of what the customer gets,
    they get a way to drive utilization from, say, 15 percent, on up to 85 percent. They get
    very cost-effective ways to do disaster recovery, high availability, provisioning--all
    sorts of system-level services.
•   Pick a typical customer. What's their life before and after VMware? What
    A typical customer has got widely proliferated x86 machines, and depending on the
    power of the server, they can get a 10-to-1, 4-to-1 reduction in the number of servers
    they need. Or they can stop that proliferation and contain it better. And beforehand, to
    bring a new service online you have to go order the machine, install it in the server
    room, get it network-connected, make sure the power is there--it can be a multi-
    month process. Post-VMware, all they do is keep pre-built images of different
    software services like SQL Server, and when someone needs that service, they just
    find some excess capacity somewhere and deploy it.
•   So what's the penalty? Why doesn't everybody do this?
    Actually, what we were finding is that for people who use it, it's become the default
    way that they run their x86 workloads.
          OK, I‟m convinced
• So what do we do?
• First let‟s think about high-level challenges
  and approaches.
       High-level Challenges
• VMs must be isolated from each other: it is
  not acceptable for execution of one to
  adversely affect performance of the other.
  – Have to think about what this really means.
• Support variety of OSs.
• Performance overhead introduced by
  virtualization should be small.
• Full Virtualization:
  – Virtual hardware exposed is functionally
    identical to the underlying machine.
     • Allows unmodified operating systems to be hosted.
     • Seems like this is what VMWare supports.
DrawBacks of Full Virtualization
• Especially on x86 architecture:
     • Support for full virtualization never part of x86 design, e.g.
       certain supervisor instructions would need to be handled by
       the VMM for correct virtualization, but executing with
       insufficient privilege fails silently as opposed to a nice trap.
     • Virtualizating x86 MMU is also a challenge.
         – VMWare ESX Server dynamically rewrites portions of the
           hosted machine code to insert traps wherever VMM
           intervention might be required. Applied to entire guest OS
           kernel since all non-trapping privileged sintrsuctions must be
           caught and handled.
         – ESX maintains shadow versions of things like page tables and
           maintains consistency with the virtual tables by trapping every
           update attempt – high cost for update-intensive operations such
           as creating a new application process.
    More arguments against Full
• Sometimes it is desirable for hosted OS to
  see real as well as virtual resources:
  – providing both real and virtual time allows a
    guest OS to better support time-sensitive
    tasks and to correctly handle TCP timeouts
    and RTT estimates
  – Exposing real machine addresses allows a
    guest OS to improve performance by using
    superpages or page coloring.
  Xen Approach: Paravirtualization
• Present a virtual machine abstraction that
  is similar but not identical to the underlying
  – Requires modifications to the guest OS.
  – No changes to the application binary interface
    (ABI), so no modifications needed to
        Xen Design Principles
1. Support for unmodified application binaries is
2. Need to support full multi-application operating
3. Paravirtualization is necessary to obtain high
   performance and strong resource isolation on
   uncooperative machine architectures such as
4. Even on cooperative machine architectures,
   completely hiding the effects of resource
   virtualization from guest OSes risks both
   correctness and performance.
• Guest OS: one of the OSs that Xen can
• Domain: running virtual machine within
  which a guest OS executes.
• Xen itself is called the hypervisor since it
  operates at a higher privilege level than
  the supervisor code of the guest operating
  systems that it hosts.
     Xen‟s Paravirtualized (x86)
• Need to discuss
  – Memory management
  – CPU
  – Device I/O
       Memory Management
• Hardest part.
• Easier if
  – the architecture provides a software-managed
    TLB as these can be easily virtualized.
  – Tagged TLB: ability to associate an address-
    space identifier tag with each TLB entry to
    allow hypervisor and each guest OS to
    efficiently coexists in separate address
    spaces – no need to flush the entire TLB
    when transferring execution.
             (What‟s a TLB?)
• Short for translation look-aside buffer, a table in
  the processor‟s memory that contains
  information about the pages in memory the
  processor has accessed recently. The table
  cross-references a program‟s virtual addresses
  with the corresponding absolute addresses in
  physical memory that the program has most
  recently used. The TLB enables faster
  computing because it allows the address
  processing to take place independent of the
  normal address-translation pipeline.
• Unfortunately x86 does ot have a software-
  managed TLB: TLB misses are serviced
  automatically by the processor by walking the
  page table structure in hardware.
• Thus to achieve best possible performance, all
  valid page translations for the current address
  space should be present in the hardware-
  accessible page table.
• Moreover, because the TLB is not tagged,
  address space switches require a complete TLB
• Given these limitations, two decisions:
  – Guest OS has direct read access to hardware
    page tables, but updates are batched and
    validated by the hypervisor.
  – Xen exists in a 64MB section on the top of
    every address space, thus avoiding a TLB
    flush when entering and leaving the
• OS no longer most privileged entity in
  system. Guest OS must run at a lower
  privilege level than Xen.
  – X86 has 4 privilege levels, 2 unused, so OK.
  – Guest OS can‟t execute privileged
    instructions, but protected from applications at
    privilege level 3.
  – Privileged instructions “paravirtualized” by
    requiring them to be validated and executed
    within Xen.
• Exceptions (e.g. memory faults, software traps):
  Guest OS must register a descriptor table for
  exception handlers with Xen.
  – Usually the same as real x86 hardware. Page fault
    handler would need to read from a privileged
    register, so need to work around this.
  – Only two types of exceptions frequent enough for real
    performance hits:
     • System Calls: Guest OS may install a “fast” handler for
       system calls, allowing direct calls from an application into its
       guest OS and avoiding indirecting through Xen on every call.
     • Can‟t do with page faults – only code executing in ring 0 can
       read the faulting address from register CR2.
• Hardware interrupts replaced with a
  lightweight event system.
• Each guest OS has a timer interface and
  is aware of both „real‟ and „virtual‟ time.
                   Device I/O
• Xen exposes a set of clean and simple device
  – Efficient
  – Allows protection and isolation
• I/O Data transferred to and from each domain
  via Xen, using shared-memory asynchronous
  buffer rings.
• Lightweight event delivery mechanism used for
  sending asynchronous notifications to a domain.
       Control and Management
• “Separate policy from mechanism”
    – Keep hypervisor out of as much as possible.
• Hypervisor provides only basic control operations.
    – Exported through and interface accessible only from authorized
• Domain is created at boot time which is permitted to use the control
  interface. This domain (Domain0) responsible for hosting
  application-level management software.
    – Control interface allows creation and termination of other domains and
      their scheduling parameters, physical memory allocation and access
      given to machine‟s physical disks and network drives.
• Control interface exported to a suite of application-level
  management software running in Domain0.
    – Tools allow creation and destruction of domains, set network filters and
      routing rules, creation and deletion of virtual network interfaces and
      virtual block devices.
           Cost of Porting
• Linux: 1.36%.
           Detailed Design
• Control Transfer
• Data Transfer
• Subsystem Virtualization
           Control Transfer
• Synchronous calls from a domain to Xen
  made using a hypercall.
  – Domain can perform a synchronous software
    trap into the hypervisor to do privileged
• Notifications delivered to domains from
  Xen using asynchronous event
  – Small number of events: new data received,
    virtual disk request has been completed.
         Data Transfer: I/O Rings
The presence of a hypervisor means there is an additional protection
domain between guest OSes and I/O devices, so it is crucial
that a data transfer mechanism be provided that allows data to move
vertically through the system with as little overhead as possible.

Two main factors have shaped the design of I/O-transfer
mechanism: resource management and event notication. For resource
accountability, attempt to minimize the work required to
demultiplex data to a specific domain when an interrupt is received
from a device . The overhead of managing buffers is carried out
later where computation may be accounted to the appropriate domain.
Similarly, memory committed to device I/O is provided by
the relevant domains wherever possible to prevent the crosstalk inherent
in shared buffer pools; I/O buffers are protected during data
transfer by pinning the underlying page frames within Xen.
  Subsystem: CPU Scheduling
• Uses Borrowed Virtual Time algorithm.
  – Has low-latency wakeup of a domain when it receives
    an event.
  – Fast dispatch important to minimize effect of
    virtualization on OS subsystems that need to run in a
    timely fashion, e.g. TCP relies on timely delivery of
    acknowledgements to estimate round-trip times.
  – BVT uses virtual-time warping, which temporarily
    violates ideal “fair sharing” to favor recently-woken
  Subsystem: Time and Timers
• Xen provides guesOSes with notions of
  – Real time
  – Virtual time
  – Wall-clock time: offset to real time
• Each guest OS can program a pair of
  alarm timers, one for real time and one for
  virtual time.
• Timeouts delivered using Xen‟s event
     Virtual Address Translation
• Xen tries to virtualize this with as little overhead as
   – Harder dues to x86‟s use of hardware page tables.
   – VMWare: provide each guest OS with a virtual page table, not
     visible to the memory management unit. Hypervisor responsible
     for trapping accesses to the virtual page table, validating
     updates, and propagating changes back and forth between it
     and the MMU-visible “shadow” page table.
• Full virtualization forces use of shadow page tables, Xen
  is not so constrained
• Xen only involved in page table updates to prevent guest
  OSes from making unacceptable changes.
• Approach: Register guest OS page tables directly with
  MMU, and restrict guest OSes to read-only access.
            Physical Memory
• Initial memory allocation ore reservation for each
  domain is specified at the time of its creation.
  Memory statically partitioned between domains,
  providing strong isolation.
• Maximum-allowable reservation also specified: if
  memory pressure in a domain increases, it may
  then attempt to claim additional memory pages
  from Xen, up to the limit.
• If a domain wants to save resources, can
  release pages back to Xen.
• XenoLinux implements a balloon driver, which
  adjusts a domain‟s memory usage by passing
  memory pages back and forth between Xen and
  XenoLinux‟s page allocator.
• Could modify Linux MM routines directly, balloon
  driver makes adjustments by using existing OS
  functions, thus simplifying Linux porting effort.
• Paravitualization could be used to extend the
  capabilities of this driver: e.g. out-of-memory
  handling mechanism in the guest OS can be
  modified to automatically alleviate memory
  pressure by requesting more memory from Xen.
• Xen provides abstraction of virtual firewall-
  router where each domain has 1 or more
  network interfaces.
• Rules for transmit/receive/whatever.
•   Only Domain0 has direct access to physical disks.
•   All other domains access disk through abstraction of virtual block devices.
•   Domain0 manages the VBDs – keeps mechanisms in Xen very simple.
•   VBD comprises a list of extents with associated ownership and access
    control information.
•   Guest OS disk scheduling algorithm will reorder requests prior to queueing
    them on the ring in an attempt to reduce response time or to supply
    differentiated service.
•   Xen has more complete knowledge of actual disk layout, so we support
    reordering within Xen, and responses may come back our of order.
•   Xen services batches of requests from competing domains in a simple
    round-robin fashion; these are then passed to a standard elevator
    scheduler before reaching disk hardware. Domains can pass down reorder
    barriers to prevent reordering.

To top