Document Sample
OSE Powered By Docstoc
					            Operating Systems Engineering

                            Virtual Machines

By Dan Tsafrir, 25/5/2011

 1                                             OSE 2011– OSE – virtual machines
             What’s a virtual machine?
       A VM is a simulation of a full computer
          With its disk & NIC & OS & user-level apps, …

       Running as an application
          On some “host” computer
          Simulation is called a “guest”

2                                                    OSE 2011– OSE – virtual machines
                      VMs – requirements
       Simulation needs to be accurate 
          Emulate HW faithfully, handle weird quirks of kernels & such
          Reproduce bugs exactly
       Simulation needs to be isolated 
          Guest must not break out of VM
          SW inside guest might be faulty and/or malicious
       Simulation needs to be fast 
          Well, as fast as possible…
       Simulation needs to be believable 
          Guest shouldn’t be able to distinguish VM from real computer
              The “blue pill” saga [ ]
          In reality, if guests can accurately time stuff, they can know
          (And indeed, viruses often refuse to work when virtualized)

3                                                                OSE 2011– OSE – virtual machines
                         VMs – origin
       Late 1960s
          IBM used VMs to share mainframes

       Late 1990s
          VMWare re-popularized VMs (for x86 HW)
          Economic boom: nowadays billions of $s business
          Everyone is playing
              SW: Microsoft, IBM, Redhat, Oracle, …
              HW: Intel, AMD, ARM, IBM, Oracle, …

4                                                  OSE 2011– OSE – virtual machines
                          VMs – why?
       For developers & power users

           One computer w/ multiple OSes
              My Win 7 laptop also runs Ubuntu
              My MacBook Pro @ home also runs XP (for office)

           Kernel development
              Like QEMU, but performs reasonably

5                                                   OSE 2011– OSE – virtual machines
                          VMs – why?
       Business case: saves money!
          Server consolidation
              Once we had underutilized machines per service…
              Reduces cost of HW, power consumption, cooling
          Portability (why should Intel/AMD/IBM care about consolidation?)
              Decouples OS from HW and makes upgrades easy
          Increased robustness
              Can backup entire machine + easily restore if HW breaks
              No need to reinstall all SW
              Can isolate important apps in their own VM (safety)
          Makes cloud models possible
              Such as Amazon’s EC2 (“elastic cloud”)
          Certain costly sys-admin chores made much easier
              Provisioning a new machine (just clone ready image)
6                                                    OSE 2011– OSE – virtual machines
                     What’s in a name
       SW that runs the show (3 names referring to same thing):

           VMM
              Virtual machine monitor
           Hypervisor
              (Of IBM origin)
              Sometimes denoted “HV”
           ~Host

       VMMs
          Citrix Xen, KVM, VMWare ESXi, MS HyperV, IBM pHyp,…

       2 possible settings
          Next 2 slides…

7                                                 OSE 2011– OSE – virtual machines
       Hosted VMM (“type 2 hypervisor”)

• Like VMWare Workstation,
  Parallels, VirtualBox, QEMU,
• Typically personal use

   8                             OSE 2011– OSE – virtual machines
         Bare metal / native VMM
          (“type 1 hypervisor”)

    • XenServer, VMWare ESXi, MS HyperV, IBM pHyp,…
    • Typically for servers, data centers, clouds
9                                         OSE 2011– OSE – virtual machines
              VMM multiplexes HW
    Just like an OS…

    Divides memory among guests
       Related: de-duplication, balloon-ing

    Time-shares CPU among guests
       Related: notion of VCPU vs. PCPU (can hot-plug)

    Simulates per-guest virtual devices
       Disk
       Network,
       …

10                                               OSE 2011– OSE – virtual machines
            Virtualization refinement
    Paravirtualization
       Guest OS is aware it is being virtualized
       For performance purposes
       Paravirtualized devices

    HW support
       Intel-VT
       AMD-V

11                                                  OSE 2011– OSE – virtual machines
     How to virtualize x86…


12                            OSE 2011– OSE – virtual machines
                        VMs – how?
    SW interpretation, instruction by instruction
       Can do it, but much, much too slow

    Idea1: when possible, execute VM’s instructions on real CPU
       Works fine for most instructions (e.g., add %eax %ebx)

        But what about isolation? (e.g., VM writes outside its memory)

    Idea2: run VMs at CPL=3
       Ordinary instructions work fine
       Writing to %cr3 traps to VMM
          VMM examines guest’s page table
          VMM can manipulate page table if it wants
          Only then set %cr3 and resume VM

    This virtualization model is called: “trap & emulate”

13                                                  OSE 2011– OSE – virtual machines
           VMM hides real machine
    Virtual vs. real resources
       Virtual vs. real cr3
           Virtual cr3: the VM (thinks it) sets the real cr3
           Real cr3: exclusively managed (= virtualized) by VMM
       Virtual vs. real machine-defined data structures
           Virtual page table: VM thinks it’s real
           Real page table: real page tables virtualized by VMM
    VMM’s job
       Make guest see only virtual machine state
       Completely hide & protect real machine state
    Problems
       Trap-&-emulate is tricky on x86
           Not all privileged instructions trap at CPL=3
       All those traps can be slow…

14                                                OSE 2011– OSE – virtual machines
         x86 state we must virtualize

state                reason for hiding it
CPL (low bits of CS) always 3; guest sometimes expects it to be 0
GDT descriptors      their DPL (descriptor priv level) is 3; guest may expect 0
gtdr                 points to “shadow” (real) GDT
IDT descriptors      trap to VMM code, not guest kernel (VMM forwards or
                     fakes interrupts to guest when necessary)
idtr                 points to “shadow” (real) IDT
page tables          entries don’t map to expected physical address
cr3                  points to “shadow” page table
IF in EFLAGS         interrupts must always be on when in guest mode
cr0                  can’t allow guest to go into real mode

 15                                                       OSE 2011– OSE – virtual machines
    Letters
       H = host
       G = guest
       P = physical
       V = virtual
       A = address

    Combinations
       GVA = guest virtual address
       GP = guest physical
       HP = host physical
       …

16                                    OSE 2011– OSE – virtual machines
     Providing guest with illusion of
      physical memory (simplistic)
    Guest view
       Wants to start at PA=0
       Wants to use all “installed” DRAM
    Host opposing view
       Must support several guests, they can’t all start at 0
       Must protect on VM’s memory from the others
    Idea
       Fake a smaller DRAM size than real DRAM
       Ensure paging is enabled
       Rewrite guest’s PTEs

17                                                   OSE 2011– OSE – virtual machines
     Providing guest with illusion of
      physical memory (simplistic)
    Example
       VMM allocates a guest phys mem 0x1000000 to 0x2000000
       VMM gets trap if guest changes cr3 (guest @ CPL=3)
       VMM copies guest's page table to "shadow" page table
       While copying, VMM adds 0x1000000 to each PA in shadow tab
       VMM checks that each resulting HPA is < 0x2000000
       Must copy the guest's page table
          So guest doesn't see VMM's modifications to PAs

18                                            OSE 2011– OSE – virtual machines
       Address translation (reminder)
              9bits   9bits   9bits   9bits       12bits
                                                           48bit VA
               p0      p1      p2      p3         offset

         1                                    Q
         2                     0
                               1                                 W
                               2                     0
         p0      Q
                                                     1                                   K
                                                     2                   0
                              p1      W
        511                                                              1
                                                    p2     K             2
                                                                        p3                PA

  4KB page-table page => 512 PTEs (8B each)                            511

  19                                                           OSE 2011– OSE – virtual machines
     Providing guest with illusion of
       physical memory (realistic)
    Host allocates N pages to guest
       No need for them to be contiguous in phys mem
       Host maintains a GPA_to_HPA mapping (say, using a hash)
       GPAs are contiguous

    What happens when guest changes cr3
      Assume guest assigns GPA1 to cr3
      A trap will occur and host will gain control
      Host’s goal:
         Generate, on the fly, the shadow page table hierarchy
         From GVA to HPA
         There’s only one such shadow hierarchy at any given time
           per core
20                                              OSE 2011– OSE – virtual machines
     Providing guest with illusion of
       physical memory (realistic)
    The host’s actions
       Saves GPA1 internally
       Allocates brand new zeroed page = root of the shadow hierarchy
          Let base of new page be HPA1
       Assigns HPA1 to cr3
       Resumes guest, which immediately faults on GVA2
          GVA2 = virtual address of 1st fetched command of guest
       Takes 9 most significant bits from GVA2
          Assume 48bit VA = 4 levels hierarchy (9bits each) + 4KB page
          8 bytes per PTE
       Computes GPA_to_HPA(GPA1) + 9bits * 8
          = HPA of 2nd-level guest’s hierarchy
       …
21                                               OSE 2011– OSE – virtual machines
     Providing guest with illusion of
       physical memory (realistic)
    The host’s actions (cont.)
       …
       Continue like so with next 9bits, repeatedly,
          Until reaching the HPA of the request page = HPA2
          Now, there needs to be a GVA2=>HPA2 mapping in the
           shadow hierarchy
       Adds the translation GVA2=>HPA2 to shadow hierarchy
          Starting at HPA1 and allocating the rest of the levels in
           the hierarchy as needed
       Resumes guest
       Repeats same procedure when next fault occurs
          This continues until all address space is mapped
          Or until next context switch (=> need to start over)
22                                                 OSE 2011– OSE – virtual machines
     Providing guest with illusion of
       physical memory (realistic)
    Building shadow page tables is costly

    Can we cache?
       Yes, but need to write protect all pages involved
          Will generate trap whenever pages are modified
          Host would be able to respond accordingly
       The problem
          How do we know when to stop write-protecting?
       Solution
          Must employ some heuristic
          Can be not perfect as long as maintains correctness

23                                                OSE 2011– OSE – virtual machines
Not all sensitive CPL=3 read/write trap
     Push CS
        Will show CPL=3 (not 0) if guest reads pushed value
     sgdt (save gdtr)
        Reveals real gdtr is guest reads it
     pushf
        Pushes real IF
        Always on in guest mode (why?)
        Host injects interrupts to guest as needed
     popf
        Ignores IF in CPL=3
        => no trap => host won’t know if guest wants interrups
     iret
        Invoked, e.g, after handling a system call
        No ring change => SS/ESP will not be restored

 24                                                 OSE 2011– OSE – virtual machines
                 How can we cope?
    Solution: binary translation
       Rewrite guest code
       Change every problematic instruction to INT 3
       Keep track of original instructions + emulate in VMM
       Note: INT 3 is 1-byte long => small enough to overwrite any inst

    Must be done dynamically at runtime
       Need to know what if bytes are code or data
       Need to know where instructions start (x86 is CISC)
       Consequently, scan code only as executed

25                                                 OSE 2011– OSE – virtual machines
        Binary translation – example
    Rewrite INT3 instead of                 Assume guest kernel
       Bad instructions (popf)               starts like so:
       First jump (jnz)
    Then start guest kernel                       pushl %ebp
       INT3 traps to host
       Emulates popf                              …
       Look where jump could go                   jnz x
    For each jump                                 …
       Translate upon the 1st
                                                   j?? y
        encounter of block                    x:
       Keep track of translated code
                                                   j?? z
       Next time, replace INT3 with
        original instructions if target
        is known (when j is direct)
26                                                     OSE 2011– OSE – virtual machines
              BT: indirect jumps & ret
    Same, but

         Can’t replace INT3 with original jump
         Since we’re not sure address will be the same next time
         ret  indirect jump via pointer on the stack
         must take trap every time (slow!)

    Can we speed up?
       Yes, by write our own code rather than hack original
        => more aggressive translation, addresses change
       See VMWare’s
          “A Comparison of Software and Hardware Techniques for x86
          Virtualization”, by Adams & Agesen, in ASPLOS 2006

         Read it to make sure you know how!
27                                                    OSE 2011– OSE – virtual machines
      Intel/AMD HW support for VMs
    Much easier to implement VMM w/ reasonable performance
    HW itself directly maintains per-guest virtual state
       CS (w/ CPL), EFLAGS, idtr, etc.
       In-memory HW struct can be loaded/unloaded like context swt
    HW knows it’s in guest mode
       Instructions directly modify virtual state
       Avoids lots of traps to VMM
    HW basically adds a new privilege level
       VMM mode, CPL=0, ..., CPL=3
       Guest-mode/CPL=0 isn’t fully privileged
    No traps to VMM on system calls
       HW handles CPL transition
    No need to shadow page
       Next slide…

28                                               OSE 2011– OSE – virtual machines
                    Nested paging
    In guest mode, there are *2* page tables in effect
       Guest page table & host page table
    Guest memory refs go through multiple lookups
       Guest tables hold GVA=>GPA translations
       HW knows this, so in every level of the hierarchy
       HW automatically translates GPA to HPA
       Continues the table walk process
       HW table walk can take ~20 memory refs
       => There’s a new “page table cache” (in addition to the TLB),
         which caches partial parts of the GVA in an attempt to skip
         levels (shown to be very effective)
    Thus, guest can directly modify its page table w/o VMM having
     to shadow it
       No need for VMM to write-protect guest page tables
       No need for VMM to track cr3 changes

29                                               OSE 2011– OSE – virtual machines
                   Nested paging
    Is nested paging faster than shadow paging?
       Depends… (on what?)

30                                            OSE 2011– OSE – virtual machines
    trap INB and OUTB
    DMA addresses are physical,
       VMM must trust devices or utilize HW support (IOTLOB)
    Device nowadays is typically shared (=> virtualized)
       If you want to share between multiple guests
       Each guest gets a part of the disk
       Each guest looks like a distinct Internet host
       Each guest gets an X window
    VMM might mimic some standard (or legacy) devices
       Regardless of actual h/w on host computer
    Guest might run paravirtualized drivers
       Typically aggregate messages before switching to VMM
    For high-performance I/O => device assignment
       Sharing through SRIOV (new standard)

31                                               OSE 2011– OSE – virtual machines

Shared By: