Docstoc

android-report

Document Sample
android-report Powered By Docstoc
					                                    Android Virtualization
                             Andreas Nilsson, Christoffer Dall, David Albert
                                           Columbia University
                                            Advisor: Jason Nieh




May 13, 2009

document created on: May 13, 2009
                                Abstract
    We present an ARM virtualization solution built on the Android oper-
ating system. The solution supports running unmodified guest operating
systems natively using Linux Kernel Virtual Machine (KVM).
    Users can choose a desired operating system from a virtual boot loader
menu and run their desired operating system on mobile devices, which run
Android.
    With mobile virtualization a user can have multiple personas on a
single device, easily use the same configuration on multiple devices, replay
execution, migrate processes and more.
Contents
1 Introduction                                                                      1

2 Related work                                                                      1
  2.1   x86 virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . .    1
  2.2   KVM on PowerPC . . . . . . . . . . . . . . . . . . . . . . . . . .          2
  2.3   ARM virtualization . . . . . . . . . . . . . . . . . . . . . . . . . .      2

3 ARM Virtualization                                                                3
  3.1   Dynamic binary translation . . . . . . . . . . . . . . . . . . . . .        4
  3.2   Translate to trap . . . . . . . . . . . . . . . . . . . . . . . . . . .     4
  3.3   Basic block breakpoints . . . . . . . . . . . . . . . . . . . . . . .       5
  3.4   Paravirtualiztion . . . . . . . . . . . . . . . . . . . . . . . . . . .     5
  3.5   Memory issues . . . . . . . . . . . . . . . . . . . . . . . . . . . .       5

4 Porting KVM to ARM                                                                6
  4.1   QEMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      6
  4.2   CPU virtualization . . . . . . . . . . . . . . . . . . . . . . . . . .      7
        4.2.1   Traps and interrupts . . . . . . . . . . . . . . . . . . . . .      9
  4.3   Memory virtualization . . . . . . . . . . . . . . . . . . . . . . . .      11
        4.3.1   MMU emulation . . . . . . . . . . . . . . . . . . . . . . .        11
        4.3.2   Shadow page table creation . . . . . . . . . . . . . . . . .       11
        4.3.3   QEMU/KVM memory mapping . . . . . . . . . . . . . .                12
        4.3.4   Memory protection . . . . . . . . . . . . . . . . . . . . . .      13
  4.4   Interrupt handlers in virtual memory . . . . . . . . . . . . . . . .       13
  4.5   I/O emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    14

5 Infrastructure work                                                              15
  5.1   Android . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    15
  5.2   QEMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     16
  5.3   Physical device . . . . . . . . . . . . . . . . . . . . . . . . . . . .    17

6 Usage model                                                                      18
  6.1   Application screens . . . . . . . . . . . . . . . . . . . . . . . . . .    19
        6.1.1   Boot menu . . . . . . . . . . . . . . . . . . . . . . . . . .      19
        6.1.2   Guest utility applications . . . . . . . . . . . . . . . . . .     20


                                         i
7 Evaluation                                                                      20

8 Limitations                                                                     21
   8.1   Design limitations . . . . . . . . . . . . . . . . . . . . . . . . . .   21
   8.2   Implementation limitations . . . . . . . . . . . . . . . . . . . . .     21

9 Member contribution                                                             21

10 Conclusion                                                                     23
   10.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   23

11 Resources                                                                      24

12 Acknowledgements                                                               24

Appendices                                                                        27

A Sensitive instructions listing                                                  27
   A.1 Privileged instructions . . . . . . . . . . . . . . . . . . . . . . . .    27
   A.2 Other instructions . . . . . . . . . . . . . . . . . . . . . . . . . .     28

B Code snippets                                                                   28
   B.1 Interrupt handler replacement . . . . . . . . . . . . . . . . . . . .      28
   B.2 Custom interrupt handlers . . . . . . . . . . . . . . . . . . . . . .      29
   B.3 Executing guest code natively . . . . . . . . . . . . . . . . . . . .      31
   B.4 Switching page tables . . . . . . . . . . . . . . . . . . . . . . . .      32
   B.5 Page table creation . . . . . . . . . . . . . . . . . . . . . . . . . .    32




                                         ii
1     Introduction
These days mobile devices such as mobile phones, Blackberries and smartphones
are becoming more and more powerful and they are being used more as small
computers than simply as phones. Virtualizing operating systems in desktop
and server environments has proven useful for a large number of reasons and we
expect a similar development for mobile devices.
Virtualization on mobile devices would solve the need to carry both a work phone
and a private phone by simply switching between isolated operating systems
on a single device. For the enterprise, tasks related to device administration
such as backup and updates will be simplified by being able to simply copy
virtual machines images from or to the mobile devices. Finally, features such
as process migration and enhanced security by “hypervisor isolation”[8] have
many interesting applications.
The solution for mobile virtualization, which we present here, is built on An-
droid. Android is an open source operating system for mobile devices. It is based
on Linux and is developed by Google, who keeps the source code synchronized
with the upstream version of Linux. Currently the only device available in retail
is the HTC Dream G1 device.
Google provides a device emulator for Android[6] as a part of the Android SDK,
which is built on the QEMU system emulator[2]. A naive approach to a mobile
virtualization solution would be to simply run QEMU within Android1 . How-
ever, this since all instructions are emulated, the performance is unacceptably
slow.
The solution to this issue is to use a kernel acceleration, in our case KVM[10],
which makes it possible to run guest code natively from QEMU. KVM was
built as an interface to virtualizable architectures (Intel VT-x, AMD-V and
PowerPC) and makes it possible to use the Linux kernel as a hypervisor. Cur-
rently, KVM has no support for ARM. The main focus of this project was to
port the QEMU/KVM functionality to the ARM platform.


2     Related work
Virtualization have been around since the middle of the 1970s. However it re-
gained popularity in the end of the last decade when hardware was becoming fast
enough to enable several full-fledged operating systems to run simultaneously.


2.1     x86 virtualization
The ground-breaking x86 hosted virtualization solution VMware Workstation
was released in 1999. Vmware uses a binary translation approach to overcome
    1 QEMU already provides support for ARM hosts even though the cross-compilation process

is non-trivial


                                            1
the non-virtualizability of the x86 instruction set [12]. The Vmware application
includes a kernel part which enables instructions to be run natively using bi-
nary translation and trap-and-emulate. The open source emulator QEMU uses
a similar approach improved by the kernel acceleration module KQEMU.
Hardware vendors have recently begun equipping their x86 cpu’s with native
virtualization support (Intel VT and AMD-V). The solution essentially boils
down to providing a new cpu guest mode. When running in this mode sensitive
instructions will automatically trap, so they can be handled by a VMM (Virtual
Machine Monitor).
A native virtualization infrastructure called KVM (Kernel Virtual Machine) has
been included in the Linux kernel since version 2.6.20. The idea behind KVM
is that a hypervisor requires support for memory handling, process scheduling,
device initialization, driver support and more. Since the Linux kernel already
supports all of the above, it makes sense to reuse this functionality for a VMM.
KVM does just that by implementing a kernel module, which utilizes the x86
hardware virtualization support to achieve virtualization through a trap-and-
emulate scheme.
KVM exposes an ioctl interface to a user space program, which supports op-
erations such as ’create vm’, ’run virtual cpu’, ’map memory’ and more. KVM
performs the necessary emulation and sends a signal back to the user space pro-
gram when it requires i/o or device emulation. Currently, the only user space
application using KVM is a modified version of QEMU.


2.2    KVM on PowerPC
KVM has recently been ported to the PowerPC instruction set by Hollis Blan-
chard from IBM [4]. The PowerPC does not have hardware virtualization sup-
port but the instruction set is virtualizable so all privileged instructions trap in
user mode. Thus the PowerPC instruction set was fit to be used directly with
KVM. There are many similarities between ARM and PowerPC and our ARM
implementation borrowed as much input from the PowerPC port as possible.
We have been in personal contact with Hollis Blanchard regarding the PowerPC
KVM and he also gave us some pointers for ARM.


2.3    ARM virtualization
Since the ARM instruction set is typically used on embedded devices there have
traditionally been less efforts put into ARM virtualization. The recent years,
however, has seen a great increase in virtualization work for ARM.
The ARM architecture is unfortunately not strictly virtualizable, as there are
twelve privileged instructions, which are simply executed as no-ops in user mode
instead of trapping to the kernel. With no hardware virtualization support, this
fact rules out the possibility of a pure trap-and-emulate solution.


                                         2
In the open source domain, a big effort has been put forward by a Samsung
based team which have ported the paravirtualization solution Xen to ARM [7].
Their work was partly based on the Master’s Thesis “Fast Secure Virtualization
for the ARM Platform” by Daniel Ferstay [5]. The ARM port of Xen investigates
and solves some interesting virtualization problems for the ARM architecture,
especially with regards to memory protection. However a substantial amount of
their work went into designing the hypercall interface.
A group of students from the University of Illinois developed an ARM hypervisor
prototype from scratch. It seems, however, that the team had to spend most
of their time coding pure OS functionality such as drivers and scheduling [11].
Another project, which we have only scanned briefly yet is an ARM hypervisor
based on the object-oriented OS Choices [3].
In the closed source domain there are two main players. Vmware is currently de-
veloping a product called MVP (Mobile Virtualization Platform) based on the
paravirtualization solution Trango. It is rumored that MVP will be a scaled-
down version of ESX. MVP is supposed to support close-sourced OS’s such
as Windows Mobile and Symbian which would require a binary translation
approach. VirtualLogix VLX, is another commercial ARM virtualization solu-
tion mainly targeted for running real-time operating systems concurrently with
Linux.


3    ARM Virtualization
Running KVM on an ARM host may sound contradictory at first, since KVM
has been referred to as a ’simple interface to the hardware virtualization features’
and ARM processors do not provide any such hardware features. However, KVM
has been ported to run on the PowerPC architecture, which does not have
hardware virtualization support either, but does support a trap-and-emulate
virtualization solution. The point in using KVM for such a solution is to reuse
the remaining infrastructure that KVM provides, such as the interface to user
space, architecture independent meta-data, memory mapping, i/o emulation and
interface emulation.
By the same notion, we want to use KVM features as much as possible for the
ARM architecture. As mentioned earlier 2.3, the problem with using KVM on
an ARM host is the fact that some privileged instructions are executed as no-ops
when run in user mode. Since the guest operating system kernel will be running
in user mode and from time to time execute these instructions, we have to find
a way to emulate correct execution of these instructions.
Several solutions present themselves for this problem; dynamic binary transla-
tion (like VMware does in their desktop products), translate to trap, breakpoints
and finally paravirtualization. Each of these methods are discussed in the sub-
sequent sections.



                                         3
If we succeed in finding a good and simple approach for making KVM work
efficiently on ARM, it might be interesting to consider KVM as a general
open infrastructure to build VMMs on for any kind of virtualizable and non-
virtualizable host architectures based on the Linux kernel. The original KVM
implementation is designed to leverage hardware virtualization and Linux ker-
nel features. There is no inherent conflict in generalizing KVM to also support
non-virtualizable architectures.


3.1    Dynamic binary translation
Dynamic binary translation scans the code at run-time and replaces potentially
unsafe sections with a set of instructions, which emulate the required behavior
and subsequently runs the code natively. The challenges with this approach is
knowing when to translate and how. Further, since the translation might replace
a single instruction (in the extreme case), with a set of instructions, it needs to
handle jumps and branches and possibly even relocation of relative addressing
and more.
Custom code has to be generated for each case of problematic instructions and
for performance reasons it is necessary to create a cache of translated code
blocks.


3.2    Translate to trap
A simplified translation approach is to locate all problematic instructions and
replace them with a single SWI instruction (software interrupt). This would ease
the binary translation scheme somewhat, however it is necessary to record the
original instruction including its operands somewhere and be able to retrieve
this information when handling the SWI instruction.
This translation could take place at run time, as for normal binary translation
or as a prepatching process based on a static analysis of the binary. If we know
the format of the operating system binary (as is the case for Linux), we could go
through the binary’s text segments and perform this translation before starting
the guest OS. This would reduce the overhead of translation greatly, but it is not
without problems. We have to be able to separate code from data and would thus
have to do a recursive descent traversal of the code flow from all possible entry
points. Additionally, we would have to figure out how to encode a problematic
transaction into a SWI call while preserving all necessary information about the
instruction including operands and literal values.
The prepatching approach could also be problematic for other binary formats
than Linux (such as Windows Mobile or Symbian), since we simply don’t know
the formats beforehand. Modules loaded at run-time for the guest kernel would
also have to be prepatched and it is not apparent how this would be imple-
mented. An OS specific translation would have to be implemented for each


                                        4
supported guest OS, which would require resources both now and in the future
to keep up with future releases. The approach is still interesting though, since
it provides a way to paravirtualize operating systems, for which we do not have
the source code and are thus not able to compile.


3.3    Basic block breakpoints
An intermediate step before a full-fledged dynamic binary translation approach
could be to translate a basic block at a time, by setting breakpoints at each
control statement and at each privileged instruction. The cpu would enter into
debug mode, which could be handled in the KVM module. In the handler for
prefetch abort (the action caused by a breakpoint) the problematic instruction
can be decoded, emulated and the program counter be set to the next instruction
and execution can continue from there.
We recognize that performance might suffer somewhat in this solution, but it
will probably be a fast way to have a system running and we could experiment
with different ways to improve performance after that has been accomplished.


3.4    Paravirtualiztion
Paravirtualization requires changes to the source code of the guest OS to cause
traps on every privileged instruction in a naive implementation. It would still
be necessary to encode all information about the original instruction including
operands and literal values.
The main problem with paravirtualization is obviously that it only supports
open-source operating systems and requires separate changes for each version
of the OS. Many mobile operating systems available at the time of writing are
not open source, which makes a paravirtualization less attractive considering
the usage model (see 6).


3.5    Memory issues
Virtualizing memory poses a few general challenges and some ARM specific.
Since the guest is running natively and not emulated so called shadow page
tables need to be created. Whenever the guest tries to access memory the MMU
will translate the virtual address to a physical. In general this translation needs
to be from guest virtual memory → host physical memory. The shadow page
tables needs to be switched in whenever the guest is run and switched out on
each trap to the host. During the guest boot up process the virtual MMU is
initially disabled and the CPU operates in real mode. This is a special case
where the shadow page tables need to translate from guest physical memory →
host physical memory. See figure 3 in section 4.3.2.



                                        5
The ARMv5 architecture unfortunately features virtually tagged caches and a
TLB whithout any ASID (Address Space Identifier). These two facts makes
guest traps very expensive since the TLB and the caches need to be flushed
every time. The ARMv6 architecture on the other hand has physically tagged
caches and uses a ASIDs in the TLB.


4     Porting KVM to ARM
Porting KVM to support ARM processors involves writing code in the Linux
kernel and in QEMU. Even though KVM provides a general interface to be used
by any user space emulator, QEMU is the only one implemented to this data
and there is no need to write something from scratch for ARM.
There is a significant portion of architecture specific KVM code in the Linux
kernel. Everything realated to switching processor modes (when running the
guest), changing page tables and invalidating caches must be rewritten for each
supported architecure. KVM for x86 is implemented as a kernel module, but
the PowerPC implementation is not. The reason is that the PowerPC imple-
mentation hijacks the interrupt handlers, which is considered bad practice from
a module due to memory mapping issues. Since it is also necessary to hijack the
interrupt handlers for an ARM implementation (see 4.2), our proposed solution
used a patched kernel instead of a module.
While vanilla QEMU has emulation support for ARM hosts and guest envi-
ronments, it ignores attempts to enable KVM in that environment. Thus some
amount of boilerplate code had to be written for QEMU. QEMU, as well as
KVM, maintains a virtual cpu state, which must be synchronized between KVM
and QEMU resulting in a need for architecture specific KVM code in QEMU.


4.1    QEMU
QEMU is a user- and system emulator. It is used in the context of KVM as
the user space application, which interacts with the user, sends messages to
the kernel module and performs i/o and device emulation. Support for KVM
is enabled through compilation configuration and runtime parameters. We have
developed our code based on the latest upstream QEMU git repository.
KVM traditionally provided a user space package called ’KVM userspace’, but
this has been discontinued by 4/29/2009 and replaced by the upstream QEMU
git repository. Cross-compiling QEMU for ARM hosts is not a trivial challenge
and is naturally a prerequisite to implementing KVM for ARM. The specifics for
how to accomplish this can be found in the Android Virtual Machine Wiki (see
section 11). In addition to the cross-compilation settings there is a number of
additions to the build system to link with the patched kernel and accept KVM
parameters.



                                      6
The interaction between QEMU and KVM is actually rather simple. At startup,
QEMU initializes its own datastructures and performs an ioctl call to the
/dev/kvm device to create a KVM virtual machine, create a virtual cpu and
set up KVM memory regions. Finally, QEMU maps a boot image into mem-
ory, and enters into a loop, which continously performs an ioctl call2 to run
the guest. On each return from the ioctl call, some number of instructions have
been executed in the guest. QEMU considers the reason for the guest exit and
acts correspondingly. For example, if the exit was due to system shutdown or
an error, QEMU will exit the execution loop and exit the program gracefully.
On the other hand, if KVM required i/o from QEMU, it will emulate the i/o
and re-execute the ioctl to run the guest.
Device i/o is performed exclusively using MMIO on an ARM architecture, which
is accomplished by mapping a region of the machine’s physical memory to a
region of physical memory as seen from the guest perspective3 . QEMU can
inject interrupts (from emulated hardware events) into the guest by setting a
flag on a shared data structure between KVM and QEMU, which causes KVM
to execute the guest’s interrupt handler instead of the saved program counter.
Thus QEMU can easily emulate a graphics device, keyboard, mouse, disk and
even USB devices. However, KVM needs to handle emulation of the cpu and
MMU operation on itself, since these functions require execution in privileged
mode and access to kernel data structures.
In the PowerPC implementation, KVM also emulates a timer controller (PIC).
However, we will let QEMU emulate a timer by the use of normal system timers
(SIG ALRM) and inject those interrupts to the guest. This is more sensible on
ARM hardware than on PowerPC hardware, since the PIC is a separate device
on ARM and an intrinsic part of the hardware on the PowerPC chip.


4.2     CPU virtualization
The need to virtualize the cpu comes from running all of the guest in user
mode on the physical cpu. For example, when the guest is executing kernel
code it thinks that it is running in supervisor mode, and it is up to the KVM
implementation to guarantee correct execution despite this controversy.
In practice, cpu virtualization is accomplished by keeping a structure describing
a virtual state of the cpu as seen from the guest’s perspective. When the guest
queries the mode of the cpu, KVM must make sure that the value in the virtual
cpu and not the value in the physical cpu is returned.
As an example of the use of the virtual cpu state, consider the execution mode.
ARM processors store their execution mode in bits [4 : 0] of the ’program
status register’ (CPSR), which is copied to and from general purpose regis-
ters using the instructions MRS and MSR, respectively. During guest execution,
   2 It issues the KVM RUN ioctl call, on a file descriptor representing the virtual cpu returned

in the previous ioctl calls.
   3 This is done by creating KVM memory slots through ’set user memory region’ ioctl calls.




                                               7
   vl.c: main()




                             /dev/kvm
   kvm_init()                                    Map KVM_RUN struct
                          KVM_CREATE_VM




                                                   set user memory
 machine->init()           integratorcp_init()
                                                        regions




kvm_sync_vcpus()           KVM_GET_REGS           KVM_SET_REGS




  main_loop()




   cpu_exec()              kvm_cpu_exec()            KVM_RUN




                                 debug




       Handle I/O                                Evaluate return code




                           Shutdown or Error




                              Exit QEMU




            Figure 1: QEMU Startup and execution cycle




                                    8
                KVM                                             QEMU

                      MMU
                                        MMIO
                                                              Emulates:
                                                              - Disk
                                                              - Graphics
          CPU                                                 - Keyboard
                                  Inject emulated IRQ         - Mouse
                                                              - Timer




                                                                    Signals
                                     Hardware           IRQ    Host kernel



                    Figure 2: The roles of QEMU and KVM


we make sure that execution of these instructions trap to a special handler
in KVM, which emulates the instruction. For example given the instruction

MRS r0, cpsr

KVM would copy the CPSR field from the virtual cpu into r0 before resuming
execution. For a complete listing of all the instructions, which require emulation,
please refere to Appendix A.


4.2.1   Traps and interrupts

The ARM architecture defines seven exceptions, five of which are traps (syn-
chronous interrupt) and two which are interrupts (asynchronous interrupts).
The traps are:

    - Reset
    - Undefined instruction
    - Software interrupt

    - Prefetch abort (instruction page fault)
    - Data abort (data page fault)




                                         9
The two interrupts are IRQ and FIQ, where IRQ is raised from external hard-
ware and can be considered a ’normal’ interrupt and FIQ is a ’fast’ interrupt
using banked registers designed for very fast and compact interrupt handling.
When a trap is generated by an instruction executed in the guest operating
system KVM must handle this differently from the way the Linux kernel usually
handles it. A reset instruction must reset the virtual cpu state, an undefined
instruction exception must invoke the guest kernel’s handler to deal with its
faulty user space program. A software interrupt must also invoke the guest
kernel’s handler allowing the guest kernel to service a guest user process. Finally,
prefetch- and data aborts must be carefully handled by the host emulated
MMU or forwarded to the guest (see section 3.5).
Asynchronous interrupts are always related to real hardware, to which the guest
operating system is never exposed. Thus KVM does not need to handle this
type of interrupts but can leave that up to the host kernel. The only caveat is
that timer interrupts must be delivered to the guest kernel’s interrupt handler
in some connection with hardware generated timer interrupts. That it easily
accomplished by making sure that whenever the guest is scheduled after a hard-
ware interrupt, the KVM layer is invoked before the guest code, so the that
KVM layer can check for pending timer interrupts or signals to the QEMU user
space process.
In addition to normal traps and interrupts there has to be a mechanism for
the guest code to trap to the host kernel for the purpose of emulation. This is
accomplished by issuing a software interrupt (SWI/SVC instruction) instead of
the sensitive instruction, which must be emulated. Since we are translating the
guest code on a block by block basis, we also have to replacce every control flow
changing instruction with a software interrupt to allow for translation of the
next basic block.
This approach is sufficient to get a guest kernel running, but in the future it is
important to optimize for example loops not to trap on each iteration. We do
not fear the performance penalties caused by replacing sensitive intstructions
too much, as traps are relatively cheap due to a quite optimized and short
pipeline on the ARM architecture (contrary to the x86 architecture). On every
dispatching of the guest, we call the function kvmarm pre guest enter and on
every exit we call the function kvmarm pre guest exit, which can be found in
arch/arm/kvm/arm host.c in the source tree.
In kvmarm pre guest enter, the guest code to be run is analyzed and the first
basic block is determined. The basic block will end with either a control flow
changing instruction, a sensitive instruction or the end of a page of instruc-
tion. That last instruction is replaced with a software interrupt and the original
instruction is stored in memory for later access in the KVM custom software
interrupt handler. When the custom handler for a software interrupt from the
guest is invoked, the kvmarm pre guest exit is first called, which restores the
original instruction and KVM can then determine whether it must emulate an
instruction, inject an interrupt to the guest kernel or return to QEMU.


                                        10
Since KVM needs to know the exact state of the hardware registers at the time
of the trap KVM must ’hijack’ the low-level interrupt handlers in assembly
code. The ARM architecture executes code from pre-defined virtual addresses
for each of the interrupts (both asynchronous and synchronous) located from
0xFFFF0000 to 0xFFFF001C4 . In short, we store the original kernel jump table,
overwrite the content of the addresses to execute our custom code and possibly
execute the kernel code subsequently. For details, please see the source code in
Appendix B.1.
Whenever the custom handlers are invoked, they first check if the current process
is in fact a guest operating system. If so, the custom handler is invoked and if
not, control is simply passed to the host kernel handlers. For the specifics of
how this works, please refer to the source code in Appendix B.2.


4.3     Memory virtualization
4.3.1    MMU emulation

As was mentioned in section 3.5 we need to emulate the MMU in KVM by
creating shadow page tables. Hardware-wise ARM has a two level page table
structure with a first level consisting of 4096 entries and a second level with 256
entries. Linux expects one page table per page and needs to store more infor-
mation per pte (page table entry) then the ARM hardware supports. Therefore
Linux does some tricks to let both architectures live side by side. In order to
avoid dealing with the double page table structure and be able to move forward
quickly we try to create the shadow page tables using built in Linux kernel
functions.
The address of the currently used page table is stored in the system control
co-processor 15’s register R2. In order to switch page tables we invalidate the
memory caches and simply update the register value. See Appendix A for an
example code snippet of how to switch page tables.


4.3.2    Shadow page table creation

We leverage on Linux to create the shadow page tables for us as far as possible.
However, For the most part we had to create our own functions mimicking
the behaviour of their Linux counterparts since Linux typically operates on the
current mm where as we need to work on a custom shadow page table mm. For
code snippets from the page table creation see Appendix B.5.
The guest boots in real mode and hence we first create a boot page table that
maps guest physical addresses → host physical addresses. As soon as the boot
code in head.S sets up a basic page table structure and enables the MMU we
   4 These addresses are the high vector addresses, which are always the ones used once Linux

has started up.



                                             11
need to intercept any guest page table updates and update the shadow page
tables accordingly. The guest OS will keep unique page tables for every process
and each of them need to have a corresponding shadow page table.


                                         Guest memory

                    0xFFFFFFFF




                                                              0xFFFFFFFF
                                 Process 1 – Page table                    Process 2 – Page table




                                                          0




                                                                                                    0
      Virtual
    addresses




     Physical
    addresses




    Machine
   addresses




      Virtual
    addresses
                                                                                                    0
                    0xFFFFFFFF




                                                              0xFFFFFFFF
                                                          0




                                 Process 1 – Page table                    Process 2 – Page table




                                    Shadow page tables

                Figure 3: Guest processes and shadow page tables


4.3.3   QEMU/KVM memory mapping

When QEMU starts a new guest system it allocates a number of chunks of vir-
tual host memory corresponding to the physical memory (RAM, devices etc) of


                                                    12
the guest. For KVM-enabled QEMU this memory is mapped into KVM memory
slots using the kvm set user memory region ioctl.
When setting up the initial shadow page table in KVM we create mappings
between the guest physical addresses and the host physical frames corresponding
to the QEMU virtual memory represented in the KVM memslots. See function
mmap memslot guest in Appendix B.5.

  User space QEMU                Machine phys.
                                   memory                         Guest memory

               Virt. Addresses       Phys. addresses       Guest phys.       Guest virt.
0xFFFFFFFF                       0                     0


      Kernel




      MMAP
      region


           0                                                             0




                   Figure 4: Memory mapping from QEMU to guest


4.3.4     Memory protection

There are several levels of memory protection to take into account. First of
all the KVM hypervisor needs to be protected from the guest kernel. Moreover
since the guest is running in user mode, KVM needs to protect the guest’s kernel
memory from rogue guest user processes. Lastly the guest user processes need
to be protected from each other.
Since both the guest kernel and user processes are running in user mode there
is no inherent memory isolation mechanism. Isolation need to be provided by
the hypervisor. As advised in Xen on ARM [7] we use the ARM domain protec-
tion mechanisms[1] to separate between hypervisor, guest kernel and guest user
mode memory. We also need to make sure that the shadow page tables for the
different guest user processes are kept independent so that protection between
user processes is enforced.


4.4      Interrupt handlers in virtual memory
When an interrupt occur the hardware will execute instructions from fixed vir-
tual memory locations. Thus the entire interrupt handler must necessarily be


                                              13
                           KVM Memory Slot
               User virt.           Guest phys.                      Machine phys.
           start          end    start       end                    start      end
       0x0000FF80 0x0000FF84 0x0         0x4                     0x0C20    0x0C24



                 Shadow page table (real mode)
                     Virtual                    Physical
             start             end      start          end
       0x0              0x4          0x0C20        0x0C24



                           Guest page table
                     Virtual                    Physical
           start            end        start               end
       0xCEFF0040       0xCEFF0044 0x0             0x4



                Shadow page table (MMU mode)
                 Virtual              Physical
           start         end    start          end
       0xCEFF0040 0xCEFF0044 0x0C20      0x0C24



                       Figure 5: Shadow page table entry example


mapped at all time - both when running the host and when running guests.
There are two different strategies to accomplish this:

   - Since Linux is designed not to use the highest 64MB of the address space
     [0xFFFF0000-0xFFFFFFE] this part can be used for the hypervisor [7].
     When an interrupt occurs the handler can either switch back the host
     kernel page tables and deal with the interrupt or in some cases deal with
     the interrupt directly. There are two drawbacks with this approach. First,
     the code need to be relocated to run from its new location. Second, it only
     works for guest OS’s that do not utilize this 64 MB memory slot. Even
     so we have chosen this way forward for the time being since we are only
     supporting Linux guests at the moment anyway.
   - The alternative approach is to protect the pages occupied by our hypervi-
     sor and emulate any access by the guest kernel. This is a general method
     that will work regardless of the guest OS memory usage.


4.5    I/O emulation
I/O emulation is left almost exclusively up to QEMU. However, KVM is involved
in mapping memory for MMIO between the guest and QEMU and in dispatching
interrupts from hardware to QEMU and to inject interrupts from QEMU into
the guest.
The mappings illustrated in figure 4 show how an emulated QEMU device com-
municates with the guest kernel. The guest kernel will think that a part of its


                                                  14
physical address space is dedicated by hardware to memory on a device5 and
map this physical memory to a part of the guest kernel address space.
From the other point of view, QEMU allocates memory for this purpose and
will do so as any user process allocates memory resulting in a contigous virtual
address range backed by possibly non-contigous physical frames. QEMU will
tell KVM through the set user memory region ioctl call that it will emulate
hardware in the memory belonging to a specific virtual address range and that
the guest should see this memory through its perception of a corresponding
physical address range. KVM will in turn create kvm memory slots to hold
this information. This is all done in the architecture agnostic part of KVM
(virt/kvm/kvm-main.c) in the Linux kernel source tree.
What we had to do was to make sure that those memory regions were mapped
correctly in the shadow page tables as discussed in section 4.3.3.


5     Infrastructure work
There are a number of changes that have to be made to the Android stack6 to
get it into a state where QEMU will run. Areas of the kernel, the user space of
the host OS, and QEMU itself need to be modified. Also, there are nontrivial
differences between the physical device and the emulated one, both in the actual
hardware that they expose and the software that is running on them.


5.1     Android
The proposed solution is to use Android as our host OS since it is essentially
an easy-to-use Linux distribution for mobile devices7 Furthermore, Android is
already designed to run on our hardware, the HTC Dream, and comes with an
emulated device which is similar enough to our hardware that we can develop
on both in parallel and expect that the majority of the work we do on one will
transfer to the other with relative ease.
Android however, comes with its own set of problems to be dealt with. An-
droid’s traditional Linux userland is virtually non existent. In its place is a
custom application framework built on top of the Android Runtime. The appli-
cation framework includes a windowing system that was created from scratch
specifically for Android and does not share a legacy with X11 or any other exist-
ing Linux windowing system. All of this code is run on Dalvik, a virtual machine
similar to the Java Virtual Machine. All code run on Dalvik is managed and is
   5 Possibly writing a byte in the MMIO region will just cause the byte to be sent to a device

with actually mapping any real memory, but the details are not important for the purpose of
this discussion.
   6 Here, “stack” refers to the Linux kernel, specific drivers, libraries, and userspace code

shipped with Android to facilitate application development.
   7 However, the solution should be easily portable to any Linux host OS, which can load the

KVM kernel module.


                                              15
written in the Java programming language. Dalvik uses its own bytecode rather
than Java bytecode, and thus is not a JVM. Java source files are first compiled
into Java bytecode and then run through a tool called dx to convert them into
Dalvik bytecode.
Unfortunately, QEMU is not designed to run on top of this stack and the work
involved in porting it over would be significant to say the least. In addition,
Android’s stack contains a good amount of functionality that we would not
need to run QEMU and would be hard to selectively strip out. With this in
mind, we have disabled the Android Runtime and instead chose to port over an
existing stack that QEMU supports. Disabling the Android runtime is as easy
as commenting out some entries in init.rc8 .
Android also includes a custom C runtime library called bionic. bionic is
designed to be small and only to support the functionality that the Android
Runtime requires to work. It is also hard to link against. Rather than trying
to work with bionic, we statically link against glibc. While this does take
up more memory, the amount is negligible compared to the amount of space
available to us and it makes our job much easier.


5.2    QEMU
In order to get QEMU running on Android, we have to port over a stack that it
can run on top of. The bottom of this stack consists of fbcon, the framebuffer
console driver. fbcon is a console driver that was created for computers that
do not have a dedicated text mode. It runs directly on top of the framebuffer
driver for the LCD screen and is able to draw images as well as text. This is
part of the Linux kernel and is already compiled into the kernel that runs on
the emulated device.
SDL, or the Simple DirectMedia Library, is a cross platform multimedia library
that handles both input and output. It is most commonly used to develop games,
but QEMU also uses it as a graphical back end. SDL in turn has an fbcon video
driver. This allows programs built against SDL to be run without any windowing
system. There is only one change that needs to be made to the fbcon video driver
in order to get SDL running on Android. Android’s framebuffer is located at
/dev/graphics/fb0 rather than at /dev/fb0. Once the references to /dev/fb0
are changed in the fbcon video driver SDL can be compiled for Android. As
mentioned earlier SDL is compiled statically against glibc.
Once SDL is compiled, QEMU can be built on top of it. QEMU assumes a
minimum resolution of 640 by 480, while our devices have resolutions of 320
by 480. After hardcoding a resolution of 320 by 480, QEMU runs successfully
with graphical output. The picture is distorted, most likely because QEMU still
believes that the screen width is 640. Given more time, we would develop a more
robust solution that did not rely on hardcoding resolution values.
  8 This   is a custom init program developed from scratch by Google



                                             16
          Figure 6: Booting Linux in QEMU with graphics enabled


5.3    Physical device
Running our code on the physical device presents its own set of problems to be
dealt with. There are two versions of the HTC Dream in existence, the T-Mobile
G1, and the Android Dev Phone 1 or the ADP1. The devices have identical hard-
ware and only differ cosmetically, however they are running different versions of
the software. Specifically, the G1 runs a locked down version of Android where
the user does not have root access. The bootloader on the G1 is also locked down
and will only install versions of Android that are signed by T-Mobile. The ADP1
does not present any of these problems, but was unfortunately unavailable to
us for the project.
Luckily there is a pretty large security flaw in the original version of Android
that came installed on the G1 that can be exploited to gain root access. On
the stock install of Android 1.0 R29 from T-Mobile, input is directed not only
to the foreground app, but also to an invisible root console running in the
background. Using this flaw, a telnet server can be started, which allows us to
replace T-Mobile’s locked down version of Android with a custom nonrestricted
one. Because the bootloader only signature checks firmware packages that are
going to be installed and not the currently installed firmware, Android will have
no trouble booting up.
The next step is to replace the Secondary Program Loader with the Engineering
SPL from the ADP1. The SPL is part of the bootloader that is stored in flash


                                      17
and is responsible for checking the signature of the firmware that the user wants
to install on the phone. The Engineering SPL has no such signature checking
restriction. Installing it allows us to flash the phone with new versions of Android
without having to rely on the bug in R29.
The Engineering SPL also includes fastboot. Fastboot is a mode that allows us
to write to flash memory over USB directly from the bootloader. This makes
installing custom kernels and root file systems much less risky because we can
always recover from the bootloader.
The default kernel installed on the custom device does not have the fbcon driver
or virtual terminal support compiled in. As such the kernel must be recompiled
to get QEMU to run. Additionally, there is a problem with the framebuffer
driver for the physical LCD that causes the driver to timeout while waiting for
the next frame when a program running on top of SDL is run. With more time,
we would have debugged the physical LCD’s framebuffer driver.


6    Usage model
To simplify matters and come to face with the fact that mobile devices are
limited in terms of available memory, processing power and electrical power, we
do not anticipate running several operating systems in parallel - nor do we see
any real need to do so.
Instead, we suggest to create a virtual ’multi-boot’ environment, similar to the
boot menu rendered in a boot loader but in a stripped-down version of the
Android kernel and OS. The menu should let users choose a desired operating
system to start in guest mode, which will then appear to be the ’only’ system
running on the device.
Once the user is working inside his or her chosen operating system, a user
application must be installed in the guest OS, which can communicate with
the underlying host kernel to for instance shutdown the guest system or in the
future even checkpoint or migrate the system.
The main reason for only running a single guest at a time is lack of hardware
resources. A short performance analysis of the Android HTC Dream G1 yields
that out of its approximately 100MB of memory about 3MB was free on normal
usage. When running Android as a guest, we will even have the overhead of the
hypervisor. In conclusion, memory and cpu resources are simply too scarce to
run multiple guests efficiently.
Another major reason for not opting for a traditional hosted usage model is the
typically small screen size of phones. Our “virtual boot loader” UI will give the
user the feeling of actually booting into the OS of his/her choice and not being
bugged by other applications.
Of the three concepts listed above ’multiple personas’ would probably be the ap-
plication area best facilitated by our solution. There are two ways personas can


                                        18
be used. First, multiple personas can be used to switch between a personal and
a corporate phone personality. Second, migration can enable a user’s personality
(including application, pictures, e-mail and documents) to be safely migrated
to a new phone.
Looking at our virtualization solution from a user perspective it is highly possi-
ble that users in practice will treat our “virtual boot loader” as a normal boot
loader. Hence they would simply reduce the timeout before the default system
is loaded and let it boot. If used this way, users would still have the opportunity
to checkpoint and migrate their machines. In this case rendering our first user
interface would be virtually unnecessary.         


                                              Mobile Computing ­ Homework 3 
6.1      Application screens
                                    Christoffer Dall (cd2436) and Andreas Nilsson (apn2107) 

                   
The virtualization solution, which we are developing, will essentially have two
user interfaces. First, there will be a text-based ’boot-menu’ where a user can
                  Application screens 
choose a desired operating system. Second, there will be a graphical management
                  The virtualization solution, which we are developing, will essentially have two user 
                  interfaces. First, there will be a text‐based ‘boot‐menu’ where a user can choose a 
application inside the chosen operating system.
                     desired operating system. Second, there will be a graphical management application 
                     inside the chosen operating system. 

6.1.1            
          Boot menu
                     Boot menu 
The boot menu will be rendered on the console as follows:
              The boot menu will be rendered on the console as follows: 

                         - Android VM Boot menu -

                         Please select from the following list, your
                         mobile operating system:

                          (a)   Android RC33 (default)
                          (b)   Windows Mobile
                          (c)   Symbian ....
                          (x)   Console

                         Time left: 6

                         #: _
                                                                                        
                      
                                Figure 7: Virtual boot loader menu.
                     The boot menu will display the installed system images and let the user select which 
                     one to start up. There will be a counter, which will automatically count down from 
                     10 seconds before the default system is started. Additionally, there will be a last 
The boot menu        will display the installed system images and let the user select
                     option, which is to start a console directly at this state and operate in text mode. 
which one to start up. There will be a counter, which will automatically count
                  The menu should be based on letters and not on numbers since current hardware 
down from a number of seconds before the default system is started. Addition-
                  supporting our solution has a QWERTY keyboard, on which letters are easier to 
                  access than numbers. 
ally, there will be a last option, which is to start a console directly at this state
and operate in text mode.
                  In the future, if our solution should be marketed as a product, a small graphically 
                     rendered menu would most likely be preferred, where the touch screen could 
In the future, if our solution should be packaged as a product, a small graphically
                   actually be used and where the screen orientation would depend on whether the 
                   keyboard was opened or not, but this also depends on the target hardware. We are 
rendered menu would most likely be preferred, where the touch screen could
actually be used and where the screen orientation would depend on whether the
keyboard was opened or not, but this also depends on the target hardware.



                                                     19
6.1.2    Guest utility applications

For operations such as switching guest operating system, change default guest,
migrate a process or an entire guest and more we suggest writing a small appli-
cation for each supported guest.
Such an application would be able to fully utilize graphical libraries touch screen
hardware and anything else available to the guest resulting in an easy-to-use and
consistent application for managing the guests.
It is not apparent how the utility applications should comminucate with the
hypervisor. If the guest system is also Android, the applications would be limited
to run in the Dalvik environment, in which you cannot write custom hypercalls
or access specific memory areas. Instead, one would have to resort to shared
files or network sockets.


7       Evaluation
The project has involved many relatively disjoint activities and each activity
deserves a separate evaluation.
Pure QEMU evaluation - Emulating Linux guests using QEMU works very
well on both the emulator and the physical device. The major problem is the
sloppy performance which makes the solution practically unusable.
Running flat binaries - Running flat binaries using KVM works well on both
the emulator and on the physical device. We have not made any performance
measurements since the crucial performance bottlenecks for the system virtual-
ization are the context switches and traps.
QEMU graphics support - Graphical support on QEMU is at best a proof
of concept. QEMU makes the assumption that the smallest screen size is larger
than ours, which causes a number of problems. While hardcoding the proper
values is a quick and dirty solution, it leads to other issues such as the distortion
of the video when it is displayed. Furthermore, serious investigation of SDL,
the framebuffer console, and the physical framebuffer driver are needed before
QEMU will even show graphics on the physical device. Despite this, graphical
support in QEMU is sufficient to begin testing our port of KVM.
KVM system emulation - Running the full-fledged guest OS using KVM
has primarily been tested on the emulator for two different reasons. First, the
emulator environment lends itself well to debugging using GDB. Second, since
we are developing multisite and we only have one physical device we had to
choose a common architecture. It should be noted that the Android goldfish
emulator uses the ARMv5 architecture where as the HTC Dream is an ARMv6
device. The ARMv6 architecture is however backwards compatible with ARMv5
so with some luck an ARMv5 solution will work out of the box if we can convince
the guest OS that it is running on ARMv5.


                                         20
Since we are still only debugging the guest OS we have not been able to do any
performance measurements. However we can safely assume that the approach
of clearing caches and the TLB on each trap and even trapping on each control
flow changing instruction will not be feasible in the long run.


8     Limitations
We have chosen to divide the limitations into design and implementation. Design
limitations are more problematic than implementation limitations since they
imply problems we do not see any apparent solution to yet.


8.1    Design limitations
    - In section 4.3.4 we discussed the different ways to isolate the hypervisor
      from the guest. Using our current strategy of using the upper 64 MB of
      memory we are limited to Linux guests.
    - As long as we are working with ARMv5 we will have to clear the caches
      and flush the TLB on each trap. Thus from a performance perspective it
      is not feasible to trap on each privileged instruction but we would have to
      translate and emulate or emulate in code residing in the upper 64 MV of
      memory (which would eliminate the need to change page tables and flush
      caches) instead.


8.2    Implementation limitations
Our main focus have been to develop a feasible virtualization strategy for ARM
and try get a guest kernel up and running as soon as possible. As such UI,
performance and security issues have been considered but not implemented.

    - There is currently no virtual boot loader or guest OS management appli-
      cation implemented.
    - We have not implemented any guest kernel memory protection.

    - Performance issues has been postponed until performance can be mea-
      sured.


9     Member contribution
Our project team have consisted of four students. This section describes the
work division in the project.



                                       21
David Albert - CS undergraduate student.


Christoffer Dall - CS Master’s student


Andreas Nilsson - CS Master’s student


Brian Smith - Professional Degree student


David’s involvement in the project is in context of a project course and his work
is documented in section 5.
Christoffer and Andreas are doing the project as a combined project course
and main project work in “COMS E6998 - Mobile Computing with iPhone and
Android”. They have been involved with all work except the work described in
section 5.
Brian is participating in this project as the final project in his professional
degree. Except from the translation functions and common interrupt code in-
cluded in the appendices, his work is not described in detail in this report.


 Task                           David    Christoffer Andreas       Brian
                                           Infrastructure
 Project design                              X         X            X
 Custom kernel for emulator                  X         X            X
 Cross-compile QEMU                          X         X
 Manage svn/git repos                        X
 Contribute to wiki               X          X         X            X
 Port SDL to Android              X
 Kernel debug of Android                       X         X
                                             Physical device
 Jailbreak phone                               X         X
 Custom bootloader                X            X         X
 Custom kernel for device         X            X         X
                                                 KVM
 Basic KVM ARM module                          X         X
 KVM flat binaries                              X         X
 Adapt KVM userspace                           X         X
 SWI interrupt handler                         X         X
 Interrupt handlers                                                 X
 Shadow page tables                            X          X
 Binary Translation                                                 X
 Emulation                                                          X



                                        22
10      Conclusion
We are well on our way to have a working mobile virtualization solution capable
of delivering all standard virtualization features. Taking inspiration from the
PowerPC port of KVM we are creating the first open-source ARM virtualization
solution which can run closed souce guest operating systems such as Windows
Mobile and Symbian. We accomplish this by translating code block by block
in order to trap and emulate guest kernel code. We can dispatch guest code to
run natively and handle interrupts correctly. Reusing Linux kernel functionality
we have created shadow page tables and have correct cpu and memory state
mappings between QEMU and KVM.
We can run a custom built kernel on the physical device and render QEMU
graphics in a stripped down version of Android on the emulator. In addition to
the virtualization results in this project we have shown that it is fairly straight
forward to modify and add to the Android kernel.
This project has been 50% research and 50% development. The subject of mobile
virtualization is quite cutting-edge and with no prior knowledge of the area we
have come a very long way. Apart from the need for performance optimizations,
especially with regards to the number of TLB and cache flushes we see no major
implementation obstacles.


10.1     Future work
The team feels dedicated to finish the work. We can currently run around 200
Linux guest boot instructions (up to the point when the guest MMU is enabled).
The remaining task before we can boot the complete kernel is to finish the
memory virtualization for MMU enabled guests.
When we have that working, our work can be expanded with the following tasks
(listed in no particular order):

     - Support QEMU graphics on the physical device.
     - Implement guest kernel memory protection.

     - Performance optimizations such as minimizing TLB flushes and cache in-
       validations. Further performance improvements could be achieved by do-
       ing a binary prepatching of the kernel using static binary translation and
       not during runtime as we currently do.

     - Incorporate QEMU changes directly into the Android emulator with de-
       vice support.
     - Support for non-Linux guest operating systems.
     - Virtual bootloader and guest OS user space applications.


                                        23
     - Include our changes into the upstream QEMU and KVM source trees.
     - Evaluate performance and possbly publish findings.


11      Resources
     - Project wiki containing a blog, references and a number of useful guides:
       http://android.chazy.dk
     - Mailing list: android-virt@lists.cs.columbia.edu (archive: https://lists.cs.
       columbia.edu/cucslists/listinfo/android-virt)

     - Kernel git repository: git://slice.chazy.dk/android/kernel/common.git
     - QEMU git repository: git://slice.chazy.dk/qemu.git
     - Screen recording: http://www.youtube.com/watch?v=heYR1kbOf54

The screen recording was made for the “Mobile Computing with iPhone and
Android” class and shows:

     - booting of our custom kernel on the emulator (notice the custom boot
       message)

     - running a generic Linux arm image v. 2.6.17 on the cross-compiled version
       of QEMU for ARM hosts
     - loading a basic kvm-arm kernel module and retrieve version number from
       test program


12      Acknowledgements
     - Professor Jason Nieh - Project Advisor
     - Computer Science PhD student Oren Laadan - Technical Advisor: help
       with solution design and Linux kernel operations.
     - Computer Science PhD student Carlos Rene Perez - Input on virtualiza-
       tion in general and Linux/KVM specific.
     - Hollis Blanchard (and other people from the KVM mailing list and IRC
       channel) - Input on the KVM PowerPC port and pointers on how to move
       forward with ARM.




                                        24
References
 [1] ARM. Arm926ej-s technical reference manual, 2003.
 [2] Bellard, F. Qemu, a fast and portable dynamic translator. In USENIX
     Annual Technical Conference, FREENIX Track (2005), pp. 41–46.
 [3] Bhardwaj, R., Reames, P., Greenspan, R., Nori, V. S., and Ucan,
     E. A choices hypervisor on the arm architecture.
     http://choices.cs.uiuc.edu/ChoicesHypervisor.pdf.
 [4] Blanchard, H.           Linux   kernel   2.6.27,   documentation/power-
     pc/kvm 440.txt, 2008.
 [5] Ferstay, D. R. Fast secure virtualization for the arm platform, 2006.

 [6] Google. Android emulator, 2009.
     http://developer.android.com/guide/developing/tools/emulator.html.
 [7] Hwang, J.-Y., Suh, S.-B., Heo, S.-K., Park, C.-J., Ryu, J.-M.,
     Park, S.-Y., and Kim, C.-R. Xen on ARM: System Virtualization using
     Xen hypervisor for ARM-based secure mobile phones. In 5th IEEE Con-
     sumer Communications and Networking Conference (2008), pp. 257–261.
 [8] IBM. sHype - Secure Hypervisor, 2009.
     http://www.research.ibm.com/secure systems department/projects/
     hypervisor.

 [9] Popek, G. J., and Goldberg, R. P. Formal requirements for virtualiz-
     able third generation architectures. Commun. ACM 17, 7 (1974), 412–421.
[10] Qumranet. KVM - Kernel Virtual Machine whitepaper, 2006.
     http://www.qumranet.com/files/white papers/KVM Whitepaper.pdf.
[11] Reames, P., Chan, E., David, F., Carlyle, J., and Campbell, R.
     A hypervisor for embedded computing, 2007.
[12] Rosenblum, M., and Garfinkel, T. Virtual Machine Monitors: Current
     Technology and Future Trends. Computer (2005), 39–47.




                                     25
26
Appendices
A     Sensitive instructions listing
A.1    Privileged instructions
      Instruction   Description

      CPS           Change Processor state bits in CPSR, In usermode
                    this is ignored
      LDM           The second version of LDM (see first version below),
                    loads user mode registers when the processor is in
                    a privileged mode. This has unpredictable behavior
                    when in user mode or system mode.
      LDM           The third version of LDM, loads registers then copies
                    the current SPSR into the CPSR. This has unpre-
                    dictable behavior when in user mode or system mode.
      LDRBT         Load Register Byte with Translation. When in priv-
                    ileged mode, the memory system is signalled to treat
                    access as if the processor were in user mode. When
                    in user mode, acts as in user mode.
      LDRT          Load Register with Translation, same as LDRBT
      MRS           Move to ARM Register from Status Register. Move
                    CPSR or SPSR to a general purpose register. When
                    in usermode or system mode, there is no SPSR there-
                    fore the action is unpredicatable.
      MSR           Move to Status Register from ARM Register. Set
                    CPSR or SPSR from a general purpose register.
                    Usermode cannot set most of the CPSR. Any writes
                    to the privileged bits in the CPSR while in user-
                    mode are ignored an exception is not thrown. When
                    in usermode or system mode, there is no SPSR there-
                    fore the action is unpredictable.
      RFE           Return from exception. Loads PC and CSPR from
                    memory. Unpredictable in Usermode.
      SRS           Store Return Status. Stores R14 and SPSR to stor-
                    age. Unpredictable in usermode and system mode
                    due to no SPSR.
      STM           The second version of STM, stores usermode regis-
                    ters when the processor is in privileged mode. This
                    has unpredictable behavior when in user mode or
                    system mode.
      STRBT         Store Register Byte with Translation. When in priv-
                    ileged mode, the memory system is signalled to treat
                                      27
                    access as if the processor were in user mode. When
                    in user mode, acts as in user mode.
      STRT          Store Register with Translation, same as STRBT.
A.2       Other instructions
Data processing instructions like ADDS, MOVS, etc. are all sensitive instruc-
tions when the target register is the PC (r15) and the S bit is set, because
SPSR of the current mode is copied to the CPSR. This causes unpredictable
behavior when executed in system mode or user mode where there is no SPSR.
Co-processor instructions manipulate co-processor data. Whether the opera-
tion is allowed in user mode is dependent on the coprocessor and the instruction.
In most cases (at least all cases defined by the architecture), if a privileged op-
eration is attempted when in user mode, a undefined exception is raised. The
instructions that manipulate co-processor data are: CDP, LDC, MCR, MCRR, MRC,
MRRC and STC.


B       Code snippets
B.1       Interrupt handler replacement
Assembly code to store interrupt jump table and backup location:
@ The t a b l e where each word i s the address o f a word t h a t contains the
      address
@ o f the i n t e r r u p t handler t h a t needs to be r e p l a c e d .
@
  .globl _ _ r e p l a c e _ v e c t o r _ t a b l e
__replace_vector_table :
  .word 0          @don ’ t care about r e s e t
  .word _ _ b r a n c h _ u n d e f i n e d _ v e c t o r
  .word _ _ b r a n c h _ s w i _ v e c t o r
  .word _ _ b r a n c h _ p r e f e t c h _ a b o r t _ v e c t o r
  .word _ _ b r a n c h _ d a t a _ a b o r t _ v e c t o r
  .word 0          @undefined
  .word _ _ b r a n c h _ i r q _ v e c t o r
  .word 0          @f i q i s always d i s a b l e d
@
@
@ The t a b l e where each word i s the address o f kvm ’ s i n t e r r u p t h a n d l e r . The
@ addresses here w i l l r e p l a c e the addresses pointed to by
@    replace vector table.
@
  .globl _ _ k v m _ v e c t o r _ t a b l e
__kvm_vector_table :
  .word 0
  .word k v m a r m _ u n d _ v e c t o r
  .word k v m a r m _ s w i _ v e c t o r
  .word k v m a r m _ p a b t _ v e c t o r
  .word k v m a r m _ d a b t _ v e c t o r
  .word 0
  .word k v m a r m _ i r q _ v e c t o r
  .word 0 @kvmarm fiq vector

@
@ The t a b l e to hold the replaced a d d r e s s e s .
@
  .globl _ _ o l d _ v e c t o r _ t a b l e
__old_vector_table :
  .word 0
  .word 0



                                                 28
    .word     0
    .word     0
    .word     0
    .word     0
    .word     0
    .word     0


Initialization code to actually hijack interrupts:
#define STUB_TO_LIVE ( addr ) ( C O N F I G _ V E C T O R S _ B A S E +0 x200 +(( u32 ) ( addr ) -( u32 ) &
     __stubs_start ) )

s t a t i c int arm_init (void)
{
    int i ;
    int rc = kvm_init ( NULL , s i z e o f ( struct kvm_vcpu ) , THIS_MODULE ) ;

    for ( i =0; i <8; i ++) {
      u32 * handler ;
      i f ( _ _ r e p l a c e _ v e c t o r _ t a b l e [ i ] == 0) {
          continue;
      }

       handler = ( u32 *) STUB_TO_LIVE ( _ _ r e p l a c e _ v e c t o r _ t a b l e [ i ]) ;

       _ _ o l d _ v e c t o r _ t a b l e [ i ] = * handler ;
       i f ( _ _ k v m _ v e c t o r _ t a b l e [ i ] != 0) {
           * handler = _ _ k v m _ v e c t o r _ t a b l e [ i ];
       }
    }
    return rc ;
}




B.2          Custom interrupt handlers
The initial handler duplicated for each interrupt type:
.macro KVM_HANDLER vecName , vecOffset , vecIdx
ENTRY (\ vecName )
  stmdb sp , { r0 , pc }          @f i r s t make sure t h i s i s an i n t e r r u p t
               @we care about , use r0 as a work reg
   g et _t hr e ad _i n fo r0
  ldr r0 , [ r0 ,# TI_TASK ]
  ldr r0 , [ r0 ,# TSK_FLAGS ]
   tst r0 , # PF_VCPU
  bne is_guest_ \ vecName @do more s t u f f i f i t was a guest
  ldr r0 , = _ _ o l d _ v e c t o r _ t a b l e
  ldr r0 , [ r0 ,#\ vecOffset ]
   str r0 , [ sp ,# -4]           @s t o r e in PC s l o t
  ldmdb sp , { r0 , pc }          @go to r e a l i n t e r r u p t handler
is_guest_ \ vecName :
  ldr r0 , [ sp ,# -8]            @r e f r e s h r0 to what i t was
.ifc \ vecIdx , A R M _ I N T E R R U P T _ S O F T W A R E
  sub sp , sp , # S_FRAME_SIZE                   @f o r SVC, we do the saving because
               @t he re was no preliminary processing
               @( r0 i s user ’ s r0 )
  stmia         sp , { r0 - r12 }                        @Store user r e g i s t e r s on the s t a c k
  add           r0 , sp , # S_PC
  stmdb         r0 , { sp , lr }^                        @Store user R13 and R14
  mrs           r0 , spsr                                @Get the user ’ s CPSR
   str          lr , [ sp , # S_PC ]                     @User R15 i s our R14
            str          r0 , [ sp , # S_PSR ]                 @Store the user ’ s CPSR on the s t a c k
   zero_fp



                                                                29
.else
  usr_entry                @save s t a t e on s t a c k the generic way
.endif
  ldr r5 , =\ vecIdx
  b c o m m o n _ i n t e r r u pt
.endm

.LCcralign :
  .word cr_alignment

KVM_HANDLER     kvmarm_und_vector ,0 x04 , A R M _ I N T E R R U P T _ U N D E F I N E D
KVM_HANDLER     kvmarm_swi_vector ,0 x08 , A R M _ I N T E R R U P T _ S O F T W A R E
KVM_HANDLER     kvmarm_pabt_vector ,0 x0C , A R M _ I N T E R R U P T _ P R E F _ A B O R T
KVM_HANDLER     kvmarm_dabt_vector ,0 x10 , A R M _ I N T E R R U P T _ D A T A _ A B O R T
KVM_HANDLER     kvmarm_irq_vector ,0 x18 , A R M _ I N T E R R U P T _ I R Q


The common interrupt handling function for guest processes after the initial
processing shown above:
@ Input : r5 − i n t e r r u p t index
@         sp − s t r u c t p f r e g s
@
common_interrupt :
  bl kvmar m_getVC PU            @g e t t h i s g u e s t s VCPU context i n t o R0

   ldr r1 ,[ r0 , # V C P U _ H O S T _ P G D _ A D D R ]
# ifdef CONFIG_CPU_DCACHE_WRITETHROUGH
   mcr p15 , 0 , ip , c7 , c6 , 0 @ i n v a l i d a t e D cache
# else
@ && ’ Clean & I n v a l i d a t e whole DCache ’
2: mrc p15 , 0 , r15 , c7 , c14 , 3 @ t e s t , clean , i n v a l i d a t e
   bne 2b
# endif
   mcr p15 , 0 , ip , c7 , c5 , 0 @ i n v a l i d a t e I cache
   mcr p15 , 0 , ip , c7 , c10 , 4 @ drain W              B
   mcr p15 , 0 , r1 , c2 , c0 , 0 @ load page t a b l e pointer
   mcr p15 , 0 , ip , c8 , c7 , 0 @ i n v a l i d a t e I & D TLBs

  @ save guest s t a t e from the s t a c k i n t o VCPU g u e s t r e g s
  add        r2 , r0 , # VCPU_REGS + S_R0 @g e t o f f s e t i n t o VCPU where to s t o r e
            @f i r s t reg ( g u e s t r e g s )
  ldmia sp ! , { r3 , r4 , r6 - r11 }
  stmia      r2 ! , { r3 , r4 , r6 - r11 } @Save R0         −R7
  ldmia      sp ! , { r3 , r4 , r6 - r12 }
  stmia      r2 ! , { r3 , r4 , r6 - r12 } @Save R8         −R15, and the CPSR
  add        sp , sp , # S_FRAME_SIZE -(17*4) @Return s t a c k pointer to where i t
            @was on entry , p f r e g s i sn ’ t made up
            @o f j u s t what we use , make up the
            @d i f f e r e n c e
  cmp r5 ,# A R M _ I N T E R R U P T _ D A T A _ A B O R T
  bne check_irq
  mrc p15 , 0 , r1 , c1 , c0 , 0 @g e t the c o n t r o l r e g i s t e r
  str r1 ,[ r0 , # VCPU_HOST_CR ]
  mrc p15 , 0 , r1 , c5 , c0 , 0 @g e t the f a u l t s t a t u s r e g i s t e r
  str r1 ,[ r0 , # VCPU_HOST_FSR ]
  mrc p15 , 0 , r1 , c6 , c0 , 0 @g e t the f a u l t address r e g i s t e r
  str r1 ,[ r0 , # VCPU_HOST_FAR ]
  b handle_exit
check_irq :
  cmp r5 ,# A R M _ I N T E R R U P T _ I R Q
  bne handle_exit
  @For an IRQ, we want to l e t the k e r n e l handle the i n t e r r u p t , however
  @we want to g e t control , not the guest , when we g e t r e d i s p a t c h e d .
  @Therefore we want to mess around with the s t a t e such t h a t when
  @we g e t redispatched i t w i l l pick up here , not in the guest where
  @the i n t e r r u p t a c t u a l l y occurred.
  sub sp , sp , #12            @room on the s t a c k f o r s t a t e



                                                       30
  ldr r1 ,= resume_point
  mrs r2 , CPSR
  and r2 ,# SVC_MODE                            @Enable f o r a l l i n t e r r u p t s on r e d i s p a t c h
  stmia sp , { r0 - r2 }
  mov r0 , sp
  ldr pc ,= __irq_svc                  @branch to i r q handler , when t h i s g e t s
               @redispatched , i t w i l l begin execution
               @at resume point
resume_point :
  add sp , sp , #12                @push s t a c k back to where i t was
  b h a n d l e _ e x i t _ e n a b l e d @already enabled
@
@ Input : r0 − s t r u c t kvm vcpu
@             r5 − i n t e r r u p t index
handle_exit :
  enable_irq
handle_exit_enabled :
  mov           r1 , r5            @copy i n t e r r u p t index as parameter
  mov r5 , r0              @save kvm vcpu address across c a l l
  bl k v m a r m _ h a n d l e _ e x i t        @handle the reason f o r the swi , returns
               @how to continue in r0
  mov           r1 , r0
  mov r0 , r5
  cmp           r1 , # RESUME_GUEST
  bne guest_stop




B.3         Executing guest code natively
KVM RUN ioctl call:
int k v m _ a r c h _ v c p u _ i o c t l _ r u n ( struct kvm_vcpu * vcpu , struct kvm_run * run )
{
  int r ;
  struct mm_struct * mm ;

   orig_vcpu = vcpu ;
   mm = c r e a t e _ t e m p l a t e _ m m () ;
   i f (! mm ) {
      r = - EFAULT ;
      goto out ;
   }
   r = m a p _ p h y s i c a l _ m e m o r y _ r e g i o n s ( vcpu - > kvm , mm ) ;
   i f (r)
      goto out ;

   vcpu - > arch . shadow_mm = mm ;
   vcpu - > arch . s ha d ow _p gd _ ad dr = page_to_phys ( virt_to_page ( mm - > pgd ) ) ;

   k vm _g ue s t_ en t er () ;
   k v m a r m _ p r e _ g u e s t _ e n t e r ( vcpu ,0) ;
   r = _ _ k v m a r m _ v c p u _ r u n ( vcpu ) ;

  kvm_ guest_ex it () ;
out :
  return r ;
}


Some pre-processing for shared format guest run:
@
@ Input : r0 − s t r u c t kvm vcpu
@
.globl _ _ k v m a r m _ v c p u _ r u n



                                                                31
__kvmarm_vcpu_run :
  @s t o r e host s t a t u s
  str r0 , [ r0 , # VCPU_HOST_GPR (0) ] @Store R0 f i r s t , because we need o f f s e t
  add r0 , r0 , # VCPU_HOST_GPR (1)                @Bump R0 to R1 s l o t
  stmia r0 , { r1 - pc }         @Store the r e s t o f the r e g i s t e r s , probably
                @don ’ t need to s t o r e PC
  sub r0 , r0 , # VCPU_HOST_GPR (1)                @R0 i s now back to the s t a r t o f
          kvm vcpu
  mrs r1 , spsr                                    @Save the h o s t s SPSR
  str r1 , [ r0 , # VCPU_ HOST_SP SR ]
  mrs r1 , cpsr              @Save the h o s t s CPSR
  str r1 , [ r0 , # VCPU_ HOST_CP SR ]
  b guest_run


Actual dispatching of the guest:
guest_run :
  ldr       r1 , [ r0 ,# VCPU_REGS + S_PSR ] @g e t the cpsr in r1
        msr         spsr_cxsf , r1                   @Copy to spsr
  ldr       lr , [ r0 ,# VCPU_REGS + S_PC ] @Get where we ’ re going in our R14

  @ switch the page t a b l e s
  mrc p15 , 0 , r1 , c2 , c0
  str r1 , [ r0 , # V C P U _ H O S T _ P G D _ A D D R ]
  ldr r1 , [ r0 , # V C P U _ S H A D O W _ P G D _ A D D R ]

   @Then the k e r n e l context switch c o d e . . .
   mov ip , #0
# ifdef CONFIG_CPU_DCACHE_WRITETHROUGH
   mcr p15 , 0 , ip , c7 , c6 , 0       @ i n v a l i d a t e D cache
# else
@ && ’ Clean & I n v a l i d a t e whole DCache ’
1: mrc p15 , 0 , r15 , c7 , c14 , 3          @ t e s t , clean , i n v a l i d a t e
   bne 1b
# endif
   mcr p15 , 0 , ip , c7 , c5 , 0       @ i n v a l i d a t e I cache
   mcr p15 , 0 , ip , c7 , c10 , 4        @ drain W         B
   mcr p15 , 0 , r1 , c2 , c0 , 0       @ load page t a b l e pointer
   mcr p15 , 0 , ip , c8 , c7 , 0       @ i n v a l i d a t e I & D TLBs

   add      r0 , r0 , # VCPU_REGS + S_R0   @o f f s e t to prepare loading the r e s t o f
        the regs
   ldmia r0 , { r0 - lr }^               @load user mode r e g i s t e r s
   movs     pc , lr                        @Do the branch , s e t t i n g the new CPSR




B.4        Switching page tables
            mcr           p15 ,   0,   ip ,   c7 ,   c5 , 0          @   i n v a l i d a t e I cache
            mcr           p15 ,   0,   ip ,   c7 ,   c10 , 4         @   drain W       B
            mcr           p15 ,   0,   r1 ,   c2 ,   c0 , 0          @   load page t a b l e pointer
            mcr           p15 ,   0,   ip ,   c8 ,   c7 , 0          @   i n v a l i d a t e I & D TLBs




B.5        Page table creation
Create a stripped down template mm based on the currently running process.
struct mm_struct * c r e a t e _ t e m p l a t e _ m m (void)
{
  struct mm_struct * mm ;



                                                                32
    struct vm_a rea_str uct * vmnext ;
    struct vm_a rea_str uct * vma ;
    int ret ;

    /∗
     ∗ Copy some e x i s t i n g m and clean i t up to have
                                  m
     ∗ a clean template to work with
     ∗/
    mm = dup_mm ( current ) ;
    vmnext = mm - > mmap ;

    while ( vmnext ) {
     vma = vmnext ;
     vmnext = vmnext - > vm_next ;

        ret = do_munmap ( mm , vma - > vm_start , vma - > vm_end - vma - > vm_start ) ;
        i f ( ret < 0) {
           printk ( KERN_CRIT " Could not unmap vma for guest \ n " ) ;
           return NULL ;
        }
    }

    return mm ;
}


Map the KVM memslot into the mm page table
/∗
  ∗ Maps v i r t u a l guest addresses to machine p h y s i c a l
  ∗ as s p e c i f i e d in memslot .
  ∗
  ∗ @kvm: The V to map M
  ∗ @ m: The ”shadow” m s t r u c t
      m                                  m
  ∗ @memslot : The a l l o c a t e d kvm memory s l o t from userspace
  ∗/
int m m a p _ m e m s l o t _ g u e s t ( struct kvm * kvm ,
                  struct mm_struct * mm ,
                  struct k vm _m em o ry _s lo t * memslot )
{
    int i ;
    int ret ;
    unsigned long vm_flags ;
    struct vm_a rea_str uct * vma , * prev ;
    struct rb_node ** rb_link , * rb_parent ;
    unsigned long gfn , addr ;
    unsigned long guest_va = memslot - > base_gfn << PAGE_SHIFT ;
    unsigned int len = memslot - > npages << PAGE_SHIFT ;
    int remap_pfn ;

    /∗ Create f l a g s on V f o r a l l access ∗/
                                  M
    /∗ XXX maybe read the f l a g s o f f the memslot i n s t ea d ∗/
    vm_flags = c a l c _ v m _ p r o t _ b i t s ( PROT_EXEC | PROT_READ | PROT_WRITE | PROT_NONE
         )
                   | c a l c _ v m _ f l a g _ b i t s ( MAP_SHARED )
          | mm - > def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC ;

    vma = f i n d _ v m a _ p r e p a r e ( mm , guest_va , & prev , & rb_link , & rb_parent ) ;
    vma = kmalloc ( s i z e o f ( struct vm_a rea_stru ct ) , GFP_KERNEL ) ;
    i f (! vma )
       return - ENOMEM ;

    vma - > vm_mm = mm ;
    vma - > vm_start = guest_va ;
    vma - > vm_end = guest_va + len ;
    vma - > vm_flags = vm_flags ;
    vma - > anon_vma = NULL ;
    vma - > vm_page_prot = v m _ g e t _ p a g e _ p r o t ( vm_flags ) ;



                                                        33
    vma - > vm_file = NULL ;
    vma - > vm_pgoff = 0;

    /∗ tweak i t so t h a t i t i s not cow ∗/
    vma - > vm_flags |= VM_SHARED ;

    vma_link ( mm , vma , prev , rb_link , rb_parent ) ;
    mm - > total_vm += len >> PAGE_SHIFT ;

    for ( i = 0; i < memslot - > npages ; i ++) {
      /∗ Map guest frame number to host v i r t u a l address ∗/
      gfn = memslot - > base_gfn + i ;
      addr = gfn_to_hva ( kvm , gfn ) ;

        /∗ Map host address to p h y s i c a l frame number ∗/
        remap_pfn = va_to_pfn ( current - > mm , addr ) ;
        i f ( remap_pfn < 0)
           continue;

        /∗ Map the p h y s i c a l frame number in the page t a b l e ∗/
        ret = m ap_gva_t o_pfn ( mm , gfn << PAGE_SHIFT , remap_pfn ) ;

        i f ( ret < 0)
           return ret ;
    }

    return 0;
}


Remap a physical frame in the page table to the corresponding virtual address.
/∗
  ∗ Maps a v i r t u a l address in the s u p p l i e d m s pgd  m’
  ∗ t a b l e to a p h y s i c a l frame .
  ∗ @ m: custsom m f o r shadow page t a b l e
     m                   m
  ∗ @addr : v i r t u a l address r e p r e s e n t a t i o n o f gfn
  ∗ @pfn : the machine p h y s i c a l frame number
  ∗/
int map_g va_to_p fn ( struct mm_struct * mm ,
              unsigned long addr ,
              unsigned long pfn )
{
   pgd_t * pgd ;
   pud_t * pud ;
   pmd_t * pmd ;
   pte_t * pte , entry , * page_table ;
   struct page * page ;
   struct vm_a rea_str uct * vma ;

    i f (!( vma = find_vma ( mm , addr ) ) )
       return - EFAULT ;

    pgd = pgd_offset ( mm , addr ) ;
    pud = pud_alloc ( mm , pgd , addr ) ;
    i f (! pud )
       return - ENOMEM ;
    pmd = pmd_alloc ( mm , pud , addr ) ;
    i f (! pmd )
       return - ENOMEM ;

    pte = pte_alloc_map ( mm , pmd , addr ) ;
    i f (! pte )
       return - ENOMEM ;
    page_table = pte ;
    page = pfn_to_page ( pfn ) ;
    entry = mk_pte ( page , vma - > vm_page_prot ) ;



                                                   34
    entry = pte_mkwrite ( pte_mkdirty ( entry ) ) ;

    set_pte_at ( mm , addr , page_table , entry ) ;

    return 0;
}




                                             35

				
DOCUMENT INFO
Shared By:
Tags: Android
Stats:
views:138
posted:5/21/2010
language:Italian
pages:39
Description: Android