Characterizing Low Level Virtual Machine
Performance in Scientiﬁc Applications
Wesley Emeneker and Amy Apon
Abstract— Virtual machine use in datacenters is increas- events recorded. Oproﬁle and the Xenoproﬁle extension
ing. Despite this boost in acceptance, virtual machine are used to gather the machine-level event data.
application execution negatively impacts performance.
Furthermore, little is known about the exact causes of per- II. BACKGROUND
formance loss. This paper presents an initial investigation
This work combines virtualization and performance
of performance-impacting machine-level events. The High
Performance LINPACK application is benchmarked in the proﬁling. The virtualization technology chosen for ex-
Xen virtual machine monitor on two different x86 64 CPU periments is the Xen hypervisor. The performance
architectures. Several machine-level events are gathered, proﬁling technology used is Oproﬁle and its cousin
including translation lookaside buffer misses and CPU Xenoproﬁle. Xen is chosen because of its performance
pipeline clears. Results from the experiments show that as well as its machine proﬁling capabilities. Xenoproﬁle
(expectedly) virtual machines suffer more performance is chosen for its implementation of a Xen-level proﬁler
detracting events than native execution. However, the mag- that fully utilizes built-in CPU performance counters.
nitude of differences between native and virtual machine
execution is unexpected. A. Xen
The Xen hypervisor is a paravirtualizing virtual ma-
I. I NTRODUCTION
chine monitor (VMM), . Paravirtualization is a
Virtual machine (VM) usage in datacenters is increas- technique in which the machine architecture presented
ing. In an era of computing where clusters of hundreds to an operating system is not identical to the actual
of computers reside in datacenters, the drive for sheer hardware architecture. With a few small changes
computational power is giving ground to the need for to the underlying architecture, a paravirtualizing VMM
machine management and service reliability. VMs act securely partitions guest VMs while providing a near-
like completely independent computers, regardless of native level of performance. For example, one nec-
whether or not they actually are. Though they suffer essary change is to the memory management system of
a performance penalty when compared to native appli- VMs. Since the VM does not control the machine, it
cation execution, many are willing to accept this cost in must communicate with the VMM which does control
order to beneﬁt from VM abilities such as live migration the computer. Whenever a VM needs more memory or
between physical hosts, transparent checkpointing and wants to change page table mappings, the VMM must
suspension, and secure resource partitioning. validate the operation before allowing it to execute. In
Despite the surge in popularity, little is known about this way, the VMM can ensure that all VMs on the
the exact causes of VM performance degradation. Many system are securely partitioned. The disadvantage to this
popular VM technologies suffer performance loss, rang- approach is that paravirtualization requires modiﬁcations
ing from 1-2% to 10-20% depending on the system to the host OS. So, in order to use a Xen VMM, the OS
workload. A VM may perform well with one ap- kernel must be modiﬁed to function properly.
plication and poorly with another. For example, a VM The Xen VMM supports virtualization extensions on
that executes a serial workload well may perform poorly recent x86 CPUs designed for fully virtualized machines.
with threaded applications. In this work we quantify However, the cost of full virtualization outweighs the
the cost of running a scientiﬁc application inside a Xen beneﬁts gained using the CPU extensions. Full virtual-
VM based on several proﬁling features available on the ization implies that the BIOS, disk drive, video cards,
CPU. Translation lookaside buffer (TLB) misses, CPU network cards, sound cards, CD-ROM drives, etc. are
pipeline clears, and page table walks are a few of the virtualized (some of these devices can be removed, but
some cannot). Many applications do not need a BIOS, a
sound card, or a video card. Instead, these applications
simply need a CPU, network, and a way to write output
(to disk or stdout for example). Removing support for
emulated hardware that is not needed by the application
improves the performance of the VM.
B. Oproﬁle and Xenoproﬁle
Oproﬁle is a free-software statistical proﬁling package
for Linux. It supports a range of kernels, from 2.2 to
2.6 and multiple CPU architectures including x86, ARM,
and Alpha. Oproﬁle uses built-in CPU performance
counters to obtain data about executing applications
and processes. CPU-supported events enable oproﬁle to
accurately gather system information while having a low
impact on system performance. For example, oproﬁle can Fig. 1. Xen VMM on an x86 64 system
capture data about translation lookaside buffer hits and
misses. Information that is not kept in CPU performance
counters, such as how many processes are started, are focus on networking. Santos examines the use of com-
not captured by oproﬁle. Oproﬁle writes values into bined hardware/software network virtualization where
CPU registers in such a way speciﬁes which events network cards implement a thin layer of abstraction for
to gather. Once set, the CPU will log the number of speeding VM network performance. Raj’s research
times the event occurs. Oproﬁle also sets a counter takes a path similar to Santos. Menon’s research
limit in CPU registers so that whenever the desired uses three techniques to improve Xen VM network
event occurs a speciﬁed number of times, an interrupt is performance. First, the virtual network interface is
generated. Upon receiving an interrupt, oproﬁle logs data improved by by ofﬂoading checksum calculations, TCP
that includes the process or module that was executing segment ofﬂoad, and using zero-copy techniques. Sec-
when the event was generated as well as the function or ond, they introduce a faster I/O channel for data move-
frame of the segment of code. ment. Lastly, the VMM is extended to use superpages
Xenoproﬁle is an extension to oproﬁle that interfaces and global page mappings in guest VMs.
with the Xen VMM. The Xen hypervisor controls Superpages and global page mappings are techniques
the CPU. Figure 1 shows an example of how the Xen most relevant to this research. A superpage maps a
VMM interacts with host and guest OS kernels on an contiguous set of physical memory pages to a contiguous
x86 64 system. In this ﬁgure, the VMM runs with the set of virtual memory pages. We denote “physical”
most privilege. It completely controls the CPU, and can memory to be the addresses used by the VMM and
execute any privileged instruction. The host OS (domain controlling domain. “Virtual” memory are the addresses
0) runs with less privilege than the VMM, but with more used by guest VMs. Superpages offer a way to reduce
than a VM. Lastly, VMs execute with the least amount the load on TLBs by representing multiple pages with
of privilege. Xenoproﬁle bridges the Xen hypervisor, a single translation. The downside of this approach
Xen Domain 0 (the controlling Linux kernel), and any with VMs is that the guest must have an idea of its
VM desired. Additionally, Xenoproﬁle sets CPU event physical memory assignment, breaking the abstraction
counters through the hypervisor. This software records of virtualization.
events that occur while executing in the hypervisor, in
Global page mappings allow entries in the TLB to be
Domain 0, and in VMs. Xenoproﬁle’s bridge enables
kept across a TLB ﬂush. With global mappings, a VM’s
accurate comparison between each environment.
entries in the TLB can be kept when a guest domain is
switched out to let the VMM execute. This optimization
C. Related Research
reduces the number of TLB misses incurred by VM
The performance of VMs with respect to machine- execution, unless multiple guest domains are switched
level events has been subjected to scrutiny, often with a rapidly. When switching between guest domains, a com-
plete TLB ﬂush is required including all global page This set of hardware was installed in May 2008, and
mappings. is part of the University of Arkansas’ Star of Arkansas
Menon’s research into optimizing Xen’s network per- supercomputer. The Harpertown processors have three
formance showed that the superpage and global page translation lookaside buffers - one for instructions, one
mapping optimizations reduce data TLB misses by as for data, and a shared third. The ﬁrst two TLBs are as
much as a factor of four. close to the processor as the L1 cache (1-3 cycles). The
Tikotekar examines VM performance with HPC instruction TLB has 128 entries. The data TLB has sixty-
benchmarks. His research compares a single VM’s four entries. The shared TLB is the same distance as
machine-level events to that of the host machine. the L2 cache, and contains 512 entries. Results of the
Tikotekar does not examine VM performance when same experiments executed on both sets of hardware are
applications are spanned across multiple nodes. presented.
This research strictly examines machine-level events The software used in testing is:
that affect performance without the goal of enhancing a • Rocks 5.0 (based on CentOS 5.0)
particular application. • Linux 2.6.18
III. E XPERIMENTAL S TUDY • Xen 3.1
• Oproﬁle/Xenoproﬁle 0.9.3
The goal of this research is to quantify machine-level
• High Performance LINPACK (HPL) - This is the
events that impact application performance. Previous
application executed during tests. It is compiled
work on VM performance impact focuses on improv-
with “g++ -O3” and is linked with the high-
ing the performance of a particular subsystem, such
performance, threaded linear algebra library “Go-
as network or disk performance. The performance of
an application varies between executions as a result of
operating system events that affect timing, cache, and Xen 3.1 was released in May of 2007. This version of
memory behavior. Thus, each run of the application the hypervisor does not support proﬁling (with oproﬁle)
constitutes one sample of performance. In this study on the newer “Harpertown” processors. As such, the
the same application is run many times to obtain a hypervisor is modiﬁed to correctly identify the model
statistically valid sample set. A test consists of the (a “Core 2” architecture) of the newer processors. The
application to be executed, as well as the setup required software used in experiments is selected to be the ones
to start and stop application proﬁling. believed to be commonly used in data centers and com-
The hardware used in testing consists of 32 nodes with putational environments today, so that the experimental
the following characteristics: setup mirrors what is found in “real” usage.
• Dual 3.2Ghz single-core Intel Xeon EM64T proces- One virtual machine image is used in testing, and the
sors (codename “Nocona”) software in the VM image is nearly identical to that of
• 4GB RAM the host computer. The only customizations made to each
• Gigabit Ethernet VM are the hostname and IP address. All Xen VMs
• 79GB 10K RPM SCSI Disk are paravirtualized. No hardware assisted virtualization
This set of hardware was installed in May 2005, and techniques are used.
was part of the University of Arkansas’ Red Diamond Five different events for proﬁling the available nodes
supercomputer. The Nocona processors have two differ- are studied. The events are:
ent translation lookaside buffers - one for instructions 1) CPU execution cycles
and one for data. The instruction TLB has 128 entries in 2) Instruction Translation Lookaside Buffer (TLB)
a fully associative cache. The data TLB has sixty-four misses
entries, also in a fully associative cache. 3) Data Translation Lookaside Buffer misses
The second set of nodes used in testing consists of 4) CPU pipeline clears
four nodes with the following hardware: 5) Page walks due to Translation Lookaside Buffer
• Dual 2.66Ghz quad-core Intel Xeon EM64T pro- misses.
cessors (codename “Harpertown”) These events were chosen for two reasons. First, these
• 16GB RAM events can be gathered on both sets of hardware. The
• Gigabit Ethernet second reason for choosing these events is to examine
• 500GB 10K RPM SCSI Disk the most obvious events that could cause the greatest
disparity in performance. Observations suggest that the HPL application per-
forms best (on Intel processors) with approximately
A. Experimental Setup 3.5-4GB of RAM per core. To make each experiment
perform close to optimally, the problem size must be
Each experiment consists of the run of a single in-
increased from 512MB. The side effect of increasing the
stance of HPL with a single problem size. A total of
problem size is that the run time lengthens. Increasing
ﬁve different experiments are constructed. Three of the
the run time reduces either the number of events that can
experiments are designed to run on a single computer
be proﬁled, or the number of experiments that can be
(when possible). The remaining two experiments are run
run in a reasonable amount of time. Both the events and
across multiple nodes.
number of experiments are important for this research,
1) HPL1 is run on a single CPU core of a single so the problem size is restricted.
2) HPL2 is run across two core on a single computer. IV. R ESULTS
3) HPL4 is run across four core on a single computer. For each event, all oproﬁle results are aggregated and
(This experiment is only applicable to the eight- sorted by application and number of occurrences. If an
core Harpertown computers.) application does not appear during each experiment, it
4) HPL2MPI is run across two core, but on two is not represented in the results. It is unlikely that an
different computers. Gigabit Ethernet connects the application that does not appear in each experiment has
two computers. a signiﬁcant impact on execution.
5) HPL4MPI is run across four core on four different In each set of results, the environments that are com-
computers, again connected with Gigabit Ethernet. pared are native Linux, Xen Domain 0, and Xen Domain
Finding the “best” problem size to execute is a difﬁcult U. Native Linux is the operating system environment
problem when benchmarking. Depending on the desired commonly used in high performance clusters today. In
results, the benchmark problem size may remain constant Native Linux, the Linux kernel has complete control of
across all experiments. This scheme will evaluate how the host. There are no guest VMs running.
the application responds to parallelism - i.e. ﬁnding the Xen Domain 0 is the Xen analog to native Linux.
smallest problem size per node/core that should not be In Xen Domain 0, the hypervisor runs “underneath”
crossed. However, one goal of parallel computing is to a modiﬁed Linux kernel. The modiﬁcations made are
tackle larger problems than have been possible before. If necessary to support the paravirtualization technique de-
this is the goal, the previous scheme for benchmarking scribed in section II. In order to run Xen VMs, a Domain
is not “fair”. Scaling the problem size as the number 0 is required. Applications are executed in Domain 0
of cores increases is another method of benchmarking. to see what impact the Xen layer has without any VM
Keeping the problem size constant per core tests the interference.
strong scalability of the application. Xen Domain U is the guest VM, and runs on top
In these experiments, we apply both approaches to of Xen Domain 0. The guest VM communicates with
benchmarking. Single node benchmarks use a single and requests resources from both Domain 0 and the Xen
problem size. HPL is linked with the GotoBLAS library, VMM. A modiﬁed Linux kernel runs inside the Xen VM.
which is capable of multi-threaded execution. Each job The data gathered can generally be grouped into kernel
is given a number of nodes (and all processors on those and user space (i.e. time spent executing in the kernel
nodes). Environment arguments constrain the GotoBLAS vs. time spent in user space) However, since this is
to use a particular number of processors on each node. a comparison between Xen and Linux, grouping both
On a single node, the problem size given to HPL takes into “kernel” space would defeat some of the purpose
approximately 512MB of RAM during execution. When of comparison. Additionally, since we are executing the
scaling the application to two distinct nodes, the same HPL application, we group user-space data into HPL
problem size is used. Keeping the same problem size events and “other” user events. The resulting grouping of
reveals the limitations and impact of using a network for data will examine Xen, Linux, HPL, and “other” events
message passing instead of using thread communication. separately.
Speciﬁcally, the impact revealed will be in the form of Many of the ﬁgures shown in this section are ag-
machine-level events. The problem size is quadrupled gregations of multiple experiments. The results of each
when moving the application to four nodes. experiment are three-fold - Linux, Xen Domain 0, and
Xen Domain U. In ﬁgures with experiment aggregation, environments put forth the same amount of work to
results from HPL1, HPL2, and HPL4 may be presented complete the task. As a counterexample, suppose that the
side-by-side. In all experiments, the comparison of inter- Xen environments executed ten or twenty percent more
est is strictly between the Linux, Domain 0, and Domain instructions than Linux but gave the same benchmark
U. results, we would determine that the CPU stalled more
Due to space constraints, not all results are shown. Re- during execution in the Linux environment. Additionally,
sults that are similar between experiments or processors it would mean that Xen environments execute more
(i.e. Noconas and Harpertowns) are not shown; however instructions to perform the same amount of work. How-
omitted results will be noted. ever, since the benchmark number and experiment results
are similar, we can determine that Xen environments
A. CPU Execution Cycles
execute approximately the same number of instructions
The CPU execution cycles event gathers the number to perform the same work.
of cycles during which the CPU executes instructions. Another behavior seen in ﬁgure 2 is the increase
This event does not record actual runtime. in the amount of time spent executing in the kernel.
In all single node experiments (HPL1 and HPL2) and
in HPL2MPI, kernel execution cycles are very small.
Only in the HPL4MPI test does kernel execution take a
signiﬁcant amount of time. In this test, the amount of
time spent executing the HPL application remains the
same across environments. The increase in instructions
executed occurs in the Linux kernel and Xen kernel.
Interestingly, the Xen environments spend less time in
the Linux kernel, but more time overall in kernel-space.
One difference to note in the experiments occurs
between the HPL2 and HPL2MPI tests. Figure 2 shows
that the number of cycles spent executing instructions is
reduced by ﬁfty percent going from HPL2 to HPL2MPI.
This is explained by the setup of the experiment. On
a single node, two processors are used in the HPL2
experiment. So, even though the benchmark runs for half
Fig. 2. Global Power Events on Nocona Processors the time, the same number of instructions are executed
(and the same number of cycles are spent executing).
For this experiment the event counter is set to overﬂow The event counters do not distinguish between events
(and thus generate an interrupt) for every 100,000 in- executed by processors. Instead, all the events are ag-
structions executed. Of all events counted, the occurrence gregated. However, in the HPL2MPI experiment, only a
of this event is expected to be the most similar between single processor on each node is used. In this case, the
Linux, Xen Domain 0, and Xen Domain U. Application same number of instructions are executed, but oproﬁle
runtimes are similar between each environment, so intu- only tabulates results from one processor per node. In
itively the number of instructions executed per platform the ﬁgure, we do not sum the instructions executed
should also be comparable. Figure 2 shows that the over all processors. We only report the mean number
mean number of events for a single experiment is very of instructions executed per node. Thus, the apparent
similar between the three environments. Figures for the reduction in instructions executed is only true per node.
Harpertown processors are not shown since the results The total number of instructions executed by all nodes
are the same as with the Nocona processors. This event is approximately the same.
only measures the number of cycles during which the
CPU executed instructions. HPL benchmark numbers are B. Instruction Translation Lookaside Buffer Misses
nearly identical for each environment (they differ by 1- This event reveals how many Instruction Translation
2%), so the amount of time the CPU spends executing Lookaside Buffer (ITLB) checks were missed when
instructions should be very similar. When compared looking for the next executable CPU instruction. Based
with the benchmark output, the results show that both on how the Xen VMM and guest VMs interact, we
expect that Native Linux will have the fewest ITLB misses than Linux. Furthermore, a guest VM (DomU)
misses, followed by Xen Domain 0, and ﬁnally Xen incurs fewer ITLB misses than the controlling domain
guest VMs (Domain U). For this experiment the event (Dom0). Results from Harpertown experiments mirror
counter is set to overﬂow when three thousand ITLB those seen in ﬁgures 3 and 4, and so are not shown.
misses occur. Superpages and global page mappings are not available
for pages with executable instructions, so this lack of
optimizations adversely affects the number of ITLB
misses. Whenever a VM is switched out (i.e. context
switched) or when a Virtual CPU (VCPU) is assigned to
a different physical CPU, a TLB ﬂush occurs. Without
global page mappings, the number of entries ﬂushed
increases. Without superpages, the number of entries that
have to be brought back into the TLB also increases. The
lack of these optimizations account for at least some of
the difference in TLB misses, but they do not appear to
completely explain the factor of sixteen difference on a
single node, and factor of one hundred difference across
nodes. Further investigation is required to fully explain
the deviation between Xen and Linux.
Some applications may perform worse than expected
in VMs with this signature of ITLB misses. For example,
Fig. 3. Instruction TLB Misses on Nocona Processors
consider an application that has a number of “if . . . then”
(conditional) clauses with large segments of code be-
tween the clauses. Each segment of conditional code
Figures 3 and 4 show the differences between en-
takes several pages of instructions. If this application
vironments. Many more Instruction TLB misses are
checks the conditions frequently, the ITLB misses in Xen
generated by the Xen environments. Although it is
may signiﬁcantly impact application execution time. In
expected that Xen will incur more ITLB misses, the
applications with small, frequently used loops (in HPL
factor of sixteen increase on a single node and a factor
for example), ITLB misses do not cause a signiﬁcant
of one hundred increase on four nodes is unexpected.
Using the Xen virtualization technology without virtual
machines incurs an order of magnitude more ITLB C. CPU Pipeline Clears
The information gathered by this event tells the num-
ber of cycles in which the CPU pipeline was cleared
of all instructions. All modern x86 CPUs have deep
execution pipelines so that streams of instructions can be
executed quickly. A CPU pipeline clear is an expensive
operation, costing many cycles of execution, both for
the work that is discarded, as well as the time required
to reﬁll the pipeline. This event is set to generate an
interrupt every time the CPU pipeline is cleared for three
Figure 5 shows how many cycles during which the
CPU execution pipeline is cleared. Unfortunately, the
CPU pipeline clear event did not function on the Harper-
town processors, so no results are presented. The cause
of the Harpertown’s non-functioning pipeline clear event
is currently unknown.
It is expected that the Xen environments will cause
Fig. 4. Instruction TLB Misses on Nocona Processors many more clears. The reasoning behind this expectation
for translations referring to data. The instruction TLB is
used for translations referring to executable instruction
pages. The event counter is set to interrupt every three
Fig. 5. CPU Pipeline Clears on Nocona Processors
is that the Xen hypervisor, Domain 0, and Domain U are
all securely partitioned. For example, if the hypervisor
needs to execute, it would seem that the CPU pipeline Fig. 6. Data TLB Misses on Nocona Processors
should be cleared in order to remove possible interfer-
ence between environments.
Instead of this expected behavior, the Xen environ-
ments have many fewer clears than Linux. Linux has
approximately a factor of ten more clears than Xen envi-
ronments. Furthermore, when the application is spanned
over multiple computers, the number of cycles spent with
the CPU pipeline cleared does not increase in the same
manner as the Linux environment. When the application
is spanned over two nodes the mean number of cycles
spent with the CPU pipeline cleared decreases.
One reason for pipeline clears is memory ordering
issues. Modern x86 CPUs perform memory access re-
ordering in order to increase application performance.
Unfortunately, not all memory accesses can be reordered.
When an absolute ordering is required, the CPU may be
forced to ﬂush the pipeline until the proper sequence
of memory operations (loads and stores) are completed.
The reason for the results shown in ﬁgure 5 is currently Fig. 7. Data TLB Misses on Harpertown Processors
unknown, and warrants further investigation. It is not
clear why Xen environments suffer fewer pipeline clears, Figure 6 shows that the Xen environments generate
but this behavior may be (partially) due to memory approximately the same amount of page walks due to
ordering issues. data TLB misses on the Nocona processors. The most
interesting data from the Harpertown experiments are
D. Data Translation Lookaside Buffer Misses shown in ﬁgure 7. Data from the HPL1, HPL2, and
The data TLB provides the same function as the HPL2MPI experiments are not shown in ﬁgure 7 since
instruction TLB - it caches recently needed translations they mirror the results from ﬁgure 6. The four-core
from virtual addresses to physical addresses. The differ- experiment (HPL4) results in a large number of data TLB
ence between the two TLBs is that the data TLB is used misses, but only in a guest VM. Both Xen Domain 0 and
Linux have approximately the same number of misses.
Furthermore, the increase in TLB misses in the guest
VM shows up only in “vmlinux”, which is Domain 0’s
kernel. In essence, oproﬁle captured more miss events
occurring only the host kernel. Identical behavior is seen
in the HPL4MPI experiment. In this case, the VM’s
TLB misses are almost identical to domain 0’s misses.
The only difference between the two experiments is that
HPL4 runs on a single node, sharing all processors.
HPL4MPI runs on four nodes and shares no processors.
One explanation for this behavior is that Xen Virtual
CPUs (VCPUs) were not pinned to real processors. Xen
keeps shadows of TLB entries for guest domains. If a
VM’s VCPUs are switched between physical processors
during execution, the per-CPU TLBs are ﬂushed and
reﬁlled. If this switching occurred during execution, the Fig. 8. Number of Page Walks Performed on Harpertown Processors
behavior seen by the guest VM in the HPL4 experiment
The number of data TLB misses in Xen environments
is much smaller than instruction TLB misses. Since
the data TLB and instruction TLB are closely related
(i.e. they both perform the same functions), and the
data TLB is smaller, it is expected that there would
be approximately the same ratio of misses as seen in
the ITLB experiment. Instead, the number of misses is
smaller than a factor of four between Xen and Linux.
One reason for this is the use of superpages and global
page mappings. As discussed earlier (and in previous
research ), superpages and global page mappings
help reduce the load on TLB translations by batching
contiguous regions of pages and by removing the need to
ﬂush all entries from the TLB (in some circumstances).
E. Page Walks
Fig. 9. Cycles Required to Complete Page Walks on Harpertown
In the x86 (and x86 64) architecture, the CPU walks Processors
the page tables whenever a TLB miss occurs. If
the CPU ﬁnds a valid translation in memory, it brings
the translation into the TLB, and retries the memory ments (Domain 0 and guest VMs) suffer approximately
reference. The less time the page walk uses, the faster ten percent more page walks than Linux. However, ﬁgure
the memory reference. The Harpertown processors have 9 shows the total number of CPU cycles spent perform-
a new (compared to the Noconas) event capability that ing page walks. In the HPL1 and HPL2 experiments,
gathers not only the number of page walks, but also the the CPU spends twice the time walking the page tables
number of cycles used by the page walks. Figures 8 and for Xen environments than for Linux. One explanation
9 show the differences in both number of page walks for this is that the Xen hypervisor manages page tables
and time required for page walks in the HPL1, HPL2, for guest VMs. An extra layer of translation overhead
and HPL4 experiments. These ﬁgures aggregate the page obviously costs more than a single translation.
walks required by both instruction and data TLB misses. Additionally, the time spent walking page tables re-
mains semi-constant for Xen, while Linux’s time nearly
Figure 8 shows that the number of page walks across doubles from HPL1 to HPL4. Page tables are managed
the environments is roughly the same. The Xen environ- in memory, so as more CPUs attempt to walk the
page tables, contention for a ﬁxed resource (the table) VI. F UTURE W ORK
will slow operations. Under this assumption, page table
walking for Xen environments is expected to increase the The parameter space for experimentation in VM
number of cycles required, except that the Xen VMM benchmarking is large. The environments used - Linux,
is already providing another layer of translation, and Xen Domain 0, Xen Domain U - are just a few of the
is taking more time than any walk in native Linux. possibilities for VMs. These were chosen because of pre-
Thus, the time for page walks under Xen remains semi- existing support for oproﬁle. In the future, both VMware
constant. (Player, Server, and ESX), Kernel Virtual Machine
(KVM), and Xen hardware-assisted virtual machines
V. C ONCLUSIONS (HVM domains) will be tested. Different x86 machines
For the most part, the results are as expected. Xen may show different patterns of events, especially if the
has more events that negatively affect performance than CPUs have built-in virtualization extensions. More event
Linux. It is expected that VMs suffer a performance data can be gathered, and many more benchmarks must
penalty compared to native application execution, and it be executed. Other research shows that single bench-
follows that the performance counters will provide evi- marks are an inaccurate predictor of real application
dence of application impact. However, several interesting performance , . Many different combinations for
things happened. First, in many cases, the actual VM benchmark problem sizes are possible. Depending on
has fewer ITLB misses and CPU pipeline clears than the results of these experiments (and more experiments
Domain 0. The reason for this is unknown, but may be not presented here), several different benchmark problem
related to the fact that each VM is only allocated 1GB sizes for each different test will be executed.
of RAM, while Domain 0 has the full 4GB of RAM
available. Second, Linux suffers more pipeline clears VII. ACKNOWLEDGEMENTS
than any Xen environment. Though the reason for this
behavior is unknown, it is suspected that the number The work is supported in part by MRI Grant #072265
of clears is due to memory-ordering issues. The Xen from the National Science Foundation.
hypervisor implements an extra layer of data control for
Domain 0 and guest VMs, which may be responsible R EFERENCES
for enforcing a memory-order that reduces the need for
 Laura C. Carrington, Michael Laurenzano, Allan Snavely,
pipeline clears. Roy L. Campbell, and Larry P. Davis. How Well Can Simple
The execution time of the application is nearly identi- Metrics Represent the Performance of HPC Applications? In
cal for each environment, but the Xen environments incur SC ’05: Proceedings of the 2005 ACM/IEEE conference on
many more instruction TLB misses. Because of this, we Supercomputing, page 48, Washington, DC, USA, 2005. IEEE
expect that applications running in Xen Domain 0 or  B. Clark, T. Deshane, E. Dow, S Evanchik, M. Finlayson,
Domain U will show decreased performance, but since J. Herne, , and J.N. Matthews. Xen and the Art of Repeated
Linux has a much larger number of CPU pipeline clears, Research. In Proceedings of the Usenix annual technical con-
ference. July 2004.
it is possible that the performance penalties cancel each
 Yaozu Dong, Shaofan Li, Asit Mallick, Jun Nakajima, Kun
other out. Tian, Xuefei Xu, Fred Yang, and Wilfred Yu. Extending
Each environment has approximately the same num- Xen with Intel Virtualization Technology. In Intel Technology
ber of data TLB misses. Superpages and global page Journal, volume 3, 2006.
 B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, I. Pratt,
mappings are most likely responsible for the small
A. Warﬁeld, P. Barham, and R. Neugebauer. Xen and the Art
increase in misses over native Linux. The number of of Virtualization. In Proceedings of the ACM Symposium on
TLB misses cannot be completely reduced to Linux Operating Systems Principles, October 2003.
levels since some TLB ﬂushes are required for switching  Intel. Intel 64 and ia-32 architectures software developer’s
manual volume 1: Basic architecture. March 2009.
execution between domains. However, even though VMs
 John Levon and Philippe Elie. Oproﬁle: A System-wide Proﬁler
will incur more misses, it is unknown exactly how many for Linux Systems. http://oproﬁle.sourceforge.net, December
more misses can be expected. Also, even though the 2008.
environments have the same number of TLB misses, the  Paul E. McKenney. Memory ordering in modern microproces-
sors, part i. Linux J., 2005(136):2, 2005.
Harpertown processors spends more time per page table
 Aravind Menon, Alan L. Cox, and Willy Zwaenepoel. Opti-
walk in the Xen environments. This behavior warrants mizing network virtualization in xen. In In Proceedings of the
further study. USENIX Annual Technical Conference, pages 15–28, 2006.
 Aravind Menon, Jose Renato Santos, Yoshio Turner, G. (John)  Jose Renato Santos, Yoshio Turner, G.(John) Janakiraman, and
Janakiraman, and Willy Zwaenepoel. Diagnosing Performance Ian Pratt. Bridging the gap between software and hardware
Overheads in the Xen Virtual Machine Environment. In VEE techniques for i/o virtualization. In In Proceedings of the
’05: Proceedings of the 1st ACM/USENIX international con- USENIX Annual Technical Conference, 2008.
ference on Virtual execution environments, pages 13–23, New  Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad
York, NY, USA, 2005. ACM Press. Calder. Automatically characterizing large scale program
 Juan Navarro, Sitaram Iyer, Peter Druschel, and Alan Cox. behavior. In ASPLOS-X: Proceedings of the 10th interna-
Practical, transparent operating system support for superpages. tional conference on Architectural support for programming
In SIGOPS Oper. Syst. Rev, pages 89–104, 2002. languages and operating systems, pages 45–57, New York, NY,
 Himanshu Raj and Karsten Schwan. High performance and USA, 2002. ACM.
scalable I/O virtualization via self-virtualized devices. In HPDC  J. Smith and R. Nair. The Architecture of Virtual Machines.
’07: Proceedings of the 16th international symposium on High Computer, 38(5), May 2005.
performance distributed computing, pages 179–188, New York,  Anand Tikotekar, Geoffroy Valle, Thomas Naughton, Hong H.
NY, USA, 2007. ACM. Ong, Christian Engelmann, and Stephen L. Scott. An Analysis
 John Scott Robin and Cynthia E. Irvine. Analysis of the intel of HPC Benchmarks in Virtual Machine Environments. In 3rd
pentium’s ability to support a secure virtual machine monitor. Workshop on Virtualization in High-Performance Cluster and
In SSYM’00: Proceedings of the 9th conference on USENIX Grid Computing (VHPC) 2008, 2008.
Security Symposium, pages 10–10, Berkeley, CA, USA, 2000.  Melinda Varian. VM and the VM Community: Past, Present,
USENIX Association. and Future, 1997. http://www.os.nctu.edu.tw/vm/pdf/VM and
 Theodore H. Romer, Wayne H. Ohlrich, Anna R. Karlin, and the VM Community Past Present and Future.pdf.
Brian N. Bershad. Reducing TLB and Memory Overhead Using
Online Superpage Promotion. In In Proceedings of the 22nd
Annual International Symposium on Computer Architecture,
pages 176–187, 1995.