Efficient Virtual Machine Caching in Dynamic Virtual
Document Sample


Efficient Virtual Machine Caching in Dynamic
Virtual Clusters
Wesley Emeneker, Dan Stanzione
Fulton High Performance Computing Initiative
Arizona State University
Wesley.Emeneker@asu.edu,dstanzi@asu.edu
Abstract— At many university, government, and corporate machines. Section III examines the initial implementation of
facilities, it is increasingly common for multiple compute clusters VM creation, details image caching as a way to reduce the
to exist in a relatively small geographic area. These clusters overhead of staging and booting VM images, and enumerates
represent a significant investment, but effectively leveraging
this investment across clusters is a challenge. Dynamic Virtual situations where caching can cause unexpected and incorrect
Clustering has been shown to be an effective way to increase behavior.
utilization, decrease job turnaround time, and increase workload
throughput in a multi-cluster environment on a small geographic II. BACKGROUND AND R ELATED W ORK
scale. Dynamic Virtual Clustering is a system for flexibly and
seamlessly deploying virtual machines in a single or multi-cluster A. Dynamic Virtual Clustering
environment. The amount of time required to deploy virtual Dynamic Virtual Clustering is a system for deploying and
machines may be prohibitively large, especially when the jobs
using VMs in a cluster or multi-cluster environment [1].
designated to run inside the virtual machines are short-lived. In
this paper we examine the overhead of deploying virtual machine The goals of DVC are to improve cluster utilization, reduce
images, and present an implementation of image caching as a way job turnaround time, reduce queue wait time, and increase
to reduce this overhead. workload throughput. It has been theoretically shown by Jones
and others that forwarding jobs between clusters, and spanning
I. I NTRODUCTION jobs over the resources of multiple clusters can improve
At many university, government, and corporate facilities, workload throughput even in bandwidth limited environments
it is increasingly common for multiple compute clusters to [2], [3], [4], [5]. One major difficulty in forwarding and span-
exist in a relatively small geographic area. These clusters ning jobs between clusters is providing software environment
represent a significant investment, but effectively leveraging homogeneity [6]. Intelligently deployed VMs provide software
this investment across clusters is a challenge. Dynamic Virtual homogeneity, even if the underlying hosts are heterogeneous.
Clustering (DVC) has been shown to be an effective way DVC uses VMs to provide the following capabilities:
to increase utilization, decrease job turnaround time, and 1) Customization : The ability to run jobs in a cluster that
increase workload throughput in a multi-cluster environment would otherwise require modifications to the software
on a small geographic scale. DVC is a system for flexibly environment.
and seamlessly deploying virtual machines (VMs) across a 2) Forwarding : The ability to run a job submitted to one
single or multi-cluster environment. DVC tightly integrates cluster unmodified on a second cluster even if the second
VM technology with the cluster’s resource management and has a different software stack.
scheduling software to allows jobs to run on any cluster in 3) Spanning : The ability to transparently run a single job
any software environment while effectively sandboxing users that spans multiple clusters.
and applications from the host system. DVC uses VMs in a Using virtualization to implement these three capabilities
cluster environment by staging images to compute nodes and has been shown to achieve the goals desired for a wide range
booting the VMs on those nodes. However, the amount of of cluster workloads[6], [7]. However, for some workloads
time required to stage and boot VMs may be prohibitively the initial implementation of DVC did not achieve the desired
large, especially when the jobs designated to run inside the performance. The design of DVC required that a single central
VMs are short-lived. In this paper we examine the overhead location host all VM images. Each virtual machine image
of staging and booting, consider issues associated with caching contains the root filesystem needed to boot a VM. In many
VM images, and present an implementation of image caching respects, an image is similar to a node disk image, with the
as a way to reduce this overhead. The basic implementation main difference being that it is encapsulated in a single file.
and analysis of caching presented here will later be used to Each image must be staged to nodes that will execute the
create intelligent scheduling algorithms and heuristics that use VMs, and is destroyed once no longer needed. For small
cache information to reduce overhead due to VM use. images, the overhead of staging for each job was acceptable,
Section II examines DVC, virtualization, and resource but for larger images the time-to-ready of a VM could be
management in cluster environments with respect to virtual as long as the runtime of the application itself. Reducing
this overhead to improve the performance of jobs is a clear
motivation to implement caching. Although DVC is examined
in this paper, image caching is applicable to other projects
using virtualization like OSCAR-V [8] and the work on VMM-
bypass[9].
B. Virtualization
Virtualization is a commonly used technique in comput-
ing that abstracts and partitions system resources[10], [11].
Whether it is done to partition system resources, or to present
resources that may not exist (like virtual terminals, memory,
or machine architecture), virtualization has served in a wide
variety of roles in computing. Virtual machines abstract an
entire machine and can provide users and processes with an
operating environment significantly different from the host
machine. Pioneered by IBM in the 1960s[12], VMs were
useful for giving users a fraction of system resources, or
providing a different software environment than the host Fig. 1. VM Staging Time
system. When applied to HPC, a VM can duplicate a cluster
software environment, so that when a job expects a software
environment different than the assigned host, it can run inside
a VM without needing to be aware of the host’s environment. a job’s turnaround time for smaller jobs. VM boot time is
However, VMs suffer a performance penalty on application an unavoidable source of overhead, while image staging can
execution when compared to native execution. The magnitude be reduced or eliminated. Tests were performed to determine
of the performance loss depends on the VM architecture. Many the overhead of staging and booting VM images. Three VM
people have benchmarked VMs [13], [14], [15], [16], and images are used in testing. The image sizes in the tests are
the Xen Virtual Machine Monitor (VMM) has been found 2GB, 3GB, and 10GB respectively. The time to stage and boot
to provide the best performance on many benchmarks. The each image are shown in table I and figure 1.
emphasis on performance in HPC makes Xen a natural choice
for applying VMs to a cluster or multi-cluster environment. Image Size
VM Operation 2GB 3GB 10GB
C. Resource Management Startup 23 secs 30 secs 30 secs
A cluster resource manager provides a few key functions in Shutdown 11 secs 13 secs 15 secs
a cluster[17]. Destruction < 1 sec < 1 sec < 1 sec
1) It allocates access to cluster resources (typically com- TABLE I
pute nodes). V IRTUAL M ACHINE OVERHEAD
2) It provides a framework for executing, and monitoring
work performed on allocated resources.
3) It arbitrates requests for resources. Guest startup, shutdown, and destruction were tested on
These cluster functions are critical to properly deploying and individual hosts since these operations are completely inde-
managing VMs. When allocating resources to deploy VMs, pendent of any other VM executing. Each image was staged
we must ensure that there is enough disk space and RAM to concurrently to a number of guests as denoted in figure 1.
accommodate what is allocated to the VM. If VM creation The NFS server used to execute the staging tests is a 10TB
fails, the resource manager must either return an error, or fileserver with a maximum read speed of 100MB/s. In each
find another way to instantiate the guest. When cached image test, a single image was staged from the fileserver to a number
checking is added to VM creation, the resource manager must of nodes. The amount of time required to stage the image to
check to see if a cached image exists and is usable. Despite the each node is presented in the graph. The results in the table
extra step, caching is valuable enough to warrant implementing also show that the boot and shutdown overhead of images is
extra functionality. roughly the same, but the results in the figure show that staging
time is highly dependent on the number of concurrent image
III. V IRTUAL M ACHINE I MAGE C ACHING reads. Average staging time, the largest source of overhead,
The use of virtual machines to execute jobs incurs an can be greatly reduced with image caching. The advantages of
application performance penalty. This penalty is generally caching – faster startup, reduced job turnaround time, and less
less than 10% [13], [14] no matter the runtime of the job. fileserver load – are obvious, but it is not always advantageous
However, deploying and booting VMs introduces a constant to cache images. Several factors must be considered when
time overhead that may represent a significant fraction of implementing image caching.
A. Caching Considerations let C denote the number of nodes with cached images, and
The advantage to the initial approach of staging images let W = V − C be the number of nodes that the image must
for each job is that we ensure that the image will work as be staged to in order for a job to execute. In terms of ρ,
expected. With image caching, “dirty” images (images that the average number of nodes without cached images is W =
have been used before) can be reused a number of times. V − (ρ ∗ V ). The smaller W , the smaller the staging time of
Following are a number of issues that must be addressed when images will be. Staging time in testing follows a roughly linear
dealing with VM image caching. pattern of S + (V ∗ S)/F seconds, where S is the amount of
a) Filesystem Corruption: : Filesystem corruption is an time always required to stage an image, and F is a factor of
expensive problem to correct, requiring the entire filesys- the linear increase in staging time for a particular image to a
tem to be checked and corrected. Unclean shutdown, hard number of nodes. As W decreases, the required staging time
reboot/poweroff, and kernel crashes are just a few of the decreases as W/2.
ways that filesystems can become corrupted. Furthermore, Given this information, equation 1 shows the average time-
many filesystems have settings requiring complete checks (as to-ready of a cluster of virtual machines defined by boot time
a precautionary measure) after a certain amount of time passes B and the staging time of images.
or after a particular number of mounts. A decision whether to
n
1 V ∗S
verify image consistency, or to remove the image and start B+ n S+ for W > 0
T = F (1)
fresh must be made. i=1
B for W = 0
b) Image Fill: : VM images are optimized for size and
generally only have a small fraction of space free. Images that This equation states that the average time-to-ready of a cluster
are reused multiple times may become completely full due to requires the constant boot time B plus the average staging
logging, or due to temporary files being written to the image. time of an image to all nodes that need the image. For
If an image is completely full, attempting to reuse it may put single node jobs, the equation is simple, but as the number
the machine into an unusable state. of nodes allocated to a job increases, the equation becomes
c) Image Skew: : The design of DVC uses a central more important since the number of cached images is directly
“master” repository that contains all VM images. Without related to the amount of time required to make a dynamic
caching, any changes made to a master image is seen by virtual cluster ready to accept jobs. This equation is simple,
any new job using that image. The use of caching bypasses and is largely dependent on the characteristics of the fileserver
staging from the master repository to nodes and uses a dirty distributing the images. In the test cases, the fileserver’s
image to execute jobs. If the master image is updated, we must performance was limited by the single Gigabit Ethernet link
ensure that the new image is used for every job. Removing to the rest of cluster. With parallel filesystems (PVFS, Lustre,
the cached images from nodes is one possibility, but any VM IBrix, Panasas, etc.), there may be multi-Gigabit links to
images currently in use cannot be removed without killing cluster nodes, which could change the equation presented.
jobs. Once jobs using the old VM image complete, the images Instead of trying to account for all possible scenarios and
may be cached on the nodes and reused for future jobs. This performance characteristics, equation 1 uses simple factors to
difference between the master image and the cached image is estimate staging time for a number of nodes.
known as skew. If images are skewed, jobs may not execute
as expected and administrator intervention may be required to C. Initial Implementation
put the system back into the expected state. The gains in faster VM startup with cached images must be
Although these issues are non-trivial and must be dealt with balanced with the possibility of image change. The design of
when image caching is implemented, the advantages gained in a system to cache images must take into account the factors
dynamic virtual cluster startup time are worth the complexities of corruption, skew, and fill. Following is a description of the
of ensuring correct functionality. design of a system for caching images.
1) Implemented Design - Aged Caching: : The first imple-
B. Methodology mentation of caching with DVC uses a configurable maximum
Caching VM images on nodes is one way to reduce the time image age (in minutes) as part of a heuristic to reduce the
required to make a dynamic virtual cluster ready to execute likelihood of filesystem corruption, image fill, and image skew.
applications. However, the possible advantages of caching Once a cached image reaches a certain age, it is removed at
reduce as the number of VMs allocated to the job increases. the start of the next job.
VMs allocated to a job must be completely booted and ready An alternate approach to aging images is number of times
to accept jobs before any application can run. The probability used to boot a VM. Instead of removing a cached image after
that a node contains a cached image is denoted by ρ, which is a certain number of minutes, we could remove an image after
independent of any other node or cached image. As the number it had been used ten, twenty, or two times. Like the walltime
of VMs V increases, the likelihood that caching will eliminate image age, aging images based on usage will also reduce the
the staging time of the dynamic virtual cluster decreases as likelihood of image corruption and fill. However, the major
V ∗ρ. As the amount of VM use increases, the number of nodes disadvantage to this approach is that image skew becomes
with cached images increases, therefore ρ increases. However, possible. Aging images based on time ensures that no cached
image older than a certain number of minutes will be reused. and correct image corruption and fill, but it cannot fix image
Aging images based on number of times used does nothing to skew.
ensure that an image is removed from cache in a reasonable Caching has been researched in many fields in
amount of time. It is possible to imagine scenarios where an computing[18]. The motivation for caching is to increase the
image is used on a node that is rarely used to execute DVC speed of an operation. One major difficulty in caching is
jobs. If the node is used once every month and the image has that of coherency – i.e. how to keep the state of the system
a use limit of thirty mounts, the cache won’t be cleaned for consistent. This paper draws on past ideas of keeping system
more than two years. In the unlikely event that the image is in consistency while factoring in the use of VMs in clusters.
use each time the node is assigned to a job, the cached image Each approach to image caching presented solves one or
may be significantly skewed from the master image. more of the problems described - staging speed, corruption,
2) Other Possible Designs: : Other schemes for image fill, and skew. However, the walltime age approach shows
caching which were examined (but not implemented) will be the most promise for maximizing speed and minimizing the
briefly described here. possibilities of non-working images. Next we examine the
a) Double Copy Caching: : Double caching is a scheme initial implementation of age caching.
that attempts to eliminate the problems of corruption, fill,
and skew by always providing a clean image. With double D. Analytical Results:
caching, the image is copied from the master image server Aged caching has been implemented with a configurable
to the node. The node then copies the image from cache to image elimination policy that is specified in minutes. Table
the execution location. This implementation would eliminate II shows the maximum percentage of job turnaround time
corruption, fill, and skew at the cost of increased initial startup required to stage and boot a virtual cluster with different
time since we have to double-copy the cache. Once the image maximum cache image ages. We assume that identical (or
has been double-copied for the first job, all subsequent jobs extremely similar) jobs are likely to run on the same set
must copy the image from cache to the execution location. of nodes one after the other. In these results, each job is
The only advantages in speed that this approach brings occurs executed ten times and the overhead of staging is shown for
when the staging time from the central fileserver is longer than each image and each job runtime. The percentages shown are
copying disk-to-disk. highly dependent on the staging time of the image which are
b) Double Copy Caching v2: : The first double-copying directly influenced by the number of nodes, the performance
approach would not significantly decrease the time-to-ready of of the fileserver, and the speed of the local disks. For the
a VM, but with a few changes we can eliminate the increased following results, each image was copied from an NFS mount
startup time. Instead of copying the cached image to the to local disk. The NFS server is a 10TB fileserver with a
execution location (a slow operation), moving the image to the maximum read speed of 100MB/s. Staging the 3GB image to
execution location will bypass staging time (a move is assumed one node takes approximately one minute, while staging the
to be a faster operation when the cache and execution locations same image to twelve nodes simultaneously takes 5.5 minutes.
reside on the same filesystem). Once the image is moved, we Once we reach the twelve node limit, staging time increases
copy the image from the server to cache. This approach will little as the NFS server is able to cache large amounts of data.
reduce the time-to-ready of the VM, but will greatly increase Figure 2 shows a graphical representation of how maximum
the load on the image server. In addition to this, copying the cache age affects job turnaround time with different cache
image from the server may adversely impact the performance expiration times.
of the VM until the copy is complete. The table and figure show that, as expected, the longer the
These two approaches are valid for caching, and each cache image is allowed to remain, the lower the overhead of
has the advantage of eliminating corruption, fill, and skew. staging. Also as expected, the largest gains from caching are
However, we must keep two copies of the same filesystem on seen by short-running applications. For applications running
disk at all times. If the images are large, or if the node is using ten hours or more, staging time is a negligible portion of
multiple images, these approaches may be constrained by disk runtime.
space.
c) Image Check Caching: : Filesystem corruption and IV. C ONCLUSIONS
fill are two undesirable image states. However, if a cached The analytical results of aged image caching show that the
image enters one of these states we can ensure that the image decreased time-to-ready state of VMs is important for short-
is not used as a VM filesystem until the problem is fixed. running jobs. At this time, the probability that image corrup-
Filesystem corruption can be dealt with simply by forcing a tion, fill, and skew will occur is unknown, and likely varies
filesystem check each time the image is put into the cache after depending on the image size and use. Despite these unknowns,
being used. This ensures that the filesystem state is consistent the results are the most promising for jobs that execute for a
at the cost of CPU and I/O time. Image fill can also be dealt short period of time; wherein staging and boot time comprise a
with when the image is put into cache. By comparing available large portion of job turnaround time. The longer a job executes,
image space to used image space, we can remove the image the less important image caching becomes since staging and
if the difference is too small. This caching scheme can detect booting is a constant time operation.
Performance Overhead scheduler knows which nodes have cached images, it can
Image Size Max Age 6 min job 60 min 600 min request that the resource manager prestage images to nodes
no cache 29.4% 4% .4% so that when the job is ready to start, it will have a cached
30 min 12% 4% .4% image available. This optimization is useful since staging time
2GB
1 hr 9.9% 4% .4% is a function of concurrency. The more images that have to be
24 hr 7.8% .9% .2% staged to nodes, the longer the staging will take. Therefore,
no cache 48% 8.4% .9% if the job scheduler can use nodes that already have the
30 min 15.7% 8.4% .9% image cached, we can decrease the time-to-ready of the virtual
3GB
1 hr 11.7% 8.4% .9% cluster.
24 hr 7.9% 1.1% .85% The results from this paper will form the basis for devel-
no cache 75% 23% 2.9% oping a model for effectively using image caching at the job
30 min 21.1% 23% 2.9% scheduler level. Enhancing the job scheduler to be aware of
10GB cached images and to take the best advantage of cache will
1 hr 14.4% 23% 2.9%
24 hr 8% 1.7% 1% be investigated in later work.
TABLE II R EFERENCES
V IRTUAL M ACHINE S TAGING OVERHEAD FOR 10 JOBS
[1] W. Emeneker, D. Jackson, J. Butikofer, and D. Stanzione, “Dynamic
Virtual Clustering with Xen and Moab,” in Workshop on XEN in HPC
Cluster and Grid Computing Environments (XHPC), 2006.
[2] W. Jones, L. Pang, D. Stanzione, and W. I. Ligon, “Characterization
of Bandwidth-aware Meta-schedulers for Co-allocating Jobs Across
Multiple Clusters,” Journal of Supercomputing, Special Issue on the
Evaluation of Grid and Cluster Computing, 2005.
[3] W. Jones, L. Pang, D. Stanzione, and W. Ligon, “Bandwidth-aware Co-
allocating Meta-schedulers for Mini-grid Architectures,” International
Conference on Cluster Computing (Cluster 2004), 2004.
[4] J. Sinaga, H. Mohammed, and D. Epema, “A dynamic co-
allocation service in multicluster systems,” 2004. [Online]. Available:
citeseer.ist.psu.edu/sinaga04dynamic.html
[5] C.-T. Yang, I.-H. Yang, K.-C. Li, and S.-Y. Wang, “Improvements on dy-
namic adjustment mechanism in co-allocation data grid environments,”
The Journal of Supercomputing, vol. 40, pp. 269–280, June 2007.
[6] W. Emeneker, “Dynamic Virtual Clustering,” Master’s thesis, Arizona
State University, April 2007.
[7] W. Emeneker and D. Stanzione, “Dynamic Virtual Clustering,” in
Submitted to the IEEE International Conference on Cluster Computing
2007, 2007.
[8] “Xen-oscar for cluster virtualization,” in Submitted to Workshop on XEN
in HPC Cluster and Grid Computing Environments (XHPC), 2006.
[9] J. Liu, W. Huang, B. Abali, and D. K. Panda, “High Performance VMM-
Bypass I/O in Virtual Machines,” 2006.
[10] R. P. Goldberg, “Architecture of virtual machines,” in Proceedings of the
workshop on virtual computer systems. New York, NY, USA: ACM
Press, 1973, pp. 74–112.
[11] N. Holmes, “The turning of the wheel,” Computer, vol. 38, no. 7, pp.
100, 98–99, 2005.
Fig. 2. 6 Minute Job Turnaround Time with Cache Expiration [12] M. Varian, “VM and the VM Community: Past, Present, and Fu-
ture,” 1997, http://www.os.nctu.edu.tw/vm/pdf/VM and the VM Com-
munity Past Present and Future.pdf.
[13] B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, I. Pratt, A. Warfield,
Although the initial results for reducing job turnaround time P. Barham, and R. Neugebauer, “Xen and the Art of Virtualization,” in
are promising, deficiencies in the caching scheme have become Proceedings of the ACM Symposium on Operating Systems Principles,
obvious. The largest problem occurs with highly parallel jobs October 2003.
[14] B. Clark, T. Deshane, E. Dow, S. Evanchik, M. Finlayson, J. Herne, , and
that require many nodes. If one node doesn’t have a cached J. Matthews., “Xen and the Art of Repeated Research. In Proceedings
image, every other VM must wait while until staging and of the Usenix annual technical conference,” July 2004.
booting the entire set of VMs is complete. As the number of [15] W. Huang, J. Liu, B. Abali, and D. K. Panda, “A case for high
performance computing with virtual machines,” in The 20th ACM
nodes allocated to a job increases, the likelihood that caching International Conference on Supercomputing (ICS’06), June 2006.
benefits a job decreases. One way to reduce the likelihood of [16] W. Emeneker and D. Stanzione, “HPC Cluster Readiness of Xen and
this is to enhance the job scheduler. UML,” in In Proceeding of IEEE International Conference on Cluster
Computing 2006, 2006.
Nodes are assigned to jobs by the cluster scheduler, while [17] M. Jette and M. Grondona, “Slurm: Simple linux utility
images are staged and booted with a resource manager. If for resource management,” 2002. [Online]. Available: cite-
we take information from the resource manager about which seer.ist.psu.edu/jette02slurm.html
[18] A. J. Smith, “Cache memories,” ACM Comput. Surv., vol. 14, no. 3, pp.
nodes have accessible cached images, the scheduler can in- 473–530, 1982.
telligently allocate nodes to ensure that the time-to-ready of
a dynamic virtual cluster is minimized. Furthermore, if the
Get documents about "