Efficient Virtual Machine Caching in Dynamic Virtual by uab76526


									       Efficient Virtual Machine Caching in Dynamic
                       Virtual Clusters
                                              Wesley Emeneker, Dan Stanzione
                                        Fulton High Performance Computing Initiative
                                                  Arizona State University

   Abstract— At many university, government, and corporate            machines. Section III examines the initial implementation of
facilities, it is increasingly common for multiple compute clusters   VM creation, details image caching as a way to reduce the
to exist in a relatively small geographic area. These clusters        overhead of staging and booting VM images, and enumerates
represent a significant investment, but effectively leveraging
this investment across clusters is a challenge. Dynamic Virtual       situations where caching can cause unexpected and incorrect
Clustering has been shown to be an effective way to increase          behavior.
utilization, decrease job turnaround time, and increase workload
throughput in a multi-cluster environment on a small geographic                 II. BACKGROUND AND R ELATED W ORK
scale. Dynamic Virtual Clustering is a system for flexibly and
seamlessly deploying virtual machines in a single or multi-cluster    A. Dynamic Virtual Clustering
environment. The amount of time required to deploy virtual               Dynamic Virtual Clustering is a system for deploying and
machines may be prohibitively large, especially when the jobs
                                                                      using VMs in a cluster or multi-cluster environment [1].
designated to run inside the virtual machines are short-lived. In
this paper we examine the overhead of deploying virtual machine       The goals of DVC are to improve cluster utilization, reduce
images, and present an implementation of image caching as a way       job turnaround time, reduce queue wait time, and increase
to reduce this overhead.                                              workload throughput. It has been theoretically shown by Jones
                                                                      and others that forwarding jobs between clusters, and spanning
                       I. I NTRODUCTION                               jobs over the resources of multiple clusters can improve
   At many university, government, and corporate facilities,          workload throughput even in bandwidth limited environments
it is increasingly common for multiple compute clusters to            [2], [3], [4], [5]. One major difficulty in forwarding and span-
exist in a relatively small geographic area. These clusters           ning jobs between clusters is providing software environment
represent a significant investment, but effectively leveraging         homogeneity [6]. Intelligently deployed VMs provide software
this investment across clusters is a challenge. Dynamic Virtual       homogeneity, even if the underlying hosts are heterogeneous.
Clustering (DVC) has been shown to be an effective way                DVC uses VMs to provide the following capabilities:
to increase utilization, decrease job turnaround time, and               1) Customization : The ability to run jobs in a cluster that
increase workload throughput in a multi-cluster environment                  would otherwise require modifications to the software
on a small geographic scale. DVC is a system for flexibly                     environment.
and seamlessly deploying virtual machines (VMs) across a                 2) Forwarding : The ability to run a job submitted to one
single or multi-cluster environment. DVC tightly integrates                  cluster unmodified on a second cluster even if the second
VM technology with the cluster’s resource management and                     has a different software stack.
scheduling software to allows jobs to run on any cluster in              3) Spanning : The ability to transparently run a single job
any software environment while effectively sandboxing users                  that spans multiple clusters.
and applications from the host system. DVC uses VMs in a                 Using virtualization to implement these three capabilities
cluster environment by staging images to compute nodes and            has been shown to achieve the goals desired for a wide range
booting the VMs on those nodes. However, the amount of                of cluster workloads[6], [7]. However, for some workloads
time required to stage and boot VMs may be prohibitively              the initial implementation of DVC did not achieve the desired
large, especially when the jobs designated to run inside the          performance. The design of DVC required that a single central
VMs are short-lived. In this paper we examine the overhead            location host all VM images. Each virtual machine image
of staging and booting, consider issues associated with caching       contains the root filesystem needed to boot a VM. In many
VM images, and present an implementation of image caching             respects, an image is similar to a node disk image, with the
as a way to reduce this overhead. The basic implementation            main difference being that it is encapsulated in a single file.
and analysis of caching presented here will later be used to          Each image must be staged to nodes that will execute the
create intelligent scheduling algorithms and heuristics that use      VMs, and is destroyed once no longer needed. For small
cache information to reduce overhead due to VM use.                   images, the overhead of staging for each job was acceptable,
   Section II examines DVC, virtualization, and resource              but for larger images the time-to-ready of a VM could be
management in cluster environments with respect to virtual            as long as the runtime of the application itself. Reducing
this overhead to improve the performance of jobs is a clear
motivation to implement caching. Although DVC is examined
in this paper, image caching is applicable to other projects
using virtualization like OSCAR-V [8] and the work on VMM-
B. Virtualization
   Virtualization is a commonly used technique in comput-
ing that abstracts and partitions system resources[10], [11].
Whether it is done to partition system resources, or to present
resources that may not exist (like virtual terminals, memory,
or machine architecture), virtualization has served in a wide
variety of roles in computing. Virtual machines abstract an
entire machine and can provide users and processes with an
operating environment significantly different from the host
machine. Pioneered by IBM in the 1960s[12], VMs were
useful for giving users a fraction of system resources, or
providing a different software environment than the host                               Fig. 1.   VM Staging Time
system. When applied to HPC, a VM can duplicate a cluster
software environment, so that when a job expects a software
environment different than the assigned host, it can run inside
a VM without needing to be aware of the host’s environment.        a job’s turnaround time for smaller jobs. VM boot time is
   However, VMs suffer a performance penalty on application        an unavoidable source of overhead, while image staging can
execution when compared to native execution. The magnitude         be reduced or eliminated. Tests were performed to determine
of the performance loss depends on the VM architecture. Many       the overhead of staging and booting VM images. Three VM
people have benchmarked VMs [13], [14], [15], [16], and            images are used in testing. The image sizes in the tests are
the Xen Virtual Machine Monitor (VMM) has been found               2GB, 3GB, and 10GB respectively. The time to stage and boot
to provide the best performance on many benchmarks. The            each image are shown in table I and figure 1.
emphasis on performance in HPC makes Xen a natural choice
for applying VMs to a cluster or multi-cluster environment.                                             Image Size
                                                                          VM Operation       2GB         3GB         10GB
C. Resource Management                                                    Startup            23 secs     30 secs     30 secs
   A cluster resource manager provides a few key functions in             Shutdown           11 secs     13 secs     15 secs
a cluster[17].                                                            Destruction        < 1 sec     < 1 sec     < 1 sec
   1) It allocates access to cluster resources (typically com-                                   TABLE I
       pute nodes).                                                                  V IRTUAL M ACHINE OVERHEAD
   2) It provides a framework for executing, and monitoring
       work performed on allocated resources.
   3) It arbitrates requests for resources.                           Guest startup, shutdown, and destruction were tested on
These cluster functions are critical to properly deploying and     individual hosts since these operations are completely inde-
managing VMs. When allocating resources to deploy VMs,             pendent of any other VM executing. Each image was staged
we must ensure that there is enough disk space and RAM to          concurrently to a number of guests as denoted in figure 1.
accommodate what is allocated to the VM. If VM creation            The NFS server used to execute the staging tests is a 10TB
fails, the resource manager must either return an error, or        fileserver with a maximum read speed of 100MB/s. In each
find another way to instantiate the guest. When cached image        test, a single image was staged from the fileserver to a number
checking is added to VM creation, the resource manager must        of nodes. The amount of time required to stage the image to
check to see if a cached image exists and is usable. Despite the   each node is presented in the graph. The results in the table
extra step, caching is valuable enough to warrant implementing     also show that the boot and shutdown overhead of images is
extra functionality.                                               roughly the same, but the results in the figure show that staging
                                                                   time is highly dependent on the number of concurrent image
         III. V IRTUAL M ACHINE I MAGE C ACHING                    reads. Average staging time, the largest source of overhead,
   The use of virtual machines to execute jobs incurs an           can be greatly reduced with image caching. The advantages of
application performance penalty. This penalty is generally         caching – faster startup, reduced job turnaround time, and less
less than 10% [13], [14] no matter the runtime of the job.         fileserver load – are obvious, but it is not always advantageous
However, deploying and booting VMs introduces a constant           to cache images. Several factors must be considered when
time overhead that may represent a significant fraction of          implementing image caching.
A. Caching Considerations                                         let C denote the number of nodes with cached images, and
   The advantage to the initial approach of staging images        let W = V − C be the number of nodes that the image must
for each job is that we ensure that the image will work as        be staged to in order for a job to execute. In terms of ρ,
expected. With image caching, “dirty” images (images that         the average number of nodes without cached images is W =
have been used before) can be reused a number of times.           V − (ρ ∗ V ). The smaller W , the smaller the staging time of
Following are a number of issues that must be addressed when      images will be. Staging time in testing follows a roughly linear
dealing with VM image caching.                                    pattern of S + (V ∗ S)/F seconds, where S is the amount of
     a) Filesystem Corruption: : Filesystem corruption is an      time always required to stage an image, and F is a factor of
expensive problem to correct, requiring the entire filesys-        the linear increase in staging time for a particular image to a
tem to be checked and corrected. Unclean shutdown, hard           number of nodes. As W decreases, the required staging time
reboot/poweroff, and kernel crashes are just a few of the         decreases as W/2.
ways that filesystems can become corrupted. Furthermore,              Given this information, equation 1 shows the average time-
many filesystems have settings requiring complete checks (as       to-ready of a cluster of virtual machines defined by boot time
a precautionary measure) after a certain amount of time passes    B and the staging time of images.
or after a particular number of mounts. A decision whether to
                                                                                          n
                                                                                   1               V ∗S
verify image consistency, or to remove the image and start                    B+ n              S+             for W > 0
                                                                     T =                               F                       (1)
fresh must be made.                                                                      i=1
                                                                              B for W = 0
     b) Image Fill: : VM images are optimized for size and
generally only have a small fraction of space free. Images that   This equation states that the average time-to-ready of a cluster
are reused multiple times may become completely full due to       requires the constant boot time B plus the average staging
logging, or due to temporary files being written to the image.     time of an image to all nodes that need the image. For
If an image is completely full, attempting to reuse it may put    single node jobs, the equation is simple, but as the number
the machine into an unusable state.                               of nodes allocated to a job increases, the equation becomes
     c) Image Skew: : The design of DVC uses a central            more important since the number of cached images is directly
“master” repository that contains all VM images. Without          related to the amount of time required to make a dynamic
caching, any changes made to a master image is seen by            virtual cluster ready to accept jobs. This equation is simple,
any new job using that image. The use of caching bypasses         and is largely dependent on the characteristics of the fileserver
staging from the master repository to nodes and uses a dirty      distributing the images. In the test cases, the fileserver’s
image to execute jobs. If the master image is updated, we must    performance was limited by the single Gigabit Ethernet link
ensure that the new image is used for every job. Removing         to the rest of cluster. With parallel filesystems (PVFS, Lustre,
the cached images from nodes is one possibility, but any VM       IBrix, Panasas, etc.), there may be multi-Gigabit links to
images currently in use cannot be removed without killing         cluster nodes, which could change the equation presented.
jobs. Once jobs using the old VM image complete, the images       Instead of trying to account for all possible scenarios and
may be cached on the nodes and reused for future jobs. This       performance characteristics, equation 1 uses simple factors to
difference between the master image and the cached image is       estimate staging time for a number of nodes.
known as skew. If images are skewed, jobs may not execute
as expected and administrator intervention may be required to     C. Initial Implementation
put the system back into the expected state.                         The gains in faster VM startup with cached images must be
   Although these issues are non-trivial and must be dealt with   balanced with the possibility of image change. The design of
when image caching is implemented, the advantages gained in       a system to cache images must take into account the factors
dynamic virtual cluster startup time are worth the complexities   of corruption, skew, and fill. Following is a description of the
of ensuring correct functionality.                                design of a system for caching images.
                                                                     1) Implemented Design - Aged Caching: : The first imple-
B. Methodology                                                    mentation of caching with DVC uses a configurable maximum
   Caching VM images on nodes is one way to reduce the time       image age (in minutes) as part of a heuristic to reduce the
required to make a dynamic virtual cluster ready to execute       likelihood of filesystem corruption, image fill, and image skew.
applications. However, the possible advantages of caching         Once a cached image reaches a certain age, it is removed at
reduce as the number of VMs allocated to the job increases.       the start of the next job.
   VMs allocated to a job must be completely booted and ready        An alternate approach to aging images is number of times
to accept jobs before any application can run. The probability    used to boot a VM. Instead of removing a cached image after
that a node contains a cached image is denoted by ρ, which is     a certain number of minutes, we could remove an image after
independent of any other node or cached image. As the number      it had been used ten, twenty, or two times. Like the walltime
of VMs V increases, the likelihood that caching will eliminate    image age, aging images based on usage will also reduce the
the staging time of the dynamic virtual cluster decreases as      likelihood of image corruption and fill. However, the major
V ∗ρ. As the amount of VM use increases, the number of nodes      disadvantage to this approach is that image skew becomes
with cached images increases, therefore ρ increases. However,     possible. Aging images based on time ensures that no cached
image older than a certain number of minutes will be reused.          and correct image corruption and fill, but it cannot fix image
Aging images based on number of times used does nothing to            skew.
ensure that an image is removed from cache in a reasonable               Caching has been researched in many fields in
amount of time. It is possible to imagine scenarios where an          computing[18]. The motivation for caching is to increase the
image is used on a node that is rarely used to execute DVC            speed of an operation. One major difficulty in caching is
jobs. If the node is used once every month and the image has          that of coherency – i.e. how to keep the state of the system
a use limit of thirty mounts, the cache won’t be cleaned for          consistent. This paper draws on past ideas of keeping system
more than two years. In the unlikely event that the image is in       consistency while factoring in the use of VMs in clusters.
use each time the node is assigned to a job, the cached image         Each approach to image caching presented solves one or
may be significantly skewed from the master image.                     more of the problems described - staging speed, corruption,
   2) Other Possible Designs: : Other schemes for image               fill, and skew. However, the walltime age approach shows
caching which were examined (but not implemented) will be             the most promise for maximizing speed and minimizing the
briefly described here.                                                possibilities of non-working images. Next we examine the
      a) Double Copy Caching: : Double caching is a scheme            initial implementation of age caching.
that attempts to eliminate the problems of corruption, fill,
and skew by always providing a clean image. With double               D. Analytical Results:
caching, the image is copied from the master image server                 Aged caching has been implemented with a configurable
to the node. The node then copies the image from cache to             image elimination policy that is specified in minutes. Table
the execution location. This implementation would eliminate           II shows the maximum percentage of job turnaround time
corruption, fill, and skew at the cost of increased initial startup    required to stage and boot a virtual cluster with different
time since we have to double-copy the cache. Once the image           maximum cache image ages. We assume that identical (or
has been double-copied for the first job, all subsequent jobs          extremely similar) jobs are likely to run on the same set
must copy the image from cache to the execution location.             of nodes one after the other. In these results, each job is
The only advantages in speed that this approach brings occurs         executed ten times and the overhead of staging is shown for
when the staging time from the central fileserver is longer than       each image and each job runtime. The percentages shown are
copying disk-to-disk.                                                 highly dependent on the staging time of the image which are
      b) Double Copy Caching v2: : The first double-copying            directly influenced by the number of nodes, the performance
approach would not significantly decrease the time-to-ready of         of the fileserver, and the speed of the local disks. For the
a VM, but with a few changes we can eliminate the increased           following results, each image was copied from an NFS mount
startup time. Instead of copying the cached image to the              to local disk. The NFS server is a 10TB fileserver with a
execution location (a slow operation), moving the image to the        maximum read speed of 100MB/s. Staging the 3GB image to
execution location will bypass staging time (a move is assumed        one node takes approximately one minute, while staging the
to be a faster operation when the cache and execution locations       same image to twelve nodes simultaneously takes 5.5 minutes.
reside on the same filesystem). Once the image is moved, we            Once we reach the twelve node limit, staging time increases
copy the image from the server to cache. This approach will           little as the NFS server is able to cache large amounts of data.
reduce the time-to-ready of the VM, but will greatly increase         Figure 2 shows a graphical representation of how maximum
the load on the image server. In addition to this, copying the        cache age affects job turnaround time with different cache
image from the server may adversely impact the performance            expiration times.
of the VM until the copy is complete.                                     The table and figure show that, as expected, the longer the
   These two approaches are valid for caching, and each               cache image is allowed to remain, the lower the overhead of
has the advantage of eliminating corruption, fill, and skew.           staging. Also as expected, the largest gains from caching are
However, we must keep two copies of the same filesystem on             seen by short-running applications. For applications running
disk at all times. If the images are large, or if the node is using   ten hours or more, staging time is a negligible portion of
multiple images, these approaches may be constrained by disk          runtime.
      c) Image Check Caching: : Filesystem corruption and                                  IV. C ONCLUSIONS
fill are two undesirable image states. However, if a cached               The analytical results of aged image caching show that the
image enters one of these states we can ensure that the image         decreased time-to-ready state of VMs is important for short-
is not used as a VM filesystem until the problem is fixed.              running jobs. At this time, the probability that image corrup-
Filesystem corruption can be dealt with simply by forcing a           tion, fill, and skew will occur is unknown, and likely varies
filesystem check each time the image is put into the cache after       depending on the image size and use. Despite these unknowns,
being used. This ensures that the filesystem state is consistent       the results are the most promising for jobs that execute for a
at the cost of CPU and I/O time. Image fill can also be dealt          short period of time; wherein staging and boot time comprise a
with when the image is put into cache. By comparing available         large portion of job turnaround time. The longer a job executes,
image space to used image space, we can remove the image              the less important image caching becomes since staging and
if the difference is too small. This caching scheme can detect        booting is a constant time operation.
                                     Performance Overhead           scheduler knows which nodes have cached images, it can
 Image Size       Max Age        6 min job 60 min 600 min           request that the resource manager prestage images to nodes
                  no cache       29.4%       4%      .4%            so that when the job is ready to start, it will have a cached
                  30 min         12%         4%      .4%            image available. This optimization is useful since staging time
                  1 hr           9.9%        4%      .4%            is a function of concurrency. The more images that have to be
                  24 hr          7.8%        .9%     .2%            staged to nodes, the longer the staging will take. Therefore,
                  no cache       48%         8.4%    .9%            if the job scheduler can use nodes that already have the
                  30 min         15.7%       8.4%    .9%            image cached, we can decrease the time-to-ready of the virtual
                  1 hr           11.7%       8.4%    .9%            cluster.
                  24 hr          7.9%        1.1%    .85%              The results from this paper will form the basis for devel-
                  no cache       75%         23%     2.9%           oping a model for effectively using image caching at the job
                  30 min         21.1%       23%     2.9%           scheduler level. Enhancing the job scheduler to be aware of
 10GB                                                               cached images and to take the best advantage of cache will
                  1 hr           14.4%       23%     2.9%
                  24 hr          8%          1.7%    1%             be investigated in later work.
                               TABLE II                                                           R EFERENCES
                                                                     [1] W. Emeneker, D. Jackson, J. Butikofer, and D. Stanzione, “Dynamic
                                                                         Virtual Clustering with Xen and Moab,” in Workshop on XEN in HPC
                                                                         Cluster and Grid Computing Environments (XHPC), 2006.
                                                                     [2] W. Jones, L. Pang, D. Stanzione, and W. I. Ligon, “Characterization
                                                                         of Bandwidth-aware Meta-schedulers for Co-allocating Jobs Across
                                                                         Multiple Clusters,” Journal of Supercomputing, Special Issue on the
                                                                         Evaluation of Grid and Cluster Computing, 2005.
                                                                     [3] W. Jones, L. Pang, D. Stanzione, and W. Ligon, “Bandwidth-aware Co-
                                                                         allocating Meta-schedulers for Mini-grid Architectures,” International
                                                                         Conference on Cluster Computing (Cluster 2004), 2004.
                                                                     [4] J. Sinaga, H. Mohammed, and D. Epema, “A dynamic co-
                                                                         allocation service in multicluster systems,” 2004. [Online]. Available:
                                                                     [5] C.-T. Yang, I.-H. Yang, K.-C. Li, and S.-Y. Wang, “Improvements on dy-
                                                                         namic adjustment mechanism in co-allocation data grid environments,”
                                                                         The Journal of Supercomputing, vol. 40, pp. 269–280, June 2007.
                                                                     [6] W. Emeneker, “Dynamic Virtual Clustering,” Master’s thesis, Arizona
                                                                         State University, April 2007.
                                                                     [7] W. Emeneker and D. Stanzione, “Dynamic Virtual Clustering,” in
                                                                         Submitted to the IEEE International Conference on Cluster Computing
                                                                         2007, 2007.
                                                                     [8] “Xen-oscar for cluster virtualization,” in Submitted to Workshop on XEN
                                                                         in HPC Cluster and Grid Computing Environments (XHPC), 2006.
                                                                     [9] J. Liu, W. Huang, B. Abali, and D. K. Panda, “High Performance VMM-
                                                                         Bypass I/O in Virtual Machines,” 2006.
                                                                    [10] R. P. Goldberg, “Architecture of virtual machines,” in Proceedings of the
                                                                         workshop on virtual computer systems. New York, NY, USA: ACM
                                                                         Press, 1973, pp. 74–112.
                                                                    [11] N. Holmes, “The turning of the wheel,” Computer, vol. 38, no. 7, pp.
                                                                         100, 98–99, 2005.
     Fig. 2.   6 Minute Job Turnaround Time with Cache Expiration   [12] M. Varian, “VM and the VM Community: Past, Present, and Fu-
                                                                         ture,” 1997, http://www.os.nctu.edu.tw/vm/pdf/VM and the VM Com-
                                                                         munity Past Present and Future.pdf.
                                                                    [13] B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, I. Pratt, A. Warfield,
   Although the initial results for reducing job turnaround time         P. Barham, and R. Neugebauer, “Xen and the Art of Virtualization,” in
are promising, deficiencies in the caching scheme have become             Proceedings of the ACM Symposium on Operating Systems Principles,
obvious. The largest problem occurs with highly parallel jobs            October 2003.
                                                                    [14] B. Clark, T. Deshane, E. Dow, S. Evanchik, M. Finlayson, J. Herne, , and
that require many nodes. If one node doesn’t have a cached               J. Matthews., “Xen and the Art of Repeated Research. In Proceedings
image, every other VM must wait while until staging and                  of the Usenix annual technical conference,” July 2004.
booting the entire set of VMs is complete. As the number of         [15] W. Huang, J. Liu, B. Abali, and D. K. Panda, “A case for high
                                                                         performance computing with virtual machines,” in The 20th ACM
nodes allocated to a job increases, the likelihood that caching          International Conference on Supercomputing (ICS’06), June 2006.
benefits a job decreases. One way to reduce the likelihood of        [16] W. Emeneker and D. Stanzione, “HPC Cluster Readiness of Xen and
this is to enhance the job scheduler.                                    UML,” in In Proceeding of IEEE International Conference on Cluster
                                                                         Computing 2006, 2006.
   Nodes are assigned to jobs by the cluster scheduler, while       [17] M. Jette and M. Grondona, “Slurm: Simple linux utility
images are staged and booted with a resource manager. If                 for resource management,” 2002. [Online]. Available: cite-
we take information from the resource manager about which                seer.ist.psu.edu/jette02slurm.html
                                                                    [18] A. J. Smith, “Cache memories,” ACM Comput. Surv., vol. 14, no. 3, pp.
nodes have accessible cached images, the scheduler can in-               473–530, 1982.
telligently allocate nodes to ensure that the time-to-ready of
a dynamic virtual cluster is minimized. Furthermore, if the

To top