C3 by xiaoyounan



                           Cashing in on the Cache in the Cloud
    Hyuck Han, Young Choon Lee, Member, IEEE, Woong Shin, Hyungsoo Jung, Heon Y. Yeom, Member, IEEE, and
                                        Albert Y. Zomaya Fellow, IEEE

      Abstract—Over the past decades, caching has become the key technology used for bridging the performance gap across memory
      hierarchies via temporal or spatial localities; in particular, the effect is prominent in disk storage systems. Applications that involve heavy
      I/O activities, which are common in the cloud, probably benefit the most from caching. The use of local volatile memory as cache might
      be a natural alternative, but many well known restrictions, such as capacity and the utilization of host machines, hinder its effective
      use. In addition to technical challenges, providing cache services in clouds encounters a major practical issue (quality of service or
      service level agreement issue) of pricing. Currently, (public) cloud users are limited to a small set of uniform and coarse-grained service
      offerings, such as High-Memory and High-CPU in Amazon EC2. In this paper, we present the cache as a service (CaaS) model as
      an optional service to typical infrastructure service offerings. Specifically, the cloud provider sets aside a large pool of memory that
      can be dynamically partitioned and allocated to standard infrastructure services as disk cache. We first investigate the feasibility of
      providing CaaS with the proof of concept elastic cache system (using dedicated remote memory servers) built and validated on the
      actual system; and practical benefits of CaaS for both users and providers (i.e., performance and profit, respectively) are thoroughly
      studied with a novel pricing scheme. Our CaaS model helps to leverage the cloud economy greatly in that (1) the extra user cost for I/O
      performance gain is minimal if ever exists, and (2) the provider’s profit increases due to improvements in server consolidation resulting
      from that performance gain. Through extensive experiments with eight resource allocation strategies we demonstrate that our CaaS
      model can be a promising cost-efficient solution for both users and providers.

      Index Terms—Cloud Computing, Cache as a Service, Remote Memory, Cost Efficiency


1    I NTRODUCTION                                                              these perspectives (or objectives) and their resolution remain
The resource abundance (redundancy) in many large datacen-                      an open issue for the cloud.
ters is increasingly engineered to offer the spare capacity as                     In this paper, we investigate how cost efficiency in the cloud
a service like electricity, water and gas. For example, public                  can be further improved, particularly with applications that
cloud service providers like Amazon Web Services virtualize                     involve heavy I/O activities; hence, I/O-intensive applications.
resources, such as processors, storage and network devices,                     They account for the majority of applications deployed on
and offer them as services on demand, i.e., infrastructure as a                 today’s cloud platforms. Clearly, their performance is sig-
service (IaaS) which is the main focus of this paper. A virtual                 nificantly impacted on by how fast their I/O activities are
machine (VM) is a typical instance of IaaS. Although a VM as                    processed. Here, caching plays a crucial role in improving their
an isolated computing platform which is capable of running                      performance.
multiple applications, it is assumed in this study to be solely                    Over the past decades, caching has become the key tech-
dedicated to a single application; and thus, we use the ex-                     nology in bridging the performance gap across memory hi-
pressions VM and application interchangeably hereafter. Cloud                   erarchies via temporal or spatial localities; in particular, the
services as virtualized entities are essentially elastic making an              effect is prominent in disk storage systems. Currently, the
illusion of “unlimited” resource capacity. This elasticity with                 effective use of cache for I/O-intensive applications in the
utility computing (i.e., pay-as-you-go pricing) inherently brings               cloud is limited for both architectural and practical reasons.
cost effectiveness that is the primary driving force behind the                 Due to essentially the shared nature of some resources like
cloud.                                                                          disks (not performance isolatable), the virtualization overhead
   However, putting a higher priority on cost efficiency than                    with these resources is not negligible and it further worsens
cost effectiveness might be more beneficial to both the user and                 the disk I/O performance. Thus, low disk I/O performance is
the provider. Cost efficiency can be characterized by having the                 one of the major challenges encountered by most infrastructure
temporal aspect as priority, which can translate to the cost to                 services as in Amazon’s relational database service, which
performance ratio from the user’s perspective and improve-                      provisions virtual servers with database servers. At present, the
ment in resource utilization from the provider’s perspective.                   performance issue of I/O-intensive applications is mainly dealt
This characteristic is reflected in the present economics of the                 with by using high performance servers with large amounts of
cloud to a certain degree [1]. However, the conflicting nature of                memory, leaving it as the user’s responsibility.
                                                                                   To overcome low disk I/O performance, there have been
• Hyuck Han, Woong Shin, and Heon Y. Yeom are with the School of                extensive studies on memory-based cache systems [2], [3], [4],
  Computer Science and Engineering, Seoul National University, Korea.           [5]. The main advantage of memory is that its access time is
  E-mail: {hhyuck, wshin, yeom}@dcslab.snu.ac.kr                                several orders of magnitude faster than that of disk storage.
• Young Choon Lee, and Albert Y. Zomaya are with Centre for Distributed         Clearly, disk-based information systems with a memory-based
  and High Performance Computing, School of Information Technologies,
  University of Sydney, NSW 2006, Australia.                                    cache can greatly outperform those without cache. A natural
  E-mail: {young.lee,albert.zomaya}@sydney.edu.au                               design choice in building a disk-based information system with
• Hyungsoo Jung (corresponding author) is with School of Information            ample cache capacity is to exploit a single, expensive, large
  Technologies, University of Sydney, NSW 2006, Australia.                      memory computer system. This simple design—using local
  E-mail: hyungsoo.jung@sydney.edu.au
                                                                                volatile memory as cache (LM cache)—costs a great deal, and

may not be practically feasible in the existing cloud services          tions; and their parameters are modeled based on preliminary
due to various factors including capacity and the utilization of        experimental results obtained using the actual system.
host machines.                                                             The remainder of this paper is organized as follows: Section
   In this paper, we address the issue of disk I/O performance          2 reviews the related work about caching and its impact on
in the context of caching in the cloud and present a cache as           I/O performance in the context of cloud computing. Section
a service (CaaS) model as an additional service to IaaS. For            3 overviews and conceptualizes the CaaS model. Section 4
example, a user is able to simply specify more cache memory             articulates the architectural design of our ‘elastic’ cache system.
as an additional requirement to an IaaS instance with the               Section 5 describes the service model with a pricing scheme
minimum computational capacity (e.g., micro/small instance              for CaaS. In Section 6, we present results of experimental
in Amazon EC2) instead of an instance with large amount                 validation for the cache system and evaluation results for our
of memory (high-memory instance in Amazon EC2). The key                 CaaS model. We then conclude this paper in Section 7.
contribution in this work is that our cache service model much
augments cost efficiency and elasticity of the cloud from the
perspective of both users and providers. CaaS as an additional          2   BACKGROUND        AND   R ELATED W ORK
service (provided mostly in separate cache servers) gives the
provider an opportunity to reduce both capital and operating            There have been a number of studies conducted to investigate
costs using a fewer number of active physical machines for              the issue of I/O performance in virtualized systems. The focus
IaaS; and this can justify the cost of cache servers in our             of these investigations include I/O virtualization, cache alter-
model. The user also benefits from CaaS in terms of appli-               natives and caching mechanisms. In this section, we describe
cation performance with minimal extra cost; besides, caching            and discuss notable work related to our study. What primarily
is enabled in a user transparent manner and cache capacity is           distinguishes ours from previous studies is the practicality with
not limited to local memory. The specific contributions of this          the virtualization support of remote memory access and the
paper are listed as follows. First, we design and implement an          incorporation of service model; hence, cache as a service.
elastic cache system, as the architectural foundation of CaaS,
with remote memory (RM) servers or solid state drives (SSDs);
this system is designed to be pluggable and file system inde-            2.1 I/O Virtualization
pendent. By incorporating our software component in existing
operating systems, we can configure various settings of storage          Virtualization enables resources in physical machines to be
hierarchies without any modification of operating systems                multiplexed and isolated for hosting multiple guest OSes
and user applications. Currently, many users exploit memory             (VMs). In virtualized environments, I/O between a guest
of distributed machines (e.g., memcached) by integration of             OS and a hardware device should be coordinated in a safe
cache system and users’ applications in an application level            and efficient manner. However, I/O virtualization is one of
or a file-system level. In such cases, users or administrators           the severe software obstacles that VMs encounter due to its
should prepare cache-enabled versions for users’ application            performance overhead. Menon et al. [6] tackled virtualized I/O
or file system to deliver a cache benefit. Hence, file system              by performing full functional breakdown with their profiling
transparency and application transparency are some of the key           tools.
issues since there is a great diversity of applications or file             Several studies [7], [8], [9] contribute to the efforts narrowing
systems in the cloud computing era.                                     the gap between virtual and native performance. Cherkasova
   Second, we devise a service model with a pricing scheme,             et al. [7] and Menon et al. [6] studied I/O performance in the
as the economic foundation of CaaS, which effectively bal-              Xen hypervisor [10] and showed a significant I/O overhead
ances conflicting objectives between the user and the provider,          in Xen’s zero-copy with the page-flipping technique. They
i.e., performance vs. profit. The rationale behind our pricing           proposed that page-flipping be simply replaced by the memcpy
scheme in CaaS is that the scheme ensures that the user gains           function to avoid side-effects. Menon et al. [9] optimized I/O
I/O performance improvement with little or no extra cost                performance by introducing virtual machine monitor (VMM)
and at the same time it enables the provider to get profit               superpage and global page mappings. Liu et al. [8] proposed a
increases by improving resource utilization, i.e., better service       new device virtualization called VMM-bypass that eliminates
(VM) consolidation. Specifically, the user cost for a particular         data transfer between the guest OS and the hypervisor by
application increases proportionally to the performance gain            giving the guest device driver direct access to the device.
and thus, the user’s cost eventually remains similar to that               With an increasing emphasis on virtualization, many hard-
without CaaS. Besides, performance gains that the user get              ware vendors have started to support hardware-level features
with CaaS has further cost efficiency implications if the user is        for virtualization. Hardware-level features have been actively
a business service provider who rents IaaS instances and offers         evaluated to seek for near native I/O performance [11], [12],
value-added services to other users (end-users).                        [13]. Zhang et al. [11] used Intel Virtualization Technology
   Finally, we apply four well-known resource allocation algo-          architecture to gain better I/O performance. Santos et al. [14]
rithms (first-fit, next-fit, best-fit and worst-fit) and develop their       used devices that support multiple contexts. Data transfer
variants with live VM migration to demonstrate the efficacy of           is offloaded from the hypervisor to the guest OS by using
CaaS.                                                                   mapped contexts. Dong et al. [13] achieved 98% of the native
   Our CaaS model and its components are thoroughly vali-               performance by incorporating several hardware features such
dated and evaluated through extensive experiments in both               as device semantic preservation with input/output memory
a real system and a simulated environment. Our RM-based                 management unit (IOMMU), effective interrupt sharing with
elastic cache system is tested in terms of its performance and          message signaled interrupts, and reusing direct memory access
reliability to verify its technical feasibility and practicality. The   (DMA) mappings. All these studies focused on network I/O,
complete CaaS model is evaluated through extensive simula-              where as this work looks at disk I/O.

                                                                                          VM 2
                                                          VM 1

                                                                                    I/O                                   Memory Pool

                                                    I/O                      OS

                                                                                  RM-Cache                       memory     memory      memory

                             Us e                                                     IaaS provider
                                 r re
                                        qu e

                                                                                                                    RM servers
                                                                 Compute servers

Fig. 1. Overview of CaaS

2.2 Cache Device                                                                                 et al. [21] use SSD as a disk cache, and further performance
                                                                                                 improvement is gained by employing online compression. To
Cooperative cache [2] is a kind of RM cache that improves
                                                                                                 alleviate performance problems of NAND flash memory, SSD-
the performance of networked file systems. In particular, it is
                                                                                                 based cache systems can adopt striping [22], parallel I/O [23],
adopted in the Serverless Network File System [3]. It uses par-
                                                                                                 NVRAM-based buffer [24], and log-based I/O [20], and these
ticipating clients’ memory regions as a cache. A remote cache
                                                                                                 techniques could significantly help amortizing the inherent
is placed between the memory-based cache of a requesting
                                                                                                 latency of a raw SSD. Nevertheless, the latency of an SSD is
client and a server disk. Each participating client exchanges
                                                                                                 still higher than that of RM.
meta information for the cache with others periodically. Such
                                                                                                    Ousterhout et al. [25] recently presented a new approach to
a caching scheme is effective where a RM is faster than a local
                                                                                                 data processing, and proposed an architecture, called RAM-
disk of the requesting client. Jiang et al. [4] propose advanced
                                                                                                 Cloud that stores data entirely in DRAM of distributed sys-
buffer management techniques for cooperative cache. These
                                                                                                 tems. RAMCloud has performance benefits owing to the ex-
techniques are based on the degree of locality. Data that have
                                                                                                 tremely low latency. Thus, it can be a good solution to over-
high (low) locality scores are placed on a high-level (low-level)
                                                                                                 come the I/O problem of cloud computing. However, RAM-
cache. Kim et al. [5] propose a cooperative caching system
                                                                                                 Cloud incurs high (operational) cost and high energy usage.
that is implemented at the virtualization layer, and the system
                                                                                                 In this study we use remote memory as a cache device, which
reduces disk I/O operations for shared working sets of virtual
                                                                                                 stores only data having high locality, to meet the balanced point
                                                                                                 of I/O performance and its cost.
   Lim et al. [15] proposed two architectures for RM systems: (1)
block-access RM supported in the coherence hardware (FGRA),
and (2) page-swapped RM at the virtualization layer (PS).                                        3   C ACHE A S A S ERVICE : OVERVIEW
In FGRA, a few hardware changes of memory producers are                                          The CaaS model consists of two main components: an elastic
necessary. On the other hand, PS implements a RM sharing                                         cache system as the architectural foundation and a service
module in a VMM.                                                                                 model with a pricing scheme as the economic foundation.
   Marazakis et al. [16] utilize RDMA (remote direct access                                         The basic system architecture for the elastic cache aims to
memory) technology to improve I/O performance in a storage                                       use RM, which is exported from dedicated memory servers (or
area network environment. It abstracts disk devices of remote                                    possibly SSDs). It is not a new caching algorithm. The elastic
machines into local block devices. RDMA-enabled memory                                           cache system can use any of the existing cache replacement
regions in remote machines are used as buffers for write                                         algorithms. Near uniform access time to RM-based cache is
operations. Remote buffers are placed between virtually ad-                                      guaranteed by a modern high speed network interface that
dressed pages of requesting clients and disk devices of remote                                   supports RDMA as primitive operations. Each VM in the
machines in a storage hierarchy. These proposals are different                                   cloud accesses the RM servers via the access interface that is
from our work in that our system focuses on improving the                                        implemented and recognized as a normal block device driver.
I/O performance of a local disk instead of a remote disk by                                      Based on this access layer, VMs utilize RM to provision a
using RM as a cache.                                                                             necessary amount of cache memory on demand.
   Recently, SSDs have been used as a file system cache or a disk                                    As shown in Figure 1, a group of dedicated memory servers
device cache in many studies. A hybrid drive [17] is a NAND                                      exports their local memory to VMs, and exported memory
flash memory attached disk. Its internal flash memory is used                                      space can be viewed as an available memory pool. This mem-
as the I/O buffer for frequently used data. It was developed                                     ory pool is used as an elastic cache for VMs in the cloud. For
in 2007, but the performance improvement was not significant                                      billing purposes, cloud service providers could employ a lease
due to the inadequate size of the cache [18]. The Drupal data                                    mechanism to manage the RM pool.
management system [19] utilizes both SSD and HDD implicitly                                         To employ the elastic cache system for the cloud, service
according to data usage patterns. It is implemented at the                                       components are essential. The CaaS model consists of two
software level. It uses SSD as a file-system level cache for                                      cache service types (CaaS types) based on whether LM or
frequently used data. Like a hybrid disk, the performance                                        RM is allocated with. Since these types are different in their
gain of Drupal is not significant. Lee et al. [20] showed that                                    performance and costs a pricing scheme that incorporates these
SSDs can benefit transaction processing performance. Makatos                                      characteristics is devised as part of CaaS.

   Together, we consider the following scenario. The service               than that of RM. In addition to this, RM has no such limitations
provider sets up a dedicated cache system with a large pool of             so that it can be a good candidate for cache memory.
memory and provides cache services as an additional service                   Implementation Level. Elastic cache can be deployed at
to IaaS. Now, users have an option to choose a cache service               either application or OS level (block device or file system
specifying their cache requirement (cache size) and that cache             level). In this paper, it is the fundamental principle that the
service is charged per unit cache size per time. Specifically, the          cache need not affect application code or file systems owing
user first selects an IaaS type (e.g., Standard small in Amazon             to the diversity of applications or file system configurations
EC2) as a base service. The user then estimates the performance            on cloud computing. Application level elastic cache such as
benefit of additional cache to her application taking into ac-              memcached2 could have better performance than OS level
count the extra cost, and determines an appropriate cache size             cache, since application level cache can exploit application
based on that estimation. We assume that the user is at least              semantics. However, modification of application code is always
aware whether her application is I/O-intensive, and aware                  necessary for application level cache. A file system level im-
roughly how much data it deals with. The additional cache                  plementation can also provide many chances for performance
in our study can be provided either from the local memory of               improvements, such as buffering and prefetching. However, it
the physical machine on which the base service resides or from             forces users to use a specific file system with the RM-based
the remote memory of dedicated cache servers. The former LM                cache. In contrast, although a block-device level implemen-
case can be handled simply by configuring the memory of the                 tation has fewer chances of performance improvements than
base service to be the default memory size plus the additional             the application or file system level counterpart, it does not
cache size. On the other hand, the latter RM case requires an              depend on applications or file systems to take benefits from
atomic memory allocation method to dedicate a specific region               the underlying block-level cache implementation.
of remote memory to a single user. Specific technical details of               RDMA vs. TCP/IP. Despite the popularity of TCP/IP, its
RM cache handling are presented in Section 4.2.                            use in high performance clusters has some restrictions due to
   The cost benefit of our CaaS model is two-fold: profit max-               its higher protocol processing overhead and less throughput
imization and performance improvement. Clearly, the former                 than other cutting-edge interconnects, such as Myrinet and
is the main objective of service provider. The latter also con-            Infiniband. Since disk cache in our system requires a low
tributes to achieving such an objective by reducing the num-               latency communication channel, we choose a RDMA-enabled
ber of active physical machines. From the user’s perspective,              interface to guarantee fast and uniform access time to RM
the performance improvement of application (I/O-intensive                  space.
applications in particular) can be obtained with CaaS in a                    Dedicated-Server-Based Cache vs. Cooperative Cache. Re-
much more cost efficient manner since caching capacity is more              mote memory from dedicated servers might demand more
important than processing power for those applications.                    servers and related resources, such as rack and power, during
                                                                           the operation. However, the total number of machines for data
4   E LASTIC C ACHE S YSTEM                                                processing applications is not greater than that of machines
                                                                           without RM-based cache systems. As an alternative way, we
In this section, we describe an elastic cache architecture, which          could implement remote memory based on a cooperative
is the key component in realizing CaaS. We first discuss the                cache, which uses participants’ local memory as remote mem-
design rationale for a RM-based cache, and its technical details.          ory. This might help saving the number of machines used
                                                                           and the energy consumed, but the efficient management of
4.1 Design Rationale                                                       cooperative cache is a daunting task in large data centers. We
Among many important factors in designing an elastic cache                 are now back to the principle that local memory should be
system, we particularly focus on the type of cache medium, the             used for a guest OS or an application on virtual machines,
implementation level of our cache system, the communication                rather than for remote memory. We consider that this design
medium between a cache server and a VM, and reliability.                   rationale is practically less problematic and better choice for
   Cache Media. We have three alternatives to implement cache              implementing real systems.
                                                                              Reliability. One of most important requirements for the
devices. Clearly, LM would be the best option due to the
                                                                           elastic cache is failure resilience. Since we implement the elastic
the speed gap between LM and other devices (RM and SSD).
                                                                           cache at the block device level, the cache system is designed
Because LM has a higher cost per capacity, which causes the
                                                                           to support a RAID-style fault-tolerant mechanism. Based on
capacity limitation, dedicating a large amount of LM as cache
                                                                           a RAID-like policy, the elastic cache can detect any failure of
could cause a side-effect of memory pressure in operating
                                                                           cache servers and recovers automatically from the failure (a
systems; this capacity issue primarily motivates us to consider
                                                                           single cache server failure).
using RM and SSD as alternative cache media. RM and SSD
                                                                              In summary, we suggest that the CaaS model can be better
enable VMs to flexibly provision cache practically without such
                                                                           realized with an RM-based elastic cache system at the block
a strict capacity limit.
                                                                           device level.
   SSDs have recently emerged as a new storage medium
that offers faster and more uniform access time than HDDs.
However, SSDs haver few drawbacks due to the characteristics               4.2   System Architecture
of NAND flash memory; in-place updates are not possible, and                In this section, we discuss the important components of the
this causes extra overhead (latency) 1 in page update opera-               elastic cache. The elastic cache system is conceptually com-
tions. Although many strategies [22], [23], [20] are proposed to           posed of two components: a VM and a cache server.
alleviate such problems, the latency of an SSD is still higher               A VM demands RM for use as a disk cache. We build an
                                                                           RM-based cache as a block device and implement a new block
  1. As Ousterhout et al. [25] pointed out, the low latency of a storage
device is very pivotal in designing storage systems.                         2. Available at http://www. memcached.org.

     VM           Application                                               • Level 1 memory : memory of VM
      Guest OS       I/O                                                    • Level 2 memory : virtual memory of VM
             RM-Cache           DMA Buffer                                 In VM environments, the level 2 (level 1) memory is mapped
           Level 2 Memory
                                             Chunk_Lock     Owner       into the level 1 (level 0) memory, and this is called double
                                                                        paging. For RDMA communication, a memory region (level
         Level 1 Memory
                                              Memory Chunk              0 memory) should be registered to the RDMA device (i.e.,
      Level 0 Memory
                                                                        InfiniBand device). Generally, kernel-level functions mapping
     VMM                    RDMA Interface                Memory Pool   virtual to physical addresses (i.e., virt_to_phys) are used for
                                                                        memory registration to the RDMA device. In VMs, the return
Fig. 2. Elastic cache structure and double paging problem               addresses of functions in a guest OS are in level 1 memory.
                                                                        Since the RDMA device cannot understand the context of level
                                                                        1 memory addresses, direct registration of level 1 memory
device driver (RM-Cache device). In the RM-Cache device,                space to RDMA leads to malfunction of RDMA communica-
RM regions are viewed as byte-addressable space. The block              tion.
address of each block I/O request is translated into an offset of          To avoid this type of double paging anomaly in RDMA
each region, and all read/write requests are also transformed           communication, we exploit hardware IOMMUs to get DMA-
into RDMA read/write operations. We use the device-mapper               able memory (level 0 memory). IOMMUs are hardware devices
module of the Linux operating system (i.e., DM-Cache3 ) to              that manage device DMA addresses. To virtualize IOMMUs,
integrate both the RM-Cache device and a general block device           VMMs like Xen provide software IOMMUs. Many hardware
(HDD) into a single block device. This forms a new virtual              vendors also re-design IOMMUs so that they are isolated
block device, which makes our cache pluggable and file-system            between multiple operating systems with direct device access.
independent.                                                            Thus, we use kernel functions related with IOMMUs to get
  In order to deal with resource allocation for remote memory           level 0 memory addresses. The RM-Cache device allocates
requested from each VM, a memory server offers a memory                 level 2 memory space through kernel level memory allocation
pool as a cache pool. When a VM needs cache from the memory             functions in the VM. Then, it remaps the allocated memory
pool, the memory pool provides available memory. To this end,           to DMA-able memory space through IOMMU. The mapped
a memory server in the pool exports a portion of its physical           address of the DMA-able memory becomes level 0 memory
memory4 to VMs, and a server can have several chunks. A                 that can now be registered correctly by RDMA devices. Figure
normal server process creates 512MB memory space (chunk) via            2 describes all these mechanisms in detail.
the malloc function, and it exports a newly created chunk to all
VMs, along with Chunk Lock and Owner regions to guarantee               5       S ERVICE M ODEL
exclusive access to the chunk. After a memory server process
exchanges RDMA specific information (e.g., rkey and memory               In this section, we first describe performance characteristics of
address for corresponding chunks) with a VM that demands                different cache alternatives and design two CaaS types. Then,
RM, the exported memory of each machine in the pool can be              we present a pricing model that effectively captures the trade-
viewed as actual cache. When a VM wants to use RM, a VM                 off between performance and cost (profit).
should first mark its ownership on assigned chunks, then it
can make use of the chunk as cache. An example of layered               5.1      Modeling Cache Services
architecture of a VM and a memory pool, both of which are
                                                                        I/O-intensive applications can be characterized primarily by
connected via the RDMA interface, is concretely described in
                                                                        data volume, access pattern and access type; i.e., file size,
Figure 2.
                                                                        random/sequential and read/write, respectively. The iden-
  When multiple VMs try to mark their ownership on the same
                                                                        tification of these characteristics is critical in choosing the
chunk simultaneously, the access conflict can be resolved by a
                                                                        most appropriate cache medium and proper size since the
safe and atomic chunk allocation method, which is based on
                                                                        performance of different storage media (e.g., DRAMs, SSDs
the CompareAndSwap operation supported by Infiniband. The
                                                                        and HDDs) varies depending on one or more of those charac-
CompareAndSwap operation of InfiniBand atomically compares
                                                                        teristics. For example, the performance bottleneck sourced from
the 64-bit value stored at the remote memory to a given value
                                                                        frequent disk accesses may be significantly improved using
and replaces the value at the remote memory to a new value
                                                                        SSDs as cache. However, if those accesses are mostly sequential
only if they are the same, By the CompareAndSwap operation,
                                                                        write operations the performance with SSDs might only be
only one node can acquire the Chunk Lock lock and it can safely
                                                                        marginally improved or even made worse. Although the use
mark its ownership to the chunk by setting the Owner variable
                                                                        of LM as cache delivers incomparably better I/O performance
to consumer’s id.
                                                                        than other cache alternatives (e.g., RM),5 such a use is limited
  Double Paging in RDMA. The double paging problem was
                                                                        by several issues including capacity and the utilization of
first addressed in [26], and techniques such as ballooning [27]
                                                                        host machines. With the consideration of these facts, we have
are proposed to avoid the problem. Since the problem is a
                                                                        designed two CaaS types as the following:
bit technical but very critical in realizing CaaS in the cloud
                                                                           • High performance (HP) - makes use of LM as cache; and
platform, we describe what implementation difficulty it causes
and how we overcome the obstacle. Goldberg et al. [26] define                 thus, its service capacity is bounded by the maximum
levels of memory as follows:                                                 amount of LM
  • Level 0 memory : memory of real machine                               5. Surprisingly, the performance of LM cache is only marginally
                                                                        better than RM in most of our experiments. The main cause of this
  3. Available at http://visa.cis.fiu.edu/ming/dmcache/index.html        unexpected result is believed to be the behavior of the ‘pdflush’
  4. A basic unit is called chunk (512MB).                              daemon in Linux, i.e., frequently writing back dirty data to disk.

                                                                                The average number of services (VMs) per physical machine
                                                                              with the HP cache option (or simply HP services) is defined
           Cost                             Fragmentation penalty             as:

         HP                                                                                                   LM max
                                                                                               HPservices =          · aHP                  (1)

         BV                                                                   where LM max is the maximum local memory available, mHP

                                                                              is the average amount of local memory for HP services. The
                                                                              amount of LM cache requested for HP is assumed to be in a
                                                                              uniform distribution.
                                    tBV                                          And, the average number of services per physical machine
                                                                              without HP is defined as:
                                    HP            BV                No-CaaS
                                      Performance (time)                                                                LM max
                                                                                            nonHPservices =         (          · aj )       (2)
Fig. 3. Cost efficiency of CaaS. nc· CHP and nc· CBV are extra
costs charged for HP and BV CaaS types, respectively, where                   where st is the number of IaaS types (i.e., three in this study),
nc is the number of cache units (e.g., 0.5GB per cache unit.                  mj is the mamory capacity of a service type j (sj ), and aj is
tHP , tBV , and tno−CaaS are performance delivered with the two               the rate of services with type j.
CaaS types and without CaaS, respectively. Then, for a given                    Then, the average number of services (service count or sc)
IaaS type si , we have the following: (fi + CHP,i ·nc) ·tHP,i ≥ (fi           per physical machine with/without HP requests are defined
+ CBV,i ·nc) ·tBV,i = fi ·tno−CaaS,i .                                        as:

                                                                                           scHP = HPservices + nonHPservices                (3)
  •    Best value (BV) - exploits RM as cache practically without
       a limit                                                                                              nonHPservices
                                                                                                 scnoHP =                                   (4)
   In our CaaS model, it is assumed that a user, who sends a                                                  1 − aHP
request with a CaaS option (HP or BV), also accompanies an
application profile including data volume, data access pattern                 where aHP is the rate of HP services. Note that the sum of all
and data access type. It can be argued that these pieces of                   aj is 1−aHP . We assume that the service provider has a means
application specific information might not be readly available                 to determine request rates of service types including the rate of
particularly for average users, and some applications behave                  I/O-intensive applications (aIO ) and further the rates of those
unpredictably. In this paper, we primarily target the scenario in             with HP and BV (aHP , and aBV ) respectively. Since services
which users repeatedly and/or regularly run their applications                with BV use a separate RM server, they are treated the same
in clouds, and they are aware of their application characteristics            as default IaaS types (small, medium and large).
either by analyzing business logic of their applications or by                  In the CaaS model, the difference between scnoHP and scHP
obtaining such information using system tools (e.g., sysstat6 )               can be seen as consolidation improvement (CI). For a given
and/or application profiling [28], [29]. When a user is unable to              IaaS type si , the rates (unit price) for HP and BV are then
identify/determine he/she simply rents default IaaS instances                 defined as:
without any cache service option since CaaS is an optional
service to IaaS. The service granularity (cache size) in our CaaS                           cHP,i = fi · piHP,i + (f · CI)/scHP             (5)
model is set to a certain size (512MB/0.5GB). In this study, we
adopt three default IaaS types: small, medium and large with                                         cBV,i = fi · piBV,i                    (6)
flat rates of fs , fm and fl , respectively.

                                                                              where piHP,i and piBV,i are the average performance im-
5.2 Pricing                                                                   provement per unit increase of LM and RM cache increase
A pricing model that explicitly takes into account various                    (e.g., 0.5GB), respectively, and f is the average service rate;
elastic cache options is essential for effectively capturing the              these values might be calculated based on application profiles
trade-off between (I/O) performance and (operational) cost.                   (empirical data).
   With HP, it is rather common to have many “awkward”                           With BV, the rate is solely dependent on piBV ; and thus,
memory fragmentations (more generally, resource fragmenta-                    the total price the user pays for a given service request is
tions) in the sense that physical machines may not be used                    expected to be equivalent to that without cache on average
for incoming service requests due to lack of memory. For                      as shown in Figure 3. We acknowledge that the use of aver-
example, for a physical machine with four processor cores and                 age performance improvement resulting in the uniformity in
the maximum LM of 16GB a request with 13GB of HP cache                        service rates (cHP,i and cBV,i ) might not be accurate; however,
requirement on top of a small IaaS instance (which uses 1 core)               this is only indicative. In the actual experiments, charges for
occupies the majority of LM leaving only 3GB available. Due to                services with cache option have been accurately calculated in
such fragmentations, an extra cost is imposed on the HP cache                 the way that for the price for a particular service (application)
option as a fragmentation penalty (or performance penalty).                   remains the same regardless of use of cache option and type of
                                                                              cache option. The cost efficiency characteristic of BV can justify
  6. Available at http://sebastien.godard.pagesperso-orange.fr                the use of average of varying piBV values, the different values

                      Infiniband Network                                               4000
                                                                                                       RM-Cache      SSD-Cache    No-Cache

                                                                       Measured tpmC


    VM           6 Cache Servers (each exports 1GB of memory)
Fig. 4. Experimental environment                                                                60WH              90WH           120WH

being due to application characteristics (e.g., data access pat-   Fig. 5. Results of TPC-C Benchmark (12 clients)
tern and type) and cache size. Alternatively, different average
performance improvement values (i.e., piHP and piBV ) can be
                                                                   is a suitable model as an efficient caching system for existing
used depending on application characteristics (e.g., data access
                                                                   IaaS models.
pattern and type) profiled and specified by the user/provider.
                                                                      Because of the attractive performance characteristics of SSDs,
Further, rates (pricing) may be mediated between the user and
                                                                   the usefulness of our system might be questionable compared
the provider through service level agreement negotiation.
                                                                   with an SSD-based cache system. To answer this, we compared
  It might be desirable that the performance gain that users
                                                                   our elastic cache system with an SSD-based system.
experience with BV is proportional to that with HP. In other
words, their performance gap may be comparable to the extra
                                                                   6.1.1               Experimental Environments
rate imposed on HP. The performance of a BV service might
not be easily guaranteed or accurately predicted since that        Throughout this paper, we use experimental environments as
performance is heavily dependent on (1) the type and amount        shown in Figure 4. For performance evaluation, we used a 7-
of additional memory, (2) data access pattern and type, and (3)    node cluster, each node of which is equipped with an Intel(R)
the interplay of (1) and (2).                                      Core(TM)2 Quad CPU 2.83GHz and 8GB RAM. All nodes are
                                                                   connected via both a switched 1 Gbps Ethernet and 10 Gbps
                                                                   Infiniband. We used Infinihost IIILx HCA cards from Mellanox
6   E VALUATION                                                    for Infiniband connection. A memory server runs Ubuntu 8.0.4
In this section, we evaluate CaaS from the viewpoints of           with Linux 2.6.24 kernel, and exports 1GB memory. One of the
both users and providers. To this end, we first measure the         clusters instantiates a VM using Xen 3.4.0. The VM with Linux
performance benefit of our elastic cache system- in terms of        2.6.32 has 2GB memory and 1 vCPU, and it runs benchmark
performance (e.g., transactions per minute), cache hit ratio and   programs. The cache replacement policy is Least Recently Used
reliability. The actual system level modification for our system    (LRU).
is not possible with the existing cloud providers like Amazon         We configured the VM to use a 16GB virtual disk combined
and Microsoft. We can neither dedicate physical servers of the     with 4GB elastic cache (i.e. RM) via the RM-Cache device. The
cloud providers to RM servers nor assign SSDs and RDMA             ext3 file system was used for benchmark tests. To assess the
devices to physical servers. Owing to these issues we could not    efficiency of our system, we compared our system to a virtual
test our systems on real cloud services but we built an RDMA-      disk with an SSD-based cache device and a virtual disk without
and SSD-enabled cloud infrastructure (Figure 4) to evaluate        any cache space. For the SSD-based cache device, we used one
our systems. We then simulate a large-scale cloud environment      Intel X25-M SSD device. Throughout this section, we denote
with more realistic settings for resources and user requests.      “virtual disk with the RM-based cache”, “virtual disk with
This simulation study enables us to examine the cost efficiency     the SSD-based cache”, and “virtual disk without any cache”
of CaaS. While experimental results in Section 6.1 demonstrate     as RM-cache, SSD-cache, and No-cache, respectively.
the feasibility of our elastic cache system those in Section 6.2
confirm the practicality of CaaS (or applicability of CaaS to the   6.1.2               TPC-C Results
cloud).                                                            We first evaluate the OLTP (Online Transaction Processing)
                                                                   performance on PostgreSQL, a popular open-source DBMS.
                                                                   The DBMS server runs inside the VM, and the RM-Cache
6.1 Experimental Validation: Elastic Cache System
                                                                   device is used for the disk device assigned to databases.
We validate the proof of concept elastic cache system with two     To measure the OLTP performance on PostgreSQL, we used
well-known benchmark suites: a database benchmark program          BenchmarkSQL7 , which is a JDBC benchmark that closely
(TPC-C) and a file system benchmark program (Postmark).             resembles the TPC-C standard for OLTP. We measured the
TPC-C, which simulates OLTP activities, is composed of read-       transaction rate (transactions per minute, tpmC) with varying
only and update transactions. The TPC-C benchmark is update        numbers of clients and warehouses. It is worth noting that
intensive with a 1.9:1 I/O read to write ratio, and it has         “warehouse” or “warehouses” will be abbreviated as WH.
random I/O access patterns [30]. Postmark, which is designed          Figure 5 shows the measured tpmC and the database size.
to evaluate the performance of email servers, is performed         We observe the highest tpmC at the smallest WH instance in
in three phases: file creation, transaction execution, and file      the RM-cache environment. Also, as the number of WHs and
deletion. Operations and files in the transaction execution         clients increases, the tpmC value decreases in all device con-
phase are randomly chosen. We choose them because these            figurations. Measured tpmC values of 60 WH are between 270
two benchmarks have all important characteristics of modern        and 400 in the No-cache environment. The performance of the
data processing applications. Intensive experiments with these
applications show that the prototype elastic cache architecture      7. Available at http://pgfoundry.org/projects/benchmarksql.

                         TABLE 1                                                                                                TABLE 2
  Database size and cache hit ratio of the TPC-C Benchmark                                                              Cache hit ratio of Postmark
                                                60 WH        90 WH        120 WH                                      # of files   20000      400000         800000
                             Database size      6.88GB       10.24GB      13.44GB                                     Hit ratio    62%        51%            46%
                             Cache hit ratio   82 - 84%      78 - 80%     70 - 73%                                   Disk usage    24%        45%            89%

                          1600                                                                                3000

                                                                                              Measured tpmC
                                    RM-Cache     SSD-Cache          No-Cache
     Elapsed time (sec)



                             0                                                                                         1G           2G                 4G            8G
                                     200,000              400,000              800,000                                                    Cache Size
                                                   Number of files

Fig. 6. Results of Postmark Benchmark (seconds)                                          Fig. 7. Effects of cache size (RM-cache, TPC-C, 90 WH, and 12

SSD-cache environment is better than that without cache by a
                                                                                         cache increases the performance of TPC-C, due to the high
factor of 8, and the RM-based cache outperforms the SSD-based
                                                                                         probability that a data block will reside in the cache. When a
cache by a factor of 1.5 due to superior bandwidth and latency.8
                                                                                         cache of 1GB RM is used, the performance with a cache is 2.5
As shown in Table 1, the PostgresSQL DBMS has a strong
                                                                                         times better than that without any cache. A cache of 4GB (8GB)
locality in its data access pattern when processing the TPC-
                                                                                         RM shows 2.4 (2.6) times better performance than that of 1GB
C-like workload, and SSD-based and RM-based cache devices
                                                                                         RM. From the observation, we can safely conclude that even a
exploit this locality. Actually, frequently accessed data, such
                                                                                         small or a moderate size of RM-based cache can accelerate data
indices, is always in the cache device, while less frequently
                                                                                         processing applications on existing cloud services and users
accessed data, such as unpopular records, is located to the
                                                                                         can choose the suitable cache size for their performance criteria.
virtual disk. Results of 90 and 120 WH cases are similar to
those of the 60 WH case in that the performance of the RM-
                                                                                            Effects of File Systems. Figure 8 shows TPC-C results with
cache case is always the best.
                                                                                         various file systems. For this experiment, we used ext2, ext3,
                                                                                         and reiserfs file systems. In all cases, we can see that RM-
6.1.3 Postmark Results
                                                                                         cache cases show better performance than No-cache cases.
Postmark, which is designed to evaluate the performance of                               The ext3 and reiserfs file systems are journaling file systems;
file servers for applications, such as email, netnews, and web-                           updates to files are first written as predefined compact entries
based commerce, is performed in three phases: file creation,                              in the journal region, and then the updates are written to their
transaction execution, and file deletion. In this experiment, the                         destination on the disk. This leads to less performance benefits
number of transactions and subdirectories are 100,000 and 100,                           in journaling file systems. In fact, the journal data are not
respectively. Three experiments are performed by increasing                              necessary to be cached since they are used only for recovery
the number of files.                                                                      from a file system crash. While the ext3 file system journals
   Figure 6 and Table 2 show the results of the Postmark                                 both meta data and data, the reiserfs file system journals only
benchmark when (1) a RM-based device is used as a cache                                  meta data. This leads to better performance in the reiserfs case
of a virtual disk, (2) an SSD device is used, and (3) no cache                           with cache. On the contrary, since the ext2 file system is not a
device is used. The total size of files for each experiment (as                           journaling file system, the ext2 case with cache shows the best
the number of files increases from 200,000 to 800,000) is 3.4,                            performance among the three. In the ext2 file system, meta
6.8, and 13.4GB, and this leads to a lower cache hit ratio.                              blocks, such as super-blocks and indirected blocks, should be
   From the figure, we can see that both cache-enabled cases                              accessed before actual data are read. Thus, when such meta
outperform No-cache cases. Because Postmark is an I/O-                                   blocks are located in the cache, the performance gain of the
intensive benchmark, I/O operations involve many cache op-                               elastic cache is maximized. From this experiment, we can see
erations. Thus, cache devices lead to better I/O performance of                          that the elastic cache provided by our cache system is file
virtual resources. With 200,000, 400,000, and 800,000 files, RM-                          system independent and greatly helpful for the file system
cache cases show (9, 5.5, and 2.5 times) better performance than                         performance.
No-cache cases. RM-cache cases also have up to 130% better
performance than SSD-cache cases.                                                        6.1.5 Discussion
                                                                                         From our experimental results, we can draw the following
6.1.4 Other Experiments                                                                  lessons. First, a small or moderate size of RM-based cache
Effects of Cache Size. Figure 7 shows the results of the TPC-                            can improve virtual disk I/O performance. Thus, if users set
C benchmark when the size of RM is varied. A large size of                               an appropriate cache size, it can lead to cost-effective perfor-
                                                                                         mance. Second, our system can safely recover from a single
  8. For our evaluation, we used a new SSD. Thus, the SSD device used
in our experiments had the best condition. It is well-known that if the
                                                                                         machine crash although the performance gradually decreases
SSD device is used for a long period, the performance is degraded                        during the recovery; this enhances the reliability. Third, our
greatly.                                                                                 system improves virtual disk I/O performance irrespective

                                                                                                                     No Cache      0.5GB Cache       1GB Cache         2GB Cache
                                                          RM-Cache     No-Cache
    Measured tpmC                                                                                                    800

                                                                                                Elapsed time (sec)

                                  ext2             ext3              reiserfs                                          0
                                                                                                                            10:0     8:2       6:4     4:6       2:8      0:10
Fig. 8. Effects of file systems (TPC-C, 90 WH, and 12 clients)                                                                          Read to update ratio (R:U)

                        TABLE 3                                                                                                            (a) 3GB Data
  Comparison between MMDB and DDB with RM-cache and                                                                  No Cache      1GB Cache         2GB Cache         4GB Cache
   RM-based block device (TPC-C, 40 WH, and 12 clients)
                                                           Measured tpmC
                                    MMDB                       3984                                                  1600

                                                                                        Elapsed time (sec)
                               DDB with RM-cache               4566
                           DDB on RM-based Block Device        7748


of file systems and supports various configurations of data
processing applications.                                                                                                0
                                                                                                                            10:0     8:2       6:4     4:6       2:8      0:10
   It is well known that main memory databases (MMDBs)
outperform disk-based databases (DDB) due to the locality of                                                                           Read to update ratio (R:U)
data in local main memory. However, since an MMDB typically                                                                                (b) 10GB Data
requires a large amount of main memory, it costs a great deal.
It may not be possible to provide adequate main memory                            Fig. 9. Results of Postmark 100k transactions (RM-Cache)
with virtual machines. From the previous section, we can see
that a DDB with RM-cache leads to (up to 7 - 8 times) better
                                                                                  formance to MMDB.
performance than that without any cache for TPC-C making it
as a real alternative to an MMDB.
   To verify this, we compare MMDBs to DDBs with RM-cache                         6.2                 Experiments: Cost Efficiency of CaaS
and RM-based block device. In the experiment, we use MySQL                        In this section, the cost efficiency of CaaS is evaluated. Specif-
Cluster and MySQL with InnoDB as MMDB and DDB, respec-                            ically, extensive experiments with the elastic cache system
tively. The core components of the MySQL Cluster are mysqld,                      are performed under a variety of workload characteristics to
ndbd, and ndb mgmd. mysqld is the process that allows external                    extract performance metrics, which are to be used as important
clients to access the data in the cluster. ndbd stores data in                    parameters for large scale simulations.
the memory and supports both replication and fragmentation.
ndb mgmd manages all processes of MySQL Cluster. An RM-                           6.2.1 Preliminary Experiments
based block device appears as a mounted file system, but it                        The performance metric of I/O-intensive applications is ob-
is stored in RM instead of a persistent storage device. Table 3                   tained to measure the average performance improvement of
shows TPC-C results obtained using three cache alternatives.                      LM and non-LM cache (i.e., piHP and piBV ). To this end, we
The results seem somewhat controversial in that the perfor-                       slightly modified Postmark so that all I/O operations are either
mance of MMDB is not as good as what is normally expected.                        read or update. The modified Postmark is used to profile I/O-
The main reason for this is due to the inherent architecture of                   intensive applications by varying the ratio of read to update.
MySQL cluster. An MMDB stores all data (including records                         A set of performance profiles is used as parameters for our
and indices for relational algebraic operations) to the address                   simulation presented in Section 6.2.2.
space of ndbd processes, and this requires coordination among                        The experiment is conducted on the same cluster that was
MySQL daemons (mysqld and ndbd). Thus, it usually exchanges                       used in the previous performance experiment (or Section 6.1).
many control messages. When exchanging these messages                             To obtain as many profiles as possible, we increase the virtual
between mysqld and ndbd, MySQL is designed to use TCP/IP                          disk space from 16GB to 32GB. We vary the dataset size
for all communications between these processes. This incurs                       from 3GB to 30GB (3, 7, 10, 15, and 30) and (RM/SSD) cache
significant overhead especially when transaction throughput                        size from 512MB to 16GB. In addition, six different read to
reaches a certain threshold level that inevitably saturates the                   update ratios (10:0, 8:2, 6:4, 4:6, 2:8 and 0:10) are used to
performance. However, DDBs do not incur IPC overhead since                        represent various I/O access patterns. We set the parameters
the InnoDB storage engine is directly embedded to mysqld. The                     of Postmark, such as min/max sizes of a file and the number
results in Table 3 identify DDB with RM-cache outperforms                         of subdirectories, to 1.5KB, 90KB and 100, respectively.
MMDB. In addition, MySQL cluster supports only very small                            Figure 9 shows the measured elapsed time of executing
sized temporary space, and queries that require temporary                         100,000 transactions only for RM-Cache with 3GB and 10GB
space resulting in large overhead when processing relational                      datasets because other results from using SSD and with other
algebraic operations. These create relatively unfavorable per-                    datasets (i.e., 7GB, 15GB, and 30GB) reveal similar performance

                           No-Cache     1GB EM         2GB EM           4GB EM                                 TABLE 4
                         2000                                                                           Experimental parameters
                                                                                            Parameter                            Value
                         1600                                                                  aIO                     { 90%, 80%, 70%, 50% }
    Elapsed time (sec)

                                                                                         HP to BV job ratio             {0:1, 1:5, 1:1, 5:1, 1:0}
                         1200                                                                                            Poisson distribution
                                                                                         Request arrival rate
                                                                                                                       with a mean of 0.5 hour
                                                                                                                         Poisson distribution
                         800                                                              # of transactions
                                                                                                                with a mean of 24 million transactions
                                                                                         VM lease time of                Poisson distribution
                         400                                                             Non-I/O-intensive
                                                                                                                      with a mean of 60 hours
                            0                                                             Data size (GB)                 {3, 5, 7, 10, 15, 30}
                                10:0   8:2      6:4      4:6      2:8     0:10            Cache size (GB)                       U(0,16)

                                         Read to update ratio (R:U)

Fig. 10. Results of Postmark 100k transactions for 10GB data
(Extra Memory)
                                                                                 6.2.2   Experimental Settings
                                                                                 The cost efficiency of CaaS is evaluated through extensive
                                                                                 simulations with randomly generated workloads, and each
                                                                                 simulation is conducted using the metric for performance
characteristics. As the cache size increases, the performance                    improvement of each cache. Different workload characteristics
gain increases as well. Most of the cases have benefited from                     were applied. Table 4 summarizes the parameters used in our
the increased cache size, except for the case when the data                      experiments. For this evaluation each computational resource
set is small. As shown in Figure 9(a), in some cases hard disk                   has two quad-core processors, 16GB RAM, 80GB SSD, and 1TB
outperforms the elastic cache since 3GB data almost fits into                     HDD, while each RM cache server has a dual-core processor,
the local memory (2GB); most of the data can be loaded and                       32GB RAM, and 500GB HDD. In this experiment, we adopt
served from the page cache. The use of additional cache devices                  three default IaaS types, and each has the following specifica-
like the elastic cache, which is inherently slower than the page                 tion:
cache, might cause more overhead than we expect in certain
workload configurations.                                                            •   small: 1 core, 1GB RAM, and 50GB disk ($0.1/hr)
                                                                                   •   medium: 2 cores, 4GB RAM, and 100GB disk ($0.2/hr)
   Increasing the rate of update operations also affects the                       •   large: 4 cores, 8GB RAM, and 200GB disk ($0.4/hr)
performance. As we increase the rate of updates, the perfor-                        A distinctive design rationale for CaaS is that the service
mance of the elastic cache increases when data sets are large                    provider should be assured of profitability improvement un-
(Figure 9(b)) while the performance degrades when data sets                      der various operational conditions; that is, the impact of the
are small (Figure 9(a)). Since the coherency protocol of the                     resource scheduling policy that a provider adopts on its profit
elastic cache is the write-back protocol, the cache operates as if               should be minimal. To meet such a requirement, we access the
it is a write buffer for the updates, and this gives performance                 performance characteristics under four well-known resource
benefits to update operations. Increase in the cache size further                 allocation algorithms—First-Fit (FF), Next-Fit (NF), Best-Fit
improves the throughput of the update intensive workloads.                       (BF) and Worst-Fit (WF)—and a variant for each of these four;
However, with small data sets, the page cache is better for read                 hence, eight in total. The four variants adopt live resource
operations. While most read operations can be served from the                    (VM) migration. FF places a user’s resource request in the first
page cache, updates suffer from dirty page replacement traffic                    resource that can accommodate the request. NF is a variant of
with relatively high latency of the cache device and the hard                    FF and it searches for an available resource from the resource
disk. Apparently, the throughput decreases as the size of data                   that is selected at the previous scheduling. BF (/WF) selects
grows. Specifically, this can be expected because the advantage                   the smallest (/largest) resource among those that can meet
of using LM no longer exists. In general, it is the result of                    the user’s resource request. Besides, we consider live VM
higher latency when accessing larger data sets.                                  migration which has been widely studied primarily for better
                                                                                 resource management [31], [32]. In our service a resource is
  To measure the performance gain of HP jobs, we additionally
                                                                                 only migrated to other physical machine if the application
give the same amount of extra memory to make fair experi-
                                                                                 running on that resource is not I/O-intensive. The decision
ments because BV jobs require that amount of cache space on
                                                                                 on resource migration is made in a best-fit fashion. Thus, we
SSD or the elastic cache. We configure experiments accordingly
                                                                                 evaluate our CaaS model using the following eight algorithms:
so that the extra memory is used as the page cache of Linux,
                                                                                 FF, NF, BF, WF and their migration counterparts, FFM, NFM,
which is user’s natural choice. Figure 10 shows the measured
                                                                                 BFM and WFM.
elapsed time for executing 100,000 transactions. From the figure
we see somewhat unexpected (or controversial) results that                          In our simulations, we set the number of physical resources
the performance gain of LM depends strongly on the read                          to be virtually unlimited.
to update ratio rather than the amount of page cache; in
other words, more update operations make such an unexpected                      6.2.3   Performance Metrics
performance pattern conspicuous. This is because the ‘pdflush’                    We assume users who select BV are conservative in terms
daemon in Linux writes dirty data to disk if data reside in                      of their spending; and their applications are I/O-intensive
memory until either a) they are more than 30 seconds old, or                     and not mission critical. Therefore, the performance gain from
b) the dirty pages have consumed more than 10% of the active,                    services with more cache in BV is very beneficial. The reciprocal
working memory.                                                                  benefit of that performance gain is realized on the service

                                      CaaS                            No-CaaS                                                        90%              80%           70%              50%     No-CaaS
                            3                                                                                                    3

                                                                                                           ized unit profit
        lized unit pro

                            2                                                                                                    2



                                FF   FFM     NF     NFM          BF     BFM     WF   WFM
                                                                                                                                               FF   FFM      NF     NFM         BF     BFM   WF   WFM
                                           Resource allocation algorithm
                                                                                                                                                            Resource allocation algorithm
Fig. 11. Overall results
                                                                                                 Fig. 12. Results with varying rates of I/O-intensive jobs

                                                                                                                                         0:1        1:2       1:1         2:1         1:0    No-CaaS
provider’s side due to more efficient resource utilization by                                                                         4
effective service consolidation. These benefits are measured

using two performance metrics based primarily on monetary

                                                                                                        Normalized unit pro
relativity to those benefits. Specifically, the benefit for users is
measured by prices paid for their I/O-intensive applications,
whereas that for providers is quantified by profit (more specifi-
cally, unit profit) obtained from running those applications. The
former performance metric is quite direct and the average price                                                                      1
paid for I/O-intensive applications is adopted. However, the                                            N
performance metric for providers is a little more complicated                                                                        0
since the cost related to serving those applications (including                                                                                FF   FFM      NF     NFM         BF     BFM   WF   WFM
the number of physical resources used) needs to be taken into                                                                                               Resource allocation algorithm
account; and thus, neither the total profit nor the average profit
may be an accurate measurement. As a result, the average unit                                    Fig. 13. Results with varying ratios of HP jobs and BV jobs
profit up is devised as the primary performance metric for
providers and it is defined as the total profit ptotal obtained
                                                                                                 to service providers than those without CaaS regardless of
over the ‘relative’ number of physical nodes using rpn. More
                                                                                                 the resource allocation algorithms and VM migration policies.
                                                                                                 The benefit of using VM migration is 32% on average more
                                                            r                                    than that without VM migration. The Best-Fit algorithm gives
                                              ptotal =           pi                        (7)   more profit than other algorithms since it minimizes resource
                                                           i=1                                   fragmentation, which results in higher resource consumption.
                                                    r                                               Figure 12 shows average unit profits when the rate of I/O-
                                           rpn =         acti /actmax                      (8)   intensive jobs is varied. From results without VM migration,
                                                   i=1                                           we can see that I/O-intensive jobs lead to more benefit due
, and                                                                                            to the efficiency of the elastic cache. The normalized unit
                                             up = ptotal /rpn                              (9)   profit with VM migration increases when the number of non-
                                                                                                 I/O-intensive jobs increases. This is because VM migration
                                                                                                 only applies to non-I/O-intensive jobs; and this leads to more
where r is the total number of service requests (VMs), acti and                                  migration chances and higher resource utilization.
actmax are the active duration of a physical node mi (and it                                        Figure 13 shows normalized unit profits with various ratios
may vary between different nodes) and the maximum duration                                       of HP jobs to BV jobs. The provider profit is noticeably higher
among all physical nodes, respectively. The active duration of                                   with CaaS than No-CaaS when the rate of HP jobs is low.
a physical node is defined as the amount of time from the time                                    However, a small loss to providers is incurred when the HP
the node is instantiated to the end time of a given operation                                    to BV ratio is high (i.e., 2:1 and 1:0); this results from the
period (or the finish time of a particular experiment in our                                      unexpected LM results (shown in Figure 11). With the inherent
study).                                                                                          cost efficiency of BV, profits obtained from these jobs are
                                                                                                 promising, particularly when the rate of BV jobs is high. If a
6.2.4 Results                                                                                    more efficient LM-based cache is devised, profits with respect
The number of experiments conducted with eight different                                         to increases in HP jobs are most likely to lead to high profits.
resource allocation algorithms is 320. Eight repeated trials are
executed for each experiment, and we obtained the average                                        7             C ONCLUSION
value of eight results as average profit under the corresponding                                  With the increasing popularity of infrastructure services such
parameter. These average unit profits are normalized based                                        as Amazon EC2 and Amazon RDS, low disk I/O performance
on average unit profit of the WF algorithm. Figure 11 shows                                       is one of most significant problems. In this paper, we have
overall benefit of CaaS. From the figure we identify that IaaS                                     presented a CaaS model as a cost efficient cache solution to
requests with CaaS can give more benefit (36% on average)                                         mitigate the disk I/O problem in IaaS. To this end, we have

built a prototype elastic cache system using a remote-memory-                 [14] J. R. Santos, Y. Turner, G. Janakiraman, and I. Pratt, “Bridging
based cache, which is pluggable and file-system independent                         the gap between software and hardware techniques for I/O
                                                                                   virtualization,” in Proceedings of the annual conference on USENIX
to support various configurations. This elastic cache system
                                                                                   2008 Annual Technical Conference on Annual Technical Conference
together with the pricing model devised in this study has                          (USENIX ATC ’08), 2008.
validated the feasibility and practicality of our CaaS model.                 [15] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt, and
Through extensive experiments we have confirmed that CaaS                           T. F. Wenisch, “Disaggregated memory for expansion and sharing
helps IaaS improve disk I/O performance greatly. The per-                          in blade servers,” in Proceedings of the 36th annual international
                                                                                   symposium on Computer architecture (ACM ISCA ’09), 2009.
formance improvement gained using cache services clearly                      [16] M. Marazakis, K. Xinidis, V. Papaefstathiou, and A. Bilas, “Ef-
leads to reducing the number of (active) physical machines                         ficient remote block-level I/O over an RDMA-capable NIC,” in
the provider uses, increases throughput, and in turn results                       Proceedings of the 20th annual international conference on Supercom-
in profit increase. This profitability improvement enables the                       puting (ACM ICS ’06), 2006.
                                                                              [17] J. Creasey, “Hybrid Hard Drives with Non-Volatile Flash and
provider to adjust its pricing to attract more users.                              Longhorn,” in Proceedings of Windows Hardware Engineering Con-
                                                                                   ference (WinHEC), 2005.
                                                                              [18] R. Harris, “Hybrid drives: not so fast.” ZDNet, CBS Interactive,
ACKNOWLEDGEMENTS                                                                   2007.
                                                                              [19] E. R. Reid, “Drupal performance improvement via SSD technol-
Professor Albert Zomaya would like to acknowledge the Aus-                         ogy,” Sun Microsystems, Inc., Tech. Rep., 2009.
tralian Research Council Grant DP A7572. Hyungsoo Jung is                     [20] S.-W. Lee and B. Moon, “Design of flash-based DBMS: an in-
the corresponding author for this paper.                                           page logging approach,” in Proceedings of the 2007 ACM SIGMOD
                                                                                   international conference on Management of data (ACM SIGMOD ’07),
                                                                              [21] T. Makatos, Y. Klonatos, M. Marazakis, M. D. Flouris, and A. Bi-
R EFERENCES                                                                        las, “Using transparent compression to improve SSD-based I/O
                                                                                   caches,” in Proceedings of the 5th European conference on Computer
[1]    L. Wang, J. Zhan, and W. Shi, “In Cloud, Can Scientific Commu-               systems (ACM EuroSys ’10), 2010.
       nities Benefit from the Economies of Scale?” IEEE Transactions on       [22] J.-U. Kang, J.-S. Kim, C. Park, H. Park, and J. Lee, “A multi-
       Parallel and Distributed Systems, vol. 99, no. PrePrints, 2011.             channel architecture for high-performance NAND flash-based
[2]    M. D. Dahlin, R. Y. Wang, T. E. Anderson, and D. A. Patterson,              storage system,” J. Syst. Archit., vol. 53, pp. 644–658, September
       “Cooperative caching: using remote client memory to improve file             2007.
       system performance,” in Proceedings of the 1st USENIX conference       [23] C. Park, P. Talawar, D. Won, M. Jung, J. Im, S. Kim, and Y. Choi, “A
       on Operating Systems Design and Implementation (USENIX OSDI                 High Performance Controller for NAND Flash-based Solid State
       ’94), 1994.                                                                 Disk (NSSD),” in Non-Volatile Semiconductor Memory Workshop,
[3]    T. E. Anderson, M. D. Dahlin, J. M. Neefe, D. A. Patterson, D. S.           2006. IEEE NVSMW 2006. 21st, 2006.
       Roselli, and R. Y. Wang, “Serverless network file systems,” ACM         [24] S. Kang, S. Park, H. Jung, H. Shim, and J. Cha, “Performance
       Trans. Comput. Syst., vol. 14, pp. 41–79, February 1996.                    Trade-Offs in Using NVRAM Write Buffer for Flash Memory-
[4]    S. Jiang, K. Davis, and X. Zhang, “Coordinated Multilevel Buffer            Based Storage Devices,” IEEE Transactions on Computers, vol. 58,
       Cache Management with Consistent Access Locality Quantifica-                 pp. 744–758, 2009.
       tion,” IEEE Transactions on Computers, vol. 56, pp. 95–108, January    [25] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich,
       2007.                                                                                e
                                                                                   D. Mazi` res, S. Mitra, A. Narayanan, G. Parulkar, M. Rosenblum,
[5]    H. Kim, H. Jo, and J. Lee, “XHive: Efficient Cooperative Caching             S. M. Rumble, E. Stratmann, and R. Stutsman, “The case for RAM-
       for Virtual Machines,” IEEE Transactions on Computers, vol. 60, pp.         Clouds: scalable high-performance storage entirely in DRAM,”
       106–119, 2011.                                                              SIGOPS Oper. Syst. Rev., vol. 43, pp. 92–105, January 2010.
[6]    A. Menon, J. R. Santos, Y. Turner, G. J. Janakiraman, and              [26] R. P. Goldberg and R. Hassinger, “The double paging anomaly,” in
       W. Zwaenepoel, “Diagnosing performance overheads in the                     Proceedings of international computer conference and exposition (ACM
       xen virtual machine environment,” in Proceedings of the 1st                 AFIPS ’74), 1974.
       ACM/USENIX international conference on Virtual execution environ-      [27] C. A. Waldspurger, “Memory Resource Management in VMware
       ments (ACM/USENIX VEE ’05), 2005.                                           ESX Server,” in Proceedings of the 5th USENIX conference on Oper-
[7]    L. Cherkasova and R. Gardner, “Measuring CPU overhead for I/O               ating Systems Design and Implementation (USENIX OSDI ’02), 2002.
       processing in the Xen virtual machine monitor,” in Proceedings         [28] B. Urgaonkar, P. J. Shenoy, and T. Roscoe, “Resource Overbooking
       of the annual conference on USENIX Annual Technical Conference              and Application Profiling in Shared Hosting Platforms,” in Pro-
       (USENIX ATC ’05), 2005.                                                     ceedings of the 5th USENIX conference on Operating Systems Design
[8]    J. Liu, W. Huang, B. Abali, and D. K. Panda, “High performance              and Implementation (USENIX OSDI ’02), 2002.
       VMM-bypass I/O in virtual machines,” in Proceedings of the annual      [29] A. V. Do, J. Chen, C. Wang, Y. C. Lee, A. Y. Zomaya, and B. B.
       conference on USENIX ’06 Annual Technical Conference (USENIX                Zhou, “Profiling Applications for Virtual Machine Placement in
       ATC ’06), 2006.                                                             Clouds,” in Proceedings of the 2011 IEEE International Conference on
[9]    A. Menon, A. L. Cox, and W. Zwaenepoel, “Optimizing network                 Cloud Computing, 2011.
       virtualization in Xen,” in Proceedings of the annual conference on     [30] S. Chen, A. Ailamaki, M. Athanassoulis, P. B. Gibbons, R. Johnson,
       USENIX ’06 Annual Technical Conference (USENIX ATC ’06), 2006.              I. Pandis, and R. Stoica, “TPC-E vs. TPC-C: characterizing the new
[10]   P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho,               TPC-E benchmark via an I/O comparison study,” SIGMOD Rec.,
       R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the art of                vol. 39, pp. 5–10, February 2011.
       virtualization,” in Proceedings of the nineteenth ACM symposium on     [31] H. Liu, H. Jin, X. Liao, C. Yu, and C.-Z. Xu, “Live Virtual Machine
       Operating systems principles (ACM SOSP ’03), 2003.                          Migration via Asynchronous Replication and State Synchroniza-
[11]   X. Zhang and Y. Dong, “Optimizing Xen VMM based on Intel                    tion,” IEEE Transactions on Parallel and Distributed Systems, no.
       virtualization technology,” in Proceedings of the 2008 International        PrePrints, 2011.
       Conference on Internet Computing in Science and Engineering (IEEE      [32] G. Jung, M. Hiltunen, K. Joshi, R. Schlichting, and C. Pu, “Mistral:
       ICICSE ’08), 2008.                                                          Dynamically Managing Power, Performance, and Adaptation Cost
[12]   P. Willmann, J. Shafer, D. Carr, A. Menon, S. Rixner, A. L.                 in Cloud Infrastructures,” in Proceedings of the 2010 IEEE 30th
       Cox, and W. Zwaenepoel, “Concurrent direct network access for               International Conference on Distributed Computing Systems (IEEE
       virtual machine monitors,” in Proceedings of the 2007 IEEE 13th             ICDCS ’10), 2010, pp. 62–73.
       International Symposium on High Performance Computer Architecture
       (IEEE HPCA ’07), 2007.
[13]   Y. Dong, J. Dai, Z. Huang, H. Guan, K. Tian, and Y. Jiang, “To-
       wards high-quality I/O virtualization,” in Proceedings of SYSTOR
       2009: The Israeli Experimental Systems Conference, 2009.

                      Hyuck Han received his B.S., M.S., and Ph.D.                                 Albert Y. Zomaya is currently the Chair Profes-
                      degrees in Computer Science and Engineering                                  sor of High Performance Computing & Network-
                      from Seoul National University, Seoul, Korea, in                             ing and Australian Research Council Professo-
                      2003, 2006, and 2011, respectively. Currently,                               rial Fellow in the School of Information Technolo-
                      he is a postdoctoral researcher at Seoul National                            gies, The University of Sydney. He is also the
                      University. His research interests are distributed                           Director of the Centre for Distributed and High
                      computing systems and algorithms.                                            Performance Computing which was established
                                                                                                   in late 2009. Professor Zomaya is the author/co-
                                                                                                   author of seven books, more than 400 papers,
                                                                                                   and the editor of nine books and 11 conference
                                                                                                   proceedings. He is the Editor in Chief of the
                                                                           IEEE Transactions on Computers and serves as an associate editor
                                                                           for 19 leading journals, such as, the IEEE Transactions on Parallel and
                                                                           Distributed Systems and Journal of Parallel and Distributed Computing.
                                                                           Professor Zomaya is the recipient of the Meritorious Service Award (in
                      Young Choon Lee received the BSc (hons)              2000) and the Golden Core Recognition (in 2006), both from the IEEE
                      degree in 2003 and the Ph.D. degree from the         Computer Society. Also, he received the IEEE Technical Committee on
                      School of Information Technologies at the Uni-       Parallel Processing Outstanding Service Award and the IEEE Technical
                      versity of Sydney in 2008. He is currently a         Committee on Scalable Computing Medal for Excellence in Scalable
                      post doctoral research fellow in the Centre for      Computing, both in 2011. Professor Zomaya is a Chartered Engineer,
                      Distributed and High Performance Computing,          a Fellow of AAAS, IEEE, IET (U.K.), and a Distinguished Engineer of
                      School of Information Technologies. His current      the ACM.
                      research interests include scheduling and re-
                      source allocation for distributed computing sys-
                      tems, nature-inspired techniques, and parallel
                      and distributed algorithms. He is a member of
the IEEE and the IEEE Computer Society.

                      Woong Shin received his B.S. degree in Com-
                      puter Science from Korea University, Seoul, Ko-
                      rea, in 2003. He worked for Samsung Networks
                      from 2003 to 2006 and TmaxSoft from 2006 to
                      2009 as a software engineer. He is currently a
                      M.S. candidate Seoul National University. His
                      research interests are in system performance
                      study, virtualization, storage systems, and cloud

                      Hyungsoo Jung received the B.S. degree in
                      mechanical engineering from Korea University,
                      Seoul, Korea, in 2002; and the M.S. and the
                      Ph.D. degrees in computer science from Seoul
                      National University, Seoul, Korea in 2004 and
                      2009, respectively. He is currently a postdoctoral
                      research associate at the University of Sydney,
                      Sydney, Australia. His research interests are in
                      the areas of distributed systems, database sys-
                      tems, and transaction processing.

                      Heon Y. Yeom is a Professor with the School
                      of Computer Science and Engineering, Seoul
                      National University. He received his B.S. degree
                      in Computer Science from Seoul National Uni-
                      versity in 1984 and his M.S. and Ph.D. degrees
                      in Computer Science from Texas A&M Univer-
                      sity in 1986 and 1992 respectively. From 1986
                      to 1990, he worked with Texas Transportation
                      Institute as a Systems Analyst, and from 1992
                      to 1993, he was with Samsung Data Systems as
                      a Research Scientist. He joined the Department
of Computer Science, Seoul National University in 1993, where he
currently teaches and researches on distributed systems, multimedia
systems and transaction processing.

To top