A Quick Introduction to Clouds Robert L Grossman University of

W
Shared by: thejokerishere
-
Stats
views:
267
posted:
1/23/2009
language:
English
pages:
10
Document Sample
scope of work template
							                          A Quick Introduction to Clouds
                                     Robert L. Grossman
                                University of Illinois at Chicago
                                               and
                                      Open Data Group

                                        October 29, 2008


1     Introduction
1.1   What is cloud computing?
There is not yet a standard definition for cloud computing, but a good working definition is to say
that clouds provide on demand resources or services over the Internet, usually at the scale and with
the reliability of a data center.
    There are at least two different, but related, types of clouds: the first category of clouds is
designed to provide computing instances on demand, while the second category of clouds is designed
to provide computing capacity on demand.
    As an example of the first category of cloud, Amazon’s EC2 services [1] provides computing
instances on demand. A small EC2 computing instance costs $0.10 per hour and provides the
approximate computing power of 1.0–1.2 GHz 2007 Opteron or 2007 Xeon processor, with 1.7 GB
memory, 160 GB of available disk space and moderate I/O performance [1]. The Eucalyptus system
[16] is an open source cloud that provides on demand computing instances and shares the same
APIs as Amazon’s EC2 cloud.
    As an example of the second category of cloud, Google’s MapReduce application provides com-
puting capacity on demand. An example from [6] describes a sorting application that was run
on a cluster containing approximately 1800 machines. Each machine had two 2 GHz Intel Xeon
processors, 4 GB memory, and two 160 GB IDE disks. The TeraSort benchmark [9] was coded
using MapReduce on this cluster. MapReduce is described in Section 3 below. The goal of the
TeraSort benchmark is to sort 1010 100-byte records, which is about 1 TB of data. The application
required about 850 seconds to complete [6] on this cluster. The Hadoop system is an open source
cloud that implements a version of MapReduce [11].
    Notice that both clouds use similar machines, but that the first is designed to scale out by
providing additional computing instances, while the second is designed to support data or compute
intensive applications by scaling capacity.




                                                 1
1.2   What is new?
It is important to understand what is new about cloud computing. There have been on-demand
services and resources available over the Internet for some time, but there are two important
differences with the today’s interest in cloud computing:

  1. The first difference is the scale. Some companies that rely on cloud computing add capacity
     by the data center and have infrastructures that scale over several (or more) data centers.

  2. The second difference is the simplicity of many of the today’s cloud services offerings. Prior to
     cloud-based computing services, writing code for high performance and distributed computing
     was relatively complicated and usually required working with grid-based services, developing
     code that explicitly passed massages between nodes, and employing other specialized methods.
     Although simplicity is the eye of the beholder, most people feel that the cloud-based storage
     service APIs and MapReduce style computing APIs are relatively simple compared to previous
     methods.

    The impact is the following. By using the Google File System and MapReduce, or the open
source Hadoop Distributed File System and the Hadoop implementation of MapReduce, it is rel-
atively easy for a project to compute using 10 TB of data over 1000 nodes. Until recently, this
would have been out of reach for most projects.

1.3   Private vs Hosted Clouds
It is important to distinguish between private clouds that are designed to be used internally by a
company or organization and hosted clouds that are designed to provide cloud-based services to
third party clients. As an example, the Google File System (GFS) [8], MapReduce [5], and BigTable
[3] are used internally within Google and are examples of private cloud services. At least at the
time, this article was written these services are not directly available to third parties. On the other
hand, Amazon’s EC2, S3 and SimpleDB [2] are examples of hosted cloud services. Anyone with a
credit card can quickly and easily access these services, even at 3 am in the morning.
    Note that Google’s private cloud is used to provide hosted cloud-based applications that are
offered to customers, such as the email and office-based services provided by Google Apps.

1.4   Utility computing
Cloud computing may, or may not, be an example of utility, or pay as you go computing. Utility
computing is an economic model in which you request and pay for computing services by the slice
as you need them. Amazon’s S3 and EC2 are based upon a utility computing model. Cloud based
services can be run by an organization as a private cloud, can offered through a utility computing
model, or can be offered through using another business model (such as a fixed price up to a not
to exceed amount and then by the slice, as in a standard cell phone plan).
    Utility computing offers several important advantages:



                                                  2
   • Utility computing is less capital intensive. It does not require up-front investments, but
     instead, as an on-demand service, you pay for capacity as you need it.

   • Utility computing allows you to access capacity exactly when you need it. For Web 2.0
     applications, this means that with a utility computing model, you can support 100 users one
     day and 10,000 users the next day.

    Here is a simple example to help understand utility computing. Assume that you have a
requirement to operate 100 servers for three years. One option is to lease them at $0.40 per
instance-hour. This would cost approximately

             100 servers · $0.40 instance-hour · 3 years · 8760 hours/year = $1, 051, 200.
    Another option is to buy them. Assume the cost to buy each server is $750 and that two staff
at $100,000 per year are required to administer the servers. Assume that the servers require 150
Watts each and the cost of electricity is $0.10 per Kilowatt-hour so that the yearly cost to operate
the 100 servers is $13,140. Then this option would cost approximately

                       100 servers · $750 + 3 years · $13, 140 electricity / year
                          +3 years · 2 staff · $100, 000 salary / year = $714, 420.
    So if you were to run the servers at 100% utilization, then buying the 100 servers is less expensive.
On the other hand, if you were to utilize the 100 servers at 68% utilization or less, then using an
instance-on-demand style of cloud would be less expensive.
    Of courses, the numbers here are only estimates, and not all costs are considered, but even from
this simple example it is clear that leasing in this way using a pay as you go utility computing
model is preferable for many use cases.

1.5   Benefits of Cloud Computing
Cloud computing provides a number of important benefits:
    First, when cloud computing is offered with a utility computing pay-as-you-go model, then the
advantages include: reduced capital expense, low barrier to entry, and the ability to scale up as
demand requires, including support for brief surges in capacity.
    Second, cloud services enjoy the economies of scale and various other benefits offered by data
centers. For this reason, the unit cost for cloud-based services should be lower in general than
competing approaches. Cloud services also have the reliability and capacity that well run data
centers can provide.
    Third, the architectures used by cloud computing have proven to be very scalable. For example,
cloud based storage services can easily managed a PB of data, while managing this much data with
a traditional database is problematic.
    Of course there are some disadvantages:
    First, since cloud services are often remote (at least for hosted cloud services), cloud services
can suffer latency and bandwidth related issues associated with any remote application.


                                                   3
    Second, since hosted cloud services serve multiple customers, there are various issues related to
multiple customers possibly sharing the same piece of hardware. Also, having data accessible by
third parties (such as the provider of cloud services) may present security, compliance or regulatory
issues. On the other side, there are economy scale advantages when security related services are
provided by data centers.

1.6    Layering Cloud Services
It is sometimes useful to view clouds that provide on-demand computing capacity as consisting of
layers that form a stack of cloud services.
    A storage cloud provides storage services (block-based or file-based services); a data cloud pro-
vides data management services (record-based, column-based or object-based services); and a com-
pute cloud provides computational services. Often these are layered (compute services over data
services over storage service) to create a stack of cloud services that serves as a computing platform
for developing cloud-based applications. See Figure 1.
    Examples of cloud computing stacks include the Google’s Google File System (GFS) [8], MapRe-
duce [5] and BigTable [3] infrastructure and the open source Hadoop Distributed File System
(HDFS) and Hadoop’s implementation of BigTable [11].
    The goal for clouds that provide on-demand computing capacity is to provide a stack of clouds
services with the scale and with the reliability of a data center.

1.7    * as a Service
Sometimes it is helpful as view clouds as providing on demand services and to distinguish between
various types of services, such as SaaS and PaaS. SaaS is an abbreviation for software as a service,
while PaaS is an abbreviation for PaaS. For example, SalesForce provides its CRM application
using a SaaS model. Amazon’s EC2 computing instance, S3 storage service and SimpleDB data
service is an example of a PaaS offering. Google provides a platform as a service offering with its
Google App Engine (GAE).
    Clouds in this sense provide a software application as an on-demand service or a platform as
an on-demand service with the reliability of a data center. Clouds like these can be designed to
support thousands to millions of separate instances of a SaaS or PaaS offerings. A common use
case is to use these types of clouds to provide on demand support for Web 2.0 applications.


2     Parallel Computing over Clouds
In this section, we explain a style of parallel programming called MapReduce that is supported
by some capacity-on-demand style clouds such as Google’s BigTable [3], Hadoop [11] and Sector
[10]. A good way to understand MapReduce is by considering how to compute an inverted index
in parallel for a large collection of web pages that are stored in a cloud.
    Assume that each node i in the cloud stores web pages pi,1 , pi,2 , pi,3 , . . ., pi,ij . Assume that a
web page pi contains words (terms) wj,1 , wj,2 , wj,3 , . . .. A basic structure important in information



                                                    4
Figure 1: A layered model of cloud services is used by some clouds that provide on-demand com-
puting capacity.


retrieval is an inverted index; that, is, a list

                                          (w1 ; p1,1 , p1,2 , pi,3 , . . .)

                                          (w2 ; p2,1 , p2,2 , p2,3 , . . .)
                                          (w3 ; p3,1 , p3,2 , p3,3 , . . .)
with the properties:

   1. The list is sorted by the word wj ;

   2. Associated with each word wj , there is a list that consists of all web pages pi containing the
      word.

    The computation is parallelized by hashing each word w with a hash function h(w) and using
the machine labeled h(w) to collect all the web pages associated with w.
    The first step is for each page pi to extract all the words wj , compute the hash h(wj ) of each
word wj . The second step is to send the page pi to the processor h(wj ) that is storing the portion
of the inverted index associated with the word wj and then to sort all the words to create the
inverted index. The third step is to collect all the pages pi associated with each word wj to create
the inverted index.
    This is a very common pattern and is well suited to a cloud. The first phase is usually called
the map and the third phase is usually called the reduce. See [5] for a full description of this style
of parallel computing.
    Notice that this example immediately generalizes. Here is another example.
    Consider credit card transactions containing account numbers and merchant names. It is com-
mon for a processor of credit card transactions to archive credit card transactions as they are
processed, but due to the size and volume of the data, a processor does not always ingest the trans-
actions into a database. Once in a database, a SQL group-by operation could quickly produce, for
each merchant, all transactions associated with that merchant for a given time period. If, instead,
the transactions are stored in files as they are processed, producing this list is not always easy.


                                                         5
On the other hand, a MapReduce computation can easily produce a sorted list of merchants, with
each merchant containing a list of credit card transactions that the merchant processed (during a
particular time period).
   Note that this example generalizes to any list of events. Indeed, given a table of events (an
event table), a MapReduce operation will produce a sorted list (an index table) for any metatadata
associated with the transaction of interest.


3    Security
Security is an area of cloud computing that presents some special challenges. For hosted clouds, the
first challenge is simply that a third party is responsible for your data and for providing security.
On the positive side, they can take advantage of economies of scale and use this to provide a level
of security than may not be cost effective for smaller companies. Another security issue is that
with a hosted cloud two or more organizations may share the same physical resource and not be
aware of it.
    For some cloud applications, security is still somewhat immature. For example, Hadoop [11]
does not currently have user level authentication and access controls, although this is expected
in a later version. On the other hand, there is no technical difficulty per se in providing these
for clouds. For example, Sector, which also provides on-demand computing capacity, does offer
authentication, authorization and access controls, and as measured by the Terasort benchmark is
faster than Hadoop [13]. In fact, the next release of Sector will be HIPAA compliant.


4    Standards and Interoperability
Companies and organizations that develop cloud-based applications have an interest in cloud frame-
works that enable applications to be ported from one cloud to another and to interoperate with
different cloud-based services. For example, with an appropriate interoperability framework a cloud
application could switch from one storage cloud to another. In this section, we discuss portability
and interoperability for clouds.
    Amazon’s APIs [2] have become the de facto standard for clouds that provide on-demand in-
stances. Cloud based applications that use this API enjoy portability and interoperability. For
example, as mentioned above, Eucalyptus [16] uses these APIs and applications that run on Ama-
zon’s EC2 service can also run on a Eucalyptus cloud.
    On the other hand, for clouds that provide on-demand capacity, portability and interoperability
is much more problematic today. Hadoop is by far the most prevalent system that provides on-
demand capacity, but, for example, it is not straightforward for a Hadoop MapReduce application
to run on another on-demand capacity cloud that is written in C++ [10].
    Although it may be too early for standards to emerge, there are several efforts to develop
standards for cloud computing, including an effort by the Open Cloud Consortium [12].
    There are also service based frameworks that have been developed that are well suited for clouds.
For example, Thrift is a a software framework for scalable cross-language services development



                                                 6
that relies on a code generation engine [15]. Using Thrift it is straightforward for cloud-based
applications to access different storage clouds, such as Hadoop [11] and Sector [10].
    There are several attempts to provide a language for MapReduce style parallel programming,
including several that extend SQL in a way that supports this style of programming. A common
language would provide an interoperable way for applications to access compute services across
different clouds.
    A closely related question are standards that would enable different clouds to interoperate. It
may be useful to think back to the beginning of the Internet. At that time, any organization that
wanted a network set up their own network and sending data between networks was quite difficult.
The introduction of TCP and related Internet protocols and standards made it possible to move
data easily between networks. Many companies with network products resisted TCP and related
standards for some time. Today we are in a somewhat analogous position with respect to the
interoperability of clouds. It is interesting to imagine a world in which there were standards for
clouds and different clouds could easily interoperate. Although there is some resistance to this view
from providers of cloud services, the ability for different clouds to interoperate easily would enable
an interesting new class of applications.


5    Benchmarks for Cloud Computing
There are not yet well established benchmarks for cloud computing yet.
    One common method for measuring the performance of clouds is to use the Terasort Benchmark
[9]. There is an example in Section 1 above and in the table below.
    For clouds that provide on-demand instances, a recent benchmark called Cloudstone has been
developed [14]. Cloudstone is a toolkit consisting of an open source Web 2.0 social application, a set
of tools for generating work loads, a set of tools for performance monitoring, and a recommended
methodology for computing a metric that quantifies the dollars per user per month that a given
cloud requires.
    For clouds that provide on-demand capacity, a recent benchmark called Creditstone has been
developed [4]. Creditstone is based upon the credit card example of a MapReduce computation de-
scribed in Section 2 above. Creditstone includes code to generate synthetic credit card transactions
and a recommended MapReduce computation.


6    Summary and Conclusion
Recall that we used a working definition that viewed clouds as providing on-demand resources or
services over the Internet, usually at the scale and with the reliability of a data center.
    Although services and resources have been offered over the Internet for some time, what is
perhaps new is the scale (the unit for cloud computing can be thought of as a data center) and
the simplicity (most people find cloud services much simpler to use than competing distributed
computing frameworks such as grids [7]).
    We have given a quick introduction to two of the most common categories of clouds: clouds that
provide on-demand computing instances, such as the Amazon EC2 cloud, and clouds that provide

                                                  7
                                                     Terasort
                         Hadoop version 0.17.2       3702 sec
                         Sector version v1.8         1526 sec
                         Number of nodes             118
                         Number of records           10 billion

                                                     Creditstone A
                         Hadoop version 0.17.2       189 min
                         Sector version v1.12        71 min
                         Number of nodes             117
                         Number of records           58.5 billion
                         Size of data                3.276 TB


Table 1: A comparison of Sector [10] and Hadoop [11] using the Terasort and Creditstone Bench-
marks. The measurements were performed on the Open Cloud Testbed. Notice that performance
advantage of Sector over Hadoop is roughly the same with both benchmarks — 2.4 times faster as
measured by Terasort and 2.6 times faster as measured by Creditstone A.


on-demand computing capacity, such as Google’s GFS and MapReduce applications and the open
source Hadoop system.
   Clouds may be for the private use of a company or organization or hosted and shared by multiple
organizations. Clouds are also commonly offered with a utility model, in the sense that you can
pay as you go, just for the instances or the capacity that you require.
   Two of the challenges facing cloud deployments today is the lack of standards and the various
security issues that arise when third parties provide resources that are possibly shared by more
than one company or organization.




                                                 8
References
 [1] Amazon. Amazon Elastic Compute Cloud (amazon ec2). ams.amazon.com/ec2, 2008.

 [2] Amazon.       Amazon Web          Services     Developer   Connection.   Retrieved   from
     http://aws.amazon.com, 2008.

 [3] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike
     Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed
     storage system for structured data. In OSDI’06: Seventh Symposium on Operating System
     Design and Implementation, 2006.

 [4] Robert L Grossman Collin Bennett and Jonathan Seidman. Creditstone: A benchmark for
     clouds that provide on-demand capacity.

 [5] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters.
     In OSDI’04: Sixth Symposium on Operating System Design and Implementation, 2004.

 [6] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters.
     Communications of the ACM, 51(1):107–113, 2008.

 [7] Ian Foster and Carl Kesselman. The Grid 2: Blueprint for a New Computing Infrastructure.
     Morgan Kaufmann, San Francisco, California, 2004.

 [8] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In SOSP
     ’03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages
     29–43, New York, NY, USA, 2003. ACM.

 [9] Jim Gray. Sort benchmark home page. http://research.microsoft.com/barc/SortBenchmark/,
     2008.

[10] Robert L Grossman and Yunhong Gu. Data mining using high performance clouds: Experimen-
     tal studies using sector and sphere. In Proceedings of The 14th ACM SIGKDD International
     Conference on Knowledge Discovery and Data Mining (KDD 2008). ACM, 2008.

[11] Hadoop. Welcome to Hadoop! hadoop.apache.org/core/, 2008.

[12] Open Cloud Consortium. http://www.opencloudconsortium.org, 2008.

[13] Sector. http://sector.sourceforge.net, 2008.

[14] Will Sobel, Shanti Subramanyam, Akara Sucharitakul, Jimmy Nguyen, Hubert Wong, Arthur
     Klepchukov, Sheetal Patil, Armando Fox, and David Patterson. Cloudstone: Multi-platform,
     multi-language benchmark and measurement tools for web 2.0. In Proceedings of Cloud Com-
     puting and its Applications 2008, 2008.

[15] Thrift. http://incubator.apache.org/thrift/, 2008.


                                                    9
[16] Rich Wolski, Chris Grzegorczyk, and Dan Nurmi et. al.   Eucalyptus.   retrieved from
     http://eucalyptus.cs.ucsb.edu/, 2008.




                                          10

						
Related docs