# PowerPoint Presentation - Department of Computer Science and

Document Sample

```					Cloud Computing Systems

Lin Gu

Hong Kong University of Science and Technology
Sept. 14, 2011
How to effectively compute in a datacenter?

Is MapReduce the best answer to computation in the cloud?
What is the limitation of MapReduce?
How to provide general-purpose parallel processing in DCs?
Program Execution on Web-Scale Data
The MapReduce Approach

• MapReduce—parallel computing for Web-scale
data processing
• Fundamental component in Google’s
technological architecture
– Why didn’t Google use parallel Fortran, MPI, …?
• Followed by many technology firms
MapReduce
• Map and Fold

– Map: do something to all elements in a list

– Fold: aggregate elements of a list

• Used in functional programming languages
such as Lisp

Old ideas can be fabulous, too!

( = Lisp “Lost In Silly Parentheses”) ?
MapReduce

• Map is a higher-order function: apply an op to all
elements in a list
– Result is a new list
• Parallelizable

(map (lambda (x) (* x x))
f   f    f     f    f            '(1 2 3 4 5))
 '(1 4 9 16 25)
Program Execution on Web-Scale Data
The MapReduce Approach

• Reduce is also a higher-order function
• Like “fold”: aggregate elements of a list
–   Accumulator set to initial value
–   Function applied to list element and the accumulator
–   Result stored in the accumulator
–   Repeated for every item in the list
–   Result is the final value in the accumulator

(fold + 0 '(1 2 3 4 5))
 15
(fold * 1 '(1 2 3 4 5))
 120
Initial value     f      f     f      f     f
final result
Program Execution on Web-Scale Data
The MapReduce Approach

Massive parallel processing made simple
• Example: word count
• Map: parse a document and generate <word, 1> pairs
• Reduce: receive all pairs for a specific word, and count
(sum)

Map                             Reduce
// D is a document              Reduce for key w:
for each word w in D            count = 0
output <w, 1>                 for each input item
count = count + 1
output <w, count>
Design Context
• Big data, but simple dependence
– Relatively easy to partition data
• Supported by a distributed system
– Distributed OS services across thousands of
commodity PCs (e.g., GFS)
• First users are search oriented
– Crawl, index, search

Designed years ago, still working today, growing adoptions
Workflow
Single Master node

Worker threads
Worker threads

Single master, numerous worker threads
Workflow
• 1. The MapReduce library in the user program first
splits the input files into M pieces of typically 16
megabytes to 64 megabytes (MB) per piece. It then
starts up many copies of the program on a cluster of
machines.
• 2. One of the copies of the program is the master. The
rest are workers that are assigned work by the master.
There are M map tasks and R reduce tasks to assign.
The master picks idle workers and assigns each one a
map task or a reduce task.
Workflow
• 3. A worker who is assigned a map task reads the
contents of the corresponding input split. It parses
key/value pairs out of the input data and passes each
pair to the user-defined Map function. The
intermediate key/value pairs produced by the Map
function are buffered in memory.
• 4. Periodically, the buffered pairs are written to local
disk, partitioned into R regions by the partitioning
function. The locations of these buffered pairs on the
local disk are passed back to the master, who is
responsible for forwarding these locations to the
reduce workers.
Workflow
• 5. When a reduce worker is notified by the master about
these locations, it uses RPCs to read the buffered data
from the local disks of the map workers. When a reduce
worker has read all intermediate data, it sorts it by the
intermediate keys so that all occurrences of the same key
are grouped together.
• 6. The reduce worker iterates over the sorted
intermediate data and for each unique intermediate key
encountered, it passes the key and the corresponding set
of intermediate values to the Reduce function. The output
of the Reduce function is appended to a final output file
for this reduce partition.
• 7. When all map tasks and reduce tasks have been
completed, the MapReduce returns back to the user code.
Programming

• How to write a MapReduce programto
– Generate inverted indices?
– Sort?
• How to express more sophisticated
logic?
• What if some workers (slaves) or the
master fails?
Workflow
Initial data split                              Master informed of
into 64MB blocks                                result locations

R reducers retrieve
Computed, results                      Data from mappers
locally stored

Final output written

Where is the communication-intensive part?
Data Storage – Key-Value Store

• Distributed, scalable storage for key-value pairs
• Example: Dynamo (Amazon)

• Another example may be P2P storage (e.g., Chord)

• Key-value store can be a general foundation for more
complex data structures
• But performance may suffer
Data Storage – Key-Value Store
Dynamo: a decentralized, scalable key-value
store
– Used in Amazon
– Use consistent hashing to distributed data
among nodes
– Replicated, versioning, load balanced
– Easy-to-use interface: put()/get()
Data Storage – Network Block Device
• Networked block storage
– ND by SUN Microsystems
• Remote block storage over Internet
– Use S3 as a block device [Brantner]
• Block-level remote storage may become slow in
networks with long latencies
Data Storage – Traditional File Systems
• PC file systems
• Link together all clusters of a file
– Directory entry: filename, attributes, date/time,
starting cluster, file size
• Boot sector (superblock) : file system wide
information
• File allocation table, root directory, …

Boot       FAT 1   FAT 2 (dup)   ROOT dir   Normal directories and files
sector
Data Storage – Network File System
• NFS—Network File System [Sandberg]
– Designed by SUN Microsystems in the 1980’s
• Transparent remote access to files stored
remotely
– XDR, RPC, VNode, VFS
– Mountable file system, synchronous behavior
• Stateless server
Data Storage – Network File System
NFS organization
Client                               Server
Data Storage – Google File System (GFS)

• A distributed file system at work (GFS)
• Single master and numerous slaves communicate with each other

• File data unit, “chunk”, is up to 64MB. Chunks are replicated.

• “master” is a single point of failure and bottleneck of scalability,
the consistency model is difficult to use
Data Storage – Database
PNUTS – a relational database service

A   42342         E
A   42342     E                                      B   42521         W
B   42521     W                                      C   66354         W
C   66354     W                                      D   12352         E
D   12352     E                                      E   75656         C
E   75656     C                                      F   15677         E
F   15677     E             Indexes and views

CREATE TABLE Parts (
ID VARCHAR,
StockNumber INT,
Parallel database           Status VARCHAR                Replication
…    A 42342      E
)       B 42521      W
C   66354       W   Structured schema
D   12352       E
E   75656       C
Designed and used by                   F   15677       E

Yahoo!

22
MapReduce/Hadoop
• Around 2004, Google invented MapReduce to
parallelize computation of large data sets. It’s been a
key component in Google’s technology foundation
• Around 2008, Yahoo! developed the open-source
variant of MapReduce named Hadoop
• After 2008, MapReduce/Hadoop become a key
technology component in cloud computing

MapReduce    Hadoop        … Hadoop or variants …

• In 2010, the U.S. conferred the MapReduce patent to
Google
MapReduce—Limitations
• MapReduce provides an easy-to-use framework for parallel
programming, but is it the most efficient and best solution to
program execution in datacenters?
• MapReduce has its discontents
– DeWitt and Stonebraker: “MapReduce: A major step backwards” –
MapReduce is far less sophisticated and efficient than parallel query
processing
• MapReduce is a parallel processing framework, not a database
system, nor a query language
– It is possible to use MapReduce to implement some of the parallel query
processing functions
– What are the real limitations?
• Inefficient for general programming (and not designed for that)
– Hard to handle data with complex dependence, frequent updates, etc.
– High overhead, bursty I/O, difficult to handle long streaming data
– Limited opportunity for optimization
Critiques
MapReduce: A major step backwards
-- David J. DeWitt and Michael Stonebraker

(MapReduce) is
– A giant step backward in the programming paradigm for large-
scale data intensive applications
– A sub-optimal implementation, in that it uses brute force
instead of indexing
– Not novel at all
– Missing features
– Incompatible with all of the tools DBMS users have come to
depend on
MapReduce—Limitations

• Inefficient for general programming (and not designed
for that)
– Hard to handle data with complex dependence, frequent
updates, etc.
– High overhead, bursty I/O
• Experience with developing a Hadoop-based distributed
compiler
– Workload: compile Linux kernel
– 4 machines available to Hadoop for parallel compiling
– Observation: parallel compiling on 4 nodes with Hadoop can
be even slower than sequential compiling on one node
Re-thinking MapReduce
• Proprietary solution developed in an environment with
one prevailing application (web search)
– The assumptions introduce several important constraints in
data and logic
– Not a general-purpose parallel execution technology
• Design choices in MapReduce
– Optimizes for throughput rather than latency
– Optimizes for large data set rather than small data structures
– Optimizes for coarse-grained parallelism rather than fine-
grained
MRlite: Lightweight Parallel Processing
• A lightweight parallelization framework following the
MapReduce paradigm
– Implemented in C++
– More than just an efficient implementation of MapReduce
– Goal: a lightweight “parallelization” service that programs
can invoke during execution
• MRlite follows several principles
– Memory is media—avoid touching hard drives
– Static facility for dynamic utility—use and reuse threads
for map tasks
MRlite：Towards Lightweight, Scalable, and
General Parallel Processing
The MRlite master accepts jobs
from clients and schedules them to
execute on slaves
slave       Distributed nodes
accept tasks from
application                  MRlite master
master and execute
slave       them
scheduler
MRlite client
slave
Linked together with the
app, the MRlite client                                  slave
library accepts calls from
app and submits jobs to
the master                                             High speed distributed
storage, stores
intermediate files

High speed                            Data flow
Distributed storage                       Command flow
9000
Computing Capability
9044

Z. Ma and L. Gu. The Limitation of MapReduce: a
8000                        Probing Case and a Lightweight Solution. CLOUD
COMPUTING 2010
7000
gcc (on one node)
Execution time (sec)

6000                                   mrcc/Hadoop
mrcc/MRlite
5000

4000
2936
3000

2000                                               1419
653
1000            506   312
50         128          65
0
Linux kernel   ImageMagick              Xen tools
Using MRlite, the parallel compilation jobs, mrcc, is 10
times faster than that running on Hadoop!
30
Inside MapReduce-Style Computation

Network activities under MapReduce/Hadoop workload
• Hadoop: open-source implementation of MapReduce
• Processing data with 3 servers (20 cores)
– 116.8GB input data
• Network activities captured with Xen virtual
machines
Workflow
Initial data split                              Master informed of
into 64MB blocks                                result locations

R reducers retrieve
Computed, results                      Data from mappers
locally stored

Final output written

Where is the communication-intensive part?
Inside MapReduce
• Packet reception under MapReduce/Hadoop workload
– Large data volume
– Bursty network traffic
• Genrality—widely observed in MapReduce workloads

Packet reception
on a slave server
Inside MapReduce
Packet reception on the master server
Inside MapReduce
Packet transmission on the master server
Datacenter Networking

Major Components of a Datacenter

• Computing hardware (equipment racks)

• Power supply and distribution hardware

• Cooling hardware and cooling fluid
distribution hardware

• Network infrastructure

• IT Personnel and office equipment
Datacenter Networking
Growth Trends in Datacenters
• Load on network & servers continues to rapidly grow
– Rapid growth: a rough estimate of annual growth rate:
enterprise data centers: ~35%, Internet data centers: 50% -
100%
– Information access anywhere, anytime, from many devices
• Desktops, laptops, PDAs & smart phones, sensor
networks, proliferation of broadband
• Mainstream servers moving towards higher speed links
– 1-GbE to10-GbE in 2008-2009
– 10-GbE to 40-GbE in 2010-2012
• High-speed datacenter-MAN/WAN connectivity
– High-speed datacenter syncing for disaster recovery
Datacenter Networking
• A large part of the total cost of the DC hardware
– Large routers and high-bandwidth switches are very
expensive
• Relatively unreliable – many components may fail.
• Many major operators and companies design their
own datacenter networking to save money and
improve reliability/scalability/performance.
– The topology is often known
– The number of nodes is limited
– The protocols used in the DC are known
• Security is simpler inside the data center, but
challenging at the border
• We can distribute applications to servers to distribute
load and minimize hot spots
Datacenter Networking
Networking components (examples)

64 10-GE port Upstream
• Highly scalable DC
Border Routers
– 3.2 Tbps capacity in a single
chassis
768 1-GE port Downstream
– 10 Million routes, 1 Million in
hardware
– 2,000 BGP peers
• High Performance & High                  – 2K L3 VPNs, 16K L2 VPNs
Density Switches & Routers               – High port density for GE and
– Scaling to 512 10GbE ports per          10GE application connectivity
chassis                               – Security
– No need for proprietary protocols
to scale
Datacenter Networking

Common data center topology
Internet

Core                           Layer-3 router

Data Center

Aggregation                   Layer-2/3 switch

Access                                  Layer-2 switch

Servers
Datacenter Networking
Data center network design goals
• High network bandwidth, low latency
• Reduce the need for large switches in the core
• Simplify the software, push complexity to the
edge of the network
• Improve reliability
• Reduce capital and operating cost
Data Center Networking

Avoid this…       and simplify this…
Interconnect

Can we avoid using high-end switches?
• Expensive high-end switches to
scale up
• Single point of failure and
bandwidth bottleneck
– Experiences from real systems

• One answer: DCell
?
43
Interconnect
DCell Ideas
• #1: Use mini-switches to scale out
• #2: Leverage servers to be part of the routing
infrastructure
– Servers have multiple ports and need to forward
packets
• #3: Use recursion to scale and build complete
graph to increase capacity
Data Center Networking
One approach: switched network with
a hypercube interconnect
• Leaf switch: 40 1Gbps ports+2 10 Gbps ports.
– One switch per rack.
– Not replicated (if a switch fails, lose one rack of capacity)
• Core switch: 10 10Gbps ports
– Form a hypercube
• Hypercube – high-dimensional rectangle
Interconnect

Hypercube properties
•   Minimum hop count
•   Even load distribution for all-all communication.
•   Can route around switch/link failures.
•   Simple routing:
– Outport = f(Dest xor NodeNum)
– No routing tables
Interconnect
A 16-node (dimension 4) hypercube
3            3            3            3

2   0    0   0   1    2   2   5    0   0   4    2

1            1            1            1

1            1             1            1

2   2    0   0   3    2   2   7    0   0   6    2

3            3            3             3

3            3            3            3

2   10   0   0   11   2   2   15   0   0   14   2

1            1             1           1

1            1            1            1

2   8    0   0   9    2   2   13   0   0   12   2

3            3            3            3
Interconnect
64-switch Hypercube
16
4X4               links   4X4
Sub-cube                  Sub-cube

Core switch:
16                   16          10Gbps port x 10
links                links
16
4X4               links   4X4
Sub-cube                  Sub-cube                                       How many servers can
be connected in this
63 * 4 links to
4 links
system?
other containers
One container:
81920 servers with
Level 2: 2 10-port 10 Gb/sec switches                          1Gbps bandwidth
16 10 Gb/sec links
Level 1: 8 10-port 10 Gb/sec switches
64 10 Gb/sec links
Level 0: 32 40-port 1 Gb/sec switches

Leaf switch: 1Gbps port x               1280 Gb/sec links
40 + 10Gbps port x 2.
Data Center Networking
The Black Box
Data Center Network
Shipping Container as Data Center Module
• Data Center Module
– Contains network gear, compute, storage, &
cooling
– Just plug in power, network, & chilled water
• Increased cooling efficiency
– Water & air flow
– Better air flow management
• Meet seasonal load requirements
Data Center Network
Unit of Data Center Growth

• One at a time:
– 1 system
– Racking & networking: 14 hrs (\$1,330)
• Rack at a time:
– ~40 systems
– Install & networking: .75 hrs (\$60)
• Container at a time:
–   ~1,000 systems
–   No packaging to remove
–   No floor space required
–   Power, network, & cooling only
–   Weatherproof & easy to transport
• Data center construction takes 24+
months
Data Center Network
Multiple-Site Redundancy and Enhanced
Performance using load balancing

Datacenter                                         Global Data Center
Deployment Problems
•      Handling site failures
LB system                                                                 transparently
Datacenter              •      Providing best site
selection per user
DNS                                                                      •      Leveraging both DNS and
non-DNS methods for
multi-site redundancy
•      Providing disaster
Datacenter                                                         recovery and non-stop
operation

LB (load balancing) System
•   The load balancing systems regulate global data center traffic
•   Incorporates site health, load, user proximity, and service response for user
site selection
•   Provides transparent site failover in case of disaster or service outage
Challenges and Research Problems
Hardware
– High-performance, reliable, cost-effective
computing infrastructure
– Cooling, air cleaning, and energy efficiency

[Andersen]
FAWN

[Barraso]                [Reghavendra]
Clusters                 Power

[Fan] Power
Challenges and Research Problems
System software
– Operating systems
– Compilers
– Database
– Execution engines and containers

DeCandia:
Dynamo                 Cooper:
Burrows:                                PNUTS
Chubby            Isard: Quincy
Yu:
Ghemawat:                                               DryadLINQ
GFS
Chang:
Dean:                                 Brantner: DB
Bigtable
MapReduce                             on S3
Challenges and Research Problems

Networking
– Interconnect and global network structuring
– Traffic engineering

Guo 2009:
Guo 2008: DCell    BCube

Al-Fares:
Commodity DC
Challenges and Research Problems
• Data and programming
– Data consistency mechanisms (e.g., replications)
– Fault tolerance
– Interfaces and semantics
• Software engineering
• User interface
• Application architecture

Olston: Pig
Latin
Pike: Sawzall
Buyya: IT
services
Resources
•   [Al-Fares] Al-Fares, M., Loukissas, A., and Vahdat, A. A scalable, commodity data center
network architecture. In Proceedings of the ACM SIGCOMM 2008 Conference on Data
Communication (Seattle, WA, USA, August 17 - 22, 2008). SIGCOMM '08. 63-74.
http://baijia.info/showthread.php?tid=139
•   [Andersen] David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee,
Lawrence Tan, Vijay Vasudevan. FAWN: A Fast Array of Wimpy Nodes. SOSP'09.
http://baijia.info/showthread.php?tid=179
•   [Barraso] Luiz Barroso, Jeffrey Dean, Urs Hoelzle, "Web Search for a Planet: The Google
Cluster Architecture," IEEE Micro, vol. 23, no. 2, pp. 22-28, Mar./Apr. 2003
http://baijia.info/showthread.php?tid=133
•   [Brantner] Brantner, M., Florescu, D., Graf, D., Kossmann, D., and Kraska, T. Building a
database on S3. In Proceedings of the 2008 ACM SIGMOD international Conference on
Management of Data (Vancouver, Canada, June 09 - 12, 2008). SIGMOD '08. 251-264.
http://baijia.info/showthread.php?tid=125
Resources
•   [Burrows] Burrows, M. The Chubby lock service for loosely-coupled distributed systems.
In Proceedings of the 7th Symposium on Operating Systems Design and Implementation
(Seattle, Washington, November 06 - 08, 2006). 335-350. .
http://baijia.info/showthread.php?tid=59
•   [Buyya] Buyya, R. Chee Shin Yeo Venugopal, S. Market-Oriented Cloud Computing. The
10th IEEE International Conference on High Performance Computing and
Communications, 2008. HPCC '08. http://baijia.info/showthread.php?tid=248
•   [Chang] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M.,
Chandra, T., Fikes, A., and Gruber, R. E. Bigtable: a distributed storage system for
structured data. In Proceedings of the 7th Symposium on Operating Systems Design and
Implementation (Seattle, Washington, November 06 - 08, 2006). 205-218.
http://baijia.info/showthread.php?tid=4
•   [Cooper] Cooper, B. F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P.,
Jacobsen, H., Puz, N., Weaver, D., and Yerneni, R. PNUTS: Yahoo!'s hosted data serving
platform. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1277-1288.
http://baijia.info/showthread.php?tid=126
Resources
•   [Dean] Dean, J. and Ghemawat, S. 2004. MapReduce: simplified data processing on large
clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems
Design & Implementation - Volume 6 (San Francisco, CA, December 06 - 08, 2004).
http://baijia.info/showthread.php?tid=2
•   [DeCandia] DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A.,
Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. 2007. Dynamo: amazon's
highly available key-value store. In Proceedings of Twenty-First ACM SIGOPS Symposium
on Operating Systems Principles (Stevenson, Washington, USA, October 14 - 17, 2007).
SOSP '07. ACM, New York, NY, 205-220. http://baijia.info/showthread.php?tid=120
•   [Fan] Fan, X., Weber, W., and Barroso, L. A. Power provisioning for a warehouse-sized
computer. In Proceedings of the 34th Annual international Symposium on Computer
Architecture (San Diego, California, USA, June 09 - 13, 2007). ISCA '07. 13-23.
http://baijia.info/showthread.php?tid=144
Resources
•   [Ghemawat] Ghemawat, S., Gobioff, H., and Leung, S. 2003. The Google file system. In
Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (Bolton
Landing, NY, USA, October 19 - 22, 2003). SOSP '03. ACM, New York, NY, 29-43.
http://baijia.info/showthread.php?tid=1
•   [Guo 2008] Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, and
Songwu Lu, DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers, in
ACM SIGCOMM 08. http://baijia.info/showthread.php?tid=142
•   [Guo 2009] Chuanxiong Guo, Guohan Lu, Dan Li, Xuan Zhang, Haitao Wu, Yunfeng Shi,
Chen Tian, Yongguang Zhang, and Songwu Lu, BCube: A High Performance, Server-
centric Network Architecture for Modular Data Centers, in ACM SIGCOMM 09.
http://baijia.info/showthread.php?tid=141
•   [Isard] Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar and
Andrew Goldberg. Quincy: Fair Scheduling for Distributed Computing Clusters. SOSP'09.
http://baijia.info/showthread.php?tid=203
Resources
•   [Olston] Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. 2008. Pig Latin: a
not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD
international Conference on Management of Data (Vancouver, Canada, June 09 - 12,
2008). SIGMOD '08. 1099-1110. http://baijia.info/showthread.php?tid=124
•   [Pike] Pike, R., Dorward, S., Griesemer, R., and Quinlan, S. 2005. Interpreting the data:
Parallel analysis with Sawzall. Sci. Program. 13, 4 (Oct. 2005), 277-298.
http://baijia.info/showthread.php?tid=60
•   [Reghavendra] Ramya Raghavendra, Parthasarathy Ranganathan, Vanish Talwar, Zhikui
Wang, Xiaoyun Zhu. No "Power" Struggles: Coordinated Multi-level Power Management
for the Data Center. In Proceedings of the International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS), Seattle, WA,
March 2008. http://baijia.info/showthread.php?tid=183
•   [Yu] Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ãš. Erlingsson, P. K. Gunda, and J. Currey.
DryadLINQ: A system for general-purpose distributed data-parallel computing using a
high-level language. In Proceedings of the 8th Symposium on Operating Systems Design
and Implementation (OSDI), December 8-10 2008.
http://baijia.info/showthread.php?tid=5
Thank you!

Questions?

```
DOCUMENT INFO
Categories:
Tags:
Stats:
 views: 17 posted: 1/11/2012 language: English pages: 62