PowerPoint Presentation - Department of Computer Science and

Document Sample
PowerPoint Presentation - Department of Computer Science and Powered By Docstoc
					Cloud Computing Systems

                   Lin Gu

Hong Kong University of Science and Technology
                 Sept. 14, 2011
How to effectively compute in a datacenter?

 Is MapReduce the best answer to computation in the cloud?
 What is the limitation of MapReduce?
 How to provide general-purpose parallel processing in DCs?
Program Execution on Web-Scale Data
                            The MapReduce Approach

• MapReduce—parallel computing for Web-scale
  data processing
• Fundamental component in Google’s
  technological architecture
  – Why didn’t Google use parallel Fortran, MPI, …?
• Followed by many technology firms
• Map and Fold

  – Map: do something to all elements in a list

  – Fold: aggregate elements of a list

• Used in functional programming languages
  such as Lisp

      Old ideas can be fabulous, too!

      ( = Lisp “Lost In Silly Parentheses”) ?

• Map is a higher-order function: apply an op to all
  elements in a list
  – Result is a new list
• Parallelizable

                              (map (lambda (x) (* x x))
  f   f    f     f    f            '(1 2 3 4 5))
                                '(1 4 9 16 25)
  Program Execution on Web-Scale Data
                                             The MapReduce Approach

  • Reduce is also a higher-order function
  • Like “fold”: aggregate elements of a list
       –   Accumulator set to initial value
       –   Function applied to list element and the accumulator
       –   Result stored in the accumulator
       –   Repeated for every item in the list
       –   Result is the final value in the accumulator

                                                       (fold + 0 '(1 2 3 4 5))
                                                        15
                                                       (fold * 1 '(1 2 3 4 5))
                                                        120
Initial value     f      f     f      f     f
                                                    final result
Program Execution on Web-Scale Data
                              The MapReduce Approach

Massive parallel processing made simple
• Example: word count
• Map: parse a document and generate <word, 1> pairs
• Reduce: receive all pairs for a specific word, and count

       Map                             Reduce
    // D is a document              Reduce for key w:
    for each word w in D            count = 0
      output <w, 1>                 for each input item
                                      count = count + 1
                                    output <w, count>
                    Design Context
• Big data, but simple dependence
  – Relatively easy to partition data
• Supported by a distributed system
  – Distributed OS services across thousands of
    commodity PCs (e.g., GFS)
• First users are search oriented
  – Crawl, index, search

Designed years ago, still working today, growing adoptions
 Single Master node

                                     Worker threads
Worker threads

        Single master, numerous worker threads
• 1. The MapReduce library in the user program first
  splits the input files into M pieces of typically 16
  megabytes to 64 megabytes (MB) per piece. It then
  starts up many copies of the program on a cluster of
• 2. One of the copies of the program is the master. The
  rest are workers that are assigned work by the master.
  There are M map tasks and R reduce tasks to assign.
  The master picks idle workers and assigns each one a
  map task or a reduce task.
• 3. A worker who is assigned a map task reads the
  contents of the corresponding input split. It parses
  key/value pairs out of the input data and passes each
  pair to the user-defined Map function. The
  intermediate key/value pairs produced by the Map
  function are buffered in memory.
• 4. Periodically, the buffered pairs are written to local
  disk, partitioned into R regions by the partitioning
  function. The locations of these buffered pairs on the
  local disk are passed back to the master, who is
  responsible for forwarding these locations to the
  reduce workers.
• 5. When a reduce worker is notified by the master about
  these locations, it uses RPCs to read the buffered data
  from the local disks of the map workers. When a reduce
  worker has read all intermediate data, it sorts it by the
  intermediate keys so that all occurrences of the same key
  are grouped together.
• 6. The reduce worker iterates over the sorted
  intermediate data and for each unique intermediate key
  encountered, it passes the key and the corresponding set
  of intermediate values to the Reduce function. The output
  of the Reduce function is appended to a final output file
  for this reduce partition.
• 7. When all map tasks and reduce tasks have been
  completed, the MapReduce returns back to the user code.

• How to write a MapReduce programto
  – Generate inverted indices?
  – Sort?
• How to express more sophisticated
• What if some workers (slaves) or the
  master fails?
Initial data split                              Master informed of
into 64MB blocks                                result locations

                                                     R reducers retrieve
              Computed, results                      Data from mappers
              locally stored

                                             Final output written

          Where is the communication-intensive part?
            Data Storage – Key-Value Store

• Distributed, scalable storage for key-value pairs
   • Example: Dynamo (Amazon)

   • Another example may be P2P storage (e.g., Chord)

• Key-value store can be a general foundation for more
  complex data structures
   • But performance may suffer
            Data Storage – Key-Value Store
Dynamo: a decentralized, scalable key-value
   – Used in Amazon
   – Use consistent hashing to distributed data
     among nodes
   – Replicated, versioning, load balanced
   – Easy-to-use interface: put()/get()
       Data Storage – Network Block Device
• Networked block storage
  – ND by SUN Microsystems
• Remote block storage over Internet
  – Use S3 as a block device [Brantner]
• Block-level remote storage may become slow in
  networks with long latencies
          Data Storage – Traditional File Systems
• PC file systems
• Link together all clusters of a file
   – Directory entry: filename, attributes, date/time,
     starting cluster, file size
• Boot sector (superblock) : file system wide
• File allocation table, root directory, …

 Boot       FAT 1   FAT 2 (dup)   ROOT dir   Normal directories and files
        Data Storage – Network File System
• NFS—Network File System [Sandberg]
  – Designed by SUN Microsystems in the 1980’s
• Transparent remote access to files stored
  – XDR, RPC, VNode, VFS
  – Mountable file system, synchronous behavior
• Stateless server
           Data Storage – Network File System
NFS organization
  Client                               Server
       Data Storage – Google File System (GFS)

• A distributed file system at work (GFS)
   • Single master and numerous slaves communicate with each other

   • File data unit, “chunk”, is up to 64MB. Chunks are replicated.

• “master” is a single point of failure and bottleneck of scalability,
  the consistency model is difficult to use
                          Data Storage – Database
 PNUTS – a relational database service

                                                            A   42342         E
       A   42342     E                                      B   42521         W
       B   42521     W                                      C   66354         W
       C   66354     W                                      D   12352         E
       D   12352     E                                      E   75656         C
       E   75656     C                                      F   15677         E
       F   15677     E             Indexes and views

                               CREATE TABLE Parts (
                                  ID VARCHAR,
                                  StockNumber INT,
      Parallel database           Status VARCHAR                Replication
                                  …    A 42342      E
                               )       B 42521      W
                                       C   66354       W   Structured schema
                                       D   12352       E
                                       E   75656       C
Designed and used by                   F   15677       E


• Around 2004, Google invented MapReduce to
  parallelize computation of large data sets. It’s been a
  key component in Google’s technology foundation
• Around 2008, Yahoo! developed the open-source
  variant of MapReduce named Hadoop
• After 2008, MapReduce/Hadoop become a key
  technology component in cloud computing

  MapReduce    Hadoop        … Hadoop or variants …

• In 2010, the U.S. conferred the MapReduce patent to
• MapReduce provides an easy-to-use framework for parallel
  programming, but is it the most efficient and best solution to
  program execution in datacenters?
• MapReduce has its discontents
   – DeWitt and Stonebraker: “MapReduce: A major step backwards” –
     MapReduce is far less sophisticated and efficient than parallel query
• MapReduce is a parallel processing framework, not a database
  system, nor a query language
   – It is possible to use MapReduce to implement some of the parallel query
     processing functions
   – What are the real limitations?
• Inefficient for general programming (and not designed for that)
   – Hard to handle data with complex dependence, frequent updates, etc.
   – High overhead, bursty I/O, difficult to handle long streaming data
   – Limited opportunity for optimization
MapReduce: A major step backwards
                           -- David J. DeWitt and Michael Stonebraker

(MapReduce) is
   – A giant step backward in the programming paradigm for large-
     scale data intensive applications
   – A sub-optimal implementation, in that it uses brute force
     instead of indexing
   – Not novel at all
   – Missing features
   – Incompatible with all of the tools DBMS users have come to
     depend on

• Inefficient for general programming (and not designed
  for that)
   – Hard to handle data with complex dependence, frequent
     updates, etc.
   – High overhead, bursty I/O
• Experience with developing a Hadoop-based distributed
   – Workload: compile Linux kernel
   – 4 machines available to Hadoop for parallel compiling
   – Observation: parallel compiling on 4 nodes with Hadoop can
     be even slower than sequential compiling on one node
Re-thinking MapReduce
• Proprietary solution developed in an environment with
  one prevailing application (web search)
   – The assumptions introduce several important constraints in
     data and logic
   – Not a general-purpose parallel execution technology
• Design choices in MapReduce
   – Optimizes for throughput rather than latency
   – Optimizes for large data set rather than small data structures
   – Optimizes for coarse-grained parallelism rather than fine-
MRlite: Lightweight Parallel Processing
 • A lightweight parallelization framework following the
   MapReduce paradigm
    – Implemented in C++
    – More than just an efficient implementation of MapReduce
    – Goal: a lightweight “parallelization” service that programs
      can invoke during execution
 • MRlite follows several principles
    – Memory is media—avoid touching hard drives
    – Static facility for dynamic utility—use and reuse threads
      for map tasks
MRlite:Towards Lightweight, Scalable, and
General Parallel Processing
               The MRlite master accepts jobs
               from clients and schedules them to
               execute on slaves
                                                        slave       Distributed nodes
                                                                    accept tasks from
          application                  MRlite master
                                                                    master and execute
                                                        slave       them
          MRlite client
Linked together with the
app, the MRlite client                                  slave
library accepts calls from
app and submits jobs to
the master                                             High speed distributed
                                                       storage, stores
                                                       intermediate files

                                 High speed                            Data flow
                             Distributed storage                       Command flow
                                Computing Capability

                                                   Z. Ma and L. Gu. The Limitation of MapReduce: a
                       8000                        Probing Case and a Lightweight Solution. CLOUD
                                                   COMPUTING 2010
                                                              gcc (on one node)
Execution time (sec)

                       6000                                   mrcc/Hadoop


                       2000                                               1419
                       1000            506   312
                                                         50         128          65
                              Linux kernel   ImageMagick              Xen tools
Using MRlite, the parallel compilation jobs, mrcc, is 10
times faster than that running on Hadoop!
Inside MapReduce-Style Computation

Network activities under MapReduce/Hadoop workload
• Hadoop: open-source implementation of MapReduce
• Processing data with 3 servers (20 cores)
   – 116.8GB input data
• Network activities captured with Xen virtual
Initial data split                              Master informed of
into 64MB blocks                                result locations

                                                     R reducers retrieve
              Computed, results                      Data from mappers
              locally stored

                                             Final output written

          Where is the communication-intensive part?
Inside MapReduce
• Packet reception under MapReduce/Hadoop workload
    – Large data volume
    – Bursty network traffic
• Genrality—widely observed in MapReduce workloads

Packet reception
on a slave server
Inside MapReduce
Packet reception on the master server
Inside MapReduce
Packet transmission on the master server
                Datacenter Networking

Major Components of a Datacenter

 • Computing hardware (equipment racks)

 • Power supply and distribution hardware

 • Cooling hardware and cooling fluid
   distribution hardware

 • Network infrastructure

 • IT Personnel and office equipment
                 Datacenter Networking
Growth Trends in Datacenters
• Load on network & servers continues to rapidly grow
  – Rapid growth: a rough estimate of annual growth rate:
    enterprise data centers: ~35%, Internet data centers: 50% -
  – Information access anywhere, anytime, from many devices
     • Desktops, laptops, PDAs & smart phones, sensor
       networks, proliferation of broadband
• Mainstream servers moving towards higher speed links
  – 1-GbE to10-GbE in 2008-2009
  – 10-GbE to 40-GbE in 2010-2012
• High-speed datacenter-MAN/WAN connectivity
  – High-speed datacenter syncing for disaster recovery
                Datacenter Networking
• A large part of the total cost of the DC hardware
   – Large routers and high-bandwidth switches are very
• Relatively unreliable – many components may fail.
• Many major operators and companies design their
  own datacenter networking to save money and
  improve reliability/scalability/performance.
   – The topology is often known
   – The number of nodes is limited
   – The protocols used in the DC are known
• Security is simpler inside the data center, but
  challenging at the border
• We can distribute applications to servers to distribute
  load and minimize hot spots
                    Datacenter Networking
         Networking components (examples)

     64 10-GE port Upstream
                                         • Highly scalable DC
                                           Border Routers
                                           – 3.2 Tbps capacity in a single
   768 1-GE port Downstream
                                           – 10 Million routes, 1 Million in
                                           – 2,000 BGP peers
• High Performance & High                  – 2K L3 VPNs, 16K L2 VPNs
  Density Switches & Routers               – High port density for GE and
   – Scaling to 512 10GbE ports per          10GE application connectivity
     chassis                               – Security
   – No need for proprietary protocols
     to scale
              Datacenter Networking

  Common data center topology

Core                           Layer-3 router

                                                Data Center

Aggregation                   Layer-2/3 switch

Access                                  Layer-2 switch

             Datacenter Networking
Data center network design goals
• High network bandwidth, low latency
• Reduce the need for large switches in the core
• Simplify the software, push complexity to the
  edge of the network
• Improve reliability
• Reduce capital and operating cost
      Data Center Networking

Avoid this…       and simplify this…

Can we avoid using high-end switches?
                  • Expensive high-end switches to
                    scale up
                  • Single point of failure and
                    bandwidth bottleneck
                     – Experiences from real systems

• One answer: DCell
                  DCell Ideas
• #1: Use mini-switches to scale out
• #2: Leverage servers to be part of the routing
  – Servers have multiple ports and need to forward
• #3: Use recursion to scale and build complete
  graph to increase capacity
                      Data Center Networking
 One approach: switched network with
      a hypercube interconnect
• Leaf switch: 40 1Gbps ports+2 10 Gbps ports.
   – One switch per rack.
   – Not replicated (if a switch fails, lose one rack of capacity)
• Core switch: 10 10Gbps ports
   – Form a hypercube
• Hypercube – high-dimensional rectangle

           Hypercube properties
•   Minimum hop count
•   Even load distribution for all-all communication.
•   Can route around switch/link failures.
•   Simple routing:
    – Outport = f(Dest xor NodeNum)
    – No routing tables
A 16-node (dimension 4) hypercube
       3            3            3            3

   2   0    0   0   1    2   2   5    0   0   4    2

       1            1            1            1

       1            1             1            1

   2   2    0   0   3    2   2   7    0   0   6    2

       3            3            3             3

       3            3            3            3

   2   10   0   0   11   2   2   15   0   0   14   2

       1            1             1           1

       1            1            1            1

   2   8    0   0   9    2   2   13   0   0   12   2

       3            3            3            3
     64-switch Hypercube
         4X4               links   4X4
         Sub-cube                  Sub-cube

                                                    Core switch:
                   16                   16          10Gbps port x 10
                   links                links
         4X4               links   4X4
         Sub-cube                  Sub-cube                                       How many servers can
                                                                                  be connected in this
63 * 4 links to
                                         4 links
other containers
                                        One container:
                                                                                  81920 servers with
                   Level 2: 2 10-port 10 Gb/sec switches                          1Gbps bandwidth
                                                             16 10 Gb/sec links
                   Level 1: 8 10-port 10 Gb/sec switches
                                                             64 10 Gb/sec links
                   Level 0: 32 40-port 1 Gb/sec switches

 Leaf switch: 1Gbps port x               1280 Gb/sec links
 40 + 10Gbps port x 2.
Data Center Networking
 The Black Box
                    Data Center Network
    Shipping Container as Data Center Module
• Data Center Module
   – Contains network gear, compute, storage, &
   – Just plug in power, network, & chilled water
• Increased cooling efficiency
   – Water & air flow
   – Better air flow management
• Meet seasonal load requirements
                         Data Center Network
           Unit of Data Center Growth

• One at a time:
   – 1 system
   – Racking & networking: 14 hrs ($1,330)
• Rack at a time:
   – ~40 systems
   – Install & networking: .75 hrs ($60)
• Container at a time:
   –   ~1,000 systems
   –   No packaging to remove
   –   No floor space required
   –   Power, network, & cooling only
   –   Weatherproof & easy to transport
• Data center construction takes 24+
                      Data Center Network
           Multiple-Site Redundancy and Enhanced
             Performance using load balancing

                                  Datacenter                                         Global Data Center
                                                                                      Deployment Problems
                                                                             •      Handling site failures
          LB system                                                                 transparently
                                                     Datacenter              •      Providing best site
                                                                                    selection per user
    DNS                                                                      •      Leveraging both DNS and
                                                                                    non-DNS methods for
                                                                                    multi-site redundancy
                                                                             •      Providing disaster
                 Datacenter                                                         recovery and non-stop

                          LB (load balancing) System
•   The load balancing systems regulate global data center traffic
•   Incorporates site health, load, user proximity, and service response for user
    site selection
•   Provides transparent site failover in case of disaster or service outage
        Challenges and Research Problems
  – High-performance, reliable, cost-effective
    computing infrastructure
  – Cooling, air cleaning, and energy efficiency


             [Barraso]                [Reghavendra]
             Clusters                 Power

                                  [Fan] Power
           Challenges and Research Problems
  System software
       – Operating systems
       – Compilers
       – Database
       – Execution engines and containers

                 Dynamo                 Cooper:
Burrows:                                PNUTS
Chubby            Isard: Quincy
  Ghemawat:                                               DryadLINQ
            Dean:                                 Brantner: DB
            MapReduce                             on S3
    Challenges and Research Problems

  – Interconnect and global network structuring
  – Traffic engineering

                              Guo 2009:
           Guo 2008: DCell    BCube

                      Commodity DC
      Challenges and Research Problems
• Data and programming
  – Data consistency mechanisms (e.g., replications)
  – Fault tolerance
  – Interfaces and semantics
• Software engineering
• User interface
• Application architecture

                                     Olston: Pig
                     Pike: Sawzall
                                               Buyya: IT
•   [Al-Fares] Al-Fares, M., Loukissas, A., and Vahdat, A. A scalable, commodity data center
    network architecture. In Proceedings of the ACM SIGCOMM 2008 Conference on Data
    Communication (Seattle, WA, USA, August 17 - 22, 2008). SIGCOMM '08. 63-74.
•   [Andersen] David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee,
    Lawrence Tan, Vijay Vasudevan. FAWN: A Fast Array of Wimpy Nodes. SOSP'09.
•   [Barraso] Luiz Barroso, Jeffrey Dean, Urs Hoelzle, "Web Search for a Planet: The Google
    Cluster Architecture," IEEE Micro, vol. 23, no. 2, pp. 22-28, Mar./Apr. 2003
•   [Brantner] Brantner, M., Florescu, D., Graf, D., Kossmann, D., and Kraska, T. Building a
    database on S3. In Proceedings of the 2008 ACM SIGMOD international Conference on
    Management of Data (Vancouver, Canada, June 09 - 12, 2008). SIGMOD '08. 251-264.
•   [Burrows] Burrows, M. The Chubby lock service for loosely-coupled distributed systems.
    In Proceedings of the 7th Symposium on Operating Systems Design and Implementation
    (Seattle, Washington, November 06 - 08, 2006). 335-350. .
•   [Buyya] Buyya, R. Chee Shin Yeo Venugopal, S. Market-Oriented Cloud Computing. The
    10th IEEE International Conference on High Performance Computing and
    Communications, 2008. HPCC '08.
•   [Chang] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M.,
    Chandra, T., Fikes, A., and Gruber, R. E. Bigtable: a distributed storage system for
    structured data. In Proceedings of the 7th Symposium on Operating Systems Design and
    Implementation (Seattle, Washington, November 06 - 08, 2006). 205-218.
•   [Cooper] Cooper, B. F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P.,
    Jacobsen, H., Puz, N., Weaver, D., and Yerneni, R. PNUTS: Yahoo!'s hosted data serving
    platform. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1277-1288.
•   [Dean] Dean, J. and Ghemawat, S. 2004. MapReduce: simplified data processing on large
    clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems
    Design & Implementation - Volume 6 (San Francisco, CA, December 06 - 08, 2004).
•   [DeCandia] DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A.,
    Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. 2007. Dynamo: amazon's
    highly available key-value store. In Proceedings of Twenty-First ACM SIGOPS Symposium
    on Operating Systems Principles (Stevenson, Washington, USA, October 14 - 17, 2007).
    SOSP '07. ACM, New York, NY, 205-220.
•   [Fan] Fan, X., Weber, W., and Barroso, L. A. Power provisioning for a warehouse-sized
    computer. In Proceedings of the 34th Annual international Symposium on Computer
    Architecture (San Diego, California, USA, June 09 - 13, 2007). ISCA '07. 13-23.
•   [Ghemawat] Ghemawat, S., Gobioff, H., and Leung, S. 2003. The Google file system. In
    Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (Bolton
    Landing, NY, USA, October 19 - 22, 2003). SOSP '03. ACM, New York, NY, 29-43.
•   [Guo 2008] Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, and
    Songwu Lu, DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers, in
•   [Guo 2009] Chuanxiong Guo, Guohan Lu, Dan Li, Xuan Zhang, Haitao Wu, Yunfeng Shi,
    Chen Tian, Yongguang Zhang, and Songwu Lu, BCube: A High Performance, Server-
    centric Network Architecture for Modular Data Centers, in ACM SIGCOMM 09.
•   [Isard] Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar and
    Andrew Goldberg. Quincy: Fair Scheduling for Distributed Computing Clusters. SOSP'09.
•   [Olston] Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. 2008. Pig Latin: a
    not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD
    international Conference on Management of Data (Vancouver, Canada, June 09 - 12,
    2008). SIGMOD '08. 1099-1110.
•   [Pike] Pike, R., Dorward, S., Griesemer, R., and Quinlan, S. 2005. Interpreting the data:
    Parallel analysis with Sawzall. Sci. Program. 13, 4 (Oct. 2005), 277-298.
•   [Reghavendra] Ramya Raghavendra, Parthasarathy Ranganathan, Vanish Talwar, Zhikui
    Wang, Xiaoyun Zhu. No "Power" Struggles: Coordinated Multi-level Power Management
    for the Data Center. In Proceedings of the International Conference on Architectural
    Support for Programming Languages and Operating Systems (ASPLOS), Seattle, WA,
    March 2008.
•   [Yu] Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey.
    DryadLINQ: A system for general-purpose distributed data-parallel computing using a
    high-level language. In Proceedings of the 8th Symposium on Operating Systems Design
    and Implementation (OSDI), December 8-10 2008.
Thank you!