F21DP2 Distributed _ Parallel Technologies

Document Sample
F21DP2 Distributed _ Parallel Technologies Powered By Docstoc
					       F21DP2 Distributed & Parallel
                 Technologies
      9. Industrial Parallel Frameworks
9.1. Introduction
   growing industrial use of very large scale parallel
    systems

   especially for Internet services

        vast quantities of data

        millions of users

        highly distributed user base

   need to find development frameworks

        capture standard patterns of parallel processing

   e.g. Google MapReduce

   e.g. Hadoop MapReduce

   e.g. Microsoft Dryad

   NB

        hard to get reliable/recent data

        commercially sensitive
9.2. Google
9.2.1. Overview
 in 2006 Google:

    stored
         search crawler                               850 TB
         Google Analytics                             220 TB
         Google Earth                                 70.5 TB
         personalised search                          4 TB

    with 11% compression
(Alex Chitu
http://googlesystem.blogspot.com/2006/09/how-much-data-does-google-store.html)


 had 450,000 computers in 25 centres
(New York Times,
http://www.nytimes.com/2006/06/14/technology/14search.html?pagewanted=2&_r=1
&ei=5094&en=4b91d1f7096cf107&hp&ex=1150257600&partner=homepage&adxnn
lx=1150258448-9M6P8rTdHIBUH24IJMpGGg)


 how to organise this?

 three key technologies

    Google File System (GFS)

    Bigtable

    MapReduce
Google File System (GFS)

   Ghemawat et al, SOSP 2003

   scalable, distributed, fault tolerant

   commodity hardware

   master + chunk server

   Linux

   largest GFS cluster had 1000 storage nodes with
    300 TB

 Bigtable

   Chang et al, OSDI 2006

   sparse, distributed, persistent, multi dimensional,
    storage map

   uses Chubby

     highly available, persistent, distributed, lock
      service

   library + master server + many tablet servers

   60 projects: Google Earth, Finance, Analytics etc

   388 clusters with 24,500 tablet servers
9.2.2. MapReduce

  Dean & Ghemawat, OSDI 2004 & CACM 2008

  hundreds of special purpose computations over rich
   data on multiple machines at Google

    many conceptually simple

  simplicity obscured by

    large input data

    computations distributed across 1002/1000s of
     machines

    failure handling

  MapReduce abstraction hides this detail

  strongly influenced by functional programming map
   & reduce

    most computations apply

       map to logical records to produce key/value
        pairs

       reduce to all values with same key to combine
        derived data

  re-execution for fault tolerance
 input: key/value pairs

 output: key/value pairs

 MapReduce library written in C++

 user provides:

   map & reduce arguments:

     map - (k1,v1) -> list(k2,v2)

    i.e. a’ * b’ -> (c’ * d’) list

     reduce - (k2,list(v2)) -> list(v2)

    i.e. c’ * (d’ list) -> d’ list

   I/O file names
 after user program calls MapReduce function

1. MapReduce library:

   partitions input files into M pieces - 16-64 MB per
    piece

   starts many copies of program on cluster

2. program copies behave as master + workers

   system must assign M map tasks and R reduce tasks

   master assigns map or reduce tasks to idle workers
3. map task worker:

   reads input piece

   parses key/value pairs

   passes each pair to map argument function

   intermediate results buffered in memory

4. map workers write buffered pairs to local disk

   disk partitioned into R regions

   master:

      is told locations of pairs on disk

      sends locations to reduce workers

5. reduce worker:

   reads intermediate data from map worker local
    disks using RPC

   sorts by intermediate keys

   uses external sort if too much data
6. reduce worker:

   iterates over sorted data

   for each unique key passes key & values to reduce
    function

   appends reduce function result to final output file
    for this reduce partition

7. master wakes up user program

      MapReduce call returns to user code
 typical MapReduce job:

   200,000 map

   5,000 reduce

   2,000 machines

                               Aug. ’04   Mar. ’06   Sep. ’07
Number of jobs (1000s)         29         171        2,217
Avg. completion time (secs)    634        874        395
Machine years used             217        2,002      11,081
map input data (TB)            3,288      52,254     403,152
map output data (TB)           758        6,743      34,774
reduce output data (TB)        193        2,970      14,018
Avg. machines per job          157        268        394
Unique implementations
map                             395         1958      4083
reduce                          269         1208      2418
Table I. MapReduce Statistics for Different Months. (CACM
2008)


 used for Google Web search

 20TB of document data

 MapReduce reduced code size from 3800 to 700
  lines of C++

 Jan 2008 MapReduce processed 20PB per day
    9.3 Apache Hadoop
(Hadoop MapReduce Tutorial
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html)

     open source MapReduce

     Java

     all nodes run Hadoop Distributed File System

     user supplies I/O and map/reduce functions

     Hadoop job client submits job to JobTracker

     single master JobTracker

     slave TaskTracker per cluster node

     master

           schedules tasks on slaves

           monitors

           re-executes failed tasks

     Yahoo major Hadoop user

     13,000+ nodes in 2008
9.4. MapReduce v Hadoop
    standard 1 TB sort benchmark

          Jim Gray 1998

          10 billion 100-byte records of uncompressed
           text

    Hadoop at Yahoo held previous record (5/08)
(Apache Hadoop wins Terabyte sort benchmark
http://developer.yahoo.net/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort
_benchmark.html)


     910 computers

     209 secs

    Google (11/8)
(Official Google Blog: Sorting 1PB with MapReduce
http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html)


     on GFS

     1,000 computers

     68 secs
       Google then tried 1PB

     1,000 TB

     4,000 computers

     48,000 hard drives

             GFS wrote 3 copies of each file to 3 different
              disks

         6 hours 2 minutes
9.5. Microsoft Dryad
   Isard et al, EuroSys’07

   architecture described by arbitrary acyclic graph

       distributed analogue of Unix pipe

       fine control over communications as well as
        computation

       user can specify communication transport
        mechanism

        files

        TCP pipes

        shared memory channels

   graph vertices may use arbitrary numbers of inputs
    and outputs

       MapReduce is single input to single output

   designed as lower layer for higher level
    programming models

   based around Dryad execution engine

       general purpose data-parallel
   limitations

       greater architectural complexity than
        MapReduce

       vertices must be deterministic




   job is directed acyclic graph

       vertex is program

       edges are data channels

   mapped onto physical resources at runtime

       more vertices in graph than cores in cluster

   channel transports finite sequence of items

       program is not aware of channel form
       program must supply own serialisation
           job manager

            runs on cluster or user workstation

            application specific code to construct
             communication graph

            library code to schedule work

       data sent directly between vertices

       cluster has name server

            knows location of each computer

            enables scheduling to take account of locality

       daemon on each computer

            proxy for job manager

            creates processes on behalf of job manager

           vertex execution

            first time, binary sent from job manager to
             daemon

            subsequently, vertex executed from cache
       simple task scheduler queues batch jobs

       distributed storage system

   can run whole assembly on single workstation for
    debugging

   evaluation: SQL

       Sloan Digital Sky Survey database

       astronomical objects: 354M records

       neighbours within 30 secs: 2,800M records

       find neighbours with same colour as primary
        object colour

       10 computers

       compared:

       in memory: shared-memory FIFO

       two-pass: NTFS files
   evaluation: data mining

   MapReduce style

   query logs from MSN Search service

   10,000 GB

   1800 computers in data centre

   shows good scalability
   Nebula scripting language

       generalises Unix pipes

       hides Dryad details

   stages connected by operators determine number of
    vertices

       e.g. Filter: create new vertex from list and make
        pipeline

       e.g. Aggregate: exchanges & merges

   front end

       perl: text I/O <-> records

       SQL subset: select, project, join

       converted to Nebula script for Dryad

				
DOCUMENT INFO