X-Tracing Hadoop - PowerPoint by Nk0jAc3Y

VIEWS: 9 PAGES: 30

									                  UC Berkeley




   Efficient Fair Scheduling for
           MapReduce
Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*,
     Khaled Elmeleegy+, Scott Shenker, Ion Stoica

       UC Berkeley, *Facebook Inc, +Yahoo! Research
                        Motivation
• Hadoop was designed for large batch jobs
  – FIFO job scheduler

• At Facebook, we saw a different workload:
  – Many users want to share a cluster
  – Many jobs are small (10-100 tasks)
    • Sampling, ad-hoc queries, hourly reports, etc

Wanted fair scheduler for Hadoop
                Benefits of Sharing
• Statistical multiplexing
  – Higher utilization, lower costs


• Data consolidation
  – Query disjoint data sets together
            Why is it Interesting?
Two aspects of MapReduce present obstacles:
• Data locality (placing computation near
  data) conflicts with fairness
• Dependence between map & reduce tasks
  can cause underutilization & starvation

Result: 2-10x gains from 2 simple techniques
                         Outline
• Hadoop terminology
• Data locality
  – 2 problems
  – solution: delay scheduling
• Reduce/map dependence
  – 2 problems
  – solution: copy-compute splitting
• Conclusion
                 Hadoop Terminology
• Job = entire MapReduce computation; consists
  of map & reduce tasks
• Task = piece of a job
• Slot = partition of a slave where a task can run
• Heartbeat = message from slave to master, in
  response to which master may launch a task


         Slave                    Job queue
                        Master
         Slave
                         Outline
• Hadoop terminology
• Data locality
  – 2 problems
  – solution: delay scheduling
• Reduce/map dependence
  – 2 problems
  – solution: copy-compute splitting
• Conclusion
                                Problem 1: Poor Locality for
                                       Small Jobs
                                    Job Locality at Facebook
                     100
Percent Local Maps




                      80
                                 58% of jobs
                      60         at Facebook

                      40

                      20

                       0
                           10       100              1000         10000   100000
                                          Job Size (Number of Maps)
                                     Node Locality     Rack Locality
       Cause: Head-of-line Scheduling
• Only job at the head of the job queue can
  be scheduled on each heartbeat
• If this job is small, chance of node that
  sent heartbeat having data for it is small
            Problem 2: Sticky Slots
• Suppose we do fair sharing as follows:
  – Divide slots equally between jobs
  – When a slot is free, give it to the job
    farthest below its fair share
             Problem 2: Sticky Slots


     Slave

                                      Fair       Running
     Slave                   Job
                                     Share        Tasks
                  Master
                             Job 1           2         1
                                                       2
     Slave
                             Job 2           2         2

     Slave


Problem: Jobs never leave their original slots
          Solution: Delay Scheduling

• Jobs must wait before they are allowed to
  run non-local tasks
  – If wait < T1, only allow node-local tasks
  – If T1 < wait < T2, also allow rack-local
  – If wait > T2, also allow off-rack
        Delay Scheduling Example


Slave

                                Fair       Running
Slave                  Job
                               Share        Tasks
              Master
                       Job 1           2         1
                                                 2
Slave
                       Job 2           2         3
                                                 2

Slave


Jobs can now shift between slots
Sticky Slots Experiment
                 Further Analysis
• When is it worthwhile to wait, and how long?

• For throughput:
  – Always worth it, unless there’s a hotspot node
  – If hotspot, run IO-bound tasks on the hotspot
    node and CPU-bound tasks non-locally
    (maximizes rate of local IO)

• For response time: similar
IO Rate Biasing Experiment
                         Outline
• Hadoop terminology
• Data locality
  – 2 problems
  – solution: delay scheduling
• Reduce/map dependence
  – 2 problems
  – solution: copy-compute splitting
• Conclusion
         Reduce/Map Dependence
• Reducers need data from all maps to run
• But want to start reducers before all maps
  finish to overlap network IO and compute
          Problem 1: Reduce Slot
                Hoarding

                                   Job 1
Maps



                                   Job 2




               Time
Reduces




               Time
                  Problem 1: Reduce Slot
                        Hoarding

                                           Job 1
 Maps



                                           Job 2




                       Time
Job 2 maps done
 Reduces




                       Time
                 Problem 1: Reduce Slot
                       Hoarding

                                          Job 1
Maps



                                          Job 2




                          Time
Reduces




                          Time
     Job 2 reduces done
             Problem 1: Reduce Slot
                   Hoarding

                                      Job 1
Maps



                                      Job 2
                                      Job 3


                            Time
          Job 3 submitted
Reduces




                            Time
          Problem 1: Reduce Slot
                Hoarding

                                               Job 1
Maps



                                               Job 2
                                               Job 3


               Time
                      Job 3 maps done
Reduces




               Time
                       Job 3 can’t launch reduces
            Problem 2: Synchronization
Maps
                  (in single job)



                      Time
          Maps done

             Wave 1      Wave 2   Wave 3
                                           Copy
Reduces




                                           Compute




                      Time
        Solution: Copy-Compute Splitting

• Logically split each reduce task into a
  “copy task” and a “compute task”
• Allow more copy tasks per node than there
  are resources for compute tasks
  – E.g. 2 compute slots and 6 copy slots
• Limit # copy tasks per job on each node to
  # of compute slots
  – 2 copy tasks / job above
                Copy-Compute Splitting
Maps
                      Example



                      Time
          Maps done



                                  Copy
Reduces




                                  Compute




                      Time
              Copy-Compute Splitting
                   Experiment
• Job mix based on Facebook workload
• 9 “bins” of job sizes, from small to large:

Bin #    0   1     2     3     4     5     6       7      8
Maps    16   40   80   160   320   600   1200   2400   6400
Jobs    29   5     4     4     3     2     1       1      1
                  Copy-Compute Splitting
                       Experiment
Measured response time gain over FIFO




   Simple fair sharing       Fair sharing + copy-comp. splitting
                      Conclusions
• Lessons for sharing cluster computing systems:
  – Make tasks small in time and resource consumption
  – Split tasks into pieces with orthogonal resource
    requirements
  – Be ready to trade user isolation for throughput


• Two simple techniques give 2-10x perf. gains
                        Further Analysis
• When is it worthwhile to wait, and how long?

• For response time:
    E(gain) = (1 – e-w/t)(D – t)

                                   Expected time to get local heartbeat
          Wait amount
                               Delay from running non-locally



  – Worth it if E(wait) < cost of running non-locally
  – Optimal wait time is infinity

								
To top