X-Tracing Hadoop - PowerPoint by Nk0jAc3Y

VIEWS: 9 PAGES: 30

• pg 1
```									                  UC Berkeley

Efficient Fair Scheduling for
MapReduce
Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*,
Khaled Elmeleegy+, Scott Shenker, Ion Stoica

UC Berkeley, *Facebook Inc, +Yahoo! Research
Motivation
• Hadoop was designed for large batch jobs
– FIFO job scheduler

• At Facebook, we saw a different workload:
– Many users want to share a cluster
– Many jobs are small (10-100 tasks)
• Sampling, ad-hoc queries, hourly reports, etc

Wanted fair scheduler for Hadoop
Benefits of Sharing
• Statistical multiplexing
– Higher utilization, lower costs

• Data consolidation
– Query disjoint data sets together
Why is it Interesting?
Two aspects of MapReduce present obstacles:
• Data locality (placing computation near
data) conflicts with fairness
• Dependence between map & reduce tasks
can cause underutilization & starvation

Result: 2-10x gains from 2 simple techniques
Outline
• Data locality
– 2 problems
– solution: delay scheduling
• Reduce/map dependence
– 2 problems
– solution: copy-compute splitting
• Conclusion
• Job = entire MapReduce computation; consists
of map & reduce tasks
• Task = piece of a job
• Slot = partition of a slave where a task can run
• Heartbeat = message from slave to master, in
response to which master may launch a task

Slave                    Job queue
Master
Slave
Outline
• Data locality
– 2 problems
– solution: delay scheduling
• Reduce/map dependence
– 2 problems
– solution: copy-compute splitting
• Conclusion
Problem 1: Poor Locality for
Small Jobs
Job Locality at Facebook
100
Percent Local Maps

80
58% of jobs

40

20

0
10       100              1000         10000   100000
Job Size (Number of Maps)
Node Locality     Rack Locality
• Only job at the head of the job queue can
be scheduled on each heartbeat
• If this job is small, chance of node that
sent heartbeat having data for it is small
Problem 2: Sticky Slots
• Suppose we do fair sharing as follows:
– Divide slots equally between jobs
– When a slot is free, give it to the job
farthest below its fair share
Problem 2: Sticky Slots

Slave

Fair       Running
Slave                   Job
Master
Job 1           2         1
2
Slave
Job 2           2         2

Slave

Problem: Jobs never leave their original slots
Solution: Delay Scheduling

• Jobs must wait before they are allowed to
– If wait < T1, only allow node-local tasks
– If T1 < wait < T2, also allow rack-local
– If wait > T2, also allow off-rack
Delay Scheduling Example

Slave

Fair       Running
Slave                  Job
Master
Job 1           2         1
2
Slave
Job 2           2         3
2

Slave

Jobs can now shift between slots
Sticky Slots Experiment
Further Analysis
• When is it worthwhile to wait, and how long?

• For throughput:
– Always worth it, unless there’s a hotspot node
– If hotspot, run IO-bound tasks on the hotspot
node and CPU-bound tasks non-locally
(maximizes rate of local IO)

• For response time: similar
IO Rate Biasing Experiment
Outline
• Data locality
– 2 problems
– solution: delay scheduling
• Reduce/map dependence
– 2 problems
– solution: copy-compute splitting
• Conclusion
Reduce/Map Dependence
• Reducers need data from all maps to run
• But want to start reducers before all maps
finish to overlap network IO and compute
Problem 1: Reduce Slot
Hoarding

Job 1
Maps

Job 2

Time
Reduces

Time
Problem 1: Reduce Slot
Hoarding

Job 1
Maps

Job 2

Time
Job 2 maps done
Reduces

Time
Problem 1: Reduce Slot
Hoarding

Job 1
Maps

Job 2

Time
Reduces

Time
Job 2 reduces done
Problem 1: Reduce Slot
Hoarding

Job 1
Maps

Job 2
Job 3

Time
Job 3 submitted
Reduces

Time
Problem 1: Reduce Slot
Hoarding

Job 1
Maps

Job 2
Job 3

Time
Job 3 maps done
Reduces

Time
Job 3 can’t launch reduces
Problem 2: Synchronization
Maps
(in single job)

Time
Maps done

Wave 1      Wave 2   Wave 3
Copy
Reduces

Compute

Time
Solution: Copy-Compute Splitting

• Logically split each reduce task into a
• Allow more copy tasks per node than there
are resources for compute tasks
– E.g. 2 compute slots and 6 copy slots
• Limit # copy tasks per job on each node to
# of compute slots
– 2 copy tasks / job above
Copy-Compute Splitting
Maps
Example

Time
Maps done

Copy
Reduces

Compute

Time
Copy-Compute Splitting
Experiment
• Job mix based on Facebook workload
• 9 “bins” of job sizes, from small to large:

Bin #    0   1     2     3     4     5     6       7      8
Maps    16   40   80   160   320   600   1200   2400   6400
Jobs    29   5     4     4     3     2     1       1      1
Copy-Compute Splitting
Experiment
Measured response time gain over FIFO

Simple fair sharing       Fair sharing + copy-comp. splitting
Conclusions
• Lessons for sharing cluster computing systems:
– Make tasks small in time and resource consumption
– Split tasks into pieces with orthogonal resource
requirements
– Be ready to trade user isolation for throughput

• Two simple techniques give 2-10x perf. gains
Further Analysis
• When is it worthwhile to wait, and how long?

• For response time:
E(gain) = (1 – e-w/t)(D – t)

Expected time to get local heartbeat
Wait amount
Delay from running non-locally

– Worth it if E(wait) < cost of running non-locally
– Optimal wait time is infinity

```
To top