Embed
Email

MapReduce Online

Document Sample
MapReduce Online
Shared by: HC111211204333
Categories
Tags
Stats
views:
1
posted:
12/11/2011
language:
pages:
42
MapReduce Online



Tyson Condie and Neil Conway

UC Berkeley



Joint work with Peter Alvaro, Rusty Sears, Khaled Elmeleegy

(Yahoo! Research), and Joe Hellerstein

MapReduce Programming Model

• Programmers think in a data-centric fashion

– Apply transformations to data sets

• The MR framework handles the Hard Stuff:

– Fault tolerance

– Distributed execution, scheduling, concurrency

– Coordination

– Network communication

MapReduce System Model

• Designed for batch-oriented computations

over large data sets

– Each operator runs to completion before

producing any output

– Operator output is written to stable storage

• Map output to local disk, reduce output to HDFS

• Simple, elegant fault tolerance model:

operator restart

– Critical for large clusters

Life Beyond Batch Processing

• Can we apply the MR programming model

outside batch processing?

• Two domains of interest:

1. Interactive data analysis

• Enabled by high-level MR query languages, e.g. Hive,

Pig, Jaql

• Batch processing is a poor fit

2. Continuous analysis of data streams

• Batch processing adds massive latency

• Requires saving and reloading analysis state

MapReduce Online

• Pipeline data between operators as it is produced

– Decouple computation schedule (logical) from data

transfer schedule (physical)

• Hadoop Online Prototype (HOP): Hadoop with

pipelining support

– Preserving the Hadoop interfaces and APIs

– Challenge: retain elegant fault tolerance model

• Enables approximate answers and stream

processing

– Can also reduce the response times of jobs

Outline

1. Hadoop Background

2. HOP Architecture

3. Online Aggregation

4. Stream Processing with MapReduce

5. Future Work and Conclusion

Hadoop Architecture

• Hadoop MapReduce

– Single master node, many worker nodes

– Client submits a job to master node

– Master splits each job into tasks (map/reduce),

and assigns tasks to worker nodes

• Hadoop Distributed File System (HDFS)

– Single name node, many data nodes

– Files stored as large, fixed-size (e.g. 64MB) blocks

– HDFS typically holds map input and reduce output

Job Scheduling

• One map task for each block of the input file

– Applies user-defined map function to each record in the

block

– Record =

• User-defined number of reduce tasks

– Each reduce task is assigned a set of record groups

• Record group = all records with same key

– For each group, apply user-defined reduce function to the

record values in that group

• Reduce tasks read from every map task

– Each read returns the record groups for that reduce task

Dataflow in Hadoop

• Map tasks write their output to local disk

– Output available after map task has completed

• Reduce tasks write their output to HDFS

– Once job is finished, next job’s map tasks can be

scheduled, and will read input from HDFS

• Therefore, fault tolerance is simple: simply re-

run tasks on failure

– No consumers see partial operator output

Dataflow in Hadoop



Submit job









map schedule reduce







map reduce

Dataflow in Hadoop







Read

Input File

map reduce

Block 1



HDFS

Block 2

map reduce

Dataflow in Hadoop







Finished Finished + Location





map Local

FS

reduce







Local

map FS reduce

Dataflow in Hadoop









map Local

FS

reduce



HTTP GET

Local

map FS reduce

Dataflow in Hadoop







Write

Final

reduce

Answer

HDFS



reduce

Hadoop Online Prototype (HOP)

Hadoop Online Prototype

• HOP supports pipelining within and between

MapReduce jobs: push rather than pull

– Preserve simple fault tolerance scheme

– Improved job completion time (better cluster utilization)

– Improved detection and handling of stragglers

• MapReduce programming model unchanged

– Clients supply same job parameters

• Hadoop client interface backward compatible

– No changes required to existing clients

• E.g., Pig, Hive, Sawzall, Jaql

– Extended to take a series of job

Pipelining Batch Size

• Initial design: pipeline eagerly (for each row)

– Prevents use of combiner

– Moves more sorting work to mapper

– Map function can block on network I/O

• Revised design: map writes into buffer

– Spill thread: sort & combine buffer, spill to disk

– Send thread: pipeline spill files => reducers

• Simple adaptive algorithm

Fault Tolerance

• Fault tolerance in MR is simple and elegant

– Simply recompute on failure, no state recovery

• Initial design for pipelining FT:

– Reduce treats in-progress map output as tentative

• Revised design:

– Pipelining maps periodically checkpoint output

– Reducers can consume output reduce task

– Fault tolerance: fate share

– “Pushdown” predicates and scalar transforms

– Total order = single reduce task

• User-defined code at data producer = bad?

– Fault-tolerant “buffer” (map task), coordination

#2: Fault Tolerance for Streams

• Operator restart for long-running reduces: too

expensive

• Hence, window-oriented fault tolerance

– Reducers label windows with IDs

– Mappers use window IDs to garbage collect spills

• Probably need fault-tolerant Job Tracker and

HDFS Name Node

#3: Intra-Job Elasticity

• Peak load != average load

– Increasingly important as job duration grows

• Solution: consistent hashing over reduce key

space

– Job Tracker manages reduce key => task mapping

• Useful for regular Hadoop as well

Other HOP Benefits

• Shorter job completion time via improved

cluster utilization: reduce work starts early

– Important for high-priority jobs, interactive jobs

• Adaptive load management

– Better detection and handling of “straggler” tasks

– Elastic scale-up/scale-down: better pre-emption

– Decouple unit of data transfer from unit of

scheduling

• E.g. Yahoo! Petasort: 15GB/map task

Sort Performance: Blocking









• 60 node EC2 cluster, 5.5GB input file

• 40 map tasks, 59 reduce tasks

Sort Performance: Pipelining









• 927 seconds vs. 610 seconds

Future Work

1. Basic pipelining

– Performance analysis at scale (e.g. PetaSort)

– Job scheduling is much harder

2. Online Aggregation

– Statically-robust estimation

– Better UI for approximate results

3. Stream Processing

– Develop into full-fledged stream processing engine

– Stream support for high-level query languages

– Online machine learning

Thanks!

Questions?



Source code and technical report:

http://code.google.com/p/hop/



Contact: nrc@cs.berkeley.edu

Map Task Execution

1. Map phase

– Read the assigned input split from HDFS

• Split = file block by default

– Parses input into records (key/value pairs)

– Applies map function to each record

• Returns zero or more new records

2. Commit phase

– Registers the final output with the slave node

• Stored in the local filesystem as a file

• Sorted first by bucket number then by key

– Informs master node of its completion

Reduce Task Execution

1. Shuffle phase

– Fetches input data from all map tasks

• The portion corresponding to the reduce task’s bucket

2. Sort phase

– Merge-sort *all* map outputs into a single run

3. Reduce phase

– Applies user reduce function to the merged run

• Arguments: key and corresponding list of values

– Write output to a temp file in HDFS

• Atomic rename when finished

Design Implications

1. Fault Tolerance

– Tasks that fail are simply restarted

– No further steps required since nothing left the task

2. “Straggler” handling

– Job response time affected by slow task

– Slow tasks get executed redundantly

• Take result from the first to finish

• Assumes slowdown is due to physical components (e.g.,

network, host machine)





• Pipelining can support both!

Fault Tolerance in HOP

• Traditional fault tolerance algorithms for

pipelined dataflow systems are complex

• HOP approach: write to disk and pipeline

– Producers write data into in-memory buffer

– In-memory buffer periodically spilled to disk

– Spills sent to consumers

– Consumers treat pipelined data as “tentative” until

producer is known to complete

– Fault tolerance via task restart, tentative output

discarded

Refinement: Checkpoints

• Problem: Treating output as tentative inhibits

parallelism

• Solution: Producers periodically “checkpoint”

with Hadoop master node

– “Output split x corresponds to input offset y”

– Pipelined data <= split x is now non-tentative

– Also improves speculation for straggler tasks,

reduces redundant work on task failure


Related docs
Other docs by HC111211204333
SRA Mathematics Scoring Open-Ended Items
Views: 0  |  Downloads: 0
Major themes in microbiology
Views: 0  |  Downloads: 0
MAF Technical Paper Template
Views: 1  |  Downloads: 0
Sheet1
Views: 3  |  Downloads: 0
nota stampa n�
Views: 3  |  Downloads: 0
Contents
Views: 0  |  Downloads: 0
as reserved forest and farmlands
Views: 2  |  Downloads: 0
Mathematics:
Views: 1  |  Downloads: 0
HOUSE BILL 298/GA
Views: 0  |  Downloads: 0
What You Need to Know About
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!