Embed
Email

MapReduce Programming

Document Sample

Shared by: Jun Wang
Categories
Tags
Stats
views:
3
posted:
10/27/2011
language:
English
pages:
37
MapReduce Programming



Yue-Shan Chang

User

Program



(1) fork (1) fork (1) fork





Master



(2) assign map

(2) assign reduce



worker

split 0

(6) write output

split 1 (5) remote read worker

(3) read file 0

split 2 (4) local write

worker

split 3

split 4 output

worker

file 1



worker





Input Map Intermediate files Reduce Output

files phase (on local disk) phase files

MapReduce Program Structure

Class MapReduce{

Class Mapper …{ Map程式碼

}

Class Reduer …{ Reduce程式碼

}

Main(){ 主程式設定區

JobConf Conf=new JobConf(“MR.Class”);

其他設定參數程式碼

}}

package org.myorg;

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.util.*;

public class WordCount {

public static class Map extends MapReduceBase implements

Mapper {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector

output, Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

output.collect(word, one);

}

}

}

public static class Reduce extends MapReduceBase implements Reducer {

public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get(); }

output.collect(key, new IntWritable(sum));

}}

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

}}

MapReduce Job

Handled parts

Configuration of a Job

• JobConf object

– JobConf is the primary interface for a user to describe a

map-reduce job to the Hadoop framework for

execution.

– JobConf typically specifies the Mapper, combiner (if

any), Partitioner, Reducer, InputFormat and

OutputFormat implementations to be used

– Indicates the set of input files (setInputPaths(JobConf,

Path...) /addInputPath(JobConf, Path)) and

(setInputPaths(JobConf, String)

/addInputPaths(JobConf, String)) and where the output

files should be written (setOutputPath(Path)).

Configuration of a Job

Input Splitting

• An input split will normally be a contiguous

group of records from a single input file

– If the number of requested map tasks is larger

than number of files

– the individual files are larger than the suggested

fragment size, there may be multiple input splits

constructed of each input file.

• The user has considerable control over the

number of input splits.

Specifying Input Formats

• The Hadoop framework provides a large variety of

input formats.

– KeyValueTextInputFormat: Key/value pairs, one per line.

– TextInputFormant: The key is the line number, and the

value is the line.

– NLineInputFormat: Similar to KeyValueTextInputFormat,

but the splits are based on N lines of input rather than Y

bytes of input.

– MultiFileInputFormat: An abstract class that lets the user

implement an input format that aggregates multiple files

into one split.

– SequenceFIleInputFormat: The input file is a Hadoop

sequence file, containing serialized key/value pairs.

Specifying Input Formats

Setting the Output Parameters

• The framework requires that the output

parameters be configured, even if the job will

not produce any output.

• The framework will collect the output from

the specified tasks and place them into the

configured output directory.

Setting the Output Parameters

A Simple Map Function:

IdentityMapper

A Simple Reduce Function:

IdentityReducer

A Simple Reduce Function:

IdentityReducer

Configuring the Reduce Phase

• the user must supply the framework with five

pieces of information

– The number of reduce tasks; if zero, no reduce

phase is run

– The class supplying the reduce method

– The input key and value types for the reduce task;

by default, the same as the reduce output

– The output key and value types for the reduce

task

– The output file type for the reduce task output

How Many Maps?

• The number of maps is usually driven by the

total size of the inputs, that is, the total

number of blocks of the input files.

• The right level of parallelism for maps seems

to be around 10-100 maps per-node,

• it is best if the maps take at least a minute to

execute

• setNumMapTasks(int)

Reducer

• Reducer reduces a set of intermediate values which

share a key to a smaller set of values.

• Reducer has 3 primary phases: shuffle, sort and reduce.

• Shuffle

– Input to the Reducer is the sorted output of the mappers.

In this phase the framework fetches the relevant partition

of the output of all the mappers, via HTTP.

• Sort

– The framework groups Reducer inputs by keys (since

different mappers may have output the same key) in this

stage

– The shuffle and sort phases occur simultaneously; while

map-outputs are being fetched they are merged.

How Many Reduces?

• The right number of reduces seems to be 0.95 or

1.75 multiplied by ( *

mapred.tasktracker.reduce.tasks.maximum).

• With 0.95 all of the reduces can launch

immediately and start transferring map outputs

as the maps finish.

• With 1.75 the faster nodes will finish their first

round of reduces and launch a second wave of

reduces doing a much better job of load

balancing.

How Many Reduces?

• Increasing the number of reduces increases the

framework overhead, but increases load

balancing and lowers the cost of failures.

• Reducer NONE

– It is legal to set the number of reduce-tasks to zero if

no reduction is desired.

– In this case the outputs of the map-tasks go directly to

the FileSystem, into the output path set by

setOutputPath(Path).

– The framework does not sort the map-outputs before

writing them out to the FileSystem

Reporter

• Reporter is a facility for Map/Reduce

applications to report progress, set

application-level status messages and update

Counters.

• Mapper and Reducer implementations can

use the Reporter to report progress or just

indicate that they are alive.

JobTracker

• JobTracker is the central location for

submitting and tracking MR jobs in a network

environment.

• JobClient is the primary interface by which

user-job interacts with the JobTracker

– provides facilities to submit jobs, track their

progress, access component-tasks' reports and

logs, get the Map/Reduce cluster's status

information and so on.

Job Submission and Monitoring

• The job submission process involves:

– Checking the input and output specifications of

the job.

– Computing the InputSplit values for the job.

– Setting up the requisite accounting information

for the DistributedCache of the job, if necessary.

– Copying the job's jar and configuration to the

Map/Reduce system directory on the FileSystem.

– Submitting the job to the JobTracker and

optionally monitoring it's status.

MapReduce Details for

Multimachine Clusters

Introduction

• Why?

– datasets that can’t fit on a single machine,

– have time constraints that are impossible to

satisfy with a small number of machines,

– need to rapidly scale the computing power

applied to a problem due to varying input set sizes.

Requirements for Successful

MapReduce Jobs

• Mapper

– ingest the input and process the input record, sending

forward the records that can be passed to the reduce task

or to the final output directly

• Reducer

– Accept the key and value groups that passed through the

mapper, and generate the final output

• job must be configured with the location and type of

the input data, the mapper class to use, the number

of reduce tasks required, and the reducer class and

I/O types.

Requirements for Successful

MapReduce Jobs

• The TaskTracker service will actually run your

map and reduce tasks, and the JobTracker

service will distribute the tasks and their input

split to the various trackers.

• The cluster must be configured with the nodes

that will run the TaskTrackers, and with the

number of TaskTrackers to run per node.

Requirements for Successful

MapReduce Jobs

• Three levels of configuration to address to

configure MapReduce on your cluster

– configure the machines,

– the Hadoop MapReduce framework,

– the jobs themselves

Launching MapReduce Jobs







• launch the preceding example from the

command line

> bin/hadoop [-libjars jar1.jar,jar2.jar,jar3.jar] jar

myjar.jar MyClass

MapReduce-Specific Configuration

for Each Machine in a Cluster

• install any standard JARs that your application uses

• It is probable that your applications will have a

runtime environment that is deployed from a

configuration management application, which you

will also need to deploy to each machine.

• The machines will need to have enough RAM for the

Hadoop Core services plus the RAM required to run

your tasks.

• The conf/slaves file should have the set of machines

to serve as TaskTracker nodes

DistributedCache

• distributes application-specific, large, read-

only files efficiently

• a facility provided by the Map/Reduce

framework to cache files (text, archives, jars

and so on) needed by applications.

• The framework will copy the necessary files to

the slave node before any tasks for the job are

executed on that node

Adding Resources to the Task

Classpath

• Methods

– JobConf.setJar(String jar): Sets the user JAR for the

MapReduce job.

– JobConf.setJarByClass(Class cls): Determines the

JAR that contains the class cls and calls

JobConf.setJar(jar) with that JAR.

– DistributedCache.addArchiveToClassPath(Path

archive, Configuration conf): Adds an archive path

to the current set of classpath entries.

Configuring the Hadoop Core

Cluster Information

• Setting the Default File System URI









• You can also use the JobConf object to set the

default file system:

– conf.set( "fs.default.name",

"hdfs://NamenodeHostname:PORT");

Configuring the Hadoop Core

Cluster Information

• Setting the JobTracker Location









• use the JobConf object to set the JobTracker

information:

– conf.set( "mapred.job.tracker",

"JobtrackerHostname:PORT");



Related docs
Other docs by Jun Wang
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!