MapReduce Programming
Yue-Shan Chang
User
Program
(1) fork (1) fork (1) fork
Master
(2) assign map
(2) assign reduce
worker
split 0
(6) write output
split 1 (5) remote read worker
(3) read file 0
split 2 (4) local write
worker
split 3
split 4 output
worker
file 1
worker
Input Map Intermediate files Reduce Output
files phase (on local disk) phase files
MapReduce Program Structure
Class MapReduce{
Class Mapper …{ Map程式碼
}
Class Reduer …{ Reduce程式碼
}
Main(){ 主程式設定區
JobConf Conf=new JobConf(“MR.Class”);
其他設定參數程式碼
}}
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements
Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector
output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer {
public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get(); }
output.collect(key, new IntWritable(sum));
}}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}}
MapReduce Job
Handled parts
Configuration of a Job
• JobConf object
– JobConf is the primary interface for a user to describe a
map-reduce job to the Hadoop framework for
execution.
– JobConf typically specifies the Mapper, combiner (if
any), Partitioner, Reducer, InputFormat and
OutputFormat implementations to be used
– Indicates the set of input files (setInputPaths(JobConf,
Path...) /addInputPath(JobConf, Path)) and
(setInputPaths(JobConf, String)
/addInputPaths(JobConf, String)) and where the output
files should be written (setOutputPath(Path)).
Configuration of a Job
Input Splitting
• An input split will normally be a contiguous
group of records from a single input file
– If the number of requested map tasks is larger
than number of files
– the individual files are larger than the suggested
fragment size, there may be multiple input splits
constructed of each input file.
• The user has considerable control over the
number of input splits.
Specifying Input Formats
• The Hadoop framework provides a large variety of
input formats.
– KeyValueTextInputFormat: Key/value pairs, one per line.
– TextInputFormant: The key is the line number, and the
value is the line.
– NLineInputFormat: Similar to KeyValueTextInputFormat,
but the splits are based on N lines of input rather than Y
bytes of input.
– MultiFileInputFormat: An abstract class that lets the user
implement an input format that aggregates multiple files
into one split.
– SequenceFIleInputFormat: The input file is a Hadoop
sequence file, containing serialized key/value pairs.
Specifying Input Formats
Setting the Output Parameters
• The framework requires that the output
parameters be configured, even if the job will
not produce any output.
• The framework will collect the output from
the specified tasks and place them into the
configured output directory.
Setting the Output Parameters
A Simple Map Function:
IdentityMapper
A Simple Reduce Function:
IdentityReducer
A Simple Reduce Function:
IdentityReducer
Configuring the Reduce Phase
• the user must supply the framework with five
pieces of information
– The number of reduce tasks; if zero, no reduce
phase is run
– The class supplying the reduce method
– The input key and value types for the reduce task;
by default, the same as the reduce output
– The output key and value types for the reduce
task
– The output file type for the reduce task output
How Many Maps?
• The number of maps is usually driven by the
total size of the inputs, that is, the total
number of blocks of the input files.
• The right level of parallelism for maps seems
to be around 10-100 maps per-node,
• it is best if the maps take at least a minute to
execute
• setNumMapTasks(int)
Reducer
• Reducer reduces a set of intermediate values which
share a key to a smaller set of values.
• Reducer has 3 primary phases: shuffle, sort and reduce.
• Shuffle
– Input to the Reducer is the sorted output of the mappers.
In this phase the framework fetches the relevant partition
of the output of all the mappers, via HTTP.
• Sort
– The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this
stage
– The shuffle and sort phases occur simultaneously; while
map-outputs are being fetched they are merged.
How Many Reduces?
• The right number of reduces seems to be 0.95 or
1.75 multiplied by ( *
mapred.tasktracker.reduce.tasks.maximum).
• With 0.95 all of the reduces can launch
immediately and start transferring map outputs
as the maps finish.
• With 1.75 the faster nodes will finish their first
round of reduces and launch a second wave of
reduces doing a much better job of load
balancing.
How Many Reduces?
• Increasing the number of reduces increases the
framework overhead, but increases load
balancing and lowers the cost of failures.
• Reducer NONE
– It is legal to set the number of reduce-tasks to zero if
no reduction is desired.
– In this case the outputs of the map-tasks go directly to
the FileSystem, into the output path set by
setOutputPath(Path).
– The framework does not sort the map-outputs before
writing them out to the FileSystem
Reporter
• Reporter is a facility for Map/Reduce
applications to report progress, set
application-level status messages and update
Counters.
• Mapper and Reducer implementations can
use the Reporter to report progress or just
indicate that they are alive.
JobTracker
• JobTracker is the central location for
submitting and tracking MR jobs in a network
environment.
• JobClient is the primary interface by which
user-job interacts with the JobTracker
– provides facilities to submit jobs, track their
progress, access component-tasks' reports and
logs, get the Map/Reduce cluster's status
information and so on.
Job Submission and Monitoring
• The job submission process involves:
– Checking the input and output specifications of
the job.
– Computing the InputSplit values for the job.
– Setting up the requisite accounting information
for the DistributedCache of the job, if necessary.
– Copying the job's jar and configuration to the
Map/Reduce system directory on the FileSystem.
– Submitting the job to the JobTracker and
optionally monitoring it's status.
MapReduce Details for
Multimachine Clusters
Introduction
• Why?
– datasets that can’t fit on a single machine,
– have time constraints that are impossible to
satisfy with a small number of machines,
– need to rapidly scale the computing power
applied to a problem due to varying input set sizes.
Requirements for Successful
MapReduce Jobs
• Mapper
– ingest the input and process the input record, sending
forward the records that can be passed to the reduce task
or to the final output directly
• Reducer
– Accept the key and value groups that passed through the
mapper, and generate the final output
• job must be configured with the location and type of
the input data, the mapper class to use, the number
of reduce tasks required, and the reducer class and
I/O types.
Requirements for Successful
MapReduce Jobs
• The TaskTracker service will actually run your
map and reduce tasks, and the JobTracker
service will distribute the tasks and their input
split to the various trackers.
• The cluster must be configured with the nodes
that will run the TaskTrackers, and with the
number of TaskTrackers to run per node.
Requirements for Successful
MapReduce Jobs
• Three levels of configuration to address to
configure MapReduce on your cluster
– configure the machines,
– the Hadoop MapReduce framework,
– the jobs themselves
Launching MapReduce Jobs
• launch the preceding example from the
command line
> bin/hadoop [-libjars jar1.jar,jar2.jar,jar3.jar] jar
myjar.jar MyClass
MapReduce-Specific Configuration
for Each Machine in a Cluster
• install any standard JARs that your application uses
• It is probable that your applications will have a
runtime environment that is deployed from a
configuration management application, which you
will also need to deploy to each machine.
• The machines will need to have enough RAM for the
Hadoop Core services plus the RAM required to run
your tasks.
• The conf/slaves file should have the set of machines
to serve as TaskTracker nodes
DistributedCache
• distributes application-specific, large, read-
only files efficiently
• a facility provided by the Map/Reduce
framework to cache files (text, archives, jars
and so on) needed by applications.
• The framework will copy the necessary files to
the slave node before any tasks for the job are
executed on that node
Adding Resources to the Task
Classpath
• Methods
– JobConf.setJar(String jar): Sets the user JAR for the
MapReduce job.
– JobConf.setJarByClass(Class cls): Determines the
JAR that contains the class cls and calls
JobConf.setJar(jar) with that JAR.
– DistributedCache.addArchiveToClassPath(Path
archive, Configuration conf): Adds an archive path
to the current set of classpath entries.
Configuring the Hadoop Core
Cluster Information
• Setting the Default File System URI
• You can also use the JobConf object to set the
default file system:
– conf.set( "fs.default.name",
"hdfs://NamenodeHostname:PORT");
Configuring the Hadoop Core
Cluster Information
• Setting the JobTracker Location
• use the JobConf object to set the JobTracker
information:
– conf.set( "mapred.job.tracker",
"JobtrackerHostname:PORT");