Creating Map-Reduce Programs Using Hadoop

Document Sample
Creating Map-Reduce Programs Using Hadoop Powered By Docstoc
					Creating Map-Reduce Programs
                Using Hadoop
         Presentation Overview
Recall Hadoop
Overview of the map-reduce paradigm
Elaboration on the WordCount example
  components of Hadoop that make WordCount
    possible
Major new example: N-Gram Generator
  step-by-step assembly of this map-reduce job
Design questions to ask when creating your own
 Hadoop jobs
       Recall why Hadoop rocks
Hadoop is:
  Free and open source
  high quality, like all Apache Foundation projects
  crossplatform (pure Java)‫‏‬
  fault-tolerant
  highly scalable
  has bindings for non-Java programming languages
  applicable to many computational problems
  Map-Reduce System Overview
JobTracker – Makes scheduling decisions
TaskTracker – Manages tasks for a given node
Task process
  Runs an individual map or reduce fragment for
   a given job
  Forks from the TaskTracker
   Map-Reduce System Overview
Processes communicate by custom RPC
 implementation
  Easy to change/extend
  Defined as Java interfaces
  Server objects implement the interface
  Client proxy objects automatically created
All messages originate at the client: (e.g., Task to
  TaskTracker)‫‏‬
  Prevents cycles and therefore deadlocks
Process Flow Diagram
          Application Overview
Launching Program
  Creates a JobConf to define a job.
  Submits JobConf to JobTracker and waits for
   completion.
Mapper
  Is given a stream of key1,value1 pairs
  Generates a stream of key2, value2 pairs
Reducer
  Is‫‏‬given‫‏‬a‫‏‬key2‫‏‬and‫‏‬a‫‏‬stream‫‏‬of‫‏‬value2‟s
  Generates a stream of key3, value3 pairs
      Job Launch Process: Client

Client program creates a JobConf
Identify classes implementing Mapper and Reducer
interfaces
                        ;
JobConf.setMapperClass()‫ ‏‬JobConf.setReducerClass()‫‏‬
Specify input and output formats
JobConf.setInputFormat(TextInputFormat.class);
JobConf.setOutputFormat(TextOutputFormat.class);
Other options too:
JobConf.setNumReduceTasks()‫‏‬
JobConf.setOutputFormat()‫‏‬
Many, many more (Facade pattern)‫‏‬
     An onslaught of terminology
We'll explain these terms, each of which plays a
 role in any non-trivial map/reduce job:
  InputFormat, OutputFormat, FileInputFormat, ...
  JobClient and JobConf
  JobTracker and TaskTracker
  TaskRunner,‫‏‬MapTaskRunner,‫‏‬MapRunner,‫…‏‬
  InputSplit, RecordReader, LineRecordReader, ...
  Writable, WritableComparable, WritableInt, ...
   InputFormat and OutputFormat
The application also chooses input and output formats,
  which define how the persistent data is read and
  written. These are interfaces and can be defined by
  the application.
InputFormat
  Splits the input to determine the input to each map
   task.
  Defines a RecordReader that reads key, value pairs
   that are passed to the map task
OutputFormat
  Given the key, value pairs and a filename, writes the
                   Example
public static void main(String[] args) throws
 Exception {
  JobConf conf = new JobConf(WordCount.class);
  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);


  conf.setMapperClass(Map.class);
  conf.setCombinerClass(Reduce.class);
  conf.setReducerClass(Reduce.class);
    Job Launch Process: JobClient
Pass JobConf to JobClient.runJob() or
JobClient.submitJob()‫‏‬
runJob() blocks – wait until job finishes
submitJob() does not
Poll for status to make running decisions
Avoid polling with JobConf.setJobEndNotificationURI()

JobClient:
Determines proper division of input into InputSplits
Sends job data to master JobTracker server
   Job Launch Process: JobTracker

JobTracker:
Inserts jar and JobConf (serialized to XML) in shared
location
Posts a JobInProgress to its run queue
   Job Launch Process: TaskTracker

TaskTrackers running on slave nodes periodically
query JobTracker for work
Retrieve job-specific jar and config
Launch task in separate instance of Java
main() is provided by Hadoop
       Job Launch Process: Task

TaskTracker.Child.main():
Sets up the child TaskInProgress attempt
Reads XML configuration
Connects back to necessary MapReduce components
via RPC
Uses TaskRunner to launch user process
    Job Launch Process: TaskRunner

TaskRunner, MapTaskRunner, MapRunner work in
a daisy-chain to launch your Mapper
Task knows ahead of time which InputSplits it should be
mapping
Calls Mapper once for each record retrieved from the
InputSplit
Running the Reducer is much the same
             Creating the Mapper

You provide the instance of Mapper
Should extend MapReduceBase
Implement interface Mapper<K1,V1,K2,V2>
One instance of your Mapper is initialized by the
MapTaskRunner for a TaskInProgress
Exists in separate process from all other instances of
Mapper – no data sharing!
                    Mapper
Override function – map()‫‏‬
void map(WritableComparable key,
              Writable value,
              OutputCollector output,
              Reporter reporter)‫‏‬
Emit (k2,v2) with output.collect(k2, v2)‫‏‬
                  Example
public static class Map extends MapReduceBase
 implements Mapper<LongWritable, Text, Text,
 IntWritable> {
  private final static IntWritable one = new
    IntWritable(1);
  private Text word = new Text();


  public void map(LongWritable key, Text value,
   OutputCollector<Text, IntWritable> output,
   Reporter reporter) throws IOException {
             What is Writable?

Hadoop‫‏‬defines‫‏‬its‫‏‬own‫“‏‬box”‫‏‬classes‫‏‬for‫‏‬strings‫‏‬
(Text), integers (IntWritable), etc.
All values are instances of Writable
All keys are instances of WritableComparable
                    Reading data

Data sets are specified by InputFormats
Defines input data (e.g., a directory)‫‏‬
Identifies partitions of the data that form an InputSplit
Factory for RecordReader objects to extract (k, v) records
from the input source
      FileInputFormat and friends

TextInputFormat – Treats‫‏‬each‫\„‏‬n‟-terminated line
of a file as a value
KeyValueTextInputFormat – Maps‫\„‏‬n‟- terminated
text‫‏‬lines‫‏‬of‫“‏‬k‫‏‬SEP‫‏‬v”
SequenceFileInputFormat – Binary file of (k, v)
pairs‫‏‬with‫‏‬some‫‏‬add‟l‫‏‬metadata
SequenceFileAsTextInputFormat – Same, but
maps (k.toString(), v.toString())‫‏‬
             Filtering File Inputs

FileInputFormat will read all files out of a
specified directory and send them to the mapper
Delegates filtering this file list to a method
subclasses may override
e.g., Create‫‏‬your‫‏‬own‫“‏‬xyzFileInputFormat”‫‏‬to‫‏‬read‫.*‏‬xyz‫‏‬
from directory list
                Record Readers
Without a RecordReader, Hadoop would be forced
to divide input on byte boundaries.
Each InputFormat provides its own RecordReader
implementation
Provides capability multiplexing
LineRecordReader – Reads a line from a text file
KeyValueRecordReader – Used by
KeyValueTextInputFormat
                 Input Split Size

FileInputFormat will divide large files into chunks
Exact size controlled by mapred.min.split.size
RecordReaders receive file, offset, and length of
chunk
Custom InputFormat implementations may
override split size – e.g.,‫“‏‬NeverChunkFile”
        Sending Data To Reducers

Map function receives OutputCollector object
OutputCollector.collect() takes (k, v) elements
Any (WritableComparable, Writable) can be used
             WritableComparator

Compares WritableComparable data
Will call WritableComparable.compare()‫‏‬
Can provide fast path for serialized data
Explicitly stated in JobConf setup
JobConf.setOutputValueGroupingComparator()‫‏‬
        Sending Data To The Client

Reporter object sent to Mapper allows simple
asynchronous feedback
incrCounter(Enum key, long amount)
setStatus(String msg)‫‏‬
Allows self-identification of input
InputSplit getInputSplit()‫‏‬
                    Partitioner

int getPartition(key, val, numPartitions)‫‏‬
Outputs the partition number for a given key
One partition == values sent to one Reduce task
HashPartitioner used by default
Uses key.hashCode() to return partition num
JobConf sets Partitioner implementation
                    Reducer

reduce( WritableComparable key,
              Iterator values,
              OutputCollector output,
              Reporter reporter)‫‏‬
Keys & values sent to one partition all go to the
same reduce task
Calls are sorted by key – “earlier”‫‏‬keys‫‏‬are‫‏‬
reduced‫‏‬and‫‏‬output‫‏‬before‫“‏‬later”‫‏‬keys
                   Example
public static class Reduce extends
 MapReduceBase implements Reducer<Text,
 IntWritable, Text, IntWritable> {
  public void reduce(Text key,
   Iterator<IntWritable> values,
   OutputCollector<Text, IntWritable> output,
   Reporter reporter) throws IOException {
    int sum = 0;
    while (values.hasNext()) {
       sum += values.next().get();
               OutputFormat

Analogous to InputFormat
TextOutputFormat – Writes‫“‏‬key‫‏‬val\n”‫‏‬strings‫‏‬to‫‏‬
output file
SequenceFileOutputFormat – Uses a binary format
to pack (k, v) pairs
NullOutputFormat – Discards output
         Presentation Overview
Recall Hadoop
Overview of the map-reduce paradigm
Elaboration on the WordCount example
  components of Hadoop that make WordCount
    possible
Major new example: N-Gram Generator
  step-by-step assembly of this map-reduce job
Design questions to ask when creating your own
 Hadoop jobs
Major example: N-Gram Generation
N-Gram is a common natural language
 processing technique (used by Google, etc)‫‏‬
N-Gram is a subsequence of N items in a given
 sequence. (i.e. subsequence of words in a
 given text)‫‏‬
Example 3-grams (from Google) with
 corresponding occurrences
  ceramics collectables collectibles (55)‫‏‬
  ceramics collected by (52)‫‏‬
  ceramics collectibles cooking (45)‫‏‬
     Understanding the process
Someone‫‏‬wise‫‏‬said,‫“‏‬A‫‏‬week‫‏‬of‫‏‬writing‫‏‬code‫‏‬
 saves‫‏‬an‫‏‬hour‫‏‬of‫‏‬research.”
Before embarking on developing a Hadoop job,
 walk through the process step by step manually
 and understand the flow and manipulation of
 data.
Once you can comfortably (and deterministically!)
 do it mentally, begin writing code.
                Requirements
Input:
  a beginning word/phrase
  n-gram size (bigram, trigram, n-gram)‫‏‬
  the minimum number of occurrences (frequency)‫‏‬
  whether letter case matters
Output: all possible n-grams that occur sufficiently
 frequently.
     High-level view of data flow
Given: one or more files containing regular text.
Look for the desired startword. If seen, take the
 next N-1 words and add the group to the
 database.
Similarly to word count, find the number of
  occurrences of each N-gram.
Remove those N-grams that do not occur
 frequently enough for our liking.
                   Follow along
The N-grams implementation exists and is ready
 for your perusal.
Grab it:
  if you use Git revision control:
     git clone git://git.qnan.org/pmw/hadoop-ngram
  to get the files with your browser, go to:
     http://www.qnan.org/~pmw/software/hadoop-ngram
We used Project Gutenberg ebooks as input.
                     Follow along
Start Hadoop
  bin/start-all.sh
Grab the NGram code and build it:
  Type‫“‏‬ant”‫‏‬and‫‏‬all‫‏‬will‫‏‬be‫‏‬built
  Look at the README to see how to run it.
Load some text files into your HDFS
  good source: http://www.gutenberg.org
Run it yourself (or see me do it) before we
 proceed.
     Can we just use WordCount?
We have the WordCount example that does a similar
 thing. But there are differences:
  We don't want to count the number of times our
   startword appears; we want to capture the
   subsequent words too.
  A more subtle problem is that wordcount maps one
    line at a time. That's a problem if we want 3-grams
    with‫‏‬startword‫‏‬of‫“‏‬pillows”‫‏‬in‫‏‬the‫‏‬book‫‏‬containing‫‏‬
    this:
     The guests stayed in the guest bedroom; the pillows were
     delightfully soft and had a faint scent of mint.
Still, WordCount is a good foundation for our code.
           Steps we must perform
Read our text in paragraphs rather than in discrete lines:
 RecordReader
 InputFormat
Develop the mapper and reducer classes:
 first mapper: find startword, get the next N-1 words, and
   return <N-gram, 1>
 first reducer: sum the number of occurrences of each N-
   gram
 second mapper: no action
 second reducer: discard N-grams that are too rare
Driver program
          A new RecordReader
Ours must implement RecordReader<K, V>
  Contain certain functions: createKey(), createValue(),
   getPos(), getProgress(), next()‫‏‬
Hadoop offers a LineRecordReader but no
 support for Paragraphs
We'll need a ParagraphRecordReader
  Use Delegation Pattern instead of extending
   LineRecordReader. We couldn't extend it because
   it has private elements.
  Create new next() function
public synchronized boolean next(LongWritable key, Text value) throws IOException {
 Text linevalue = new Text();
 boolean appended, gotsomething;
 boolean retval;
 byte space[] = {' '};

  value.clear();
  gotsomething = false;
  do {
    appended = false;
    retval = lrr.next(key, linevalue);
    if (retval) {
       if (linevalue.toString().length() > 0) {
          byte[] rawline = linevalue.getBytes();
          int rawlinelen = linevalue.getLength();
          value.append(rawline, 0, rawlinelen);
          value.append(space, 0, 1);
          appended = true;
       }
       gotsomething = true;
    }
  } while (appended);
  //System.out.println("ParagraphRecordReader::next() returns "+gotsomething+" after setting
value to: ["+value.toString()+"]");
  return gotsomething;
}
           A new InputFormat
Given to the JobTracker during execution
getRecordReader method
  This is the why we need InputFormat
  Must return our ParagraphRecordReader
public class ParagraphInputFormat extends FileInputFormat<LongWritable, Text>
  implements JobConfigurable {

    private CompressionCodecFactory compressionCodecs = null;

    public void configure(JobConf conf) {
      compressionCodecs = new CompressionCodecFactory(conf);
    }

    protected boolean isSplitable(FileSystem fs, Path file) {
      return compressionCodecs.getCodec(file) == null;
    }

  public RecordReader<LongWritable, Text> getRecordReader(InputSplit genericSplit,
JobConf job, Reporter reporter)‫‏‬
    throws IOException {

        reporter.setStatus(genericSplit.toString());
        return new ParagraphRecordReader(job, (FileSplit) genericSplit);
    }
}
       First‫‏‬stage:‫“‏‬Find”‫‏‬Mapper
Define the startword at startup
Each time map is called we parse an entire
 paragraph and output matching N-Grams
Tell Reporter how far done we are to track
 progress
Output <N-Gram, 1> like WordCount
  output.collect(ngram, new IntWritable(1));
This last part is important... next slide explains.
   Importance‫‏‬of‫“‏‬output.collect()”
Remember Hadoop's data type model:
  map: (K1, V1)‫‏→‏‬list(K2, V2)‫‏‬
This means that for every single (K1, V1) tuple,
 the map stage can output zero, one, two, or any
 other number of tuples, and they don't have to
 match the input at all.
Example:
  output.collect(ngram, new IntWritable(1));
  output.collect(“good-ol'-”+ngram,‫‏‬new‫‏‬IntWritable(0));
                         Find Mapper
Our mapper must have a configure() class
       public void configure(JobConf conf) {
          desiredPhrase = conf.get("mapper.desired-phrase");
          Nvalue = conf.getInt("mapper.N-value", 3);
          caseSensitive = conf.getBoolean("mapper.case-sensitive",
  false);
       }
We can pass primitives through JobConf
            “Find”‫‏‬Reducer
Like WordCount example
Sum all the numbers matching our N-Gram
Output <N-Gram, #of occurences>
   Second‫‏‬stage:‫“‏‬Prune”‫‏‬Mapper
Parse line from previous output and divide into
 Key/Value pairs


             “Prune”‫‏‬Reducer
This way we can sort our elements by frequency
If this N-Gram occurs fewer times than our
   minimum, trim it out
    Piping data between M/R jobs
How‫‏‬does‫‏‬the‫“‏‬Find”‫‏‬map/reduce‫‏‬job‫‏‬pass‫‏‬its‫‏‬
 results‫‏‬to‫‏‬the‫“‏‬Reduce”‫‏‬map/reduce‫‏‬job?
I create a temporary file within HDFS. This
  temporary file is used as the output of Find and
  the input of Reduce.
At the end, I delete the temporary file.
                Counters
The N-Gram generator has one programmer-
 defined counter: the number of
 partial/incomplete N-grams. These occur when
 a paragraph ends before we can read N-1
 subsequent words.
We can add as many counters as we want.
                            JobConf
We need to set everything up
  2 Jobs executing in series Find and Prune
      JobConf ngram_find_conf = new JobConf(getConf(),
  NGram.class),
        ngram_prune_conf = new JobConf(getConf(), NGram.class);
User inputs parameters
  Starting N-Gram word/phrase
  N-Gram size
  Minimum frequency for pruning
                         Find JobConf
Now we can plug everything in:
         ngram_find_conf.setJobName("ngram-find");
         ngram_find_conf.setInputFormat(ParagraphInputFormat.class);
         ngram_find_conf.setOutputKeyClass(Text.class);
         ngram_find_conf.setOutputValueClass(IntWritable.class);
         ngram_find_conf.setMapperClass(FindJob_MapClass.class);
         ngram_find_conf.setReducerClass(FindJob_ReduceClass.class);

Also pass input parameters
 ngram_find_conf.set("mapper.desired-phrase", args.get(2), true));
 ngram_find_conf.setInt("mapper.N-value", new Integer(other_args.get(3)).intValue());
 ngram_find_conf.setBoolean("mapper.case-sensitive", caseSensitive);


And point to our input and output files
       FileInputFormat.setInputPaths(ngram_find_conf,
  other_args.get(0));
       FileOutputFormat.setOutputPath(ngram_find_conf, tempDir);
                         Prune JobConf
  Perform set up as before
            ngram_prune_conf.setJobName("ngram-prune");
            ngram_prune_conf.setInt("reducer.min-freq", min_freq);
            ngram_prune_conf.setOutputKeyClass(Text.class);
            ngram_prune_conf.setOutputValueClass(IntWritable.class);
            ngram_prune_conf.setMapperClass(PruneJob_MapClass.class);
            ngram_prune_conf.setReducerClass(PruneJob_ReduceClass.class);

  We need to point our inputs to the outputs of the
   previous job
    FileInputFormat.setInputPaths(ngram_prune_conf, tempDir);
    FileOutputFormat.setOutputPath(ngram_prune_conf, new
Path(other_args.get(1)));
                      Execute Jobs
Run as blocking process with runJob
  Batch processing is done in series
  JobClient.runJob(ngram_find_conf);
  JobClient.runJob(ngram_prune_conf);
        Design questions to ask
From where will my input come?
  InputFileFormat
How is my input structured?
  RecordReader
(There are already several common IFFs and
  RRs. Don't reinvent the wheel.)‫‏‬
Mapper and Reducer classes
  Do Key (WritableComparator) and Value (Writable)
   classes exist?
        Design questions to ask
Do I need to count anything while job is in
 progress?
Where is my output going?
Executor class
  What information do my map/reduce classes need?
   Must I block, waiting for job completion? Set
   FileFormat?

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:14
posted:10/27/2011
language:English
pages:58