Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Map-Reduce, Hadoop by 6YjVy5t0

VIEWS: 16 PAGES: 34

									Map-Reduce, Hadoop
           Presentation Overview
What is map-reduce?
  input/output data types
  why is it useful and where is it used?
Execution overview
Features
  fault tolerance
  ordering guarantee
  other perks and bonuses
Hands-on demonstration and follow-along
Map-reduce-merge
          What is map-reduce?
Map-reduce is a programming model (and an
 associated implementation) for processing and
 generating large data sets.
It consists of two steps: map and reduce.
The “map” step takes a key/value pair and
 produces an intermediate key/value pair.
The “reduce” step takes a key and a list of the
 key's values and outputs the final key/value
 pair.
                       Types
map: (k1, v1) → list(k2, v2)
reduce: (k2, list(v2)) → list(v2)
            Why is this useful?
Map-reduce jobs are automatically parallelized.
Partial failure of the processing cluster is
 expected and tolerable.
Redundancy and fault-tolerance is built in, so the
 programmer doesn't have to worry.
It scales very well.
Many jobs are naturally expressible in the
 map/reduce paradigm.
          What are some uses?
Word count
  map: <word, 1>. reduce: <word, #>
Grep
  map: <file, line>. reduce: identity
Inverted index
  map: <word, docID>. reduce: <word, list(docID)>
Distributed sort (special case)
  map: <key, record>. reduce: identity
Users: Google, Yahoo!, Amazon, Facebook, etc.
           Presentation Overview
What is map-reduce?
  input/output data types
  why is it useful and where is it used?
Execution overview
Features
  fault tolerance
  ordering guarantee
  other perks and bonuses
Hands-on demonstration and follow-along
Map-reduce-merge
       Execution overview: map
The user begins a map-reduce job. One of the
 machines becomes the master.
Partition the input into M splits (16-64 MB each)
 and distribute among the machines. A worker
 reads his split and begins work. Upon
 completion, the worker notifies the master.
The master partitions the intermediate keyspace
 into R pieces with a partitioning function.
     Execution overview: reduce
When a reduce worker is notified about a job, it
 uses RPC to read the intermediate data from a
 mapper, then sorts it by key.
The reducer processes its job, then writes its
 output to the final output file for its reduce
 partition.
When all reducers are finished, the master wakes
 up the user program.
           What are M and R?
M is the number of map pieces. R is the number
 of reduce pieces.
Ideally, M and R are much larger than the number
  of workers. This allows one machine to perform
  many different tasks, improving load balancing
  and speeds up recovery.
The master makes O(M+R) scheduling decisions
 and keeps O(M*R) states in memory.
At least R files end up being written.
        Example: counting words
We have UTD's fight song:
  C-O-M-E-T-S! Go!
  Green, Orange, White!
  Comets! Go!
  Strong of will, we fight for right!
  Let's all show our comet might!
We want to count the number of occurrences of
 each word.
The next slides show the map and reduce
 phases.
                First stage: map
Go through the input, and for each word return a
 tuple of (<word>, 1).
Output:
  <C-O-M-E-T-S!, 1>
  <Go!, 1>
  <Green,, 1>
  <Orange,, 1>
  <White!, 1>
  <Comets!, 1>
  <Go!, 1>
  <Strong, 1>
     Between map and reduce...
Between the mapper and the reducer, some
 gears turn within Hadoop, and it groups
 identical keys and sorts by key before starting
 the reducer.
Here's the output:
  <C-O-M-E-T-S!, [1]>
  <Comets!, [1]>
  <Go!, [1,1]>
  <Green,, [1]>
  <Orange,, [1]>
  <Strong, [1]>
         Second stage: reducer
The reducer receives the content, one key-
 valuelist pair at a time, and does its own
 processing.
For wordcount, it sums the values in each list.
Here's the output:
  <C-O-M-E-T-S!, 1>
  <Go!, 2>
  <Green,, 1>
  <Orange,, 1>
  …
How can we improve our
     wordcount?

 Also, any questions?
           Presentation Overview
What is map-reduce?
  input/output data types
  why is it useful and where is it used?
Execution overview
Features
  fault tolerance
  ordering guarantee
  other perks and bonuses
Hands-on demonstration and follow-along
Map-reduce-merge
               Fault tolerance
Worker failure is expected. If a worker fails
 during a map phase, its workload is reassigned
 to another worker. If a mapper fails during a
 reduce phase, both phases are re-executed.
Master failure is not expected, though
 checkpointing can be used for recovery.
If a particular record causes the mapper or
   reducer to reliably crash, the map-reduce
   system can figure this out, skip the record, and
   proceed.
          Ordering guarantee
The implementation of map-reduce guarantees
 that within a given partition, the intermediate
 key/value pairs are processed in increasing key
 order.
This means that each reduce partition ends up
 with an output file sorted by key.
           Partitioning function
By default, your reduce tasks will be distributed
 evenly by using a hash(intrmdt-key) mod N
 function.
You can specify a custom partitioning function.
Useful for locality reasons, such as if the key is a
 URL and you want all URLs belonging to a
 single host to be processed on a single
 machine.
            Combiner function
After a map phase, the mapper transmits over the
 network the entire intermediate data file to the
 reducer.
Sometimes this file is highly compressible.
The user can specify a combiner function. It's
 just like a reduce function, except it's run by the
 mapper before passing the job to the reducer.
                  Counters
A counter can be associated with any action that
  a mapper or a reducer does. This is in addition
  to default counters such as the number of input
  and output key/value pairs processed.
A user can watch the counters in real time to see
  the progress of a job.
When the map/reduce job finishes, these
 counters are provided to the user program.
           Presentation Overview
What is map-reduce?
  input/output data types
  why is it useful and where is it used?
Execution overview
Features
  fault tolerance
  ordering guarantee
  other perks and bonuses
Hands-on demonstration and follow-along
Map-reduce-merge
    What is                             ?
Hadoop is the implementation of the map/reduce
 design that we will use.
Hadoop is released under the Apache License
 2.0, so it's open source.
Hadoop uses the Hadoop Distributed File
 System, HDFS. (In contrast to what we've seen
 with Lucene.)
Get the release from:
  http://hadoop.apache.org/core/
 Preparing Hadoop on your system
Configure passwordless public-key SSH on
localhost
Configure Hadoop:
 look at the two configuration files at
   http://utdallas.edu/~pmw033000/hadoop/
Format the HDFS:
 bin/hadoop namenode -format
Start Hadoop:
 cd <hadoop-dir>
 bin/start-all.sh (and wait ≈20 seconds)‫‏‬
               Example: grep
Standard Unix 'grep' behavior: run it on the
 command line with the search string as the first
 argument and the list of files or directories as
 the subsequent argument(s).


$ grep HelloWorld file1.c file2.c file3.c
file2.c:System.out.println(“I say HelloWorld!”);
$
   Preparing for 'grep' in Hadoop
Hadoop's jobs always operate within the HDFS.
Hadoop will read its input from HDFS, and will
 write its output to HDFS.
Thus, to prepare:
  Download a free electronic book:
  http://utdallas.edu/~pmw033000/hadoop/book.txt
  Load the file into HDFS:
     bin/hadoop fs -copyFromLocal book.txt /book.txt
      Using 'grep' within Hadoop
bin/hadoop jar \
hadoop-0.18-2-examples.jar \
grep /book.txt /grep-result \
“search string”
bin/hadoop fs -ls /grep-result
bin/hadoop fs -cat /grep-
 result/part-00000


A good string to try: “Horace de \S+”
         How 'grep' in Hadoop works
The program runs two map/reduce jobs in sequence. The first job counts how
  many times a matching string occurred and the second job sorts matching
  strings by their frequency and stores the output in a single output file.
Each mapper of the first job takes a line as input and matches the user-
  provided regular expression against the line. It extracts all matching strings
  and emits (matching string, 1) pairs. Each reducer sums the frequencies of
  each matching string. The output is sequence files containing the matching
  string and count. The reduce phase is optimized by running a combiner that
  sums the frequency of strings from local map output. As a result it reduces
  the amount of data that needs to be shipped to a reduce task.
The second job takes the output of the first job as input. The mapper is an
  inverse map, while the reducer is an identity reducer. The number of
  reducers is one, so the output is stored in one file, and it is sorted by the
  count in a descending order. The output file is text, each line of which
  contains count and a matching string.
    Another example: word count
bin/hadoop jar hadoop-0.18.2-
 examples.jar \
wordcount /book.txt /wc-result
bin/hadoop fs -cat /wc-result/part-
 00000 | \
sort -n -k 2
You can also try passing a “-r #” option to
 increase the number of parallel reducers.
Each mapper takes a line as input and breaks it
 into words. It then emits a key/value pair of the
           Presentation Overview
What is map-reduce?
  input/output data types
  why is it useful and where is it used?
Execution overview
Features
  fault tolerance
  ordering guarantee
  other perks and bonuses
Hands-on demonstration and follow-along
Does map-reduce satisfy all needs?
Map-reduce is great for homogeneous data, such
 as grepping a large collection of files or word-
 counting a huge document.
Joining heterogeneous databases does not work
  well.
As is, we'd need additional map-reduce steps,
 such as map-reducing one database and
 reading from the others on the fly.
We want to support relational algebra.
                       Solution
The solution to these problems is: map-reduce-
 merge. It is map-reduce with a new additional
 merging step.
The merge phase makes it easier to process data
 relationships among heterogeneous data sets.
Types:
  map: (k1, v1)α → [(k2, v2)]α
  reduce: (k2, [v2])α → (k2, [v3])α   (notice that the output [v] is a list)

  merge: ((k2, [v3])α, (k3, [v4])β) → (k4, v5)γ
  If α=β, then the merging step performs a self-merge
     (self-join in R.A.).
                  New terms
Partition selector: determines which data
 partitions produced by reducers should be
 retrieved for merging.
Processor: user-defined logic of processing data
 from an individual source.
Merger: user-defined logic of processing data
 merged from two sources where data satisfies a
 merge condition.
Configurable iterator: next slide.
          Configurable iterators
The map and reduce user-defined functions get
 one iterator for the values.
The merge function gets two iterators, one for
 each data source.
The iterators do not have to move forward – they
 can be instrumented to do whatever the user
 wants.
Relational join algorithms have specific patterns
 for the merging step.

								
To top