VIEWS: 16 PAGES: 34 POSTED ON: 6/7/2012
Map-Reduce, Hadoop Presentation Overview What is map-reduce? input/output data types why is it useful and where is it used? Execution overview Features fault tolerance ordering guarantee other perks and bonuses Hands-on demonstration and follow-along Map-reduce-merge What is map-reduce? Map-reduce is a programming model (and an associated implementation) for processing and generating large data sets. It consists of two steps: map and reduce. The “map” step takes a key/value pair and produces an intermediate key/value pair. The “reduce” step takes a key and a list of the key's values and outputs the final key/value pair. Types map: (k1, v1) → list(k2, v2) reduce: (k2, list(v2)) → list(v2) Why is this useful? Map-reduce jobs are automatically parallelized. Partial failure of the processing cluster is expected and tolerable. Redundancy and fault-tolerance is built in, so the programmer doesn't have to worry. It scales very well. Many jobs are naturally expressible in the map/reduce paradigm. What are some uses? Word count map: <word, 1>. reduce: <word, #> Grep map: <file, line>. reduce: identity Inverted index map: <word, docID>. reduce: <word, list(docID)> Distributed sort (special case) map: <key, record>. reduce: identity Users: Google, Yahoo!, Amazon, Facebook, etc. Presentation Overview What is map-reduce? input/output data types why is it useful and where is it used? Execution overview Features fault tolerance ordering guarantee other perks and bonuses Hands-on demonstration and follow-along Map-reduce-merge Execution overview: map The user begins a map-reduce job. One of the machines becomes the master. Partition the input into M splits (16-64 MB each) and distribute among the machines. A worker reads his split and begins work. Upon completion, the worker notifies the master. The master partitions the intermediate keyspace into R pieces with a partitioning function. Execution overview: reduce When a reduce worker is notified about a job, it uses RPC to read the intermediate data from a mapper, then sorts it by key. The reducer processes its job, then writes its output to the final output file for its reduce partition. When all reducers are finished, the master wakes up the user program. What are M and R? M is the number of map pieces. R is the number of reduce pieces. Ideally, M and R are much larger than the number of workers. This allows one machine to perform many different tasks, improving load balancing and speeds up recovery. The master makes O(M+R) scheduling decisions and keeps O(M*R) states in memory. At least R files end up being written. Example: counting words We have UTD's fight song: C-O-M-E-T-S! Go! Green, Orange, White! Comets! Go! Strong of will, we fight for right! Let's all show our comet might! We want to count the number of occurrences of each word. The next slides show the map and reduce phases. First stage: map Go through the input, and for each word return a tuple of (<word>, 1). Output: <C-O-M-E-T-S!, 1> <Go!, 1> <Green,, 1> <Orange,, 1> <White!, 1> <Comets!, 1> <Go!, 1> <Strong, 1> Between map and reduce... Between the mapper and the reducer, some gears turn within Hadoop, and it groups identical keys and sorts by key before starting the reducer. Here's the output: <C-O-M-E-T-S!, > <Comets!, > <Go!, [1,1]> <Green,, > <Orange,, > <Strong, > Second stage: reducer The reducer receives the content, one key- valuelist pair at a time, and does its own processing. For wordcount, it sums the values in each list. Here's the output: <C-O-M-E-T-S!, 1> <Go!, 2> <Green,, 1> <Orange,, 1> … How can we improve our wordcount? Also, any questions? Presentation Overview What is map-reduce? input/output data types why is it useful and where is it used? Execution overview Features fault tolerance ordering guarantee other perks and bonuses Hands-on demonstration and follow-along Map-reduce-merge Fault tolerance Worker failure is expected. If a worker fails during a map phase, its workload is reassigned to another worker. If a mapper fails during a reduce phase, both phases are re-executed. Master failure is not expected, though checkpointing can be used for recovery. If a particular record causes the mapper or reducer to reliably crash, the map-reduce system can figure this out, skip the record, and proceed. Ordering guarantee The implementation of map-reduce guarantees that within a given partition, the intermediate key/value pairs are processed in increasing key order. This means that each reduce partition ends up with an output file sorted by key. Partitioning function By default, your reduce tasks will be distributed evenly by using a hash(intrmdt-key) mod N function. You can specify a custom partitioning function. Useful for locality reasons, such as if the key is a URL and you want all URLs belonging to a single host to be processed on a single machine. Combiner function After a map phase, the mapper transmits over the network the entire intermediate data file to the reducer. Sometimes this file is highly compressible. The user can specify a combiner function. It's just like a reduce function, except it's run by the mapper before passing the job to the reducer. Counters A counter can be associated with any action that a mapper or a reducer does. This is in addition to default counters such as the number of input and output key/value pairs processed. A user can watch the counters in real time to see the progress of a job. When the map/reduce job finishes, these counters are provided to the user program. Presentation Overview What is map-reduce? input/output data types why is it useful and where is it used? Execution overview Features fault tolerance ordering guarantee other perks and bonuses Hands-on demonstration and follow-along Map-reduce-merge What is ? Hadoop is the implementation of the map/reduce design that we will use. Hadoop is released under the Apache License 2.0, so it's open source. Hadoop uses the Hadoop Distributed File System, HDFS. (In contrast to what we've seen with Lucene.) Get the release from: http://hadoop.apache.org/core/ Preparing Hadoop on your system Configure passwordless public-key SSH on localhost Configure Hadoop: look at the two configuration files at http://utdallas.edu/~pmw033000/hadoop/ Format the HDFS: bin/hadoop namenode -format Start Hadoop: cd <hadoop-dir> bin/start-all.sh (and wait ≈20 seconds) Example: grep Standard Unix 'grep' behavior: run it on the command line with the search string as the first argument and the list of files or directories as the subsequent argument(s). $ grep HelloWorld file1.c file2.c file3.c file2.c:System.out.println(“I say HelloWorld!”); $ Preparing for 'grep' in Hadoop Hadoop's jobs always operate within the HDFS. Hadoop will read its input from HDFS, and will write its output to HDFS. Thus, to prepare: Download a free electronic book: http://utdallas.edu/~pmw033000/hadoop/book.txt Load the file into HDFS: bin/hadoop fs -copyFromLocal book.txt /book.txt Using 'grep' within Hadoop bin/hadoop jar \ hadoop-0.18-2-examples.jar \ grep /book.txt /grep-result \ “search string” bin/hadoop fs -ls /grep-result bin/hadoop fs -cat /grep- result/part-00000 A good string to try: “Horace de \S+” How 'grep' in Hadoop works The program runs two map/reduce jobs in sequence. The first job counts how many times a matching string occurred and the second job sorts matching strings by their frequency and stores the output in a single output file. Each mapper of the first job takes a line as input and matches the user- provided regular expression against the line. It extracts all matching strings and emits (matching string, 1) pairs. Each reducer sums the frequencies of each matching string. The output is sequence files containing the matching string and count. The reduce phase is optimized by running a combiner that sums the frequency of strings from local map output. As a result it reduces the amount of data that needs to be shipped to a reduce task. The second job takes the output of the first job as input. The mapper is an inverse map, while the reducer is an identity reducer. The number of reducers is one, so the output is stored in one file, and it is sorted by the count in a descending order. The output file is text, each line of which contains count and a matching string. Another example: word count bin/hadoop jar hadoop-0.18.2- examples.jar \ wordcount /book.txt /wc-result bin/hadoop fs -cat /wc-result/part- 00000 | \ sort -n -k 2 You can also try passing a “-r #” option to increase the number of parallel reducers. Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the Presentation Overview What is map-reduce? input/output data types why is it useful and where is it used? Execution overview Features fault tolerance ordering guarantee other perks and bonuses Hands-on demonstration and follow-along Does map-reduce satisfy all needs? Map-reduce is great for homogeneous data, such as grepping a large collection of files or word- counting a huge document. Joining heterogeneous databases does not work well. As is, we'd need additional map-reduce steps, such as map-reducing one database and reading from the others on the fly. We want to support relational algebra. Solution The solution to these problems is: map-reduce- merge. It is map-reduce with a new additional merging step. The merge phase makes it easier to process data relationships among heterogeneous data sets. Types: map: (k1, v1)α → [(k2, v2)]α reduce: (k2, [v2])α → (k2, [v3])α (notice that the output [v] is a list) merge: ((k2, [v3])α, (k3, [v4])β) → (k4, v5)γ If α=β, then the merging step performs a self-merge (self-join in R.A.). New terms Partition selector: determines which data partitions produced by reducers should be retrieved for merging. Processor: user-defined logic of processing data from an individual source. Merger: user-defined logic of processing data merged from two sources where data satisfies a merge condition. Configurable iterator: next slide. Configurable iterators The map and reduce user-defined functions get one iterator for the values. The merge function gets two iterators, one for each data source. The iterators do not have to move forward – they can be instrumented to do whatever the user wants. Relational join algorithms have specific patterns for the merging step.
Pages to are hidden for
"Map-Reduce, Hadoop"Please download to view full document