; MapReduce
Learning Center
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>



  • pg 1

r MapReduce overview
r Note: These notes are based on notes
 provided by Google
          What is a Cloud?
q Cloud = Lots of storage + compute cycles
      Data-Intensive Computing
q Data-Intensive
   m Typically store data at datacenters
   m Use compute nodes nearby
   m Compute nodes run computation services
r In data-intensive computing, the focus is
  on the data: problem areas include
  m   Storage
  m   Communication bottleneck
  m   Moving tasks to data (rather than vice-versa)
  m   Security
  m   Availability of Data
  m   Scalability
       Computation Services
q Google → MapReduce, Sawzall
r Yahoo → Hadoop, Pig Latin
r Microsoft → Dryad, DryadLINQ
     Motivation: Large Scale Data
r Want to process lots of data ( > 1 TB)
r Want to parallelize across
    hundreds/thousands of CPUs
    m   How to parallelize
    m   How to distribute
    m   How to handle failures
r   Want to make this easy
         What is MapReduce?
r MapReduce is an abstraction that allows
  programmers to specify computations that
  can be done in parallel
r MapReduce hides the messy details needed
  to support the computations e.g.,
  m   Distribution and synchronization
  m   Machine failures
  m   Data distribution
  m   Load balancing
r This is widely used at Google
        Programming Model
r MapReduce simplifies programming through
  its library.
r The user of the MapReduce library
  expresses the computation as two
  functions: Map, Reduce
            Programming Model
r Map
  m Takes an input pair and produces a set of
    intermediate key/value pairs e.g.,
        • Map: (key1, value1) à list(key2,value2)
r The MapReduce library groups together all
  intermediate values associated with the
  same intermediate key
r Reduce
   m   This function accepts an intermediate key and a
       set of values for that key
   m    Reduce: (key2,list(key2,value2)) à value3
 Example: Word Frequencies in
          Web Pages
r Determine the count of each word that
  appears in a document (or a set of
  m   Each file is associated with a document URL
r Map function
  m Key = document URL
  m Value = document contents

r Output of map function is (potentially
  many) key/value pairs
  m   Output (word, “1”) once per word in the
 Example: Word Frequencies in
          Web Pages
r Pseudo code for map

  Map(String key, String value):
   // input_key: document name
   // input_value: document contents
  for each word w in value:
     EmitIntermediate(w, "1");
 Example: Word Frequencies in
          Web Pages
r Example key, value pair:
   m “document_example”, “to be or not to be”

r Result of applying the map function
   m “to”, 1
   m “be”, 1
   m “or”, 1
   m “not”, 1
   m “to”, 1
   m “be”, 1
 Example: Word Frequencies in
          Web Pages
r Pseudo-code for Reduce
  Reduce(String key, values):
   // key: a word, same for input and output
  // values: a list of counts
  int result = 0;
  for each v in values:
     result = result + value;

  The function sums together all counts emitted for a
    particular word
 Example: Word Frequencies in
          Web Pages
r The MapReduce framework sorts all pairs
  with the same key
   m   (be,1), (be,1), (not,1), (or, 1), (to, 1), (to,1)
r The pairs are then grouped
  m (be, 1,1), (not, 1), (or, 1), (to, 1, 1)

r The reduce function combines (sums) the
  values for a key
   m   Example: Applying reduce to (be, 1, 1) = 2
    Example: Distributed Grep
r Find all occurrences of a given pattern in a
  a file (or set of files)
r Input consists of (url+offset, line)
r map(key=url+offset, val=line):
   m   If contents match specified pattern, emit (line,
r reduce(key=line, values=uniq_counts):
   m Example of input to reduce is essentially (line,
   m Don’t do anything; just emit line
 Example: Count of URL Access
r Map function
  m Input:   <log of web page requests, content of
  m Outputs: <URL, 1>

r Reduce function adds together all values
  for the same URL
       Example:Web structure
r Simple representation of WWW link graph
  m Map
       • Input: (URL, page-contents)
       • Output: (URL, list-of-URLs)
r Who maps to me?
  m Map
       • Input: (URL, list-of-URLS)
       • Output: For each u in list-of-URLS output <u,URL>
  m   Reduce: Concatenates the list of all source
      URLs associated with u and emits (<u,
          The Infrastructure
r Large clusters of commodity PCs and
  networking hardware
r Clusters consists of 100/1000s of machines
  (failures are common)
r GFS (Google File System).
  m   Distributed file system.
  m   Provides replication of the data.
       The Infrastructure
r  Users submit jobs to a scheduling system
r Possible partitions of data can be based on
  files, databases, file lines, database
  records etc;
r Map invocations are distributed across
  multiple machines by automatically
  partitioning the input data into a set of M
r The input splits can be processed in parallel
  by different machines
r Reduce invocations are distributed by
  partitioning the intermediate key space
  into R pieces using a hash function:
  hash(key) mod R
  m   R and the partitioning function are specified by
      the programmer.

Workers are assigned work by the master
The master is started by the MapReduce Framework

Workers assigned map tasks read the input, parse it and invoke
the user’s Map() method.

• Intermediate key/value pairs are buffered in memory
•Periodically, buffered data is written to local disk (R files)
•Pseudo random partitioning function (e.g., (hash(k) mod R)

•Locations are passed back to the master who forwards these
 locations to workers executing the reduce function.

• Reduce runs after all mappers are done
• Workers executing Reduce are notified by the master about
  location of intermediate data

• Reduce workers use remote procedure calls to read the data from
  local disks of map works
• Sorts all intermediate data by intermediate key

• Reduce worker iterates over the sorted intermediate data and for
  each key encountered it passes the key and the corresponding set
  of intermediate values to the Reduce function

• The output of the Reduce function is appended to a final output
                   Data flow
r Input, final output are stored on a
  distributed file system
  m   Scheduler tries to schedule map tasks “close”
      to physical storage location of input data
r Intermediate results are stored on local
  file system of map and reduce workers
r Output can be input to another map reduce
Parallel Execution
r Master data structures
  m Task status: (idle, in-progress, completed)
  m Idle tasks get scheduled as workers become
  m When a map task completes, it sends the master
    the location and sizes of its R intermediate files,
    one for each reducer
  m Master pushes this info to reducers

r Master pings workers periodically to detect
r Map worker failure
  m Map tasks completed or in-progress at worker
    are reset to idle
  m Reduce workers are notified when task is
    rescheduled on another worker
r Reduce worker failure
   m Only in-progress tasks are reset to idle

r Master failure
  m MapReduce task is aborted and client is
r MapReduce master takes the location
  information of input files into account and
  attempts to schedule a map task on a
  machine that contains a replica of the
  corresponding input data
r Schedule a map task near a replica of that
  task’s input data
r The goal is to read most input data locally
  and thus reduce the consumption of
  network bandwidth
             Task Granularity
r M and R should be much larger than the
 number of available machines.
  m   Dynamic load balancing.
  m   Speeds up recovery in case of failures.
r R determines the number of output files
   m Often constrained by users.
                Backup Tasks
r Stragglers - A common reason for long
r Schedule backups for remaining jobs (in
  progress jobs) when map or reduce phases
  near completion.
  m   Slightly increases needed computational
  m   Does not increase running time, but has the
      potential to improve it significantly.
r Often a map task will produce many pairs
  of the form (k,v1), (k,v2), … for the same
  key k
  m   E.g., popular words in Word Count
r Can save network time by pre-aggregating
  at mapper
  m   combine(k1, list(v1)) à v2
  m   Usually same as reduce function
r Works only if reduce function is
  commutative and associative
            Partition Function
r Inputs to map tasks are created by
  contiguous splits of input file
r For reduce, we need to ensure that records
  with the same intermediate key end up at
  the same worker
r System uses a default partition function
  e.g., hash(key) mod R
r Sometimes useful to override; What if all
  output keys are URLS and we want all
  entries for a single host to end up in the
  same output file?
  m   Use hash(hostname(URL)) mod R ensures URLs
      from a host end up in the same output file

r MapReduce – a framework for distributed
  m   Distributed programs are easy to write and
  m   Provides fault tolerance
  m   Program execution can be easily monitored.
r It works for Google!!

To top