Joins in Hadoop

Document Sample
Joins in Hadoop Powered By Docstoc
					Joins in Hadoop

 Gang and Ronnie
                     Agenda
•   Introduction of new types of joins
•   Experiment results
•   Join plan generator
•   Summary and future work
                 Problem at hand
• Map join (fragment-duplicate join)
                            Fragment (large table)


                  Split 1    Split 2     Split 3   Split 4
    Map tasks:




             Duplicate
                                  Duplicate
             (small table)
 Slide taken from project proposal
• Too many copies of the                       Size of   Size of   Map       Duplicate
  small table are shuffled                     R         S         tasks     data
                                               150 MB 24 GB        352       32 GB
  across the network

• Partially Solved                             Size of
                                               R
                                                         Size of
                                                         S
                                                                   Map
                                                                   tasks
                                                                             Duplicate
                                                                             data
    – Distributed Cache                        150 MB 24 GB        277       <=64*150
                                                                             MB§


• Doesn’t work with too                        Size of   Size of   # of      Duplicate
  many nodes involved                          R         S         nodes     data
                                               150 MB 24 TB        277 k     TB? PB?

§there are 64 nodes in our cluster, and distributed cache will copy the data no more
than that amount of time
Slide taken from project proposal II
• Memory Limitation
  – Hash table is not memory-efficient.
  – The table size is usually larger than the heap
    memory assigned to a task


            Out Of Memory Exception!
Solving Not-Enough-Memory problem
  New Map Joins:
• Multi-phase map join      (MMJ)
• Reversed map join         (RMJ)
• JDBM-based map join       (JMJ)

small table as: duplicate
large table as: fragment
             Multi-phase map join
• n-phase map join
                         Fragment



Map tasks:




             Duplicate
                         Duplicate
                          Part 1
                                     Duplicate
                                      Part 2
                                                 …   Duplicate
                                                      Part n



 Problem? - Reading large table multiple times!
            Reversed map join
• Default map join (in each Map task):
  1. read duplicate to memory, build hash table
  2. for each tuple in fragment, probe the hash
  table
• Reversed map join (in each Map task): :
  1. read fragment to memory, build hash table
  2. for each tuple in duplicate , probe the hash
  table
 Problem? – not really a Map job…
            JDBM-based map join
 • JDBM is a transactional persistence engine for
   Java.

 • Using JDBM, we can eliminate
   OutOfMemoryException. The size of the hash
   table is no longer bound by the heap size.


Problem? – Probing a hashtable on disk might take much time!
                Advanced Joins
• Step 1:
  Semi join on join key only;
• Step 2:
  Use the result to filter the table;
• Step 3:
  Join new tables.

• Can be applied to both map and reduce-side
  joins
      Problem? – Step 1 and 2 have overhead!
           The Nine Candidates
•   AMJ/no dist   advanced map join without DC
•   AMJ/dist      advanced map join with DC
•   DMJ/no dist   default map join without DC
•   DMJ/dist      default map join with DC
•   MMJ           multi-phase map join
•   RMJ/dist      reversed map join with DC
•   JMJ/dist      JDBM-based map join with DC
•   ARJ/dist      advanced reduce join with DC
•   DRJ           default reduce join
            Experiment Setup
• TPC-DS benchmark
• Evaluated query:
  JOIN customer, web_sales ON cid
• Performed on different scales of generated
  data, e.g. 10GB, 170GB (not actual table size)

• Each combination is performed five (5) times
• Results are analyzed with error bars
               Hadoop Cluster



• 128 — Hewlett Packard    Used in the experiment:
  DL160 Compute Building
  Blocks                   • Hadoop Cluster
• Each equipped with:        (Altocumulus):
• 2 quad-core CPUs
• 16 GB RAM                  64 nodes
• 2 TB storage
• High-speed network
  connection
                         Result analysis
400



350



300


                                           AMJ/no dist
250
                                           AMJ/dist
                                           DMJ/no dist
200                                        DMJ/dist
                                           MMJ
                                           RMJ/dist
150
                                           JMJ/dist
                                           ARJ/dist
100                                        DRJ



 50



  0




      Some results ignored
             One small note
• What does 50*200 mean?

• TABLE customer: from 50GB version of TPC-DS
     - actual table size: about 100MB
  TABLE web_sales: 200GB version of TPC-DS
      - actual table size: about 30GB
      Distributed Cache
400

350

300

250

200
                          DMJ/no dist
150                       DMJ/dist
100

 50

  0
           Distributed Cache II
• Distributed cache introduces an overhead
  when converting the file in HDFS to local disks.
• The following situations are in favor of
  Distributed cache (compared to non-DC):
  1. number of nodes is low
  2. number of map tasks is high
                              Advanced vs. Default
300




250




200




150                                                                                                               ARJ/dist
                                                                                                                  DRJ



100




 50




  0
      10*10   10*30   10*50   10*70 10*100 10*130 10*170 10*200 50*50   50*70 50*100 50*130 50*170 50*200 70*70
       Advanced vs. Default II
1200




1000




800




600
                                 AMJ/dist
                                 DMJ/dist

400




200




   0
        Advanced vs. Default III
• The overhead of semi-join and filtering is
  heavy.
• The following situations are in favor of
  advanced joins (compared to reduce joins):
  1. join selectivity gets lower
  2. network becomes slower (true!)
  3. we need to handle skewed data
450
      Map Join vs Reduce Join--Part I
400

350

300

250                               DMJ/no dist
                                  MMJ
200                               JMJ/dist
150                               ARJ/dist
                                  DRJ
100

 50

  0
   Map Join vs Reduce Join-- Part II
1400

1200

1000

800                             DMJ/no dist
                                RMJ/dist
600
                                JMJ/dist
400                             ARJ/dist
                                DRJ
200

  0
           Map Join vs Reduce Join
   In most situations, Default Map Join performs better
    than Default Reduce Join
      Eliminate the data transfer and sorting at shuffle

       stage
   The gap is not significant due to the fast network
   Potential problems of Map Joins
      A job involving too many map tasks causes large

       amount of data transferred over network
       Distributed cache may do harm to performance
          Beyond Default Map Join
   Multi-Phase Map Join
       Succeed in all experiment groups.
       Performance comparable with DMJ when only one
        phase is involved.
       Performance degrades sharply when phase
        number are greater than 2, due to the much more
        tasks we launch.
       Currently no support for distributed cache, not
        scalable
          Beyond Default Map Join
   Reversed Map Join
       Succeed in all experiment groups.
       Not performs as good as DRJ due the overhead of
        distributed cache
       Performs best when
          Beyond Default Map Join
   JDBM Map Join
       Fail for the last two experiment groups, mainly due
        to the improper configuration settings.
                 Join Plan Generator
   Cost-based + rule-based             Number of distributed
                                        files
                                                                d
   Focus on three aspects              Network speed
       Whether or not to use
                                                                v
        distributed cache               Number of map tasks
       Whether to use Default Map                              m
        Join
                                        Number of reduce
       Map joins or reduce side join   tasks
                                                                r
   Parameters                          Number of working
                                                                n
                                        nodes

                                        Small table size
                                                                s
                                        Large table size
                                                                l
                Join Plan Generator
   Whether to use distributed cache
       Only works for map join approaches
       Cost model
           With distributed cache: 1  n  s    d
                                  v
             where is the average overhead to distribute one file
           Without distributed cache: 1
                                          m s
                                        v
               Join Plan Generator
   Whether to use Default Map Join
       We give Default Map Join the highest priority since
        it usually works best
       The choice on distributed cache can ensure Default
        Map Join works efficiently
       Rule: if small table can fit into memory entirely,
        just do it.
                 Join Plan Generator
   Map Joins or Default Reduce side Join
       In those situations where DMJ fails, Reversed Map
        Join is most promising in terms of usability and
        scalability.
       Cost model:
           RMJ:
             m  s    d (without distributed cache)
             



               s   d
             n            (with distributed cache)
             where
                           is the average overhead to distribute one
            file
           DRJ:s  l )   ,   f (v, r )
                (
Join Plan Generator
      Distributed cache?



           Y    N


                              Y
      Default Map Join?           Do it

                N


    Reversed Map Join /
                                  Do it
   Default Reduce side Join
                   Summary
   Distributed cache is a double-edge sword
   When using distributed cache properly, Default
    Map Join performs best
   The three new map join approaches extend
    the usability of default map join
                Future Work
• SPJA workflow
  (selection, projection, join, aggregation)
• Better optimizer
• Multi-way join
• Build to hybrid system
• Need a dedicated (slower) cluster…

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:36
posted:12/16/2011
language:English
pages:33