Performance _ Programming Comparison of JAQL_ Hive_ Pig and Java by nyut545e2


									     Performance & Programming Comparison of
             JAQL, Hive, Pig and Java

      Robert Stewart

Document Revision 1.0                          1
Experiment Design
   Benchmarks (Widely Used)
        ● Word Count [3,4,5,6]

               ●    Read large text file; output list of distinct words with frequency
           ●   Dataset Join [1,2,3,4]
               ●    Reads two datasets; join on occurrences of identical items
           ●   Webserver Log Processing
               ●    Reads webserver log file; aggregates page counts & average viewing time
               ●    Designed by author, performs (SQL):
                    ●   SELECT userID, AVG(timeOnSite) as averages, COUNT(pageID)
                    ●   GROUP BY userID;
           ●   Convergence
               ●    A Turing complete graph algorithm
           All implementations can be found in the MEng dissertation:

   Performance Metrics
       ●   Scale Up
           ●       Fix cluster size; increase computation size
       ●   Scale Out
           ●       Increase cluster size; increase computation size (proportionally)
       ●   Runtime
           ●       Increase cluster size; fix computation size

Experiment Design

   Ease of Programming
    ●   Source Lines of Code Review
        ●   Programming effort required for each implementation
    ●   Language Computability
        ●   Expressive power; Completeness of each language

    ●   Benchmark Platform
        ●   Beowulf Cluster of 32 CentOS 5.4 Machines
                 Dual Core Intel Pentium 4, 3Ghz CPU's
                 Cache size: 1MB (per core)
                 1GB Main Memory, 1GB Swap Memory
                40GB on each node for the Hadoop FileSystem

        ●   Programs run three times, and mean results reported

Scale Up: Uniformly Distributed Keys

         Fig 1: Word Count                                     Fig 2: Log Processing

  ●Hive and Java achieve a similar performance for word count and both outperform JAQL and Pig,
  which have similar performance

  ●Pig versions 0.5 and 0.6 were both ran for word count (figure 1), to illustrate the development of

  ●   JAQL and Pig scale up less well than Hive and Java for the log processing application

  ● For the smallest computation (x1) in figure 2, JAQL achieves the quickest runtime of all four

  ●   Scale up for the join application benchmarks can be found in the dissertation                     4
Scale Up: Skewed Key Distribution

          Fig 3: Word Count                                         Fig 4: Join

 ●JAQL does not achieve good scale up performance in relation to all other languages in the join
 application (figure 4)

 ● Hive and Pig have optimization techniques for handling skewed key distribution. As a result in
 join, between multipliers x12 and x18, the performance of Java degrades faster than Hive and Pig

 ●   At multiplier x20 in the join application, both Pig and Hive achieve a quicker runtime than Java

 ●All languages perform similarly for the word count application for both skewed and uniformly
 distributed keys, which may be due to the small dataset used for the word count application
Scale Out: Word Count

  Fig 5: Absolute Values                               Fig 6: Relative Scale Out

  ● Inter-node communication, Hadoop coordination and merge sorting cause an increase in
  runtime when running on more than one node

  ● In the scale up benchmark for word count (figure 1), Hive performs similarly to Java, and Pig
  similar to JAQL. Correlative results are seen for scale out in figure 5

  ●   Hive is more efficient than JAQL and Pig by a factor of between 2.4 and 3.1

  ● The relative scale out graph shows the runtimes in relation to the runtime on one node for each
Scale Out: Join

  Fig 7: Absolute Values                                Fig 8: Relative Scale Out

  ●The scale out results for join show that JAQL is the least efficient implementation (figure 7)

  ●   Pig and Hive achieve similar scale out performance

  ●   Java scales out better than the other three languages (figure 8)


       Fig 9: Join                                        Fig 10: Log Processing

  ●   For join (figure 9)
       ● After increasing the size of the cluster beyond 4 nodes there is no improvement for JAQL
       ● However Hive, Pig and Java all improve up to 16 nodes, after which there is no further

  ●   For log processing (figure 10)
       ● Runtimes improve as the cluster grows to 12 nodes. After this threshold, there is negligible
         improvement, and Java and Hive performance degrades due to increased communication
         and coordination costs

Using One or Two nodes with
 Small/Large Computation

The computation requirements were deliberately set at different levels:
   ● The join application required a moderately large computation (figure 9)

   ● The log processing application required only a small computation (figure 10)

 ●   The consequence is that a better runtime performance can be achieved by running
     wholly on one node for the log processing application, rather than two, due to
     communication costs, merge sorting and high latency node startup times

 ●   For large computation in the join application, it is clearly seen that running on two
     nodes is beneficial over running on just one node, due to the increased parallel
     processing over the 2 high-throughput Hadoop nodes

Controlling Reduce Tasks

 ● Three of the languages, Pig, Hive and Java have an additional expressive power to
 allow the programmer to specify the number of reduce tasks a job should be split into,
 which is outlined in the MapReduce Hadoop documentation, which outlines the
 formula as such (using a 10 node cluster):

     Reducer Tasks = <Number of Nodes> x <Max Number of Reducers per node> x 0.9
                               Reduce Tasks = 10 x 2 x 0.9
                                            = 18

 ● Note: The scale out and runtime experiments reported in this document all adhere to
 the guidelines set by this formula, set in the Hadoop configuration prior to running
 each experiment

     Chart Annotation
         The following charts are annotated:
           ● Marking the use of 18 reducer tasks

           ● Runtime when using 3 reducer tasks

Controlling Reduce Tasks

         Fig 11: Join                            Fig 12: Web Log Processing

● Pig and Java achieve optimal runtime for the join application (figure 11) using 23
reducers, as opposed the 18 given by the documented formula

●There is a big contrast seen in the log processing benchmark, where only Java is able to
make use of the additional reducers to decrease overall runtime

●   All implementations degrade when using more than 23 reducers in the both benchmarks

● The join benchmark shows that performance can be optimized to at least a 48% quicker
runtime, when controlling the number of reducers relative to using only one reducer
Ease of Programming
●Figure 13 shows the source lines of code (SLoC) for each benchmark

● The programs in all three high level languages (Hive, Pig and JAQL) are far shorter
than the Java equivalent (by at least a factor of 7.5; word count – Java is 45 lines,
JAQL is 6 lines), showing that they achieve their design goal of facilitating the abstract
design of data queries to run on an Hadoop cluster

●These high level languages provide a SLoC ratio as little as 1.8% (log processing –
Java is 165 lines, JAQL is 3 lines), so programmers spend less time writing and
debugging large applications. This is balanced by occasional small performance
optimizations apparent in the Java implementation results.

                                 Fig 13: SLoC Comparison                                     12
Computational Power of the
High Level Data Query Languages

                                        Fig 14: Computation Power
● Neither Pig Latin or Hive QL provide looping constructs required to be defined as Turing Complete

● Pig Latin and HiveQL can both by extended User Defined Functions. These are Java (a Turing Complete
language) implementations, hence the combination of Pig Latin or Hive QL with UDF's may be Turing
Complete applications

●Though not optimized for recursive functions, JAQL does support them, and therefore JAQL can be
defined as Turing Complete

● Map-Reduce-Merge was a model introduced [7] make MapReduce relationally complete, as MapReduce
is not relationally complete

Convergence Application
     A dataset holds information of network node relationships:
         Directed Graph Connections = [<ParentNode>,<ChildNode>]

      Find all nodes with no children in the network

  ●This is a typical example of a convergence application, and one that would
  normally require looping constructs to fulfil the application requirements

                        Fig 15: Convergence Implementations
  ● For Pig and Hive, where the program is embedded into Java, the SLoC ratio is
  less significant

  ●As the convergence application is implementable in JAQL's core language,
  JAQL achieves the best SLoC ratio of all four languages
Performance Results
  ●   For these benchmarks
       ● Hive achieves the quickest parallel runtime for each benchmark

       ● Pig and JAQL achieve similar runtime performance, with the exception of...

       ● ... the join application, where JAQL achieves a considerably slower runtime

  ● The optimizations in Pig and Hive for skewed key distributions allow them to handle
  large computation more gracefully than a Java implementation (Figure 4). There is no
  such optimization technique currently in JAQL

  ●Controlling reducers can improve performance by 51%, but may also degrade
  performance by 38% (Figures 11 & 12).

 Ease Of Programming
  ● Providing that Pig Latin, Hive QL and JAQL can express the application, they reduce
  the source lines of code relative to Java by at least a factor of 7.5, and at most a factor
  of 55

  ●JAQL has the greatest computational power of the high level data query languages,
  as the support for recursive functions defines JAQL as a Turing Complete language

  ●   Pig and Hive are both extendible languages, with user defined functions (UDFs)

  ● All languages can implement the convergence application, though Pig and Hive need
  to be extended to do so; Turing Complete JAQL does not
  Dataset Join
      ●   Pig: “Pig Wiki” [1]
          Hive: “Hive Language Tutorial” [2]
      ●   JAQL: “JAQL Language Overview” [3]
      ●   Java: Hadoop Packaged examples [4]

  Word Count

      ●   Pig: “A Brief, hands-on Introduction to Hadoop & Pig” [5]
      ●   Hive: “A Petabyte Scale Data Warehouse using Hadoop” [6]
      ●   JAQL: “JAQL Language Overview” [3]
      ●   Java: Hadoop Packaged examples [4]

  ●   Webserver Log
      ●   All four implementations were designed by Rob Stewart for the purpose of these

            [1] -
            [2] -
            [3] -
            [4] – Java classes in hadoop-*-examples.jar, packaged with Hadoop installation
            [5] -
            [6] -
            [7] - “Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters”; SIGMOD 2007:
            ACM SIGMOD International Conference on Management of Data; Yang, Hung-Chih, Dasdan, Ali,
            Hsiao, Ruey-Lung, Parker, Stott; Beijing, China                                                   16

To top