Docstoc

RCFile_ a Fast and Space-Efficient Data Placement Structure in

Document Sample
RCFile_ a Fast and Space-Efficient Data Placement Structure in Powered By Docstoc
					Maximizing Network and Storage Performance for
             Big Data Analytics

                   Xiaodong Zhang
                  Ohio State University
                            Collaborators
     Rubao Lee, Ying Huai, Tian Luo, Yuan Yuan Ohio State University
        Yongqiang He and the Data Infrastructure Team, Facebook
                      Fusheng Wang, Emory University
     Zhiwei Xu, Institute of Comp. Tech, Chinese Academy of Sciences
            Digital Data Explosion in Human Society
The global storage capacity                                                        Amount of digital information
                                                                         2007
                                                                         Analog    created and replicated in a year
                  Analog Storage                                         18.86 billion GB
  1986
  Analog
  2.62 billion GB


                                                                        PC hard disks
  Digital                                                               123 billion GB
  0.02 billion GB                                                       44.5%

                                                                        Digital
     Digital Storage                                                    276.12 billion GB

Source:
Exabytes: Documenting the 'digital age' and huge growth in computing capacity,                                2
The Washington Post
Challenge of Big Data Management and Analytics (1)

 Existing DB technology is not prepared for the huge volume
   • Until 2007, Facebook had a 15TB data warehouse by a big-DBMS-vendor
   • Now, ~70TB compressed data added into Facebook data warehouse every
     day (4x total capacity of its data warehouse in 2007)
   • Commercial parallel DBs rarely have 100+ nodes
   • Yahoo!’s Hadoop cluster has 4000+ nodes; Facebook’s data warehouse has
     2750+ nodes, Google’s sorted10 PB data on a cluster of 8,000 nodes (2011)

   • Typical science and medical research examples:
       – Large Hadron Collider at CERN generates over 15 PB of data per year
       – Pathology Analytical Imaging Standards databases at Emory reaches 7TB, going to PB



                                                                                      3
Challenge of Big Data Management and Analytics (2)

  Big data is about all kinds of data
    • Online services (social networks, retailers …) focus on big data of
      online and off-line click-stream for deep analytics
    • Medical image analytics are crucial to both biomedical research
      and clinical diagnosis

  Complex analytics to gain deep insights from big data
    •   Data mining
    •   Pattern recognition
    •   Data fusion and integration
    •   Time series analysis
    •   Goal: gain deep insights and new knowledge

                                                                            4
  Challenge of Big Data Management and Analytics (3)

 Conventional database business model is not affordable
   •   Expensive software license (e.g. Oracle DB, $47,500 per processor, 2010)
   •   High maintenance fees even for open source DBs
   •   Store and manage data in a system at least $10,000/TB*
   •   In contrast, Hadoop-like systems only cost $1,500/TB**

 Increasingly more non-profit organizations work on big data
    Hospitals, bio-research institutions

    Social networks, on-line services …..

    Low cost software infrastructure is a key.

                                                                             5
  Challenge of Big Data Management and Analytics (4)

 Conventional parallel processing model is “scale-up” based
    • BSP model, CACM, 1990: optimizations in both hardware and software
    • Hardware: low ratio of comp/comm, fast locks, large cache and memory
    • Software: overlapping comp/comm, exploiting locality, co-scheduling …

 Big data processing model is “scale-out” based
    • DOT model, SOCC’11: hardware independent software design
    • Scalability: maintain a sustained throughput growth by continuously
      adding low cost computing and storage nodes in distributed systems
    • Constraints in computing patterns: communication- and data-sharing-free

        MapReduce programming model becomes an
 effective data processing engine for big data analytics
   *: http://www.dbms2.com/2010/10/15/pricing-of-data-warehouse-appliances/
                                                                                         6
  **: http://www.slideshare.net/jseidman/data-analysis-with-hadoop-and-hive-chicagodb-2212011
                      Why MapReduce?
 A simple but effective programming model designed to process
  huge volumes of data concurrently
 Two unique properties
   • Minimum dependency among tasks (almost sharing nothing)
   • Simple task operations in each node (low cost machines are sufficient)

 Two strong merits for big data anaytics
   • Scalability (Amadal’s Law): increase throughput by increasing # of nodes
   • Fault-tolerance (quick and low cost recovery of the failures of tasks)

 Hadoop is the most widely used implementation of MapReduce
   • in hundreds of society-dependent corporations/organizations for big data
     analytics: AOL, Baidu, EBay, Facebook, IBM, NY Times, Yahoo! ….
                                                                          7
    An Example of a MapReduce Job on Hadoop
 Calculate average salary of each 2 organizations in a huge file.
     {name: (org., salary)}            {org.: avg. salary}




        Key         Value                Key      Value
    Original key/value pairs:        Result key/value pairs: two
    all the person names             entries showing the org name
    associated with each org         and corresponding average
    name and their salaries          salary
    Name          (dept. ,salary)    dept.         avg. salary
    Alice         (Org-1, 3000)      Org-1         …
    Bob           (Org-2, 3500)      Org-2         …
    …             …                                                  8
   An Example of a MapReduce Job on Hadoop
   Calculate the average salary of every organization
   {name: (org., salary)}             {org.: avg. salary}

HDFS



A HDFS block          Hadoop Distributed File System (HDFS)




                                                              9
   An example of a MapReduce job on Hadoop
   Calculate the average salary of every department
   {name: (org., salary)}             {org.: avg. salary}

HDFS

             Map                  Map                  Map



           Each map task takes 4 HDFS blocks as its input
           and extract {org.: salary} as new key/value pairs,
           e.g. {Alice: (org-1, 3000)} to {org-1: 3000}
             3 Map tasks concurrently process input data
                       Records of “org-1”
                      Records of “org-2”                        10
   An example of a MapReduce job on Hadoop
   Calculate the average salary of every department
   {name: (org., salary)}             {org.: avg. salary}

HDFS

             Map                  Map                  Map




          Shuffle the data using org. as Partition Key (PK)
                      Records of “org-1”
                      Records of “org-2”                      11
    An example of a MapReduce job on Hadoop
    Calculate the average salary of every department
    {name: (org., salary)}             {org.: avg. salary}

HDFS

                 Map                   Map                   Map



Calculate the                                                Calculate the
average salary                                               average salary
for “org-1”                                                  for “org-2”
                       Reduce (Avg.)         Reduce (Avg.)

HDFS
                                                                     12
          Key/Value Pairs in MapReduce
 A simple but effective programming model designed to process
  huge volumes of data concurrently on a cluster


 Map: (k1, v1)  (k2, v2), e.g. (name, org & salary)  (org, salary)
 Reduce: (k2, v2)  (k3, v3), e.g. (org, salary)  (org, avg. salary)
 Shuffle: Partition Key (It could be the same as k2, or not)
   • Partition Key: to determine how a key/value pair in the map output be
     transferred to a reduce task
   • e.g. org. name is used to partition the map output file accordingly


                                                                         13
      MR(Hadoop) Job Execution Patterns
                              MR program (job)             The execution of
      Map Tasks
      Reduce Tasks                                         a MR job involves
                                                           Control level work, e.g.
                                                           6 steps
                                                           job scheduling and task
Data is stored in a                    1: Job submission
Distributed File System                                    assignment
(e.g. Hadoop Distributed
File System)                            Master node

               Worker nodes                        Worker nodes




                              2: Assign Tasks




                           Do data processing
                           work specified by Map                                      14
                           or Reduce Function
MR(Hadoop) Job Execution Patterns
                         MR program                  The execution of
Map Tasks
Reduce Tasks                                         a MR job involves
                                                     6 steps
                                 1: Job submission

      Map output
                                  Master node

        Worker nodes                         Worker nodes




                                             Map output will be shuffled to
      3: Map phase                           different reduce tasks based on
                          4: Shuffle phase   Partition Keys (PKs) (usually
      Concurrent tasks
                                             Map output keys)                  15
MR(Hadoop) Job Execution Patterns
                         MR program                  The execution of
Map Tasks
Reduce Tasks                                         a MR job involves
                                                     6 steps
                                 1: Job submission

                                                     6: Output will be stored back
                                  Master node        to Distributed File System

        Worker nodes                         Worker nodes




                                                Reduce output

      3: Map phase        4: Shuffle phase   5: Reduce phase
      Concurrent tasks                       Concurrent tasks                16
    MR(Hadoop) Job Execution Patterns
                           MR program                The execution of
    Map Tasks
    Reduce Tasks                                     a MR job involves
                                                     6 steps
                                 1: Job submission

                                                     6: Output will be stored back
                                  Master node        to Distributed File System

            Worker nodes                    Worker nodes




A MapReduce (MR) job is resource-consuming:
1: Input data scan in the Map phase => local or remote I/Os
2: Store intermediate results of Map output => local I/Os
                                             Reduce output
3: Transfer data across in the Shuffle phase => network costs
           3: Map phase     4: Shuffle job 5: local I/Os
4: Store final results of this MRphase =>Reduce phase + network
           Concurrent tasks                Concurrent tasks
costs (replicate data)                                        17
 Two Critical Challenges in Production Systems

 Background: Conventional databases have been moved to
  MapReduce Environment, e.g. Hive (facebook) and Pig (Yahoo!)
 Challenge 1: How to initially store the data in distributed systems
   • subject to minimizing network and storage cost

 Challenge 2: How to automatically convert relational database
  queries into MapReduce jobs
   • subject to minimizing network and storage costs

 Addressing these two Challenges, we maximize
   • Performance of big data analytics
   • Productivity of big data analytics

                                                                  18
   Challenge 1: Four Requirements of Data Placement

    Data loading (L)
      • the overhead of writing data to distributed files system and local disks

    Query processing (P)
      • local storage bandwidths of query processing
      • the amount of network transfers

    Storage space utilization (S)
      • Data compression ratio
      • The convenience of applying efficient compression algorithms

    Adaptive to dynamic workload patterns (W)
      • Additional overhead on certain queries

Objective: to design and implement a data placement structure
meeting these requirements in MapReduce-based data warehouses19
Initial Stores of Big Data in Distributed Environment

    HDFS Blocks                                          NameNode
                                                         (A part of the
    Store Block 1
                                                         Master node)
    Store Block 2
    Store Block 3


   DataNode 1                  DataNode 2                  DataNode 3


 HDFS (Hadoop Distributed File System) blocks are distributed

 Users have a limited ability to specify customized data placement policy
   • e.g. to specify which blocks should be co-located

 Minimizing I/O costs in local disks and intra network communication
                                                                          20
 MR programming is not that “simple”!                                public static class Reduce extends Reducer<IntWritable,Text,IntWritable,Text> {
                                                                                  private Text result = new Text();

                                                                                  public void reduce(IntWritable key, Iterable<Text> values,
package tpch;                                                                                              Context context
import java.io.IOException;                                                               ) throws IOException, InterruptedException {
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;                                                               double sumQuantity = 0.0;
import org.apache.hadoop.conf.Configured;                                                                  IntWritable newKey = new IntWritable();
import org.apache.hadoop.fs.Path;                                                                          boolean isDiscard = true;
import org.apache.hadoop.io.DoubleWritable;                                                                String thisValue = new String();
import org.apache.hadoop.io.IntWritable;                                                                   int thisKey = 0;




         This complex code is for a simple MR job
import org.apache.hadoop.io.Text;                                                                          for (Text val : values) {
import org.apache.hadoop.mapreduce.Job;                                                                                 String[] tokens = val.toString().split("\\|");
import org.apache.hadoop.mapreduce.Mapper;                                                                              if (tokens[tokens.length - 1].compareTo("l") == 0){
import org.apache.hadoop.mapreduce.Reducer;                                                                                          sumQuantity += Double.parseDouble(tokens[0]);
import org.apache.hadoop.mapreduce.Mapper.Context;                                                                      }
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;                                                           else if (tokens[tokens.length - 1].compareTo("o") == 0){
import org.apache.hadoop.mapreduce.lib.input.FileSplit;                                                                              thisKey = Integer.valueOf(tokens[0]);
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;                                                                      thisValue = key.toString() + "|" + tokens[1]+"|"+tokens[2];
import org.apache.hadoop.util.GenericOptionsParser;                                                                     }
import org.apache.hadoop.util.Tool;                                                                                     else
import org.apache.hadoop.util.ToolRunner;                                                                                            continue;
public class Q18Job1 extends Configured implements Tool{                                                   }
     public static class Map extends Mapper<Object, Text, IntWritable, Text>{
                                                                                                             if (sumQuantity > 314){



                                                            Low Productivity!
            private final static Text value = new Text();                                                                 isDiscard = false;
            private IntWritable word = new IntWritable();                                                    }
            private String inputFile;
            private boolean isLineitem = false;                                                              if (!isDiscard){
            @Override                                                                                                     thisValue = thisValue + "|" + sumQuantity;
            protected void setup(Context context                                                                          newKey.set(thisKey);
            ) throws IOException, InterruptedException {                                                                  result.set(thisValue);
                   inputFile = ((FileSplit)context.getInputSplit()).getPath().getName();                                  context.write(newKey, result);
                   if (inputFile.compareTo("lineitem.tbl") == 0){                                            }
                         isLineitem = true;                                                     }
                   }                                                               }
                   System.out.println("isLineitem:" + isLineitem + " inputFile:" + inputFile);
            }                                                                      public int run(String[] args) throws Exception {
                                                                                                Configuration conf = new Configuration();
            public void map(Object key, Text line, Context context                     String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
                 ) throws IOException, InterruptedException {                          if (otherArgs.length != 3) {




We all want to simply write:
                   String[] tokens = (line.toString()).split("\\|");                     System.err.println("Usage: Q18Job1 <orders> <lineitem> <out>");
                   if (isLineitem){                                                      System.exit(2);
                         word.set(Integer.valueOf(tokens[0]));                         }
                         value.set(tokens[4] + "|l");                                  Job job = new Job(conf, "TPC-H Q18 Job1");
                         context.write(word, value);                                   job.setJarByClass(Q18Job1.class);
                   }
                   else{                                                               job.setMapperClass(Map.class);
                         word.set(Integer.valueOf(tokens[0]));                         job.setMapOutputKeyClass(IntWritable.class);
                         value.set(tokens[1] + "|" + tokens[4]+"|"+tokens[3]+"|o");    job.setMapOutputValueClass(Text.class);
                         context.write(word, value);
                   }                                                                   job.setReducerClass(Reduce.class);
            }                                                                          job.setOutputKeyClass(IntWritable.class);



“SELECT * FROM Book WHERE price > 100.00”?
    }                                                                                  job.setOutputValueClass(Text.class);


                                                                                      FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
                                                                                      FileInputFormat.addInputPath(job, new Path(otherArgs[1]));
                                                                                      FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
                                                                                               return (job.waitForCompletion(true) ? 0 : 1);
                                                                                  }

                                                                                  public static void main(String[] args) throws Exception {
                                                                                      int res = ToolRunner.run(new Configuration(), new Q18Job1(), args);
                                                                                      System.exit(res);                                                                              21
                                                                                  }
                                                                     }
 Challenge 2: High Quality MapReduce in Automation

                    A job description in SQL-like declarative language




                                                           A interface between users
Write MR                       SQL-to-MapReduce
                                                           and MR programs (jobs)
programs (jobs)                    Translator

                                                 MR programs (jobs)



          Workers



                  Hadoop Distributed File System (HDFS)                         22
 Challenge 2: High Quality MapReduce in Automation

                              A job description in SQL-like declarative language




                                                                     A interface between users
Write MR                                 SQL-to-MapReduce
                                                                     and MR programs (jobs)
programs (jobs)                              Translator


                A data warehousing                       A A MR program (job)
                                                           high-level programming
                system (Facebook)                        environment (Yahoo!)
      Workers
Improve productivity from hand-coding MapReduce programs
     • 95%+ Hadoop jobs in Facebook are generated by Hive
     • 75%+ Hadoop jobs in Yahoo! are invoked by Pig*
                Hadoop Distributed File System (HDFS)                                     23
* http://hadooplondon.eventbrite.com/
                            Outline
 RCFile: a fast and space-efficient placement structure
   • Re-examination of existing structures
   • A Mathematical model as the analytical basis for RCFile
   • Experiment results

 Ysmart: a high efficient query-to-MapReduce translator
   • Correlations-aware is the key
   • Fundamental Rules in the translation process
   • Experiment results

 Impact of RCFile and Ysmart in production systems
 Conclusion
                                                               24
    Row-Store: Merits/Limits with MapReduce

              Table
       A     B        C     D           HDFS Blocks
       101   201      301   401         Store Block 1
       102   202      302   402         Store Block 2
       103   203      303   403         Store Block 3
       104   204      304   404         …
       105   205      305   405


Data loading is fast (no additional processing);
All columns of a data row are located in the same HDFS block
 Not all columns are used (unnecessary storage bandwidth)
 Compression of different types may add additional overhead 25
Distributed Row-Store Data among Nodes

HDFS Blocks
                                           NameNode
Store Block 1
Store Block 2
Store Block 3




A   B   C   D              A   B   C   D            A   B   C   D
…   …   …   …              …   …   …   …            …   …   …   …



              DataNode 1                   DataNode 2           DataNode 3
                                                                        26
Column-Store: Merits/Limits with MapReduce

          Table
    A     B       C     D     HDFS Blocks
    101   201     301   401   Store Block 1
    102   202     302   402   Store Block 2
    103   203     303   403   Store Block 3
    104   204     304   404   …
    105   205     305   405
    …     …       …     …



                                              27
 Column-Store: Merits/Limits with MapReduce
  Column group 1

     A      Column group 2
                                         HDFS Blocks
     101       B        Column group 3   Store Block 1
     102       201       C     D         Store Block 2
     103       202       301   401       Store Block 3
     104       203       302   402       …
     105       204       303   403
     …         205       304   404
             …        305 405
Unnecessary I/O costs can be avoided:
                      …     …
 Only needed columns are loaded, and easy compression
Additional network transfers for column grouping        28
Distributed Column-Store Data among Nodes

 HDFS Blocks
                                NameNode
 Store Block 1
 Store Block 2
 Store Block 3




   A                  B                  C     D
   101                201                301   401
   102                202                302   402
   103                203                303   403
   104                204                304   404
   105                205                305   405
   …
         DataNode 1   …
                            DataNode 2   …     …     DataNode 3
                                                             29
  Optimization of Data Placement Structure

 Consider the four requirements comprehensively
 The optimization problem becomes:
   • In an environment of dynamic workload (W) and with a
     suitable data compression algorithm (S) to improve the
     utilization of data storage,
     find a data placement structure (DPS) that minimizes the
     processing time of a basic operation (OP) on a table (T) with n
     columns

 Two basic operations
   • Write: the essential operation of data loading (L)
   • Read: the essential operation of query processing (P)

                                                                       30
                        Write Operations
    Table
A    B    C       D
101 201 301 401
…    …    …       …      Load the table into HDFS blocks based on
                          different data placement structure
    HDFS Blocks
                         HDFS blocks are distributed to Datanodes




                      DataNode 1   DataNode 2   DataNode 3
                                                                31
            Read Operation in Row-store
                  Read local rows concurrently ;
                  Discard unneeded columns
      A     C                       A      C                 A      C
      101 301                       102 302                  103 303


  HDFS Blocks
A    B      C      D          A      B     C     D     A      B     C     D
101   201   301    401        102    202   302   402   103    203   303   403




      DataNode 1                    DataNode 2               DataNode 3   32
     Read Operations in Column-store
Column      Column      Column
group 1     group 2     group 3
                                                 Project A&C
A           B           C and D                      A   C
Datanode1 Datanode2 Datanode3                        101 301
                                                     …   …
          Project C&D
            C    D
            301 401               Transfer them to
            …    …                a common place
                                  via networks




           DataNode 3                       DataNode 1       DataNode 3   33
    An Expected Value based Statistical Model

 In probability theory, the expected value of an random variable is
   • Weighted average of all possible values of this random variable can take on


 Estimated big data access time can take time of “read” T(r) and
  “write” T(w), each with a probability (p(r) and p(w)) since p(r) +
  p(w) = 1, and read and write have equal weights, estimated access
  time is a weighted average:


   • E[T] = p(r) * T(r) + p(w) * T(w)


                                                                           34
 Overview of the Optimization Model

                             processing time of a read operation
 processing time of a write operation

min E (op | DPS )  w E ( write | DPS )  r E ( read | DPS )

r  w  1


     Probabilities of write and read operations


                                                                   35
    Overview of the Optimization Model



  min E (op | DPS )  w E ( write | DPS )  r E ( read | DPS )

  r  w  1
  The majority of operation is read, so ωr>>ωw
~70TB of compressed data added per day and PB level compressed data
scanned per day in Facebook (90%+)
 We focus on minimization of processing time of read operations

                                                               36
Modeling Processing Time of a Read Operation

 # of combinations of n columns a query needs is up to

  n n         n
             2n  1
   1  2       n
               




                                                          37
Modeling Processing Time of a Read Operation

 # of combinations of n columns a query needs is up to

   n n         n
              2n  1
    1  2       n
                

Processing Time of a Read Operation
 E ( read | DPS ) 
    n
     
     
  n i
                               S          1                                            S        i
 
 i 1 j 1
             f (i, j, n )(
                             Blocal
                                      
                                          
                                                ( DPS )   ( DPS , i, j, n ) 
                                                                                    Bnetwotk
                                                                                                )
                                                                                                n


                                                                                                38
Expected Time of a Read Operation
 E (read | DPS ) 
       n
        
       i
   n    
                     S           1                                            S        i
  f (i, j, n)( B
  i 1 j 1
                             
                                 
                                       ( DPS )   ( DPS , i, j , n) 
                                                                           Bnetwotk
                                                                                       )
                                                                                       n
                     local




n: the number of columns of table T
i: i columns are needed
j: it is jth column combination in all combinations with i columns




                                                                                       39
  Expected Time of a Read Operation
   E (read | DPS ) 
        n
         
        i
    n    
                       S           1                                            S        i
    f (i, j, n)( B
   i 1 j 1
                               
                                   
                                         ( DPS )   ( DPS , i, j , n) 
                                                                             Bnetwotk
                                                                                         )
                                                                                         n
                       local



The frequency of occurrence of jth column combination
in all combinations with i columns.
We fix it as a constant value to represent a environment with
highly dynamic workload




                                                                                         40
    Expected Time of a Read Operation
      E (read | DPS ) 
               n
                
               i
        n       
                          S           1                                            S        i
       f (i, j, n)( B
       i 1 j 1
                                  
                                      
                                            ( DPS )   ( DPS , i, j , n) 
                                                                                Bnetwotk
                                                                                            )
                                                                                            n
                          local



The frequency of occurrence of jth column combination
in all combinations with i columns.
We fix the probability as a constant value to represent a environment with
highly dynamic workload
      n
       
       
    n i

   f (i, j, n)  1
   i 1 j 1

                                                                                            41
 Expected Time of a Read Operation
  E (read | DPS ) 
        n
         
        i
    n    
                      S           1                                            S        i
   f (i, j, n)( B
   i 1 j 1
                              
                                  
                                        ( DPS )   ( DPS , i, j , n) 
                                                                            Bnetwotk
                                                                                        )
                                                                                        n
                      local


The time used to read needed columns from local disks in parallel
S: The size of the compressed table. We assume with efficient data
   compression algorithms and configurations, different DPSs can
   achieve comparable compression ratio
Blocal : The bandwidth of local disks
ρ: the degree of parallelism, i.e. total number of concurrent nodes


                                                                                        42
   Expected Time of a Read Operation
    E (read | DPS ) 
          n
           
          i
      n    
                        S           1                                            S        i
     f (i, j, n)( B
     i 1 j 1
                                
                                    
                                          ( DPS )   ( DPS , i, j , n) 
                                                                              Bnetwotk
                                                                                          )
                                                                                          n
                        local


 α(DPS): Read efficiency. % of columns read from local disks
           i
            , DPS  Column - store Only read necessary columns
 ( DPS)   n
            1 , DPS  Row - store Read all columns, including unnecessary columns
           




                                                                                          43
 Expected Time of a Read Operation
  E (read | DPS ) 
        n
         
        i
    n    
                      S           1                                            S        i
   f (i, j, n)( B
   i 1 j 1
                              
                                  
                                        ( DPS )   ( DPS , i, j , n) 
                                                                            Bnetwotk
                                                                                        )
                                                                                        n
                      local




The extra time on network transfer for row construction




                                                                                        44
 Expected Time of a Read Operation
  E (read | DPS ) 
        n
         
        i
    n    
                      S           1                                            S        i
   f (i, j, n)( B
   i 1 j 1
                              
                                  
                                        ( DPS )   ( DPS , i, j , n) 
                                                                            Bnetwotk
                                                                                        )
                                                                                        n
                      local




The extra time on network transfer for row construction
λ(DPS, i, j, n): Communication Overhead. Additional network
transfers for row constructions




                                                                                        45
 Expected Time of a Read Operation
   E (read | DPS ) 
        n
         
        i
    n    
                          S           1                                            S        i
    f (i, j, n)( B
   i 1 j 1
                                  
                                      
                                            ( DPS )   ( DPS , i, j , n) 
                                                                                Bnetwotk
                                                                                            )
                                                                                            n
                          local




The extra time on network transfer for row construction
λ(DPS, i, j, n): Communication Overhead. Additional network
transfers for row constructions
                      0, All needed columns are in one node (0 communication)
   ( DPS, i, j, n)  
                      , At least two columns are in two different nodes


                                                                                            46
 Expected Time of a Read Operation
   E (read | DPS ) 
        n
         
        i
    n    
                          S           1                                            S        i
    f (i, j, n)( B
   i 1 j 1
                                  
                                      
                                            ( DPS )   ( DPS , i, j , n) 
                                                                                Bnetwotk
                                                                                            )
                                                                                            n
                          local




The extra time on network transfer for row construction
λ(DPS, i, j, n): Communication Overhead. Additional network
transfers for row constructions
                      0, All needed columns are in one node (0 communication)
   ( DPS, i, j, n)  
                      , Transferring data via networks
                      β: % of data transferred via networks
                      (DPS, workload dependent), 0%≤β≤100%                                  47
 Expected Time of a Read Operation
  E (read | DPS ) 
        n
         
        i
    n    
                      S           1                                            S        i
   f (i, j, n)( B
   i 1 j 1
                              
                                  
                                        ( DPS )   ( DPS , i, j , n) 
                                                                            Bnetwotk
                                                                                        )
                                                                                        n
                      local




The extra time on network transfer for row construction
Bnetwork: The bandwidth of the network




                                                                                        48
   Finding Optimal Data Placement Structure

     E (read | DPS ) 
          n
           
          i
      n    
                         S           1                                            S        i
     f (i, j, n)( B
     i 1 j 1
                                 
                                     
                                           ( DPS )   ( DPS , i, j , n) 
                                                                               Bnetwotk
                                                                                           )
                                                                                           n
                         local


                          Row-Store                 Column-store                 Ideal

  Read efficiency                    1               i/n (optimal)        i/n (optimal)

  Communication                                             β
                          0 (optimal)                                          0 (optimal)
    overhead                                        (0%≤β≤100%)

Can we find a Data Placement Structure with both optimal read
efficiency and communication overhead ?                                                    49
                   Goals of RCFile
 Eliminate unnecessary I/O costs like Column-store
   • Only read needed columns from disks

 Eliminate network costs in row construction like Row-store
 Keep the fast data loading speed of Row-store
 Can apply efficient data compression algorithms
  conveniently like Column-store
 Eliminate all the limits of Row-store and Column-store




                                                           50
RCFile: Partitioning a Table into Row Groups
                               A HDFS block consists of one
           Table               or multiple row groups

    A      B    C     D                     HDFS Blocks
    …     A…Row Group …
                …                           Store Block 1
    101   201      301   401                Store Block 2
    102   202      302   402                Store Block 3
    103   203      303   403                Store Block 4
    104   204      304   404                …
    105   205      305   405
    …     …        …     …

                                                              51
RCFile: Distributed Row-Group Data among Nodes
                        For example, each HDFS block has three row groups
  HDFS Blocks
                                     NameNode
  Store Block 1
  Store Block 2
  Store Block 3




Row Group 1-3           Row Group 4-6         Row Group 7-9

                DataNode 1           DataNode 2            DataNode 3
                                                                     52
            Inside a Row Group

                             Metadata

       Store Block 1

101   201    301       401
102   202    302       402
103   203    303       403
104   204    304       404
105   205    305       405



                                        53
            Inside a Row Group

                         Metadata




101   201   301   401
102   202   302   402
103   203   303   403
104   204   304   404
105   205   305   405



                                    54
      RCFile: Inside each Row Group

                         Compressed Metadata
101                      Compressed Column A
102                      101 102 103 104 105
       201
103                      Compressed Column B
       202
104          301         201 202 203 204 205
105    203
             302         Compressed Column C
       204         401
             303         301 302 303 304 305
       205         402
             304         Compressed Column D
                   403
             305         401 402 403 404 405
                   404
                   405
                                               55
                  Benefits of RCFile
 Eliminate unnecessary I/O costs
   • In a row group, a table is partitioned by columns
   • Only read needed columns from disks
 Eliminate network costs in row construction
   • All columns of a row are located in the same HDFS block
 Comparable data loading speed to Row-Store
   • Only adding a vertical-partitioning operation in the data loading
     procedure of Row-Store
 Can apply efficient data compression algorithms
  conveniently
   • Can use compression schemes used in Column-store

                                                                         56
Expected Time of a Read Operation
 E (read | DPS ) 
       n
        
       i
   n    
                     S           1                                            S        i
  f (i, j, n)( B
  i 1 j 1
                             
                                 
                                       ( DPS )   ( DPS , i, j , n) 
                                                                           Bnetwotk
                                                                                       )
                                                                                       n
                     local


                     Row-Store               Column-store

Read efficiency              1                i/n (optimal)

Communicatio                                          β
                     0 (optimal)
 n overhead                                  (0%≤β≤100%)


                                                                                       57
Expected Time of a Read Operation
 E (read | DPS ) 
       n
        
       i
   n    
                     S           1                                            S        i
  f (i, j, n)( B
  i 1 j 1
                             
                                 
                                       ( DPS )   ( DPS , i, j , n) 
                                                                           Bnetwotk
                                                                                       )
                                                                                       n
                     local


                     Row-Store               Column-store                  RCFile

Read efficiency              1                i/n (optimal)         i/n (optimal)

Communicatio                                          β
                     0 (optimal)                                     0 (optimal)
 n overhead                                  (0%≤β≤100%)


                                                                                       58
          Row-Group Size
                 A performance sensitive and workload
                  dependent parameter
          RG1
Block 1

          RG2


          RG4

Block 2

          RG5

                                                   59
          Row-Group Size
                 A performance sensitive and workload
                  dependent parameter
              In RCFile, row group size is up to the
Block 1   RG1 size of HDFS block
                   • HDFS block size: 64MB or 128MB




Block 2   RG2



                                                      60
          Row-Group Size
               A performance sensitive and workload
                dependent parameter
               In RCFile, row group size is up to the
Block 1
                size of HDFS block
                 • HDFS block size: 64MB or 128MB

          RG1  Larger than HDFS block size? (VLDB’11)
                     – One column of a row group per HDFS block
                     – Co-locate columns of a row group
                     – Can provide flexibility on changing schema
Block 2          • A concern on fault-tolerance
                     – If a node fails, the recovery time is longer than
                       that of row-group size under current RCFile
                       restriction
                     – Needs additional maintenance                  61
Discussion on Choosing a right Row-Group Size

   RCFile Restriction: Row group size ≤ HDFS block size
   Avg. row size ↑, # of columns ↑ => row-group size ↑
     • Avoid unnecessary columns to be prefetched due to short columns
         – Unsuitable row-group size may introduce ~10x data read from disks,
           including ~90% unnecessary data
     • Relatively short columns may cause inefficient compression

   Query selectivity ↑ => row-group size ↓
     • High-selectivity queries (high % of rows to read) with a large row-
       group size would not take advantage of lazy decompression, but
       low-selectivity queries do.
         – Lazy decompression: a column is only decompressed when it will be really
           useful for query execution
                                                                                      62
                  Evaluation Environment
 Cluster:
    • 40 nodes one in the Facebook
 Node:
    •   Intel Xeon CPU with 8 cores
    •   32GB main memory
    •   12 1TB disks
    •   Linux, kernel version 2.6
    •   Hadoop 0.20.1
 Use Zebra library* for Column-store and Column-group
    • Column-store: A HDFS block only contains the data of a single column
    • Column-group: Some columns are grouped together and stored in the
      same HDFS block
 Benchmark is from [SIGMOD09]**
 *http://wiki.apache.org/pig/zebra
 **Pavlo et al., “A comparison of approaches to large-scale data analysis,” in SIGMOD Conference, 2009,
                                                                                                          63
 pp. 165–178.
                                   Data Loading Evaluation
                                   Data loading time for loading table USERVISITS (~120GB)
                             250
     Data loading time (s)



                             200

                             150

                             100

                              50

                              0
                                     Row-store     Column-store    Column-group        RCFile
 Row-store has the fastest data loading time
 RCFile’s data loading time is comparable to Row-store (7% slower)
 Zebra (column-store and column-group): each record will be
  written to multiple HDFS blocks => Higher network overhead
                                                                                                64
                            Storage Utilization Evaluation
                                    Compressed size of table USERVISITS (~120GB)
                           120

                           100
      Storage Space (GB)




                            80

                            60

                            40

                            20

                            0
                                 Raw Data    Row-store   Column-store Column-group   RCFile

 Use Gzip algorithm for each structure
 RCFile has the best data compression ratio (2.24)
 Zebra compresses metadata and data together, which is less efficient
                                                                                              65
      A Case of Query Execution Time Evaluation

 Query: Select pagerank, pageurl FROM RANKING WHERE pagerank < 400
                                                                  Query execution time

Column-store:                                    60
                            Execution Time (s)
Stored in different                              50

HDFS blocks                                      40
Column-group:                                    30
In the same column                               20
group, i.e. stored in the
                                                 10
same HDFS block
                                                 0
                                                      row-store    column-store Column-group   RCFile

     Column-group has the fastest query execution time

     RCFile’s execution time is comparable to Column-group (3% slower)
                                                                                                        66
Facebook Data Analytics Workloads Managed By RCFile

 Reporting
   • E.g. daily/weekly aggregations of impression/click counts

 Ad hoc analysis
   • E.g. geographical distributions and activities of users in the world

 Machine learning
   • E.g. online advertizing optimization and effectiveness studies

 Many other data analysis tasks on user behavior and patterns
 User workloads and related analysis cannot be published
 RCFile evaluation with public available workloads with excellent
  performance (ICDE’11)
                                                                            67
                                     RCFile in Facebook
  The interface to                                                                        …                     Web Servers
  800+ million users

                                                                    Large amount of                        70TB compressed
                                                                        log data                           data per day



                                                                                          …                     Data Loaders

                                                                                                            Capacity:
                                                                         RCFile Data                        21PB in May, 2010
                                                                                                            30PB+ today


                                                                                          …                    Warehouse
                                                                                                                                   68
Picture source: Visualizing Friendships, http://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919
                      Summary of RCFile
 Data placement structure lays a foundation for MapReduce-based big
  data analytics
 Our optimization model shows RCFile meets all basic requirements
 RCFile: an operational system for daily tasks of big data analytics
   • A part of Hive, a data warehouse infrastructure on top of Hadoop.
   • A default option for Facebook data warehouse
   • Has been integrated into Apache Pig since version 0.7.0 (expressing data analytics
     tasks and producing MapReduce programs)
   • Customized RCFile systems for special applications

 Refining RCFile and optimization model, making RCFile as a standard
  data placement structure for big data analytics


                                                                                 69
                           Outline
 RCFile: a fast and space-efficient placement structure
   • Re-examination of existing structures
   • A Mathematical model as basis of RCFile
   • Experiment results
 Ysmart: a high efficient query-to-MapReduce translator
   • Correlations-aware is the key
   • Fundamental Rules in the translation process
   • Experiment results
 Impact of RCFile and Ysmart in production systems
 Conclusion

                                                           70
 Translating SQL-like Queries to MapReduce
           Jobs: Existing Approach
 “Sentence by sentence” translation
   • [C. Olston et al. SIGMOD 2008], [A. Gates et al., VLDB 2009] and [A.
     Thusoo et al., ICDE2010]
   • Implementation: Hive and Pig
 Three steps
   • Identify major sentences with operations that shuffle the data
       – Such as: Join, Group by and Order by
   • For every operation in the major sentence that shuffles the data, a
     corresponding MR job is generated
       – e.g. a join op. => a join MR job
   • Add other operations, such as selection and projection, into
     corresponding MR jobs

Existing SQL-to-MapReduce translators give unacceptable
performance.                                       71
                                    An Example: TPC-H Q21
 One of the most complex and time-consuming queries in the
TPC-H benchmark for data warehousing performance
Optimized MR Jobs vs. Hive in a Facebook production cluster
             Optimized MR Jobs        Hive
         160
                              140
       Execution time (min)




                              120      3.7x
                              100
                              80
                              60
                              40
                              20
                               0        What’s wrong?          72
    The Execution Plan of TPC-H Q21
The only difference:                        SORT
Hive handle this sub-tree in a
different way with the
optimized MR jobs                           AGG3
It’s the dominated part on time
(~90% of execution time)
                                            Join4

                             Left-outer-
                                Join                       Join3

                                                    supplier       nation
                  Join2


       Join1              AGG1             AGG2

lineitem       orders     lineitem     lineitem                         73
                                                                         A JOIN MR Job
    However, inter-job correlations exist.
    Let’s look at the Partition Key                                      An AGG MR Job
                             Key: l_orderkey
                                                                         A Table
                                  J5           A Composite MR Job
     Key: l_orderkey

                     J3
                           Key: l_orderkey      Key: l_orderkey
Key: l_orderkey
                                             J4
            J1               J2


 lineitem         orders   lineitem      lineitem             lineitem      orders

J1,to J5 allJ4 allthe same partition key ‘l_orderkey’
J1 J2 and use need the input table ‘lineitem’

What’s wrong with existing SQL-to-MR translators?
Existing translators are correlation-unaware
1. Ignore common data input
2. Ignore common data transition                                                   74
              Approaches of Big Data Analytics in MR:
                          The landscape
Performance


                Hand-coding                     Correlation-
                 MR jobs                       aware SQL-to-
                                      Pro:     MR translator
                                      Easy programming, high productivity
               Pro:                   Con:
                                      Poor performance on complex queries
               high performance MR programs
               Con:                   (complex queries are usual in daily
                                       for a simple
               1: lots of coding evenoperations) job
               2: Redundant coding is inevitable
                                                     Existing
               3: Hard to debug
                  [J. Tan et al., ICDCS 2010]       SQL-to-MR
                                                 Translators

                                                   Productivity      75
     Our Approaches and Critical Challenges
                         Correlation-aware
                        SQL-to-MR translator         MR Jobs for best
 SQL-like queries                                    performance


                                                Merge
            Primitive            Identify
                                               Correlated
            MR Jobs            Correlations
                                                MR jobs

1: Correlation possibilities
and detection

                                      3: Implement high-performance
2: Rules for automatically            and low-overhead MR jobs
exploiting correlations
                                                                76
             Input Correlation (IC)
 Multiple MR jobs have input correlation (IC) if their input
  relation sets are not disjoint



                                              J1              J2


                                   lineitem        orders   lineitem




                            A shared input relation set
                               Map Func. of MR Job 1
                                                                   78
                               Map Func. of MR Job 2
            Transit Correlation (TC)
 Multiple MR jobs have transit correlation (TC) if
   • they have input correlation (IC), and
   • they have the same Partition Key
                                             Key: l_orderkey      Key: l_orderkey

                                                    J1                  J2


                                        lineitem         orders      lineitem


                            Two MR jobs should first
                            have IC                               Partition Key
                            A shared input relation set
                                                                       Other Data
                                Map Func. of MR Job 1
                                                  79
                                Map Func. of MR Job 2
          Job Flow Correlation (JFC)
 A MR job has Job Flow Correlation (JFC) with one of its child
  MR jobs if it has the same partition key as that MR job



             J1                        J2

          Partition Key           Output of MR Job 2              J2



                  Other Data

                   Map Func. of MR Job 1                          J1
                   Reduce Func. of MR Job 1
                   Map Func. of MR Job 2               lineitem        orders
                                                                          80
                   Reduce Func. of MR Job 2
     Query Optimization Rules for
  Automatically Exploiting Correlations
 Exploiting both Input Correlation and Transit Correlation

 Exploiting the Job Flow Correlation associated with
Aggregation jobs

 Exploiting the Job Flow Correlation associated with JOIN
jobs and their Transit Correlated parents jobs

 Exploiting the Job Flow Correlation associated with JOIN
jobs


                                                              81
                      Exp1: Four Cases of TPC-H Q21
1: Sentence-to-Sentence Translation                       2: InputCorrelation+TransitCorrelation
• 5 MR jobs                                               • 3 MR jobs
                                                                                        Left-outer-
                                 Left-
                                                                                           Join
                               outer-Join


                                                                            Join2
                      Join2



           Join1               AGG1         AGG2



lineitem           orders     lineitem      lineitem
                                                                     lineitem          orders
 3: InputCorrelation+TransitCorrelation+
    JobFlowCorrelation                                 4: Hand-coding (similar with Case 3)
 • 1 MR job                                            • In reduce function, we optimize code
                                                       according query semantic




                   lineitem    orders
                                                                 lineitem           orders            87
  Breakdowns of Execution Time (sec)
1200
                                                                     Job5 Reduce
                                                                     Job5 Map
                                                                     Job4 Reduce
1000                    From totally 888sec to 510sec                Job4 Map
                                                                     Job3 Reduce
                                                                     Job3 Map
                                     From totally 768sec to 567sec   Job2 Reduce
800
                                                                     Job2 Map
                                                                     Job1 Reduce
                                                                     Job1 Map
600
                                                             Only 17% difference

400


200


  0
       No Correlation   Input Correlation      Input Correlation   Hand-Coding
                        Transit Correlation    Transit Correlation
                                                                             88
                                               JobFlow Correlation
          Exp2: Clickstream Analysis
A typical query in production clickstream analysis: “what is the
average number of pages a user visits between a page in category
‘X’ and a page in category ‘Y’?”
                In YSmart JOIN1, AGG1, AGG2, JOIN2 and
                AGG3 are executed in a single MR job
                                             800
                                             700
                      Execution time (min)


                                             600     8.4x
                                             500
                                             400
                                             300            4.8x
                                             200
                                             100
                                               0
                                                   YSmart          Hive   Pig   89
YSmart in the Hadoop Ecosystem

                                   See patch
                                 HIVE-2206 at
                                  apache.org

                 YSmart
              Hive + YSmart




    Hadoop Distributed File System (HDFS)       90
               Summary of YSmary
 YSmart is a correlation-aware SQL-to-MapReduce translator
 Ysmart can outperform Hive by 4.8x, and Pig by 8.4x
 YSmart is being integrated into Hive
 The individual version of YSmart was released in January 2012




                                                             91
           Translate SQL-like queries to
           MapReduce jobs

  YSmart

                     A Hadoop-powered
        …            Data Warehousing
                     System (Hive)



RCFile Data

                  Web servers for 800M
      …           facebook users
                                     92
                       Conclusion
 We have contributed two important system components:
  RCFile and Ysmart in the critical path of Big Data analytics
  Ecosystem.
 The ecosystem of Hadoop-based big data analytics is created:
  Hive and Pig, will soon merge into an unified system
 RCFile and Ysmart provides two basic standards in such a new
  Ecosystem.
                      Thank You!


                                                                 93

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:3/7/2013
language:English
pages:87