Hadoop in SIGMOD 2011 by IB4Qiw0




Hadoop in SIGMOD 2011



      Nova: Continuous Pig/Hadoop Workflows

        Apache Hadoop Goes Realtime at Facebook

     Emerging Trends in the Enterprise Data Analytics

     A Hadoop Based Distributed Loading Approach to
               Parallel Data Warehouses
Industrial Session in Sigmod 2011

                  Data Management for          Dynamic Optimization and
                  Feeds and Streams(2)          Unstructured Content (4)

Applying Hadoop                   Industrial                        BusinessAnalytics(2)

                                   Support for Business Analytics
                                       and Warehousing (4)
Nova: Continuous Pig/Hadoop Workflows

            By Yahoo!
                    Nova Overview

    Ingesting and analyzing user behavior logs
    Building and updating a search index from a stream of crawled web
    Processing semi-structured data

Two-layer programming model (Nova over Pig)
    Continuous processing
    Independent scheduling
    Cross-module optimization
    Manageability features
                   Workflow Model

    Two kinds of vertices: tasks (processing
     steps) and channels (data containers)
    Edges connect tasks to channels and channels
     to tasks

Four common patterns of processing
    Non-incremental (template detection)
    Stateless incremental (shingling)
    Stateless incremental with lookup table
     (template tagging)
    Stateful incremental (de-duping)
              Workflow Model (Cont.)

Data and Update Model
    Blocks: A channel’s data is divided into blocks

                        Contains a complete snapshot of data on a
                        channel as of some point in time
   Base block
                        Base blocks are assigned increasing sequence

                        Used in conjunction with incremental
   Delta block
                        Contains instructions for transforming a base
                        block into a new base block( i  j  Bi  B j (i  j ) )
             Workflow Model (Cont.)

Task/Data Interface
    Consumption mode: all or new
    Production mode: B or Δ
                 Workflow Model (Cont.)

Workflow Programming and Scheduling
    Data-based trigger.
    Time-based trigger
    Cascade trigger.

Data Compaction and Garbage Collection
    If a channel has blocks B0, 0 1 ,1 2 ,  2 3 ,the
     compaction operation computes and adds B3 to the channel
    After compaction is used to add B3 to the channel,and current
     cursor is at sequence number 2, then B0, 0 1 ,  1 2
     can be garbage-collected.
Nova System Architecture
Apache Hadoop Goes Realtime at Facebook

             By Facebook
Workload Types
  Facebook Messaging
     High Write Throughput
     Large Tables
     Data Migration
  Facebook Insights
     Realtime Analytics
     High Throughput Increments
  Facebook Metrics System (ODS)
     Automatic Sharding
     Fast Reads of Recent Data and Table Scans
Why Hadoop & HBase

High write throughput
Efficient and low-latency strong consistency semantics within
 a data center
Efficient random reads from disk
High Availability and Disaster Recovery
Fault Isolation
Atomic read-modify-write primitives
Range Scans
Tolerance of network partitions within a single data center
Zero Downtime in case of individual data center failure
Active-active serving capability across different data centers
                 Realtime HDFS

High Availability - AvatarNode
Realtime HDFS (Cont.)

Hadoop RPC compatibility

Block Availability: Placement Policy
    a pluggable block placement policy
              Realtime HDFS (Cont.)

Performance Improvements for a Realtime Workload
    RPC Timeout
    Reads from Local Replicas

New Features
    HDFS sync
    Concurrent Readers
Production HBase

ACID Compliance (RWCC: Read Write Consistency Control)
    Atomicity (WALEdit)
    Consistency
Availability Improvements
    HBase Master Rewrite,Region assignment in memory -> ZooKeeper
    Online Upgrades
    Distributed Log Splitting
Performance Improvements
    Compaction(minor and major)
    Read Optimizations
Emerging Trends in the Enterprise Data Analytics:
  Connecting Hadoop and DB2 Warehouse

                        By IBM

  1.Increasing volumes of data
  2. Hadoop-based solutions in conjunction with
     data warehouses
A Hadoop Based Distributed Loading Approach to
           Parallel Data Warehouses

                By Teradata

   ETL(Extraction Transformation Loading) is a critical
    part of data warehouse
   While data are partitioned and replicated across all
    nodes in a parallel data warehouse, load utilities reside
    on a single node(bottleneck)
Why Hadoop for Teradata EDW(Enterprise Data Warehouse)?

    More disk space can be easily added
    Use as a intermediate storage
    MapReduce for transformation
    Load data in parallel
Block Assignment Problem

 –   HDFS file F on a cluster of P nodes (each node is uniquely
   identified with an integer i where 1 ≤ i ≤ P)
 –   The problem is defined by: assignment(X, Y, n,m, k, r)
      X is the set of n blocks (X = {1, . . . , n}) of F
      Y is the set of m nodes running PDBMS (called PDBMS nodes)
       (Y⊆ {1, . . . , P })
      k copies, m nodes
      r is the mapping recording the replicated block locations of
       each block. r(i) returns the set of nodes which has a copy of the
       block i.
Block Assignment Problem(Cont.)

• An assignment g from the blocks in X to the nodes in Y is
  denoted by a mapping from X = {1, . . . , n} to Y where g(i)
  = j (i ∈ X, j ∈ Y ) means that the block i is assigned to the
  node j.

•    An even assignment g is an assignment such that ∀ i ∈ Y ∀
    j ∈ Y | |{ x | ∀ 1 ≤ x ≤ n&&g(x) = i}| - |{y | ∀ 1 ≤ y ≤ n&&g(y)
    = j}| | ≤ 1.

•    The cost of an assignment g is defined to be cost(g) = |{i |
    g(i) r(i) ∀ 1 ≤ i ≤ n}|, which is the number of blocks
    assigned to remote nodes.
  Thank You!


To top