Docstoc

Facebook's Petabyte Scale Data Warehouse using Hive and Hadoop

Document Sample
Facebook's Petabyte Scale Data Warehouse using Hive and Hadoop Powered By Docstoc
					                 Facebook’s Petabyte Scale Data Warehouse
                 using Hive and Hadoop



Namit Jain,
Facebook Data Infrastructure Team
Hive: Simplifying Hadoop – New Technology Familiar
Interfaces

hive> select key, count(1) from kv1 where key > 100
  group by key;



vs.

$ cat > /tmp/reducer.sh
uniq -c | awk '{print $2"\t"$1}‘
$ cat > /tmp/map.sh
awk -F '\001' '{if($1 > 100) print $1}‘
$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -
  input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/
  reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /
  tmp/largekey -numReduceTasks 1
$ bin/hadoop dfs –cat /tmp/largekey/part*
Data Flow Architecture at Facebook


                                                                Filers


     Web
Servers
                  Scribe
MidTier





                    Hive

                 replica3on
                             Scribe‐Hadoop
Cluster


Adhoc
Hive‐Hadoop
Cluster





                        Produc3on
Hive‐Hadoop
Cluster

      Oracle
RAC
                                         Federated
MySQL

Where is this data stored?

  Hadoop/Hive Warehouse
  –    9600 cores, 12 PetaBytes
  –    12 TB per node
  –    Two level network topology
  –    1 Gbit/sec from node to rack switch
  –    4 Gbit/sec to top level rack switch
Data Usage

  Statistics per day:
   –    10 TB of compressed new data added per day
   –    135TB of compressed data scanned per day
   –    7500+ Hive jobs on production cluster per day
   –    80K compute hours per day
  Barrier to entry is significantly reduced:
   –  New engineers go though a Hive training session
   –  ~200 people/month run jobs on Hadoop/Hive
   –  Analysts (non-engineers) use Hadoop through Hive
Key Challenges

  Lack of Tools
  Shared Cluster
   –  Isolation
   –  Quotas
  Metadata discovery
  Monitoring
   –  Capacity Planning
   –  Handle Failures
   –  Throttling
  Good Query Plans
  Single points of failure
Current Solutions

  Shared Cluster
   –  Different Clusters for different QOS
   –  Users



   –    Memory limits per job
   –    Automatic retention for tables
   –    Limit number of mappers per job
   –    “Dynamic Clouds”
   –    Fair Share Scheduler
   –    Pools
   –    Speculative execution for both mappers and reducers
  Metadata Discovery
   –  CoHive – collaborative query management
        Lineage
        Table/Column descriptions
        Identifying “expert users” per table
        Rank most used tables
  Monitoring
   –    Capture resource usage metrics
   –    Identify Cost per business unit
   –    Load Average
   –    Alerts
           Space Utilization
           Servers going down
  Good Query Plans
   –    Data Skew
   –    Hints/Configurable parameters
   –    Rule Based
   –    Tools use these hints based on costs
  Single Points of Failure
   –  High Availability of Namenode

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:11
posted:12/18/2011
language:
pages:11