VIEWS: 3 PAGES: 9 POSTED ON: 1/16/2012
Hive Bryson Hori Leonardo Nguyen Leo Tsuchiya Branden Ogata What is Hive? Data warehouse infrastructure Open source Built on top of Hadoop Goals Scalability Extensibility Fault-tolerance Loose-coupling What is Hadoop? Open source project Hadoop Distributed File System (HDFS) Focus on reliability with large files Designed for low-cost hardware Runs computations near data to reduce costs Despite this, speed is not a priority Queries can still take hours to run Setting Up Hive Set up AMI Download and extract Hadoop Create RSA key Start on one node Format file system Repeat for other nodes Designate as master/slave nodes Difficulties with Hadoop Setup Requires a lot of changes to multiple configuration files Default settings do not work Assumes prior knowledge Networking error messages Network administration Difficulties with Hive Cannot do anything with Hive before getting Hadoop to work Reasons to use Hive Query language similar to SQL Differences Subqueries only in FROM clause Only equi-joins supported Masochism Results Official results (from Hadoop wiki) Sorting 9TB on 900 nodes = 1.8 hours Sorting 14TB on 1400 nodes = 2.2 hours Sorting 20TB on 2000 nodes = 2.5 hours Questions?
Pages to are hidden for
"Hive"Please download to view full document