Hive by dfhdhdhdhjr

VIEWS: 3 PAGES: 9

									   Hive
  Bryson Hori
Leonardo Nguyen
  Leo Tsuchiya
 Branden Ogata
               What is Hive?
Data warehouse infrastructure
Open source
Built on top of Hadoop
Goals
   Scalability
   Extensibility
   Fault-tolerance
   Loose-coupling
              What is Hadoop?
Open source project
Hadoop Distributed File System (HDFS)
   Focus on reliability with large files
   Designed for low-cost hardware
   Runs computations near data to reduce costs
      Despite this, speed is not a priority
      Queries can still take hours to run
             Setting Up Hive
Set up AMI
Download and extract Hadoop
Create RSA key
Start on one node
   Format file system
Repeat for other nodes
   Designate as master/slave nodes
       Difficulties with Hadoop
Setup
   Requires a lot of changes to multiple configuration
    files
   Default settings do not work
Assumes prior knowledge
   Networking error messages
   Network administration
       Difficulties with Hive
Cannot do anything with Hive before getting
Hadoop to work
          Reasons to use Hive
Query language similar to SQL
   Differences
      Subqueries only in FROM clause
      Only equi-joins supported
Masochism
                    Results
Official results (from Hadoop wiki)
   Sorting 9TB on 900 nodes = 1.8 hours
   Sorting 14TB on 1400 nodes = 2.2 hours
   Sorting 20TB on 2000 nodes = 2.5 hours
Questions?

								
To top