Hadoop and HBase vs RDBMS by jonathangray

VIEWS: 35,941 PAGES: 38

More Info
									Hadoop/HBase vs. RDBMS
December CTO Forum Jonathan Gray Streamy.com

About Me
• Co-Founder of Streamy.com • Background in computer engineering, distributed/fault-tolerant applications, relational databases, Linux • Successfully migrated Streamy backend from PostgreSQL to Hadoop/HBase in June

Why Hadoop/HBase?
• Datasets are growing into Petabytes • Traditional databases are expensive to scale and inherently difficult to distribute • Commodity hardware is cheap and powerful
– $1k buys you 4core/4GB/1TB – 300GB 15k RPM SAS nearly $500

• Need for random access and batch processing
– Hadoop only supports batch/streaming

History of Hadoop/HBase
• Google solved its scale problems
– “The Google File System” published October 2003
• Hadoop DFS

– “MapReduce: Simplified Data Processing on Large Clusters” published December 2004
• Hadoop MapReduce

– “Bigtable: A Distributed Storage System for Structured Data” published November 2006
• HBase

Hadoop Introduction
• Two main components
– Hadoop DFS
• A scalable, fault-tolerant, high performance distributed file system capable of running on commodity hardware

– Hadoop MapReduce
• Software framework for distributed computation

• Significant adoption
– Used in production in hundreds of organizations – Yahoo is primary sponsor with tens of thousands of active nodes running Hadoop

HDFS: Hadoop Distributed File System
• Reliably store petabytes of replicated data across thousands of nodes
– Data divided into 64MB blocks, each block replicated three times

• Master/Slave architecture
– Master NameNode contains block locations – Slave DataNode manages blocks on local FS

• Built on commodity hardware
– No 15k RPM disks or RAID required

HDFS Example
• Store 1TB flat text file on 10 node cluster
– Can use Java API or command-line
./hadoop dfs –copyFromLocal ./srcFile /destFile

– File split into 64MB blocks (16,384 total) – Each block sent to three nodes (49,152 total, 3TB) – Has notion of racks to ensure replication across distinct clusters/geographic locations

• Distributed programming model to reliably process petabytes of data using its locality
– Built-in bindings for Java and C – Can use with any language via HadoopStreaming

• Inspired by map and reduce functions used in functional programming
Input -> Map() -> Copy/Sort -> Reduce() -> Output

MapReduce Example
• Perform WordCount on 1TB file in HDFS
– Map task launched for each block of file – Within each task, Map function called for each line: Map(LineNumber, LineString)
• For each word in LineString  Output(Word, 1)

– Map output is sorted, grouped and copied to reducer – Reduce(Word, List) called for each word
• Output(Word, Length(List))

– Final output contains total count for each word

• …is designed to store and stream extremely large datasets in batch • …is not intended for realtime querying • …does not support random access This is why we have HBase!

What is HBase?
• • • • • • Distributed, Column-Oriented, Multi-Dimensional, High-Availability, High-Performance Storage System
Project Goals Billions of Rows * Millions of Columns * Thousands of Versions Petabytes across thousands of commodity servers

HBase is not…
• A SQL Database
– No joins, no query engine, no types, no SQL – Transactions and secondary indexing possible but immature

• A drop-in replacement for your RDBMS
• You must be OK with RDBMS anti-schema
– Denormalized data – Wide and sparsely populated tables

Just say “no” to your inner DBA

HBase Architecture
• Table is made up of any number of regions • Region is specified by its startKey and endKey
– Empty table:
(Table, NULL, NULL)

– Two-region table:
(Table, NULL, “Streamy”) and (Table, “Streamy”, NULL)

• Each region may live on a different node and is made up of several HDFS files and blocks, each of which is replicated by Hadoop

HBase Architecture Contd.
• Two types of HBase nodes: Master and RegionServer • Special tables -ROOT- and .META. store schema information and region locations • Master server responsible for regionserver monitoring as well as assignment and load balancing of regions

HBase Tables
• Tables are sorted by Row • Table schema only defines it’s column families
– – – – Each family consists of any number of columns Each column consists of any number of versions Columns only exist when inserted, NULLs are free Columns within a family are sorted and stored together – Everything except table names are byte[]

(Table, Row, Family:Column, Timestamp)  Value

HBase Table as Data Structures
SortedMap( RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) ) )
SortedMap(RowKey, List(SortedMap(Column, List(Value, Timestamp))))

Web Crawl Example
• Store web crawl data
– Table crawl with family content – Row is URL with Columns
• content:data stores raw crawled data • content:language stores http language header • content:type stores http content-type header

– If processing raw data for hyperlinks and images, add families links and images
• links:<url> column for each hyperlink • images:<url> column for each image

Web Crawl Example in RDBMS
• How would this look in a traditional DB?
– Table crawl with columns url, data, language, and type – Table links with columns url and link – Table images with columns url and image

• How will this scale?
– 10M documents w/ avg 10 links and 10 images – 210M total rows versus 10M total rows – Index bloat with links/images tables

Connecting to HBase
• Native Java Client/API
– get(byte[] row, byte[] column, long ts, int versions)

• Non-Java Clients
– Thrift server (Ruby, C++, etc) – REST server – Native C/C++ client scheduled for 0.20 release

• TableInput/TableOutputFormat for MapReduce
– HBase as MapReduce source and/or sink

• HBase Shell
– Jruby shell to add, get, scan, and admin

HBase Extensions
• Hive, Pig, Cascading
– Hadoop-targeted MapReduce tools with upcoming HBase integration

• Pigi
– HBase ORM that includes indexing, joining, searching, paging, and ordering

• subRecord
– Provides a unified infrastructure combining storage, security, logging, metrics, and monitoring on top of HBase

• HBase-Writer
– Heritrix crawling directly to an HBase table

History of HBase
• November 2006
– Google releases paper on BigTable

• February 2007
– Initial HBase prototype created as Hadoop contrib

• October 2007
– First useable HBase

• January 2008
– Hadoop become TLP, HBase becomes subproject

• October 2008
– HBase 0.18.1 released

Current Project Status
• Latest stable release: HBase 0.18.1 on Hadoop 0.18.2
– 8th HBase release – Significant stability and scalability release

• Upcoming release: HBase 0.19.0 on Hadoop 0.19.0
– HDFS appends means little to no data loss if master fails (max of 100 edits lost versus 30k previously) – Block-caching, scanner pre-fetching, write batching – Vastly improved resource monitoring and memory efficiency

• Next release: HBase 0.20.0 on Hadoop 0.20.0 (March 2009)
– – – – New HDFS file format TFile No more SPOF: ZooKeeper integration gives multimaster Order of magnitude improvements in random access Significant expansion of in-memory and caching capabilities

HBase in the Wild
• Streamy • Powerset
– Birthplace of HBase – Home of Michael Stack, project lead

• • • • • •

Mahalo The Shopping Engine @ Tokenizer.org Advanced Threats Research @ Trend Micro Wikia Multilingual Archive @ WorldLing Also being used in some capacity at:
– Yahoo, Last.fm, Videosurf, and Rapleaf


Comparison One
• System to store a shopping cart
– Customers, Products, Orders

Simple SQL Schema
CREATE TABLE customers ( customerid UUID PRIMARY KEY, name TEXT, email TEXT) CREATE TABLE products ( productid UUID PRIMARY KEY, name TEXT, price DOUBLE)

CREATE TABLE orders ( orderid UUID PRIMARY KEY, customerid UUID INDEXED REFERENCES(customers.customerid), date TIMESTAMP, total DOUBLE) CREATE TABLE orderproducts ( orderid UUID INDEXED REFERENCES(orders.orderid), productid UUID REFERENCES(products.productid))

Simple HBase Schema
CREATE TABLE customers (content, orders) CREATE TABLE products (content) CREATE TABLE orders (content, products)

Efficient Queries with Both
• • • • Get name, email, orders for customer Get name, price for product Get customer, stamp, total for order Get list of products in order

Where SQL Makes Life Easy
• Joining
– In a single query, get all products in an order with their product information

• Secondary Indexing
– Get customerid by e-mail

• Referential Integrity
– Deleting an order would delete links out of ‘orderproducts’ – ID updates propogate

• Realtime Analysis
– GROUP BY and ORDER BY allow for simple statistical analysis

Where HBase Makes Life Easy
• Dataset Scale
– We have 1M customers and 100M products – Product information includes large text datasheets or PDF files – Want to track every time a customer looks at a product page

• Read/Write Scale
– Tables distributed across nodes means reads/writes are fully distributed – Writes are extremely fast and require no index updates

• Replication
– Comes for free

• Batch Analysis
– Massive and convoluted SQL queries executed serially become efficient MapReduce jobs distributed and executed in parallel

• For small instances of simple/straightforward systems, relational databases offer a much more convenient way to model and access data
– Can outsource most work to transaction and query engine – HBase will force you to pull complexity into Application layer

• Once you need to scale, the properties and flexibility of HBase can relieve you from the headaches associated with scaling an RDBMS Questions?

Comparison Two
• Compare key factors
– Hardware Requirements – Scalability – Reliability – Ease of Use – Cost

Hardware Requirements
• RDBMS are IO-bound
– Typically require large arrays of fast and expensive disks – Modest production environment might have a single node with 15-30 15k RPM drives, 16 cores, and 16-64GB RAM – Requires a backup server with similar specs – $$$$$

• HBase is designed for commodity hardware
– Biggest factor for performance is number of nodes – Modest production environment might have 10-20 nodes each with 2 500GB 7.2k RPM drives, 4 cores, and 4GB RAM – Common to have one master node with RAID, dual PSU, etc as this is currently a SPOF

• RDBMS scale achieved through
– Caching a la Memcached – Partitioning often left up to the application or external tools – Replication can be built-in or an add-on with most popular RDBMS – Regardless of scale mechanisms, architecture does not allow efficient multi-master support

• HBase scales out of the box
– Random access often made faster with something similar to Memcached (built-in with 0.20 release) – Constant performance from low to high concurrency – Writes are distributed and there are no indexes – Scale by plugging in more RegionServers

– Slave replication – Warm/Hot backups – Single node failure is often catastrophic

• HBase
– Replication is built-in – Backups are unnecessary but available

Ease of Use
– Millions are trained in SQL and relational data modeling – Normalized schemas are well understood and have predictable performance – However schemas are often limiting, difficult to change, and scale poorly

• HBase and MapReduce
– Significant learning curve – Both have excellent communities and increasing numbers of tools to help ease the initial pain – Schemas are loosely defined so data structure is easy to change and performance is constant

Other Factors
• Operating System / Architecture
– RDBMS vary greatly on their target architectures – HBase designed for Linux though also being run on Solaris and with some success on Windows

• Cost
– HBase is FOSS – Plenty of mature FOSS RDBMS, but many used in enterprise are expensive

• Widespread use
– RDBMS are tried and true – Hadoop and HBase are still in development and though production-ready are not yet in wide use

• Second verse is same as the first verse • RDBMS provides tremendous functionality out of the box but is extremely difficult and costly to scale • HBase provides barebones functionality out of the box but scaling is built-in and inexpensive

To top