Hadoop/HBase vs. RDBMS
December CTO Forum Jonathan Gray Streamy.com
About Me
• Co-Founder of Streamy.com • Background in computer engineering, distributed/fault-tolerant applications, relational databases, Linux • Successfully migrated Streamy backend from PostgreSQL to Hadoop/HBase in June
Why Hadoop/HBase?
• Datasets are growing into Petabytes • Traditional databases are expensive to scale and inherently difficult to distribute • Commodity hardware is cheap and powerful
– $1k buys you 4core/4GB/1TB – 300GB 15k RPM SAS nearly $500
• Need for random access and batch processing
– Hadoop only supports batch/streaming
History of Hadoop/HBase
• Google solved its scale problems
– “The Google File System” published October 2003
• Hadoop DFS
– “MapReduce: Simplified Data Processing on Large Clusters” published December 2004
• Hadoop MapReduce
– “Bigtable: A Distributed Storage System for Structured Data” published November 2006
• HBase
Hadoop Introduction
• Two main components
– Hadoop DFS
• A scalable, fault-tolerant, high performance distributed file system capable of running on commodity hardware
– Hadoop MapReduce
• Software framework for distributed computation
• Significant adoption
– Used in production in hundreds of organizations – Yahoo is primary sponsor with tens of thousands of active nodes running Hadoop
HDFS: Hadoop Distributed File System
• Reliably store petabytes of replicated data across thousands of nodes
– Data divided into 64MB blocks, each block replicated three times
• Master/Slave architecture
– Master NameNode contains block locations – Slave DataNode manages blocks on local FS
• Built on commodity hardware
– No 15k RPM disks or RAID required
HDFS Example
• Store 1TB flat text file on 10 node cluster
– Can use Java API or command-line
./hadoop dfs –copyFromLocal ./srcFile /destFile
– File split into 64MB blocks (16,384 total) – Each block sent to three nodes (49,152 total, 3TB) – Has notion of racks to ensure replication across distinct clusters/geographic locations
MapReduce
• Distributed programming model to reliably process petabytes of data using its locality
– Built-in bindings for Java and C – Can use with any language via HadoopStreaming
• Inspired by map and reduce functions used in functional programming
Input -> Map() -> Copy/Sort -> Reduce() -> Output
MapReduce Example
• Perform WordCount on 1TB file in HDFS
– Map task launched for each block of file – Within each task, Map function called for each line: Map(LineNumber, LineString)
• For each word in LineString Output(Word, 1)
– Map output is sorted, grouped and copied to reducer – Reduce(Word, List) called for each word
• Output(Word, Length(List))
– Final output contains total count for each word
Hadoop…
• …is designed to store and stream extremely large datasets in batch • …is not intended for realtime querying • …does not support random access This is why we have HBase!
What is HBase?
• • • • • • Distributed, Column-Oriented, Multi-Dimensional, High-Availability, High-Performance Storage System
Project Goals Billions of Rows * Millions of Columns * Thousands of Versions Petabytes across thousands of commodity servers
HBase is not…
• A SQL Database
– No joins, no query engine, no types, no SQL – Transactions and secondary indexing possible but immature
• A drop-in replacement for your RDBMS
• You must be OK with RDBMS anti-schema
– Denormalized data – Wide and sparsely populated tables
Just say “no” to your inner DBA
HBase Architecture
• Table is made up of any number of regions • Region is specified by its startKey and endKey
– Empty table:
(Table, NULL, NULL)
– Two-region table:
(Table, NULL, “Streamy”) and (Table, “Streamy”, NULL)
• Each region may live on a different node and is made up of several HDFS files and blocks, each of which is replicated by Hadoop
HBase Architecture Contd.
• Two types of HBase nodes: Master and RegionServer • Special tables -ROOT- and .META. store schema information and region locations • Master server responsible for regionserver monitoring as well as assignment and load balancing of regions
HBase Tables
• Tables are sorted by Row • Table schema only defines it’s column families
– – – – Each family consists of any number of columns Each column consists of any number of versions Columns only exist when inserted, NULLs are free Columns within a family are sorted and stored together – Everything except table names are byte[]
(Table, Row, Family:Column, Timestamp) Value
HBase Table as Data Structures
SortedMap( RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) ) )
SortedMap(RowKey, List(SortedMap(Column, List(Value, Timestamp))))
Web Crawl Example
• Store web crawl data
– Table crawl with family content – Row is URL with Columns
• content:data stores raw crawled data • content:language stores http language header • content:type stores http content-type header
– If processing raw data for hyperlinks and images, add families links and images
• links: column for each hyperlink • images: column for each image
Web Crawl Example in RDBMS
• How would this look in a traditional DB?
– Table crawl with columns url, data, language, and type – Table links with columns url and link – Table images with columns url and image
• How will this scale?
– 10M documents w/ avg 10 links and 10 images – 210M total rows versus 10M total rows – Index bloat with links/images tables
Connecting to HBase
• Native Java Client/API
– get(byte[] row, byte[] column, long ts, int versions)
• Non-Java Clients
– Thrift server (Ruby, C++, etc) – REST server – Native C/C++ client scheduled for 0.20 release
• TableInput/TableOutputFormat for MapReduce
– HBase as MapReduce source and/or sink
• HBase Shell
– Jruby shell to add, get, scan, and admin
HBase Extensions
• Hive, Pig, Cascading
– Hadoop-targeted MapReduce tools with upcoming HBase integration
• Pigi
– HBase ORM that includes indexing, joining, searching, paging, and ordering
• subRecord
– Provides a unified infrastructure combining storage, security, logging, metrics, and monitoring on top of HBase
• HBase-Writer
– Heritrix crawling directly to an HBase table
History of HBase
• November 2006
– Google releases paper on BigTable
• February 2007
– Initial HBase prototype created as Hadoop contrib
• October 2007
– First useable HBase
• January 2008
– Hadoop become TLP, HBase becomes subproject
• October 2008
– HBase 0.18.1 released
Current Project Status
• Latest stable release: HBase 0.18.1 on Hadoop 0.18.2
– 8th HBase release – Significant stability and scalability release
• Upcoming release: HBase 0.19.0 on Hadoop 0.19.0
– HDFS appends means little to no data loss if master fails (max of 100 edits lost versus 30k previously) – Block-caching, scanner pre-fetching, write batching – Vastly improved resource monitoring and memory efficiency
• Next release: HBase 0.20.0 on Hadoop 0.20.0 (March 2009)
– – – – New HDFS file format TFile No more SPOF: ZooKeeper integration gives multimaster Order of magnitude improvements in random access Significant expansion of in-memory and caching capabilities
HBase in the Wild
• Streamy • Powerset
– Birthplace of HBase – Home of Michael Stack, project lead
• • • • • •
Mahalo The Shopping Engine @ Tokenizer.org Advanced Threats Research @ Trend Micro Wikia Multilingual Archive @ WorldLing Also being used in some capacity at:
– Yahoo, Last.fm, Videosurf, and Rapleaf
>
Questions?
Comparison One
• System to store a shopping cart
– Customers, Products, Orders
Simple SQL Schema
CREATE TABLE customers ( customerid UUID PRIMARY KEY, name TEXT, email TEXT) CREATE TABLE products ( productid UUID PRIMARY KEY, name TEXT, price DOUBLE)
CREATE TABLE orders ( orderid UUID PRIMARY KEY, customerid UUID INDEXED REFERENCES(customers.customerid), date TIMESTAMP, total DOUBLE) CREATE TABLE orderproducts ( orderid UUID INDEXED REFERENCES(orders.orderid), productid UUID REFERENCES(products.productid))
Simple HBase Schema
CREATE TABLE customers (content, orders) CREATE TABLE products (content) CREATE TABLE orders (content, products)
Efficient Queries with Both
• • • • Get name, email, orders for customer Get name, price for product Get customer, stamp, total for order Get list of products in order
Where SQL Makes Life Easy
• Joining
– In a single query, get all products in an order with their product information
• Secondary Indexing
– Get customerid by e-mail
• Referential Integrity
– Deleting an order would delete links out of ‘orderproducts’ – ID updates propogate
• Realtime Analysis
– GROUP BY and ORDER BY allow for simple statistical analysis
Where HBase Makes Life Easy
• Dataset Scale
– We have 1M customers and 100M products – Product information includes large text datasheets or PDF files – Want to track every time a customer looks at a product page
• Read/Write Scale
– Tables distributed across nodes means reads/writes are fully distributed – Writes are extremely fast and require no index updates
• Replication
– Comes for free
• Batch Analysis
– Massive and convoluted SQL queries executed serially become efficient MapReduce jobs distributed and executed in parallel
Conclusion
• For small instances of simple/straightforward systems, relational databases offer a much more convenient way to model and access data
– Can outsource most work to transaction and query engine – HBase will force you to pull complexity into Application layer
• Once you need to scale, the properties and flexibility of HBase can relieve you from the headaches associated with scaling an RDBMS Questions?
Comparison Two
• Compare key factors
– Hardware Requirements – Scalability – Reliability – Ease of Use – Cost
Hardware Requirements
• RDBMS are IO-bound
– Typically require large arrays of fast and expensive disks – Modest production environment might have a single node with 15-30 15k RPM drives, 16 cores, and 16-64GB RAM – Requires a backup server with similar specs – $$$$$
• HBase is designed for commodity hardware
– Biggest factor for performance is number of nodes – Modest production environment might have 10-20 nodes each with 2 500GB 7.2k RPM drives, 4 cores, and 4GB RAM – Common to have one master node with RAID, dual PSU, etc as this is currently a SPOF
Scalability
• RDBMS scale achieved through
– Caching a la Memcached – Partitioning often left up to the application or external tools – Replication can be built-in or an add-on with most popular RDBMS – Regardless of scale mechanisms, architecture does not allow efficient multi-master support
• HBase scales out of the box
– Random access often made faster with something similar to Memcached (built-in with 0.20 release) – Constant performance from low to high concurrency – Writes are distributed and there are no indexes – Scale by plugging in more RegionServers
Reliability
• RDBMS
– Slave replication – Warm/Hot backups – Single node failure is often catastrophic
• HBase
– Replication is built-in – Backups are unnecessary but available
Ease of Use
• RDBMS
– Millions are trained in SQL and relational data modeling – Normalized schemas are well understood and have predictable performance – However schemas are often limiting, difficult to change, and scale poorly
• HBase and MapReduce
– Significant learning curve – Both have excellent communities and increasing numbers of tools to help ease the initial pain – Schemas are loosely defined so data structure is easy to change and performance is constant
Other Factors
• Operating System / Architecture
– RDBMS vary greatly on their target architectures – HBase designed for Linux though also being run on Solaris and with some success on Windows
• Cost
– HBase is FOSS – Plenty of mature FOSS RDBMS, but many used in enterprise are expensive
• Widespread use
– RDBMS are tried and true – Hadoop and HBase are still in development and though production-ready are not yet in wide use
Conclusion
• Second verse is same as the first verse • RDBMS provides tremendous functionality out of the box but is extremely difficult and costly to scale • HBase provides barebones functionality out of the box but scaling is built-in and inexpensive