Hadoop and HBase vs RDBMS
Presentation given to Los Angeles CTO Forum on December 12, 2008. Introduction to Hadoop, MapReduce, and HBase including how it compares to a traditional RDBMS.
Shared by: jonathangray
Hadoop/HBase vs. RDBMS December CTO Forum Jonathan Gray Streamy.com About Me • Co-Founder of Streamy.com • Background in computer engineering, distributed/fault-tolerant applications, relational databases, Linux • Successfully migrated Streamy backend from PostgreSQL to Hadoop/HBase in June Why Hadoop/HBase? • Datasets are growing into Petabytes • Traditional databases are expensive to scale and inherently difficult to distribute • Commodity hardware is cheap and powerful – $1k buys you 4core/4GB/1TB – 300GB 15k RPM SAS nearly $500 • Need for random access and batch processing – Hadoop only supports batch/streaming History of Hadoop/HBase • Google solved its scale problems – “The Google File System” published October 2003 • Hadoop DFS – “MapReduce: Simplified Data Processing on Large Clusters” published December 2004 • Hadoop MapReduce – “Bigtable: A Distributed Storage System for Structured Data” published November 2006 • HBase Hadoop Introduction • Two main components – Hadoop DFS • A scalable, fault-tolerant, high performance distributed file system capable of running on commodity hardware – Hadoop MapReduce • Software framework for distributed computation • Significant adoption – Used in production in hundreds of organizations – Yahoo is primary sponsor with tens of thousands of active nodes running Hadoop HDFS: Hadoop Distributed File System • Reliably store petabytes of replicated data across thousands of nodes – Data divided into 64MB blocks, each block replicated three times • Master/Slave architecture – Master NameNode contains block locations – Slave DataNode manages blocks on local FS • Built on commodity hardware – No 15k RPM disks or RAID required HDFS Example • Store 1TB flat text file on 10 node cluster – Can use Java API or command-line ./hadoop dfs –copyFromLocal ./srcFile /destFile – File split into 64MB blocks (16,384 total) – Each block sent to three nodes (49,152 total, 3TB) – Has notion of racks to ensure replication across distinct clusters/geographic locations MapReduce • Distributed programming model to reliably process petabytes of data using its locality – Built-in bindings for Java and C – Can use with any language via HadoopStreaming • Inspired by map and reduce functions used in functional programming Input -> Map() -> Copy/Sort -> Reduce() -> Output MapReduce Example • Perform WordCount on 1TB file in HDFS – Map task launched for each block of file – Within each task, Map function called for each line: Map(LineNumber, LineString) • For each word in LineString Output(Word, 1) – Map output is sorted, grouped and copied to reducer – Reduce(Word, List) called for each word • Output(Word, Length(List)) – Final output contains total count for each word Hadoop… • …is designed to store and stream extremely large datasets in batch • …is not intended for realtime querying • …does not support random access This is why we have HBase! What is HBase? • • • • • • Distributed, Column-Oriented, Multi-Dimensional, High-Availability, High-Performance Storage System Project Goals Billions of Rows * Millions of Columns * Thousands of Versions Petabytes across thousands of commodity servers HBase is not… • A SQL Database – No joins, no query engine, no types, no SQL – Transactions and secondary indexing possible but immature • A drop-in replacement for your RDBMS • You must be OK with RDBMS anti-schema – Denormalized data – Wide and sparsely populated tables Just say “no” to your inner DBA HBase Architecture • Table is made up of any number of regions • Region is specified by its startKey and endKey – Empty table: (Table, NULL, NULL) – Two-region table: (Table, NULL, “Streamy”) and (Table, “Streamy”, NULL) • Each region may live on a different node and is made up of several HDFS files and blocks, each of which is replicated by Hadoop HBase Architecture Contd. • Two types of HBase nodes: Master and RegionServer • Special tables -ROOT- and .META. store schema information and region locations • Master server responsible for regionserver monitoring as well as assignment and load balancing of regions HBase Tables • Tables are sorted by Row • Table schema only defines it’s column families – – – – Each family consists of any number of columns Each column consists of any number of versions Columns only exist when inserted, NULLs are free Columns within a family are sorted and stored together – Everything except table names are byte (Table, Row, Family:Column, Timestamp) Value HBase Table as Data Structures SortedMap( RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) ) ) SortedMap(RowKey, List(SortedMap(Column, List(Value, Timestamp)))) Web Crawl Example • Store web crawl data – Table crawl with family content – Row is URL with Columns • content:data stores raw crawled data • content:language stores http language header • content:type stores http content-type header – If processing raw data for hyperlinks and images, add families links and images • links:<url> column for each hyperlink • images:<url> column for each image Web Crawl Example in RDBMS • How would this look in a traditional DB? – Table crawl with columns url, data, language, and type – Table links with columns url and link – Table images with columns url and image • How will this scale? – 10M documents w/ avg 10 links and 10 images – 210M total rows versus 10M total rows – Index bloat with links/images tables Connecting to HBase • Native Java Client/API – get(byte row, byte column, long ts, int versions) • Non-Java Clients – Thrift server (Ruby, C++, etc) – REST server – Native C/C++ client scheduled for 0.20 release • TableInput/TableOutputFormat for MapReduce – HBase as MapReduce source and/or sink • HBase Shell – Jruby shell to add, get, scan, and admin HBase Extensions • Hive, Pig, Cascading – Hadoop-targeted MapReduce tools with upcoming HBase integration • Pigi – HBase ORM that includes indexing, joining, searching, paging, and ordering • subRecord – Provides a unified infrastructure combining storage, security, logging, metrics, and monitoring on top of HBase • HBase-Writer – Heritrix crawling directly to an HBase table History of HBase • November 2006 – Google releases paper on BigTable • February 2007 – Initial HBase prototype created as Hadoop contrib • October 2007 – First useable HBase • January 2008 – Hadoop become TLP, HBase becomes subproject • October 2008 – HBase 0.18.1 released Current Project Status • Latest stable release: HBase 0.18.1 on Hadoop 0.18.2 – 8th HBase release – Significant stability and scalability release • Upcoming release: HBase 0.19.0 on Hadoop 0.19.0 – HDFS appends means little to no data loss if master fails (max of 100 edits lost versus 30k previously) – Block-caching, scanner pre-fetching, write batching – Vastly improved resource monitoring and memory efficiency • Next release: HBase 0.20.0 on Hadoop 0.20.0 (March 2009) – – – – New HDFS file format TFile No more SPOF: ZooKeeper integration gives multimaster Order of magnitude improvements in random access Significant expansion of in-memory and caching capabilities HBase in the Wild • Streamy • Powerset – Birthplace of HBase – Home of Michael Stack, project lead • • • • • • Mahalo The Shopping Engine @ Tokenizer.org Advanced Threats Research @ Trend Micro Wikia Multilingual Archive @ WorldLing Also being used in some capacity at: – Yahoo, Last.fm, Videosurf, and Rapleaf > Questions? Comparison One • System to store a shopping cart – Customers, Products, Orders Simple SQL Schema CREATE TABLE customers ( customerid UUID PRIMARY KEY, name TEXT, email TEXT) CREATE TABLE products ( productid UUID PRIMARY KEY, name TEXT, price DOUBLE) CREATE TABLE orders ( orderid UUID PRIMARY KEY, customerid UUID INDEXED REFERENCES(customers.customerid), date TIMESTAMP, total DOUBLE) CREATE TABLE orderproducts ( orderid UUID INDEXED REFERENCES(orders.orderid), productid UUID REFERENCES(products.productid)) Simple HBase Schema CREATE TABLE customers (content, orders) CREATE TABLE products (content) CREATE TABLE orders (content, products) Efficient Queries with Both • • • • Get name, email, orders for customer Get name, price for product Get customer, stamp, total for order Get list of products in order Where SQL Makes Life Easy • Joining – In a single query, get all products in an order with their product information • Secondary Indexing – Get customerid by e-mail • Referential Integrity – Deleting an order would delete links out of ‘orderproducts’ – ID updates propogate • Realtime Analysis – GROUP BY and ORDER BY allow for simple statistical analysis Where HBase Makes Life Easy • Dataset Scale – We have 1M customers and 100M products – Product information includes large text datasheets or PDF files – Want to track every time a customer looks at a product page • Read/Write Scale – Tables distributed across nodes means reads/writes are fully distributed – Writes are extremely fast and require no index updates • Replication – Comes for free • Batch Analysis – Massive and convoluted SQL queries executed serially become efficient MapReduce jobs distributed and executed in parallel Conclusion • For small instances of simple/straightforward systems, relational databases offer a much more convenient way to model and access data – Can outsource most work to transaction and query engine – HBase will force you to pull complexity into Application layer • Once you need to scale, the properties and flexibility of HBase can relieve you from the headaches associated with scaling an RDBMS Questions? Comparison Two • Compare key factors – Hardware Requirements – Scalability – Reliability – Ease of Use – Cost Hardware Requirements • RDBMS are IO-bound – Typically require large arrays of fast and expensive disks – Modest production environment might have a single node with 15-30 15k RPM drives, 16 cores, and 16-64GB RAM – Requires a backup server with similar specs – $$$$$ • HBase is designed for commodity hardware – Biggest factor for performance is number of nodes – Modest production environment might have 10-20 nodes each with 2 500GB 7.2k RPM drives, 4 cores, and 4GB RAM – Common to have one master node with RAID, dual PSU, etc as this is currently a SPOF Scalability • RDBMS scale achieved through – Caching a la Memcached – Partitioning often left up to the application or external tools – Replication can be built-in or an add-on with most popular RDBMS – Regardless of scale mechanisms, architecture does not allow efficient multi-master support • HBase scales out of the box – Random access often made faster with something similar to Memcached (built-in with 0.20 release) – Constant performance from low to high concurrency – Writes are distributed and there are no indexes – Scale by plugging in more RegionServers Reliability • RDBMS – Slave replication – Warm/Hot backups – Single node failure is often catastrophic • HBase – Replication is built-in – Backups are unnecessary but available Ease of Use • RDBMS – Millions are trained in SQL and relational data modeling – Normalized schemas are well understood and have predictable performance – However schemas are often limiting, difficult to change, and scale poorly • HBase and MapReduce – Significant learning curve – Both have excellent communities and increasing numbers of tools to help ease the initial pain – Schemas are loosely defined so data structure is easy to change and performance is constant Other Factors • Operating System / Architecture – RDBMS vary greatly on their target architectures – HBase designed for Linux though also being run on Solaris and with some success on Windows • Cost – HBase is FOSS – Plenty of mature FOSS RDBMS, but many used in enterprise are expensive • Widespread use – RDBMS are tried and true – Hadoop and HBase are still in development and though production-ready are not yet in wide use Conclusion • Second verse is same as the first verse • RDBMS provides tremendous functionality out of the box but is extremely difficult and costly to scale • HBase provides barebones functionality out of the box but scaling is built-in and inexpensive