Powerpoint

Hadoop and HBase vs RDBMS

You must be logged in to download this document
Description

Presentation given to Los Angeles CTO Forum on December 12, 2008. Introduction to Hadoop, MapReduce, and HBase including how it compares to a traditional RDBMS.

Reviews
great presentation
Rated 10 out of 10

January 22, 2009 (9 months 17 days ago)
great review of HBase featurese

Shared by: Jonathan Gray
Stats
views:
10105
rating:
10(1)
reviews:
1
posted:
12/12/2008
language:
English
pages:
0
Hadoop/HBase vs. RDBMS December CTO Forum Jonathan Gray Streamy.com About Me • Co-Founder of Streamy.com • Background in computer engineering, distributed/fault-tolerant applications, relational databases, Linux • Successfully migrated Streamy backend from PostgreSQL to Hadoop/HBase in June Why Hadoop/HBase? • Datasets are growing into Petabytes • Traditional databases are expensive to scale and inherently difficult to distribute • Commodity hardware is cheap and powerful – $1k buys you 4core/4GB/1TB – 300GB 15k RPM SAS nearly $500 • Need for random access and batch processing – Hadoop only supports batch/streaming History of Hadoop/HBase • Google solved its scale problems – “The Google File System” published October 2003 • Hadoop DFS – “MapReduce: Simplified Data Processing on Large Clusters” published December 2004 • Hadoop MapReduce – “Bigtable: A Distributed Storage System for Structured Data” published November 2006 • HBase Hadoop Introduction • Two main components – Hadoop DFS • A scalable, fault-tolerant, high performance distributed file system capable of running on commodity hardware – Hadoop MapReduce • Software framework for distributed computation • Significant adoption – Used in production in hundreds of organizations – Yahoo is primary sponsor with tens of thousands of active nodes running Hadoop HDFS: Hadoop Distributed File System • Reliably store petabytes of replicated data across thousands of nodes – Data divided into 64MB blocks, each block replicated three times • Master/Slave architecture – Master NameNode contains block locations – Slave DataNode manages blocks on local FS • Built on commodity hardware – No 15k RPM disks or RAID required HDFS Example • Store 1TB flat text file on 10 node cluster – Can use Java API or command-line ./hadoop dfs –copyFromLocal ./srcFile /destFile – File split into 64MB blocks (16,384 total) – Each block sent to three nodes (49,152 total, 3TB) – Has notion of racks to ensure replication across distinct clusters/geographic locations MapReduce • Distributed programming model to reliably process petabytes of data using its locality – Built-in bindings for Java and C – Can use with any language via HadoopStreaming • Inspired by map and reduce functions used in functional programming Input -> Map() -> Copy/Sort -> Reduce() -> Output MapReduce Example • Perform WordCount on 1TB file in HDFS – Map task launched for each block of file – Within each task, Map function called for each line: Map(LineNumber, LineString) • For each word in LineString  Output(Word, 1) – Map output is sorted, grouped and copied to reducer – Reduce(Word, List) called for each word • Output(Word, Length(List)) – Final output contains total count for each word Hadoop… • …is designed to store and stream extremely large datasets in batch • …is not intended for realtime querying • …does not support random access This is why we have HBase! What is HBase? • • • • • • Distributed, Column-Oriented, Multi-Dimensional, High-Availability, High-Performance Storage System Project Goals Billions of Rows * Millions of Columns * Thousands of Versions Petabytes across thousands of commodity servers HBase is not… • A SQL Database – No joins, no query engine, no types, no SQL – Transactions and secondary indexing possible but immature • A drop-in replacement for your RDBMS • You must be OK with RDBMS anti-schema – Denormalized data – Wide and sparsely populated tables Just say “no” to your inner DBA HBase Architecture • Table is made up of any number of regions • Region is specified by its startKey and endKey – Empty table: (Table, NULL, NULL) – Two-region table: (Table, NULL, “Streamy”) and (Table, “Streamy”, NULL) • Each region may live on a different node and is made up of several HDFS files and blocks, each of which is replicated by Hadoop HBase Architecture Contd. • Two types of HBase nodes: Master and RegionServer • Special tables -ROOT- and .META. store schema information and region locations • Master server responsible for regionserver monitoring as well as assignment and load balancing of regions HBase Tables • Tables are sorted by Row • Table schema only defines it’s column families – – – – Each family consists of any number of columns Each column consists of any number of versions Columns only exist when inserted, NULLs are free Columns within a family are sorted and stored together – Everything except table names are byte[] (Table, Row, Family:Column, Timestamp)  Value HBase Table as Data Structures SortedMap( RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) ) ) SortedMap(RowKey, List(SortedMap(Column, List(Value, Timestamp)))) Web Crawl Example • Store web crawl data – Table crawl with family content – Row is URL with Columns • content:data stores raw crawled data • content:language stores http language header • content:type stores http content-type header – If processing raw data for hyperlinks and images, add families links and images • links: column for each hyperlink • images: column for each image Web Crawl Example in RDBMS • How would this look in a traditional DB? – Table crawl with columns url, data, language, and type – Table links with columns url and link – Table images with columns url and image • How will this scale? – 10M documents w/ avg 10 links and 10 images – 210M total rows versus 10M total rows – Index bloat with links/images tables Connecting to HBase • Native Java Client/API – get(byte[] row, byte[] column, long ts, int versions) • Non-Java Clients – Thrift server (Ruby, C++, etc) – REST server – Native C/C++ client scheduled for 0.20 release • TableInput/TableOutputFormat for MapReduce – HBase as MapReduce source and/or sink • HBase Shell – Jruby shell to add, get, scan, and admin HBase Extensions • Hive, Pig, Cascading – Hadoop-targeted MapReduce tools with upcoming HBase integration • Pigi – HBase ORM that includes indexing, joining, searching, paging, and ordering • subRecord – Provides a unified infrastructure combining storage, security, logging, metrics, and monitoring on top of HBase • HBase-Writer – Heritrix crawling directly to an HBase table History of HBase • November 2006 – Google releases paper on BigTable • February 2007 – Initial HBase prototype created as Hadoop contrib • October 2007 – First useable HBase • January 2008 – Hadoop become TLP, HBase becomes subproject • October 2008 – HBase 0.18.1 released Current Project Status • Latest stable release: HBase 0.18.1 on Hadoop 0.18.2 – 8th HBase release – Significant stability and scalability release • Upcoming release: HBase 0.19.0 on Hadoop 0.19.0 – HDFS appends means little to no data loss if master fails (max of 100 edits lost versus 30k previously) – Block-caching, scanner pre-fetching, write batching – Vastly improved resource monitoring and memory efficiency • Next release: HBase 0.20.0 on Hadoop 0.20.0 (March 2009) – – – – New HDFS file format TFile No more SPOF: ZooKeeper integration gives multimaster Order of magnitude improvements in random access Significant expansion of in-memory and caching capabilities HBase in the Wild • Streamy • Powerset – Birthplace of HBase – Home of Michael Stack, project lead • • • • • • Mahalo The Shopping Engine @ Tokenizer.org Advanced Threats Research @ Trend Micro Wikia Multilingual Archive @ WorldLing Also being used in some capacity at: – Yahoo, Last.fm, Videosurf, and Rapleaf > Questions? Comparison One • System to store a shopping cart – Customers, Products, Orders Simple SQL Schema CREATE TABLE customers ( customerid UUID PRIMARY KEY, name TEXT, email TEXT) CREATE TABLE products ( productid UUID PRIMARY KEY, name TEXT, price DOUBLE) CREATE TABLE orders ( orderid UUID PRIMARY KEY, customerid UUID INDEXED REFERENCES(customers.customerid), date TIMESTAMP, total DOUBLE) CREATE TABLE orderproducts ( orderid UUID INDEXED REFERENCES(orders.orderid), productid UUID REFERENCES(products.productid)) Simple HBase Schema CREATE TABLE customers (content, orders) CREATE TABLE products (content) CREATE TABLE orders (content, products) Efficient Queries with Both • • • • Get name, email, orders for customer Get name, price for product Get customer, stamp, total for order Get list of products in order Where SQL Makes Life Easy • Joining – In a single query, get all products in an order with their product information • Secondary Indexing – Get customerid by e-mail • Referential Integrity – Deleting an order would delete links out of ‘orderproducts’ – ID updates propogate • Realtime Analysis – GROUP BY and ORDER BY allow for simple statistical analysis Where HBase Makes Life Easy • Dataset Scale – We have 1M customers and 100M products – Product information includes large text datasheets or PDF files – Want to track every time a customer looks at a product page • Read/Write Scale – Tables distributed across nodes means reads/writes are fully distributed – Writes are extremely fast and require no index updates • Replication – Comes for free • Batch Analysis – Massive and convoluted SQL queries executed serially become efficient MapReduce jobs distributed and executed in parallel Conclusion • For small instances of simple/straightforward systems, relational databases offer a much more convenient way to model and access data – Can outsource most work to transaction and query engine – HBase will force you to pull complexity into Application layer • Once you need to scale, the properties and flexibility of HBase can relieve you from the headaches associated with scaling an RDBMS Questions? Comparison Two • Compare key factors – Hardware Requirements – Scalability – Reliability – Ease of Use – Cost Hardware Requirements • RDBMS are IO-bound – Typically require large arrays of fast and expensive disks – Modest production environment might have a single node with 15-30 15k RPM drives, 16 cores, and 16-64GB RAM – Requires a backup server with similar specs – $$$$$ • HBase is designed for commodity hardware – Biggest factor for performance is number of nodes – Modest production environment might have 10-20 nodes each with 2 500GB 7.2k RPM drives, 4 cores, and 4GB RAM – Common to have one master node with RAID, dual PSU, etc as this is currently a SPOF Scalability • RDBMS scale achieved through – Caching a la Memcached – Partitioning often left up to the application or external tools – Replication can be built-in or an add-on with most popular RDBMS – Regardless of scale mechanisms, architecture does not allow efficient multi-master support • HBase scales out of the box – Random access often made faster with something similar to Memcached (built-in with 0.20 release) – Constant performance from low to high concurrency – Writes are distributed and there are no indexes – Scale by plugging in more RegionServers Reliability • RDBMS – Slave replication – Warm/Hot backups – Single node failure is often catastrophic • HBase – Replication is built-in – Backups are unnecessary but available Ease of Use • RDBMS – Millions are trained in SQL and relational data modeling – Normalized schemas are well understood and have predictable performance – However schemas are often limiting, difficult to change, and scale poorly • HBase and MapReduce – Significant learning curve – Both have excellent communities and increasing numbers of tools to help ease the initial pain – Schemas are loosely defined so data structure is easy to change and performance is constant Other Factors • Operating System / Architecture – RDBMS vary greatly on their target architectures – HBase designed for Linux though also being run on Solaris and with some success on Windows • Cost – HBase is FOSS – Plenty of mature FOSS RDBMS, but many used in enterprise are expensive • Widespread use – RDBMS are tried and true – Hadoop and HBase are still in development and though production-ready are not yet in wide use Conclusion • Second verse is same as the first verse • RDBMS provides tremendous functionality out of the box but is extremely difficult and costly to scale • HBase provides barebones functionality out of the box but scaling is built-in and inexpensive

Related docs
HBase at Hadoop World NYC
Views: 2280  |  Downloads: 39
HBase @ WorldLingo
Views: 2392  |  Downloads: 42
HBase nosql presentation
Views: 969  |  Downloads: 19
HBase Goes Realtime
Views: 2961  |  Downloads: 47
HUG7 HBase 0.20 Intro
Views: 2214  |  Downloads: 15
HBase at Hadoop World NYC
Views: 81  |  Downloads: 5
Introduction to Hadoop
Views: 30  |  Downloads: 7
Introduction to Hadoop
Views: 367  |  Downloads: 64
Hands-On Hadoop Tutorial
Views: 420  |  Downloads: 52
Programming In Hadoop
Views: 2  |  Downloads: 0
The-Hadoop-Fair-Scheduler
Views: 65  |  Downloads: 3
Introduction to MapReduce and Hadoop
Views: 180  |  Downloads: 20
premium docs
Other docs by Jonathan Gray
HUG7 HBase 0.20 Intro
Views: 2214  |  Downloads: 15
HBase Goes Realtime
Views: 2961  |  Downloads: 47