History of Data at Streamy, from Postgres to HBase

Description

Talk from Hadoop World NYC 2009 given by Jonathan Gray about the history of data at Streamy, migrating from a PostgreSQL database to Hadoop/HBase.

Reviews
Shared by: Jonathan Gray
Stats
views:
57
rating:
not rated
reviews:
0
posted:
10/7/2009
language:
English
pages:
0
HBase @ Streamy • History of Data • RDBMS Issues • HBase to the Rescue • Streamy Today and Tomorrow • Future of HBase Tuesday, September 22, 2009 About Me • Co-Founder and CTO of Streamy.com • HBase Committer • Migrated Streamy from RDBMS to HBase and Hadoop in June 2008 Tuesday, September 22, 2009 History of Data The Prototype • Streamy 1.0 built on PostgreSQL ‣ All of the bells and whistles • Powered by single low-spec node ‣ 8 core / 8 GB / 2TB / $4k Functionally powerful, Woefully slow Tuesday, September 22, 2009 History of Data The Alpha • Streamy 1.5 built on optimized PostgreSQL ‣ Remove bells and whistles, add partitioning • Powered by high-powered master node ‣ 16 core / 64 GB / 15x146GB 15k RPM / $40k Less powerful, still slow... Insanely expensive Tuesday, September 22, 2009 History of Data The Beta • Streamy 2.0 built entirely on HBase ‣ Custom caches, query engines, and API • Powered by 10 low-spec nodes ‣ 4 core / 4GB / 1TB / $10k for entire cluster Less functional but fast, scalable, and cheap Tuesday, September 22, 2009 RDBMS Issues • Poor disk usage patterns • Black box query engine • Write speed degrades with table size • Transactions/MVCC unnecessary overhead • Expensive Tuesday, September 22, 2009 The Read Problem • View 30 newest unread stories from blogs ‣ Not RDBMS friendly, no early-out ‣ PL/Python heap-merge hack helped ‣ We knew what to do but DB didn’t listen Tuesday, September 22, 2009 The Write Problem • Rapidly growing items table ‣ Crawl index from 1k to 100k feeds ‣ Indexes, static content, dynamic statistics ‣ Solutions are imperfect Tuesday, September 22, 2009 RDBMS Conclusions • Enormous functionality and flexibility ‣ But you throw it out the door at scale • Stripped down RDBMS still not attractive • Turned entire team into DBAs • Gets in the way of domain-specific optimizations Tuesday, September 22, 2009 What We Wanted • Transparent partitioning • Transparent distribution • Fast random writes • Good data locality • Fast random reads Tuesday, September 22, 2009 What We Got • Transparent partitioning • Transparent distribution • Fast random writes • Good data locality • Fast random reads Tuesday, September 22, 2009 Regions RegionServers MemStore Column Families HBase 0.20 What Else We Got • Transparent replication • High availability • MapReduce • Versioning • Fast Sequential Reads Tuesday, September 22, 2009 HDFS No SPOF Input/OutputFormats Column Versions Scanners HBase @ Streamy Today Tuesday, September 22, 2009 HBase @ Streamy Today • All data stored in HBase • Additional caching of hot data • Query and indexing engines • MapReduce crawling and analytics • Zookeeper/Katta/Lucene Tuesday, September 22, 2009 HBase @ Streamy Tomorrow • Thumbnail media server • Slave replication for Backup/DR • More Cascading • Better Katta integration • Realtime MapReduce Tuesday, September 22, 2009 HBase on a Budget • HBase works on cheap nodes ‣ But you need a cluster (5+ nodes) ‣ $10k cluster has 10X capacity of $40k node • Multiple instances on a single cluster • 24/7 clusters + bandwidth != EC2 Tuesday, September 22, 2009 Lessons Learned • Layer of abstraction helps tremendously ‣ Internal Streamy Data API ‣ Storage of serialized types • Schema design is about reads not writes • What’s good for HBase is good for Streamy Tuesday, September 22, 2009 What’s Next for HBase • Inter-cluster / Inter-DC replication ‣ Slave and Multi-Master • Master rewrite, more Zookeeper • Batch operations, HDFS uploader • No more data loss ‣ Need HDFS appends Tuesday, September 22, 2009 HBase Information • Home Page http://hbase.org • Wiki http://wiki.apache.org/hadoop/Hbase • Twitter http://twitter.com/hbase • Freenode IRC #hbase • Mailing List hbase-user@hadoop.apache.org Tuesday, September 22, 2009

Related docs
Hadoop and HBase vs RDBMS
Views: 10627  |  Downloads: 354
HBase at Hadoop World NYC
Views: 2487  |  Downloads: 47
HBase at Hadoop World NYC
Views: 85  |  Downloads: 8
History of Data Compression
Views: 11  |  Downloads: 0
History
Views: 55  |  Downloads: 0
History of
Views: 37  |  Downloads: 0
Civilization and Beyond Learning from History
Views: 1  |  Downloads: 0
Golden Deeds Stories from History
Views: 3  |  Downloads: 0
History of iPath
Views: 3  |  Downloads: 0
The history of
Views: 87  |  Downloads: 0
History of the
Views: 28  |  Downloads: 0
THE HISTORY OF THE
Views: 3  |  Downloads: 0
History of the
Views: 7  |  Downloads: 0
premium docs
Other docs by Jonathan Gray
HUG7 HBase 0.20 Intro
Views: 2528  |  Downloads: 19
HBase Goes Realtime
Views: 3341  |  Downloads: 52
Hadoop and HBase vs RDBMS
Views: 10627  |  Downloads: 354