HBase @ Streamy
• History of Data • RDBMS Issues • HBase to the Rescue • Streamy Today and Tomorrow • Future of HBase
Tuesday, September 22, 2009
About Me
• Co-Founder and CTO of Streamy.com • HBase Committer • Migrated Streamy from RDBMS to HBase
and Hadoop in June 2008
Tuesday, September 22, 2009
History of Data
The Prototype
• Streamy 1.0 built on PostgreSQL
‣ All of the bells and whistles
• Powered by single low-spec node
‣ 8 core / 8 GB / 2TB / $4k
Functionally powerful, Woefully slow
Tuesday, September 22, 2009
History of Data
The Alpha
• Streamy 1.5 built on optimized PostgreSQL
‣ Remove bells and whistles, add partitioning
• Powered by high-powered master node
‣ 16 core / 64 GB / 15x146GB 15k RPM / $40k
Less powerful, still slow... Insanely expensive
Tuesday, September 22, 2009
History of Data
The Beta
• Streamy 2.0 built entirely on HBase
‣ Custom caches, query engines, and API
• Powered by 10 low-spec nodes
‣ 4 core / 4GB / 1TB / $10k for entire cluster
Less functional but fast, scalable, and cheap
Tuesday, September 22, 2009
RDBMS Issues
• Poor disk usage patterns • Black box query engine • Write speed degrades with table size • Transactions/MVCC unnecessary overhead • Expensive
Tuesday, September 22, 2009
The Read Problem
• View 30 newest unread stories from blogs
‣ Not RDBMS friendly, no early-out ‣ PL/Python heap-merge hack helped ‣ We knew what to do but DB didn’t listen
Tuesday, September 22, 2009
The Write Problem
• Rapidly growing items table
‣ Crawl index from 1k to 100k feeds ‣ Indexes, static content, dynamic statistics ‣ Solutions are imperfect
Tuesday, September 22, 2009
RDBMS Conclusions
• Enormous functionality and flexibility
‣ But you throw it out the door at scale
• Stripped down RDBMS still not attractive • Turned entire team into DBAs • Gets in the way of domain-specific
optimizations
Tuesday, September 22, 2009
What We Wanted
• Transparent partitioning • Transparent distribution • Fast random writes • Good data locality • Fast random reads
Tuesday, September 22, 2009
What We Got
• Transparent partitioning • Transparent distribution • Fast random writes • Good data locality • Fast random reads
Tuesday, September 22, 2009
Regions RegionServers MemStore Column Families HBase 0.20
What Else We Got
• Transparent replication • High availability • MapReduce • Versioning • Fast Sequential Reads
Tuesday, September 22, 2009
HDFS No SPOF Input/OutputFormats Column Versions Scanners
HBase @ Streamy
Today
Tuesday, September 22, 2009
HBase @ Streamy
Today
• All data stored in HBase • Additional caching of hot data • Query and indexing engines • MapReduce crawling and analytics • Zookeeper/Katta/Lucene
Tuesday, September 22, 2009
HBase @ Streamy
Tomorrow
• Thumbnail media server • Slave replication for Backup/DR • More Cascading • Better Katta integration • Realtime MapReduce
Tuesday, September 22, 2009
HBase on a Budget
• HBase works on cheap nodes
‣ But you need a cluster (5+ nodes) ‣ $10k cluster has 10X capacity of $40k node
• Multiple instances on a single cluster • 24/7 clusters + bandwidth != EC2
Tuesday, September 22, 2009
Lessons Learned
• Layer of abstraction helps tremendously
‣ Internal Streamy Data API ‣ Storage of serialized types
• Schema design is about reads not writes • What’s good for HBase is good for Streamy
Tuesday, September 22, 2009
What’s Next for HBase
• Inter-cluster / Inter-DC replication
‣ Slave and Multi-Master
• Master rewrite, more Zookeeper • Batch operations, HDFS uploader • No more data loss
‣ Need HDFS appends
Tuesday, September 22, 2009
HBase Information
• Home Page http://hbase.org • Wiki http://wiki.apache.org/hadoop/Hbase • Twitter http://twitter.com/hbase • Freenode IRC #hbase • Mailing List hbase-user@hadoop.apache.org
Tuesday, September 22, 2009