HBase @ WorldLingo “Multilingual Archive”
Lars George, CTO www.worldlingo.com www.larsgeorge.com
whoami
CTO @ WorldLingo Co-wrote C64 Simons BASIC competitor at age 16 in Assembler Co-wrote “Berlin 1948” in C on Amiga at age 19. Diplom Informatiker (FH) Moved to Australia, then US of A 10 years at WorldLingo
WorldLingo
Co-founded 1999 Machine Translation Services Professional Human Translations Offices in US and UK Microsoft Office Provider since 2001 Web based services Customer Projects Multilingual Archive
Multilingual Archive
SOAP API Simple calls
◦ ◦ ◦ ◦ ◦ ◦
putDocument() getDocument() search() command() putTransformation() getTransformation()
Multilingual Archive (cont.)
Planned already, implemented as customer project Scale:
◦ 500million documents ◦ Random Access ◦ “100%” Uptime
Technologies?
◦ Database ◦ Zip-Archives on file system, or Hadoop
Multilingual Archive (cont.)
44 Dell PESC1435, 12GB RAM, 2 x 1TB SATA drives Java 6 Tomcat 5.5 88 Xen domU’s
◦ Apache ◦ Hadoop/HBase ◦ Tomcat application servers
Currently split into two clusters
Lucene Search Server
43 fields indexed 166GB size Automated merging/warm-up/swap Looking into scalable solution
◦ ◦ ◦ ◦
Katta Hyper Estraier DLucene …
Sorting?
HBase
Distributed database modeled on Bigtable Runs on top of Hadoop Core Column-oriented store
◦ Bigtable: A Distributed Storage System for Structured Data by Chang et al. ◦ "Commodity" servers, replicated data, etc.
◦ Wide table costs only the data stored ◦ NULLs in row are 'free' ◦ Good compression: columns of similar type ◦ Petabytes of data across thousands of servers
Goal of billions of rows X millions of cells
Distributed, High Availability, High Performance
!HBase
A SQL Database!
◦ ◦ ◦ ◦ ◦ No joins No sophisticated query engine No transactions No column typing No SQL, no ODBC/JDBC, etc.
Not a replacement for your RDBMS...
Why HBase?
Datasets are reaching Petabytes Traditional databases are expensive to scale and difficult to distribute Commodity hardware is cheap and powerful Need for random access and batch processing (which Hadoop does not offer)
HBase Public Timeline
November 2006
◦ Google releases paper on Bigtable
February 2007
◦ Initial HBase prototype created as Hadoop contrib
October 2007
◦ First "useable" HBase (0.15.0 Hadoop)
December 2007
◦ First HBase User Group
January 2008
◦ Hadoop becomes TLP, HBase becomes subproject
October 2008
◦ HBase 0.18.1 released
January 2009
◦ HBase 0.19.0 released
HBase WorldLingo Timeline
HBase Tables
Tables are sorted by Row Table schema defines column families
◦ Families consist of any number of columns ◦ Columns consist of any number of versions ◦ Everything except table name is byte[]
(Table, Row, Family:Column, Timestamp) Value
HBase Table (cont.)
As a data structure
SortedMap( RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) ) )
HBase - Example
Store web crawl data
◦ Table crawl with family content ◦ Row is URL with columns
content:data stores raw crawled data content:language stores http language header content:type stores http content-type header
◦ If processing raw data for hyperlinks and images, add families links and images
links: column for each hyperlink links: column for each image
HBase - Clients
Native Java client/API
◦ get(byte[] row, byte[] column, long ts, int versions)
Non-Java clients
◦ Thrift server (Ruby, C++, Erlang, etc.) ◦ REST server
TableInput/TableOutputFormat for MapReduce HBase shell (jruby)
HBase - Multilingual Archive
5 Tables Up to 5 column families XML Schemas Automated table schema updates Stock standard options
◦ Should I tweak?
MemCached(b) layer
HBase - Layers
Network LWS
Director 1 Firewall
Director n
Web
App Cache Data
Tomcat 1
Apache 1
Apache n
…
Tomcat n
Tomcat 1
Tomcat n
MemCached 1
MemCached n
HBase
HBase - Map/Reduce
Backup/Restore Index building Cache filling Mapping Updates Translation
HBase - Problems
Early versions
◦ Data loss ◦ Migration nightmares ◦ Slow performance
Current version
◦ RegionServer madness (region unknown) ◦ New resource limits (xceiver) ◦ Read HBase Wiki
Single point of failure (name node)
HBase - Notes
RTF M HBase Wiki IRC Channel Personal Experience:
(ine)
◦ ◦ ◦ ◦ ◦
Max. file handles Hadoop xceiver limits Redundant meta data (on name node) RAM Deployment strategy
HBase – Notes (cont.)
Upcoming versions:
◦ ◦ ◦ ◦ Zookeeper removes SPOF New file format brings speed Build in Memcache = Can drop a layer! 200ms > 20ms (do 50 get()’s per call)
Questions?
Email: lars@worldlingo.com Blog: www.larsgeorge.com Twitter: larsgeorge