HBase @ WorldLingo
Presentation give March 6th in Berlin at the Berlin Hadoop User Group.
Shared by: larsgeorge
HBase @ WorldLingo “Multilingual Archive” Lars George, CTO www.worldlingo.com www.larsgeorge.com whoami CTO @ WorldLingo Co-wrote C64 Simons BASIC competitor at age 16 in Assembler Co-wrote “Berlin 1948” in C on Amiga at age 19. Diplom Informatiker (FH) Moved to Australia, then US of A 10 years at WorldLingo WorldLingo Co-founded 1999 Machine Translation Services Professional Human Translations Offices in US and UK Microsoft Office Provider since 2001 Web based services Customer Projects Multilingual Archive Multilingual Archive SOAP API Simple calls ◦ ◦ ◦ ◦ ◦ ◦ putDocument() getDocument() search() command() putTransformation() getTransformation() Multilingual Archive (cont.) Planned already, implemented as customer project Scale: ◦ 500million documents ◦ Random Access ◦ “100%” Uptime Technologies? ◦ Database ◦ Zip-Archives on file system, or Hadoop Multilingual Archive (cont.) 44 Dell PESC1435, 12GB RAM, 2 x 1TB SATA drives Java 6 Tomcat 5.5 88 Xen domU’s ◦ Apache ◦ Hadoop/HBase ◦ Tomcat application servers Currently split into two clusters Lucene Search Server 43 fields indexed 166GB size Automated merging/warm-up/swap Looking into scalable solution ◦ ◦ ◦ ◦ Katta Hyper Estraier DLucene … Sorting? HBase Distributed database modeled on Bigtable Runs on top of Hadoop Core Column-oriented store ◦ Bigtable: A Distributed Storage System for Structured Data by Chang et al. ◦ "Commodity" servers, replicated data, etc. ◦ Wide table costs only the data stored ◦ NULLs in row are 'free' ◦ Good compression: columns of similar type ◦ Petabytes of data across thousands of servers Goal of billions of rows X millions of cells Distributed, High Availability, High Performance !HBase A SQL Database! ◦ ◦ ◦ ◦ ◦ No joins No sophisticated query engine No transactions No column typing No SQL, no ODBC/JDBC, etc. Not a replacement for your RDBMS... Why HBase? Datasets are reaching Petabytes Traditional databases are expensive to scale and difficult to distribute Commodity hardware is cheap and powerful Need for random access and batch processing (which Hadoop does not offer) HBase Public Timeline November 2006 ◦ Google releases paper on Bigtable February 2007 ◦ Initial HBase prototype created as Hadoop contrib October 2007 ◦ First "useable" HBase (0.15.0 Hadoop) December 2007 ◦ First HBase User Group January 2008 ◦ Hadoop becomes TLP, HBase becomes subproject October 2008 ◦ HBase 0.18.1 released January 2009 ◦ HBase 0.19.0 released HBase WorldLingo Timeline HBase Tables Tables are sorted by Row Table schema defines column families ◦ Families consist of any number of columns ◦ Columns consist of any number of versions ◦ Everything except table name is byte (Table, Row, Family:Column, Timestamp) Value HBase Table (cont.) As a data structure SortedMap( RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) ) ) HBase - Example Store web crawl data ◦ Table crawl with family content ◦ Row is URL with columns content:data stores raw crawled data content:language stores http language header content:type stores http content-type header ◦ If processing raw data for hyperlinks and images, add families links and images links:<url> column for each hyperlink links:<url> column for each image HBase - Clients Native Java client/API ◦ get(byte row, byte column, long ts, int versions) Non-Java clients ◦ Thrift server (Ruby, C++, Erlang, etc.) ◦ REST server TableInput/TableOutputFormat for MapReduce HBase shell (jruby) HBase - Multilingual Archive 5 Tables Up to 5 column families XML Schemas Automated table schema updates Stock standard options ◦ Should I tweak? MemCached(b) layer HBase - Layers Network LWS Director 1 Firewall Director n Web App Cache Data Tomcat 1 Apache 1 Apache n … Tomcat n Tomcat 1 Tomcat n MemCached 1 MemCached n HBase HBase - Map/Reduce Backup/Restore Index building Cache filling Mapping Updates Translation HBase - Problems Early versions ◦ Data loss ◦ Migration nightmares ◦ Slow performance Current version ◦ RegionServer madness (region unknown) ◦ New resource limits (xceiver) ◦ Read HBase Wiki Single point of failure (name node) HBase - Notes RTF M HBase Wiki IRC Channel Personal Experience: (ine) ◦ ◦ ◦ ◦ ◦ Max. file handles Hadoop xceiver limits Redundant meta data (on name node) RAM Deployment strategy HBase – Notes (cont.) Upcoming versions: ◦ ◦ ◦ ◦ Zookeeper removes SPOF New file format brings speed Build in Memcache = Can drop a layer! 200ms > 20ms (do 50 get()’s per call) Questions? Email: firstname.lastname@example.org Blog: www.larsgeorge.com Twitter: larsgeorge
Other docs by larsgeorge