HBase @ WorldLingo

Document Sample
HBase @ WorldLingo
Description

Presentation give March 6th in Berlin at the Berlin Hadoop User Group.

Shared by: Lars George
Categories
Tags
Stats
views:
8219
posted:
3/6/2009
language:
English
pages:
23
HBase @ WorldLingo “Multilingual Archive”

Lars George, CTO www.worldlingo.com www.larsgeorge.com



whoami

CTO @ WorldLingo  Co-wrote C64 Simons BASIC competitor at age 16 in Assembler  Co-wrote “Berlin 1948” in C on Amiga at age 19.  Diplom Informatiker (FH)  Moved to Australia, then US of A  10 years at WorldLingo





WorldLingo

Co-founded 1999  Machine Translation Services  Professional Human Translations  Offices in US and UK  Microsoft Office Provider since 2001  Web based services  Customer Projects  Multilingual Archive





Multilingual Archive

SOAP API  Simple calls





◦ ◦ ◦ ◦ ◦ ◦



putDocument() getDocument() search() command() putTransformation() getTransformation()



Multilingual Archive (cont.)

Planned already, implemented as customer project  Scale:





◦ 500million documents ◦ Random Access ◦ “100%” Uptime





Technologies?

◦ Database ◦ Zip-Archives on file system, or Hadoop



Multilingual Archive (cont.)

44 Dell PESC1435, 12GB RAM, 2 x 1TB SATA drives  Java 6  Tomcat 5.5  88 Xen domU’s





◦ Apache ◦ Hadoop/HBase ◦ Tomcat application servers





Currently split into two clusters



Lucene Search Server

43 fields indexed  166GB size  Automated merging/warm-up/swap  Looking into scalable solution





◦ ◦ ◦ ◦





Katta Hyper Estraier DLucene …



Sorting?



HBase





Distributed database modeled on Bigtable Runs on top of Hadoop Core Column-oriented store



◦ Bigtable: A Distributed Storage System for Structured Data by Chang et al. ◦ "Commodity" servers, replicated data, etc.



 



◦ Wide table costs only the data stored ◦ NULLs in row are 'free' ◦ Good compression: columns of similar type ◦ Petabytes of data across thousands of servers



 



Goal of billions of rows X millions of cells



Distributed, High Availability, High Performance



!HBase





A SQL Database!

◦ ◦ ◦ ◦ ◦ No joins No sophisticated query engine No transactions No column typing No SQL, no ODBC/JDBC, etc.







Not a replacement for your RDBMS...



Why HBase?

Datasets are reaching Petabytes  Traditional databases are expensive to scale and difficult to distribute  Commodity hardware is cheap and powerful  Need for random access and batch processing (which Hadoop does not offer)





HBase Public Timeline

      



November 2006

◦ Google releases paper on Bigtable



February 2007

◦ Initial HBase prototype created as Hadoop contrib



October 2007

◦ First "useable" HBase (0.15.0 Hadoop)



December 2007

◦ First HBase User Group



January 2008

◦ Hadoop becomes TLP, HBase becomes subproject



October 2008

◦ HBase 0.18.1 released



January 2009

◦ HBase 0.19.0 released



HBase WorldLingo Timeline



HBase Tables

Tables are sorted by Row  Table schema defines column families





◦ Families consist of any number of columns ◦ Columns consist of any number of versions ◦ Everything except table name is byte[]



(Table, Row, Family:Column, Timestamp)  Value



HBase Table (cont.)





As a data structure

SortedMap( RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) ) )



HBase - Example





Store web crawl data

◦ Table crawl with family content ◦ Row is URL with columns

 content:data stores raw crawled data  content:language stores http language header  content:type stores http content-type header



◦ If processing raw data for hyperlinks and images, add families links and images

 links: column for each hyperlink  links: column for each image



HBase - Clients





Native Java client/API

◦ get(byte[] row, byte[] column, long ts, int versions)







Non-Java clients

◦ Thrift server (Ruby, C++, Erlang, etc.) ◦ REST server



TableInput/TableOutputFormat for MapReduce  HBase shell (jruby)





HBase - Multilingual Archive

5 Tables  Up to 5 column families  XML Schemas  Automated table schema updates  Stock standard options





◦ Should I tweak?





MemCached(b) layer



HBase - Layers

Network LWS

Director 1 Firewall



Director n



Web

App Cache Data

Tomcat 1



Apache 1



Apache n







Tomcat n



Tomcat 1



Tomcat n



MemCached 1



MemCached n



HBase



HBase - Map/Reduce

Backup/Restore  Index building  Cache filling  Mapping  Updates  Translation





HBase - Problems





Early versions

◦ Data loss ◦ Migration nightmares ◦ Slow performance







Current version

◦ RegionServer madness (region unknown) ◦ New resource limits (xceiver) ◦ Read HBase Wiki







Single point of failure (name node)



HBase - Notes

RTF M  HBase Wiki  IRC Channel  Personal Experience:



(ine)



◦ ◦ ◦ ◦ ◦



Max. file handles Hadoop xceiver limits Redundant meta data (on name node) RAM Deployment strategy



HBase – Notes (cont.)





Upcoming versions:

◦ ◦ ◦ ◦ Zookeeper removes SPOF New file format brings speed Build in Memcache = Can drop a layer! 200ms > 20ms (do 50 get()’s per call)



Questions?

Email: lars@worldlingo.com  Blog: www.larsgeorge.com  Twitter: larsgeorge






Share This Document


Related docs
Other docs by Lars George
HBase @ WorldLingo
Views: 8217  |  Downloads: 113
Advanced HBase
Views: 5445  |  Downloads: 182
Social Networks and the Richness of Data
Views: 91  |  Downloads: 3
My Life with HBase - FOSDEM 2010 NoSQL
Views: 18762  |  Downloads: 303
Realtime Analytics with Hadoop and HBase
Views: 1028  |  Downloads: 3
HBase at WorldLingo - Munich OpenHUG
Views: 1668  |  Downloads: 24
by registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!