Docstoc

HBase @ WorldLingo

Document Sample
HBase @ WorldLingo Powered By Docstoc
					HBase @ WorldLingo “Multilingual Archive”
Lars George, CTO www.worldlingo.com www.larsgeorge.com

whoami
CTO @ WorldLingo  Co-wrote C64 Simons BASIC competitor at age 16 in Assembler  Co-wrote “Berlin 1948” in C on Amiga at age 19.  Diplom Informatiker (FH)  Moved to Australia, then US of A  10 years at WorldLingo


WorldLingo
Co-founded 1999  Machine Translation Services  Professional Human Translations  Offices in US and UK  Microsoft Office Provider since 2001  Web based services  Customer Projects  Multilingual Archive


Multilingual Archive
SOAP API  Simple calls


◦ ◦ ◦ ◦ ◦ ◦

putDocument() getDocument() search() command() putTransformation() getTransformation()

Multilingual Archive (cont.)
Planned already, implemented as customer project  Scale:


◦ 500million documents ◦ Random Access ◦ “100%” Uptime


Technologies?
◦ Database ◦ Zip-Archives on file system, or Hadoop

Multilingual Archive (cont.)
44 Dell PESC1435, 12GB RAM, 2 x 1TB SATA drives  Java 6  Tomcat 5.5  88 Xen domU’s


◦ Apache ◦ Hadoop/HBase ◦ Tomcat application servers


Currently split into two clusters

Lucene Search Server
43 fields indexed  166GB size  Automated merging/warm-up/swap  Looking into scalable solution


◦ ◦ ◦ ◦


Katta Hyper Estraier DLucene …

Sorting?

HBase


Distributed database modeled on Bigtable Runs on top of Hadoop Core Column-oriented store

◦ Bigtable: A Distributed Storage System for Structured Data by Chang et al. ◦ "Commodity" servers, replicated data, etc.

 

◦ Wide table costs only the data stored ◦ NULLs in row are 'free' ◦ Good compression: columns of similar type ◦ Petabytes of data across thousands of servers

 

Goal of billions of rows X millions of cells

Distributed, High Availability, High Performance

!HBase


A SQL Database!
◦ ◦ ◦ ◦ ◦ No joins No sophisticated query engine No transactions No column typing No SQL, no ODBC/JDBC, etc.



Not a replacement for your RDBMS...

Why HBase?
Datasets are reaching Petabytes  Traditional databases are expensive to scale and difficult to distribute  Commodity hardware is cheap and powerful  Need for random access and batch processing (which Hadoop does not offer)


HBase Public Timeline
      

November 2006
◦ Google releases paper on Bigtable

February 2007
◦ Initial HBase prototype created as Hadoop contrib

October 2007
◦ First "useable" HBase (0.15.0 Hadoop)

December 2007
◦ First HBase User Group

January 2008
◦ Hadoop becomes TLP, HBase becomes subproject

October 2008
◦ HBase 0.18.1 released

January 2009
◦ HBase 0.19.0 released

HBase WorldLingo Timeline

HBase Tables
Tables are sorted by Row  Table schema defines column families


◦ Families consist of any number of columns ◦ Columns consist of any number of versions ◦ Everything except table name is byte[]

(Table, Row, Family:Column, Timestamp)  Value

HBase Table (cont.)


As a data structure
SortedMap( RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) ) )

HBase - Example


Store web crawl data
◦ Table crawl with family content ◦ Row is URL with columns
 content:data stores raw crawled data  content:language stores http language header  content:type stores http content-type header

◦ If processing raw data for hyperlinks and images, add families links and images
 links:<url> column for each hyperlink  links:<url> column for each image

HBase - Clients


Native Java client/API
◦ get(byte[] row, byte[] column, long ts, int versions)



Non-Java clients
◦ Thrift server (Ruby, C++, Erlang, etc.) ◦ REST server

TableInput/TableOutputFormat for MapReduce  HBase shell (jruby)


HBase - Multilingual Archive
5 Tables  Up to 5 column families  XML Schemas  Automated table schema updates  Stock standard options


◦ Should I tweak?


MemCached(b) layer

HBase - Layers
Network LWS
Director 1 Firewall

Director n

Web
App Cache Data
Tomcat 1

Apache 1

Apache n

…

Tomcat n

Tomcat 1

Tomcat n

MemCached 1

MemCached n

HBase

HBase - Map/Reduce
Backup/Restore  Index building  Cache filling  Mapping  Updates  Translation


HBase - Problems


Early versions
◦ Data loss ◦ Migration nightmares ◦ Slow performance



Current version
◦ RegionServer madness (region unknown) ◦ New resource limits (xceiver) ◦ Read HBase Wiki



Single point of failure (name node)

HBase - Notes
RTF M  HBase Wiki  IRC Channel  Personal Experience:

(ine)

◦ ◦ ◦ ◦ ◦

Max. file handles Hadoop xceiver limits Redundant meta data (on name node) RAM Deployment strategy

HBase – Notes (cont.)


Upcoming versions:
◦ ◦ ◦ ◦ Zookeeper removes SPOF New file format brings speed Build in Memcache = Can drop a layer! 200ms > 20ms (do 50 get()’s per call)

Questions?
Email: lars@worldlingo.com  Blog: www.larsgeorge.com  Twitter: larsgeorge



				
DOCUMENT INFO
Shared By:
Categories:
Tags: HBase
Stats:
views:9148
posted:3/6/2009
language:English
pages:23
Description: Presentation give March 6th in Berlin at the Berlin Hadoop User Group.