HBase @ WorldLingo

Description

Presentation give March 6th in Berlin at the Berlin Hadoop User Group.

Reviews
Shared by: Lars George
Categories
Tags
Stats
views:
2612
rating:
not rated
reviews:
0
posted:
3/6/2009
language:
English
pages:
0
HBase @ WorldLingo “Multilingual Archive” Lars George, CTO www.worldlingo.com www.larsgeorge.com whoami CTO @ WorldLingo  Co-wrote C64 Simons BASIC competitor at age 16 in Assembler  Co-wrote “Berlin 1948” in C on Amiga at age 19.  Diplom Informatiker (FH)  Moved to Australia, then US of A  10 years at WorldLingo  WorldLingo Co-founded 1999  Machine Translation Services  Professional Human Translations  Offices in US and UK  Microsoft Office Provider since 2001  Web based services  Customer Projects  Multilingual Archive  Multilingual Archive SOAP API  Simple calls  ◦ ◦ ◦ ◦ ◦ ◦ putDocument() getDocument() search() command() putTransformation() getTransformation() Multilingual Archive (cont.) Planned already, implemented as customer project  Scale:  ◦ 500million documents ◦ Random Access ◦ “100%” Uptime  Technologies? ◦ Database ◦ Zip-Archives on file system, or Hadoop Multilingual Archive (cont.) 44 Dell PESC1435, 12GB RAM, 2 x 1TB SATA drives  Java 6  Tomcat 5.5  88 Xen domU’s  ◦ Apache ◦ Hadoop/HBase ◦ Tomcat application servers  Currently split into two clusters Lucene Search Server 43 fields indexed  166GB size  Automated merging/warm-up/swap  Looking into scalable solution  ◦ ◦ ◦ ◦  Katta Hyper Estraier DLucene … Sorting? HBase  Distributed database modeled on Bigtable Runs on top of Hadoop Core Column-oriented store ◦ Bigtable: A Distributed Storage System for Structured Data by Chang et al. ◦ "Commodity" servers, replicated data, etc.   ◦ Wide table costs only the data stored ◦ NULLs in row are 'free' ◦ Good compression: columns of similar type ◦ Petabytes of data across thousands of servers   Goal of billions of rows X millions of cells Distributed, High Availability, High Performance !HBase  A SQL Database! ◦ ◦ ◦ ◦ ◦ No joins No sophisticated query engine No transactions No column typing No SQL, no ODBC/JDBC, etc.  Not a replacement for your RDBMS... Why HBase? Datasets are reaching Petabytes  Traditional databases are expensive to scale and difficult to distribute  Commodity hardware is cheap and powerful  Need for random access and batch processing (which Hadoop does not offer)  HBase Public Timeline        November 2006 ◦ Google releases paper on Bigtable February 2007 ◦ Initial HBase prototype created as Hadoop contrib October 2007 ◦ First "useable" HBase (0.15.0 Hadoop) December 2007 ◦ First HBase User Group January 2008 ◦ Hadoop becomes TLP, HBase becomes subproject October 2008 ◦ HBase 0.18.1 released January 2009 ◦ HBase 0.19.0 released HBase WorldLingo Timeline HBase Tables Tables are sorted by Row  Table schema defines column families  ◦ Families consist of any number of columns ◦ Columns consist of any number of versions ◦ Everything except table name is byte[] (Table, Row, Family:Column, Timestamp)  Value HBase Table (cont.)  As a data structure SortedMap( RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) ) ) HBase - Example  Store web crawl data ◦ Table crawl with family content ◦ Row is URL with columns  content:data stores raw crawled data  content:language stores http language header  content:type stores http content-type header ◦ If processing raw data for hyperlinks and images, add families links and images  links: column for each hyperlink  links: column for each image HBase - Clients  Native Java client/API ◦ get(byte[] row, byte[] column, long ts, int versions)  Non-Java clients ◦ Thrift server (Ruby, C++, Erlang, etc.) ◦ REST server TableInput/TableOutputFormat for MapReduce  HBase shell (jruby)  HBase - Multilingual Archive 5 Tables  Up to 5 column families  XML Schemas  Automated table schema updates  Stock standard options  ◦ Should I tweak?  MemCached(b) layer HBase - Layers Network LWS Director 1 Firewall Director n Web App Cache Data Tomcat 1 Apache 1 Apache n … Tomcat n Tomcat 1 Tomcat n MemCached 1 MemCached n HBase HBase - Map/Reduce Backup/Restore  Index building  Cache filling  Mapping  Updates  Translation  HBase - Problems  Early versions ◦ Data loss ◦ Migration nightmares ◦ Slow performance  Current version ◦ RegionServer madness (region unknown) ◦ New resource limits (xceiver) ◦ Read HBase Wiki  Single point of failure (name node) HBase - Notes RTF M  HBase Wiki  IRC Channel  Personal Experience:  (ine) ◦ ◦ ◦ ◦ ◦ Max. file handles Hadoop xceiver limits Redundant meta data (on name node) RAM Deployment strategy HBase – Notes (cont.)  Upcoming versions: ◦ ◦ ◦ ◦ Zookeeper removes SPOF New file format brings speed Build in Memcache = Can drop a layer! 200ms > 20ms (do 50 get()’s per call) Questions? Email: lars@worldlingo.com  Blog: www.larsgeorge.com  Twitter: larsgeorge 

Related docs
HBase nosql presentation
Views: 1045  |  Downloads: 32
HBase User Group 7
Views: 2899  |  Downloads: 18
HBase at Hadoop World NYC
Views: 2541  |  Downloads: 48
HBase Goes Realtime
Views: 3480  |  Downloads: 53
HUG7 HBase 0.20 Intro
Views: 2639  |  Downloads: 19
Hadoop and HBase vs RDBMS
Views: 10798  |  Downloads: 363
HBase at Hadoop World NYC
Views: 88  |  Downloads: 8
Construct Lewis Dot Diagrams
Views: 16  |  Downloads: 0
CONSUMER DISCOUNT CARD PROGRAM
Views: 3  |  Downloads: 1
North Hunterdon High School Media Center
Views: 1  |  Downloads: 0
BigTable
Views: 112  |  Downloads: 6
premium docs