HBase at WorldLingo - Munich OpenHUG

Document Sample
HBase at WorldLingo - Munich OpenHUG
Shared by: Lars George
Categories
Tags
Stats
views:
1666
posted:
1/10/2010
language:
English
pages:
0
Lars George, CTO www.worldlingo.com www.larsgeorge.com



  CTO



@ WorldLingo   Diplom Informatiker (FH)   Moved to Australia, then US of A   10 years at WorldLingo   HBase contributor and soon committer   Nerd-factor: co-wrote C64 Simons BASIC competitor at age 16 in Assembler



  Co-founded



1999   Machine Translation Services   Professional Human Translations   Offices in US and UK   Microsoft Office Provider since 2001   Web based services   Customer Projects   Multilingual Archive



  SOAP API   Simple



calls



◦  putDocument() ◦  getDocument() ◦  search() ◦  command() ◦  putTransformation() ◦  getTransformation()



  Planned



already, implemented as customer



project   Scale:

◦  500million documents ◦  Random Access ◦  “100%” Uptime

  Technologies?



◦  Database ◦  Zip-Archives on file system, or Hadoop



  44



Dell PESC1435, 12GB RAM, 2 x 1TB SATA drives   Java 6   Tomcat 5.5   88 Xen domU’s

◦  Apache ◦  Hadoop/HBase ◦  Tomcat application servers

  Currently



split into two clusters



  43



fields indexed   166GB size   Automated merging/warm-up/swap   Looking into scalable solution

◦  Katta ◦  Hyper Estraier ◦  DLucene ◦  …

  Sorting?



 



Distributed database modeled on Bigtable Runs on top of Hadoop Core

◦  "Commodity" servers, replicated data, etc.



◦  Bigtable: A Distributed Storage System for Structured Data by Chang et al.



   



Column-oriented store

◦  Wide table costs only the data stored ◦  NULLs in row are 'free' ◦  Good compression: columns of similar type



   



Goal of billions of rows X millions of cells

◦  Petabytes of data across thousands of servers



Distributed, High Availability, High Performance



  “NoSQL”



Database!



◦  No joins ◦  No sophisticated query engine ◦  No transactions (sort of) ◦  No column typing ◦  No SQL, no ODBC/JDBC, etc.

  Not



a replacement for your RDBMS...



  Datasets



are reaching Petabytes   Traditional databases are expensive to scale and difficult to distribute   Commodity hardware is cheap and powerful   Need for random access and batch processing (which Hadoop does not offer)



               



November 2006

◦  Google releases paper on Bigtable



February 2007

◦  Initial HBase prototype created as Hadoop contrib



October 2007

◦  First "useable" HBase (0.15.0 Hadoop)



December 2007

◦  First HBase User Group



January 2008

◦  Hadoop becomes TLP, HBase becomes subproject



October 2008

◦  HBase 0.18.1 released



January 2009

◦  HBase 0.19.0 released



September 2009

◦  HBase 0.20.1 released



  Tables



are sorted by Row   Table schema defines column families

◦  Families consist of any number of columns ◦  Columns consist of any number of versions ◦  Everything except table name is byte[]



(Table, Row, Family:Column, Timestamp)  Value



  As



a data structure

RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) )



SortedMap(



)



  Store



web crawl data



◦  Table crawl with family content ◦  Row is URL with columns

  content:data stores raw crawled data   content:language stores http language header   content:type stores http content-type header



◦  If processing raw data for hyperlinks and images, add families links and images

  links: column for each hyperlink   links: column for each image



  Native



Java client/API



◦  get(byte[] row, byte[] column, long ts, int versions)

  Non-Java



clients



◦  Thrift server (Ruby, C++, Erlang, etc.) ◦  REST server

  TableInput/TableOutputFormat



for



MapReduce   HBase shell (jruby)



  5 Tables   Up



to 5 column families   XML Schemas   Automated table schema updates   Standard options tweaked over time

◦  Garbage Collection!

  MemCached(b)



layer



Network LWS Web App Cache Data

Tomcat 1 Apache 1 Director 1



Firewall



Director n



Apache n







Tomcat n



Tomcat 1



Tomcat n



MemCached 1



MemCached n



HBase



  Backup/Restore   Index



building   Cache filling   Mapping   Updates   Translation



  Early



versions



◦  Data loss ◦  Migration nightmares ◦  Slow performance

  Current



version



◦  RegionServer madness (region unknown) ◦  New resource limits (xceiver) ◦  Read HBase Wiki

  Single



point of failure (name node only!)



  RTF(ine)M   HBase Wiki   IRC



Channel   Personal Experience:

◦  Max. file handles ◦  Hadoop xceiver limits ◦  Redundant meta data (on name node) ◦  RAM ◦  Deployment strategy



  HBase



◦  New Key Format – KeyValue ◦  New File Format – Hfile ◦  New Block Cache – Concurrent LRU ◦  New Query and Result API ◦  New Scanners ◦  Zookeeper Integration – No SPOF in HBase ◦  New REST Interface ◦  Contrib

  Transactional Tables   Secondary Indexes   Stargate



0.20.x “Performance”



  HBase



0.21.x “Advanced Concepts”



◦  Master Rewrite – More Zookeeper ◦  New RPC Protocol (Avro) ◦  Multi-DC Replication ◦  Intra Row Scanning ◦  Further optimizations on algorithms and data structures ◦  Discretionary Access Control



  Email: lars@worldlingo.com   Blog: www.larsgeorge.com   Twitter: larsgeorge




Share This Document


Related docs
Other docs by Lars George
Realtime Analytics with Hadoop and HBase
Views: 1009  |  Downloads: 3
HBase @ WorldLingo
Views: 8208  |  Downloads: 113
Advanced HBase
Views: 5409  |  Downloads: 182
My Life with HBase - FOSDEM 2010 NoSQL
Views: 18634  |  Downloads: 300
Social Networks and the Richness of Data
Views: 90  |  Downloads: 3
HBase at WorldLingo - Munich OpenHUG
Views: 1666  |  Downloads: 24
by registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!