HBase at WorldLingo - Munich OpenHUG

W
Shared by: larsgeorge
Categories
Tags
-
Stats
views:
1717
posted:
1/10/2010
language:
English
pages:
0
Document Sample
scope of work template
							Lars George, CTO www.worldlingo.com www.larsgeorge.com

  CTO

@ WorldLingo   Diplom Informatiker (FH)   Moved to Australia, then US of A   10 years at WorldLingo   HBase contributor and soon committer   Nerd-factor: co-wrote C64 Simons BASIC competitor at age 16 in Assembler

  Co-founded

1999   Machine Translation Services   Professional Human Translations   Offices in US and UK   Microsoft Office Provider since 2001   Web based services   Customer Projects   Multilingual Archive

  SOAP API   Simple

calls

◦  putDocument() ◦  getDocument() ◦  search() ◦  command() ◦  putTransformation() ◦  getTransformation()

  Planned

already, implemented as customer

project   Scale:
◦  500million documents ◦  Random Access ◦  “100%” Uptime
  Technologies?

◦  Database ◦  Zip-Archives on file system, or Hadoop

  44

Dell PESC1435, 12GB RAM, 2 x 1TB SATA drives   Java 6   Tomcat 5.5   88 Xen domU’s
◦  Apache ◦  Hadoop/HBase ◦  Tomcat application servers
  Currently

split into two clusters

  43

fields indexed   166GB size   Automated merging/warm-up/swap   Looking into scalable solution
◦  Katta ◦  Hyper Estraier ◦  DLucene ◦  …
  Sorting?

 

Distributed database modeled on Bigtable Runs on top of Hadoop Core
◦  "Commodity" servers, replicated data, etc.

◦  Bigtable: A Distributed Storage System for Structured Data by Chang et al.

   

Column-oriented store
◦  Wide table costs only the data stored ◦  NULLs in row are 'free' ◦  Good compression: columns of similar type

   

Goal of billions of rows X millions of cells
◦  Petabytes of data across thousands of servers

Distributed, High Availability, High Performance

  “NoSQL”

Database!

◦  No joins ◦  No sophisticated query engine ◦  No transactions (sort of) ◦  No column typing ◦  No SQL, no ODBC/JDBC, etc.
  Not

a replacement for your RDBMS...

  Datasets

are reaching Petabytes   Traditional databases are expensive to scale and difficult to distribute   Commodity hardware is cheap and powerful   Need for random access and batch processing (which Hadoop does not offer)

               

November 2006
◦  Google releases paper on Bigtable

February 2007
◦  Initial HBase prototype created as Hadoop contrib

October 2007
◦  First "useable" HBase (0.15.0 Hadoop)

December 2007
◦  First HBase User Group

January 2008
◦  Hadoop becomes TLP, HBase becomes subproject

October 2008
◦  HBase 0.18.1 released

January 2009
◦  HBase 0.19.0 released

September 2009
◦  HBase 0.20.1 released

  Tables

are sorted by Row   Table schema defines column families
◦  Families consist of any number of columns ◦  Columns consist of any number of versions ◦  Everything except table name is byte[]

(Table, Row, Family:Column, Timestamp)  Value

  As

a data structure
RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) )

SortedMap(

)

  Store

web crawl data

◦  Table crawl with family content ◦  Row is URL with columns
  content:data stores raw crawled data   content:language stores http language header   content:type stores http content-type header

◦  If processing raw data for hyperlinks and images, add families links and images
  links:<url> column for each hyperlink   links:<url> column for each image

  Native

Java client/API

◦  get(byte[] row, byte[] column, long ts, int versions)
  Non-Java

clients

◦  Thrift server (Ruby, C++, Erlang, etc.) ◦  REST server
  TableInput/TableOutputFormat

for

MapReduce   HBase shell (jruby)

  5 Tables   Up

to 5 column families   XML Schemas   Automated table schema updates   Standard options tweaked over time
◦  Garbage Collection!
  MemCached(b)

layer

Network LWS Web App Cache Data
Tomcat 1 Apache 1 Director 1

Firewall

Director n

Apache n

…

Tomcat n

Tomcat 1

Tomcat n

MemCached 1

MemCached n

HBase

  Backup/Restore   Index

building   Cache filling   Mapping   Updates   Translation

  Early

versions

◦  Data loss ◦  Migration nightmares ◦  Slow performance
  Current

version

◦  RegionServer madness (region unknown) ◦  New resource limits (xceiver) ◦  Read HBase Wiki
  Single

point of failure (name node only!)

  RTF(ine)M   HBase Wiki   IRC

Channel   Personal Experience:
◦  Max. file handles ◦  Hadoop xceiver limits ◦  Redundant meta data (on name node) ◦  RAM ◦  Deployment strategy

  HBase

◦  New Key Format – KeyValue ◦  New File Format – Hfile ◦  New Block Cache – Concurrent LRU ◦  New Query and Result API ◦  New Scanners ◦  Zookeeper Integration – No SPOF in HBase ◦  New REST Interface ◦  Contrib
  Transactional Tables   Secondary Indexes   Stargate

0.20.x “Performance”

  HBase

0.21.x “Advanced Concepts”

◦  Master Rewrite – More Zookeeper ◦  New RPC Protocol (Avro) ◦  Multi-DC Replication ◦  Intra Row Scanning ◦  Further optimizations on algorithms and data structures ◦  Discretionary Access Control

  Email: lars@worldlingo.com   Blog: www.larsgeorge.com   Twitter: larsgeorge


						
Related docs
Other docs by larsgeorge
Realtime Analytics with Hadoop and HBase
Views: 1475  |  Downloads: 5