HBase at WorldLingo - Munich OpenHUG
Document Sample


Lars George, CTO www.worldlingo.com www.larsgeorge.com CTO @ WorldLingo Diplom Informatiker (FH) Moved to Australia, then US of A 10 years at WorldLingo HBase contributor and soon committer Nerd-factor: co-wrote C64 Simons BASIC competitor at age 16 in Assembler Co-founded 1999 Machine Translation Services Professional Human Translations Offices in US and UK Microsoft Office Provider since 2001 Web based services Customer Projects Multilingual Archive SOAP API Simple calls ◦ putDocument() ◦ getDocument() ◦ search() ◦ command() ◦ putTransformation() ◦ getTransformation() Planned already, implemented as customer project Scale: ◦ 500million documents ◦ Random Access ◦ “100%” Uptime Technologies? ◦ Database ◦ Zip-Archives on file system, or Hadoop 44 Dell PESC1435, 12GB RAM, 2 x 1TB SATA drives Java 6 Tomcat 5.5 88 Xen domU’s ◦ Apache ◦ Hadoop/HBase ◦ Tomcat application servers Currently split into two clusters 43 fields indexed 166GB size Automated merging/warm-up/swap Looking into scalable solution ◦ Katta ◦ Hyper Estraier ◦ DLucene ◦ … Sorting? Distributed database modeled on Bigtable Runs on top of Hadoop Core ◦ "Commodity" servers, replicated data, etc. ◦ Bigtable: A Distributed Storage System for Structured Data by Chang et al. Column-oriented store ◦ Wide table costs only the data stored ◦ NULLs in row are 'free' ◦ Good compression: columns of similar type Goal of billions of rows X millions of cells ◦ Petabytes of data across thousands of servers Distributed, High Availability, High Performance “NoSQL” Database! ◦ No joins ◦ No sophisticated query engine ◦ No transactions (sort of) ◦ No column typing ◦ No SQL, no ODBC/JDBC, etc. Not a replacement for your RDBMS... Datasets are reaching Petabytes Traditional databases are expensive to scale and difficult to distribute Commodity hardware is cheap and powerful Need for random access and batch processing (which Hadoop does not offer) November 2006 ◦ Google releases paper on Bigtable February 2007 ◦ Initial HBase prototype created as Hadoop contrib October 2007 ◦ First "useable" HBase (0.15.0 Hadoop) December 2007 ◦ First HBase User Group January 2008 ◦ Hadoop becomes TLP, HBase becomes subproject October 2008 ◦ HBase 0.18.1 released January 2009 ◦ HBase 0.19.0 released September 2009 ◦ HBase 0.20.1 released Tables are sorted by Row Table schema defines column families ◦ Families consist of any number of columns ◦ Columns consist of any number of versions ◦ Everything except table name is byte[] (Table, Row, Family:Column, Timestamp) Value As a data structure RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) ) SortedMap( ) Store web crawl data ◦ Table crawl with family content ◦ Row is URL with columns content:data stores raw crawled data content:language stores http language header content:type stores http content-type header ◦ If processing raw data for hyperlinks and images, add families links and images links:<url> column for each hyperlink links:<url> column for each image Native Java client/API ◦ get(byte[] row, byte[] column, long ts, int versions) Non-Java clients ◦ Thrift server (Ruby, C++, Erlang, etc.) ◦ REST server TableInput/TableOutputFormat for MapReduce HBase shell (jruby) 5 Tables Up to 5 column families XML Schemas Automated table schema updates Standard options tweaked over time ◦ Garbage Collection! MemCached(b) layer Network LWS Web App Cache Data Tomcat 1 Apache 1 Director 1 Firewall Director n Apache n … Tomcat n Tomcat 1 Tomcat n MemCached 1 MemCached n HBase Backup/Restore Index building Cache filling Mapping Updates Translation Early versions ◦ Data loss ◦ Migration nightmares ◦ Slow performance Current version ◦ RegionServer madness (region unknown) ◦ New resource limits (xceiver) ◦ Read HBase Wiki Single point of failure (name node only!) RTF(ine)M HBase Wiki IRC Channel Personal Experience: ◦ Max. file handles ◦ Hadoop xceiver limits ◦ Redundant meta data (on name node) ◦ RAM ◦ Deployment strategy HBase ◦ New Key Format – KeyValue ◦ New File Format – Hfile ◦ New Block Cache – Concurrent LRU ◦ New Query and Result API ◦ New Scanners ◦ Zookeeper Integration – No SPOF in HBase ◦ New REST Interface ◦ Contrib Transactional Tables Secondary Indexes Stargate 0.20.x “Performance” HBase 0.21.x “Advanced Concepts” ◦ Master Rewrite – More Zookeeper ◦ New RPC Protocol (Avro) ◦ Multi-DC Replication ◦ Intra Row Scanning ◦ Further optimizations on algorithms and data structures ◦ Discretionary Access Control Email: lars@worldlingo.com Blog: www.larsgeorge.com Twitter: larsgeorge
Get documents about "