Embed
Email

hbase_intro

Document Sample

Shared by: Evan He
Categories
Tags
Stats
views:
0
posted:
2/9/2012
language:
pages:
28
教育訓練課程









HBase Intro

王耀聰 陳威宇

Jazz@nchc.org.tw

waue@nchc.org.tw



1

HBase is a distributed column-

oriented database built on top of

HDFS.

HBase is ..

 A distributed data store that can scale

horizontally to 1,000s of commodity servers and

petabytes of indexed storage.

 Designed to operate on top of the Hadoop

distributed file system (HDFS) or Kosmos File

System (KFS, aka Cloudstore) for scalability,

fault tolerance, and high availability.

 Integrated into the Hadoop map-reduce platform

and paradigm.

Benefits

 Distributed storage

 Table-like in data structure

 multi-dimensional map

 High scalability

 High availability

 High performance

Who use HBase

Backdrop

 Started toward by Chad Walters and Jim

 2006.11

 Google releases paper on BigTable

 2007.2

 Initial HBase prototype created as Hadoop contrib.

 2007.10

 First useable HBase

 2008.1

 Hadoop become Apache top-level project and HBase becomes

subproject

 2008.10~

 HBase 0.18, 0.19 released

HBase Is Not …

 Tables have one primary index, the row key.

 No join operators.

 Scans and queries can select a subset of

available columns, perhaps by using a wildcard.

 There are three types of lookups:

 Fast lookup using row key and optional timestamp.

 Full table scan

 Range scan from region start to end.

HBase Is Not …(2)

 Limited atomicity and transaction support.

 HBase supports multiple batched mutations of

single rows only.

 Data is unstructured and untyped.

 No accessed or manipulated via SQL.

 Programmatic access via Java, REST, or

Thrift APIs.

 Scripting via JRuby.

Why Bigtable?

 Performance of RDBMS system is good

for transaction processing but for very

large scale analytic processing, the

solutions are commercial, expensive, and

specialized.

 Very large scale analytic processing

 Big queries – typically range or table scans.

 Big databases (100s of TB)

Why Bigtable? (2)

 Map reduce on Bigtable with optionally

Cascading on top to support some

relational algebras may be a cost effective

solution.

 Sharding is not a solution to scale open

source RDBMS platforms

 Application specific

 Labor intensive (re)partitionaing

Why HBase ?

 HBase is a Bigtable clone.

 It is open source

 It has a good community and promise for

the future

 It is developed on top of and has good

integration for the Hadoop platform, if

you are using Hadoop already.

 It has a Cascading connector.

HBase benefits than RDBMS

 No real indexes

 Automatic partitioning

 Scale linearly and automatically with new

nodes

 Commodity hardware

 Fault tolerance

 Batch processing

Data Model

 Tables are sorted by Row

 Table schema only define it’s column families .

 Each family consists of any number of columns

 Each column consists of any number of versions

 Columns only exist when inserted, NULLs are free.

 Columns within a family are sorted and stored together

 Everything except table names are byte[]

 (Row, Family: Column, Timestamp)  Value





Column Family





Row key









TimeStamp value

Members

 Master

 Responsible for monitoring region servers

 Load balancing for regions

 Redirect client to correct region servers

 The current SPOF

 regionserver slaves

 Serving requests(Write/Read/Scan) of Client

 Send HeartBeat to Master

 Throughput and Region numbers are scalable by

region servers

Regions

 表格是由一或多個 region 所構成

 Region 是由其 startKey 與 endKey 所指定

 每個 region 可能會存在於多個不同節點上,而且

是由數個HDFS 檔案與區塊所構成,這類 region

是由 Hadoop 負責複製

實際個案討論 – 部落格

 邏輯資料模型

 一篇 Blog entry 由 title, date, author, type, text 欄位所組成。

 一位User由 username, password等欄位所組成。

 每一篇的 Blog entry可有許多Comments。

 每一則comment由 title, author, 與 text 組成。

 ERD

部落格 – HBase Table Schema









 Row key

 type (以2個字元的縮寫代表)與 timestamp組合而成。

 因此 rows 會先後依 type 及 timestamp 排序好。方便用 scan () 來存取 Table的資

料。

 BLOGENTRY 與 COMMENT的”一對多”關係由comment_title,

comment_author, comment_text 等column families 內的動態數量的column來

表示

 每個Column的名稱是由每則 comment的 timestamp來表示,因此每個

column family的 column 會依時間自動排序好

Architecture

ZooKeeper

 HBase depends on

ZooKeeper (Chapter

13) and by default it

manages a

ZooKeeper instance

as the authority on

cluster state

Operation

The -ROOT-

table holds the

list of .META.

table regions









The .META.

table holds the

list of all user-

space regions.

Installation (1)

啟動Hadoop…









$ wget

http://ftp.twaren.net/Unix/Web/apache/hadoop/hbase/hbase-

0.20.2/hbase-0.20.2.tar.gz

$ sudo tar -zxvf hbase-*.tar.gz -C /opt/

$ sudo ln -sf /opt/hbase-0.20.2 /opt/hbase

$ sudo chown -R $USER:$USER /opt/hbase

$ sudo mkdir /var/hadoop/

$ sudo chmod 777 /var/hadoop

Setup (1)

$ vim /opt/hbase/conf/hbase-env.sh

export JAVA_HOME=/usr/lib/jvm/java-6-sun

export HADOOP_CONF_DIR=/opt/hadoop/conf

export HBASE_HOME=/opt/hbase

export HBASE_LOG_DIR=/var/hadoop/hbase-logs

export HBASE_PID_DIR=/var/hadoop/hbase-pids

export HBASE_MANAGES_ZK=true

export HBASE_CLASSPATH=$HBASE_CLASSPATH:/opt/hadoop/conf







$ cd /opt/hbase/conf

$ cp /opt/hadoop/conf/core-site.xml ./

$ cp /opt/hadoop/conf/hdfs-site.xml ./

$ cp /opt/hadoop/conf/mapred-site.xml ./



Setup (2)

name

value







Name value

hbase.rootdir hdfs://secuse.nchc.org.tw:9000/hbase

hbase.tmp.dir /var/hadoop/hbase-${user.name}

hbase.cluster.distributed true

hbase.zookeeper.property 2222

.clientPort

hbase.zookeeper.quorum Host1, Host2

hbase.zookeeper.property /var/hadoop/hbase-data

.dataDir

Startup & Stop

$ start-hbase.sh









$ stop-hbase.sh

Testing (4)

$ hbase shell

> create 'test', 'data'

0 row(s) in 4.3066 seconds

> list > scan 'test'

test ROW COLUMN+CELL

1 row(s) in 0.1485 seconds row1 column=data:1, timestamp=1240148026198,

value=value1

> put 'test', 'row1', 'data:1',

'value1' row2 column=data:2, timestamp=1240148040035,

value=value2

0 row(s) in 0.0454 seconds

row3 column=data:3, timestamp=1240148047497,

> put 'test', 'row2', 'data:2', value=value3

'value2'

3 row(s) in 0.0825 seconds

0 row(s) in 0.0035 seconds

> disable 'test'

> put 'test', 'row3', 'data:3',

'value3' 09/04/19 06:40:13 INFO client.HBaseAdmin: Disabled test

0 row(s) in 0.0090 seconds 0 row(s) in 6.0426 seconds

> drop 'test'

09/04/19 06:40:17 INFO client.HBaseAdmin: Deleted test

0 row(s) in 0.0210 seconds

> list

0 row(s) in 2.0645 seconds

Connecting to HBase

 Java client

 get(byte [] row, byte [] column, long timestamp, int

versions);

 Non-Java clients

 Thrift server hosting HBase client instance

 Sample ruby, c++, & java (via thrift) clients

 REST server hosts HBase client

 TableInput/OutputFormat for MapReduce

 HBase as MR source or sink

 HBase Shell

 JRuby IRB with “DSL” to add get, scan, and admin

 ./bin/hbase shell YOUR_SCRIPT

Thrift

$ hbase-daemon.sh start thrift

$ hbase-daemon.sh stop thrift





 a software framework for scalable cross-language

services development.

 By facebook

 seamlessly between C++, Java, Python, PHP, and Ruby.

 This will start the server instance, by default on port

9090

 The other similar project “rest”

References

 HBase 介紹

 http://www.wretch.cc/blog/trendnop09/21192

672

 Hadoop: The Definitive Guide

 Book, by Tom White

 HBase Architecture 101

 http://www.larsgeorge.com/2009/10/hbase-

architecture-101-storage.html



Other docs by Evan He
06.MR_Programing
Views: 0  |  Downloads: 0
Perl_06_Subroutines and Functions
Views: 0  |  Downloads: 0
RubyCourse_1.0-1
Views: 0  |  Downloads: 0
Hadoop
Views: 1  |  Downloads: 0
taobao_arch_qcon_2009
Views: 0  |  Downloads: 0
rubyonrails
Views: 0  |  Downloads: 0
10.Conclusions
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!