Guidelines for schema design for HBase - Paxcel Technologies
Shared by: fjzhangweiqun
-
Stats
- views:
- 1
- posted:
- 11/12/2012
- language:
- Unknown
- pages:
- 20
Document Sample


Schema Guidelines & Case
Studies
For non-relational databases: HBase & Hypertable with
create statements
7/15/2010
PAXCEL Technologies Pvt. Ltd.
Pushpinder Singh
Table of Contents
Guidelines for schema design for HBase ...................................................................................................... 3
Case 1: Users and their friends (1-M relation).............................................................................................. 4
Case 2: Parent Child Relation (Self-Join) Storing Locations .......................................................................... 6
Case 3: Student-Course Information (M-M relation).................................................................................... 8
Case 4: User Action (1-M Relation; High Volume Data).............................................................................. 11
Case 5: Access Log (analyzed every 5 minutes, hour, monthly, yealy) ....................................................... 12
Case 6: Stock Analysis ................................................................................................................................. 14
Case 7: Hospital-Doctor Scenario................................................................................................................ 16
Guidelines for schema design for HBase
Remember the DDI methodology:
o Denormalization: In simple terms it means we store data about a business entity and
data about the entities related to it in the same table. This aims to achieve the goal of
retrieving the data via fewer searches on the data store.
o Duplication: serves the same purpose as Denormalization, and
o Intelligent Keys: The Row key is the one according to which the rows are ordered in
HBase and a row is easily accessible using the row key value. As it is the fastest identifier
for a row, we need to take special care to design the row key as a composite key
comprising the attributes that are most often used as search criteria. This map to the
principle of Index design in relational databases.
They do not support referential integrity, thus when data is reused or duplicated, the integrity of
duplicated data needs to be maintained by the application
First get to know the type of queries you want to answer.
We do not have sophisticated query processing and optimization capabilities as we have in
relational database systems.
Let us understand the above principles through some case studies:
Case 1: Users and their friends (1-M relation)
In a scenario wherein we have to model a schema for users and their friends, a typical RDBMS schema
will look like:
Sample Query: Get the friendship information for a particular user
Decisions:
Number of Tables: 1
Row key: user id, as the query being asked is for a particular user.
Number of column families: 2
Column Family 1: Info, columns like: name, sex, age, number of versions 1
Column Family 2: Friend, each column representing a friend and value for it is the type
of the friend, number of versions=1
ROW Column Families
Info: Friend:
<user_id> Info: name Friend:<user_id>=type
Info: sex
Info: age
Creating the table in HBase:
create 'user',{NAME => 'info', VERSIONS => 1},{NAME => 'useremail', VERSIONS => 3}, {NAME =>
'friend', VERSIONS => 1}
Creating the table in Hypertable:
CREATE TABLE user(info MAX_VERSIONS=1,useremail MAX_VERSIONS=3,friend
MAX_VERSIONS=1);
Case 2: Parent Child Relation (Self-Join) Storing Locations
In a scenario wherein we have to store location information like the name of the country, state, city etc.
and also the relation information has to be maintained like country name for the state, state name for a
city etc. the general RDBMS schema will look like:
Sample Query: get me the parent for a city named X, or get me all the children for the city Y
Decisions:
Number of Tables: 1
Row key: Location ID, as the query being asked is always about any particular location.
Number of Column Families: 3
Column Family 1: name, versions=1
Column Family 2: parent, with each location id being a column and its value
representing the type i.e. nation, or state. Number of versions: 1 as there can only be
one parent allowed.
Column Family 3: child, with each location id being a column and its value representing
the type i.e. state or city. Number of versions=1
Row Column families
Name: Parent: Child:
<loc_id> Parent:<loc_id> Child:<loc_id>
1 India Child:2=state
Child:3=state
2 Punjab Parent:1=country Child:5=city
3 Haryana Parent:1=country Child:4=city
4 Pkl Parent:3=state
5 Chd Parent:2=state
Creating the table in HBase:
create 'location',{NAME => 'name', VERSIONS => 1},{NAME => 'parent', VERSIONS => 1}, {NAME
=> 'child', VERSIONS => 1}
Creating the table in Hypertable:
CREATE TABLE location(name MAX_VERSIONS=1,parent MAX_VERSIONS=1,child
MAX_VERSIONS=1);
Case 3: Student-Course Information (M-M relation)
In a scenario wherein we have to prepare a schema for storing information for the various students and
various courses for which one user can opt for, the general RDBMS schema will be:
Sample Query: get all the courses for which a particular student has opted for or get all the students
who have opted for a particular course.
Decisions:
Number of Tables: 2
Table 1: Students
Row key: student ID, as the query being asked in relation to students is always about
any particular student.
Number of Column Families: 2
Column Family 1: Info, with columns name, sex, age etc. Number of versions = 1
Column Family 2: Course, with each course id as column and its value being the type of
course for that student, i.e. elective/ optional, versions = 1
Table: students
Row Column families
Info Course
<student_id> Info:name Course:<course_id>=type
Info:sex
Info:age
Table 2: Courses
Row key: course_id, as the query being asked in relation to courses is always about any
particular course.
Number of Column Families: 2
Column Family 1: Info, with columns title, introduction, teacher_id. Number of versions
=1
Column Family 2: Student, with each student id as column and its value being the type
of student for that course, i.e. regular/ correspondence/ distant. Number of versions = 1
Table: Courses
Row Column families
Info Student
<course_id> Info: Title student:<student_id>=type
Info: Introduction
Info: teacher_id
Creating the table in HBase:
create 'students',{NAME => 'info', VERSIONS => 1},{NAME => 'course', VERSIONS => 1}
create 'courses',{NAME => 'info', VERSIONS => 1},{NAME => 'student', VERSIONS => 1 }
Creating the table in Hypertable:
CREATE TABLE students(info MAX_VERSIONS=1,course MAX_VERSIONS=1);
CREATE TABLE courses(info MAX_VERSIONS=1,student MAX_VERSIONS=1);
Case 4: User Action (1-M Relation; High Volume Data)
In a scenario wherein we have a huge number of users and each user can perform actions now and then,
we need to store every event (performing of a particular action) and also query for latest events/ actions
by a particular user. In a typical RDBMS the schema will have the form:
Sample query: get the latest 10 actions performed by a particular user or get the details of actions
performed by user in a bunch of 10 actions at a time (similar to paging of result-set)
Decisions:
Number of Tables: 1
Table 1: UserActions
Row key: <userid><time_stamp><actionid> (Intelligent Key)
Number of column families: 1
Column Family 1: Name, with only one column that stores the name of the action
Row Column families
Name:
<userid><time_stamp><actionid>
Creating the table in HBase:
create 'useraction',{NAME=>'name', VERSIONS => 1}
Creating the table in Hypertable:
CREATE TABLE useraction(name MAX_VERSIONS=1);
Case 5: Access Log (analyzed every 5 minutes, hour, monthly,
yealy)
In a scenario where we have to store the access log information like time, Ip, domain, url, referrer,
browser_cookie, login_id, etc. we generally store it as one table in RDBMS.
Sample Query: Analysis every 5 mins, per hour, day, month or year basis
Decisions:
Number of Tables: 1
Table 1: AccessLog
Row key: <time>
Number of column families: 2
Column Family 1: http, having columns like ip, domain, url, referrer, versions = 1
Column Family 2: user, having columns like browser_cookie, login_id
Row Column families
http: User
Time http: ip User: browser_cookie
http: domain User:login_id
http: url
http: referrer
Creating the table in HBase:
create 'accesslog',{NAME => 'http', VERSIONS => 1},{NAME => 'user', VERSIONS => 1}
Creating the table in Hypertable:
CREATE TABLE accesslog (http MAX_VERSIONS=1,user MAX_VERSIONS = 1);
Case 6: Stock Analysis
Input dataset: Symbol, Date, Open, High, Low, Close
Table name: Stock
Table Key: combination of the date and the stock symbol as most of the queries will be on stock symbols
for particular date/s
Column Families: 2
1) Column Family Name: FixedPrice
a. Columns: open price
b. Versions: 1, as in a day the Open price for a stock will be constant
2) Column Family Name: Variableprices
a. Columns: High Price, Low Price, Close Price.
b. Versions: x, where x can have any value as per the application requirement and business
logic.
Creating the table in HBase:
create 'stock',{NAME => 'fixedprice', VERSIONS => 1},{NAME => 'variableprices', VERSIONS =>
10000}
Creating the table in Hypertable:
CREATE TABLE stock(fixedprice MAX_VERSIONS=1,variableprices MAX_VERSIONS = 10000);
NOTE: here we have taken 10000 versions for the columns under the variableprices column family
Case 7: Hospital-Doctor Scenario
Consider the case for the data represented as a relational schema below for hospitals and the doctors
working there.
To model the same schema as non-relational we have decided:
Number of tables: 3
Table 1: Doctor
Row key: doctor_id
Number of Column Families: 2
Column Family 1: Info, with columns like name,age,sex. Number of versions=1
Column Family 2: Department, with each department id as column and its value equivalent to
the hospital id where this department is. As a doctor can work in same department for various
hospitals the number of versions for this column family can be as per the business rule.
Row Column families
info: Department:
<Doctor_id> Info:name Department:<dept_id>=<hospital_id>
Info:age
Info:sex
Info:bornin
1 Info:name = peter Department:1=1
Info:age = 32 Department:1=4
Info:sex=M Department:2=1
Info:bornin= NY Department:3=1
Table 2: Hospital
Row key: hospital_id
Number of Column Families: 2
Column Family 1: Info, with columns like name. Number of versions=1
Column Family 2: City, having value equivalent to the city name. Number of versions can be as
per the business rule of as to how many cities a hospital can be in.
Row Column families
info: City:
<hospital_id> Info:name
1 Info:name = Cosmo City=Delhi
City=Mohali
Table 3: Department
Row key: department_id
Number of Column Families: 2
Column Family 1: Info, with columns like name having values equivalent to the name of the
department. Number of versions=1
Column Family 2: Hospital, with each hospital id as column and its value equivalent to the
hospital name. Number of versions = 1
Row Column families
info: Hospital:
<department_id> Info:name Hospital:<hospital_id>=<name>
1 Info:name = Cardiology Hospital:1=Cosmo
Hospital:2=Fortis
Creating the table in HBase:
create 'doctor',{NAME => 'info',VERSIONS => 1}, { NAME => 'department', VERSIONS => 1000}
create 'hospital',{NAME => 'info', VERSIONS => 1} , { NAME => 'city', VERSIONS => 1000}
create 'department',{NAME => 'info', VERSIONS => 1}, {NAME => 'hospital', VERSIONS => 1}
Creating the table in Hypertable:
CREATE TABLE doctor(info MAX_VERSIONS=1, department MAX_VERSIONS = 1000);
CREATE TABLE hospital(info MAX_VERSIONS=1,city MAX_VERSIONS = 1000);
CREATE TABLE department(info MAX_VERSIONS=1,hospital MAX_VERSIONS = 1);
References
Cloud Data Structure Diagramming Techniques and Design Patterns
By: David Salmen, Tatiana Malyuta, Rhonda Fetters and Norbert Antunes
Large Scale Data with HADOOP
By: Galen Riley and Josh Patterson
HBase Schema Design Case Studies
By: Evan/ Qingyan Liu
Shared by: zhang xiaobin
About
Good!!!NICE!!!
The best document database!
Related docs
Other docs by fjzhangweiqun
Get documents about "