Docstoc

Guidelines for schema design for HBase - Paxcel Technologies

Document Sample
Guidelines for schema design for HBase - Paxcel Technologies Powered By Docstoc
					Schema Guidelines & Case
Studies
For non-relational databases: HBase & Hypertable with
create statements
7/15/2010
PAXCEL Technologies Pvt. Ltd.
Pushpinder Singh
Table of Contents

Guidelines for schema design for HBase ...................................................................................................... 3

Case 1: Users and their friends (1-M relation).............................................................................................. 4

Case 2: Parent Child Relation (Self-Join) Storing Locations .......................................................................... 6

Case 3: Student-Course Information (M-M relation).................................................................................... 8

Case 4: User Action (1-M Relation; High Volume Data).............................................................................. 11

Case 5: Access Log (analyzed every 5 minutes, hour, monthly, yealy) ....................................................... 12

Case 6: Stock Analysis ................................................................................................................................. 14

Case 7: Hospital-Doctor Scenario................................................................................................................ 16
Guidelines for schema design for HBase
      Remember the DDI methodology:

           o   Denormalization: In simple terms it means we store data about a business entity and
               data about the entities related to it in the same table. This aims to achieve the goal of
               retrieving the data via fewer searches on the data store.

           o   Duplication: serves the same purpose as Denormalization, and

           o   Intelligent Keys: The Row key is the one according to which the rows are ordered in
               HBase and a row is easily accessible using the row key value. As it is the fastest identifier
               for a row, we need to take special care to design the row key as a composite key
               comprising the attributes that are most often used as search criteria. This map to the
               principle of Index design in relational databases.

      They do not support referential integrity, thus when data is reused or duplicated, the integrity of
       duplicated data needs to be maintained by the application

      First get to know the type of queries you want to answer.

      We do not have sophisticated query processing and optimization capabilities as we have in
       relational database systems.

Let us understand the above principles through some case studies:
Case 1: Users and their friends (1-M relation)
In a scenario wherein we have to model a schema for users and their friends, a typical RDBMS schema
will look like:




Sample Query: Get the friendship information for a particular user

Decisions:

       Number of Tables: 1

               Row key: user id, as the query being asked is for a particular user.

               Number of column families: 2

               Column Family 1: Info, columns like: name, sex, age, number of versions 1

               Column Family 2: Friend, each column representing a friend and value for it is the type
               of the friend, number of versions=1
              ROW                                          Column Families

                                    Info:                           Friend:

<user_id>                           Info: name                      Friend:<user_id>=type

                                    Info: sex

                                    Info: age



Creating the table in HBase:

        create 'user',{NAME => 'info', VERSIONS => 1},{NAME => 'useremail', VERSIONS => 3}, {NAME =>
        'friend', VERSIONS => 1}

Creating the table in Hypertable:

        CREATE TABLE user(info MAX_VERSIONS=1,useremail MAX_VERSIONS=3,friend
        MAX_VERSIONS=1);
Case 2: Parent Child Relation (Self-Join) Storing Locations
In a scenario wherein we have to store location information like the name of the country, state, city etc.
and also the relation information has to be maintained like country name for the state, state name for a
city etc. the general RDBMS schema will look like:




Sample Query: get me the parent for a city named X, or get me all the children for the city Y

Decisions:

        Number of Tables: 1

                Row key: Location ID, as the query being asked is always about any particular location.

                Number of Column Families: 3

                Column Family 1: name, versions=1

                Column Family 2: parent, with each location id being a column and its value
                representing the type i.e. nation, or state. Number of versions: 1 as there can only be
                one parent allowed.

                Column Family 3: child, with each location id being a column and its value representing
                the type i.e. state or city. Number of versions=1
                   Row                                   Column families

                                Name:                Parent:                  Child:

        <loc_id>                                     Parent:<loc_id>          Child:<loc_id>

        1                       India                                         Child:2=state

                                                                              Child:3=state

        2                       Punjab               Parent:1=country         Child:5=city

        3                       Haryana              Parent:1=country         Child:4=city

        4                       Pkl                  Parent:3=state

        5                       Chd                  Parent:2=state



Creating the table in HBase:

        create 'location',{NAME => 'name', VERSIONS => 1},{NAME => 'parent', VERSIONS => 1}, {NAME
        => 'child', VERSIONS => 1}

Creating the table in Hypertable:

        CREATE TABLE location(name MAX_VERSIONS=1,parent MAX_VERSIONS=1,child
        MAX_VERSIONS=1);
Case 3: Student-Course Information (M-M relation)
In a scenario wherein we have to prepare a schema for storing information for the various students and
various courses for which one user can opt for, the general RDBMS schema will be:




Sample Query: get all the courses for which a particular student has opted for or get all the students
who have opted for a particular course.

Decisions:

       Number of Tables: 2

       Table 1: Students

               Row key: student ID, as the query being asked in relation to students is always about
               any particular student.

               Number of Column Families: 2

               Column Family 1: Info, with columns name, sex, age etc. Number of versions = 1

               Column Family 2: Course, with each course id as column and its value being the type of
               course for that student, i.e. elective/ optional, versions = 1
                                          Table: students

              Row                                           Column families

                               Info                                Course

<student_id>                   Info:name                           Course:<course_id>=type

                               Info:sex

                               Info:age




Table 2: Courses

       Row key: course_id, as the query being asked in relation to courses is always about any
       particular course.

       Number of Column Families: 2

       Column Family 1: Info, with columns title, introduction, teacher_id. Number of versions
       =1

       Column Family 2: Student, with each student id as column and its value being the type
       of student for that course, i.e. regular/ correspondence/ distant. Number of versions = 1



                                          Table: Courses

              Row                                           Column families

                               Info                                Student

<course_id>                    Info: Title                         student:<student_id>=type

                               Info: Introduction

                               Info: teacher_id
Creating the table in HBase:

        create 'students',{NAME => 'info', VERSIONS => 1},{NAME => 'course', VERSIONS => 1}

        create 'courses',{NAME => 'info', VERSIONS => 1},{NAME => 'student', VERSIONS => 1 }

Creating the table in Hypertable:

        CREATE TABLE students(info MAX_VERSIONS=1,course MAX_VERSIONS=1);

        CREATE TABLE courses(info MAX_VERSIONS=1,student MAX_VERSIONS=1);
Case 4: User Action (1-M Relation; High Volume Data)
In a scenario wherein we have a huge number of users and each user can perform actions now and then,
we need to store every event (performing of a particular action) and also query for latest events/ actions
by a particular user. In a typical RDBMS the schema will have the form:




Sample query: get the latest 10 actions performed by a particular user or get the details of actions
performed by user in a bunch of 10 actions at a time (similar to paging of result-set)

Decisions:

        Number of Tables: 1

        Table 1: UserActions

        Row key: <userid><time_stamp><actionid> (Intelligent Key)

        Number of column families: 1

        Column Family 1: Name, with only one column that stores the name of the action

                       Row                                             Column families

                                                      Name:

<userid><time_stamp><actionid>



Creating the table in HBase:

        create 'useraction',{NAME=>'name', VERSIONS => 1}

Creating the table in Hypertable:

        CREATE TABLE useraction(name MAX_VERSIONS=1);
Case 5: Access Log (analyzed every 5 minutes, hour, monthly,
yealy)
In a scenario where we have to store the access log information like time, Ip, domain, url, referrer,
browser_cookie, login_id, etc. we generally store it as one table in RDBMS.




Sample Query: Analysis every 5 mins, per hour, day, month or year basis

Decisions:

        Number of Tables: 1

        Table 1: AccessLog

        Row key: <time>

        Number of column families: 2

        Column Family 1: http, having columns like ip, domain, url, referrer, versions = 1

        Column Family 2: user, having columns like browser_cookie, login_id

              Row                                              Column families

                                    http:                                User

Time                                http: ip                             User: browser_cookie

                                    http: domain                         User:login_id

                                    http: url

                                    http: referrer
Creating the table in HBase:

        create 'accesslog',{NAME => 'http', VERSIONS => 1},{NAME => 'user', VERSIONS => 1}

Creating the table in Hypertable:

        CREATE TABLE accesslog (http MAX_VERSIONS=1,user MAX_VERSIONS = 1);
Case 6: Stock Analysis
Input dataset: Symbol, Date, Open, High, Low, Close
Table name: Stock

Table Key: combination of the date and the stock symbol as most of the queries will be on stock symbols
for particular date/s

Column Families: 2

    1) Column Family Name: FixedPrice
           a. Columns: open price
           b. Versions: 1, as in a day the Open price for a stock will be constant

    2) Column Family Name: Variableprices
           a. Columns: High Price, Low Price, Close Price.
           b. Versions: x, where x can have any value as per the application requirement and business
              logic.

Creating the table in HBase:

        create 'stock',{NAME => 'fixedprice', VERSIONS => 1},{NAME => 'variableprices', VERSIONS =>

        10000}

Creating the table in Hypertable:

        CREATE TABLE stock(fixedprice MAX_VERSIONS=1,variableprices MAX_VERSIONS = 10000);

NOTE: here we have taken 10000 versions for the columns under the variableprices column family
Case 7: Hospital-Doctor Scenario
Consider the case for the data represented as a relational schema below for hospitals and the doctors
working there.




To model the same schema as non-relational we have decided:

Number of tables: 3

Table 1: Doctor

        Row key: doctor_id

        Number of Column Families: 2

        Column Family 1: Info, with columns like name,age,sex. Number of versions=1

        Column Family 2: Department, with each department id as column and its value equivalent to
        the hospital id where this department is. As a doctor can work in same department for various
        hospitals the number of versions for this column family can be as per the business rule.
              Row                                           Column families

                                 info:                            Department:

<Doctor_id>                      Info:name                        Department:<dept_id>=<hospital_id>

                                 Info:age

                                 Info:sex

                                 Info:bornin

1                                Info:name = peter                Department:1=1

                                 Info:age = 32                    Department:1=4

                                 Info:sex=M                       Department:2=1

                                 Info:bornin= NY                  Department:3=1



Table 2: Hospital

        Row key: hospital_id

        Number of Column Families: 2

        Column Family 1: Info, with columns like name. Number of versions=1

        Column Family 2: City, having value equivalent to the city name. Number of versions can be as
        per the business rule of as to how many cities a hospital can be in.
                Row                                           Column families

                                   info:                               City:

<hospital_id>                      Info:name

1                                  Info:name = Cosmo                   City=Delhi

                                                                       City=Mohali



Table 3: Department

        Row key: department_id

        Number of Column Families: 2

        Column Family 1: Info, with columns like name having values equivalent to the name of the
        department. Number of versions=1

        Column Family 2: Hospital, with each hospital id as column and its value equivalent to the
        hospital name. Number of versions = 1



                Row                                           Column families

                                   info:                               Hospital:

<department_id>                    Info:name                           Hospital:<hospital_id>=<name>

1                                  Info:name = Cardiology              Hospital:1=Cosmo

                                                                       Hospital:2=Fortis




Creating the table in HBase:

        create 'doctor',{NAME => 'info',VERSIONS => 1}, { NAME => 'department', VERSIONS => 1000}

        create 'hospital',{NAME => 'info', VERSIONS => 1} , { NAME => 'city', VERSIONS => 1000}

        create 'department',{NAME => 'info', VERSIONS => 1}, {NAME => 'hospital', VERSIONS => 1}
Creating the table in Hypertable:

        CREATE TABLE doctor(info MAX_VERSIONS=1, department MAX_VERSIONS = 1000);

        CREATE TABLE hospital(info MAX_VERSIONS=1,city MAX_VERSIONS = 1000);

        CREATE TABLE department(info MAX_VERSIONS=1,hospital MAX_VERSIONS = 1);
References

Cloud Data Structure Diagramming Techniques and Design Patterns
By: David Salmen, Tatiana Malyuta, Rhonda Fetters and Norbert Antunes


Large Scale Data with HADOOP
By: Galen Riley and Josh Patterson

HBase Schema Design Case Studies
By: Evan/ Qingyan Liu

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:11/12/2012
language:Unknown
pages:20
About Good!!!NICE!!! The best document database!