Embed
Email

1 PART 1_ BigTable Concepts 2 PART 2_ HBase Hands-on

Document Sample

Description

BigTable non-relational database, is a sparse, distributed, persistent storage of the multi-dimensional sorted Map. Bigtable is designed to reliably handle PB-level data, and can be deployed to thousands of machines. Bigtable has achieved several of the following goals: wide applicability, scalability, high performance and high availability. Bigtable has more than 60 Google products and projects has been applied, including Google Analytics, GoogleFinance, Orkut, Personalized Search, Writely and GoogleEarth. These products are made ??of Bigtable different needs, some need high throughput batch processing, while others require a timely response and rapid return data to the end user. They use the Bigtable cluster configuration is also very different, and some clusters only a few servers, while others require thousands of servers, storage, hundreds of TB of data.

Shared by: Elijah Jimmy
Stats
views:
22
posted:
12/22/2011
language:
pages:
2
NOSQL – Advanced Lecture, WS 10/11

Information Systems Group Prof. Dr. Jens Dittrich

Saarland University TAs: Alekh Jindal and Jorge Quiané

Exercise 3: BigTable, HBase







1 PART 1: BigTable Concepts

1. What is the order of the following components of keys in the BigTable data model:

(a) Column family

(b) Column qualifier

(c) Row key

(d) Timestamp

2. Consider stock market data from different stock exchanges, each having several listed companies in

different market sectors. How will you compose the keys for each of the following query scenarios:

(a) Get the daily change in stock prices of all companies listed in a given stock exchange.

(b) Get the daily change in stock prices in all stock exchanges of a given company.

(c) Get the stock prices of all companies belong to a given market sector in a given stock exchange.

3. Which of the following statements about BigTable are correct:

(a) Data is maintained in lexicographic order by row key.

(b) The table row ranges are partitioned only once.

(c) Single-row transactions are supported.

(d) Data files are stored using Google File System.

4. Describe the following BigTable terminologies:

(a) Tablet

(b) Chubby

(c) Memtable

(d) Locality groups

5. Which data structure is used to store tablet location information?





2 PART 2: HBase Hands-on

1. Install HBase version 0.20.6 from http://hbase.apache.org/. Configure HBase to use root directory

in HDFS. Start HBase and perform the following test using HBase shell (/bin/hbase shell):

(a) Create a test table "university" with two column families "computersc" and "mathematics".

(b) Insert names of ten people each in "computersc" and "mathematics" column families. Use each

person’s research group name as his/her column qualifier.

(c) Run the following queries:

- Get all persons in "computersc" column family

- Get all persons in "mathematics" column family

- Get all persons in "computersc" column family in a given research group

- Get all persons in "mathematics" column family in a given research group

Hand in the input data, the HBase shell commands, and the output printed by HBase.

2. Load TPC-H Lineitem, Customer, Orders dataset into HBase using the Java API. Use record offset

as Row ID and attribute names in the schema as column family name. Write and execute the slight-

modified TPC-H query 1 and TPC-H query 3 from Exercise 1. Report the runtimes and compare

it with those in Exercise 1 and 2. Hand in the first 5 lines of the output.





1

NOSQL – Advanced Lecture, WS 10/11

Information Systems Group Prof. Dr. Jens Dittrich

Saarland University TAs: Alekh Jindal and Jorge Quiané

Exercise 3: BigTable, HBase







3. Table Region: Recall that HBase sorts the data and creates regions based on the Row ID. We can

exploit this to improve the runtime performance of the above two queries i.e. set Row ID such that

it helps in query execution. Copy the TPC-H data loaded before into another HBase table with

appropriate Row IDs. Rerun the above queries and compare the results.

4. Column Family: Recall that HBase stores column families physical close on disk. We can exploit this

to improve query performance of the above two queries i.e put columns which are accessed in the

queries together. Copy the TPC-H data loaded before into another HBase table with appropriate

column families. Rerun the above queries and compare the results.

5. Which distributed database concepts do Table Region and Column Family techniques correspond

to?





3 PART 3: HBase with Hadoop MapReduce

1. Note that the queries in Part 2 execute on a single node. However, HBase can be easily combined

with MapReduce to parallelize the query processing.

Write MapReduce jobs to run the slight-modified TPC-H query 1 and TPC-H query 3 from Exercise

1 using HBase data tables as input and output. Consider the optimizations in points 3 and 4 of

Part 2. [Hint: Extend TableMap / TableReduce].



2. Compare the query runtimes with those in Part 2. Hand in the first 5 lines of the output.









2



Related docs
Other docs by Elijah Jimmy
Argos_Game Show Games to Play
Views: 14  |  Downloads: 0
Topside Working Group
Views: 5  |  Downloads: 0
Before 2nd Birthday
Views: 8  |  Downloads: 0
CC - Windows Internet Names Services _WINS_
Views: 3  |  Downloads: 0
Self-Adaptive Two-Dimensional RAID Arrays_1_
Views: 5  |  Downloads: 0
Lines A. - C
Views: 1  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!