Embed
Email

1 PART 1_ BigTable_HBase Bloom Filter _10+10+10 points_ 2 PART ..

Document Sample

Description

BigTable non-relational database, is a sparse, distributed, persistent storage of the multi-dimensional sorted Map. Bigtable is designed to reliably handle PB-level data, and can be deployed to thousands of machines. Bigtable has achieved several of the following goals: wide applicability, scalability, high performance and high availability. Bigtable has more than 60 Google products and projects has been applied, including Google Analytics, GoogleFinance, Orkut, Personalized Search, Writely and GoogleEarth. These products are made ??of Bigtable different needs, some need high throughput batch processing, while others require a timely response and rapid return data to the end user. They use the Bigtable cluster configuration is also very different, and some clusters only a few servers, while others require thousands of servers, storage, hundreds of TB of data.

Shared by: Elijah Jimmy
Stats
views:
15
posted:
12/22/2011
language:
pages:
1
NOSQL – Advanced Lecture, WS 10/11

Information Systems Group Prof. Dr. Jens Dittrich

Saarland University TAs: Alekh Jindal and Jorge Quiané

Exercise 6: Bloom Filters, BerkeleyDB, HiveQL









1 PART 1: BigTable/HBase Bloom Filter (10+10+10 points)

As explained in the lecture in BigTable/HBase bloom filter creation may be piggy-backed on the merge

of the indexes residing on disk.



1. Describe how this process may be exploited to define a bloom filter with the optimal parameters of

n, m, and k.

2. Derive a formula (show the derivation steps) to find the optimal value of m and k guaranteeing a

"false positive" probability below 0.1%.

3. What are the trade-offs of the k and m parameters in a working system?





2 PART 2: BerkeleyDB Concepts (10+10+10 points)

Answer the following questions:



1. Briefly describe the Btree, Hash, Queue, and Recno access methods in BerkeleyDB.

2. Which data management services are provided by BerkeleyDB?

3. What are the major differences between BerkeleyDB and Relational Databases?





3 PART 3: HiveQL Hands-on (40 points)

1. Write down the equivalent HiveQL queries for the slightly modified TPC-H Query 1 and TPC-H

Query 3 from Exercise 1.

2. Install Hive 0.6.0 from http://hive.apache.org/. Load TPC-H customers, orders, and lineitem table

into HDFS. Run your HiveQL queries from above, report and compare runtimes with those in

Exercise 1. Also, hand in the EXPLAIN EXTENDED output for your queries.









1



Related docs
Other docs by Elijah Jimmy
Argos_Game Show Games to Play
Views: 14  |  Downloads: 0
Topside Working Group
Views: 5  |  Downloads: 0
Before 2nd Birthday
Views: 8  |  Downloads: 0
CC - Windows Internet Names Services _WINS_
Views: 3  |  Downloads: 0
Self-Adaptive Two-Dimensional RAID Arrays_1_
Views: 5  |  Downloads: 0
Lines A. - C
Views: 1  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!