1 PART 1_ BigTable_HBase Bloom Filter _10+10+10 points_ 2 PART

Document Sample
1 PART 1_ BigTable_HBase Bloom Filter _10+10+10 points_ 2 PART Powered By Docstoc
					NOSQL – Advanced Lecture, WS 10/11
Information Systems Group                            Prof. Dr. Jens Dittrich
Saarland University                     TAs: Alekh Jindal and Jorge Quiané
                        Exercise 6: Bloom Filters, BerkeleyDB, HiveQL




1     PART 1: BigTable/HBase Bloom Filter (10+10+10 points)
As explained in the lecture in BigTable/HBase bloom filter creation may be piggy-backed on the merge
of the indexes residing on disk.

    1. Describe how this process may be exploited to define a bloom filter with the optimal parameters of
       n, m, and k.
    2. Derive a formula (show the derivation steps) to find the optimal value of m and k guaranteeing a
       "false positive" probability below 0.1%.
    3. What are the trade-offs of the k and m parameters in a working system?


2     PART 2: BerkeleyDB Concepts (10+10+10 points)
Answer the following questions:

    1. Briefly describe the Btree, Hash, Queue, and Recno access methods in BerkeleyDB.
    2. Which data management services are provided by BerkeleyDB?
    3. What are the major differences between BerkeleyDB and Relational Databases?


3     PART 3: HiveQL Hands-on (40 points)
    1. Write down the equivalent HiveQL queries for the slightly modified TPC-H Query 1 and TPC-H
       Query 3 from Exercise 1.
    2. Install Hive 0.6.0 from http://hive.apache.org/. Load TPC-H customers, orders, and lineitem table
       into HDFS. Run your HiveQL queries from above, report and compare runtimes with those in
       Exercise 1. Also, hand in the EXPLAIN EXTENDED output for your queries.




                                                    1

				
DOCUMENT INFO
Shared By:
Tags: BigTable
Stats:
views:21
posted:12/22/2011
language:English
pages:1
Description: BigTable non-relational database, is a sparse, distributed, persistent storage of the multi-dimensional sorted Map. Bigtable is designed to reliably handle PB-level data, and can be deployed to thousands of machines. Bigtable has achieved several of the following goals: wide applicability, scalability, high performance and high availability. Bigtable has more than 60 Google products and projects has been applied, including Google Analytics, GoogleFinance, Orkut, Personalized Search, Writely and GoogleEarth. These products are made ??of Bigtable different needs, some need high throughput batch processing, while others require a timely response and rapid return data to the end user. They use the Bigtable cluster configuration is also very different, and some clusters only a few servers, while others require thousands of servers, storage, hundreds of TB of data.