Phase 3 Report.doc - Google Code

Document Sample
Phase 3 Report.doc - Google Code Powered By Docstoc
					H2 Engine Overview

We chose the H2 DB Engine over other DB Engines using the following selection
     Open source and free-to-modify licensing: access to source code.
     Java based: language familiarity, object oriented based for ease of extension.
     Light weight footprint: 1MB.
     Support for disk based or in-memory databases and tables.
     Support for multi-dimensional indexes.
     Cost based optimizer: genetic algorithm for complex queries, zero-administration.
     Support for multiple connections, table level locking.
     Support for linked tables, and a built-in virtual 'range' table.
     Testing and deployment: built-in web based console, support for embedded

To ensure that the current release’s implementation meet or exceed our overall
expectations, we studied the H2 DB Engine’s design and implementation on Storage,
Indexing, Record insertion, Locking, and Linked tables, the Parser, the Optimizer and
Jaqu. Our findings are as follow:

A DiskFile object represents a file saved on the disk. Each DiskFile object has Storage
objects. Storage objects are made up of 8 KB pages. Storage objects have Record
objects, which are typically Row objects. A Row object represents a row of data in a
table. A cache of records is kept in memory, but ultimately it is written to the disk
through DiskFile.

H2 supports both hash indexes and B-tree indexes. By default, tables (represented by
the TableData object) have a scanIndex, which simply represents the order in which
rows are added. Every time a new index is created on a table, the rows are sorted
based on the index and added to the index. Indexes can return a Cursor to iterate over
the rows. Anytime data is handled through the index, the index calls the Storage object
for the data, which then calls the DiskFile object. When a table has an index, the
TableFilter checks its index conditions to determine the best index to use.

Record insertions:
Pages (Records) are inserted through a Storage object. Space is allocated by the
Storage object before added to the DiskFile object. If the table is indexed, the record is
added to the index in the proper place before sending it to the Storage object.

Multiple connections to the database are enhanced by Table Level Locking. Before a
connection retrieves data from the table, the database adds the shared lock to the table,
given that no exclusive lock currently exists on the table. Then multiple connections can
access the table with the shared lock system. When a connection attempts to insert or
update the data in the table, then the connection should get the exclusive lock. An
exclusive lock can be obtained only if no other connection already has an exclusive lock
on that table.

Linked Tables:
H2 provides access to tables that exists in another database through Linked Tables.To
create such a table, H2 uses a create statement that specifies the database driver,
username, password and the linked table name. Then, while accessing the table, H2
issues specific queries over JDBC. Then the modified query with the exact table name
is run against the linked database. Only simple queries can be issued to the table and
no joins are allowed against the linked table. The linked table has issues when NULL
values are inserted. If the linked table has some values that are not set, those values
are also updated as NULL.

The parser is implemented in the H2 database as a part of the org.h2.command class.
The file contains most of the parser code and is the most important part of
the parser. It uses the SQL statement and converts it into a command object.
The parser.readif method recognizes the token object and returns it to the prepare
method. The prepared command object is denoted as an object of the class Prepared.
This object is used as a command object for data to be sent to the database engine.
       The other package which can help in this execution is the h2.command.dml
       package. This package is also required for managing the dml queries. Same
       goes for h2.command.ddl

H2 uses a cost-based optimizer. The queries are evaluated using a cost based analysis
and a best plan is chosen based on the lowest cost. For simple queries, which have
less number of tables in the JOIN operation, all the possible plans are analyzed and the
plan with the lowest cost is evaluated in the compilation phase and executed in the
execution phase. For complex queries, the algorithm analyzes all possible combinations
of first few tables and the remaining tables are added using a greedy algorithm. A
genetic algorithm is used to test at most 2000 plans.

Jaqu is the abbreviated form of Java Query. Java Query allows accessing the H2
database using Java by replacing SQL statements. Without using any SQL statements,
Java Query uses java functions, list iterators and other pre-defined classes to access
the data contents in the database.
We are aware of the following limitations in the current release of the H2 DB Engine:
   Maximum File Size for the data: 256GB
   Maximum File Size for the index: 256GB
   BLOB and CLOB size limit: 256GB
   Minimum Main Memory: 240MB
   Durability: Durability is one of the components of ACID. H2 database supports
      Atomicity, Consistency and Isolation. But it does not guarantee Durability. All the
      transactions that are committed are not retained when there is a power failure.

However, these limitations do not affect our primary goals of studying the design of a
database engine and extending the Engine to support features we would like to propose
in the next part.

Uncertain Index Implementation in the H2 Database Engine

       Uncertain data are data points which do not have a definite value. Rather, they
are associated with a probability distribution for the exact values of the data. Uncertain
data is becoming more prevalent in the field, and databases must be able to handle
uncertain data effectively. One component of our database project was to implement an
index for uncertain data in the H2 database engine, based on a research paper by
Agarwal at PODS 2009.

There were some problems for implementing this type of index:
    What kind of data types will the index support?
    How do we denote the PDF for each data point?
    How complex of an index structure do we want to create?
    How do we enhance the SQL syntax to create and use an uncertain index?
    How do we store this index using the H2 engine?

       The data type we used is FLOAT. We restricted the uncertain data to be
continuous numeric values because this makes sense for most situations that use
uncertain data.

       Denoting a PDF is extremely difficult, since each data point must have its own
PDF. Therefore, the table to index must either contain a representation for each PDF or
must be joined with a table containing PDF representations. Simply put, the PDFs must
be provided as data in tables; they cannot be preset by the database engine. The most
manageable type of PDF is a histogram, which contains only horizontal line segments,
as described in Agarwal's research paper. We chose to simplify this even further by
making each data point a (value, uncertainty) tuple. The value represents the median
value in the PDF, and the uncertainty is a plus-or-minus value from the median. For
example, the data point (9.7, 0.5) represents the data point with an equal probability of
appearing at any point between 9.2 and 10.2. The histogram PDF for these tuples is a
single horizontal line segment.
       The complexity of the index structure is a critical factor in database performance.
Agarwal describes a few different structures in his paper, but there is consistently a
tradeoff between construction time and lookup time. The most basic structure uses the
half-plane reporting technique. The database converts each PDF into a threshold
function, and the index keeps each threshold function in a list. When a query is
executed, the range is compared to each threshold function, yielding O(n) runtime for n
threshold functions. The index is also only supported for one column of data.

To create an uncertain index:
CREATE UNCERTAIN INDEX <name> ON <table_name> ( <value_col>,
<uncertainty_col> ) WITH THRESHOLD <percentage_value>;

To query a table with an uncertain index:
SELECT <selection> FROM <table_name> WHERE <range> WITH UNCERTAINTY
THRESHOLD <percentage_value>;

The SELECT query only works on tables with an uncertain index. If the table being
queried does not have an uncertain index, the uncertainty qualifications are ignored. If
there is more than one uncertain index on the table, the SELECT uses the index with
the correct threshold value.

      Please see the parseCreate (~ line 3434) and parseSelectSimple (~ line 1580) in
org.h2.command.Parser to see the code for parsing these statements. Please also see
org.h2.command.ddl.CreateIndex and org.h2.command.dml.Select for changes based
on uncertain indexes.

       Storing the uncertain index in the H2 engine is the biggest challenge. The
UncertainIndex class is a subclass of BaseIndex. It stores a list of LineSegFunction
objects that represent the threshold functions. It also stores a map of references of
threshold functions to references of table rows. Adding new functions is easy thanks to
the simple data structure. The class also provides a getSQL function, which allows for
the index to be recreated every time the index is reloaded. The org.h2.table.TableData
class had to be updated to add uncertain indexes. Please see the
org.h2.index.uncertain package.

         The UncertainIndex class works within the H2 framework for queries. When a
SELECT query is parsed, if it has an uncertainty threshold, all indexes on the table are
searched to find an uncertain index with the given threshold. Once that index is found,
it is given a one-time lowest cost (see setOneTimeLowestCost and getCost methods in
UncertainIndex) to guarantee that it is picked. The find function then picks all
appropriate rows and returns an UncertainCursor object to iterate over the rows. Note
that the SELECT query for uncertain data only supports the range conditions on the
uncertain data. To apply other conditions, nested queries are required.

Here are some example queries for how to use uncertain indexes:
create table andy(v float, e float);
insert into andy values(9.5, 0.5);
insert into andy values(9.5, 0.7);
insert into andy values(9.8, 0.5);
insert into andy values(9.7, 0.5);
insert into andy values(10.1, 0.3);
insert into andy values(9.2, 0.5);
insert into andy values(9.7, 0.6);
insert into andy values(10.4, 1);
create uncertain index unix2 on andy(v, e) with threshold 0.2;
create uncertain index unix75 on andy(v, e) with threshold 0.75;
select * from andy;
select * from andy where v > 9.7 with uncertainty threshold 0.2;
select * from andy where v > 9.7 with uncertainty threshold 0.75;
select * from andy where v > 9.7 and v < 10 with uncertainty threshold 0.2;
select * from andy where v > 9.7 and v < 10 with uncertainty threshold 0.75;

Natural Language Parsing:

The technique is based on the Eliza model which involves mapping a particular set of
words to a set of predefined tokens in the destination language.
This technique even though very primitive and simple in its form is still used widely in
most of the appliatioins
of natural language parsing. The main reason why this technique is widely used is on
account of its simplicity and its ease of implementation. Based on the limited resources
which we had in making this project, we thought this would be a good choice for use as
it is very simple to implement and helps in making a very good prototype of the model.


In order to make an engine in any query, we need two different aspects of the engine.
The aspect engine and the Authoring tool. The authoring tool has the ability to match
given queries and isolating the individual queries which belong to the non sql syntax.
The tricky part is deciding which is an error in the query and which is actually the
erroneous query in the engine.

The authoring tool thus has to use statistical analysis to make out whether something is
an error. Since this is a very sophisticated part of the project, we decided to leave this
part out of our work.

The selection operator works on the tables present in the information schema. The
information schema is a system catalog table present in the H2 database and holds
information about all the tables which are present in the table.
Since all the selection occurs from the present tables, we use this as a destination
language set.
Hence we can define the selection function
F x -> information_schema
where x is the set of tokens which have a reference to the selection operation
This x can be a set of predefined tokens. This include operators such as
are present
All these words are suggestive of the current
 token referring to a table in the database. We then map this table with the
information_schema and thus edit the same

For implementing the projection operator
we use
F x -> columns in the above table.

Here, we map the columns from the table which we already decided in the above part.
Since we take care of the table, we now know which columns we are supposed to work
If we find keywords such as all in the table, we go ahead by using the * for the
projection operatioin.

Nested SQL queries:
This is the trickier part.
for this, there are at least two tables in the given sentence. we need to first come up
with one of the suggestions and then the outer.
The next case is when we already have the two queries. The nesting is also an
important part of the algorithm
Which query lies where is another part which is a tad difficult to implement.

Shared By: