Docstoc

Expressing and Optimizing the Similarity-based Queries In SQL

Document Sample
Expressing and Optimizing the Similarity-based Queries In SQL Powered By Docstoc
					Expressing and Optimizing the Similarity-based Queries In SQL

by Like Gao (lgao@gmu.edu) ISE Department GMU Advisor: Prof. X. Sean Wang

What is similarity-based search?

Example: time series case
Similarity: Euclidean Distance, Correlation Coefficient
Similar: Distance=0.5; Dissimilar Distance=2.5;

t

t

Similarity-based Search

Pattern set
Query object Nearest

Near

Example: not similarity-based search!

Similarity-based Search is content based!

Query optimization in DBMS? What is available/unavailable?

Query Optimization
Select From Where And ID Employee Salary > 100K Age < 40

Query Optimization is to find the query execution plan that has the smallest cost

Employee
ID 690088 435464 034567 … Age 24 30 50 … Salary 50k 60k 110k …

Supported by current DBMS!

UDT, UDF(P) are supported
Select From Where And And Animal
ID 00002 123567 4572 … Age 2 2 5 … Photo 50k 60k 110k …

ID Animal Photo is most like ‘???.jpg’ animal in Photo is puppy Age < 4

UDP: written by users, supported by DBMS

UDT: supported by DBMS i.e., BLOB

But: query optimization?
Select From Where And And ID Animal UDP1 (similarity-based search 1) UDP2 (similarity-based search 2) Age < 4

The optimization for similarity-based queries can not be properly optimized by current DBMS! (traditional DBMS evaluates UDP as early as possible!)

Summary
RDBMS

Build-in Optimization Algorithms

?

SimilarityBased Queries (SQL) Answer

Similarity-functions (UDP), provided by users

Expressing and Optimizing the Similarity-based Queries In SQL
End of Part 1

Outline
RDBMS and SQL Query Optimization Similarity-Based Search Expressing and Optimizing SimilarityBased Queries

RDBMS and SQL
RDBMS (Relational Database Management System)
http://www.cs.yale.edu/homes/avi/db-book/index.html

Based on E. F. Codd’s relational model (theory), 1969 Venders: IBM DB2, ORACLE, Microsoft SQL Server, etc.

SQL (Structured Query Language)
http://sqlzoo.net/

The standard language used by RDBMS

A quick review of RDBMS and SQL

http://mason.gmu.edu/~xywang/infs797/lecture1.pdf

Query Optimization
Select From Where
ID 690088 435464

ID Employee Salary > 100K
Age 24 30 Salary 50k 60k

If most of employees’ salaries are very likely >100K,  sequential scan If 95% of the employees’ salaries are unlikely >100K,  using the index on Salary Optimizer ?SSCAN ?Index

034567
…

50
…

110k
…

Plan

Execution Engine

IDs

Query Optimization is to find the plan that has the smallest execution cost

Statistics About Employee

Employee Data

Similarity-Based Search
Nearest Neighbor
Given a set of pattern objects P and a query object q, an object pi in P such that sim(pi,q) < sim(pj,q), for any j≠i.

k-nearest neighbors

Near Neighbors
Given a set of pattern objects P, a query object q and a threshold TH, all object pi in P such that sim(pi,q) < TH.

Both nearest and near neighbors

Example: time series case
Similarity: Euclidean Distance, Correlation Coefficient
similar dissimilar

t

t

Similarity-based Search

Nearest

pi

P

q
Near

pi1 pi2

Example: not similarity-based search!

Similarity-Based Queries in SQL
Current DBMS support UDT and UDF.
(U:user, D: defined, T: type and F: function)

Express the similarity predicate as UDP (user-defined-predicate)
NN(query object, one given object, pattern set, K, TH)  T/F?

PatternSet

PsetID Set1 Set1 Set2 …

ID 1 10 10 …

Select From Where And

P.ID PatternSet P NN(q, P.ID, Set1, 1, 0.2) P.ID < 1000

Optimizing
Select From Where And P.ID PatternSet P NN(q, P.ID, Set1, 1, 0.2) P.ID < 1000

Only a few P.ID < 1000

Plan A: 1) P.ID < 1000 2) NN(q, P.ID, Set1, 1, 0.2) Plan B: 1) NN(q, P.ID, Set1, 1, 0.2) 2) P.ID < 1000

Many P.ID < 1000


				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:28
posted:11/20/2009
language:English
pages:19