Keyword Search in Databases
Document Sample


Keyword Search in Databases
Vagelis Hristidis
University of California, San Diego
Motivation
Keyword Search is the dominant information
discovery method in documents
Increasing amount of data stored in
databases
Motivation
Currently, information discovery in databases
requires:
Knowledge of schema
Knowledge of a query language (eg: SQL)
Knowledge of the role of the keywords
• Eliminate these requirements
Keyword Query - Semantics
Keywords are:
• in same node
• in nodes of same type (eg: same relation)
• connected through edges (eg: primary-foreign key
relationships)
Score of result:
• distance of keywords within a node
• distance between keywords in terms of edges between them
[HristVLDB02, HristICDE03]
• weighted distance
• Distance + IR techniques [HristVLDB03]
• Random Walk probability [current work]
DISCOVER [HristVLDB02]
Result of Keyword Query
Result is tree T of nodes where:
• each edge corresponds to an edge of the
database graph
• every keyword contained in a node of T
• no node of T is redundant (minimal)
Example - Schema
Subset of TPC-H schema
n:1 n:1
ORDERS CUSTOMER NATION
Example - Data
ORDERS
ORDERKEY CUSTKEY TOTALPRICE CLERK ...
o1 1000105 12312 $5,000 John Smith
o2 1000111 12312 $3,000 Mike Miller
o3 1000125 10001 $7,000 Mike Miller
o4 1000110 10002 $8,000 Keith Brown
CUSTOMER
CUSTKEY NAME NATIONKEY ...
c1 12312 Brad Lou 01
c2 10001 George Walters 01
c3 10013 John Roberts 01
NATION
NATIONKEY NAME REGIONKEY
n1 01 USA N.America
Example – Keyword Query
Query: “Smith, Miller”
ORDERS
ORDERKEY CUSTKEY TOTALPRICE CLERK ...
o1 1000105 12312 $5,000 John Smith
o2 1000111 12312 $3,000 Mike Miller
o3 1000125 10001 $7,000 Mike Miller
o4 1000110 10002 $8,000 Keith Brown
CUSTOMER
CUSTKEY NAME NATIONKEY ...
c1 12312 Brad Lou 01
c2 10001 George Walters 01
c3 10013 John Roberts 01
NATION
NATIONKEY NAME REGIONKEY
n1 01 USA N.America
Example – Keyword Query
Query: “Smith, Miller”
ORDERS
ORDERKEY CUSTKEY TOTALPRICE CLERK ...
o1 1000105 12312 $5,000 John Smith Results:
o2 1000111 12312 $3,000 Mike Miller
o3
o4
1000125 10001 $7,000 Mike Miller Size Result
1000110 10002 $8,000 Keith Brown
2 o1c1o2
CUSTOMER
CUSTKEY NAME NATIONKEY ...
c1 12312 Brad Lou 01
c2 10001 George Walters 01
c3 10013 John Roberts 01
NATION
NATIONKEY NAME REGIONKEY
n1 01 USA N.America
Example – Keyword Query
Query: “Smith, Miller”
ORDERS
o1
ORDERKEY
1000105
CUSTKEY
12312
TOTALPRICE
$5,000
CLERK
John Smith
...
Results:
o2 1000111 12312 $3,000 Mike Miller
o3 1000125 10001 $7,000 Mike Miller
o4 1000110 10002 $8,000 Keith Brown
Size Result
CUSTOMER
CUSTKEY NAME NATIONKEY ... 2 o1c1o2
c1 12312 Brad Lou 01
c2 10001 George Walters 01
c3 10013 John Roberts 01
4 o1c1 n1
c2 o3
NATION
NATIONKEY NAME REGIONKEY
n1 01 USA N.America
Architecture
User
Keywords
"Smith","Miller"
Master
Index
ORDERSSmith={o1}
Tuple Sets
ORDERSMiller={o2,o3}
Candidate
Network
Database Generator
Schema Candidate
Networks
Plan
Generator
Execution Plan
Plan
Execution CREATE TABLE T1 AS SELECT * FROM
Joining
Networks of ORDERSSmith, CUSTOMERS WHERE ...
tuples SQL queries
SELECT * FROM T1, ORDERSMiller WHERE ...
SELECT * FROM T1, NATION, CUSTOMERS,
ORDERSMiller WHERE ...
Database
Architecture
User
Keywords
"Smith","Miller"
Master
Index
ORDERSSmith={o1}
Tuple Sets
ORDERSMiller={o2,o3}
Candidate
Network
Database Generator
Schema Candidate
Networks
Plan
Generator
Execution Plan
Plan
Execution CREATE TABLE T1 AS SELECT * FROM
Joining
Networks of ORDERSSmith, CUSTOMERS WHERE ...
tuples SQL queries
SELECT * FROM T1, ORDERSMiller WHERE ...
SELECT * FROM T1, NATION, CUSTOMERS,
ORDERSMiller WHERE ...
Database
Candidate Networks Generator -
Challenges
A keyword may appear in multiple tuples
# candidate networks can be too big
(sometimes unbounded)
Candidate Network - Example
ORDERS
Smith n:1
n:1 n:1
ORDERS CUSTOMER NATION
n:1
ORDERS
Miller
Candidate Network - Example
ORDERS
Smith n:1
n:1 n:1
ORDERS CUSTOMER NATION
n:1
ORDERS
Miller
CN1: OSmith C OMiller size=2
Candidate Network - Example
ORDERS
Smith n:1
n:1 n:1
ORDERS CUSTOMER NATION
n:1
ORDERS
Miller
• some CNs are pruned, based on schema
• complete, non OMiller
CN1: OSmith C redundant size=2
• size unbounded if many-to-many relationships
CN2: OSmith C N C OMiller size=4
Architecture
User
Keywords
Master
Index
Tuple Sets
Candidate
Network
Database Generator
Schema Candidate
Networks
Plan
Generator
Execution Plan
Plan
Execution
Joining
Networks of
tuples SQL queries
Database
Execution Plan - Challenges
Generated SQL queries are expensive due to
joins
Reusability opportunities
Execution Plan
Each CN corresponds to a SQL statement
CN1: OSmith C OMiller
CN2: OSmith C N C OMiller
Execution Plan
CN1 OSmith C OMiller
CN2 OSmith C N C OMiller
Reuse Common Subexpressions -
Example
Execution Plan
CN1• OSmith Reuse of Common Subexpressions is NP-
Optimal
C OMiller
OSmith
CN2Complete C N C OMiller
• Greedy algorithm:
Optimized Execution Plan
In each iteration build intermediate result of size 1 (1
Temp OSmith C
join) that maximizes frequency a ,0 a, b 1
CN1 Temp OMiller logb (size)
CN2 Temp N C OMiller
[HristVLDB03]
Score of result T
Combining function Score combines
scores of attribute values of T
For example
Score=aiTScore(ai)/size(T)
Attribute value scores Score(ai)
calculated by off-the-self tools of DBMSs
Example – Complaints Database
Schema
Products Complaints Customers
prodId prodId custId
manufacturer custId name
model date occupation
comments
Example – Keyword Query “Netvista Maxtor”
Complaints
tupleId prodId custId date comments score
c1 p121 c3232 6-30-2002 “disk crashed after just one week of moderate 1/3
use on an IBM Netvista X41”
c2 p131 c3131 7-3-2002 “lower-end IBM Netvista= 4/3 fire, starting
Score(c3) caught 1/3
apparently with disk”
Score(p1 c1) = (1+1/3)/2 = 4/6
c3 p131 c3143 37471 “IBM Netvista unstable with Maxtor HD” 4/3
Customers Products
tupleId custId name occupation tupleId prodId manufacturer model score
u1 c3232 “John “Software p1 p121 “Maxtor” “D540X” 1
Smith” Engineer”
u2 c3131 “Jack “Architect” p2 p131 “IBM” “Netvista” 1
Score(p2 c3) Lucas”
= (1+4/3)/2 = 7/6
u3 c3143 “John “Student” p3 p141 “Tripplite” “Smart 0
Mayer” 700VA”
Results: (1) c3, (2) p2 c3, (3) p1 c1
Execution Algorithms
Users usually want top-k results.
Hence, submitting to DBMS a SQL query for
each CN (Naïve algorithm) is inefficient.
When no, or very few results, Naïve algorithm
is efficient, since it fully exploits DBMS.
Present top-k algorithm.
XKeyword [HristICDE03]
Storing XML Data (Decomposer)
Storing Data in XKeyword
Each target object is stored in a CLOB
Connections between target objects in ID Relations
part
person *
partkey name subpart
Minimal ID Relations
name nation Lineitem_Part: LPa (L_id, Pa_id)
lineitem Lineitem_Person_ref: LPref (L_id, P_id)
quantity shipdate supplier linepart Part_Part: PaPa (Pa_id1,Pa_id2)
Lineitem, 100
quantity
[10 ] shipdate
supplier linepart LPa L_id Pa_id
[Oct 14 2001]
Lineitem, 101
Part, 123 100 123
supplier linepart partkey name
quantity
[10 ]
shipdate
[Oct 15 2001]
[1005] ["TV"] 101 123
Presentation of Results
Number of results explodes due to MVDs
Example
l1: lineitem pa1: part
[quantity=10 [partkey=1008
lin
p1: person lie r
shipdate=Oct 14 2001] ep p a rt name="VCR"]
supp a rt pa3: part ub
s
[name="John" su p [partkey=1005
nation="US"] p lier l2: lineitem p a rt name="TV"]
su
[quantity=10 line bp
ar
t pa2: part
shipdate=Oct 15 2001] [partkey=1009
name="VCR & DVD"]
Results:
R1. p1-l1-pa3-pa1
R2. p1-l2-pa3-pa2
R3. p1-l2-pa3-pa1
R4. p1-l1-pa3-pa2
R3, R4 are implied by R1, R2!
Demo (in VLDB03)
• Demo on DBLP dataset available at www.db.ucsd.edu/XKeyword
Demo
Demo
Demo
Related Work
DBXplorer. S. Agrawal et al. ICDE 2002
Similar three step architecture
Incomplete solutions (relations are not re-used)
No common subexpression reusability
BANKS. G. Bhalotia et al. ICDE 2002
Database viewed as graph
Steiner tree problem approximations
Proximity searching in databases. R. Goldman et
al. VLDB 1998
Database viewed as graph
No schema info
hub nodes
Current Work
ObjectRank project
Exploit link structure of data graph for ranking
Suitable for bibliographic and other authority
transfer applications
Semantic differences from PageRank
More processing allowed due to smaller size
than Web
Demo at www.db.ucsd.edu/ObjectRank
Questions?
Related docs
Get documents about "