Efficient Keyword Search across Heterogeneous Relational Databases
Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison
Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University
Key Message of Paper
Precise data integration is expensive But we can do IR-style data integration very cheaply, with no manual cost!
– just apply automatic schema/data matching – then do keyword search across the databases – no need to verify anything manually
Already very useful
Build upon keyword search over a single database ...
2
Keyword Search over a Single Relational Database
A growing field, numerous current works
– – – – DBXplorer [ICDE02], BANKS [ICDE02] DISCOVER [VLDB02] Efficient IR-style keyword search in databases [VLDB03], VLDB-05, SIGMOD-06, etc. XKeyword [ICDE03], XRank [Sigmod03] TeXQuery [WWW04] ObjectRank [Sigmod06] TopX [VLDB05], etc.
Many related works over XML / other types of data
– – – –
More are coming at SIGMOD-07 ...
3
A Typical Scenario
Customers
tid custid name t1 c124 t2 c533 t3 c333 Cisco IBM MSR contact Michael Jones David Long David Ross addr … … …
Complaints
tid id emp-name comments Repair didn’t work Deferred work to John Smith u1 c124 Michael Smith u2 c124 John
Foreign-Key Join Q = [Michael Smith Cisco]
Ranked list of answers
t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work
score=.8
t1 c124 Cisco
Michael Jones …
u2 c124 John
Deferred work to John Smith
score=.7
4
Our Proposal: Keyword Search across Multiple Databases
Complaints tid id emp-name comments tid v1 v2 v3 empid e23 e14 e37 Employees name Mike D. Smith John Brown Jack Lucas Groups contact Michael Jones addr … tid x1 x2 eid e23 e14 reports-to e37 e37
u1 c124 Michael Smith
u2 c124 John
Repair didn’t work
Deferred work to John Smith
Customers tid custid name t1 c124 Cisco
t2 c533
t3 c333
IBM
MSR
David Long
Joan Brown
…
…
Query: [Cisco Jack Lucas]
t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith x1 e23 e37 v3 e37 Jack Lucas
across databases
IR-style data integration
5
A Naive Solution
1. Manually identify FK joins across DBs
2. Manually identify matching data instances across DBs
3. Now treat the combination of DBs as a single DB apply current keyword search techniques
Just like in traditional data integration, this is too much manual work
6
Kite Solution
Automatically find FK joins / matching data instances across databases no manual work is required from user
Employees emp-name comments Repair didn’t work Deferred work to John Smith tid v1 v2 v3 empid e23 e14 e37 name Mike D. Smith John Brown Jack Lucas Groups contact Michael Jones David Long Joan Brown addr … … …
7
Complaints
tid id u1 c124 Michael Smith u2 c124 John
Customers tid custid name t1 c124 t2 c533 t3 c333 Cisco IBM MSR tid x1 eid e23 e37
reports-to
x2
e14
e37
Complaints tid id
Automatically Find FK Joins across Databases
emp-name comments Repair didn’t work Deferred work to John Smith tid empid
Employees name
u1 c124 Michael Smith u2 c124 John
v1
v2 v3
e23
e14 e37
Mike D. Smith
John Brown Jack Lucas
Current solutions analyze data values (e.g., Bellman) Limited accuracy
– e.g., “waterfront” with values yes/no “electricity” with values yes/no
Our solution: data analysis + schema matching
– improve accuracy drastically (by as much as 50% F-1)
Automatic join/data matching can be wrong
incorporate confidence scores into answer scores 8
Incorporate Confidence Scores into Answer Scores
Recall: answer example in single-DB settings
t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work
score=.8
Recall: answer example in multiple-DB settings
score 0.7 for data matching
u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith x1 e23 e37 v3 e37 Jack Lucas
t1 c124 Cisco Michael Jones …
score 0.9 for FK join
score (A, Q) =
α.score_kw (A, Q) + β.score_join (A, Q) + γ.score_data (A, Q) size (A)
9
Summary of Trade-Offs
SQL queries
Precise data integration
– the holy grail
IR-style data integration, naive way
– manually identify FK joins, matching data – still too expensive
IR-style data integration, using Kite
– automatic FK join finding / data matching – cheap – only approximates the “ideal” ranked list found by naive
10
Kite Architecture
Index Builder IR index1 … IR indexn
Q = [ Smith Cisco ]
Condensed CN Generator
Foreign key joins
– Partial
Refinement rules
Top-k Searcher Data instance matcher
– Full – Deep
Foreign-Key Join Finder
Data-based Join Finder Schema Matcher
Distributed SQL queries
D1
…
Dn
D1
…
Dn
Offline preprocessing
Online querying
11
Online Querying
Database 1
Relation 1 Relation 2
Database 2
Relation 1 Relation 2
What current solutions do: 1. Create answer templates 2. Materialize answer templates to obtain answers
12
Create Answer Templates
Service-DB
Find tuples that contain query keywords
– – Use DB’s IR index example:
Q = [Smith Cisco] Tuple sets: Service-DB: ComplaintsQ={u1, u2} CustomersQ={v1} HR-DB: EmployeesQ={t1} GroupsQ={}
Complaints u1 u2
Customers v1 v2 v3
HR-DB
Groups x1 x2 Employees t1 t2 t3
Create tuple-set graph
Schema graph:
Customers J1 Complaints J1 J4 J2 Emps J3 Customers{} J1 Complaints{} J4 J4 J1 ComplaintsQ J4 J4 Emps{} J3 J3 EmpsQ J2 Groups{} J2 Groups
Tuple set graph:
J1
CustomersQ
13
Create Answer Templates (cont.)
Search tuple-set graph to generate answer templates
– also called Candidate Networks (CNs)
Each answer template = one way to join tuples to form an answer
sample tuple set graph
J1 J4
sample CNs CN1: CustomersQ
J3 J3 J2 Groups{} J2
Customers{}
J1 J1
Complaints{}
J4 J4
Emps{}
CustomersQ
J1
ComplaintsQ
J4
EmpsQ
J1 CN2: CustomersQ Complaints{Q} J2 J2 J4 CN3: EmpsQ Groups{} Emps{} Complaints{Q} J2 J3 J4 CN4: EmpsQ Groups{} Emps{} Complaints{Q}
14
Materialize Answer Templates to Generate Answers
By generating and executing a SQL query
CN: SQL: CustomersQ ComplaintsQ
J1
(CustomersQ
= {v1} , ComplaintsQ = {u1, u2})
SELECT * FROM Customers C, Complaints P WHERE C.cust-id = P.id AND (C.tuple-id = v1) AND (P.tuple-id = u1 OR tuple-id = u2)
Naive solution
– materialize all answer templates, score, rank, then return answers
Current solutions
– find only top-k answers – materialize only certain answer templates – make decisions using refinement rules + statistics
15
Challenges for Kite Setting
More databases way too many answer templates to generate
– can take hours on just 3-4 databases
Materializing an answer template takes way too long
– requires SQL query execution across multiple databases – invoking each database incurs large overhead
Difficult to obtain reliable statistics across databases
See paper for our solutions
16
Empirical Evaluation
Domains
Domain Avg # approximate FK joins tuples Avg # Avg # tables Avg # tuples per table # DBs attributes per per DB per table schema total across DBs per pair Total size
DBLP Inventory
2 8
3 5.8
3 5.4
11 890
6 804
11 33.6
500K 2K
400M 50M
Sample Inventory Schema
AUTHOR BOOK WH2BOOK ARTIST CD WH2CD
The DBLP Schema
AR (aid, biblo) PU (aid, uid) AU (id, name) DBLP 1 CITE (id1, id2) AR (id, title) CNF (id, name) DBLP 2
WAREHOUSE Inventory 1
17
Runtime Performance (1)
runtime vs. maximum CCN size
180
180
time (sec)
120 60 0 1 2 3 4 5 6 7 8 9
time (sec)
DBLP
Inventory
120 60 0 1 2 3 4 5 6 7
max CCN size
2-keyword queries, k=10, 2 databases
2-keyword queries, k=10, 5 databases
max CCN size
runtime vs. # of databases
45
Inventory
time (sec)
30
Hybrid algorithm adapted to run over multiple databases Kite without adaptive rule selection and without rule Deep Kite without condensed CNs
15 0 1 2 3 4 5 6 7 8 # of DBs
Kite without rule Deep
maximum CCN size = 4, 2-keyword queries, k=10
Full-fledged Kite algorithm
18
Runtime Performance (2)
runtime vs. # of keywords in the query
20
time (sec)
15 10 5 0 1 2
time (sec)
DBLP
40
Inventory
30 20 10
|q|
3 4 5
0 1 2 3 4 5
|q|
max CCN=4, k=10, 5 databases
max CCN=6, k=10, 2 databases
runtime vs. # of answers requested
45 45
time (sec)
30 15 0 1 4 7 10 13 16 19 22 25 27 30
time (sec)
Inventory
30 15 0 1 4 7 10 13 16 19 22 25 27 30
k
k
2-keyword queries, max CCN=4, 5 databases
2-keyword queries, max CCN=4, |q|=2, 5 databases
19
Query Result Quality
Pr@k
1 0.8 0.6 0.4 0.2 0 1 5 10 15
Pr@k
1 0.8 0.6 0.4 0.2
k
20
0 1 5 10 15 20
k
OR-semantic queries
AND-semantic queries
Pr@k = the fraction of answers that appear in the “ideal” list
20
Summary
Kite executes IR-style data integration
– performs some automatic preprocessing – then immediately allows keyword querying
Relatively painless
– no manual work! – no need to create global schema, to understand SQL
Can be very useful in many settings: e.g., on-the-fly, best-effort, for non-technical people
– enterprises, on the Web, need only a few answers – emergency (e.g., hospital + police), need answers quickly
21
Future Directions
Incorporate user feedback interactive IR-style data integration More efficient query processing
– large # of databases, network latency
Extends to other types of data
– XML, ontologies, extracted data, Web data
IR-style data integration is feasible and useful extends current works on keyword search over DB raises many opportunities for future work
22
BACKUP
23
Other Experiments
Join Discovery Accuracy
1 0.8 0.6 0.4 0.2 0 Inventory 1 Inventory 2 Inventory 3 Inventory 4 Inventory 5
accuracy (F1)
Schema matching helps improve join discovery algorithm drastically
Join Discovery
Join Discovery + Schema Matching
Kite over single database
6
time (sec)
4 2 0 1 2 3 4 5 6 7 8
Kite also improves singledatabase keyword search algorithm mHybrid
max CCN size
24