Efficient Keyword Search across Heterogeneous Relational Databases

Reviews
Efficient Keyword Search across Heterogeneous Relational Databases Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University Key Message of Paper  Precise data integration is expensive  But we can do IR-style data integration very cheaply, with no manual cost! – just apply automatic schema/data matching – then do keyword search across the databases – no need to verify anything manually  Already very useful Build upon keyword search over a single database ... 2 Keyword Search over a Single Relational Database  A growing field, numerous current works – – – – DBXplorer [ICDE02], BANKS [ICDE02] DISCOVER [VLDB02] Efficient IR-style keyword search in databases [VLDB03], VLDB-05, SIGMOD-06, etc. XKeyword [ICDE03], XRank [Sigmod03] TeXQuery [WWW04] ObjectRank [Sigmod06] TopX [VLDB05], etc.  Many related works over XML / other types of data – – – –  More are coming at SIGMOD-07 ... 3 A Typical Scenario Customers tid custid name t1 c124 t2 c533 t3 c333 Cisco IBM MSR contact Michael Jones David Long David Ross addr … … … Complaints tid id emp-name comments Repair didn’t work Deferred work to John Smith u1 c124 Michael Smith u2 c124 John Foreign-Key Join Q = [Michael Smith Cisco] Ranked list of answers t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work score=.8 t1 c124 Cisco Michael Jones … u2 c124 John Deferred work to John Smith score=.7 4 Our Proposal: Keyword Search across Multiple Databases Complaints tid id emp-name comments tid v1 v2 v3 empid e23 e14 e37 Employees name Mike D. Smith John Brown Jack Lucas Groups contact Michael Jones addr … tid x1 x2 eid e23 e14 reports-to e37 e37 u1 c124 Michael Smith u2 c124 John Repair didn’t work Deferred work to John Smith Customers tid custid name t1 c124 Cisco t2 c533 t3 c333 IBM MSR David Long Joan Brown … … Query: [Cisco Jack Lucas] t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith x1 e23 e37 v3 e37 Jack Lucas across databases  IR-style data integration 5 A Naive Solution 1. Manually identify FK joins across DBs 2. Manually identify matching data instances across DBs 3. Now treat the combination of DBs as a single DB  apply current keyword search techniques Just like in traditional data integration, this is too much manual work 6 Kite Solution  Automatically find FK joins / matching data instances across databases  no manual work is required from user Employees emp-name comments Repair didn’t work Deferred work to John Smith tid v1 v2 v3 empid e23 e14 e37 name Mike D. Smith John Brown Jack Lucas Groups contact Michael Jones David Long Joan Brown addr … … … 7 Complaints tid id u1 c124 Michael Smith u2 c124 John Customers tid custid name t1 c124 t2 c533 t3 c333 Cisco IBM MSR tid x1 eid e23 e37 reports-to x2 e14 e37 Complaints tid id Automatically Find FK Joins across Databases emp-name comments Repair didn’t work Deferred work to John Smith tid empid Employees name u1 c124 Michael Smith u2 c124 John v1 v2 v3 e23 e14 e37 Mike D. Smith John Brown Jack Lucas   Current solutions analyze data values (e.g., Bellman) Limited accuracy – e.g., “waterfront” with values yes/no “electricity” with values yes/no  Our solution: data analysis + schema matching – improve accuracy drastically (by as much as 50% F-1) Automatic join/data matching can be wrong  incorporate confidence scores into answer scores 8 Incorporate Confidence Scores into Answer Scores  Recall: answer example in single-DB settings t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work score=.8  Recall: answer example in multiple-DB settings score 0.7 for data matching u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith x1 e23 e37 v3 e37 Jack Lucas t1 c124 Cisco Michael Jones … score 0.9 for FK join score (A, Q) = α.score_kw (A, Q) + β.score_join (A, Q) + γ.score_data (A, Q) size (A) 9 Summary of Trade-Offs  SQL queries Precise data integration – the holy grail  IR-style data integration, naive way – manually identify FK joins, matching data – still too expensive  IR-style data integration, using Kite – automatic FK join finding / data matching – cheap – only approximates the “ideal” ranked list found by naive 10 Kite Architecture Index Builder IR index1 … IR indexn Q = [ Smith Cisco ] Condensed CN Generator Foreign key joins – Partial Refinement rules Top-k Searcher Data instance matcher – Full – Deep Foreign-Key Join Finder Data-based Join Finder Schema Matcher Distributed SQL queries D1 … Dn D1 … Dn Offline preprocessing Online querying 11 Online Querying Database 1 Relation 1 Relation 2 Database 2 Relation 1 Relation 2 What current solutions do: 1. Create answer templates 2. Materialize answer templates to obtain answers 12 Create Answer Templates Service-DB Find tuples that contain query keywords – – Use DB’s IR index example: Q = [Smith Cisco] Tuple sets: Service-DB: ComplaintsQ={u1, u2} CustomersQ={v1} HR-DB: EmployeesQ={t1} GroupsQ={} Complaints u1 u2 Customers v1 v2 v3 HR-DB Groups x1 x2 Employees t1 t2 t3 Create tuple-set graph Schema graph: Customers J1 Complaints J1 J4 J2 Emps J3 Customers{} J1 Complaints{} J4 J4 J1 ComplaintsQ J4 J4 Emps{} J3 J3 EmpsQ J2 Groups{} J2 Groups Tuple set graph: J1 CustomersQ 13 Create Answer Templates (cont.)   Search tuple-set graph to generate answer templates – also called Candidate Networks (CNs) Each answer template = one way to join tuples to form an answer sample tuple set graph J1 J4 sample CNs CN1: CustomersQ J3 J3 J2 Groups{} J2 Customers{} J1 J1 Complaints{} J4 J4 Emps{} CustomersQ J1 ComplaintsQ J4 EmpsQ J1 CN2: CustomersQ  Complaints{Q} J2 J2 J4 CN3: EmpsQ  Groups{}  Emps{}  Complaints{Q} J2 J3 J4 CN4: EmpsQ  Groups{}  Emps{}  Complaints{Q} 14 Materialize Answer Templates to Generate Answers  By generating and executing a SQL query CN: SQL: CustomersQ  ComplaintsQ J1 (CustomersQ = {v1} , ComplaintsQ = {u1, u2}) SELECT * FROM Customers C, Complaints P WHERE C.cust-id = P.id AND (C.tuple-id = v1) AND (P.tuple-id = u1 OR tuple-id = u2)   Naive solution – materialize all answer templates, score, rank, then return answers Current solutions – find only top-k answers – materialize only certain answer templates – make decisions using refinement rules + statistics 15 Challenges for Kite Setting  More databases  way too many answer templates to generate – can take hours on just 3-4 databases  Materializing an answer template takes way too long – requires SQL query execution across multiple databases – invoking each database incurs large overhead   Difficult to obtain reliable statistics across databases See paper for our solutions 16 Empirical Evaluation Domains Domain Avg # approximate FK joins tuples Avg # Avg # tables Avg # tuples per table # DBs attributes per per DB per table schema total across DBs per pair Total size DBLP Inventory 2 8 3 5.8 3 5.4 11 890 6 804 11 33.6 500K 2K 400M 50M Sample Inventory Schema AUTHOR BOOK WH2BOOK ARTIST CD WH2CD The DBLP Schema AR (aid, biblo) PU (aid, uid) AU (id, name) DBLP 1 CITE (id1, id2) AR (id, title) CNF (id, name) DBLP 2 WAREHOUSE Inventory 1 17 Runtime Performance (1) runtime vs. maximum CCN size 180 180 time (sec) 120 60 0 1 2 3 4 5 6 7 8 9 time (sec) DBLP Inventory 120 60 0 1 2 3 4 5 6 7 max CCN size 2-keyword queries, k=10, 2 databases 2-keyword queries, k=10, 5 databases max CCN size runtime vs. # of databases 45 Inventory time (sec) 30 Hybrid algorithm adapted to run over multiple databases Kite without adaptive rule selection and without rule Deep Kite without condensed CNs 15 0 1 2 3 4 5 6 7 8 # of DBs Kite without rule Deep maximum CCN size = 4, 2-keyword queries, k=10 Full-fledged Kite algorithm 18 Runtime Performance (2) runtime vs. # of keywords in the query 20 time (sec) 15 10 5 0 1 2 time (sec) DBLP 40 Inventory 30 20 10 |q| 3 4 5 0 1 2 3 4 5 |q| max CCN=4, k=10, 5 databases max CCN=6, k=10, 2 databases runtime vs. # of answers requested 45 45 time (sec) 30 15 0 1 4 7 10 13 16 19 22 25 27 30 time (sec) Inventory 30 15 0 1 4 7 10 13 16 19 22 25 27 30 k k 2-keyword queries, max CCN=4, 5 databases 2-keyword queries, max CCN=4, |q|=2, 5 databases 19 Query Result Quality Pr@k 1 0.8 0.6 0.4 0.2 0 1 5 10 15 Pr@k 1 0.8 0.6 0.4 0.2 k 20 0 1 5 10 15 20 k OR-semantic queries AND-semantic queries  Pr@k = the fraction of answers that appear in the “ideal” list 20 Summary  Kite executes IR-style data integration – performs some automatic preprocessing – then immediately allows keyword querying  Relatively painless – no manual work! – no need to create global schema, to understand SQL  Can be very useful in many settings: e.g., on-the-fly, best-effort, for non-technical people – enterprises, on the Web, need only a few answers – emergency (e.g., hospital + police), need answers quickly 21 Future Directions  Incorporate user feedback  interactive IR-style data integration More efficient query processing – large # of databases, network latency   Extends to other types of data – XML, ontologies, extracted data, Web data IR-style data integration is feasible and useful extends current works on keyword search over DB raises many opportunities for future work 22 BACKUP 23 Other Experiments Join Discovery Accuracy 1 0.8 0.6 0.4 0.2 0 Inventory 1 Inventory 2 Inventory 3 Inventory 4 Inventory 5 accuracy (F1)  Schema matching helps improve join discovery algorithm drastically Join Discovery Join Discovery + Schema Matching Kite over single database 6 time (sec)  4 2 0 1 2 3 4 5 6 7 8 Kite also improves singledatabase keyword search algorithm mHybrid max CCN size 24

Related docs
premium docs
Other docs by One Seven
Independent Contractor Agreement
Views: 472  |  Downloads: 31
Stock Certificate for Preferred Stock
Views: 484  |  Downloads: 20
crowell-all
Views: 211  |  Downloads: 1
Notice of Unsatisfactory Work Performance
Views: 561  |  Downloads: 25
Sample Action by Voting Members or Managers
Views: 240  |  Downloads: 1
Form 6252 Installment Sale Income
Views: 498  |  Downloads: 1
Criminal Psychology
Views: 580  |  Downloads: 56
Stock Subscription Agreement
Views: 998  |  Downloads: 52
Letter of Intent for Joint Venture
Views: 2042  |  Downloads: 220
Due Diligence Checklist
Views: 1036  |  Downloads: 63