Embed
Email

Keyword Searching in Relational Databases

Document Sample
Keyword Searching in Relational Databases
Shared by: HC111124003428
Categories
Tags
Stats
views:
6
posted:
11/23/2011
language:
English
pages:
62
Keyword Searching in Relational Databases









Esha Palta (05329017)

Kumar Gaurav Bijay (02005013)

03/13/06 Gaurav Kumar/ Esha Palta 1

Dilbert Strip 









03/13/06 Gaurav Kumar/ Esha Palta 2

Motivation

 Keyword search









 We have SQL, why keyword-querying?

 SQL - not appropriate for naive users

 So many online databases (imdb, citeseer, bseindia …)

– user cannot keep track of schema for all of these



03/13/06 Gaurav Kumar/ Esha Palta 3

Simple Approaches

 Using Form interfaces

 Require separate form for each type of query – confusing

 Not suitable for ad-hoc queries – how many forms will you

provide?





 How about Google?

 Export data from db to documents and do keyword- querying on

these

 Suffers from duplication overheads

 Google wants all keywords in one document. DB is often

normalized, so need to join tables and store as documents

 Multiple combinations of tables to join. Not scalable …





03/13/06 Gaurav Kumar/ Esha Palta 4

Differences from Web Search

 Related data split across multiple tuples due to

normalization

Cites Paper Writes Author

(Citing,

(PaperId, (AuthorId, (AuthorId,

PaperName) PaperId) AuthorName)

Cited)





The DBLP Bibliography Schema

 Different keywords may match tuples from different

relations

 What joins are to be computed can only be decided on the fly

 Need to find result containing all keywords and rank them

somehow



03/13/06 Gaurav Kumar/ Esha Palta 5

Systems for DB search



BANKS (Browsing and Keyword Search)

– IITB (ICDE ‟02)

DBXplorer – Microsoft Research (ICDE ‟02)

ObjectRank – IBM, UCSD, FIU (VLDB ‟04)

Bidirectional BANKS – IITB (VLDB ‟05)









03/13/06 Gaurav Kumar/ Esha Palta 6

Systems for DB search



BANKS (Browsing and Keyword Search)

– IITB (ICDE ‟02)

DBXplorer – Microsoft Research (ICDE ‟02)

ObjectRank – IBM, UCSD, FIU (VLDB ‟04)

Bidirectional BANKS – IITB (VLDB ‟05)

will cover

in depth









03/13/06 Gaurav Kumar/ Esha Palta 7

BANKS (ICDE ’02)









03/13/06 Gaurav Kumar/ Esha Palta 8

The BANKS system



BANKS Architecture

HTTP JDBC

User BANKS



Web-server Database



 Available on the web

 http://www.cse.iitb.ac.in/banks

 Connects to database using JDBC

 JDBC metadata features used to provide schema

browsing

 Preprocesses db

03/13/06 Gaurav Kumar/ Esha Palta 9

Basic Model

 Database: modeled as a graph

 Nodes = tuples

 Edges = references between tuples

 foreign key (assume for this talk), inclusion dependencies, ..

 Edges are directed.



PaperId:PaperName



BANKS01:Keyword Search MO:MultiQuery Optimizn paper

AuthorID:PaperId



Charuta:BANKS01 writes

AuthorId

Charuta S. Sudarshan Prasan Roy author



03/13/06

example

DBLP Kumar/ Esha Palta

Gaurav 10

The BANKS Answer Model



 Query: set of search terms {t1, t2, .., tn}

 For each search term ti we find set of nodes Si matching ti

 Eg: Query = Sudarshan Roy (t1= Sudarshan, t2 = Roy)



 Answer: rooted, directed tree connecting nodes matching

keywords

 Root node has special significance, may be restricted to some

relations

 E.g. relations representing entities, not relationships

 May include intermediate nodes not in any Si (Steiner Tree)



 Multiple answers

 Ranking based on proximity + prestige

03/13/06 Gaurav Kumar/ Esha Palta 11

Answer Example

Query: sudarshan roy Paper

MultiQuery Optimization



Writes Writes





Author Author

S. Sudarshan Prasan Roy





 We would like to find sets of (closely)

connected tuples that match all given

keywords





03/13/06 Gaurav Kumar/ Esha Palta 12

Edge Directionality

Directed tree will miss desired answers. For eg:

Query = DBXplorer ObjectRank

CitedBy Cited

BANKS Cites DBXPlorer





Cites ObjectRank



So, for each forward edge, BANKS adds a back

edge

Cited CitedBy

BANKS Cites DBXPlorer





Cites ObjectRank





03/13/06 Gaurav Kumar/ Esha Palta 13

Edge Directionality



 What if we ignore directionality?

 Some popular tuples are connected to many other

tuples

 E.g. Students -> departments -> university

 Problem: A popular tuple would create

misleading shortcuts between tuples

 E.g. every student would be closely linked with every

other student via the department/university

 Solution: define different forward and backward

edge weights

 Forward edges: In the direction of the foreign key

reference

03/13/06 Gaurav Kumar/ Esha Palta 14

Edge Weight



 Weight of forward edge based on schema

 e.g. citation link weights > “writes” link weights

 Weight of backward edge = indegree of edges

pointing to the node

3



1

3

1



3

1





03/13/06 Gaurav Kumar/ Esha Palta 15

Edge Weight Scaling



 Normalize edge score Escore(e)

 Make edge weight scale-free by dividing edge

weigth by wmin

 Problem: Some backward edges have unduly

large weights

 Depress the scale by defining Escore(e) as

log(1+w(e)/wmin )

 Overall Escore E = 1 / (1 + e Escore(e))









03/13/06 Gaurav Kumar/ Esha Palta 16

Node Weight



 Set weight of a Node = Indegree of the node

As per prestige rankings nodes with multiple

pointers to them get a higher prestige

So, higher node weight corresponds to higher

prestige

 Problem: Nodes with many in-edges result in

skewed answers

 Subdue extreme node weights by using

log(1+indegree)

 Node score Nscore = Average of node scores

(root-node-weight +  leaf-node-weights)

03/13/06 Gaurav Kumar/ Esha Palta 17

Combining Scores



 Combining two independent metrics: node

weight and edge weight

 Normalize each to 0-1

 Combine using weighting factor 

 Additive: (1- ) Escore +  Nscore



 Multiplicative: Escore * Nscore

 Performance study to compare alternatives

and to find reasonable values for 









03/13/06 Gaurav Kumar/ Esha Palta 18

First Step – Symbol Table



The first step is to build a symbol table

This table is in the db and is not normalized

Example:

Keyword List of Matching Nodes

Database {NICDE_2, NVLDB_3, …}

Search {NBANKS1, NBANKS2, NDBXPLR,…}

Rank {NOBJRNK, NXRANK, NSPHSRCH, …}





03/13/06 Gaurav Kumar/ Esha Palta 19

Searching for Best Answers



 Backward Expanding Search Algorithm:

 Assume: graph fits in memory

 Idea: find vertices from which a forward path exists to at

least one node from each Si.

 Run concurrent single source shortest path algorithm

from each node matching a keyword

 Create an iterator for each node matching a keyword

 Traverse the graph edges in reverse direction



 Output a node whenever it is on the intersection of the sets

of nodes reached from each keyword

 Answer trees may not be generated in

relevance order

03/13/06 Gaurav Kumar/ Esha Palta 20

Backward Expanding Search

Query: sudarshan roy



paper MultiQuery Optimization









writes









authors S. Sudarshan Prasan Roy



03/13/06 Gaurav Kumar/ Esha Palta

Iterators 21

BANKS Query Result Example

 Result of “Sudarshan Roy”









03/13/06 Gaurav Kumar/ Esha Palta 22

Result Ordering

Answers need not be always in Relevance order







This tree is output

Better Root

Missed

2 2







5 2

2 1









03/13/06 Gaurav Kumar/ Esha Palta 23

Result Ordering (contd…)



 Solution:

 Generate all connection trees and then sort them

 Increases computation costs and leads to a greatly increased

time to generate initial results

 Create a small heap ordered on the relevance of the

trees

 Output highest ranked tree from heap to user when heap is

full

What about duplicate results?

 Maintain a list of generated results for duplicate

detection

 Discard result according to relevance

03/13/06 Gaurav Kumar/ Esha Palta 24

Experience and Performance



 BANKS provides keyword search coupled with

extensive browsing facilities

 Schema browsing + data browsing

 Graphical display of data

 Implemented using Java + servlets

 Keyword search response times typically 1 to 3

seconds on

 DBLP database with 100,000 tuples/300,000 edges

 P3 600 MHz, 512 MB RAM





03/13/06 Gaurav Kumar/ Esha Palta 25

Anecdotes



 “Mohan”

 Returns C. Mohan at top based on prestige

(number of papers written)

 “Transaction”

 Returns Jim Gray‟s classic paper and textbook as

top answers based on prestige (number of

citations)

 “Sunita Seltzer”

 No common papers, but both have papers with

Stonebraker: system finds this connection





03/13/06 Gaurav Kumar/ Esha Palta 26

Effect of Parameters

 Log scaling of edge weights worked well

 (1- ) E +  N versus E N  made little difference

 Best with  = .2 (subdue node weights but not entirely)









03/13/06 Gaurav Kumar/ Esha Palta 27

BANKS (VLDB ’05)









03/13/06 Gaurav Kumar/ Esha Palta 28

Motivation

BANKS performs poorly if

Keyword matches lot of nodes (so lot of Dijkstra

sources)





Wastes time





Sudarshan Roy





Search hits a node with large fan – in.









03/13/06 Gaurav Kumar/ Esha Palta 29

New Ideas – Forward Search



Why only backward, lets search forward too :

How about fwd

Searching ?











Sudarshan Roy









03/13/06 Gaurav Kumar/ Esha Palta 30

New Ideas - Activation



Activation :- Cannot forward search from each

node.

Spread activation from keyword nodes to others.

Activation is like Page Rank with decay.

High Activation  close to many keywords.









03/13/06 Gaurav Kumar/ Esha Palta 31

Activation Spreading



Spreading Activation

Node with highest activation explored first

Activation spread to neighbors (μ = 0.3)









Gives low activation to neighbors of hubs



03/13/06 Gaurav Kumar/ Esha Palta 32

Modifications to Model



Graph model stays the same.

BANKS is concerned with search more than how

to tune parameters or define node – weights /

edge – weights.

BANKS code :

Tree Node – Score, N =  node  prestige(u)

uleaf  Root



Tree Edge – Score, E = 1

1  cost _ of _ path _ from _ root _ to _ t i

ti keyword







Total Score = EN ( = 0.2)

03/13/06 Gaurav Kumar/ Esha Palta 33

The New Algorithm



Need two priority queues :

Qin - do backward search from these nodes

Qout - do forward search from these nodes



Each node, n keeps 3 variables per keyword, ti

sp [i] : Node to got to from n for shortest-path to ti

distance [i] : Length of the shortest-path from n to ti

Activation [i] : Activation to n from keyword „ti‟









03/13/06 Gaurav Kumar/ Esha Palta 34

The New Algorithm – continued…

 Set initial activation keyword nodes and add to Qin for

backward-search.

 At each step, pick node with maximum activation

i.e. if (Qin.getMaxActivation > Qout. getMaxActivation))

// use node from Qin

else

// use node from Qout

 If node from Qin, do backward search and add itself to

Qout. (newly explored nodes into Qin)

 If node from Qout, do forward search

 If node has reached from all keyword, generate result-

tree. [answer is buffered as results can be out of order]

03/13/06 Gaurav Kumar/ Esha Palta 35

Explanation with example



Qin Qout

N100

N4 Roy

N1

N3 … Sudarshan



N2





Roy Sudarshan









03/13/06 Gaurav Kumar/ Esha Palta 36

Explanation with example



Qin Qout

N100

N4 Roy

N1

N1

N3 … N2

Sudarshan



N2





Roy Sudarshan









03/13/06 Gaurav Kumar/ Esha Palta 37

Explanation with example



Qin Qout

N100

N2 N1

N4

N1 N3 Roy

N3 …

… Sudarshan

N2

N100



Roy Sudarshan







Result Found !







03/13/06 Gaurav Kumar/ Esha Palta 38

Generation of top-k results



 If we know the score of next-best answer, all buffered

answers with better score can be output.









 Need upper bounds

03/13/06 Gaurav Kumar/ Esha Palta 39

Computation of upper bound



For each keyword ti, we have explored nodes

upto some length – say li.



1

So, next – best – score (approx.) =

1   li



This is not a true upper bound, but works quite

well and is simple !









03/13/06 Gaurav Kumar/ Esha Palta 40

Are we losing answers ?



BANKS – I used many Dijkstra states, BANKS – II

uses 2 only – forward and backward search-

states.



The result is that we can now lose answers !









03/13/06 Gaurav Kumar/ Esha Palta 41

Answer Loss Example



Ny





K1 Nx K2





K1





Ny

Ny



Nx K2

K1 K2

K1

This is the generated answer.

This answer is lost.

03/13/06 Gaurav Kumar/ Esha Palta 42

But, we will generate this tree rooted at Nx:



NY







K2

NX









K1





So, a rotated tree with same nodes but

different root is often generated !

03/13/06 Gaurav Kumar/ Esha Palta 43

Metrics of Performance

 Manually obtain best relevant answers.



 Determine 2 times :

1. Time taken to produce last relevant answer.

2. Time taken to output last relevant answer.



 Search algorithms

 MI-Bkwd: original backward search

 Iterator for every node matching a keyword

 SI-Bkwd: backward search with single backward iterator

 Bidirec: bidirectional search



 Datasets

 DBLP, IMDB ~ 2 million nodes, 9 million edges

 US Patent DB ~ 4 million nodes, 15 million edges

03/13/06 Gaurav Kumar/ Esha Palta 44

Graph - I



 MI-Bkwd versus SI-Bkwd









 SI-Bkwd gain increases with origin size, # keywords

03/13/06 Gaurav Kumar/ Esha Palta 45

Graph - II

 SI-Bkwd versus Bidirec









 Bidirec gain increases with origin size, # keywords

03/13/06 Gaurav Kumar/ Esha Palta 46

A Critique



BANKS needs a lot of memory.



Need to cluster and keep parts of graph on disk.



Work is in progress 









03/13/06 Gaurav Kumar/ Esha Palta 47

DBXplorer (ICDE ’02)









03/13/06 Gaurav Kumar/ Esha Palta 48

DBXplorer : (Microsoft Research)



Use symbol – table to determine which tables to

join.

Generate all possible table – join combinations :



Figure :



T1, T2, T3,

T4 and T5

are tables









03/13/06 Gaurav Kumar/ Esha Palta 49

Cool ideas in DBXplorer

Symbol table need not be at tuple level. If

column has an index, column – level symbol

table is ok.

Table Compression :

e.g. : Keywords Columns Keywords Columns

K1 K1

C1 C1

K2 K2

X

K3 C2 K3 C2



K4 K4



K5 K5

Intermediate Column



03/13/06 Gaurav Kumar/ Esha Palta 50

ObjectRank (VLDB ‘04)









03/13/06 Gaurav Kumar/ Esha Palta 51

ObjectRank (IBM, FIU, UCSD)



Creates objects in database. Object definition is

manual.

e.g. in DBLP, author, conference and paper can

be defined as objects.



Heavily inspired by PageRank.



Each node is given global ObjectRank just like

PageRank of Google.



03/13/06 Gaurav Kumar/ Esha Palta 52

ObjectRank Ideas



Keyword-level ObjectRank : for each keyword,

precompute and save object ranks of nodes [can

optimize by defining cut-off)



Score of node, n w.r.t. keyword k :

scorek(n) = f (Global-object-rank (n), Objectrankk (n))





At run time, scores are combined :

scorek1,k2,…,km(n) = scorek1(n) * scorek2(n) * …* scorekm(n)



03/13/06 Gaurav Kumar/ Esha Palta 53

ObjectRank Algorithm and answers



If graph is DAG or near DAG, topologically sort

and spread ObjectRank in this order.

Answers are single objects and not Cluster /

group as in BANKS.

Demo at :

http://teriyaki.ucsd.edu:9099/objrank/main05_new.html









03/13/06 Gaurav Kumar/ Esha Palta 54

Conclusion



Studied BANKS, both versions.



Covered cool ideas from DBXplorer and

ObjectRank.



Graph of BANKS must be made disk-resident.









03/13/06 Gaurav Kumar/ Esha Palta 55

References



 Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and

S. Sudarshan.

Keyword Searching and Browsing in Databases using BANKS.

In International Conference on Data Engineering (ICDE), pages 1083–1096,

2002.

 Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S. Sudarshan et. al.

Bidirectional Expansion for Keyword Search on Graph Databases.

In VLDB Conference, pages 505–516, 2005.

 Sanjay Agrawal, Surajit Chaudhari, and Gautam Das.

DBXplorer: A System for Keyword-Based Search over Relational Databases.

In International Conference on Data Engineering (ICDE), pages 5–22, 2002.

 Andrey Balmin, Vagelis Hristidis, and Yannis Papakonstantinou.

ObjectRank: Authority-Based Keyword Search in Databases.

In VLDB Conference, pages 564–575, 2004.







03/13/06 Gaurav Kumar/ Esha Palta 56

Appendix



Extra slides









03/13/06 Gaurav Kumar/ Esha Palta 57

Browsing - May add??????

 Hyperlinks are there for all primary key foreign key

attributes

 Each table is displayed with set of tools for interacting

with data

 Projection (using drop), Selection, Join, Group-by, Sort

 Template facilities to do a variety of tasks

 Browsing data by grouping and creating crosstabs

 e.g., theses grouped by department and year

 Hierarchical views of data

 Nested XML style, even on relational data

 Graphical displays

 Bar charts, pie charts, etc

 Templates are generic and can be applied on any data

matching assumed schema

 Can be applied after applying selections

03/13/06

 New templates can be created by user, interactively

Gaurav Kumar/ Esha Palta 58

Example of Browsing in BANKS









03/13/06 Gaurav Kumar/ Esha Palta 59

Related Work



 DataSpot (DTL)/Mercado Intuifind [VLDB 98]

 Based on patent by Palmon (filed 1995, granted 1998)

 Based on hypergraph model, similar answer model to ours

 Differences: our model of backward link weights and prestige

 Proximity Search [VLDB98]

 Different model of proximity based on adding up support

 No edge weights, prestige, different evaluation algorithm

 Information units (linked Web pages) [WWW10]

 No directionality, only studied in Web context

 Microsoft DBExplorer (this conference)

 No ranking, based on SQL generation

 Addresses efficient construction of text indexes

 Microsoft English query

03/13/06 Gaurav Kumar/ Esha Palta 60

Extensions



Summarization of output

group the output tuples into sets that have same tree

structure

define the notion of similarity between two result trees

perform restricted search

Metadata queries (attribute:keyword queries)

For example: author:levy

match all the tuples of a relation

 costly

Forward searching approach



03/13/06 Gaurav Kumar/ Esha Palta 61

Proposed Conclusions and Future

Work

BANKS is an integrated browsing and keyword

querying system for relational databases

Future work:

 Keyword queries on XML



 Disambiguating queries by selecting



 Nodes: G.W.Bush: “Bush Jr” or “Bush Sr”

 Tree structure: “coauthors” or “cites”

 Boolean queries

 Metadata queries



Summarization of output

03/13/06 Gaurav Kumar/ Esha Palta 62


Related docs
Other docs by HC111124003428
INDICE
Views: 0  |  Downloads: 0
CH 08
Views: 2  |  Downloads: 0
UNRESERVED-FEMALE-ART
Views: 0  |  Downloads: 0
Understanding Global Cultures
Views: 2  |  Downloads: 0
T 520A 09
Views: 4  |  Downloads: 0
??? 56-1
Views: 0  |  Downloads: 0
ANEXO 1
Views: 53  |  Downloads: 0
The ISO/CASCO Toolbox
Views: 3  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!