Keyword Searching in Relational Databases
Esha Palta (05329017)
Kumar Gaurav Bijay (02005013)
03/13/06 Gaurav Kumar/ Esha Palta 1
Dilbert Strip
03/13/06 Gaurav Kumar/ Esha Palta 2
Motivation
Keyword search
We have SQL, why keyword-querying?
SQL - not appropriate for naive users
So many online databases (imdb, citeseer, bseindia …)
– user cannot keep track of schema for all of these
03/13/06 Gaurav Kumar/ Esha Palta 3
Simple Approaches
Using Form interfaces
Require separate form for each type of query – confusing
Not suitable for ad-hoc queries – how many forms will you
provide?
How about Google?
Export data from db to documents and do keyword- querying on
these
Suffers from duplication overheads
Google wants all keywords in one document. DB is often
normalized, so need to join tables and store as documents
Multiple combinations of tables to join. Not scalable …
03/13/06 Gaurav Kumar/ Esha Palta 4
Differences from Web Search
Related data split across multiple tuples due to
normalization
Cites Paper Writes Author
(Citing,
(PaperId, (AuthorId, (AuthorId,
PaperName) PaperId) AuthorName)
Cited)
The DBLP Bibliography Schema
Different keywords may match tuples from different
relations
What joins are to be computed can only be decided on the fly
Need to find result containing all keywords and rank them
somehow
03/13/06 Gaurav Kumar/ Esha Palta 5
Systems for DB search
BANKS (Browsing and Keyword Search)
– IITB (ICDE ‟02)
DBXplorer – Microsoft Research (ICDE ‟02)
ObjectRank – IBM, UCSD, FIU (VLDB ‟04)
Bidirectional BANKS – IITB (VLDB ‟05)
03/13/06 Gaurav Kumar/ Esha Palta 6
Systems for DB search
BANKS (Browsing and Keyword Search)
– IITB (ICDE ‟02)
DBXplorer – Microsoft Research (ICDE ‟02)
ObjectRank – IBM, UCSD, FIU (VLDB ‟04)
Bidirectional BANKS – IITB (VLDB ‟05)
will cover
in depth
03/13/06 Gaurav Kumar/ Esha Palta 7
BANKS (ICDE ’02)
03/13/06 Gaurav Kumar/ Esha Palta 8
The BANKS system
BANKS Architecture
HTTP JDBC
User BANKS
Web-server Database
Available on the web
http://www.cse.iitb.ac.in/banks
Connects to database using JDBC
JDBC metadata features used to provide schema
browsing
Preprocesses db
03/13/06 Gaurav Kumar/ Esha Palta 9
Basic Model
Database: modeled as a graph
Nodes = tuples
Edges = references between tuples
foreign key (assume for this talk), inclusion dependencies, ..
Edges are directed.
PaperId:PaperName
BANKS01:Keyword Search MO:MultiQuery Optimizn paper
AuthorID:PaperId
Charuta:BANKS01 writes
AuthorId
Charuta S. Sudarshan Prasan Roy author
03/13/06
example
DBLP Kumar/ Esha Palta
Gaurav 10
The BANKS Answer Model
Query: set of search terms {t1, t2, .., tn}
For each search term ti we find set of nodes Si matching ti
Eg: Query = Sudarshan Roy (t1= Sudarshan, t2 = Roy)
Answer: rooted, directed tree connecting nodes matching
keywords
Root node has special significance, may be restricted to some
relations
E.g. relations representing entities, not relationships
May include intermediate nodes not in any Si (Steiner Tree)
Multiple answers
Ranking based on proximity + prestige
03/13/06 Gaurav Kumar/ Esha Palta 11
Answer Example
Query: sudarshan roy Paper
MultiQuery Optimization
Writes Writes
Author Author
S. Sudarshan Prasan Roy
We would like to find sets of (closely)
connected tuples that match all given
keywords
03/13/06 Gaurav Kumar/ Esha Palta 12
Edge Directionality
Directed tree will miss desired answers. For eg:
Query = DBXplorer ObjectRank
CitedBy Cited
BANKS Cites DBXPlorer
Cites ObjectRank
So, for each forward edge, BANKS adds a back
edge
Cited CitedBy
BANKS Cites DBXPlorer
Cites ObjectRank
03/13/06 Gaurav Kumar/ Esha Palta 13
Edge Directionality
What if we ignore directionality?
Some popular tuples are connected to many other
tuples
E.g. Students -> departments -> university
Problem: A popular tuple would create
misleading shortcuts between tuples
E.g. every student would be closely linked with every
other student via the department/university
Solution: define different forward and backward
edge weights
Forward edges: In the direction of the foreign key
reference
03/13/06 Gaurav Kumar/ Esha Palta 14
Edge Weight
Weight of forward edge based on schema
e.g. citation link weights > “writes” link weights
Weight of backward edge = indegree of edges
pointing to the node
3
1
3
1
3
1
03/13/06 Gaurav Kumar/ Esha Palta 15
Edge Weight Scaling
Normalize edge score Escore(e)
Make edge weight scale-free by dividing edge
weigth by wmin
Problem: Some backward edges have unduly
large weights
Depress the scale by defining Escore(e) as
log(1+w(e)/wmin )
Overall Escore E = 1 / (1 + e Escore(e))
03/13/06 Gaurav Kumar/ Esha Palta 16
Node Weight
Set weight of a Node = Indegree of the node
As per prestige rankings nodes with multiple
pointers to them get a higher prestige
So, higher node weight corresponds to higher
prestige
Problem: Nodes with many in-edges result in
skewed answers
Subdue extreme node weights by using
log(1+indegree)
Node score Nscore = Average of node scores
(root-node-weight + leaf-node-weights)
03/13/06 Gaurav Kumar/ Esha Palta 17
Combining Scores
Combining two independent metrics: node
weight and edge weight
Normalize each to 0-1
Combine using weighting factor
Additive: (1- ) Escore + Nscore
Multiplicative: Escore * Nscore
Performance study to compare alternatives
and to find reasonable values for
03/13/06 Gaurav Kumar/ Esha Palta 18
First Step – Symbol Table
The first step is to build a symbol table
This table is in the db and is not normalized
Example:
Keyword List of Matching Nodes
Database {NICDE_2, NVLDB_3, …}
Search {NBANKS1, NBANKS2, NDBXPLR,…}
Rank {NOBJRNK, NXRANK, NSPHSRCH, …}
…
03/13/06 Gaurav Kumar/ Esha Palta 19
Searching for Best Answers
Backward Expanding Search Algorithm:
Assume: graph fits in memory
Idea: find vertices from which a forward path exists to at
least one node from each Si.
Run concurrent single source shortest path algorithm
from each node matching a keyword
Create an iterator for each node matching a keyword
Traverse the graph edges in reverse direction
Output a node whenever it is on the intersection of the sets
of nodes reached from each keyword
Answer trees may not be generated in
relevance order
03/13/06 Gaurav Kumar/ Esha Palta 20
Backward Expanding Search
Query: sudarshan roy
paper MultiQuery Optimization
writes
authors S. Sudarshan Prasan Roy
03/13/06 Gaurav Kumar/ Esha Palta
Iterators 21
BANKS Query Result Example
Result of “Sudarshan Roy”
03/13/06 Gaurav Kumar/ Esha Palta 22
Result Ordering
Answers need not be always in Relevance order
This tree is output
Better Root
Missed
2 2
5 2
2 1
03/13/06 Gaurav Kumar/ Esha Palta 23
Result Ordering (contd…)
Solution:
Generate all connection trees and then sort them
Increases computation costs and leads to a greatly increased
time to generate initial results
Create a small heap ordered on the relevance of the
trees
Output highest ranked tree from heap to user when heap is
full
What about duplicate results?
Maintain a list of generated results for duplicate
detection
Discard result according to relevance
03/13/06 Gaurav Kumar/ Esha Palta 24
Experience and Performance
BANKS provides keyword search coupled with
extensive browsing facilities
Schema browsing + data browsing
Graphical display of data
Implemented using Java + servlets
Keyword search response times typically 1 to 3
seconds on
DBLP database with 100,000 tuples/300,000 edges
P3 600 MHz, 512 MB RAM
03/13/06 Gaurav Kumar/ Esha Palta 25
Anecdotes
“Mohan”
Returns C. Mohan at top based on prestige
(number of papers written)
“Transaction”
Returns Jim Gray‟s classic paper and textbook as
top answers based on prestige (number of
citations)
“Sunita Seltzer”
No common papers, but both have papers with
Stonebraker: system finds this connection
03/13/06 Gaurav Kumar/ Esha Palta 26
Effect of Parameters
Log scaling of edge weights worked well
(1- ) E + N versus E N made little difference
Best with = .2 (subdue node weights but not entirely)
03/13/06 Gaurav Kumar/ Esha Palta 27
BANKS (VLDB ’05)
03/13/06 Gaurav Kumar/ Esha Palta 28
Motivation
BANKS performs poorly if
Keyword matches lot of nodes (so lot of Dijkstra
sources)
…
Wastes time
Sudarshan Roy
Search hits a node with large fan – in.
03/13/06 Gaurav Kumar/ Esha Palta 29
New Ideas – Forward Search
Why only backward, lets search forward too :
How about fwd
Searching ?
…
Sudarshan Roy
03/13/06 Gaurav Kumar/ Esha Palta 30
New Ideas - Activation
Activation :- Cannot forward search from each
node.
Spread activation from keyword nodes to others.
Activation is like Page Rank with decay.
High Activation close to many keywords.
03/13/06 Gaurav Kumar/ Esha Palta 31
Activation Spreading
Spreading Activation
Node with highest activation explored first
Activation spread to neighbors (μ = 0.3)
Gives low activation to neighbors of hubs
03/13/06 Gaurav Kumar/ Esha Palta 32
Modifications to Model
Graph model stays the same.
BANKS is concerned with search more than how
to tune parameters or define node – weights /
edge – weights.
BANKS code :
Tree Node – Score, N = node prestige(u)
uleaf Root
Tree Edge – Score, E = 1
1 cost _ of _ path _ from _ root _ to _ t i
ti keyword
Total Score = EN ( = 0.2)
03/13/06 Gaurav Kumar/ Esha Palta 33
The New Algorithm
Need two priority queues :
Qin - do backward search from these nodes
Qout - do forward search from these nodes
Each node, n keeps 3 variables per keyword, ti
sp [i] : Node to got to from n for shortest-path to ti
distance [i] : Length of the shortest-path from n to ti
Activation [i] : Activation to n from keyword „ti‟
03/13/06 Gaurav Kumar/ Esha Palta 34
The New Algorithm – continued…
Set initial activation keyword nodes and add to Qin for
backward-search.
At each step, pick node with maximum activation
i.e. if (Qin.getMaxActivation > Qout. getMaxActivation))
// use node from Qin
else
// use node from Qout
If node from Qin, do backward search and add itself to
Qout. (newly explored nodes into Qin)
If node from Qout, do forward search
If node has reached from all keyword, generate result-
tree. [answer is buffered as results can be out of order]
03/13/06 Gaurav Kumar/ Esha Palta 35
Explanation with example
Qin Qout
N100
N4 Roy
N1
N3 … Sudarshan
N2
Roy Sudarshan
03/13/06 Gaurav Kumar/ Esha Palta 36
Explanation with example
Qin Qout
N100
N4 Roy
N1
N1
N3 … N2
Sudarshan
N2
Roy Sudarshan
03/13/06 Gaurav Kumar/ Esha Palta 37
Explanation with example
Qin Qout
N100
N2 N1
N4
N1 N3 Roy
N3 …
… Sudarshan
N2
N100
Roy Sudarshan
Result Found !
03/13/06 Gaurav Kumar/ Esha Palta 38
Generation of top-k results
If we know the score of next-best answer, all buffered
answers with better score can be output.
Need upper bounds
03/13/06 Gaurav Kumar/ Esha Palta 39
Computation of upper bound
For each keyword ti, we have explored nodes
upto some length – say li.
1
So, next – best – score (approx.) =
1 li
This is not a true upper bound, but works quite
well and is simple !
03/13/06 Gaurav Kumar/ Esha Palta 40
Are we losing answers ?
BANKS – I used many Dijkstra states, BANKS – II
uses 2 only – forward and backward search-
states.
The result is that we can now lose answers !
03/13/06 Gaurav Kumar/ Esha Palta 41
Answer Loss Example
Ny
K1 Nx K2
K1
Ny
Ny
Nx K2
K1 K2
K1
This is the generated answer.
This answer is lost.
03/13/06 Gaurav Kumar/ Esha Palta 42
But, we will generate this tree rooted at Nx:
NY
K2
NX
K1
So, a rotated tree with same nodes but
different root is often generated !
03/13/06 Gaurav Kumar/ Esha Palta 43
Metrics of Performance
Manually obtain best relevant answers.
Determine 2 times :
1. Time taken to produce last relevant answer.
2. Time taken to output last relevant answer.
Search algorithms
MI-Bkwd: original backward search
Iterator for every node matching a keyword
SI-Bkwd: backward search with single backward iterator
Bidirec: bidirectional search
Datasets
DBLP, IMDB ~ 2 million nodes, 9 million edges
US Patent DB ~ 4 million nodes, 15 million edges
03/13/06 Gaurav Kumar/ Esha Palta 44
Graph - I
MI-Bkwd versus SI-Bkwd
SI-Bkwd gain increases with origin size, # keywords
03/13/06 Gaurav Kumar/ Esha Palta 45
Graph - II
SI-Bkwd versus Bidirec
Bidirec gain increases with origin size, # keywords
03/13/06 Gaurav Kumar/ Esha Palta 46
A Critique
BANKS needs a lot of memory.
Need to cluster and keep parts of graph on disk.
Work is in progress
03/13/06 Gaurav Kumar/ Esha Palta 47
DBXplorer (ICDE ’02)
03/13/06 Gaurav Kumar/ Esha Palta 48
DBXplorer : (Microsoft Research)
Use symbol – table to determine which tables to
join.
Generate all possible table – join combinations :
Figure :
T1, T2, T3,
T4 and T5
are tables
03/13/06 Gaurav Kumar/ Esha Palta 49
Cool ideas in DBXplorer
Symbol table need not be at tuple level. If
column has an index, column – level symbol
table is ok.
Table Compression :
e.g. : Keywords Columns Keywords Columns
K1 K1
C1 C1
K2 K2
X
K3 C2 K3 C2
K4 K4
K5 K5
Intermediate Column
03/13/06 Gaurav Kumar/ Esha Palta 50
ObjectRank (VLDB ‘04)
03/13/06 Gaurav Kumar/ Esha Palta 51
ObjectRank (IBM, FIU, UCSD)
Creates objects in database. Object definition is
manual.
e.g. in DBLP, author, conference and paper can
be defined as objects.
Heavily inspired by PageRank.
Each node is given global ObjectRank just like
PageRank of Google.
03/13/06 Gaurav Kumar/ Esha Palta 52
ObjectRank Ideas
Keyword-level ObjectRank : for each keyword,
precompute and save object ranks of nodes [can
optimize by defining cut-off)
Score of node, n w.r.t. keyword k :
scorek(n) = f (Global-object-rank (n), Objectrankk (n))
At run time, scores are combined :
scorek1,k2,…,km(n) = scorek1(n) * scorek2(n) * …* scorekm(n)
03/13/06 Gaurav Kumar/ Esha Palta 53
ObjectRank Algorithm and answers
If graph is DAG or near DAG, topologically sort
and spread ObjectRank in this order.
Answers are single objects and not Cluster /
group as in BANKS.
Demo at :
http://teriyaki.ucsd.edu:9099/objrank/main05_new.html
03/13/06 Gaurav Kumar/ Esha Palta 54
Conclusion
Studied BANKS, both versions.
Covered cool ideas from DBXplorer and
ObjectRank.
Graph of BANKS must be made disk-resident.
03/13/06 Gaurav Kumar/ Esha Palta 55
References
Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and
S. Sudarshan.
Keyword Searching and Browsing in Databases using BANKS.
In International Conference on Data Engineering (ICDE), pages 1083–1096,
2002.
Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S. Sudarshan et. al.
Bidirectional Expansion for Keyword Search on Graph Databases.
In VLDB Conference, pages 505–516, 2005.
Sanjay Agrawal, Surajit Chaudhari, and Gautam Das.
DBXplorer: A System for Keyword-Based Search over Relational Databases.
In International Conference on Data Engineering (ICDE), pages 5–22, 2002.
Andrey Balmin, Vagelis Hristidis, and Yannis Papakonstantinou.
ObjectRank: Authority-Based Keyword Search in Databases.
In VLDB Conference, pages 564–575, 2004.
03/13/06 Gaurav Kumar/ Esha Palta 56
Appendix
Extra slides
03/13/06 Gaurav Kumar/ Esha Palta 57
Browsing - May add??????
Hyperlinks are there for all primary key foreign key
attributes
Each table is displayed with set of tools for interacting
with data
Projection (using drop), Selection, Join, Group-by, Sort
Template facilities to do a variety of tasks
Browsing data by grouping and creating crosstabs
e.g., theses grouped by department and year
Hierarchical views of data
Nested XML style, even on relational data
Graphical displays
Bar charts, pie charts, etc
Templates are generic and can be applied on any data
matching assumed schema
Can be applied after applying selections
03/13/06
New templates can be created by user, interactively
Gaurav Kumar/ Esha Palta 58
Example of Browsing in BANKS
03/13/06 Gaurav Kumar/ Esha Palta 59
Related Work
DataSpot (DTL)/Mercado Intuifind [VLDB 98]
Based on patent by Palmon (filed 1995, granted 1998)
Based on hypergraph model, similar answer model to ours
Differences: our model of backward link weights and prestige
Proximity Search [VLDB98]
Different model of proximity based on adding up support
No edge weights, prestige, different evaluation algorithm
Information units (linked Web pages) [WWW10]
No directionality, only studied in Web context
Microsoft DBExplorer (this conference)
No ranking, based on SQL generation
Addresses efficient construction of text indexes
Microsoft English query
03/13/06 Gaurav Kumar/ Esha Palta 60
Extensions
Summarization of output
group the output tuples into sets that have same tree
structure
define the notion of similarity between two result trees
perform restricted search
Metadata queries (attribute:keyword queries)
For example: author:levy
match all the tuples of a relation
costly
Forward searching approach
03/13/06 Gaurav Kumar/ Esha Palta 61
Proposed Conclusions and Future
Work
BANKS is an integrated browsing and keyword
querying system for relational databases
Future work:
Keyword queries on XML
Disambiguating queries by selecting
Nodes: G.W.Bush: “Bush Jr” or “Bush Sr”
Tree structure: “coauthors” or “cites”
Boolean queries
Metadata queries
Summarization of output
03/13/06 Gaurav Kumar/ Esha Palta 62