Keyword Searching in Relational Databases Esha Palta _05329017 by wuxiangyu

VIEWS: 8 PAGES: 62

									Keyword Searching in Relational Databases




           Esha Palta (05329017)
            Kumar Gaurav Bijay (02005013)
03/13/06     Gaurav Kumar/ Esha Palta   1
           Dilbert Strip 




03/13/06   Gaurav Kumar/ Esha Palta   2
                      Motivation
   Keyword search




   We have SQL, why keyword-querying?
       SQL - not appropriate for naive users
       So many online databases (imdb, citeseer, bseindia …)
        – user cannot keep track of schema for all of these

03/13/06              Gaurav Kumar/ Esha Palta           3
                   Simple Approaches
 Using Form interfaces
       Require separate form for each type of query – confusing
       Not suitable for ad-hoc queries – how many forms will you
        provide?


   How about Google?
       Export data from db to documents and do keyword- querying on
        these
       Suffers from duplication overheads
       Google wants all keywords in one document. DB is often
        normalized, so need to join tables and store as documents
       Multiple combinations of tables to join. Not scalable …


03/13/06                Gaurav Kumar/ Esha Palta                    4
             Differences from Web Search
  Related data split across multiple tuples due to
   normalization
Cites          Paper            Writes         Author
(Citing,
               (PaperId,        (AuthorId,     (AuthorId,
               PaperName)       PaperId)       AuthorName)
Cited)


                  The DBLP Bibliography Schema
   Different keywords may match tuples from different
    relations
          What joins are to be computed can only be decided on the fly
          Need to find result containing all keywords and rank them
           somehow

03/13/06                  Gaurav Kumar/ Esha Palta                   5
           Systems for DB search

BANKS (Browsing and Keyword Search)
– IITB (ICDE ‟02)
DBXplorer – Microsoft Research (ICDE ‟02)
ObjectRank – IBM, UCSD, FIU (VLDB ‟04)
Bidirectional BANKS – IITB (VLDB ‟05)




03/13/06        Gaurav Kumar/ Esha Palta     6
           Systems for DB search

BANKS (Browsing and Keyword Search)
– IITB (ICDE ‟02)
DBXplorer – Microsoft Research (ICDE ‟02)
ObjectRank – IBM, UCSD, FIU (VLDB ‟04)
Bidirectional BANKS – IITB (VLDB ‟05)
                                           will cover
                                           in depth




03/13/06        Gaurav Kumar/ Esha Palta          7
           BANKS (ICDE ’02)




03/13/06      Gaurav Kumar/ Esha Palta   8
                 The BANKS system

BANKS Architecture
               HTTP                         JDBC
        User                BANKS

                          Web-server               Database

   Available on the web
       http://www.cse.iitb.ac.in/banks
   Connects to database using JDBC
       JDBC metadata features used to provide schema
        browsing
   Preprocesses db
03/13/06              Gaurav Kumar/ Esha Palta                9
                                 Basic Model
     Database: modeled as a graph
          Nodes = tuples
          Edges = references between tuples
                 foreign key (assume for this talk), inclusion dependencies, ..
                 Edges are directed.

                                                               PaperId:PaperName

BANKS01:Keyword Search                MO:MultiQuery Optimizn           paper
                                                                AuthorID:PaperId

Charuta:BANKS01                                                        writes
                                                                       AuthorId
 Charuta                         S. Sudarshan          Prasan Roy      author

   03/13/06
                                      example
                              DBLP Kumar/ Esha Palta
                               Gaurav                                       10
                      The BANKS Answer Model

    Query: set of search terms {t1, t2, .., tn}
          For each search term ti we find set of nodes Si matching ti
          Eg: Query = Sudarshan Roy (t1= Sudarshan, t2 = Roy)

    Answer: rooted, directed tree connecting nodes matching
     keywords
          Root node has special significance, may be restricted to some
           relations
                  E.g. relations representing entities, not relationships
          May include intermediate nodes not in any Si (Steiner Tree)

    Multiple answers
          Ranking based on proximity + prestige
    03/13/06                       Gaurav Kumar/ Esha Palta                  11
                     Answer Example
Query: sudarshan roy                                 Paper
                      MultiQuery Optimization

           Writes                                       Writes


 Author                                                             Author
           S. Sudarshan                                Prasan Roy


    We would like to find sets of (closely)
     connected tuples that match all given
     keywords


03/13/06                  Gaurav Kumar/ Esha Palta                      12
                       Edge Directionality
Directed tree will miss desired answers. For eg:
    Query = DBXplorer ObjectRank
               CitedBy                    Cited
      BANKS              Cites                        DBXPlorer


               Cites           ObjectRank

So, for each forward edge, BANKS adds a back
 edge
                   Cited                 CitedBy
       BANKS                 Cites                     DBXPlorer


               Cites         ObjectRank


03/13/06                   Gaurav Kumar/ Esha Palta                13
                   Edge Directionality

   What if we ignore directionality?
          Some popular tuples are connected to many other
           tuples
          E.g. Students -> departments -> university
   Problem: A popular tuple would create
    misleading shortcuts between tuples
          E.g. every student would be closely linked with every
           other student via the department/university
   Solution: define different forward and backward
    edge weights
          Forward edges: In the direction of the foreign key
           reference
03/13/06                Gaurav Kumar/ Esha Palta            14
                          Edge Weight

    Weight of forward edge based on schema
          e.g. citation link weights > “writes” link weights
    Weight of backward edge = indegree of edges
     pointing to the node
                      3

                                1
                      3
                                 1

                      3
                                 1


03/13/06                  Gaurav Kumar/ Esha Palta          15
                  Edge Weight Scaling

   Normalize edge score Escore(e)
          Make edge weight scale-free by dividing edge
           weigth by wmin
   Problem: Some backward edges have unduly
    large weights
          Depress the scale by defining Escore(e) as
           log(1+w(e)/wmin )
   Overall Escore E = 1 / (1 + e Escore(e))




03/13/06               Gaurav Kumar/ Esha Palta           16
                       Node Weight

   Set weight of a Node = Indegree of the node
     As per prestige rankings nodes with multiple
      pointers to them get a higher prestige
     So, higher node weight corresponds to higher
      prestige
   Problem: Nodes with many in-edges result in
    skewed answers
          Subdue extreme node weights by using
           log(1+indegree)
   Node score Nscore = Average of node scores
    (root-node-weight +  leaf-node-weights)
03/13/06              Gaurav Kumar/ Esha Palta       17
                    Combining Scores

    Combining two independent metrics: node
     weight and edge weight
          Normalize each to 0-1
          Combine using weighting factor 
             Additive: (1- ) Escore +  Nscore

             Multiplicative:  Escore * Nscore
    Performance study to compare alternatives
     and to find reasonable values for 




03/13/06               Gaurav Kumar/ Esha Palta    18
                First Step – Symbol Table

The first step is to build a symbol table
This table is in the db and is not normalized
Example:
            Keyword    List of Matching Nodes
            Database   {NICDE_2, NVLDB_3, …}
            Search     {NBANKS1, NBANKS2, NDBXPLR,…}
            Rank       {NOBJRNK, NXRANK, NSPHSRCH, …}
            …

 03/13/06               Gaurav Kumar/ Esha Palta        19
                Searching for Best Answers

   Backward Expanding Search Algorithm:
          Assume: graph fits in memory
          Idea: find vertices from which a forward path exists to at
           least one node from each Si.
          Run concurrent single source shortest path algorithm
           from each node matching a keyword
               Create an iterator for each node matching a keyword
                   Traverse the graph edges in reverse direction

               Output a node whenever it is on the intersection of the sets
                of nodes reached from each keyword
   Answer trees may not be generated in
    relevance order
03/13/06                    Gaurav Kumar/ Esha Palta                   20
            Backward Expanding Search
 Query: sudarshan roy

   paper              MultiQuery Optimization




writes




 authors       S. Sudarshan                    Prasan Roy

 03/13/06           Gaurav Kumar/ Esha Palta
                               Iterators                    21
           BANKS Query Result Example
   Result of “Sudarshan Roy”




03/13/06          Gaurav Kumar/ Esha Palta   22
                     Result Ordering
   Answers need not be always in Relevance order



                This tree is output
  Better Root
   Missed
                                              2        2



            5                                      2
                         2                1




03/13/06                Gaurav Kumar/ Esha Palta           23
                  Result Ordering (contd…)

   Solution:
       Generate all connection trees and then sort them
              Increases computation costs and leads to a greatly increased
               time to generate initial results
       Create a small heap ordered on the relevance of the
        trees
              Output highest ranked tree from heap to user when heap is
               full
What about duplicate results?
       Maintain a list of generated results for duplicate
        detection
       Discard result according to relevance
03/13/06                    Gaurav Kumar/ Esha Palta                   24
           Experience and Performance

   BANKS provides keyword search coupled with
    extensive browsing facilities
       Schema browsing + data browsing
       Graphical display of data
   Implemented using Java + servlets
   Keyword search response times typically 1 to 3
    seconds on
       DBLP database with 100,000 tuples/300,000 edges
       P3 600 MHz, 512 MB RAM


03/13/06             Gaurav Kumar/ Esha Palta         25
                          Anecdotes

    “Mohan”
          Returns C. Mohan at top based on prestige
           (number of papers written)
    “Transaction”
          Returns Jim Gray‟s classic paper and textbook as
           top answers based on prestige (number of
           citations)
    “Sunita Seltzer”
          No common papers, but both have papers with
           Stonebraker: system finds this connection


03/13/06               Gaurav Kumar/ Esha Palta           26
                 Effect of Parameters
     Log scaling of edge weights worked well
     (1- ) E +  N versus E N  made little difference
     Best with  = .2 (subdue node weights but not entirely)




03/13/06               Gaurav Kumar/ Esha Palta                 27
           BANKS (VLDB ’05)




03/13/06        Gaurav Kumar/ Esha Palta   28
                          Motivation
BANKS performs poorly if
    Keyword matches lot of nodes (so lot of Dijkstra
     sources)

                   …
                                      Wastes time


           Sudarshan            Roy


    Search hits a node with large fan – in.




03/13/06               Gaurav Kumar/ Esha Palta         29
           New Ideas – Forward Search

Why only backward, lets search forward too :
                                      How about fwd
                                       Searching ?
                   …




           Sudarshan            Roy




03/13/06               Gaurav Kumar/ Esha Palta       30
             New Ideas - Activation

Activation :- Cannot forward search from each
 node.
    Spread activation from keyword nodes to others.
    Activation is like Page Rank with decay.
     High Activation  close to many keywords.




 03/13/06          Gaurav Kumar/ Esha Palta            31
               Activation Spreading

Spreading Activation
    Node with highest activation explored first
Activation spread to neighbors (μ = 0.3)




Gives low activation to neighbors of hubs

 03/13/06           Gaurav Kumar/ Esha Palta       32
           Modifications to Model

Graph model stays the same.
BANKS is concerned with search more than how
 to tune parameters or define node – weights /
 edge – weights.
BANKS code :
 Tree Node – Score, N =      node  prestige(u)
                                   uleaf  Root


  Tree Edge – Score, E =                 1
                1       cos t _ of _ path _ from _ root _ to _ t i
                     ti keyword



Total Score = EN          ( = 0.2)
03/13/06         Gaurav Kumar/ Esha Palta                       33
               The New Algorithm

Need two priority queues :
 Qin - do backward search from these nodes
  Qout - do forward search from these nodes

Each node, n keeps 3 variables per keyword, ti
    sp [i] : Node to got to from n for shortest-path to ti
    distance [i] : Length of the shortest-path from n to ti
    Activation [i] : Activation to n from keyword „ti‟




03/13/06           Gaurav Kumar/ Esha Palta             34
      The New Algorithm – continued…
 Set initial activation keyword nodes and add to Qin for
  backward-search.
 At each step, pick node with maximum activation
 i.e. if (Qin.getMaxActivation > Qout. getMaxActivation))
                 // use node from Qin
      else
                 // use node from Qout
 If node from Qin, do backward search and add itself to
  Qout. (newly explored nodes into Qin)
 If node from Qout, do forward search
 If node has reached from all keyword, generate result-
  tree. [answer is buffered as results can be out of order]
03/13/06            Gaurav Kumar/ Esha Palta            35
                Explanation with example

                                                    Qin              Qout
                      N100
                N4                                         Roy
                                  N1
           N3         …                                  Sudarshan

                             N2


                Roy                    Sudarshan




03/13/06                      Gaurav Kumar/ Esha Palta                      36
                Explanation with example

                                                    Qin        Qout
                      N100
                N4                                                    Roy
                                                          N1
                                  N1
           N3         …                                   N2
                                                                 Sudarshan

                             N2


                Roy                    Sudarshan




03/13/06                      Gaurav Kumar/ Esha Palta                       37
                Explanation with example

                                                    Qin         Qout
                      N100
                                                          N2           N1
                N4
                                  N1                      N3           Roy
           N3         …
                                                          …       Sudarshan
                             N2
                                                         N100

                Roy                    Sudarshan



                       Result Found !



03/13/06                      Gaurav Kumar/ Esha Palta                        38
           Generation of top-k results

 If we know the score of next-best answer, all buffered
  answers with better score can be output.




 Need upper bounds
03/13/06           Gaurav Kumar/ Esha Palta           39
            Computation of upper bound

For each keyword ti, we have explored nodes
 upto some length – say li.

                                                 1
So, next – best – score (approx.) =
                                              1   li

This is not a true upper bound, but works quite
 well and is simple !




 03/13/06          Gaurav Kumar/ Esha Palta              40
            Are we losing answers ?

BANKS – I used many Dijkstra states, BANKS – II
 uses 2 only – forward and backward search-
 states.

The result is that we can now lose answers !




 03/13/06        Gaurav Kumar/ Esha Palta       41
                 Answer Loss Example

                            Ny


                 K1        Nx                K2


                                K1


                                             Ny
                Ny

                                                 Nx       K2
           K1         K2
                                                 K1
 This is the generated answer.
                                          This answer is lost.
03/13/06              Gaurav Kumar/ Esha Palta                   42
But, we will generate this tree rooted at Nx:

           NY



                       K2
           NX




           K1


So, a rotated tree with same nodes but
 different root is often generated !
03/13/06        Gaurav Kumar/ Esha Palta         43
                Metrics of Performance
 Manually obtain best relevant answers.

 Determine 2 times :
     1. Time taken to produce last relevant answer.
     2. Time taken to output last relevant answer.

 Search algorithms
      MI-Bkwd: original backward search
            Iterator for every node matching a keyword
      SI-Bkwd: backward search with single backward iterator
      Bidirec: bidirectional search

 Datasets
      DBLP, IMDB ~ 2 million nodes, 9 million edges
      US Patent DB ~ 4 million nodes, 15 million edges
03/13/06                 Gaurav Kumar/ Esha Palta               44
                       Graph - I

    MI-Bkwd versus SI-Bkwd




 SI-Bkwd gain increases with origin size, # keywords
03/13/06           Gaurav Kumar/ Esha Palta             45
                      Graph - II
 SI-Bkwd versus Bidirec




   Bidirec gain increases with origin size, # keywords
03/13/06           Gaurav Kumar/ Esha Palta       46
                    A Critique

BANKS needs a lot of memory.

Need to cluster and keep parts of graph on disk.

Work is in progress 




 03/13/06       Gaurav Kumar/ Esha Palta      47
           DBXplorer (ICDE ’02)




03/13/06        Gaurav Kumar/ Esha Palta   48
       DBXplorer : (Microsoft Research)

Use symbol – table to determine which tables to
 join.
Generate all possible table – join combinations :

Figure :

 T1, T2, T3,
 T4 and T5
  are tables




 03/13/06        Gaurav Kumar/ Esha Palta      49
             Cool ideas in DBXplorer
Symbol table need not be at tuple level. If
 column has an index, column – level symbol
 table is ok.
Table Compression :
e.g. : Keywords Columns   Keywords        Columns
            K1                              K1
                    C1                                     C1
            K2                              K2
                                                 X
            K3      C2                  K3                 C2

            K4                              K4

            K5                              K5
                                                 Intermediate Column

 03/13/06        Gaurav Kumar/ Esha Palta                       50
      ObjectRank (VLDB ‘04)




03/13/06     Gaurav Kumar/ Esha Palta   51
            ObjectRank (IBM, FIU, UCSD)

Creates objects in database. Object definition is
 manual.
 e.g. in DBLP, author, conference and paper can
 be defined as objects.

Heavily inspired by PageRank.

Each node is given global ObjectRank just like
 PageRank of Google.

 03/13/06          Gaurav Kumar/ Esha Palta     52
                   ObjectRank Ideas

Keyword-level ObjectRank : for each keyword,
 precompute and save object ranks of nodes [can
 optimize by defining cut-off)

Score of node, n w.r.t. keyword k :
        scorek(n) = f (Global-object-rank (n), Objectrankk (n))


At run time, scores are combined :
 scorek1,k2,…,km(n) = scorek1(n) * scorek2(n) * …* scorekm(n)

 03/13/06              Gaurav Kumar/ Esha Palta                   53
ObjectRank Algorithm and answers

If graph is DAG or near DAG, topologically sort
 and spread ObjectRank in this order.
Answers are single objects and not Cluster /
 group as in BANKS.
Demo at :
  http://teriyaki.ucsd.edu:9099/objrank/main05_new.html




 03/13/06             Gaurav Kumar/ Esha Palta            54
                  Conclusion

Studied BANKS, both versions.

Covered cool ideas from DBXplorer and
 ObjectRank.

Graph of BANKS must be made disk-resident.




03/13/06       Gaurav Kumar/ Esha Palta       55
                            References

 Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and
  S. Sudarshan.
  Keyword Searching and Browsing in Databases using BANKS.
  In International Conference on Data Engineering (ICDE), pages 1083–1096,
  2002.
 Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S. Sudarshan et. al.
  Bidirectional Expansion for Keyword Search on Graph Databases.
  In VLDB Conference, pages 505–516, 2005.
 Sanjay Agrawal, Surajit Chaudhari, and Gautam Das.
  DBXplorer: A System for Keyword-Based Search over Relational Databases.
  In International Conference on Data Engineering (ICDE), pages 5–22, 2002.
 Andrey Balmin, Vagelis Hristidis, and Yannis Papakonstantinou.
  ObjectRank: Authority-Based Keyword Search in Databases.
  In VLDB Conference, pages 564–575, 2004.



 03/13/06                Gaurav Kumar/ Esha Palta                     56
           Appendix

            Extra slides




03/13/06   Gaurav Kumar/ Esha Palta   57
                Browsing - May add??????
   Hyperlinks are there for all primary key foreign key
    attributes
   Each table is displayed with set of tools for interacting
    with data
          Projection (using drop), Selection, Join, Group-by, Sort
   Template facilities to do a variety of tasks
          Browsing data by grouping and creating crosstabs
               e.g., theses grouped by department and year
          Hierarchical views of data
               Nested XML style, even on relational data
          Graphical displays
               Bar charts, pie charts, etc
   Templates are generic and can be applied on any data
    matching assumed schema
          Can be applied after applying selections
03/13/06
          New templates can be created by user, interactively
                          Gaurav Kumar/ Esha Palta                    58
           Example of Browsing in BANKS




03/13/06          Gaurav Kumar/ Esha Palta   59
                             Related Work

    DataSpot (DTL)/Mercado Intuifind [VLDB 98]
          Based on patent by Palmon (filed 1995, granted 1998)
          Based on hypergraph model, similar answer model to ours
          Differences: our model of backward link weights and prestige
    Proximity Search [VLDB98]
          Different model of proximity based on adding up support
          No edge weights, prestige, different evaluation algorithm
    Information units (linked Web pages) [WWW10]
          No directionality, only studied in Web context
    Microsoft DBExplorer (this conference)
          No ranking, based on SQL generation
          Addresses efficient construction of text indexes
    Microsoft English query
    03/13/06                 Gaurav Kumar/ Esha Palta                  60
                           Extensions

Summarization of output
    group the output tuples into sets that have same tree
     structure
    define the notion of similarity between two result trees
    perform restricted search
Metadata queries (attribute:keyword queries)
    For example: author:levy
    match all the tuples of a relation
               costly
    Forward searching approach

 03/13/06                Gaurav Kumar/ Esha Palta       61
        Proposed Conclusions and Future
                    Work
BANKS is an integrated browsing and keyword
  querying system for relational databases
Future work:
 Keyword queries on XML

 Disambiguating queries by selecting

       Nodes: G.W.Bush: “Bush Jr” or “Bush Sr”
       Tree structure: “coauthors” or “cites”
 Boolean queries
 Metadata queries

Summarization of output
03/13/06             Gaurav Kumar/ Esha Palta     62

								
To top