DBXplorer by pptfiles


									DBXplorer: A System for Keyword Based
   Search over Relational Databases
  Sanjay Agrawal                      Microsoft Research
  Surajit Chaudhuri                   Microsoft Research
  Gautam Das                          Microsoft Research
                      Presented by:
  DEEP PANCHOLI                              1000556121
 The two most common types of search are Structured Search
  and Keyword Based Search
 Example of Structured Search http://autos.yahoo.com/
 A similar example is to search for books in bookseller’s
  database e.g. Books->Travel->Maps->Asia->Russia
 We all already know what is keyword based search and one
  example can be searching for ‘Jim Gray’ on Microsoft
  Intranet to obtain matched rows
Introduction (cont)
 Problems faced with Keyword based search implementation
   Need to know schema
   Normalized databases
   Availability of indexes
 Built on the concept laid by BANKS paper explained in the
  last lecture.
 Symbol tables
 Compacting the symbol tables
 Search requirements
Overview of DBXplorer
 DBXplorer returns all rows either from single table or from
  multiple tables, using FK-joins, such that each row has all the
 Publish
   1. Identify a database and tables and columns within it that are to be
    enabled for search
   2. Create auxiliary tables (Symbol Tables)
 Search
   1. Look up the Symbol table
   2. Searching in possible subsets of tables
   3. Construct and execute SQL statement and rank the results before
     displaying to user
Different Symbol Table Designs
 We will only consider exact match problem
 Two important levels of granularities
   Column level granularity (Pub-Col)
   Cell level granularity (Pub-Cell)


   Fname             Lname

   John              Marshall

    John             Shawn

   Shawn             Archer

   Jacquelyn         Marshall
Factors that affect granularity
 Space and time requirements
   Pub-Col is faster and occupies less space
 Keyword search performance
   Pub-Col if there is an index on the column
 Ease of Symbol table maintenance
   Pub-Col is easier to maintain as it contains updates only if there is
    addition of a new distinct values
 Hence, the Pub-Col alternative is almost always better than Pub-
  Cell unless if certain columns contain no indexes
 If an index is available for column, we should use Pub-Col
Pub-Col representation
 Store simply as Keyword-ColId
 Alternative is to use Hashvalue-ColId since storing keywords
  is wasteful as strings can be long and of varying lengths
 Compression Algorithms
 FK-Comp: If column c1 is a subset of values in another
  column c2, we retain only values in c1
 CP-Comp: It is used when pairs of columns share common
  keywords but are not tied by FK
Pub-Col Algorithm
Search Component
 Common step for all kinds of granularities
 It makes use of join trees
 Hence, if we join tables that occur in the join tree the
  resulting relation will contain all potential rows containing all
  keywords specified in the query
 Example of graph tree
 Finally SQL query is generated and run
 The result is then ranked before outputting. The basic
  approach is to rank them based on the number of joins
  involved which is similar to Banks approach
Search Algorithm
Case of Token matches
 Token matches are matches in which keyword match with a
  token or a substring of attribute value

  Pub-Prefix method efficiently enables token match
   capabilities by exploiting available B+ tree indexes
  Symbol table has entry (hash(k),T.C, P)
Case of Token matches (cont)
 Pub-Prefix method result is comparable to Pub-Cell method
  when the column width is small (i.e. less than 100
 For columns where strings are greater than hundreds of
  characters, Pub-Cell outperforms Pub-Prefix significantly
 Important issue is to determine the appropriate prefix length
  stored in symbol table. However, Pub-Prefix method is still
  being researched upon
 Other research is going on in field of stemming of query
Experimental Results
 The experiments were carried out on a 450MHz 256 MB
  Intel P-3 machine. There were 4 databases used for
   TPC-H data of sizes from 100 to 500 MB
   USR is Microsoft employee address DB of 130 MB with 19
   ML is a 375 MB mailing list DB with 38 tables
   KB is a 365 MB DB with 84 tables containing information on
    articles and help manuals on various shipped products
System Architecture for DBXplorer
Experimental Results (cont)
 In particular the authors show the following:
   Pub-Col is compact compared to Pub-Cell
   Pub-Col scales linearly with data size and is independent of data
   Pub-Prefix is compact compared to Pub-Cell and has a
    significantly better performance when full text indexes are not
Pub-Col and Pub-Cell symbol table size
Symbol table publishing time
Query performance
Other Observations
 It was also noticed that search scales with number of query
  keywords. The query was varied with 2 to 10 keywords and
  still the average query time was between 1 to 1.3 seconds
 Also, it was noticed that FK-Comp and CP-Comp reduce the
  size of Pub-Col by a factor of 0.45 to 0.90 depending on size
  of original table
 However, it was noticed that compression added a negligible
  overhead on search performance
Effectiveness of Pub-Prefix method
 The Pub-prefix method was tested on workload consisting of
  100 random keywords from character column of width 64
  bytes in the KB database.
 It was noticed that the performance of Pub-Prefix increased
  with increase in Pub-Prefix length and gave the optimum
  performance at prefix-length of 8
 This is because as the length increases, beyond a certain limit
  the optimizer decides to scan the original table compared to
  index search
 Although, we discussed only about a single database query,
  this technique can be applied to search multiple databases
 DBXplorer is easy to use with any Database Management
 As mentioned before, the Pub-Col alternative is the best
  when columns have indexes on them. A hybrid table can be
  created so that if there is an index for a column, we use Pub-
  Col granularity and if there is no index, we use Pub-Cell

To top