DBXplorer _ A System For Keyword-Based Search Over

Shared by: pptfiles
Categories
Tags
-
Stats
views:
0
posted:
2/24/2013
language:
Unknown
pages:
29
Document Sample
scope of work template
							Sanjay Agrawal                      Microsoft Research
Surajit Chaudhuri                   Microsoft Research
Gautham Das                         Microsoft Research



                     Presented by
                    Anusha Mannem
   Introduction

   Overview Of DBXplorer

   Publish & Design of Symbol Table

   Finding Matches For Keyword Search

   Support for Generalized Matches(Token Matching)

   Experimental Results

   Conclusions
   Internet search engines have popularized keyword based search.

  Alternative to Keyword-Based search is Structured Search.
Ex: Searching for books in Bookseller's database
  Books Travel  Lonely planet  Asia
                   (or)
  Books Travel  Rough Guides  Europe
Other similar example is http://m.www.yahoo.com/

   Example for Keyword-Based Search is searching for “Jim Gary” on
    Microsoft intranet to obtain matched rows, i.e., rows in DB where ‘Jim
    Gary’ occur.
   DBXplorer, an efficient and scalable keyword search utility for relational
    databases.

   The goal of DBXplorer is to enable keyword searches on multiple
    databases without requiring the users to know the schema of respective
    databases.

   How it works?
    When a DBXplorer is given a set of query keywords, it returns all the rows
    (either from single table, or by joining tables connected by foreign-key
    joins) such that each row contains all the keywords.
   Enabling this keyword search requires 2 steps

(a)Publish Step

(b)Search Step
   Publish : A databases is enabled for keyword search through the following
    steps.

   Step 1 : A database is identified, along with the set of tables or columns
    within the database to be published.

   Step 2 : Auxiliary tables are created for supporting keyword searches. The
    most important structure is Symbol Table S that is used at search time to
    efficiently determine the locations of query keywords in the database
   Search: Given a query consisting of a set of keywords, it is answered as
    follows.

   Step 1: Looks up symbol table to identify the tables, and columns/rows of
    the database which contain the query keywords.

   Step 2: Identify and enumerate all possible joins (subsets of table if
    joined) whose rows might contain required keywords

   Step 3: Generate SQL statements for each join (gives rows which contain
    all keywords), and the final rows are ranked and presented to the user.
 Publish and Search components are packaged as two separate COM
objects.
     Publish Component provides interface to

(a)    Select a database

(b)    Select tables/columns within the database to publish ,and

(c)    modify/remove/maintain the publication


     For given set of keywords, the search component provides interfaces to

(a)    Retrieve matching databases from set of published databases.

(b)    Selectively identify tables, columns/rows that need to be searched within
       each database identified in step (a)
   Exact match problem is considered i.e., each keyword in the query must
    match the value of an attribute in a row of a table.

   Symbol Table S, which stores the information of the query keywords at
    different levels (row/column ) of granularity.

   The two granularity levels are :

   Column level granularity (Pub-Col) : for each keyword , we only store the
    list of columns where they occur.

   Row Level granularity (Pub-Cell) : for each keyword we keep track of all
    the rows that contains the keyword.
   Space and Time Requirements
     Pub-Col is much faster and smaller when compared to Pub-Cell.

   Keyword Search Performance
     Pub-Col is effective if there are an indexes on the published columns

   Ease of Maintenance
     Pub-Col is easier to maintain as it requires an update only if the insertions
    cause new values in column data.

   Hence, Pub-Col symbol table is always better than Pub-Cell table, unless
    certain columns do not have indexes.
   Symbol table is stored as a relation with two attributes : Keyword and
    ColId.

   Alternative way is to keep hashed values of keywords in the symbol table .

   FK-Comp:
     If the set of values in column c1 is a subset of the values in another
    column c2 due to key-foreign key relationship, we retain only a single hash
    table entry for keywords common to both as (keyval,c1), since the foreign
    key constraint can be used to infer presence of keyval in c2.
   CP-Comp :
                                                                      V2
(a)Partition H into a minimum number                                       c1
                                                                      V3        x
 of bipartite cliques (a bipartite clique                                  c2
is any subgraph of H with a                                           v4
maximal number of edges).
                                            Uncompressed hash table
(b) Compress each clique.

(map the symbol table S to a bipartite
graph H having two node sets, HashVal &
 ColId, where every row (v, c) in table S                         ColumnsMap table
corresponds to an edge in H.)
                                             Compressed hash table
   Step 1 :
           Symbol table is searched(using generated SQL) to
    identify the database tables, columns/cells that contain at least
    one query keyword.

   Step 2 :
         Enumerating Join trees

   Step 3 :
         Identifying matching rows
   Identify and enumerate all potential subsets of tables in the
    database that, if joined, might contain rows having all
    keywords.
   The resulting relation will contain all potential rows having all
    keywords specified in the query.

    If we view the schema graph G as an          keywords
        undirected graph, this step
        enumerates join trees, i.e., sub-trees
        of G such that:
    (a) the leaves belong to MatchedTables
        and
    (b) together, the leaves contain all
        keywords of the query
                                                             Join Trees
   The input to this final search step is enumerated join trees.

   Each join tree is then mapped to a single SQL statement that joins the
    tables as specified in the tree, and select those rows that contain all
    keywords.

   The retrieved rows are ranked before being output.

   Our approach is to rank the rows by number of joins involved
    E.g., documents in which keywords occur close to one another are ranked
    higher than documents in which keywords are far apart.
   Token matches - the keyword in the query matches only a token or sub-
    string of an attribute value (e.g., retrieve rows of address by specifying
    only a street name).

   Pub-Prefix method
     B+ tree indexes can be used to retrieve rows whose cell matches a given prefix
      string
     This clause is of the form
                   WHERE T.C LIKE ‘P%K%’
     During publishing of a database, for every keyword K, the entry (hash(K), T.C,
      P) is kept in the symbol table if there exists a string in column T.C which
      ▪ contains a token K, and
      ▪ has prefix P
Let the hash values of the searchable tokens i.e., ‘string’, ‘ball’
and ‘round’ be 1, 2 and 3 respectively

     RowId         C
        1    this is a string
        2    this string
        3    this is a ball
        4    no string
        5    any ball is round
                                                Pub-Prefix table
       Database table T
Consider searching keyword “string”

Pub-Prefix table returns prefixes “th” and “no” and subsequent SQL will
contain (T.C LIKE ‘th%string%’) OR (T.C LIKE ‘no%string%’)
   Setup: The Experiments are on a 450 MHz 256 MB Intel P-III machine.

   Four databases were used for evaluation:
     TPC-H data of sizes from 100 to 500 MB
     USR is Microsoft employee address DB of 130 MB with 19 tables
     ML is a 375 MB mailing list DB with 38 tables
     KB is a 365 MB DB with 84 tables containing information on articles
      and help manuals on various shipped products
   In particular it has been showed that :

   Pub-Col table is compact when compared to Pub-Cell

   Pub-Col scales linearly with data size, and is independent of data
    distribution.

   Pub-Prefix is compact compared to Pub-Cell and has a significantly better
    performance when full text indexes are not present
Publish & Search performance of Pub-Col & Pub-Cell are compared:
Fig: Matching Databases
Fig: Join Trees
Fig: Matching Rows
   DBXplorer has been implemented using a commercial relational database
    and web server and allows users to interact via a browser at front-end.

   Although the authors discussed only the case where there is a single
    database, this techniques can be extended to keyword search over multiple
    databases.

   DBXplorer support exact matches and generalized matches up to some
    extent.

   As mentioned earlier, the Pub-Col alternative is the best when columns
    have indexes on them. If a full-text index is available, use Pub-Col with
    the full-text index. Instead, if only a traditional index is available and the
    column width is small, use Pub-Prefix, otherwise use Pub-Cell.

						
Other docs by pptfiles