Keyword Search in Databases

Document Sample
scope of work template
							Keyword Search in Databases




      Vagelis Hristidis
          University of California, San Diego
Motivation

   Keyword Search is the dominant information
    discovery method in documents
   Increasing amount of data stored in
    databases
Motivation

   Currently, information discovery in databases
    requires:
       Knowledge of schema
       Knowledge of a query language (eg: SQL)
       Knowledge of the role of the keywords



    • Eliminate these requirements
Keyword Query - Semantics
Keywords are:
• in same node
• in nodes of same type (eg: same relation)
• connected through edges (eg: primary-foreign key
  relationships)
Score of result:
• distance of keywords within a node
• distance between keywords in terms of edges between them
  [HristVLDB02, HristICDE03]
• weighted distance
• Distance + IR techniques [HristVLDB03]
• Random Walk probability [current work]
DISCOVER [HristVLDB02]
Result of Keyword Query

Result is tree T of nodes where:
• each edge corresponds to an edge of the
  database graph
• every keyword contained in a node of T
• no node of T is redundant (minimal)
Example - Schema


       Subset of TPC-H schema

           n:1              n:1
  ORDERS         CUSTOMER         NATION
Example - Data
        ORDERS
        ORDERKEY     CUSTKEY       TOTALPRICE    CLERK         ...
   o1   1000105      12312         $5,000        John Smith
   o2   1000111      12312         $3,000        Mike Miller
   o3   1000125      10001         $7,000        Mike Miller
   o4   1000110      10002         $8,000        Keith Brown


        CUSTOMER
        CUSTKEY     NAME              NATIONKEY     ...
   c1   12312       Brad Lou          01
   c2   10001       George Walters    01
   c3   10013       John Roberts      01


        NATION
        NATIONKEY         NAME                  REGIONKEY
   n1   01                USA                   N.America
Example – Keyword Query
    Query: “Smith, Miller”
           ORDERS
           ORDERKEY     CUSTKEY       TOTALPRICE    CLERK         ...
      o1   1000105      12312         $5,000        John Smith
      o2   1000111      12312         $3,000        Mike Miller
      o3   1000125      10001         $7,000        Mike Miller
      o4   1000110      10002         $8,000        Keith Brown


           CUSTOMER
           CUSTKEY     NAME              NATIONKEY     ...
      c1   12312       Brad Lou          01
      c2   10001       George Walters    01
      c3   10013       John Roberts      01


           NATION
           NATIONKEY         NAME                  REGIONKEY
      n1   01                USA                   N.America
Example – Keyword Query
             Query: “Smith, Miller”
     ORDERS
     ORDERKEY     CUSTKEY       TOTALPRICE    CLERK         ...
o1   1000105      12312         $5,000        John Smith          Results:
o2   1000111      12312         $3,000        Mike Miller
o3
o4
     1000125      10001         $7,000        Mike Miller           Size Result
     1000110      10002         $8,000        Keith Brown
                                                                    2    o1c1o2
     CUSTOMER
     CUSTKEY     NAME              NATIONKEY     ...
c1   12312       Brad Lou          01
c2   10001       George Walters    01
c3   10013       John Roberts      01


     NATION
     NATIONKEY         NAME                  REGIONKEY
n1   01                USA                   N.America
Example – Keyword Query
             Query: “Smith, Miller”
     ORDERS
o1
     ORDERKEY
     1000105
                  CUSTKEY
                  12312
                                TOTALPRICE
                                $5,000
                                              CLERK
                                              John Smith
                                                            ...
                                                                  Results:
o2   1000111      12312         $3,000        Mike Miller
o3   1000125      10001         $7,000        Mike Miller
o4   1000110      10002         $8,000        Keith Brown
                                                                  Size Result

     CUSTOMER
     CUSTKEY     NAME                NATIONKEY     ...            2    o1c1o2
c1   12312       Brad Lou            01
c2   10001       George Walters      01
c3   10013       John Roberts        01
                                                                  4    o1c1 n1
                                                                       c2 o3
     NATION
     NATIONKEY         NAME                  REGIONKEY
n1   01                USA                   N.America
                               Architecture
                               User


            Keywords
                                          "Smith","Miller"
                    Master
                    Index
                                           ORDERSSmith={o1}
            Tuple Sets
                                           ORDERSMiller={o2,o3}
                   Candidate
                    Network
    Database       Generator
    Schema Candidate
             Networks
                     Plan
                  Generator
        Execution Plan

                     Plan
                   Execution          CREATE TABLE T1 AS SELECT * FROM
  Joining
Networks of                              ORDERSSmith, CUSTOMERS WHERE ...
   tuples   SQL queries
                                      SELECT * FROM T1, ORDERSMiller WHERE ...
                                      SELECT * FROM T1, NATION, CUSTOMERS,
                                           ORDERSMiller WHERE ...
                   Database
                               Architecture
                               User


            Keywords
                                          "Smith","Miller"
                    Master
                    Index
                                           ORDERSSmith={o1}
            Tuple Sets
                                           ORDERSMiller={o2,o3}
                   Candidate
                    Network
    Database       Generator
    Schema Candidate
             Networks
                     Plan
                  Generator
        Execution Plan

                     Plan
                   Execution          CREATE TABLE T1 AS SELECT * FROM
  Joining
Networks of                              ORDERSSmith, CUSTOMERS WHERE ...
   tuples   SQL queries
                                      SELECT * FROM T1, ORDERSMiller WHERE ...
                                      SELECT * FROM T1, NATION, CUSTOMERS,
                                           ORDERSMiller WHERE ...
                   Database
Candidate Networks Generator -
Challenges
   A keyword may appear in multiple tuples
   # candidate networks can be too big
    (sometimes unbounded)
Candidate Network - Example
   ORDERS
    Smith     n:1
             n:1               n:1
   ORDERS           CUSTOMER         NATION
              n:1
   ORDERS
    Miller
Candidate Network - Example
   ORDERS
    Smith       n:1
               n:1               n:1
   ORDERS             CUSTOMER         NATION
                n:1
   ORDERS
    Miller

CN1: OSmith  C  OMiller                 size=2
Candidate Network - Example
   ORDERS
    Smith       n:1
               n:1                n:1
   ORDERS             CUSTOMER          NATION
                n:1
   ORDERS
    Miller
 • some CNs are pruned, based on schema
 • complete, non  OMiller
CN1: OSmith  C redundant                    size=2
 • size unbounded if many-to-many relationships
CN2: OSmith  C  N  C  OMiller            size=4
Architecture
                                   User


                Keywords
                        Master
                        Index
                Tuple Sets
                       Candidate
                        Network
        Database       Generator
        Schema Candidate
                 Networks
                         Plan
                      Generator
            Execution Plan

                         Plan
                       Execution
      Joining
    Networks of
       tuples   SQL queries




                       Database
Execution Plan - Challenges

   Generated SQL queries are expensive due to
    joins
   Reusability opportunities
Execution Plan

   Each CN corresponds to a SQL statement
   CN1: OSmith  C  OMiller
    CN2: OSmith  C  N  C  OMiller
   Execution Plan
    CN1  OSmith  C  OMiller
    CN2  OSmith  C  N  C  OMiller
Reuse Common Subexpressions -
Example
   Execution Plan
    CN1•  OSmith Reuse of Common Subexpressions is NP-
           Optimal
                     C  OMiller
          OSmith
    CN2Complete  C  N  C  OMiller
       • Greedy algorithm:
    Optimized Execution Plan
          In each iteration build intermediate result of size 1 (1
    Temp  OSmith  C
        join) that maximizes frequency a ,0  a, b  1
    CN1  Temp  OMiller logb (size)
    CN2  Temp  N  C  OMiller
[HristVLDB03]
Score of result T
    Combining function Score combines
     scores of attribute values of T
    For example
     Score=aiTScore(ai)/size(T)
    Attribute value scores Score(ai)
     calculated by off-the-self tools of DBMSs
Example – Complaints Database
Schema

  Products       Complaints   Customers
  prodId         prodId       custId
  manufacturer   custId       name
  model          date         occupation
                 comments
  Example – Keyword Query “Netvista Maxtor”
 Complaints
 tupleId      prodId      custId      date               comments                                          score

 c1           p121        c3232       6-30-2002          “disk crashed after just one week of moderate     1/3
                                                         use on an IBM Netvista X41”

 c2           p131        c3131       7-3-2002           “lower-end IBM Netvista= 4/3 fire, starting
                                                                  Score(c3) caught                         1/3
                                                         apparently with disk”
   Score(p1 c1) = (1+1/3)/2 = 4/6
 c3           p131        c3143                 37471    “IBM Netvista unstable with Maxtor HD”            4/3


 Customers                                              Products
 tupleId   custId      name         occupation          tupleId   prodId    manufacturer     model         score
 u1        c3232       “John        “Software           p1        p121      “Maxtor”          “D540X”      1
                       Smith”       Engineer”
 u2        c3131       “Jack       “Architect”          p2        p131      “IBM”             “Netvista”   1
  Score(p2 c3)        Lucas”
                       = (1+4/3)/2 = 7/6
 u3        c3143       “John        “Student”           p3        p141      “Tripplite”      “Smart        0
                       Mayer”                                                                700VA”

Results: (1) c3, (2) p2 c3, (3) p1 c1
Execution Algorithms

   Users usually want top-k results.
   Hence, submitting to DBMS a SQL query for
    each CN (Naïve algorithm) is inefficient.
   When no, or very few results, Naïve algorithm
    is efficient, since it fully exploits DBMS.
   Present top-k algorithm.
XKeyword [HristICDE03]
Storing XML Data (Decomposer)
Storing Data in XKeyword
 Each target object is stored in a CLOB

 Connections between target objects in ID Relations

                                 part
           person                      *
                            partkey name subpart
                                                            Minimal ID Relations
   name nation                                              Lineitem_Part:       LPa (L_id, Pa_id)
                 lineitem                                   Lineitem_Person_ref: LPref (L_id, P_id)
    quantity shipdate supplier   linepart                   Part_Part:           PaPa (Pa_id1,Pa_id2)

            Lineitem, 100

  quantity
   [10 ]   shipdate
                       supplier linepart                         LPa      L_id      Pa_id
         [Oct 14 2001]
                 Lineitem, 101
                                                   Part, 123              100     123
                     supplier linepart      partkey name
quantity
 [10 ]
           shipdate
         [Oct 15 2001]
                                            [1005] ["TV"]                 101     123
Presentation of Results

   Number of results explodes due to MVDs
    Example
                                          l1: lineitem                                                                 pa1: part
                                         [quantity=10                                                               [partkey=1008
                                                            lin
      p1: person          lie   r
                                    shipdate=Oct 14 2001]         ep                               p      a rt      name="VCR"]
                     supp                                              a rt       pa3: part     ub
                                                                                                s
    [name="John"    su p                                                       [partkey=1005
     nation="US"]        p   lier         l2: lineitem             p    a rt    name="TV"]
                                                                                               su
                                         [quantity=10       line                                    bp
                                                                                                         ar
                                                                                                              t         pa2: part
                                    shipdate=Oct 15 2001]                                                            [partkey=1009
                                                                                                                  name="VCR & DVD"]

Results:
  R1.       p1-l1-pa3-pa1
  R2.       p1-l2-pa3-pa2
  R3.       p1-l2-pa3-pa1
  R4.       p1-l1-pa3-pa2
R3, R4 are implied by R1, R2!
Demo (in VLDB03)




• Demo on DBLP dataset available at www.db.ucsd.edu/XKeyword
Demo
Demo
Demo
Related Work
   DBXplorer. S. Agrawal et al. ICDE 2002
       Similar three step architecture
       Incomplete solutions (relations are not re-used)
       No common subexpression reusability
   BANKS. G. Bhalotia et al. ICDE 2002
       Database viewed as graph
       Steiner tree problem approximations
   Proximity searching in databases. R. Goldman et
    al. VLDB 1998
       Database viewed as graph
       No schema info
       hub nodes
Current Work

ObjectRank project
 Exploit link structure of data graph for ranking

 Suitable for bibliographic and other authority
  transfer applications
 Semantic differences from PageRank

 More processing allowed due to smaller size
  than Web
 Demo at www.db.ucsd.edu/ObjectRank
Questions?

						
Related docs