Effective XML Keyword Search with Relevance Oriented Ranking

Document Sample
Effective XML Keyword Search with Relevance Oriented Ranking Powered By Docstoc
					  Effective XML Keyword
  Search with Relevance
     Oriented Ranking

Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu




                                                  1
                Introduction
• XML Keyword search
  – Inspired by IR style keyword search on the
    web
  – Enables user to access information in XML
    database
  – XML data modeled as a rooted, labeled tree
  – Recent research efforts
    • Efficiency
    • Effectiveness

                                                 2
               Effectiveness
• Capture user’s search intention
  – Identify the target that user intends to search for
  – Infer the predicate constraint that user intends to
    search via
• Result ranking
  – Rank the query results according to their
    objective relevance to user search intention



                                                          3
                State of the Art
• Search semantics design
  – LCA (Lowest Common Ancestor)
    • Node v is a LCA of keyword set K={w1, w2,…,wk} if the sub-tree
      rooted at v contains at least one occurrence of all keywords in K,
      after excluding the sub-elements that already contain all
      keywords in K
  – SLCA (Smallest LCA)
    • Node v is a SLCA of keyword set K={w1, w2,…,wk} if
       – (1) v is a LCA of K
       – (2) no proper descendant of v is LCA of K
  – XSeek
    • Infers the search intention based on the concept of objects and
      an analysis of the matching between keyword and data node
                                                                     4
           State of the Art (cont)
• Efficient result retrieval
  – Designed based on a certain search semantics
  – XKSearch, Multiway SLCA etc.
• Result ranking
  – XRANK, XKSEarch, EASE
  – They only consider
     • Structural compactness of matching results
     • Keyword proximity
     • Similarity at node level


                                                    5
          Problems Unaddressed
• Not address the user search intention adequately!
  – Meaningfulness of query result
    • SLCA is less meaningful in many cases
  – Keyword Ambiguity Problems
    1. A keyword can appear both as an xml node type and as
      the text value of some other nodes
    2. A keyword can appear in the text values of different xml
      node types and carry different meanings

       Neither SLCA nor Xseek can well address keyword ambiguity



                                                                   6
Problems
                                Meaningfulness
     • Keyword query “rock music”
          – Search intention: find customers interested in “rock music”
            – C3
          – SLCA returns: interest node of C3
                                                      storeDB

                             customers                                         books

                                     customer
                                                             ...                               book  ...
               customer            ID
                                               ...
                                         interests
                                                             customer
                                                                       ...             ID
                                      name                                                          publisher
                                                                                          title authors
                                                                 interests
                       ...        “C ” interest
                                                       ID       ...    ...
  ID               interests
                                    3                      name
                                                                 interest
                                                                                         ...             name
                                                                                              authorauthor
     name     contact             “Art Smith”         “C”4                           “B”2
        address
                    interest            “rock music”                           book “Edward Martin” “Oxford”
 “C1”
       no.       ...         customer
                                                     “Rock Davis” “art”
                                                                                                   “Sophia Jones”
       “ street
       1”
                city
                                     ...      purchases
                                                                      ID title
                                                                                authors...
“Mary Smith”     ...         ID name interests
                                                                    “B 1”
                                                                                 authorauthor
       “Art Street”“fashion”“C”         interest purchase
                               2                                              “John Williams”
                                                               “Art of Customer              “Daniel Jones”
                             “John Martin”“street art”           Interest Care”                             7
Problems
                          Keyword Ambiguity
     • Q = “customer, interest, art”
          – Ambiguity 1: customer, interest; Ambiguity 2: art
          – Intention: find customer whose interest is art
          – less relevant or irrelevant result to be returned also                      --- C1,C3, B1’s title
                                                       storeDB

                              customers                                         books

                                     customer
                                                              ...                               book  ...
               customer            ID
                                               ...
                                         interests
                                                             customer
                                                                       ...              ID
                                      name                                                           publisher
                                                                                           title authors
                                                                 interests
                        ...       “C ” interest
                                                       ID        ...   ...
  ID               interests
                                    3                      name
                                                                  interest
                                                                                          ...             name
                                                                                               authorauthor
     name     contact             “Art Smith”         “C”4                            “B”2
        address
                    interest            “rock music”                            book “Edward Martin”“Oxford”
 “C1”
       no.       ...         customer                “Rock Davis” “art”
                                                                                                    “Sophia Jones”
       “ street
       1”
                city
                                     ...      purchases
                                                                       ID title
                                                                                 authors...
“Mary Smith”     ...         ID name interests
                                                                     “B1”
                                                                                  authorauthor
       “Art Street”“fashion”“C”2        interest purchase                     “John Williams”
                                                                “Art of Customer              “Daniel Jones”
                             “John Martin”“street art”           Interest Care”
                                                                                                             8
Problems
                       Keyword Ambiguity (cont)
 • Q = “customer, art”
      – “art” can be the value of interest node(C2, C4), name node(C3), or
        street node of customer(C1), or title node of book(B1)
      – “customer” can be tag name of customer node, or (part of) value of
        title of(B1)                     storeDB                      - How to rank C1 to C4 and B1?
                              customers                                         books

                                     customer
                                                              ...                               book  ...
               customer            ID
                                               ...
                                         interests
                                                             customer
                                                                       ...              ID
                                      name                                                           publisher
                                                                                           title authors
                                                                 interests
                        ...       “C ” interest
                                    3
                                                       ID
                                                           name
                                                               ...     ...                ...             name
  ID               interests                                      interest                     authorauthor
     name     contact             “Art Smith”         “C”4                            “B”2
        address
                    interest            “rock music”                            book “Edward Martin”“Oxford”
 “C1”
       no.       ...         customer                “Rock Davis” “art”
                                                                                                    “Sophia Jones”
       “ street
       1”
                city
                                     ...      purchases
                                                                       ID title
                                                                                 authors...
“Mary Smith”     ...         ID name interests
                                                                     “B1”
                                                                                  authorauthor
       “Art Street”“fashion”“C”2        interest purchase                     “John Williams”
                                                                “Art of Customer              “Daniel Jones”
                             “John Martin”“street art”           Interest Care”
                                                                                                             9
           Objectives & Challenges
• Address the below as a single problem
  – Search intention identification
  – Query result retrieval
  – Result ranking
     – Extend original TF*IDF from text database to XML database,
       while capture the hierarchical structure of XML data

 • Challenges
  I. How to decide which sub-tree(s) with appropriate node types can
       capture user desired information
  II. How to return sub-trees of an appropriate size (i.e. contain enough
       but non-overwhelming information)
  III. How to rank those sub-trees by their relevance
                                                                            10
                      Challenges
Difficulty in applying TF*IDF to XML
   XML DB carries semantic information while text DB
    contains pure text information. XML TF*IDF must be
    aware of the underlying semantics.
   All contents of XML data are stored in leaf nodes only
   What is analogy of “flat document” in XML?
      o Sub-tree classified according to its prefix path
   Normalization factor is not simply the size of sub-tree
      o Structure of sub-trees may also infest the ranks



                                                           11
                  TF*IDF Recap
• Rule 1: A keyword appearing in many documents
  should not be regarded as more important than a
  keyword appearing in a few. --- IDF

• Rule 2: A document with more occurrences of a query
  keyword should not be regarded as less important for
  that keyword than a document that has less. --- TF

• Rule 3: A normalization factor is needed to balance
  between long and short documents
   – as Rule 2 discriminates against short documents which may
     have less chance to contain more occurrences of keywords.
                                                                 12
                      Our Approach
– Extend IR-style keyword search techniques (like TF*IDF)
  from text database to XML database, in order to capture the
  hierarchical structure of xml document
   •   by analyzing the knowledge of statistics of underlying XML data

– Major Contributions
1. Identify user’s desired search-for node and search-via node(s) in
   a heuristic way
    Define XML TF (term frequency) and XML DF (document frequency)
    Confidence Formulas for search for/via candidates
2. Define XML TF*IDF Similarity
    Propose 3 guidelines specifically for xml keyword search
    Take keyword ambiguity problems into account
3. Design a Keyword Search Engine XReal                                  13
                                    Data Model
  •• Node type -– text values contained in leaf node share the same
     Value node Two nodes are of same node type if they
   • prefix path node
      Structural
         Single-valued node type, multi-valued
       /storeDB/customers/customer/name vs. node type
         Grouping type – all its children are of same multi-valued type
                                         /storeDB/books/book/publisher/name
                                                      storeDB

                             customers                                         books

                                     customer
                                                             ...                               book  ...
               customer            ID
                                               ...
                                         interests
                                                             customer
                                                                       ...             ID
                                      name                                                          publisher
                                                                                          title authors
                                                                 interests
                       ...        “C ” interest
                                                       ID       ...    ...
  ID               interests
                                    3                      name
                                                                 interest
                                                                                         ...             name
                                                                                              authorauthor
     name     contact             “Art Smith”         “C”4                           “B”2
        address
                    interest            “rock music”                           book “Edward Martin” “Oxford”
 “C1”
       no.       ...         customer                “Rock Davis” “art”
                                                                                                   “Sophia Jones”
       “ street
       1”
                city
                                     ...      purchases
                                                                      ID title
                                                                                authors...
“Mary Smith”     ...         ID name interests
                                                                    “B 1”
                                                                                 authorauthor
       “Art Street”“fashion”“C”         interest purchase
                               2                                              “John Williams”
                                                                “Art of Customer             “Daniel Jones”
                             “John Martin”“street art”           Interest Care”
                                                                                                           14
             XML TF and IDF
• XML DF f kT (document frequency)
  – The number of T-typed nodes that contain keyword
    k in their sub-trees in XML database.
   • Granularity of similarity measurement is sub-trees of
     certain node type T
• XML TF f a , k (term frequency)
  – The number of occurrences of a keyword k in a
    given value node a in XML database.



                                                             15
Infer the desired search-for node
• Guidelines: A node type T is considered as a desired
  search for node if
    1. T is intuitively related to every query keyword
    2. XML nodes of type T should be informative enough to contain
       enough relevant information
    3. XML nodes of type T should be not overwhelming to contain too
       much irrelevant information

          C for (T , q)  log e (1   f kT ) * r depth (T )
                                        kq
• Confidence of T as the search for node w.r.t. query q.
      •     product instead of sum is used to follow 1st guideline
      •     log part designed to follow 3rd guideline
      •     exponential part designed to follow 2nd guideline
      •     r is a decay factor in (0,1].

                                                                     16
      Infer the Search-Via Nodes
• Infer structural node to search via
  – Structural node n is a good candidate if it is related to as many
    (but not necessarily all) keywords as possible

             Cvia (T , q )  log e (1   f kT )
                                       kq
    • Search via node type normally is not unique

• Infer individual value node to search via
  – Statistics alone is not adequate to infer the likelihood of a value
    node as (part of) search via node
  – Capture keyword co-occurrence


                                                                    17
      Capture keyword co-occurrence
    • E.g. Q = “ customer, name, rock, interest, art ”
        Easy to find name and interest have high confidence to be the
         search via nodes
        But hard to know rock is value of name or interest,
                          art is value of interest or name
                                                              How to differ customer C4 from C3?
                                                      storeDB

                             customers                                         books

                                     customer
                                                             ...                               book  ...
               customer            ID
                                               ...
                                         interests
                                                             customer
                                                                       ...             ID
                                      name                                                          publisher
                                                                                          title authors
                                                                 interests
                       ...        “C ” interest
                                                       ID       ...    ...
  ID               interests
                                    3                      name
                                                                 interest
                                                                                         ...             name
                                                                                              authorauthor
     name     contact             “Art Smith”         “C”4                           “B”2
        address
                    interest            “rock music”                           book “Edward Martin” “Oxford”
 “C1”
       no.       ...         customer
                                                     “Rock Davis” “art”
                                                                                                   “Sophia Jones”
       “ street
       1”
                city
                                     ...      purchases
                                                                      ID title
                                                                                authors...
“Mary Smith”     ...         ID name interests
                                                                    “B 1”
                                                                                 authorauthor
       “Art Street”“fashion”“C”2        interest purchase                     “John Williams”
                                                               “Art of Customer              “Daniel Jones”
                             “John Martin”“street art”           Interest Care”                            18
 Capture keyword co-occurrence
• Proximity factors for a value node v of type kt
  containing keyword k
   – Given a query q and a certain value node v, if there are two
     keywords kt and k in q, s.t. kt matches the type of an
     ancestor node of v and k matches a keyword in v
   – In-Query distance
       • Distance between keyword k and node type kt in query q
       • Favors: kt appears before k
   – Structural distance
      • Depth distance between v and the nearest kt typed
                                                                      1
                                                         (v ) Dist (q, v, kt, k )
        ancestor node of v
                           Cvia (q, v, k )  1 
   – Value-Type distance                         ktq  ancType

      • Max of the above two                                               19
    Principles of XML keyword search
• Principle 1
  – When searching for D-typed nodes via a single-valued type V,
    ideally only the values and structures nested in V-typed nodes
    can affect the relevance, regardless of the size of other typed
    nodes nested in D-typed nodes.
    • However, TF*IDF similarity in IR normalizes the relevance score of
      each document w.r.t. its size

• Principle 2 – address keyword Ambiguity 2
  – When searching for nodes of type D via a multi-valued type V’,
    the relevance of a D-typed node which contains a query
    relevant V’-typed node should not be affected (i.e. normalized)
    too much by other query-irrelevant V’-typed nodes.
    • Example: query “art”   - C4 should not be less relevant than C1
                                                                           20
 Principles of XML keyword search

• Principle 1 and 2
  – Especially useful for interpreting pure keyword query -
    find search via node correctly


• Principle 3
  – The order of keywords in a query is important to
    indicate the search intention
    • Incorporate the search via confidence Cvia we defined
      before



                                                              21
           XML TF*IDF Similarity
• To calculate the similarity between the search for
  node and the query q
  – Base case: similarity between value node a and q
    • Apply original TF*IDF directly since a contains keywords
                              k
      only without any structureWqT,a *Wa ,k
         similarity(q a)  kqa between structural node n
  – Recursive case:, similarityTa *W
                              Wq    a
    and q
    • Based on similarities of its children c and the confidence
                     IDF Normalization TF
      level of c as the node type to search via
                              factor

   WqT,ak  Cvia (q, a, k )*ln(1  NTa / (1  f kTa ))

   Wa , k  1  ln( f a , k )   WqTa     (WqT,ak ) 2   Wa    W     2
                                                                      a ,k
                                         kq                    ka
                                                                             22
     XML TF*IDF Similarity (cont.)
• Recursive Case
  – Intuition 2. An internal node n is relevant to q, if n has a
    child c such that the type of c has high confidence to be a
    search via node w.r.t. q (i.e. large Cvia(Tc , q)), and c is
    highly relevant to q (i.e. large sim(q, c)).
  – Intuition 3. An internal node n is more relevant to q if n
    has more query-relevant children when all others being
    equal.
                                    sim(q, c)* Cvia (Tc , q)
    similarity(q, n)  cchd ( n )
                                           Wnq
          Weighted sum of all n’s                Overall weight of node n w.r.t
        children’s similarity and their           query q which essentially
        confidence to be the search                   plays the role of a
                                                                                  23
                   via node                          normalization factor
Flowchart of answering a query
1. Identify user search intention
  – Compute the confidence of all possible candidate
    node types and choose desired search for node Tfor

2. Relevance-oriented ranking
  – Compute XML TF*IDF similarity in a bottom-up
    approach from value nodes containing keywords up
    to nodes of type Tfor
  – Return a ranked list of sub-trees rooted at nodes of
    type Tfor
     • If more than one search for node type have comparable
       confidence, a ranked list for each search for node is returned

                                                                   24
         Experimental Result
• Data set
  – DBLP, XMark, WSU, eBay
• Comparison
  – Compare XReal with SLCA, Xseek
• Equipment
  – Implement in Java
  – Run on 3.6GHz pentium IV, 1 GB memory PC with
    Windows XP
  – Berkeley DB java edition for storing keyword inverted
    lists and keyword frequency table
                                                        25
        Search Effectiveness
• Accuracy in inferring the search for node
  – Conducted by user survey
  – Tested queries contain at least one of the two
    ambiguity problems
  – Conclusion
     • XReal works well, especially when the search for
       node is not given explicitly in the query




                                                          26
        Search Effectiveness
• Result effectiveness
  – Measured by precision, recall, F-measure
  – Observations
     • XReal achieves higher precision than SLCA and
       Xseek for queries that contain ambiguities
     • XReal Performs as well as XSeek when queries
       have no ambiguity in XML data
     • XReal: Top-100 precision higher than overall
       precision
     • F-measure also shows good overall effectiveness
       of both XReal and XSeek
                                                         27
      Ranking Effectiveness
• Metrics
  – Number of Top-1 answers that are relevant
  – Reciprocal Rank (R-Rank)
  – Mean Average Precision (MAP)




                                                28
      Efficiency & Scalability
• Compare three adoptions of indices for
  XReal, and SLCA
  – Dup
    • Store only the dewey id and XML TF      f a ,k
  – DupType
    • Stores an extra node type (i.e. its prefix path)
  – DupTypeNorm
    • Stores an extra normalization factor Wa for value
      node

                                                          29
XMark   DBLP




               30
Thank You

  Q&A


            31
                                                    storeDB

                            customers                                      books

                                                           ...                                book   ...
                                    customer
               customer           ID
                                             ...           customer
                                                                   ...
                                       interests                                   ID             publisher
                                    name                                                title authors
                                                              interests
                      ...        “C3” interest
                                                    ID      ... ...
  ID              interests                              name
                                                               interest
                                                                                        ...        name
                                                                                              author
                                                                                         author
    name     contact             “Art Smith”         “C4”                         “B2”
       address
                   interest            “rock music”                                               “Oxford”
 “C1”
      no.
                ...         customer                 “Rock Davis”“art”      book “Edward Martin”
                                                                                              “Sophia Jones”
      “ street
      1”
               city
                                   ...       purchases                             ...
                                                                             authors
                                                                    ID title
“Mary Smith”    ...          ID name interests                                     author
                                                                              author
                                                                 “B1”
      “Art Street”“fashion” “C2”       interestpurchase
                                                                           “John Williams”
                                                             “Art of Customer           “Daniel Jones”
                            “John Martin” “street art”        Interest Care”


                                                                                                           32