Document Sample
Architecture Powered By Docstoc
					A Platform for
Efficient Full-Text SEARCH
on the Web

                  Emiran Curtmola

                       IDAR 2007

   Search Semi-structured Data (XML)

 Growing amount of XML data available for
  processing and exchange

 Need for text predicates that go beyond simple
  keyword search

 Existing applications require to query both on
  structure and text of documents

   Full-Text queries (FT)
      query structure + text
      complex, composable predicates on the words in the text
          window, distance, order, times etc.
  A Typical Scenario

E.g., web service discovery in P2P or Grid
   Web services typically described using XML (e.g.,
    WSDL standard)

   Autonomous service providers use non-uniform
    descriptions, with variable structure and text

   Query: “find web services providing info about
    <breaking news> on a possible tsunami in
    Asia (within 10 words)”

         Existing Approaches: DB & IR
DB community                                                       newspaper-name
• data centric (structure)
                              breaking news
     • languages
     • efficient evaluation

• XPath 2.0, XQuery 1.0,       overview                           sightseeing
 XSLT 1.0
                                           sailing clubs


Information Retrieval
(IR) community                 text
• document centric (text)
                                           text                                       text
     • indices                                                          text
     • ranking methods

• Yahoo!, Google,
                                    text       text        text
 XXL, JuruXML, Elixir etc.                                                                   4
   Query Languages for Structure + Text
 Challenge: a variety of competing proposals for querying XML
  on structure + text with [BAS-06]
    variable expressive power
    scoring methods
    often fuzzy semantics

 Front-runner language: XQuery Full-Text (XQFT)
    Proposed by W3C task force
       right now, going to last call until June 22, 2007
       going as a W3C Recommendation as early as 2008!
    Subsumes expressivity of most of the proposed FT languages
    Reference implementation: GalaTex [Curtmola et al. XIME-P 2005]

    Query in XQFT
        .//* ftcontains “tsunami” and “Asia” window <=10 words]
  Need to Optimize FT Queries

Prior to our project, no work on FT query
 optimization but efficient evaluation limited to
   Conjunctive keyword search (no predicates)
   Full-text predicates in isolation

Need for efficient evaluation of FT queries
   universal formal techniques to optimize


 Efficient evaluation of full-text queries
   Query optimization

   Impact of scoring methods on optimizations

   Query distributed data

 Summary and future work

   A Novel Universal Optimization Framework

 XQFT semantics in W3C proposal is given in
  functional language style
   no apparent connection to (relational) database query

 We provide an alternative (yet equivalent) semantics
  captured by
   Formalization of XML full-text languages in terms of
      keyword patterns
      pattern matches
      predicates evaluated through matches
   XFT algebra
      matches are treated as relational tuples

     XFT Algebra
 Example: query in XQFT

        .//* ftcontains “tsunami” and “Asia” window <=10 words

                                                          all occurrences
                                                        (matches) of “Asia”
   all occurrences
(matches) of “tsunami”       common ancestors of
                                match pairs

                window (match(" tsunami")  match("Asia" ))

keep only ancestors of close matches
   Benefits of the Optimization Framework
   [Amer-Yahia et al. SIGMOD 2006]

 Enable leveraging the tried-and-true relational-style
  evaluation & optimization techniques, including
    Join re-ordering
    Pushing selection predicates into joins

 Concise & clean formal semantics for all FT
  languages by translation to the XFT algebra
   one-size-fits-all optimization for all FT languages

 Efficient algorithms for operator evaluation through
  novel and successful marriage IR &DB

 Measured speedup of at least two orders of
  magnitude over two reference XQFT engines

 Efficient evaluation of full-text queries
   Query optimization

   Impact of scoring methods on optimizations

   Query distributed data

 Summary and future work

  Integrate with Universal Scoring

Until now, scoring well understood on text

Challenge: score structure + text
   Non-trivial
   Many scoring proposals; sometimes hardcoded in
    the algorithm

Extend the universal optimization framework
 to accommodate for universal scoring

   Requirements for Extending with Scores

Documents carry “scores”
   relevance of the query matching documents

XFT algebraic operators manipulate scores

   Generic functions, not a particular scoring function
      no scoring method is better than the other

   Avoid re-computing scores: score of a node can be
    derived solely from the scores of its descendants
   Preliminary Results: Scoring Scheme
Parameterized scoring scheme
     scoreK( k,pos,n ) = score keyword k at position pos in
      node n

     scoreM( p,m ) = score a match m with pattern p
        aggregate scores from subpatterns of a pattern for the
         same node

     scoreS( SM(n,p) ) = score a set of matches SM
      corresponding to node n and pattern p
        aggregate scores from children to parent

The score of a node depends on scoring its
 set of matches
   scoreK is used in scoring a match
     scoreM is used in scoring a set of matches
        scoreS
   Example: Using the Scoring Scheme
 Query: “tsunami” and “Asia” and “danger”

                       match (2, 5, 40) for
                       pattern (“tsunami”, “Asia”, “danger”)
                       =scoreM(scoreM(10, 15), 2)
match (2, 5) for
pattern (“tsunami”, “Asia”)
=scoreM(10, 15)                         “danger”
                                          =scoreK(danger, 40, node1)=2

    “tsunami”                            “Asia”
    =scoreK(tsunami, 2, node1)=10        =scoreK(Asia, 5, node1)=15

 Impact of Scores on Optimizations

 Challenge
   Scoring breaks the expected relational “equivalent” query
      scoring intermediate nodes might generate different score

      Pitfall: Scoring Breaks Equivalence

  Query: “tsunami” and “Asia” and “danger”

          7.25 =scoreM(scoreM(10, 15), 2)         9.25=scoreM(scoreM(2, 15), 10)
                                                           
=scoreM(10, 15)                            =scoreM(2, 15)
                            Different values if scoreM is
                       danger
                           the pairwise average function
                                                                 Asia
                         =2                                        =10
                            There are functions that
   tsunami        Asia   break the relational equivalence
                                                 danger     tsunami
   =10            =15                            =2         =15

      Consistent scoring: same scores for equivalent plans
      Consistent ranking: same ranks for equivalent plans
       Ongoing Work

Equivalent rewriting rules                                        Scoring scheme
                                                                scoreK Properties?
                                                                scoreM Properties?
                                                                scoreS Properties?

                    E.g., join reordering requires associative, commutative
                    scoring functions

                    E.g., top-K requires monotonicity

       Ongoing Work

Equivalent rewriting rules                                  Scoring scheme
                                                           scoreK Properties?
                                                           scoreM Properties?
                                                           scoreS Properties?

Equivalent rewriting rules                          A particular scoring scheme

     Catalog all existing scoring methods for structure and text
      w.r.t. their compatibility with rewriting optimizations
         Can we capture them in our framework?
         E.g., vector space model is consistent scoring for the relational-
          style rewritings                                               19
 Ongoing Work

Smart, configurable optimizer

           Plug-in a particular scoring scheme at run time

                Is it consistent scoring / ranking?
                    (are the rewritings sound?)

  If yes, use the rewritings           If not, identify and disable all
                                       non-sound rewritings


 Efficient evaluation of full-text queries
   Query optimization

   Impact of scoring methods on optimizations

   Distributed access methods

 Summary and future work

  Query on Distributed Data

Move from search individual sources to highly
 distributed sources

   Consumers and producers: many, dynamic
      completely decentralized
   Users unaware of data location
      completely distributed data

Our goal: efficient distributed computation
   data discovery, evaluation, ranking of FT queries
    P2P Network with XML Sources
                                     Local                       Each node can
                                     XML                         • produce and store XML data
                                             1                   • answer queries over its local
      Query1: (tsunami, Asia)                                      XML store
                                                                 • initiate queries on actual
                                    2                              content of documents
                                                         3   XML

               Local       Efficient and expressive querying of
               XML                  the global XML data?
                                                                          Query2: (concerts, NYC)
Network link
                                             5                  XML

      7                8
                                    9             10
                                                                     11          12
    Local        Local
                 XML              Local          Local
    XML                                                         Local              Local   23
                                  XML            XML
                                                                XML                XML
     Proposed Architecture

                                      Return the answers
                                      to the FT query

Locally, post-processes at a node
• leverage the XFT engine
                                     XFT Algebraic Engine

 Consumer’s side

 Producers’ side

Distributed access methods (index)
to discover the relevant sources

• answer keyword/XPath
 part of the queries

  Proposal: Leverage Query Dissemination Trees

Route queries: move queries, not data

Peers self-organize in query dissemination
   Every node contains summary of XML documents
    stored in its subtrees

Use the dissemination trees for query routing
   Queries always posed at the root
   If a node’s summary matches the query then
    forward query to children
      Define the Design Space

           … but the overall throughput depends on the slowest node.

           Challenge: relieve the traffic congestion
 • more congestion                                            • less congestion
 • less control overhead                                      • more control overhead

1 tree for all keywords                                      1 tree per keyword
    The Design Space To Explore

Optimal solution lies between the extremes
     Partition set of keywords into blocks
     Build one tree per keyword block
         connect all keywords from same block into one tree

                             Optimal solution?

1 tree for all keywords       Optimal solution          1 tree per keyword
                          Partitioning the data space
      Forces at Cross-purposes
                             find the minimum                  relieve congestion
     Optimization problem:                         to
                             number of trees            (improve the overall throughput)

                  peak-to-average load within an approximation       ε (acceptable ε=20%)

 • more congestion                                                        • less congestion
 • less control overhead                                                  • more control overhead

                             Tradeoff: congestion vs. control traffic

                                                   control traffic

                                                     Number trees

1 tree for all keywords                                                  1 tree per keyword

                                    Partitioning the data space                             28
  Preliminary Results: Load Balancing

   a node that appears high in one tree will appear
    in lower levels in all the other trees
    guarantee a node appears on different tree levels in
   each tree

Load balance is when the nodes have been
 in the top levels at most once

Our approach: circular permutation of the
 internal nodes among the different trees
    peak load decreases drastically
    peak-to-average processing load is within 15%
   Future Directions

For conjunctive query routing
   Query selectivity estimation

Scoring in distributed systems
   E.g., IDF is inherently global

Need an analytical cost model to better
 understand parameters for XML access
 methods in the design space


A formalized approach to full-text queries
 for large-scale systems
   Efficiency
      Relational-style optimizations of XFT algebraic plans
      Universal scoring
         properties of scoring functions for scoring consistency
   Distributed computation

Prototype (under construction) 

Thank You!


Shared By: