Docstoc

Architecture

Document Sample
Architecture Powered By Docstoc
					A Platform for
Efficient Full-Text SEARCH
on the Web




                  Emiran Curtmola

                       IDAR 2007

                                1
   Search Semi-structured Data (XML)

 Growing amount of XML data available for
  processing and exchange

 Need for text predicates that go beyond simple
  keyword search

 Existing applications require to query both on
  structure and text of documents

   Full-Text queries (FT)
      query structure + text
      complex, composable predicates on the words in the text
          window, distance, order, times etc.
                                                                 2
  A Typical Scenario

E.g., web service discovery in P2P or Grid
   Web services typically described using XML (e.g.,
    WSDL standard)


   Autonomous service providers use non-uniform
    descriptions, with variable structure and text
    comments

   Query: “find web services providing info about
    <breaking news> on a possible tsunami in
    Asia (within 10 words)”

                                                     3
         Existing Approaches: DB & IR
                                                                                doc
                                                           newspapers
                                                                                                 …
                                                      newspaper
DB community                                                       newspaper-name
• data centric (structure)
                              breaking news
     • languages
                                                           entertainment
     • efficient evaluation

• XPath 2.0, XQuery 1.0,       overview                           sightseeing
 XSLT 1.0
                                           sailing clubs


                                                                    museums


Information Retrieval
(IR) community                 text
• document centric (text)
                                                                                             text
                                           text                                       text
     • indices                                                          text
     • ranking methods

• Yahoo!, Google,
                                    text       text        text
 XXL, JuruXML, Elixir etc.                                                                   4
   Query Languages for Structure + Text
 Challenge: a variety of competing proposals for querying XML
  on structure + text with [BAS-06]
    variable expressive power
    scoring methods
    often fuzzy semantics

 Front-runner language: XQuery Full-Text (XQFT)
    Proposed by W3C task force
       right now, going to last call until June 22, 2007
       going as a W3C Recommendation as early as 2008!
    Subsumes expressivity of most of the proposed FT languages
    Reference implementation: GalaTex [Curtmola et al. XIME-P 2005]

    Query in XQFT
      doc/newspapers/newspaper/breaking_news[
        .//* ftcontains “tsunami” and “Asia” window <=10 words]
      /overview
                                                                   5
  Need to Optimize FT Queries

Prior to our project, no work on FT query
 optimization but efficient evaluation limited to
   Conjunctive keyword search (no predicates)
   Full-text predicates in isolation


Need for efficient evaluation of FT queries
   universal formal techniques to optimize




                                                 6
  Outline

 Efficient evaluation of full-text queries
   Query optimization

   Impact of scoring methods on optimizations

   Query distributed data


 Summary and future work



                                                 7
   A Novel Universal Optimization Framework

 XQFT semantics in W3C proposal is given in
  functional language style
   no apparent connection to (relational) database query
    languages


 We provide an alternative (yet equivalent) semantics
  captured by
   Formalization of XML full-text languages in terms of
      keyword patterns
      pattern matches
      predicates evaluated through matches
   XFT algebra
      matches are treated as relational tuples


                                                            8
     XFT Algebra
 Example: query in XQFT

        .//* ftcontains “tsunami” and “Asia” window <=10 words


                                                          all occurrences
                                                        (matches) of “Asia”
   all occurrences
(matches) of “tsunami”       common ancestors of
                                match pairs




                window (match(" tsunami")  match("Asia" ))
                     10




keep only ancestors of close matches
                                                                              9
   Benefits of the Optimization Framework
   [Amer-Yahia et al. SIGMOD 2006]

 Enable leveraging the tried-and-true relational-style
  evaluation & optimization techniques, including
    Join re-ordering
    Pushing selection predicates into joins


 Concise & clean formal semantics for all FT
  languages by translation to the XFT algebra
   one-size-fits-all optimization for all FT languages


 Efficient algorithms for operator evaluation through
  novel and successful marriage IR &DB

 Measured speedup of at least two orders of
  magnitude over two reference XQFT engines
                                                          10
  Outline

 Efficient evaluation of full-text queries
   Query optimization

   Impact of scoring methods on optimizations

   Query distributed data


 Summary and future work



                                                 11
  Integrate with Universal Scoring

Until now, scoring well understood on text
 only

Challenge: score structure + text
   Non-trivial
   Many scoring proposals; sometimes hardcoded in
    the algorithm


Extend the universal optimization framework
 to accommodate for universal scoring

                                               12
   Requirements for Extending with Scores

Documents carry “scores”
   relevance of the query matching documents


XFT algebraic operators manipulate scores

Requirements
   Generic functions, not a particular scoring function
      no scoring method is better than the other


   Avoid re-computing scores: score of a node can be
    derived solely from the scores of its descendants
                                                     13
   Preliminary Results: Scoring Scheme
Parameterized scoring scheme
     scoreK( k,pos,n ) = score keyword k at position pos in
      node n

     scoreM( p,m ) = score a match m with pattern p
        aggregate scores from subpatterns of a pattern for the
         same node

     scoreS( SM(n,p) ) = score a set of matches SM
      corresponding to node n and pattern p
        aggregate scores from children to parent


The score of a node depends on scoring its
 set of matches
   scoreK is used in scoring a match
     scoreM is used in scoring a set of matches
                                                                  14
        scoreS
   Example: Using the Scoring Scheme
 Query: “tsunami” and “Asia” and “danger”




                       match (2, 5, 40) for
                       pattern (“tsunami”, “Asia”, “danger”)
                       =scoreM(scoreM(10, 15), 2)
                                   
match (2, 5) for
pattern (“tsunami”, “Asia”)
=scoreM(10, 15)                         “danger”
                                          =scoreK(danger, 40, node1)=2


    “tsunami”                            “Asia”
    =scoreK(tsunami, 2, node1)=10        =scoreK(Asia, 5, node1)=15



                                                                         15
 Impact of Scores on Optimizations

 Challenge
   Scoring breaks the expected relational “equivalent” query
    plans
      scoring intermediate nodes might generate different score
       values




                                                               16
      Pitfall: Scoring Breaks Equivalence

  Query: “tsunami” and “Asia” and “danger”

          7.25 =scoreM(scoreM(10, 15), 2)         9.25=scoreM(scoreM(2, 15), 10)
                                                           
=scoreM(10, 15)                            =scoreM(2, 15)
                            Different values if scoreM is
                       danger
                           the pairwise average function
                                                                 Asia
                         =2                                        =10
                            There are functions that
   tsunami        Asia   break the relational equivalence
                                                 danger     tsunami
   =10            =15                            =2         =15



 Need
      Consistent scoring: same scores for equivalent plans
      Consistent ranking: same ranks for equivalent plans
                                                                           17
       Ongoing Work

Equivalent rewriting rules                                        Scoring scheme
                                                                scoreK Properties?
                                                                scoreM Properties?
        RW
                                                                scoreS Properties?


                    E.g., join reordering requires associative, commutative
                    scoring functions

                    E.g., top-K requires monotonicity




                                                                               18
       Ongoing Work

Equivalent rewriting rules                                  Scoring scheme
                                                           scoreK Properties?
                                                           scoreM Properties?
        RW
                                                           scoreS Properties?




Equivalent rewriting rules                          A particular scoring scheme
                                                           scoreK
                                                           scoreM
        RW?
                                                           scoreS


     Catalog all existing scoring methods for structure and text
      w.r.t. their compatibility with rewriting optimizations
         Can we capture them in our framework?
         E.g., vector space model is consistent scoring for the relational-
          style rewritings                                               19
 Ongoing Work

Smart, configurable optimizer


           Plug-in a particular scoring scheme at run time


                Is it consistent scoring / ranking?
                    (are the rewritings sound?)



  If yes, use the rewritings           If not, identify and disable all
                                       non-sound rewritings




                                                                          20
  Outline

 Efficient evaluation of full-text queries
   Query optimization

   Impact of scoring methods on optimizations

   Distributed access methods


 Summary and future work



                                                 21
  Query on Distributed Data

Move from search individual sources to highly
 distributed sources

Challenges
   Consumers and producers: many, dynamic
      completely decentralized
   Users unaware of data location
      completely distributed data


Our goal: efficient distributed computation
   data discovery, evaluation, ranking of FT queries
                                                   22
    P2P Network with XML Sources
                                     Local                       Each node can
                                     XML                         • produce and store XML data
                                             1                   • answer queries over its local
      Query1: (tsunami, Asia)                                      XML store
                                                                 • initiate queries on actual
                                    2                              content of documents
                                                             Local
                                                         3   XML

               Local       Efficient and expressive querying of
               XML                  the global XML data?
                           4
                                                                          Query2: (concerts, NYC)
Network link
                                                                Local
                                             5                  XML
                                                                            6

      7                8
                                    9             10
                                                                     11          12
    Local        Local
                 XML              Local          Local
    XML                                                         Local              Local   23
                                  XML            XML
                                                                XML                XML
     Proposed Architecture

                                      Return the answers
                                      to the FT query


Locally, post-processes at a node
• leverage the XFT engine
                                     XFT Algebraic Engine

 Consumer’s side

 Producers’ side


Distributed access methods (index)
to discover the relevant sources

• answer keyword/XPath
 part of the queries




                                                            24
  Proposal: Leverage Query Dissemination Trees

Route queries: move queries, not data

Peers self-organize in query dissemination
 trees
   Every node contains summary of XML documents
    stored in its subtrees


Use the dissemination trees for query routing
   Queries always posed at the root
   If a node’s summary matches the query then
    forward query to children
                                                 25
      Define the Design Space


           … but the overall throughput depends on the slowest node.

           Challenge: relieve the traffic congestion
 • more congestion                                            • less congestion
 • less control overhead                                      • more control overhead




1 tree for all keywords                                      1 tree per keyword
                                                                              26
    The Design Space To Explore

Optimal solution lies between the extremes
Proposal
     Partition set of keywords into blocks
     Build one tree per keyword block
         connect all keywords from same block into one tree




                             Optimal solution?




1 tree for all keywords       Optimal solution          1 tree per keyword
                                                                       27
                          Partitioning the data space
      Forces at Cross-purposes
                             find the minimum                  relieve congestion
     Optimization problem:                         to
                             number of trees            (improve the overall throughput)


                  peak-to-average load within an approximation       ε (acceptable ε=20%)

 • more congestion                                                        • less congestion
 • less control overhead                                                  • more control overhead

                             Tradeoff: congestion vs. control traffic

                                  congestion
                                                   control traffic




                                                     Number trees

1 tree for all keywords                                                  1 tree per keyword

                                    Partitioning the data space                             28
  Preliminary Results: Load Balancing

Requirement
   a node that appears high in one tree will appear
    in lower levels in all the other trees
    guarantee a node appears on different tree levels in
   each tree


Load balance is when the nodes have been
 in the top levels at most once

Our approach: circular permutation of the
 internal nodes among the different trees
    peak load decreases drastically
    peak-to-average processing load is within 15%
                                                            29
   Future Directions

For conjunctive query routing
   Query selectivity estimation


Scoring in distributed systems
   E.g., IDF is inherently global


Need an analytical cost model to better
 understand parameters for XML access
 methods in the design space


                                           30
  Summary

A formalized approach to full-text queries
 for large-scale systems
   Efficiency
      Relational-style optimizations of XFT algebraic plans
      Universal scoring
         properties of scoring functions for scoring consistency
   Distributed computation


Prototype (under construction) 



                                                                    31
Thank You!




             32

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:2/10/2012
language:
pages:32