Evaluating Top-k Queries over Web-Accessible Databases

W
Document Sample
scope of work template
							Evaluating Top-k Queries over
Web-Accessible Databases
Nicolas Bruno
Luis Gravano
Amélie Marian
Columbia University
“Top-k” Queries Natural in
Many Scenarios
     Example: NYC Restaurant Recommendation
     Service.
     Goal: Find best restaurants for a user:
            Close to address: “2290 Broadway”
            Price around $25
            Good rating

   Query: Specification of Flexible Preferences

   Answer: Best k Objects for Distance Function
2/27/2002                                         2
Attributes often Handled by
External Sources
     MapQuest returns the distance between
     two addresses.
     NYTimes Review gives the price range
     of a restaurant.
     Zagat gives a food rating to the
     restaurant.



2/27/2002                                3
“Top-k” Query Processing
Challenges
     Attributes handled by external sources
     (e.g., MapQuest distance).
     External sources exhibit a variety of
     interfaces (e.g., NYTimes Review,
     Zagat).
     Existing algorithms do not handle all
     types of interfaces.

2/27/2002                                     4
Processing Top-k Queries over
Web-Accessible Data Sources
     Data and query model
     Algorithms for sources with different
     interfaces
     Our new algorithm: Upper
     Experimental results



2/27/2002                                    5
 Data Model
   Top-k Query: assignment of weights and
    target values to attributes
     < $25, “2290 Broadway”, very good >

preferred price   close to address   preferred rating

     weights: <4, 1, 2>
                                     Combined in
 price: most important attribute     scoring function

  2/27/2002                                             6
Sorted Access Source S
     Return objects sorted
     by scores for a given
     query.
            GetNextS interface

     Example: Zagat

  S-Source
  Access Time: tS(S)


2/27/2002                        7
Random Access Source R
     Return the score of a
     given object for a given
     query.
            GetScoreR interface

     Example: MapQuest

  R-Source
  Access Time: tR(R)


2/27/2002                         8
Query Model
     Attributes scores between 0 and 1.
     Sequential access to sources.
     Score Ties broken arbitrarily.
     No wild guesses.
     One S-Source (or SR-Source) and
     multiple R-sources. (More on this later.)


2/27/2002                                        9
Query Processing Goals
     Processing top-k queries over R-Sources.
     Returning exact answer to top-k query q.
     Minimizing query response time.
     Naïve solution too expensive (access all
     sources for all objects).



2/27/2002                                 10
Example: NYC Restaurants
     S-Source:
            Zagat: restaurants sorted by food rating.
     R-Sources:
            MapQuest: distance between two input
            addresses.
             User address: “2290 Broadway”
            NYTimes Review: price range of the input
            restaurant.
             Target Value: $25

2/27/2002                                               11
TA Algorithm for SR-Sources
                       Fagin, Lotem, and Naor (PODS 2001)
       Perform sorted access sequentially to all SR-Sources
       Completely probe every object found for all
       attributes using random access.
       Keep best k objects.
       Stop when scores of best k objects are no less than
       maximum possible score of unseen objects
       (threshold).

        Does NOT handle R-Sources

2/27/2002                                                12
Our Adaptation of TA Algorithm
for R-Sources: TA-Adapt
       Perform sorted access to S-Source S.
       Probe every R-Source Ri for newly found
       object.
       Keep best k objects.
       Stop when scores of best k objects are no
       less than maximum possible score of unseen
       objects (threshold).



2/27/2002                                      13
An Example Execution of
TA-Adapt
Object      S(Zagat)   R1(MQ) R2(NYT) Final Score
   o1          0.9        0.1       0.5       0.56
   o2          0.8        0.7       0.7       0.75
   o3          0.45       0.6       0.3       0.55

    GetScoreR1(q) 2)
     GetNext (q,o
      GetNextS(q)
   GetScoreR1S(q,o13)
              R2
             R2                                       score
                                                     score
   Threshold =10.9
    Threshold 0.725
                 0.725
   Threshold ==0.95
                                                              x
                                                         xx        x
 Total Execution Time = 9
                                                         o1
                                                        o1    o2   o3
                                                                   o3


               tS(S)=tR(R1)=tR(R2)=1, w=<3, 2, 1>, k=1
            Final Score = (3.scoreZagat + 2.scoreMQ + 1.scoreNYT)/6
2/27/2002                                                               14
Improvements over TA-Adapt
     Add a shortcut test after each random-
     access probe (TA-Opt).
     Exploit techniques for processing
     selections with expensive predicates
     (TA-EP).
            Reorder accesses to R-Sources.
            Best weight/time ratio.


2/27/2002                                    15
The Upper Algorithm
     Selects a pair (object,source) to probe next.
     Based on the property:

        The object with the highest upper bound will
         be probed before top-k solution is reached.

             score
            score            erocs

                     x
                     x   x
                                               one of top-k objects
                                     Object is not one of top-k objects



2/27/2002                                                        16
An Example Execution of Upper
 Object      Upper Bound     S(Zagat) R1(MQ) R2(NYT)            Final Score
    o1           0.95
                0.65            0.9        0.1
   o2            0.75
                  0.9
                 0.8           0.8        0.7        0.7          0.75
   o            0.725         0.45
    3



   GetScoreR1S(q)o1
     GetNext (q, 2
     GetNext S(q)
      GetNext (q)
   GetScoreR2S(q,o2)                               score

   Threshold ==0.95
    Threshold =0.725
     Threshold 0.9
                 0.9
   Threshold ==0.95
    Threshold 1                                            x



  Total Execution Time = 6                            o1   o2
                                                                 o3


                tS(S)=tR(R1)=tR(R2)=1, w=<3, 2, 1>, k=1
             Final Score = (3.scoreZagat + 2.scoreMQ + 1.scoreNYT)/6
 2/27/2002                                                               17
The Upper Algorithm
       Choose object with highest upper bound.
       If some unseen object can have higher upper bound:
           Access S-Source S
       Else:
           Access best R-Source Ri for chosen object
       Keep best k objects
       If top-k objects have final values higher than
       maximum possible value of any other object, return
       top-k objects.
            Interleaves accesses on objects
2/27/2002                                            18
Selecting the Best Source
       Upper relies on expected values to make its
       choices.
       Upper computes “best subset” of sources
       that is expected to:
      1. Compute the final score for k top objects.
      2. Discard other objects as fast as possible.
       Upper chooses best source in “best subset”.
            Best weight/time ratio.



2/27/2002                                             19
Experimental Setting:
Synthetic Data
     Attribute scores randomly generated (three
     data sets: uniform, gaussian and correlated).
     tR(Ri): integer between 1 and 10.
     tS(S)  {0.1, 0.2,…,1.0}.
     Query execution time: ttotal
     Default: k=50, 10000 objects, uniform data.
     Results: average ttotal of 100 queries.
     Optimal assumes complete knowledge
      (unrealistic, but useful performance bound)

2/27/2002                                           20
   Experiments: Varying Number
   of Objects Requested k
         210

         180

         150
                                             Optimal
         120                                 Upper
ttotal




                                             TA-EP
         90
                                             TA-Opt
         60                                  TA-Adapt
         30

          0
               0   20   40   60   80   100
                             k

     2/27/2002                                    21
         Experiments: Varying Number
         of Database Objects N
          350

          300

          250
                                                               Optimal
          200                                                  Upper
ttotal




          150                                                  TA-EP
                                                               TA-Opt
          100

           50

            0
                0    20000   40000    60000   80000   100000

                        Number of objects in S-Source S

         2/27/2002                                                22
Experimental Setting:
Real Web Data
     S-Source: Verizon Yellow Pages
      (sorted by distance)
     R-Sources:
        Subway Navigator     Subway time
        Altavista            Popularity
        MapQuest             Driving time
        NYTimes Review       Food and price
                             ratings
        Zagat                Food, Service, Décor
                             and Price ratings
2/27/2002                                           23
Experiments: Real-Web Data
                          6000
   # of Random Accesses


                          5000


                          4000

                                                                                           Upper
           nR




                          3000                                                             TA-EP
                                                                                           TA-Opt

                          2000


                          1000


                            0
                                 1



                                          2



                                                   3



                                                            4


                                                                     5



                                                                              6



                                                                                       7
                                ry



                                         ry



                                                  ry



                                                           ry


                                                                    ry



                                                                             ry



                                                                                      ry
                             ue



                                      ue



                                               ue



                                                        ue


                                                                 ue



                                                                          ue



                                                                                   ue
                            Q



                                     Q



                                              Q



                                                       Q


                                                                Q



                                                                         Q



                                                                                  Q

2/27/2002                                                                                           24
Evaluation Conclusions
     TA-EP and TA-Opt much faster than
     TA-Adapt.
     Upper significantly better than all
     versions of TA.
     Upper close to optimal.
     Real data experiments: Upper faster
     than TA adaptations.

2/27/2002                                  25
Conclusion
     Introduced first algorithm for top-k processing
     over R-Sources.
     Adapted TA to this scenario.
     Presented new algorithms: Upper and Pick (see
     paper)
     Evaluated our new algorithms with both real
     and synthetic data.
            Upper close to optimal


2/27/2002                                        26
Current and Future Work
     Relaxation of the Source Model
            Current source model limited
            Any number of R-Sources and SR-Sources
            Upper has good results even with only SR-Sources
     Parallelism
            Define a query model for parallel access to
            sources
            Adapt our algorithms to this model
     Approximate Queries

2/27/2002                                                 27
References
     Top-k Queries:
            Evaluating Top-k Selection Queries, S. Chaudhuri and L.
            Gravano. VLDB 1999
     TA algorithm:
            Optimal Aggregation Algorithms for Middleware, R. Fagin,
            A. Lotem, and M. Naor. PODS 2001
     Variations of TA:
            Query Processing Issues on Image (Multimedia) Databases,
            S. Nepal and V. Ramakrishna. ICDE 1999
            Optimizing Multi-Feature Queries for Image Databases, U.
            Güntzer, W.-T. Balke, and W.Kießling. VLDB 2000
     Expensive Predicates
            Predicate Migration: Optimizing queries with Expensive
            Predicates, J.M. Hellerstein and M. Stonebraker. SIGMOD
            1993
2/27/2002                                                             28
Real-web Experiments
             6000


             5000


             4000

                                                                              Upper
    ttotal




             3000                                                             TA-EP
                                                                              TA-Opt

             2000


             1000


               0
                    1



                             2



                                      3



                                               4



                                                        5



                                                                 6



                                                                          7
                   ry



                            ry



                                     ry



                                              ry



                                                       ry



                                                                ry



                                                                         ry
                ue



                         ue



                                  ue



                                           ue



                                                    ue



                                                             ue



                                                                      ue
               Q



                        Q



                                 Q



                                          Q



                                                   Q



                                                            Q



                                                                     Q
2/27/2002                                                                             29
Real-web Experiments with
Adaptive Time
                                    TA-Opt             TA-EP             Upper
                          1200


                          1000
       ttotal (seconds)




                           800


                           600


                           400


                           200


                            0
                                 Query 1     Query 2           Query 3   Query 4

2/27/2002                                                                          30
Relaxing the Source Model
            Upper_Weight   Upper-Relaxed     TA-Upper         TAz-EP-NODUP       TAz-EP

            200000

            180000

            160000

            140000

            120000
   ttotal




            100000
                                                                                          TA-EP
             80000

             60000

             40000
                                                                                          Upper
             20000

                0
                     0     1         2         3          4          5       6        7
                                  Number of SR-Sources (out of 6 sources)

2/27/2002                                                                                  31
Upcoming Journal Paper
     Variations of Upper
            Select best source
     Data Structures
     Complexity Analysis
     Relaxing Source Model
            Adaptation of our Algorithms
            New Algorithms
     Variations of Data and Query Model to handle
     real web data

2/27/2002                                      32
Optimality
     TA instance optimal over:
            Algorithms that do not make wild guesses.
            Databases that satisfy the distinctness property.
     TAZ instance optimal over:
            Algorithms that do not make wild guesses.
     No complexity analysis of our algorithms, but
     experimental evaluation instead



2/27/2002                                                       33

						
Related docs