Factorizing Complex Predicates in Queries to Exploit Indexes by yurtgc548

VIEWS: 0 PAGES: 24

									   Primitives for Workload
Summarization and Implications
           for SQL
            Prasanna Ganesan*
               Stanford University
    Surajit Chaudhuri     Vivek Narasayya
               Microsoft Research

         *Work done at Microsoft Research




                                            1
                   Motivation
• Workload: Set of SQL Statements
• Many tasks exploit workload information
  – DB Admin, Index Tuning, Statistics building,
    Approximate Query Processing
• DBMS profilers produce large workloads
  (+additional info)
• Most tasks need small workloads
• Goal: Summarization - Find a “representative”
  subset of a given, large workload.
  – Sometimes a weighted subset
                                                   2
  Why Not Random Sampling?
• One Size does not fit all
   – Different definitions of “representative subset”
   – Random sampling may lose valuable info
• Ignores additional info associated with
  statements
• Shown to work poorly, e.g., for Index Selection
  [chaudhuri02]
   – May oversample queries on some tables, while
     ignoring less frequent queries on other tables


                                                        3
                   Our Solution
•   Treat input as a relation
    •   Each SQL statement (+associated info) is a tuple
•   Extend SQL with new language primitives
    •   Allow declarative specification of desired subset
    •   Usable on arbitrary relations, not just workloads
•   Implement extensions inside query engine
    •   Why? Primitives appear widely applicable
    •   Other implementation options available



                                                            4
  The Architecture
            SELECT *, DOMSUM(Count) FROM WkldTbl
            DOMINATE WITH PARTITIONING BY FromTables, JoinConds,
            WhereCols
            (SLAVE.GroupByCols Í MASTER.GroupByCols) AND
            (SLAVE.OrderByCols PREFIX MASTER.OrderByCols)
            REPRESENT WITH PARTITIONING BY FromTables,
            JoinConds, WhereCols MAXIMIZING SUM(DOM_Count)
            GLOBAL CONSTRAINT Count(*) ≤ 200
            LOCAL CONSTRAINT Count(*) ≥
            int(200*LOCAL.Count(*)/GLOBAL.Count(*))




Execution                                     Application
            Summary
 Engine




                                                               5
                     Outline
• New Primitives for Summarization (Subsetting)
  – Dominance
  – Representation
• Implementing summarization primitives in SQL
• Experiments




                                                  6
                 Dominance
• Idea: Filter and aggregate using a partial
  order on tuples
• Specify condition for one tuple to dominate
  another
  – Transitive condition
  – Encapsulates application knowledge
• Output: Keep throwing away tuples that are
  dominated
  – Retain aggregate info about dominated tuples


                                                   7
A Graphical Representation

                  Vendor   Quality   Price
                  6Buono    75
                            3         25     2


                   3
       2Cattivo   50       50




                                             8
Applying Dominance to Workloads
• Example: Index Selection
      Q1                                      Q2

  SELECT ... FROM R                    SELECT … FROM R
                        dominates
   GROUP BY A, B, C                      GROUP BY A, B


  – An index useful for Q1 likely to be useful for Q2

   MASTER.FromTables=SLAVE.FromTables AND
  MASTER.GroupByCols Ê SLAVE.GroupByCols AND
  MASTER.OrderByCols PREFIX SLAVE.OrderByCols

                                                         9
                     Outline
• New Primitives for Summarization (Subsetting)
  – Dominance
  – Representation
• Implementing Summarization Primitives in SQL
• Experiments




                                                  10
                Representation
• Dominance only gets us so far
    – Need a “lossier” way to select a subset
•      Idea: Pick a subset that solves
       a Linear Program
    – Optimize some criterion
    – Satisfy lots of constraints
    – Support concept of partitioning




                                                11
                          Details
• Partition tuples by a set of attributes



• Criterion: Maximize/Minimize Aggregate
   – E.g., Minimize Count(*)
• Global Constraints
   – E.g., Sum(B) in chosen subset > 60% Sum(B) in input
• Local Constraints - apply to each partition
   – E.g., Sum(B) in chosen subset > 40% Sum(B) in that partition



                                                                    12
   An Index Selection Example
• Partition by Tables, Join Conditions and
  attributes in WHERE clause
• Criterion: Maximize Sum(ExecutionCost)
  – Need best “coverage”
• Global Constraint: Count(*) ≤ 200
• Local Constraint: Proportionate representation
  – A partition with 20% of input should have 20% of
    output
  – Count(*) ≥int(200*LOCAL.Count(*)/GLOBAL.Count(*))

                                                   13
                Putting it all together
• Apply dominance criterion (as earlier).
• Apply representation (as earlier, but maximize
  SUM(DOM_Count) ).
• Weight each tuple by the number of tuples it dominates.



 SELECT SqlString, DOMSUM(Count) FROM WkldTbl
 DOMINATE WITH PARTITIONING BY FromTables, JoinConds, WhereCols
 (SLAVE.GroupByCols Í MASTER.GroupByCols) AND (SLAVE.OrderByCols PREFIX
 MASTER.OrderByCols)
 REPRESENT WITH PARTITIONING BY FromTables, JoinConds, WhereCols
 MAXIMIZING SUM(DOM_Count)
 GLOBAL CONSTRAINT Count(*) ≤ 200
 LOCAL CONSTRAINT Count(*) ≥ int(200*LOCAL.Count(*)/GLOBAL.Count(*))


                                                                          14
                     Outline
• New Primitives for Summarization (Subsetting)
  – Dominance
  – Representation
• Implementing Summarization Primitives in SQL
• Experiments




                                                  15
    Implementing Summarization
         Primitives in SQL
• Assume set and sequence support in SQL
   – The mills of the standards bodies…
• Partitioning useful for both primitives
   – Hashing, Sort-based, Index-based…
• Implementing Dominance
   – Naïve O(n2) algorithm
   – Techniques from group-wise processing
   – Leverage Skyline optimizations



                                             16
              Representation
• Implementing directly is LP-hard
• Many queries are much simpler
  – Fall into one of two special cases
• Other queries are handled by a simple heuristic
  – User-guided search
• Implement as multiple operators




                                                    17
         User-Guided Search
• Scan tuples in a specific order
  – User-specified, or heuristically chosen
• Will always minimize/maximize Count(*)
  – Use ordering to transform other objectives
  – Slightly different algorithms for the two cases




                                                      18
                         F
A Minimization Example
                         E

Satisfied                D

                         C
            Output
                         B

                         A


 Violated

                         19
          Two Special Cases
• Maximize SUM(Attr)
  – All constraints are on Count(*)
  – Use partitioning and sort-order access
• Minimize Count(*)
  – Single constraint: Again easily solved
  – More special cases also solvable
  – Multiple constraints: Approximation algorithm




                                                    20
                 Experiments
• Evaluate utility for index selection
• Compare to sophisticated Wkld. Compression
  [chaudhuri02]
  – Clusters using a complex distance function
• Simple query as described earlier
  – Constrained to output same number of statements as
    Workload Compression
  – Orders of magnitude faster
• TPC-H 1GB database
  – Multiple synthetic workloads introduced in
    [chaudhuri02]


                                                     21
       Experiments (Contd.)


Workload                          Tuning
                       Compress
                                  Wizard




  Evaluate



Total Estimated Cost
                                           22
Comparing Estimated Costs




                            23
                 Conclusion
• Our contributions
  – Summarization can be expressed declaratively
  – Introduction of new operators for summarization
  – Discussion of SQL implementation
• The Future
  – An automatic monitoring and tuning infrastructure?
  – More workload-sensitive tasks?




                                                         24

								
To top