Semantic Query Optimization Techniques by ebh18594

VIEWS: 106 PAGES: 31

									Semantic Query Optimization
        Techniques


         November 16, 2005




         By : Mladen Kovacevic
Background
•   1980's, semantic information stored in dbs as
    integrity constraints could be used for query
    optimization

•   semantic: “of or relating to meaning or the
    study of meaning”
    (http://wordnet.princeton.edu)

•   integrity : preserve data consistency when
    changes made in db.

•   no extensive implementation existing today
    (1999)
Introduction
•   Key factor in relational database system’s
    improvement in query execution time, is query
    optimization.

•   Query execution can be improved by:

     – Analyzing integrity information, and rewriting
       queries exploiting this information (JE & PI)

     – Avoid expensive sorting costs (Order
       Optimization)

     – Exploiting uniqueness by knowing rows will be
       unique, thus, avoiding extra sorts. (EU)
Presentation Overview
•   Semantic Query Optimization techniques

     – Join Elimination (JE)

     – Predicate Introduction (PI)

     – Order Optimization (OO)

     – Exploiting Uniqueness (EU)
Some Motivation
•    Describing two techniques in SQO,
     demonstrated in DB2 UDB.
      – Predicate Introduction
      – Join Elimination

•    Reasons:
      – rewriting queries by hand showed that these two
        provided consistent optimization.
      – practical to implement
      – extendible to other DBMS’s.

•    Data sets used : TPC-D and APB-1 OLAP
     benchmarks

    only REFERENTIAL INTEGRITY constraints and
             CHECK CONSTRAINTS used!
       Semantic Query Optimization
           (SQO) Techniques
•   Join Elimination: Some joins need NOT be evaluated
    since the result may be known apriori (more on this
    later)

•   Join Introduction: Adding a join can help if relation is
    small compared to original relations and highly
    selective.

•   Predicate Elimination : If predicate known to be
    always true, can be eliminated from query (DISTINCT
    clause on Primary Key – Uniqueness exploitation!)

•   Predicate Introduction: New predicates on indexed
    attributes can result in a much faster access plan.

•   Detecting the Empty Answer Set : If query predicates
    inconsistent with integrity constraints, the query does
    not have answer.
Why SQO implementations not
used?
•   Deductive Databases : Many cases SQO techniques
    were designed for deductive databases, thus not
    appearing to be useful in relational database context.

•   CPU & I/O Speeds similar : When being developed,
    CPU & I/O speeds were not as dramatically different
     – (savings in I/O not worth the CPU time added)

•   Lack of Integrity Constraints : Thought that many
    integrity constraints are needed for SQO to be useful
Two-stage Optimizer
•   Examples of SQO techniques always designed for a
    two-stage optimizer

     – Stage 1 : logically equivalent queries created
       (DB2’s query rewrite optimization)

     – Stage 2 : generate plans of all these queries,
       choosing the one with lowest estimated cost.
       (DB2’s query plan optimization)
         • Join order, join methods, join site in a
           distributed database, method for accessing
           input table, etc.
Join Elimination
·   Simple : Eliminate relation where join is over tables
    related through referential integrity constraint, and
    primary key table referenced only in the join

VIEW DEFINITION
CREATE VIEW Supplier_Info (n, a, c) as
SELECT      s_name, s_address, n_name
FROM        tpcd.supplier, tpcd.nation
WHERE       s_nationkey = n_nationkey


QUERY
SELECT          s_n, s_a
FROM            Supplier_Info
Join Elimination (con’t)
·   Query can be rewritten internally as:

SELECT       s_n, s_a
FROM         tpcd.supplier


Why do such a simple rewrite?

           • User may not have access to the supplier
           table,
              and/or may only know about the view.
           • Sometimes GUI managers create these
           “dumb”
              queries so need to optimize
           • Non-programmers write queries often, and may
              not even think about this.

• Algorithm for generic redundant join removal
  provided in paper.
Example – Join Elimination
SELECT       p_name, p_retailprice, s_name, s_address
FROM         tpcd.lineitem, tpcd.partsupp,
             tpcd.part,     tpcd.supplier
WHERE        p_partkey = ps_partkey and
             s_suppkey = ps_suppkey and
             ps_partkey = l_partkey and
             ps_suppkey = l_suppkey and
             l_shipdate between '1994-01-01' and
             '1996-06-30' and l_discount >= 0.1
GROUP BY     p_name, p_retailprice, s_name, s_address
ORDER BY     p_name, s_name
    PART
  PARTKEY

                    PARTSUP         LINEITEM
                        P
                    PARTKEY         PARTKEY

                    SUPPKEY         SUPPKEY
   SUPPLIE
      R
  SUPPKEY


                                  1 – many relationship
Example : Join Elimination
•   Any immediate improvements that can be seen here?

                p_partkey    =     ps_partkey and
                s_suppkey    =     ps_suppkey and
                ps_partkey   =     l_partkey and
                ps_suppkey   =     l_suppkey

                       PS_PARTKEY = L_PARTKEY




    P_PARTKEY            PS_PARTKEY                 L_PARTKEY
          P_PARTKEY = PS_PARTKEY    PS_PARTKEY = L_PARTKEY




    S_SUPPKEY            PS_SUPPKEY                 L_SUPPKEY

          S_SUPPKEY = PS_SUPPKEY     PS_SUPPKEY = L_SUPPKEY


                       S_SUPPKEY = L_SUPPKEY
Results
•             100 MB db size
•             Execution Time : 58.5 sec -> 38.25 sec (35 % improvement)
•             I/O Cost: 4631 -> 1498 page reads (67 % improvement)

              Join Elimination Optimizing Query 1       Join Elimination Optimizing Query 1
                        Execution Time                              Pages Read


              70                                            5000
                                                                      4631
                                                            4500
              60        58.5
                                                            4000
              50                                            3500

                                        38.25               3000
                                                    Pages
              40
    Seconds




                                                            2500
              30
                                                            2000
                                                                                     1498
              20                                            1500

                                                            1000
              10
                                                             500

               0                                               0
                       Original       Optimized                      Original     Optimized
Results – OLAP Environment
•   In OLAP (online analytical processing) servers, using a
    star schema (one fact table, with several dimension
    tables) improvements ranged from 2% to 96 %.

     – In these cases, much improvement came from
       CPU cost instead of I/O, because dimension tables
       were small enough to fit into memory...
                                 Join Elimination Optimized Query in OLAP
                                                Environment
                           700

                           600

                           500
Execution Time (seconds)




                           400

                           300

                           200

                           100

                             0
                                 I1   I2   I3   I4   I5     I6    I7   I8   I9   I10
                                                     Query Name
Predicate Introduction
•   Techniques discussed :

     – Index Introduction : add new predicate on
       attribute if index exists on that attribute.
               – Assumption : index retrieval is better than
                  table scan, is this always good?

     – Scan Reduction : reduce number of tuples that
       qualify for a join.
               – Problem : Not very common; unlikely that
                  there will be any check constraints or
                  predicates with inequalities about join
                  columns

•   Detecting empty query answer set (not shown as query
    execution time essentially 0)
Example - Predicate
Introduction
SELECT       sum(l_extendedprice * l_discount) as revenue
FROM         tpcd.lineitem
WHERE        l_shipdate >= date(‘1994-01-01’)         and
             l_shipdate < date(‘1994-01-01’)+ 1 year and
             l_discount between .06 – 0.01            and
             .06 + 0.01 and l_quantity < 24;



Check Constraint : l_shipdate <= l_receiptdate
Index : l_receiptdate

•   Maintaining semantics, we can add :
     –   l_receiptdate >= date(‘1994-01-01’)
Example - Predicate
Introduction
SELECT       sum(l_extendedprice * l_discount) as revenue
FROM         tpcd.lineitem
WHERE        l_shipdate >= date(‘1994-01-01’)         and
             l_shipdate < date(‘1994-01-01’)+ 1 year and
             l_receiptdate >= date(‘1994-01-01’)      and
             l_discount between .06 – 0.01            and
             .06 + 0.01 and l_quantity < 24;


Check Constraint : l_shipdate <= l_receiptdate
Index : l_receiptdate

•   Maintaining semantics, we can add :
     –   l_receiptdate >= date(‘1994-01-01’)


•   Why would we want to do this? In order to have optimizer
    choose a plan using the index. Is this always good?
•   NO! What if most of the rows in the table need to be
    returned? We should use a tablescan instead.
Predicate Introduction - Algorithm
•   Input : set of all check constraints defined for a
    database and the set of all predicates in query

•   Output: set of all non-redundant formulas derivable
    from the input set. This answer set can then be
    added to the query, but only a few are potentially
    useful.
           • The goal in the paper was to choose
             additions that would guarantee
             improvement.

•   Conditions in paper: Conservative approach of
    introducing predicates that will have the plan
    optimizer use an index. Insist on only one index
    available with the query predicate.
Predicate Introduction -
Results
                                                Predicate Introduction Estimated Costs
                                  250000


                                  200000
Estimated Cost (internal units)




                                  150000
                                                                                              Original
                                                                                              Optimized
                                  100000


                                   50000


                                       0
                                           P1          P2        P3         P4           P5
                                                                Query
Predicate Introduction -
Results
                                      Predicate Introduction Execution Times

                           120


                           100
                                                      Why?
Execution Time (seconds)




                            80

                                                                                    Original
                            60
                                                                                    Optimized


                            40


                            20


                             0
                                 P1        P2         P3         P4            P5
                                                     Query
Why Longer Execution for P3/P5?
•    P2 and P3 are the same except for the following
P2 :
SELECT ...
FROM   ...
WHERE l_shipdate >= date ('1998-09-01') and
       l_shipdate < date ('1998-09-01') + 1 month

P3 :
SELECT ...
FROM   ...
WHERE l_shipdate >= date ('1995-09-01') and
       l_shipdate < date ('1995-09-01') + 1 month

•    Difference in table shows that P2 has 2 % of the tuples falling
     in the range while P3 has 48 % of the tuples fall in the category
     : BOTH plans will choose index scan! P3 is so large that
     tablescan is better in this case.
1.   Cost model underestimates cost of locking/unlocking index
     pages
2.   Estimated number of tuples goes down because of the
     reduction factor problem (multiply in the new predicate
     added)
Adjustments for Reduction Factor
Problem
•   Add new predicate only when it contains a major column
    of an index and a scan of that index is sufficient to answer
    the query (thus, no table scan necessary)
•   Original Index : <receiptdate, discount, quantity,
    extendedprice>
•   New Index : <receiptdate, discount, quantity,
    extendedprice, shipdate, partkey, suppkey, orderkey>

                                       Predicate Introduction Execution Times

                                  60


                                  50
       Execution Time (seconds)




                                  40

                                                                                Original
                                  30
                                                                                Optimized


                                  20


                                  10


                                   0
                                         P3                          P5
                                                      Query
Order Optimization Techniques
•   Access plan strategies exploit the physical orderings
    provided either by indexes or sorting

•   GOAL: optimize the sorting strategy

•   Techniques
     –   Pushing down sorts in joins
     –   Minimizing the number of sorting columns
     –   Detecting when sorting can be avoided because of
         predicates, keys or indexes
•   Order Optimization : detecting when indexes provide
    an interesting order, so that sorting can be either
    avoided, and used as sparingly as possible.
•   Interesting Orders : when the side effect of a join
    produces rows in sorted order, which can be taken
    advantage of later (if another join needed, ORDER
    BY, GROUP BY, DISTINCT)
Fundamental Operators
•   Order optimization requires the following operations

     –   Reduce Order
     –   Test Order
     –   Cover Order
     –   Homogenize Order
Order Optimization Results
Exploiting Uniqueness
•   Checking to see if query contains unnecessary
    DISTINCT clauses
     –   How does this make improvements?

          •   Removing duplicates is performed by SORTING, a costly
              operation.


•   Example is removing DISTINCT keyword from query if
    it is applied onto the primary key itself (since primary
    keys are, by definition, distinct)
How to exploit uniqueness?
•   Using knowledge about:
     –   Keys
     –   Table Constraints
     –   Query Predicates


•   Cannot always be tested efficiently, so we look for a
    sufficient solution.
Summary
•   Important Outcome : experimental evidence showing
    SQO can provide effective enhancement to the traditional
    query optimization.
     – Join Elimination : geared towards OLAP
        environment (where very useful)
               – Independent on existence of complex integrity
                 constraint – semantic reasoning used about
                 referential integrity constraints
               – Easy to implement and execute
     – Predicate Introduction : guaranteeing improvements
       more difficult, needing rather severe restrictions
       imposed (limits the applicability of this approach)
     – Order Optimization : utilizing functional
       dependencies and table information, we use it in
       creating a “smart” access plan, avoiding or optimizing
       sort operations.
     – Exploiting Uniqueness : uniqueness is powerful
       when it reduces the number of expensive sorts.
       Discovering true ways of exploiting this technique are
       quite tricky and specific.
References
•   Qi Cheng, Jarek Gryz, Fred Koo, et al: Implementation of Two
    Semantic Query Optimization Techniques in DB2 Universal
    Database. Proceedings of the 25th VLDB Conference, Edinburg,
    Scotland,1999.

•   David E. Simmen, Eugene J. Shekita, Timothy Malkemus:
    Fundamental
    Techniques for Order Optimization. SIGMOD Conference 1996: 57-
    67

•   G. N. Paulley, Per-ke Larson: Exploiting Uniqueness in Query
    Optimization. ICDE 1994: 68-79
The End.

								
To top