Semantic Query Optimization Techniques November 16, 2005 By : Mladen Kovacevic Background • 1980's, semantic information stored in dbs as integrity constraints could be used for query optimization • semantic: “of or relating to meaning or the study of meaning” (http://wordnet.princeton.edu) • integrity : preserve data consistency when changes made in db. • no extensive implementation existing today (1999) Introduction • Key factor in relational database system’s improvement in query execution time, is query optimization. • Query execution can be improved by: – Analyzing integrity information, and rewriting queries exploiting this information (JE & PI) – Avoid expensive sorting costs (Order Optimization) – Exploiting uniqueness by knowing rows will be unique, thus, avoiding extra sorts. (EU) Presentation Overview • Semantic Query Optimization techniques – Join Elimination (JE) – Predicate Introduction (PI) – Order Optimization (OO) – Exploiting Uniqueness (EU) Some Motivation • Describing two techniques in SQO, demonstrated in DB2 UDB. – Predicate Introduction – Join Elimination • Reasons: – rewriting queries by hand showed that these two provided consistent optimization. – practical to implement – extendible to other DBMS’s. • Data sets used : TPC-D and APB-1 OLAP benchmarks only REFERENTIAL INTEGRITY constraints and CHECK CONSTRAINTS used! Semantic Query Optimization (SQO) Techniques • Join Elimination: Some joins need NOT be evaluated since the result may be known apriori (more on this later) • Join Introduction: Adding a join can help if relation is small compared to original relations and highly selective. • Predicate Elimination : If predicate known to be always true, can be eliminated from query (DISTINCT clause on Primary Key – Uniqueness exploitation!) • Predicate Introduction: New predicates on indexed attributes can result in a much faster access plan. • Detecting the Empty Answer Set : If query predicates inconsistent with integrity constraints, the query does not have answer. Why SQO implementations not used? • Deductive Databases : Many cases SQO techniques were designed for deductive databases, thus not appearing to be useful in relational database context. • CPU & I/O Speeds similar : When being developed, CPU & I/O speeds were not as dramatically different – (savings in I/O not worth the CPU time added) • Lack of Integrity Constraints : Thought that many integrity constraints are needed for SQO to be useful Two-stage Optimizer • Examples of SQO techniques always designed for a two-stage optimizer – Stage 1 : logically equivalent queries created (DB2’s query rewrite optimization) – Stage 2 : generate plans of all these queries, choosing the one with lowest estimated cost. (DB2’s query plan optimization) • Join order, join methods, join site in a distributed database, method for accessing input table, etc. Join Elimination · Simple : Eliminate relation where join is over tables related through referential integrity constraint, and primary key table referenced only in the join VIEW DEFINITION CREATE VIEW Supplier_Info (n, a, c) as SELECT s_name, s_address, n_name FROM tpcd.supplier, tpcd.nation WHERE s_nationkey = n_nationkey QUERY SELECT s_n, s_a FROM Supplier_Info Join Elimination (con’t) · Query can be rewritten internally as: SELECT s_n, s_a FROM tpcd.supplier Why do such a simple rewrite? • User may not have access to the supplier table, and/or may only know about the view. • Sometimes GUI managers create these “dumb” queries so need to optimize • Non-programmers write queries often, and may not even think about this. • Algorithm for generic redundant join removal provided in paper. Example – Join Elimination SELECT p_name, p_retailprice, s_name, s_address FROM tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplier WHERE p_partkey = ps_partkey and s_suppkey = ps_suppkey and ps_partkey = l_partkey and ps_suppkey = l_suppkey and l_shipdate between '1994-01-01' and '1996-06-30' and l_discount >= 0.1 GROUP BY p_name, p_retailprice, s_name, s_address ORDER BY p_name, s_name PART PARTKEY PARTSUP LINEITEM P PARTKEY PARTKEY SUPPKEY SUPPKEY SUPPLIE R SUPPKEY 1 – many relationship Example : Join Elimination • Any immediate improvements that can be seen here? p_partkey = ps_partkey and s_suppkey = ps_suppkey and ps_partkey = l_partkey and ps_suppkey = l_suppkey PS_PARTKEY = L_PARTKEY P_PARTKEY PS_PARTKEY L_PARTKEY P_PARTKEY = PS_PARTKEY PS_PARTKEY = L_PARTKEY S_SUPPKEY PS_SUPPKEY L_SUPPKEY S_SUPPKEY = PS_SUPPKEY PS_SUPPKEY = L_SUPPKEY S_SUPPKEY = L_SUPPKEY Results • 100 MB db size • Execution Time : 58.5 sec -> 38.25 sec (35 % improvement) • I/O Cost: 4631 -> 1498 page reads (67 % improvement) Join Elimination Optimizing Query 1 Join Elimination Optimizing Query 1 Execution Time Pages Read 70 5000 4631 4500 60 58.5 4000 50 3500 38.25 3000 Pages 40 Seconds 2500 30 2000 1498 20 1500 1000 10 500 0 0 Original Optimized Original Optimized Results – OLAP Environment • In OLAP (online analytical processing) servers, using a star schema (one fact table, with several dimension tables) improvements ranged from 2% to 96 %. – In these cases, much improvement came from CPU cost instead of I/O, because dimension tables were small enough to fit into memory... Join Elimination Optimized Query in OLAP Environment 700 600 500 Execution Time (seconds) 400 300 200 100 0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 Query Name Predicate Introduction • Techniques discussed : – Index Introduction : add new predicate on attribute if index exists on that attribute. – Assumption : index retrieval is better than table scan, is this always good? – Scan Reduction : reduce number of tuples that qualify for a join. – Problem : Not very common; unlikely that there will be any check constraints or predicates with inequalities about join columns • Detecting empty query answer set (not shown as query execution time essentially 0) Example - Predicate Introduction SELECT sum(l_extendedprice * l_discount) as revenue FROM tpcd.lineitem WHERE l_shipdate >= date(‘1994-01-01’) and l_shipdate < date(‘1994-01-01’)+ 1 year and l_discount between .06 – 0.01 and .06 + 0.01 and l_quantity < 24; Check Constraint : l_shipdate <= l_receiptdate Index : l_receiptdate • Maintaining semantics, we can add : – l_receiptdate >= date(‘1994-01-01’) Example - Predicate Introduction SELECT sum(l_extendedprice * l_discount) as revenue FROM tpcd.lineitem WHERE l_shipdate >= date(‘1994-01-01’) and l_shipdate < date(‘1994-01-01’)+ 1 year and l_receiptdate >= date(‘1994-01-01’) and l_discount between .06 – 0.01 and .06 + 0.01 and l_quantity < 24; Check Constraint : l_shipdate <= l_receiptdate Index : l_receiptdate • Maintaining semantics, we can add : – l_receiptdate >= date(‘1994-01-01’) • Why would we want to do this? In order to have optimizer choose a plan using the index. Is this always good? • NO! What if most of the rows in the table need to be returned? We should use a tablescan instead. Predicate Introduction - Algorithm • Input : set of all check constraints defined for a database and the set of all predicates in query • Output: set of all non-redundant formulas derivable from the input set. This answer set can then be added to the query, but only a few are potentially useful. • The goal in the paper was to choose additions that would guarantee improvement. • Conditions in paper: Conservative approach of introducing predicates that will have the plan optimizer use an index. Insist on only one index available with the query predicate. Predicate Introduction - Results Predicate Introduction Estimated Costs 250000 200000 Estimated Cost (internal units) 150000 Original Optimized 100000 50000 0 P1 P2 P3 P4 P5 Query Predicate Introduction - Results Predicate Introduction Execution Times 120 100 Why? Execution Time (seconds) 80 Original 60 Optimized 40 20 0 P1 P2 P3 P4 P5 Query Why Longer Execution for P3/P5? • P2 and P3 are the same except for the following P2 : SELECT ... FROM ... WHERE l_shipdate >= date ('1998-09-01') and l_shipdate < date ('1998-09-01') + 1 month P3 : SELECT ... FROM ... WHERE l_shipdate >= date ('1995-09-01') and l_shipdate < date ('1995-09-01') + 1 month • Difference in table shows that P2 has 2 % of the tuples falling in the range while P3 has 48 % of the tuples fall in the category : BOTH plans will choose index scan! P3 is so large that tablescan is better in this case. 1. Cost model underestimates cost of locking/unlocking index pages 2. Estimated number of tuples goes down because of the reduction factor problem (multiply in the new predicate added) Adjustments for Reduction Factor Problem • Add new predicate only when it contains a major column of an index and a scan of that index is sufficient to answer the query (thus, no table scan necessary) • Original Index : <receiptdate, discount, quantity, extendedprice> • New Index : <receiptdate, discount, quantity, extendedprice, shipdate, partkey, suppkey, orderkey> Predicate Introduction Execution Times 60 50 Execution Time (seconds) 40 Original 30 Optimized 20 10 0 P3 P5 Query Order Optimization Techniques • Access plan strategies exploit the physical orderings provided either by indexes or sorting • GOAL: optimize the sorting strategy • Techniques – Pushing down sorts in joins – Minimizing the number of sorting columns – Detecting when sorting can be avoided because of predicates, keys or indexes • Order Optimization : detecting when indexes provide an interesting order, so that sorting can be either avoided, and used as sparingly as possible. • Interesting Orders : when the side effect of a join produces rows in sorted order, which can be taken advantage of later (if another join needed, ORDER BY, GROUP BY, DISTINCT) Fundamental Operators • Order optimization requires the following operations – Reduce Order – Test Order – Cover Order – Homogenize Order Order Optimization Results Exploiting Uniqueness • Checking to see if query contains unnecessary DISTINCT clauses – How does this make improvements? • Removing duplicates is performed by SORTING, a costly operation. • Example is removing DISTINCT keyword from query if it is applied onto the primary key itself (since primary keys are, by definition, distinct) How to exploit uniqueness? • Using knowledge about: – Keys – Table Constraints – Query Predicates • Cannot always be tested efficiently, so we look for a sufficient solution. Summary • Important Outcome : experimental evidence showing SQO can provide effective enhancement to the traditional query optimization. – Join Elimination : geared towards OLAP environment (where very useful) – Independent on existence of complex integrity constraint – semantic reasoning used about referential integrity constraints – Easy to implement and execute – Predicate Introduction : guaranteeing improvements more difficult, needing rather severe restrictions imposed (limits the applicability of this approach) – Order Optimization : utilizing functional dependencies and table information, we use it in creating a “smart” access plan, avoiding or optimizing sort operations. – Exploiting Uniqueness : uniqueness is powerful when it reduces the number of expensive sorts. Discovering true ways of exploiting this technique are quite tricky and specific. References • Qi Cheng, Jarek Gryz, Fred Koo, et al: Implementation of Two Semantic Query Optimization Techniques in DB2 Universal Database. Proceedings of the 25th VLDB Conference, Edinburg, Scotland,1999. • David E. Simmen, Eugene J. Shekita, Timothy Malkemus: Fundamental Techniques for Order Optimization. SIGMOD Conference 1996: 57- 67 • G. N. Paulley, Per-ke Larson: Exploiting Uniqueness in Query Optimization. ICDE 1994: 68-79 The End.
Pages to are hidden for
"Semantic Query Optimization Techniques"Please download to view full document