VIEWS: 12 PAGES: 56 POSTED ON: 12/3/2011
Query Processing and Optimization Chapters 12,14,15 in [R] Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Basic Steps in Query Processing 1. Parsing and translation 2. Optimization 3. Evaluation Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2 Overview of Query Optimization Plan: Tree of R.A. ops, with choice of alg for each op. Each operator typically implemented using a `pull’ interface: when an operator is `pulled’ for the next output tuples, it `pulls’ on its inputs and computes them. Two main issues: For a given query, what plans are considered? • Algorithm to search plan space for cheapest (estimated) plan. How is the cost of a plan estimated? Ideally: Want to find best plan. Practically: Avoid worst plans! We will study the System R approach. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3 Introduction Alternative ways of evaluating a given query Equivalent expressions Different algorithms for each operation (Chapter 13) Cost difference between a good and a bad way of evaluating a query can be enormous Example: performing a r X s followed by a selection r.A = s.B is much slower than performing a join on the same condition Need to estimate the cost of operations Depends critically on statistical information about relations which the database must maintain • E.g. number of tuples, number of distinct values for join attributes, etc. Need to estimate statistics for intermediate results to compute cost of complex expressions Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 Introduction (Cont.) Relations generated by two equivalent expressions have the same set of attributes and contain the same set of tuples, although their attributes may be ordered differently. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5 Introduction (Cont.) Generation of query-evaluation plans for an expression involves several steps: 1. Generating logically equivalent expressions • Use equivalence rules to transform an expression into an equivalent one – Logical optimization 2. Annotating resultant expressions to get alternative query plans 3. Choosing the cheapest plan based on estimated cost – Physical optimization The overall process is called cost based optimization. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6 Query optimization – the two phases query Catalog Parse schema info Tree of RA operators Logical optimization – pushing operators down Catalog physical optimization method to implement each RA Statistics indexes info Run-time code Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7 Search space “plan” Search logical order 1 2 1 2 3 X or X or X 1 2 2 1 Take Plan 1: 4 Join methods, 3 select methods There are 3*3*4 = 36 Plans! For a large tree the # of plans is huge! The optimization problem: Find the best plan! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8 Schema for Examples Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string) Similar to old schema; rname added for variations. Reserves: Each tuple is 40 bytes long, 100 tuples per page, 1000 pages. Sailors: Each tuple is 50 bytes long, 80 tuples per page, 500 pages. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9 RA Tree: sname Motivating Example bid=100 rating > 5 SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND sid=sid R.bid=100 AND S.rating>5 Reserves Sailors Cost: 500+500*1000 I/Os (On-the-fly) By no means the worst plan! Plan: sname Misses several opportunities: selections could have been bid=100 rating > 5 (On-the-fly) `pushed’ earlier, no use is made of any available indexes, etc. (Simple Nested Loops) Goal of optimization: To find more sid=sid efficient plans that compute the same answer. Reserves Sailors Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10 (On-the-fly) Alternative Plans 1 sname (No Indexes) sid=sid (Sort-Merge Join) (Scan; (Scan; write to bid=100 rating > 5 write to temp T1) temp T2) Main difference: push selects. Reserves Sailors With 5 buffers, cost of plan: Scan Reserves (1000) + write temp T1 (10 pages, if we have 100 boats, uniform distribution). Scan Sailors (500) + write temp T2 (250 pages, if we have 10 ratings). Sort T1 (2*2*10), sort T2 (2*3*250), merge (10+250) Total: 3560 page I/Os. If we used BNL join, join cost = 10+4*250, total cost = 2770. If we `push’ projections, T1 has only sid, T2 only sid and sname: T1 fits in 3 pages, cost of BNL drops to under 250 pages, total < 2000. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 11 (On-the-fly) sname Alternative Plans 2 With Indexes rating > 5 (On-the-fly) With clustered index on bid of sid=sid (Index Nested Loops, with pipelining ) Reserves, we get 100,000/100 = 1000 tuples on 1000/100 = 10 pages. (Use hash index; do bid=100 Sailors not write INL with pipelining (outer is not result to temp) materialized). Reserves –Projecting out unnecessary fields from outer doesn’t help. v Join column sid is a key for Sailors. –At most one matching tuple, unclustered index on sid OK. v Decision not to push rating>5 before the join is based on availability of sid index on Sailors. v Cost: Selection of Reserves tuples (10 I/Os); for each, must get matching Sailors tuple (1000*1.2); total 1210 I/Os. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12 Logical and physical optimization Logical opt. – use equivalence rules to get the “last” equivalent expression. Physical opt. – use cost estimates to select the best physical plan. Note: the two are not independent! The best physical may not be the best logical! Example: select Name from Sailors, Reserves where Reserves.Sid = Sailors.Sid and Sailors.Rating >10 and assume both sorted on Sid and index exists on Rating. Better not to push the select down! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13 Relational Algebra Equivalences Allow us to choose different join orders and to `push’ selections and projections ahead of joins. Selections: c1 ... cn R c1 . . . cn R (Cascade) c1 c 2 R c 2 c1 R (Commute) Projections: a1 R a1 . . . an R (Cascade) Joins: R (S T) (R S) T (Associative) (R S) (S R) (Commute) + Show that:R (S T) (T R) S (is T R a join?) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14 More Equivalences A projection commutes with a selection that only uses attributes retained by the projection. Selection between attributes of the two arguments of a cross-product converts cross-product to a join. A selection on just attributes of R commutes with R S. (i.e., (R S) (R) S ) Similarly, if a projection follows a join R S, we can `push’ it by retaining only attributes of R (and S) that are needed for the join or are kept by the projection. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 15 Equivalence Rules 1. Conjunctive selection operations can be deconstructed into a sequence of individual selections. ( E ) ( ( E )) 1 2 1 2 2. Selection operations are commutative. ( ( E )) ( ( E )) 1 2 2 1 3. Only the last in a sequence of projection operations is needed, the others can be omitted. t1 ( t2 (( tn (E )))) t1 (E ) 4. Selections can be combined with Cartesian products and theta joins. a. (E1 X E2) = E1 E2 b. 1(E1 2 E2) = E1 1 2 E2 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 16 Equivalence Rules (Cont.) 5. Theta-join operations (and natural joins) are commutative. E1 E2 = E2 E1 6. (a) Natural join operations are associative: (E1 E2) E3 = E1 (E2 E3) (b) Theta joins are associative in the following manner: (E1 1 E2) 2 3 E3 = E1 1 3 (E2 2 E3) where 2 involves attributes from only E2 and E3. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 17 Equivalence Rules (Cont.) 7. The selection operation distributes over the theta join operation under the following two conditions: (a) When all the attributes in 0 involve only the attributes of one of the expressions (E1) being joined. 0E1 E2) = (0(E1)) E2 (b) When 1 involves only the attributes of E1 and 2 involves only the attributes of E2. 1 E1 E2) = (1(E1)) ( (E2)) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 18 Equivalence Rules (Cont.) 8. The projections operation distributes over the theta join operation as follows: (a) if involves only attributes from L1 L2: L1 L2 ( E1....... E2 ) ( L1 ( E1 )) ...... ( L2 ( E2 )) (b) Consider a join E1 E2. Let L1 and L2 be sets of attributes from E1 and E2, respectively. Let L3 be attributes of E1 that are involved in join condition , but are not in L1 L2, and let L4 be attributes of E2 that are involved in join condition , but are not in L1 L2. L1 L2 ( E1..... E2 ) L1 L2 (( L1 L3 ( E1 )) ...... ( L2 L4 ( E2 ))) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 19 Equivalence Rules (Cont.) 9. The set operations union and intersection are commutative E1 E2 = E2 E1 E1 E2 = E2 E1 n (set difference is not commutative). 10. Set union and intersection are associative. (E1 E2) E3 = E1 (E2 E3) (E1 E2) E3 = E1 (E2 E3) 11. The selection operation distributes over , and –. (E1 – E2) = (E1) – (E2) and similarly for and in place of – Also: (E1 – E2) = (E1) – E2 and similarly for in place of –, but not for Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 20 Equivalence Rules (Cont.) 12.The projection operation distributes over union L(E1 E2) = (L(E1)) (L(E2)) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 21 Transformation Example Query: Find the names of all customers who have an account at some branch located in Brooklyn. customer-name(branch-city = “Brooklyn” (branch (account depositor))) Transformation using rule 7a. customer-name ((branch-city =“Brooklyn” (branch)) (account depositor)) Performing the selection as early as possible reduces the size of the relation to be joined. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 22 Example with Multiple Transformations Query: Find the names of all customers with an account at a Brooklyn branch whose account balance is over $1000. customer-name((branch-city = “Brooklyn” balance > 1000 (branch (account depositor))) Transformation using join associatively (Rule 6a): customer-name((branch-city = “Brooklyn” balance > 1000 (branch (account)) depositor) Second form provides an opportunity to apply the “perform selections early” rule, resulting in the subexpression branch-city = “Brooklyn” (branch) balance > 1000 (account) Thus a sequence of transformations can be useful Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 23 Multiple Transformations (Cont.) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 24 Projection Operation Example customer-name((branch-city = “Brooklyn” (branch) account) depositor) When we compute (branch-city = “Brooklyn” (branch) account ) we obtain a relation whose schema is: (branch-name, branch-city, assets, account-number, balance) Push projections using equivalence rules 8a and 8b; eliminate unneeded attributes from intermediate results to get: customer-name (( account-number ( (branch-city = “Brooklyn” (branch) account )) depositor) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 25 Join Ordering Example For all relations r1, r2, and r3, (r1 r2) r3 = r1 (r2 r3 ) If r2 r3 is quite large and r1 r2 is small, we choose (r1 r2) r3 so that we compute and store a smaller temporary relation. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 26 Highlights of System R Optimizer Impact: Most widely used currently; works well for < 10 joins. Cost estimation: Approximate art at best. Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes. Considers combination of CPU and I/O costs. Plan Space: Too large, must be pruned. Only the space of left-deep plans is considered. • Left-deep plans allow output of each operator to be pipelined into the next operator without storing it in a temporary relation. Cartesian products avoided. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 27 Cost Estimation For each plan considered, must estimate cost: Must estimate cost of each operation in plan tree. • Depends on input cardinalities. • We’ve already discussed how to estimate the cost of operations (sequential scan, index scan, joins, etc.) Must also estimate size of result for each operation in tree! • Use information about the input relations. • For selections and joins, assume independence of predicates. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 28 Enumeration of Alternative Plans There are two main cases: Single-relation plans Multiple-relation plans For queries over a single relation, queries consist of a combination of selects, projects, and aggregate ops: Each available access path (file scan / index) is considered, and the one with the least estimated cost is chosen. The different operations are essentially carried out together (e.g., if an index is used for a selection, projection is done for each retrieved tuple, and the resulting tuples are pipelined into the aggregate computation). Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 29 Choice of Evaluation Plans Must consider the interaction of evaluation techniques when choosing evaluation plans: choosing the cheapest algorithm for each operation independently may not yield best overall algorithm. E.g. merge-join may be costlier than hash-join, but may provide a sorted output which reduces the cost for an outer level aggregation. nested-loop join may provide opportunity for pipelining Practical query optimizers incorporate elements of the following two broad approaches: 1. Search all the plans and choose the best plan in a cost-based fashion. 2. Uses heuristics to choose a plan. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 30 Cost-Based Optimization Consider finding the best join-order for r1 r2 . . . rn. There are (2(n – 1))!/(n – 1)! different join orders for above expression. With n = 7, the number is 665280, with n = 10, the number is greater than 176 billion! No need to generate all the join orders. Using dynamic programming, the least-cost join order for any subset of {r1, r2, . . . rn} is computed only once and stored for future use. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 31 Dynamic Programming in Optimization To find best join tree for a set of n relations: To find best plan for a set S of n relations, consider all possible plans of the form: S1 (S – S1) where S1 is any non-empty subset of S. Recursively compute costs for joining subsets of S to find the cost of each plan. Choose the cheapest of the 2n – 1 alternatives. When plan for any subset is computed, store it and reuse it when it is required again, instead of recomputing it • Dynamic programming Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 32 Join Order Optimization Algorithm procedure findbestplan(S) if (bestplan[S].cost ) return bestplan[S] // else bestplan[S] has not been computed earlier, compute it now for each non-empty subset S1 of S such that S1 S P1= findbestplan(S1) P2= findbestplan(S - S1) A = best algorithm for joining results of P1 and P2 cost = P1.cost + P2.cost + cost of A if cost < bestplan[S].cost bestplan[S].cost = cost bestplan[S].plan = “execute P1.plan; execute P2.plan; join results of P1 and P2 using A” return bestplan[S] Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 33 Left Deep Join Trees In left-deep join trees, the right-hand- side input for each join is a relation, not the result of an intermediate join. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 34 Cost of Optimization To find best left-deep join tree for a set of n relations: Consider n alternatives with one relation as right-hand side input and the other relations as left-hand side input. Using (recursively computed and stored) least-cost join order for each alternative on left-hand-side, choose the cheapest of the n alternatives. If only left-deep trees are considered, time complexity of finding best join order is O(n 2n) Space complexity remains at O(2n) Cost-based optimization is expensive, but worthwhile for queries on large datasets (typical queries have small n, generally < 10) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 35 Heuristic using Left-Deep Plans Left-deep plans differ only in the order of relations, the access method for each relation, and the join method for each join. Enumerated using N passes (if N relations joined): Pass 1: Find best 1-relation plan for each relation. Pass 2: Find best way to join result of each 1-relation plan (as outer) to another relation. (All 2-relation plans.) Pass N: Find best way to join result of a (N-1)-relation plan (as outer) to the N’th relation. (All N-relation plans.) For each subset of relations, retain only: Cheapest plan overall, plus Cheapest plan for each interesting order of the tuples. Complexity: only O(N2 ) (but not necessarily optimal!) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 36 Interesting Orders in Cost-Based Optimization Consider the expression (r1 r2 r3) r4 r5 An interesting sort order is a particular sort order of tuples that could be useful for a later operation. Generating the result of r1 r2 r3 sorted on the attributes common with r4 or r5 may be useful, but generating it sorted on the attributes common only r1 and r2 is not useful. Using merge-join to compute r1 r2 r3 may be costlier, but may provide an output sorted in an interesting order. Not sufficient to find the best join order for each subset of the set of n given relations; must find the best join order for each subset, for each interesting sort order Simple extension of earlier dynamic programming algorithms Usually, number of interesting orders is quite small and doesn’t affect time/space complexity significantly Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 37 Heuristic Optimization Cost-based optimization is expensive, even with dynamic programming. Systems may use heuristics to reduce the number of choices that must be made in a cost-based fashion. Heuristic optimization transforms the query-tree by using a set of rules that typically (but not in all cases) improve execution performance: Perform selection early (reduces the number of tuples) Perform projection early (reduces the number of attributes) Perform most restrictive selection and join operations before other similar operations. Some systems use only heuristics, others combine heuristics with partial cost-based optimization. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 38 Steps in Typical Heuristic Optimization 1. Deconstruct conjunctive selections into a sequence of single selection operations (Equiv. rule 1.). 2. Move selection operations down the query tree for the earliest possible execution (Equiv. rules 2, 7a, 7b, 11). 3. Execute first those selection and join operations that will produce the smallest relations (Equiv. rule 6). 4. Replace Cartesian product operations that are followed by a selection condition by join operations (Equiv. rule 4a). 5. Deconstruct and move as far down the tree as possible lists of projection attributes, creating new projections where needed (Equiv. rules 3, 8a, 8b, 12). 6. Identify those subtrees whose operations can be pipelined, and execute them using pipelining). Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 39 Structure of Query Optimizers The System R/Starburst optimizer considers only left-deep join orders. This reduces optimization complexity and generates plans amenable to pipelined evaluation. System R/Starburst also uses heuristics to push selections and projections down the query tree. Heuristic optimization used in some versions of Oracle: Repeatedly pick “best” relation to join next • Starting from each of n starting points. Pick best among these. For scans using secondary indices, some optimizers take into account the probability that the page containing the tuple is in the buffer. Intricacies of SQL complicate query optimization E.g. nested subqueries Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 40 Enumeration of Plans (Contd.) ORDER BY, GROUP BY, aggregates etc. handled as a final step, using either an `interestingly ordered’ plan or an addional sorting operator. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 41 Sailors: B+ tree on rating sname Example Hash on sid Reserves: B+ tree on bid Pass1: sid=sid Sailors: B+ tree matches rating>5, and is probably cheapest. However, if this selection is expected to bid=500 rating>5 retrieve a lot of tuples, and index is unclustered, file scan may be cheaper. • Still, B+ tree plan kept (because tuples are in rating order). Reserves Sailors Reserves: B+ tree on bid matches bid=500; cheapest. So we have two plans on Sailors and one on Reserves v Pass 2: – We consider each plan retained from Pass 1 as the outer, and consider how to join it with the (only) other relation. e.g., Reserves as outer: Hash index can be used to get Sailors tuples that satisfy sid = outer tuple’s sid value. (only linear scan of SAILORS) SELECT S.sname FROM Sailors S Nested Queries WHERE EXISTS (SELECT * FROM Reserves R Nested block is optimized WHERE R.bid=103 independently, with the outer AND R.sid=S.sid) tuple considered as providing a selection condition. Nested block to optimize: SELECT * Outer block is optimized with FROM Reserves R the cost of `calling’ nested block WHERE R.bid=103 computation taken into account. AND S.sid= outer value Implicit ordering of these blocks means that some good strategies Equivalent non-nested query: are not considered. The non- SELECT S.sname nested version of the query is FROM Sailors S, Reserves R typically optimized better. WHERE S.sid=R.sid AND R.bid=103 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 43 A comprehensive example SELECT S.sid, COUNT(*) AS numres FROM boats B, Reserves R, Sailors S WHERE R.sid = S.sid AND B.bid = R.bid AND B.color = ‘red’ GROUP BY S.sid Query finds the number of red boats reserved by each sailor sid COUNT(*) AS numres GROUPBY sid Sid = sid Bid = bid sailors boats color = ‘red’ reserves Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 44 Assume the following Indexes For Reserves A B+tree on Sid A clustered B+tree on Bid For Boats Both B+tree and Hash indexes on Color For Sailors A clustered B+tree and Hash indexes on Sid Note, some indexes are especially contrived for the example… Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 45 A comprehensive example – cont. Pass 1 – best plan for each relation Reserves, sailors – file scan Boats – best is hash index but B-tree is used because of interesting order. Pass 2 – all left deep plans Reserves (outer) sid boats (inner) Boats (outer) sid reserves (inner) Reserves (outer) sid sailors (inner) Sailors (outer) sid reserves (inner) Need to consider all relevant join methods: BNL, SORT-Merge, Index,Hash Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 46 A comprehensive example – cont. Some good plans: Boats sid reserves using index join since Bid is a clustered index for reserves. Reserves sailors using sort-merge since sailors already has clustered B tree index on sid. Pass 3 – consider each plan from stage 2 as an outer. Then use sort-merge join on Sid if join with Sailors is last, otherwise must sort result on Sid. This since we need to do a GROUP-by on Sid. So because of the Group-By the join with Sailors on Sid may better be the last one… Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 47 The system R optimizer Current relational query optimizers have been greatly influenced by choices made in the design of IBM’s system R query optimizer. Important design choices in the system R optimizer include: The use of statistics about the database instance to estimate the cost of a query evaluation plan. A decision to consider only plans with binary joins in which the inner relation is a base relation (I.e. not a temporary relation). This heuristic reduces the potentially very large number of alternative plans that must be considered. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 48 The system R optimizer A decision to focus optimization on the class of SQL queries without nesting , and to treat nested queries in a relatively as hoc way. A decision not to perform duplicate elimination for projections (except as a final step in the query evaluation when required by a DISTINCT clause). A model of cost that accounted for CPU costs as well as I/O costs. Search the space of plans for an optimal plan using hill-climbing. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 49 Additional topics Index selection – general problem is NP- complete. Query with many joins – AI technique like simulated annealing. [J9] Parallel database machines. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 50 Materialized Views** A materialized view is a view whose contents are computed and stored. Consider the view create view branch_total_loan(branch_name, total_loan) as select branch_name, sum(amount) from loan group by branch_name Materializing the above view would be very useful if the total loan amount is required frequently Saves the effort of finding multiple tuples and adding up their amounts Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 51 Optimization with Materialized views (Data Warehousing) Which views to materialize? Given a query and a set of materialized views: Can the query be answered using only subset of the materialized views? If yes, what is the best sub-set? All these topic are associated with optimization in Data warehousing! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 52 Query Optimization and Materialized Views Rewriting queries to use materialized views: A materialized view v = r s is available A user submits a query r s t We can rewrite the query as v t • Whether to do so depends on cost estimates for the two alternative Replacing a use of a materialized view by the view definition: A materialized view v = r s is available, but without any index on it User submits a query A=10(v). Suppose also that s has an index on the common attribute B, and r has an index on attribute A. The best plan for this query may be to replace v by r s, which can lead to the query plan A=10(r) s Query optimizer should be extended to consider all above alternatives and choose the best overall plan Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 53 Materialized View Selection Materialized view selection: “What is the best set of views to materialize?”. Index selection: “what is the best set of indices to create” closely related, to materialized view selection • but simpler Materialized view selection and index selection based on typical system workload (queries and updates) Typical goal: minimize time to execute workload , subject to constraints on space and time taken for some critical queries/updates One of the steps in database tuning • more on tuning in later chapters Commercial database systems provide tools (called “tuning assistants” or “wizards”) to help the database administrator choose what indices and materialized views to create Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 54 Materialized View Maintenance The task of keeping a materialized view up-to-date with the underlying data is known as materialized view maintenance Materialized views can be maintained by recomputation on every update A better option is to use incremental view maintenance Changes to database relations are used to compute changes to the materialized view, which is then updated View maintenance can be done by Manually defining triggers on insert, delete, and update of each relation in the view definition Manually written code to update the view whenever database relations are updated Periodic recomputation (e.g. nightly) Above methods are directly supported by many database systems • Avoids manual effort/correctness issues Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 55 Summary Query optimization is an important task in a relational DBMS. Must understand optimization in order to understand the performance impact of a given database design (relations, indexes) on a workload (set of queries). Two parts to optimizing a query: Consider a set of alternative plans. • Must prune search space; typically, left-deep plans only. Must estimate cost of each plan that is considered. • Must estimate size of result and cost for each plan node. • Key issues: Statistics, indexes, operator implementations. Using materialized views (Data Warehousing) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 56