Docstoc

optimization

Document Sample
optimization Powered By Docstoc
					         Query Processing and Optimization



                               Chapters 12,14,15 in [R]




Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke   1
        Basic Steps in Query Processing
            1. Parsing and translation
            2. Optimization
            3. Evaluation




Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke   2
          Overview of Query Optimization
    Plan: Tree of R.A. ops, with choice of alg for each op.
         Each operator typically implemented using a `pull’
          interface: when an operator is `pulled’ for the next
          output tuples, it `pulls’ on its inputs and computes them.
    Two main issues:
         For a given query, what plans are considered?
            • Algorithm to search plan space for cheapest (estimated) plan.
         How is the cost of a plan estimated?
  Ideally: Want to find best plan. Practically: Avoid
   worst plans!
  We will study the System R approach.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                3
        Introduction
           Alternative ways of evaluating a given query
              Equivalent expressions
              Different algorithms for each operation (Chapter 13)
           Cost difference between a good and a bad way of
            evaluating a query can be enormous
              Example: performing a r X s followed by a selection r.A =
               s.B is much slower than performing a join on the same
               condition
           Need to estimate the cost of operations
              Depends critically on statistical information about relations
               which the database must maintain
                  • E.g. number of tuples, number of distinct values for join
                    attributes, etc.
              Need to estimate statistics for intermediate results to
               compute cost of complex expressions
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                  4
       Introduction (Cont.)
       Relations generated by two equivalent expressions
       have the same set of attributes and contain the same
       set of tuples, although their attributes may be
       ordered differently.




Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke   5
    Introduction (Cont.)
           Generation of query-evaluation plans for an
            expression involves several steps:
             1. Generating logically equivalent expressions
                  • Use equivalence rules to transform an expression into an
                    equivalent one – Logical optimization
             2. Annotating resultant expressions to get alternative query
                plans
             3. Choosing the cheapest plan based on estimated cost –
                Physical optimization
           The overall process is called cost based
            optimization.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                 6
     Query optimization – the two phases
                                          query

                    Catalog               Parse
                  schema info
                                                      Tree of RA
                                                       operators

                            Logical optimization – pushing operators down



   Catalog    physical optimization method to implement each RA
   Statistics
 indexes info              Run-time code
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke              7
      Search space “plan”
   Search logical order
                        1                                    2
   1                  2                                 3
      X        or        X                    or             X


   1       2                        2                 1


    Take Plan 1: 4 Join methods, 3 select
    methods
    There are 3*3*4 = 36 Plans!
    For a large tree the # of plans is huge!

    The optimization problem:
    Find the best plan!
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke     8
        Schema for Examples
        Sailors (sid: integer, sname: string, rating: integer, age: real)
        Reserves (sid: integer, bid: integer, day: dates, rname: string)

 Similar to old schema; rname added for variations.
 Reserves:
       Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.
   Sailors:
       Each tuple is 50 bytes long, 80 tuples per page, 500 pages.



Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke              9
                                                                 RA Tree:        sname

        Motivating Example
                                                                         bid=100     rating > 5
                        SELECT S.sname
                        FROM Reserves R, Sailors S
                        WHERE R.sid=S.sid AND                                   sid=sid
                          R.bid=100 AND S.rating>5
                                                                     Reserves             Sailors
   Cost: 500+500*1000 I/Os
                                                             (On-the-fly)
   By no means the worst plan!      Plan: sname
   Misses several opportunities:
    selections could have been          bid=100   rating > 5 (On-the-fly)
    `pushed’ earlier, no use is made
    of any available indexes, etc.
                                                      (Simple Nested Loops)
   Goal of optimization: To find more       sid=sid
    efficient plans that compute the
    same answer.
                                                     Reserves         Sailors
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                                    10
                                                                             (On-the-fly)
        Alternative Plans 1
                                                                        sname




        (No Indexes)                                                   sid=sid
                                                                                 (Sort-Merge Join)


                                                    (Scan;                                    (Scan;
                                                    write to bid=100             rating > 5   write to
                                                    temp T1)                                  temp T2)
   Main difference: push selects.
                                                                 Reserves        Sailors
   With 5 buffers, cost of plan:
       Scan Reserves (1000) + write temp T1 (10 pages, if we have 100 boats,
        uniform distribution).
       Scan Sailors (500) + write temp T2 (250 pages, if we have 10 ratings).
       Sort T1 (2*2*10), sort T2 (2*3*250), merge (10+250)
       Total: 3560 page I/Os.
   If we used BNL join, join cost = 10+4*250, total cost = 2770.
   If we `push’ projections, T1 has only sid, T2 only sid and sname:
       T1 fits in 3 pages, cost of BNL drops to under 250 pages, total < 2000.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                                     11
                                                                                          (On-the-fly)
                                                                               sname

        Alternative Plans 2
        With Indexes
                                                                               rating > 5 (On-the-fly)



   With clustered index on bid of                                            sid=sid
                                                                                         (Index Nested Loops,
                                                                                         with pipelining )
    Reserves, we get 100,000/100 =
    1000 tuples on 1000/100 = 10 pages.                      (Use hash
                                                              index; do    bid=100      Sailors
                                                              not write
   INL with pipelining (outer is not                         result to
                                                              temp)
    materialized).                                                        Reserves

         –Projecting out unnecessary fields from outer doesn’t help.
v   Join column sid is a key for Sailors.
     –At most one matching tuple, unclustered index on sid OK.
v   Decision not to push rating>5 before the join is based on
     availability of sid index on Sailors.
v   Cost: Selection of Reserves tuples (10 I/Os); for each,
     must get matching Sailors tuple (1000*1.2); total 1210 I/Os.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                                           12
        Logical and physical optimization

         Logical opt. – use equivalence rules to get the “last” equivalent
          expression.
         Physical opt. – use cost estimates to select the best physical plan.
        Note: the two are not independent!
        The best physical may not be the best logical!

        Example:
        select Name from Sailors, Reserves
            where Reserves.Sid = Sailors.Sid and Sailors.Rating >10
        and assume both sorted on Sid and index exists on Rating.
        Better not to push the select down!



Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                   13
        Relational Algebra Equivalences
 Allow us to choose different join orders and to
  `push’ selections and projections ahead of joins.
 Selections:  c1 ...  cn  R    c1  . . .  cn  R   (Cascade)
                         c1  c 2  R    c 2  c1  R         (Commute)
                                                  
     Projections:  a1  R   a1 . . .  an  R                    (Cascade)

    Joins: R  (S  T)
                                       (R S)  T
                                                                     (Associative)
                 (R  S)  (S  R)
                                                                      (Commute)

   + Show that:R  (S T)
                                           (T R)  S (is T R a join?)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                    14
        More Equivalences
 A projection commutes with a selection that only
  uses attributes retained by the projection.
 Selection between attributes of the two arguments of
  a cross-product converts cross-product to a join.
 A selection on just attributes of R commutes with
  R  S. (i.e.,  (R  S)   (R)  S )
                                    
 Similarly, if a projection follows a join R  S, we can
                                               
  `push’ it by retaining only attributes of R (and S) that
  are needed for the join or are kept by the projection.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke   15
       Equivalence Rules
       1. Conjunctive selection operations can be
          deconstructed into a sequence of individual
          selections.    ( E )    (  ( E ))
                                    1   2          1       2




       2. Selection operations are commutative.
                               (  ( E ))    (  ( E ))
                                1       2              2     1



       3. Only the last in a sequence of projection operations
          is needed, the others can be omitted.
                    t1 ( t2 (( tn (E ))))   t1 (E )

       4.   Selections can be combined with Cartesian
            products and theta joins.
            a. (E1 X E2) = E1  E2
            b. 1(E1 2 E2) = E1 1 2 E2

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke   16
        Equivalence Rules (Cont.)

        5. Theta-join operations (and natural joins) are
           commutative.
                          E1  E2 = E2  E1
        6. (a) Natural join operations are associative:
                    (E1 E2) E3 = E1 (E2 E3)

            (b) Theta joins are associative in the following
            manner:

             (E1      1 E2)      2  3   E3 = E1      1 3   (E2   2   E3)

                where 2 involves attributes from only E2 and E3.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                     17
        Equivalence Rules (Cont.)

        7. The selection operation distributes over the theta join
           operation under the following two conditions:
           (a) When all the attributes in 0 involve only the
           attributes of one
                of the expressions (E1) being joined.

                          0E1          E2) = (0(E1))          E2

            (b) When  1 involves only the attributes of E1 and 2
            involves
                only the attributes of E2.
                       1 E1  E2) = (1(E1))  ( (E2))

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke            18
        Equivalence Rules (Cont.)
        8. The projections operation distributes over the
           theta join operation as follows:
           (a) if  involves only attributes from L1  L2:
                            L1  L2 ( E1....... E2 )  ( L1 ( E1 )) ...... ( L2 ( E2 ))


            (b) Consider a join E1                                  E2.
              Let L1 and L2 be sets of attributes from E1 and E2,
               respectively.
              Let L3 be attributes of E1 that are involved in join
               condition , but are not in L1  L2, and
              let L4 be attributes of E2 that are involved in join
               condition , but are not in L1  L2.
                L1  L2 ( E1.....  E2 )   L1  L2 ((  L1  L3 ( E1 )) ...... ( L2  L4 ( E2 )))
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                                           19
        Equivalence Rules (Cont.)
  9.    The set operations union and intersection are
        commutative
                      E1  E2 = E2  E1
                      E1  E2 = E2  E1
         n (set difference is not commutative).
  10.   Set union and intersection are associative.
                  (E1  E2)  E3 = E1  (E2  E3)
                 (E1  E2)  E3 = E1  (E2  E3)
  11.   The selection operation distributes over ,  and –.
                    (E1 – E2) =  (E1) – (E2)
                       and similarly for  and  in place of –
        Also:            (E1  – E2) = (E1) – E2
                             and similarly for  in place of –, but not
        for 
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke            20
      Equivalence Rules (Cont.)
  12.The projection operation distributes over union
               L(E1  E2) = (L(E1))  (L(E2))




Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke   21
        Transformation Example

           Query: Find the names of all customers who have an
            account at some branch located in Brooklyn.
            customer-name(branch-city = “Brooklyn”
                             (branch (account depositor)))
           Transformation using rule 7a.
             customer-name
                    ((branch-city =“Brooklyn” (branch))
                             (account depositor))
           Performing the selection as early as possible reduces
            the size of the relation to be joined.


Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke      22
Example with Multiple Transformations
     Query: Find the names of all customers with an account at
      a Brooklyn branch whose account balance is over $1000.
      customer-name((branch-city = “Brooklyn”  balance > 1000
                   (branch        (account        depositor)))
     Transformation using join associatively (Rule 6a):
      customer-name((branch-city = “Brooklyn”  balance > 1000
                   (branch        (account))         depositor)
     Second form provides an opportunity to apply the
      “perform selections early” rule, resulting in the
      subexpression
         branch-city = “Brooklyn” (branch)  balance > 1000 (account)
     Thus a sequence of transformations can be useful
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke           23
        Multiple Transformations (Cont.)




Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke   24
         Projection Operation Example
     customer-name((branch-city = “Brooklyn” (branch)          account)   depositor)

        When we compute
                   (branch-city = “Brooklyn” (branch) account )
         we obtain a relation whose schema is:
         (branch-name, branch-city, assets, account-number, balance)
        Push projections using equivalence rules 8a and 8b;
         eliminate unneeded attributes from intermediate results
         to get:
           customer-name ((
             account-number ( (branch-city = “Brooklyn” (branch) account ))
                     depositor)


Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                           25
        Join Ordering Example

         For all relations r1, r2, and r3,
                      (r1 r2) r3 = r1 (r2 r3 )
         If r2 r3 is quite large and r1 r2 is small, we
          choose

                       (r1 r2) r3
            so that we compute and store a smaller
            temporary relation.


Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke   26
Highlights of System R Optimizer

   Impact:
        Most widely used currently; works well for < 10 joins.
   Cost estimation: Approximate art at best.
        Statistics, maintained in system catalogs, used to estimate
         cost of operations and result sizes.
        Considers combination of CPU and I/O costs.
   Plan Space: Too large, must be pruned.
        Only the space of left-deep plans is considered.
          • Left-deep plans allow output of each operator to be pipelined into
            the next operator without storing it in a temporary relation.
        Cartesian products avoided.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                   27
        Cost Estimation
           For each plan considered, must estimate cost:
                Must estimate cost of each operation in plan tree.
                  • Depends on input cardinalities.
                  • We’ve already discussed how to estimate the cost of
                    operations (sequential scan, index scan, joins, etc.)
                Must also estimate size of result for each operation
                 in tree!
                  • Use information about the input relations.
                  • For selections and joins, assume independence of
                    predicates.



Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke              28
         Enumeration of Alternative Plans
   There are two main cases:
        Single-relation plans
        Multiple-relation plans
   For queries over a single relation, queries consist of a
    combination of selects, projects, and aggregate ops:
        Each available access path (file scan / index) is considered,
         and the one with the least estimated cost is chosen.
        The different operations are essentially carried out
         together (e.g., if an index is used for a selection, projection
         is done for each retrieved tuple, and the resulting tuples
         are pipelined into the aggregate computation).
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke        29
         Choice of Evaluation Plans
        Must consider the interaction of evaluation techniques
         when choosing evaluation plans: choosing the
         cheapest algorithm for each operation independently
         may not yield best overall algorithm. E.g.
            merge-join may be costlier than hash-join, but may provide a
             sorted output which reduces the cost for an outer level
             aggregation.
            nested-loop join may provide opportunity for pipelining
        Practical query optimizers incorporate elements of the
         following two broad approaches:
           1. Search all the plans and choose the best plan in a
              cost-based fashion.
           2. Uses heuristics to choose a plan.


Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke              30
        Cost-Based Optimization

           Consider finding the best join-order for r1 r2 . . . rn.
           There are (2(n – 1))!/(n – 1)! different join orders for
            above expression. With n = 7, the number is 665280,
            with n = 10, the number is greater than 176 billion!
           No need to generate all the join orders. Using
            dynamic programming, the least-cost join order for
            any subset of
            {r1, r2, . . . rn} is computed only once and stored for
            future use.



Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke         31
           Dynamic Programming in
           Optimization
             To find best join tree for a set of n relations:
                 To find best plan for a set S of n relations,
                  consider all possible plans of the form: S1 (S
                  – S1) where S1 is any non-empty subset of S.
                 Recursively compute costs for joining subsets of
                  S to find the cost of each plan. Choose the
                  cheapest of the 2n – 1 alternatives.
                 When plan for any subset is computed, store it
                  and reuse it when it is required again, instead
                  of recomputing it
                     • Dynamic programming




Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke       32
  Join Order Optimization Algorithm
     procedure findbestplan(S)
       if (bestplan[S].cost  )
             return bestplan[S]
       // else bestplan[S] has not been computed earlier,
       compute it now
       for each non-empty subset S1 of S such that S1  S
             P1= findbestplan(S1)
             P2= findbestplan(S - S1)
             A = best algorithm for joining results of P1 and P2
             cost = P1.cost + P2.cost + cost of A
             if cost < bestplan[S].cost
                     bestplan[S].cost = cost
                     bestplan[S].plan = “execute P1.plan; execute
       P2.plan;
                               join results of P1 and P2 using A”
       return bestplan[S]
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke      33
        Left Deep Join Trees
      In left-deep join trees, the right-hand-
       side input for each join is a relation, not
       the result of an intermediate join.




Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke   34
      Cost of Optimization
           To find best left-deep join tree for a set of n relations:
              Consider n alternatives with one relation as right-hand side
               input and the other relations as left-hand side input.
              Using (recursively computed and stored) least-cost join
               order for each alternative on left-hand-side, choose the
               cheapest of the n alternatives.
           If only left-deep trees are considered, time
            complexity of finding best join order is O(n 2n)
              Space complexity remains at O(2n)
           Cost-based optimization is expensive, but
            worthwhile for queries on large datasets (typical
            queries have small n, generally < 10)




Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                35
        Heuristic using Left-Deep Plans
   Left-deep plans differ only in the order of relations, the access
    method for each relation, and the join method for each join.
   Enumerated using N passes (if N relations joined):
       Pass 1: Find best 1-relation plan for each relation.
       Pass 2: Find best way to join result of each 1-relation plan (as outer) to
        another relation. (All 2-relation plans.)
       Pass N: Find best way to join result of a (N-1)-relation plan (as outer)
        to the N’th relation. (All N-relation plans.)
   For each subset of relations, retain only:
       Cheapest plan overall, plus
       Cheapest plan for each interesting order of the tuples.
   Complexity: only O(N2 ) (but not necessarily optimal!)


Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                       36
 Interesting Orders in Cost-Based
 Optimization
             Consider the expression (r1 r2 r3) r4 r5
             An interesting sort order is a particular sort order of
              tuples that could be useful for a later operation.
                Generating the result of r1 r2 r3 sorted on the attributes
                 common with r4 or r5 may be useful, but generating it sorted
                 on the attributes common only r1 and r2 is not useful.
                Using merge-join to compute r1 r2 r3 may be costlier, but
                 may provide an output sorted in an interesting order.
             Not sufficient to find the best join order for each
              subset of the set of n given relations; must find the
              best join order for each subset, for each interesting
              sort order
                Simple extension of earlier dynamic programming
                 algorithms
                Usually, number of interesting orders is quite small and
                 doesn’t affect time/space complexity significantly
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                  37
         Heuristic Optimization
        Cost-based optimization is expensive, even with
         dynamic programming.
        Systems may use heuristics to reduce the number of
         choices that must be made in a cost-based fashion.
        Heuristic optimization transforms the query-tree by
         using a set of rules that typically (but not in all cases)
         improve execution performance:
           Perform selection early (reduces the number of tuples)
           Perform projection early (reduces the number of attributes)
           Perform most restrictive selection and join operations before
            other similar operations.
           Some systems use only heuristics, others combine heuristics
            with partial cost-based optimization.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke              38
     Steps in Typical Heuristic
     Optimization
  1. Deconstruct conjunctive selections into a sequence of
     single selection operations (Equiv. rule 1.).
  2. Move selection operations down the query tree for the
     earliest possible execution (Equiv. rules 2, 7a, 7b, 11).
  3. Execute first those selection and join operations that will
     produce the smallest relations (Equiv. rule 6).
  4. Replace Cartesian product operations that are followed by
     a selection condition by join operations (Equiv. rule 4a).
  5. Deconstruct and move as far down the tree as possible
     lists of projection attributes, creating new projections
     where needed (Equiv. rules 3, 8a, 8b, 12).
  6. Identify those subtrees whose operations can be pipelined,
     and execute them using pipelining).

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke     39
       Structure of Query Optimizers
           The System R/Starburst optimizer considers only
            left-deep join orders. This reduces optimization
            complexity and generates plans amenable to
            pipelined evaluation.
            System R/Starburst also uses heuristics to push
            selections and projections down the query tree.
           Heuristic optimization used in some versions of
            Oracle:
              Repeatedly pick “best” relation to join next
                  • Starting from each of n starting points. Pick best among these.
           For scans using secondary indices, some optimizers
            take into account the probability that the page
            containing the tuple is in the buffer.
           Intricacies of SQL complicate query optimization
              E.g. nested subqueries
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                        40
        Enumeration of Plans (Contd.)
    ORDER BY, GROUP BY, aggregates etc. handled as a
     final step, using either an `interestingly ordered’
     plan or an addional sorting operator.




Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke   41
                                Sailors:
                                 B+ tree on rating                             sname

         Example                 Hash on sid
                                Reserves:
                                 B+ tree on bid
   Pass1:                                                          sid=sid

      Sailors: B+ tree matches rating>5, and is probably cheapest.
       However, if this selection is expected to                bid=500 rating>5
       retrieve a lot of tuples, and index is unclustered,
       file scan may be cheaper.
          • Still, B+ tree plan kept (because tuples are in rating order). Reserves Sailors
        Reserves: B+ tree on bid matches bid=500; cheapest.
        So we have two plans on Sailors and one on Reserves
v   Pass 2:
      – We consider each plan retained from Pass 1 as the outer, and
      consider how to join it with the (only) other relation.
           e.g., Reserves as outer: Hash index can be used to get Sailors tuples
          that satisfy sid = outer tuple’s sid value. (only linear scan of SAILORS)
                                                          SELECT S.sname
                                                          FROM Sailors S
        Nested Queries                                    WHERE EXISTS
                                                            (SELECT *
                                                             FROM Reserves R
   Nested block is optimized                                WHERE R.bid=103
    independently, with the outer                            AND R.sid=S.sid)
    tuple considered as providing a
    selection condition.                                         Nested block to optimize:
                                                                 SELECT *
   Outer block is optimized with
                                                                 FROM Reserves R
    the cost of `calling’ nested block
                                                                 WHERE R.bid=103
    computation taken into account.
                                                                   AND S.sid= outer value
   Implicit ordering of these blocks
    means that some good strategies                        Equivalent non-nested query:
    are not considered. The non-                           SELECT S.sname
    nested version of the query is                         FROM Sailors S, Reserves R
    typically optimized better.                            WHERE S.sid=R.sid
                                                            AND R.bid=103
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke                          43
         A comprehensive example
         SELECT S.sid, COUNT(*) AS numres
         FROM boats B, Reserves R, Sailors S
         WHERE R.sid = S.sid AND B.bid = R.bid AND B.color = ‘red’
         GROUP BY S.sid

         Query finds the number of red boats reserved by each sailor
                                            sid COUNT(*) AS numres



                                                  GROUPBY sid




                                                     Sid = sid



                                Bid = bid
                                                                 sailors

 boats         color = ‘red’                   reserves

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke             44
        Assume the following Indexes
         For Reserves
            A B+tree on Sid
            A clustered B+tree on Bid
         For Boats
            Both B+tree and Hash indexes on Color
         For Sailors
            A clustered B+tree and Hash indexes on Sid


         Note, some indexes are especially contrived
          for the example…
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke   45
        A comprehensive example – cont.
         Pass 1 – best plan for each relation
            Reserves, sailors – file scan
            Boats – best is hash index but B-tree is used
             because of interesting order.
         Pass 2 – all left deep plans
              Reserves (outer) sid boats (inner)
              Boats (outer) sid reserves (inner)
              Reserves (outer) sid sailors (inner)
              Sailors (outer) sid reserves (inner)
         Need to consider all relevant join methods:
          BNL, SORT-Merge, Index,Hash
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke   46
        A comprehensive example – cont.
           Some good plans:
              Boats sid reserves using index join since Bid is a
               clustered index for reserves.
              Reserves sailors using sort-merge since sailors
               already has clustered B tree index on sid.
           Pass 3 – consider each plan from stage 2 as an
            outer. Then use sort-merge join on Sid if join
            with Sailors is last, otherwise must sort result
            on Sid. This since we need to do a GROUP-by
            on Sid. So because of the Group-By the join
            with Sailors on Sid may better be the last
            one…
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke      47
        The system R optimizer
    Current relational query optimizers have been
     greatly influenced by choices made in the
     design of IBM’s system R query optimizer.
     Important design choices in the system R
     optimizer include:
          The use of statistics about the database instance to
           estimate the cost of a query evaluation plan.
          A decision to consider only plans with binary joins in
           which the inner relation is a base relation (I.e. not a
           temporary relation). This heuristic reduces the
           potentially very large number of alternative plans
           that must be considered.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke       48
        The system R optimizer

              A decision to focus optimization on the class of
               SQL queries without nesting , and to treat nested
               queries in a relatively as hoc way.
              A decision not to perform duplicate elimination
               for projections (except as a final step in the query
               evaluation when required by a DISTINCT clause).
              A model of cost that accounted for CPU costs as
               well as I/O costs.
              Search the space of plans for an optimal plan
               using hill-climbing.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke        49
        Additional topics

         Index selection – general problem is NP-
          complete.
         Query with many joins – AI technique like
          simulated annealing. [J9]
         Parallel database machines.




Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke   50
        Materialized Views**
          A materialized view is a view whose contents are
           computed and stored.
          Consider the view
           create view branch_total_loan(branch_name, total_loan)
           as
           select branch_name, sum(amount)
           from loan
           group by branch_name
          Materializing the above view would be very useful if
           the total loan amount is required frequently
             Saves the effort of finding multiple tuples and
              adding up their amounts
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke      51
        Optimization with Materialized
        views (Data Warehousing)
         Which views to materialize?

         Given a query and a set of materialized views:
            Can the query be answered using only subset of the
             materialized views?
            If yes, what is the best sub-set?


         All these topic are associated with optimization in Data
          warehousing!



Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke       52
      Query Optimization and Materialized
      Views
     Rewriting queries to use materialized views:
        A materialized view v = r     s is available
        A user submits a query r s t
        We can rewrite the query as v t
            • Whether to do so depends on cost estimates for the two alternative
     Replacing a use of a materialized view by the view definition:
        A materialized view v = r      s is available, but without any index on
         it
        User submits a query A=10(v).
        Suppose also that s has an index on the common attribute B, and r has
         an index on attribute A.
        The best plan for this query may be to replace v by r s, which can
         lead to the query plan A=10(r) s
     Query optimizer should be extended to consider all above
      alternatives and choose the best overall plan
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke               53
        Materialized View Selection
         Materialized view selection: “What is the best set of views to
          materialize?”.
         Index selection: “what is the best set of indices to create”
            closely related, to materialized view selection
                • but simpler
         Materialized view selection and index selection based on typical
          system workload (queries and updates)
            Typical goal: minimize time to execute workload , subject to
              constraints on space and time taken for some critical
              queries/updates
            One of the steps in database tuning
                • more on tuning in later chapters
         Commercial database systems provide tools (called “tuning
          assistants” or “wizards”) to help the database administrator
          choose what indices and materialized views to create
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke               54
        Materialized View Maintenance
      The task of keeping a materialized view up-to-date with the
       underlying data is known as materialized view maintenance
      Materialized views can be maintained by recomputation on every
       update
      A better option is to use incremental view maintenance
         Changes to database relations are used to compute changes to
          the materialized view, which is then updated
      View maintenance can be done by
         Manually defining triggers on insert, delete, and update of each
          relation in the view definition
         Manually written code to update the view whenever database
          relations are updated
         Periodic recomputation (e.g. nightly)
         Above methods are directly supported by many database systems
            • Avoids manual effort/correctness issues
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke               55
        Summary
   Query optimization is an important task in a relational DBMS.
   Must understand optimization in order to understand the
    performance impact of a given database design (relations,
    indexes) on a workload (set of queries).
   Two parts to optimizing a query:
      Consider a set of alternative plans.
        • Must prune search space; typically, left-deep plans only.
      Must estimate cost of each plan that is considered.
        • Must estimate size of result and cost for each plan node.
        • Key issues: Statistics, indexes, operator implementations.
   Using materialized views (Data Warehousing)

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke     56

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:11
posted:12/3/2011
language:English
pages:56