Embed
Email

optimization

Document Sample

Shared by: gegeshandong
Categories
Tags
Stats
views:
2
posted:
12/3/2011
language:
English
pages:
56
Query Processing and Optimization







Chapters 12,14,15 in [R]









Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Basic Steps in Query Processing

1. Parsing and translation

2. Optimization

3. Evaluation









Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2

Overview of Query Optimization

 Plan: Tree of R.A. ops, with choice of alg for each op.

 Each operator typically implemented using a `pull’

interface: when an operator is `pulled’ for the next

output tuples, it `pulls’ on its inputs and computes them.

 Two main issues:

 For a given query, what plans are considered?

• Algorithm to search plan space for cheapest (estimated) plan.

 How is the cost of a plan estimated?

 Ideally: Want to find best plan. Practically: Avoid

worst plans!

 We will study the System R approach.



Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3

Introduction

 Alternative ways of evaluating a given query

 Equivalent expressions

 Different algorithms for each operation (Chapter 13)

 Cost difference between a good and a bad way of

evaluating a query can be enormous

 Example: performing a r X s followed by a selection r.A =

s.B is much slower than performing a join on the same

condition

 Need to estimate the cost of operations

 Depends critically on statistical information about relations

which the database must maintain

• E.g. number of tuples, number of distinct values for join

attributes, etc.

 Need to estimate statistics for intermediate results to

compute cost of complex expressions

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4

Introduction (Cont.)

Relations generated by two equivalent expressions

have the same set of attributes and contain the same

set of tuples, although their attributes may be

ordered differently.









Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5

Introduction (Cont.)

 Generation of query-evaluation plans for an

expression involves several steps:

1. Generating logically equivalent expressions

• Use equivalence rules to transform an expression into an

equivalent one – Logical optimization

2. Annotating resultant expressions to get alternative query

plans

3. Choosing the cheapest plan based on estimated cost –

Physical optimization

 The overall process is called cost based

optimization.



Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6

Query optimization – the two phases

query



Catalog Parse

schema info

Tree of RA

operators



Logical optimization – pushing operators down







Catalog physical optimization method to implement each RA

Statistics

indexes info Run-time code

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7

Search space “plan”

Search logical order

1 2

1 2 3

X or X or X





1 2 2 1





Take Plan 1: 4 Join methods, 3 select

methods

There are 3*3*4 = 36 Plans!

For a large tree the # of plans is huge!



The optimization problem:

Find the best plan!

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8

Schema for Examples

Sailors (sid: integer, sname: string, rating: integer, age: real)

Reserves (sid: integer, bid: integer, day: dates, rname: string)



 Similar to old schema; rname added for variations.

 Reserves:

 Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.

 Sailors:

 Each tuple is 50 bytes long, 80 tuples per page, 500 pages.







Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9

RA Tree: sname



Motivating Example

bid=100 rating > 5

SELECT S.sname

FROM Reserves R, Sailors S

WHERE R.sid=S.sid AND sid=sid

R.bid=100 AND S.rating>5

Reserves Sailors

 Cost: 500+500*1000 I/Os

(On-the-fly)

 By no means the worst plan! Plan: sname

 Misses several opportunities:

selections could have been bid=100 rating > 5 (On-the-fly)

`pushed’ earlier, no use is made

of any available indexes, etc.

(Simple Nested Loops)

 Goal of optimization: To find more sid=sid

efficient plans that compute the

same answer.

Reserves Sailors

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10

(On-the-fly)

Alternative Plans 1

sname









(No Indexes) sid=sid

(Sort-Merge Join)





(Scan; (Scan;

write to bid=100 rating > 5 write to

temp T1) temp T2)

 Main difference: push selects.

Reserves Sailors

 With 5 buffers, cost of plan:

 Scan Reserves (1000) + write temp T1 (10 pages, if we have 100 boats,

uniform distribution).

 Scan Sailors (500) + write temp T2 (250 pages, if we have 10 ratings).

 Sort T1 (2*2*10), sort T2 (2*3*250), merge (10+250)

 Total: 3560 page I/Os.

 If we used BNL join, join cost = 10+4*250, total cost = 2770.

 If we `push’ projections, T1 has only sid, T2 only sid and sname:

 T1 fits in 3 pages, cost of BNL drops to under 250 pages, total 5 (On-the-fly)







 With clustered index on bid of sid=sid

(Index Nested Loops,

with pipelining )

Reserves, we get 100,000/100 =

1000 tuples on 1000/100 = 10 pages. (Use hash

index; do bid=100 Sailors

not write

 INL with pipelining (outer is not result to

temp)

materialized). Reserves



–Projecting out unnecessary fields from outer doesn’t help.

v Join column sid is a key for Sailors.

–At most one matching tuple, unclustered index on sid OK.

v Decision not to push rating>5 before the join is based on

availability of sid index on Sailors.

v Cost: Selection of Reserves tuples (10 I/Os); for each,

must get matching Sailors tuple (1000*1.2); total 1210 I/Os.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12

Logical and physical optimization



 Logical opt. – use equivalence rules to get the “last” equivalent

expression.

 Physical opt. – use cost estimates to select the best physical plan.

Note: the two are not independent!

The best physical may not be the best logical!



Example:

select Name from Sailors, Reserves

where Reserves.Sid = Sailors.Sid and Sailors.Rating >10

and assume both sorted on Sid and index exists on Rating.

Better not to push the select down!







Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13

Relational Algebra Equivalences

 Allow us to choose different join orders and to

`push’ selections and projections ahead of joins.

 Selections:  c1 ...  cn  R    c1  . . .  cn  R   (Cascade)

 c1  c 2  R    c 2  c1  R  (Commute)

 

Projections:  a1  R   a1 . . .  an  R   (Cascade)



 Joins: R  (S  T)

   (R S)  T

 (Associative)

(R  S)  (S  R)

  (Commute)



+ Show that:R  (S T)

   (T R)  S (is T R a join?)

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14

More Equivalences

 A projection commutes with a selection that only

uses attributes retained by the projection.

 Selection between attributes of the two arguments of

a cross-product converts cross-product to a join.

 A selection on just attributes of R commutes with

R  S. (i.e.,  (R  S)   (R)  S )

  

 Similarly, if a projection follows a join R  S, we can



`push’ it by retaining only attributes of R (and S) that

are needed for the join or are kept by the projection.



Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 15

Equivalence Rules

1. Conjunctive selection operations can be

deconstructed into a sequence of individual

selections.    ( E )    (  ( E ))

1 2 1 2









2. Selection operations are commutative.

  (  ( E ))    (  ( E ))

1 2 2 1







3. Only the last in a sequence of projection operations

is needed, the others can be omitted.

 t1 ( t2 (( tn (E ))))   t1 (E )



4. Selections can be combined with Cartesian

products and theta joins.

a. (E1 X E2) = E1  E2

b. 1(E1 2 E2) = E1 1 2 E2



Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 16

Equivalence Rules (Cont.)



5. Theta-join operations (and natural joins) are

commutative.

E1  E2 = E2  E1

6. (a) Natural join operations are associative:

(E1 E2) E3 = E1 (E2 E3)



(b) Theta joins are associative in the following

manner:



(E1 1 E2) 2  3 E3 = E1 1 3 (E2 2 E3)



where 2 involves attributes from only E2 and E3.



Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 17

Equivalence Rules (Cont.)



7. The selection operation distributes over the theta join

operation under the following two conditions:

(a) When all the attributes in 0 involve only the

attributes of one

of the expressions (E1) being joined.



0E1  E2) = (0(E1))  E2



(b) When  1 involves only the attributes of E1 and 2

involves

only the attributes of E2.

1 E1  E2) = (1(E1))  ( (E2))



Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 18

Equivalence Rules (Cont.)

8. The projections operation distributes over the

theta join operation as follows:

(a) if  involves only attributes from L1  L2:

 L1  L2 ( E1....... E2 )  ( L1 ( E1 )) ...... ( L2 ( E2 ))





(b) Consider a join E1  E2.

 Let L1 and L2 be sets of attributes from E1 and E2,

respectively.

 Let L3 be attributes of E1 that are involved in join

condition , but are not in L1  L2, and

 let L4 be attributes of E2 that are involved in join

condition , but are not in L1  L2.

 L1  L2 ( E1.....  E2 )   L1  L2 ((  L1  L3 ( E1 )) ...... ( L2  L4 ( E2 )))

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 19

Equivalence Rules (Cont.)

9. The set operations union and intersection are

commutative

E1  E2 = E2  E1

E1  E2 = E2  E1

n (set difference is not commutative).

10. Set union and intersection are associative.

(E1  E2)  E3 = E1  (E2  E3)

(E1  E2)  E3 = E1  (E2  E3)

11. The selection operation distributes over ,  and –.

 (E1 – E2) =  (E1) – (E2)

and similarly for  and  in place of –

Also:  (E1 – E2) = (E1) – E2

and similarly for  in place of –, but not

for 

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 20

Equivalence Rules (Cont.)

12.The projection operation distributes over union

L(E1  E2) = (L(E1))  (L(E2))









Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 21

Transformation Example



 Query: Find the names of all customers who have an

account at some branch located in Brooklyn.

customer-name(branch-city = “Brooklyn”

(branch (account depositor)))

 Transformation using rule 7a.

customer-name

((branch-city =“Brooklyn” (branch))

(account depositor))

 Performing the selection as early as possible reduces

the size of the relation to be joined.





Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 22

Example with Multiple Transformations

 Query: Find the names of all customers with an account at

a Brooklyn branch whose account balance is over $1000.

customer-name((branch-city = “Brooklyn”  balance > 1000

(branch (account depositor)))

 Transformation using join associatively (Rule 6a):

customer-name((branch-city = “Brooklyn”  balance > 1000

(branch (account)) depositor)

 Second form provides an opportunity to apply the

“perform selections early” rule, resulting in the

subexpression

branch-city = “Brooklyn” (branch)  balance > 1000 (account)

 Thus a sequence of transformations can be useful

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 23

Multiple Transformations (Cont.)









Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 24

Projection Operation Example

customer-name((branch-city = “Brooklyn” (branch) account) depositor)



 When we compute

(branch-city = “Brooklyn” (branch) account )

we obtain a relation whose schema is:

(branch-name, branch-city, assets, account-number, balance)

 Push projections using equivalence rules 8a and 8b;

eliminate unneeded attributes from intermediate results

to get:

 customer-name ((

 account-number ( (branch-city = “Brooklyn” (branch) account ))

depositor)





Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 25

Join Ordering Example



 For all relations r1, r2, and r3,

(r1 r2) r3 = r1 (r2 r3 )

 If r2 r3 is quite large and r1 r2 is small, we

choose



(r1 r2) r3

so that we compute and store a smaller

temporary relation.





Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 26

Highlights of System R Optimizer



 Impact:

 Most widely used currently; works well for 5, and is probably cheapest.

However, if this selection is expected to bid=500 rating>5

retrieve a lot of tuples, and index is unclustered,

file scan may be cheaper.

• Still, B+ tree plan kept (because tuples are in rating order). Reserves Sailors

 Reserves: B+ tree on bid matches bid=500; cheapest.

 So we have two plans on Sailors and one on Reserves

v Pass 2:

– We consider each plan retained from Pass 1 as the outer, and

consider how to join it with the (only) other relation.

e.g., Reserves as outer: Hash index can be used to get Sailors tuples

that satisfy sid = outer tuple’s sid value. (only linear scan of SAILORS)

SELECT S.sname

FROM Sailors S

Nested Queries WHERE EXISTS

(SELECT *

FROM Reserves R

 Nested block is optimized WHERE R.bid=103

independently, with the outer AND R.sid=S.sid)

tuple considered as providing a

selection condition. Nested block to optimize:

SELECT *

 Outer block is optimized with

FROM Reserves R

the cost of `calling’ nested block

WHERE R.bid=103

computation taken into account.

AND S.sid= outer value

 Implicit ordering of these blocks

means that some good strategies Equivalent non-nested query:

are not considered. The non- SELECT S.sname

nested version of the query is FROM Sailors S, Reserves R

typically optimized better. WHERE S.sid=R.sid

AND R.bid=103

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 43

A comprehensive example

SELECT S.sid, COUNT(*) AS numres

FROM boats B, Reserves R, Sailors S

WHERE R.sid = S.sid AND B.bid = R.bid AND B.color = ‘red’

GROUP BY S.sid



Query finds the number of red boats reserved by each sailor

sid COUNT(*) AS numres







GROUPBY sid









Sid = sid







Bid = bid

sailors



boats color = ‘red’ reserves



Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 44

Assume the following Indexes

 For Reserves

 A B+tree on Sid

 A clustered B+tree on Bid

 For Boats

 Both B+tree and Hash indexes on Color

 For Sailors

 A clustered B+tree and Hash indexes on Sid





 Note, some indexes are especially contrived

for the example…

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 45

A comprehensive example – cont.

 Pass 1 – best plan for each relation

 Reserves, sailors – file scan

 Boats – best is hash index but B-tree is used

because of interesting order.

 Pass 2 – all left deep plans

 Reserves (outer) sid boats (inner)

 Boats (outer) sid reserves (inner)

 Reserves (outer) sid sailors (inner)

 Sailors (outer) sid reserves (inner)

 Need to consider all relevant join methods:

BNL, SORT-Merge, Index,Hash

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 46

A comprehensive example – cont.

 Some good plans:

 Boats sid reserves using index join since Bid is a

clustered index for reserves.

 Reserves sailors using sort-merge since sailors

already has clustered B tree index on sid.

 Pass 3 – consider each plan from stage 2 as an

outer. Then use sort-merge join on Sid if join

with Sailors is last, otherwise must sort result

on Sid. This since we need to do a GROUP-by

on Sid. So because of the Group-By the join

with Sailors on Sid may better be the last

one…

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 47

The system R optimizer

Current relational query optimizers have been

greatly influenced by choices made in the

design of IBM’s system R query optimizer.

Important design choices in the system R

optimizer include:

 The use of statistics about the database instance to

estimate the cost of a query evaluation plan.

 A decision to consider only plans with binary joins in

which the inner relation is a base relation (I.e. not a

temporary relation). This heuristic reduces the

potentially very large number of alternative plans

that must be considered.



Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 48

The system R optimizer



 A decision to focus optimization on the class of

SQL queries without nesting , and to treat nested

queries in a relatively as hoc way.

 A decision not to perform duplicate elimination

for projections (except as a final step in the query

evaluation when required by a DISTINCT clause).

 A model of cost that accounted for CPU costs as

well as I/O costs.

 Search the space of plans for an optimal plan

using hill-climbing.



Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 49

Additional topics



 Index selection – general problem is NP-

complete.

 Query with many joins – AI technique like

simulated annealing. [J9]

 Parallel database machines.









Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 50

Materialized Views**

 A materialized view is a view whose contents are

computed and stored.

 Consider the view

create view branch_total_loan(branch_name, total_loan)

as

select branch_name, sum(amount)

from loan

group by branch_name

 Materializing the above view would be very useful if

the total loan amount is required frequently

 Saves the effort of finding multiple tuples and

adding up their amounts

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 51

Optimization with Materialized

views (Data Warehousing)

 Which views to materialize?



 Given a query and a set of materialized views:

 Can the query be answered using only subset of the

materialized views?

 If yes, what is the best sub-set?





 All these topic are associated with optimization in Data

warehousing!







Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 52

Query Optimization and Materialized

Views

 Rewriting queries to use materialized views:

 A materialized view v = r s is available

 A user submits a query r s t

 We can rewrite the query as v t

• Whether to do so depends on cost estimates for the two alternative

 Replacing a use of a materialized view by the view definition:

 A materialized view v = r s is available, but without any index on

it

 User submits a query A=10(v).

 Suppose also that s has an index on the common attribute B, and r has

an index on attribute A.

 The best plan for this query may be to replace v by r s, which can

lead to the query plan A=10(r) s

 Query optimizer should be extended to consider all above

alternatives and choose the best overall plan

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 53

Materialized View Selection

 Materialized view selection: “What is the best set of views to

materialize?”.

 Index selection: “what is the best set of indices to create”

 closely related, to materialized view selection

• but simpler

 Materialized view selection and index selection based on typical

system workload (queries and updates)

 Typical goal: minimize time to execute workload , subject to

constraints on space and time taken for some critical

queries/updates

 One of the steps in database tuning

• more on tuning in later chapters

 Commercial database systems provide tools (called “tuning

assistants” or “wizards”) to help the database administrator

choose what indices and materialized views to create

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 54

Materialized View Maintenance

 The task of keeping a materialized view up-to-date with the

underlying data is known as materialized view maintenance

 Materialized views can be maintained by recomputation on every

update

 A better option is to use incremental view maintenance

 Changes to database relations are used to compute changes to

the materialized view, which is then updated

 View maintenance can be done by

 Manually defining triggers on insert, delete, and update of each

relation in the view definition

 Manually written code to update the view whenever database

relations are updated

 Periodic recomputation (e.g. nightly)

 Above methods are directly supported by many database systems

• Avoids manual effort/correctness issues

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 55

Summary

 Query optimization is an important task in a relational DBMS.

 Must understand optimization in order to understand the

performance impact of a given database design (relations,

indexes) on a workload (set of queries).

 Two parts to optimizing a query:

 Consider a set of alternative plans.

• Must prune search space; typically, left-deep plans only.

 Must estimate cost of each plan that is considered.

• Must estimate size of result and cost for each plan node.

• Key issues: Statistics, indexes, operator implementations.

 Using materialized views (Data Warehousing)



Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 56



Related docs
Other docs by gegeshandong
Chapter 10 Slides-Cavico
Views: 0  |  Downloads: 0
100 Mile Club tracking sheet
Views: 3  |  Downloads: 0
lit11-12
Views: 0  |  Downloads: 0
Terranora Primary.xlsx
Views: 0  |  Downloads: 0
Study Guide Chp 17_ 19-20
Views: 0  |  Downloads: 0
8
Views: 7  |  Downloads: 0
1735-1250240321-jh09cp_ladies_footwear_wk24
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!