Query Processing and Optimization
Chapters 12,14,15 in [R]
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1
Basic Steps in Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2
Overview of Query Optimization
Plan: Tree of R.A. ops, with choice of alg for each op.
Each operator typically implemented using a `pull’
interface: when an operator is `pulled’ for the next
output tuples, it `pulls’ on its inputs and computes them.
Two main issues:
For a given query, what plans are considered?
• Algorithm to search plan space for cheapest (estimated) plan.
How is the cost of a plan estimated?
Ideally: Want to find best plan. Practically: Avoid
worst plans!
We will study the System R approach.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3
Introduction
Alternative ways of evaluating a given query
Equivalent expressions
Different algorithms for each operation (Chapter 13)
Cost difference between a good and a bad way of
evaluating a query can be enormous
Example: performing a r X s followed by a selection r.A =
s.B is much slower than performing a join on the same
condition
Need to estimate the cost of operations
Depends critically on statistical information about relations
which the database must maintain
• E.g. number of tuples, number of distinct values for join
attributes, etc.
Need to estimate statistics for intermediate results to
compute cost of complex expressions
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4
Introduction (Cont.)
Relations generated by two equivalent expressions
have the same set of attributes and contain the same
set of tuples, although their attributes may be
ordered differently.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5
Introduction (Cont.)
Generation of query-evaluation plans for an
expression involves several steps:
1. Generating logically equivalent expressions
• Use equivalence rules to transform an expression into an
equivalent one – Logical optimization
2. Annotating resultant expressions to get alternative query
plans
3. Choosing the cheapest plan based on estimated cost –
Physical optimization
The overall process is called cost based
optimization.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6
Query optimization – the two phases
query
Catalog Parse
schema info
Tree of RA
operators
Logical optimization – pushing operators down
Catalog physical optimization method to implement each RA
Statistics
indexes info Run-time code
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7
Search space “plan”
Search logical order
1 2
1 2 3
X or X or X
1 2 2 1
Take Plan 1: 4 Join methods, 3 select
methods
There are 3*3*4 = 36 Plans!
For a large tree the # of plans is huge!
The optimization problem:
Find the best plan!
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8
Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real)
Reserves (sid: integer, bid: integer, day: dates, rname: string)
Similar to old schema; rname added for variations.
Reserves:
Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.
Sailors:
Each tuple is 50 bytes long, 80 tuples per page, 500 pages.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9
RA Tree: sname
Motivating Example
bid=100 rating > 5
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND sid=sid
R.bid=100 AND S.rating>5
Reserves Sailors
Cost: 500+500*1000 I/Os
(On-the-fly)
By no means the worst plan! Plan: sname
Misses several opportunities:
selections could have been bid=100 rating > 5 (On-the-fly)
`pushed’ earlier, no use is made
of any available indexes, etc.
(Simple Nested Loops)
Goal of optimization: To find more sid=sid
efficient plans that compute the
same answer.
Reserves Sailors
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10
(On-the-fly)
Alternative Plans 1
sname
(No Indexes) sid=sid
(Sort-Merge Join)
(Scan; (Scan;
write to bid=100 rating > 5 write to
temp T1) temp T2)
Main difference: push selects.
Reserves Sailors
With 5 buffers, cost of plan:
Scan Reserves (1000) + write temp T1 (10 pages, if we have 100 boats,
uniform distribution).
Scan Sailors (500) + write temp T2 (250 pages, if we have 10 ratings).
Sort T1 (2*2*10), sort T2 (2*3*250), merge (10+250)
Total: 3560 page I/Os.
If we used BNL join, join cost = 10+4*250, total cost = 2770.
If we `push’ projections, T1 has only sid, T2 only sid and sname:
T1 fits in 3 pages, cost of BNL drops to under 250 pages, total 5 (On-the-fly)
With clustered index on bid of sid=sid
(Index Nested Loops,
with pipelining )
Reserves, we get 100,000/100 =
1000 tuples on 1000/100 = 10 pages. (Use hash
index; do bid=100 Sailors
not write
INL with pipelining (outer is not result to
temp)
materialized). Reserves
–Projecting out unnecessary fields from outer doesn’t help.
v Join column sid is a key for Sailors.
–At most one matching tuple, unclustered index on sid OK.
v Decision not to push rating>5 before the join is based on
availability of sid index on Sailors.
v Cost: Selection of Reserves tuples (10 I/Os); for each,
must get matching Sailors tuple (1000*1.2); total 1210 I/Os.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12
Logical and physical optimization
Logical opt. – use equivalence rules to get the “last” equivalent
expression.
Physical opt. – use cost estimates to select the best physical plan.
Note: the two are not independent!
The best physical may not be the best logical!
Example:
select Name from Sailors, Reserves
where Reserves.Sid = Sailors.Sid and Sailors.Rating >10
and assume both sorted on Sid and index exists on Rating.
Better not to push the select down!
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13
Relational Algebra Equivalences
Allow us to choose different join orders and to
`push’ selections and projections ahead of joins.
Selections: c1 ... cn R c1 . . . cn R (Cascade)
c1 c 2 R c 2 c1 R (Commute)
Projections: a1 R a1 . . . an R (Cascade)
Joins: R (S T)
(R S) T
(Associative)
(R S) (S R)
(Commute)
+ Show that:R (S T)
(T R) S (is T R a join?)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14
More Equivalences
A projection commutes with a selection that only
uses attributes retained by the projection.
Selection between attributes of the two arguments of
a cross-product converts cross-product to a join.
A selection on just attributes of R commutes with
R S. (i.e., (R S) (R) S )
Similarly, if a projection follows a join R S, we can
`push’ it by retaining only attributes of R (and S) that
are needed for the join or are kept by the projection.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 15
Equivalence Rules
1. Conjunctive selection operations can be
deconstructed into a sequence of individual
selections. ( E ) ( ( E ))
1 2 1 2
2. Selection operations are commutative.
( ( E )) ( ( E ))
1 2 2 1
3. Only the last in a sequence of projection operations
is needed, the others can be omitted.
t1 ( t2 (( tn (E )))) t1 (E )
4. Selections can be combined with Cartesian
products and theta joins.
a. (E1 X E2) = E1 E2
b. 1(E1 2 E2) = E1 1 2 E2
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 16
Equivalence Rules (Cont.)
5. Theta-join operations (and natural joins) are
commutative.
E1 E2 = E2 E1
6. (a) Natural join operations are associative:
(E1 E2) E3 = E1 (E2 E3)
(b) Theta joins are associative in the following
manner:
(E1 1 E2) 2 3 E3 = E1 1 3 (E2 2 E3)
where 2 involves attributes from only E2 and E3.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 17
Equivalence Rules (Cont.)
7. The selection operation distributes over the theta join
operation under the following two conditions:
(a) When all the attributes in 0 involve only the
attributes of one
of the expressions (E1) being joined.
0E1 E2) = (0(E1)) E2
(b) When 1 involves only the attributes of E1 and 2
involves
only the attributes of E2.
1 E1 E2) = (1(E1)) ( (E2))
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 18
Equivalence Rules (Cont.)
8. The projections operation distributes over the
theta join operation as follows:
(a) if involves only attributes from L1 L2:
L1 L2 ( E1....... E2 ) ( L1 ( E1 )) ...... ( L2 ( E2 ))
(b) Consider a join E1 E2.
Let L1 and L2 be sets of attributes from E1 and E2,
respectively.
Let L3 be attributes of E1 that are involved in join
condition , but are not in L1 L2, and
let L4 be attributes of E2 that are involved in join
condition , but are not in L1 L2.
L1 L2 ( E1..... E2 ) L1 L2 (( L1 L3 ( E1 )) ...... ( L2 L4 ( E2 )))
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 19
Equivalence Rules (Cont.)
9. The set operations union and intersection are
commutative
E1 E2 = E2 E1
E1 E2 = E2 E1
n (set difference is not commutative).
10. Set union and intersection are associative.
(E1 E2) E3 = E1 (E2 E3)
(E1 E2) E3 = E1 (E2 E3)
11. The selection operation distributes over , and –.
(E1 – E2) = (E1) – (E2)
and similarly for and in place of –
Also: (E1 – E2) = (E1) – E2
and similarly for in place of –, but not
for
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 20
Equivalence Rules (Cont.)
12.The projection operation distributes over union
L(E1 E2) = (L(E1)) (L(E2))
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 21
Transformation Example
Query: Find the names of all customers who have an
account at some branch located in Brooklyn.
customer-name(branch-city = “Brooklyn”
(branch (account depositor)))
Transformation using rule 7a.
customer-name
((branch-city =“Brooklyn” (branch))
(account depositor))
Performing the selection as early as possible reduces
the size of the relation to be joined.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 22
Example with Multiple Transformations
Query: Find the names of all customers with an account at
a Brooklyn branch whose account balance is over $1000.
customer-name((branch-city = “Brooklyn” balance > 1000
(branch (account depositor)))
Transformation using join associatively (Rule 6a):
customer-name((branch-city = “Brooklyn” balance > 1000
(branch (account)) depositor)
Second form provides an opportunity to apply the
“perform selections early” rule, resulting in the
subexpression
branch-city = “Brooklyn” (branch) balance > 1000 (account)
Thus a sequence of transformations can be useful
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 23
Multiple Transformations (Cont.)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 24
Projection Operation Example
customer-name((branch-city = “Brooklyn” (branch) account) depositor)
When we compute
(branch-city = “Brooklyn” (branch) account )
we obtain a relation whose schema is:
(branch-name, branch-city, assets, account-number, balance)
Push projections using equivalence rules 8a and 8b;
eliminate unneeded attributes from intermediate results
to get:
customer-name ((
account-number ( (branch-city = “Brooklyn” (branch) account ))
depositor)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 25
Join Ordering Example
For all relations r1, r2, and r3,
(r1 r2) r3 = r1 (r2 r3 )
If r2 r3 is quite large and r1 r2 is small, we
choose
(r1 r2) r3
so that we compute and store a smaller
temporary relation.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 26
Highlights of System R Optimizer
Impact:
Most widely used currently; works well for 5, and is probably cheapest.
However, if this selection is expected to bid=500 rating>5
retrieve a lot of tuples, and index is unclustered,
file scan may be cheaper.
• Still, B+ tree plan kept (because tuples are in rating order). Reserves Sailors
Reserves: B+ tree on bid matches bid=500; cheapest.
So we have two plans on Sailors and one on Reserves
v Pass 2:
– We consider each plan retained from Pass 1 as the outer, and
consider how to join it with the (only) other relation.
e.g., Reserves as outer: Hash index can be used to get Sailors tuples
that satisfy sid = outer tuple’s sid value. (only linear scan of SAILORS)
SELECT S.sname
FROM Sailors S
Nested Queries WHERE EXISTS
(SELECT *
FROM Reserves R
Nested block is optimized WHERE R.bid=103
independently, with the outer AND R.sid=S.sid)
tuple considered as providing a
selection condition. Nested block to optimize:
SELECT *
Outer block is optimized with
FROM Reserves R
the cost of `calling’ nested block
WHERE R.bid=103
computation taken into account.
AND S.sid= outer value
Implicit ordering of these blocks
means that some good strategies Equivalent non-nested query:
are not considered. The non- SELECT S.sname
nested version of the query is FROM Sailors S, Reserves R
typically optimized better. WHERE S.sid=R.sid
AND R.bid=103
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 43
A comprehensive example
SELECT S.sid, COUNT(*) AS numres
FROM boats B, Reserves R, Sailors S
WHERE R.sid = S.sid AND B.bid = R.bid AND B.color = ‘red’
GROUP BY S.sid
Query finds the number of red boats reserved by each sailor
sid COUNT(*) AS numres
GROUPBY sid
Sid = sid
Bid = bid
sailors
boats color = ‘red’ reserves
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 44
Assume the following Indexes
For Reserves
A B+tree on Sid
A clustered B+tree on Bid
For Boats
Both B+tree and Hash indexes on Color
For Sailors
A clustered B+tree and Hash indexes on Sid
Note, some indexes are especially contrived
for the example…
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 45
A comprehensive example – cont.
Pass 1 – best plan for each relation
Reserves, sailors – file scan
Boats – best is hash index but B-tree is used
because of interesting order.
Pass 2 – all left deep plans
Reserves (outer) sid boats (inner)
Boats (outer) sid reserves (inner)
Reserves (outer) sid sailors (inner)
Sailors (outer) sid reserves (inner)
Need to consider all relevant join methods:
BNL, SORT-Merge, Index,Hash
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 46
A comprehensive example – cont.
Some good plans:
Boats sid reserves using index join since Bid is a
clustered index for reserves.
Reserves sailors using sort-merge since sailors
already has clustered B tree index on sid.
Pass 3 – consider each plan from stage 2 as an
outer. Then use sort-merge join on Sid if join
with Sailors is last, otherwise must sort result
on Sid. This since we need to do a GROUP-by
on Sid. So because of the Group-By the join
with Sailors on Sid may better be the last
one…
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 47
The system R optimizer
Current relational query optimizers have been
greatly influenced by choices made in the
design of IBM’s system R query optimizer.
Important design choices in the system R
optimizer include:
The use of statistics about the database instance to
estimate the cost of a query evaluation plan.
A decision to consider only plans with binary joins in
which the inner relation is a base relation (I.e. not a
temporary relation). This heuristic reduces the
potentially very large number of alternative plans
that must be considered.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 48
The system R optimizer
A decision to focus optimization on the class of
SQL queries without nesting , and to treat nested
queries in a relatively as hoc way.
A decision not to perform duplicate elimination
for projections (except as a final step in the query
evaluation when required by a DISTINCT clause).
A model of cost that accounted for CPU costs as
well as I/O costs.
Search the space of plans for an optimal plan
using hill-climbing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 49
Additional topics
Index selection – general problem is NP-
complete.
Query with many joins – AI technique like
simulated annealing. [J9]
Parallel database machines.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 50
Materialized Views**
A materialized view is a view whose contents are
computed and stored.
Consider the view
create view branch_total_loan(branch_name, total_loan)
as
select branch_name, sum(amount)
from loan
group by branch_name
Materializing the above view would be very useful if
the total loan amount is required frequently
Saves the effort of finding multiple tuples and
adding up their amounts
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 51
Optimization with Materialized
views (Data Warehousing)
Which views to materialize?
Given a query and a set of materialized views:
Can the query be answered using only subset of the
materialized views?
If yes, what is the best sub-set?
All these topic are associated with optimization in Data
warehousing!
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 52
Query Optimization and Materialized
Views
Rewriting queries to use materialized views:
A materialized view v = r s is available
A user submits a query r s t
We can rewrite the query as v t
• Whether to do so depends on cost estimates for the two alternative
Replacing a use of a materialized view by the view definition:
A materialized view v = r s is available, but without any index on
it
User submits a query A=10(v).
Suppose also that s has an index on the common attribute B, and r has
an index on attribute A.
The best plan for this query may be to replace v by r s, which can
lead to the query plan A=10(r) s
Query optimizer should be extended to consider all above
alternatives and choose the best overall plan
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 53
Materialized View Selection
Materialized view selection: “What is the best set of views to
materialize?”.
Index selection: “what is the best set of indices to create”
closely related, to materialized view selection
• but simpler
Materialized view selection and index selection based on typical
system workload (queries and updates)
Typical goal: minimize time to execute workload , subject to
constraints on space and time taken for some critical
queries/updates
One of the steps in database tuning
• more on tuning in later chapters
Commercial database systems provide tools (called “tuning
assistants” or “wizards”) to help the database administrator
choose what indices and materialized views to create
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 54
Materialized View Maintenance
The task of keeping a materialized view up-to-date with the
underlying data is known as materialized view maintenance
Materialized views can be maintained by recomputation on every
update
A better option is to use incremental view maintenance
Changes to database relations are used to compute changes to
the materialized view, which is then updated
View maintenance can be done by
Manually defining triggers on insert, delete, and update of each
relation in the view definition
Manually written code to update the view whenever database
relations are updated
Periodic recomputation (e.g. nightly)
Above methods are directly supported by many database systems
• Avoids manual effort/correctness issues
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 55
Summary
Query optimization is an important task in a relational DBMS.
Must understand optimization in order to understand the
performance impact of a given database design (relations,
indexes) on a workload (set of queries).
Two parts to optimizing a query:
Consider a set of alternative plans.
• Must prune search space; typically, left-deep plans only.
Must estimate cost of each plan that is considered.
• Must estimate size of result and cost for each plan node.
• Key issues: Statistics, indexes, operator implementations.
Using materialized views (Data Warehousing)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 56