# Applied Databases Relational Algebra

Document Sample

```					                                                                     Today’s lecture

Applied Databases
Relational Algebra                                    What is the (core) relational algebra?
How can we write queries using the relational algebra?
Floris Geerts                                   How powerful is the relational algebra?

14 Oct, 2009

Relational query languages                                           Formal relational query languages

Query languages allow the manipulation and retrieval of data      Relational Algebra
from a database.                                                  Simple “operational” model, useful for expressing execution plans.
The relational model supports simple, powerful query
languages:                                                        Relational Calculus
strong formal foundation; and
Logical model (declarative), useful for theoretical results.
allows for much (provably correct) optimisation.
Both languages were introduced by Codd in a series of papers.
NOTE: Query languages are not (necessarily) programming                They have equivalent expressive power.
languages.                                                             They are the key to understanding SQL query processing!
Preliminaries                                                                 Relational algebra

Basic operations:
A query is applied to relation instances, and the result of a                       Selection (σ): Selects a subset of rows from relation.
query is also a relation instance.                                                  Projection (π): Deletes unwanted columns from relation.
Cross-product (×): Allows us to combine two relations.
input      query         output
Set-diﬀerence (−): Allows us to subtract relations.
instance                  instance                                  Union (∪): Allows us to union relations.
Renaming (ρ): Allows to rename relation and ﬁeld names.
For a given query, the schema of input relations are ﬁxed.                          Intersection, join, division,
Not essential, but (very!) useful (especially join).
The query will then execute over any valid instance.
The schema of the result can also be determined (and is ﬁxed              Closure
for the given query).                                                     Since each operation returns a relation, operations can be
composed! (One says that the algebra is closed.)

Projection                                                                    Projection – continued

Choose a set of ﬁeld names A and a table R
πA (R) extracts the columns in A from the table.                              Suppose the result of a projection has a repeated value, how
do we treat it?
Example, given Munros =
MId   MName                Lat      Long            Height   Rating          πRating (Munros) is             Rating        or       Rating   ?
1     The Saddle           57.167   5.384           1010     4                                               4                      4
2     Ladhar Bheinn        57.067   5.750           1020     4                                               4                      2.5
3     Schiehallion         56.667   4.098           1083     2.5                                             2.5                    1.5
4     Ben Nevis            56.780   5.002           1343     1.5
1.5
πMId,Rating (Munros) is
In “pure” relational algebra the answer is always a set (recall
MId   Rating                                       that we deﬁned a relation instance as a set).
1     4
2     4                                            However, SQL and some other languages return a multiset for
3     2.5                                          some operations from which duplicates may be eliminated by
4     1.5
a further operation. (Why? Eliminating duplicates is
expensive in practice).
Provides the user with a view by hiding some attributes.
Selection                                                              Selection - continued

Chooses tuples that satisfy some condition.                            Conditions are built up from:
Selection σC (R) takes a table R and extracts those rows from               Comparisons on attributes: R.A = R.A , R.A = R.A
it that satisfy the condition C .                                           Comparisons on values. E.g., Height > 1000, MName =
For example,                                                                "Ben Nevis".
σHeight > 1050 (Munros) =                                                   Predicates constructed from these using ∨ (or), ∧ (and), ¬
(not).
E.g. (Lat > 57 ∧ Height > 1000) ∨ (Height=Lat) .
MId   MName             Lat       Long    Height    Rating
3     Schiehallion      56.667    4.098   1083      2.5
4     Ben Nevis         56.780    5.002   1343      1.5
A selection provides the user with a view of data by hiding tuples
What can go into a condition C ?                                   that do not satisfy the condition the user wants.

Combining selection and projection                                     Combining selection and projection

Find all names and age of climbers of age > 30.
Relational algebra query                                               Are Q1 and Q2 the same?

Q1 = πHName,Age (σAge>30 (Hikers))                        They are semantically, as they produce the same result.
But they diﬀer in terms of eﬃciency:
Q1 scans Hikers, selects some tuples, and the only scans
An equivalent relational algebra query                                      selected tuples.
Q2 scans Hikers, projects out two attributes and then scans
Q2 = σAge>30 (πHName,Age (Hikers))                             the result again.

Q1 is likely to be more eﬃcient than Q2 .
Procedural languages can be optimized....
The same declarative query can be translated into more than one
procedural query.
Set operations – union                                                                                    Set operations – set diﬀerence

If two tables have the same structure, we can perform set
operations.
Same structure means union-compatible:
Same number of ﬁelds; and                                                                            We can also take the diﬀerence of two union-compatible
Corresponding ﬁelds (taken from left to right) have the same
domains.
tables:

Example:                                                                                                  Hikers − Climbers =       HId   HName      Skill     Age
123   Edmund     EXP       80
Hikers =   HId
123
HName
Edmund
Skill
EXP
Age
80
Climbers =   HId
214
HName
Arnold
Expertise
BEG
Age
25                                     313   Bridget    EXP       33
214
313
Arnold
Bridget
BEG
EXP
25
33
898     Jane     MED         39                                     212   James      MED       27
212   James      MED           27

Again, output schema is that of the ﬁrst relation.
Hikers ∪ Climbers =      HId         HName     Skill      Age
123         Edmund    EXP        80
214         Arnold    BEG        25
313         Bridget   EXP        33
212         James     MED        27
898         Jane      MED        39

Output schema is that of the ﬁrst relation (Hikers in the
Example).

Set operations – intersection                                                                             Cross product (Cartesian product)

The basic operation is the Cartesian product, R × S, which
It turns out we can implement intersection in terms of other                                              concatenates every tuple in R with every tuple in S.
operations:                                                                                               Example:

R ∩ S = R − (R − S)
A    B    C    D
a1   b1   c1   d1
C    D
Although it is mathematically nice to have fewer operators,                                                   A B                               a1   b1   c2   d2
c1   d1
this may not be an eﬃcient way to implement intersection.                                                     a1 b1      ×                =     a1   b1   c3   d3
c2   d2
a2 b2                             a2   b2   c1   d1
Intersection is also a special case of a join, which we’ll shortly                                                             c3   d3
a2   b2   c2   d2
discuss.                                                                                                                                        a2   b2   c3   d3
Cartesian product – continued                                                Cartesian product – continued

What happens when we form a product of two tables with
columns with the same name?
If R1 has n tuples and R2 has m tuples then R1 × R2 has
Recall the schemas: Hikers(HId, HName, Skill, Age)
n × m tuples.
and Climbs(HId, MId, Date,Time). What is the schema
of Hikers × Climbs?                                                           This is an expensive operation: if R1 and R2 have both 1 000
Various possibilities including:                                              tuples (small relation) then R1 × R2 has 1 000 000 tuples
Forget the conﬂicting name (as in R&G) ((Hid),                           (large relation).
HName,Skill, Age, (HId), MId, Date, Time). Allow
positional references (by number) to columns.                            Query processors try to avoid building products - instead they
Label the conﬂicting colums with 1,2... (HId.1,                          attempt to build only subsets which contain relevant
HName,Skill, Age, HId.2, MId, Date, Time).                               information.
Neither of these is satisfactory. The product operation is no
longer commutative (a property that is useful in optimization.)

Renaming                                                                     Natural join

To avoid confusion about attribute names, one can use the                     For obvious reasons of eﬃciency we rarely use unconstrained
renaming operator ρ:                                                          cross products in practice.
A natural join ( ) produces the set of all merges of tuples
ρ(C (1 → sid1, 5 → sid2), Hikers × Climbs)
that agree on their commonly named ﬁelds.
This operator                                                                 Example:
names result relation C ; and                                                       HId   MId   Date       Time          HId   HName     Skill     Age
123   1     10/10/88   5
explicitly names ﬁelds on positions 1 and 5 into sid1 and sid2.                     123   3     11/08/87   2.5           123
214
Edmund
Arnold
EXP
BEG
80
25    =
313   1     12/08/89   4             313   Bridget   EXP       33
In general,                                                                              214
313
2
2
08/07/92
06/07/94
7
5             212   James     MED       27

HId   MId   Date        Time     HName     Skill   Age
123   1     10/10/88    5        Edmund    EXP     80
ρ(R(oldname → newname, . . . , position → newname, E ),                                      123
313
3
1
11/08/87
12/08/89
2.5
4
Edmund
Bridget
EXP
EXP
80
33
214   2     08/07/92    7        Arnold    BEG     25
313   2     06/07/94    5        Bridget   EXP     33
Where E is a relational algebra expression.
Natural Join – cont.                                                    Conditional Join

Natural join has interesting relationships with other
operations. What is R S when
R=S                                                                 Extension of natural join in which a join condition is speciﬁed:
R and S have no column names in common                              R   C   S for σC (R   C   S)
R and S have all column names in common, i.e., they are             Special case in which join condition consists of equality
union compatible
conditions is called the equijoin.
Natural join has nice properties (assuming ﬁelds are identiﬁed          A natural join is an equijoin in which equalities are speciﬁed
by names):                                                              on all common ﬁelds.
Commutative: R        S =S   R
Associative: R   (S     T ) = (R   S)   T
Hence we can always simply write R1     R2   ···    Rk .

Interaction of the relational algebra operators                         Division

Suppose we have two tables with schemas R(A, B) and S(B).
R/S is deﬁned to be the set of A values in R which are paired
πA (R ∪ S) = πA (R) ∪ πA (S)                                            (in R) with all B values in S.
σC (R ∪ S) = σC (R) ∪ σC (S)                                            That is the set of all x for which πB (S) ⊆ πB (σA=x (R)).

(R ∪ S)    T =R      T ∪S        T
A/B = πA R − πA (πA (R)     πB (S) − R)
T    (R ∪ S) = T      R ∪T       S.

The general deﬁnition of division extends this idea to more
than one attribute.
Examples                                                               Examples – cont
The highest Munro(s)
This is more tricky. We ﬁrst ﬁnd the peaks (their MIds) that
The names of people who have climbed The Saddle.                       are lower than some other peak. LowerIds =

πMId (σHeight<Height’ (Munros     πHeight’ (
πHName (σMName="The Saddle" (Munros       Hikers    Climbs))
ρ(Height → Height’, Munros))))
(we could have used × instead of here)
Note the optimization to:                                              Now we ﬁnd the MIds of peaks that are not in this set (they
must be the peaks with maximum height)
πHName (σMName="The Saddle" (Munros)      Hikers     Climbs)
MaxIds = πMId (Munros) − LowerIds

In what order would you perform the joins?
Finally we get the names:

πMName (MaxIds     Munros)

Examples – cont                                                        What we cannot compute with relational algebra
The names of hikers who have climbed all Munros
We start by ﬁnding the set of HId,MId pairs for which the              Aggregate operations. E.g. “The number of hikers who have
hiker has not climbed that peak.                                       climbed Schiehallion” or “The average age of hikers”. These
We do this by subtracting part of the Climbs table from the            are possible in SQL which has numerous extensions to the
set of all HId,MId pairs. NotClimbed=                                  relational algebra.
Recursive queries. Given a table Parent(Parent, Child)
πHId (Hikers)    πMId (Munros) − πHId,MId (Climbs)                compute the Ancestor table. This appears to call for an
arbitrary number of joins.
(we could have used × instead of here)                                 Non-relational data. For example, lists, arrays, multisets
The HIds in this table identify the hikers who have not climed         (bags); or relations that are nested. These are ruled out by
some peak. By subtraction we get the HIds of hikers who                the relational data model, but they are important and are the
have climbed all peaks:                                                province of object-oriented databases and
“complex-object”/XML query languages.
ClimbedAll = πHId (Hikers) − πHId (NotClimbed)
Of course, we can always compute such things if we can talk
A join gets us the desired information:                                to a database from a full-blown (Turing complete)
programming language.
πHName (Hikers     ClimbedAll)
Relational calculus                                                  Relational algebra: Summary

Relational algebra: the 6 basic operations.

Declarative way of writing queries;                                  Using labels vs. positions.
Query rewriting for optimization.
Ignorant of how things are computed;
Practice with relational algebra.
Equivalent to relational algebra: Every query that can be
expressed in the relational algebra can be expressed in the          Limitations of relational algebra.
calculus, and vice versa.
R& G