VIEWS: 35 PAGES: 29 POSTED ON: 7/23/2011
Design Theory of Relational Databases Announcement (reminder): The 2nd class test questions will be given out (&appear on the web) before Thursday this week. The solution will be due Monday 14th 5pm. [There will be an additional bonus question too, for extra credit. Solving the bonus will require some self-study.] This lecture: – Functional Dependencies Next lectures: – Normalisation – Various properties of decompositions Link from previous lecture(s) We looked at conceptual design Via ER diagrams & their translation into actual tables The result of this is a very good starting point! :-) But: The resulting tables most likely will be not yet in the optimal shape to represent real-world data The remaining problem: redundancies. • Can lead to inefficiency and/or potential errors Technique to solve the problem of redundancies - check each table for for Functional Dependencies [we learn how to do this today] - once these found, there is a fairly mechanical procedure (called normalisation) to optimise the tables [in the next lecture] Functional Dependencies (FDs) Definition: Let S=(A,B,C,D,...) a relational schema of table T Let U be a sub-collection of attributes of S Let X be a single attribute of S We say that the U -->X holds in table T if the following is true for all pairs of tuples t1 & t2 of T: If t1 and t2 have the same value for every attribute in U, then they have the same value for the attribute X. U → X means: 'the attribute set U determines the attribute X' or 'the attribute X depends on the attribute set U' Example From the 'lecturing' table: FDs are quite visible. cid | sid | year | numbers -------+-----+------+--------- Can you spot them? 6995 | 22 | 2002 | 87 6995 | 41 | 2002 | 87 6995 | 57 | 2002 | 87 6995 | 76 | 2002 | 87 6995 | 22 | 2003 | 80 6995 | 89 | 2003 | 80 6995 | 57 | 2003 | 80 6995 | 76 | 2003 | 80 8762 | 38 | 2002 | 251 8762 | 81 | 2002 | 251 8762 | 85 | 2002 | 251 11580 | 14 | 2002 | 182 11580 | 83 | 2002 | 182 11580 | 86 | 2002 | 182 (...) Example From the 'lecturing' table: FDs are quite visible. cid | sid | year | numbers -------+-----+------+--------- Can you spot them? 6995 | 22 | 2002 | 87 6995 | 41 | 2002 | 87 6995 | 57 | 2002 | 87 cid,year → numbers 6995 | 76 | 2002 | 87 6995 | 22 | 2003 | 80 6995 | 89 | 2003 | 80 6995 | 57 | 2003 | 80 cid → numbers ??? 6995 | 76 | 2003 | 80 8762 | 38 | 2002 | 251 year → numbers ??? 8762 | 81 | 2002 | 251 sid → numbers ??? 8762 | 85 | 2002 | 251 11580 | 14 | 2002 | 182 11580 | 83 | 2002 | 182 11580 | 86 | 2002 | 182 (...) FDs are constraints From the real-world meaning of the 'lecturing' table, the FD that we found (i.e. cid,year → numbers) is not accidental. Any collection of rows that will ever make up the “lecturing” table, will necessarily satisfy this dependency.(why? – a given course in a given year will always have a fixed number of students. And this number does not depend on who teaches the course.) So, if the lecturing table is ever modified s.t. The constraint is not satisfied, then the integrity of the table is violated. Setting up this constraint at the creation of the database is a bit tricky. We will see a better solution will be to decompose such tables into smaller ones. Another example of FD: funny but true Primary keys give FDs Every attribute of the table is functionally dependent on the primary key! How so? Another example of FD: funny but true Primary keys give FDs Every attribute of the table is functionally dependent on the primary key! How so? • By definition, if two rows were to have the same value for the primary key, then they will have the same value for all other attributes. – But no two distinct rows can ever have the same value for the primary key ==> so the FD is automatically satisfied. Amstrong's axioms Once some FDs have been found to hold true then other FDs can be inferred from them by the following rules (called Amstrong's axioms): 1. Trivial dependencies (also called Reflexivity): If U → anything, and X is subset of U then U → X. 2. Augmentation: If U → X, then UZ → XZ. 3. Transitivity: If U → X and X → Y, then U → Y. It can be shown that these rules are sufficient to derive all FD that hold as a consequence of postulating some. Checking if a FD holds It can be shown that Amstrong's axioms are sufficient to derive all FD that hold as a consequence of postulating some. But finding FDs this way requires a time-consuming algorithm (runs in exponential time). There is a simpler method for the following special case: Problem: Given a set F of FDs, check if a particular FD, say U → X follows from it. Solution: we successively augment U with attributes that functionally depend on U. (We call the usugmented version of U the closure of U). The new U will contain all attributes which depend on the original U. So if X is among them then U → X is true otherwise it is false. The problem with unnormalised tables Suppose: S(A,B,C,D,...) B→C //whenever B are the same C are too What's wrong with this? The problem with unnormalised tables Suppose: S(A,B,C,D,...); B→C What's wrong with this? Case 1: no two records will ever agree on B ==> then, nothing's wrong. [eg. when B is primary key] Case 2: several problems: * aggregation anomaly: if C is numeric * insertion anomaly * update anomaly * deletion anomaly Suppose: S(A,B,C,D,...); B→C * aggregation anomaly: if C is numeric (e.g. numbers in lecturing), then GROUP BY on it can give wrong results. → Try to compute the total number of course registrations for a given year. Suppose: S(A,B,C,D,...); B→C * insertion anomaly: Whenever entering a new record, we would need to check not to violate the FD. That is, check if another record exists with the same B value, then check that the new record agrees with it on its C value. Suppose: S(A,B,C,D,...); B→C * update anomaly: Whenever we need to update the C info in the table, we must replace all occurrences of that particular value. * deletion anomaly: Deleting the last record with a specific B value causes the corresponding C value to be lost too. Example: S= items held in a shop supplier_name → suppliers_phone_number * insertion anomaly: when we enter a new record that has a new phone number of the supplier (all old items still hold the old number!) * update anomaly: when we need to change the suppliers phone number *delete anomaly: when we delete the last item which was usually ordered from a certain supplier, the info about their contact number is lost too Normalisation Solution: split the table S=(A,B,C,D,...); which has U → X. Then: * V:= find all other attributes outside U that depend on U [using the closure algorithm] * split into: UV & notV. In previous example: {supplier_name, suppl_phone, and all details of supplier} and {the rest of attributes, supplier_name} * Continue with the new tables. [NB these may still have FDs that are not a consequence of U → X that we just dealt with!] Properties of decompositions - Losslessness property - Redundancy reducing property - Dependency preserving property A case study. Lossless decompositions = no loss of information Proposition. The decomposition made in the process of normalisation is always lossless. Why? - easy to see that we don't lose any row, but in general joining the component tables we might give us more rows than we had originally. Lossless decompositions = no loss of information Proposition. The decomposition made in the process of normalisation is always lossless. Why? - easy to see that we don't lose any row, but in general joining the component tables we might give us more rows than we had originally. To see why this will not happen in case of decompositions involved in normalisation, use: Lemma. Suppose we have a schema S=(A,B,C,D,...) grouped into 3 non-empty subsets V,U,W. The decomposition of S into S1=VU & S2=UW is lossless provided that at least one of U → V or U → W holds on (S1 join S2). Using this and noting that we must have had U → V or U → W on S, which was exactly the reason to decompose it. Lemma. Suppose we have a schema S=(A,B,C,D,...) grouped into 3 non-empty subsets V,U,W. The decomposition of S into S1=VU & S2=UW is lossless provided that at least one of U → V or U → W holds on (S1 join S2). Proof. Consider (S1 join S2). If (x(V), x(U)) in S1 fits with (y(U), y(W)) in S2 then x(U)=y(U). We have a problem if the joined tuple (x(V),x(U),y(W)) is not in S. Lemma. Suppose we have a schema S=(A,B,C,D,...) grouped into 3 non-empty subsets V,U,W. The decomposition of S into S1=VU & S2=UW is lossless provided that at least one of U → V or U → W holds on (S1 join S2). Proof. Consider (S1 join S2). If (x(V), x(U)) in S1 fits with (y(U), y(W)) in S2 then x(U)=y(U). We have a problem if the joined tuple (x(V),x(U),y(W)) is not in S. But S1 and S2 is a decomposition of S so a row in S1 must then have another match in S2 with which the join gives a line in S. And likewise a row of S2 must have another match in S1 with which the join gives a line in S. Denote the match of (x(V), x(U)) in S1 by (x(U), x(W) neq y(W)) in S2. & the match of (y(U), y(W)) in S2 by (y(V) neq x(V), y(U)) in S1. Lemma. Suppose we have a schema S=(A,B,C,D,...) grouped into 3 non-empty subsets V,U,W. The decomposition of S into S1=VU & S2=UW is lossless provided that at least one of U → V or U → W holds on (S1 join S2). Proof. Consider (S1 join S2). If (x(V), x(U)) in S1 fits with (y(U), y(W)) in S2 then x(U)=y(U). We have a problem if the joined tuple (x(V),x(U),y(W)) is not in S. But S1 and S2 is a decomposition of S so a row in S1 must then have another match in S2 with which the join gives a line in S. And likewise a row of S2 must have another match in S1 with which the join gives a line in S. Denote the match of (x(V), x(U)) in S1 by (x(U), x(W) neq y(W)) in S2. & the match of (y(U), y(W)) in S2 by (y(V) neq x(V), y(U)) in S1. But then S1 join S2 gives us not only (x(V),x(U),y(W)), but 3 more tuples: (x(V),x(U),x(W)), (y(V),y(U),y(W)), and (y(V),y(U),x(W)). Lemma. Suppose we have a schema S=(A,B,C,D,...) grouped into 3 non-empty subsets V,U,W. The decomposition of S into S1=VU & S2=UW is lossless provided that at least one of U → V or U → W holds on (S1 join S2). Proof. Consider (S1 join S2). If (x(V), x(U)) in S1 fits with (y(U), y(W)) in S2 then x(U)=y(U). We have a problem if the joined tuple (x(V),x(U),y(W)) is not in S. But S1 and S2 is a decomposition of S so a row in S1 must then have another match in S2 with which the join gives a line in S. And likewise a row of S2 must have another match in S1 with which the join gives a line in S. Denote the match of (x(V), x(U)) in S1 by (x(U), x(W) neq y(W)) in S2. & the match of (y(U), y(W)) in S2 by (y(V) neq x(V), y(U)) in S1. But then S1 join S2 gives us not only (x(V),x(U),y(W)), but 3 more tuples: (x(V),x(U),x(W)), (y(V),y(U),y(W)), and (y(V),y(U),x(W)). ==> neither U → V nor U → W holds on (S1 join S2). Redundancy reducing property Redundancy = repeated information We want to remove repeated info which arises by a non-trivial FD. Definition: The decomposition of S into S1 and S2 is redundancy reducing if either S1 or S2 is free of any key for the original S, and they have fewer rows than S. NB. Since we only consider non-trivial dependencies, if it was the case that both U → V and U → W holds on S then U would be key for S and so S wouldn't need fixing. Hence, redundancy reducing property is also guaranteed for decompositions that arise in correctly carried out normalisation processes. Dependency preserving property Remember, FDs = constraints that we observe during analysis of the real-world situation. We want to guarantee and check FDs, but we want this via the keys (of decomposed schemas) Usually the decomposition done in the normalisation process achieves this, except a problem case [fortunately arises rarely!]: supplier → supp_address contact_person → supp_address Here, if we take {contact_person, supp_address} in a separate table, the first FD is lost. Likewise, if we take {supplier, supp_address} into a separate table, the second FD is lost. No satisfactory solution in this case. Case study: Achim's example tables So how do we deal with the FD? cid,year → numbers Applying what we learned, we decompose the lecturing(cid,sid,year,numbers) schema into: course_instances(cid,year,numbers) taught_by(cid,year,sid). Star at this long enough to work out what ER diagram corresponds to it?... ...finally, it is good practice to revise the ER so that whenever the db needs extending you start from the best ER and ot have to normalise again for the same FDs.