dbms_design

Document Sample

Shared by: Aashish Sharma
Categories
Tags
Stats
views:
149
posted:
8/29/2009
language:
English
pages:
255
CS520: Introduction to Database Design & Engineering

Fall 2002



(C) Frieder, Grossman, & Goharian 1996, 2002



1



Database Course Differentiation

Credit not given for both classes since they are similar in content



CS 425 Prerequisite: CS401 Emphasis:

» Database Design & Use » Application Development



CS 520 Prerequisite: CS402 Emphasis:

» Database Design & Eng. » DBMS Development



Project:

» Application Development



Project:

» DBMS Development



(C) Frieder, Grossman, & Goharian 1996, 2002



2



Contents

Introduction SQL Database Design Query Optimization Recovery and Concurrency Control Integration of Structured Data and Text Distributed Database Systems 1 - 44 45 - 122 123 - 188 189 - 211 212 - 233 234 - 241 242 - end



(C) Frieder, Grossman, & Goharian 1996, 2002



3



Background

Initially, installations wrote separate applications with large amounts of repeated code to implement concurrency control, security, and recovery. This was a lot of wasted effort that was also very much error prone



(C) Frieder, Grossman, & Goharian 1996, 2002



4



Example of Common Functionality

Payroll User Interface Business Logic Concurrency Control Security Recovery

(C) Frieder, Grossman, & Goharian 1996, 2002



Inventory User Interface Business Logic Concurrency Control Security Recovery



Marketing User Interface Business Logic Concurrency Control Security Recovery

5



Database System Example

Payroll User Interface Business Logic Inventory User Interface Business Logic Database System Concurrency Control Security Recovery

(C) Frieder, Grossman, & Goharian 1996, 2002



Marketing User Interface Business Logic



6



Definitions

DBMS - a Database Management System is a set of routines that is capable of providing the following basic functions: » Add » Delete » Update » Retrieve



(C) Frieder, Grossman, & Goharian 1996, 2002



7



Primitive Functionality

Add (X) = Find (X); If not found then insert (X) else return (error_code) Delete (X) = Find (X); If found then remove (X) else return (error_code)

(C) Frieder, Grossman, & Goharian 1996, 2002



8



Primitive Functionality (cont)

Update (X, Y) = Delete (X); If not error_code then Add (Y)



Retrieve (X) = Delete (X); If not error_code then Add (X)



(C) Frieder, Grossman, & Goharian 1996, 2002



9



Database Models

Network » Any links supporting quick access Hierarchical » Links but no cycles (hierarchy) Relational » Data Independence Object – Oriented » Entity Abstraction

(C) Frieder, Grossman, & Goharian 1996, 2002



10



Inverted Index

I N D E X



Posting List

(C) Frieder, Grossman, & Goharian 1996, 2002



11



Inverted List Example

Hank Record 1 Record 3 Record 5



Query: find all occurrences of the name (value of attribute) is ‘Hank’ in the database: Hash to the value Hank in the “index” Scan the posting list for all occurrences



(C) Frieder, Grossman, & Goharian 1996, 2002



12



Inverted Index

Associates a posting list with each attribute value

abacus: abatement: … zoology: 83: 2002: (F3AB, 873A, FF32) (6A15) (D381, DA32) (F623, B001, 879D, 76AA) (AAAA, BBBB, CCCC)



Inverted because it lists for each attribute the location on disk where the value is stored.

(C) Frieder, Grossman, & Goharian 1996, 2002



13



Building an Inverted Index

For each relation r in the collection

» For each attribute t in relation r

– Find attribute t in the item dictionary – If term t exists, add a new disk location to its posting list – Otherwise,

Add attribute value t to the item dictionary Add a node to the posting list



(C) Frieder, Grossman, & Goharian 1996, 2002



14



File Organizations

Unsorted files (Heap Files)

– Good storage efficiency, fast insertion, deletion. – Slow searches



Sorted Files

– Good storage efficiency and search of range – Slow insertion and deletion – not practical (files never sorted) => B+ tree data structure



Hashed-Based Files

– – Not efficient storage Fast insertion, deletion, and equality searches

15



(C) Frieder, Grossman, & Goharian 1996, 2002



Sample Database

Hank graduated: » Michigan » IIT » MIT Hank worked: » IBM » Intel » Bellcore » Harris

(C) Frieder, Grossman, & Goharian 1996, 2002



16



Network Model

MIT IIT

Michigan



Graduated Hank Worked

IBM Intel Bellcore Harris



(C) Frieder, Grossman, & Goharian 1996, 2002



17



Hierarchical Model

Hank Graduated Worked

IBM Bellcore Harris Intel



MIT



IIT



Michigan



(C) Frieder, Grossman, & Goharian 1996, 2002



18



Relational Model

Person

Hank Hank Hank



Graduated

IIT MIT Michigan



Person

Hank Hank Hank Hank



Worked

IBM Intel Harris Bellcore



Key: (Person, Graduated)



Key: (Person, Worked)



(C) Frieder, Grossman, & Goharian 1996, 2002



19



Object Oriented Model

Type EMPLOYEE begin name : char (20); graduated : SETOF (schools); worked : SETOF (companies); end;



INSERT ( 1,Hank, {IIT, Michigan, MIT}, {IBM, Intel, Bellcore, Harris} )



(C) Frieder, Grossman, & Goharian 1996, 2002



20



Relational Model

The initial paper on the relational model was written in 1969 by Codd. System R was a relational prototype implemented in the mid 70’s by IBM. Ingres was a relational prototype implemented at UC Berkeley in the mid 70’s. Finally, commercial offerings of relational systems started with Oracle in 1979 and was quickly followed by SQL/DS and DB2 by IBM in the mid 80’s.



(C) Frieder, Grossman, & Goharian 1996, 2002



21



Data Independence

The relational model allows users to simply specify what data they require, not how to get them. This is referred to as data independence and is a key contribution of the relational model. Older models are referred to as navigational as users must navigate through the data and follow pointers from one datum to another.



(C) Frieder, Grossman, & Goharian 1996, 2002



22



Structure of RDBMS

User Query

Query Optimizer Operators File Manager Buffer Manager Disk Space Manager DB

(C) Frieder, Grossman, & Goharian 1996, 2002



Concurrency Control & Crash Recovery: Transaction Manager Lock Manager Recovery Manager



23



Relational Model

Relational algebra is used to specify the operations allowed within the relational model. Relational algebra is theoretically based on set theory. A relation can be illustrated as a table of rows and columns. The table is referred to as a relation, rows are referred to as tuples, and columns are attributes. Relational operators are closed. A relational operation applied to a relation always results in a relation.



(C) Frieder, Grossman, & Goharian 1996, 2002



24



EMPLOYEE(emp#, name, department, salary)



Example Relation:



Emp# 002 004 007



Name Jones Smith Bond Table Row Column



Department Marketing Sales Diplomacy = = = Relation Tuple Attribute



Salary 300.00 150.00 999.00



(C) Frieder, Grossman, & Goharian 1996, 2002



25



Formal Definition

IF R is the set of attributes (columns), commonly referred to as schema, then r(R) is a mapping of a set of tuples (rows ), commonly referred to as instance. Since each attribute is restricted to a limited domain, a relation is actually a subset of: dom(A1) x dom(A2) x dom(A3) where dom(X) indicates the domain or set of valid values for attribute X.

(C) Frieder, Grossman, & Goharian 1996, 2002



26



Characteristics of Relations

No inherent ordering of tuples. Conceptually, no inherent ordering of attributes. In practice, attributes are ordered based on the initial schema definition. All attribute values within a tuple should be atomic (1NF).



(C) Frieder, Grossman, & Goharian 1996, 2002



27



Key Attributes

Since a relation is merely just a set of tuples, NO DUPLICATE ELEMENTS are theoretically possible. (Unfortunately, some implementations violate this uniqueness definition) A column or set of columns must uniquely identify a tuple. Superkey - any combination of attributes that uniquely identify a tuple.



(C) Frieder, Grossman, & Goharian 1996, 2002



28



Keys (continued)

Key - a superkey from R such that the removal of any attribute results in a set of attributes that is NOT a superkey. Hence, a key is a: MINIMAL SUPERKEY Ex: (emp#, name, department, salary) is a superkey but not a key, because name, department, or salary could be removed and we would still have the superkey, emp#.



(C) Frieder, Grossman, & Goharian 1996, 2002



29



Candidate Keys

It is possible that more than one set of attributes qualify as a key. These are called candidate keys. Typically, one is chosen and referred to as the primary key. This key is underlined in the description of the relation. EMPLOYEE(emp#, name, department, salary)



(C) Frieder, Grossman, & Goharian 1996, 2002



30



Student-Grade Database

Student #

1 1 1 2 2 3 4 4 4 5 6 6



GPA

4.0 4.0 4.0 3.8 3.0 2.1 3.5 3.4 3.9 3.6 4.0 3.1



Degree

Bachelors Masters Doctorate Bachelors Masters Bachelors Bachelors Masters Doctorate Bachelors Associates Bachelors



Key: (Student#, Degree) – Assumes only one degree type per person

(C) Frieder, Grossman, & Goharian 1996, 2002



31



SELECT

SELECT - extract tuples from a relation Syntax: σ (Relation Name)



σGPA=4.0 (SG) - obtains all tuples from the Student-Grade (SG) relation where the GPA is 4.0. Student #

1 1 1 6

(C) Frieder, Grossman, & Goharian 1996, 2002



GPA

4.0 4.0 4.0 4.0



Degree

Bachelors Masters Doctorate Associates

32



Project

Project retrieves columns from a table Syntax: π (Relation Name)



πstudent#, degree(SG) Retrieves student number and degree from the StudentGrade relation.



(C) Frieder, Grossman, & Goharian 1996, 2002



33



Single – Scan Project



πstudent#, degree(SG)

Student #

1 1 1 2 2 3 4 4 4 5 6 6



Degree

Bachelors Masters Doctorate Bachelors Masters Bachelors Bachelors Masters Doctorate Bachelors Associates Bachelors

34



(C) Frieder, Grossman, & Goharian 1996, 2002



Multiple – Scan Project



πstudent#, GPA(SG)

Student #

1 1 1 2 2 3 4 4 4 5 6 6

(C) Frieder, Grossman, & Goharian 1996, 2002



GPA

4.0 4.0 4.0 3.8 3.0 2.1 3.5 3.4 3.9 3.6 4.0 3.1

35



Undergrad & Graduate Student Grade Database

Student # Graduate (GSG)

Key: (Student#, Degree) 1 1 2 4 4



GPA

4.0 4.0 3.0 3.4 3.9



Degree

Masters Doctorate Masters Masters Doctorate



Student #

1 2 3 4 5 6 6



GPA

4.0 3.8 2.1 3.5 3.6 4.0 3.1



Degree

Bachelors Bachelors Bachelors Bachelors Bachelors Associates Bachelors

36



Undergrad (USG)

(C) Frieder, Grossman, & Goharian 1996, 2002



Set Theoretic Operations: USG UNION GSG

List all information on all students



Student #

1 1 1 2 2 3 4 4 4 5 6 6

(C) Frieder, Grossman, & Goharian 1996, 2002



GPA

4.0 4.0 4.0 3.8 3.0 2.1 3.5 3.4 3.9 3.6 4.0 3.1



Degree

Bachelors Masters Doctorate Bachelors Masters Bachelors Bachelors Masters Doctorate Bachelors Associates Bachelors

37



Set Theoretic Operations:

πstudent#(USG) INTERSECTION πstudent#(GSG)

List all students who were both graduate and undergraduate students



Grad & Undergrad (GU)



Student #

1 2 4



(C) Frieder, Grossman, & Goharian 1996, 2002



38



Set Theoretic Operations:

πstudent#(USG) DIFFERENCE πstudent#(GSG)

List all students who were undergraduate but not graduate students



Only Undergrads (OU)



Student #

3 5 6



(C) Frieder, Grossman, & Goharian 1996, 2002



39



Cartesian Product

Relational Algebra includes: GU x OU For the sets:



1 GU 4



2



3 OU



5 6



GU x OU =













40



(C) Frieder, Grossman, & Goharian 1996, 2002



θ - Join

θ - Join is a Cartesian product with the addition of a



condition that determines which tuples are selected.

Can be logically viewed as: » Step 1: Cartesian Product » Step 2: Select from result of Step 1



(C) Frieder, Grossman, & Goharian 1996, 2002



41



Additional Join Types

In a θ - Join, the joining attributes are explicitly specified. An Equi - Join is the most common θ - Join with an equality as the

condition.



A Natural - Join is an Equi-Join where the joining attributes are all those attributes with a common name (implicitly specified)



(C) Frieder, Grossman, & Goharian 1996, 2002



42



πstudent#,degree(GSG) Equi-Join πstudent#,GPA(USG)

Student #

1 1 2 4 4



GPA

4.0 4.0 3.8 3.5 3.5



Degree

Masters Doctorate Masters Masters Doctorate



(C) Frieder, Grossman, & Goharian 1996, 2002



43



Relational Algebra Summary

Relations are viewed as sets. Relational operations are closed. Any operation on one or more relations yields a relation. No inherent ordering. Unary Operators: SELECT, PROJECT Binary Operators: UNION, INTERSECTION, SET DIFFERENCE, CARTESIAN PRODUCT, JOIN Single Scan: SELECT, PROJECT including key Multiple Scan: All others

(C) Frieder, Grossman, & Goharian 1996, 2002



44



Structured Query Language (SQL)



(C) Frieder, Grossman, & Goharian 1996, 2002



45



Structured Query Language

First relational query language, SQUARE was implemented in System R (1975). SQUARE was followed by SEQUEL. First commercial implementation, Oracle (1979), followed closely by SQL/DS in 1982. Major SQL DBMS vendors: Oracle, IBM (DB2), Sybase, Informix, Computer Associates, and Microsoft.



(C) Frieder, Grossman, & Goharian 1996, 2002



46



SQL Overview

Data Manipulation Language (DML)

» » » » SELECT INSERT UPDATE DELETE



Data Definition Language (DDL)

» CREATE TABLE » CREATE INDEX » GRANT



(C) Frieder, Grossman, & Goharian 1996, 2002



47



SELECT Overview

Single Relation » SELECT » Boolean Operators » IN » BETWEEN » Aggregate Operators » Calculated Attributes » Sorting » Wildcard Searches » GROUP BY » HAVING » NULLS » Varchar Multiple Relations



(C) Frieder, Grossman, & Goharian 1996, 2002



48



Syntax:

SELECT FROM Ex: SELECT p#,name,qty FROM PARTS Ex: SELECT * FROM PARTS

p# 1 2 3



PARTS

name Nut Bolt Wheel qty 42 25 15



(C) Frieder, Grossman, & Goharian 1996, 2002



49



Select with WHERE

SELECT FROM WHERE is of the form: [=,,,=] EX: SELECT * p# FROM PARTS 3 WHERE p# = 3

name Wheel qty 15



(C) Frieder, Grossman, & Goharian 1996, 2002



50



Use of Boolean Operators

Conditions can be separated by Boolean operators: AND, OR, NOT EX: “List information about parts 1 and 2” SELECT * p# name qty FROM PARTS Nut 42 WHERE p# = 1 OR p# = 2 1 2 Bolt 25 EX: “LIST information about all wheels that contain more than 20 in stock” SELECT * FROM PARTS WHERE name = ‘Wheel’ and qty > 20



(C) Frieder, Grossman, & Goharian 1996, 2002



51



Shortcut Number 1: IN

To find information for a list of values, the IN operator may be used: Ex: “List the name of all parts whose part # is 1,2, or 5” SELECT name FROM PARTS WHERE p# IN (1, 2, 5)

Name Nut Bolt



instead of: WHERE (p# = 1) OR (p# = 2) OR (p# = 5)

(C) Frieder, Grossman, & Goharian 1996, 2002



52



Shortcut Number 2: BETWEEN

To find values within a range, it is often easier to use BETWEEN. Ex: “Find all parts where the quantity on hand is greater than or equal to twenty parts, but less than or equal to fifty.” p# name qty SELECT * 1 Nut 42 FROM PARTS 2 Bolt 25 WHERE qty BETWEEN 20 and 50 instead of: WHERE qty >= 20 AND qty [DESC] can be added to SELECT to obtain sorted output. Ex: List all part names in ascending order: SELECT p#, name, qty FROM PARTS ORDER BY name

p# name 2 Bolt 1 Nut 3 Wheel qty 25 42 15



For descending order change to: ORDER BY name DESC

(C) Frieder, Grossman, & Goharian 1996, 2002



56



Sorting Calculated Attributes

To refer to a computed attribute in the ORDER BY, use the position in the list of columns following SELECT. Ex: “List all part information in descending order of a projected 20 percent reduction in quantity” SELECT p#,name,(qty-(qty*.2)) ‘qty’ FROM PARTS ORDER BY 3 DESC

(C) Frieder, Grossman, & Goharian 1996, 2002



p# 1 2 3



name qty Nut 34 Bolt 20 Wheel 12

57



Wildcard Searches of Strings

The LIKE operator is used to search parts of a string. The following wildcard characters are used: % - zero or more characters _ - exactly one character Ex: List all parts whose name starts with a ‘W’

p# name SELECT * 3 Wheel FROM PARTS WHERE name LIKE ‘W%’

(C) Frieder, Grossman, & Goharian 1996, 2002



qty 15



58



More LIKE Examples

Ex: List all parts whose name starts with a ‘W’ and ends with an ‘L’ p# name Wheel WHERE name LIKE ‘W%L’ 3 Ex: List all parts whose name is three characters long and starts with a ‘N’ name WHERE name LIKE ‘N__’ p#

1 Nut qty 15



qty 42



(C) Frieder, Grossman, & Goharian 1996, 2002



59



Review

List all parts whose name starts with a ‘W’ or whose part number is either 2,4,8,11,12,13,14,15. Sort the list in descending order by quantity. SELECT * FROM PARTS WHERE _____ LIKE _______ OR _____ IN (_,_,_) OR ______ BETWEEN __ AND __ ORDER BY ____ DESC

(C) Frieder, Grossman, & Goharian 1996, 2002



60



Review (Answer)

List all parts whose name starts with a ‘W’ or whose part number is either 2,4,8,11,12,13,14,15. Sort the list in descending order by quantity. SELECT * FROM PARTS WHERE name LIKE ‘W%’ OR p# IN (2,4,8) OR p# BETWEEN 11 AND 15 ORDER BY qty DESC

(C) Frieder, Grossman, & Goharian 1996, 2002



61



More Review

Ex: List the total number of parts that would exist if quantity was increased by 25 percent.



Ex: List all parts that would have a quantity greater than 50, if the quantity was increased by 25 percent.



(C) Frieder, Grossman, & Goharian 1996, 2002



62



More Review (Answer)

Ex: List the total number of parts that would exist if the quantity was increased by 25 percent. SELECT SUM(qty + qty * .25) FROM PARTS Sum(qty + qty * .25)

103



Ex: List all parts that would have a quantity greater than 50, if the quantity was increased by 25 percent. SELECT * p# name 1 Nut FROM PARTS WHERE qty + qty * .25 > 50

(C) Frieder, Grossman, & Goharian 1996, 2002



qty 42



63



GROUP BY

It is often necessary to review data about groups of related tuples. Consider an employee relation that contains the “Department” attribute. Assume one employee may work in only a single department. DEPARTMENT partitions the EMP set into subsets:

Sales Marketing Service Finance

(C) Frieder, Grossman, & Goharian 1996, 2002



64



GROUP BY (continued)

EMPLOYEE (emp#, name, salary, department) emp# name salary department

1 2 3 4 5 6 7 8 Fred Mike Sam Martha Juanita Steve Tom Sue 200 300 400 350 500 800 200 900 Sales Sales Sales Marketing Marketing Finance Service Service

65



(C) Frieder, Grossman, & Goharian 1996, 2002



GROUP BY (continued)

For each department, list the average salary. SELECT department, AVG(salary) FROM EMPLOYEE GROUP BY department department AVG(salary) Sales 300 Marketing 425 Finance 800 Service 550

(C) Frieder, Grossman, & Goharian 1996, 2002



66



Group By (continued)

If a WHERE clause exists, it is executed as well. Ex: For each department, list the highest salary, but exclude all employees whose name starts with a ‘S’ SELECT department, MAX(salary) FROM EMPLOYEE WHERE name NOT LIKE ‘S%’ GROUP BY department

(C) Frieder, Grossman, & Goharian 1996, 2002



67



GROUP BY (continued)

department Sales Marketing Service max(salary) 300 500 200



(C) Frieder, Grossman, & Goharian 1996, 2002



68



GROUP BY (continued)

More refined groups are obtained by using multiple attributes in GROUP BY. Add the attribute “REGION” to the employee relation. Now the department partition may be partitioned into different regions.

North West



MARKETING

South



East



(C) Frieder, Grossman, & Goharian 1996, 2002



69



GROUP BY (continued)

EMPLOYEE (emp#, name, salary, department,rgn) emp# name salary department rgn

1 2 3 4 5 6 7 8 Fred Mike Sam Martha Juanita Steve Tom Sue 200 300 400 350 500 800 200 900 Sales Sales Sales Marketing Marketing Finance Service Service north north east west west south north south

70



(C) Frieder, Grossman, & Goharian 1996, 2002



GROUP BY (multiple partitions)

To specify more than one partition, simply add more attributes after the GROUP BY: Ex: Compute the average salary for each region within each department. SELECT department, region, AVG(salary) FROM EMPLOYEE GROUP BY department, region

(C) Frieder, Grossman, & Goharian 1996, 2002



71



GROUP BY (continued)

department Sales Sales Marketing Finance Service Service reg avg(salary) north 250 east 400 west 425 south 800 north 200 south 900



(C) Frieder, Grossman, & Goharian 1996, 2002



72



HAVING (restricts groups)

Syntax: HAVING (list of conditions) Aggregate functions used (SUM, MIN, MAX, COUNT). Ex: List the average salary for all departments that have more than two employees. SELECT department, AVG(salary) FROM EMPLOYEE department avg(salary) GROUP BY department Sales 300 HAVING COUNT(*) > 2

(C) Frieder, Grossman, & Goharian 1996, 2002



73



HAVING (continued)

Ex: List the minimum salary in departments sales, marketing and service as long as the department has an average salary greater than 400. SELECT department, MIN(salary) FROM EMPLOYEE WHERE department IN (‘sales’,’marketing’,’service’) GROUP BY department HAVING AVG(salary) > 400

department min(salary) Marketing 350

(C) Frieder, Grossman, & Goharian 1996, 2002



74



Multiple Entity Relationships

Multiple tables needed to store multi-entity relationships. One to many: One parent may have many children Many to one: Many people may attend a single meeting Many to Many: Students graduate from multiple colleges Each college graduates by many students

(C) Frieder, Grossman, & Goharian 1996, 2002



75



Single Relation Design

It is tempting to try and stuff all multi-valued relationships into a single relation: emp# name salary 1 Fred 200 2 Ethel 300 3 Mike 400 college1 Harvard IIT MIT college2

Unused



college3

Unused Unused



Michigan Stanford



IIT



(C) Frieder, Grossman, & Goharian 1996, 2002



76



Problems with Poor Design

Incompleteness » Unable to store more than three colleges per individual. Many to many relationships can have an infinite number of values. Query Complexity » Queries such as “list all colleges attended by Mike” become substantially more difficult. Wasted Storage » Many entries are not used.

(C) Frieder, Grossman, & Goharian 1996, 2002



77



Multiple Relations: 2 Relations for Many-Many

EMPLOYEE emp# name salary 1 Fred 200 2 3 Ethel Mike 300 400 3 3 3 MIT Stanford IIT emp# 1 2 2 COLLEGE name Harvard IIT Michigan



emp# in EMPLOYEE is a primary key emp# in COLLEGE is a foreign key emp#,name in COLLEGE is a primary key

(C) Frieder, Grossman, & Goharian 1996, 2002



78



Preserving Relationships

Joining the two relations restores the original relation assuming a key is part of the partitioning. Poor partitioning may result in additional spurious tuples being introduced.



(C) Frieder, Grossman, & Goharian 1996, 2002



79



Equi-Join

Explicit indication of the join attribute conditions. Typically involves joining on foreign key attributes. List all colleges attended by “Mike” SELECT b.name FROM EMPLOYEE a, COLLEGE b WHERE a.emp# = b.emp# AND a.name = ‘Mike’ name

MIT Stanford IIT

(C) Frieder, Grossman, & Goharian 1996, 2002



80



Problem with only using Two Relations for Many-Many

Suppose we need to maintain the college location as well. location must be replicated many times. EMPLOYEE emp# name salary 1 Fred 200 2 3 Ethel Mike 300 400 COLLEGE emp# name 1 Harvard 2 IIT 2 Michigan 3 MIT 3 Stanford 3 IIT location Boston Chicago Ann Arbor Boston Stanford Chicago



(C) Frieder, Grossman, & Goharian 1996, 2002



81



Use of 3 Relations

To avoid needless repetition, we create a separate table for each of the two entities involved in a many-many relationship, and then a third “linking” relation to contain data about the relationship between the entities. All 1-1 information about employees: EMPLOYEE(emp#, name, salary) All 1-1 information about colleges: COLLEGE(col#, name, location) All data pertaining to a single employee attending a single college: ATTENDS(emp#, col#, gpa)



(C) Frieder, Grossman, & Goharian 1996, 2002



82



Use of Three Relations (continued)

EMPLOYEE

emp# 1 2 3 name Fred Ethel Mike salary 200 300 400 col# 11 22 33 44 55



COLLEGE

name Harvard IIT Michigan MIT Stanford gpa 2.45 3.79 3.65 2.85 2.65 4.0

83



location Boston Chicago Ann Arbor Boston Stanford



ATTENDS

emp# 1 2 2 3 3 3 col# 11 22 33 44 55 22



(C) Frieder, Grossman, & Goharian 1996, 2002



Sample Query (3 Relations)

Ex: List the names of all colleges attended by “Mike.”

SELECT b.name FROM EMPLOYEE a, COLLEGE b, ATTENDS c WHERE a.emp# = c.emp# AND name b.col# = c.col# AND Michigan MIT a.name = ‘Mike’

Stanford



(C) Frieder, Grossman, & Goharian 1996, 2002



84



Sample Query (3 Relations)

Ex: List the names of all employees who attended Harvard.

SELECT a.name FROM EMPLOYEE a, COLLEGE b, ATTENDS c WHERE a.emp# = c.emp# AND b.col# = c.col# AND name Fred b.name = ‘Harvard’



(C) Frieder, Grossman, & Goharian 1996, 2002



85



Subqueries

Instead of hard-coding the list that is used by IN, it is possible to dynamically generate the list using a subquery. Ex: SELECT * FROM EMPLOYEE WHERE emp# IN (1,2,3,4,5,7)



could be rewritten as; Ex: SELECT * FROM EMPLOYEE WHERE emp# IN (SELECT num FROM SAMPLE)



Assuming SAMPLE(num) is a relation with tuples: 1,2,3,4,5, and 7.

(C) Frieder, Grossman, & Goharian 1996, 2002



86



EXISTS

EXISTS prefaces a subquery and evaluate to TRUE if one or more tuples are present in the result set of the subquery. Ex: List all employees who attended at least one college. SELECT * FROM EMPLOYEE a WHERE EXISTS (SELECT c.emp# FROM ATTENDS c WHERE c.emp# = a.emp#)

(C) Frieder, Grossman, & Goharian 1996, 2002



87



DISTINCT

DISTINCT is used to remove duplicates. Ex: List all distinct salaries. SELECT DISTINCT(salary) FROM EMPLOYEE

distinct (salary) 200 300 400



(C) Frieder, Grossman, & Goharian 1996, 2002



88



UNION

The union of two result sets (of the same data type) is obtained via the UNION operator (duplicates are removed). The OR query can be written using UNION: Syntax: UNION Ex: Obtain billing from 2000 and 2001. SELECT * FROM BILL2000 UNION SELECT * FROM BILL2001

(C) Frieder, Grossman, & Goharian 1996, 2002



89



EXCEPT

Syntax: EXCEPT Ex: Find employees, who attended IIT but not MIT.

SELECT e.emp# FROM EMPLOYEE a, COLLEGE b, ATTENDS c WHERE a.emp# = c.emp# AND b.col# = c.col# AND b.name = ‘IIT’ EXCEPT SELECT e.emp# FROM EMPLOYEE a, COLLEGE b, ATTENDS c WHERE a.emp# = c.emp# AND b.col# = c.col# AND b.name = ‘MIT’

(C) Frieder, Grossman, & Goharian 1996, 2002



90



Other Data Manipulation

UPDATE » Modify tuples in a single relation DELETE » Remove tuples from a single relation INSERT » Add tuples to a single relation



(C) Frieder, Grossman, & Goharian 1996, 2002



91



INSERT

Syntax: INSERT INTO [list of columns] VALUES () Ex. For the relation: EMPLOYEE (emp#, name, salary) INSERT INTO EMPLOYEE (emp#, name, salary) VALUES (5, ‘Herbert’, 200) Note: If optional column list is not found, the values must be listed in the order of their initial definition



(C) Frieder, Grossman, & Goharian 1996, 2002



92



INSERT - Format 2

Syntax: INSERT INTO (select statement) Ex: Copy all tuples in the EMPLOYEE relation and place them in NEW_EMPLOYEE INSERT INTO NEW_EMPLOYEE SELECT * FROM EMPLOYEE



(C) Frieder, Grossman, & Goharian 1996, 2002



93



UPDATE

Syntax: UPDATE WHERE where is of the form: SET = Ex: Modify John’s salary to 150 UPDATE EMPLOYEE SET salary = 150.00 WHERE name = ‘John’

(C) Frieder, Grossman, & Goharian 1996, 2002



94



UPDATE (continued)

An assignment statement may contain a numeric expression: Ex: Give all employees a ten percent raise. UPDATE EMPLOYEE SET salary = salary * 1.10



(C) Frieder, Grossman, & Goharian 1996, 2002



95



DELETE

Syntax: DELETE FROM WHERE Ex: Remove all employees who work in department 5 DELETE FROM EMPLOYEE WHERE dept = 5 To remove all employees: DELETE FROM EMPLOYEE



(C) Frieder, Grossman, & Goharian 1996, 2002



96



Data Definition Language (DDL)

» » » » » » » Create Table Drop Table Create Index Drop Index GRANT REVOKE ALTER TABLE



(C) Frieder, Grossman, & Goharian 1996, 2002



97



CREATE TABLE

CREATE TABLE [ ], .... [ ] (



) typical data types are: CHAR(x), VARCHAR(x), SMALLINT, INTEGER, DATE, TIME, DECIMAL (x,y)



(C) Frieder, Grossman, & Goharian 1996, 2002



98



CREATE TABLE (example)

CREATE TABLE EMPLOYEE (emp# SMALLINT, name CHAR(20), salary DECIMAL(5,2))



(C) Frieder, Grossman, & Goharian 1996, 2002



99



Varying Length Character

VARCHAR(x) indicates that a string will be no longer than x characters. Fixed length strings are padded to fill fixed space. Varying length strings have a Length Indicator.

FIXED

200 Hank FILL 252.35 200



VARYING

4 Hank 252.35



(C) Frieder, Grossman, & Goharian 1996, 2002



100



Effect of Varying Length Columns on Performance FIXED

tuple 1 tuple 2 tuple 3 tuple 4 tuple 5



VARCHAR

tuple 1 tuple 2 (cont) tuple 3 (cont) tuple 4 tuple 5 tuple 2 tuple 3



A modification to the FIXED table only affects one tuple. A modification to VARCHAR might result in the reshuffling or copying to OVERFLOW of other tuples so that they fit on a single page.

(C) Frieder, Grossman, & Goharian 1996, 2002



101



Rule of Thumb (VARCHAR)

Avoid VARCHAR when it is not necessary. One “rule of thumb” is to avoid VARCHAR when the maximum savings is less then thirty characters. Advantage: Save storage Disadvantage: Degrades performance of UPDATE



(C) Frieder, Grossman, & Goharian 1996, 2002



102



Nulls

An attribute may be defined as null. This indicates that the value is unknown and avoids the need for user-defined special indicators. CREATE TABLE EMPLOYEE (emp# SMALLINT, name CHAR(20), salary



DECIMAL(5,2) NULL)



(C) Frieder, Grossman, & Goharian 1996, 2002



103



Effect of Nulls on Performance

Any data in a tuple that allows nulls is prefaced by a null indicator.



Null Indicator

200 Hank 252.35 Null Indicator (1 byte) 200



No Null

Hank 252.35



(C) Frieder, Grossman, & Goharian 1996, 2002



104



Effect of Nulls on Performance

A table that specifies about NULLS results in one byte added for each tuple stored in the relation. This can be a tremendous waste of storage if no data are ever NULL. Most DBMS default to allow NULLS, while many real world applications do not require NULLS. Performance of retrieval and update is slightly degraded because the null indicator must be examined before checking tuple content.



(C) Frieder, Grossman, & Goharian 1996, 2002



105



Syntax Modifications for Nulls

Allow NULL specification in INSERT » To add an employee whose salary is unknown: – INSERT INTO EMPLOYEE (3,’Hank’, null) Use of NULL in select. SELECT * FROM EMPLOYEE WHERE salary IS NULL



(C) Frieder, Grossman, & Goharian 1996, 2002



106



CREATE INDEX

The relational model does not specify how data should be accessed. To create a separate access path, SQL allows users to use CREATE INDEX to create a separate structure, called access method. Usually B+tree is used.

Ex: CREATE UNIQUE INDEX I1 ON EMPLOYEE (num) SELECT * FROM EMPLOYEE WHERE num = 25 will use a B-tree instead of a sequential scan.

(C) Frieder, Grossman, & Goharian 1996, 2002



107



DROP INDEX / TABLE

To remove an index use: DROP INDEX To remove a table use: DROP TABLE



(C) Frieder, Grossman, & Goharian 1996, 2002



108



Referential Integrity

When a primary key is modified, it is often necessary to delete the corresponding foreign keys. A single employee id might be a foreign key in many tables.

Employee



Colleges



Projects



Dependents



(C) Frieder, Grossman, & Goharian 1996, 2002



109



Specification of Primary and Foreign Key

EMPLOYEE(emp#, name, salary) and COLLEGE (emp#, col#, col_name) emp# is a primary key in the EMPLOYEE relation and emp# is a foreign key of the COLLEGE relation. CREATE TABLE EMPLOYEE CREATE TABLE COLLEGE (emp# SMALLINT, (emp# SMALLINT, name CHAR(20), col# SMALLINT, salary DECIMAL(5,2), col_name CHAR(20), PRIMARY KEY (emp#)) FOREIGN KEY K1 (emp#) REFERENCES EMPLOYEE ON DELETE CASCADE, PRIMARY KEY (emp#, col#))

(C) Frieder, Grossman, & Goharian 1996, 2002



110



Referential Integrity (continued)

ON DELETE: » [CASCADE, SET NULL, RESTRICT] CASCADE » A delete to a primary key results in a delete of all corresponding tuples that contain the foreign key. SET NULL » A delete to a primary key results in null values placed in all corresponding foreign keys. RESTRICT » A delete to primary key results in an error if a matching foreign key exists.



(C) Frieder, Grossman, & Goharian 1996, 2002



111



Views

A view is a logical relation. It is defined as a subset of tuples and attributes of a physical (base) relation. Syntax: CREATE VIEW as Ex: Create a view on the EMPLOYEE relation such that the salary attribute is omitted. CREATE VIEW V1 AS (SELECT num, name FROM EMPLOYEE) Now a user may be given access to only V1 without access to the base relation: EMPLOYEE.

(C) Frieder, Grossman, & Goharian 1996, 2002



112



Views (continued)

The tuples in the base relation may be restricted by adding a WHERE condition to the view definition. EX: Create a view that contains information only about employees in named ‘Steve’ CREATE VIEW V2 as (SELECT * FROM EMPLOYEE WHERE name = ‘Steve’) Any queries against V2 are executed by merging the view definition with the query to ensure the result set only accesses data allowed by the view.

(C) Frieder, Grossman, & Goharian 1996, 2002



113



View Insert

An insert to a view that does not contain all of the attributes in the base relation results in the additional attributes being set to NULL. This is only valid if nulls are permitted. Ex: Consider the view V1 that omits the SALARY attribute. INSERT INTO V1 (4, ‘Hank’) is equivalent to: INSERT INTO EMPLOYEE (4,’Hank’, null) A null value will be placed in the salary attribute.



(C) Frieder, Grossman, & Goharian 1996, 2002



114



Security and Authorization

Access to relations and views is controlled by the GRANT and REVOKE statements. GRANT [ALL, SELECT, INSERT, UPDATE, DELETE] ON to Ex: Give John access to all EMPLOYEE data and ensure that Mary and Sue may not look at employee salaries. GRANT ALL ON EMPLOYEE TO JOHN GRANT SELECT ON V1 TO MARY, SUE

(C) Frieder, Grossman, & Goharian 1996, 2002



115



GRANT (continued)

Optionally, “with grant option” may be specified to allow the recipient to grant access to the privileges which are being given. If user1 issues: » GRANT SELECT ON T1 TO USER2 WITH GRANT OPTION User 2 may now: » GRANT SELECT ON USER1.T1 to USER3



(C) Frieder, Grossman, & Goharian 1996, 2002



116



GRANT (continued)

A GRANT of the UPDATE operation may be restricted to only certain columns of the object. Ex: » Give John the ability to update only the SALARY attribute. » GRANT UPDATE(salary) on EMPLOYEE to JOHN » “PUBLIC” is a special user name that implies all users » GRANT ALL ON SUPPLIER TO PUBLIC gives all users access to all of the tuples in the SUPPLIER relation.

(C) Frieder, Grossman, & Goharian 1996, 2002



117



Revoke

Removes access from a user REVOKE ON FROM Ex: remove Mary’s access to look at employee data. REVOKE SELECT ON EMPLOYEE FROM MARY



(C) Frieder, Grossman, & Goharian 1996, 2002



118



System Catalogs

System catalogs contain metadata (data about data). These catalogs can be queried with any valid SQL SELECT. When a user issues a CREATE TABLE statement, the following catalogs are updated:

» A tuple is added to the TABLES relation that indicates the name of the table and who created, time of creation, etc. » One tuple is added to the COLUMNS relation for each column in the CREATE TABLE statement to indicate the name and the data type of each column. » A tuple is added to the TABLE_AUTH to indicate that the creator has access to the relation.

(C) Frieder, Grossman, & Goharian 1996, 2002



119



Performance Aspects of Catalogs

Catalogs often are a “hot spot” in that many users issuing DDL against the catalogs will result in contention. Some catalogs store information about when applications have been processed. Most installations support separate development and production systems.



(C) Frieder, Grossman, & Goharian 1996, 2002



120



Embedded SQL

Some queries can not be answered conveniently by only SQL commands. The use of SQL commands within a program in a host language is called embedded SQL. The data types of SQL might not be recognized by host language, and vise versa. Thus, casting the data values. Programming languages typically do not support set of rows. Thus use of Cursors.



(C) Frieder, Grossman, & Goharian 1996, 2002



121



Trigger

Trigger is a procedure that is invoked by DBMS as the result of a transaction to perform the following: Maintaining database integrity Another database transaction Alerting users Supporting auditing and security checks by creating logs Collecting Statistics



(C) Frieder, Grossman, & Goharian 1996, 2002



122



Database Design



(C) Frieder, Grossman, & Goharian 1996, 2002



123



Problem

Given some body of data to be represented in the database, what is the best logical structure for the data? » Identify Entities » Identify Relationships Focus of this effort is on logical design, not physical



(C) Frieder, Grossman, & Goharian 1996, 2002



124



Design Approaches

Entity Relationship Model » Identify the general entities and relationships Normalization » Refine the design



(C) Frieder, Grossman, & Goharian 1996, 2002



125



Entity Relationship Model

Developed by Peter Chen in 1976 Entity » A Distinguishable Object » Regular Entity » Weak Entity » Entities have properties Relationship



(C) Frieder, Grossman, & Goharian 1996, 2002



126



ER Diagram

Each entity is shown as a rectangle Weak entities have a double rectangle (not shown here) Properties are shown as ellipses with the name of the property in question. Properties are attached to the specific entity.

SSN Name Employee Salary

(C) Frieder, Grossman, & Goharian 1996, 2002



127



Drawing Relationships

Each relationship is labeled as a diamond. The participants are connected to the relevant relationship with a line. Each line is labeled 1 or M to indicate the type of relationship: (1-1, 1-M, M-1 or M-M).



(C) Frieder, Grossman, & Goharian 1996, 2002



128



Relationship Examples

DEPARTMENT 1 Dept-Emp M EMPLOYEE



EMPLOYEE



1



Emp-Sp



1



SPOUSE



SUPPLIER



M



Sup-part



M



PARTS



(C) Frieder, Grossman, & Goharian 1996, 2002



129



ER Diagram

Weak Entity

» has a discriminator attribute (ex: p#) » does not exist without the strong entity (ex: no payment exists without loan)



Weak entity requires the primary key of the strong entity along with the discriminator to be identified.

L# ssn L-amount P# P-date amount



Loan



1



loanpayment



M Payment



(C) Frieder, Grossman, & Goharian 1996, 2002



130



DDL From E-R diagrams

Each entity may result in a relation whose attributes are the properties for the entity Each relationship may result in a relation whose attributes link the entities described in the relationship



(C) Frieder, Grossman, & Goharian 1996, 2002



131



Example (1-M, M-1)

s# sname 1 city p# pname color



Supplier



Supl-part



M



Parts



Supplier (s#, sname, city) Parts ( p#, pname, color) Sup-part (s#, p#) OR: Supplier (s#, sname, city) Parts ( p#, pname, color,s#)

(C) Frieder, Grossman, & Goharian 1996, 2002



132



Example (1-M, M-1)

s# sname 1 city p# pname color



Supplier



Supl-part

qty date



M



Parts



Supplier (s#, sname, city) Parts ( p#, pname, color) Sup-part (s#, p#, qty,date)



(C) Frieder, Grossman, & Goharian 1996, 2002



133



Example (1-1)

ssn name 1 salary s-ssn name DOB



Employee



Emp-Sp



1



Spouse



Employee (ssn, name, salary) Spouse (s-ssn, name, DOB) OR: Emp-Sp (ssn, s-ssn) Employee (ssn, name, salary,s-ssn) OR: Spouse (s-ssn, name, DOB) Employee (ssn, name, salary) Spouse (s-ssn, name, DOB,ssn)

(C) Frieder, Grossman, & Goharian 1996, 2002



134



Example (M-M)

ssn name M salary dname city



Employee



Emp-Dept



M



Department



Employee (ssn, name, salary) Department (dname, city) Emp-Dept (ssn, dname)



(C) Frieder, Grossman, & Goharian 1996, 2002



135



Example (Weak Entity)

L# ssn L-amount P# P-date amount



Loan



1



loanpayment



M Payment



Loan (L#, ssn, L-amount) Payment (L#, P#, amount, P-date)



(C) Frieder, Grossman, & Goharian 1996, 2002



136



Example (Class Hierarchies)

ssn name



phone



Staff



Staff (ssn, name,phone) Full Time (ssn,salary,start-date) Part Time (ssn, hourly-rate) OR:



IS-A



Full Time Start-date



Part Time



Salary



Hourly-rate



Full Time (ssn,salary,start-date, name,phone) Part Time (ssn, hourly-rate, name, phone)



(C) Frieder, Grossman, & Goharian 1996, 2002



137



Example (Ternary Relationship)

ssn name M salary dname city



Employee



E-D-P

M Project



M



Department



pname



Employee (ssn, name, salary) Department (dname, city) Project (p-no,pname) E-D-P (ssn, dname,pno)

(C) Frieder, Grossman, & Goharian 1996, 2002



pno



138



Example (Aggregation)

ssn name M salary dname city M



Employee



Emp-Dept

M



Department



active Employee (ssn, name, salary) Department (dname, city) Project (p-no,pname) Emp-Dept (ssn, dname) Active (ssn, dname,pno,date)

1 Project



date pname pno



(C) Frieder, Grossman, & Goharian 1996, 2002



139



Design Scenario

Some Institute of Technology (SIT) is considering to modernize its administrative functions including is campus-wide information database. Since SIT is in the same financial shape as all universities are, it was decided to collect many free designs and evaluate them with the hope that some would be perfectly developed. That is were you come in. You have been graciously volunteered to design their database. You must state all of your assumptions. Failure to do so will lead to “general” assumptions and might violate your design. Your design must be in at least 3NF and must be able to store and query data on the following topics:



(C) Frieder, Grossman, & Goharian 1996, 2002



140



Design Scenario Requirements

• Academic Structure:

• Colleges – containing many departments, head by a Dean, etc. • Departments – containing many labs, faculty, courses, students, head by a Chair, etc. • Locations, prerequisites, offerings, professors • Names, social security numbers, children, offices, phone numbers, email addresses, salary, etc. • Prices, item selection based on meal (breakfast, lunch, and dinner), purchases (date & cost), location, manager, etc.



• Classes:



• Personnel:



• Cafeteria Offerings:



(C) Frieder, Grossman, & Goharian 1996, 2002



141



ERD for Design Scenario

Lab College 1

DC



Children M



M

DL



M

DS



1 Department 1 M

DF



FC



1 M Student M

term Sec# CP CS



M M Faculty M

loc



CD



M M M



M Course M

term



CF Sec#

142



(C) Frieder, Grossman, & Goharian 1996, 2002



Design Relation Definition

Relations (Academic) » College( college#, dean, college_nm) » Dept( dept#, chair, dept_nm) » Lab (lab_nm, bld#, room#, capacity) » Student (ssn, first_nm, last_nm) » Faculty (f_ssn, first_nm, last_nm, salary, phone,email) » Course (c#,course_nm,credit_hrs)



(C) Frieder, Grossman, & Goharian 1996, 2002



143



Design Relation Definition

Relationships (Academic) » DC (dept#, college#) » DL (lab_nm, dept#) /* ssn of student */ » DS (ssn, dept#) /* ssn of faculty */ » DF (dept#, f_ssn) » CF (c#, f_ssn, term, sec#, loc) » CD (c#, dept#) » CS (c#, ssn, term, sec#) /* Prereqs of Courses /* » CP( course#, prereq# )

(C) Frieder, Grossman, & Goharian 1996, 2002



144



Design Relation Definition

Relations (Personnel) » Children (ssn, f_nm,l_nm, age) Relationship (Personnel) /* ssn of faculty and child */ » FC (f_ssn, ssn) Cafeteria

» Meal( meal#, time ) » Café( café#, address, phone# ) » … etc.

(C) Frieder, Grossman, & Goharian 1996, 2002



145



Sample Query 1

Who is teaching CS 425 Section 051 during the fall 1999 term and where is it taught?

SELECT f_ssn, loc FROM CF WHERE (c# = ‘CS425’ AND Sec# = ‘051’ AND Term = ‘Fall99’)



(C) Frieder, Grossman, & Goharian 1996, 2002



146



Sample Query 2

What are the names of the children of the faculty who taught classes offered at the Stuart Building during the fall 1998 term?

SELECT Children.f_nm, Children.l_nm FROM Children, Faculty, FC , CF WHERE (loc = “Stuart Building” AND term = “Fall98” AND CF.f_ssn = Faculty.f_ssn AND FC.f_ssn = Faculty.f_ssn AND FC.ssn# = Children.ssn)



(C) Frieder, Grossman, & Goharian 1996, 2002



147



Sample Query 3

What are the common meal offerings that are available for both lunch and dinner?

SELECT A.meal# FROM Meal A, Meal B WHERE (A.meal# = B.meal# AND A.time = “Lunch” AND B.Time = “Dinner”)



(C) Frieder, Grossman, & Goharian 1996, 2002



148



Normalization

Different normal forms have been defined to characterize a database design. Each NF is progressively more restrictive.

U

3NF 2NF 1NF



(C) Frieder, Grossman, & Goharian 1996, 2002



149



Functional Dependency

Functional Dependency (FD) » A many-one or many-many relationship from one set of attributes to another within a given relation – ex: Supplier (S#), Part (P#) --> QTY » Functional dependency still holds even though for many combinations of S# and P#, there is only one quantity



(C) Frieder, Grossman, & Goharian 1996, 2002



150



Broader FD

A functional dependency can be defined if the attribute values for x uniquely determine those in y. A functional dependency is a statement about a relational scheme (i.e., all possible relations), and cannot be deduced from a particular relation.



(C) Frieder, Grossman, & Goharian 1996, 2002



151



Example

S# S1 S1 S2 S2 S3 S4 S4 CITY London London Paris Paris Paris London London P# P1 P2 P1 P2 P2 P4 P5 QTY 100 100 200 200 300 400 400



Functional Dependency: Whenever two tuples have the same value of x, they will also have the same value of y. Ex: S# -> CITY

(C) Frieder, Grossman, & Goharian 1996, 2002



152



FD Diagrams

Draw a line to represent irreducible FD’s

SNAME S# STATUS CITY



An arrow will always exist from the primary key to the non-key attributes. Problems usually exist when other arrows are present. Normalization may be informally defined as the process by which extra arrows are removed.

(C) Frieder, Grossman, & Goharian 1996, 2002



153



Closure Of A Set Of Dependencies



Question: Answer:



What is F+? F+ is the closure of F - the set of all functional dependencies derivable from F.



Equivalently, F+ is the set of all dependencies that follow from F by Armstrong’s axioms.

(C) Frieder, Grossman, & Goharian 1996, 2002



154



Armstrong’s Axioms

Let R be a relational scheme with attributes U and functional dependencies FD.

» Reflexivity - If Y is a subset of X, then X=>Y. » Augmentation - If X=>Y, and Z is a subset of U, then XZ=>YZ. » Transitivity - If X=>Y and Y=>Z, then X=>Z.



(C) Frieder, Grossman, & Goharian 1996, 2002



155



Completeness Proof

Given that Armstrong’s axioms are complete prove that the FD set {Reflexivity, Augmentation, and Pseudotransitivity} is complete.

Reflexivity - If Y is a subset of X, then X=>Y. Augmentation - If X=>Y then XZ=>YZ. Transitivity - If X=>Y and Y=>Z, then X=>Z. Pseudotransitivity - If X=>Y and WY=>Z, then XW=>Z.

(C) Frieder, Grossman, & Goharian 1996, 2002



156



Completeness Proof

To demonstrate a FD set A is complete given a complete FD set B, must demonstrate that FD set A derives every FD in FD set B.

FD Set A Reflexivity Augmentation => => FD Set B (Armstrong’s Axioms) Reflexivity Augmentation



To derive Transitivity, set W = {} in Pseudotransitivity, hence Pseudotransitivity simplifies to Transitivity.



(C) Frieder, Grossman, & Goharian 1996, 2002



157



Armstrong’s Axioms Extended (Projectivity)

Projectivity - If X=>YZ, then X=>Y and X=>Z. 1. X => YZ 2. YZ => Y 3. X => Y 4. YZ => Z 5. X => Z Given Reflexivity Transitivity 1, 2 Reflexivity Transitivity 1, 4



(C) Frieder, Grossman, & Goharian 1996, 2002



158



Armstrong’s Axioms Extended (Additivity)

Additivity - If X=>Y and X=>Z, then X=>YZ. 1. X => Y 2. X => Z 3. XY => YZ 4. X => XY 5. X => YZ Given Given Augment Y on 2 Augment X on 1 Transitivity 4, 3



(C) Frieder, Grossman, & Goharian 1996, 2002



159



Armstrong’s Axioms Extended (Pseudotransitivity)

Pseudotransitivity - If X=>Y and WY=>Z, then XW=>Z. 1. 2. 3. 4. X => Y WY => Z XW => YW XW => Z Given Given Augment W on 1 Transitivity 3, 2



(C) Frieder, Grossman, & Goharian 1996, 2002



160



Functional Dependency Derivation Using the Extended Armstrong’s Axioms



Given the following functional dependency set, FD = { AB=>E, AG=>J, BE=>I, E=>G, GI=>H }, prove AB=>GH.

(C) Frieder, Grossman, & Goharian 1996, 2002



161



Derivation Proof

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. AB => E BE => I E => G GI => H AB => G AB => BE AB => I AB => GI AB => H AB => GH Given Given Given Given Transitivity 1, 3 Augment B to 1 Transitivity 6, 2 Additivity 5, 7 Transitivity 8, 4 Additivity 5, 9

162



(C) Frieder, Grossman, & Goharian 1996, 2002



Closure of Attributes

Without computing F+, can identify if a given FD (X => Y) is in F+. Obtaining candidate keys Algorithm: closure = a; while there are changes to closure do if there is a FD x=>y such that x is subset of closure then closure = closure U y; end;

(C) Frieder, Grossman, & Goharian 1996, 2002



163



Closure of Attributes (Example)

R = (A, B, C, D) FD {A=>B, C=>A, C=>D} A+ = A, B B+ = B C+ = C, A, D, B D+ = D



(C) Frieder, Grossman, & Goharian 1996, 2002



164



1NF

A relation is in 1NF if and only if all underlying domains contain scalar values. No multi-valued attributes within a single “cell” of a relation.



(C) Frieder, Grossman, & Goharian 1996, 2002



165



Example of 1NF

The following relation is NOT in 1NF

Student # 1 2 ... Courses {CS425, CS595, CS100} {CS525, CS548}



Key:



The following relation is in 1NF

Student # 1 1 1 2 2 ...

(C) Frieder, Grossman, & Goharian 1996, 2002



Courses CS425 CS595 CS100 CS525 CS548



Key:



166



Problems with 1NF

Consider the relation:

Professor# P1 P1 P1 P2 P2 P2 P3 P3 P4 P5 Student# S1 S2 S4 S1 S2 S3 S2 S4 S4 S5



(relation in 1NF but not 2NF)

Course 425 100 595 525 525 548 548 425 525 548 Goal M.S. Ph.D. Ph.D. M.S. Ph.D. M.S. Ph.D. Ph.D. Ph.D. M.S.



FD: {Professor#, Student# -> Course, Student# -> Goal} Key:

(C) Frieder, Grossman, & Goharian 1996, 2002



167



1NF Problems

Insertion Anomaly

» Can not insert new professors without them teaching courses



Deletion Anomaly

» Deletion of a tuple describing a particular student (S5) eliminates additional valid information (P5 exists)



Update Anomaly

» The Goal value appears many times for the same student and is redundant.



(C) Frieder, Grossman, & Goharian 1996, 2002



168



Fixing the problems with 1NF

To fix the problems, place information that is logically separate in separate relations. STUDENTS

» Facts about the individual students



PROFESSOR-STUDENTS

» Facts about where a professor and a student first met



(C) Frieder, Grossman, & Goharian 1996, 2002



169



2NF

Consider the following two relations:

STUDENTS Student# S1 S2 S3 S4 S5 Goal M.S. Ph.D. M.S. Ph.D. M.S. PROFESSOR-STUDENTS Professor# Student# Course P1 S1 425 P1 S2 100 P1 S4 595 P2 S1 525 P2 S2 525 P2 S3 548 P3 S2 548 P3 S4 425 P4 S4 525 P5 S5 548



Key:



And by shear luck… also 3NF

(C) Frieder, Grossman, & Goharian 1996, 2002



Key:



Note: Database valid for student information only!!!

170



2NF

A relation is in 2NF if and only if it is in 1NF and every non-key attribute is fully dependent on the primary key. That is, no non-key attribute is dependent on only part of the key. is the key, but Goal is dependent on only Student#



(C) Frieder, Grossman, & Goharian 1996, 2002



171



Problems with 2NF

Consider the relation in 2NF but not in 3NF:

Emp# 1 2 3 4 5 6 7 Age 30 45 52 44 50 52 49 City Chicago Washington New York Washington Chicago London Washington Experience Accounting Computer Science Physics Computer Science Accounting Accounting Computer Science



FD: {Emp# ->Age, City, Experience, City ->Experience} Key:



(C) Frieder, Grossman, & Goharian 1996, 2002



172



Problems with 2NF

Insertion Anomaly

» Can not insert fact that a city has a particular expertise until we have an employee located in the particular city.



Deletion Anomaly

» Some deletes will not only eliminate the fact that an employee exists in a given location (employee number 6), but will also remove the information about the expertise in a city (London and accounting).



Update Anomaly

» Experience is redundant

(C) Frieder, Grossman, & Goharian 1996, 2002



173



Fixing problems with 2NF

Replace with information about » EMPLOYEE – Facts about employees » OFFICE – Facts about company offices That is, create a 3NF version of the design



(C) Frieder, Grossman, & Goharian 1996, 2002



174



3NF

A relation is in 3NF if and only if it is in 2NF and every non-key attribute is nontransitively dependent on the primary key. Or in English:

» Every non-key attribute is dependent on the key and nothing but the key.



(C) Frieder, Grossman, & Goharian 1996, 2002



175



FD’s for Example

City Emp# -> City Emp# -> Experience Emp# -> Age City -> Experience Note: There is a transitive dependency Emp# -> City and City -> Experience implying Emp# -> Experience



Emp#



Experience



Age



(C) Frieder, Grossman, & Goharian 1996, 2002



176



3NF

OFFICE City Experience Chicago Accounting Washington Computer Science New York Physics London Accounting EMPLOYEE Age City 30 Chicago 45 Washington 52 New York 44 Washington 50 Chicago 52 London 49 Washington



Key:



Emp# 1 2 3 4 5 6 7



No anomalies exist!



Key:



(C) Frieder, Grossman, & Goharian 1996, 2002



177



3NF – Unfortunately!!!

Consider the following two relations:

STUDENTS Student# S1 S2 S3 S4 S5 Goal M.S. Ph.D. M.S. Ph.D. M.S. PROFESSOR-STUDENTS Professor# Student# Course P1 S1 425 P1 S2 100 P1 S4 595 P2 S1 525 P2 S2 525 P2 S3 548 P3 S2 548 P3 S4 425 P4 S4 525 P5 S5 548



Key:



Relations are in 3NF, but anomalies exist !!!

(C) Frieder, Grossman, & Goharian 1996, 2002



Key:



178



Anomalies Dependency

For student information, no anomalies exist. For the professorial information, anomalies exist !!!

» Insertion Anomaly:

– Can not insert P6 if never taught a student



» Deletion Anomaly:

– If S5 terminates the university, P5 is dismissed!!! – ( You may like it, but as a professor, I do not!!! )



(C) Frieder, Grossman, & Goharian 1996, 2002



179



3NF Example

FD: x y {trivial FD; x is a super key; y part of a key}



FD: {ssn sid, nm , sid ssn} CK: {ssn;sid} R: ssn nm sid

1 3 2 Mary 10 Joe 30 Mary 20



What is the normal form of relation R? Do you see any anomaly?



(C) Frieder, Grossman, & Goharian 1996, 2002



180



3NF Example

FD:{sid,program degree; degree program} Key: sid, program R:

sid program degree 10 U BS 10 M MS 20 U BS 30 U BA What is the normal form of relation R? why? Do you see any anomaly?



(C) Frieder, Grossman, & Goharian 1996, 2002



181



BCNF Example

FD: x y {trivial FD; x is a super key}



R: (sid, program, degree) FD:{sid,program degree; degree program} Key: sid, program

sid degree degree program BS 10 BS U R2: MS R1: 10 MS M BS 20 BA U BA 30 What is the normal form of relation R? why? Do you see any redundancy?

(C) Frieder, Grossman, & Goharian 1996, 2002



182



Boyce-Codd Normal Form (BCNF)

A relation is in BCNF if and only if it is in 3NF and every attribute is non-transitively dependent on the primary key. Or in English:

» Every attribute is dependent on the key and nothing but the key.

183



(C) Frieder, Grossman, & Goharian 1996, 2002



Loss-Less Decomposition

R1 ∩ R2 R1 or R1 ∩ R2 R2



R1 (sid,degree) and R2 (degree,program) R1 ∩ R2 R2 as we have FD: degree program Thus: loss-less.

R1 natural join R2: sid program degree 10 U BS 10 M MS 20 U BS 30 U BA

184



(C) Frieder, Grossman, & Goharian 1996, 2002



Lossy Decomposition

R1 (sid,program) and R2 (degree,program) R1 ∩ R2 : program ≠> R1 or R2 Thus: lossy decomposition.

R1 natural join R2: sid program degree program U 10 BS U R2: M R1: 10 MS M U 20 BA U U 30 sid program degree 10 U BS 10 U BS 30 U BS 10 M MS 10 U BA 20 U BA 30 U BA



(C) Frieder, Grossman, & Goharian 1996, 2002



185



Dependency Preservation

F+ = (Fx ∪ Fy)+ R: (sid, program, degree)

FD:{sid,program degree; degree program} Key: sid, program



R1 (sid,degree) and R2 (degree,program)

FD: {degree program}



F+ ≠ (Fx ∪ Fy)+ Have to join to verify FD: sid,program degree

(C) Frieder, Grossman, & Goharian 1996, 2002



186



Minimal Cover

Create the smallest set of functional dependencies by: •Removing transitive dependencies from non-key attributes to key. ssn program, degree => ssn program program degree •Union all functional dependencies that the left hand side is the same. ssn DOB ssn name => ssn DOB, name •Remove the extra attribute from the left hand-side of FD if the FD is valid after removal of that attribute. ssn address ssn, name address => ssn address

(C) Frieder, Grossman, & Goharian 1996, 2002



187



Decomposition to 3NF using Minimal Cover

IF relation R is not in 3NF then Create the Minimal Cover of FD on R. Create a Relation Ri for each FD in Minimal Cover. If the key of relation R is not in its entirety included in any of the relations Ri, then create one more relation with that key. This decomposition is loss-less and preserves dependencies.

(C) Frieder, Grossman, & Goharian 1996, 2002



188



Query Optimization



(C) Frieder, Grossman, & Goharian 1996, 2002



189



Query Optimization

A key difference in the relational model and other models is that it is entirely up to the DBMS how data are retrieved.



(C) Frieder, Grossman, & Goharian 1996, 2002



190



Query Processing Components

Each SQL statement is implemented with the following steps: » Parse - Identify tokens in the query » Develop internal representation of the query (attempt to have an internal form such that two queries with different syntax, but similar functionality will have a uniform internal representation) » Execute the optimizer to choose the best access path to the data.

(C) Frieder, Grossman, & Goharian 1996, 2002



191



Access Paths

Typical choices the optimizer must make are: » Is it better to implement a sequential scan than to use a b-tree? » Which b-tree should be used? » Given five relations that must be joined, what order should be implemented?



(C) Frieder, Grossman, & Goharian 1996, 2002



192



Selectivity

Selectivity of a condition refers to the ratio of the number of tuples that satisfy a condition to the total number of tuples in the relation. Selectivity of a primary key is 1/N where N is the number of tuples An attribute with i distinct values will have a selectivity of (N / i) / N, assuming a uniform distribution. Typically, an index is used for terms with “good” selectivity (i.e., emp#) and a scan is used for terms with “bad” selectivity (i.e., gender)



(C) Frieder, Grossman, & Goharian 1996, 2002



193



Optimization with Boolean Logic

(Conjunctive Expressions) A conjunctive (only AND) query, e.g., A and B and C, is easy to optimize since any false match terminates the search. In a conjunctive query, order the conditions according to selectivity, the lower the selectivity, the earlier is the condition evaluated. Lower selectivity values (primary keys being the lowest) are typically indexed.

(C) Frieder, Grossman, & Goharian 1996, 2002



194



Optimization with Boolean Logic

(Disjunctive Expressions) A disjunctive (only OR) query, e.g., A or B or C, is more difficult to optimize. In a disjunctive query, all conditions must be evaluated since truth in any of the conditions validates the expression. Poor access paths, if they exist, none-the-less must be evaluated.



(C) Frieder, Grossman, & Goharian 1996, 2002



195



Join Algorithms

Nested (inner-outer) Loop. For each tuple in the outer relation R, retrieve every tuple in S (inner) and test whether or not the join condition is satisfied. Merge-Scan. Sort R and S by the join attributes. Scan the tuples matching those that have match conditions that restrict R and S.



(C) Frieder, Grossman, & Goharian 1996, 2002



196



Join Order Optimization

(Cost Based)

Consider a join of EMPLOYEE with DEPARTMENT

Assume EMPLOYEE requires e data blocks to store. Assume DEPARTMENT requires d data blocks to store.



Nested Loop Join (with n+1 blocks of memory available) I/O demands are computed as:

– There are two potential algorithms:

EMPLOYEE as the outer, DEPARTMENT as inner DEPARTMENT as outer, EMPLOYEE as inner



(C) Frieder, Grossman, & Goharian 1996, 2002



197



Join Optimization (continued)

EMPLOYEE as outer, DEPARTMENT as inner

» EMPLOYEE is read into the n+1st buffer one block at a time, n blocks at a time of DEPARTMENT are read and then scanned:

– Reading EMPLOYEE requires e block reads – The nested join, requires e/n iterations as only n blocks may be placed in memory at one time. – For each of the e/n iterations, a full scan of d blocks is required.



» Total I/O = e + (e) (d / n)



Reversing the join order yields:

» Total I/O = d + (d) (e / n)

(C) Frieder, Grossman, & Goharian 1996, 2002



198



Join Order (continued)

The importance of join order is seen when values are substituted into the equations. For d = 10 and e = 50 and n = 5:

» EMPLOYEE as outer, DEPARTMENT as inner » Total I/O = e + (e) (d / n) » Total I/O = 50 + (50)(10 / 5) = 150



Reversing the join order yields:

» Total I/O = d + (d ) (e / n) » Total I/O = 10 + (10)(50 / 5) = 110



Hence, for this situation, it is better to use DEPARTMENT as the outer relation.

(C) Frieder, Grossman, & Goharian 1996, 2002



199



Join Optimization (continued)

Rule of Thumb for Nested Loop Join Computations The smaller number of relevant tuples should be the outer relation (Many other optimization techniques are available)



(C) Frieder, Grossman, & Goharian 1996, 2002



200



Cost Based Optimization

Obtain cost estimates for each execution strategy. The cost depends on: » cost to access secondary storage » storage cost

– need for temporary files



» computational cost

– cost of in-memory operations



» communication cost

– cost of shipping the query from the server to the client



(C) Frieder, Grossman, & Goharian 1996, 2002



201



Cost-Based Optimization

(Metadata Use)

System catalogs contain data that are used to estimate the cost for each access path. Metadata used includes: » number of tuples » number of blocks » existence of indexes » number of levels in a B-tree index » number of distinct values found in the index (selectivity) of an attribute.

(C) Frieder, Grossman, & Goharian 1996, 2002



202



(Rule Based Optimization)

Query Tree » Structure that corresponds to a relational algebra expression by representing relations as leaf nodes.



Query Trees



(C) Frieder, Grossman, & Goharian 1996, 2002



203



Query Tree Transformation

Ex: Find the last names of employees born after 1957 who work on a project named Aquarius EMPLOYEE (ssn, lname, fname, bdate) WORK_ON (essn, p#) PROJECTS (pno, pname) - e tuples - w tuples - p tuples



SELECT lname FROM EMPLOYEE, WORKS_ON, PROJECTS WHERE (essn = ssn) and (p# = pno) and (pname = ‘Aquarius’) and (bdate > 1957) Initial query tree will JOIN the three relations first and then perform the selections and projections. O(e w p) tuples will be accessed.

(C) Frieder, Grossman, & Goharian 1996, 2002



204



Initial Tree

PROJECT (lname) SELECT (pname = ‘Aquarius’, bdate > 1957) Join (p# = pno) PROJECTS Join (essn = ssn) EMPLOYEE

205



WORKS-ON

(C) Frieder, Grossman, & Goharian 1996, 2002



Migrate SELECT

PROJECT (lname) Join Step 1: (p# = pno) Reduce size of join by computing SELECT Join early in the process. (essn = ssn) SELECT (pname = ‘Aquarius’) PROJECTS

(C) Frieder, Grossman, & Goharian 1996, 2002



SELECT (bdate > 1957)



WORKS-ON



EMPLOYEE

206



Join Order Justification for Rules

Both pno and ssn are key attributes. Thus, they both will yield only one tuple to be retrieved. The cardinality of PROJECT is less than EMPLOYEE, thus join PROJECT first.



(C) Frieder, Grossman, & Goharian 1996, 2002



207



Reorder the Joins

PROJECT (lname) Step 2: Order joins according to lower input and result sizes Join (essn = ssn) Join (p# = pno) SELECT (bdate > 1957)



SELECT (pname = ‘Aquarius’) PROJECTS

(C) Frieder, Grossman, & Goharian 1996, 2002



WORKS-ON



EMPLOYEE

208



Final Query Tree

Step 3: Move Projects early to reduce temporary relation sizes

PROJECT (lname) Join (essn = ssn)



PROJECT (essn)



PROJECT (ssn, lname)



Join (p# = pno)



SELECT (bdate > 1957)



PROJECT (pno)



PROJECT (essn,p#)



EMPLOYEE



SELECT (Aquarius)



WORKS-ON

PROJECTS

209



(C) Frieder, Grossman, & Goharian 1996, 2002



Overview of Key Rules

Partition each select of (A and B and C) to SELECT (A), SELECT (B), and SELECT(C) Move each select as far down the tree as possible Rearrange leaf nodes so that the small answer sets are processed first. Typically, use smallest selectivity to estimate this (found in system catalog). Combine a Cartesian product and a select of joining conditions in a join Move projection as far down the tree as possible Identify sub-trees that represent groups of operations that may be executed by a single access routine.

(C) Frieder, Grossman, & Goharian 1996, 2002



210



Explain

Many commercial systems provide a utility to identify the path the optimizer will choose for a given query. Typically the utility is referred to as EXPLAIN and the syntax is:

» EXPLAIN



Results are placed into relations that may be queried. Performance tuning is often done by changing an index and examining the effect on queries by repeating the EXPLAIN statement.

(C) Frieder, Grossman, & Goharian 1996, 2002



211



Recovery and Concurrency Control



(C) Frieder, Grossman, & Goharian 1996, 2002



212



Recovery and Concurrency

Transaction

» Logical unit of work » Composed of one or more SQL statements » Either all of the transaction completes or none of the transaction is executed. Ex: Transfer money from savings account to checking.

– Step 1: Subtract from savings – Step 2: Add to checking



It is critical that either both steps complete or neither.



(C) Frieder, Grossman, & Goharian 1996, 2002



213



Commit and Rollback

Commit

» Indicates that a transaction completed. There is no BEGIN and END TRANSACTION necessary as COMMIT marks the END of the current transaction and the start of the next transaction.



Rollback

» Undo all work completed by the transaction that is currently in progress. This is implemented by saving all “in progress” work to a transaction log.

(C) Frieder, Grossman, & Goharian 1996, 2002



214



Recovery

After a failure, a DBMS checks the log to determine:

» The transactions that were in process during the time of failure. These transactions are rolled back (UNDO).



To avoid lengthy restart times, the system periodically writes all contents of main memory to disk. This is referred to as a checkpoint. After a failure, all transactions that have completed after the last checkpoint must be redone because data have not been written from memory to disk.



(C) Frieder, Grossman, & Goharian 1996, 2002



215



Recovery (continued)

In summary, after a failure a DBMS enters an UNDO phase to rollback all transactions that were in progress and a REDO phase to again execute the transactions that occurred after the last checkpoint and before the failure. Example below » Undo t3 » Redo t2 t1 t2 t3



Checkpoint

(C) Frieder, Grossman, & Goharian 1996, 2002



Failure

216



Resilience to Failure

Power Failure (only) » Undo and Redo processing Data Disk Failure » The last good data archive is restored and all logs since the point of failure are re-processed to REDO all transactions lost on the bad disk. Log Disk Failure » This is rare, but the only good means of avoiding this problem is to use dual logs in which all log writes are duplicated. » Without dual logging it is necessary to restore back to the last good data archive and all transactions since then are lost.

(C) Frieder, Grossman, & Goharian 1996, 2002



217



Concurrency Problems

Problems occur when two transactions executing simultaneously become interleaved. All proposed means of ensuring concurrency must be shown to be serializable. These algorithms must produce a result that is equivalent to some (arbitrary) serial execution of the transactions that they manage. In summary, it is correct for transaction t1 to precede t2, or follow t2, but it is not correct for transactions t1 statements to be interleaved with t2.



(C) Frieder, Grossman, & Goharian 1996, 2002



218



Lost Update Problem

Consider the following two transactions: The typical example is a husband and wife both withdrawing money from the same bank account. Husband: Withdraw $10 s1: Read Savings s3: Subtract 10 s5: Write Savings s7: COMMIT Wife: Withdraw $10 s2: Read Savings s4: Subtract 10 s6: Write Savings s8: COMMIT



Consider the execution of: s1, s2, s3, s4, s5, s6, s7, s8. If savings starts at 100, after s1 and s2, each user determines savings is equal to 100, so after s8, it ends up at 90 even though 20 has been subtracted. Hence, one update has been lost.

(C) Frieder, Grossman, & Goharian 1996, 2002



219



Uncommitted Dependency

This occurs when a transaction relies upon a value that may be rolled back by another transaction. Consider T1: s1: Add 5 to X s2: COMMIT Consider T2: s3: Read X s4: Add X to Y s5: Write Y s6: COMMIT



If s1, s3, s4, s5, s6 execute, Y now has a value that will be incorrect if T1 does not execute s2, but instead rolls back.



(C) Frieder, Grossman, & Goharian 1996, 2002



220



Inconsistent Analysis

Consider a transaction T1 that is computing the average salary of all employees; it scans all employee records. Consider a transaction T2 that updates Hank’s salary. If transaction T2 is executed and completes before T1 is finished, but after T1 has examined Hank’s salary, the result of T1 will be incorrect.

221



(C) Frieder, Grossman, & Goharian 1996, 2002



Concurrency Control

Two different themes surround concurrency control algorithms:

» pessimistic

– These algorithms are based on the premise that conflict will occur, they use locks to force conflicting requests to wait until a lock is released.



» optimistic

– These algorithms assume that conflicts are rare. Hence, a transaction executes without any waiting on others, but at the end, a check is made to determine if a conflict occurred (via timestamps). If so, the entire transaction is rolled back.

(C) Frieder, Grossman, & Goharian 1996, 2002



222



Locking

Two Phase Locking (2 PL) algorithms have two phases, growing and shrinking, and are serializable. As a transaction progresses, it only acquires new locks, growing. Upon commit, the transaction releases all locks. A transaction that adds locks, releases some, and adds again is not 2 PL and may not be serializable.

(C) Frieder, Grossman, & Goharian 1996, 2002



223



Types of Locks

Share Lock

» This is a read lock. Other users may read data have a share lock, but no users may write to data that contain a share lock.



Exclusive Lock

» This is acquired during a write. No users may read data that have an exclusive lock. For the lost update problem, the UPDATE statements result in exclusive locks which preclude the interleaving seen in the example.

(C) Frieder, Grossman, & Goharian 1996, 2002



224



Lock Granularity

Locks may be held at the row, page, or table level. A page or table level lock saves space in the system lock table as only one lock may serve to lock millions of tuples, but concurrency is sacrificed.



(C) Frieder, Grossman, & Goharian 1996, 2002



225



Lock Hierarchy

When a request for a lock on a single tuple is issued, an INTENT lock is acquired for each level of the hierarchy above it. This precludes a user issuing a page level lock and acquiring access to data already held by a row level lock.

Table Page 1 row 1

(C) Frieder, Grossman, & Goharian 1996, 2002



Page 2 row2 row 3 row 4

226



Lock Hierarchy (continued)

When user 1 requests a share lock on row1, an intent share lock is placed first on the table, and then on the page, and finally a share lock is placed on row 1. Table IS - first Page 1 IS - second row 1 IS third row2 row 3



Page 2 row 4



At this point, a request for an exclusive lock on page 1 will be denied as an intent share lock exists.

(C) Frieder, Grossman, & Goharian 1996, 2002



227



Lock Wait

When a user requests data that are currently locked, a wait ensues. Some DBMS allow a timeout to be specified that governs the longest a transaction must wait before a timeout. This timeout must be balanced with the need to run many, valid transactions. All DBMS provide a means of viewing the lock table, identifying the resources users are waiting on since this information is critical for resolving concurrency control problems.

(C) Frieder, Grossman, & Goharian 1996, 2002



228



Lock Levels

Due to the need for improved concurrency DBMS, allow users to sacrifice integrity for run-time performance. This is done via different lock levels which may be set dynamically. Some commercial systems support up to five different lock levels. We describe the two most common. Repeatable Read (RR) refers to the strongest lock level. It is the type of lock we have discussed in which all locks are held until the end of a transaction. The key to repeatable read is that if a transaction reads the same datum twice, it is ensured that it will have the same value both times.

(C) Frieder, Grossman, & Goharian 1996, 2002



229



Cursor Stability (CS)

For many applications, it not necessary to ensure maximum integrity. A transaction using repeatable read that is scanning a one billion row table effectively causes all users who are updating rows in the table to enter a lengthy lock wait. Cursor stability states that only the current rows being scanned will be locked. Immediately after the row is scanned the lock is released. For the 1 billion row table, under repeatable read, the number of locks could grow to 1 billion, with cursor stability, only one row is locked throughout the transaction.

(C) Frieder, Grossman, & Goharian 1996, 2002



230



Example

RR after one row is read Row 1 (lock) Row 2 Row 3 ... Row 1,000,000 RR after one million Row 1 (lock) Row 2 (lock) Row 3 (lock) ... Row 1,000,000 (lock)

(C) Frieder, Grossman, & Goharian 1996, 2002



CS after one row is read. Row 1 (lock) Row 2 Row 3 ... Row 1,000,000 CS after one million Row 1 Row 2 Row 3 ... Row 1,000,000 (lock)

231



Deadlock

A deadlock occurs when two users are waiting on each other to release resources. User 1: User 2: s1: read A s2: read B s3: write B s4: write A After s1 and s2 execute, user 1 and 2 each hold a share lock on A and B. With s3, user 1 begins a wait on user 2. With s4, user 2 begins a wait on user 1. All commercial systems implement deadlock detection periodically, and once detected, a victim is chosen and the transaction is rolled back.

(C) Frieder, Grossman, & Goharian 1996, 2002



232



Performance Tuning

Concurrency can often be improved by using cursor stability instead of repeatable read. Many applications do not have a requirement for repeatable read, but it is the default for most systems. Many developers do not test an application under expected workloads. A single user test does not test any concurrency. It is essential to test multiple users prior to delivering an application. Applications should include code to test for the presence of a deadlock (SQLCODE = 911) or they will abnormally terminate when one occurs.

(C) Frieder, Grossman, & Goharian 1996, 2002



233



Integrating Structured Data and Text: A Relational Approach



(C) Frieder, Grossman, & Goharian 1996, 2002



234



Mapping Text onto Relations

Relation definition: DOCUMENT (DocID, Docname, Headline, Dateline) TERM (Term, df, idf) INDEX (DocID, Termcnt, Term) All inverted index entries e.g., vehicle term vehicle vehicle vehicle D1, D3, D4 results in:



docID

D1 D3 D4

235



(C) Frieder, Grossman, & Goharian 1996, 2002



Text Retrieval Conference (TREC) Sample Document

AP881214-0028 AP-NR-12-14-88 0117EST u i BC-Japan-Stocks 12-14 0027 BC-Japan-Stocks,0026 Stocks Up In Tokyo TOKYO (AP) The Nikkei Stock Average closed at 29,754.73 points up 156.92 points on the Tokyo Stock Exchange Wednesday.



(C) Frieder, Grossman, & Goharian 1996, 2002



236



Relational Document Representation

DOCUMENT

DocID 28 Docname AP881214-0028 Headline Stocks Up In Tokyo Dateline TOKYO (AP)



INDEX

DocID 28 28 28 28 28 28 28 28 28 Termcnt 1 2 1 1 2 1 1 1 1 Term nikkei stock average closed points up tokyo exchange wednesday



TERM

Term average closed exchange nikkei points stock tokyo up wednesday df 2265 2208 2790 234 1627 2674 725 12746 6417 idf 1.08 1.08 1.00 2.07 1.23 1.00 1.58 0.30 0.60

237



(C) Frieder, Grossman, & Goharian 1996, 2002



Relational Query Representation

QUERY TERM nikkei stock exchange american TERMCNT 1 2 2 1



ORIGINAL QUERY: “nikkei stock exchange american stock exchange”



SQL:



(Query Weight * Document Weight) SELECT d.docname, SUM(a.termcnt * c.idf * b.termcnt * c.idf) FROM QUERY a, INDEX b, TERM c, DOCUMENT d WHERE a.term = b.term AND a.term = c.term AND b.docid = d.docid GROUP BY d.docname ORDER BY 2 DESC

238



(C) Frieder, Grossman, & Goharian 1996, 2002



Sample Term Query Result

(Inner/Dot Product)



Term nikkei stock exchange american



Q-Termcnt Q-Weight D-Termcnt D-Weight 1 2 2 1 _____ 2.07 2.00 2.00 0.60 1 2 1 0 2.07 2.00 1.00 0.00



Q-Wt * D-Wt 4.28 4.00 2.00 0.00



Similarity Coefficient 10.28



(C) Frieder, Grossman, & Goharian 1996, 2002



239



Similarity Coefficients

Several similarity coefficients based on the query vector X and the document vector Y are defined:



Inner Prod uct



i=1



∑ xi ⋅ yi



t



Cosine Coefficient



i=1 t i=1



∑ xiyi

i=1



t



x i2 • ∑



y i2 ∑

240



t



(C) Frieder, Grossman, & Goharian 1996, 2002



Sample Relevance Ranking Query

SELECT c.qryid, b.docid, SUM(((1+LOG(a.termcnt))/((b.logavgtf)* (229.50439 + (.20*b.disterm))))*(c.nidf*((1+LOG(c.termcnt))/(d.logavgtf)))) FROM trec6$d5$idx a, trec6$d5$docavgtf b, trec6$q6$qrynidf c, trec6$q6$qryavgtf d WHERE a.docid = b.docid AND c.qryid = d.qryid AND a.term = c.term AND c.qryid = 301 GROUP BY c.qryid, b.docid UNION SELECT c.qryid, b.docid, SUM(((1+LOG(a.termcnt))/((b.logavgtf)* (229.50439 + (.20*b.disterm))))*(c.nidf*((1+LOG(c.termcnt))/(d.logavgtf)))) FROM trec6$d4$idx a, trec6$d4$docavgtf b, trec6$q6$qrynidf c, trec6$q6$qryavgtf d WHERE a.docid = b.docid AND c.qryid = d.qryid AND a.term = c.term AND c.qryid = 301 GROUP BY c.qryid, b.docid ORDER BY 3 DESC;



(C) Frieder, Grossman, & Goharian 1996, 2002



241



Distributed Database Systems



(C) Frieder, Grossman, & Goharian 1996, 2002



242



Overview

Distributed DBMS allow data stored at multiple sites to be accessed from a single site. One query may join data from two different sites. The key motivation for a distributed DBMS is to move data closer to the users.

California DBMS 1

Data frequently accessed by California users



Chicago DBMS 2

Data frequently accessed by Chicago users



New York DBMS 3

Data frequently accessed by New York users



(C) Frieder, Grossman, & Goharian 1996, 2002



243



Overview (continued)

Homogeneous DBMS

» Each site contains the same implementation of a DBMS, (e.g., all sites are running Oracle)



Heterogeneous DBMS

» Different DBMS are used at different sites (e.g, some sites are Oracle, some are IBM, and some are Informix)



(C) Frieder, Grossman, & Goharian 1996, 2002



244



Overview (continued)

For the remainder of this discussion, assume that a homogeneous distributed DBMS is used. In practice, heterogeneous DBMS exist, but they require an additional layer of software that serves as a “global coordinator” or “mediator” of all the different DBMS. Today, many research efforts focus on both internet and intranet mediators

245



(C) Frieder, Grossman, & Goharian 1996, 2002



Distributed Concurrency Control

An update over more than one site requires careful concurrency as it is possible that one site may fail while others have committed. To avoid, this a Two Phased Commit (2PC) protocol is used. A single site from which the update originates is the controlling site.



(C) Frieder, Grossman, & Goharian 1996, 2002



246



Two-Phased Commit

Phase 1

» Send update to all sites » Receive acknowledgments



Phase 2

» After receiving all acknowledgments, send COMMIT to all sites. » If all acknowledgments are not received in a certain pre-defined time period, a ROLLBACK is sent to all sites. » Once, the COMMIT is sent, all sites commit the data. If a site fails before it receives the COMMIT, it will receive the COMMIT upon restart.



(C) Frieder, Grossman, & Goharian 1996, 2002



247



Two-Phased Commit (example)

Assume EMP exists on site A and DEPT exists on site B. Transaction T1, adds a new department and several employees who work in that department. Assume T1 originates at site C, so site C will be the controlling site. A Update EMP C Phase 1: (EMP) (start) B Update Dept (DEPT) C (start)

(C) Frieder, Grossman, & Goharian 1996, 2002



Ack



A (EMP) B (DEPT)



Ack



248



Two-Phased Commit (example)

After Phase one, site C has received all acknowledgements and is now ready to send final commit. C (start)

COMMIT



A (EMP) B (DEPT) A (EMP) B (DEPT)

249



Phase 2:



COMMIT



C (start)



Ack



Ack



(C) Frieder, Grossman, & Goharian 1996, 2002



Two-Phased Commit Failure During Phase 1

Consider a case where site B fails after receiving the request for update but site A succeeds: A Update EMP C Phase 1: (EMP) (start) B Update DEPT (DEPT) C (start)

Ack



A (EMP)



Site C receives only one acknowledgment but was waiting for two, so a rollback is sent to all sites.

(C) Frieder, Grossman, & Goharian 1996, 2002



250



Two-Phased Commit Failure During Phase 2

Consider a case where site B fails after sending the acknowledgment in Phase 1. Phase 2: C (start)

COMMIT



A (EMP) B (DEPT)



COMMIT



Site B will eventually restart, receive the COMMIT and phase two will complete. C (start)

Ack



A (EMP) B (DEPT)

251



Ack

(C) Frieder, Grossman, & Goharian 1996, 2002



Replication

Since 2PC processing is expensive, a cheaper alternative is to replicate data so that they are at each site. Many replication algorithms exist, the goal of which is to propagate an update to all replicas.

California DBMS 1

EMP Source



Chicago DBMS 2



New York DBMS 3



Replica of EMP



Replica of EMP



(C) Frieder, Grossman, & Goharian 1996, 2002



252



Write-All Copies

The Write-all copies takes an update to a table at site A and before commit sends the update to all sites that contain the replica. If any site is down, a commit does not occur.



(C) Frieder, Grossman, & Goharian 1996, 2002



253



Shadow Copies

A master copy is defined and one or more read-only replicas are created based on changes found in the log at the mater site.

Master Copy DBMS Site A Log Site A



Read-Only Replica Site B



DBMS Site B



DBMS Site C



Read-Only Replica Site C



(C) Frieder, Grossman, & Goharian 1996, 2002



254



Shadow Copies

Many replica algorithms exist that allow updates to the replicas, but they are not widely used in commercial products. All commercial vendors provide some form of read-only replicas based on changes found in the log at the master site.



(C) Frieder, Grossman, & Goharian 1996, 2002



255




Share This Document


Related docs
Other docs by Aashish Sharma
LAB0322Y
Views: 12  |  Downloads: 4
Overview of Flexfields
Views: 22  |  Downloads: 7
115finswe
Views: 0  |  Downloads: 0
115hrfdig
Views: 72  |  Downloads: 2
LAB31FDY
Views: 11  |  Downloads: 3
csg1
Views: 21  |  Downloads: 2
Oracle_DBA_FAQ
Views: 152  |  Downloads: 65
EDU23ABY
Views: 17  |  Downloads: 3
SBNewsletter2005JanuaryFebruary
Views: 2  |  Downloads: 0
by registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!