Docstoc

Database as Model

Document Sample
Database as Model Powered By Docstoc
					  CMSC 424
Database Design
  Section 401
Dr. David Kuijt
                         Contact Info

Professor: David Kuijt
•   Office: AVW 3205
•   Phone: 5-0534
•   Email: kuijt@cs.umd.edu
•   Office Hours: T/Th 3:15-4:00
    (or by appointment)
TA: Debbie Heisler
•   Office: AVW 3270
•   Phone: 405-7027
•   Email: heisler@cs.umd.edu
•   Office Hours: TBA
                   Basic Information

• Required text:
   Korth & Silberschatz Database System Concepts, Fourth Edition,
     McGraw Hill 2001.
• Warnings:
   – Late homework or projects are not acceptable. Hand in what you
     have finished. Exceptions will be made only for emergencies or
     medical reasons with a doctor's note.
   – No makeup exams. Exceptions as above.
   – Cheating will result in an immediate grade of XF ("failure through
     academic dishonesty" -- this goes on your permanent transcript),
     and may result in suspension or expulsion from University. This is
     your only warning. Don't do it.
                      Motivation

We live in a database world. The simplest acts are tied to
  databases. The last time I called out for delivery pizza, it
  involved at least four enormous databases. What were
  they?
• Pizza Hut knew what I had ordered before -- they asked if I
  wanted the same pizza as last time. They probably store
  lots more information than that -- perhaps all my old order
  information. They could use this information to make
  corporate decisions (quantities of materials to order;
  forecasting pizza trends) as well as a local aid.
                    Motivation (2)

• Every delivery food place around here uses caller ID.
  That's a relatively simple database, just giving names and
  incoming telephone number, but it helps them avoid some
  types of fraudulent orders and errors when writing down
  names, addresses, and stuff.
• I used a telephone. When you use a telephone, all the
  details about the call are stored in a database. Call length,
  what number you called, time of call, billing information,
  and so on. Cell phone databases are even more complex.
                       Motivation (3)

• I paid by credit card. Huge databases are involved.
• Every time somebody runs your credit card through a
  swipe reader or types in the number, they're checking
  information in a database.
   –   Is this a valid credit account?
   –   Does it have enough money to cover the bill?
   –   Is the credit card stolen?
   –   Debit the merchant account, credit the consumer account
• Whether the transaction is accepted or rejected, all the
  details are recorded in a database somewhere.
                       More Motivation

• Every time you go to an ATM, use a credit card, buy something with a
  UPC bar code at a supermarket or department store, go to a movie,
  concert, or Caps game, register for classes, or get a parking ticket in
  Lot 4, you are working with databases.
• Everything in your wallet that isn't a photograph is an entry in a
  database somewhere
    –   medical plan cards
    –   credit cards
    –   student ID
    –   driver's license
    –   membership cards in clubs or interest groups
    –   everything. Even the currency!
                   Final Motivation

• Databases are all around us.
   – Knowledge is power.
   – Databases give us power (the ability to do things we couldn’t
     otherwise do)
   – they give other people power over us, and knowledge about what
     we are doing.
   – This class is about Databases.




                  So what is a Database?
                 What is a Database?

At the simplest level, a database architecture has two
   components.

(1) Data.
   – Usually a whole lot of it
   – Representing multiple types of different objects
   – Each type may be related to itself and to other types in multiple
     ways


(2) A set of methods to access and manipulate the data.
                            Data

• For any reasonable-size database the data may be quite
  complex.

• It is an attempt to record or model all the aspects of the real
  world that are important to one specific purpose --
  telephone calls, for example, or credit card accounts.

• Lots of different objects need to be stored as data, and they
  need to be stored in such a way as to reflect the ways that
  the objects can interact with each other.
                     Database Example

• For example, consider a local telephone system database.
  Types of data stored include:
   – Account information:
       • customers: individuals, groups, companies that have leased numbers;
       • billing addresses, payment history, calling plans and billing contracts;
   – Hardware information:
       •   network structure (call routing),
       •   hardware age, reliability, and maintenance information,
       •   system load tracking,
       •   network billing pattern (what numbers are long distance from what
           other numbers, and what ones are local)
                Database Example (2)

• Local telephone system database continued. Additional
  types of data would include:
   – Call information:
       • start and end time,
       • telephone number that initiated the call;
       • telephone number(s) that received the call


• All that information could be stored in files with much less
  fuss and bother. Why use a database? Why not just store
  the information in flat files?
                 Why Not Flat Files?
• Why use a database? Why not just use a flat file?
• Databases have a number of advantages over flat files.
   – Data Access. The set of programs that provide access to a
     database allow much more complex and flexible queries to the
     database with greater efficiency and convenience.
   – Reduced duplication and better control over data consistency.
     Data redundancy is bad. Which item to change in an update?
     How do you know that you've found all the copies? Data
     inconsistency (disagreement between various copies of the same
     data) is a serious problem.
   – Integrity Constraints can be enforced inside a database -- telephone
     numbers all 10 digits; phone numbers in Maryland all have the first
     three numbers being 410 or 301.
           Why Not Flat Files? (2)
– Uniform access and control of data using a standard language
– Data Independence. We want the data to be independent of the
  representation chosen for it within the system. Tying the data to a
  given representation is what caused the Y2K fuss -- only two digits
  were used for a "year" field.
– Concurrency control. Multiple users on a single database is a big
  advantage.
– Recovery.
– Security. Different users of the database may need different levels
  of access to information.
– Centralized Control
– Platform independence (portability). Since the internal file
  structure and access program details are hidden from the user, it is
  much easier to use the database on multiple platforms.
                  Data Abstraction

• Most users don't need to understand all the details of the
  implementation and data design of a complex database. To
  make a database convenient to use, the system provides
  users with an abstract view of the data, limiting the
  information available to them. There are usually three
  levels of data abstraction.
   – Physical Level
   – Conceptual Level
   – View Level
               Data Abstraction (2)

• Physical level. The actual implementation details of low-
  level data structures are described at this level.

• Conceptual level. This level describes all the different data
  types that exist by defining a relatively small number of
  simple structures, including all the relationships that these
  data types have with each other. Implementation of these
  objects might be complex, but it is hidden from the user at
  this level. Database administrators are usually the only
  ones who have access at this level.
               Data Abstraction (3)

• View level. There may be multiple different views, each
  of which represents a simpler subset of the functions and
  data available at the conceptual level. Different user types
  may require different parts of the database (for example, a
  bank account database might be accessed by cashiers,
  account holders, credit card companies, and the bank's
  payroll manager. Each of them can only access a small
  part of the full database of bank account information).
  Creating a number of restricted views makes the database
  more useful for the individual user types, giving each type
  access according to the needs of that type.
                   Data Abstraction (4)

• Definition: a Schema is a
  specification of a particular
  database using a particular data
  model.

The three levels of data abstraction
   are often referred to as:
• External Schema(s) (for the
   view level(s)).
• Conceptual Schema (for the
   conceptual level)
• Internal Schema (for the
   physical level)
                    Database as Model
• A model represents a perception of a real system
• Models help us manage or understand the real world
  system they represent.
• When modeling a system we select aspects and
  characteristics we want to represent; we abstract them to
  form a simple(r) system
   – examples: a map, an airplane flight simulator, computer weather
     analysis program
• A database is a model of reality
     Data Models underlying the Database
• The data model is a collection of conceptual tools for
  describing data and its attributes
   – data objects
   – interrelationships of the data
   – data semantics and consistency constraints
• There are two well-established data models used in
  database design
   – Entity-Relationship (E-R) model
   – Relational model
   – older methods included the Network and Hierarchical data models
       • Each was tied closely to the underlying implementation, which made
         it more difficult to model data and to modify or update the database.
         As a result they aren’t much used any more
              Entity-Relationship Model
• Diagram based model
• Two primitives
   – Entities -- each represents a unique real-world object
   – Relationships -- each represents an association among several
     entities
   – Each are associated in sets of the same type (for example, one
     entity set might be customer, representing the set of all entities that
     represent customers at a given bank)
• Third important notion: Attributes
   – Entities are associated with a set of attributes
Diagram-based Model
                              Entities
• Entity: a distinguishable object we want to model
   – e.g., room CSI 3120, Celine Dion, Elizabeth I of England
• Entities have attributes (single-valued properties)
   – e.g., a person has a name, SSN#, gender, …
   – if an attribute has more than a single value, we should model it as
     an Entity
• Entity Set: a set of entities of the same type
   – e.g., CLASSROOMs, SINGERs, HISTORICAL MONARCHs
• Entity Sets may overlap
   – CSI 3120 is a member of CLASSROOMs and also a member of
     CSI BUILDING ROOMs.
                        Relationships
• Relationship is an association among entities
   – David Kuijt teaches-in CSI 3120
• Relationship Set is a collection of relationships of the same
  type
   – FACULTY teach-in CLASSROOMs
• Relationships may also have attributes
   – e.g., the relationship teach-in has an attribute “weekday” and
     another attribute “time” to store the day and time in which a given
     Entity of the set FACULTY teaches in a given Entity of the type
     CLASSROOM
             Example Database Design (1)
• Application: library database. Authors have written books
  about various subjects; different libraries in the system
  may carry these books.
• Entities (with attributes in parentheses):
   –   Authors (SS#, name, tel, birthdate)
   –   Books (ISDN, title)
   –   Subjects (sname)
   –   Libraries (lname)
• Relations [associating entities in square brackets]:
   – Wrote-on [Authors, Subjects]
   – Carry [Libraries, Subjects]
   – Index [Subjects, Books]
Diagram of Initial Database Design
                      Poor Initial Design
• Our first design is a poor model of the real-world system
  we are examining. Problems in our first design:
   – no relationship associating authors and books
   – no relationship associating libraries and books
   – common queries will be complex and difficult:
       •   Q: what libraries carry books by a given author?
       •   Q: what books has a given author written?
       •   Q: who is the author of a given book?
       •   Q: how many copies of a given book exist at each library?
       •   Q: what edition of a book does the library have?
             Example Database Design (2)
• Application: library database as before
• Entities (with attributes in parentheses):
   –   Authors (SS#, name, tel, birthdate)
   –   Books (ISDN, title)
   –   Subjects (sname)
   –   Libraries (lname)
• Relations [associating entities in square brackets]
  (attributes in parentheses):
   – Wrote [Authors, Books]
   – In-stock [Libraries, Books] (quantity, edition)
   – Index [Subjects, Books]
Diagram of Improved Database Design
                               Keys
• Fundamental concept for databases
• Must be able to uniquely identify things within a database
  (in the E-R model, Entities and Relationships)
   – Avoid duplication of results in a search; identify data redundancy
     in other operations
   – Halt search on positive results
   – Quick lookup in underlying data structures used at the Physical
     Level of abstraction
• Examples of possible keys
   – Student ID number (SS#) is used as a key for most UMD databases
     having to do with students
                            Entity Keys
• Superkey: set of attributes whose values uniquely identify the entity
• candidate key: a minimal superkey (a minimal subset of a superkey
   whose values still uniquely identify the entity)
• primary key: if there is more than one possible candidate key, one is
   chosen as the primary one used for most entity-identification purposes
• weak entity: has no primary key; instead it depends upon another
   strong entity’s primary key to exist
    – e.g., CHILDren of EMPLOYEEs are weak; the primary key of
       EMPLOYEE in addition to the attributes of the CHILD are used
       for identification
    – weak entities are “existent dependent” on a strong entity -- when
       the strong entity gets deleted, so does the weak one
                     Relationship Keys
• Depend upon the entity mapping of the relationship
   – one-one: the primary key of any of the entities can be used to
     uniquely distinguish a given relationship between two unique
     entities.
   – one-many: the primary key of the “many” entity, plus possibly a
     subset of the attributes of the relationship, will uniquely identify a
     given relationship
       • e.g., MOTHER gave-birth-to CHILD; to identify a specific gave-
         birth-to relationship requires the primary key of MOTHER and
         possibly the (date) and (time) attributes of gave-birth-to
   – many-many: the union of the primary keys of the entities
     associated, plus possibly a subset of the attributes of the
     relationship, will uniquely identify a given relationship
       • e.g., PERSON married PERSON; SS# of both and possibly date
                        Special Cases
• Relationships may associate different entities of the same
  type

• Ternary versions of the above

• M-N relationships: many-one mappings are often more
  useful in practice than many-many mappings.
   – DUMMY Entities can be used to convert an M-N mapping
     relationship to a pair of relationships, one M-1 and one N-1.
           Specialization-Generalization
                 (ISA Hierarchy)
• This is a way to represent entity complexity

• specialization: top-down refinement of entities with
  distinct attributes
   – Entity type BANK ACCOUNT might be subdivided into related
     but different types CHECKING ACCT and SAVINGS ACCT


• generalization: bottom-up abstraction of common attributes
   – Course types DATABASE, SYSTEM, and NETWORK all have
     common attribute (project). From them we can abstract a new
     course type PRACTICAL COURSE
   – other common course attributes are included (e.g., course number)
               ISA Hierarchy Example:
                Top-down Refinement
• Account entity with
  attributes balance and
  number
• additional complexity: we
  want to represent two
  subtypes of account
   – Savings Account with
     attribute Interest Rate
   – Checking Account with
     attribute Overdraft Limit
              ISA Hierarchy Example:
               Bottom-up Abstraction
• Three related entities with
  similar attribute project
• we abstract a new type of
  super entity Practical
  Course and link the three
  entities as subtypes
• other shared attributes
  (e.g., course number) are
  also promoted to the upper
  level entity
                      Aggregation
                   (Part-of Hierarchy)
• This is a way to represent relationship complexity
   – relationships among relationships are not supported by the E-R
     model
   – often we want to model lower-level relationships differently



• Groups of entities and relationships can be abstracted into
  higher level entities
                Part-of Hierarchy Example

• Entities driver, car, tires, doors,
  engine, seats, piston, valves
• Relationship drives is
  insufficient to model the
  complexity of this system
• Part-of relationships allow
  abstraction into higher level
  entities (piston and valves as
  parts of engine; engine, tires,
  doors, seats aggregated into
  car)
       Mapping an E-R Schema to Tables

• Motivation - translating E-R database designs into
  Relational designs
   – Both models are abstract, logical representations of a real-world
     enterprise
   – Both models employ similar design principles
   – Converting an E-R diagram to tables is the way we translate an E-
     R schema to a Relational schema.
   – Later on we’ll examine how to convert a Relational schema to an
     E-R schema
  Mapping an E-R Schema to Tables (2)

– Strong Entity E with primary key PK and attributes A, B, …
  ==> E(PK, A, B, …)
– Weak Entity F with (non-primary) key WK and attributes C, D, …
  depending upon E above for primary key
  ==> F(PK, WK, C, D, …)
– Relationship R with attributes L, M, … and associating Entities E
  (with primary key PK), E2 (PK2), E3 (PK3), …
  ==> R(PK, PK2, PK3, …, L, M, …)
– Relationships between weak entities and the strong one on which they are
  dependent usually do not require representation because it is usually a
  many-one relationship with no attributes on the relationship (they are on
  the weak entity) and so the resulting table R(PK, WK) is a subset of the
  weak entity itself.
                       Table Details

• The whole table represents a single Entity Set or
  Relationship Set.
• Each entry (row) in the table corresponds to a single
  instance (member in that set)
• For an Entity Set each column in the table represents an
  attribute in the E-R diagram
• For a Relationship Set each column in the table represents
  either an attribute of the Relationship or one of the parts of
  the primary key of the Entity Sets it associates
 Mapping an E-R Schema to Tables (3)

– ISA relationships: choose either to
    • Represent the super class entity, then represent each subclass with the
      primary key of the super class and its own attribute set. This is very
      similar to the way weak entities are treated.
    • Or, map the subclasses to separate relations and ignore the whole
      super class. This is good when the subclasses partition the whole
      superclasses between them (the subclasses are disjoint and the union
      of the subclasses covers the whole super class).
– Aggregate (part-of) relationship
    • Translation is straightforward -- just treat the aggregate as an entity
      and use the methods defined above.
– With last week’s lecture, this covers the material of chapter 2.
              Relational Database Model

• Most popular logical data model
• Relations (also called tables) represent both Entity Sets
  and Relationship Sets.
   – Attributes form the columns of the table (column and attribute are
     synonymous)
   – Each row represents a single entity or relationship (called a row or
     tuple)
• Each instance of an attribute takes values from a specific
  set called the domain of the column (the domain defines
  the type)
        Relational Database Model (cont)

• A relation schema is made up of the name and attributes of
  a relation with their underlying domains
• A database schema is a set of all relation schemas.
• The notions of keys, primary keys, superkeys are all as
  previously described
                     Query Languages

• a language in which a user requests information from the
  database
   – a higher level language than standard programming languages
• query languages may be procedural or non-procedural
   – procedural languages specify a series of operations on the database
     to generate the desired result
   – non-procedural languages do not specify how the information is
     generated
   – most commercial relational database systems offer a query
     language that includes procedural and non-procedural elements
                    Relational Algebra

• procedural query language
• set of operators that map one or more relations into another
  relation
• closed algebraic system
   – best feature - operations on operations
   – form relational algebraic expressions
• two types of operations: set-theoretic and database specific
         Relational Algebra Operations

• database specific:
   –   (horizontal) selection ()
   –   (vertical) projection ()
   –   join
   –   outer join
   –   semijoin
   –   division
• set operators
   –   union
   –   difference
   –   intersection
   –   cartesian (cross) product
                Example Relations

EMP   ename salary   dept     DEPT dept    floor   mgr
      Gary     30K   toy           candy   1       Irene
      Shirley 35K    candy         toy     2       Jim
      Christos 37K   shoe          men     2       John
      Robin 22K      toy           shoe    1       George
      Uma      30K   shoe
      Tim      12K   (null)
            Database Specific Operators

• (horizontal) selection ()
   – picks a subset of the rows


• (vertical) projection ()
   – picks a subset of the columns


• join
   – creates a new relation (table) out of two
         • equijoin (based upon equality of attributes)
         • natural join (equijoin plus projection to eliminate duplicated columns)
                        Set Operators

• union
   – both relations must be union-compatible -- same degree and same
     domains
• set difference
   – both relations must be union-compatible as above
• intersection
   – same deal
• cartesian (cross) product
   – note similarity to join operation; join can be defined as a cross
     product followed by a selection criteria
                        More Operators

• rename ()
   – results of operations in the relational algebra do not have names
   – it is often useful to be able to name such results for use in further
     expressions later on
   – conceptually similar to an assignment operator in most
     programming languages
• semijoin
   – very useful in practical implementation of large queries
   – semijoin of R and S is equivalent to the join of R and S projected
     onto the attributes of R.
         Still More Operators: Outer Join

• outer join is an extension of the join operation to deal with
  missing information
   – three types: left outer join, right outer join, and full outer join
   – left outer join computes the natural join, then takes all tuples
     (rows) in the left relation that did not match on the join attribute
     and includes them in the result, with all attributes of the right
     relation padded with null values
   – right outer join is the same, except non-matching tuples in the right
     relation are included in the result padded with null values
   – full outer join includes all non-matching tuples of both relations
     appropriately padded
   – see examples in text, p108-109
                   Still More Operators

• division
   – R/S: given R(A,B) and S(B), then a given tuple t is in R/S if for all
     s in S there exists an r in R such that r.B=s.B and t.A=r.A.
   – So tuple t with attribute t.A is in the result if and only if R
     contained tuples (t.A, B1), (t.A, B2), (t.A, B3), … for every possible
     value Bi contained in S.
   – Note that S must be defined on a subset of the attributes of R for
     the operation to be meaningful.
              A Short Interlude: Integrity

• the preceding slides covered chapter three up to section 3.3
• before attacking chapter 4 (SQL), we’re going to make a
  brief excursion up to chapter 6, touching sections 6.1 - 6.4
• Integrity constraints attempt to enforce data consistency
  and prevent accidental damage to the database during
  updates
• We’ve already seen two forms of integrity constraints:
   – key declarations (stipulating that certain attributes form a candidate
     key for a given entity set)
   – mapping form of a relationship (one-one, one-many, many-many)
                   Integrity Constraints

• Domain Constraints
   – simplest form of integrity constraint
   – type declarations are one such domain constraint (e.g., integer,
     floating point, double-precision, fixed length character string).
   – domains can be further restricted (e.g., check clause in SQL can
     ensure that hourly wages are  4.00 dollars)
   – easily tested whenever a new data item is entered into the database
   – extensions like date or currency can be easily supported on a
     strongly typed programming language
   – Null values can be useful for values to be filled in later, but some
     attributes may need to be specified as “not Null” (e.g., primary
     keys cannot have a null value)
                  Integrity Constraints (2)

• Key Constraints
   –   keys must have unique values
   –   primary key -- a candidate key declared primary
   –   unique key -- a candidate key
   –   foreign key -- a set of attributes that are a primary key for some
       other relations
        • foreign keys are an important concept because we need to treat
          foreign keys differently from other attributes (for example, protecting
          their uniqueness and insuring referential integrity) even though they
          aren’t a primary key in the current relation
                Referential Integrity

– We often want to be able to ensure that an attribute value in a tuple
  of a relation appears in at least one tuple of another relation. For
  example:
             » EMP(eno, ename, salary)
             » DEPT(dno, dname, floor)
             » WORKS-IN(eno, dno, hours)
– note that eno is a foreign key in WORKS-IN
– We want the following to be true:
    • eno(WORKS-IN)  eno(EMP)           (every eno is a real employee)
    • dno(WORKS-IN)  dno(DEPT)          (every dno is a real department)
– SQL allows the declaration of domain/key/referential integrity
  constraints with the clause check in its DDL
     Referential Integrity: SQL DDL Example

Create table customer                      Create table account
    (cust-name        char(20) not null,       (account-no      char(10) not null,
    cust-street       char(30),                branch-name      char(15),
    cust-city         char(30),                balance          number,
    primary key       (cust-name))             primary key      (account-no),
                                           foreign key (branch-name) references branch,
Create table branch                            check            (balance 0))
    (branch-name      char(15) not null,
    branch-city       char(30),            Create table depositor
    assets            number,                  (cust-name         char(20) not null,
    primary key       (branch-name),           account-no         char(10) not null,
    check             (assets 0))             primary key        (cust-name),
                                           foreign key (cust-name) references customer,
                                           foreign key (account-no) references account)
  Referential Integrity and Database Modifications

• Database modifications may violate referential integrity
• Insertion: inserting a value into the referencing relation
  that is not in the referenced relation
• Deletion: deleting the last example of a given value in the
  referenced relation and leaving that value in the
  referencing one
   – proper handling may lead to cascading deletions
• Update to the referencing relation (constraints as Insertion)
• Update to the referenced relation (constraints as Deletion)
                          Assertions

• An assertion is an arbitrary expression that the database
  must always satisfy
   – e.g., student GPA > 2.8, or sum(all-charges) < credit-line
   – Domain constraints and referential integrity constraints are special
     forms of assertion that are easy to test
   – SQL supports assertions as follows:
           create assertion <assertion-name> check <predicate>
   – When an assertion is made the system checks it for validity. If it is
     validated, every future modification of the database is checked
     against the assertion and allowed only if it is not violated.
   – This can be very expensive if assertions are complex or numerous
                               Triggers

• A trigger is a statement that the system executes automatically as a
  side effect of an update to the database.
• A trigger has two parts:
    – condition under which it is executed
    – actions to be taken if it is executed
• Example: instead of having an assertion “balance 0” for a checking
  account, use a trigger on negative balances that sets the balance to zero
  and creates a new loan for the amount of the overdraft
• Triggers make the system reactive
• Triggers are also called active rules
• Like Assertions, Triggers can be very expensive.
                         Trigger Example

define trigger overdraft on update of account T
   (if new T.balance < 0 then (insert into borrow values
        (T.branch.name, T.account-number,
                T-customer-name, - new T.balance)
        update deposit S
        set S.balance = 0
        where S.account-number = T.account-number))

(note: SQL syntax given here is slightly different from that in the text, p235)
        SQL (Structured Query Language)
                (Astrahan, Gray, Lindsay, Selinger, …)

• Most common and influential commercial query language; well
  established as the industry standard query language for relational
  databases
• Developed (as “Sequel”) at the IBM Research Lab in San Jose in the
  early 70s
• Four basic commands
   – select
   – insert
   – delete
   – update
• Result of each query is a relation
                        SQL Example

select              e.name
from                emp e
where               e.age > 30;
• e is a tuple variable ranging over the emp relation
• a tuple variable followed by a “.” and an attribute is an indexed tuple
    variable and specifies the corresponding attribute of the tuple, very
    similarly to in many programming languages
• what follows the keyword select is the target list
• what follows from is the tuple variable list and consists of a list of
    relations and variable names
• what follows where is the qualification clause; an arbitrary boolean
    expression
                                    SQL

• Basic format of the select command
       select [distinct] target_list
       from tuple_variable_list
       where qualification
       [order by target_list_subset];
• Semantics
    – evaluate qualification: select the subset of the cartesian product of the
      ranges of the tuple variables that satisfy the qualification
    – evaluate target list: eliminate columns that are not in the target list
    – prepare the result as a relation with columns according to the target list
    – if distinct is used, eliminate duplicate tuples
    – if order by is used, sort the result accordingly
           SQL: some example queries

• We will give a number of simple query examples using the following
  relational schema:
        sailors(sid, sname, rating)
        boats(bid, bname, colour)
        reserve(sid, bid, date)



(1) Find the names of sailors who have reserved boat #2
    select       s.sname
    from         sailors s, reserve r
    where        s.sid=r.sid and r.bid=2
             SQL: example queries (2)

(2) Find the names of sailors who have reserved a red boat
    select       s.sname
    from         sailors s, reserve r, boats b
    where        s.sid=r.sid and r.bid=b.bid and b.colour=“red”




(3) Find the colours of all boats reserved by Pat
    select        b.colour
    from          sailors s, reserve r, boats b
    where         s.sname=“Pat” and s.sid=r.sid and r.bid=b.bid
              SQL: example queries (3)

(4) Find the names of sailors who have reserved at least one boat
    select       s.sname
    from         sailors s, reserve r
    where        s.sid=r.sid


(5) Find the names of sailors who have reserved a red or a green boat
    select       s.sname
    from         sailors s, reserve r, boats b
    where        s.sid=r.sid and r.bid=b.bid and
                 (b.colour=“red” or b.colour=“green”)
              SQL: example queries (4)

(6) Find the names of sailors who have reserved a red and a green boat
    select       s.sname
    from         sailors s, reserve r, boats b, reserve r2, boats b2
    where        s.sid=r.sid and r.bid=b.bid and b.colour=“red”
                 and s.sid=r2.sid and r2.bid=b2.bid and
                 b2.colour=“green”)

Note: in the above query if sailor Pat has reserved one green boat and two
  red ones, the name Pat will appear twice in the results. To avoid that,
  use the keyword distinct in the select line, as in:
  select distinct s.sname
                               SQL

• Basic format of the select command
       select [distinct] target_list
       from tuple_variable_list
       where qualification
       [order by target_list_subset];


• Simple query examples use this relational schema:
       sailors(sid, sname, rating)
       boats(bid, bname, colour)
       reserve(sid, bid, date)
                        SQL: target list

• “*” is an abbreviation for all attributes in the from list
  select         *
  from           sailors s
  where          order by s.rating



• Each item in the target list can be as general as attribute_name =
  expression, where the expression is any arithmetic or string expression
  over indexed tuple variables and constants. It can also contain some
  built-in functions like sqrt, sin, mod, etc. as well as aggregates (coming
  up later)
     SQL: target list expression example

With rating an integer from 1 to 10, this query gives a rating bonus to
  sailors who sailed two different boats on the same day.
  select          s.sid, s.sname, rating=s.rating+2
  from            sailors s, reserve r, reserve r2
  where           s.sid=r.sid and s.sid=r2.sid and r.date=r2.date and
                  r.bid != r2.bid


What’s wrong with the above?
   – What happens if s.rating = 9 before this query?
   – Domain constraints might take care of this, but we need to be
      careful
                                 SQL

Qualifications: each item in a qualification (where clause) can be as
  general as expression=expression

Example:
   select         name1 = s1.sname, name2 = s2.sname
   from           sailors s1, sailors s2
   where          2*s1.rating = s2.rating-1
                                 SQL

Further elaboration:
• tuple variables can be implicit if the system can figure out which
   relation each attribute belongs to
• table names can be used as tuple variables

Example: find names, ages, and departments of employees who are over
   40 and work on the first floor.
   select       ename, age, emp.dname
   from         emp, dept
   where        age>40 and floor=1 and emp.dname=dept.dname
                                SQL

SQL provides set operators: union, intersect, and minus

Example: find the names of employees who work in the toy department
   and make at most 60K
   (select       ename
   from          emp
   where         dname=“toy”)
minus
   (select       ename
   from          emp
   where         sal>60K)
                                  SQL

Note that it is usually possible to phrase a single query in multiple ways.
  The previous query could have also been written:

    (select       ename
    from          emp
    where         dname=“toy”)
intersect
    (select       ename
    from          emp
    where         sal60K)
                                 SQL

Or also (even simpler):

   select         ename
   from           emp
   where          dname=“toy” and sal60K

Writing a query in different ways will usually change how efficient the
  query is -- the above query is very likely to be faster than the example
  using intersects, and that one is likely to be faster than the one using
  minus.
                           SQL

SQL also provides set operators: contains (a set being a
  superset of another) and exists (a set not being empty).
  Both return Boolean results, so may be negated (using
  not).
                              SQL

Example: find the names of employees who manage all the departments
    on the first floor.
select mgr
from      dept d1
where (select d2.dname
          from dept d2
          where d1.mgr=d2.mgr)
contains
          (select dname
          from dept
          where floor=1)
                                       SQL

SQL allows nested queries using the keyword in
Example: find the names of employees who work on the first floor.
   select ename
   from emp
   where dname in
                    (select dname
                    from dept
                    where floor = 1)
The same query in flat form is
   select dname
   from emp, dept
   where emp.dname=dept.dname and floor=1
                                 SQL
The connective in tests for set membership. Similar connectives are:
     – not in (set non membership)
     – op any (op relationship with some tuple in the set)
     – op all (op relationship with all tuples in the set)
     where op is one of (=, !=, <, >, <=, >=)
Example: find the names of employees who make more than everybody on the
   first floor.
   select ename
   from emp
   where sal > all
                    (select sal
                    from emp, dept
                    where emp.dname=dept.dname and floor = 1)
                                    SQL
Scoping of variables works exactly as in Pascal or C

Example: find the names of students who take a course from their advisor.
   select sname
   from student
   where s# in
          (select s#
          from enroll
          where c# in
                     (select c#
                     from class
                     where prof=student.advisor))
                Recap: SQL

Four basic commands
  –   select
  –   insert
  –   delete
  –   update
                          SQL Insert

• Insert command format:
   insert into relation_name values (value_list)
   or
   insert into relation_name select_statement
• Semantics of insert
   – format one: add the tuple corresponding to value_list into
     relation_name
   – format two: execute the select statement, then add all the resulting
     tuples into relation_name
Example:
   insert into student values (1, “Carey”, “CS”, “Stonebraker”)
                       SQL Insert

Example: relation     register(S#, name, paid)

in which registered students are recorded. After the end of
   registration week, we execute:

insert into student
         select r.s#, r.name
         from register r
         where r.paid=“yes”
                        SQL Delete

• Delete command format:
     delete relation_name where qualification



• Semantics of delete: execute the corresponding select
  command:
   select      full_target_list          (or “*”)
   from        relation_name
   where       qualification
  and then remove the resulting tuples from relation_name
                       SQL Delete

Example: with the following schema
    student(s#, name, major, advisor)
    enroll(s#, c#, grade)
    course(c#, dept)
The following command expels CS majors who received a grade of
  less than 2.5 in a CS course:
    delete   student
    where    major=“CS” and s# in
                               (select s#
                               from enroll, course
                               where enroll.s#=student.s# and grade<2.5
                               and enroll.c#=course.c# and dept=“CS”)
                         SQL Update

• Update format
   update       relation_name
   set          target_list
   where        qualification



• Semantics of update: it is equivalent to executing:
   – insert into del_temp
            select *
            from relation_name
            where qualification
                           SQL Update

• Semantics of update (cont): … then executing
   – insert into app_temp
                                            Ext_target_list is identical to
           select ext_target_list           target_list in the original update
           from relation_name               command, but augmented with
           where qualification              tuple_variable.attribute_name
                                            for all attributes of the range of
                                            tuple_variable that don’t appear
                                            in target_list.

   – delete the tuples in del_temp from relation_name

   – add the tuples in app_temp to relation_name
                     SQL Update

Example: give a 10% grade raise to every CS major in CS564
   update     enroll
   set        grade=1.1*grade
   where      c#=“CS564” and s# in
              (select s#
              from student
              where major=“CS”)
                          SQL Update

Which is equivalent to:

    insert into del_temp            insert into app_temp
       select s#, c#, grade            select s#, c#, grade=1.1*grade
       from enroll                     from enroll
       where c#=“CS564” and s# in      where c#=“CS564” and s# in
         (select s#                      (select s#
         from student                    from student
         where major=“CS”)               where major=“CS”)
                    SQL Aggregates

Aggregate functions are functions that take a collection of
  values as input and return a single value. SQL supports
  five built-in aggregate functions:
   – average: avg
   – minimum: min
   – maximum: max
   – total: sum
   – cardinality: count
   using distinct to aggregate only unique values is often important with
      avg, sum, and count
                      SQL Aggregates

Example: find the number of students
   select num_of_students = count(s#)
   from student
       why do we not need to use distinct in this example?


Example: find the number of employee records
   select count (*)
   from emp
       if an employee appears more than once in the emp relation, for example
          if he had switched jobs or had two jobs, then this command would
          count that employee once for each record
                    SQL Aggregates

• Qualified Aggregates:
   Example: find the average age of employees in the toy department
       select avg(age)
       from emp
       where dname=“toy”
                SQL: Group By clause
• Group aggregates: groups of tuples are computed using the
  group by clause
    – the attributes given in the clause are used to form groups
    – typles with the same value on all attributes in the clause are placed
      in one group

Example: in each department, find the minimum age of employees who make
   more than 50K
         select     dname, min(age)
         from       emp
         where      sal>50K
         group by   dname
                    SQL: Having clause
• Sometimes it is useful to state a condition that applies to
  groups in group by rather than to tuples. We do that in
  SQL with the having clause. SQL applies predicates of
  having after groups have been formed.

Example: find the average salary for employees under 30 for each
   department having more than 10 such employees
        select     dname, avg(sal)
        from       emp
        where      age<30
        group by   dname
        having     count(*)>10
              SQL: Multiple Group Bys

Example: using relation emp(ss#, ename, dept, cat, sal)

Count the employees and average monthly salary for each
  employee category in each department

   select       dept, cat, count(*), avg(sal)/12
   from         emp
   group by     dept, cat
         SQL: Multiple Group Bys

Select … from emp    Select … from emp
group by cat         group by dept
                 SQL: Multiple Group By

Select … from emp
group by dname, cat




note that some dname/cat groups are empty.
          SQL: Examples on Having

Find the average salary of employees under 30 for each
   department with more than 10 such employees

   select     dname, avg(sal)
   from       emp
   where      age<30                (employee age under 30)
   group by   dname                 (group by department)
   having     10<count(*)           (group size > 10)
            SQL: Examples on Having

Find the average salary of employees under 30 for each
   department with more than 10 employees
   select           e.dname, avg(e.sal)
   from             emp e
   where            e.age<30                      (employee age under 30)
   group by         e.dname                       (group by department)
   having           10<any
        (select     count(ee.ename)               (number of employees in group)
        from        emp ee
        where       ee.dname=e.dname)              (… from the same dept as e)
                  (why is this query different from the previous one?)
          SQL: Examples on Having

Find categories of employees whose average salary exceeds
   that of programmers
   select     cat, avg(sal)
   from       emp
   group by   cat
   having     avg(sal)> (select avg(sal)
                       from emp
                       where cat=“programmer”)
           SQL: Examples on Having

Find all departments with at least two clerks

   select      dname
   from        emp
   where       job=“clerk”
   group by    dname
   having      count(*) >= 2
                   SQL: Examples

Find the names of sailors with the highest rating

   select      sname
   from        sailors
   where       rating = (select max(rating)
                        from    sailors)
                    SQL: Examples

For each boat, find the number of sailors of rating >7 who
  have reserved this boat

   select      bid, bname, count(s.sid)
   from        sailors s, boats b, reserve r
   where       s.sid=r.sid and r.bid=b.bid and rating>7
   group by    b.bid
                   SQL: Examples

For each red boat, find the number of sailors who have
  reserved this boat

   select     bid, bname, count(s.sid)
   from       sailors s, boats b, reserve r
   where      s.sid=r.sid and r.bid=b.bid
   group by   b.bid
   having     colour=“red”
                       SQL: Examples
Difference between the last two queries?

   – First one gave a qualification on the tuples
       (take all tuples of the multijoin
       discard tuples that do not fulfill ratings>7
       then group them by boat id
       then find the cardinality of each group)
   – Second one gave a qualification for the groups
       (take all tuples of the multijoin
       group them by boat id
       discard groups representing boats that are non-red
       find the cardinality of remaining groups)
            And Now, For Something
             Completely Different...
• The recent SQL material largely covers chapter 4, at least
  sections 4.1 through 4.6 and some of 4.9.
• Earlier we examined Relational Algebra, covering sections
  3.1 through 3.3
• Now we leave chapter 4 and head back to examine sections
  3.6 and 3.7, covering Relational Calculi
   – based upon predicate calculus
   – non-procedural query languages (descriptive rather than
     prescriptive)
   – we will examine two relational calculi: tuple calculus and domain
     calculus
                      Tuple Calculus
Query: {t | P(t)}         P: is a predicate associated with some relation R
                          t: is a tuple variable ranging over the relation R
                          t[A]: is the value of attribute A in tuple t
   – students in CMSC 424
       • {t | t  enroll  t[course#] = CMSC424}
   – students in CMSC 424 conforming with the CMSC-420
     prerequisite
       • {t | t  enroll   s  enroll  t[course#] = CMSC424 
                  s[course#] = CMSC420  t[ss#] = s[ss#]}
                      Tuple Calculus
• Quantifiers and free variables
   – ,  quantify the variables following them, binding them to some
     value. (in the previous slide, s was bound by )
   – A tuple variable that is not quantified by  or  is called a free
     variable. (in the previous slide, t was a free variable)


• Atoms
   – R(t)                where t is a tuple variable
   – t[x]  s[y]         where t,s are tuple variables and
                           {, , , , , }
                        Tuple Calculus
• Formulas
   – an Atom is a Formula
   – If P and Q are Formulas, then so are (P), P, PQ, PQ, and PQ
   – If P(t) is a Formula, then so are t P(t) and t P(t)


• Equivalences
   –   (P  Q)  P   Q
   –   (P  Q)  P   Q
   –   t P(t)   (t ( P(t)))
   –    t P(t)   ( t ( P(t)))
                     Tuple Calculus
• Safety
   – Math is too powerful; we can easily phrase expressions that
     describe infinite sets
                                {t | t  enroll}
   – These expressions are called unsafe
   – When we are dealing with finite sets, unsafe expressions happen in
     expressions that involve negation ()
   – We can avoid this problem by using an entirely positive (non-
     negated) scope as the first operand of any conjunction where we
     use negation. The first operand establishes the scope and the
     second one filters the established scope.
                 {t | t  enroll  t[course#]  CMSC-420}
                 Domain Calculus
• Another form of relational calculus
• Uses domain variables that take values from an attribute’s
  domain, rather than values representing an entire tuple
• Closely related to tuple calculus
• Domain Calculus serves as the theoretical basis for the
  query language QBE, just as the relational algebra we
  examined earlier forms the basis for SQL
• Expressions are of the form:
         {< x1, x2, x3, ..., xn> | P( x1, x2, x3, ..., xn) }
                    Domain Calculus
• Atoms
   – < x1, x2, x3, ..., xn>  R
   – x y                   where x,y are domain variables and
                              {, , , , , }
   – x c                   where c is a constant


• Formulas
   – an atom is a formula
   – If P and Q are formulas, then so are (P), P, PQ, PQ, and PQ
   – If P(x) is a formula and x is a domain variable, then x P(x) and
     x P(x) are also formulas
                     Domain Calculus
• Queries are of the form:
{< x1, x2, x3, ..., xn> | P( x1, x2, x3, ..., xn) }

• Examples

    {<ss#, course#, semester> | Enroll(ss#, course#, semester)}

    {<x, y, z> | Enroll(x, y, z)  y = CMSC-424}
Reductions of Relational Algebra and Calculi




• Relational Algebra (sections 3.2-3.5), Tuple Calculus
  (section 3.6), and Domain Calculus (section 3.7) can be
  reduced to each other: they have equivalent expressive
  power. For every expression in one, we can compute an
  equivalent expression in the others.
            Functional Dependencies
• Important concept in differentiating good database designs
  from bad ones
• FD is a generalization of the notion of keys
• An FD is a set of attributes whose values uniquely
  determine the values of the remaining attributes.
Emp(eno, ename, sal)          key FDs:       eno => ename
Dept(dno, dname, floor)                      eno => sal
Works-in(eno, dno, hours)                    (eno,dno) => hours
                                              dno => dname
                                              dno => floor
             Functional Dependencies
• If   R and   R, then  =>  holds in the extension
  r(R) of R iff for any pair t1 and t2 tuples of r(R) such that
  t1()=t2() , then it is also true that t1() = t2()
• We can use FDs as
   – constraints we wish to enforce (e.g., keys)
   – for checking to see if the FDs are satisfied within the database
   R(   A   B   C   D)
        1   1   1   1              A => B satisfied?          no
        1   2   1   2              A => C satisfied?          yes
        2   2   2   2              C => A satisfied?          no
        2   3   2   3              AB => D satisfied?         yes
        3   3   2   4
            Functional Dependencies
• Trivial dependencies:            => 
                                   =>  if   

• Closure
   – we need to consider all FDs
   – some are implied by others; e.g., FDs are transitive; if A=>B and
     B=>C, then A=>C
   – Given F = set of FDs, we want to find F’ (the closure of all FDs
     logically implied by F)
                  Armstrong’s Axioms
• Reflexivity             if    then  => 
• Augmentation            if  =>  then   
• Transitivity            if  =>  and  =>  then  => 

Armstrong’s Axioms can be used to derive three additional useful rules:

• Union rule              if  =>  and  =>  then  => 
• Decomposition rule      if  =>  then  =>  and  => 
• Pseudotransitivity rule if  =>  and  =>  then  => 
                      FD Example
R(A, B, C, G, H, I)
  F=( A=>B
        A=>C
        CG => H
        CG => I
        B => H )

   F+=( A => H            transitivity: A => B => H
        CG => HI          union rule: CG => H, CG => I
        AG => I           augmentation: A=> C, AG => CG => I
        AG => H )         augmentation: A=> C, AG => CG => H
               Closure of Attribute Sets
• Useful to test if an attribute set is a superkey
• the closure + of a set of attributes  under F is the set of all attributes
  that are functionally determined by 
• there is an algorithm to compute the closure

R(A, B, C, G, H, I)        F=(A=>B, A=>C, CG => H, CG => I, B => H )
   Example: Algorithm to compute (AG)+
   starts with             result=(AG)
   A=>B expands            result=(AGB)
   A=>C expands            result=(AGBC)
   CG=>H expands           result=(AGBCH)
   CG=>I expands           result=(AGBCHI)
   B=>H causes no more expansion
              Uses of Attribute Closure
• Testing for superkey
    – to test if  is a superkey, compute + and determine if + contains all
      attributes of R


• Testing functional dependencies
    – to check if a functional dependency  =>  holds, (is in F+) just check to
      see if   +
    – in other words, we compute + using attribute closure, and then check if
      the result contains .
    – Simple, cheap, and useful test.
          Relational Database Design
• A major goal in designing a database is to have a schema
  that
   – makes queries simpler (easy to phrase)
   – avoids redundancies and update anomalies (about which more
     later)
       Schema and Query Simplicity (1)
Example Schema 1:         EMP(eno, ename, sal, dno)
                          DEPT(dno, dname, floor, mgr)

Query 1: find all employees that make more than their manager
        select e.ename from EMP e, EMP m, DEPT d
                   where e.dno = m.dno and d.mgr=m.eno and e.sal>m.sal

Query 2: for each department, find the maximum salary
        select d.dname, max(e.sal) from EMP e, DEPT d
                  where e.dno = d.dno group by d.dno

Q1 requires two joins; Q2 requires a join and a group-by.
       Schema and Query Simplicity (2)
Example Schema 2: (a single relation)
               ED(eno, ename, sal, dno, dname, floor, mgr)

Query 1: find all employees that make more than their manager
        select e.ename from ED e, ED m
                   where e.mgr=m.eno and e.sal>m.sal

Query 2: for each department, find the maximum salary
        select d.dname, max(sal) from ED e
                  group by dno

Q1 requires one join; Q2 requires just a group-by.
       Schema and Query Simplicity (3)
• How did we get simpler queries?
• Schema 2 was a more complicated relation with more information; in
  essence ED was EMP and DEPT from Schema 1 with the join pre-
  computed
• Should we just precompute the joins and store bigger relations?
• Taken to the extreme, we could compute the universal relation with all
  attributes inside it and null values for those values that make no sense
• Why wouldn’t we want to do that?

• Problems with too-complex relations: repetition of information (data
  redundancy) and inability to represent certain information (update
  anomalies)
DB Design: Redundancy and Anomalies

• Redundancy (repetition of information)
   – each department is repeated for each employee in it
   – great risk of inconsistencies -- suppose the department is moved to
     a new floor?
   – A simple update (change in mgr name, department floor, etc) in
     Schema 1 becomes multiple updates in Schema 2
• Anomalies (inability to represent some types of
  information)
   – departments can’t exist without employees. A department cannot
     exist until the first employee is inserted, and it can no longer exist
     when the last employee is deleted from the ED relation
   DB Design: Dealing with Anomalies

• So complex relations make for simpler queries, but have
  the disadvantages of data redundancy and creation of
  anomalies. How do we balance the two objectives? We
  want:
   – simple queries
   – no anomalies; minimize data redundancy


• If we start with Schema 2 and discover anomalies we can
  decompose the relation(s) until the problems go away.
  This process is called normalization.
               Objectives of DB Design
                   (Normalization)
• no redundancy
    – for space efficiency and to reduce the potential for inconsistencies


• update integrity
    – avoid update anomalies


• linguistic efficiency
    – simpler queries are much better for the application programmer and for
      the query optimizer


• good performance
    – smaller relations imply more joins (bad)
                 Lossy Decompositions

• Not all decompositions are reversible (lossless)

Example:
Shipment(S#, P#, J#) decomposed into SP(S#, P#) and   SJ(S#, J#)
   s1    p1       j1                s1        p1      s1       j1
   s2    p2       j1                s2        p2      s2       j1
   s2    p3       j2                s2        p3      s2       j2
   s3    p3       j3                s3        p3      s3       j3
   s4    p4       j3                s4        p4      s4       j3
                  Lossy Decompositions
Shipment(S#, P#, J#) decomposed into SP(S#, P#) and                     SJ(P#, J#)
   s1     p1        j1                  s1        p1                    p1          j1
   s2     p2        j1                  s2        p2                    p2          j1
   s2     p3        j2                  s2        p3                    p3          j2
   s3     p3        j3                  s3        p3                    p3          j3
   s4     p4        j3                  s4        p4                    p4          j3
If we join SP and SJ again into SP-PJ(S#, P#, P#, J#) we get:
   s1     p1        p1        j1
   s2     p2        p2        j1
   s2     p3        p3        j2                  from the joined tuples we cannot
   s2     p3        p3        j3                  deduce the original form of the data.
   s3     p3        p3        j2                  this is called the connection trap
   s3     p3        p3        j3                  and the decomposition is lossy
   s4     p4        p4        j3
  Example of Lossy Join Decomposition

• Lossy-join decompositions result in information loss
• Example: decomposition of R=(A,B) into R1=(A) and R2=(B)
R= (A, B)       R1=     (A)                R2=      (B)
     1                                            1
     2                                            2
     1
                        R1 X R2 = (A,      B)
                                          1
                                          2
                                          1
                                          2
             Decomposition Continued

• Decompose the relation schema
• All attributes of an original schema (R) must appear in the
  decomposition (R1, R2)
• Lossless (reversible) join decomposition: for all possible relations r on
  schema R, the decomposition into (R1, R2) is lossless if
                              r = R1(r)   R2 (r)


• The decomposition of R into R1 and R2 is lossless if and only if at least
  one of the following dependencies is in F+:
                                  R1  R2  R1
                                  R1  R2  R2
       Lossless Join Decomposition and
          Functional Dependencies
• So FDs can help determine whether a decomposition is lossless

• R is a relation schema and F its FDs. Then a decomposition
                                  R = R1  R2
  is lossless if at least one of the following dependencies holds
                                 R1  R2  R1
                                 R1  R2  R2

• either of the above FDs guarantees uniqueness in the mapping (and
  therefore that the decomposition is lossless)
           Dependency Preservation

• Dependencies are preserved in a decomposition if we do
  not need to join in order to enforce FDs -- all FDs remain
  intra-relational and do not become inter-relational

• To check if a decomposition is dependency preserving, we
  need to examine all FDs in F+

• There is an algorithm for testing dependency preservation
  (requires the computation of F+)
              Goals of Normalization

• Decide whether a particular relation R is in “good” form

• if it is not in “good” form, decompose it into a set of
  relations (R1, R2, R3, …, Rn) such that:

   – each relation is in “good” form

   – the decomposition is a lossless-join decomposition, based upon
     functional dependencies
                         Normalization

• Types of FDs in R(A, B, C, D) with (A, B) a candidate
  key:
   – trivial:    AB ==> A
   – partial:    A ==> C               (C depends upon a part of the key)
       TEACH(student, teacher, subject)
       student, subject ==> teacher (students not allowed in the same subject
                                    of two different teachers)
       teacher ==> subject             (each teacher teaches only one subject)
   – transitive: A ==> C ==> D
       ED(eno, ename, sal, dno, dname, floor, mgr)
       eno ==> dno ==> mgr
              Normalization using FDs
• When we decompose a relation schema R with a set of
  functional dependencies F into R1, R2, R3, …, Rn we want:
   – lossless-join decomposition: otherwise the decomposition results in
     loss of information relative to the original schema R
   – no redundancy: the relations Ri should be in either BCNF (Boys-
     Codd Normal Form) or 3NF (Third Normal Form) (about which
      more in a slide or two)
   – Dependency preservation: let Fi be the set of dependencies in F+
     that include only attributes in Ri:
       • preferably the decomposition should be dependency perserving. That is,
         F1  F2  F3  …  Fn = F+
       • Otherwise checking updates for violation of FDs may require
         computing joins, which is expensive
                    The Normal Forms
• 1NF: every attribute has an atomic value
• 2NF: 1NF and no partial dependencies
• 3NF: 2NF and no transitive dependencies.
  Equivalently (text definition): if for each FD X==> Y either
       • it is trivial, or
       • X is a superkey, or
       • Y-X is a proper subset of a candidate key (each attribute in Y that
         isn’t in X is contained in some candidate key)
• BCNF: if for each FD X==> Y either
       • it is trivial, or
       • X is a superkey
              Distinguishing Examples
• 1NF but not 2NF:         SUPPLY(sno, pno, jno, scity, jcity, qty)
   – (sno, pno, jno) is the candidate key
   – sno ==> scity, jno ==> jcity are both partial dependencies
• 2NF but not 3NF:         ED( eno, ename, sal, dno, dname, floor, mgr)
   – transitive FD: eno ==> dno ==> dname
• 3NF but not BCNF:        TEACH(student, teacher, subject)
   – student, subject ==> teacher
   – teacher ==> subject
           Boyce-Codd Normal Form
BCNF is perhaps the most useful Normal Form for database
  design

A relation schema R is in BCNF with respect to a set F of
  functional dependancies if for all functional dependancies
  in F+ of the form X==> Y where XR, YR; at least one
  of the following holds:

   – X ==>Y is trivial (that is, Y  X)
   – X is a superkey for R
                    BCNF Example
• R = (A, B, C)
• F = (A==> B,
       B==> C)
• R is not in BCNF
• Decomposition R1 = (A, B), R2 = (B, C)
   – R1 and R2 are in BCNF
   – Lossless-join decomposition
   – Dependency preserving
      Third Normal Form: Motivation
• There are some situations where
   – BCNF is not dependency preserving, and
   – efficient checking for FD violation on updates is important
   – In these cases BCNF is too severe and a looser Normal Form
     would be useful
• Solution: define a weaker Normal Form, called Third
  Normal Form, where
   – FDs can be checked on individual relations without performing a
     join (no inter-relational FDs)
   – There is always a lossless-join, dependency-preserving
     decomposition
                 Third Normal Form
• A relation schema R is in 3NF with respect to a set F of
  functional dependancies if for all functional dependancies
  in F+ of the form X==> Y where XR, YR; at least one
  of the following holds:
   – X ==>Y is trivial (that is, Y  X)
   – X is a superkey for R
   – Each attribute A in X==>Y is contained in a candidate key for R
                  (note: possibly in different candidate keys)
• A relation in BCNF is also in 3NF
• 3NF is a minimal relaxation of BCNF to ensure
  dependency preservation
                      3NF Example
• R = (J, K, L)
• F = (JK==> L,
       L==> K)
• Two candidate keys: JK and JL
• R is in 3NF
   – JK==>L             JK is a superkey
   – L==>K              K is contained in a candidate key
• BCNF decomposition has R1 = (J, L), R2 = (J, K)
   – testing for JK==>L requires a join
• There is some redundancy in this schema
                    Testing for 3NF
• Optimization: need to check only FDs in F, need not check
  all FDs in F+
• Use attribute closure to check, for each dependency
  X==>Y, if X is a superkey
• If X is not a superkey, we have to verify if each attribute in
  Y is contained in a candidate key of R
   – This test is rather more expensive, since it involves finding
     candidate keys
   – Testing for 3NF has been shown to be NP-hard
   – Interestingly, decomposition into 3NF can be done in polynomial
     time (testing for 3NF is harder than decomposing into 3NF!)
       Comparison of BCNF and 3NF
• It is always possible to decompose a relation into relations
  in 3NF such that:
   – the decomposition is lossless
   – the dependencies are preserved


• It is always possible to decompose a relation into relations
  in BCNF such that:
   – the decomposition is lossless
   – but it may not be possible to preserve dependencies
    BCNF and 3NF Comparison (cont.)
Example of problems due to redundancy in 3NF
   – R = (J, K, L)                          J        L        K
     F = (JK==> L, L==> K)                  j1       l1       k1
                                            j2       l1       k1
                                            j3       l1       k1
                                            null     l2       k2
A schema that is in 3NF but not BCNF has the problems of:
   – repetition of information (e.g., the relationship between l1 and k1)
   – need to use null values (e.g., to represent the relationship between
     l2 and k2 when there is no corresponding value for attribute J)
                          Design Goals
• Goal for a relational database design is:
    – BCNF
    – Lossless Join
    – Dependency Preservation
• If we cannot achieve this, we accept one of
    – lack of dependency preservation (or use of more expensive inter-relational
      methods to preserve dependencies)
    – data redundancy due to use of 3NF
• Interestingly, SQL does not provide a direct way of specifying
  functional dependencies other than superkeys
    – can specify FDs using assertions, but they are expensive to test
    – Even if we have a dependency preserving decomposition, using SQL we
      cannot efficiently test an FD whose left hand side is not a key
         BCNF and Over-normalization
• Goal is to obtain schemas that are:
    – BCNF
    – Lossless Join
    – Dependency Preserving
• but sometimes we have to look at the meaning, too
   Example:         TEACH(student, teacher, subject)
   student, subject ==> teacher       (students not allowed in the
                                        same subject of two teachers)
   teacher ==> subject                  (each teacher teaches one subject)
• This 3NF has anomalies:
    – Insertion: cannot insert a teacher until we have a student taking his subject
    – Deletion: if we delete the last student of a teacher, we lose the subject he
      teaches
     BCNF and Over-normalization (2)
• What is the problem? Schema overload. We are trying to capture two
  meanings:
    – 1) subject X can be taught by teacher Y
    – 2) student Z takes subject W from teacher V
   It makes no sense to say we lose the subject he teaches when he does
   not have a student. Who is he teaching the subject to?

• Normalizing this schema to BCNF cannot preserve dependencies, so
  we better stay with the 3NF TEACH and another (BCNF) relation
  SUBJECT-TAUGHT (teacher, subject) to capture the meaning of the
  real-world environment more effectively.
     Getting Physical: Storage and File
           Structure (Chapter 11)
• Up until now we have examined database design from a
  high-level conceptual view, passing over actual
  implementation and underlying hardware.
   – Appropriate focus for database users
   – But hardware does have an influence on implementation, and
     implementation does have an influence on what conceptual designs
     will be more efficient and useful
• Now we get physical -- examine physical storage media to
  give a background for later focus on implementation of the
  data models and languages already described
                      Chapter 11

At this point we are focussing on the following sections
• 11.1 Overview of Physical Storage Media
• 11.2 Magnetic Disks
• 11.3 RAID (very briefly)
• 11.4 Tertiary Storage
• 11.5 Storage Access
• 11.6 File Organization
• 11.7 Organization of Records in Files
• 11.8 Data-Dictionary Storage
Classification of Physical Storage Media

• Media are classified according to three characteristics:
   – speed of access
   – cost per unit of data
   – reliability
       • data loss on power failure or system crash
       • physical failure of the storage device
• We can also differentiate storage as either
   – volatile storage
   – non-volative storage
Physical Storage Media Overview (11.1)

• Typical media available are:
   –   Cache
   –   Main memory
   –   Flash memory
   –   Mag disk
   –   Optical storage (CD or DVD)
   –   Tape storage
  Physical Storage Media -- Cache and
             Main Memory
• Cache
  – fastest and most costly form of storage
  – volatile
  – managed by computer system hardware
• Main memory
  – fast access (10s to 100s of nanoseconds)
  – generally too small or expensive to hold the entire database
      • current capacities commonly used are up to a few Gigabites
      • capacities have gone up and per-byte costs have decreased steadily,
        roughly a factor of 2 every 2-3 years
  – volatile
Physical Storage Media -- Flash Memory
• Also known as EEPROM -- Electrically Erasable Programmable Read-
  Only Memory
• non-volatile
• reading data is comparable to main memory speeds
• writing is more complex
    – can’t overwrite a single location -- a whole bank of memory must be
      erased to permit writing within that bank.
    – erasing is only supported a limited number of times -- 10,000 to one
      million erase cycles
    – writes are slow (a few microseconds), and erases are slower
• cost comparable to main memory
• widely used in computer systems embedded in other devices, such as
  digital cameras and hand-held computers
Physical Storage Media -- Magnetic Disk
•   data is stored on a spinning disk and read/written magnetically
•   primary medium for long-term storage of data
•   typically stores entire database
•   data must be moved from disk to main memory for access, and written back
    for storage
     – much slower access than main memory (about which more later)
•   direct access -- possible to read data on disk in any order, unlike magnetic tape
•   capacities up to 100 gig
     – much larger capacity and cheaper cost/byte than main memory or flash memory
     – capacity doubles every two or three years
•   survives power failures and system crashes
     – disk failure can destroy data, but this is more rare than system crashes
      Physical Storage Media -- Optical
                   Storage
• Non-volatile; data is read optically from a spinning disk using a laser
• CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms
• Write-once, Read-many (WORM) optical disks used for archival
  storage (CD-R and DCD-R)
• Multiple-write versions also available (CD-RW, DVD-RW, and DVD-
  RAM)
• Reads and writes are slower than with magnetic disk
• Juke-box systems available for storing large volumes of data
    – large numbers of removable disks
    – several drives
    – mechanism for automatic loading/unloading of disks
 Physical Storage Media -- Tape Storage

• Non-volatile
• used primarily for backup (to recover from disk failure) and for
  archival data
• sequential access -- much slower than disk
• very high capacity (40-300 GB tapes available)
• tape can be removed from drive; storage costs much cheaper than disk,
  but drives are expensive; data is read optically from a spinning disk
  using a laser
• Juke-box systems available for storing large volumes of data
    – e.g., remote sensing data, possibly hundreds of terabytes (1012 bytes) or
      even a petabyte (1015 bytes)
                        Storage Hierarchy
•   Primary storage: fastest media but
    volatile
     – cache
     – main memory
•   secondary storage: next level in
    hierarchy; moderately fast access
    time, non-volatile
     – also called on-line storage
     – flash memory, magnetic disks
•   tertiary storage: lowest level in
    hierarchy; slower access time, non-
    volatile
     – also called off-line storage
     – optical storage, magnetic tape
                     Magnetic Disks (11.2)
•   Read-write head
     – positioned very close to the platter
       surface (almost touching it)
     – Reads or writes magnetically
       coded information
•   Surface of platter divided into
    circular tracks
     – over 16,000 tracks per platter on
       typical hard disks
•   Each track is divided into sectors
     – a sector is the smallest unit of data
       that can be read or written
     – sector size is typically 512 bytes
     – typically 200 (on inner tracks) to
       400 (outer tracks) sectors per track
                    Magnetic Disks (cont)
•   To read/write a sector
     – disk arm swings to position head
       on the right track
     – platter spins continually; data is
       read/written as sector passes under
       head
•   Head-disk assemblies
     – multiple disk platters on a single
       spindle (typically 2 to 4)
     – one head per platter, mounted on a
       common arm
•   Cylinder i consists of ith track of all
    the platters
               Magnetic Disks (cont)

• Earlier generation disks were susceptible to head crashes

   – disk spins constantly at 60, 120, even 250 revolutions per second

   – head is very close to the surface; if it touches the surface it can
     scrape the recording medium off the surface, wiping out data and
     causing the removed medium to fly around, causing more head
     crashes

   – newer disks have less friable material; less subject to head crashes
                     Magnetic Disks (cont)
•   Disk controller -- interfaces
    between the computer system and
    the disk drive
     – accepts high-level commands to
       read or write a sector
     – initiates actions such as moving        •   Multiple disks are connected to a
       the disk arm to the right track and         computer system through a
       actually reading or writing the data        controller
     – computes and attaches checksums              – controllers functionality
       to each sector to verify that data is          (checksum, bad sector remapping)
       read back correctly                            often carried out by individual
     – Ensures successful writing by                  disks, reducing load on controller
       reading back sector after writing it    •   Two disk interface standards are
     – Performs remapping of bad sectors           ATA (AT attachment) and SCSI
                                                   (Small Computer System
                                                   Interconnect)
            Disk Performance Measures

• Access time -- the time it takes from when a read or write request is
  issued to when data transfer begins. Consists of:
    – Seek time -- time it takes to reposition the arm over the correct track
         • average seek time is 1/2 the worst case seek time
         • 4 to 10 milliseconds on typical disks
    – Rotational latency -- time it takes for the sector to appear under the head
         • average latency is 1/2 the worst case latency
         • 4 to 11 milliseconds on typical disk (5400 to 15000 rpm)
• Data Transfer Rate -- rate of retrieval from or storage to disk
    – 4 to 8 MB per second is typical
    – Multiple disks may share a controller, so the rate that controller can
      handle is also important. E.G., ATA-5: 66MB/sec, SCSI-3: 40MB/sec,
      Fiber Channel: 256 MB/s
     Disk Performance Measures (cont.)

• Mean time to failure (MTTF) - the average time the disk is expected to
  run continuously without any failure.
    – Typically 3-5 years
    – Sounds good, but if you have 1500 disks then 300 per year will fail, or
      about 1 per day
    – MTTF decreases as the disk ages
• RAID (Redundant Arrays of Independent Disks) (11.3)
    – disk organization techniques that manage a large number of disks,
      providing a view of a single disk of
        • high capacity and high speed by using multiple disks in parallel
        • high reliability by storing data redundantly, so that data can be recovered even
          if a disk fails
        • MTTdata loss can be as high as 500,000 to 1,000,000 hours on a RAID
    Optimization of Disk-Block Access:
                Motivation
• Requests for disk I/O are generated both by the file system and by the
  virtual memory manager
• Each request specifies the address on the disk to be referenced in the
  form of a block number
    – a block is a contiguous sequence of sectors from a single track on one
      platter
    – block sizes range from 512 bytes to several K (4 -- 16K is typical)
    – smaller blocks mean more transfers from disk; larger blocks makes for
      more wasted space due to partially filled blocks
    – block is the standard unit of data transfer between disk to main memory
• Since disk access speed is much slower than main memory access,
  methods for optimizing disk-block access are important
   Optimization of Disk-Block Access:
                Methods
• Disk-arm Scheduling: requests for several blocks may be
  speeded up by requesting them in the order they will pass
  under the head.
   – If the blocks are on different cylinders, it is advantageous to ask
     for them in an order that minimizes disk-arm movement
   – Elevator algorithm -- move the disk arm in one direction until all
     requests from that direction are satisfied, then reverse and repeat
   – Sequential access is 1-2 orders of magnitude faster; random access
     is about 2 orders of magnitude slower
   Optimization of Disk-Block Access:
                Methods
• Non-volatile write buffers
   – store written data in a RAM buffer rather than on disk
   – write the buffer whenever it becomes full or when no other disk
     requests are pending
   – buffer must be non-volatile to protect from power failure
       • called non-volatile random-access memory (NV-RAM)
       • typically implemented with battery-backed-up RAM
   – dramatic speedup on writes; with a reasonable-sized buffer write
     latency essentially disappears
   – why can’t we do the same for reads? (hints: ESP, clustering)
   Optimization of Disk-Block Access:
                Methods
• File organization (Clustering): reduce access time by
  organizing blocks on disk in a way that corresponds
  closely to the way we expect them to be accessed
   – sequential files should be kept organized sequentially
   – hierarchical files should be organized with mothers next to
     daughters
   – for joining tables (relations) put the joining tuples next to each
     other
   – over time fragmentation can become an issue
       • restoration of disk structure (copy and rewrite, reordered) controls
         fragmentation
   Optimization of Disk-Block Access:
                Methods
• Log-based file system
   – does not update in-place, rather writes updates to a log disk
       • essentially, a disk functioning as a non-volatile RAM write buffer
   – all access in the log disk is sequential, eliminating seek time
   – eventually updates must be propogated to the original blocks
       • as with NV-RAM write buffers, this can occur at a time when no disk
         requests are pending
       • the updates can be ordered to minimize arm movement
   – this can generate a high degree of fragmentation on files that
     require constant updates
       • fragmentation increases seek time for sequential reading of files
                 Storage Access (11.5)

• Basic concepts (some already familiar):
   – block-based. A block is a contiguous sequence of sectors from a single
     track; blocks are units of both storage allocation and data transfer
   – a file is a sequence of records stored in fixed-size blocks (pages) on the
     disk
   – each block (page) has a unique address called BID
   – optimization is done by reducing I/O, seek time, etc.
   – database systems seek to minimize the number of block transfers between
     the disk and memory. We can reduce the number of disk accesses by
     keeping as many blocks as possible in main memory.
   – Buffer - portion of main memory used to store copies of disk blocks
   – buffer manager - subsystem responsible for allocating buffer space in
     main memory and handling block transfer between buffer and disk
                        Buffer Management

• The buffer pool is the part of the main memory alocated for
  temporarily storing disk blocks read from disk and made available to
  the CPU
• The buffer manager is the subsystem responsible for the allocation and
  the management of the buffer space (transparent to users)
•   On a process (user) request for a block (page) the buffer manager:
     –   checks to see if the page is already in the buffer pool
     –   if so, passes the address to the process
     –   if not, it loads the page from disk and then passes the address to the process
     –   loading a page might require clearing (writing out) a page to make space
•   Very similar to the way virtual memory managers work, although it can do a
    lot better (why?)
         Buffer Replacement Strategies

• Most operating systems use a LRU replacement scheme. In database
  environments, MRU is better for some common operations (e.g., join)
    – LRU strategy: replace the least recently used block
    – MRU strategy: replace the most recently used block
• Sometimes it is useful to fasten or pin blocks to keep them available
  during an operation and not let the replacement strategy touch them
    – pinned block is thus a block that is not allowed to be written back to disk
• There are situations where it is necessary to write back a block to disk
  even though the buffer space it occupies is not yet needed. This write
  is called the forced output of a block; useful in recovery situations
• Toss-immediate strategy: free the space occupied by a block as soon as
  the final tuple of that block has been processed
         Buffer Replacement Strategies

• Most recently used (MRU) strategy: system must pin the block
  currently being processed. After the final tuple of that block has been
  processed the block is unpinned and becomes the most recently used
  block. This is essentially “toss-immediate” with pinning, and works
  very well with joins.
• The buffer manager can often use other information (design or
  statistical) to predict the probability that a request will reference a
  particular page
    – e.g., the data dictionary is frequently accessed -- keep the data dictionary
      blocks in main memory buffer
    – if several pages are available for overwrite; choose the one that has the
      lowest number of recent access requests to replace
             Buffer Management (cont)
• Existing OS affect DBMS operations by:
   –   read ahead, write behind
   –   wrong replacement strategies
   –   Unix is not good for DBMS to run on top
   –   Most commercial systems implement their own I/O on a raw disk
       partition
• Variations of buffer allocation
   –   common buffer pool for all relations
   –   separate buffer pool for each relation
   –   as above but with relations borrowing space from each other
   –   prioritized buffers for very frequently accessed blocks, e.g. data
       dictionary
           Buffer Management (cont)

• For each buffer the manager keeps the following:
   – which disk and which block it is in
   – whether the block is dirty (has been modified) or not (why?)
   – information for the replacement strategy
       • last time block was accessed
       • whether it is pinned
       • possible statistical information (access frequency etc.)
   Buffer Management and Disk-block
       Access Optimization (end)
• Disk-block access methods must take care of some
  information within each block, as well as information
  about each block:
   – allocate records (tuples) within blocks
   – support record addressing by address and by value
   – support auxiliary (secondary indexing) file structures for more
     efficient processing
• These concerns are linked in to our next topic:
                     file organization.
                    File Organization

• The database is stored logically as a collection of files.
• Each file is a sequence of records
• A record is a sequence of fields
• Easy so far, but as we just finished discussing, anything
  stored on disk is stored in blocks, which are a physical
  constraint unrelated to the storage system used for files.
• So how do we organize a file into blocks and records?
    – formatting fields within a record
    – formatting records within a block
    – assigning records to blocks.
                  Fixed-Length Records

• Simplest approach. We know
  the length of each record, and
  they are all the same
    – store record i starting from
      byte n * (i - 1), where n is
      record length
    – record access is simple
                Fixed-Length Records

• Problems:
   – records may cross blocks
       • normal modification: don’t
         permit records to cross block
         boundaries
   – deletion of record i leaves a
     gap, which requires some way
     of dealing with the empty
     space.
   – E.G., record 2 (A-215) is
     deleted from the example
     block on the right
        Fixed-Length Records: Deletion

• One simple fix is to shift all the
  records down to fill the gap, as
  shown on the right.
• This involves a lot of work, so
  it might be slow
   Fixed-Length Records: more Deletion

• Another fix would be to shift
  the last record (record n) to the
  deleted position I
• Much less work
• Not useful if the records are
  stored in order (sorted)
       Fixed-Length Records: still more
                  Deletion

• Another possibility is to not
  move records at all
• Maintain a header at the
  beginning of the file
• Store a link to the list of
  addresses of deleted records
• Use each deleted record to store
  the link to the next deleted
  record
• Essentially a linked list, often
  called a free list
             Variable Length Records

• Can occur in a database system in several ways
   – storage of multiple record types in a file
   – record types that allow variable lengths for one or more fields
   – record types that allow repeating fields
• Several ways to store variable length records
   – attach an “end of record” symbol (delimiter) to mark the end of
     each record
   – store the length of the record at the beginning of each record
   – store header information at the beginning of the file with the
     location and length of each record
   – these techniques can be applied at the file, block, or record level
                 Variable Length Records
•   Files:
     – delimit each record within the file
     – store a length field at the beginning of each record
     – store header information at the beginning of the file with the location and length of
       each record within the file
•   Blocks:
     – delimit each record within the block
     – store a length field at the beginning of each record
     – store header information at the beginning of the block with the location and length
       of each record inside the block
•   Records:
     – delimit each field within the record
     – store a length field at the beginning of each field
     – store header information at the beginning of the record with the location and length
       of each field
             Variable Length Records
Two more techniques for storing variable-length records
   – use fixed-length fields
       • reserve space -- if there is a maximum space, just reserve that, and
         mark unused space with a special null (end-of-record) symbol
       • wasteful if the maximum record length is much larger than the
         average record length
   – list representation
       • represent variable-length records by lists of fixed-length records,
         chained together by pointers
       • useful for variable-length records caused by repeating same-length
         fields
       • we don’t want a single field of the variable-length record to cross the
         boundary of two fixed-length records in its representation, so this can
         also be wasteful of space
         Organizing Records in a Block

• Two major ways we can organize records within a block
  (disk page)
   – fixed-packed (contiguous storage)
   – slotted page structure (indexed heap)


1) fixed-packed -- records are stored contiguously
   –   highly inflexible
   –   records may span over block boundary
   –   fragmentation with deletions and insertions
   –   external pointers may prevent internal block reorganization --
       records are pinned to their address in the block
        Organizing Records in a Block

2) slotted page structure
   – an initial block header storing block address and offset is used to
     reference the record
   – records are indexed within a block
   – insertions and deletions are easy (there are no assumptions about
     contiguity of records and record-address startpoints to deal with)
   – records may be rearranged within the block without concerns about
     external pointers
   – records are not pinned within the block
         Organizing Records in a Files

• Given a set of records, how do we organize them in a file?
  Three possible methods are:
   – 1. Heap -- no order at all. A record can be placed anywhere in the
     file where there is space
   – 2. Sequential -- records are stored in a sorted order according to the
     value of a search key
   – 3. Hashing -- a hash function computed on some attribute of each
     record is used to specify in which block of the file the record
     should be placed
   – Records of each relation are often stored in separate files.
     Sometimes it is useful to use a clustering file organization, where
     records of several relations might be stored in a single file.
              Heap File Organization

• Heap -- no order at all. A record can be placed anywhere
  in the file where there is space
   – easy insert, easy delete
   – lack of any structure makes queries (including finding a particular
     record) very difficult
   – not usually useful for anything except very small relations
           Sequential File Organization

• Sequential -- records are stored
  in a sorted order according to
  the value of a search key
    – designed for efficient queries
      in sorted order
    – very suitable for applications
      that require sequential
      processing of the entire file
    – difficult to maintain sorted
      order with insert/delete
    – deletions can use a free list
      (pointer chain) to mark empty
      space as previously described
           Sequential File Organization

• Insertions use the following
  method:
    – locate the location to be
      inserted
    – if there is space there, insert
      with no more work
    – otherwise insert the record in
      an overflow block
    – in either case the pointer chain
      must be updated
• Every so often we need to
  reorganize the whole file to
  restore sequential order
           Clustering File Organization

• Simple file structure stores each
  relation in a separate file
    – tuples can be represented as
      fixed-length records
    – easy to implement
    – well-suited to small databases
• Large databases often attempt
  to store many relations in one
  file using a clustering file
  organization
• E.G., relations customer and
  depositor shown to the right:
          Clustering File Organization

• depositor relation stores the
  different accounts that a
  particular customer has
• customer relation stores the
  address information of a given
  customer
• Both relations use customer-
  name as a key
• Some common queries on the
  two relations join them based
  on the customer-name attribute
           Clustering File Organization

• Storing the two relations
  together, sorted on customer-
  name, allows the join to be
  computed much more quickly
• There is a price to pay -- some
  operations are now more
  expensive (slower)
• for example, consider
        select   *
        from     customer
• sequential pass through the
  customer relation is now hard
           Clustering File Organization

• To allow sequential access
  through all tuples of the
  customer relation, we chain
  together all the tuples of that
  relation using pointers
• clustering results in variable-
  size records



• Careful use of clustering can
  produce significant
  performance gains
                      Data Dictionary

The data dictionary (also called system catalog) stores
  metadata -- data about data. For example:

• Information about relations
   –   names of relations
   –   names and types of attributes of each relation
   –   names and definitions of views
   –   integrity constraints


• User and account information, including passwords
                     Data Dictionary

• Statistical and descriptive data
   – number of tuples in each relation


• Physical file organization information
   – how relation is stored (sequential, hash, clustered, etc.)
   – physical location of relation -- operating system file name or disk
     addresses of blocks containing records of the relation


• Information about indices
   – (about which more after the midterm, when we cover chapter 12)
                      Data Dictionary

• In effect, the data dictionary is a mini database. The data
  within can either be stored as:
   – special-purpose data structures and code to access it, or
   – a set of relations, using the existing database structures and code to
     access it (most common solution)
                       Data Dictionary

Example of a possible system catalog representation:

Relation-metadata =      (relation-name, number-of-attributes,
                         storage-organization, location)
Attribute-metadata =     (attribute-name, relation-name, domain-type,
                         position, length)
User-metadata =          (user-name, encrypted-password, group)
Index-metadata =         (index-name, relation-name, index-type,
                         index-attributes)
View-metadata =          (view-name, definition)
                    Reading and Review

Upcoming dates:
    – Mar 11: HW#2 is due
    – Mar 11 and 13: midterm review
    – Mar 18: midterm
Text Sections we’ve covered:
    –   Chapter 1: Introduction
    –   Chapter 2: E-R data model
    –   Chapter 3 (except 3.4 & 3.5): Relational model
    –   Chapter 4: SQL
    –   Chapter 6: Integrity and Security
    –   Chapter 7 up to and including 7.7: Relational-Database Design
    –   Chapter 11 (except 11.3 and 11.9): Storage and File Structure
                 Reading and Review
                Chapter 1: Introduction

1.1: Applications and motivation
1.2: Database systems vs. file systems
1.3: Views
1.4: Data models
1.5: Database languages
1.6: Database users and administrators
1.7-1.10: other stuff
1.11: Summary
           Reading and Review
   Chapter 2: Entity-Relationship Model
2.1: Basic concepts: entities (entity-sets), relationship (set)s, and attributes
2.2: Constraints and mapping
2.3: Keys
2.4: Design issues
2.5: E-R Diagram
2.6: Weak entity sets
2.7: Extended E-R features (specialization, aggregation)
2.8: Design of an E-R schema
2.9: Reduction of an E-R schema to tables
2.10: (ignore)
2.11: Summary
              Reading and Review
           Chapter 3: Relational Model
3.1: Structure of relational databases
3.2: The relational algebra
3.3: Extended relational-algebra operations
3.4, 3.5: (ignore)
3.6: Tuple relational calculus
3.7: Domain relational calculus
3.8: Summary
                  Reading and Review
                    Chapter 4: SQL
4.1: Background
4.2: Basic structure
4.3: Set operations
4.4: Aggregate functions
4.5: Null values
4.6: Nested subqueries
4.7, 4.8: (ignore)
4.9: Modification of the database
4.10: (ignore)
4.11-4.14: (really ignore)
4.15: Summary
            Reading and Review
       Chapter 6: Integrity and Security
6.1: Domain Constraints
6.2: Referential Integrity
6.3: Assertions
6.4: Triggers
6.5: Security and Authorization
6.6, 6.7: (ignore)
6.8: Summary
          Reading and Review
  Chapter 7: Relational-Database Design
7.1: First Normal Form
7.2: Pitfalls in Relational-Database Design
7.3: Functional Dependencies
7.4: Decomposition
7.5: Desirable Properties of Decomposition
7.6: BCNF
7.7: Third Normal Form
7.8-7.10: (ignore)
7.11: Summary
          Reading and Review
  Chapter 11: Storage and File Structure
11.1: Overview of physical storage media
11.2: Magnetic Disks
11.3: RAID (not responsible for this)
11.4: Tertiary Storage
11.5: Storage Access
11.6: File Organization
11.7: Organization of Records in Files
11.8: Data Dictionary Storage
11.9: (ignore)
11.10: Summary

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:8
posted:6/14/2012
language:English
pages:211