PPT

Document Sample
PPT Powered By Docstoc
					C20.0046: Database
Management Systems
Lecture #5
   M.P. Johnson
   Stern School of Business, NYU
   Spring, 2008


         M.P. Johnson, DBMS, Stern/NYU, Spring 2008   1
Agenda
    Last time: FDs
    This time:
1.   Anomalies
2.   Normalization: BCNF & 3NF
    Next time: SQL




                 M.P. Johnson, DBMS, Stern/NYU, Spring 2008   2
Review: FDs
   Definition:
            If two tuples agree on the attributes
                          A1, A2, …, An
           then they must also agree on the attributes
                          B1, B2, …, Bm


   Notation: A1, A2, …, An  B1, B2, …, Bm
   Read as: Ai functionally determines Bj


                    M.P. Johnson, DBMS, Stern/NYU, Spring 2008   3
Review: Combining FDs
  If some FDs are satisfied, then
  others are satisfied too


                                              name  color
    If all these FDs are true:                category  department
                                              color, category  price


    Then this FD also holds: name, category  price

           Why?


                  M.P. Johnson, DBMS, Stern/NYU, Spring 2008            4
Problem: find all FDs
   Given a relation instance and set of given FDs
   Find all FD’s satisfied by that instance

   Useful if we don’t get enough information from our
    users: need to reverse engineer a data instance

   Q: How long does this take?
   A: Some time for each subset of atts
   Q: How many subsets?
       powerset
    exponential time in worst-case
   But can often be smarter…
                   M.P. Johnson, DBMS, Stern/NYU, Spring 2008   5
 Closure Algorithm

Start with X={A1, …, An}.                                    Example:

Repeat:                                                       name  color
                                                              category  department
  if  B1, …, Bn  C is a FD and                               color, category  price
       B1, …, Bn are all in X
                                                            {name, category}+ =
  then add C to X.
                                                               {name, category, color,
                                                                department, price}
until X didn’t change



                        M.P. Johnson, DBMS, Stern/NYU, Spring 2008                 6
Closure alg e.g.

            Example:            A, B  C
                                A, D  B
                                B     D


Compute X+, for every set X (AB is shorthand for {A,B}):

A+ = A, B+ = BD, C+ = C, D+ = D
AB+ = ABCD, AC+ = AC, AD+ = ABCD, BC+ = BC, BD+ = BD, CD+ = CD
ABC+ = ABD+ = ACD+ = ABCD (no need to compute–why?)
BCD+ = BCD, ABCD+ = ABCD

 What are the keys?
                   M.P. Johnson, DBMS, Stern/NYU, Spring 2008   7
Closure alg e.g.

 In class:

 R(A,B,C,D,E,F)                         A, B             C
                                        A, D             E
                                        B                D
                                        A, F             B

 Compute {A,B}+       X = {A, B,                                }

 Compute {A, F}+      X = {A, F,                                }

 What are the keys?

                   M.P. Johnson, DBMS, Stern/NYU, Spring 2008       8
Closure alg e.g.
   Product(name, price, category, color)
    name, category  price
    category  color

    FDs are:
    Keys are:   {name, category}

   Enrollment(student, address, course, room, time)
    student  address
    room, time  course
    student, course  room, time

    FDs are:
    Keys are:

                     M.P. Johnson, DBMS, Stern/NYU, Spring 2008   9
Next topic: Anomalies
   Identify anomalies in existing schemata

   Decomposition by projection

   BCNF

   Lossy v. lossless

   Third Normal Form
                 M.P. Johnson, DBMS, Stern/NYU, Spring 2008   10
Types of anomalies
   Redundancy
       Repeat info unnecessarily in several tuples


   Update anomalies:
       Change info in one tuple but not in another


   Deletion anomalies:
       Delete some values & lose other values too


   Insert anomalies:
       Inserting row means having to insert other, separate info /
        null-ing it out

                       M.P. Johnson, DBMS, Stern/NYU, Spring 2008     11
Examples of anomalies
    Name              SSN         Mailing-address                        Phone
    Michael           123         NY                                     212-111-1111
    Michael           123         NY                                     917-111-1111
    Hilary            456         DC                                     202-222-2222
    Hilary            456         DC                                     914-222-2222
    Bill              789         Chappaqua                              914-222-2222
    Bill              789         Chappaqua                              212-333-3333

SSN  Name, Mailing-address                                     SSN  Phone
     Redundancy: name, maddress
     Update anomaly: Bill moves
     Delete anom.: Bill doesn’t pay bills, lose phones  lose Bill!
     Insert anom: can’t insert someone without a (non-null) phone
     Underlying cause: SSN-phone is many-many
     Effect: partial dependency ssn  name, maddress,
          Whereas key = {ssn,phone}
                            M.P. Johnson, DBMS, Stern/NYU, Spring 2008                  12
Decomposition by projection
   Soln: replace anomalous R with projections of R onto
    two subsets of attributes
   Projection: an operation in Relational Algebra
       Corresponds to SELECT command in SQL


   Projecting R onto attributes (A1,…,An) means
    removing all other attributes
       Result of projection is another relation
       Yields tuples whose fields are A1,…,An
       Resulting duplicates ignored



                        M.P. Johnson, DBMS, Stern/NYU, Spring 2008   13
Projection for decomposition
             R(A1, ..., An, B1, ..., Bm, C1, ..., Cp)



  R1(A1, ..., An, B1, ..., Bm)                   R2(A1, ..., An, C1, ..., Cp)


R1 = projection of R on A1, ..., An, B1, ..., Bm
R2 = projection of R on A1, ..., An, C1, ..., Cp
A1, ..., An  B1, ..., Bm  C1, ..., Cp = all attributes
R1 and R2 may (/not) be reassembled to produce original R


                     M.P. Johnson, DBMS, Stern/NYU, Spring 2008                 14
  Decomposition example
                  Name        SSN        Mailing-address                    Phone
  Break           Michael     123        NY                                 212-111-1111
  the             Michael     123        NY                                 917-111-1111
  relation        Hilary      456        DC                                 202-222-2222

  into            Hilary      456        DC                                 914-222-2222
                  Bill        789        Chappaqua                          914-222-2222
  two:
                  Bill        789        Chappaqua                          212-333-3333


Name               SSN         Mailing-address                     SSN              Phone
Michael            123         NY                                  123              212-111-1111
Hilary             456         DC                                  123              917-111-1111
Bill               789         Chappaqua                           456              202-222-2222

        The anomalies are gone                                    456              914-222-2222

            No more redundant data                                789              914-222-2222

            Easy to for Bill to move                              789              212-333-3333

            Okay for Bill to lose all phones
                               M.P. Johnson, DBMS, Stern/NYU, Spring 2008                          15
Thus: high-level strategy
                             name

E/R Model:                           Product        buys                Person

                             price                               name        ssn




Relational Model:
plus FD’s


Normalization:
Eliminates anomalies


                    M.P. Johnson, DBMS, Stern/NYU, Spring 2008                     16
Using FDs to produce good schemas
1.       Start with set of relations
2.       Define FDs (and keys) for them based on real
         world
3.       Transform your relations to “normal form”
         (normalize them)
         Do this using “decomposition”


        Intuitively, good design means
         No anomalies
         Can reconstruct all (and only the) original information




                        M.P. Johnson, DBMS, Stern/NYU, Spring 2008   17
Decomposition terminology
   Projection: eliminating certain atts from relation
   Decomposition: separating a relation into two by
    projection
   Join: (re)assembling two relations
       Whenever a row from R1 and a row from R2 have the same
        value for some atts A, join together to form a row of R3


   If exactly the original rows are reproduced by joining
    the relations, then the decomposition was lossless
       We join on the attributes R1 and R2 have in common (As)
   If it can’t, the decomposition was lossy

                       M.P. Johnson, DBMS, Stern/NYU, Spring 2008   18
Lossless Decompositions
 A decomposition is lossless if we can recover:
         R(A,B,C)
                                                Decompose



         R1(A,B)          R2(A,C)

                                                     Recover

             R’(A,B,C) should be the same as
                        R(A,B,C)
 R’ is in general larger than R. Must ensure R’ = R
                   M.P. Johnson, DBMS, Stern/NYU, Spring 2008   19
Lossless decomposition
   Sometimes the same set of data is reproduced:
                       Name             Price         Category
                        Word             100              WP
                       Oracle            1000             DB
                       Access            100              DB



           Name      Price                                     Name     Category
            Word      100                                      Word       WP
           Oracle     1000                                     Oracle     DB
           Access     100                                      Access     DB

   (Word, 100) + (Word, WP)  (Word, 100, WP)
   (Oracle, 1000) + (Oracle, DB)  (Oracle, 1000, DB)
   (Access, 100) + (Access, DB)  (Access, 100, DB)


                       M.P. Johnson, DBMS, Stern/NYU, Spring 2008                  20
Lossy decomposition
   Sometimes it’s not:
                          Name            Price         Category
                          Word             100             WP                  What’s
                          Oracle          1000              DB
                                                                               wrong?
                          Access           100              DB


    Category     Name                                            Category   Price

       WP        Word                                               WP      100

       DB        Oracle                                             DB      1000

       DB       Access                                              DB      100


   (Word, WP) + (100, WP)  (Word, 100, WP)
   (Oracle, DB) + (1000, DB)  (Oracle, 1000, DB)
   (Oracle, DB) + (100, DB)  (Oracle, 100, DB)
   (Access, DB) + (1000, DB)  (Access, 1000, DB)
   (Access, DB) + (100, DB)  (Access, 100, DB)

                           M.P. Johnson, DBMS, Stern/NYU, Spring 2008                   21
Ensuring lossless decomposition
               R(A1, ..., An, B1, ..., Bm, C1, ..., Cp)



    R1(A1, ..., An, B1, ..., Bm)                   R2(A1, ..., An, C1, ..., Cp)


       If A1, ..., An  B1, ..., Bm or A1, ..., An  C1, ..., Cp
       Then the decomposition is lossless

                                                   Note: don’t need both
   Examples:
   name  price, so first decomposition was lossless
   category  name and category  price, and so second
    decomposition was lossy

                       M.P. Johnson, DBMS, Stern/NYU, Spring 2008                 22
Quick lossless/lossy example

                       X           Y          Z
                       1           2          3
                       4           2          5

   At a glance: can we decompose into R1(Y,X), R2(Y,Z)?
   At a glance: can we decompose into R1(X,Y), R2(X,Z)?




                  M.P. Johnson, DBMS, Stern/NYU, Spring 2008   23
Next topic: Normal Forms
   First Normal Form = all attributes are atomic
       As opposed to set-valued
       Assumed all along

   Second Normal Form (2NF)

   Third Normal Form (3NF)

   Boyce Codd Normal Form (BCNF)

   Fourth Normal Form (4NF)

   Fifth Normal Form (5NF)
                      M.P. Johnson, DBMS, Stern/NYU, Spring 2008   24
BCNF definition
   A simple condition for removing anomalies from
    relations:
                 A relation R is in BCNF if:
                    If As  Bs is a non-trivial dependency
                    in R , then As is a superkey for R


   I.e.: The left side must always contain a key
   I.e: If a set of attributes determines other attributes,
    it must determine all the attributes

   Slogan: “In every FD, the left side is a superkey.”
                     M.P. Johnson, DBMS, Stern/NYU, Spring 2008   25
 BCNF decomposition algorithm
Repeat
 choose A1, …, Am  B1, …, Bn that violates the BNCF condition
   //Heuristic: choose Bs as large as possible
 split R into R1(A1, …, Am , B1, …, Bn) and R2(A1, …, Am , [others])
 continue with both R1 and R2
Until no more violations




                         B’s             A’s          Others


                               R1                      R2
                       M.P. Johnson, DBMS, Stern/NYU, Spring 2008      26
Boyce-Codd Normal Form
   Name/phone example is not BCNF:
        Name        SSN    Mailing-address                      Phone
        Michael     123    NY                                   212-111-1111
        Michael     123    NY                                   917-111-1111

        {ssn,phone} is key
        FD: ssn  name,mailing-address holds
              Violates BCNF: ssn is not a superkey
   Its decomposition is BCNF
        Only superkeys  anything else
Name               SSN          Mailing-address             SSN                PhoneNumber
Michael            123          NY                          123                212-111-1111
                                                            123                917-111-1111


                           M.P. Johnson, DBMS, Stern/NYU, Spring 2008                         27
Design/BCNF example
   Consider situation:
       Entities: Emp(ssn,name,lot), Dept(id,name,budg)
       Relship: Works(E,D,since)
   Draw E/R

   New rule: in each dept, everyone parks in
    same lot
   Translate to FD
   Normalize

                    M.P. Johnson, DBMS, Stern/NYU, Spring 2008   28
BCNF Decomposition
   Larger example: multiple decompositions
   {Title, Year, Studio, President, Pres-Address}
   FDs:
       Title Year  Studio
       Studio  President
       President  Pres-Address
        Studio  President, Pres-Address
   No many-many this time
   Problem cause: transitive FDs:
       Title,year  studio  president



                       M.P. Johnson, DBMS, Stern/NYU, Spring 2008   29
BCNF Decomposition
        Illegal: As  Bs, where As not a superkey
        Decompose: Studio  President, Pres-Address
         As = {studio}
         Bs = {president, pres-address}
         Cs = {title, year}
        Result:
    1.    Studios(studio, president, pres-address)
    2.    Movies(studio, title, year)
        Is (2) in BCNF? Is in (1) BCNF?
         Key: Studio
         FD: President  Pres-Address
         Q: Does president  studio? If so, president is a key
         But if not, it violates BCNF


                         M.P. Johnson, DBMS, Stern/NYU, Spring 2008   30
BCNF Decomposition
   Studios(studio, president, pres-address)
   Illegal: As  Bs, where As not a superkey
    Decompose: President  Pres-Address
       As = {president}
       Bs = {pres-address}
       Cs = {studio}
   {Studio, President, Pres-Address} becomes
       {President, Pres-Address}
       {Studio, President}




                       M.P. Johnson, DBMS, Stern/NYU, Spring 2008   31
Decomposition algorithm example
   R(N,O,R,P)    F = {N  O, O  R, R  N}
                 Name               Office                 Residence   Phone
                 George             Pres.                  WH          202-…
                 George             Pres.                  WH          486-…
                 Dick               VP                     NO          202-…
   Key: N,P      Dick    VP     NO                                    307-…

   Violations of BCNF: N  O, OR, N OR

   Pick N  OR (on board)
   Can we rejoin? (on board)
   What happens if we pick N  O instead?
   Can we rejoin? (on board)
                    M.P. Johnson, DBMS, Stern/NYU, Spring 2008                 32
An issue with BCNF
   We could lose FDs

   Relation: R(Title, Theater, Neighboorhood)
   FDs:
     Title,N’hood  Theater

           Assume a movie shouldn’t play twice in same n’hood
     Theater  N’hood
   Keys:              Title        Theater                          N’hood
     {Title, N’hood}  Aviator      Angelica                         Village
     {Theater, Title} Life Aquatic Angelica                         Village


                        M.P. Johnson, DBMS, Stern/NYU, Spring 2008             33
Losing FDs
   BCNF violation: Theater  N’hood
   Decompose:
       {Theater, N’Hood}
       {Theater, Title}


   Resulting relations:
R1                                           R2
Theater          N’hood                       Theater              Title
Angelica         Village                      Angelica             Aviator
                                              Angelica             Life Aquatic

                      M.P. Johnson, DBMS, Stern/NYU, Spring 2008              34
   Losing FDs
      Suppose we add new rows to R1 and R2:

R1 Theater            N’hood                R2 Theater                     Title
   Angelica           Village                         Angelica             Life Aquatic
   Film Forum         Village                         Angelica             Aviator
                                                      Film Forum           Life Aquatic
        R’
         Theater            N’hood                          Title
         Angelica           Village                         Life Aquatic
         Angelica           Village                         Aviator
         Film Forum         Village                         Life Aquatic
      Neither R1 nor R2 enforces FD Title,N’hood  Theater
                          M.P. Johnson, DBMS, Stern/NYU, Spring 2008                  35
Third normal form: motivation
   Sometimes
       BCNF is not dependency-preserving, and
       Efficient checking for FD violation on updates is important


   In these cases BCNF is too severe a req.
       “over-normalization”


   Solution: define a weaker normal form, 3NF
       FDs can be checked on individual relations without
        performing a join (no inter-relational FDs)
       relations can be converted, preserving both data and FDs


                        M.P. Johnson, DBMS, Stern/NYU, Spring 2008    36
BCNF lossiness
   NB: BCNF decomp. is not data-lossy
       Results can be rejoined to obtain the exact original


   But: it can lose dependencies
       After decomp, now legal to add rows whose corresponding
        rows would be illegal in (rejoined) original


   Data-lossy v. FD-lossy




                        M.P. Johnson, DBMS, Stern/NYU, Spring 2008   37
Third Normal Form
   Now define the (weaker) Third Normal Form
       Turns out: this example was already in 3NF

     A relation R is in 3rd normal form if :

     For every nontrivial dependency A1, A2, ..., An  B
     for R, {A1, A2, ..., An } is a super-key for R,
     or B is part of a key, i.e., B is prime


Tradeoff:
       BCNF = no FD anomalies, but may lose some FDs
       3NF = keeps all FDs, but may have some anomalies

                      M.P. Johnson, DBMS, Stern/NYU, Spring 2008   38
Next week
   For Monday: read ch.5.1-2 (SQL)

   Proj1 due next Monday




                M.P. Johnson, DBMS, Stern/NYU, Spring 2008   39