Normalization by dfgh4bnmu


									Normalization                                                                 Mote’s Notes

Database design is an art. Like many other art forms (painting, music, poetry), it is an art
that has a solid theoretical foundation. Some of this theory is based on formal theorem
and proof logic; some of it is based on years of practical experience. One of the most
important theoretical foundations of database design is normalization.

Database normalization is the process of reducing the data to its simplest form. This
minimizes redundancies, eliminates chances of insertion, update and deletion errors, and
provides the most flexible storage structure. Normalization is a multi-step process.

First Normal Form (1NF)

The first stage of normalization is to ensure that all data can be directly accessed by a
primary key. This means that every table must have a primary key and that all
compound and repeating fields must be eliminated. A compound field is one that
contains multiple, distinct pieces of data. Two classic examples of compound fields are
concatenated first and last names, and comma delimited addresses.

       Name         John Smith
       Address      1234 Buffalo Trail, Austin, Texas, 78701

A human may have little difficulty manually dealing with these compound fields, but the
database program will need to be a bit more sophisticated in order to process simple tasks
like sorting the records by last name, or finding all customers living in Austin. While it is
true that all database products include a substring function to allow us to parse compound
fields, simply finding all addresses containing the string “Austin” may not produce the
desired results. What happens with this second customer?

       Name         Mary Jones
       Address      4321 Austin Blvd, Buffalo, New York, 10787

The normalization solution for compound fields is to simply decompose them into
multiple fields. Instead of having two fields, Name and Address, we will have six fields.

       FirstName    Mary
       LastName     Jones
       Street       4321 Austin Blvd
       City         Buffalo
       State        New York
       Zip          18291

Repeating fields also cause processing difficulties. A classic example of a repeating field
is the recording of an employee’s family members as separate fields in the employee
master table.

       FirstName    Bob
       LastName     Brown
       Spouse       Beverly
       Child1       Ben
       Child2       Barbara
       Child3       Betty
       Child4       --
       Child5       --

In this example, Bob and his wife Beverly have three kids. No problem. The database
was designed to handle up to five kids. What happens if Beverly has triplets next month?

The problem with repeating fields is that the database has to be designed for the worst
possible case. If one employee can have ten kids, then all employees must be given the
storage space for ten kids. That’s the obvious problem, but there are a host of more subtle
problems with repeating fields. For example, how can we alphabetize the names of Bob
and Beverly’s kids?

The normalization solution for repeating fields is to create a new “child” table. In this
example, Bob (and his unique spouse) would be stored on one record in the employee
master table. The three kids would be stored on three separate records in a new Kids
table. Now, if Beverly has triplets, we just need to insert three new records. If she has a
dozen kids, then we insert a dozen records. If an employee doesn’t have any kids, then no
records are stored in the child table. What could be easier? What could be more efficient?

Second Normal Form (2NF)

The second rule of normalization is: Every non-key field must depend on the table's
entire primary key. The implication of this rule is that each table should only contain
data about a single subject.

To illustrate this, consider the following five field paycheck table:

       Date         9/30/2002
       SSN          123-45-6789
       FirstName    Bob
       LastName     Brown
       Amount       $7,500

The composite primary key is made up of the Date and the SSN fields. The other three
fields are non-key fields. Clearly, the FirstName and LastName fields do not belong in

the table. They do not depend on the entire composite primary key; they only depend on
the SSN portion of the primary key.

Third Normal Form (3NF)

The third rule of normalization is: Non-key fields must be independent of all other
non-key fields. You cannot be forced to make a change to one non-key field when a
change is made to another non-key field.

The most common violation of this rule is when computed values are stored in a table.

For example, consider this ten field employee table:

       SSN                987-65-4321
       FirstName          Jenny
       LastName           Williams
       MonthlySalary      $6,000
       AnnualSalary       $72,000
       JobTitle           Assistant Manager
       Street             1122 Maple St
       City               Austin
       State              Texas
       Zip                78712

Clearly, we don’t need to store the monthly and annual salaries. One of them, either one,
should be discarded. If one of the values is stored, then we can easily compute the other
value when needed.

Another possible example of field independence is the relationship between salary and
job title. If every assistant manager earns $72,000 per year, and if every other job title has
a fixed salary, then the salary should be removed from the employee master table. A new
Job table should be created.

       AnnualSalary       $72,000
       JobTitle           Assistant Manager

The JobTitle field will be the primary key of the new table. The JobTitle field will be a
foreign key in the original employee table.

Obviously, very few organizations have a fixed salary structure like this. In most
organizations, the salary should be left in the employee table since the salary is a function
of the individual person, not simply their job title.

A more subtle example of field independence is the relationship between city, state and
zip code. Quick, without looking back at Jenny Williams’ sample data, what is the city

and state for 78712? You didn’t need to know her SSN to answer this question, did you?
Of course not.

City and state are not independent of zip code. According to this rule of normalization,
they should not be stored in the employee master table. Instead, a new Location table
should be created. Zip will be the primary key of the new table. City and State will be
non-key fields. City and State are independent of each other. What state goes with
Austin? You don’t know for sure, since Austin Texas isn’t the only city named Austin in
the U.S.

City and state no longer appear in the employee master table. The employee’s zip field
serves as a foreign key to the new (parent) Location table.


Normalization is a multi-step process. I’ve presented the first three levels of
normalization. Some database professionals impose higher degrees of normalization;
some live with less. The first three levels, however, are arguably the most commonly
followed today.


To top