Docstoc

e - University of Southern California

Document Sample
e - University of Southern California Powered By Docstoc
					                                                       1




 Economic Incentives for Database Normalization




Forthcoming in INFORMATION PROCESSING AND MANAGEMENT


               J. Christopher Westland
           School of Business Administration
           University of Southern California
           Los Angeles, California 90089-1421
                    (213) 743-6742



                  December 10, 1991
                                                                                               2




                                       ABSTRACT

                    Economic Incentives for Database Normalization



Database systems are central to business information processing. The conceptual basis
for most commercial database managers is the relational model. Little research exists
concerning the cost effectiveness of relational database normalization, but there is
anecdotal evidence that normalization induced fragmentation may create inefficiencies.
Supply and demand for normalization is investigated given management policies
forresponse time, database capacity and deletion policies. On the supply side,
normalization reduces costs associated with insertion, deletion and change anomalies.
The expected cost of removing change anomalies is linearly proportional to both
minimum database size and to database capacity. The occurrence rates of either
insertion or deletion anomalies are shown to be moderate for all but microcomputer sized
databases. But since insertion or deletion anomalies tend to result in significant cost,
even small probabilities of occurrence can result in significant costs. On the demand
side, normalization can create retrieval inefficiencies where a comparatively small amount
of information is being sought and retrieved from the database. Both an increase in
clustering, and an increase in database size will exacerbate these inefficiencies. This can
result in fragmentation inefficiencies and information overload. It is suggested that
normalization reduces the opportunity cost associated with information retrieval from a
database by improving recall and is most pronounced when recall is low. Where
retrieval rates are high with respect to update events, the database fragmentation caused by
normalization cost end users through slower retrieval response.




Keywords: Economics of Information Technology; Database Normalization; Relational
Database Model; Cost-Benefit Analysis
                                                                                             3




1.     Objectives and Significance of Database Normalization

Databases are central to business information systems. Hundreds of billions of dollars
are spent annually on their operation in the U.S. (Schlender [1989]; Computer Science and
Technology Board [1990]; Westland [1990d]). Database management systems (DBMS)
consequently represent a rapidly growing segment of the information systems market
(Hamilton [1990], System Builder [1989]). The conceptual basis for almost all
commercial DBMS is the relational model. Normalization has become a focal activity in
relational database design. The relational model is limited in its ability to express
real-world phenomena, but has the advantages of well defined query and schema
generation languages (relational algebra and calculus) and a well developed theory for
avoiding schema anomalies (normalization theory). Normalization was designed to
control various anomalies that arose in relational DBMS implementations. Codd [1972]
alludes to an objective for normalization, noting that 2NF is “better” than 1NF and 3NF is
“better” than 2NF in the sense that a decomposition in which the projections are
independent is generally preferable to one in which they are not (Date [1986; p. 380]).
But Date [1986; p. 361-3] emphasizes that there may also be sound reasons for flouting
the principles of normalization, since normalization provides logical database structures
that are desirable in only a limited sense.

From a business perspective, database normalization should yield tangible economic
advantages in the use and construction of databases. Yet little research has been
conducted to determine whether normalization results in more efficient database storage
and retrieval. Indeed, the literature is silent on the particular business or economic
objectives satisfied by normalization, or information technology pricing in general
(Westland 1992a; 1992b]). The stated justifications for normalization relate to logical
characteristics of the mathematics underlying the relational model. For example Date
[1986; p. 244] says that

       “The reason for insisting that all relations be normalized is as follows.
       Basically, a normalized relation is a simpler structure, mathematically
       speaking, than an unnormalized one. As a result, the corresponding
       operators are simpler too, and there are fewer of them.”
                                                                                               4

But the mathematical model is irrelevant to managers, who are concerned with cost
justification. Management may be wary of normalization since there is anecdotal
evidence that the normalization induced database fragmentation can create significant
inefficiencies, especially where a database experiences few updates, and many retrievals
(Inmon (1897), Bunnell [1987]). Normalization leads to decreased performance because
of the cost of taking joins; the question remains as to whether the additional cost of joins
is outweighed by the savings from avoiding update anomalies.

Seminal work by Codd [1972] defines three normal forms for relational databases,
viewing normalization as a way to remove some logical problems in the relational model.
First normal form (1NF), second normal form (2NF) and third normal form (3NF) were
the only forms originally proposed by Codd [1972], and are the normal forms generally
supported by commercial CASE and reengineering tools. Anderson [1990] notes that it is
"generally accepted that 3NF is a good goal for database design." The higher forms of
normalization such as Boyce/Codd NF, 4NF, 5NF and Domain-Key NF (Fagin [1981]) are
academically important, but are not widely implemented. These address join, multivalued,
inclusion and template dependencies. The decompositions considered in this research are
presumed to possess the desirable properties of dependency preservation and lossless join,
and are presumed to be free of problems with respect to nulls or dangling tuples.

This research investigates the supply and demand-side policies and circumstances under
which normalization is economically desirable. It evolves as follows. Section 2
investigates the supply of updates, and the use of normalization to avoid supply costs in
the form of update anomalies. The database model is presented, and costs of change,
insert and delete anomalies are derived. Section 3 investigates the demand for data
retrieval, and determines the effect of normalization on capital expenditure per unit of
storage on database hardware and on the utility of data retrieval. Section 4 summarizes the
results and discusses applications.

2.     Update Anomalies

Database activities involving a given piece of information proceed roughly in the
sequence acquisition, insertion, storage, change, querying, usage, and deletion. Data
                                                                                                5

acquisition activities obtain information about a real-world event1 and converts it into a
computer readable form. Acquisition may constitute the costliest task in database
operation since the majority of internal controls are implemented at this point. Storage
may be perceived as the process of accumulating and moving data forward in time. This is
costly since both storage media and data have time value. Additionally, database
management encompasses the asynchronous activities of deletion or archiving to remove
obsolete information, change to correct or complete stored information, and reorganization
to optimize physical performance. All of these are necessary components of a model of
the database's economic life-cycle.

Update anomalies are problems involving the three database activities — insertion,
change and deletion (Date [1986; pp. 368-369]). When nonfull or transitive
dependencies exist, the update anomalies arising from failure to decompose a 1NF relation
either create avoidable costs, or deprive users of benefits. Thus normalization could be
economically justified either as improving the utility of the database to the end-user, or
through elimination of delays, misinformation or frustrations involved in using the
database.

Normalization was designed to control various anomalies that arose in relational DBMS
implementations. Codd [1972] alluded to an objective function for normalization, noting
that 2NF is “better” than 1NF and 3NF is “better” than 2NF in the sense that a
decomposition in which the projections are independent is generally preferable to one in
which they are not (Date [1986; p. 380]).

Normalization is depicted here as a decomposition which reflects functional dependencies.
A 1NF relation may be decomposed into two relations — a given relation and a
functionally dependent relation whenever a nonfull (for 2NF) or transitive (for 3NF)
functional dependency exists by the approach summarized in Date [1986; pp. 366-374].
The values of the X component (or subset of attributes) of a tuple in a 1NF relation
uniquely (or functionally) determine the values of the Y component if for every two tuples
t1 and t2 in the 1NF database t1[X] = t2[X] implies t1[Y] = t2[Y]. This is denoted
XY; the set of attributes X is called the left-hand side of the functional dependency, and
Y is the right-hand side; tuple in X can be called a left-hand side tuple (LT) and a tuple in


1 These events are often called transactions in business information systems.
                                                                                            6

Y can be called a right-hand side tuple (RT). Thus in figure 1, there are many employees
in a department, and no employees in two departments — an employee (LT) uniquely
determines a department (RT).




          FIGURE 1: EXAMPLE OF DECOMPOSITION FROM 1NF TO 2NF OR 3NF

A functional dependency XY is a full functional dependency if removal of any attribute
A from X means that the dependency does not hold any more. In contrast, a functional
dependency XY is a nonfull dependency if there is some attribute AX that can be
removed from X and the dependency will still hold — these are removed in 2NF. A
functional dependency XY in a relation schema R is a transitive dependency if there is a
set of attributes Z that is not a subset of any key of R, and both XZ and ZY hold —
these are removed in 3NF. Functional dependencies between relations typically arise
when information on more than one real-world object or entity is included in a single
database transaction. In the current exposition, decomposition to remove either nonfull
(2NF) or transitive (3NF) dependencies will be assumed to create two relations whose join
is the 1NF relation, and where there are many LTs for each RT. The example in figure 1
shows a decomposition where data on two real world object types is retained in the
database — an employee object uniquely (functionally) determines a department object.
                                                                                                 7

There are many employees (LT) in a department (RT).

Date [1986; pp. 366-374] notes that update anomalies arising from insertion and deletion
are simply differing perspectives on the same anomaly. These are therefore addressed in
the current analysis via a generic insertion / deletion anomaly. An insertion / deletion
anomaly occurs when a LT is acquired and needs to be inserted to the database, but the
associated RT is not on the database. For example, in figure 1, this anomaly would
occur when an employee was hired to work in a department which did not yet exist —
thus the employee change process would be held up until a new department could be
added. The addition of a new department could incur authorization and internal control
costs, as well as delays. Change anomalies, on the other hand, reflect database search
costs associated for all the connecting tuples affected by the change. Because external
manual systems are often involved in removing insertion / deletion anomalies, their fixed
cost is expected to be significantly higher than the entirely automated search costs of a
change anomaly.

       2.1.    Database Capacity and Usage Policy

Existing normalization literature focuses on the logical design of databases rather than
database usage. In contrast, economic analysis requires knowledge of database usage
over time. The current analysis focuses database usage and associated transaction flows
for a 1NF versus its 2NF or 3NF decomposition into two relations. Extension to more
complex organizations is discussed in the last section of the paper. Assume that two
types of independently generated real-world events are associated with the two object
types represented on the database. Let each event provide data for a single tuple in the
database. The first part of this analysis provides a characterization of the stochastic
processes which generate anomalies.

Let LT information be generated by one real world object type (e.g., an employee), and
RT information be generated by another real-world object type (e.g., a department). Let
LT information arrive for insertion to the database with successive time between arrivals
given by the random variables \b\bc\{(Xki)\s(∞,k=1) . Let Wni be the waiting time to the nth
acquisition of LT information determining RT i.. Denote as Ni(t) the number of LT
information acquisitions for which waiting time is less than some time period t; i.e., which
satisfy 0 < Wni ≤ t; this has expectation E\b\bc\[(Ni(t)) Mi(t); Mi(t) is called the renewal
function.. In this case, the counting processes Ni(t) are nonnegative, integer-valued
                                                                                                               8

stochastic processes that register the successive acquisitions of information on an LT to be
inserted to the database during time interval (0,t], where time durations between
successive arrivals are positive independent identically distributed (i.i.d.) random
variables. This model conforms to a renewal process (Karlin and Taylor [1975; Chapter
5]).

Across all potential RTs2, interarrival times Xki will be independent3 but will not
necessarily be identically distributed, since differing real-world processes generate
differing distributions of transactions. The total transaction volume in a given time
period (0,t] is a superposition of all of the processes \b\bc\{(Xki)\s(∞,k=1) over all of the
potential LTs, i1,2,.... Feller [1966] shows, via the renewal process counterpart to the
Central Limit Theorem, that such a superposition will result in N(t) which has a Poisson
distribution

         Pr\b\bc\{(N(t) = j)  \f(e –t(t)j,j!)
         (1)

where the sum of interarrival times tends to a stable limit t4.               In practice  can be
estimated by counting transactions in a period.

During the design and implementation of a DBMS, policy decisions are made concerning
the investment in storage capacity for a system, even though management may not
formulate them as such. In the short-run, the capacity of database storage (S) is limited,
e.g., by the disk space available. Increasing capacity (S) requires procurement of more
storage space, which is assumed to be a long run decision. In the short-run, insertions


2 Database only contain a limited number of tuples providing information about the real world. Thus
there exist “potential” tuples that are not currently on the database, but whose corresponding objects exist
in the real-world, or will exist sometime in the future. For example, management may consider adding a
new department, or may have actually set up a department without any employees, where the department
tuple (RT) will not be recorded on the database until an employee (LT) update occurs. Thus the new
department is a “potential” RT.

3 Independence may be assumed in employee hiring, for example, because successive hires will be
generated by different job needs and applicants whose motivation and timing determined by themselves.
Although there may be exceptions, an independence assumption is reasonable for most types of events.

4 If the sum did not stabilize, then the system would most likely face capacity problems either repeatedly
exceeding peak-load, or experiencing costly idle capacity. It is difficult to make policy decisions where
the transaction streams to which the system must respond are highly variable.
                                                                                                     9

take place continuously, and may outstrip available storage (S).

In order to control this, management must delete records. Deletion and archiving policies
is variable and may be influenced by legal restrictions; e.g., in connection with payroll or
uncollectible accounts information; and by convenience. Deletion occurs as information
ages. Obsolescence is quite rapid in many business databases e.g., stock prices may be
obsolete within seconds. Other databases, of for example literature citations, may not
become obsolete for many years (Westland [1990c]). Database information to be
deleted is usually flagged as deletion transactions are processed, and then deleted
periodically during a database reorganization.5 Assume that deletion policies reduce the
amount of stored data to some amount (s < S).             S and s are management’s database
policy variables.

Let N(t) be the renewal counting process corresponding to the number of inserts Xi , and
M(t) its expectation. Then N(S–s) 1 is the number of periods elapsed until the first
deletion occurs, at which time the database size reduces to s. This adds a second renewal
process to the original insertion process. This renewal process may be perceived as an
inversion of a stockkeeping problem. In stockkeeping, parts are requisitioned by
customers or production until a “safety stock” level s is reached, then a purchase raises
inventory to some level S determined by management policy. With databases,
information is added until “storage capacity” level S is reached, then a purging of
information reduces the database size to some level s determined by management policy.
An adaptation of Karlin and Taylor's [1975; Chapter 8] inventory model gives the limiting
distribution of database size, conditioned upon the knowledge that this size cannot exceed
S the total storage capacity.6 The limiting c.d.f. is

         Pr\b\bc\{( \o(\s\up8(~), ≤ sx  \o(\s\up8(~), ≤ S) \f(M(x),1  M(S–s))
                                      (2)

where random variable \o(\s\up8(~), is the size of the database. Equation (1)
determined that N(t) and had a Poisson distribution with parameter t. Thus the c.d.f. of



5 Flagging followed by deletion my serve internal control functions as well.

6 The adaptation of Karlin and Taylor’s inventory replenishment model to a database usage model is
straightforward, and is left to the reader.
                                                                                                             1
                                                                                                             0
database size may be computed to be

         Pr\b\bc\{( \o(\s\up8(~), ≤ sx  \o(\s\up8(~), ≤ S) \f(t(x),1  t(S–s)) ;
         x[0,S–s]                           (3)

This is the limiting conditional distribution of database size and is uniform on (s,S), with
conditional expectation

         E \b\bc\{( \o(\s\up8(~), \o(\s\up8(~), ≤ S) s  \f(t(S–s)2,2  2t(S–s))
                                             (4)

         2.2.     Cost of Change Anomalies

A change anomaly occurs when a RT must be changed, and either every associated LT
must be found and changed on the 1NF, or inconsistent functionally determining
information is allowed to exist for the functionally dependent information at some cost to
the end-user. Let \o(\s\up8(~),S and s be measured in terms of counts of LTs in
order to provide a consistent measure of size across 1NF and 2NF or 3NF forms of the
database. Let \o(\s\up8(~), be the rank random variable for the number of LTs for an
RT , where the RT with rank one has the largest number of LTs. Conceptually,
this sorts all of the RTs by the number of LTs under them; the number (frequency) of LTs
will be Pr\b\bc\{( \o(\s\up8(~),  )  \o(\s\up8(~),for each The Zipf
[1949] distribution provides a robust model the rank-frequency distribution of
categorizations such as this — it has mass function

         Pr\b\bc\{( \o(\s\up8(~),    ) \f(—,)               1,2,....    (1,2)
                  (5)

where ) is the Riemann Zeta function and the parameter  has been called the
temperature of discourse by Mandelbrot [1983] from an analogy in physics7.


7 Higher values of  fatten the rightmost tail of a Zipfian density and indicate a low frequency of
occurrence of rare dependencies. High values of are associated with more complex dependency
relationships between a relation and its functionally dependent relations (Westland [1990c]). Mandelbrot
[1983] showed that when values of greater than or equal to 2 were encountered, they reflected constraints
or abnormalities in dependencies between groups of information. The value of  is topic and process
specific. Methods of estimating have been discussed in Fedorowicz [1982a; 1982b], Servi and Griffith
[1980] and Nelson and Tague [1985]. Theoretical bases for the Zipf distribution have been provided in Hill
[1974], in Mandelbrot [1965] and in Ijiri and Simon [1975]. Empirical studies on Zipfian populations are
                                                                                                         1
                                                                                                         1




                     FIGURE 2: FREQUENCY OF LTS (POTENTIAL & ACTUAL)

When the expected number of LTs under a given RT falls below one, the RT is presumed
not to exist, nor will a tuple exist on the 1NF of the database. An analogous model in
Westland [1991a] shows the RT recorded on the database with the least number of LTs is
of rank

        min (\o(\s\up8(~),inf \b\bc\{(  \f(—,)\o(\s\up8(~),) ≥ 1            
                                     

This is the greatest lower bound (inf) of the rank of RT information that has greater than

numerous; some of the more important appear in Bradford [1948], Ijiri and Simon [1975], Eco [1976],
Rauschner [1979], Salton [1971; 1972; 1979], Salton and McGill [1983], Salton and Zhang [1986], Sparck
Jones [1971; 1972; 1979], Swanson [1969; 1977] and van Rijsbergen [1981] provides a summary related to
the indexing literature.
                                                                                                           1
                                                                                                           2
one LT. The frequency of LTs is depicted in figure 2. The number of LTs for each RT
will be (from largest to smallest)

        \f(—,)\o(\s\up8(~),)           1, ... ,min (\o(\s\up8(~),                
                       

with expectation

        \f(\o(\s\up10(—),),) \i\su(=1,min,)                                       
                       

 where\o(\s\up10(—),)is the expectation of database size computed in the last section.
For a large database, i.e., where\o(\s\up10(—),)is large, formula (8) approaches
\f(,) \o(\s\up10(—),)which is nearly linear for (1,2) with least squares best
fit of \b\bc\((.05306  .34626\o(\s\up10(—),).

Substituting \o(\s\up10(—),)from equation (4), the expected cost of removing change
anomalies in a time period of length t is

        \o(\s\up8(–), \f(,) \b\bc\((s  \f(t(S–s)2,2  2t(S–s)))
                                   (9)

where  is the cost of removing a single change anomaly8, \o(\s\up8(–), is the rate of
arrival of change transactions.




8 The current analysis takes the costs of anomaly removal as given. They are unlikely to be fixed, since
anomaly removal affects the costs of removal of all other anomalies that refer to the same RT. Thus 
will be a function dependent on the frequency of LTs. An average cost may be computed in specific
situations, but is outside the scope of the current model.
                                                                                            1
                                                                                            3




FIGURE 3: COST OF CHANGE ANOMALIES WITH POLICIES S[0,150]             AND   s[0,100]

Formula (9), the cost of change anomalies in time period t for a 1NF relation which could
be avoided by decomposition to two 2NF or 3NF relations, is depicted in figure 3 for
=1, t=1, \o(\s\up8(–), =1, =1.1, S[0,150] and s[0,100].          Inspection of
figure 3 shows that the expected costs of removing change anomalies vary linearly in
policies s and S–s, and will be greatest where both s and S are large.

       2.3.    Cost of Insertion and Deletion Anomalies

An insertion anomaly occurs when an LTs information arrives, but there is no
corresponding RT information already on the database. For large databases, this typically
means that a new RT (e.g., a new department) must be set up on the database. This occurs
if and only if the first LT requiring a particular RT is being inserted.
                                                                                                      1
                                                                                                      4
The superposition of renewal processes was shown to yield a Poisson N(t) with parameter
t. Using the rank-frequency model for occurrence of functional dependencies from the
last section, assume that the independent processes updating RT for its LT information is
a Poisson process with parameter

        \f(—,)       1,2,....    (1,2)                                       (10)

This states that the RT insertion transactions are acquired with a probability proportional
to the number of LTs under them. The superposition of these independent processes
results in a Poisson process for database insertions, with parameter t (consistent with the
renewal process characterized in equation 1).

Consider just the potential RTs min to the right of the min (\o(\s\up8(~),line in
figure 2. These are sources of insertion / deletion anomalies. The probability of a single
insertion / deletion anomaly for the th RT is

         Pr\b\bc\{(N\s\do4((t) = 1)  \f(e –\s\do4(t(\s\do4(t)1,1!)  \s\do4(t e
–\s\do4(t                              (11)

and is

          \i\su(=min,∞, \s\do4(t e –\s\do4(t )
         (12)

in a time period of duration t. The probability of n anomalies is

         \b\bc\[( \i\su(=min,∞, \s\do4(t e –\s\do4(t ))\s\up20(n)
                                 (13)

and the expected number of anomalies is

         \i\su(n=1,∞, n ) \b\bc\[( \i\su(=min,∞, \s\do4(t e –\s\do4(t ))\s\up20(n)
                                          (14)

A useful approximation for this value may be derived by substituting both sums with
integrals; the first over n∞) and the remaining sum with an integral over
[0,\f(min—,))  [0,0). The integral over  is
                                                                                            1

        \i(,, t e –t d)  \f(1,t) \i(,t, se –s )ds  \f(1–I0t),t)            5

                (15)

substituting st and gamma function (2) (2–1)! 1, and incomplete gamma function
I0t); tables and formulas for the incomplete gamma function may be found in
Abramowitz and Stegun [1972]. Taking the integral over n

        \i(,∞, n \b\bc\[(\f(1–I0t),t)))\s\up20(n) dn 
                (16)

         \f(1–I0t),t) \b\bc\[( ln–2 \b\bc\[(\f(1–I0t),t)) — ln–1
\b\bc\[(\f(1–I0t),t)   )   )

It is necessary to restate this in terms of the policy variables s and S. These impact
formula (16) through 0\f(min—,) . From formula (6) it can be seen that min
\b\bc\\f((),\o(\s\up8(~),)) )\s\up20() and thus 0\f(min—,) =
\f(,\o(\s\up8(~),)) with expectation

       0]\f(, \b\bc\((s  \f(t(S–s)2,2  2t(S–s))) )
                (17)

substituting from formula 4. Figure 4 shows the expected cost of insertion or deletion
anomalies for various values of S and s, for =1, for a unit time period and assuming the
cost of removing an anomaly is 1.
                                                                                            1
                                                                                            6




 FIGURE 4: COST OF INSERTION OR DELETION ANOMALIES WITH POLICIES S[0,150] and
                                    s[0,100]

If  is the cost of a single insertion or deletion anomaly (the insertion cost will be
different from the deletion cost), then  times the vertical axis on figure 4 is the net
value of normalization i.e., the expected cost differential between incurring insertion /
deletion anomalies in a 1NF relation versus avoiding them by decomposition to two 2NF
or 3NF relations. Figure 4 suggests that occurrence rates are moderate for all but
microcomputer sized databases. But since these anomalies can lead to significant labor
costs associated with acquisition, update and control of RT information, even small
probabilities of occurrence can yield significant costs.
                                                                                                         1
                                                                                                         7
3.      Retrieval

        3.1.     Retrieval Policy

Normalization affects the value received by end users from successful database retrieval.
This occurs because functional dependencies are explicitly represented in the logical
design of the database, in the physical design of the database, and in processing in the
DBMS, which may use this information to better optimize queries. This improves the
utility of queries because LT data is clustered around RT data. This improves the
processing of queries in two ways (1) the percentage of relevant database information
which is retrieved and relevant to the query, i.e., recall, will increase; and (2) query
searches involve fewer records, and thus response time should improve.

For any query, insert, change or deletion of a database tuple to occur, the physical storage
position of the tuple must first be located. The DBMS is assumed to include optimization
routines for choice of the best query, insert, delete or change execution strategy. This
takes a tuple of the relation on physical storage and logically organizes it so that it may be
read and updated faster than if it were recorded as a flat table directly accessed from
physical storage. Similarly, other optimization routines parse queries, modifying them so
that they more closely concur with physical database organization. The expected seek time
for a storage position presumably increases with database size; but due to DBMS search
optimization, it is less than a linear increase. Search efficiency is intimately tied to
indexing and physical file structure; and to index compression (Date [1986; p. 75]). The
speedup provided by search optimization will be specific to the implemented software,
processing hardware and storage medium. Prior research has provided specific functional
forms of optimization cost for SELECT and JOIN operation (Astrahan et alii [1976]);
multiway joins (Selinger et alii [1979]); aggregate operations (Klug [1982]); statistical
estimates derived through join operations (Chin and Ozsoyoglu [1981]). A summary of
database cost analyses appears in Elmasri and Navathe [1989; 526-31].

Seek time is affected by the inherent speed of the physical storage medium, and the size S
of the database. High-speed magnetic disk media are the favorite of commercial DBMSs,
which are typically optimized for that media. In the current model, let seek time9,


9 Seek time is interpreted in the current analysis to mean the time between communicating the query to
                                                                                                           1
                                                                                                           8
(, \o(\s\up10(—),)), for a single record depend on two parameters — (1) the capital
expenditure for one unit of storage , where the more that is spent, the faster is the
retrieval ∂< 0 and ∂2> 0;10 and (2) the expected amount of data retained on
the storage medium, where ∂\o(\s\up10(—)> 0,      ∂2\o(\s\up10(—) > 0
and ∂2\o(\s\up10(—)< 0 where\o(\s\up10(—),) is given in formula (4). Since
most commercial DBMS use some variation of a B-tree architecture for storage and
indexing, it is reasonable to assume that seek time is correlated to only the number of
levels of indexing which must be traversed, since any search will be limited to a subtree. If
at each level, the B-tree splits into n branches, an X level B-tree will be of size 1+n X–1.
This suggests that seek time will be proportional to the logarithm of the database size,
giving a simple functional form for seek time of

         (, \o(\s\up10(—),))  \f(ln ( \o(\s\up10(—),)),)
                            (18)

where the unit of measurement for sets the coefficient of proportionality to unity.11 For
multiple record seeks, assume that the process is repeated for each item sought (each most
likely being on a different subtree of the B-tree) and thus for m seeks, seek time is m (,
\o(\s\up10(—),)).       Assume that DBMS applications development policy sets a
response time for the system of . Response time policies may appear in various forms;
these are discussed at length by Westland [1990a] and Westland [1990e; 1990f;1991c]
considers the impact of shared resources on response time. For example, in online
networked systems, or data communication, the policy objective is to get a response
(transaction) back to a terminal in "real-time" (generally less than three seconds). In a
batch system, where human factors are less important, the policy objective is to process a
certain number of transactions in a specified period of time, e.g., 10,000 transactions
overnight (meaning in three to six hours). Both batch and online systems design policy
may be captured in . The design decision on how much to invest () in the storage
technology becomes


the system, and retrieval of the information from the system.

10 Capital expenditure will affect both seek time and database capacity.   Seek time may be extracted by
making  a per unit cost.

11 Published performance statistics for Oracle's database server appear in Flynn [1988] and Caniano
[1988] provides some supporting evidence for this functional form.
                                                                                                         1
                                                                                                         9
        \s\do10(\a\co1(max, ))         (, \o(\s\up10(—),))  (, \o(\s\up10(—),)) ≤ 
                               

where \o(\s\up10(—),) \o(\s\up10(—),)(s,S) is selected through management’s
policy decisions on s and S. Thus   \f(ln ( \o(\s\up10(—),)),) where values are in
units of measurement which allow the elimination of the coefficient of proportionality.
Policy  is an information system’s service level policy, and implicitly assumes that there
is some cost of poor service. The cost of change anomalies  is a component of that cost
of poor service, but there are likely to be other service costs as well.

        3.2.     Search Costs and Capital Expenditures

When a 1NF relation is normalized to two 2NF or 3NF relations, database response
optimization is typically achieved by logically clustering the LTs around the RT. Assume
that physical clustering is derived from logical clustering. Then, from a standpoint of
search speed, the normalized database may incur a given fixed cost for finding the first LT
under a given RT, with approximately zero marginal costs of finding the second and
additional LTs under a RT.12 On the other hand, the normalized database will require an
additional seek for the RT. Thus fragmentation inefficiencies in databases resulting from
normalization may be most pronounced in normalized databases which average a small
number of LTs per RT.

In the current analysis, data is assumed to be kept at the detailed level. From equation
(8), the seek time for the normalized database is seen to be

        \s\do9(\a(lim , min∞)) \b\bc\((; \f(\o(\s\up10(—),),)
\i\su(=1,min,))  \b\bc\((; \f(,) \o(\s\up10(–),)                           
       

which is less than the  \b\bc\((;\o(\s\up10(–),)seek time for the database of 1NF
relations resident on media with the same capital expenditure per unit of storage  since 
is increasing in \o(\s\up10(–),). Figure 5 shows the amount by which       \f(,)
\o(\s\up10(–),) is less than \o(\s\up10(–),)for values of [1,2].



12 One can construct situations in which this assertion will be violated in practice. For example,
physical clustering options may be chosen on the 1NF database, allowing a 1NF database to perform
similarly to a normalized database. The assertion, though, captures a general effect of normalization.
                                                                                                 2
                                                                                                 0




    FIGURE 5:   \f(,)   PROPORTIONAL REDUCTION IN RECORDS PROCESSED DURING
                                             SEEK


Since \f(,) is generally less than 40%, seek time for the functionally dependent
information is expected to decline noticeably with normalization. But normalization also
demands an extra seek for the RT. Let the expected size of a retrieval be m LTs (the RT
being retrieved automatically with the LT in a 1NF database). Response time policy 
may be met when m tuples are retrieved, but may not be met when m1 are retrieved.
Thus for a normalized database, a new capital expenditure per unit of storage level
\f(mm) higher is required to satisfy policy . This implies that the capital expenditure per
unit of storage required for a normalized database is

        norm \b\bc\[(\f(ln\b\bc\[( )), ln ( \o(\s\up10(–),) ))  1) \f(mm )
\f(, ln ( \o(\s\up10(–),) ))                         (21)

versus 1NF  \f(, ln ( \o(\s\up6(–),) )) for a database of 1NF relations. The reduction
in capital expenditure per unit of storage for policy when the database is normalized is

       \b\bc\[(1 — \b\bc\[(\f(ln\b\bc\[( )), ln ( \o(\s\up10(–),) ))  1)
\f(mm ) )     \f(, ln ( \o(\s\up10(–),) ))                              (22)
                                                                                               2
                                                                                               1
It is interesting to explore values of  and m for which equation (22) is negative, i.e.,
where it is not cost effective to normalize. The value ln\b\bc\[( ))(.371 , .918)
for (1,2). For \o(\s\up10(–),) large, this factor becomes more of a problem, and is least
significant in small databases of tens or hundreds of tuples such as is often encountered in
microcomputer database software. Consider a database of                \o(\s\up10(–),)100
LTs, with 1 ln\b\bc\[( )) .918; then equation (22) becomes

       \b\bc\[(1 — .9082\f(mm ) )       \f(, ln ( \o(\s\up10(–),) ))
                              (23)

which is negative for m < 9 RTs retrieved. With a database of \o(\s\up10(–),)10000
LTs, and 1, equation (22) becomes

       \b\bc\[(1 — .99082\f(mm ) )        \f(, ln ( \o(\s\up10(–),) ))
                       (24)

which is negative for m < 107 RTs retrieved. From this perspective, normalization will
create inefficiencies where a comparatively small amount of information is being sought
after and retrieved from the database. Both an increase in  (implying a greater number
of LTs per RT) and an increase in \o(\s\up10(–),) (implying a larger database) will
exacerbate these inefficiencies. This problem is not as severe for queries as it might first
seem, since as the database size increases, the number of tuples which satisfy a query will
also increase. But the problem of end user overload, discussed below, then becomes a
problem. The analysis delineates the nature of the "fragmentation" problem. It suggests
that database normalization is not cost justified in terms of the long-run capital
expenditures required to meet response time policy goals.




       3.3.    End-User Demand

The value of any database system arises not from update activities, but from the end-user
demand which is satisfied by information retrieval. This section analyzes the degree to
which a database user's information needs are satisfied by a "successful" retrieval, i.e.,
one which completes without an anomaly. Following Westland [1989] let  be the
traditional information science measure of precision, i.e., the portion of the relations
retrieved which are relevant to the end user's query. Relevance may reflect the
                                                                                                 2
                                                                                                 2
decision-model used by the end user, or various other sources as discussed in Maron and
Kuhns [1960], or Robertson [1968]. Let  be the value received from consumption of a
relevant and retrieved tuple; this is a measure of value, i.e., the end user's willingness to
pay for useful information. Let  be the cost, e.g., of the end user's effort and time, of
consuming a tuple which is retrieved, whether or not it is relevant to the end user’s query.

The point of futility F is specific to a user, and may depend upon the users interest in a
topic (as discussed in Westland [1990c]) expectations and reputation effects (as discussed
in Westland [1990b; 1991b]) and miscellaneous other factors. Let  be the portion of
retrieved tuples that are actually consumed before the end user reaches his point of futility
F, and which depends on the cost of consumption (Blair [1980 ; 1986]). If m1 tuples are
retrieved, then there will be  m1 tuples that are relevant and (1–)m1 that are not. The
end user will inspect a maximum of  m1 tuples that are relevant and (1–)m1 that are
not, with expenditure  m1. F =  m1is the maximum number of tuples that the end
user will inspect. This is an implicit function because F is fixed for an individual end user,
but and m1 may vary. Then the net value of information actually used from a query
is

       m1 —  m1 =           F \b\bc\[( –1                               

where  is the ratio of value from consumption to cost of consuming. Note that F and
 reflect only the end user's personal preferences, and involve neither the database nor
retrieval system, whereas and m1 may vary individually with the topic, database,
retrieval system, and so forth.

This measure of net value from query retrieval increases with increasing precision with
increasing point of futility F, and with increasing value to cost ratio . The marginal value
of consumption ∂/∂[m1] should diminish as m1, the number of tuples read, becomes
larger;  will tend to level off because redundancy causes the incremental information
gained from each tuple to decrease.

The net value of choosing a particular set of tuples includes both the value of the tuples
consumed as well as opportunity cost of not consuming some other possible set of tuples.
To define an opportunity cost measure, let  be the portion of tuples in the entire database
which are relevant to the query and let \o(\s\up10(–),) computed in the prior section, be
the size of the database. The traditional recall performance measure, denoted , is
                                                                                                2

therefore  m1/ \o(\s\up10(–),) . Consider the alternative of retrieving m2 tuples
                                                                                                3

rather than m1 tuples \o(\s\up10(–),) ≥ m2 ≥ m1. Let m2/m1 ≤ 1, the proportion of
tuples actually retrieved to the number that might have been retrieved. If  is the portion
of the m1 retrieved tuples consumed, then  is the portion of the m2 tuples which might
have been consumed. The net value from choosing an alternative set of tuples of size m2
from the database is

     \f( m1 ,m2) \b\bc\[( m1  \f(  \o(\s\up10(–),) – m1, \o(\s\up10(–),) –
m1) (m2 – m1)) —  \f( m1 ,m2) \b\bc\[( m1 – m2+ m1) (26)

If \o(\s\up10(–),) is much greater than m1 then

           \f(  \o(\s\up10(–),) – m1, \o(\s\up10(–),) – m1)              
                                   

Substitute and F from above to obtain

       F \b\bc\[( \b\bc\[( (1 – )) – 1)
       (28)

This is the opportunity cost; it is simply formula (25) with  replaced by a value
intermediate between , the portion of relevant tuples in the retrieval set, and , the
portion of database tuples that are relevant. The incremental cost of choosing m2 rather
than m1 tuples is formula (28) minus (25)

       F \b\bc\[( ( – )(1 – ))
       (29)

or in terms of recall  rather than 

       F \b\bc\[(  \b\bc\[(\f(m1,  \o(\s\up10(–),))   — 1) (1 – ))
                                (30)




This incremental cost increases with increasing point of futility F; with increasing value to
cost ratio ; with decreasing or increasing m2; with increasing ; and with decreasing
precision . When < and <1, this quantity is negative, implying that it costs more to
                                                                                               2
                                                                                               4
have a larger retrieval set. Blair and Maron [1985] provide empirical evidence that
individuals adjust their queries, especially in an interactive setting, in order to maximize
precision , at the expense of recall. If precision is much greater than either recall as
suggested by Blair and Maron [1985], then the incremental cost is increasing rapidly in
m2which is proportional to \f(1,). In this situation, the existence of a moderate cost at
the point of futility F favors a very small retrieval set size. But the results of the last
section suggest that this same bias tends to eliminate the positive net value of database
normalization by requiring an increase in capital expenditure per unit of storage to achieve
the same response time goals. Relational databases are highly structured in comparison
to the text databases studied by Blair and Maron. Consequently, low recall may be much
less of a problem in the current situation, and the results here may only be considered
suggestive,

When a 1NF relation is normalized to two 2NF or 3NF relations, this logically clusters the
LTs around the RT. Again, assume that physical clustering is derived from logical
clustering. Consider the interpretation of the recall improvement due to normalization in
light of formulas (25) and (28). The clustering of LTs around a RT in the normalized
database will tend to increase the recall from the normalized database over the database of
1NF relations. The net value from querying do not depend upon recall, and thus
normalization has no impact upon them. "Opportunity costs" though, do depend upon . A
proportional increase of of some minuscule amount  in recall reduces opportunity cost by

         \f( Fm1, \o(\s\up10(–),))    (1 – ) \f(,  )                    
                      

thus by the proportion \f(,  ). Unfortunately, this calculation is more difficult to
apply than the previous cost calculations. Blair and Maron [1985] showed that it is
extremely difficult to ascertain recall or the effect of query optimization on recall. As
noted previously, relational database are more structured than the databases studied in
Blair and Maron [1985], and may not be analogous. Generally though, if recall is very
low, the clustering brought about by normalization can be expected to increase the utility
of successful queries much more than if recall is very high.




4.         Discussion
                                                                                                             2
                                                                                                             5
The prior economic analysis modelled the activities associated with a database given
management policies forresponse time , and respective minimum and maximum
database sizes s, and S. Although management may not couch their policies in terms of
these parameters specifically, the argument was made that actual database policy decisions
could be mapped to these parameters. No attempt was made to sum results to yield a
total “net value” of normalization — such a formulation would have involved too many
parameters to yield much intuition. Analysis of component activities, though, provides
useful insights, and these are summarized in this section.

The research investigated insertion, deletion and change anomalies under the minimal
assumption that capacity S, and post-purge database size s are managed. The research
showed that the expected cost of removing change anomalies is linearly proportional both
s and S–s, and will be greatest when the capacity S is very large. Change anomaly costs
are reflected in slower response times, and will only be a problem if response times are
already slow with respect to policy , or if storage capacity is being reached. The
occurrence rates for insertion or deletion anomalies are moderate for all but
microcomputer sized databases. But since these anomalies are accompanied by
significant labor costs for finding, controlling accuracy and integrity, and updating new
data, even small probabilities of occurrence can be significant.     In contrast, the change
costs will not be a consideration unless they cause the system's response to consistently
violate response time policy, thus requiring an increase in capital expenditures.

Information retrieval generates the value that end users receive from a database. In
analyzing information retrieval, an additional policy decision on response (seek) time 
was imposed on the model. The research showed that normalization can create retrieval
inefficiencies where a comparatively small amount of information is being sought and
retrieved from the database. Both an increase in the number of LTs per RT, and an
increase in database size will exacerbate these inefficiencies. This problem may or may
not influence value since as the database size increases, the number of tuples which satisfy
a query will also increase. This can result in information overload, or incomplete
retrievals because retrieval cannot be completed in the response time  allotted13. This is


13 This is commonly called the “fugitive” information problem, because the information exists, but is not
available in the time frame needed either because the end user is overloaded with information, or because
of excessive search costs and time in the information system. It is one of the major problems with current
information systems.
                                                                                                 2
                                                                                                 6
the "fragmentation" problem for which we have some anecdotal evidence.
“Fragmentation” can result in excessive capital expenditures to address response time
policy . It was shown that where there exists a moderate cost at the point of futility F,
database users will favor very small retrieval set sizes. But this will require an increase in
capital expenditures to achieve the same response time goals.

The analysis of user demand for information retrieval yielded only suggestive results.
Normalization reduces the opportunity cost associated with information retrieval from a
database by improving recall. It is most pronounced when recall is low. Evidence from
the information retrieval literature has shown that retrieval recall from document
databases is typically low, but must be interpreted with caution. Relational databases are
much more structured than text databases and it may be difficult to draw conclusions on
recall for relational databases (Westland 1990a).

The research drew conclusions only for a simple two relation database. But if the models
presented here are perceived as describing the functional dependencies between the two
relations, it is simple to address more complex database structures using the same models.
Complex schemas may be broken into pairs of relations. Response time policy will be
that associated with views using these relations. Using this approach, the models
presented here should be generalizable to most databases encountered in practice. By
parameterizing these models with empirical measurements from the environment in which
the database will operate, it will be possible to specify the cost effectiveness of
normalization in many specific situations.




5.     Bibliography

Abramowitz, M. and Stegun, I. A. (1972). Handbook of Mathematical Functions. New
     York:Dover

Anderson, J. (1990) .Canonizer: A Tool Designed for Pros. UNIX Today!, April 2, 1990

Astrahan, M. (1976). System R: A Relational Approach to Data Base Management. ACM
       Transactions on Database Systems, v.1(2)

Blair, D.C. (1980). Searching Biases in Large, Interactive Document Retrieval Systems.
                                                                                          2
                                                                                          7
       Journal of the American Society for Information Science, v. 31(4), 271-277

Blair, D.C. (1986). Indeterminacy in the Subject Access to Documents. Information
        Processing and Management, v.22(2)

Blair, D.C. and Maron, M.E. (1985). An evaluation of retrieval effectiveness for a
        full-text document retrieval system. Communications of the ACM, v.28(3), 289-97

Bradford, S.C. (1948). Documentation. London: Crosby Lockwood

Bunnell,D. (1987). The Challenge of Hypermedia. PC World, November 1987, 17-26

Chin, F.Y. and Ozsoyoglu, G. (1981) Statistical Database Design. ACM Transactions on
       Database Systems, v. 6, 113-139

Codd, E.F. (1972). Further Normalization of the Data Base Relational Model, in Data
       Base Systems. Courant Computer Science Symposia Series, vol. 6, Englewood
       Cliffs: Prentice-Hall

Computer Science and Technology Board: National Research Council(1990). Keeping
     the US Computer Industry Competitive. Washington, D.C.: National Academy
       Press

Cooper, W.S. (1971). A Definition of Relevance for Information Retrieval. Information
      Storage and Retrieval, v.7(1), 19-37

Eco, U. (1976). A Theory of Semiotics. Bloomington: Indiana University Press

Elmasri and Navathe (1989). Fundamentals of Database Systems. Redwood City, CA:
       Benjamin/Cummings

Fedorowicz, J. (1982a). The Theoretical Foundation of Zipf's Law and Its Application to
       the Bibliographic Database Environment. Journal of the American Society of
       Information Science, v.33(5), 285-293

Fedorowicz, J.(1982b). A Zipfian Model of an Automatic Bibliographic System: An
      Application to MEDLINE. Journal of the American Society of Information
      Science, v. 33(5), 223-232
                                                                                            2
                                                                                            8
Feller, W. (1966). An Introduction to Probability Theory and Its Applications, v. II. New
        York:Wiley

Flynn, M.K. (1988). Oracle Unveils OLTP Version of its Relational DBMS. Datamation,
       September 1, p. 93

Hamilton, D. (1990) Application Development Stars. System Builder, April/May

Hill, B.M. (1974) .The Rank Frequency Form of Zipf's Law. Journal of the American
        Statistical Association, 69(348), 1017-1026

Ijiri, Y. and Simon, H.A. (1975). Some Distributions Associated with Bose-Einstein
         Statistics. Proceedings of the National Academy of Sciences, v. 72(5), 1654-1657

Inmon, W.H.(1987). Denormalize for Efficiency. Computerworld, (March 16), v. 21(11),
       19-21

Karlin, S. and Taylor, H.M. (1975). A First Course in Stochastic Processes (2nd ed.)
        .New York:Academic Press

Klug, A. (1982). Equivalence of Relational Algebra and Relational Calculus Query
       Languages Having Aggregate Functions. Journal of the ACM, v. 29(3)

Mandelbrot, B.B. (1965) Information Theory and Psycholinguistics, in B.A.Wolman and
      E.N.Nagel (eds.) Scientific Psychology:Principles and Applications New
      York:Basic Books

Mandelbrot, B.B. (1983). The Fractal Geometry of Nature. New York:Freeman

Maron, M.E., and Kuhns, J.L. (1960). On Relevance, Probabilistic Indexing, and
      Information Retrieval. Journal of the Association of Computing Machinery, 7(3),
      216-244

Nelson, M.J., and Tague, J.M. (1985). Split Size-Rank Models for the Distribution of
       Index Terms. Journal of the American Society of Information Science, v. 36(5),
       283-295

Salton, G. (1972). Experiments in Automatic Thesaurus Construction for Information
                                                                                            2
                                                                                            9
       Retrieval. Information Processing, v. 71 Amsterdam:North Holland, 115-123

Salton, G. (1979). Mathematics and Information Retrieval. Journal of Documentation, v.
        35(2), 145-153

Salton, G. (ed.) (1971). The SMART Retrieval System Experiments in Automatic
        Document Processing. Englewood-Cliffs:Prentice-Hall

Salton, G. and McGill, M.J. (1983). Introduction to Modern Information Retrieval. New
        York:McGraw-Hill

Salton, G. and Zhang, Y. (1986). Enhancement of Text Representations Using Related
        Document Titles. Information Processing and Management, v. 22(5), 385-394

Schlender, B.R. (1989) .How to Break the Software Logjam. Fortune, September 25,
       1989, 100-112

Selinger, P. (1979) Access Path Selection in a Relational Database Management
       Systems. ACM/SIGMOD 1979

Servi, P.N., and Griffith, B.C. (1980). A Method for Partitioning the Journal Literature.
       Journal of the American Society for Information Science , v. 31(1), 36-50

Sparck Jones, K. (1971). Automated Keyword Classification for Information Retrieval.
       London:Butterworths

Sparck Jones, K. (1972). A Statistical Interpretation of Term Specificity and Its
       Application in Retrieval. Journal of Documentation, v. 28(1), 11-21

Sparck Jones, K. (1979). Search Term Relevance Weighting Given Little Relevance
       Information. Journal of Documentation, v.35(1), 30-49

Swanson, D.R. (1960). Searching Natural Language Text by Computer. Science, v. 132,
      1060-1104

Swanson, D.R. (1977). Information Retrieval as a Trial-and-Error Process. Library
      Quarterly, v.47(2), 129-48

System Builder (1989). The Peoples Choice Awards, June/July, 27-34
                                                                                          3
                                                                                          0
van Rijsbergen, C.J. (1981). Information Retrieval. London:Butterworths

Westland, J.C. (1989) .A Net Benefits Approach to Measuring Retrieval Performance.
      Information Processing and Management, October 1989, v.25(5), 579-581

Westland, J.C. (1990a). Scaling Up Output Capacity and Performance Results from
      Information Systems Prototypes. ACM Transactions on Database Systems,
      September, v.15(3) 341-58
Westland, J.C. (1990b). Assessing the Economic Benefits of Information Systems
      Auditing. Information Systems Research, v.1(3), 309-24

Westland, J.C. (1990c). Topic Specific Market Concentration in the Information Services
      Industry: Evidence from the DIALOG Group of Databases. The Information
      Society, v. 6, 127-138

Westland, J.C. (1990d). Competing in the World's Computer Market. Scientific
      American, November, v. 263(5), 152

Westland, J.C. (1990e). Collaboration and Productivity in Information Systems Research.
      The Information Society v.7(1), 33-50

Westland, J.C. (1990f). CASE as Negotiation: Systems Design Goals and the
       Management of Group Learning in an Automated Systems Design Process.
       Sistemica, v.1(2), 131-43

Westland, J.C. (1991a). Economic Constraints in Hypertext. Journal of the American
      Society of Information Science, April, v. 42(3), 178-184

Westland, J.C. (1991b). Problem Vectorizability and the Market for Vector
      Supercomputing. Information Processing and Management, v. 27(5)

Westland, J.C. (1991c). Self-Organizing Executive Information Networks. Decision
       Support Systems, in press

Westland, J.C. (1992a). Congestion and Network Externalities in the Short Run Pricing
      of Information Systems Services. Management Science, in press

Westland, J.C. (1992b) .The Marginal Analysis of Strategic Investments in Information
      Technology. in Mahmood, Banker and Kaufman (eds.) Strategic and Economic
                                                                                                  3
                                                                                                  1
       Impacts of Information Technology Investment: Perspectives on Organizational
       Growth and Competitive Advantage. Middletown, PA: Idea Group Publishing,
       in press

Zipf, G.K. (1949). Human Behavior and the Principle of Least Effort. New York:Hafner (facsimile
       1965)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:11/12/2012
language:Unknown
pages:31
About Good!!!NICE!!! The best document database!