Database Management System
A file consists of a collection of records. A key element in file management is the way in
which the records themselves are organized inside the file, since this heavily affects
system performances as far as record finding and access. Note carefully that by
``organization'' we refer here to the logical arrangement of the records in the file and not
instead to the physical layout of the file as stored on a storage media, To prevent
confusion, the latter is referred to by the expression ``record blocking'', and will be
treated later on.
Choosing a file organization is a design decision, hence it must be done having in mind
the achievement of good performance with respect to the most likely usage of the file.
The criteria usually considered important are:
1. Fast access to single record or collection of related records.
2. Easy record adding/update/removal, without disrupting.
3. Storage efficiency.
4. Redundancy as a warranty against data corruption.
Needless to say, these requirements are in contrast with each other for all but the most
trivial situations, and it's the designer job to find a good compromise among them,
yielding and adequate solution to the problem at hand. For example, easiness of
adding/etc. is not an issue when defining the data organization of a CD-ROM product,
whereas fast access is, given the huge amount of data that this media can store. However,
as it will become apparent shortly, fast access techniques are based on the use of
additional information about the records, which in turn competes with the high volumes
of data to be stored.
Logical data organization is indeed the subject of whole shelves of books, in the
``Database'' section of your library. Here we'll briefly address some of the simpler used
techniques, mainly because of their relevance to data management from the lower-level
(with respect to a database's) point of view of an OS. Five organization models will be
Pile File Organization:
Heap file is otherwise known as random file or pile file. In heap file organization, file
records are inserted at the end of the file or in any file block with free space, hence
insertion of record is efficient. Data in files are collected in the order that they arrives. It
is not analyzed, categorized, or forced to fit field definitions or field sizes, at best, the
order of the records may be chronological. Records may be of variable length and need
not have similar sets of data elements.
Uses of Pile File Organization:
Heap files are used in situations where data are collected prior to processing, where data
are not easy to recognize, and in some research on file structures. Since much of the data
collected in real-world situations are in the form of piles, this file organization is
considered as the base for other evaluations.
Drawback of Pile Organization:
In heap file organization, data analysis can become very expensive because of the time
required for retrieval of a statistically adequate number of sample records:
1. Searching of records is difficult. Normally linear search is used to locate the
2. Deletion of the record is difficult. Because if we want to delete a particular
record, first we have to locate the file and delete.
Sequential File Organization:
This is the most common structure for large files that are typically processed in their
entirety, and it's at the heart of the more complex schemes. In this scheme, all the records
have the same size and the same field format, with the fields having fixed size as well.
The records are sorted in the file according to the content of a field of a scalar type, called
``key''. The key must identify uniquely a records, hence different record have different
keys. This organization is well suited for batch processing of the entire file, without
adding or deleting items: this kind of operation can take advantage of the fixed size of
records and file; moreover, this organization is easily stored both on disk and tape. The
key ordering, along with the fixed record size, makes this organization amenable to
dichotomy search However, adding and deleting records to this kind of file is a tricky
process: the logical sequence of records typically matches their physical layout on the
media storage, so to ease file navigation, hence adding a record and maintaining the key
order requires a reorganization of the whole file. The usual solution is to make use of a
``log file'' (also called ``transaction file''), structured as a pile, to perform this kind of
modification, and periodically perform a batch update on the master file.
Advantages of Sequential File Organization:
The Sequential file organization permits the economical and efficient use of
sequential processing techniques when the activity rate is high,
This organization also provides quick access to records in a relatively efficient
Records can be inserted or updated in the middle of the file.
Disadvantages of Sequential file Organization:
Indexed file organization is less efficient in the use of storage space than some
other file organizations.
It requires relatively expensive hardware and software resources.
It requires unique keys.
Processing is occasionally slow.
Requires periodic reorganization of file.
Indexed File Organization:
Each record in the file has one or more embedded keys (referred to as key data items);
each key is associated with an index. An index provides a logical path to the data records
according to the contents of the associated embedded record key data items. Indexed files
must be direct-access storage files. Records can be fixed length or variable length.
Each record in an indexed file must have an embedded prime key data item. When
records are inserted, updated, or deleted, they are identified solely by the values of their
prime keys. Thus, the value in each prime key data item must be unique and must not be
changed when the record is updated. You tell COBOL the name of the prime key data
item in the RECORD KEY clause of the file-control paragraph.
In addition, each record in an indexed file can contain one or more embedded alternate
key data items. Each alternate key provides another means of identifying which record to
retrieve. You tell COBOL the name of any alternate key data items on the ALTERNATE
RECORD KEY clause of the file-control paragraph.
Life sequential organization the data is stored in physical contiguous box. How ever the
difference is in the use of indexes. There are three areas in the disc storage:
Primary Area: - Contains file records stored by key or ID numbers.
Overflow Area: - Contains records area that cannot be placed in primary area.
Index Area: - It contains keys of records and there locations on the disc.
Advantages of Indexed file:
• Faster access to rows where the indexed column is searched on.
Disadvantage of Index file:
• Inserts & deletes slower
• Updates to indexed columns slower
• Increases storage used
Hash Files Organization:
Hashing (hash addressing) is a technique for providing fast direct access to a specific
record on the basis of a given value of some field. If two or more key values hash to the same
disk address, we have a collision. The hash function should distribute the domain of the key
possibly evenly among the address space of the file to minimize the chance of collision.
The collisions may cause a page to overflow.
1. Hashing involves computing the address of a data item by computing a function
on the search key value.
2. A hash function h is a function from the set of all search key values K to the set
of all bucket addresses B.
We choose a number of buckets to correspond to the number of search key
values we will have stored in the database.
To perform a lookup on a search key value Ki, we compute h(Ki), and
search the bucket with that address.
If two search keys i and j map to the same address, because h(Ki) = h(Kj),
then the bucket at the address obtained will contain records with both
search key values.
In this case we will have to check the search key value of every record in
the bucket to get the ones we want.
Insertion and deletion are simple.
1. A good hash function gives an average-case lookup that is a small constant,
independent of the number of search keys.
2. We hope records are distributed uniformly among the buckets.
3. The worst hash function maps all keys to the same bucket.
4. The best hash function maps all keys to distinct addresses.
5. Ideally, distribution of keys to addresses is uniform and random.
6. Suppose we have 26 buckets, and map names beginning with ith letter of the alphabet
to the ith bucket.
Problem: this does not give uniform distribution.
Many more names will be mapped to "A" than to "X".
Typical hash functions perform some operation on the internal binary machine
representations of characters in a key.
For example, compute the sum, modulo # of buckets, of the binary representations
of characters of the search key.
Handling of bucket over flows
1. Open hashing occurs where records are stored in different buckets. Compute the hash
function and search
the corresponding bucket to find a record.
2. Closed hashing occurs where all records are stored in one bucket. Hash function
computes addresses within that bucket. (Deletions are difficult.) Not used much in
3. Drawback to our approach: Hash function must be chosen at implementation time.
Number of buckets is fixed, but the database may grow.
If number is too large, we waste space.
If number is too small, we get too many "collisions", resulting in records of many
search key values being in the same bucket.
Choosing the number to be twice the number of search key values in the file gives a
good space/performance trade.
Any search other than on equality is very expensive (linear search or involves
Prediction of total number of buckets is difficult.
Allocate a large space.
Estimate a ``reasonable'' size and periodically reorganize
Indexed Sequential File Organization
An index file can be used to effectively overcome the above mentioned problem, and to
speed up the key search as well. The simplest indexing structure is the single-level one: a
file whose records are pair’s key-pointer, where the pointer is the position in the data file
of the record with the given key. Only a subset of data records, evenly spaced along the
data file, are indexed, so to mark intervals of data records.
A key search then proceeds as follows: the search key is compared with the index ones to
find the highest index key preceding the search one, and a linear search is performed
from the record the index key points onward, until the search key is matched or until the
record pointed by the next index entry is reached. In spite of the double file access (index
+ data) needed by this kind of search, the decrease in access time with respect to a
sequential file is significant.
Consider, for example, the case of simple linear search on a file with 1,000 records. With
the sequential organization, averages of 500 key comparisons are necessary (assuming
uniformly distributed search key among the data ones). However, using and evenly
spaced index with 100 entries, the number of comparisons is reduced to 50 in the index
file plus 50 in the data file: a 5:1 reduction in the number of operations.
This scheme can obviously be hierarchically extended: an index is a sequential file in
itself, amenable to be indexed in turn by a second-level index, and so on, thus exploiting
more and more the hierarchical decomposition of the searches to decrease the access
time. Obviously, if the layering of indexes is pushed too far, a point is reached when the
advantages of indexing are hampered by the increased storage costs, and by the index
access times as well.
Advantages of Indexed sequential file
Reorganizing Indexed Files: This operation is usually done by a utility program supplied
by the manufacturer of the COBOL compiler that you use. Refer to the manuals of your
compiler for details and instructions.