; DBMS
Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

DBMS

VIEWS: 41 PAGES: 40

  • pg 1
									DBMS

1. Introduction

DBMS - A Database is a collection of interrelated data and a Database Management System is a set of
programs to use and/or modify this data.

1. 1 Approaches to Data Management

   •   File-Based Systems

       Conventionally, before the Database systems evolved, data in software systems was stored in
       and represented using flat files.

   •   Database Systems

       Database Systems evolved in the late 1960s to address common issues in applications handling
       large volumes of data which are also data intensive. Some of these issues could be traced back
       to the following disadvantages of File-based systems.

       Drawbacks of File-Based Systems




       As shown in the figure, in a file-based system, different programs in the same application may
       be interacting with different private data files. There is no system enforcing any standardized
       control on the organization and structure of these data files.

   •   Data Redundancy and Inconsistency

       Since data resides in different private data files, there are chances of redundancy and resulting
       inconsistency. For example, in the above example shown, the same customer can have a
       savings account as well as a mortgage loan. Here the customer details may be duplicated since
       the programs for the two functions store their corresponding data in two different data files.
       This gives rise to redundancy in the customer's data. Since the same data is stored in two files,
       inconsistency arises if a change made in the data in one file is not reflected in the other.

   •   Unanticipated Queries
       In a file-based system, handling sudden/ad-hoc queries can be difficult, since it requires
       changes in the existing programs.

   •   Data Isolation

       Though data used by different programs in the application may be related, they reside in
       isolated data files.

   •   Concurrent Access Anomalies

       In large multi-user systems the same file or record may need to be accessed by multiple users
       simultaneously. Handling this in a file-based systems is difficult.

   •   Security Problems

       In data-intensive applications, security of data is a major concern. Users should be given access
       only to required data and not the whole database. In a file-based system, this can be handled
       only by additional programming in each application.

   •   Integrity Problems

       In any application, there will be certain data integrity rules which need to be maintained. These
       could be in the form of certain conditions/constraints on the elements of the data records. In
       the savings bank application, one such integrity rule could be Customer ID, which is the unique
       identifier for a customer record, should be non-empty. There can be several such integrity rules.
       In a file-based system, all these rules need to be explicitly programmed in the application
       program.

It may be noted that, we are not trying to say that handling the above issues like concurrent access, security,
integrity problems, etc., is not possible in a file-based system. The real issue was that, though all these are
common issues of concern to any data-intensive application, each application had to handle all these
problems on its own. The application programmer needs to bother not only about implementing the
application business rules but also about handling these common issues.

1.2 Advantages of Database Systems




As shown in the figure, the DBMS is a central system which provides a common interface between the
data and the various front-end programs in the application. It also provides a central location for the
whole data in the application to reside.
Due to its centralized nature, the database system can overcome the disadvantages of the file-based
system as discussed below.

   •   Minimal Data Redundancy

   Since the whole data resides in one central database, the various programs in the application can
   access data in different data files. Hence data present in one file need not be duplicated in another.
   This reduces data redundancy. However, this does not mean all redundancy can be eliminated.
   There could be business or technical reasons for having some amount of redundancy. Any such
   redundancy should be carefully controlled and the DBMS should be aware of it.

   •   Data Consistency

   Reduced data redundancy leads to better data consistency.

   •   Data Integration

   Since related data is stored in one single database, enforcing data integrity is much easier.
   Moreover, the functions in the DBMS can be used to enforce the integrity rules with minimum
   programming in the application programs.

   •   Data Sharing

   Related data can be shared across programs since the data is stored in a centralized manner. Even
   new applications can be developed to operate against the same data.

   •   Enforcement of Standards

   Enforcing standards in the organization and structure of data files is required and also easy in a
   Database System, since it is one single set of programs which is always interacting with the data
   files.

   •   Application Development Ease

   The application programmer need not build the functions for handling issues like concurrent access,
   security, data integrity, etc. The programmer only needs to implement the application business
   rules. This brings in application development ease. Adding additional functional modules is also
   easier than in file-based systems.

   •   Better Controls

   Better controls can be achieved due to the centralized nature of the system.

   •   Data Independence

   The architecture of the DBMS can be viewed as a 3-level system comprising the following:

       - The internal or the physical level where the data resides.

       - The conceptual level which is the level of the DBMS functions

       - The external level which is the level of the application programs or the end user.

   Data Independence is isolating an upper level from the changes in the organization or structure of a
   lower level. For example, if changes in the file organization of a data file do not demand for
   changes in the functions in the DBMS or in the application programs, data independence is
   achieved. Thus Data Independence can be defined as immunity of applications to change in
   physical representation and access technique. The provision of data independence is a major
   objective for database systems.
   •   Reduced Maintenance

   Maintenance is less and easy, again, due to the centralized nature of the system.

1.3 Functions of a DBMS

The functions performed by a typical DBMS are the following:

   •   Data Definition

       The DBMS provides functions to define the structure of the data in the application. These
       include defining and modifying the record structure, the type and size of fields and the various
       constraints/conditions to be satisfied by the data in each field.

   •   Data Manipulation

       Once the data structure is defined, data needs to be inserted, modified or deleted. The functions
       which perform these operations are also part of the DBMS. These functions can handle planned
       and unplanned data manipulation needs. Planned queries are those which form part of the
       application. Unplanned queries are ad-hoc queries which are performed on a need basis.

   •   Data Security & Integrity

       The DBMS contains functions which handle the security and integrity of data in the application.
       These can be easily invoked by the application and hence the application programmer need not
       code these functions in his/her programs.

   •   Data Recovery & Concurrency

       Recovery of data after a system failure and concurrent access of records by multiple users are
       also handled by the DBMS.

   •   Data Dictionary Maintenance

       Maintaining the Data Dictionary which contains the data definition of the application is also one
       of the functions of a DBMS.

   •   Performance

       Optimizing the performance of the queries is one of the important functions of a DBMS. Hence
       the DBMS has a set of programs forming the Query Optimizer which evaluates the different
       implementations of a query and chooses the best among them.

Thus the DBMS provides an environment that is both convenient and efficient to use when there is a
large volume of data and many transactions to be processed.

1.4 Role of the Database Administrator

Typically there are three types of users for a DBMS. They are :

   1. The End User who uses the application. Ultimately, this is the user who actually puts the data in
      the system into use in business. This user need not know anything about the organization of
      data in the physical level. She also need not be aware of the complete data in the system. She
      needs to have access and knowledge of only the data she is using.
   2. The Application Programmer who develops the application programs. She has more knowledge
      about the data and its structure since she has manipulate the data using her programs. She
      also need not have access and knowledge of the complete data in the system.
   3. The Database Administrator (DBA) who is like the super-user of the system. The role of the DBA
      is very important and is defined by the following functions.
•   Defining the Schema

    The DBA defines the schema which contains the structure of the data in the application. The
    DBA determines what data needs to be present in the system ad how this data has to be
    represented and organized.

•   Liaising with Users

    The DBA needs to interact continuously with the users to understand the data in the system and
    its use.

•   Defining Security & Integrity Checks

    The DBA finds about the access restrictions to be defined and defines security checks
    accordingly. Data Integrity checks are also defined by the DBA.

•   Defining Backup / Recovery Procedures

    The DBA also defines procedures for backup and recovery. Defining backup procedures includes
    specifying what data is to backed up, the periodicity of taking backups and also the medium and
    storage place for the backup data.

•   Monitoring Performance

    The DBA has to continuously monitor the performance of the queries and take measures to
optimize all the queries in the application.
1.5 Types of Database Systems

Database Systems can be catagorised according to the data structures and operators they present to
the user. The oldest systems fall into inverted list, hierarchic and network systems. These are the pre-
relational models.

   •   In the Hierarchical Model, different records are inter-related through hierarchical or tree-like
       structures. A parent record can have several children, but a child can have only one parent. In
       the figure, there are two hierarchies shown - the first storing the relations between CUSTOMER,
       ORDERS, CONTACTS and ORDER_PARTS and the second showing the relation between PARTS,
       ORDER_PARTS and SALES_HISTORY. The many-to-many relationship is implemented through
       the ORDER_PARTS segment which occurs in both the hierarchies. In practice, only one tree
       stores the ORDER_PARTS segment, while the other has a logical pointer to this segment. IMS
       (Information Management System) of IBM is an example of a Hierarchical DBMS.
•   In the Network Model, a parent can have several children and a child can also have many
    parent records. Records are physically linked through linked-lists. IDMS from Computer
    Associates International Inc. is an example of a Network DBMS.
   •   In the Relational Model, unlike the Hierarchical and Network models, there are no physical
       links. All data is maintained in the form of tables consisting of rows and columns. Data in two
       tables is related through common columns and not physical links or pointers. Operators are
       provided for operating on rows in tables. Unlike the other two type of DBMS, there is no need to
       traverse pointers in the Relational DBMS. This makes querying much more easier in a Relational
       DBMS than in the the Hierarchical or Network DBMS. This, in fact, is a major reason for the
       relational model to become more programmer friendly and much more dominant and popular in
       both industrial and academic scenarios. Oracle, Sybase, DB2, Ingres, Informix, MS-SQL Server
       are few of the popular Relational DBMSs.

       CUSTOMER

       CUST. NO. CUSTOMER NAME              ADDRESS           CITY
       15371     Nanubhai & Sons            L. J. Road        Mumbai
           ...             ...                      ...            ...
           ...             ...                      ...            ...
           ...             ...                      ...            ...

   •

       CONTACTS                                            ORDERS
                                                           ORDER                      CUSTOMER
       CUST.NO. CONTACT                 DESIGNATION                      ORDER DATE
                                                           NO.                        NO.
       15371       Nanubhai             Owner              3216          24-June-1997 15371
       15371       Rajesh Munim         Accountant              ...           ...           ...
           ...              ...               ...               ...           ...           ...
           ...              ...               ...               ...           ...           ...

       PARTS                                               ORDERS-PARTS
       PARTS                                               ORDER
                   PARTS DESC           PART PRICE                  PART NO.          QUANTITY
       NO.                                                 NO.
                   Amkette 3.5"
       S3                               400.00             3216          C1           300
                   Floppies
            ...             ...                ...         3216          S3           120
            ...             ...                ...                ...         ...            ...
            ...             ...                ...                ...         ...            ...

   •
       SALES-HISTORY

       PART NO.        REGION          YEAR           UNITS
       S3              East            1996           2000
       S3              North           1996           5500
       S3              South           1996           12000
       S3              West            1996           20000

The recent developments in the area have shown up in the form of certain object and object/relational
DBMS products. Examples of such systems are GemStone and Versant ODBMS. Research has also
proceeded on to a variety of other schemes including the multi-dimensional approach and the logic-
based approach.
3-Level Database System Architecture




   •   The External Level represents the collection of views available to different end-users.
   •   The Conceptual level is the representation of the entre information content of the database.
   •   The Internal level is the physical level which shows how the data data is stored, what are the
       representation of the fields etc.

2. The Internal Level

This chapter discusses the issues related to how the data is physically stored on the disk and some of
the access mechanisms commonly used for retrieving this data.

The Internal Level is the level which deals with the physical storage of data. While designing this layer,
the main objective is to optimize performance by minimizing the number of disk accesses during the
various database operations.
The figure shows the process of database access in general. The DBMS views the database as a
collection of records. The File Manager of the underlying Operating System views it as a set of pages
and the Disk Manager views it as a collection of physical locations on the disk.

When the DBMS makes a request for a specific record to the File Manager, the latter maps the record
to a page containing it and requests the Disk Manager for the specific page. The Disk Manager
determines the physical location on the disk and retrieves the required page.

2.1 Clustering

In the above process, if the page containing the requested record is already in the memory, retrieval
from the disk is not necessary. In such a situation, time taken for the whole operation will be less.
Thus, if records which are frequently used together are placed physically together, more records will be
in the same page. Hence the number of pages to be retrieved will be less and this reduces the number
of disk accesses which in turn gives a better performance.

This method of storing logically related records, physically together is called clustering.

Eg: Consider CUSTOMER table as shown below.


                   Cust ID           Cust Name             Cust City               ...

                    10001                Raj                  Delhi                ...

                    10002                 ...                  ...                 ...

                    10003                 ...                  ...                 ...

                    10004                 ...                  ...                 ...

                      ...                 ...                  ...                 ...

                      ...                 ...                  ...                 ...


If queries retrieving Customers with consecutive Cust_IDs frequently occur in the application,
clustering based on Cust_ID will help improving the performance of these queries. This can be
explained as follows.

Assume that the Customer record size is 128 bytes and the typical size of a page retrieved by the File
Manager is 1 Kb (1024 bytes).

If there is no clustering, it can be assumed that the Customer records are stored at random physical
locations. In the worst-case scenario, each record may be placed in a different page. Hence a query to
retrieve 100 records with consecutive Cust_Ids (say, 10001 to 10100), will require 100 pages to be
accessed which in turn translates to 100 disk accesses.

But, if the records are clustered, a page can contain 8 records. Hence the number of pages to be
accessed for retrieving the 100 consecutive records will be ceil(100/8) = 13. i.e., only 13 disk accesses
will be required to obtain the query results. Thus, in the given example, clustering improves the speed
by a factor of 7.7

Q: For what record size will clustering be of no benefit to improve performance ?

A: When the record size and page size are such that a page can contain only one record.

Q: Can a table have clustering on multiple fields simultaneously ?

A: No
Intra-file clustering - Clustered records belong to the same file (table) as in the above example.

Inter-file Clustering - Clustered records belong to different files (tables). This type of clustering may be
required to enhance the speed of queries retrieving related records from more than one tables. Here
interleaving of records is used.

2.2 Indexing

Indexing is another common method for making retrievals faster.

Consider the example of CUSTOMER table used above. The following query is based on Customer's
city.

Retrieve the records of all customers who reside in Delhi

Here a sequential search on the CUSTOMER table has to be carried out and all records with the value
'Delhi' in the Cust_City field have to be retrieved. The time taken for this operation depends on the
number of pages to be accessed. If the records are randomly stored, the page accesses depends on
the volume of data. If the records are stored physically together, the number of pages depends on the
size of each record also.

If such queries based on Cust_City field are very frequent in the application, steps can be taken to
improve the performance of these queries. Creating an Index on Cust_City is one such method. This
results in the scenario as shown below.




A new index file is created. The number of records in the index file is same as that of the data file. The
index file has two fields in each record. One field contains the value of the Cust_City field and the
second contains a pointer to the actual data record in the CUSTOMER table.

Whenever a query based on Cust_City field occurs, a search is carried out on the Index file. Here, it is
to be noted that this search will be much faster than a sequential search in the CUSTOMER table, if the
records are stored physically together. This is because of the much smaller size of the index record due
to which each page will be able to contain more number of records.

When the records with value 'Delhi' in the Cust_City field in the index file are located, the pointer in the
second field of the records can be followed to directly retrieve the corresponding CUSTOMER records.

Thus the access involves a Sequential access on the index file and a Direct access on the actual data
file.

Retrieval Speed v/s Update Speed : Though indexes help making retrievals faster, they slow down
updates on the table since updates on the base table demand update on the index field as well.
It is possible to create an index with multiple fields i.e., index on field combinations. Multiple indexes
can also be created on the same table simultaneously though there may be a limit on the maximum
number of indexes that can be created on a table.




Q: In which of the following situations will indexes be ineffective ?

a) When the percentage of rows being retrieved is large

b) When the data table is small and the index record is of almost the same size as of the actual data
record.

c) In queries involving NULL / Not NULL in the indexed field.

d)All of the above

A: d) All of the above

Q: Can a clustering based on one field and indexing on another field exist on the same table
simultaneously ?

A: Yes

2.3 Hashing

Hashing is yet another method used for making retrievals faster. This method provides direct access to
record on the basis of the value of a specific field called the hash_field. Here, when a new record is
inserted, it is physically stored at an address which is computed by applying a mathematical function
(hash function) to the value of the hash field. Thus for every new record,

hash_address = f (hash_field), where f is the hash function.

Later, when a record is to be retrieved, the same hash function is used to compute the address where
the record is stored. Retrievals are faster since a direct access is provided and there is no search
involved in the process.

An example of a typical hash function is given by a numeric hash field, say an id, modulus a very large
prime number.

Q: Can there be more than one hash fields on a file ?

A: No
As hashing relates the field value to the address of the record, multiple hash fields will map a record to
multiple addresses at the same time. Hence there can be only one hash field per file.

Collisions : Consider the example of the CUSTOMER table given earlier while discussing clustering. Let
CUST_ID be the hash field and the hash function be defined as ((CUST_ID mod 10000)*64 + 1025).
The records with CUST_ID 10001, 10002, 10003 etc. will be stored at addresses 1089, 1153, 1217 etc.
respectively.

It is possible that two records hash to the same address leading to a collision. In the above example,
records with CUST_ID values 20001, 20002, 20003 etc. will also map on to the addresses 1089, 1153,
1217 etc. respectively. And same is the case with CUST_ID values 30001, 30002, 30003 etc.

The methods to resolve a collision are by using :

1. Linear Search:

While inserting a new record, if it is found that the location at the hash address is already occupied by
a previously inserted record, search for the next free location available in the disk and store the new
record at this location. A pointer from the first record at the original hash address to the new record
will also be stored. During retrieval, the hash address is computed to locate the record. When it is seen
that the record is not available at the hash address, the pointer from the record at that address is
followed to locate the required record.




In this method, the over head incurred is the time taken for the linear search to locate the next free
location while inserting a record.

2. Collision Chain:

Here, the hash address location contains the head of a list of pointers linking together all records which
hash to that address.
In this method, an overflow area needs to be used if the number of records mapping on to the same
hash address exceeds the number of locations linked to it.

3.1 The Relational Model

Relational Databases: Terminology




                                                      Ord_Items
                                                       Ord
       Databases: Case Example                                 Item #   Qty
                                                       #

                                                       101     HW1      100
       Ord_Aug
                                                       101     HW3      50
        Ord #   OrdDate       Cust#
                                                       101     SW1      150
        101     02-08-94      002
                                                       102     HW2      10
        102     11-08-94      003
                                                       103     HW3      50
        103     21-08-94      003
                                                       104     HW2      25
        104     28-08-94      002
                                                       104     HW3      100
        105     30-08-94      005
                                                       105     SW1      100
       Items
                                                            Customers
        Item
                   Descr              Price                 Ord #    OrdDate            Cust#
        #
                                                            101      02-08-94           002
        HW1        Power Supply       4000
                                                            102      11-08-94           003
        HW2        101-Keyboard       2000
                                                            103      21-08-94           003
        HW3        Mouse              800
                                                            104      28-08-94           002
        SW1        MS-DOS 6.0         5000
                                                            105      30-08-94           005
        SW2        MS-Word 6.0        8000




            Term                     Meaning                 Eg. from the given Case Example

        Relation           A table                          Ord_Aug, Customers, Items etc.

                           A row or a record in a           A row from Customers relation is a
        Tuple
                           relation.                        Customer tuple.

                           A field or a column in a
        Attribute                                           Ord_Date, Item#, CustName etc.
                           relation.

        Cardinality        The number of tuples in a
                                                            Cardinality of Ord_Items relation is 8
        of a relation      relation.

        Degree of a        The number of attributes in a
                                                            Degree of Customers relation is 3.
        relation           relation.

                                                            Domain of Qty in Ord_Items is the set
        Domain of          The set of all values that can
                                                            of all values which can represent
        an attribute       be taken by the attribute.
                                                            quantity of an ordered item.

                                                            Primary Key of Customers relation is
                           An attribute or a combination
                                                            Cust#.
        Primary Key        of attributes that uniquely
        of a relation      defines each tuple in a
                                                            Ord# and Item# combination forms
                           relation.
                                                            the primary Key of Ord_Items

                           An attribute or a combination    Cust# in Ord_Aug relation is a foreign
                           of attributes in one relation    key creating reference from Ord_Aug
                           R1 which indicates the           to Customers. This is required to
                           relationship of R1 with          indicate the relationship between
                           another relation R2.             Orders in Ord_Aug and Customers.
        Foreign Key
                           The foreign key attributes in    Ord# and Item# in Ord_Items are
                           R1 must contain values           foreign keys creating references from
                           matching with those of the       Ord_Items to Ord_Aug and Items
                           values in R2                     respectively.


3.2 Properties of Relations

    No Duplicate Tuples “ A relation cannot contain two or more tuples which have the same values for
all the attributes. i.e., In any relation, every row is unique.
    Tuples are unordered “ The order of rows in a relation is immaterial.
    Attributes are unordered “ The order of columns in a relation is immaterial.
    Attribute Values are Atomic “ Each tuple contains exactly one value for each attribute.
It may be noted that many of the properties of relations follow the fact that the body of a relation is a
mathematical set.

3.3 Integrity Rules

The following are the integrity rules to be satisfied by any relation.

â ¢ No Component of the Primary Key can be null.

â ¢ The Database must not contain any unmatched Foreign Key values. This is called the referential
integrity rule.

Q: Can the Foreign Key accept nulls?
A: Yes, if the application business rule allows this.

How do we explain this ?

Unlike the case of Primary Keys, there is no integrity rule saying that no component of the foreign key
can be null. This can be logically explained with the help of the following example:

Consider the relations Employee and Account as given below.

                                                 Employee


                           Emp#        EmpName          EmpCity        EmpAcc#

                           X101        Shekhar          Bombay         120001

                           X102        Raj              Pune           120002

                           X103        Sharma           Nagpur         Null

                           X104        Vani             Bhopal         120003


                                                  Account


                                  ACC#        OpenDate           BalAmt

                                  120001      30-Aug-1998        5000

                                  120002      29-Oct-1998        1200

                                  120003      01-Jan-1999        3000

                                  120004      04-Mar-1999        500




EmpAcc# in Employee relation is a foreign key creating reference from Employee to Account. Here, a
Null value in EmpAcc# attribute is logically possible if an Employee does not have a bank account. If
the business rules allow an employee to exist in the system without opening an account, a Null value
can be allowed for EmpAcc# in Employee relation.

In the case example given, Cust# in Ord_Aug cannot accept Null if the business rule insists that the
Customer No. needs to be stored for every order placed.

The next issue related to foreign key reference is handling deletes / updates of parent?

In the case example, can we delete the record with Cust# value 002, 003 or 005 ?
The default answer is NO, as long as there is a foreign key reference to these records from some other
table. Here, the records are referenced from the order records in Ord_Aug relation. Hence Restrict the
deletion of the parent record.

Deletion can still be carried if we use the Cascade or Nullify strategies.

Cascade: Delete/Update all the references successively or in a cascaded fashion and finally
delete/update the parent record. In the case example, Customer record with Cust#002 can be deleted
after deleting order records with Ord# 101 and 104. But these order records, in turn, can be deleted
only after deleting those records with Ord# 101 and 104 from Ord_Items relation.

Nullify: Update the referencing to Null and then delete/update the parent record. In the above
example of Employee and Account relations, an account record may have to be deleted if the account
is to be closed. For example, if Employee Raj decides to close his account, Account record with Acc#
120002 has to be deleted. But this deletion is not possible as long as the Employee record of Raj
references it. Hence the strategy can be to update the EmpAcc# field in the employee record of Raj to
Null and then delete the Account parent record of 120002. After the deletion the data in the tables will
be as follows:

                                                Employee


                           Emp#       EmpName         EmpCity      EmpAcc#

                           X101       Shekhar         Bombay       120001

                           X102       Raj             Pune         120002 Null

                           X103       Sharma          Nagpur       Null

                           X104       Vani            Bhopal       120003


                                                 Account


                                  ACC#        OpenDate          BalAmt

                                  120001      30-Aug-1998       5000

                                  120002      29-Oct-1998       1200

                                  120003      01-Jan-1999       3000

                                  120004      04-Mar-1999       500


3.4 Relational Algebra Operators

The eight relational algebra operators are

1. SELECT - To retrieve specific tuples/rows from a relation.
                          Ord#       OrdDate         Cust#

                           101       02-08-94         002

                           104       18-09-94         002




2. PROJECT - To retrieve specific attributes/columns from a relation.




                     Descr                 Price

                     Power Supply          4000

                     101-Keyboard          2000

                     Mouse                 800

                     MS-DOS 6.0            5000

                     MS-Word 6.0           8000
3. PRODUCT - To obtain all possible combination of tuples from two relations.




      Ord#       OrdDate         O.Cust#        C.Cust#           CustName         City

      101        02-08-94        002            001               Shah             Bombay

      101        02-08-94        002            002               Srinivasan       Madras

      101        02-08-94        002            003               Gupta            Delhi

      101        02-08-94        002            004               Banerjee         Calcutta

      101        02-08-94        002            005               Apte             Bombay

      102        11-08-94        003            001               Shah             Bombay

      102        11-08-94        003            002               Srinivasan       Madras




4. UNION - To retrieve tuples appearing in either or both the relations participating in the UNION.




                   Eg: Consider the relation Ord_Jul as follows
                   (Table: Ord_Jul)
                      101        03-07-94         001

                      102        27-07-94         003

                      101        02-08-94         002

                      102        11-08-94         003

                      103        21-08-94         003

                      104        28-08-94         002

                      105        30-08-94         005



Note: The union operation shown above logically implies retrieval of records of Orders placed in July or
in August

5. INTERSECT- To retrieve tuples appearing in both the relations participating in the INTERSECT.




            Eg:
            To retrieve Cust# of Customers who've placed orders in July and in August

            Cust#

            003




6. DIFFERENCE - To retrieve tuples appearing in the first relation participating in the DIFFERENCE but
not the second.
           Eg:
           To retrieve Cust# of Customers who've placed orders in July but not in August

           Cust#

           001




7. JOIN - To retrieve combinations of tuples in two relations based on a common field in both the
relations.




             Eg:

                                                                                 ORD_A
                                                                                 UG join
                                                                                 CUSTO
              Ord#            OrdDate       Cust#     CustNames       City       MERS
                                                                                 (here,
              101             02-08-94      002       Srinivasan      Madras
                                                                                 the
              102             11-08-94      003       Gupta           Delhi      commo
                                                                                 n
              103             21-08-94      003       Gupta           Delhi      column
                                                                                 is
              104             28-08-94      002       Srinivasan      Madras     Cust#)
Note: The above join operation logically implies retrieval of details of all orders and the details of the
corresponding customers who placed the orders.

Such a join operation where only those rows having corresponding rows in the both the relations are
retrieved is called the natural join or inner join. This is the most common join operation.

Consider the example of EMPLOYEE and ACCOUNT relations.

EMPLOYEE


EMP #             EmpName                EmpCity                Acc#

X101              Shekhar                Bombay                 120001

X102              Raj                    Pune                   120002

X103              Sharma                 Nagpur                 Null

X104              Vani                   Bhopal                 120003


ACCOUNT


  Acc#              OpenDate               BalAmt

120001       30. Aug. 1998             5000

120002       29. Oct. 1998             1200

120003       1. Jan. 1999              3000

120004       4. Mar. 1999              500


A join can be formed between the two relations based on the common column Acc#. The result of the
(inner) join is :


Emp#      EmpName           EmpCity       Acc#             OpenDate           BalAmt

X101      Shekhar           Bombay        120001           30. Aug. 1998      5000

X102      Raj               Pune          120002           29. Oct. 1998      1200

X104      Vani              Bhopal        120003           1. Jan 1999        3000


Note that, from each table, only those records which have corresponding records in the other table
appear in the result set. This means that result of the inner join shows the details of those employees
who hold an account along with the account details.

The other type of join is the outer join which has three variations â “ the left outer join, the right
outer join and the full outer join. These three joins are explained as follows:

The left outer join retrieves all rows from the left-side (of the join operator) table. If there are
corresponding or related rows in the right-side table, the correspondence will be shown. Otherwise,
columns of the right-side table will take null values.
EMPLOYEE left outer join ACCOUNT gives:


Emp#      EmpName         EmpCity         Acc#             OpenDate          BalAmt

X101      Shekhar         Bombay          120001           30. Aug. 1998     5000

X102      Raj             Pune            120002           29. Oct. 1998     1200

X103      Sharma          Nagpur          NULL             NULL              NULL

X104      Vani            Bhopal          120003           1. Jan 1999       3000



The right outer join retrieves all rows from the right-side (of the join operator) table. If there are
corresponding or related rows in the left-side table, the correspondence will be shown. Otherwise,
columns of the left-side table will take null values.




EMPLOYEE right outer join ACCOUNT gives:


Emp#      EmpName         EmpCity         Acc#             OpenDate          BalAmt

X101      Shekhar         Bombay          120001           30. Aug. 1998     5000

X102      Raj             Pune            120002           29. Oct. 1998     1200

X104      Vani            Bhopal          120003           1. Jan 1999       3000

NULL      NULL            NULL            120004           4. Mar. 1999      500


(Assume that Acc# 120004 belongs to someone who is not an employee and hence the details of the
Account holder are not available here)


The full outer join retrieves all rows from both the tables. If there is a correspondence or relation
between rows from the tables of either side, the correspondence will be shown. Otherwise, related
columns will take null values.
EMPLOYEE full outer join ACCOUNT gives:


Emp#      EmpName         EmpCity         Acc#          OpenDate          BalAmt

X101      Shekhar         Bombay          120001        30. Aug. 1998     5000

X102      Raj             Pune            120002        29. Oct. 1998     1200

X103      Sharma          Nagpur          NULL          NULL              NULL

X104      Vani            Bhopal          120003        1. Jan 1999       3000

NULL      NULL            NULL            120004        4. Mar. 1999      500



Q: What will the result of a natural join operation between R1 and R2 ?
A:


a1     b1        c1

a2     b2        c2

a3     b3        c3


8. DIVIDE

Consider the following three relations:




R1 divide by R2 per R3 gives:


 a


Thus the result contains those values from R1 whose corresponding R2 values in R3 include all R2
values.
4. Structured Query Language (SQL)

4.1 SQL : An Overview

The components of SQL are

a. Data Manipulation Language â “ Consists of SQL statements for operating on the data
(Inserting, Modifying, Deleting and Retrieving Data) in tables which already exist.

b. Data Definition Language â “ Consists of SQL statements for defining the schema (Creating,
Modifying and Dropping tables, indexes, views etc.)

c. Data Control Language â “ Consists of SQL statements for providing and revoking access
permissions to users

4.2 DML - SELECT, INSERT, UPDATE and DELETE statements.

The SELECT statement

Retrieves rows from one or more tables according to given conditions.

General form:

SELECT [ ALL | DISTINCT ] <attribute (comma)list>

FROM <table (comma)list>

[ WHERE <conditional expression>]

[ ORDER BY [DESC] <attribute list>

[ GROUP BY <attribute (comma)list>]

[ HAVING <conditional expression>]

The INSERT statement

Inserts one or more tuples in a table.

General forms:

To insert a single tuple

INSERT INTO <table-name> [<attribute (comma)list>]

VALUES <value list>;

To insert multiple tuples

INSERT INTO <table-name> [<attribute (comma)list>]

SELECT [ ALL | DISTINCT ] <attribute (comma)list>

FROM <table (comma)list>*

[ WHERE <conditional expression>];

* - list of existing tables
The UPDATE statement

Updates values of one or more attributes of one or more tuples in a table.

General form:

UPDATE <table-name>

SET <attribute-1 = value-1[, attribute-2 = value-2,...attribute-n = value-n]

[ WHERE <conditional expression>];

The DELETE statement

Deletes one or more tuples in a table according to given conditions

General form:

DELETE FROM <table-name>

[ WHERE <conditional expression>];


4.3 DDL - CREATE, ALTER, and DROP statements.

DDL statements are those which are used to create, modify and drop the definitions or structures of
various tables, views, indexes and other elements of the DBMS.

The CREATE TABLE statement

Creates a new table.

General form:

CREATE TABLE <table-name>

(<table-element (comma)list>*);

* - table element may be attribute with its data-type and size or any integrity constraint on attributes.

Some CREATE TABLE statements on the Case Example

Query:

CREATE TABLE customers

( cust# NUMBER(6) NOT NULL,

custname CHAR(30) ,

city CHAR(20));

- This query Creates a table CUSTOMERS with 3 fields - cust#, custname and city. Cust# cannot be
null

Query:
CREATE TABLE ord_sep <-------------------      Creates a new table ord_sep, which has the same structure
                                               of ord_aug. The data in ord_aug is copied to the new table
AS SELECT * from ord_aug;                      ord_sep.


- This query Creates table ORD_SEP as a cpy of ORD-AUG. Copies structure as well as data.

Query:


CREATE TABLE ord_sep <------------------
                                              Creates a new table ord_sep, which has the same structure of
                                              ord_aug. No data in ord_aug is copied to the new table since
AS SELECT * from ord_aug
                                              there is no row which satisfies the 'always false' condition 1 =
                                              2.
WHERE 1 = 2;


- This query Creates table ORD_SEP as a copy of ORD_AUG, but does not copy any data as the WHERE
clause is never satisfied.

The ALTER TABLE statement

Alters the structure of an existing table.

General form:

ALTER TABLE <table-name>

ADD | MODIFY (<table-element (comma)list);

Examples of ALTER TABLE statement.

Query:



ALTER TABLE customers

MODIFY custname CHAR(35); <-------------       Modifies the data type/size of an attribute in the table



- This query changes the custname field to a character field of length 35. Used for modifying field
lengths and attributes.

Query:


ALTER TABLE customers
                                               Adds two new attributes to the Customers table.
                                               Here, for existing tuples (if any), the new attribute
ADD (phone number(8), <------------------
                                               will take NULL values since no DEFAULT value is
                                               mentioned for the attribute.
credit_rating char(1));


- This query adds two new fields - phone & credit_rating to the customers table.


The DROP TABLE statement

DROPS an existing table.
General form:

DROP TABLE <table-name>;

Example:

Query:

DROP TABLE ord_sep;

- The above query drops table ORD_SEP from the database

Creating & Dropping Views

A view is a virtual relation created with attributes from one or more base tables.

SELECT * FROM myview1; at any given time will evaluate the view-defining query in the CREATE VIEW
statement and display the result.

Query:

CREATE VIEW myview1

AS SELECT

ord#, orddate, ord_aug.cust#, custname

FROM ord_aug, customers

WHERE ord_aug.cust# = customers.cust#;

- This query defines a view consisting of ord#, cust#, and custname using a join of ORD_AUG and
CUSTOMERS tables.

Query:

CREATE VIEW myview2 (ItemNo, Quantity)

AS SELECT item#, qty

FROM ord_items;

- This query defines a view with columns item# and qty from the ORD_ITEMS table, and renames
these columns as ItemNo. and Quantity respectively.
Query:




CREATE VIEW myview3

AS SELECT item#, descr, price

FROM items

                                             WITH CHECK OPTION in a CREATE VIEW statement
WHERE price < 1000
                                             indicates that INSERTs or UPDATEs on the view will be
                                             rejected if they violate any integrity constraint implied by
WITH CHECK OPTION; <-------------------
                                             the view-defining query.



- This query defines the view as defined. WITH CHECK OPTION ensures that if this view is used for
updation, the updated values do not cause the row to fall outside the view.

Query:

DROP VIEW myview1; <---- To drop a view

- this query drops the view MYVIEW1

Creating & Dropping Indexes



Query:

CREATE INDEX i_city <--------------------   Creates a new index named i_city. The new
                                            index file(table) will have the values of city
ON customers (city);                        column of Customers table



Query:


CREATE UNIQUE INDEX i_custname <-----       Creates an index which allows only unique values for
-                                           custnames

ON customers (custname);


Query:


CREATE INDEX i_city_custname <---------
                                            Creates an index based on two fields : city and
                                            custname
ON customers (city, custname);


Query:


DROP INDEX i_city; <--------------------    Drops index i_city
4.4 DCL - GRANT and REVOKE statements.

DCL statements are those which are used to control access permissions on the tables, indexes, views
and other elements of the DBMS.

Granting & Revoking Privileges

Query:


                                   Grants all permissions on the table customers to the user who
GRANT ALL <------------------
                                   logs in as 'ashraf'.

ON customers

TO ashraf;


Query:
GRANT SELECT <--------------       Grants SELECT permission on the table customers to the user
                                   'sunil'. User 'sunil' does not have permission to insert, update,
ON customers                       delete or perform any other operation on customers table.

TO sunil;
Query:
GRANT SELECT

ON customers

TO sunil

WITH GRANT OPTION; <-------        Enables user 'sunil' to give SELECT permission on
--                                 customers table to other users.
Query:

REVOKE DELETE <-------------
                                   Takes away DELETE permission on customers table
                                   from user 'ashraf'.
ON customers

FROM ashraf;

5. Recovery and Concurrency

Recovery and Concurrency in a DBMS are part of the general topic of transaction management. Hence
we shall begin the discussion by examining the fundamental notion of a transaction.

5.1 Transaction

A transaction is a logical unit of work.

Consider the following example:

The procedure for transferring an amount of Rs. 100/- from the account of one customer to another is
given.


 EXEC SQL       WHENEVER SQLERROR GOTO
                UNDO
 EXEC SQL       UPDATE DEPOSIT
                      SET BALANCE=BALANCE-100
                        WHERE CUSTID=from_cust;
 EXEC SQL       UPDATE DEPOSIT
                      SET BALANCE=BALANCE+100
                        WHERE CUSTID=to_cust:
 EXEC SQL       COMMIT;
                      GOTO FINISH
UNDO:
 EXEC SQL       ROLLBACK;
FINISH:
 RETURN;

Here, it has to be noted that the single operation â œamount transferâ          involves two database
updates â “ updating the record of from_cust and updating the record of to_cust. In between these
two updates the database is in an inconsistent (or incorrect in this example) state. i.e., if only one of
the updates is performed, one cannot say by seeing the database contents whether the amount
transfer operation has been done or not. Hence to guarantee database consistency it has to be ensured
that either both updates are performed or none are performed. If, after one update and before the
next update, something goes wrong due to problems like a system crash, an overflow error, or a
violation of an integrity constraint etc., then the first update needs to be undone.

This is true with all transactions. Any transaction takes the database from one consistent state to
another. It need not necessarily preserve consistency of database at all intermediate points. Hence it is
important to ensure that either a transaction executes in its entirety or is totally cancelled. The set of
programs which handles this forms the transaction manager in the DBMS. The transaction manager
uses COMMIT and ROLLBACK operations for ensuring atomicity of transactions.

COMMIT â “ The COMMIT operation indicates successful completion of a transaction which means that
the database is in a consistent state and all updates made by the transaction can now be made
permanent. If a transaction successfully commits, then the system will guarantee that its updates will
be permanently installed in the database even if the system crashes immediately after the COMMIT.

ROLLBACK â “ The ROLLBACK operation indicates that the transaction has been unsuccessful which
means that all updates done by the transaction till then need to be undone to bring the database back
to a consistent state. To help undoing the updates once done, a system log or journal is maintained by
the transaction manager. The before- and after-images of the updated tuples are recorded in the log.

The properties of transaction can be summarised as ACID properties - ACID standing for atomicity,
consistency, isolation and durability.

Atomicity: A transaction is atomic. Either all operations in the transaction have to be performed or
none should be performed.

Consistency: Transactions preserve database consistency. i.e., A transaction transforms a consistent
state of the database into another without necessarily preserving consistency at all intermediate
points.

Isolation: Transactions are isolated from one another. i.e., A transaction's updates are concealed from
all others until it commits (or rolls back).

Durability: Once a transaction commits, its updates survive in the database even if there is a
subsequent system crash.

5.2 Recovery from System Failures

System failures (also called soft crashes) are those failures like power outage which affect all
transactions in progress, but do not physically damage the database.

During a system failure, the contents of the main memory are lost. Thus the contents of the database
buffers which contain the updates of transactions are lost. (Note: Transactions do not directly write on
to the database. The updates are written to database buffers and, at regular intervals, transferred to
the database.) At restart, the system has to ensure that the ACID properties of transactions are
maintained and the database remains in a consistent state. To attain this, the strategy to be followed
for recovery at restart is as follows:

   •   Transactions which were in progress at the time of failure have to be undone at the time of
       restart. This is needed because the precise state of such a transaction which was active at the
       time of failure is no longer known and hence cannot be successfully completed.
   •   Transactions which had completed prior to the crash but could not get all their updates
       transferred from the database buffers to the physical database have to redone at the time of
       restart.

This recovery procedure is carried out with the help of

An online logfile or journal - The logfile maintains the before- and after-images of the tuples updated
during a transaction. This helps in carrying out the UNDO and REDO operations as required. Typical
entries made in the logfile are :

   •   Start of Transaction Marker
   •   Transaction Identifier
   •   Record Identifier
   •   Operations Performed
   •   Previous Values of Modified Data (Before-image or Undo Log)
   •   Updated Values of Modified Records (After-image or Redo Log)
   •   Commit / Rollback Transaction Marker

Taking a checkpoint at specific intervals - This involves the following two operations:
a) physically writing the contents of the database buffers out to the physical database. Thus during a
checkpoint the updates of all transactions, including both active and committed transactions, will be
written to the physical database.
b) Physically writing a special checkpoint record to the physical log. The checkpoint record has a list of
all active transactions at the time of taking the checkpoint.

5.3 Recovery : An Example




At the time of restart, T3 and T5 must be undone and T2 and T4 must be redone.
T1 does not enter the recovery procedure at all since it updates were all written to the database at
time tc as part of the checkpoint process

5.4 Concurrency

Concurrency refers to multiple transactions accessing the same database at the same time. In a
system which allows concurrency, some kind of control mechanism has to be in place to ensure that
concurrent transactions do not interfere with each other.


Three typical problems which can occur due to concurrency are explained here.

a) Lost Update Problem




(To understand the above situation, assume that

   •   there   is a record R, with a field, say Amt, having value 1000 before time t1.
          o      Both transactions A & B fetch this value at t1 and t2 respectively.
          o      Transaction A updates the Amt field in R to 800 at time t3.
          o      Transaction B updates the Amt field in R to 1200 at time t4.

Thus after time t4, the Amt value in record R has value 1200. Update by Transaction A at time t3 is
over-written by the Transaction B at time t4.)
b) Uncommitted Dependency Problem




(To understand the above situation, assume that

   •   there   is a record R, with a field, say Amt, having value 1000 before time t1.
          o      Transaction B fetches this value and updates it to 800 at time t1.
          o      Transaction A fetches R with Amt field value 800 at time t2.
          o      Transaction B rolls back and its update is undone at time t3. The Amt field takes the
                 initial value 1000 during rollback.

Transaction A continues processing with Amt field value 800 without knowing about B's rollback.)

c) Inconsistent Analysis Problem
5.5 Locking

Locking: A solution to problems arising due to concurrency.

Locking of records can be used as a concurrency control technique to prevent the above
mentioned problems. A transaction acquires a lock on a record if it does not want the record
values to be changed by some other transaction during a period of time. The transaction
releases the lock after this time.

Locks are of two types

   1. shared (S lock)
   2. and exclusive (X Lock).

   •   A transaction acquires a shared (read) lock on a record when it wishes to retrieve or fetch the
       record.
   •   An exclusive (write) lock is acquired on a record when a transaction wishes to update the
       record. (Here update means INSERT, UPDATE or DELETE.)

The following figure shows the Lock Compatibility matrix.




Normally, locks are implicit. A FETCH request is an implicit request for a shared lock whereas an
UPDATE request is an implicit request for an exclusive lock.
Explicit lock requests need to be issued if a different kind of lock is required during an operation. For
example, if an X lock is to acquired before a FETCH it has to be explicitly requested for.

5.6 Deadlocks

Locking can be used to solve the problems of concurrency. However, locking can also introduce the
problem of deadlock as shown in the example below.
Deadlock is a situation in which two or more transactions are in a simultaneous wait state, each of
them waiting for one of the others to release a lock before it can proceed.

If a deadlock occurs, the system may detect it and break it. Detecting involves detecting a cycle in the
â œWait-For Graphâ          (a graph which shows 'who is waiting for whom'). Breaking a deadlock
implies choosing one of the deadlocked transactions as the victim and rolling it back, thereby releasing
all its locks. This may allow some other transaction(s) to proceed.

Deadlock prevention can be done by not allowing any cyclic-waits.

6. Query Optimization

6.1 Overview

When compared to other database systems, query optimization is a strength of the relational systems.
It can be said so since relational systems by themselves do optimization to a large extent unlike the
other systems which leave optimization to the programmer. Automatic optimization done by the
relational systems will be much more efficient than manual optimization due to several reasons like :

   •   uniformity in optimization across programs irrespective of the programmer's expertise in
       optimizing the programs.
   •   system's ability to make use of the knowledge of internal conditions (eg: volume of data at the
       time of querying) for optimization. For the same query, such conditions may be different at
       different times of querying. (In a manual system, this knowledge can be utilised only if the
       query is re-written each time, which is not practically possible.)
   •   system's ability to evaluate large number of alternatives to find the most efficient query
       evaluation method.

In this chapter we shall look into the process of automatic query optimization done by the relational
systems.

6.2 An Example of Query Optimization

Let us look at a query being evaluated in two different ways to see the dramatic effect of query
optimization.

Consider the following query.

Select ORDDATE, ITEM#, QTY
from ORDTBL, ORD_ITEMS
where ORDTBL.ORD# = ORD_ITEMS.ORD#
and ITEM# = 'HW3';

Assumptions:

   •   There are 100 records in ORDTBL
   •   There are 10,000 records in ORD_ITEMS
   •   There are 50 order items with item# 'HW3'

Query Evaluation Method 1

T1 = ORDTBL X ORD_ITEMS
(Perform the Product operation as the first step towards joining the two tables)

- 10000 X 100 tuple reads (1000000 tuple reads -> generates 1000000 tuples as intermediate result)
- 1000000 tuples written to disk (Assuming that 1000000 tuples in the intermediate result cannot be
held in the memory. 1000000 tuple writes to a temporary space in the disk.)
T2 =    ORDTBL.ORD# = ORD_ITEMS.ORD# & ITEM# 'HW3'(T1)
(Apply the two conditions in the query on the intermediate result obtained after the first step)

- 1000000 tuples read into memory (1000000 tuple reads)
- 50 selected (those tuples satisfying both the conditions. 50 held in the memory itself)


T3 =     ORDDATE,ITEM#,QTY (T2)
(Projection performed as the final step. No more tuple i/o s)

- 50 tuples (final result)

Total no. of tuple i/o s = 1000000 reads + 1000000 writes + 1000000 reads
= 3000000 tuple i/o s


Query Evaluation Method 2


T1 =    ITEM#='HW3'   (ORD_ITEMS) (Perform the Select operation on ORD_ITEMS as the first step)

- 10000 tuple reads (10000 tuple reads from ORD_ITEMS)
- 50 tuples selected; no disk writes (50 tuples satisfy the condition in Select. No disk writes assuming
that the 50 tuples forming the intermediate result can be held in the memory)

T2 = ORDTBL JOIN T1

- 100 tuple reads (100 tuple reads from ORDTBL)
- resulting relation with 50 tuples


T3 =     ORDDATE, ITEM#, QTY(T2)
(Projection performed as the final step. No more tuple i/o s)

- 50 tuples (final result)

Total no. of tuple i/o s = 10000 reads + 100 reads
= 10100 tuple i/o's

Comparison of the two Query Evaluation Methods

10,100 tuple I/O's (of Method 2) v/s 3,000,000 tuple I/O's (of Method 1) !

Thus by sequencing the operations differently a dramatic difference can be made in the performance of
queries.

Here it needs to be noted that in the Method 2 of evaluation, the first operation to be performed was a
'Select' which filters out 50 tuples from the 10,000 tuples in the ORD_ITEMS table. Thus this operation
causes elimination of 9950 tuples. Thus elimination in the initial steps would help optimization.

Some more examples:

   select CITY, COUNT(*) from CUSTTBL                 select CITY, COUNT(*) from CUSTTBL
1. where CITY != 'BOMBAY'                       v/s   group by CITY
   group by CITY;                                     having CITY != 'BOMBAY';

   select * from ORDTBL                               select * from ORDTBL
2. where to_char(ORDDATE,'dd-mm-yy')            v/s   where ORDDATE = to_date('11-08-94',
   = '11-08-94';                                      'dd-mm-yy');
Here the second version is faster. In the first form of the query, a function to_char is applied on an
attribute and hence needs to be evaluated for each tuple in the table. The time for this evaluation will
be thus proportional to the cardinality of the relation. In the second form, a function to_date is applied
on a constant and hence needs to be evaluated just once, irrespective of the cardinality of the relation.
Moreover, if the attribute ORDDATE is indexed, the index will not be used in the first case, since the
attribute appears in an expression and its value is not directly used.

6.3 The Query Optimization Process

The steps of query optimization are explained below.

a) Cast into some Internal Representation â “ This step involves representing each SQL query into
some internal representation which is more suitable for machine manipulation. The internal form
typically chosen is a query tree as shown below.

Query Tree for the SELECT statement discussed above:




b)Convert to Canonical Form â “ In this second step, the optimizer makes use of some transformation
laws or rules for sequencing the internal operations involved. Some examples are given below.
(Note: In all these examples the second form will be more efficient irrespective of the actual data
values and physical access paths that exist in the stored database. )

Rule 1:

(A JOIN B) WHERE restriction_A AND restriction_B




(A WHERE restriction_A) JOIN (B WHERE restriction_B)
Restrictions when applied first, cause eliminations and hence better performance.

Rule 2:

(A WHERE restriction_1) WHERE restriction_2
A WHERE restriction_1 AND restriction_2
Two restrictions applied as a single compound one instead applying the two individual restrictions
separately.

Rule 3:

(A[projection_1])[projection_2]




A[projection_2]

If there is a sequence of successive projections applied on the same relation, all but the last one can
be ignored. i.e., The entire operation is equivalent to applying the last projection alone.

Rule 4:

(A[projection]) WHERE restriction




(A WHERE restriction)[projection]
Restrictions when applied first, cause eliminations and hence better performance.

Reference [1] gives more such general transformation laws.
c)Choose Candidate Low-level Procedures â “ In this step, the optimizer decides how to execute the
transformed query. At this stage factors such as existence of indexes or other access paths, physical
clustering of records, distribution of data values etc. are considered.

The basic strategy here is to consider the query expression as a set of low-level implementation
procedures predefined for each operation. For eg., there will be a set of procedures for implementing
the restriction operation: one (say, procedure 'a') for the case where the restriction attribute is
indexed, one (say, procedure 'b') where the restriction attribute is hashed and so on.

Each such procedure has and associated cost measure indicating the cost, typically in terms of disk
I/Os.

The optimizer chooses one or more candidate procedures for each low-level operations in the query.
The information about the current state of the database (existence of indexes, current cardinalities
etc.) which is available from the system catalog will be used to make this choice of candidate
procedures.

d)Generate Query Plans and Choose the Cheapest. In this last step, query plans are generated by
combining a set of candidate implementation procedures. This can be explained with the following
example(A trivial one but illustrative enough).

Assume that there is a query expression comprising a restriction, a join and a projection. Some
examples, of implementation procedures available for each of these operations can be assumed as
given in the table below.


                                                                         Implementation
               Operation                Condition Existing
                                                                           Procedure

             Restriction       Restriction attribute is indexed                   a

             Restriction       Restriction attribute is hashed                   b

             Restriction       Restriction attribute is neither                   c
                               indexed nor hashed

             Join                                                                 d

             Join                                                                 e

             Projection                                                           f

             Projection                                                           g


Now the various query plans for the original query expression can be generated by making
permutations of implementation procedures available for different operations. Thus the query plans can
be

adf
adg
aef
aeg
bdf
...
...

It has to be noted that in reality, the number of such query plans possible can be too many and hence
generating all such plans and then choosing the cheapest will be expensive by itself. Hence a heuristic
reduction of search space rather than exhaustive search needs to be done. Considering the above
example, one such heuristic method can be as follows:

If the system knows that the restriction attribute is neither indexed nor hashed, then the query plans
involving implementation procedure 'c ' alone (and not 'a' and 'b') need to be considered and the
cheapest plan can be chosen from the reduced set of query plans.

6.4 Query Optimization in Oracle

Some of the query optimization measures used in Oracle are the following:

Indexes unnecessary for small tables. i.e., if the size of the actual data record is not much larger than
the index record, the search time in the index table and the data table will be comparable. Hence
indexes will not make much difference in the performance of queries.

Indexes/clusters when retrieving less than 25% of rows. The overhead of searching in the index file
will be more when retrieving more rows.

Multiple column WHERE clauses

evaluations causing largest number of eliminations performed first

JOIN-columns should be indexed. JOIN columns or Foreign Key columns may be indexed since queries
based on these columns can be expected to be very frequent.

Index not used in queries containing NULL / NOT NULL. Index tables will not have NULL / NOT NULL
entries. Hence need not search for these in the index table.

								
To top
;