Document Sample
GTU MCA PPT Full DBMS By Keval Powered By Docstoc
            What is DATABASE?
• A database can be defined as collection of interrelated,
  meaningful data.
• For example postal address contains.
• A building name
• A flat no in the building
• A road name
• A state name
• A pin code
• A country name
• So the address book is a database and the postal address
  is data.
               What is DBMS?
• To be able to successfully design and maintain
  database we have to do following.
1) Indentified which part of the world’s data is of
   interest to us.
2) Identify what specific object in that part of the
   world’s data are of interest.
3) Identify relationship between object.
• Managing data in a such database is called
   DBMS.(data base management system)
• DBMs allows inserting, updating, deleting and
   processing of data.
              Benefits of DBMS
•   Redundancy can be reduced
•   Inconsistency can be reduced
•   Data can shared by single or multiple user
•   Standard can be followed
•   Data integrity can be maintained
•   Security of the data can be implemented
•   Data indecency can be achieved
               What is RDBMS?
• RDBMS is based on the relational model as introduced
  by Dr. E. F. Codd.
• It should also satisfy codd’s 12 rules, but in practice
  there is no DBMS that satisfy all this rule.
• RDBMS stored data in the form of table.
• RDBMS is that a single database can be spread across
  several table.
• This is differ from flat file database.
• Today's popular commercial RDBMS for large
  database include oracle, Microsoft SQL
  server,sybase,SQL server and
  IBM’s DB2.
              Dr. E.F. Codd’s Rule
• The Information rule: All data should be in presented
  in table form.
• The Guaranteed Access rule: all data should be
  accessible without ambiguity..
• The Systematic Treatment of Null Values rule: a field
  should be allowed to remain empty. This involves the
  support of null values. Which is distinct form an empty
  string or a number with a value of zero.
• The Dynamic Online Catalog Based on the Relational
  Model rule: a relational database must provide to
  access to its structure through the same tools that are
  used to access the data.
• The Comprehensive Data Sublanguage rule: the
  database must support one clearly defined language
  that include data definition language, data
  manipulation, data integrity and database transaction
• The View Updating rule: All views of the data which are
  theoretically updatable must be updatable in practice by the
• The High-level Insert, Update, and Delete rule: The capability
  of handling a base relation or a derived relation as a single
  operand applies not only to the retrieval of data but also to
  the insertion, update, and deletion of data.
• The Physical Data Independence rule: Application programs
  and terminal activities remain logically unimpaired whenever
  any changes are made in either storage representations or
  access methods.
• The Logical Data Independence rule: how data is viewed should not
  be changed when the logical structure of the database changed. This
  rule is particularly difficult to satisfy.
• The Integrity Independence rule: Integrity constraints must be
  definable in the RDBMS.
• The Distribution Independence rule: An RDBMS
  has distribution independence. Distribution
  independence implies that users should not have
  to be aware of whether a database is distributed.
• The Nonsubversion rule: If the database has any
  means of handling a single record at a time, that
  low-level language must not be able to subvert or
  avoid the integrity rules which are expressed in a
  higher-level language that handles multiple
  records at a time.
                      DBMS Vs RDBMS
               DBMS                                     RDMS
Data handle as a File oriented system Data handle as a in form of table
DBMS does not support the client       Most of the RDBMS support the
server Architecture                    client server architecture
DBMS does not support distributed      RDBMS support distributed database
There is no security of data           There are multiple level of security
DBMS may satisfy less then 7 rule of   RDMS satisfy more than 7 rule of Dr.
Dr. E. F. Codd                         E. F. Codd
                                   Naming Conventions
Field                                  Columns, Attributes
Record                                 Row, Tuple,Entity
File                                   Table, relation, Entity class
           Introduction to Oracle
• In june 1970, Dr. E.F. Codd a paper publish entitled. A
  relation model of data for large data bank.
• This relational sponsored by IBM, then came to be
  accepted as a definitive model for RDBMS.
• The language developed by IBM for manipulate data is
  called structured query Language(SQL)
• Relational software later came to know as oracle
• Oracle corporation that produce the most widely used,
  server based, multi user based RDBMS named oracle.
            Chapter 1. Introduction
1.   Basic Concepts.
                          •   What is Data?
                      •     What is Database?
              •   What is Database Management System?

2.   Purpose & Advantages of DBMS.

3.   Data Models.
4.   DBMS Architecture
                    •     Three Level Architecture
                        •   Overall Architecture

5.   Various Components of DBMS.
1. Basic Concepts :-

   What is Data?

  – “Data is a collection of facts from which conclusion
    may be drawn.”

  – In computer science, “data is anything in a form
    suitable for use with a computer.” Data is often
    distinguished from programs. A program is a set of
    instructions that detail a task for the computer to
    perform. In this sense, data is thus everything that
    is not program code.

What is Database?

–   “A database is a collection of data that is organized so
    that its contents can easily be accessed, managed,
    and updated.”

–   “A database is a collection of data, typically describing
    the activities of one or more related organizations.”

–   “Database is a structured collection of records or data
    that is stored in a computer system.”

What is Database Management System?

•    A Database Management System (DBMS), or simply a
     Database System (DBS) consist of :
    – A collection of interrelated and persistent data (usually
       referred to as the database (DB)).
    – A set of application programs used to access, update and
       manage that data (which form the data management
       system (MS)).

•    The goal of a DBMS is to provide an environment that is
     both convenient and efficient to use in :
    – Retrieving information from the database.
    – Storing information into the database.
•    Databases are usually designed to manage large bodies
     of information. This involves
    – Definition of structures for information storage (data
    – Provision of mechanisms for the manipulation of
        information (file and systems structure, query
    – Providing for the safety of information in the
        database (crash recovery and security).
    – Concurrency control if the system is shared by users.

2.1 Purpose of DBMS :-
 In the early days, database applications were built directly on top of file
Drawbacks of using file systems to store data:

            1. Data redundancy and inconsistency
            - Multiple file formats, duplication of information in different files

            2. Difficulty in accessing data
            - Need to write a new program to carry out each new task

            3. Data isolation — multiple files and formats

            4. Integrity problems
            - Integrity constraints (e.g. account balance > 0) become“ buried”
                in program code rather than being stated explicitly
            - Hard to add new constraints or change existing ones
2.1 Purpose of DBMS :- (Count…)
         5. Atomicity of updates

         - Failures may leave database in an inconsistent state with partial
updates carried out
         - Example: Transfer of funds from one account to another should either `
complete or not happen at all

         6. Concurrent access by multiple users

         - Concurrent accessed needed for performance
         - Uncontrolled concurrent accesses can lead to inconsistencies
           Example: Two people reading a balance and updating it at the same time

         7. Security problems

         - Hard to provide user access to some, but not all, data
2.2 Advantages of DBMS :-

• Database Management System (DBMS) aids in storage,
  control, manipulation and retrieval of data. This article
  lists the advantages of database management systems.
• Database is a software program, used to store, delete,
  update and retrieve data. A database can be limited to
  a single desktop computer or can be stored in large
  server machines, like the IBM Mainframe. There are
  various database management systems available in the
  market. Some of them are Sybase, Microsoft SQL
  Server, Oracle RDBMS, PostgreSQL, etc.
1. Data Warehouses
• The                         database
   management         systems      are
   warehouses of information, where
   large amount of data can be stored.
   The     common       examples     in
   commercial      applications    are
   inventory data, personnel data, etc.
   It often happens that a common
   man uses a database management
   system, without even realizing,
   that it is being used. The best
   examples for the

• same would be the address book of
  a cell phone, digital diaries, etc.
  Both these equipments store data
  in their internal database.
2. Defining Attributes :-

   The unique data field in a table is assigned a primary
   key. The primary key helps in the identification of
   data. It also checks for duplicates within the same
   table, thereby reducing data redundancy. There are
   tables, which have a secondary key in addition to the
   primary key. The secondary key is also called 'foreign
   key'. The secondary key refers to the primary key of
   another table, thus establishing a relationship
   between the two tables.
3. Systematic Storage :-

  The data is stored in the form of tables. The table consists of
  rows and columns. The primary and secondary key helps to
  eliminate data redundancy, enabling systematic storage of data.

4. Changes to schema :-

  The table schema can be changed and it is not platform
  dependent. Therefore, the tables in the system can be edited to
  add new columns and rows without hampering the applications,
  which depend on that particular database.

5. No Language Dependence :-

   The database management systems are not language dependent. Therefore,
   they can be used with various languages and on various platforms.

6. Table Joins :-

   The data in two or more tables can be integrated into a single table. This
   enables to reduce the size of the database and also helps in easy retrieval of

7. Multiple Simultaneous Usage :-

   The database can be used simultaneously by a number of users. Various users
   can retrieve the same data simultaneously. The data in the database can also
   be modified, based on the privileges assigned to users.

8. Data Security :-
   Data is the most important asset. Therefore, there is a need for data security.
   Database management systems help to keep the data secured.

9. Privileges :-

   Different privileges can be given to different users. For example, some users
   can edit the database, but are not allowed to delete the contents of the

10. Abstract View of Data and Easy Retrieval :-

   DBMS enables easy and convenient retrieval of data. A database user can
   view only the abstract form of data; the complexities of the internal structure
   of the database are hidden from him. The data fetched is in user friendly
11. Data Consistency :-

  Data consistency ensures a consistent view of data to every user.
  It includes the accuracy, validity and integrity of related data. The
  data in the database must satisfy certain consistency constraints,
  for example, the age of a candidate appearing for an exam
  should be of number data type and in the range of 20-25. When
  the database is updated, these constraints are checked by the
  database systems.
3. Data Models :-
  –   A Collection of tools for describing :
      •   Data
      •   Data Relationship
      •   Add Semantics
      •   Data Constraints
  –   Relational Model
  –   Entity Relationship Model (for Database Design)
  –   Object base Data Model (for Object-Oriented)
  –   Semistructured Data Model (XML)
  –   Other Older Models :
      •   Network Model
      •   Hierarchical Model
3.1 Relational Data Model :-

 This Model uses a collection of tables to
  represent both data and the relationship among
  those data.
 Each table has multiple columns and each
  column has a unique name.
 It is an example of Record-Base a model.
 Database is structured in fix-format records of
  several types.
 This is the most widely used Data Model
Table        A table is a collection of data arrange in row and column
             format. A database may contain one or more tables.
Entity       An Entity the distinguishable objects of real world.
             E.g.:- Student, Customer, Employee…..etc…
Attributes An attributes are the set of properties processed by an
             entity. E.g.:- Name, Address, City, Mobile….etc…
Fields       The title of the column that holds a specific type of data
             is known as field. A table can have maximum 255 fields.
Tuples       Each record row in a table is tuple.
Records      The collection of data horizontally for each field is known
             as record. A record is complete information about an
Example of Simple relational Model
3.2 The Entity Relationship Model :-

 It is based on the real world that consists of a
  collection of Basic Object called “Entity”.
 An Entity is a Thing or Object in the real world that
  distinguishable from other objects.
 For example a person is an entity and bank account
  can be consider as an entity.
 Entities are describe in database by a set of
 A Relationship is an association among several


Shows Relation between…..
1.  Between two entities
2.  Between Attributes and Entity
3.3 Object-Base Data Model :-

 It can be seen an extending the (E-R) Model
  with notations of:

   Encapsulation
   Methods (Functions) and
   Object identity

 Combines the features of Object-Oriented
  Data Model and Relational Data Model.
3.4 Semistructured Data Model

 It permits the specification of data
  where individual data item of the
  same type may have different set of

 The Extensible Markup Language
  (XML) is widely use to represent the
  Semistructured Data.
3.5 Hierarchical Data Model :-
 In this model data is organized into a tree-like structure, implying
   a single upward link in each record to describe the nesting, and a
   sort field to keep the records in a particular order in each same-
   level list.
•   A hierarchical database consists of the following:
1. It contains nodes connected by branches.
2. The top node is called the root.
3. If multiple nodes appear at the top level, the nodes are
   called root segments.
4. The parent of node nx is a node directly above nx and
   connected to nx by a branch.
5. Each node (with the exception of the root) has exactly
   one parent.
6. The child of node nx is the node directly below nx and
   connected to nx by a branch.
7. One parent may have many children.
By introducing data redundancy, complex
network structures can also be represented
as hierarchical databases. This redundancy
is eliminated in physical implementation by
including a 'logical child'. The logical child
contains no data but uses a set of pointers
to direct the database management system
to the physical child in which the data is
actually stored. Associated with a logical
child are a physical parent and a logical
parent. The logical parent provides an
alternative (and possibly more efficient)
path to retrieve logical child information.
3.6 Network Data Model :-
 This model organizes data using two fundamental constructs,
   called records and sets. Records contain fields, and sets
   define one-to-many relationships between records: one
   owner, many members.
• Access to the database was not via SQL query strings,
  but by a specific set of API's, typically for FIND, CREATE,
• Each API would only access a single table (dataset), so it
  was not possible to implement a JOIN which would
  return data from several tables.
• It was not possible to provide a variable WHERE clause.
  The only selection mechanism available was
   – read all entries (a full table scan).
   – read a single entry using a specific primary key.
   – read all entries on a child table which were associated with a
     selected entry on a parent table
• Any further filtering had to be done within the
  application code.
• It was not possible to provide an ORDER BY
  clause. Data was presented in the order in which
  it existed in the database. This mechanism could
  be tuned by specifying sort criteria to be used
  when each record was inserted, but this had
  several disadvantages:
  – Only a single sort sequence could be defined for each
    path (link to a parent), so all records retrieved on that
    path would be provided in that sequence.
  – It could make inserts rather slow when attempting to
    insert into the middle of a large collection, or where a
    table had multiple paths each with its own set of sort
    4. DBMS Architecture :-
•     One of the many tasks that DBAs must handle is choosing the correct DBMS to
      use for each new application being developed

•     There are many aspects to selecting the proper type of DBMS - there are
      different types of each DBMS for different architectures and purposes.

•     Final architecture must be based on the business needs of the organization, not
      be made by a single person or group, but by a team consisting of business
      experts and IT experts

•      The DBMS selected is appropriate for the nature and type of processing plan to
       implemented. There are basically four levels of DBMS architecture that can be
     1.    Enterprise DBMS
     2.    Departmental DBMS
     3.    Personal DBMS
     4.    Mobile DBMS
1.       Enterprise DBMS :-

     –     It is designed for scalability and high performance. It must be
           capable of supporting very large databases, a large number
           of concurrent users, and multiple types of applications. The
           enterprise DBMS will run on a large-scale machine, typically a
           mainframe or a high-end Unix, Linux, or Windows NT
           machine. Furthermore, an enterprise DBMS offers all of the
           “bells and whistles” available from the DBMS vendor. Multi-
           processor support, support for parallel queries, clustering,
           and other advanced DBMS features will be core components
           of an enterprise DBMS.
2. Departmental DBMS :-

   – sometimes referred to as a workgroup DBMS, supports small to
     medium sized workgroups within an organization, and typically
     runs on a Unix, Linux, or Windows 2000 (or NT) server. The
     dividing line between a departmental database server and an
     enterprise database server is gray. Hardware and software
     upgrades often can allow a departmental DBMS to tackle tasks
     that previously could only be performed by an enterprise
     DBMS. The steadily falling cost of departmental hardware and
     software components further contributes to lowering TCO and
     helping to enable a workgroup environment to scale up to
     serve the enterprise.
3. Personal DBMS :-

   – is designed to be used by a single user, typically on a low- to
     medium-powered PC platform. Lotus Approach, Microsoft
     Access and dBase are examples of personal database software.
     Of course, the major DBMS vendors also market personal
     versions of their more high-powered solutions, Personal Oracle
     and DB2 Everyplace for example. Sometimes the low cost of a
     personal DBMS causes misguided attempts to choose a
     personal DBMS for a departmental or enterprise solution. But
     do not be lured by the low cost. A personal DBMS product is
     suited only for small scale projects and should not be used to
     deploy multi-user applications.
4. Mobile DBMS :-

   – is a specialized version of a departmental or enterprise DBMS.
     It is designed to be used by remote users who are not usually
     connected to the network. The mobile DBMS enables local
     database access and modification on a laptop or handheld
     device, such as a Palm PDA or PocketPC. Furthermore, the
     mobile DBMS provides a mechanism for synchronizing remote
     database changes to a centralized, enterprise or departmental
     database server.
Three-Level Architecture :-

• A commonly used views of data approach is the three-level
  architecture suggested by ANSI/SPARC (American National
  Standards Institute/Standards Planning and Requirements
  Committee). ANSI/SPARC produced an interim report in
  1972 followed by a final report in 1977

• The three levels of the architecture are three different views
  of the data:
   1. External - individual user view
   2. Conceptual - community user view
   3. Internal - physical or storage view
    Two-Tier Client-Server

• Client manages main business and
  data processing logic and user
• Server manages and controls access to
Two-Tier Client Server Architecture
Three-Tier C-S Architecture
• Client side presented two problems
  preventing true scalability:
  – ‘Fat’ client, requiring considerable resources on
    client’s computer to run effectively.
  – Significant client side administration overhead.

• By 1995, three layers proposed, each
  potentially running on a different
Three-Tier C-S Architecture
• User interface layer – runs on client.
• Business logic and data processing
  layer – middle tier runs on a server
  (application server).
• DBMS – stores data required by the
  middle tier. This tier may be on a
  separate server (database server).
Three-Tier C-S Architecture
• Advantages:
  – ‘Thin’ client, requiring less expensive
  – Application maintenance centralized.
  – Easier to modify or replace one tier without
    affecting others.
  – Separating business logic from database
    functions makes it easier to implement load
  – Maps quite naturally to Web environment.
Three-Tier Architecture
 Overall Architecture :-

 • A DBMS is typically run as a back-end server
   in a (local or global) network, offering services
   to clients directly or to application servers.
 Components of DBMS :-
   – Can range from a PC to a network of computers.
   – DBMS, operating system, network software (if necessary)
     and also the application programs.
   – Used by the organization and a description of this data
     called the schema.
   – Includes  database    designers,      DBAs,      application
     programmers, and end-users.
   – Instructions and rules that should be applied to the design
     and use of the database and DBMS.
 Data Definition Language :- (DDL)
   The Data Definition Language (DDL) is used to create and destroy databases
   and database objects. These commands will primarily be used by database
   administrators during the setup and removal phases of a database project.
  – Specific notation for defining the Data schema
      Example : Create Table Account ( Acc_No Char(10),
                                                Balance Integer )
  – DDL compiler generates a set of tables stored in a data dictionary
  – Data Dictionary contains Metadata (Data about Data)
      •   Database Schema
      •   Data Storage and Definition Language
          –   Specifies the storage structure and access methods used
      •   Integrity Constraints
          –   Domain Constraints
          –   Referential Integrity
          –   Assertion
      •   Authorization
 Data Manipulation Language :- (DML)

  –   Language for accessing and manipulating the data organized by
      the appropriate data model
      •   DML also known as query language
      •   Data Manipulation is:
               retrieval of information from the database
               insertion of new information into the database
               deletion of information in the database
               modification of information in the database
  –   Two Classes of Languages :
      •   Procedural :
          –   What Data is Required?
          –   How to get those Data?
      •   Declarative (Nonprocedural):
          –   What Data is Required?
          –   Without specifying How to get those Data?
  –   SQL is the most widely used Query Language.
: Chapter – 2 :

Relational Data
• Relational data model allows various data packets/files
  to be related to each other with a set of relations.
• For example, if customer names are stored in one file
  „CUSTOMER‟, and other file contains the names of cities
  „CITY‟. Then CUSTOMER file can have a relation with
  CITY, which will help identify which city the customer
  belongs to.
• The data is arranged in a relation which is visually
  represented in a two dimensional table.
• The data is inserted into the table in the form of tuples
  (which are nothing but rows).
                 Introduction (Count…)

• A tuple is formed by one or more than one attributes,
  which are used as basic building blocks in the formation
  of various expressions that are used to derive a
  meaningful information.
• There can be any number of tuples in the table, but all
  the tuple contain fixed and same attributes with varying
• The relational model is implemented in database where a
  relation is represented by a table, a tuple is represented
  by a row, an attribute is represented by a column of the
  table, attribute name is the name of the column.
                Introduction         (Count…)

   All the relational algebra operations, such as Select,
    Intersection, Product, Union, Difference, Project, Join,
    Division, Merge etc. can also be performed on the
    Relational Database Model.
   Operations on the Relational Database Model are
    facilitated with the help of different conditional
    expressions, various key attributes, pre-defined
    constraints etc.
•   Candidate Key :-
    Any field or a combination of fields that identifies a
    record uniquely is called a Candidate Key. The
    Candidate Key cannot contain NULL value and should
    always contain a unique value.
•   Primary Key :-
    Primary key is nothing but a candidate key that
    identifies a record uniquely.
•   Foreign Key :-
    A Foreign key is a primary key for other table, in which
    it uniquely identifies a record. A Foreign Key defines
    relation between two (or more) tables. A Foreign key
    can contain NULL value.
   Super Key :-
    Given Relation Schema R with U is the list of attributes,
    there are a set which is a subset of U. If in a relation r of
    K with any two distinct tuples t1 and t2 we have the
    constraints that t1[K] ≠ t2[K] then K is called a SUPER
    KEY of relation schema R.

    A SUPER KEY is defined in the relational model as a set
    of attributes of a relation variable (relvar) for which it
    holds that in all relations assigned to that variable there
    are no two distinct tuples (rows) that have the same
    values for the attributes in this set. Equivalently a super
    key can also be defined as a set of attributes of a relvar
    upon which all attributes of the relvar are functionally
   Candidate Key :-

    A Super Key can have a redundant attribute. A super
    key that have no such attributes is called a CANDIDTE

    We can define a Candidate Key as follow :
    Given relation schema R K’ is the candidate key of R if
    K’ is satisfied two constraints :

   K’ is a super Key of R.
   There is no such a proper subset of K’ that is also a
    super key of R or K’ is a minimal super key.
   Since a relational schema can have more then one
   Candidate Key thus there is a chosen candidate key
   whose values are used to identify tuple in the relation.
   Such a key is PRIMARY KEY. Primary key is usually a
   most common candidate key.
Example :-
  State Code      SrNo           RegNo             Chassis No       Vehicle
GJ3AB           9999        G1235RF452           1254GG54F       HONDA CITY
HR4BC           7           HR1546TT32           3256SDF45       HONDA CIVIC
MH4FF           1000        MH45HGJ215           458HYU789       HONDA CITY
GJ3BM           6045        GJ11HAS2125          653ASD458       PLATINA

 Super Key :- {StateCode, SrNo, RegNo}, {RegNo}, {RegNo, ChassisNo}
 Candidate Key :- {StateCode, SrNo}, {RegNo}
 Primary Key :-{StateCode, SrNo}, {RegNo}
             Fundamental Concepts in Relational Model

•   Domain :-
    A domain D is the original sets of atomic values used to model

    By atomic, we mean that each value in the domain is indivisible
    as far as the relational model is concerned.
     For example:
•   The domain of day shift is the set of all possible days : {Mon,
    Tue, Wed…}
•   The domain of salary is the set of all floating-point numbers
    greater than 0 and less than 200,000 (say).
•   The domain of name is the set of character strings that
    represents names of person
• Relation :-

  A relation is a subset of the Cartesian product of a list of domains
  characterized by a name.
  Relation can be viewed as a “table”. In that table, each row
  represents a tuple of data values and each column represents an

• Attribute :-

  A column of a relation designated by name. The name associated
  should be meaningful. Each attributes associates with a domain.
   Relation Schema :-

    Denoted by R is a list of attributes (A1, A2, …, An)
      The degree of the relation is the number of attributes of its
       relation schema.
      The cardinality of the relation is the number of tuples in the
                     Characteristics of Relations

•   Ordering of Tuples in relation :-
    A tuple is a set of values. A relation is a set of tuples. Since a
    relation is a set, there is no ordering on rows.

•   Ordering of Values with a tuple :-
    The order of attributes and their values within a relation is not
    important as long as the correspondence between attributes
    and values is maintained.

•   Values and NULL Values in the tuple :-
    Each value in a tuple is atomic. That means each value cannot
    be divided into smaller components. Hence, the composite and
    multivalued attributes are not allowed in a relation.
               Constraints in Relational Data
• Relational model support a well-defined theory of constraint
  on attributes or tables.
• Constraint is useful because it allows designer to specify the
  semantics of data in database and it is the rules to enforce
  DBMS to check that new data satisfies the semantics.
• Domain restrict the values of attributes in the relation and it
  is a constraint of relational model.
• We need more specific way to state what data values are/are
  not allows, what format is suitable for an attributes. For
  example, employee number must be unique, employees’ age
  is in the range [23, 65].
   Such information is provided in logical statements called
    integrity constraints.
   There are several kinds of integrity constraints:
       Key Constraints.
       Referential Constraints.

   Key Constraints :-
    A relation is a set of tuples. By definition, all elements in a set are
    distinct hence all tuple in a relation must be distinct.
    In relational model, tuples have no identity like object –
    identification. Tuple identity is totally value-based.
    Therefore, we need key constraint that is the way of uniquely
    identify a tuple.
   There are many key constraints are there Like :
       Super Key.
       Primary Key.
       Candidate Key.
       Not Null.
       Unique Key etc…

   Referential Constraints :-
    The constraint that is specified between two relations
    and maintain the correspondence between tuples in
    these relations. It means the reference from a tuple in
    one relation to other relation must be valid.
    Referential constraint in relational model relates to
    notation of foreign key.
       Foreign Key Constraints :-
    A Set of attributes FK in a relational schema R1 is foreign key if :

       The attributes in FK correspond to the attributes in the
        primary key of another relation schema R2.

       The value for FK in each tuple of R1 either occur as values of
        primary key of a tuple in R2 or is entirely NULL

   In a database of many relations, there are usually many
    foreign keys. They provide the “glue” that links
    individual relations into a cohesive database structure.
                 Entity-Relationship Model

 Entity-Relationship (E-R) data model was developed to facilitate
  database design by allowing specification of an enterprise schema
  that represent the overall logical structure of a database.

 The E-R model is very useful in mapping the meanings and
  interaction of real-world enterprises into a conceptual schema.

 E-R data model employs three basic notations:
    Entity Sets
    Relationship Sets
    Attributes
1. Entity Sets :-

 What is Entity ?

    An Entity is a “Thing” or “Object” in real-world that is
     distinguishable from all other objects.
     e.g. Each person in an enterprise is an entity.

    Entity has a set of properties and the values for some set of
     properties may uniquely identify an entity.
     e.g. A person have a “personID” property and its value “101-
     A45” would uniquely identify that particular entity.
   What is Entity Sets?

       An Entity Set is the set of entities of the same type that
        share some properties or attributes.
        e.g. Set of all persons who are customer at a given bank can
        be define as the entity set customer.
        Similarly entity set loan might represent the set of all loans
        awarded by particular bank.

       Entity Set do not need to be Disjoint.
        e.g. It can be possible that entity set of all employees of the
        bank and entity set of all customer of the bank, so a person
        entity may be employ or customer or both or neither of
   An entity is represented by a set of attributes.
   The attributes are Descriptive Properties processed by
    each member of an entity set.
   Designation of an attribute for an entity set expresses
    that the database stores similar information
    concerning each entity.
       e.g. Possible attributes for entity set customer :
          custNo, custName, custAdd, custCity….etc…
   Thus database includes a collection of entity sets, each
    of which contains any number of entities of the same
• A database usually contains groups of entities that are
  similar. For example, a company employing
• hundreds of employees may want to store similar
  information concerning each of the employees. These
• employee entities share the same attributes, but each
  entity has its own value(s) for each attribute.
• An entity type defines a collection (or set) of entities that
  have the same attributes.
• A few individual entities of each type are also illustrated,
  along with the values of their attributes.
• The collection of all entities of a particular entity type in the
  database at any point in time is called an entity set; the
  entity set is usually referred to using the same name as the
  entity type.
• For example :-

• EMPLOYEE refers to both a type of entity as well as the
  current set of all employee entities in the database.
• An entity type describes the schema or intension for a
  set of entities that share the same structure.
• The collection of entities of a particular entity type are
  grouped into an entity set, which is also called the
• extension of the entity type.
2. Relationship Sets :-

   What is Relationship?

       A relationship is an association among several entities.
        e.g. We can define relationship that assciated “Customer
        Hays Loan”.

   What is Relationship Set?

       A relationship Set is a set of relationship of the same type.
       It is mathematical relation on n > 2 entity sets.
   If E1, E2……, En are entity set then the relationship ser R is
    the subset of:
    {(e1,e2…en)|e1 є E1, e2 є E2, …., en є En}
    e.g. Entity sets Loan and entity set Branch which define the
    relationship set Loan_Branch which denote the association
    between Loan and Branch in which the loan is maintain.

   A function that entity plays in relationship is called “Entity’s

   Some entity sets participate in relationship set more then
    once in different role called “Recursive Relationship Set”.
e.g. Consider a entity set “employee” that records the
    information about all the employees of bank we may have a
    relationship “work for” that is modeled by ordered pairs of
    So first employee pair work as Worker and second employ
    pair work as Manager.

   A relationship may also have attributes called “Descriptive

e.g. Relationship Depositor with entity set Customer and
    Account we can associate the attribute Access_Date to
    relationship to specify the most recent date a Customer
    access an Account.
Descriptive Attribute

Descriptive Attribute (Access_Date)
3. Attributes :-

   For each attribute, there is a set of permitted values called the
    Domain or value set of that attribute.

   The domain of attribute cust_name might be a set of all text
    string of a certain length.

   An attribute, as used in E-R Model, can be characterize by the
    following attribute types:

    1.   Simple and Composite Attributes.
    2.   Single-Value and Multivalued Attributes.
    3.   Derived Attributes.
3.1. Simple and Composite Attributes :-
    Attributes those used in our examples are of simple attributes,
     they have not been divided into a subpart.
    Composite attribute on the other hand can be divided into a
    For example : An attribute name could be structured as a
     composite attribute consisting of Surname, Name, F_Name.


         Surname             Name              F_Name

    Street_Add               City                 State                  Zip

    Street_No              Street_Nm              App_No

   Attributes that are not divisible are called Simple or Atomic Attributes.
   If a composite referenced only as a whole there is no need to subdivide it into
    component attributes.
   For example if there is no need to refer to an individual components of an
    address then the whole address can be designated as a simple attribute.
3.2. Single-Value and Multivalued Attributes :-

   Most attributes nave a single value for any particular entity; such
    attributes are called Single-Valued Attributes.
   For example Age is a single-valued attribute of a person.
   In some cases an attribute can have a set of values for the same
    entity – for example, a color attribute for a car or a college
    degree attribute for a person.
   Cars with one color have a single value whereas two tones cars
    have a two colors value.
   Similarly one person may not have a college degree, another
    person may have one college degree and third person may have
    two or more college degree; therefore different person may have
    different number of values for the same attribute.
   Such attributes are called Multivalued attributes.
     Multivalued Attributes are shown in E-
      R diagram using this symbol
3.3. Stored and Derived Attributes :-

    In some cases two or more attribute values are related
     – for example Age and Birth_date attributes of a
    For a particular person entity the value of Age can be
     determined from the current date and the value of the
     person’s a Birth_Date. The Age attribute hence called
     Derived Attribute and is said to be Derivable Attribute
     from Birth_Date attribute which is called Stored
3.4. Null Value Attributes :-
   In some cases particular entity may not have an
    applicable value for an attribute.
   For example Apartment_No attribute of an address
    applies only to address that are in apartment building
    not to other type of residency. Similarly a
    College_Degree attribute applies only to person with
    college degree.
   For such situation a special value called NULL is created.
   A single family home would have a NULL value for
    Apartment_No and a person with no college degree
    would have a NULL value for College_degree.
   In two cases the value of an attribute would have the
     Known

     Unknown

   Attribute value is exist but missing that is Known.
     E.g. :- Height of a Person

   Attribute value when it not sure whether it is exist or
    not that is Unknown.
     E.g. :- Home Phone.
3.5. Complex Attributes :-
    Composite and Multivalued Attributes can be nested arbitrarily.
    We can represent nesting by grouping components of a
     composite attributes between Parentheses ( ) and separating a
     component with comma (,) and by displaying Multivalued
     attributes by Braces { }.
    Such an attributes are called Complex Attributes.
    For example a person can have more then one residencies and
     each residency can have one address and multiple phones then :
     Address ( Sreet_add ( S_Num,S_Name,App_No), City,
         State, Zip, Phone { Add_Phone (Code,Ph_No),
                                      Mobile }
4. Alternative E_R Notations :-

               Entity                        Weak Entity


               Key Attribute                 Derived Attribute

                                  Composite Attribute
Database Designing
           What is Database Designing?
• Each relational schema consist of a number of
  attributes, and the relational database schema
  consist of a number of relational schemas.
• So far we have assumed that the attributes are
  grouped to formed a relation schema by using a
  common sense of the database designer or by
  mapping the database schema design for a
  conceptual model.
• However we still need some formal measure of why
  one grouping of attributes into relation schema may
  be better than another.
 There are two levels at which we can discuss the
  goodness of relation schema.
   1.   Logical / Conceptual Level
   2.   Implementation / Storage Level

 Logical Level :-

 How users interprets the relation schemas and the
  meaning of their attributes.
 Heaving a good relation schema at this level enable
   users to understand clearly the meaning of the data in
   the relations, and hence to formulate their queries
 Implementation Level :-

 How the tuples in a base relation are stored and
 This level applies only to schemas of the base relations
  which will be physically stored as file whereas at logical
  level we are interested in schemas of both base relation
  or view (Virtual Relation).

 As with many design problems database design may be
   performed using two approaches :
   1. Bottom-Up Design Methodology
   2. Top-Down Design Methodology
 Bottom-Up Design Methodology :-

 This    methodology considers the basic
  relationship among individual attributes as a
  starting point and uses those to construct
  relation schema.
 This approach is not very popular in practice
  because it suffer from the problem of heaving
  to collect a large number of binary relationships
  among attributes as the string point.
 Top-Down Design Methodology :-

 Also called a Design By Analysis.
 Starts with a numbers of groupings of attributes into
  relations that exist together naturally, for example, on
  an Invoice, a form or a report.
 The relations are then analyzed individually and

   Theory describe applicable to both the top-down and
   bottom-up design approaches, but is more practical
   when used with the top-down approaches.
Functional Dependency
 A functional dependency is a constraint between two sets of
    attributes from the database.
   Suppose our relational database schema has N attributes like
    A1, A2, A3… AN;
   Let us think of whole database as being describe by a single
    “Universal Relational Schema” R = {A1, A2, A3… AN}
   We don’t imply that we will actually store the database as a
    single table.
   That’s why we have to divide our database as a collection of
   For retrieving a value of one attribute base on another attribute
    value we have to follow some constraints (Normalization).
 Definition :-

 “A Functional Dependency denote by X  Y between two
 sets of attributes X and Y that are subset of R specifies the
 constraints on the possible tuples that can form a relation
 state r of R. The constraints is that for any two tuples t1
 and t2 in r that have t1[X] = t2[X] they must also have ………
 t1[Y] = t2[Y].”

 This means that value of Y component of a tuple in r
  depends on or are determine by the value of X component.
 Alternatively the value of X component uniquely
  determine the value of Y component.
 We also say that there is a functional dependency
  from     X to Y, or that Y is functional dependent on
 The abbreviation of Functional Dependency is FD or
 The set of attribute X is called the Left-hand Side of
 The set of attribute Y is called the Right-hand Side
  of FD.
 Thus X functionally determines Y in relational
  schema if and only if whenever two tuples of r(R)
  agree on their X values, they must agree on their Y
 If a constraint on R state that there can not be
  more then one tuple with a given X-value in any
  relation instance r(R) that is X is a candidate key of
  R. This implies X  Y for any subset of attributes Y
  of R.
 A functional dependency is a property of the
  semantics/meaning of attributes.
 The      database  designer     will    use    their
  understanding of the semantics of the attributes of
  R that is how they are related to each other to
  specify the FD that should hold on all relation
  states r of R.
 Whenever the semantic of two sets of
 attributes in R indicate that a FD should
 hold, we specify the dependency as a

 The relation extension r(R) that satisfy the
 FD constraints are called Legal Relation
 State or Legal extension.
 Example :-
 Consider the relation schema EMP_PROJ

       SSN      Pnumber      Hours      Ename      Pname     Plocation

A. SSN  Ename
B. Pnumber  {Pname, Plocation}
C. {SSN, Pnumber}  Hours

    These functional dependencies specify that…
    (A) The value of an Employee’s Social Security Number (SSN) uniquely
     determine the Employee Name (Ename).
    (B) The value of a Project Number (Pnumber) uniquely determine the
     Project Name (Pname) and Project Location (Plocation).
    (C) A combination of SSN and Pnumber values uniquely determine the
     number of Hours.
Inference Rules for FD :-

We denote by F the set of functional dependencies
 that are specified on relation schema R.
Typically schema designer specify the FDs that are
 semantically obvious; usually, however, numerous
 other FDs hold in all legal relation instances among
 sets of attributes that can be derived from and satisfy
 the dependency in F.
In real life it is impossible to specify all possible FDs
 for a given situation.
 For Example :-

 If each department has one manager, so that Dept_No uniquely determine
  Mgr_SSN (Dept_No  Mgr_SSN) and a manager has a Unique phone
  number called Mgr_Phone (Mgr_SSN  Mgr_Phone) then these two
  dependencies together imply that Dept_No  Mgr_Phone.
 This is an inference FD.
 Therefore it useful to define a concept called closure that include all possible
  dependencies that can be inferred from the given set F.

 Definition :-
 “Formally the set of all dependencies that include F as well as all
  dependencies that can be inferred from F is called CLOSURE of F; it is
  denoted by F+.”
 For Example :-

 Suppose that we specify the following set F of obvious FDs on
 the relation schema EMP_PROJ.

F={   SSN{Ename, Bdate, Address, Dnumber},
      Dnumber{Dname, Dmgr_SSN} }

 Some of the additional dependencies that we can infer from F
 are the following :

SSN  {Dname, Dmgr_SSN}
Dnumber  Dname
 An FD X  Y is inferred from a set of dependencies F specified
  on R if X  Y holds in every legal relation state r of R; that is
  whenever r satisfied all the dependencies in F, X  Y also hold
  in r.
 The closure F+ of F is the set of all FDs that can be inferred
  from F.
 To determine a semantic way to infer dependencies, we must
  discover a set of Inference Rules that can be used to infer new
  dependencies from a given set of dependencies.

 We use the notation F |= X  Y to denote that FD X  Y is
  inferred from the set of FDs F.
 We use an abbreviated notation when discussing FDs. We
  concatenate attribute variables and drop the commas for
 Hence, the FD {X,Y}  Z is abbreviated to XY  Z, and the FD
     {X,Y,Z}  {U,V} is abbreviated to XYZ  UV

 Following six rules IR1 through IR2 are well-known inference
     rules for FDs.

       (RAT DUP)

1.    IR1 Reflexive Rule ------------- If X  Y, then X  Y
2.    IR2 Augmentation Rule ------- {X  Y} |= XZ  YZ
3.    IR3 Transitive Rule ------------- {X  Y, Y  Z} |= X  Z
4.    IR4 Decomposition ------------ {X  YZ} |= X  Y
5.    IR5 Union or Additive Rule --- {X  Y, X  Z} |= X  YZ
6.    IR6 Pseudotransitive Rule ---- {X  Y, WY  Z} |= WX  Z
     IR1 Reflexive Rule If X  Y, then X  Y
• Proof of IR1 :-
  Suppose that X  Y and that two
  tuples t1 and t2 exist in some
  relation instance r or R such that
  t1[X] = t2[X]. Then t1[Y] = t2[Y]
  because X  Y; hence , X  Y must
  hold in r.
 IR2 Augmentation Rule {X  Y} |= XZ 
• Proof of IR2 :-
  Assume that X  Y hold in a relation instance r of R but that XY
   YZ does not hold. Then there must be exist two tuples t1 and
  t2 in r such that (1) t1[X] = t2[X],        (2) t1[Y] = t2[Y], (3)
  t1[XZ] = t2[XZ], (4) t1[YZ] ≠ t2[YZ].

  This is not possible because
  (5) t1[Z] = t2[Z] from (1) and (3),
  (6) t1[YZ] = t2[YZ] from (2) and (5) contradicting (4)
 IR3 Transitive Rule {X  Y, Y  Z} |= X 
• Proof of IR3 :-

  Assume that (1) X  Y and
                (2) Y  Z both hold in relation r.
  Then for any two tuples t1 and t2 in r such that t1[X]=t1[X] we
  must have
  (3) t1[Y]=t2[Y] from assumption (1)
  Hence, we must also have
  (4) t1[Z]=t2[Z], from (3) and assumption (2)
  Hence, X  Z must hold in r.
 IR4 Decomposition Rule {X  YZ} |= X 
• Proof of IR4 :-

1. X  YZ (given)
2. YZ  Y (using IR! And knowing that YZ  Y).
3. X  Y Using IR3 on 1 and 2)
     IR5 Union Rule {X  Y, X  Z} |= X  YZ
• Proof of IR5 :-

1.   X  Y (given)
2.   X  Z (given)
3.   X  XY (using IR2 on 1 by assuming XX = X)
4.   XY  YZ (using IR2 on 2 by augmenting with Y)
5.   X  YZ (using IR3 on 3 and 4)
     IR6 Pseudotransitive Rule {X  Y, WY  Z} |= WX  Z
• Proof of IR6 :-

1.   X  Y (given)
2.   WY  Z (given)
3.   WX  WY (using IR2 on 1 by assuming with W)
4.   XY  YZ (using IR2 on 2 by augmenting with Y)
5.   WX  Z (using IR3 on 3 and 2)
Armstrong’s Inference Rules
• It has been shown by Armstrong (1947) that inference rules IR1
  Through IR3 are sound and complete.
• By Sound, we mean that given a set of FDs F specified on a relation
  schema R , any dependency that we can infer from F by using IR1
  through IR3 holds in every relation state r of R that satisfied the
  dependency in F.
• By Complete, we mean that using IR1 through IR3 repeatedly to infer
  dependencies until no more dependency can be inferred result in the
  complete set of all possible dependencies that can be inferred from F.
• Hence Infer Rules IR1 through IR3 are known as Armstrong’s Rule.
         Normalization of Relation
• Initially Dr. E. F. Codd proposed three normal
  forms, which he called First Normal Form (1NF),
  Second Normal Form (2NF), Third Normal Form
  (3NF). A stronger definition of 3NF is called Boyce-
  Codd Normal Form(BCNF).
• These normal forms are based on a single analytical
  tool: the Functional Dependency among the
  attributes of relation.
• Later the 4NF and 5NF were proposed based on
  concept of multivalued dependency and join
  dependency respectively.
            Normalization of Data
• It can be consider the process of analyzing the relation
  schemas based on their FDs and primary key to
  achieve the desirable properties of :
    Minimizing Redundancy and
    Minimizing the Insertion, Deletion and Update
• Unsatisfied relation schemas that do not meet certain
  conditions – the Normal Form Tests – are decomposed
  into smaller relation schemas that meet the test and
  hence posses the desirable properties.
 Thus, the normalization producer provides database
 designers with the following:
   The formal framework for analyzing relation schemas
    base on their keys and on the functional dependencies
    among their attributes.
   A series of normal forms tests that can be carried out
    on individual relation schemas so that the relational
    database can be normalized to any desired degree.

 Definition : The NF of the relation refers highest NF
 condition that is meets, and hence indicate the degree to
 which it has been normalized.
 Normal Forms when consider in isolation from, other
 factors, do not guarantee a good database design.

 It is generally not sufficient to check separately that
 each relation schema in the database is, say in BCNF
 or in 3NF.

 Rather, the purpose of normalization through
 decomposition must also confirm the existence of
 additional properties that the relational schemas,
 taken together, should process.
 This would include two properties :
   1. Lossless Join or Nonadditive Join
     Which guarantees that the spurious tuple generation
  2. Dependency Preservation
     Which ensure that each functional dependency is
     represented in some individual relation resulting
     after decomposition.
 The Nonadditive join property is extremely critical and
  must be achieve at any cost, whereas the dependency
  preservation, although desirable, is sometimes
          First Normal Form (1NF)
• 1NF is now considered to be part of formal
  definition of a relation in the basic relational
• It was disallow Multivalued attributes, composite
  attributes and their combination.
• It state the domain of an attributes must include
  only atomic value (Simple, Indivisible values) and
  that the value of any attribute in a tuple must be a
  singled value from the domain of that attribute.
• The only attribute values permitted by 1NF are
  single atomic or indivisible value
 Consider the Department Relation Schema..

         Dname             Dnumber         Dmgr_ssn             Dlocation

 Domain of Dlocation can have an atomic values, but some
  tuples can have a set of this values.
 In this case, Dlocation is not functionally dependent on
  primary key Dnumber.
   Dname             Dnumber    Dmgr_ssn                    Dlocation
 Research        5             333445555    {Surat, Baroda, Rajkot }
 Production      4             987654321    { Ahmadabad }
 Sales           1             888665555    { Bhavnagar }

 The Department relation is not in 1NF.
 There are three main techniques to achieve the 1NF.
1. Remove the attribute Dlocation that violate 1NF and
   place it in a separate relation along with the primary
   key Dnumber.
2. Expand the key so that there will be a separate tuple in
   the original Department relation for each location of
   Department as shown bellow…
             Dname   Dnumber   Dmgr_ssn       Dlocation
     Research           5      333445555   Surat
     Research           5      333445555   Baroda
     Research           5      333445555   Rajkot
     Production         4      987654321   Ahmadabad
     Sales              1      888665555   Bhavnagar
3.   If a maximum number of values is known for the
     attribute-for example, if it is known that at most three
     locations can exist for a department-replace the
     Dlocation       attribute       by     three     atomic
     attributes:Dlocation1, Dlocation2, Dlocation3.
•    This solution has a disadvantages of introducing NULL
     values if most departments have few then three
•    Of the three solution the first is generally considered
     best because it does not suffer from redundancy .
•    1NF also disallow multivalued attributes that are
     themselves composite. These are called Nested
         Emp_Proj                                 Projs
            SSN             Ename        Pnumber               Hours

        SSN         Ename      Pnumber      Hours
        1         Raj               1        33                 Using second technique
                                    2        15                  we can define the same
        2         Ram               3        10                  relation schema in two
        3         Jay               2        10                  separate        relation
                                    3        12                  schemas as bellow
        4         Vijay             30       16
                                    10       16
       Emp_Proj_1

              SSN           Ename

       Emp_Proj_1

                  SSN          Pnumber         Hours
         Second Normal Form (2NF)
• 2NF is based on the concept of Full Functional
• A FD X  Y is full functional dependency is removal of any
  attribute A from X means that the dependency does not
  holds any more; that is for any attribute     A € X ,(X-{A})
  does not functionally determine Y.
• A functional dependency XY is Partial Dependency if
  some attribute A € X ,(X-{A})  Y if some attribute A € X
  can be removed from X and the dependency still hold.
 Definition :-

  “A relation schema R is in 2NF is every nonprime
  attribute A in R fully functionally dependent on the
  prime key of R.”

 The    test for 2NF involves testing for functional
  dependencies whose left-hand side attributes are the part
  of the primary key.

 If the primary key contains a single attribute, the test need
  not be applied at all.
     Emp_Proj

      SSN          Pnumber          Hours           Ename     Pname           Plocation


     2nd Normal Form :-

     Emp_Proj_1              SSN                   Pnumber           Hours

     Emp_Proj_2           SSN              Ename

     Emp_Proj_3             Pnumber                 Pname        Plocation
            Third Normal Form (3NF)
• 3NF is based on the concept of Transitive Dependency.
• A functional Dependency X-->Y in a relation schema R is a
  transitive dependency if there is a set of attributes Z that is
  neither a candidate key nor a subset of any key of R, and
  both X-->Z and Z-->Y holds.
• The dependency SSN->Dmgr_ssn is transitive through
  Dnumber in EMP_DEPT.
• Because both the dependencies SSN->Dnumber and
  Dnumber->Dmgr_ssn hold and Dnumber is neither a key
  itself nor a subset of key of EMP_DEPT.
 Definition :-
  “According to Codd’s original definition, a relation schema R
  is in 3NF if it satisfied 2NF and no nonprime attribute of R is
  the transitively dependent on the prime key.”
 The given relational schema EMP_DEPT is in 2NF, since no partial
  dependencies on a key exist.
 However EMP_DEPT is not in 3NF because of the transitive
  dependency of Dmgr_ssn on SSN via Dnumber.
 We can normalize EMP_DEPT by decomposing it in to two 3NF
  relations ED1 And ED2 represent the independent entity facts
  about Employee and Department .
 A JOIN operation on ED1 and ED2 will recover the original
  relation EMP_DEPT.
     EMP_DEPT

    SSN     Ename       Bdate     Address       Dnumber    Dname     Dmgr_ssn

     3rd Normal Form :-

    ED1           SSN           Ename       Bdate          Address    Dnumber

    ED2           Dnumber               Dname             Dmgr_ssn
         LOTS               1st Normal Form

        Property_ID#            Country_name        Lot#        Area      Price       Tax_Rate



          2nd Normal Form

          LOTS1                                                         LOTS2

        Property_ID#       Country_name   Lot#   Area   Price          Country_name       Tax_Rate

    FD1                                                                FD3


      3rd Normal Form

      LOTS1A                                                LOTS1B

    Property_ID#   Country_name   Lot#     Area              Area          Price
FD1                                                    FD3

                                                       LOTS                        1NF
      LOTS2

    Country_name     Tax_Rate

    FD4                                     LOTS1                      LOTS2       2NF

                                  LOTS1A            LOTS1B             LOTS2       3NF
    Boyce-Codd Normal Form (BCNF)
• BCNF was proposed as a simpler form of 3NF, but it was
  found to be stricter than 3NF.
• That is every relation in BCNF is also in 3NF; However, a
  relation in 3NF is not necessary in BCNF.
• We can see the need for stronger NF than 3NF in example
  of LOTS relation schema with its four FDs FD1 to FD4.
• Suppose we have a thousands of lots for only two countries
  India and USA.
• Suppose also that the lots size in India are of 1.0,1.0,1.2 up
  to 2.0.
 Where as the lots size in USA are of 2.1,2.2 up to 3.0.
 In such a situation we have an additional functional
    dependency FD5 : Area --> Country.
   The Area of LOTS can determine the Country.
   This relation can generate the redundancy by repeating the
    same value “India” in Country for Area = 1.0,1.1,1.2 up to
    the 2.0. For these ten values the value for Country name
    would be same as “India”.
   And same situation in Country USA.
   So that there is again need for decomposition of the
    relation schema LOTS1A in to two different schemas
    LOTS1AX and LOTS1AY.
 Definition :-
  A relation schema R is in BCNF if whenever a nontrivial
  functional dependency X --> A (3NF) holds in R, then X is
  super key of R.

 Means if we want to check that the schema is in BCNF or
  not for that the schema must be in 3NF.
 Then after if any FD in that schema is holds that any
  nonprime attribute can determine the value of any prime
  attribute then that will violate the second pre condition for
 So that to convert that schema in to BCNF again we have to
  decomposed that relation schema. Like ..:
      3rd Normal Form

      LOTS1A                                            LOTS1B

    Property_ID#    Country_name   Lot#   Area         Area        Price
FD1                                                 FD3



                                   For this FD5 Area is nonprime attribute and
                                   Country_name is our prime attribute so that FD5
                                   violate the 2nd precondition for BCNF.
      LOTS2
                                   So that once again we have to decompose the
    Country_name      Tax_Rate     relation schema LOTS1A in to two different
    FD4                            relation schema LOTS1AX and LOTS1AY then after
                                   LOTS1AX, LOTS1AY, LOTS1B, LOTS2 will the BCNF
                                   conversion of LOTS relation schema.
      Boyce Codd Normal Form

      LOTS1AX                                       LOTS1AY

    Property_ID#   Country_name   Lot#         Area      Country_name

FD1                                      FD5


      LOTS2
                                              LOTS1B

    Country_name     Tax_Rate                  Area       Price

FD4                                      FD3
                           LOTS               1NF

                   LOTS1            LOTS2     2NF

      LOTS1A               LOTS1B   LOTS2     3NF

           Relational Decomposition
• A single Universal Relation Schema R={A1,A2,…An} that include
  all the attributes of the database.
• We implicitly make the universal relation assumption, which
  state that every attribute name is unique.
• The set of functional dependency F that should be hold on the
  attributes of R is specified by the database designer
• Using the FDs, decompose the universal relation schema R in to
  a set of relation schemas D={R1,R2,R3,…..Rm} that will become
  the relation schema; D is called the Decomposition of R .
          Properties of Relational Decomposition
1. Dependency Preservation Property :-

•   It would be useful if each FD X-->Y specified in F either
    appeared directly in one of the relation schema Ri in D or could
    be inferred from the FDs that appear in some Ri.
•   Informally this is the Dependency Preservation Condition.
•   It is not necessary that the exact dependency specified in F
    appear themselves in individual relations of the decomposition
•   It is sufficient that the union of the FDs that hold on the
    individual relations in D be equivalent to F (set of FDs which are
    hold in Universal Schema R).
 Definition :-

Given a set of dependencies F on R, the
Projection of F in Ri where Ri is a subset of R,
is the set of FDs       X --> Y in F+ such that
attributes in X u Y are all contain in Ri. Hence
the projection of F on each relation schema
Ri in the decomposition D is the set of FDs F+
the closure of F, such that all their Left-hand
side and Right-hand side attributes are in Ri.
• We       say      that     a      decomposition
  D={R1,R2,R3…,Rm} of R is Dependency-
  Preserving with respect to F if the union of
  projection of F on each Ri in D is equivalent to
• If a decomposition is not dependency-
  preserving, some dependency is Lost in the
• To check a Lost dependency holds, we must
  take the JOIN of to or more relations.
2. Nonadditive Join / Lossless Join :-

   Another property that decomposition D should possess is the
    nonadditive join property which ensures that no spurious
    tuples are generated hen a NATURAL JOIN operation is applied
    to the relations in the decomposition.

 Definition :-

    Formally a decomposition D = {R1,R2,R3…..,Rm} of R has the
    Lossless Join Property with respect to the set of dependencies
    F on R if, for every relation state r of R that satisfied F, the
    following holds, where * is the NATURAL JOIN of all the relation
    in D: *(R1(r),R2(r)…..Rm(r)) = r
 The word loss in Lossless refers to loss of
  information not to loss of tuples.
 If decomposition does not have the
  lossless join property, we may get
  additional spurious tuples after the
  NATURAL JOIN operation is applied: these
  additional tuples represented the
  erroneous or invalid information.
         Multivalued Dependency
• So far we have discussed only functional dependency,
  which is by far the most important type of dependency
  in relational database design theory.
• However, in many cases relations have constraints that
  can not be satisfied as FD.
• In this section, we discuss the concept of Multivalued
  Dependency (MVD) and defined Fourth Normal Form
  which is based on MVD.
 If we have two or more Multivalued independent attributes in
  the same relation schema, we get into a problem of having to
  repeat every value of one of the attributes with every value of
  the other attribute to keep the relation state consistent and to
  maintain the independence among the attributes involved.
 Example :-
 Consider the relation EMP.
 A tuple in this EMP relation represents the fact that an
  employee whose name is Ename works on the project whose
  name is Pname and has a dependent whose name is Dname.
 An employee may work on several projects and may have
  several dependents, and the employee’s projects and
  dependents are independent from one another.
 To keep the relation state consistent, we must have a separate
  tuple to represent every combination of an employee’s
  dependent and an employee’s project.
 This constraint is specified as an Multivalued dependency on
  EMP relation.
              Ename                 Pname           Dname
            Smith, Smith             X, Y            John
            Smith, Smith             X, Y            Anna

    Ename        Pname     Dname      Two MVDs :-
    Smith           X      John     1. Ename -->> Pname
    Smith           Y      Anna
                                    2. Ename -->> Dname
    Smith           X      Anna
    Smith           Y      John
   Formal Definition for MVD :-

    A Multivalued dependency X -->> Y specified on relational
    schema R, where X and Y are both subset of R, specifies the
    following constraints on any relation state r of R: If two tuples
    t1 and t2 exist in r such that            t1[x] = t2[x], then two
    tuples t3 and t4 should also exist in r with the following
    properties, where we use Z to denote ( R – ( X u Y ) ).
    • t3[x] = t4[x] = t1[x] = t2[x]
    • t3[y] = t1[y] and t4[y] = t2[y]
    • t3[z] = t2[z] and t4[z] = t1[z]
•   The formal definition specifies that given a particular value of
    X, the set of values of Y determined by this value of X is
    completely determined by X alone and does not depend on the
    values of the remaining attribute Z of R.
•   Hence, whenever two tuples exist that have distinct values of Y
    but the same value of X, these values of Y must be repeated in
    separate tuples with every distinct value of Z that occurs with
    that same value of X.

•   An MVD X -->> Y in R is called Trivial MVD if :
    a) Y is a subset of X or,
    b) X u Y = R.

•   An MVD that satisfied neither (a) nor (b) is called Nontrivial
         Fourth Normal Form (4NF)
• We now present the definition for 4NF, which is
  violated when a relation has undesirable MVDs,
  and hence can be used to identify and decomposed
  such a relation schema.

• Definition :-

  A relation schema R is in 4NF with respect to a set
  of dependences F (that include FDs and MVDs) if
  for every nontrivial MVD X -->> Y in F+, X is a super
  key for R.
    Emp :-                       4th Normal Form :-
    Ename     Pname   Dname   1.   Emp_Project:-
    Smith       X      John
    Smith       Y     Anna         Ename        Pname
    Smith       X     Anna          Smith          X
    Smith       Y      John         Smith          Y
    Brown      W       Jim          Brown        W
    Brown       X      Jim          Brown          X
    Brown       Y      Jim          Brown          Y
    Brown       Z      Jim          Brown          Z
    Brown      W       Joan
    Brown       X      Joan   2.   Emp_Dept:-
    Brown       Y      Joan        Ename      Dname
    Brown       Z      Joan        Smith        John
    Brown      W      Bond         Smith        Anna
    Brown       X     Bond         Brown        Jim
    Brown       Y     Bond         Brown        Joan
    Brown       Z     Bond         Brown        Bond
      Emp :-                               4th Normal Form :-
     Ename          Pname     Dname    1.    Emp_Project:-
MVD1    X -->> Y
                                             Ename         Pname
MVD2               X -->> Z
                                      MVD1      X -->> Y

We can also write like :
X -->> Y | Z
                                       2.    Emp_Dept:-
                                             Ename         Dname

                                      MVD2      X -->> Z
        Join Dependency and Fifth Normal Form

• There may be no FD in R that violate any NF
  up to BCNF, and there may be no nontrivial
  MVD present in R that violate 4NF.
• We then resort to another dependency called
  Join Dependency and, if it is present, carry out
  a multiway decomposition into 5NF.
• Such a dependency is very difficult to detect in
  practice; therefore normalization in 5th Normal
  Form is very rarely done in practice.
•   Definition :-
    A Join Dependency (JD), Denote by JD(R1,R2,R3…..,Rn),
    specified on relation schema R, specified a constraint on the
    states r of R. The constraint state that every legal state r of R
    should have a nonadditive join decomposition into R1,R2,….Rn;
    that is, for every such r we have:
    *(R1(r),R2(r)…..Rn(r)) = r

•   5th Normal Form :-

    A relation schema R is in 5NF (or Project Join Normal Form)
    (PJNF) with respect to a set F of FDs, MVDs, JDs if, for every
    nontrivial Join Dependency JD(R1,R2….,Rn) in F+ (that is,
    implied by F), every Ri is a super key.
                  Design Phases
• For small Application it may be feasible fro a
  database designer who understand the application
  requirements to decide directly on the relations to be
  created, their attributes and the constraints on the
• However such a direct design process is difficult for
  real-world applications. Often no one can understand
  the complete data need of an application.
• The database designer must interact with the users of
  the application to understand the needs of the
  application, represent them in a high-level fashion
  that can be understand by the users.
 And finally translate the requirement into a lower levels of
  the design.
 The data is the requirement of the users and
 database structure fulfills these requirements.
 Following are the sequential tasks to be performed while
  designing database :-
1. Collect Database Requirements.
2. Decide the Number of Entities and their Attributes.
3. Make Relationships between Entities.
4. Conceptual Designing.
5. Logical Designing.
6. Physical Designing.
The main phases of database designing :

1. Requirements and Data :-

  The initial phase of the database designing is to
  characterized fully the data needs of the database
  users. The database designer needs to interact
  extensively with domain experts and users to carry
  out this task.
  The outcome of this phase is specification of user
2. Choose Data Model :-

  Next, the designer chooses the data model and by
  applying the concept of the data model, translate these
  requirements into a conceptual schema of the database.
  Schema developed at this conceptual-design phase
  provides a detail overview of the enterprise. The entity
  relationship model is typically used to represent the
  conceptual design.
  Typically, the conceptual-design phase result in the
  creation of the E-R diagram that provide the graphic
  representation of the schema.
3. Functional Requirements :-

  A fully developed conceptual schema also indicates
  the functional requirements of the enterprise.
  In a specification of Functional Requirements user
  describe the kinds of operations (Transactions) that
  will be performed on the data like :
•     Modifying and Updating Data.
•     Searching and Retrieving Data.
•     Deleting Data.
 Process of moving from an abstract data model to the
  implementation of the database processes in two final
  design phases :
   • Logical-design phase
   • Physical-design phase
1. Logical-design Phase :-
    The designer maps the high-level conceptual schema into
    the implementation data model of the database system
    that will be used.
    The implementation data model is typically the relational
    data model, and this step consists of mapping the
    conceptual schema defined using the entity-relationship
    model into a relational schem.
2. Physical-design Phase :-

    Finally the database designer use the resulting system
    specific database schema in the subsequent physical-
    design phase, in which the physical features of the
    database are specified. These features include the form of
    file organization and the internal storage structure.

•   The physical schema of a database can be change easily
    after an application has been built. However changes to
    the logical schema are usually harder to carry out, since
    they may affect the number of queries and update
    scattered across application code.
    Chapter - 4
               : Introduction :

• Indexes are used to speed up the retrieval of
  records in response to certain search condition.
• The index structure typically provide secondary
  access paths, which provide alternative ways to
  access the records without affecting the physical
  placement of records on disk.
• Basically any field of the file can be use to create
  index and multiple indexes on different fields can
  be constructed on the same file.
 To find a record or records in the file based on a
  certain selection condition on an indexing field,
  one has to initially access the index, which
  points to one or more blocks in the file where
  the required records are located.
 The most prevalent types of indexes are based
  on ordered files (Single-level Indexes) and tree
  data structures (Multi-level Indexes, B+-trees).
              Single-Level Indexes

• For a file with a given record structure consisting
  of several fields (or attributes), an index access
  structure is usually defined on a single field of a
  file, called an indexing field or indexing
• The index typically stores each value of the index
  field along with a list of pointers to all desk blocks
  that contain records with that field value.
• The values in the index are ordered so that we
  can do a binary search on the index.
 There are several types of ordered indexes.
 Primary Index is specified on the ordering key field of an
    ordered file of records.
   Ordering key field is used to physically order the file records
    on disk, and every records has a unique value.
   If the ordering field is not a key field that is, if numerous
    records in the file can have the same value for the ordering
    field another type of index called a clustering index used.
   Notice that a file can have at most one physical ordering field,
    so it can have at most one primary index or one clustering
    index but not both.
   A third type of index, called secondary index, can be specified
    on any nonordering field of a file.
               (1) Primary Index :-

• A primary index is an ordered file whose records
  are of fixed length with two fields.
• The first field is of the same data type as the
  ordering key field-called primary key-of the data
  file, and the second field is a pointer to a disk
  block (address).
• Each index entry has the value of the primary key
  field for the first record in a block and a pointer to
  that block as its two field values.
• It will refer to the two field values of index entry i
  as <K(i), P(i)>.
 The total number of entries in the index is the same as the
    number of disk blocks in the ordered data file.
   The first record in each block of data file is called anchor record
    of the block or simply block anchor.
   Indexes can also be characterized as dense or sparse.
   A dense index has an index entry for search key value in the data
   A sparse index or nondense index, on the other hand, has index
    entries for only some of the search values.
   Therefore a primary index is a nondense index.
   A record whose primary key value is K lies in the block whose
    address is P(i), where K(i) ≤ K < K(i+1).
   The ith block in the data file contains all such records because of
    the physical ordering of the file records on the primary key field.
            Index File                       Name        SSN        Bdate   JOB
        <K(i), P(i)> Entries        Aaryan
                                                      Anchor Record
  Block Anchor                                          :
  Primary Key              Block
      Value               Pointer   Alex
                                                    Anchor Record
Aaryan                              Alian
Alex                                                    :
Allen                               Alien
                 .                                  Anchor Record
                 .                                      :
                     :                                  .
                                                      Anchor Record
            (2) Clustering Index :-

• If file records are physically ordered on a nonkey
  field-which does not have a distinct value for
  each record-that field is called clustering field.
• We can create a different types of index, called
  clustering index, to speed up retrieval of records
  that have the same value for the clustering field.
• A clustering index is also an index file with two
  fields; the first field is of the same type as the
  clustering field of the data file, and the second
  field is a block pointer.
 There is one entry in the clustering
 index for each distinct value for the
 clustering field, containing the
 value and a pointer to the first
 block in the data file that has a
 record with that value for its
 clustering field.

 See word file……
            (3) Secondary Index :-

• A secondary index provides a secondary means of
  accessing a file for which some primary access
  already exists.
• The secondary index may be on field which is a
  candidate key and has a unique value in every
  records, or a nonkey with duplicate value.
• The first field is the same data type as some
  nonordering field of the data file that is an index
• The second is either a block pointer or a record
 Again we refer to the two field values of index entry i as
  <k(i), p(i)>, the entries are order by k(i), so we can
  perform a binary search.
 Because the records of the data file are not physically
  stored by value of the secondary key field, we can not
  use a block anchors.
 That is why the index entry is created for each record in
  the data file rather then for each block.
 A secondary index usually need more storage space
  and longer search time then does the primary index,
  because of its larger number of entries.
 Example for binary search :-
 Number of records r = 30,000
 Block size B = 1024 bytes (record length 100 bytes)
 No. of records per Block bfr = └(B/R)┘1024/100 = 10 RPB
 Number of blocks need for the file b = (r/bfr)
                                            30,000/10 = 3,000
 A binary search on data file would need approximately ..
  (log 2 b) = (log 2 3,000) = 12 block access
              Multilevel Indexing

• The indexing schema we have describe thus far
  involve an ordered index file.
• A binary search is applied to the index to locate
  pointers to a disk block or to a record in the file
  having a specific index field value.
• A binary search requires approximately (log 2 bi)
  block access for an index with bi block because
  each step of the algorithm reduce the part of the
  index file that we continue to search by a factor
  of 2.
 The idea behind the multilevel index is to reduce the
  part of the index that we continue to search by bfri, the
  blocking factor for the index, which is larger then 2.
 Hence, the search space is reduced much faster.
 The value bfri is called fan-out of the multileveled
  index, and we will refer to it by the symbol fo.
 A multilevel index consider the index file, which we will
  now refer to as the first level (base level) of a
  multilevel index, as an index file with distinct value for
  each K(i).
 Therefore we can create a primary index
  for the first level.
 Index to the first level is called second level
  of the multilevel index.
 Because the second level is the primary
  index, we can use the block anchor so that
  the second level has a one entry for each
  block of the first level.
Tree Structure
Search Tree & B-Tree
• A search tree is a special type of tree that is used to
  guide the search for a record.
• The multilevel index can be thought of as a variance of
  a search tree; each node in a multilevel index can have
  as many as fo pointers and fo key values, where fo is a
  index fan-out.
• The index field value guide us to the next node, until
  we reach the data file block that contain the required
• By following a pointer, we restrict our search at each
  level to a subtree of the search tree and ignore all
  nodes not in this subtree.
                     Search Tree
• A search tree of an order p is a tree such that each
  node has a p-1 search value, and p pointers in the
  order <P1,K1,P2,K2,…,Pq-1,Kq-1,Pq>, where q ≤ p; each Pi is
  a pointer to the child node, and each Ki is a search
  value for some ordered set of values.
• All search values are assume to be unique.
• Two constraints must be hold at all times on the search
 Constraints :-

1. Within each node, K1 < K2 < K3…..<Kq-1.
2. For all values X in the subtree pointed at by pi,
   we have Ki-1 < X < Ki for 1 < i < q;
             X < Ki for i = 1,
             Ki-1 < X for I = q.
• Whenever we search for a value X, we follow the
   appropriate pointer Pi according to the formulas in
   constraint 2.
• Notice that some of the pointer Pi in a node may be
   NULL pointer.
 We can use a search tree as a mechanism to search for
  record stored in disk file.
 The value in the tree can be the value of one the fields of
  the file, called the search field.
 Each key value in the tree is associated with the pointer to
  the record in the data file having that value. Alternatively,
  the pointer could be to the disk block containing the
 When a new record is inserted, we must update the
  search tree by inserting an entry in the tree containing
  the search field value of the new record and a pointer to
  the new record.
 Algorithms are necessary for inserting values into tree
  and deleting values from the tree while maintaining the
  preceding two constraints.
 In general, these algorithms do not guaranteed that the
  search tree is balanced means all of its leaf nodes are at
  the same level.
 Keeping a search tree balance is important because it
  yields an uniform search speed.
 A problem with search tree is that, record deletion may
  leave some nodes in the tree nearly empty thus, wasting
  storage space and increasing the number of levels.
 B-Tree addresses both the problems by adding one more
  constraints on search tree.
Search Tree :-

   P1     K1     ……   Ki-1     Pi       Ki   ……    Pq-1    Kq-1   Pq

   X                           X                    X

X < K1                  Ki-1 < X < Ki             Kq-1<X
                      B - Tree
• B-Tree has an additional constraint that ensure that
  tree is always balanced and that the space wasted by
  deletion, never become excessive.
• The algorithm for insertion and deletion become
  more complex in order to maintain these constraints.
• More formally, a B-Tree or order P, when used as an
  access structure on a key field to search for a record
  in a data file can be defined as follows :
1. Each internal node in the B-Tree is of the form..
   where, Pi is the tree pointer-a pointer to another node in
   the B-tree.
   Each Pri is the data pointer-a pointer to the record whose
   search key value is equal to Ki.
2. Within each node, K1 < K2 < K3 < ….. < Kq-1
3. For search key field value X in the subtree pointed at by
    pi we have :
    Ki-1 < X Ki for 1 < i < q,
    X < Ki for i = 1,
    Ki-1 < X for i = q.
4. Each node has at most p tree pointer.
5. Each node, expect the root and leaf nodes, has at least
   ceil(p/2) tree pointers. The root node has at least 2 tree
   pointers unless it is the only node in the tree.
6. A node with q tree pointers, q ≤ p, has q-1 search key
   field values and hence has q-1 data pointer.
7. All leaf nodes are at the same level. Leaf nodes have the
   same structure as internal nodes except that all of their
   tree pointers Pi are NULL.
•   Notice that all the search values K in the B-tree are
    unique because we assumed that the tree is used as an
    access structure on a key field.
•   If we use a B-tree on a non-key field, we must change
    the definition of the file pointer Pri to point to a block
    that contains the pointer to s file records.
•   A B-tree is start with a single root node at level 0.
•   Once a root node is full with p-1 search values and we
    attempt to insert another entry in the tree, the root
    node split into two nodes at level 1.
•   Only the middle value is kept in the root node, and the
    rest of the values are split between the other two
•   When a non-root node is full and new entry is inserted
    in to it, that node is split into two nodes at the same
    level, and the middle entry is moved to the parent node
    along with two pointers to the new split nodes.
•   Search field V = 9 bytes, Block size B = 512 bytes
•   Data pointer Pr = 7 bytes, Block pointer P = 6 bytes
•   (p*P) + ((p-1) * (Pr + V)) ≤ B
•   (p*6) + ((p-1) * (7 + 9)) ≤ 512
•   6p + ((p-1) * 16) ≤ 512
•   6p + 16p – 16 ≤ 512
•   22p ≤ 528
                    B+ - Tree
• Most implementation of dynamic multilevel
  index use a variation of the B-tree data
  structure called a B+-tree.
• In B-tree every value of the search field appears
  once at some level in the tree, along with a data
• In B+-tree, data pointers are stored only at leaf
  nodes doffers from the structure of internal
• The leaf nodes have an entry value of search field, along
  with the data pointer to the record or the block that
  contain the record if the search field is a key field.
• For non-key search field, the pointer points to a block
  containing the pointers to the data file records, creating a
  extra level of indirection.
• The leaf node of B+-tree are usually linked to provide
  ordered access on the search field to the records.
• These leaf nodes are similar to the first level of an index.
• Internal nodes of the B+-tree correspond to the other
  levels of a multilevel indexes.
• Some search field values from the leaf nodes are
  repeated in the internal nodes of the B+-tree to guide the

• The structure of internal nodes of a B+-tree of order p is :

1. Each internal node is of the form :

2. Within each internal node, K1 < K2 < … < Kq-1

3. for all search field values x in the sub tree pointed at by Pi, we
   have Ki-1 < X < Ki for 1 < i < q;
             X ≤ Ki for i = 1;
             Ki-1 < X for i = q;
4. Each internal node has at most p tree pointers.
5. An internal node with q tree pointers, q ≤ p, has q – 1 search field

•   The structure of leaf nodes of B+-tree of order p is :

1. Each internal node is of the form :
   <<K1,Pr1>,<K2, Pr2>, … ,<Kq-1,Prq-1>,P next>
   P next points to the next leaf node of the B+-tree

2. Within each internal node, K1 < K2 < … < Kq-1
3. Each Pri is the data pointer that points to the records whose
   search field value is Ki or to a file block containing the record.
4. Each leaf node has at least ceil(p/2) values.
5. All leaf nodes are at the same level.
• The pointers in internal nodes are tree pointers to the
  blocks that are tree nodes, whereas the pointers in leaf
  nodes are data pointers to the data file records or block,
  except for the P next pointer, which is a tree pointer to the
  next leaf node.
• By starting at the leftmost leaf node, it is possible to
  traverse leaf nod as a linked list, using the Pnext pointer.
• This provides ordered access to the data records on the
  indexing field.
• P previous can also be included.
Difference Between B-tree & B+-tree

     Chapter - 5

Relational Algebra

              Introduction :-
• In this chapter we discuses the techniques
  used by a DBMS to process, Optimize, and
  execute high-level queries.
• A query express in high level query language
  such as SQL must first be scanned, parsed and
• The Scanner identifies the language tokens
  such as SQL keywords, attribute names, and
  relation names in the text of the query.
            Introduction (cont…)
• The Parser checks the query syntax to determine
  whether it is formulated according to the syntax
  rules of the query language.
• The query must be Validate, by checking that all
  attributes and relation names are valid and
  semantically meaningful names in the schema of the
  particular database being queried.
               Introduction (cont…)
• An internal representation of the query is then created,
  usually as a tree data structure called Query Tree.
• It is also possible to represent the query using a graph
  data structure called a Query Graph.
• The DBMS must then devise an execution strategy for
  retrieving the result of the query from the database files.
  A query typically has many possible execution strategies
  and the process for choosing a suitable one for
  processing a query is known as Query Optimization.
Translating SQL queries into Relational
• SQL is the query language that is used in most
  commercial RDBMSs.
• An SQL query first translated into an equivalent
  extended      relational   algebra     expression
  represented as a query tree data structure that is
  then optimized.
• Typically SQL queries are decomposed into query
  blocks, which form the basic unit that can be
  translated into an algebraic operators and
   A Query block contains a single SELECT-FROM-WHERE
    expression, as well as GROUP BY and HAVING clauses if these
    are part of block.

SELECT Lname,Fname
FROM Employee
WHERE Salary     >              ( SELECT MAX(Salary)
                                 FROM Employee
Outer Block
                                  WHERE Dno = 5
                                        Inner Block
( SELECT MAX(Salary)
  FROM Employee                          Inner Block

  WHERE Dno = 5 );               Store its result in C

C  ℑ MAX Salary ( σ Dno = 5 ( Employee ) )

SELECT Lname,Fname
FROM Employee                               Outer Block
WHERE Salary     > C
                                   Use the result of Inner Block from C

Result  Л Lname,Fname(σ Salary > C (Employee))
   The Query Optimizer would then choose an execution
    plan for each block.

   We should note that in previous example inner block
    needs to be evaluate only once to produce the
    maximum salary, which is then use (as a content C) by
    the outer block. We called this an Uncorrelated Nested
    Query, where the tuple variable from the outer block
    appear in the WHERE clause of the inner block.
        Algorithms for External Sorting
• Sorting is one of the primary algorithm used in query processing.
  For example, whenever an SQL query specify an ORDER BY clause
  the query result must be sorted.
• Sorting is also a key component in sort-merge algorithm used for
  JOIN and other operations such as UNION and INTERSECTION.
• External Sorting refers to sorting algorithms that are suitable for
  large files of records sorted on disk that do not fit entirely in
  main memory, such as most database files.
• The typical External Sort algorithm uses a Sort-Merge strategy.
   This algorithm start by sorting small sub files called
    Runs of the main file and then merges the sorted
    runs, creating larger sorted sub files that are
    merged in turn.
   The sort-merge algorithm, like other database
    algorithm, require buffer space in main memory,
    where the actual sorting and merging of runs is
   The basic algorithm consist of two phases :
     1. The sorting phase and
     2. The merging phase.
                 1. The Sorting phase :-
• In the sorting phase, runs (Portion) of the file that
  can fit in the available buffer space are read into the
  main memory, sorted using an internal sorting
  algorithm and written back into the main memory as
  temporary sorted sub file (Runs).
• The size of the runs and the number of initial runs
  (nR) is declared by the number of file blocks(b) and
  available buffer space (nB).

• For example, if nB = 5 blocks and the size of the file
  b = 1024 blocks then nR = Г (b / nB ) ˥.
                2. The Merging phase :-

• In the merging phase, sorted runs are merged during
  one or more passes.
• The degree of merging (dM) is the number of runs
  that can be merge in each pass.
• In each pass, one buffer block is needed to hold one
  block from each of the runs being merged and one
  block is needed for containing the block of the
  merge result.

• Hence dM is the smaller of (nB – 1) and nR.
• Number of passes = Г (log dM (nR)) ˥ .
                                       Algorithm
set i  1;
      j  b; // size of the file in block.
   k  nB; // size of the buffer in blocks.
   m  Г (j/k) ˥ ;

(Sort Phase)

While ( i ≤ m )
do { read next k block of the file into the buffer or if there are less
    then k blocks remaining, then read in the remaining blocks;

       sort the record in the buffer and write as a temporary sub file;

(Merge Phase)
Set i  1;
    p  Г (log k-1 m) ˥ ; // p is the number of passes.
     j  m;
While ( i ≤ p )
do { n  1;
     q  Г ( j/(k-1) ) ˥ ; // number of sub files to write in this pass.
     while ( n ≤ q )
     do {
          read next k-1 sub files or remaining sub files ( from
          previous pass ) one block at a time ;
          merge and write as a new sub file one block at a time;
        } j  q;
          i  i + 1;
          Algorithms for SELECT operations :
1. Search Methods for Simple SELECTION :-

•   A number of search algorithms are possible for
    selecting records from file.
•   These are known as File Scans, because they scan
    the records of the file to retrieve records that
    satisfy a selection condition.
•   If search algorithm involves the use of index, the
    index search is called an Index Scan.
•   The following search methods (S1 through S6 ) are
    example of some of the search algorithms that can
    be use to implement a SLECT operation.
S1:- Linear Search.
    Retrieving every records in the file, and test whether its
    attribute value satisfy the selection condition or not.

S2:- Binary Search.
    If the selection condition involves an equality
    comparison on a key attribute on which a file is ordered,
    binary search is more efficient then the linear search.
E.g.:- SSN is the ordering attribute of EMPLOYEE…
     σ SSN = ‘12345’ (EMPLOYEE)
S3:- Using a Primary Index.
    If the selection condition involves an equality comparison on
    a key attribute with a primary index, for example SSN =
    ‘12345’ in previous example, use a primary index to retrieve
    record. This condition retrieve ONLY SINGLE RECORD.

S4:-Using a primary index to retrieve multiple records.
     If the comparison condition is >,<,≥ or ≤ on a key field with a
     primary index, for example DNO < 5 in following example use
     the primary index to retrieve records satisfying condition
     then DNO < 5 retrieve all the preceding records then of DNO
     = 5.
E.g.:- DNO is the primary attribute of EMPLOYEE uses
       the index…
      σ DNO < ‘12345’ (EMPLOYEE)
S5:-Using clustering index to retrieve multiple records.
    If the selection condition involves an equality comparison
    on an nonprime attribute with a clustering index, for
    example Mark2 = 85 use the index to retrieve all the
    records that satisfy the selection condition.
    σ Mark2 < 85 (MARKS)

S6:-Using a secondary (B+ tree) index on an equality
    This search method can be used to retrieve a single record
    if the indexing field is a key (has unique value) or to retrieve
    multiple records if the indexing field is not a key. This can
    also be use with the comparisons involving <,>,≤ or ≥.
2. Search Methods for Complex SELECTION :-

•   If the condition of SELECT operation is a conjunctive condition
    that is, if it is made up of several simple conditions connected
    with the AND logical connectives.

S7:-Conjunctive selection using an individual index.

    If an attribute involve in any single simple condition in the
    conjunctive condition has an access path that permits the use of
    one of the methods S2 to S6, use that condition to retrieve the
    records and then check whether each retrieved record satisfy
    the remaining simple condition in the conjunctive condition.
S8:-Conjunctive selection using composite condition.

   If two or more attributes are involved n equality
   condition in the conjunctive condition and a
   composite index exist on the combined field, for
   example if an index has been created on the
   composite key (Essn, Pno) of the WORKS_ON file
   we can use the index directly.
    σ Essn < ‘1234’ AND Pno = 10 (WORKS_ON)
S9:-Conjunctive selection by intersection of records

   If the secondary index is available on more then one of
   the fields involve in simple conditions in the
   conjunctive condition and if the indexes include record
   pointers, then each index can be used to retrieve the
   set of record pointers that satisfy the individual
   condition. The intersection of these set of record
   pointers gives the record pointers that satisfy the
   conjunctive condition, which are then used to retrieve
   those records directly.
          Algorithms for JOIN operations :

• The JOIN operation is one of the most time
  consuming operation in query processing.
• JOIN involves only two files are called two-way
  join and join in that there more then two files
  are involves are known as multiway join.
• The algorithms we consider are for join
  operations of the form : R ⋈ A=B S : where A & B
  are domain-compatible attributes of R and S.

 Methods for Implementing JOIN Operation :

   J1 - Nested-loop Join :-

    For each record t in R (outer loop) retrieve every records s in S
    (inner Join) and test whether two records satisfied the join
    condition t[A] = s[B].

   J2 – Single-loop Join :-

    If an Index exist for one of the two join attributes-say b of S
    retrieve each record t in R, one at a time (Single loop), and then
    use the access structure to retrieve directly all matching records s
    from S that satisfied s[B]=t[A].
 J3 – Sort-merge Join :-

   If the records of R and S are physically sorted by
   the value of Join attributes A and B, we can
   implement the join in most efficient way.
   Both files are scanned concurrently in order of
   the join attributes, matching the records that
   have the same values for A and B. If the files are
   not sorted, they must be sorted first by using
   external sorting.
 J4 – Hash-Join :-

•   The records of file R and S are both hashed to the same hash
    file, using the same hashing function on the join attributes A of
    R and B of S as a hash keys.
•   First a single pass through a file (R) with a fewer records hashes
    its records to the hash file buckets; this is called Partitioning
    Phase, since the records of R are partitioned into a hash
•   In the second phase, called the Probing Phase, a single pass
    through the other file (S) then hashes each of its records to
    probe the appropriate bucket, and that records is combined
    with all matching records from R in that bucket.
•   This simplify description of hash-join assumes that the smaller
    of the two files fit entirely into memory bucket after the first
 Algorithm for Implementing T  R ⋈ A=B S
Sort the tuples in R on A (* assume R has n tuples)
Sort the tuples in S on B (* assume S has m tuples)
set I  1, J  1;
while (I ≤ n) and (J ≤ m)
do { if R(I)[A] > S(J)[B]
         then set J  J + 1;
      elseif R(I)[A] < S(J)[B]
         then set I  I + 1;
      else { (* R(I)[A] = S(J)[B] so we output a matched tuple *)
             output the combined tuple <R(I), S(J)> to T;

             (*Output other tuples that match R(I), if any*)
              set L  J + 1;

              while (L ≤ m) and (R(I)[A] = S(L)[B])
              do { output the combined tuple <R(I), S(J)> to T;
                      set L  L + 1;
        (*Output other tuples that match S(J), if any*)
         set K  I + 1;
         while (K ≤ n) and (R(K)[A] = S(J)[B])
         do { output the combined tuple <R(K), S(J)> to T;
                 set K  K + 1;
         set I  K, J  L;
            Algorithms for PROJECT Operation :
• A PROJECT Operation π <attribute list> (R) is straightforward to
  implement if <attribute list> includes a key of relation R, because
  in this case the result of the operation will have the same number
  of tuples as R, but only the values for the attributes in <attribute
  list> in each tuple.
• If <attribute list> does not include the key of R, duplicate records
  must be eliminated. This is usually done by sorting the result of
  the operation and then eliminating duplicate tuples.
• Hashing can also be used to eliminating duplicates: as each
  records is hashed and inserted into a bucket of the hash file in
  memory, it is checked against those already in the bucket; if it is
  duplicate, it is not inserted.
    Algorithm for Implementing T  π <attribute list> (R)

Create a tuple t[<attribute list>] in T’ for each tuple t in R.
(*T’ contain a projection results before duplicate elimination*)
if <attribute list> includes a key of R
      then T  T’;
else { sort the tuples in T’;
        set I  1, j  2;
        while I ≤ n
         do { output the tuple T’*I+ to T;
              while T’*I+ = T’*j+ and j ≤ n do j  J + 1; (*eliminate
               I  j; j  I + 1;
(*T contains the projection result after duplicate elimination*)
               Algorithms for SET Operations :
  CARTESIAN are sometimes expensive to implement.
• In particular the CARTESIAN PRODUCT operation R X S is quit
  expensive because its result include a record for each
  combination of records from R and S.
• Additionally the attributes of the result include all attributes of R
  and S.
• If R has N records and J attributes and
• S has M records and K attributes then
• The result relation will have N * M records and J * K attributes.
• Hence, it is avoided to implement the CARTESIAN PRODUCT
   The other three set operations UNION, INTERSECTION and SET
    DIFFERENCE apply only to union-compatible relations, which have
    the same number of attributes and the same attribute domains.
   The way to implement these operations is to use variations of the
    sort-merge technique: the two relations are sorted to produce the
   We can implement the UNION operation, R ∪ S, by scanning and
    merging both sorted files, and whenever the same tuple exist in
    both relations, only one is kept in the merge result.
   For the INTERSECTION operation R ∩ S, we keep in the merge
    result only those tuples that appear in both relations.
    Algorithm for Implementing T  R ∪ S

Sort the tuples in R and S using the same unique sort attributes;
set I  1, J  1;
while (I ≤ n) and (J ≤ m)
do { if R(I) > S(J)
           then { output S(J) to T;
                 set J  J + 1;
               }     elseif R(I) < S(J)
                               then { output R(I) to T;
                                       set I  I + 1;
        else J  J + 1; (*R(I) = S(J) so we skip one of the duplicate tuple*)
if(I ≤ n) then add tuples R(I) to R(n) to T;
if(J ≤ m) then add tuples S(J) to S(m) to T;
    Algorithm for Implementing T  R ∩ S

Sort the tuples in R and S using the same unique sort attributes;
set I  1, J  1;
while (I ≤ n) and (J ≤ m)
do { if R(I) > S(J)
        then set J  J + 1;
     elseif R(I) < S(J)
        then set I  I + 1;
      else { output R(J) to T; (*R(I) = S(J) so we output the tuple*)
             set I  I + 1, J  J + 1;
    Algorithm for Implementing T  R – S
Sort the tuples in R and S using the same unique sort attributes;
set I  1, J  1;
while (I ≤ n) and (J ≤ m)
do { if R(I) > S(J)
        then set J  J + 1;
     elseif R(I) < S(J)
        then { output R(I) to T;(*R(I) has no matching S(J) so we
                                       output the tuple*)
              set I  I + 1;
      else set I  I + 1, J  J + 1;
if(I ≤ n) then add tuples R(I) to R(n) to T;
          Implement Aggregate Operations :
• The aggregate operation (Min, Max, Count, Sum),
  when applied to an entire table, can be computed by
  a table scan or by using an appropriate index, if
  available. For example :
  SELECT MAX(Salary) FROM Employee;
• If an (Ascending) Index on salary exists for Employee
  relation, then the optimizer can decide on using the
  index 5to search for the largest value by following the
  rightmost pointer in each index node from the root to
  the rightmost leaf.
• That node would include the largest salary value as
  its last entry.
   When a GROUP BY clause id used in query, the aggregate
    operation must be applied separately to each group of
   Hence, the table must first be partitioned in to the subset
    of tuples, where each partition has the same value for the
    grouping attributes. In this case the computation is more
    SELECT Dno, MAX(Salary) FROM Employee
    GROUP BY Dno;
   In such a query first use either sorting or hashing on the
    grouping attributes to partition the file into the appropriate
   Then the algorithm computes the aggregate operation for
    the tuples in each group, which have the same grouping
    attribute(s) VALUE.
   In the query given in example, the set of tuples foe each
    department number would be grouped together in a
    partition and the average salary computed for each group.
   Notice that if a clustering index exist on the grouping
    attributes, then the records are already partitioned into the
    appropriate subsets. In this case it is necessary to apply the
    computation to each group.
                 Implement Outer Join :
• Outer Join Operation is introduce with three different
  variations :
   – Left Outer Join.
   – Right Outer Join.
   – Full Outer Join.
   SELECT Lname, Fname, Dname FROM
   (Employee LEFT OUTER JOIN Department ON Dno = Dnumber) ;
• The result of this query is a table of employee names and
  their associate departments. It is similar to the regular
  (inner) join result, with the exception that if an employee
  tuple does not have an associate department, the
  employee’s name will still appear in resulting table, but the
  department name would be NULL fir such a tuple in the
  query result.
    Algorithms, such as nested loop join or single-loop join.
   Alternatively, outer join can be computed by executing a
    combination of relation algebra operators.

   EXAMPLE :-

1. Compute the inner join of the EMPLOYEE and DEPARTMENT

TEMP1  Л Lname,Fname,Dname (EMPLOYEE ⋈ Dno=Dnumber
2. Find the EMPLOYEE tuples that do not appear in inner join result.

TEMP2  Л Lname, Fname (EMPLOYEE) - Л Lname, Fname (TEMP1)

3. Pad each tuple in TEMP2 with NULL Dname field.


4. Apply the UNION operation to TEMP1, TEMP2 to produce the
   LEFT OUTER JOIN result.

         Using Heuristic in Query Optimization :-
• In this technique we discuss the query optimization
  techniques that apply heuristic rules to modify the internal
  representation of a query-which is usually in the form of a
  query tree or a query graph data structure-to improve its
  expected performance.
• The parser of high-level query first generates an initial internal
  representation, which is then optimized according to the
  heuristic rules.
• Following that, the query execution plan is generated to
  execute groups of operations based on the access paths
  available on the files involved in the query.
 One of the main Heuristic Rule is to apply SELECT and
  PROJECT operations before applying the JOIN or other
  binary operations.
 The SELECT and JOIN operation reduce the size of the
  file and hence should be applied before a join or other
  binary operation.
 A query tree is used to represent a relational algebra or
  extended relational algebra expression.
 A query graph is used to represent a relational calculus
 Notations for Query Tree and Query Graph :-

 A query tree is a tree data structure that corresponds to a
  relational algebra expression. It represent the input relations of
  query as leaf nodes of the tree, and represent the relational
  algebra operation as a parent node (Internal Node) of those leaf
 The execution is terminate when the ROOT node is executed.

SELECT P.Pnumber, P.Dnum,E.Lname,E.Address,E.Bdate
WHERE P.Dnum=D.Dnumber AND D.Mgr_SSN=E.SSN
      AND Plocation > ’Rajkot’;
SELECT P.Pnumber, P.Dnum,E.Lname,E.Address,E.Bdate
WHERE P.Dnum=D.Dnumber AND D.Mgr_SSN=E.SSN
      AND Plocation > ’Rajkot’;

Л Pnumber,Dnum,Lname,Address,Bdate
( ( (σ Plocation=‘Rajkot’ (PROJECT) ) ⋈ Dnum=Dnumber
 The three relations PROJECT, DEPARTMENT, and EMPLOYEE are
  represent by leaf nodes P,D, and E.
• A query tree represents a specific order of operations for
  executing a query.
• A more natural representation of query is the query graph
• Relations (P, D, E) in the query graph are represented by relation
  node, which are represent as a single circle.
• Constant value, typically form the query selection condition, are
  represented by constant node, which are displayed as a double
  circle or double ovals.
• Selection and join conditions are represented by the graph edges.
   Tree Graph Representation for given SQL Query :-

                                      Relation Node

[P.Pnumber, P.Dnum]                                   [E.Lname, E.Address, E.Bdate]

              P.Dnum=D.Dnumber              D.Mgr_SSN=E.SSN
        P                             D                                   E

                                                               Graph Edge



                      Constant Node
   Initial (canonical) Query Tree for given SQL Query :-

                   Л   P.Pnumber, P.Dnum E.Lname, E.Address, E.Bdate

        σ   P.Dnum=D.Dnumber AND D.Mgr_SSN=E.SSN AND P.Plocation=‘Rajkot’


                                X                      E

                    P                      D
 Query Tree for given SQL Query :-

                     Л   P.Pnumber, P.Dnum E.Lname, E.Address, E.Bdate

                                    ⋈ D.Mgr_SSN = E.SSN

                   ⋈ P.Dnum = D.Dnumber
                                                             E           EMPLOYEE

  σ   P.Plocation=‘Rajkot’                   D            DEPARTMENT

              P               PROJECT
    General Transformation Rules for Relational Algebra Operations :-

1. Cascade of σ :-

σ c1 AND c2 AND …AND cn (R) = σ c1 (σ c2 (…(σ cn (R))…))

2. Commutativity of σ :-

σ C1 (σ C2 (R)) = σ C2 (σ C1 (R))

3. Cascade of Л :-

Л List1 (Л List2 (…(Л List n (R))…)) = Л List1 (R)
4. Commuting σ with Л :-

Л A1,A2,A3…An (σ C (R) ) = σ C (Л A1,A2,A3…An (R))

5. Commutativity of ⋈ AND x :-

 R⋈cS= S ⋈cR
 RxS=SxR
6. Commuting σ with ⋈ or x :-

    If all attributes in selection condition c involved only attributes of
    one of the relation schemas (R).

σ c (R ⋈ S) = (σ c (R) ) ⋈ S

    Alternatively, selection condition c can be written as (c1 AND c2)
    where condition c1 involves only attributes of R and condition c2
    involves only attributes of S then :

σ c (R ⋈ S) = (σ c1 (R) ) ⋈ (σ c2 (S) )
7. Commuting Л with ⋈ or x :-

 The projection list L = ,A1,A2,..An,B1,B2,…Bm-.
 A1…An attributes of R and B1…Bm attributes of S.
 Join condition C involves only attributes in L then :

ЛL ( R ⋈ c S ) = ( ЛA1,…An (R) ) ⋈c ( ЛB1,…Bm(S) )

8. Commutativity of SET Operation :-

Minus (R-S) is not Commutative.
9. Associatively of ⋈, x, ⋂, and ⋃ :-

 If ∅ stands for any one of these operation through out the
  expression then :
(R ∅ S) ∅ T = R ∅ (S ∅ T)

10.Commutativity of σ with SET Operation :-

 If ∅ stands for any one of three operations (⋃,⋂,and-)
     then :
σ c (R ∅ S) = (σ c (R)) ⋃ (σ c (S))
11.The Л operation comute with ⋃ :-

Л L (R ⋃ S) = (Л L(R)) ⋃ (Л L(S))

12.Converting a (σ,x) sequence with ⋃

(σ c (R x S)) = (R ⋈ c S)
                 INTRODUCTION :-
• So far, we have considered only the representation of the
  relations themselves.
• A relational database system needs to maintain data
  about the relations, such as the schema of the relations.
• This information is called a DATA DICTIONARY or SYSTEM
• In database management systems, a file that defines the
  basic organization of a database. A data dictionary
  contains a list of all files in the database, the number of
  records in each file, and the names and types of each
  field. ...
 A database about data and database structures. A catalog
  of all data elements, containing their names, structures,
  and information about their usage. A central location for
  metadata. ...
 Among the types of information that the system must
  store are these :
1. Names of the relations.
2. Names of the attributes of each relation.
3. Domains and length of attributes.
4. Name of views define on the database, and definition of
    those views.
5. Integrity constraints (Key constraints).
6. Names of the authorized users.
7. Authorization and accounting information.
8. Password or other information used          to
   authenticate user.
9. Number of tuples in each relation.
10.Method of storage for each relation.
11.Name of the index.
12.Name of the relation being indexed.
13.Attributes on which the index id defined.
14.Types of index formed.
Defining metadata.
The term metadata is often defined as data about data.
That is, data that provides information about the tables,
views, constraints, stored procedures, etc. stored within the
database. If we take a table as an example, the dictionary
will store information such as its name.
• When it was created and when it was last accessed
• The names and data types of its attributes (i.e. structural
• Its owner, who may read and write to it (i.e. security
• Where the data is stored (i.e. physical information)
Database catalog :-

The database catalog of a database
instance consists of metadata in which
definitions of database objects such as
Basis tables, view tables, synonyms,
value ranges, indexes, users, and user
groups are stored.
Thank you...

Shared By:
Description: GTU MCA PPT