IDIS Database Design and Implementation by akm33296

VIEWS: 7 PAGES: 11

									                 IDIS Database Design and Implementation

    Rationale
This document presents how the Integrated Database Information System (IDIS)
database was designed and how it is being implemented. The first part
introduces the design process, its principles and best practices. The second part
presents how these guidelines were applied in designing and implementing the
IDIS database.

  Database design principles

    Introduction
 This section presents the terms, concepts, approach, standards and best
 practices associated with the design of a database. It is important to bear these
 principles in mind when reading the next section which presents the design of
 the IDIS database. The database design phase is a very important step for all
 IT projects developing systems that rely on a database to adequately store,
 query, import & export data and support reporting. For such systems the
 operation of the database is critical hence its design and implementation must
 be long lasting, flawless and perfectly tailored to meet the requirements of the
 system.

    Definition of terms and concepts used

      Database
   A database is a collection of data that is organized in a systematic way so
   that its contents can easily be accessed, managed and updated. The most
   prevalent type of database is the relational database, a tabular database in
   which data is defined so that it can be reorganized and accessed in a number
   of different ways. A distributed database is one that can be dispersed or
   replicated among different points in a network. The software used to manage
   and query a database is known as a database management system (DBMS).

     Database Management System
   A Database Management System is a software environment that structures
   and manipulates data, and ensures data security, recovery, and integrity. The
   Data Platform relies on a database management system (RDBMS) to store
   and maintain all of its data as well as execute all the associated queries.
   There are two types of RDBMS : the first group consists of single software
   packages which support only a single database, with a single user access
   and are not scalable (i.e. cannot handle large amounts of data). Typical
   examples of this first group are MS Access and FileMaker.
   The second group is formed by DBMS composed of one or more programs
   and their associated services which support one or many databases for one
   or many users in a scalable fashion. For example an enterprise database


                                                                                 1
server can support the HR database, the accounting database and the stocks
database all at the same time. Typical examples of this second group include
MySQL, MS SQL Server, Oracle and DB2. The DBMS selected for the Data
Platform is MS SQL Server from the second group.

   Table
A table is set of data elements that has a horizontal dimension (rows) and a
vertical dimension (columns) in a relational database system. A table has a
specified number of columns but can have any number of rows. Rows stored
in a table are structurally equivalent to records from flat files. Columns are
often referred as attributes or fields. In a database managed by a DBMS the
format of each attribute is a fixed datatype. For example the attribute date can
only contain information in the date time format.

    Identifier
 An identifier is an attribute that is used either as a primary key or as a
 foreign key. The integer datatype is used for identifiers. In cases where the
 number of records exceed the allowed values by the integer datatype then a
 biginteger datatype is used.

    Primary key
 A column in a table whose values uniquely identify the rows in the table. A
 primary key value cannot be NULL.

    Foreign key
 A column in a table that does not uniquely identify rows in that table, but is
 used as a link to matching columns in other tables.

   Relationship
A relationship is an association between two tables. For example the
relationship between the table "hotel" and "customer" maps the customers to
the hotels they have used.

   Index
An index is a data structure which enables a query to run at a sublinear-time.
Instead of having to go through all records one by one to identify those which
match its criteria the query uses the index to filter out those which don't and
focus on those who do.

    View
A view is a virtual or logical table composed of the result set of a pre-compiled
query. Unlike ordinary tables in a relational database, a view is not part of the
physical schema: it is a dynamic, virtual table computed or collated from data
in the database. Changing the data in a view alters the data stored in the
database



                                                                                  2
   Query
A query is a request to retrieve data from a database with the SQL SELECT
instruction or to manipulate data stored in tables.

SQL
Structured Query Language (SQL), pronounced "sequel", is a language that
provides an interface to relational database systems. It was developed by
IBM in the 1970s for use in System R. SQL is a de facto standard, as well as
an ISO and ANSI standard.

    Approach

    Introduction
The most difficult part of the database design activity is to propose a design
approach that efficiently accommodates storage and querying of data without
generating high costs or compromising database integrity. Two database
design approaches are used to for this purpose. The first approach is called
"Entity-Relationship" and the second approach is called "dimensional model".
Both these approaches as well as their respective advantages and
shortcomings are presented in the next sections. A separate section
introduces the metamodeling approach and the last section discusses all
these approaches.

   ER
The Entity-Relationship (ER) approach is based on a the ER model originally
proposed by Peter in 1976 as a way to unify the network and relational
database views. Simply stated the ER model is a conceptual data model that
views the real world as entities and relationships. A basic component of the
model is the Entity-Relationship diagram which is used to visually represents
data objects. Since Chen wrote his paper the model has been extended and
today it is commonly used for database design For the database specialists,
the utility of the ER model is:
    It maps well to the relational model. The constructs used in the ER model
     can easily be transformed into relational tables.
    It is simple and easy to understand with a minimum of training. Therefore,
     the model can be used by the database team to communicate the design
     to the end user.
    In addition, the model can be used as a design plan by the database
     developer to implement a data model in a specific database management
     software.
ER is a logical design technique that seeks to remove the redundancy in data.
The goal of this discipline is to illuminate the microscopic relationships among
data elements. The highest art form of ER modeling is to remove all
redundancy in the data. This is immensely beneficial to the processing of


                                                                                  3
transactions because transactions are made very simple and deterministic i.e.
they can be determined easily. Transaction oriented databases are typically
associated with business model dealing with banking, accounting, stock
management, Customer Relationship Management (CRM), etc. For example,
the transaction of updating a customer's address may devolve to a single
record lookup in a customer address master table. This lookup is controlled
by a customer address key, which defines uniqueness of the customer
address record and allows an indexed lookup that is extremely fast. It is safe
to say that the success of transaction processing in relational databases is
mostly due to the discipline of ER modeling.
Due to the very high and minute decomposition of data into separate tables
the ER approach generates database models which
   End Users cannot understand, remember and navigate.
   cannot be queried by database query software. Software that attempt to
    do this are notorious for making the wrong choices, with disastrous
    consequences for performance.
   are quite complex to query by humans as usually a large amount of often
    complex tables need to be joined to retrieve only a few fields.
   defeats the high-performance retrieval of data.

   Dimensional Model
The dimensional model (DM) approach is a logical design technique that
seeks to present the data in a standard, intuitive framework that allows for
high-performance access. It is inherently dimensional, and it adheres to a
discipline that uses the relational database model with some important
restrictions. The DM predates the ER modeling approach as it was already
used in the late 60s. It is probably accurate to say that the DM approach was
not invented by any single person. It is an irresistible force in the design of
databases that will always appear when the design efforts put
understandability and performance as the highest goals. Due to the
shortcomings of the ER approach a large share of the database specialists
have switched from ER to the more "simpler" designs whose designs all look
similar. In a natural, almost unconscious way, they returned to the roots of the
original relational model because they knew a database cannot be effectively
queried by end users unless it is packaged simply. The DM approach has a
number of important advantages that the ER approach lacks.
   First the models resulting from the DM approach are predictable e.g. it is
    simple to predict how they evolve.
   User interfaces, query tools and report generators can all make strong
    assumptions about the model to make the processing more efficient.
   Rather than using a cost based optimizer, a database engine can make
    very strong assumptions about constrains hence highly improving
    performance.


                                                                                 4
In the DM approach every dimensional model is composed of one table with a
multipart key, called the fact table, and a set of smaller tables called
dimension tables. Each dimension table has a single-part primary key that
corresponds exactly to one of the components of the multipart key in the fact
table. This characteristic "star-like" structure is often called a star join. The
term star join dates back to the earliest days of relational databases. A fact
table, because it has a multipart primary key made up of two or more foreign
keys, always expresses a many-to-many relationship. The most useful fact
tables also contain one or more numerical measures, or "facts," that occur for
the combination of keys that define each record. The most useful facts in a
fact table are numeric and additive. Additivity is crucial because data
warehouse applications almost never retrieve a single fact table record;
rather, they fetch back hundreds, thousands, or even millions of these records
at a time, and the only useful thing to do with so many records is to add them
up. Dimension tables, by contrast, most often contain descriptive textual
information. Dimension attributes are used as the source of most of the
interesting constraints in data warehouse queries, and they are virtually
always the source of the row headers in the SQL answer set.

   Metamodeling
Metamodeling is a variant of the ER modeling approach that focuses on
elegantly solving complex modeling situations that do not seem to comply
with the ER modeling approach. Instead of focusing on real world entities a
metamodel aims at modeling the language which supports the situation. A
typical example of a situation where metamodeling is used is when a
database model is asked to establish the translation between multiple
standards that have similarities but present significant differences in
vocabulary, sections, definitions, etc. According to the ER modeling approach
each standard first needs to be modeled so that it can be operated
individually. Once each standard is modeled then the ER approach suggests
that mappings between all models are established. When such exercise is
carried for more than 4 or 5 models the task becomes extremely complex with
hundreds of mappings to be established. Very rapidly it becomes obvious that
there is a fundamental flaw in the modeling approach that can only be
balanced with huge efforts to maintain the model. Such a situation is an
excellent indicator of the need for a metamodeling approach. The above
modeling problem can easily and elegantly be solved by a metamodel that
focuses not on operating each individual standard and their mappings but
instead which focuses on the vocabulary mappings between each model.

    Discussion
For experienced database designers, the design of a new database always
starts by considering which of the three approaches, e.g. ER, dimensional
and metamodeling, best applies to meeting its requirements. If the database
is targeted at supporting a transaction system then the ER approach is used
as it is the best fit approach. Transaction systems are used by banks, retail,


                                                                                 5
real estate, etc to manage their businesses along their business models. If
the database is targeted at establishing complex linkages between large sets
of concepts. Typical examples of metamodeling applications include
standards translation, currency translation and language translation. If the
database is targeted at supporting end-user queries then the Dimensional
Model approach is used. ER modeling defeats end-user delivery and should
not be used for this purpose.
It is important to understand that ER based transaction databases usually
serve very well the purpose of a specific application or domain but usually fall
short of supporting multiple application or domains at the same time. For
example an organization uses a database to support its hydrology data and
another one to support its environmental data. In this example both
databases operate well individually but cannot be directly linked or mapped
together as their common founding concepts are implemented in different
ways. For example, both database contain a table corresponding to the
wetland feature but the definition of this feature varies and its attribute data is
different. In this case, both databases can hardly be merged into a single
database that would serve both purposes. This example also applies to the
more conventional accounting and HR business domains.

 Normalization

     Introduction
When designing a database the first and most important principle is to
efficiently organize data in a database according to a set of principles called
the normal forms. This process called ‘normalization’ ensures that the
database is consistent (reduces chances of errors and conflicts) and complies
with best practices which aim at eliminating data redundancy (for example,
storing the same data in more than one table) and ensure data dependencies
make sense (only storing related data in a table). The resulting ‘normalized’
database is hence logically stored and uses the minimum amount of storage
space. The normal form design principles are numbered from one (the lowest
form of normalization, referred to as first normal form or 1NF) through five
(fifth normal form or 5NF). In practical applications only the first 3 normal
forms are implemented (1NF, 2NF and 3NF) and occasional the fourth normal
form (4NF) is used. Fifth normal form is very rarely used. It's important to
point out that they are guidelines and guidelines only. Occasionally, it
becomes necessary to stray from them to meet practical requirements.
However, when variations take place, it's extremely important to evaluate any
possible ramifications they could have and account for possible
inconsistencies.

   The first normal form
The first normal form (1NF) sets the very basic rules for an organized
database:



                                                                                  6
        Eliminate duplicative columns from the same table.
        Create separate tables for each group of related data and identify
         each row with a unique column or set of columns (the primary key).
     An example can be found here.

  The second normal form
The second normal form (2NF) further addresses the concept of removing
    duplicative data:
•   Meet all the requirements of the first normal form.
• Remove subsets of data that apply to multiple rows of a table and place
them in separate tables.
• Create relationships between these new tables and their predecessors
through the use of foreign keys.
•   An example can be found here.

  The third normal form
The third normal form (3NF) goes one large step further:
•   Meet all the requirements of the second normal form.
•   Remove columns that are not dependent upon the primary key.
•   An example can be found here.

  The fourth normal form
The fourth normal form (4NF) has one additional requirement:
•   Meet all the requirements of the third normal form.
•   A relation is in 4NF if it has no multi-valued dependencies.
•   An example can be found here.

  The fifth normal form
The fifth normal form (5NF) addresses situations where a relation cannot be
decomposed into two separate relations :
•   Meet all the requirements of the third normal form.
• A relation is in 5NF if it provides a lossless decomposition of join-
dependencies that have no multi-valued dependencies.
•   An example can be found here.




                                                                              7
 Database Model

    Introduction
The preparation of the database model represents the main activity of the
database design activity. The database model not only acts as the repository
of the design but also as a very important medium to communicate with
stakeholders and all technical specialists involved with the implementation of
the database as well as its integration within the target application. The
design of a database is usually carried out by database architects and/or
database administrators. A Computer Assisted Software Engineering (CASE)
tool is used in almost all cases to support the design process by automating
all repetitive tasks and store all documentation in a dictionary. Once system
requirements are analyzed and understood the database design work goes
into a 3 phase cycle which is described in the following sections. In theory the
database design will go through a single cycle of those phases. In practice it
is very common to iterate through a couple of cycles that will ensure the
outputs of all 3 phases are all in line and adequately linked. The following
sections present each of these phases.

   Conceptual Model
The first phase produces a conceptual data model which presents a
conceptual plan of how the tables, attributes and relationships will be
implemented. This model is very often used as a medium to communicate
with stakeholders and illustrate how data will be stored. This conceptual
model focuses on establishing the tables that will store the required
information (called ‘attribute’) and the relationships between the tables. The
conceptual model is always prepared according to a standard formalism
which clearly indicates how tables and their relations have to be presented
and documented. The Unified Modeling Language (UML) class diagram is the
de facto formalism being used by IT database professionals to deliver the
conceptual model of the target database. In the past two other formalisms
have been used. The first one is the Entity-Relationship (E/R) formalism
which was very popular in North America. The second one is MERISE which
was very popular in France. Once the conceptual data model tables and
relationships have been defined the documentation of each one of them is
usually carried by filling the required standardized documentation in the
conceptual model dictionary. This dictionary is usually completed directly
through the CASE tool directly in the background of the conceptual model.
Various standards and best practices are used to document the conceptual
model dictionary. Several ISO standards can be used to document the
conceptual model into its dictionary depending on the database context. In
many cases only mandatory fields of the dictionary are being filled with
information as they contain information that will be used by the CASE tool to
transform the conceptual data model into a logical model. Although the
conceptual data model does not explicitly require to be normalized it is a best
practice to ensure it does comply with at least the first (1NF) and second


                                                                               8
(2NF) normal forms. In practice, experienced IT professionals often already
address in the conceptual model important issues which they know will arise
in the following logical and physical data models to ensure the smallest
number of iterations between the phases. It is important to note that the
conceptual data model is a graphical representation of the information
requirements of a system, it is not a database.

    Logical Model
The second phase produces a logical database model that translates the
conceptual model into a fully descriptive logical data model. As a result
whereas some join tables did not appear in the conceptual model the logical
model will include all tables. This model is very often used as a medium to
communicate with technical IT staff involved into the programming and
implementation of the database driven target system. The logical data model
represents the realistic and normalized design of the database that will be
implemented regardless of the technology that will be selected to implement
it. The logical data model must be normalized and comply with the first 3
normal forms (1NF, 2NF and 3 NF). In some rare cases it must also comply
with the fourth and fifth normal forms (4NF & 5NF). CASE tools can usually
handle and automate most of the simple work required by the conversion of a
conceptual data model into a logical data model. Following normal forms and
best practices they usually offer solutions to most of the common issues
associated with this conversion. Once these solutions are validated and
accepted the more complex cases need to be addressed independently to
ensure that the logical model complies with normal forms, is not technology
dependant and is in line with the selected approach for the database design.
It is important to note that the conceptual data model is a graphical
representation of the information requirements of a system, it is not a
database.

   Physical model
The third and last phase produces the physical data model which is in fact the
database design that will be implemented. In practice this materializes as an
implemented schema which is an empty set of tables linked be relationships
that are ready to receive data. This design incorporates all required best
practices to adequately implement the database logical model into an
operational database schema that can be handled and operated by a
database management system (DBMS). As such the physical data model is
technology dependant. For example, if an attribute field was called ‘social
security number’ in the conceptual and logical model it will be called
‘sosecnum’ in the physical data model as the DBMS does not support long
names with space characters. The physical data model is not a graphical
representation of the information requirements of a system, it is the empty
database ready to receive data.




                                                                              9
 Database Implementation

   Optimization

    Denormalization

    Indexing

    Backup

   Hardware tuning

 Best practices

Design of the IDIS database

    Approach
 In accordance with best practices this project has selected the ‘dimensional
 model’ approach to meet those requirements. This approach is the de facto
 solution implemented for databases with similar requirements for over 25 years
 across industry and academia. It follows a very strict and very clear generic
 logic that does not depend on the domain being modeled. In short, data is
 stored in a data table called the ‘fact table’ and the attributes associated with
 this data are stored in attribute tables called ‘dimension tables’. A ‘dimension
 table’ can be linked to more than one fact table. This approach can be equally
 implemented in a single database or in many databases depending on system
 specific requirements.
 In the data platform instead of storing all data from all basins into a single
 database this approach is implemented as following: a database is created for
 each basin and the basin’s data is stored into this database according to the
 dimensional model approach. Figure 1 presents Basin tables and Index tables.
 Two important motives guide this implementation approach. First, this approach
 allows for basin specific databases sizes to remain manageable for the backup
 procedures. The larger the database the more complex the backup procedure.
 Second, having basin specific databases greatly facilitates the replication
 procedures which will be required when the data platform moves into a
 decentralized mode where local mirrors are maintained at the basin locations. If
 all data tables for all basins are all cluttered into a single database the
 replication procedures are far more complex to establish and operate.




                                                                                10
 IDIS Database Model

   Conceptual Model

   Logical Model

   Physical Model

Data acquisition, loading and documentation processes

 Data extraction

 Data transformation

 Data loading

 Documentation

Data quality control and security features

How to get the data from the database

Operation and maintenance requirements

 Indexes

 Backup

   Introduction

   Backup plan

    roles & responsabilities

 Software upgrades

   DBMS upgrade

   Patches




                                                        11

								
To top