1. BASIC DEFINITIONS
DWH (Datawarehousing) is a repository of integrated information, specifically structured for
Queries and analysis. Data and information are extracted from heterogeneous sources as they are
generated. This makes it much easier and more efficient to run queries over data that originally
came from different sources.
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
Collection of data in support of management’s decision making process”.
Subject-oriented – a DW is organized around major subjects; excludes data that is
not useful in the decision support process.
Integrated – a DW is constructed by integrating numerous data sources (relational
DB, flat files, legacy systems. DW provides mechanisms for cleaning and
standardizing of the data.
Time-variant – data is stored to provide information from a historical prospective.
Every key structure in the data warehouse contains, either implicitly or explicitly, an
element of time.
Nonvolatile – a DW is physically separated from the operational environment. Due to
this separation it does not require transaction processing, recovery, and concurrency
It usually requires
Two operations: initial loading of data and access of data.
Data Warehouse is an architecture constructed by integrating data from multiple
heterogeneous sources to support structured and/or ad hoc queries, analytical
reporting and decision making.
Data Warehousing is a process of constructing and using data warehouses.
A Multi-Subject Information Store
Typically 100’s of Gigabytes to Terabytes
Data Mart :
Page 1 of 10
It is a collection of subject areas organized for decision support based on the needs of
a given department. Ex: sales, marketing etc. the data mart is designed to suit the
needs of a department. Data mart is much less granular than the ware house data.
Data Mart is
A Single Subject Data Warehouse
Often Departmental or Line of Business Oriented
Typically Less Than a 100 Gigabytes
Differences between DWH & Data Mart :
DWH is used on an enterprise level, while data marts are used on a business division / department
level. Data warehouses are arranged around the corporate subject areas found in the corporate
data model. Data warehouses contain more detail information while most data marts contain more
summarized or aggregated data.
OLTP is Online Transaction Processing. This is standard, normalized database
structure. OLTP is designed for Transactions, which means that inserts, updates and
deletes must be fast.
OLAP is Online Analytical Processing. Read-only, historical, aggregated data.
Page 2 of 10
Difference between OLTP and OLAP :
Current data Current and historical data
Short database transactions Long database transactions
Online update/insert/delete Batch update/insert/delete
Normalization is promoted Demoralization is promoted
High volume transactions Low volume transactions
Transaction recovery is Transaction recovery is not
Low number of concurrent High number of concurrent
Fact Table :
It contains the quantitative measures about the business.
Fact tables that contain aggregated facts are often called summary tables.
Dimension Table :
It is a descriptive data about the facts (business).
Aggregate tables :
Aggregate Tables are pre-stored summarized tables. Usage of Aggregates can
increase the performance of Queries by several times.
Conformed dimensions :
Conformed dimensions are a dimension table shared by fact tables. These tables
connect separate star schemas into an enterprise star schema.
A schema is a collection of database objects, including tables, views, indexes, and
synonyms. There are a variety of ways of arranging schema objects in the schema
models designed for data warehousing. Most data warehouses use a dimensional
Page 3 of 10
Star Schema :
Star Schema is a set of tables comprised of a single, central fact table surrounded by
de-normalized dimensions. Star schema implement dimensional data structures with
Snow Flake Schema:
Snow Flake Schema is a set of tables comprised of a single, central fact table
surrounded by normalized dimension hierarchies. Snowflake schema implement
dimensional data structures with fully normalized dimensions.
The DWH contains 2 types of queries. There will be
Fixed queries that are clearly defined and well understood, such as regular
Ad Hoc Query: Is the starting point for any analysis into a database. The ability
to run any query when desired and expect a reasonable response that makes the
data warehouse worthwhile and makes the design such a significant challenge.
There will also be ad hoc queries that are unpredictable, both in quantity and
The end-user access tools are capable of automatically generating the database
query that answers any question posted by the user.
Canned Queries: are pre-defined queries. Canned queries contain prompts that
allow you to customize the query for your specific needs
Kimball (Bottom up) vs Inmon (Top down) approaches :
Bottom up: Acc. To Ralph Kimball, when you plan to design analytical solutions
for an enterprise, try building data marts. When you have 3 or 4 such data marts,
you would be having an enterprise wide data warehouse built up automatically
without time and effort from exclusively spent on building the EDWH. Because
the time required for building a data mart is lesser than for an EDWH.
Page 4 of 10
Top down: try to build an Enterprise wide Data warehouse first and all the data
marts will be the subsets of the EDWH. Acc. To him, independent data marts
cannot make up an enterprise data warehouse under any circumstance, but they
will remain isolated pieces of information –stove pieces.
ER Diagram :
ER model is a conceptual data model that views the real world as entities and
Relationships. A basic component of the model is the Entity-Relationship
diagram which is used to visually represent data objects.
Extraction, Transformation & Loading.
ETL Tools in the market for eg, Informatica, Ascential Data stage, Acta ,Oracle
Warehouse Builder(OWB) etc.,
Page 5 of 10
Staging Area :
It is the work place where raw data is brought in, cleaned, combined, archived and
exported to one or more data marts. The purpose of data staging area is to get data
ready for loading into a presentation layer.
Slowly Changing Dimensions :
Dimensions are said to be slowly changing dimensions when their attributes remain
almost constant, requiring minor alterations.
Eg Marital status
Bitmap index, B tree index are the indexing mechanism use for a typical data
OLAP, MOLAP, ROLAP, DOLAP, HOLAP :
OLAP: Online Analytical Processing.
OLAP tools in the market eg Business Objects, Brio, Cognos , Microstrategy ,
Alphablock, Crystal Reports etc.,
ROLAP: Relationnal OLAP, the users see cubes but under the hood it is
pure relational table, Micro-Strategy is a ROLAP product.
MOLAP: Multi dimensionnal OLAP, the users see cubes and under the hood
there a big cube, Oracle Express used to be a MOLAP product.
DOLAP: Desktop OLAP, the users see many cubes and under the hood there
are many small cubes, Cognos PowerPlay.
HOLAP: Hybrid OLAP, combines MOLAP and ROLAP, Essbase
Types of Facts:
– Able to add the facts along all the dimensions
– Discrete numerical measures eg. Retail sales in $
– Numeric measures that cannot be added across any dimensions
– Intensity measure averaged across all dimensions eg. Room temperature
– Textual facts - AVOID THEM
Page 6 of 10
– Snapshot, taken at a point in time
– Measures of Intensity
– Not additive along time dimension eg. Account balance, Inventory balance
– Added and divided by number of time period to get a time-average.
A field represented by a column within an object (entity). An object may be a table,
view or report. An attribute is also associated with an SGML(HTML) tag used to
further define the usage.
Business Activity Monitoring (BAM) :
BAM is a business solution that is supported by an advanced technical infrastructure
that enables rapid insight into new business strategies, the reduction of operating cost
by real-time identification of issues and improved process performance.
Business Intelligence (BI) :
Business intelligence is actually an environment in which business users receive data
that is reliable, consistent, understandable, easily manipulated and timely. With this
data, business users are able to conduct analyses that yield overall understanding of
where the business has been, where it is now and where it will be in the near future.
Business intelligence serves two main purposes. It monitors the financial and
operational health of the organization (reports, alerts, alarms, analysis tools, key
performance indicators and dashboards). It also regulates the operation of the
organization providing two- way integration with operational systems and
information feedback analysis.
Data Integration :
Pulling together and reconciling dispersed data for analytic purposes that
organizations have maintained in multiple, heterogeneous systems. Data needs to be
accessed and extracted, moved and loaded, validated and cleaned, and standardized
Data Mapping :
The process of assigning a source data element to a target data element.
Data Mining :
A technique using software tools geared for the user who typically does not know
exactly what he's searching for, but is looking for particular patterns or trends. Data
mining is the process of shifting through large amounts of data to produce data
Page 7 of 10
content relationships. It can predict future trends and behaviors, allowing businesses
to make proactive, knowledge-driven decisions. This is also known as data surfing.
Data Modeling :
A method used to define and analyze data requirements needed to support the
business functions of an enterprise. These data requirements are recorded as a
conceptual data model with associated data definitions. Data modeling defines the
relationships between data elements and structures.
A method of exploring detailed data that was used in creating a summary level of
data. Drill down levels depend on the granularity of the data in the data warehouse.
Meta data is data that expresses the context or relativity of data. Examples of meta
data include data element descriptions, data type descriptions, attribute/property
descriptions, range/domain descriptions and process/method descriptions. The
repository environment encompasses all corporate meta data resources: database
catalogs, data dictionaries and navigation services. Meta data includes name, length,
valid values and description of a data element. Meta data is stored in a data dictionary
and repository. It insulates the data warehouse from changes in the schema of
The process of reducing a complex data structure into its simplest, most stable
structure. In general, the process entails the removal of redundant attributes, keys, and
relationships from a conceptual data model.
A surrogate key is a single-part, artificially established identifier for an entity.
Surrogate key assignment is a special case of derived data - one where the primary
key is derived. A common way of deriving surrogate key values is to assign integer
Page 8 of 10
MOLAP, ROLAP, and HOLAP
In the OLAP world, there are mainly two different types: Multidimensional OLAP
(MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to
technologies that combine MOLAP and ROLAP.
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary
Excellent performance: MOLAP cubes are built for fast data retrieval, and is
optimal for slicing and dicing operations.
Can perform complex calculations: All calculations have been pre-generated
when the cube is created. Hence, complex calculations are not only doable, but
they return quickly.
Limited in the amount of data it can handle: Because all calculations are
performed when the cube is built, it is not possible to include a large amount of
data in the cube itself. This is not to say that the data in the cube cannot be
derived from a large amount of data. Indeed, this is possible. But in this case, only
summary-level information will be included in the cube itself.
Requires additional investment: Cube technology are often proprietary and do not
already exist in the organization. Therefore, to adopt MOLAP technology,
chances are additional investments in human and capital resources are needed.
This methodology relies on manipulating the data stored in the relational database to give
the appearance of traditional OLAP's slicing and dicing functionality. In essence, each
action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL
Can handle large amounts of data: The data size limitation of ROLAP technology
is the limitation on data size of the underlying relational database. In other words,
Page 9 of 10
ROLAP itself places no limitation on data amount.
Can leverage functionalities inherent in the relational database: Often, relational
database already comes with a host of functionalities. ROLAP technologies, since
they sit on top of the relational database, can therefore leverage these
Performance can be slow: Because each ROLAP report is essentially a SQL query
(or multiple SQL queries) in the relational database, the query time can be long if
the underlying data size is large.
Limited by SQL functionalities: Because ROLAP technology mainly relies on
generating SQL statements to query the relational database, and SQL statements
do not fit all needs (for example, it is difficult to perform complex calculations
using SQL), ROLAP technologies are therefore traditionally limited by what SQL
can do. ROLAP vendors have mitigated this risk by building into the tool out-of-
the-box complex functions as well as the ability to allow users to define their own
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For
summary-type information, HOLAP leverages cube technology for faster performance.
When detail information is needed, HOLAP can "drill through" from the cube into the
underlying relational data.
Page 10 of 10