Data Warehouse Concepts - PDF by pcherukumalla

VIEWS: 230 PAGES: 24

									Data Warehouse Concepts
What is a Data Warehouse? According to Inmon, famous author for several data warehouse
books, "A data warehouse is a subject oriented, integrated, time variant, non volatile collection
of data in support of management's decision making process".

Example: In order to store data, over the years, many application designers in each branch
have made their individual decisions as to how an application and database should be built. So
source systems will be different in naming conventions, variable measurements, encoding
structures, and physical attributes of data. Consider a bank that has got several branches in
several countries, has millions of customers and the lines of business of the enterprise are
savings, and loans. The following example explains how the data is integrated from source
systems to target systems.

Example of Source Data
System Name        Attribute Name               Column Name                Datatype       Values
Source System Customer Application
                                       CUSTOMER_APPLICATION_DATE NUMERIC(8,0) 11012005
1             Date
Source System Customer Application
                                       CUST_APPLICATION_DATE            DATE            11012005
2             Date
Source System
              Application Date         APPLICATION_DATE                 DATE            01NOV2005

In the aforementioned example, attribute name, column name, datatype and values are
entirely different from one source system to another. This inconsistency in data can be avoided
by integrating the data into a data warehouse with good standards.

Example of Target Data(Data Warehouse)
Target System        Attribute Name                 Column Name              Datatype    Values
Record #1       Customer Application Date CUSTOMER_APPLICATION_DATE DATE                01112005
Record #2       Customer Application Date CUSTOMER_APPLICATION_DATE DATE                01112005
Record #3       Customer Application Date CUSTOMER_APPLICATION_DATE DATE                01112005

In the above example of target data, attribute names, column names, and datatypes are
consistent throughout the target system. This is how data from various source systems is
integrated and accurately stored into the data warehouse.

See Figure 1.12 below for Data Warehouse Architecture Diagram.
Figure 1.12 : Data Warehouse Architecture

 Data Warehouse & Data Mart
 A data warehouse is a relational/multidimensional database that is designed for query and
 analysis rather than transaction processing. A data warehouse usually contains historical data
 that is derived from transaction data. It separates analysis workload from transaction workload
 and enables a business to consolidate data from several sources.

 In addition to a relational/multidimensional database, a data warehouse environment often
 consists of an ETL solution, an OLAP engine, client analysis tools, and other applications that
 manage the process of gathering data and delivering it to business users.

 There are three types of data warehouses:
 1. Enterprise Data Warehouse - An enterprise data warehouse provides a central database
 for decision support throughout the enterprise.
 2. ODS(Operational Data Store) - This has a broad enterprise wide scope, but unlike the
 real entertprise data warehouse, data is refreshed in near real time and used for routine
 business activity.
 3. Data Mart - Datamart is a subset of data warehouse and it supports a particular region,
 business unit or business function.

 Data warehouses and data marts are built on dimensional data modeling where fact tables are
 connected with dimension tables. This is most useful for users to access data since a database
 can be visualized as a cube of several dimensions. A data warehouse provides an opportunity
 for slicing and dicing that cube along each of its dimensions.

 Data Mart: A data mart is a subset of data warehouse that is designed for a particular line of
 business, such as sales, marketing, or finance. In a dependent data mart, data can be derived
 from an enterprise-wide data warehouse. In an independent data mart, data can be collected
 directly from sources.

 Figure 1.12 : Data Warehouse and Datamarts

General Information
In general, an organization is started to earn money by selling a
product or by providing service to the product. An organization may be
at one place or may have several branches.

 When we consider an example of an organization selling products throughtout the world, the
 main four major dimensions are product, location, time and organization. Dimension tables
 have been explained in detail under the section Dimensions. With this example, we will try to
 provide detailed explanation about STAR SCHEMA.
What is Star Schema?
Star Schema is a relational database schema for representing multimensional data. It is the
simplest form of data warehouse schema that contains one or more dimensions and fact
tables. It is called a star schema because the entity-relationship diagram between dimensions
and fact tables resembles a star where one fact table is connected to multiple dimensions. The
center of the star schema consists of a large fact table and it points towards the dimension
tables. The advantage of star schema are slicing down, performance increase and easy
understanding of data.
      Steps in designing Star Schema

 •    Identify a business process for analysis(like sales).
 •    Identify measures or facts (sales dollar).
 •    Identify dimensions for facts(product dimension, location dimension, time dimension,

      organization dimension).
 •    List the columns that describe each dimension.(region name, branch name, region name).
 •    Determine the lowest level of summary in a fact table(sales dollar).

      Important aspects of Star Schema & Snow Flake Schema

 •    In a star schema every dimension will have a primary key.
 •    In a star schema, a dimension table will not have any parent table.
 •    Whereas in a snow flake schema, a dimension table will have one or more parent tables.
 •    Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
 •    Whereas hierachies are broken into separate tables in snow flake schema. These hierachies

      helps to drill down the data from topmost hierachies to the lowermost hierarchies.


A logical structure that uses ordered levels as a means of organizing data. A hierarchy can be
used to define data aggregation; for example, in a time dimension, a hierarchy might be used
to aggregate data from the Month level to the Quarter level, from the Quarter level to the Year
level. A hierarchy can also be used to define a navigational drill path, regardless of whether the
levels in the hierarchy represent aggregated totals or not.

A position in a hierarchy. For example, a time dimension might have a hierarchy that
represents data at the Month, Quarter, and Year levels.

Fact Table
A table in a star schema that contains facts and connected to dimensions. A fact table typically
has two types of columns: those that contain facts and those that are foreign keys to
dimension tables. The primary key of a fact table is usually a composite key that is made up of
all of its foreign keys.
A fact table might contain either detail level facts or facts that have been aggregated (fact
tables that contain aggregated facts are often instead called summary tables). A fact table
usually contains facts with the same level of aggregation.

Example of Star Schema: Figure 1.6

In the example figure 1.6, sales fact table is connected to dimensions location, product, time
and organization. It shows that data can be sliced across all dimensions and again it is possible
for the data to be aggregated across multiple dimensions. "Sales Dollar" in sales fact table can
be calculated across all dimensions independently or in a combined manner which is explained

 •    Sales Dollar value for a particular product
 •    Sales Dollar value for a product in a location
 •    Sales Dollar value for a product in a year within a location
 •    Sales Dollar value for a product in a year within a location sold or serviced by an employee

Snowflake Schema

A snowflake schema is a term that describes a star schema structure normalized through the
use of outrigger tables. i.e dimension table hierachies are broken into simpler tables. In star
schema example we had 4 dimensions like location, product, time, organization and a fact

In Snowflake schema, the example diagram shown below has 4 dimension tables, 4 lookup
tables and 1 fact table. The reason is that hierarchies(category, branch, state, and month) are
being broken out of the dimension tables(PRODUCT, ORGANIZATION, LOCATION, and TIME)
respectively and shown separately. In OLAP, this Snowflake schema approach increases the
number of joins and poor performance in retrieval of data. In few organizations, they try to
normalize the dimension tables to save space. Since dimension tables hold less space,
Snowflake schema approach may be avoided.
 Example of Snowflake Schema: Figure 1.7

Fact Table
The centralized table in a star schema is called as FACT table. A fact
table typically has two types of columns: those that contain facts and
those that are foreign keys to dimension tables. The primary key of a
fact table is usually a composite key that is made up of all of its
foreign keys.

 In the example fig 1.6 "Sales Dollar" is a fact(measure) and it can be added across several
 dimensions. Fact tables store different types of measures like additive, non additive and semi
 additive measures.
      Measure Types

  •   Additive - Measures that can be added across all dimensions.
  •   Non Additive - Measures that cannot be added across all dimensions.
  •   Semi Additive - Measures that can be added across few dimensions and not with others.

 A fact table might contain either detail level facts or facts that have been aggregated (fact
 tables that contain aggregated facts are often instead called summary tables).
In the real world, it is possible to have a fact table that contains no measures or facts. These
tables are called as Factless Fact tables.
     Steps in designing Fact Table

 •   Identify a business process for analysis(like sales).
 •   Identify measures or facts (sales dollar).
 •   Identify dimensions for facts(product dimension, location dimension, time dimension,

     organization dimension).
 •   List the columns that describe each dimension.(region name, branch name, region name).
 •   Determine the lowest level of summary in a fact table(sales dollar).

Example of a Fact Table with an Additive Measure in Star Schema: Figure 1.6

In the example figure 1.6, sales fact table is connected to dimensions location, product, time
and organization. Measure "Sales Dollar" in sales fact table can be added across all dimensions
independently or in a combined manner which is explained below.

 •   Sales Dollar value for a particular product
 •   Sales Dollar value for a product in a location
 •   Sales Dollar value for a product in a year within a location
 •   Sales Dollar value for a product in a year within a location sold or serviced by an employee

Database - RDBMS

There are a number of relational databases to store data. A relational database contains
normalized data stored in tables. Tables contain records and columns. RDBMS makes it easy to
work with individual records. Each row contains a unique instance of data for the categories
defined by the columns.
RDBMS are used in OLTP applications(e.g. ATM cards) very frequently and sometimes
datawarehouse may also use relational databases. Please refer to Relational data modeling for
details to know how data from a source system is normalized and stored in RDBMS databases.

Popular RDBMS Databases
RDBMS Name           Company Name
Oracle               Oracle Corporation
IBM DB2 UDB          IBM Corporation
IBM Informix         IBM Corporation
Microsoft SQL Server Microsoft
Sybase               Sybase Corporation
Terradata            NCR

What are ETL Tools?

ETL Tools are meant to extract, transform and load the data into Data Warehouse for decision
making. Before the evolution of ETL Tools, the above mentioned ETL process was done
manually by using SQL code created by programmers. This task was tedious and cumbersome
in many cases since it involved many resources, complex coding and more work hours. On top
of it, maintaining the code placed a great challenge among the programmers.

These difficulties are eliminated by ETL Tools since they are very powerful and they offer many
advantages in all stages of ETL process starting from extraction, data cleansing, data profiling,
transformation, debuggging and loading into data warehouse when compared to the old

There are a number of ETL tools available in the market to do ETL process the data according
to business/technical requirements. Following are some those.

Popular ETL Tools
Tool Name                        Company Name
Informatica                      Informatica Corporation
DT/Studio                        Embarcadero Technologies
DataStage                        IBM
Ab Initio                        Ab Initio Software Corporation
Data Junction                    Pervasive Software
Oracle Warehouse Builder         Oracle Corporation
Microsoft SQL Server Integration Microsoft
TransformOnDemand                Solonde
Transformation Manager           ETL Solutions

ETL Tools: What to Learn?
With the help of ETL tools, we can create powerful target Data Warehouses without much
difficulty. Following are the various options that we have to know and learn in order to use ETL
» How to install ETL tool on server/client?

Working with an ETL Tool:
» How to work with various options like designer, mapping, workflow, scheduling etc.,?
» How to work with sources like DBMS, relational source databases, files, ERPs etc., and
  import the source definitions?
» How to import data from data modeling tools, applications etc.,?
» How to work with targets like DBMS, relational source databases, files, ERPs etc., and
  import the source definitions?
» How to create target definitions?
» How to create mappings between source definitions and target definitions?
» How to create transformations?
» How to cleanse the source data?
» How to create a dimension, slowly changing dimensions, cube etc.,?
» How to create and monitor workflows?
» How to configure, monitor and run debugger?
» How to view and generate metadata reports?

ETL Concepts
Extraction, transformation, and loading. ETL refers to the methods involved in accessing and
manipulating source data and loading it into target database.

The first step in ETL process is mapping the data between source systems and target
database(data warehouse or data mart). The second step is cleansing of source data in staging
area. The third step is transforming cleansed source data and then loading into the target

Note that ETT (extraction, transformation, transportation) and ETM (extraction, transformation,
move) are sometimes used instead of ETL.

Glossary of ETL (

Source System
A database, application, file, or other storage facility from which the data in a data warehouse
is derived.

The definition of the relationship and data flow between source and target objects.

Data that describes data and other structures, such as objects, business rules, and processes.
For example, the schema design of a data warehouse is typically stored in a repository as
metadata, which is used to generate scripts used to build and populate the data warehouse. A
repository contains metadata.
Staging Area
A place where data is processed before entering the warehouse.

The process of resolving inconsistencies and fixing the anomalies in source data, typically as
part of the ETL process.

The process of manipulating data. Any manipulation beyond copying is a transformation.
Examples include cleansing, aggregating, and integrating data from multiple sources.

The process of moving copied or transformed data from a source to a data warehouse.

Target System
A database, application, file, or other storage facility to which the "transformed source data" is
loaded in a data warehouse.
Figure 1.12 : Sample ETL Process Flow


Informatica is a powerful ETL tool from Informatica Corporation, a leading provider of
enterprise data integration software and ETL softwares.

     The important Informatica Components are:

 •   Power Exchange
 •   Power Center
 •   Power Center Connect
 •   Power Exchange
 •   Power Channel
 •   Metadata Exchange
 •   Power Analyzer
 •   Super Glue

In Informatica, all the Metadata information about source systems, target systems and
transformations are stored in the Informatica repository. Informatica's Power Center Client and
Repository Server access this repository to store and retrieve metadata.

Note: To know more about Metadata and

Source and Target:
Consider a Bank that has got many branches throughout the world. In each branch data may
be stored in different source systems like oracle, sql server, terradata, etc. When the Bank
decides to integrate its data from several sources for its management decisions, it may choose
one or more systems like oracle, sql server, terradata, etc. as its data warehouse target. Many
organisations prefer Informatica to do that ETL process, because Informatica is more powerful
in designing and building data warehouses. It can connect to several sources and targets to
extract meta data from sources and targets, transform and load the data into target systems.

Guidelines to work with Informatica Power Center

 •   Repository: This is where all the metadata information is stored in the Informatica suite. The

     Power Center Client and the Repository Server would access this repository to retrieve, store
     and manage metadata.
 •   Power Center Client: Informatica client is used for managing users, identifiying source and

     target systems definitions, creating mapping and mapplets, creating sessions and run
     workflows etc.
 •   Repository Server: This repository server takes care of all the connections between the

     repository and the Power Center Client.
 •   Power Center Server: Power Center server does the extraction from source and then loading

     data into targets.
 •   Designer: Source Analyzer, Mapping Designer and Warehouse Designer are tools reside within

     the Designer wizard. Source Analyzer is used for extracting metadata from source systems.
     Mapping Designer is used to create mapping between sources and targets. Mapping is a
     pictorial representation about the flow of data from source to target.
     Warehouse Designer is used for extracting metadata from target systems or metadata can be
     created in the Designer itself.
 •   Data Cleansing: The PowerCenter's data cleansing technology improves data quality by

     validating, correctly naming and standardization of address data. A person's address may not
     be same in all source systems because of typos and postal code, city name may not match with
     address. These errors can be corrected by using data cleansing process and standardized data
     can be loaded in target systems (data warehouse).
 •   Transformation: Transformations help to transform the source data according to the

     requirements of target system. Sorting, Filtering, Aggregation, Joining are some of the
      examples of transformation. Transformations ensure the quality of the data being loaded into
      target and this is done during the mapping process from source to target.
 •    Workflow Manager: Workflow helps to load the data from source to target in a sequential

      manner. For example, if the fact tables are loaded before the lookup tables, then the target
      system will pop up an error message since the fact table is violating the foreign key validation.
      To avoid this, workflows can be created to ensure the correct flow of data from source to target.
 •    Workflow Monitor: This monitor is helpful in monitoring and tracking the workflows created in

      each Power Center Server.
 •    Power Center Connect: This component helps to extract data and metadata from ERP

      systems like IBM's MQSeries, Peoplesoft, SAP, Siebel etc. and other third party applications.
 •    Power Center Exchange: This component helps to extract data and metadata from ERP

      systems like IBM's MQSeries, Peoplesoft, SAP, Siebel etc. and other third party applications.


Power Exchange:
Informatica Power Exchange as a stand alone service or along with Power Center, helps
organizations leverage data by avoiding manual coding of data extraction programs. Power
Exchange supports batch, real time and changed data capture options in main frame(DB2,
VSAM, IMS etc.,), mid range (AS400 DB2 etc.,), and for relational databases (oracle, sql
server, db2 etc) and flat files in unix, linux and windows systems.

Power Channel:
This helps to transfer large amount of encrypted and compressed data over LAN, WAN,
through Firewalls, tranfer files over FTP, etc.

Meta Data Exchange:
Metadata Exchange enables organizations to take advantage of the time and effort already
invested in defining data structures within their IT environment when used with Power Center.
For example, an organization may be using data modeling tools, such as Erwin, Embarcadero,
Oracle designer, Sybase Power Designer etc for developing data models. Functional and
technical team should have spent much time and effort in creating the data model's data
structures(tables, columns, data types, procedures, functions, triggers etc). By using meta
deta exchange, these data structures can be imported into power center to identifiy source and
target mappings which leverages time and effort. There is no need for informatica developer to
create these data structures once again.

Power Analyzer:
Power Analyzer provides organizations with reporting facilities. PowerAnalyzer makes
accessing, analyzing, and sharing enterprise data simple and easily available to decision
makers. PowerAnalyzer enables to gain insight into business processes and develop business

With PowerAnalyzer, an organization can extract, filter, format, and analyze corporate
information from data stored in a data warehouse, data mart, operational data store, or
otherdata storage models. PowerAnalyzer is best with a dimensional data warehouse in a
relational database. It can also run reports on data in any table in a relational database that do
not conform to the dimensional model.

Super Glue:
Superglue is used for loading metadata in a centralized place from several sources. Reports
can be run against this superglue to analyze meta data.

Power Mart:
Power Mart is a departmental version of Informatica for building, deploying, and managing
data warehouses and data marts. Power center is used for corporate enterprise data
warehouse and power mart is used for departmental data warehouses like data marts. Power
Center supports global repositories and networked repositories and it can be connected to
several sources. Power Mart supports single repository and it can be connected to fewer
sources when compared to Power Center. Power Mart can extensibily grow to an enterprise
implementation and it is easy for developer productivity through a codeless environment.

Note:This is not a complete tutorial on Informatica. We will add more Tips and Guidelines on
Informatica in near future. Please visit us soon to check back. To know more about
Informatica, contact its official website

Informatica - Transformations
[Submitted by:Radhika, Michigan, US.]

In Informatica, Transformations help to transform the source data according to the
requirements of target system and it ensures the quality of the data being loaded into target.

Transformations are of two types: Active and Passive.

Active Transformation
An active transformation can change the number of rows that pass through it from source to
target i.e it eliminates rows that do not meet the condition in transformation.

Passive Transformation
A passive transformation does not change the number of rows that pass through it i.e it passes
all rows through the transformation.

Transformations can be Connected or UnConnected.

Connected Transformation
Connected transformation is connected to other transformations or directly to target table in
the mapping.

UnConnected Transformation
An unconnected transformation is not connected to other transformations in the mapping. It is
called within another transformation, and returns a value to that transformation.

Following are the list of Transformations available in Informatica:

 •   Aggregator Transformation
 •   Expression Transformation
 •   Filter Transformation
 •    Joiner Transformation
 •    Lookup Transformation
 •    Normalizer Transformation
 •    Rank Transformation
 •    Router Transformation
 •    Sequence Generator Transformation
 •    Stored Procedure Transformation
 •    Sorter Transformation
 •    Update Strategy Transformation
 •    XML Source Qualifier Transformation
 •    Advanced External Procedure Transformation
 •    External Transformation

In the following pages, we will explain all the above Informatica Transformations and their
significances in the ETL process in detail.

Aggregator Transformation
Aggregator transformation is an Active and Connected transformation. This transformation is
useful to perform calculations such as averages and sums (mainly to perform calculations on
multiple rows or groups). For example, to calculate total of daily sales or to calculate average
of monthly or yearly sales. Aggregate functions such as AVG, FIRST, COUNT, PERCENTILE,
MAX, SUM etc. can be used in aggregate transformation.

Expression Transformation
Expression transformation is a Passive and Connected transformation. This can be used to
calculate values in a single row before writing to the target. For example, to calculate discount
of each product or to concatenate first and last names or to convert date to a string field.

Filter Transformation
Filter transformation is an Active and Connected transformation. This can be used to filter rows
in a mapping that do not meet the condition. For example, to know all the employees who are
working in Department 10 or to find out the products that falls between the rate category $500
and $1000.

Joiner Transformation
Joiner Transformation is an Active and Connected transformation. This can be used to join two
sources coming from two different locations or from same location. For example, to join a flat
file and a relational source or to join two flat files or to join a relational source and a XML

In order to join two sources, there must be atleast one matching port. at least one matching
port. While joining two sources it is a must to specify one source as master and the other as
      The Joiner transformation supports the following types of joins:

 •    Normal
 •       Master Outer
 •       Detail Outer
 •       Full Outer

 Normal join discards all the rows of data from the master and detail source that do not
match, based on the condition.

 Master outer join discards all the unmatched rows from the master source and keeps all the
rows from the detail source and the matching rows from the master source.

 Detail outer join keeps all rows of data from the master source and the matching rows from
the detail source. It discards the unmatched rows from the detail source.

 Full outer join keeps all rows of data from both the master and detail sources.

Lookup Transformation
Lookup transformation is Passive and it can be both Connected and UnConnected as well. It is
used to look up data in a relational table, view, or synonym. Lookup definition can be imported
either from source or from target tables.

For example, if we want to retrieve all the sales of a product with an ID 10 and assume that
the sales data resides in another table. Here instead of using the sales table as one more
source, use Lookup transformation to lookup the data for the product, with ID 10 in sales

Difference between Connected and UnConnected Lookup Transformation:
 Connected lookup receives input values directly from mapping pipeline whereas UnConnected
lookup receives values from: LKP expression from another transformation.

 Connected lookup returns multiple columns from the same row whereas UnConnected lookup
has one return port and returns one column from each row.

 Connected lookup supports user-defined default values whereas UnConnected lookup does
not support user defined values.

Normalizer Transformation
Normalizer Transformation is an Active and Connected transformation. It is used mainly with
COBOL sources where most of the time data is stored in de-normalized format. Also,
Normalizer transformation can be used to create multiple rows from a single row of data.

Rank Transformation
Rank transformation is an Active and Connected transformation. It is used to select the top or
bottom rank of data. For example, to select top 10 Regions where the sales volume was very
high or to select 10 lowest priced products.
 Router Transformation
 Router is an Active and Connected transformation. It is similar to filter transformation. The
 only difference is, filter transformation drops the data that do not meet the condition whereas
 router has an option to capture the data that do not meet the condition. It is useful to test
 multiple conditions. It has input, output and default groups. For example, if we want to filter
 data like where State=Michigan, State=California, State=New York and all other States. It’s
 easy to route data to different tables.

Sequence Generator Transformation
Sequence Generator transformation is a Passive and Connected
transformation. It is used to create unique primary key values or cycle
through a sequential range of numbers or to replace missing keys.

 It has two output ports to connect transformations. By default it has two fields CURRVAL and
 NEXTVAL(You cannot add ports to this transformation). NEXTVAL port generates a sequence of
 numbers by connecting it to a transformation or target. CURRVAL is the NEXTVAL value plus
 one or NEXTVAL plus the Increment By value.

 Stored Procedure Transformation
 Stored Procedure transformation is a Passive and Connected & UnConnected transformation. It
 is useful to automate time-consuming tasks and it is also used in error handling, to drop and
 recreate indexes and to determine the space in database, a specialized calculation etc.

 The stored procedure must exist in the database before creating a Stored Procedure
 transformation, and the stored procedure can exist in a source, target, or any database with a
 valid connection to the Informatica Server. Stored Procedure is an executable script with SQL
 statements and control statements, user-defined variables and conditional statements.

 Sorter Transformation
 Sorter transformation is a Connected and an Active transformation. It allows to sort data either
 in ascending or descending order according to a specified field. Also used to configure for case-
 sensitive sorting, and specify whether the output rows should be distinct.

 Source Qualifier Transformation
 Source Qualifier transformation is an Active and Connected transformation. When adding a
 relational or a flat file source definition to a mapping, it is must to connect it to a Source
 Qualifier transformation. The Source Qualifier performs the various tasks such as overriding
 default SQL query, filtering records; join data from two or more tables etc.

 Update Strategy Transformation
 Update strategy transformation is an Active and Connected transformation. It is used to update
 data in target table, either to maintain history of data or recent changes. You can specify how
 to treat source rows in table, insert, update, delete or data driven.

 XML Source Qualifier Transformation
 XML Source Qualifier is a Passive and Connected transformation. XML Source Qualifier is used
  only with an XML source definition. It represents the data elements that the Informatica Server
  reads when it executes a session with XML sources.

  Advanced External Procedure Transformation
  Advanced External Procedure transformation is an Active and Connected transformation. It
  operates in conjunction with procedures, which are created outside of the Designer interface to
  extend PowerCenter/PowerMart functionality. It is useful in creating external transformation
  applications, such as sorting and aggregation, which require all input rows to be processed
  before emitting any output rows.

  External Procedure Transformation
  External Procedure transformation is an Active and Connected/UnConnected transformations.
  Sometimes, the standard transformations such as Expression transformation may not provide
  the functionality that you want. In such cases External procedure is useful to develop complex
  functions within a dynamic link library (DLL) or UNIX shared library, instead of creating the
  necessary Expression transformations in a mapping.

  Differences between Advanced External Procedure and External Procedure

    External Procedure returns single value, whereas Advanced External Procedure returns
  multiple values.
    External Procedure supports COM and Informatica procedures whereas AEP supports only
  Informatica Procedures.

  Following articles describing the ETL Tool Transformation Manager(TM) has been sourced
  from ETL Solutions Ltd. If you need more information, please visit their official website or contact them at

  ETL Solution's product "Transformation Manager(TM)" is a stand-alone Windows® or
  Linux® software suite of meta-data driven code generator programs that provides for the
  authoring, testing, and debugging of data transforms between virtually all types of data,
  whether XML, databases, flat files, Java™ classes or spreadsheets, with flexible deployment
  options in both J2EE™ and Microsoft®.NET architectures.

What is Data Mining?

  Data Mining is a set of processes related to analyzing and discovering useful, actionable knowledge buried
  deep beneath large volumes of data stores or data sets. This knowledge discovery involves finding patterns or
  behaviors within the data that lead to some profitable business action. Data Mining requires generally large
  volumes of data including history data as well as current data to explore the knowledge. Once the required
  amount of data has been accumulated from various sources, it is cleaned, validated and prepared for storing it
  in the data warehouse or data mart. BI reporting Tools capture the required facts from these data to be used by
  the knowledge discovery process. Data Mining can be accomplished by utilizing one or more of the traditional
  knowledge discovery techniques like Market Basket Analysis, Clustering, Memory Based Reasoning, Link
  Analysis, Neural Networks and so on.

Data Mining Life Cycle:

    •   Find out the Business Problem: Consider a company's current year sales dropped by a percentage
        when compared to the previous year. By using OLAP Tools, the exact sales fact can be determined
        across several dimensions like region, time etc.
    •   Knowledge Discovery: Given this business problem, various reasons for the decrease in sales have to
        be analyzed utilizing one or more of the Data Mining Techniques. Causes may include poor quality or
        service of the product or flaws in marketing schemes or less demand for the product or seasonal
        changes or regulations enforced by the Government or competitors pressure, and so on. The exact
        solutions have to be found out in order to resolve this sales drop, which we call it as the Knowledge
        Discovery here.
    •   Implement the Knowledge: Based on above discovery, proper actions should be taken in order to
        overcome the business problem.
    •   Analyze the Results: Once it is been implemented, results need to be monitored and measured to find
        out outcomes of that action.

OLAP vs Data Mining:

  OLAP helps organizations to find out the measures like sales drop, productivity, service response time,
  inventory in hand etc. Simply, OLAP tell us 'What has happened' and Data Mining helps to find out 'Why it has
  happened' at the first place. Data Mining can also be used to predict 'What will happen in the future' with the
  help of data patterns available within the organization and publicly available data.

  For example if a borrower with bad credit and employment history apllies for a mortgage loan, his/her
  application may be denied by a mortgage lender since he/she may default the loan if approved. The mortgage
  lender would have come to this decision based upon the historical data previously mined following a similar

Business Intelligence Tool Guide:

  Business Intelligence Tools assist organizations to improve their overall performance by helping them to plan,
  track, monitor, analyze and report the business activities. These tools improve customer relationship
  management thereby increasing company's profitability significantly. Below are the few guidelines which may
  be of help while working with BI tools
•       How to install and setup the BI software?
•       How to get license and training from BI software vendors?
•       How to create users, administrators and assign privileges to users?
•       How to connect to the different database servers from BI applications?
•       How to understand and work on BI data models or universe?
•       How to frame the select statement according to the business requirements?
    o     How to select the tables that have to be used in the report?
    o     How to select the columns that are required for reporting?
    o     How to write the join condition to join (inner join, outer join, equi join) different tables in select statement?
    o     How to write multiple select statements in a single report?
    o     How to write the filters (null, in, equal to, greater than) that are required after the where clause?
    o     How to create dimensions and facts?
    o     How to drill up and drill down?
    o     How to set user prompts for user to enter values?
    o     How to process the query and retrieve the results?
•       How to work on results?
    o     How to modify field formats?
    o     How to sort data?
    o     How to create computer items like date functions, numeric and string functions?
•       How to create pivots?
    o     How to add data?
    o     How to create totals?
    o     How to group data?
•       How to create charts?
•       How to create reports?
    o     How to work on reporting body?
    o     How to work on report group headers?
    o     How to work on report header/footer?
    o     How to work on page header/footer?
    o     How to design the report layout?
    o     How to use page breaks?
•       How to schedule, monitor, modify, delete, and refresh a job(report)?
•       How to write report design document, /report testing document, test reports and get user acceptance?
•       How to distribute reports and results via email, printers, intranet server, and web?
•       How to export and import data?
•       How to track on scorecards, balancing scorecards, forecasting, key performance indicators and dashboards?

         Business Intelligence & Key Performance Indicators:

            Key Performance Indicators, commonly referred to as KPIs, are a list of measurements that are identified as
       critical factors in achieving the organizational goals or mission. KPIs are often identified in a business to help
       them drive a business towards its success and are associated with a number of business activities like
       Customer Relationship Management(CRM), Supply Chain Analytics or any other activity that is happening within
       the organization.

Requirements of a good KPI:

  There can be a number of factors related with the success of a company; All of these factors cannot be chosen
  as the indicators; Only those that are mission critical, strictly adhering to the organizational goals and
  accurately measurable should be selected as the company's KPIs. It is always better to keep the number of
  KPIs to a minimum to make sure that greater focus can be given to each of these indicators. So the important
  factors to be considered in selecting a KPI are as follows:

   •      Measurable: A KPI should be quantifiable in terms of numbers.
   •      Reflect the organizational Goals: A KPI should drive a business towards success.
   •      Actionable: It should help the managers to initiate some business action as a result of all the analysis and
          measures lead by KPI.

Examples of KPIs:

  A KPI may reflect regional sales by sales person, supply chain statistics by supplier, productivity by units,
  customer satisfaction, customer growth or it may reflect employee turnover. In either case, it should give a
  high-level, real time information to the top level managers enabling them to concentrate in the company's

What is a Dashboard in Business terms?

  A Business Intelligence Dashboard visually represents the key organizational performance data in a near real
  time, user friendly manner that can be understood instantaneously. Technically speaking, a Dashboard is a
  visual representation that reflects the Key Performance Indicators(KPIs) of interest for managerial
  review and not only that it enables them to drill-down further. Business Intelligence Dashboard is similar in
  function to a car dashboard in that it displays and provides access to the powerful analytical systems and key
  performance metrics in a form enabling business executives to analyze trends and more effectively manage
  their areas of responsibility.

Features of Dashboard:

  A typical web based Business Intelligence Dashboard encompasses the following features:
    •   Web based Interface: Managers can gain broad visibility into the real-time key measurements of a
        business with the help of this multi-window, intuitive and interactive interface.
    •   Role Based View: Executives can clearly track their organization's overall performance against its goals.
    •   Reports: Configurable, user-level as well as management-level reports.
    •   Charting and Graphing: Dashboards are better known for their easy one-click charts and graphs that
        gives instant access to complex solutions.
    •   Pre-defined Performance Metrics: All the Dashboards are built with the common pre-defined metrics by
        default which eases the business user in tracking the regular yet important performance metrics.

Benefits of using Business Intelligence Dashboard:

  Dashboards quickly convert and communicate the complex corporate data into a meaningful display of charts,
  graphs, gauges and other formats concurrently. A dynamic, intelligence Dashboard will allow the managers to
  drill-down data to go deeper into the anaysis. It eliminates the need to go through several reports, in one shot
  Dashboard gives a clear picture about how a company is performing in its critical areas.

Dashboard Softwares:

  Following are few of the famous Business Intelligence Dashboard softwares available in the current market.

  Cognos ReportNet: A product from Cognos that helps to build, distribute, and view multi-object dashboards
  with graphical elements such as gauges, charts, and clickable pictures. Connects dashboards with underlying
  tabular data, letting people go from a high-level view to a deeper understanding.

  MicroStrategy: Microstrategy Dashboards are designed to deliver maximum visual impact in a format
  optimized for quick absorption, using a combination of tables, graphics, gauges, dials and other graphical
  indicators, as well as conditional formatting, free-form labels, borders and background colors.

  BusinessObjects Dashboard Manager: Dashboard Manager lets one choose the level of detail they need -
  an overview, a specific chart or graph, or the underlying report.

  The Compliance Management Dashboard: A product from Hyperion provides easy-to-use screens, meters,
  and stoplights that quickly convey critical information. The Compliance Management Dashboard includes a
  data model with pre-built connectors to data sources, a set of compliance metrics, and a series of easy-to-use
  screens, meters, and stoplights designed to quickly convey critical information needed by finance executives.


  Scorecards are similar to Dashboards in a way that it provides easy-to-understand, summarized, at-a-glance
  data for the managers and top officials to tell them about their company's present and past performance.
    Scorecards thus help to monitor the Key Performance Indicators accurately and to communicate the
    goals and strategies across the organization in an efficient and elegant manner. In a Business Intelligence
    environment, Scorecards allows managers to set metrics or targets and monitor them to see their impact on
    every department.

Balanced Scorecards:

    A methodology created by Drs. Robert S. Kaplan and David P. Norton in 1992, is a management concept
    which helps managers at all levels monitor results in their key areas, including financial performance, customer
    knowledge, internal business processes and learning. It has been implemented in thousands of corporations,
    organizations and government agencies worldwide

    Business Intelligence Tools Directory:

•     Actuate
•     Business Objects
•     Cognos
•     HummingBird
•     Hyperion
•     Information Builders
•     MicroStrategy
•     ProClarity

•     Siebel

What is OLAP?

    OLAP, an acronym for 'Online Analytical Processing' is a technique by which the data sourced from a data
    warehouse or data mart is visualized and summarized to provide perspective multidimensional view across
    multiple dimensions. Generally OLAP refers to OLAP Tools(e.g Cognos, Business Objects etc.,) that help to
    accomplish these tasks. Since data warehouse is designed using a dimensional data model, data is
    represented in the form of data cubes enabling us to aggregate facts, slice and dice across several
    dimensions. OLAP tools provide options to drill-down the data from one hierarchy to another hierarchy.

    For example sales amount can be calculated for a particular year or it can be drilled down to its next
    hierarchies like month, week, day etc. In the same way, data can be rolled up for summarization from product
    to product group, product group to product sub-class then from product sub-class to product class. Thus with
    this cube structure, data can be viewed from multiple points providing the data analysts, a greater insight into
  There are many OLAP hybrids or variants like MOLAP(Multidimensional OLAP), HOLAP(Hybrid OLAP),
  ROLAP(Relational OLAP), DOLAP(Desktop OLAP or Database OLAP) available in the market and can be
  used depending on the needs and requirements of an organization

OLAP - Examples:

  Topmost executives of an organization are really interested in aggregated facts or numbers to take decisions
  rather than querying several databases (that are normalized) to get the data and do the comparison by
  themselves. OLAP tools visualize the data in an understandable format, like in the form of Scorecards and
  Dashboards with Key Performance Indicators enabling managers to monitor and take immediate actions. In
  todays business life, OLAP plays a vital role by assisting decision makers in the field of banking and finance,
  hospitals, insurance, manufacturing, pharmaceuticals etc., to measure facts across geography, demography,
  product, and sales.

  OLAP can be performed in data warehouses that undergo frequent updates and that do not. Following are
  some of the examples to show how OLAP solves complex queries involving facts to be measured across
  company’s best-interested dimensions.

   •    Comparison of sales (fact) of a product (dimension) over years (dimension) in the same region
   •    How may members (fact) have opened a savings account (dimension), in USA branch (dimension), over
        a period (dimension)?
   •    How many mortgage loans (fact) have been approved in fixed mortgage (dimension) or Adjustable Rate
        Mortgage (dimension) in New York City (dimension), over a period (dimension)?
   •    What is the total sales value (fact) of a particular product (dimension) in a particular grocery store
        (dimension), over a period (dimension)?
   •    What is the amount spent (fact) for a particular product promotion (dimension) in a particular branch
        (dimension) or in a particular city (dimension), over a period (dimension)?

To top