Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Data Warehousing Concepts - DOC - DOC

VIEWS: 301 PAGES: 28

Data Warehousing Concepts & Architecture

More Info
									                Data Warehousing Concepts

  Definition of a Data Warehouse

  •   Subject oriented, time variant, non-volatile integrated, granular repository of data
              Subject-oriented
                   all relevant data about a subject is gathered and stored as a single set in
                      a useful format
              Time-variant
                   long-term storage of data, 5-10 yrs as opposed to the 30 to 60 days time
                      periods of operational data.
              Non-volatile
                   data in DW is read-only
              Integrated
                   data stored with consistent naming conventions, measurements,
                      encoding structures, and physical attributes.

DWH ARCHITECTURE

  Architecture, in the context of an organization's data warehousing efforts, is a
      conceptualization of how the data warehouse is built. There is no right or wrong
      architecture, but rather there are multiple architectures that exist to support various
      environments and situations. The worthiness of the architecture can be judged from how
      the conceptualization aids in the building, maintenance, and usage of the data
      warehouse.

  One possible simple conceptualization of data warehouse architecture consists of the
     following interconnected layers:

  Operational database layer

  The source data for the data warehouse — An organization's Enterprise Resource Planning
     systems fall into this layer.

  Data access layer

  The interface between the operational and informational access layer — Tools to extract,
     transform, load data into the warehouse fall into this layer.

  Metadata layer

  The data directory - This is usually more detailed than an operational system data directory.
     There are dictionaries for the entire warehouse and sometimes dictionaries for the data
     that can be accessed by a particular reporting and analysis tool.

  Informational access layer
The data accessed for reporting and analyzing and the tools for reporting and analyzing data
   — Business intelligence tools fall into this layer. And the Inmon-Kimball differences about
   design methodology, discussed later in this article, have to do with this layer.
OLTP Vs DW

OLTP                                    Data Warehouse


Capture business transaction as it      Navigate through the data and identify what
happens                                 to do
Designed to reduce redundant data       Designed to increase analysis efficiency
Mission critical applications           Non-mission critical applications
Automation to operate the business      Query and Analysis drive the business
Data Management                         Data Consolidation
Two dimensional                         Multi-dimensional
Granular data                           Granular as well as aggregated data
Normalized data                         Usually de-normalized data
End users are operations, sales staff   End users are business analysts, executives,
                                        senior management
Predictable transaction Volume          Unpredictable ad-hoc query volume




Star Schema

In the star schema design, a single object (the fact table) sits in the middle and is radially
connected to other surrounding objects (dimension lookup tables) like a star. Each dimension is
represented as a single table. The primary key in each dimension table is related to a forieng key
in the fact table.
                                                                            Product Dimension
 Date Dimension
                                    Sales Fact                              Product_key
 Date_key
                                    Date_key                                Description
 Day_of_week
                                    Product_key                             Brand
 Month
                                    Store_key                               Category
 Quarter
                                    Dollars_sold
 Year                                                                       Store Dimension
                                    Units_sold
 Holiday_flag                                                               Store_key
                                    Dollars_cost
                                                                            Store_name

                                                                            Address

                                                                            Floor_plan_type


                                         Sample star schema

All measures in the fact table are related to all the dimensions that fact table is related to. In other
words, they all have the same level of granularity.

example: Assume our data warehouse keeps store sales data, and the different dimensions are
time, store, product, and customer. In this case, the figure on the left repesents our star schema.
The lines between two tables indicate that there is a primary key / foreign key relationship
between the two tables. Note that different dimensions are not related to one another.

Snowflake Schema

The snowflake schema is an extension of the star schema, where each point of the star
explodes into more points. In a star schema, each dimension is represented by a single
dimensional table, whereas in a snowflake schema, that dimensional table is normalized into
multiple lookup tables, each representing a level in the dimensional hierarchy.

We are using this because for some of the dimension tables some attributes changed frequently

Ex: status that changes like LOA,COMP,ACTV,INAC
                                     Sample snowflake schema

For example, the Time Dimension that consists of 2 different hierarchies:

1. Year → Month → Day
2. Week → Day

We will have 4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for
month, a lookup table for week, and a lookup table for day. Year is connected to Month, which is
then connected to Day. Week is only connected to Day. A sample snowflake schema illustrating
the above relationships in the Time Dimension is shown to the right.

The main advantage of the snowflake schema is the improvement in query performance due to
minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of
the snowflake schema is the additional maintenance efforts needed due to the increase number
of lookup tables.




Slowly Changing Dimensions

Dimension attributes chnages over time
For example, customers change their names, move, have children, adjust their Incomes


Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in
the customer lookup table has the following record:

Customer Key          Name                  State
1001                  Christina             Illinois

At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc.
now modify its customer table to reflect this change? This is the "Slowly Changing Dimension"
problem.

There are in general three ways to solve this type of problem, and they are categorized as
follows:

Type 1: The new record replaces the original record. No trace of the old record exists.
Type 2: A new record is added into the customer dimension table. Therefore, the customer is
treated essentially as two people.

Type 3: The original record is modified to reflect the change.

We next take a look at each of the scenarios and how the data model and the data looks like for
each of them. Finally, we compare and contrast among the three alternatives.

SCD type1

In Type 1 Slowly Changing Dimension, the new information simply overwrites the original
information. In other words, no history is kept.

In our example, recall we originally have the following table:

Customer Key          Name                   State
1001                  Christina              Illinois

After Christina moved from Illinois to California, the new information replaces the new record, and
we have the following table:

Customer Key          Name                   State
1001                  Christina              California

Advantages:

- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no
need to keep track of the old information.

Disadvantages:

- All history is lost. By applying this methodology, it is not possible to trace back in history. For
example, in this case, the company would not be able to know that Christina lived in Illinois
before.

Usage:

About 50% of the time.

When to use Type 1:

Type 1 slowly changing dimension should be used when it is not necessary for the data
warehouse to keep track of historical changes.

Type 2 Slowly Changing Dimension


In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new
information. Therefore, both the original and the new record will be present. The newest record
gets its own primary key.
In our example, recall we originally have the following table:

Customer Key          Name                  State
1001                  Christina             Illinois

After Christina moved from Illinois to California, we add the new information as a new row into the
table:

Customer Key          Name                  State
1001                  Christina             Illinois
1005                  Christina             California

Advantages:

- This allows us to accurately keep all historical information.

Disadvantages:

- This will cause the size of the table to grow fast. In cases where the number of rows for the table
is very high to start with, storage and performance can become a concern.

- This necessarily complicates the ETL process.

Usage:

About 50% of the time.

When to use Type 2:

Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to
track historical changes.

Type 3 Slowly Changing Dimension



In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular
attribute of interest, one indicating the original value, and one indicating the current value. There
will also be a column that indicates when the current value becomes active.

In our example, recall we originally have the following table:

Customer Key          Name                  State
1001                  Christina             Illinois

To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:

        Customer Key
        Name
        Original State
        Current State
        Effective Date

After Christina moved from Illinois to California, the original information gets updated, and we
have the following table (assuming the effective date of change is January 15, 2003):

Customer Key         Name                   Original State   Current State          Effective Date
1001                 Christina              Illinois         California             15-JAN-2003

Advantages:

- This does not increase the size of the table, since new information is updated.

- This allows us to keep some part of history.

Disadvantages:

- Type 3 will not be able to keep all history where an attribute is changed more than once. For
example, if Christina later moves to Texas on December 15, 2003, the California information will
be lost.

Usage:

Type 3 is rarely used in actual practice.

When to use Type 3:

Type III slowly changing dimension should only be used when it is necessary for the data
warehouse to track historical changes, and when such changes will only occur for a finite number
of time
        Types of Facts
There are three types of facts:

       Additive: Additive facts are facts that can be summed up through all of the dimensions in
        the fact table.
       Semi-Additive: Semi-additive facts are facts that can be summed up for some of the
        dimensions in the fact table, but not the others.
       Non-Additive: Non-additive facts are facts that cannot be summed up for any of the
        dimensions present in the fact table.

Let us use examples to illustrate each of the three types of facts. The first example assumes that
we are a retailer, and we have a fact table with the following columns:

Dat
e
Stor
e
Pro
duct
Sale
s_A
mou
nt

The purpose of this table is to record the sales amount for each product in each store on a daily
basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you
can sum up this fact along any of the three dimensions present in the fact table -- date, store, and
product. For example, the sum of Sales_Amount for all 7 days in a week represent the total
sales amount for that week.

Say we are a bank with the following fact table:

Dat
e
Acc
ount
Curr
ent_
Bala
nce
Profi
t_M
argi
n

The purpose of this table is to record the current balance for each account at the end of each day,
as well as the profit margin for each account for each day. Current_Balance and Profit_Margin
are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all
accounts (what's the total current balance for all accounts in the bank?), but it does not make
sense to add them up through time (adding up all current balances for a given account for each
day of the month does not give us any useful information). Profit_Margin is a non-additive fact,
for it does not make sense to add them up for the account level or the day level.
        Types of Fact Tables
Based on the above classifications, there are two types of fact tables:

       Cumulative: This type of fact table describes what has happened over a period of time.
        For example, this fact table may describe the total sales by product by store by day. The
        facts for this type of fact tables are mostly additive facts. The first example presented
        here is a cumulative fact table.
       Snapshot: This type of fact table describes the state of things in a particular instance of
        time, and usually includes more semi-additive and non-additive facts. The second
        example presented here is a snapshot fact table

    What is Dimensional Modeling

            oIt is a logical design technique to structure the business dimensions and the
             metrics that are analyzed along these dimensions.
           o A logical design technique that seeks to present the data in a standard
             framework that is intuitive and allows for high performance access
           o It is inherently dimensional and adheres to a discipline that uses relational model
             with some important restrictions
           o The fundamental idea of dimensional modeling is that nearly every type of
             business data can be represented as a kind of cube of data
           o The model has also proved to provide high performance for queries and analysis
       Components of Dimensional Model
           o Fact Table
                    The fact table contains facts or measurements of the business
           o Dimension Table
                    The dimension tables contain textual attributes that describe the facts

    Hierarchies in Dimensions

    Multiple Hierarchies

            ♦   Dimension tables can represent multiple hierarchies roll-ups
            ♦   For example ,Store Dimension could have
            ♦          the following hierarchies
                    ► Physical Geography
                           • Zip, City, County, State, Country
                    ► Sales Organization
                           • District, Region, Zone
                    ► Distribution Roll-up
                           • Distribution Center, Distribution Center Region

   Store Dimension
   Store_key
   Store_description
   Store_type
   Zip
   City
   State
   Sales_region
Sales_zone
Distribution Center
Distribution Center
Region
Dimensional Modeling Framework

                             Identify Subject Area, Grain
                                                                    Conceptual
                                                                      Level
                          Identify Major Dimension & Facts,
                          Conform Dimensions across Facts




  Level of                   Detail Facts with Measures
   detail

                 Detail Dimensions with Hierarchies &Attributes    Logical Level
                 Slowly changing Dimensions Policies

                             Source-Data Model Mapping



                                                                   Physical Level
                     Pre-calculations, Aggregates, Indexes, Data
                     Structures, Source-Physical Model Mapping
STEP1: Identify the Business Process

       o   A major operational process that is supported by some kind of legacy system(s)
           from which data can be collected for the purpose of the data warehouse
       o   Example: orders, invoices, shipments, inventory, sales

STEP2 : Identify the Grain

       o   Choose the level of detail
       o   The fundamental lowest level of data represented in a fact table for the business
           process
       o   Example: individual transactions, individual daily snapshots

STEP3: Identify the Dimensions & Dimension Hierarchy

           Choose the dimensions that will apply to each fact table record


                Year                                       Region                              Category



             Quarter                                         State                              Brand


               Month                                       District                            Product
                                  Week

               Date                                           City




STEP4: Identify the Facts

       o   Choose the measured facts that will populate each fact table record

STEP5 : Conforming the dimensions

       o   Common dimensions across the Facts/ data marts have to be exactly same or
           subset of the main dimension table
Option 1: Identical dimensions with the same keys, labels, definitions and
values
                                       Item Key                                DATE KEY
                                                                               ITEM KEY
      Sales                            Item Desc.
                                       Brand Desc.                             STORE KEY
     Schema                            Category                                PROMO KEY

                                       ..                                      Sales Fact




                                       Item Key                                DATE KEY
        Inventory                                                              ITEM KEY
                                       Item Desc.
         Schema                        Brand Desc.
                                                                               STORE KEY
                                                                               Inventory Fact
                                       Category
                                       ..


■   Option 2: “Subset” of base dimension




                                        Item Key                  DATE KEY                      DATE KEY
                   Sales
                                                                  ITEM KEY
                  Schema                Item Desc.                                              Day-of-week

                                        Brand Desc.               STORE KEY                     Week Desc
                                        Category                  PROMO KEY                     Month Desc
                                        Desc.                     Sales $

                                        ..

                                       Item key Item Desc                   Brand Desc          Category Desc
                                       0001     Cheerios 10oz               Cheerios             Cereal

                   Forecast             Brand Key                Month Key                      Month KEY
                   Schema               Brand Desc.              Brand Key                      Month Desc

                                        Category                 Estimate

                                        Desc.                    Sales $

                                        ..

                                       Brand key        Brand Desc           Category Desc
                                       1001             Cheerios              Cereal
■   STEP6 : Adding Attributes to Dimension Tables
       ♦ Enhance the depth of analysis
       ♦ Examples Customer Age, Address, Profession, Product color, flavor, product
          size, packaging type etc.




                                                                  Regional
     Year                               Region                    Manager

                                                                      Category
    Quarter                              State

                                                                        Brand
    Month                               District
                   Week
                                                                       Product
     Date                                 City

                      Sequenc                           Phon                       Colo
CurrentFla                              Addres           e              Siz
                         e                                               e          r
    g                                     s

■   STEP7 : Storing Pre-calculations in the Fact table
       ♦ Calculated based on one or more base measures




■   STEP8 : Choosing the Duration of the Database
       ♦ Need for analyzing the data over a period of time
■   STEP9 : Track Slowly Changing Dimensions
       ♦ Certain kinds of dimension attribute changes need to be handled differently in
          Data Warehouse
               ■ Type I – Overwrite
               ■ Type II - History
               ■ Type III – Add new column example :- Organizational changes
               Retail Sales - Case Study

   ■    Background
           ♦ Chain consists of over 100 grocery stores in five states
           ♦ Stores average 60,000 SKUs in departments such as frozen foods, dairy etc.
           ♦ Bar codes are scanned directly into the cash registers PoS system
           ♦ Items are promoted via coupons, temporary price reductions, ads and in-store
               promotions
   ■    Analytical Requirements
           ♦ Need to know what is selling in store each day in order to evaluate product
               movement, as well as to see how sales are impacted by promotions
           ♦ Need to understand the mix of items in a consumer market basket



               ■     Identify the Business Process        ■    Sales

               ■     Identify the Grain                   ■    Transaction Item (Daily Sales)



               ■     Identify the Dimensions              ■    Date, Location, Item, Promotion



               ■     Identify the Measures                ■    Quantity, Price, Amount




        DATE KEY                                                           ITEM KEY

  Date Description                                                  Item Description
                                    DATE KEY
  Week                                                              Item Size
                                    ITEM KEY
  Month                                                             Package Type
                                    STORE KEY
  Quarter                                                           Category
                                    PROMOTION
  Year
                                    KEY
                                    POS TRXN#
                                     Sales Quantity
                                    Unit Sales Price
       STORE KEY
                                    Sales $ Amount
                                                                       PROMOTION KEY
 Store Name
 City                                                           Promotion Description
 District                                                       Discount
 Zone

Resultant Sales Schema
Time Dimension
    Date Key        Date        Day of Week    Day Number in         Month        Quarter           Year        Holiday
                                                   Month                                                       Indicator
       1          1/1/1999          Friday           1               January          Q1            1999        Holiday
       2          1/2/1999         Saturday          2               January          Q1            1999      Non-Holiday
       3          1/3/1999         Sunday            3               January          Q1            1999      Non-Holiday
       4          1/4/1999         Monday            4               January          Q1            1999      Non-Holiday


Item Dimension
    Itemkey          Item       SKU Number          Dept               Size      Package           Brand       Category
                 Description                                                       Type
       1       Lasagna 6 OZ        90706287103 Grocery          6 OZ           Box              Cold Gourmet Frozen Foods
       2       Beef Stew 6 OZ      16005393282 Grocery          6 OZ           Box              Cold Gourmet Frozen Foods
       3       Extra Nougat 2      46817560065 Grocery          6 OZ           Can              Chewy        Candy
               OZ




Promotion Dimension
Promo Key           Promo Name Price                     Ad Type          Media Type        Promo $        Begin Date     End Date
                                Reduction
1                   Blue Ribbon Temporary                Daily Paper      Paper             2000           1/1/1999       1/15/1999
                    Discounts
2                   Red Carpet  Markdown                 Sunday Paper Paper                 1000           1/3/1999       1/10/1999
                    Closeout
3                   Ad Blitz    None                     Paper and        Paper and         7000           1/15/1999      1/30/1999
                                                         Radio            Radio


Sales Fact
Date Key                Item Key       Store Key         Promo Key        POS Trxn #        Sales Qty      Unit Sales     Sales $Amt
                                                                                                           Price
1                       1              1                 15               763457893         1              4.59           4.59
1                       2              1                 1                763457893         2              0.89           1.78
1                       5              11                19               763457894         1              2.56           2.56
2                       13             5                 8                763457923         1              0.33           0.33
2                       5              11                12               763457998         1              1.29           1.29
     Factless Fact Table

A factless fact table is a table that contains nothing but dimensional keys.

          Purpose
          A tracking process or collecting status can be performed by using fact less fact tables.

Example
    Identifying product promotion events (to determine promoted products that didn’t sell)
    Tracking student attendance or registration events
    Tracking insurance-related accident events
    Identifying building, facility, and equipment schedules for a hospital or university
    No of Accident for a Months\
    No of Policy has been closed this months

This type of fact table itself can be used to generate the useful reports. You can count the
number of occurrences with various criteria. For example, you can have a factless fact table to
capture the student attendance (the example used by Ralph). The following questions can be
answered:

       Which class has the least attendance?
       Which teachers taugh the most students?
       What is the average number of attendance of a given course?

All the queries are based on the COUNT() with the GROUP BY queries. I think that the
interesting metrics are the nested GROUP BY so you can first count and then apply other
aggregate functions such as AVERAGE, MAX, MIX.
Granularity

The first step in designing a fact table is to determine the granularity of the fact table. By
granularity, we mean the lowest level of information that will be stored in the fact table. This
constitutes two steps:

    1. Determine which dimensions will be included.
    2. Determine where along the hierarchy of each dimension the information will be kept.

The determining factors usually goes back to the requirements.

Which Dimensions To Include

Determining which dimensions to include is usually a straightforward process, because business
processes will often dictate clearly what are the relevant dimensions.

For example, in an off-line retail world, the dimensions for a sales fact table are usually time,
geography, and product. This list, however, is by no means a complete list for all off-line retailers.
A supermarket with a Rewards Card program, where customers provide some personal
information in exchange for a rewards card, and the supermarket would offer lower prices for
certain items for customers who present a rewards card at checkout, will also have the ability to
track the customer dimension. Whether the data warehousing system includes the customer
dimension will then be a decision that needs to be made.

What Level Within Each Dimensions To Include

Determining which part of hierarchy the information is stored along each dimension is a bit more
tricky. This is where user requirement (both stated and possibly future) plays a major role.

In the above example, will the supermarket wanting to do analysis along at the hourly level? (i.e.,
looking at how certain products may sell by different hours of the day.) If so, it makes sense to
use 'hour' as the lowest level of granularity in the time dimension. If daily analysis is sufficient,
then 'day' can be used as the lowest level of granularity. Since the lower the level of detail, the
larger the data amount in the fact table, the granularity exercise is in essence figuring out the
sweet spot in the tradeoff between detailed level of analysis and data storage.

Note that sometimes the users will not specify certain requirements, but based on the industry
knowledge, the data warehousing team may foresee that certain requirements will be forthcoming
that may result in the need of additional details. In such cases, it is prudent for the data
warehousing team to design the fact table such that lower-level information is included. This will
avoid possibly needing to re-design the fact table in the future. On the other hand, trying to
anticipate all future requirements is an impossible and hence futile exercise, and the data
warehousing team needs to fight the urge of the "dumping the lowest level of detail into the data
warehouse" symptom, and only includes what is practically needed. Sometimes this can be more
of an art than science, and prior experience will become invaluable here.
Top-Down(DWH->DataMart)

Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An
enterprise has one data warehouse, and data marts source their information from the data
warehouse. In the data warehouse, information is stored in 3rd normal form.

Bottom up(Data Mart->DWH)
Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the
enterprise. Information is always stored in the dimensional model.
Bridge Tables

Provides many to many relationship.


In the case of a property transaction (aka a property “sale”), you are potentially dealing with several many-
to-many relationships.
First of all, you may not be buying alone (one-to-many buyers).
You may also not be buying your house from a single person (one-to-many sellers).
Your house may be on two parcels of land or no land at all if you are buying an apartment or condo (none-
to-many parcels).
A barn or in-law apartment may be included or perhaps you are just buying land to build your dream home
later (none-to-many buildings).


These and other complexities make the case for bridge tables to represent the many-to-many relationships
inherent in a complex transaction such as this.
When you buy land or a house (in the US, at least!) you must “record” the sale with the clerk of the
municipality in which the property exists.
This recording becomes part of the chain of ownership and tax record for the property.
This recording can accommodate multiple grantors (the people or organizations selling the property),
grantees (those who are buying the property), and properties (multiple parcels, land lines, lots, buildings,
outbuildings, units, etc).
Lookup Tables

Dimension tables in schema
              Top-Down                                         Bottom-Up                                          Hybrid                                       Federated
                                                                                   Major Characteristics
· Emphasizes the DW.                              · Emphasizes data marts.                        · Emphasizes DW and data marts; blends        · Emphasizes the need to integrate new and
· Starts by designing an enterprise model for     · Starts by designing a dimensional model       ―top-down‖ and ―bottom-up‖ methods.           existing heterogeneous BI environments.
a DW.                                             for a data mart.                                · Starts by designing enterprise and local    · An architecture of architectures.
                                                  · Uses a ―flat‖ architecture consisting of a    models synchronously.                         · Acknowledges the reality of change in
· Deploys multi-tier architecture comprised of    staging area and data marts.                    · Spends 2–3 weeks creating a high-level,     organizations and systems that make it
a staging area, a DW, and ―dependent‖ data        · The staging area is largely non-persistent.   normalized, enterprise model; fleshes out     difficult to implement a formalized
marts.                                            · Data marts contain both atomic and            model with initial marts. · Populates marts   architecture.
· The staging area is persistent.                 summary data.                                   with atomic and summary data via a non-       · Rationalizes the use of whatever means
· The DW is enterprise-oriented; data marts       · Data marts can provide both enterprise and    persistent staging area.                      possible to implement or integrate analytical
are function-specific.                            function-specific views.                        · Models marts as one or more star            resources to meet changing needs or
· The DW has atomic-level data; data marts        · A data mart consists of a single star         schemas.                                      business conditions.
have summary data.                                schema, logically or physically deployed.       · Uses ETL tool to populate data marts and    · Encourages organizations to share
· The DW uses an enterprise-based                 · Data marts are deployed incrementally and     exchange meta data between ETL tool and       dimensions, facts, rules, definitions, and data
normalized model; data marts use a subject-       ―integrated‖ using conformed dimensions.        data marts.                                   wherever possible, however possible.
specific dimensional model.                                                                       · Backfills a DW behind the marts when
· Users can query the data warehouse and                                                          users want views at atomic level across
data marts.                                                                                       marts; instantiates the ―fleshed out‖
                                                                                                  enterprise model, and moves atomic data to
                                                                                                  the DW.



                                                                                                  Pros
· Enforces a flexible, enterprise architecture.   · Focuses on creating user-friendly, flexible   · Provides rapid development within an        · Provides a rationale for ―band aid‖
· Once built, minimizes the possibility of        data structures.                                enterprise architecture framework.            approaches that solve real business
renegade ―independent‖ data marts.                · Minimizes ―back office‖ operations and        · Avoids creation of renegade ―independent‖   problems.
· Supports other analytical structures in an      redundant data structures to accelerate         data marts.                                   · Alleviates the guilt and stress data
architected environment, including data           deployment and reduce cost.                     · Instantiates enterprise model and           warehousing managers might experience by
mining sets, ODSs, and operational reports.       · No drill-through required since atomic data   architecture only when needed and once        not adhering to formalized architectures.
· Keeps detailed data in normalized form so       is always stored in the data marts.             data marts deliver real value.                · Provides pragmatic way to share data and
it can be flexibly re-purposed to meet new        · Creates new views by extending existing       · Synchronizes meta data and database         resources.
and unexpected needs.                             stars or building new ones within the same      models between enterprise and local
· Data warehouse eliminates redundant             logical model.                                  definitions.
extracts.                                         · Staging area eliminates redundant             · Backfilled DW eliminates redundant
                                                  extracts.                                       extracts.
                                                                                                 Cons
· Upfront modeling and platform deployment       · Few query tools can easily join data across   · Requires organizations to enforce standard   · The approach is not fully articulated.
mean the first increments take longer to         multiple, physically distinct marts.            use of entities and rules.                     · With no predefined end-state or
deploy and cost more.                            · Requires groups throughout an                 · Backfilling a DW is disruptive, requiring    architecture in mind, it may give way to
· Requires building and managing multiple        organization to consistently use dimensions     corporate commitment, funding, and             unfettered chaos.
data stores and platforms.                       and facts to ensure a consolidated view.        application rewrites.                          · It might encourage rather than reign in
· Difficult to drill through from summary data   · Not designed to support operational data      · Few query tools can dynamically query        independent development and perpetuate
in marts to detail data in DW.                   stores or operational reporting data            atomic and summary data in different           the disintegration of standards and controls.
· Might need to store detail data in data        structures or processes.                        databases.
marts anyway.



                                                                                     Major Proponents
Bill Inmon and co-authors                        Ralph Kimball and co-authors                    Many practitioners                             Doug Hackney
Junk Dimension
Junk Dimension also called as garbage dimension.
A garbage dimension is a dimension that consists of low-cardinality columns such as
codes, indicators, status,and flags. The garbage dimension is also referred to as a junk
dimension. Attributes in a garbage dimension are not related to any hierarchy

A junk dimension is a collection of random transcational codes flags and text attributes that
are unrelated to any particular dimension.The junk dimension is simply a structure that
provides the convienent place to store the junk dimension.




Degenerate dimension

Degenerate dimension is a Dimension which has only a single attribute.

This dimension is typically represented as a single field in a fact table.

The data items thar are not facts and data items that do not fit into the existing dimensions
are termed as Degenerate Dimensions.

Degenerate Dimensions are the fastest way to group similar transactions.

Degenerate Dimensions are used when fact tables represent transactional data.

They can be used as primary key for the fact table but they cannot act as foreign keys

The degenerate Dimension are keyes of a fact table which do not the corresponding dimesion
table.

Example:

it is generated at the time of transaction   like invoice no this is generated when the invoice is
raised
Surrogate key
application generated unique key value for each record, in place of the
natural key of the record.

surrogate keys are keys that are maintained within the data warehouse instead of keys taken
from source data systems. There are several reasons for the use of surrogate keys:

surrogate key is a system-generated (non-meaningful from a business perspective) primary key
for purposes of ensuring uniqueness within a database table

      Purpose

Fill in a data warehouse dimension table with data which comes from different source systems
and assign a unique record identifier (surrogate key) to each record.




Another benefit you can get from surrogate keys (SID) is :

Tracking the SCD - Slowly Changing Dimension.

classical example:

On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would
be in your Employee Dimension). This employee has a turnover allocated to him on the
Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted from Business Unit
'BU1' to Business Unit 'BU2.' All the new turnover have to belong to the new Business Unit
'BU2' but the old one should Belong to the Business Unit 'BU1.'

If you used the natural business key 'E1' for your employee within your datawarehouse
everything would be allocated to Business Unit 'BU2' even what actualy belongs to 'BU1.'

If you use surrogate keys, you could create on the 2nd of June a new record for the Employee
'E1' in your Employee Dimension with a new surrogate key.

This way, in your fact table, you have your old data (before 2nd of June) with the SID of the
Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee
'E1' + 'BU2.'

You could consider Slowly Changing Dimension as an enlargement of your natural key: natural
key of the Employee was Employee Code 'E1' but for you it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the
natural key enlargement process, is that you might not have all part of your new key within
your fact table, so you might not be able to do the join on the new enlarge key -> so you need
another id.

CUBE
Cubes are logical representation of multidimensional data.The edge of the cube contains
dimension members and the body of the cube contains data values
What are the steps to build the data warehouse?

1. Requirement gathering (involved in face to face discussion with the
clients regarding what is their requirement)

2. Data modeling (where we fix the schema design for warehouse)

3. Data profiling

4. Data cleansing

5. ETL ( 1. Extracting the transactional data from the data sources into a
staging area 2. Transforming the transactional Data 3. Loading the transformed
data into a dimensional database )

6. Reporting 1. Building pre-calculated summary values to speed up report
generation 2. Building (or purchasing) a front-end reporting tool

								
To top