The Data Warehouse by sanmelody


									Lecture 3

               Themes in this session

 • Basics of the multidimensional data model and star-
   join schemata
 • The process of, and specific design issues in,
   multidimensional data modelling
Why use Multidimensional
 • Simple, intuitive design basis
    – easy to create a multidimensional model
    – easy to communicate the meaning of the model
    – easy to gain an overview of the model
 • A logical model which can be implemented in a variety of
    – relational databases
    – multidimensional databases
    – object-oriented databases
 • Supports the reporting and analytical needs of business
 • Tried and tested
General structure for a multidimensional
 • A central fact table, referred to as a multidimensional
   data subject
 • Surrounding dimension tables, referred to as single
   dimensional data subjects
 • Joins connecting the fact table and its surrounding
   dimension tables. Only one join per dimension table
 • A concatenated or multipart key in the fact table which
   is comprised of one key from each dimension table.

 NOTE: the multidimensional model represents a n-
  dimensional matrix, with n being the number of
The fact table

 • Measures or facts
   – reflect focal events or snapshots of states of being
    – vary continuously over time
 • All the facts have a specific granularity
 • Facts should ideally be additive but this is not
   always the case
 • A set of foreign keys constituting a concatenated
 • Contains the major volume of data
The dimension tables

 • Dimensions are often referred to as causal dimensions,
   they contain the causal factors responsible for the
   collected measures
 • The time dimension is not a causal dimension, it is
   however one of the most important dimensions for
   structuring and analysing data
 • The dimension tables contain dimensional attributes,
   these are usually textual and discrete and must have a
   relevant business meaning
 • Good dimensional attributes are stable across time
 • If the attributes are connected in one or more
   hierarchies then these are usually captured in the
   dimension table
Skinny fact tables

 • As the fact table contains the vast volume of
   records it is important that it is memory space
 • Foreign keys are usually represented in integer
   from and do not require much memory space
 • Facts too are often numeric properties and can
   usually be represented as integers (contrast to
   dimensional attributes which are usually long text
 • This space efficiency is critical to the memory
   space consumption of the data warehouse

 • Lowest level of aggregation is determined by the
   granularity of the fact table.
 • Aggregations can be created on-the-fly or by the
   process of pre-aggregation
 • Pre-aggregation demands more storage space but
   provides better query performance
 • Aggregation is easier when facts are all additive

 • The matrices, represented by multidimensional
   models are often 99% sparse.
 • Sparsity is dealt with by simply not creating
   records for the cells that are not filled in the
   matrix. If nothing has happened no record is
 • Pre-aggregation and storage of aggregates can
   however lead to sparsity failure which places large
   demands on data storage

 In 3rd normal from all mutually independent and fully
   dependent on the primary key

 • The fact table is by nature highly normalised in a
   star-join schema
 • Dimensions are however usually not normalised
Snowflakes and normalised dimension
 • “Any attempts to normalise dimension tables in
   order to save disk space are a waste of time”
 • Affects the intuitive understandability of the
 • Normalised dimension tables destroy the ability to
 • Normalised tables demand extra joins and the
   querying of snowflakes take s longer than the
   querying of standard star-join schemata
The process of multidimensional data
What to focus on in MDM

 •   Query optimised database
 •   Whole business entities
 •   Key business activities and influences
 •   Transaction history
 •   People, places and things
 •   Time
 •   Dimension and rollup
Basic steps in modelling

 • Select a business subject area
 • Identify which business process(es) is being modelled
 • Identify the basic measures or facts
 • Determine at what level of detail (granularity) active
   analysis is conducted
 • Determine what the measures have in common (identify
   the dimensions)
 • Identify the relevant attributes in the various
 • Determine if the attributes are stable or variable over
   time and if their cardinality is bounded or unbounded
Identifying facts

 • The purpose of the analyst must be supported by
   the facts in the fact table, there must be measures
   which have relevance to the business goals which
   the organisation seeks to fulfil
 • Facts are by nature dynamic and variable over time
 • They do not have a limited cardinality
 • Facts have their origins in the working of the
   organisation and the activities it performs
Identifying dimensional attributes

 • Dimensional attributes are those predictive
   variables business users believe are of significance
   to the measures in the fact table
 • Dimensional attributes are often present in
   hierarchies in the causal dimensions
 • Dimensional attributes usually have a limited
   cardinality and are non-variable across time.
Supporting multiple hierarchies in
 • Dimensions should be able to support multiple
   independent hierarchies
 • Alternate hierarchies are easily supported
 • Cyclic paths in two hierarchies demand that the
   hierarchies be split into two entirely separate
   hierarchies or even two separate dimensions
 • A dimension can also contain attributes that do not have
   any hierarchical relationships to the other attributes in
   the dimension

 NOTE: Any of the attributes, whether in a hierarchy
  or not, can be used in the drill down process
Factless fact tables

 • Some fact tables quite simply have no measured facts!
 • Often used to represent many-to-many relationships
 • The only thing they contain is a concatenated key, they
   do still however represent a focal event which is
   identified by the combination of conditions referenced
   in the dimension tables
 • There are two main types of factless fact tables:
    – event tracking tables
    – coverage tables
Dealing with many-to-many relationships
among dimensional attributes

 • Many to many relationships are difficult to deal
   with in a any database design situation. Great
   efforts should be taken to identify any in the data
 • When creating a MDM it is necessary to separate
   the two entities and capture their relationship in a
   factless fact table
Dealing with semi-additive and non-
additive facts
 • Semi-additive facts are those which are not additive
   across all dimensions
    – warn users
    – prohibit the addition of these facts across the relevant
 • Non-additive facts are not additive across any
    – most ratios and all measures depicting snapshots of a state fall
      into this class
    – in some cases other calculatory methods can be used to
      aggregate these measures
       • average over the number of time periods
       • calculate the ratio of the sums not the sum of the ratios
Degenerate dimensions

 • A dimension which has been cannibalised by other
 • Represented as a dimension key without a
   corresponding dimension
 • The key to this dimension still serves as a vital
   element for the grouping of facts
 • This often happens when the grain in the fact
   tables represents actual working documents
Special mention - handling unit prices and
 • Could usually be considered to be dimension attributes
   of a product
 • However, every price change would then result in a
   change in one of the dimensions (remember these are
   supposed to be stable)
 • Instead they are included as non-additive facts in the
   fact table. These are often used in a lot of derived
   measures which are presented to the business user in a
   specially created view
 • This is once again a snapshot of a state, the state being
   the cost or price of the goods at a specific point in time
Depicting processes and value chains as
families of star-join schemas

 • Kimball speaks of their being two sides to the value
    – the demand side - the steps needed to satisfy the customers’
      demand for the product
    – the supply side - the steps needed to manufacture the products
      from original ingredients or parts
 • The chain consists of a sequence of inventory and flow
   star-join schemata
 • joining the different star-join schemata is only possible
   when two sequential schemata have a common, identical
 • Sometimes the represented chain can be extended
   beyond the bounds of the business itself
A family of stars




     schema 1

                schema 2
                           schema 3
                                          schema 4
Creating mini-dimensions for really large
 • Many dimension attributes are used very frequently as
   browsing constraints, in big dimensions these constraints can
   be hard to “find” among the lesser used ones
 • Logical groups of often used constraints can be separated
   into small dimensions which are very well indexed and easily
   accessible for browsing
 • All variables in these mini-dimensions must be presented as
   distinct bands or classes
 • The key to the mini-dimension can be places as a foreign key
   in both the fact table and dimension table from which it has
   been broken off
 • Mini-dimensions, as their name suggests, should be kept
   small and compact
Slowly changing dimensions

 • Most dimensions are not constant over time
 • Most dimensions are however almost constant over time
 • Almost constant dimensions are referred to as slowly-
   changing dimensions
 • There are three main methods of handling slowly
   changing dimensions:
    – Overwriting
    – Creating additional dimensional records
    – Creating new current fields within the original dimension
 Note: One of the key functions of the data warehouse is to
   track events over extended periods of time. The validity of
   the data warehouse is thus dependent on how well changes in
   the its dimensions are tracked.
Slowly changing dimensions (1)

 • The dimensional attribute record is overwritten with
   the new value
 • No changes are needed elsewhere in the dimension
 • No keys are affected anywhere in the database
 • Very easy to implement but the historical data is now
 • Two basic questions need to be asked before
   overwriting a dimension attribute:
    – How important is the value to the end-users analysis needs?
    – How important is the tracking of history?
Slowly changing dimensions (2)

 • Introduce a new record for the same dimensional
   entity in order to reflect its changed state
 • A new instance of the dimensional key is created
   which references the new record
 • In order to is best dealt with by using version
   digits at the end of the key. These allow up to 100
   snapshots of a changing dimensional entity
 • All these keys need to be created, maintained and
   managed by someone and tracked in the metadata
 • The database maintains its consistency and the
   versions can be said to partition history
Slowly changing dimensions (3)

 • Use slightly different design of dimension table which
   has fields for:
    – original status of dimensional attribute
    – current status of dimensional attribute
    – an effective date of change field
 • This allows the analyst to compare the as-is and as-was
   states against each other
 • Only two states can be traced, the current and the
 • Some inconsistencies are created in the data as time is
   not properly partitioned
Special mention - Heterogeneous
 Some products have many, many distinguishing attributes
   and many possible permutations (usually on the basis of
   some customised offer). This results in immense
   product dimensions and bad browsing performance
 • In order to deal with this, fact tables with
   accompanying product dimensions can be created for
   each product type - these are known as custom fact
 • Primary core facts on the products types are kept in a
   core fact table
 • The core facts are copied in each of the customer fact
Pre-aggregated data in SJS

 • How to deal with aggregates is one of the biggest
   issues in the design of a DW
 • Choice of pre-aggregation/on-the fly aggregation
   has great relvance to data storage and query
 • Two main strategies exist for the creation of
   – the creation of new fact tables for aggregates
   – the creation of new level fields for aggregates

To top