data modeling training notes by jyotheeshct

VIEWS: 18 PAGES: 29

									                                          Data Modeling
Data Warehouse Defined

“A data warehouse is a collection of corporate information, derived directly from operational
   systems and some external data sources. Its specific purpose is to support business
   decisions, not business operations”

Characteristics of a DW
•   Subject-oriented Data
    –   collects all data for a subject, from different sources
•   Read-only Requests
    –   loaded during off-hours, read-only during day hours
•   Interactive Features, ad-hoc query
    –   flexible design to handle spontaneous user queries
•   Pre-aggregated data
    –   to improve runtime performance
•   Highly denormalized data structures
    –   fat tables with redundant columns

STAGING AREA - SOME CLARITY
•   Staging Area
    –   optional
    –   to cleanse the source data
    –   Accepts data from different sources
    –   Data model is required at staging area
    –   Multiple data models may be required for parking different sources and for transformed
        data to be pushed out to warehouse

ODS - SOME CLARITY

•   Operational Data Store
    –   Optional
    –   Granular, detailed level data
    –   May feed warehouse (eg when warehouse is aggregated)
    –   Usually a relational model
    –   May keep data for a smaller time period than warehouse

Data Modeling

WHAT IS A DATA MODEL???
A data model is an abstraction of some aspect of the real world (system).

WHY A DATA MODEL???
    –   Helps to visualize the business
    –   A model is a means of communication.
     –   Models help elicit and document requirements.
     –   Models reduce the cost of change.
     –   Model is the essence of DW architecture based on which DW will be implemented

What do we want to do with the data?
Model depends on what kind of data analysis we want to do:
•   Different Data Analysis Techniques
     –   Query and reporting
          • Display Query Results
     –   Multidimensional analysis
          • Analyze data content by looking at it in different perspectives
     –   Data mining
          • discover patterns and clustering attributes in data
Impact of Data Analysis Techniques on DM
•   Query and reporting
        • Normalized data model
        • Select associated data elements
        • summarize and group by category
        • present results
        • direct table scan
        • ER with normalized / denormalized appropriate
Query and reporting




Requirements of a Decision Support Query Environment
•   To provide a method for testing hypothesis (eg. what if ….)
•   To allow ad-hoc queries
•   To allow human input (DSS makes decisions with users )
•   Expects user knowledge of problem
•   To simulate the behavior of a real-world problem
Impact of Data Analysis Techniques on DM
•   Multidimensional analysis
          • Fast and easy access to data
          • Any number of analysis dimensions in any combinations
          • ER will mean many joins
          • Dimensional model appropriate
Multidimensional Analysis




Data Mining
•   Data Mining
         • discovers unusual patterns
         • requires low level of detail data
DW Architectures
•   Architecture Choices depend on
     –   Current infrastructure
     –   Business environment
     –   Desired management and control structure
     –   Resources
     –   Commitment …..
•   Data Warehouse/data mart

DW Architectures
•   Architecture Choices determine
     –   Where will DW reside?
          • Centrally / locally / distributed
     –   Where will it be managed from?
          • Centrally / independently
•   3 choices
         • Global
         • Independent
         • Interconnected
or a combination of the three

DW Architectures
•   Global Architecture
    –   related to scope of data access and storage
    –   does not mean centralized
    –   can be physically centralized or distributed
    –   enterprise view of data
    –   time-consuming & costly to implement

Global Architecture




DW Architectures




•   Independent Architecture
    –   stand-alone
    –   controlled by a department
    –   minimal integration
    –   no global view
    –   very fast to implement

DW Architectures
•   Interconnected Architecture
    –   distributed
    –   integrated and interconnected
    –   gives a global view of enterprise
    –   more complexity
         • who manages / controls data
         • another tier in architecture to share common data between multiple data marts
         • have a data sharing schema across data marts
Independent and Interconnected Architecture




Types of Data Warehouse
•   Enterprise Data Warehouse
•   Data Mart

Enterprise data warehouse
•   Contains data drawn from multiple operational systems
•   Supports time- series and trend analysis across different business areas
•   Can be used as a transient storage area to clean all data and ensure consistency
•   Can be used to populate data marts
•   Can be used for everyday and strategic decision making

Data Mart
•   Logical subset of enterprise data warehouse
•   Organized around a single business process
•   Based on granular data
•   May or may not contain aggregates
•   Object of analytical processing by the end user.
•   Less expensive and much smaller than a full blown corporate data warehouse.

Distributed and Centralized Data warehouses
•   DW sitting on a monolithic machine - unrealistic
•   Separate machines, different OS, different DB systems - reality

Solution
•  Share a uniform architecture to allow them to be fused coherently
Classical Architectures
•   Physical data warehouse (physical)
     –   Data warehouse --> data marts
     –   Data marts --> data warehouse
     –   Parallel data warehouse and data marts

DW Implementation Approaches
•   Top Down
•   Bottom-up
•   Combination of both
•   Choices depend on:
     –   current infrastructure
     –   resources
     –   architecture
     –   ROI
     –   Implementation speed

DW Implementation Approaches

Top Down
• More planning and design initially
• Involve people from different work-groups, departments
• Data marts may be built later form Global DW
• Overall data model to be decided up-front

Bottom Up
•  Can plan initially without waiting for global infrastructure
•  built incrementally
•  can be built before or in parallel with Global DW
•  Less complexity in design

Combined Approach
•   Determine degree of planning and design for a global approach to integrate data marts being
    built by bottom-up approach
•   Develop base level infrastructure definition for global DW at business level
•   Develop plan to handle data elements needed by multiple data marts
•   Build a common data store to be used by data marts and global DW

Levels of modeling
•   Conceptual modeling
     –   Describe data requirements from a business point of view without technical details
•   Logical modeling
     –   Refine conceptual models
     –   Data structure oriented, platform independent
•   Physical modeling
     –   Detailed specification of what is physically implemented using specific technology
Conceptual Model
•   A conceptual model shows data through business eyes.
•   All entities which have business meaning.
•   Important relationships
•   Few significant attributes in the entities.
•   Few identifiers or candidate keys.

Sample conceptual model



                                 Products




                                 Customer
                                                       Customers
                                 Invoices

     Sales Reps




                                Customer
                                Addresses


                                                           Sample
                                                          Conceptual
                                Geographic                  Model
                                Boundaries


Logical Model
•   Replaces many-to-many relationships with associative entities.
•   Defines a full population of entity attributes.
•   May use non-physical entities for domains and sub-types.
•   Establishes entity identifiers.
•   Has no specifics for any RDBMS or configuration.
                      CUSTOMER INVOICE
                      #INVOICE ID
                      #LINE ITEM SEQ
                      .INVOICE DATE




                      the bill for                                           the bill sent to          the bill purchased by


                    purchased                                               purchased at               purchased by
                        by
PRODUCT                                                CUSTOMER ADDRESS                              CUSTOMER
#PRODUCT CODE                                          #CUSTOMER ID                                  #CUSTOMER ID
.PRODUCT DESCRIPTION                                   #ADDRESS ID                                   #SNAPSHOT DATE
                                                                                                     .CUSTOMER NAME
                                                for the         for the       located within
                      sold by                  customer       customer
                                              managed by      sold to by
                   the salesman            the sales         the salesman           the general location of
                         for              manager for              for
                            SALES REP                                          GEOGRAPHIC
                            #SALES REP ID                                      BOUNDARY
                                                                               #GEO CODE




                                            Sample Logical Model

   Physical Model

   •   A Physical data model may include
        –   Referential Integrity
        –   Indexes
        –   Views
        –   Alternate keys and other constraints
        –   Tablespaces and physical storage objects.

                                                                        PRODUCTS

                                                              # PRODUCT_CODE
                                                              PRODUCT_DESCRIPTION
                                                              CATEGORY_CODE
                                                              CATEGORY_DESCRIPTION



               SALES_REPS
                                                               CUSTOMER_INVOICES                                         CUSTOMERS
        #SALES_REP_ID
                                                               #INVOICE_ID                                         #CUSTOMER_ID
        LAST_NAME
                                                               #LINE_ITEM_SEQ                                      #SNAPSHOT_DATE
        FIRST_NAME
                                                               INVOICE_DATE                                        CUSTOMER_NAME
        o MANAGER_FIRST_NAME
                                                               CUSTOMER_ID                                         o AGE
        o MANAGER_LAST_NAME
                                                               BILL_TO_ADDRESS_ID                                  o MARITAL_STATUS
                                                               SALES_REP_ID                                        CREDIT_RATING
                                                               MANAGER_REP_ID
                                                               ORGANIZATION_ID
                                                               ORG_ADDRESS_ID
                                                               PRODUCT_CODE
                                                               QUANTITY
                                                               UNIT_PRICE
                                                               AMOUNT                                                 Sample Physical
                                                               oPRODUCT_COST
                                                               LOAD_DATE                                                     Model

                                     CUSTOMER_ADDRESSES                                                GEOGRAPHIC_BOUNDARIES
                                     #CUSTOMER_ID                                                      #GEO_CODE
                                     #ADDRESS_ID                                                       CITY_NAME
                                     ADDRESS_LINE1                                                     STATE_NAME
                                     o ADDRESS_LINE2                                                   COUNTRY_NAME
                                     o POSTAL_CODE                                                     o CITY_ABBRV
                                     SALES_REP_ID                                                      o STATE_ABBRV
                                     GEO_CODE                                                          o COUNTRY_ABBRV
                                     LOAD_DATE




   Data Architecting

   What is data architecting???
Structure and locate data according to its characteristics

3 Basic types of data
     –   Real time data
     –   Derived data
     –   Reconciled data

Data Architecting-Real time data

•   Represents current status of business
•   Used by operational systems to run business
•   Changes as operational transactions are processed
•   Very detailed, high level of granularity

Data Architecting - Real time data

To use Real time data in DW:
Must be
        • Cleansed (comes from different sources, cleansed to to ensure data consistency
          quality)
        • Summarized (because it contains individual, transactional,detailed)
        • Transformed
into an easily understandable format for manipulation by analysts
Eg. Different units of measure, currency, exchange rates

Data Architecting - Derived data

•   Data created by summarizing, aggregating, averaging real-time data through some process
•   represents a view of business data at a specific time
•   Historical record of business over a period
•   Precalculate derived data elements and summarize detailed data to improve query processing

Data Architecting - Reconciled data

•   Real-time data cleansed, adjusted, enhanced to provided integrated source of data for
    analysis
•   Create and maintain historical data while reconciling
•   Normally not explicitly defined
•   Logical result of derivation operations
•   May be stored as temporary files used to transform operational data for consistency

Enterprise Data Model (EDM)

•   Consistent definition of data elements common to a business
•   High-level business view
•   Generic logical data model
•   Physical data design


EDM - The Phased Enterprise Data Model




Enterprise Data Model (EDM)

Phases
Increasing order of Information required
•   Information Planning
•   Business Analysing
•   Logical Data Modeling
•   Physical Data design

Enterprise Data Model (EDM)
Information Planning
• Consolidated view of the business
• Identify some business concepts (20-30)
• called subject areas / super entity/ business entity in which the organization is interested
   Eg.customer, product
Purpose
    – To set up scope and architecture of DW
    – To provide a single comprehensive point of view
Enterprise Data Model (EDM)

Business Analyzing
• Define contents of primary business concepts.
• Gather and arrange business requirements
• Defines business terms
Purpose
   – To set up scope and architecture of DW
   – To provide a single comprehensive point of view
Logical Data Modeling
•    Enterprise-wide in scope
•    consists of several entities, relationships, attributes
• complete model in 3rd Normal Form.
Can be divided into 2 types:
      –   Generic logical data model (enterprise level)
      –   Logical application model (application level)

Physical Data Design
•    space
•    performance
•    physical distribution of data

Purpose:
To design for the physical implementation

Is it possible to draw an EDM ???

Not always!!

Phased approach OR a simple EDM
•          list of subject areas (<25)
•          define business relationships between subject areas
•          define contents of each subject area

Granularity

    More the detail Lower the granularity

Why is it important in DW???

     Opportunity for TRADE-OFF
     performance         vs.   volume of data stored ability to access detailed data vs. cost of
     storage

Granularity
Granularity

To overcome trade-offs between data volume and query capability :
•   Divide the data in the DW
•   Create 2 levels of granularity of data
•   Detailed Raw data
     –   keep it on separate storage medium
     –   load when required
•   Summarized data

Data Partitioning Model - WHY?
To understand, maintain and navigate a DW

TYPES of Partitioning
•   Logical and Physical

Logical Partitioning - WHY?

Goals:
• Provide flexible access to data
•   Provide easy and efficient data management services
•   Ensure scalability of the data warehouse
•    Enable elements of the data warehouse to be portable. Some elements of the DW can be
    shared with other physical warehouses or archived on other storage media.

Data Partitioning Model - Logical Partitioning

•   Partition large volumes of data by splitting
•   Helps to make data easier to:
           • Restructure
           • Index
           • Sequentially scan
           • Reorganize
           • Recover
           • Monitor
Logical Partition - HOW??

Criteria
•   Time period (date, month, or quarter)
     –   almost always chosen
•   Geography (location)
•   Product (more generically, by line of business)
•   Organizational unit
•   A combination of the above

Data Partitioning Model -Subject Areas

Subject areas classified by the topics of interest to the business.
•   5W1H rule
     – when, where, who, what, why, and how
eg. ‘who’ could be customer, employee, manager, supplier, business partner, competitor.
•   Get a candidate list of subject areas
•   Decompose, rearrange, select, redefine in more detail

•   Define the business relationships among subject areas
•   This will determine the dimensions used
•   Subject Areas help define criteria like:
        • Unit of the data model
        • Unit of an implementation project
        • Unit of management of the data
        • Basis for the integration of multiple mplementations unit for analysis should be
          business process

Data Modeling - Techniques

•   STAGING AREA
     –   YES ! (maybe multiple data models are required)
•   ODS
     –   YES !
•   DATAWAREHOUSE/DATAMART
     –   YES!

Data Modeling - Techniques
•   Modeling techniques

     –   E-R Modeling
     –   Dimensional Modeling

Implementation and modeling styles
•   Modeling versus implementation
     –   Modeling: describe what should be built to non-technical folks
     –   Implementation: describe what is actually built to technical folks
•   Relational modeling
     –   Use for implementation
     –   Difficult to understand by non-technical folks
•   Dimensional modeling
     –   Use for modeling during analysis and design phases
     –   Can be implemented using other modeling styles e.g. object-oriented, relational

E-R Modeling
•   Produces a data model, using two basic concepts entities and the relationships between those
    entities.

•   Detailed ER models also contain attributes, which can be properties of either the entities or the
    relationships.


Conventions used in E-R modeling
•   Entities
•   Attributes
•   Relationships or Associations

Entities
•   Principal data objects about which information is to be collected.
•   Usually recognizable concepts such as person, things, or events.
•   Examples : EMPLOYEES, PROJECTS, INVOICES.

Attributes & Relationships
•   Attributes describe the entity of which they are associated.
•   A relationship represents an association between two or more entities. An example :
     –   Employees are assigned to projects
     –   Departments manage one or more projects.

Types of Data Relationships - Cardinality
•   One - One              1: 1
•   One - Many             1: m
•   Many - Many            m:n
•   Recursive data relationship

Normalization
•   Remove data redundancy
•   0 NF - contains repeating values
•   1 NF - No repeating values
•   2 NF - Every attribute is dependent on the key, the whole key and nothing but the key
•   3 NF - No non-key attribute is functionally dependent on another non-key attribute
•   Denormalization - carefully introduced redundancy to improve query performance

Normalization - 1NF
•    Eliminate Repeating groups

Person                                       Skills
A                                    Oracle, DB2
B                                    MS Access, Oracle
C                                    Oracle, CICS, SQL
D                                    DB2, CICS

Who are the ones who have DB2 skills???

Normalization - 2NF
•    Eliminate Redundant data

Skill ID Skill Description

S1         DB2
S2         Oracle
S3         MS Access
S4         CICS
S5         SQL


Normalization - 3NF
•    Eliminate Columns Not Dependent On Key

Memb ID             Skill ID         Comp ID.         Comp Name        Location
A                   S1               D1               Core Tech        HYD


Relational modeling
•    Represents business entities, data items associated with each entity, and the relationships of
     business interest among the entities
•    Entities are usually broken down into smallest possible units and combined using relationships
•    Diagram looks like a spiderweb

Entity Completeness Checklist
•    Name
      –   to describe the data contained
      –   to meet naming conventions/standards
•    Description
      –   to describe precisely what the entity represents
      –   required for sharing and reuse of data model components
•    Category
      –   classifies entities sharing common characteristics

Entity Completeness Checklist (contd.)
•    Category Types
      –   Fundamental entities(represents basic or core concepts)
      –   Associative/intersecting entities(to associate entities to reconcile m-m relation)
     –   Attributive entities(to describe or categorize other entity)
     –   Subtype entities(to represent a subset of occurrences of parent entity)

Entity Completeness Checklist (contd.)
•   Abbreviations
     –   document the abbreviation and full definition
•   Acronyms
     –   avoid (not understood by all, not unique)
     –   if used, document them
•   Current Number of occurrences
     –   to estimate entity statistics for all entity categories

Entity Completeness Checklist (contd.)
•   Authority
     –   Metadata authority(to approve change of entities, attributes etc.)
     –   Data authority(to change occurrences of entity)
•   Primary Key/Foreign Key/Non-key attribute names
•   Relationships to other entities
     –   no entity stands by itself

Homonyms
   Same or similar in sound or spelling as another
   BUT DIFFERENT IN MEANING!!

     Create CONFUSION!

    IDENTIFY AND ELIMINATE them for entities and attributes!!

Synonyms
Same meaning ...
Same logical concept ...
Assigned different names!!

Synonyms (contd.)
Compare -

Definition, Relationships to other entities, Key structure, attributes, domain values

Attribute Completeness Checklist
•   Name
     –   to uniquely identify the attribute
     –   to meet naming conventions/standards
•   Description
     –   to describe precisely what the attribute represents
•   Type
     –   refers to how the attribute is used in the datamodel
Completeness Checklist (contd.)
•   Key attributes
     –   primary keys in the entity that they are defined
     –   primary / foreign keys in other entities that they occur in)
     –   implemented with a unique index
•   Non-Key attributes
     –   contain the bulk of the information
     –   need not be unique
     –   candidate keys not selected as primary keys
     –   secondary keys may be selected as access paths
     –   implemented using non-unique index


Completeness Checklist (contd.)
•   Domain
    set of permitted values for the attribute
     Domain elements
     –   General Domain
          • describes the manner in which data is represented(data type)
          • alphanumeric, real, integer, boolean, sound, digital video etc.
     –   Specific Domain
          • Enumerated domain
          • specific set of values that are valid and allowed
          • static values (eg. Flat type : 2 bed, 3 bed, duplex etc)

•   Abbreviations
     –   document the abbreviation and full definition
•   Acronyms
     –   avoid (not understood by all, not unique)
     –   if used, document them
•   Key use
     –   applies only to primary keys
     –   will serve as primary or foreign key in child entity
•   Source
     –   whether attribute is primitive or derived
     –   If derived, establish the formula
     –   document formula
     –   formula should identify any other attributes required to generate value for derived attribute
•   Traceability
     –   why is the attribute there
     –   refer to source (paragraph, citation of statement, physical data structure element ...)
     –   mapped to metadata object that is maintained as part of system lifecycle (eg. Critical
         success factor, objective, physical system element like file, table
Derived Attributes
•   Created by accumulating values of multiple instances of attributes. Eg.
    Aggregation/summarization


Calculated Attributes
•   Describes a feature of a single instance of entity
•   Calculated from another single instance of related attribute

Calculated Attributes - contd..
Should Data model contain derived attributes??
YES !!
     –   represent information that management actually wants
     –   users have an opportunity to specify business rules
     –   provide an opportunity to validate that all necessary base data is captured
     –  design is made easier as requirements are already mapped
     In DSS environment - ESSENTIAL
     NEVER use derived attributes as PRIMARY keys

Attribute Names
•   Unique name representing its business meaning
•   clear, concise, self-explanatory
•   minimize use of special characters
•   length > 50 gives flexibility
     –   limitations of 32, 33 exist in some CASE tools
     –   standard documented abbreviations made
•   SHOULD NOT
     –   replace or contradict definition of attribute
     –   contain abbreviations not approved by authority

Attribute Names
•   SHOULD NOT CONTAIN
     –   possessive forms ( Individual’s birth date)
     –   articles (a, an, the)
     –   conjunctions (and, but)
     –   verbs (person owns property)
     –   prepositions (at, by, under, for, of ..)
     –   plural words (product names..)
     –   names of organizations, forms, screens, reports
          • eg. Block 61 title (refers to a specific field on a form)
Attribute Description
•   Builds on and is consistent with attribute name
•   unambiguous, clear, economically worded
•   stand alone (not dependent on another attribute definition to convey meaning. BEWARE of
    circular attribute definitions)
•   Never MISS giving a description

AVOID:
     –   restating the name of attribute and/or characteristics (eg. Length, data type, domain
         values)
     –   using technical jargon
     –   limiting description to direct extract from dictionary


Some attribute descriptions
Need improvement
• Location name - the name of a location

• order line total quantity - a six-digit integer total

• directional indicator - E, W, S, N, NE
Pretty Good
• Safety level quantity - The calculated minimum quantity of a product SKU that must be on
   hand to reduce risk of out-of-stock conditions

• operating quantity - The calculated, demand-driven quantity of a material item that must be
   maintained and replenished for use in day-to-day operations
Primary Key Attributes
•   Stable (not to change in value, cannot be null)
•   Minimal (in number of attributes.. Large composite keys not advisable)
•   Factless(should not contain intelligent groupings of data)
•   Definitive(value always exists for every occurrence)

Primary Key Attributes
•   Candidate Keys (Possible primary keys)
•   One among them is chosen as Primary key
•   The others are alternate keys
•   eg. Candidate keys for a U.S. Citizen are:
     –   driving license #
     –   passport #
     –   SS #
     –   None of them are definitive
     –
     Fingerprint ID Is DEFINITIVE
Primary Key Attributes- Surrogate Keys
 Use artificial key/surrogate key/pseudo-key/system-generated key to ensure uniqueness when:
     –   no attribute possesses all PK characteristics
     –   candidate keys are large and complex
•   ALWAYS USE IN DW Data Model
Relationships- Checklist
•    Name & Description - Optional
•   Type (identifying/non-identifying)
•   Cardinality (Degree/Nature)
     –   one-to-one 1:1
     –   many-to-one m:1
     –   one-to many 1:m
     –   many-to-many m:m(resolved using associative entities)
•   Deletion Integrity Rules (cascade/disassociate/disallow)


Limitations of E-R Modeling
•   Poor Performance
• Tend to be very complex and difficult to navigate.
Dimensional Modeling
•   Dimensional modeling uses three basic concepts : measures, facts, dimensions.
•   Is powerful in representing the requirements of the business user in the context of database
    tables.
•   Focuses on numeric data, such as values counts, weights, balances and occurences.

Dimensional modeling
•   Must identify
     –   Business process to be supported
     –   Grain (level of detail)
     –   Dimensions
     –   Facts

Conventions used in Dimensional modeling
•   Facts
•   Measures(Variables)
•   Dimensions
     –   Dimension members
     –   Dimension hierarchies

Facts
•   A fact is a collection of related data items, consisting of measures and context data.
•   Each fact typically represents a business item, a business transaction, or an event that can be
    used in analyzing the business or business process.
•   Facts are measured, “continuously valued”, rapidly changing information. Can be calculated
    and/or derived.

Fact Table
•   A table that is used to store business information (measures) that can be used in
    mathematical equations.
     –   Quantities
     –   Percentages
     –   Prices

Dimensions
•   A dimension is a collection of members or units of the same type of views.
•   Dimensions determine the contextual background for the facts.
•   Dimensions represent the way business people talk about the data resulting from a business
    process, e.g., who, what, when, where, why, how

Dimension Table
•   Table used to store qualitative data about fact records
     –   Who
     –   What
     –   When
     –   Where
     –   Why

Dimension data should be
•   verbose, descriptive
•   complete
•   no misspellings, impossible values
•   indexed
•   equally available
•   documented ( metadata to explain origin, interpretation of each attribute)

Dimensional model
•   visualise a dimensional model as a CUBE (hypercube because dimensions can be more than
    3 in number)
•   Operations for OLAP

    Drill Down :Higher level of detail
    Roll Up: summarized level of data
    (The navigation path is determined by hierarchies within dimensions.)
    Slice: cuts through the cube.Users can focus on specific perspectives
    Dice: rotates the cube to another perspective (change the dimension)
    Drill down …. Roll up
    Slice and Dice

Dimensions
•   Collection of members or units of the same type of views.
•   determine the contextual background for the facts.
•   the parameters over which we want to perform OLAP (Eg. Time, Location/region, Customers)
•   Member is a distinct name to determine data item’s position (eg. Time - Month, quarter)
•   Hierarchy arrange members into hierarchies or levels
Hierarchies
•   Allow for the ‘rollup’ of data to more summarized levels.
     –   Time
          • day
          • month
          • quarter
          • year
Hierarchies

Aggregates
•   Aggregate Tables are pre-stored summarized tables… created at a higher level of granularity
    across any or all of the dimensions.

•   If the existing granularity is Day wise sales, then creating a separate month wise sales table is
    an example of Aggregate Table.

Aggregates
•   The use of such aggregates is the single most effective tool the data warehouse designer has
    to improve query performance.

•   Usage of Aggregates can increase the performance of Queries by several times.

Measures
•   A measure is a numeric attribute of a fact, representing the performance or behaviour of the
    business relative to dimensions.
•  The actual numbers are called as variables.
Eg. sales in money, sales volume, quantity supplied, supply cost, transaction amount
•  A measure is determined by combinations of the members of the dimensions and is located
  on facts.
THE CUBE




Types of Facts
•   Additive
     –   Able to add the facts along all the dimensions
     –   Discrete numerical measures eg. Retail sales in $
•   Semi Additive
     –   Snapshot, taken at a point in time
     –   Measures of Intensity
     –   Not additive along time dimension eg. Account balance, Inventory balance
     –   Added and divided by number of time period to get a time-average

Types of Facts
•   Non Additive
     –   Numeric measures that cannot be added across any dimensions
     –   Intensity measure averaged across all dimensions eg. Room temperature
     –   Textual facts - AVOID THEM

Advantages of Dimensional Modeling
•   Allows complex multi-dimensional data structure to be defined with a very simple data model.
•   Reduces number of physical joins the query has to process
•   Simplifies the view of data model.
•   Allows DWH to expand and evolve with relatively low maintenance.



                                                                                  TIME PERIOD
                                        PRODUCT
                                                                         Invoice date
                                   Product description                   Fiscal year
                                   Category code                         Quarter
                                   Category description                  Month
                                                                         Week




                                                          CUSTOMER REP SALES
                       SALES REP                                                                           ADDRESS
                                                     Customer snapshot date
                   Last name                         Invoice date                               Address line 1
                   First name                        Gross sales                                Address line 2
                                                     Quantity                                   City name
                                                     Product cost                               State abbreviation
                                                                                                Postal code
                                                                                                Country name




                                         CUSTOMERS                 CUSTOMER DEMOGRAPHICS
                                     Customer name               Snapshot date
                                                                 Credit rating
                                                                 Marital status
                                                                 Age




                                               Sample Logical Model
                                             for Dimensional Data Mart
                                                              PRODUCTS                                              PRODUCT_SNAPSHOTS

                                   #PRODUCT_CODE                                                                #PRODUCT_CODE
                                   . PRODUCT_DESCRIPTION                                                        #SNAPSHOT_DATE
                                   . CATEGORY_CODE                                                              . MSRP
                                   . CATEGORY_DESCRIPTION                                                       . UOM
                                                                                                                . PRIMARY_SUPPLIER_NAME
                                                                                                                . SUPPLIER_CITY_NAME
                                                                                                                . SUPPLIER_STATE_ABBRV
                                                                                                                . SUPPLIER_COUNTRY_NAME
             SALES_REPS             CUSTOMER_INVOICES                   PURCHASE_INVOICES
           # SALES_REP_ID          #INVOICE_ID                        # INVOICE_ID
           . LAST_NAME             #LINE_ITEM_SEQ                     #LINE_ITEM_SEQ                           BUDGET_DETAILS
           . FIRST_NAME            . INVOICE_DATE                     . INVOICE_DATE
                                                                                                           #BUDGET_ID
           o                       . CUSTOMER_DATE                    . SUPPLIER_ID
           MANAGER_FIRST_          . BILL_TO_ADDRESS_ID               . ADDRESS_ID
                                                                                                           #REVISION_SEQ                 Sample Physical
                                                                                                           #LINE_ITEM_SEQ
           NAME
           oMANAGER_LAST
                                   . SALES_REP_ID
                                   . MANAGER_REP_ID
                                                                      . BUDGET_ID
                                                                      . REVISION_SEQ
                                                                                                           . BLI_TYPE_CODE                    Model
                                                                                                           . BLI_TYPE_DESCRIPTION
           _NAME                   . ORGANIZATION_ID                  . BUDGET_LINE_ITEM_SEQ
                                                                                                           . ORGANIZATION_ID                   for
                                   . ORG_ADDRESS_ID                   . PRODUCT_CODE
                                                                                                           . ADDRESS_ID
                                   . PRODUCT_CODE                     . QUANTITY                           . BUDGET_PERIOD
                                                                                                                                         Data Warehouse
                                   . QUANTITY                         . UNIT_PRICE
                                                                                                           . LOAD_DATE
                                   . UNIT_PRICE                       . AMOUNT
                                                                                                           . BUDGET_AMOUNT
                                   . AMOUNT                           . LOAD_DATE
                                                                                                           . EXPENDITURES
                                   o PRODUCT COST
                                                                                                           o PRODUCT_CODE
                                   . LOAD_DATE




              CUSTOMER_ADDRESSES                          CUSTOMERS                            SUPPLIER_ADDRESSES                    INTERNAL_ORG_ADDRESSES
             #CUSTOMER_ID                    #CUSTOMER_ID                              #SUPPLIER_ID                                 #ORGANIZATION_ID
             #ADDRESS_ID                     #SNAPSHOT_DATE                            #ADDRESS_ID                                  #ADDRESS_ID
             . ADDRESS_LINE1                 . CUSTOMER_NAME                           . SUPPLIER_NAME                              . ORG_TYPE
             oADDRESS_LINE2                  oAGE                                      oPOSTAL_CODE                                 . ORGANIZATION_NAME
             oPOSTAL_CODE                    oMARITAL STATUS                           . GEO_CODE                                   . ADDRESS_LINE1
             . SALES_REP_ID                  . CREDIT_RATING                           . LOAD_DATE                                  oADDRESS_LINE2
             . GEO_CODE                                                                                                             oPOSTAL_CODE
             . LOAD_DATE                                                                                                            . GEO_CODE
                                                                GEOGRAPHIC_BOUNDARIES                                               oPARENT_ORG_ID
                                                                                                                                    . LOAD_DATE
                                                             #GEO_CODE
                                                             . CITY_NAME
                                                             . STATE_NAME
                                                             . COUNTRY_NAME
                                                             oCITY_ABBRV
                                                             oSTATE_ABBRV
                                                             oCOUNTRY_ABBRV




Common structures for datamarts: Denormalize!
•   Star
    –   Single fact table surrounded by denormalised dimension tables
    –   The fact table primary key is the composite of the foreign keys (primary keys of dimension
        tables)
    –   Fact table contains transaction type information.
    –   Many star schemas in a data mart
    –   Easily understood by end users, more disk storage required

Example of Star- schema
Common structures for datamarts: Denormalize!
•   Snowflake
     –   Single fact table surrounded by normalised dimension tables
     –   Normalizes dimension table to save data storage space.
     –   When dimensions become very very large
     –   Less intuitive, slower performance due to joins

•   May want to use both approaches, especially if supporting multiple end-user tools.

Example of Snow flake schema




Snowflake - Disadvantages
•   Normalization of dimension makes it difficult for user to understand
•   Decreases the query performance because it involves more joins
•   Dimension tables are normally smaller than fact tables - space may not be a major issue to
    warrant snow flaking

Keys …..
•   Primary Keys
     –   uniquely identify a record
•   Foreign Keys
     –   primary key of another table referred here
•   Surrogate Keys
     –   system-generated key for dimensions
     –   key on its own has no meaning
     –   integer key, less space

More Keys …..
•   Smart Keys
     –   primary key out of various attributes of dimension
     –   AVOID THEM!
     –   Join to Fact table should be on single surrogate key
•    Production Keys
      –   DO NOT USE Production defined attributes
      –   Business may reuse/change them - DW cannot!

Basic Dimensional Modeling Techniques
•    Slowing changing Dimensions
•    Rapidly changing Small Dimensions
•    Large Dimensions
•    Rapidly changing Large Dimensions
•    Degenerate Dimensions
•    Junk Dimensions

Slowly Changing Dimensions

    A dimension is considered a Slowly Changing Dimension when its attributes remain almost
     constant over time, requiring relatively minor alterations to represent the evolved state.

Slowly changing Dimension- Options
Eg. Key does not change but description changes (product description)

TYPE 1
•    Overwrite dimension record with new values
      –   used when old value of attribute has no significance

Slowly changing Dimension- Options

TYPE 2
•    Create a new record using a new value of surrogate key
      –   used when history can be clearly partitioned
      –   query only on new value or only old value
      –   query on some other attributes - return all records)

Slowly changing Dimension- Options (contd..)

TYPE 3
•    Create an ‘old’ field in dimension to store immediate previous value
      –   used when change is a soft change
      –   no perfect partition in history
      –   may want to track for sometime with both old or new value
      –   do not use when there are too many such soft changes successively

Slowly Changing Dimension-

An Example
•    Slowly Changing Dimension
Rapidly Changing Small Dimensions
Eg. Rapid changes to product dimension

•   Type 2 (use surrogate key and create a new record)
•   use effective dates
•   use only until dimension table remains small

Large Dimensions - Dimensions containing several million records!!!

HOW TO SUPPORT???
•   Database to support indexing technology that support rapid browsing
•   Find and suppress duplicate entries in the dimension (eg. Name and address matching)
•   Never use Type 2 to solve changing dimensions (i.e. adding records)

Rapidly Changing Monster Dimensions
Dimensions containing > 100 million records!!!

HOW TO SUPPORT???
•   Break the Monster dimension into separate dimension tables
•   Constant information into original table
•   New dimension table can have discrete values for each attribute
•   Choose pre-defined set of values per attribute

Rapidly Changing Monster Dimensions            (contd..)
•   Build the data in this dimension with all possible combinations of values for each attribute
•   Identify each combination uniquely
•   Everytime an event occurs and is recorded in fact table, attach it with the unique combination
    ID.

Rapidly Changing Monster Dimensions            (contd..)
•   Advantages
     –   No increase in data storage everytim event occurs
•   Drawbacks
     –   Forced to use ranges of discrete values for dimensional attributes
     –   New dimension cannot be too big (not >1M)
     –   Data in new dimension can be accessed along with static data only through the fact table -
         slower
     –   Only if event occurs, link the static and changing portions of dimension - keep a dummy
         event in fact

Degenerate Dimensions
•   Occur in line item oriented fact tables
•   occur when dimension table is left only with a single key and no other fields
•   all other attributes have been moved into other dimension tables
•   Moved to fact table - not joined to anything
Junk Dimensions
•   Number of miscellaneous flags and text attributes left over after design

WHAT TO DO WITH THEM????
DO NOT
     –   Leave them behind in the fact table
     –   Make each flag and attribute into its own dimension
     –   Strip off all such flags and attributes

Junk Dimensions (contd…)
•   DO
     –   Grouping of random flags and attributes
     –   take away from fact and group them into junk dimension

eg. Open ended comments fields

Conformed Dimensions
•   Dimension that means the same thing with every possible fact table that it is joined.
•   Dimension is identically the same dimension in each data mart
•   Major responsibility of the central DWdesign team is to establish, publish, maintain and
    Enforce them
•   DW cannot function as an integrated whole without strict adherence to conformed dimensions

Conformed Dimensions (Contd.)
•   When you don’t need Conformed Dimensions
     –   Several lines of business where the customers and products are disjoint.
     –   Don’t manage these separate business lines together

Time Dimension

•   An exclusive Time dimension is required because the SQL date semantics and functions
    cannot generate several important attributes required for analytical purposes.
•   Attributes like weekdays, weekends, fiscal period, holidays, season cannot be generated by
    SQL statements.

Time Dimension
•   Moreover SQL date stamps occupy more space largely increasing the size of the fact table.
•   Joins on such SQL generated date-stamps are costly decreasing the query speed
    significantly.

Time Dimension
•   The Day of week(Monday, ...) is useful to create reports comparing for ex. Monday sales to
    Friday sales.
•   The Day number in month is useful for comparing measures for the same day in each month.
•   The last day in month flag is useful for performing payday analysis.
Time Dimension
•   The holiday flag and season attributes are useful for holiday VS non-holiday analysis and
    season business analysis.
•   Event attribute is needed to record special days like strike days, etc..

								
To top