Data Modeling Data Warehouse Defined “A data warehouse is a collection of corporate information, derived directly from operational systems and some external data sources. Its specific purpose is to support business decisions, not business operations” Characteristics of a DW • Subject-oriented Data – collects all data for a subject, from different sources • Read-only Requests – loaded during off-hours, read-only during day hours • Interactive Features, ad-hoc query – flexible design to handle spontaneous user queries • Pre-aggregated data – to improve runtime performance • Highly denormalized data structures – fat tables with redundant columns STAGING AREA - SOME CLARITY • Staging Area – optional – to cleanse the source data – Accepts data from different sources – Data model is required at staging area – Multiple data models may be required for parking different sources and for transformed data to be pushed out to warehouse ODS - SOME CLARITY • Operational Data Store – Optional – Granular, detailed level data – May feed warehouse (eg when warehouse is aggregated) – Usually a relational model – May keep data for a smaller time period than warehouse Data Modeling WHAT IS A DATA MODEL??? A data model is an abstraction of some aspect of the real world (system). WHY A DATA MODEL??? – Helps to visualize the business – A model is a means of communication. – Models help elicit and document requirements. – Models reduce the cost of change. – Model is the essence of DW architecture based on which DW will be implemented What do we want to do with the data? Model depends on what kind of data analysis we want to do: • Different Data Analysis Techniques – Query and reporting • Display Query Results – Multidimensional analysis • Analyze data content by looking at it in different perspectives – Data mining • discover patterns and clustering attributes in data Impact of Data Analysis Techniques on DM • Query and reporting • Normalized data model • Select associated data elements • summarize and group by category • present results • direct table scan • ER with normalized / denormalized appropriate Query and reporting Requirements of a Decision Support Query Environment • To provide a method for testing hypothesis (eg. what if ….) • To allow ad-hoc queries • To allow human input (DSS makes decisions with users ) • Expects user knowledge of problem • To simulate the behavior of a real-world problem Impact of Data Analysis Techniques on DM • Multidimensional analysis • Fast and easy access to data • Any number of analysis dimensions in any combinations • ER will mean many joins • Dimensional model appropriate Multidimensional Analysis Data Mining • Data Mining • discovers unusual patterns • requires low level of detail data DW Architectures • Architecture Choices depend on – Current infrastructure – Business environment – Desired management and control structure – Resources – Commitment ….. • Data Warehouse/data mart DW Architectures • Architecture Choices determine – Where will DW reside? • Centrally / locally / distributed – Where will it be managed from? • Centrally / independently • 3 choices • Global • Independent • Interconnected or a combination of the three DW Architectures • Global Architecture – related to scope of data access and storage – does not mean centralized – can be physically centralized or distributed – enterprise view of data – time-consuming & costly to implement Global Architecture DW Architectures • Independent Architecture – stand-alone – controlled by a department – minimal integration – no global view – very fast to implement DW Architectures • Interconnected Architecture – distributed – integrated and interconnected – gives a global view of enterprise – more complexity • who manages / controls data • another tier in architecture to share common data between multiple data marts • have a data sharing schema across data marts Independent and Interconnected Architecture Types of Data Warehouse • Enterprise Data Warehouse • Data Mart Enterprise data warehouse • Contains data drawn from multiple operational systems • Supports time- series and trend analysis across different business areas • Can be used as a transient storage area to clean all data and ensure consistency • Can be used to populate data marts • Can be used for everyday and strategic decision making Data Mart • Logical subset of enterprise data warehouse • Organized around a single business process • Based on granular data • May or may not contain aggregates • Object of analytical processing by the end user. • Less expensive and much smaller than a full blown corporate data warehouse. Distributed and Centralized Data warehouses • DW sitting on a monolithic machine - unrealistic • Separate machines, different OS, different DB systems - reality Solution • Share a uniform architecture to allow them to be fused coherently Classical Architectures • Physical data warehouse (physical) – Data warehouse --> data marts – Data marts --> data warehouse – Parallel data warehouse and data marts DW Implementation Approaches • Top Down • Bottom-up • Combination of both • Choices depend on: – current infrastructure – resources – architecture – ROI – Implementation speed DW Implementation Approaches Top Down • More planning and design initially • Involve people from different work-groups, departments • Data marts may be built later form Global DW • Overall data model to be decided up-front Bottom Up • Can plan initially without waiting for global infrastructure • built incrementally • can be built before or in parallel with Global DW • Less complexity in design Combined Approach • Determine degree of planning and design for a global approach to integrate data marts being built by bottom-up approach • Develop base level infrastructure definition for global DW at business level • Develop plan to handle data elements needed by multiple data marts • Build a common data store to be used by data marts and global DW Levels of modeling • Conceptual modeling – Describe data requirements from a business point of view without technical details • Logical modeling – Refine conceptual models – Data structure oriented, platform independent • Physical modeling – Detailed specification of what is physically implemented using specific technology Conceptual Model • A conceptual model shows data through business eyes. • All entities which have business meaning. • Important relationships • Few significant attributes in the entities. • Few identifiers or candidate keys. Sample conceptual model Products Customer Customers Invoices Sales Reps Customer Addresses Sample Conceptual Geographic Model Boundaries Logical Model • Replaces many-to-many relationships with associative entities. • Defines a full population of entity attributes. • May use non-physical entities for domains and sub-types. • Establishes entity identifiers. • Has no specifics for any RDBMS or configuration. CUSTOMER INVOICE #INVOICE ID #LINE ITEM SEQ .INVOICE DATE the bill for the bill sent to the bill purchased by purchased purchased at purchased by by PRODUCT CUSTOMER ADDRESS CUSTOMER #PRODUCT CODE #CUSTOMER ID #CUSTOMER ID .PRODUCT DESCRIPTION #ADDRESS ID #SNAPSHOT DATE .CUSTOMER NAME for the for the located within sold by customer customer managed by sold to by the salesman the sales the salesman the general location of for manager for for SALES REP GEOGRAPHIC #SALES REP ID BOUNDARY #GEO CODE Sample Logical Model Physical Model • A Physical data model may include – Referential Integrity – Indexes – Views – Alternate keys and other constraints – Tablespaces and physical storage objects. PRODUCTS # PRODUCT_CODE PRODUCT_DESCRIPTION CATEGORY_CODE CATEGORY_DESCRIPTION SALES_REPS CUSTOMER_INVOICES CUSTOMERS #SALES_REP_ID #INVOICE_ID #CUSTOMER_ID LAST_NAME #LINE_ITEM_SEQ #SNAPSHOT_DATE FIRST_NAME INVOICE_DATE CUSTOMER_NAME o MANAGER_FIRST_NAME CUSTOMER_ID o AGE o MANAGER_LAST_NAME BILL_TO_ADDRESS_ID o MARITAL_STATUS SALES_REP_ID CREDIT_RATING MANAGER_REP_ID ORGANIZATION_ID ORG_ADDRESS_ID PRODUCT_CODE QUANTITY UNIT_PRICE AMOUNT Sample Physical oPRODUCT_COST LOAD_DATE Model CUSTOMER_ADDRESSES GEOGRAPHIC_BOUNDARIES #CUSTOMER_ID #GEO_CODE #ADDRESS_ID CITY_NAME ADDRESS_LINE1 STATE_NAME o ADDRESS_LINE2 COUNTRY_NAME o POSTAL_CODE o CITY_ABBRV SALES_REP_ID o STATE_ABBRV GEO_CODE o COUNTRY_ABBRV LOAD_DATE Data Architecting What is data architecting??? Structure and locate data according to its characteristics 3 Basic types of data – Real time data – Derived data – Reconciled data Data Architecting-Real time data • Represents current status of business • Used by operational systems to run business • Changes as operational transactions are processed • Very detailed, high level of granularity Data Architecting - Real time data To use Real time data in DW: Must be • Cleansed (comes from different sources, cleansed to to ensure data consistency quality) • Summarized (because it contains individual, transactional,detailed) • Transformed into an easily understandable format for manipulation by analysts Eg. Different units of measure, currency, exchange rates Data Architecting - Derived data • Data created by summarizing, aggregating, averaging real-time data through some process • represents a view of business data at a specific time • Historical record of business over a period • Precalculate derived data elements and summarize detailed data to improve query processing Data Architecting - Reconciled data • Real-time data cleansed, adjusted, enhanced to provided integrated source of data for analysis • Create and maintain historical data while reconciling • Normally not explicitly defined • Logical result of derivation operations • May be stored as temporary files used to transform operational data for consistency Enterprise Data Model (EDM) • Consistent definition of data elements common to a business • High-level business view • Generic logical data model • Physical data design EDM - The Phased Enterprise Data Model Enterprise Data Model (EDM) Phases Increasing order of Information required • Information Planning • Business Analysing • Logical Data Modeling • Physical Data design Enterprise Data Model (EDM) Information Planning • Consolidated view of the business • Identify some business concepts (20-30) • called subject areas / super entity/ business entity in which the organization is interested Eg.customer, product Purpose – To set up scope and architecture of DW – To provide a single comprehensive point of view Enterprise Data Model (EDM) Business Analyzing • Define contents of primary business concepts. • Gather and arrange business requirements • Defines business terms Purpose – To set up scope and architecture of DW – To provide a single comprehensive point of view Logical Data Modeling • Enterprise-wide in scope • consists of several entities, relationships, attributes • complete model in 3rd Normal Form. Can be divided into 2 types: – Generic logical data model (enterprise level) – Logical application model (application level) Physical Data Design • space • performance • physical distribution of data Purpose: To design for the physical implementation Is it possible to draw an EDM ??? Not always!! Phased approach OR a simple EDM • list of subject areas (<25) • define business relationships between subject areas • define contents of each subject area Granularity More the detail Lower the granularity Why is it important in DW??? Opportunity for TRADE-OFF performance vs. volume of data stored ability to access detailed data vs. cost of storage Granularity Granularity To overcome trade-offs between data volume and query capability : • Divide the data in the DW • Create 2 levels of granularity of data • Detailed Raw data – keep it on separate storage medium – load when required • Summarized data Data Partitioning Model - WHY? To understand, maintain and navigate a DW TYPES of Partitioning • Logical and Physical Logical Partitioning - WHY? Goals: • Provide flexible access to data • Provide easy and efficient data management services • Ensure scalability of the data warehouse • Enable elements of the data warehouse to be portable. Some elements of the DW can be shared with other physical warehouses or archived on other storage media. Data Partitioning Model - Logical Partitioning • Partition large volumes of data by splitting • Helps to make data easier to: • Restructure • Index • Sequentially scan • Reorganize • Recover • Monitor Logical Partition - HOW?? Criteria • Time period (date, month, or quarter) – almost always chosen • Geography (location) • Product (more generically, by line of business) • Organizational unit • A combination of the above Data Partitioning Model -Subject Areas Subject areas classified by the topics of interest to the business. • 5W1H rule – when, where, who, what, why, and how eg. ‘who’ could be customer, employee, manager, supplier, business partner, competitor. • Get a candidate list of subject areas • Decompose, rearrange, select, redefine in more detail • Define the business relationships among subject areas • This will determine the dimensions used • Subject Areas help define criteria like: • Unit of the data model • Unit of an implementation project • Unit of management of the data • Basis for the integration of multiple mplementations unit for analysis should be business process Data Modeling - Techniques • STAGING AREA – YES ! (maybe multiple data models are required) • ODS – YES ! • DATAWAREHOUSE/DATAMART – YES! Data Modeling - Techniques • Modeling techniques – E-R Modeling – Dimensional Modeling Implementation and modeling styles • Modeling versus implementation – Modeling: describe what should be built to non-technical folks – Implementation: describe what is actually built to technical folks • Relational modeling – Use for implementation – Difficult to understand by non-technical folks • Dimensional modeling – Use for modeling during analysis and design phases – Can be implemented using other modeling styles e.g. object-oriented, relational E-R Modeling • Produces a data model, using two basic concepts entities and the relationships between those entities. • Detailed ER models also contain attributes, which can be properties of either the entities or the relationships. Conventions used in E-R modeling • Entities • Attributes • Relationships or Associations Entities • Principal data objects about which information is to be collected. • Usually recognizable concepts such as person, things, or events. • Examples : EMPLOYEES, PROJECTS, INVOICES. Attributes & Relationships • Attributes describe the entity of which they are associated. • A relationship represents an association between two or more entities. An example : – Employees are assigned to projects – Departments manage one or more projects. Types of Data Relationships - Cardinality • One - One 1: 1 • One - Many 1: m • Many - Many m:n • Recursive data relationship Normalization • Remove data redundancy • 0 NF - contains repeating values • 1 NF - No repeating values • 2 NF - Every attribute is dependent on the key, the whole key and nothing but the key • 3 NF - No non-key attribute is functionally dependent on another non-key attribute • Denormalization - carefully introduced redundancy to improve query performance Normalization - 1NF • Eliminate Repeating groups Person Skills A Oracle, DB2 B MS Access, Oracle C Oracle, CICS, SQL D DB2, CICS Who are the ones who have DB2 skills??? Normalization - 2NF • Eliminate Redundant data Skill ID Skill Description S1 DB2 S2 Oracle S3 MS Access S4 CICS S5 SQL Normalization - 3NF • Eliminate Columns Not Dependent On Key Memb ID Skill ID Comp ID. Comp Name Location A S1 D1 Core Tech HYD Relational modeling • Represents business entities, data items associated with each entity, and the relationships of business interest among the entities • Entities are usually broken down into smallest possible units and combined using relationships • Diagram looks like a spiderweb Entity Completeness Checklist • Name – to describe the data contained – to meet naming conventions/standards • Description – to describe precisely what the entity represents – required for sharing and reuse of data model components • Category – classifies entities sharing common characteristics Entity Completeness Checklist (contd.) • Category Types – Fundamental entities(represents basic or core concepts) – Associative/intersecting entities(to associate entities to reconcile m-m relation) – Attributive entities(to describe or categorize other entity) – Subtype entities(to represent a subset of occurrences of parent entity) Entity Completeness Checklist (contd.) • Abbreviations – document the abbreviation and full definition • Acronyms – avoid (not understood by all, not unique) – if used, document them • Current Number of occurrences – to estimate entity statistics for all entity categories Entity Completeness Checklist (contd.) • Authority – Metadata authority(to approve change of entities, attributes etc.) – Data authority(to change occurrences of entity) • Primary Key/Foreign Key/Non-key attribute names • Relationships to other entities – no entity stands by itself Homonyms Same or similar in sound or spelling as another BUT DIFFERENT IN MEANING!! Create CONFUSION! IDENTIFY AND ELIMINATE them for entities and attributes!! Synonyms Same meaning ... Same logical concept ... Assigned different names!! Synonyms (contd.) Compare - Definition, Relationships to other entities, Key structure, attributes, domain values Attribute Completeness Checklist • Name – to uniquely identify the attribute – to meet naming conventions/standards • Description – to describe precisely what the attribute represents • Type – refers to how the attribute is used in the datamodel Completeness Checklist (contd.) • Key attributes – primary keys in the entity that they are defined – primary / foreign keys in other entities that they occur in) – implemented with a unique index • Non-Key attributes – contain the bulk of the information – need not be unique – candidate keys not selected as primary keys – secondary keys may be selected as access paths – implemented using non-unique index Completeness Checklist (contd.) • Domain set of permitted values for the attribute Domain elements – General Domain • describes the manner in which data is represented(data type) • alphanumeric, real, integer, boolean, sound, digital video etc. – Specific Domain • Enumerated domain • specific set of values that are valid and allowed • static values (eg. Flat type : 2 bed, 3 bed, duplex etc) • Abbreviations – document the abbreviation and full definition • Acronyms – avoid (not understood by all, not unique) – if used, document them • Key use – applies only to primary keys – will serve as primary or foreign key in child entity • Source – whether attribute is primitive or derived – If derived, establish the formula – document formula – formula should identify any other attributes required to generate value for derived attribute • Traceability – why is the attribute there – refer to source (paragraph, citation of statement, physical data structure element ...) – mapped to metadata object that is maintained as part of system lifecycle (eg. Critical success factor, objective, physical system element like file, table Derived Attributes • Created by accumulating values of multiple instances of attributes. Eg. Aggregation/summarization Calculated Attributes • Describes a feature of a single instance of entity • Calculated from another single instance of related attribute Calculated Attributes - contd.. Should Data model contain derived attributes?? YES !! – represent information that management actually wants – users have an opportunity to specify business rules – provide an opportunity to validate that all necessary base data is captured – design is made easier as requirements are already mapped In DSS environment - ESSENTIAL NEVER use derived attributes as PRIMARY keys Attribute Names • Unique name representing its business meaning • clear, concise, self-explanatory • minimize use of special characters • length > 50 gives flexibility – limitations of 32, 33 exist in some CASE tools – standard documented abbreviations made • SHOULD NOT – replace or contradict definition of attribute – contain abbreviations not approved by authority Attribute Names • SHOULD NOT CONTAIN – possessive forms ( Individual’s birth date) – articles (a, an, the) – conjunctions (and, but) – verbs (person owns property) – prepositions (at, by, under, for, of ..) – plural words (product names..) – names of organizations, forms, screens, reports • eg. Block 61 title (refers to a specific field on a form) Attribute Description • Builds on and is consistent with attribute name • unambiguous, clear, economically worded • stand alone (not dependent on another attribute definition to convey meaning. BEWARE of circular attribute definitions) • Never MISS giving a description AVOID: – restating the name of attribute and/or characteristics (eg. Length, data type, domain values) – using technical jargon – limiting description to direct extract from dictionary Some attribute descriptions Need improvement • Location name - the name of a location • order line total quantity - a six-digit integer total • directional indicator - E, W, S, N, NE Pretty Good • Safety level quantity - The calculated minimum quantity of a product SKU that must be on hand to reduce risk of out-of-stock conditions • operating quantity - The calculated, demand-driven quantity of a material item that must be maintained and replenished for use in day-to-day operations Primary Key Attributes • Stable (not to change in value, cannot be null) • Minimal (in number of attributes.. Large composite keys not advisable) • Factless(should not contain intelligent groupings of data) • Definitive(value always exists for every occurrence) Primary Key Attributes • Candidate Keys (Possible primary keys) • One among them is chosen as Primary key • The others are alternate keys • eg. Candidate keys for a U.S. Citizen are: – driving license # – passport # – SS # – None of them are definitive – Fingerprint ID Is DEFINITIVE Primary Key Attributes- Surrogate Keys Use artificial key/surrogate key/pseudo-key/system-generated key to ensure uniqueness when: – no attribute possesses all PK characteristics – candidate keys are large and complex • ALWAYS USE IN DW Data Model Relationships- Checklist • Name & Description - Optional • Type (identifying/non-identifying) • Cardinality (Degree/Nature) – one-to-one 1:1 – many-to-one m:1 – one-to many 1:m – many-to-many m:m(resolved using associative entities) • Deletion Integrity Rules (cascade/disassociate/disallow) Limitations of E-R Modeling • Poor Performance • Tend to be very complex and difficult to navigate. Dimensional Modeling • Dimensional modeling uses three basic concepts : measures, facts, dimensions. • Is powerful in representing the requirements of the business user in the context of database tables. • Focuses on numeric data, such as values counts, weights, balances and occurences. Dimensional modeling • Must identify – Business process to be supported – Grain (level of detail) – Dimensions – Facts Conventions used in Dimensional modeling • Facts • Measures(Variables) • Dimensions – Dimension members – Dimension hierarchies Facts • A fact is a collection of related data items, consisting of measures and context data. • Each fact typically represents a business item, a business transaction, or an event that can be used in analyzing the business or business process. • Facts are measured, “continuously valued”, rapidly changing information. Can be calculated and/or derived. Fact Table • A table that is used to store business information (measures) that can be used in mathematical equations. – Quantities – Percentages – Prices Dimensions • A dimension is a collection of members or units of the same type of views. • Dimensions determine the contextual background for the facts. • Dimensions represent the way business people talk about the data resulting from a business process, e.g., who, what, when, where, why, how Dimension Table • Table used to store qualitative data about fact records – Who – What – When – Where – Why Dimension data should be • verbose, descriptive • complete • no misspellings, impossible values • indexed • equally available • documented ( metadata to explain origin, interpretation of each attribute) Dimensional model • visualise a dimensional model as a CUBE (hypercube because dimensions can be more than 3 in number) • Operations for OLAP Drill Down :Higher level of detail Roll Up: summarized level of data (The navigation path is determined by hierarchies within dimensions.) Slice: cuts through the cube.Users can focus on specific perspectives Dice: rotates the cube to another perspective (change the dimension) Drill down …. Roll up Slice and Dice Dimensions • Collection of members or units of the same type of views. • determine the contextual background for the facts. • the parameters over which we want to perform OLAP (Eg. Time, Location/region, Customers) • Member is a distinct name to determine data item’s position (eg. Time - Month, quarter) • Hierarchy arrange members into hierarchies or levels Hierarchies • Allow for the ‘rollup’ of data to more summarized levels. – Time • day • month • quarter • year Hierarchies Aggregates • Aggregate Tables are pre-stored summarized tables… created at a higher level of granularity across any or all of the dimensions. • If the existing granularity is Day wise sales, then creating a separate month wise sales table is an example of Aggregate Table. Aggregates • The use of such aggregates is the single most effective tool the data warehouse designer has to improve query performance. • Usage of Aggregates can increase the performance of Queries by several times. Measures • A measure is a numeric attribute of a fact, representing the performance or behaviour of the business relative to dimensions. • The actual numbers are called as variables. Eg. sales in money, sales volume, quantity supplied, supply cost, transaction amount • A measure is determined by combinations of the members of the dimensions and is located on facts. THE CUBE Types of Facts • Additive – Able to add the facts along all the dimensions – Discrete numerical measures eg. Retail sales in $ • Semi Additive – Snapshot, taken at a point in time – Measures of Intensity – Not additive along time dimension eg. Account balance, Inventory balance – Added and divided by number of time period to get a time-average Types of Facts • Non Additive – Numeric measures that cannot be added across any dimensions – Intensity measure averaged across all dimensions eg. Room temperature – Textual facts - AVOID THEM Advantages of Dimensional Modeling • Allows complex multi-dimensional data structure to be defined with a very simple data model. • Reduces number of physical joins the query has to process • Simplifies the view of data model. • Allows DWH to expand and evolve with relatively low maintenance. TIME PERIOD PRODUCT Invoice date Product description Fiscal year Category code Quarter Category description Month Week CUSTOMER REP SALES SALES REP ADDRESS Customer snapshot date Last name Invoice date Address line 1 First name Gross sales Address line 2 Quantity City name Product cost State abbreviation Postal code Country name CUSTOMERS CUSTOMER DEMOGRAPHICS Customer name Snapshot date Credit rating Marital status Age Sample Logical Model for Dimensional Data Mart PRODUCTS PRODUCT_SNAPSHOTS #PRODUCT_CODE #PRODUCT_CODE . PRODUCT_DESCRIPTION #SNAPSHOT_DATE . CATEGORY_CODE . MSRP . CATEGORY_DESCRIPTION . UOM . PRIMARY_SUPPLIER_NAME . SUPPLIER_CITY_NAME . SUPPLIER_STATE_ABBRV . SUPPLIER_COUNTRY_NAME SALES_REPS CUSTOMER_INVOICES PURCHASE_INVOICES # SALES_REP_ID #INVOICE_ID # INVOICE_ID . LAST_NAME #LINE_ITEM_SEQ #LINE_ITEM_SEQ BUDGET_DETAILS . FIRST_NAME . INVOICE_DATE . INVOICE_DATE #BUDGET_ID o . CUSTOMER_DATE . SUPPLIER_ID MANAGER_FIRST_ . BILL_TO_ADDRESS_ID . ADDRESS_ID #REVISION_SEQ Sample Physical #LINE_ITEM_SEQ NAME oMANAGER_LAST . SALES_REP_ID . MANAGER_REP_ID . BUDGET_ID . REVISION_SEQ . BLI_TYPE_CODE Model . BLI_TYPE_DESCRIPTION _NAME . ORGANIZATION_ID . BUDGET_LINE_ITEM_SEQ . ORGANIZATION_ID for . ORG_ADDRESS_ID . PRODUCT_CODE . ADDRESS_ID . PRODUCT_CODE . QUANTITY . BUDGET_PERIOD Data Warehouse . QUANTITY . UNIT_PRICE . LOAD_DATE . UNIT_PRICE . AMOUNT . BUDGET_AMOUNT . AMOUNT . LOAD_DATE . EXPENDITURES o PRODUCT COST o PRODUCT_CODE . LOAD_DATE CUSTOMER_ADDRESSES CUSTOMERS SUPPLIER_ADDRESSES INTERNAL_ORG_ADDRESSES #CUSTOMER_ID #CUSTOMER_ID #SUPPLIER_ID #ORGANIZATION_ID #ADDRESS_ID #SNAPSHOT_DATE #ADDRESS_ID #ADDRESS_ID . ADDRESS_LINE1 . CUSTOMER_NAME . SUPPLIER_NAME . ORG_TYPE oADDRESS_LINE2 oAGE oPOSTAL_CODE . ORGANIZATION_NAME oPOSTAL_CODE oMARITAL STATUS . GEO_CODE . ADDRESS_LINE1 . SALES_REP_ID . CREDIT_RATING . LOAD_DATE oADDRESS_LINE2 . GEO_CODE oPOSTAL_CODE . LOAD_DATE . GEO_CODE GEOGRAPHIC_BOUNDARIES oPARENT_ORG_ID . LOAD_DATE #GEO_CODE . CITY_NAME . STATE_NAME . COUNTRY_NAME oCITY_ABBRV oSTATE_ABBRV oCOUNTRY_ABBRV Common structures for datamarts: Denormalize! • Star – Single fact table surrounded by denormalised dimension tables – The fact table primary key is the composite of the foreign keys (primary keys of dimension tables) – Fact table contains transaction type information. – Many star schemas in a data mart – Easily understood by end users, more disk storage required Example of Star- schema Common structures for datamarts: Denormalize! • Snowflake – Single fact table surrounded by normalised dimension tables – Normalizes dimension table to save data storage space. – When dimensions become very very large – Less intuitive, slower performance due to joins • May want to use both approaches, especially if supporting multiple end-user tools. Example of Snow flake schema Snowflake - Disadvantages • Normalization of dimension makes it difficult for user to understand • Decreases the query performance because it involves more joins • Dimension tables are normally smaller than fact tables - space may not be a major issue to warrant snow flaking Keys ….. • Primary Keys – uniquely identify a record • Foreign Keys – primary key of another table referred here • Surrogate Keys – system-generated key for dimensions – key on its own has no meaning – integer key, less space More Keys ….. • Smart Keys – primary key out of various attributes of dimension – AVOID THEM! – Join to Fact table should be on single surrogate key • Production Keys – DO NOT USE Production defined attributes – Business may reuse/change them - DW cannot! Basic Dimensional Modeling Techniques • Slowing changing Dimensions • Rapidly changing Small Dimensions • Large Dimensions • Rapidly changing Large Dimensions • Degenerate Dimensions • Junk Dimensions Slowly Changing Dimensions A dimension is considered a Slowly Changing Dimension when its attributes remain almost constant over time, requiring relatively minor alterations to represent the evolved state. Slowly changing Dimension- Options Eg. Key does not change but description changes (product description) TYPE 1 • Overwrite dimension record with new values – used when old value of attribute has no significance Slowly changing Dimension- Options TYPE 2 • Create a new record using a new value of surrogate key – used when history can be clearly partitioned – query only on new value or only old value – query on some other attributes - return all records) Slowly changing Dimension- Options (contd..) TYPE 3 • Create an ‘old’ field in dimension to store immediate previous value – used when change is a soft change – no perfect partition in history – may want to track for sometime with both old or new value – do not use when there are too many such soft changes successively Slowly Changing Dimension- An Example • Slowly Changing Dimension Rapidly Changing Small Dimensions Eg. Rapid changes to product dimension • Type 2 (use surrogate key and create a new record) • use effective dates • use only until dimension table remains small Large Dimensions - Dimensions containing several million records!!! HOW TO SUPPORT??? • Database to support indexing technology that support rapid browsing • Find and suppress duplicate entries in the dimension (eg. Name and address matching) • Never use Type 2 to solve changing dimensions (i.e. adding records) Rapidly Changing Monster Dimensions Dimensions containing > 100 million records!!! HOW TO SUPPORT??? • Break the Monster dimension into separate dimension tables • Constant information into original table • New dimension table can have discrete values for each attribute • Choose pre-defined set of values per attribute Rapidly Changing Monster Dimensions (contd..) • Build the data in this dimension with all possible combinations of values for each attribute • Identify each combination uniquely • Everytime an event occurs and is recorded in fact table, attach it with the unique combination ID. Rapidly Changing Monster Dimensions (contd..) • Advantages – No increase in data storage everytim event occurs • Drawbacks – Forced to use ranges of discrete values for dimensional attributes – New dimension cannot be too big (not >1M) – Data in new dimension can be accessed along with static data only through the fact table - slower – Only if event occurs, link the static and changing portions of dimension - keep a dummy event in fact Degenerate Dimensions • Occur in line item oriented fact tables • occur when dimension table is left only with a single key and no other fields • all other attributes have been moved into other dimension tables • Moved to fact table - not joined to anything Junk Dimensions • Number of miscellaneous flags and text attributes left over after design WHAT TO DO WITH THEM???? DO NOT – Leave them behind in the fact table – Make each flag and attribute into its own dimension – Strip off all such flags and attributes Junk Dimensions (contd…) • DO – Grouping of random flags and attributes – take away from fact and group them into junk dimension eg. Open ended comments fields Conformed Dimensions • Dimension that means the same thing with every possible fact table that it is joined. • Dimension is identically the same dimension in each data mart • Major responsibility of the central DWdesign team is to establish, publish, maintain and Enforce them • DW cannot function as an integrated whole without strict adherence to conformed dimensions Conformed Dimensions (Contd.) • When you don’t need Conformed Dimensions – Several lines of business where the customers and products are disjoint. – Don’t manage these separate business lines together Time Dimension • An exclusive Time dimension is required because the SQL date semantics and functions cannot generate several important attributes required for analytical purposes. • Attributes like weekdays, weekends, fiscal period, holidays, season cannot be generated by SQL statements. Time Dimension • Moreover SQL date stamps occupy more space largely increasing the size of the fact table. • Joins on such SQL generated date-stamps are costly decreasing the query speed significantly. Time Dimension • The Day of week(Monday, ...) is useful to create reports comparing for ex. Monday sales to Friday sales. • The Day number in month is useful for comparing measures for the same day in each month. • The last day in month flag is useful for performing payday analysis. Time Dimension • The holiday flag and season attributes are useful for holiday VS non-holiday analysis and season business analysis. • Event attribute is needed to record special days like strike days, etc..