Lecture 13 by nouman100

VIEWS: 19 PAGES: 18

More Info
									Virtual University of Pakistan


     Data Warehousing
                  Lecture-13
           Dimensional Modeling (DM)

                         Ahsan Abdullah
                        Assoc. Prof. & Head
              Center for Agro-Informatics Research
                   www.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, Islamabad
                    Email: ahsan@cluxing.com
                                                                  1
Dimensional Modeling (DM)




                            2
      The need for ER modeling?
 Problems with early COBOLian data
  processing systems.

 Data redundancies

 From flat file to Table, each entity ultimately
  becomes a Table in the physical schema.

 Simple O(n2) Join to work with Tables
                                                    3
Why ER Modeling has been so successful?
   Coupled with normalization drives out all
    the redundancy out of the database.

   Change (or add or delete) the data at just
    one point.

   Can be used with indexing for very fast
    access.

   Resulted in success of OLTP systems.
                                                 4
       Need for DM: Un-answered Qs
 Lets have a look at a typical ER data model first.

 Some Observations:
    All tables look-alike, as a consequence it is difficult to
     identify:

       Which table is more important ?

       Which is the largest?

       Which tables contain numerical measurements of the
        business?

       Which table contain nearly static descriptive attributes?
                                                                  5
Need for DM: Complexity of Representation
    Many topologies for the same ER
     diagram, all appearing different.
       Very hard to visualize and remember.
           12
                                7                 6
                   3                    12                7
      11                        4            8
               8
                            9
     1                                       10
                   10                                 9       11
           6                            1


                                    3        2        5
               2        5                                     4


      A large number of possible connections to
         any two (or more) tables                                  6
          Need for DM: The Paradox
 The Paradox: Trying to make information accessible using
  tables resulted in an inability to query them!

 ER and Normalization result in large number of tables which
  are:
    Hard to understand by the users (DB programmers)

    Hard to navigate optimally by DBMS software

 Real value of ER is in using tables individually or in pairs

 Too complex for queries that span multiple tables with a
  large number of records
                                                                 7
                   ER vs. DM
           ER                              DM
Constituted to optimize OLTP     Constituted to optimize DSS
        performance.                query performance.

                                       Models the macro
Models the micro relationships     relationships among data
   among data elements.             elements with an overall
                                     deterministic strategy.
                                   All dimensions serve as
   A wild variability of the
                                   equal entry points to the
   structure of ER models.
                                          fact table.
Very vulnerable to changes in    Changes in users' querying
 the user's querying habits,            habits can be
 because such schemas are             accommodated by
        asymmetrical.             automatic SQL generators.
                                                               8
How to simplify a ER data model?

 Two general methods:

  De-Normalization

  Dimensional Modeling (DM)



                                   9
               What is DM?…
 A simpler logical model optimized for decision
  support.

 Inherently dimensional in nature, with a single
  central fact table and a set of smaller
  dimensional tables.

 Multi-part key for the fact table

 Dimensional tables with a single-part PK.

 Keys are usually system generated
                                                    10
              What is DM?...
 Results in a star like structure, called star
  schema or star join.

   All relationships mandatory M-1.

   Single path between any two levels.


 Supports ROLAP operations.


                                                  11
      Dimensions have Hierarchies

                                    Items


                    Books                      Cloths

          Fiction       Text             Men        Women


               Engg            Medical




Analysts tend to look at the data through dimension at a
            particular “level” in the hierarchy

                                                            12
      The two Schemas




                   Star
Snow-flake



                          13
                          “Simplified” 3NF (Retail)
   CITY                 DISTRICT                                              M       DIVISION     PROVINCE
                                     1                    district                                                BACK
                1                                                         1
  zone                    M                                                                division
                                         M     DISTRICT               DIVISION
  ZONE                   CITY
        1
store                           M                                                 week
                                                                              1
   STORE # STREET                            ZONE               ...               DATE       WEEK
                    1                                                                          M
sale_header                                                                                                quarter
                          M                     M
RECEIPT # STORE #                            DATE         ...                                         MONTH          QTR
                                                                                               1              1
                                                                                                       M             M
            1
                                                                                        WEEK       MONTH
                    M                          sale_detail                                month                      1
        RECEIPT #               ITEM #              ...               $
                                                                                                       YEAR          QTR
                1               M        M
                                                                      1                                    year
  ITEM # CATEGORY
                                                                      ITEM #      SUPPLIER
item_x_cat                      M
                                         1                             item_x_splir
                                CATEGORY             DEPT
                                                                                                                         14
                                    cat_x_dept
      Vastly Simplified Star Schema
                                                         Product Dim
Geography Dim
                                                     1        ITEM#
    STORE#      1
                                Fact Table                   CATEGORY
     ZONE
                                RECEIPT#
                                                               DEPT
     CITY
                                 STORE#
                            M                                SUPPLIER
   DISTRICT
                                  ITEM#          M
   DIVISION
                                  DATE                       Time Dim
                                             M
   PROVINCE                         .                          DATE
                                    .                    1
                    facts           .                          WEEK

                                 Sale Rs.                     MONTH

                                                              QUARTER

                                                               YEAR

                                                                        15
     The Benefit of Simplicity



Beauty lies in close correspondence
 with the business, evident even to
           business users.



                                  16
      Features of Star Schema
Dimensional hierarchies are collapsed into a single
table for each dimension. Loss of Information?

A single fact table created with a single header from the
detail records, resulting in:

 A vastly simplified physical data model!

 Fewer tables (thousands of tables in some ERP systems).
 Fewer joins resulting in high performance.

 Some requirement of additional space.               17
   Quantifying space requirement
Quantifying use of additional space using star schema

There are about 10 million mobile phone users in Pakistan.
Say the top company has half of them = 500,000

Number of days in 1 year = 365
Number of calls recorded each day = 250,000 (assumed)
Maximum number of records in fact table = 91 billion rows
Assuming a relatively small header size = 128 bytes
Fact table storage used = 11 Tera bytes
Average length of city name = 8 characters  8 bytes
Total number of cities with telephone access = 170 (1 byte)
Space used for city name in fact table using Star = 8 x 0.091 =
0.728 TB
Space used for city code using snow-flake = 1x 0.091= 0.091 TB
Additional space used  0.637 Tera byte i.e. about 5.8%
                                                                  18

								
To top