Learning Center
Plans & pricing Sign in
Sign Out

Data Analysis


  • pg 1
									Data Analysis

  Data Warehouse
SQL for Data Analysis
    Data Mining

 Bogdan Shishedjiev Data Analysis   1
                      Data Processing
• Data processing types
   – OLTP (On Line Transaction Processing)
   – OLAP (On-line Analytical Processing )
• Database types
   – Transactional
      • Numerous users
      • Dynamic (flickering)
      • Always maintaining current state
      • Critical (very loaded)
   – Warehouses
      • Aa few users (analyzers)
      • Relatively stable
      • Maintaining the data history (all states in the time)
      • Not loaded

                        Bogdan Shishedjiev Data Analysis        2

Bogdan Shishedjiev Data Analysis   3
• Source compnents
  – Filter – separate and assure the coherency of data to be
  – Export – do the transfer of data portions in precise
    moments of time.
• Warehouse components
  –   Loader – Initial loading and preparing the warehouse.
  –   Refresh – loads the portions
  –   достъп
  –   data mining
  –   Export – to other warehouses. This creates an hierarchy
      of warehouses

                     Bogdan Shishedjiev Data Analysis          4
         Relational Schemes
• Star

           Bogdan Shishedjiev Data Analysis   5
              Relational Schemes
• Snowflake

                Bogdan Shishedjiev Data Analysis   6
             Data Warehouse Design
•   Stages
    – Choose the activity processes to model
    – Choose the granularity of the activity procesus
    – Choose the dimensions that can be applied to every
      record of the fact table.
    – Choose the facts that must be recorded in the fact

                    Bogdan Shishedjiev Data Analysis       7
            Data Warehouse Design
• Fact types – the most valuable are numerical
  continuous values
   – Additive – they can be added along all dimensions
     (Money amounts)
   – Semi-additive – they can be added along some of
     dimensions (Precipitations, Product quantities)
   – Non-additive – they cannot be added along any
     dimension (Wind speed, wind direction)

                   Bogdan Shishedjiev Data Analysis      8
             Data Warehouse Design
• Recommendations
  –   Use continuous additive numerical values
  –   The fact table is highly normalized
  –   Don’t normalize the dimensions. The gain is < 1%
  –   Design thoroughly the dimension attributes. Most often
      they are textual and discrete. They are used as headings
      and constraint sources in the answers to users

                     Bogdan Shishedjiev Data Analysis        9
  Dimension time                                                      Dimension product
      day_of_ week
                                 Example –                                    product_key
                                                                  SKU(stock keeping units )_description
                              Hypermarket Chain                              package_size
                                                                          brand subcategory
     week_no_overall                                                            category
            month                                                             departement
    month_no_overall                                                         package_ type
            quarter                                                             diet_type
       fiscal_period                                                             weight
        holiday_flag                                                    weight_unit_of_mesure
      weekday_flag                                                       units_per_retail_case
 last_day_in_month_flag               Facts - Sales                    units_per_shipping_case
           season                                                          cases_per_pallet
                                           Time_key                            shelf_width
                                         product_key                          shelf_height
                                           Store_key                          shelf_depth
Dimension shop                         promotion     key                      ..and others
             stoie_key                    dollar_sales
           store_ name                    units_sales
             store_city                                                         promotion_key
          store_county                                                        promotion_name
            store_state                                                    price__reduction_ type
           store_phone                                                           dlsplay_ type
            store_FАX                                                            coupon_type
        floor_plan_ type                                                      ad_media_name
  photo_processing_type                                                        display_provider
   finance_services_type                                                          promo_cost
      first_opened_date                                                      promo_begin_date
      last_remodel_date                                                        promo_end-date
          store_surface                                                          ..and others
frozen-surface ..and others    Bogdan Shishedjiev Data Analysis                            10
                   Hypermarket Chain
• Fact table
   – Granularity – each sell of a product (SKU – Stock
     Keeping Unit)
   – Values – Total cost (additive), SKU quantity (semi-
     additive), price (non-additive), customer count (non
• Dimensions
   –   Time
   –   Product
   –   Store
   –   Promotion

                     Bogdan Shishedjiev Data Analysis       11
               Hypermarket Chain
• Calculation of disk space needed
   – Dimension time : 2 years x 365 days = 730 days
   – Dimension shop : 300 shops, everyday records
   – Dimension product : 30.000 products in each shop;
     3000 are sold every day in each shop.
   – Dimension promotion : An article can participate in
     only one promotion in a shop during one day.
   – Elementary fact records 300 x 730 x 3000 x 1 = 657
     .106 records
   – Key field number 4; Value field number 4 ; Total
     number osf fields =8
   – Fact table size - 657 .106 x 8 fields x 4B = 21 GB
                    Bogdan Shishedjiev Data Analysis       12
     Data Operations for Data Analysis
• General form of a SQL statement
select D1.C1, ... Dn.Cn, Aggr1(F,Cl),…,
  from Fact as F, Dimension1 as D1,...
  DimensionN as Dn
  where join-condition (F, D1)
       and join-condition (F, Dn)
       and selection-condition
   group by D1.C1, ... Dn.Cn
   order by D1.C1, ... Dn.C
                Bogdan Shishedjiev Data Analysis   13
       Data Operations for Data Analysis
• Example
select Time.Month, Product.Name, sum(Qty)
from Sale, Time, Product, Promotion
where Sale.TimeCode = Time.TimeCode
         and Sale.ProductCode = Product.ProductCode
         and Sale.PromoCode = Promotion.PromoCode
         and (Product. Name = ' Pasta' or Product.Name = 'Oil')
         and Time.Month between 'Feb' and 'Apr'
         and Promotion.Name = 'SuperSaver'
group by Time.Month, Product.Name
order by Time.Month, Product.Name
pivot Time.Month
                       Feb        Mar         Apr

              Oil      5K         5K          7K

              Pasta    45K        50K         51K

                        Bogdan Shishedjiev Data Analysis          14
                            Data Cube
The cube is used to represent data along some measure of interest.
Although called a "cube", it can be 2-dimensional, 3-dimensional, or
higher-dimensional. Each dimension represents some attribute in the
database and the cells in the data cube represent the measure of

                         Bogdan Shishedjiev Data Analysis          15
                               Data Cube
• Data cube representation
 Combination                   Count       Combination                Count

 {P1, Calgary, Vance}          2
                                           {P3, Vancouver, Richard}   9
 {P2, Calgary, Vance}          4
 {P3, Calgary, Vance}          1           {P4, Vancouver, Richard}   2
 {P1, Toronto, Vance}          5
                                           {P5, Vancouver, Richard}   9
 {P3, Toronto, Vance}          8
                                           {P1, Calgary, Richard}     2
 {P5, Toronto, Vance}          2
                                           {P2, Calgary, Richard}     1
 {P5, Montreal, Vance}         5
                                           {P3, Calgary, Richard}     4
 {P1, Vancouver, Bob}          3
 {P3, Vancouver, Bob}          5           {P2, Calgary, Allison}     2

 {P5, Vancouver, Bob}          1           {P3, Calgary, Allison}     1

 {P1, Montreal, Bob}           3           {P1, Toronto, Allison}     2
 {P3, Montreal, Bob}           8           {P2, Toronto, Allison}     3
 {P4, Montreal, Bob}           7           {P3, Toronto, Allison}     6
 {P5, Montreal, Bob}           3
                                           {P4, Toronto, Allison}     2
 {P2, Vancouver, Richard}      11
                            Bogdan Shishedjiev Data Analysis              16
                   Data Cube
• Totals - the value ANY or ALL or NULL

                Bogdan Shishedjiev Data Analysis   17
                                   Data Cube
      • Drill down – adding a dimension for more detailed results
Time. Month   Product.Name   sum(Qty)
                                            TIme.Monih       Product.Name   Zone     sum(Qty)
Feb           Pasta          48K            Feb              Pasta          North      18K
Mar           Pasta          50K            Feb              Pasta          Centre     18K
Apr           Pasta          51K            Feb              Pasta          South      12K
                                            Mar              Pasta          North      18K
                                            Mar              Pasta          Centre     18K
                                            Mar              Pasta          South      14K
                                            Apr              Pasta          North      18K
                                            Apr              Pasta          Centre     17K
                                            Apr              Pasta          South      16K

                               Bogdan Shishedjiev Data Analysis                           18
                           Data Cube
• Roll-up - removing dimension
   TIme.Monih   Product.Name                Zone           sum(Qty)
   Feb          Pasta                       North              18K
   Feb          Pasta                       Centre             18K
   Feb          Pasta                       South              12K
   Mar          Pasta                       North              18K
   Mar          Pasta                       Centre             18K
   Mar          Pasta                       South              14K
   Apr          Pasta                       North              18K
   Apr          Pasta                       Centre             17K
   Apr          Pasta                       South              16K

                          Product.Name          Zone       sum(Qty)
                          Pasta                 North          54K
                          Pasta                 Centre         53K
                          Pasta                 South          42K

                        Bogdan Shishedjiev Data Analysis              19
                       Data Cube
•The whole data cube
 TIme.Monih    Product.Name                     Zone    sum(Qty)
Feb           Pasta                            North           18K
Feb           Pasta                            Centre          18K
Feb           Pasta                            South           12K
Mar           Pasta                            North           18K
Mar           Pasta                            Centre          18K
Mar           Pasta                            South           14K
Apr           Pasta                            North           18K
Apr           Pasta                            Centre          17K
Apr           Pasta                            South           16K
ALL           Pasta                            North           54K
ALL           Pasta                            Centre          53K
ALL           Pasta                            South           42K
Feb           Pasta                            ALL             48K
Mar           Pasta                            ALL             50K
Apr           Pasta                            ALL             51K
ALL           Pasta                            ALL            149K
ALL           ALL                              ALL            149K

                    Bogdan Shishedjiev Data Analysis                 20

To top