Data Warehousing Overview Issues, Terminology, Products and (1)

Document Sample
Data Warehousing Overview  Issues, Terminology, Products and (1) Powered By Docstoc
					Introduction to Data Warehousing
           CPS 196.03
            Notes 6
                  Warehousing
l   Growing industry: $30+ billion industry
l   Range from desktop to huge:
    u   Walmart: 900-CPU, 2,700 disk, 23TB
        Teradata system (numbers from earlier part
        of this decade)
l   Lots of buzzwords, hype
    u   slice & dice, rollup, MOLAP, pivot, ...



                                                     2
                 Outline
l   What is a data warehouse?
l   Why a warehouse?
l   Models & operations
l   Implementing a warehouse




                                3
          What is a Warehouse?
l   Collection of diverse data
    u   subject oriented
    u   aimed at executive, decision maker
    u   often a copy of operational data
    u   with value-added data (e.g., summaries, history)
    u   integrated
    u   time-varying
    u   non-volatile
                                                 more
                                                           4
          What is a Warehouse?
l   Collection of tools
    u   gathering data
    u   cleansing, integrating, ...
    u   querying, reporting, analysis
    u   data mining
    u   monitoring, administering warehouse




                                              5
 Warehouse Architecture
   Client                       Client

             Query & Analysis



  Metadata     Warehouse



                Integration



Source           Source           Source


                                           6
         Motivating Examples
l   Forecasting
l   Comparing performance of units
l   Monitoring, detecting fraud
l   Visualization




                                     7
            Why a Warehouse?
l   Two Approaches:
    u   Query-Driven (Lazy)
    u   Warehouse (Eager)


                                       ?

                              Source       Source


                                                    8
  Query-Driven Approach


          Client              Client


                   Mediator


Wrapper            Wrapper             Wrapper



Source             Source              Source



                                                 9
    Advantages of Warehousing
l   High query performance
l   Queries not visible outside warehouse
l   Local processing at sources unaffected
l   Can operate when sources unavailable
l   Can query data not stored in a DBMS
l   Extra information at warehouse
    u   Modify, summarize (store aggregates)
    u   Add historical information
                                               10
    Advantages of Query-Driven
l   No need to copy data
    u   less storage
    u   no need to purchase data
l   More up-to-date data
l   Query needs can be unknown
l   Only query interface needed at sources
l   May be less draining on sources


                                             11
                OLTP vs. OLAP
l   OLTP: On Line Transaction Processing
    u   Describes processing at operational sites
l   OLAP: On Line Analytical Processing
    u   Describes processing at warehouse




                                                    12
              OLTP vs. OLAP
    OLTP                          OLAP
l   Mostly updates            l   Mostly reads
l   Many small transactions   l   Queries long, complex
l   Mb-Gb of data             l   Tb-Pb of data
l   Raw data                  l   Summarized,
l   Clerical users                consolidated data
l   Up-to-date data           l   Decision-makers,
l   Consistency,                  analysts as users
    recoverability critical


                                                          13
                   Data Marts
l   Smaller warehouses
l   Spans part of organization
    u   e.g., marketing (customers, products, sales)
l   Do not require enterprise-wide consensus
    u   but long term integration problems?




                                                       14
Warehouse Models & Operators
l   Data Models
    u   relations
    u   stars & snowflakes
    u   cubes
l   Operators
    u   slice & dice
    u   roll-up, drill down
    u   pivoting
    u   other
                               15
                    Warehouse Models
l   Modeling data warehouses: dimensions, measures
    u   Star schema: A fact table in the middle connected to a set
        of dimension tables
    u   Snowflake schema: A refinement of star schema where
        some dimensional hierarchy is normalized into a set of
        smaller dimension tables, forming a shape similar to
        snowflake
    u   Fact constellations: Multiple fact tables share dimension
        tables, viewed as a collection of stars, therefore called
        galaxy schema or fact constellation
                                                                    16
Star




       Measures




                  17
Star Schema




              18
       Another Example of Star Schema
time
time_key                                            item
day                                               item_key
day_of_the_week              Sales Fact Table     item_name
month                                             brand
quarter                               time_key    type
year                                              supplier_type
                                      item_key
                                    branch_key
       branch                                     location
                                   location_key
       branch_key                                 location_key
       branch_name                   units_sold   street
       branch_type                                city
                                   dollars_sold   state_or_province
                                                  country
                                     avg_sales
                  Measures
                                                                      19
                  Terms
l   Fact table
l   Dimension tables
l   Measures




                          20
     Dimension Hierarchies
          sType
 store
           city      region




è snowflake schema
è constellations


                              21
       Example of Snowflake Schema
time
time_key                                      item
day                                         item_key         supplier
day_of_the_week        Sales Fact Table     item_name        supplier_key
month                                       brand            supplier_type
quarter                        time_key     type
year                            item_key    supplier_key

                              branch_key
  branch                                    location
                             location_key
                                            location_key
   branch_key
                               units_sold   street
   branch_name
                                            city_key
   branch_type
                             dollars_sold                  city
                                                           city_key
                               avg_sales                   city
                                                           state_or_province
            Measures                                       country
                                                                        22
              Example of Fact Constellation
time
time_key                                    item             Shipping Fact Table
day                                      item_key
day_of_the_week       Sales Fact Table   item_name               time_key
month                                    brand
quarter                   time_key       type                      item_key
year                                     supplier_type           shipper_key
                             item_key
                           branch_key                          from_location

 branch                   location_key   location                 to_location
branch_key                               location_key            dollars_cost
branch_name
                            units_sold
                                         street
branch_type               dollars_sold   city                   units_shipped
                                         province_or_state
                            avg_sales    country                    shipper
           Measures                                                 shipper_key
                                                                    shipper_name
                                                                    location_key
                                                                               23
                                                                    shipper_type
                     Cube


Fact table view:            Multi-dimensional cube:




                              dimensions = 2


           Recall counters in Apriori

                                                      24
                   3-D Cube

Fact table view:         Multi-dimensional cube:


                        day 2

                       day 1




                                dimensions = 3



                                                   25
            Aggregates
• Add up amounts for day 1
• In SQL: SELECT sum(amt) FROM SALE
           WHERE date = 1



                                 81




                                      26
             Aggregates
• Add up amounts by day
• In SQL: SELECT date, sum(amt) FROM SALE
           GROUP BY date




                                            27
         Another Example
• Add up amounts by day, product
• In SQL: SELECT date, sum(amt) FROM SALE
           GROUP BY date, prodId




                 rollup

                drill-down

                                            28
                   Aggregates
l   Operators: sum, count, max, min,
            median, ave
l   “Having” clause
l   Using dimension hierarchy
    u   average by region (within store)
    u   maximum by month (within date)




                                           29
       Types of Measures in Data Cubes

l   Distributive: if the result derived by applying the function to
    n aggregate values is the same as that derived by applying
    the function on all the data without partitioning
        è   E.g., count(), sum(), min(), max()
l   Algebraic: if it can be computed by an algebraic function
    with M arguments (where M is a bounded integer), each of
    which is obtained by applying a distributive aggregate
    function
        è   E.g., avg(), min_N(), standard_deviation()
l   Holistic: if there is no constant bound on the storage size
    needed to describe a subaggregate.
        è   E.g., median(), mode(), rank()                            30
               Cube Aggregation
                        Example: computing sums
 day 2                     ...
day 1




                                          129
          rollup

         drill-down
                                                  31
                 Cube Operators

 day 2                     ...
day 1
                           sale(c1,*,*)




                                              129

         sale(c2,p2,*)                    sale(*,*,*)

                                                        32
            Extended Cube

             *



    day 2


day 1                       sale(*,p2,*)




                                           33
   Cube Aggregates Lattice
                     129
                           all


         city          product         date



 city, product        city, date       product, date



 day 2           city, product, date
day 1




                                                       34
Dimension Hierarchies

 all


state



city




                        35
                Dimension Hierarchies
                       all

      city           product           date


city, product       city, date        product, date


                                                      state
                city, product, date
                                                                    state, date
                                              state, product

                                                               state, product, date

  not all arcs shown...

                                                                                      36
           Interesting Hierarchy
   all

            years

weeks
            quarters


             months         conceptual
                          dimension table

    days

                                            37
Aggregation Using Hierarchies


 day 2                   customer
day 1
                          region

                          country


                (customer c1 in Region A;
                customers c2, c3 in Region B)



                                                38
                   Multidimensional Data

l       Sales volume as a function of product,
        month, and region Dimensions: Product, Location, Time
                                 Hierarchical summarization paths
              on
        gi




                                     Industry Region      Year
    Re




                                     Category Country Quarter
    Product




                                     Product    City    Month    Week

                                               Office     Day


                    Month
                                                                    39
       Typical OLAP Operations
                                                Total annual sales
                           Date                  of TV in U.S.A.
             1Qtr   2Qtr    3Qtr   4Qtr   sum
   t
  uc


        TV
  od




      PC                                        U.S.A
Pr




    VCR




                                                         Country
 sum
                                                Canada

                                                Mexico


                                                 sum




                                                                     40
          Typical OLAP Operations
l   Roll up (drill-up): summarize data
     u by climbing up hierarchy or by dimension reduction
l   Drill down (roll down): reverse of roll-up
     u from higher level summary to lower level summary or
       detailed data, or introducing new dimensions
l   Slice and dice: project and select
l   Pivot (rotate):
     u reorient the cube, visualization, 3D to series of 2D planes
l   Other operations
     u drill across: involving (across) more than one fact table
     u drill through: through the bottom level of the cube to its
       back-end relational tables (using SQL)
                                                                 41
Fig. 3.10 Typical OLAP
Operations




                         42
                     Pivoting
Fact table view:                 Multi-dimensional cube:


                                  day 2

                                 day 1




Pivot turns unique values from
one column into unique columns
in the output

                                                           43

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:7/22/2013
language:English
pages:43
Lingjuan Ma Lingjuan Ma MS
About work for China Compulsory Certification. Some of the documents come from Internet, if you hold the copyright please contact me by huangcaijin@sohu.com