ch8

Document Sample
ch8 Powered By Docstoc
					D                     Data Warehouse
A
                        Predefined
T                       reports
                                                  Interactive
                                                  data analysis
A   Operations
    data
B                                  Daily data
A                OLTP Database
                                   transfer


S                3NF tables
                                                Data warehouse
                                                Star configuration
E                          Flat files
                                                                     1
D                    Data Warehouse Goals
A    Existing databases optimized for Online Transaction
T     Processing (OLTP)
     Online Analytical Processing (OLAP) requires fast retrievals,
A     and only bulk writes.
     Different goals require different storage, so build separate dta
      warehouse to use for queries.
B    Extraction, Transformation, Transportation (ETT)
     Data analysis
A        Ad hoc queries
         Statistical analysis
S        Data mining (specialized automated tools)


E
                                                                         2
D   Extraction, Transformation, and
         Transportation (ETT)
A               Customers

T               Convert Client
                to Customer
A               Apply standard
                product numbers
B               Convert
A               currencies


S               Fix region codes
                                   Data warehouse:
                                   All data must be
    Transaction data
                                   consistent.
E   from diverse
    systems.

                                                      3
D                         OLTP v. OLAP
A
    Category          OLTP                   OLAP
T   Data storage
    Indexes
                      3NF tables
                      Few
                                             Multidimensional cubes
                                             Many
    Joins             Many                   Minimal
A   Duplicated data   Normalized,
                      limited duplication
                                             Denormalized DBMS

    Updates           Constant, small data   Overnight, bulk
B   Queries           Specific               Ad hoc


A
S
E
                                                                      4
D              Multidimensional Cube
A                             Pet Store
                              Item Sales
                              Amount = Quantity*Sale Price
T
A
B   Customer
A   Location


S
E
                                                        5
D            Sales Date: Time Hierarchy
A                   Year

T   Levels                     Roll-up
                   Quarter     To get higher-level totals
A
B                  Month


A                               Drill-down
                                To get lower-level details
                    Week
S
E                   Day


                                                             6
D                       Star Design
A    Dimension Tables

T   Products                                   Sales Date

A                       Fact Table

                           Sales
B                         Quantity
                   Amount=SalePrice*Quantity

A   Customer

S   Location



E
                                                            7
D                      Snowflake Design                   City
                                                    CityID
A    Merchandise              Sale
                                                    ZipCode
                                                    City

T   ItemID
    Description
                           SaleID
                           SaleDate
                                                    State

    QuantityOnHand         EmployeeID
A   ListPrice
    Category
                           CustomerID
                           SalesTax
                                             Customer
                                            CustomerID
B                                           Phone
                                            FirstName

A          OLAPItems
           SaleID
                                            LastName
                                            Address
                                            ZipCode
           ItemID
S          Quantity
           SalePrice
                                            CityID

                           Dimension tables can join to
E          Amount
                           other dimension tables.

                                                                 8
D               OLAP Computation Issues
A              Quantity       Price      Quantity*Price
                3              5.00           15.00
T               2              4.00             8.00
                5              9.00      45.00 or 23.00
A
B   Compute Quantity*Price in base query, then add to get $23.00
    If you use Calculated Measure in the Cube, it will add first and
A   multiply second to get $45.00, which is wrong.

S
E
                                                                       9
D   OLAP Data Browsing
A
T
A
B
A
S
E
                         10
D   Microsoft Excel Pivot Table
A
T
A
B
A
S
E
                                  11
D                     Excel Pivot Table Reports
                                                Quarter   Month


A        LastName EmployeeID
         Carpenter
                              Data
                            8 Sum of Animal
                              Sum of Merchandise
                                                Quarter 1 Quarter 2 Quarter 3 Quarter 4 Grand Total

                                                  1,668.91
                                                    324.90
                                                              606.97
                                                               78.30
                                                                         426.39
                                                                          99.00
                                                                                       7.20
                                                                                     128.70
                                                                                               2,709.47
                                                                                                 630.90


T
         Eaton              6 Sum of Animal         522.37               341.85      562.50    1,426.72
                              Sum of Merchandise     30.60                54.90      107.10      192.60
         Farris             7 Sum of Animal       5,043.36   1,059.70                796.47    6,899.53
                              Sum of Merchandise    826.92     188.10                306.00    1,321.02
         Gibson             2 Sum of Animal       4,983.51   1,549.83              2,556.10    9,089.44


A        Hopkins

         James
                              Sum of Merchandise
                            4 Sum of Animal
                              Sum of Merchandise
                            5 Sum of Animal
                                                    668.25
                                                  3,747.96
                                                    476.91
                                                  3,282.77
                                                               238.50
                                                             1,194.88
                                                               252.90
                                                             2,373.08
                                                                         372.65
                                                                         121.50
                                                                         437.88
                                                                                     450.90
                                                                                     128.41
                                                                                       7.20
                                                                                     150.11
                                                                                               1,357.65
                                                                                               5,443.90
                                                                                                 858.51
                                                                                               6,243.84


B
                              Sum of Merchandise    505.89     693.45     99.00       99.00    1,397.34
         O'Connor           9 Sum of Animal       2,643.69     180.91    510.12                3,334.72
                              Sum of Merchandise    263.70      83.70     55.80                  403.20
         Reasoner           3 Sum of Animal       4,577.43     625.74    589.68    2,500.24    8,293.09
                              Sum of Merchandise    762.30      89.10    116.80      396.90    1,365.10


A        Reeves

         Shields
                            1 Sum of Animal
                              Sum of Merchandise
                          10 Sum of Animal
                              Sum of Merchandise
                                                  1,120.93
                                                    263.88
                                                  1,008.76
                                                     62.10
                                                                          162.15
                                                                           22.50
                                                                                               1,120.93
                                                                                                 263.88
                                                                                               1,170.91
                                                                                                  84.60
         Total Sum of Animal                     28,599.69   7,591.11   2,840.72   6,701.03   45,732.55

S        Total Sum of Merchandise                 4,185.45   1,624.05     569.50   1,495.80    7,874.80




E   Can place data in rows or columns.
    By grouping months, can instantly get quarterly or monthly totals.

                                                                                                          12
D           OLAP in SQL 99
                                          Category Month Amount
                                          Bird         1    $135.00
                                          Bird         2     $45.00
A   GROUP BY two columns                  Bird         3    $202.50
                                          Bird         6     $67.50
T   Gives you totals for each
    month within each category.
                                          Bird         7     $90.00
                                          Bird         9     $67.50

A   You do not get super-
    aggregate totals for the
                                          Cat
                                          Cat
                                                       1
                                                       2
                                                            $396.00
                                                            $113.85

B   category, or the month, or the
    overall total.
                                          Cat
                                          Cat
                                                       3
                                                       4
                                                            $443.70
                                                              $2.25

A   SELECT Category, Month(SaleDate) AS Month,
            Sum(Quantity*SalePrice) AS Amount
S   FROM Sale INNER JOIN (Merchandise INNER JOIN SaleItem
    ON Merchandise.ItemID = SaleItem.ItemID)
    ON Sale.SaleID = SaleItem.SaleID
E   GROUP BY Category, Month(SaleDate);


                                                                  13
D                        SQL ROLLUP
A   SELECT Category, Month…, Sum …
    FROM …
T   GROUP BY ROLLUP (Category, Month...)

    Category    Month    Amount
A   Bird         1       135.00

B   Bird
    …
                 2        45.00

    Bird     (null)      607.50
A   Cat
    Cat
                 1
                 2
                         396.00
                         113.85
S   …
    Cat      (null)     1293.30
    …
E   (null)   (null)     8451.79

                                           14
D            Missing Values Cause Problems
A    If there are missing values in the groups, it can be difficult
     to identify the super-aggregate rows.

T   Category     Month     Amount
    Bird          1        135.00
A   Bird
    …
                  2         45.00

                                                       Missing date
B   Bird
    Bird
              (null)
             (null)
                            32.00
                           607.50
                                                       Super-aggregate
    Cat           1        396.00
A   Cat
    …
                  2        113.85


S   Cat
    …
             (null)       1293.30

    (null)   (null)       8451.79
E
                                                                         15
D                 GROUPING Function
    SELECT Category, Month…, Sum …,
          GROUPING (Category) AS Gc,
A         GROUPING (Month) AS Gm
    FROM …
T   GROUP BY ROLLUP (Category, Month...)
    Category Month Amount       Gc   Gm
A   Bird        1   135.00      0    0
    Bird        2    45.00      0    0
B   …
    Bird    (null)   32.00       0   0
A   Bird
    Cat
           (null)
                1
                    607.50
                    396.00
                                 1
                                 0
                                     0
                                     0
    Cat         2   113.85       0   0
S   …
    Cat    (null)  1293.30       1   0
E   …
    (null) (null)  8451.79       1   1

                                           16
D                         CUBE Option
     SELECT Category, Month, Sum, GROUPING (Category) AS Gc,
           GROUPING (Month) AS Gm
A    FROM …
     GROUP BY CUBE (Category, Month...)

T   Category Month
    Bird         1
                        Amount
                        135.00
                                    Gc
                                     0
                                         Gm
                                          0
A   Bird
    …
                 2       45.00       0    0

    Bird     (null)       32.00      0    0
B   Bird    (null)       607.50      1    0
    Cat          1       396.00      0    0
A   Cat
    …
                 2       113.85      0    0

    Cat     (null)      1293.30      1    0
S   (null)       1      1358.8       0    1
    (null)       2      1508.94      0    1
E   (null)
    …
                 3      2362.68      0    1

    (null) (null)       8451.79      1    1
                                                               17
D          GROUPING SETS: Hiding Details
A   SELECT Category, Month, Sum
    FROM …
    GROUP BY GROUPING SETS
T    ( ROLLUP (Category),
       ROLLUP (Month),
       ()
A    )
    Category Month      Amount
B   Bird    (null)      607.50
    Cat     (null)     1293.30
A   …
    (null)      1      1358.8
S   (null)
    (null)
                2
                3
                       1508.94
                       2362.68
    …
E   (null) (null)      8451.79

                                           18
D       SQL OLAP Analytical Functions
A   VAR_POP
    VAR_SAMP
                     variance


T   STDDEV_POP
    STDEV_SAMP
                     standard deviation

    COVAR_POP        covariance
A   COVAR_SAMP
    CORR                 correlation
    REGR_R2          regression r-square
B   REGR_SLOPE
    REGR_INTERCEPT
                     regression data (many)


A
S
E
                                              19
D                 SQL RANK Functions
A   SELECT Employee, SalesValue
    RANK() OVER (ORDER BY SalesValue DESC) AS rank
    DENSE_RANK() OVER (ORDER BY SalesValue DESC) AS dense
T   FROM Sales
    ORDER BY SalesValue DESC, Employee;

A   Employee     SalesValue   rank   dense
    Jones          18,000       1      1
B   Smith          16,000       2      2

A   Black          16,000       2      2
    White          14,000       4      3
S                                            DENSE_RANK
                                             does not skip

E                                            numbers


                                                             20
D                 SQL OLAP Windows
    SELECT Category, SaleMonth, MonthAmount,
A    AVG(MonthAmount)
      OVER (PARTITION BY Category
         ORDER BY SaleMonth ASC ROWS 2 PRECEDING)
T     AS MA
    FROM qryOLAPSQL99

A   ORDER BY SaleMonth ASC;

    Category   SaleMonth   MonthAmount   MA
B   Bird
    Bird
               200101
               200102
                           1500.00
                           1700.00
    Bird       200103      2000.00       1600.00
A   Bird
    …
               200104      2500.00       1850.00


S   Cat
    Cat
               200101
               200102
                           4000.00
                           5000.00
    Cat        200103      6000.00       4500.00
E   Cat
    …
               200104      7000.00       5500.00


                                                    21
D                       Ranges: OVER
A   SELECT SaleDate, Value
    SUM(Value) OVER (ORDER BY SaleDate) AS running_sum,
T   SUM(Value) OVER (ORDER BY SaleDate RANGE
           BETWEEN UNBOUNDED PRECEDING
           AND CURRENT ROW) AS running_sum2,
A   SUM (Value) OVER (ORDER BY SaleDate RANGE
           BETWEEN CURRENT ROW

B   FROM …
           AND UNBOUNDED FOLLOWING) AS remaining_sum;



A   Sum1 computes total from beginning through current row.
S   Sum2 does the same thing, but more explicitly lists the rows.
    Sum3 computes total from current row through end of query.
E
                                                                    22
D                 LAG and LEAD Functions
A   LAG or LEAD: (Column, # rows, default)

T   SELECT SaleDate, Value,
       LAG (Value 1,0) OVER (ORDER BY SaleDate) AS prior_day
       LEAD (Value 1, 0) OVER (ORDER BY SaleDate) AS next_day
A   FROM …
    ORDER BY SaleDate
B    SaleDate    Value prior_day     next_day
                                                          Prior is 0 from
                                                          default value

A    1/1/2003
     1/2/2003
     1/3/2003
                  1000
                  1500
                  2000
                               0
                            1000
                            1500
                                         1500
                                         2000
                                         2300
S    …
     1/31/2003    3500       3200            0

E     Not part of standard yet? But are in SQL Server and Oracle.

                                                                            23
D                                Data Mining
A
     Goal: To discover unknown relationships in the data
T     that can be used to make better decisions.

A    Transactions and operations       Reports
B    Specific ad hoc questions         Queries
A
     Aggregate, compare, drill down     OLAP
S                                                   Databases
     Unknown relationships
E                                     Data Mining


                                                                24
D                   Exploratory Analysis
A
     Data Mining usually works autonomously.
T       Supervised/directed
        Unsupervised
A       Often called a bottom-up approach that scans the data to
         find relationships
B    Some statistical routines, but they are not sufficient
        Statistics relies on averages
A       Sometimes the important data lies in more detailed pairs

S
E
                                                                    25
D                  Common Techniques
A
     Classification/Prediction/Regression
T    Association Rules/Market Basket Analysis
     Clustering
A       Data points
        Hierarchies
B    Neural Networks
     Deviation Detection
A    Sequential Analysis
        Time series events
S       Websites
     Textual Analysis
E    Spatial/Geographic Analysis

                                                 26
D                Classification Examples
A
     Examples
T      Which borrowers/loans are most likely to be successful?
       Which customers are most likely to want a new item?
A      Which companies are likely to file bankruptcy?
       Which workers are likely to quit in the next six months?
       Which startup companies are likely to succeed?
B      Which tax returns are fraudulent?

A
S
E
                                                                   27
D                     Classification Process
A    Clearly identify the outcome/dependent variable.
     Identify potential variables that might affect the outcome.
T        Supervised (modeler chooses)
         Unsupervised (system scans all/most)

A    Use sample data to test and validate the model.
     System creates weights that link independent variables to

B     outcome.


A
        Income Married Credit History Job Stability Success
S       50000    Yes      Good          Good         Yes
        25000    Yes      Bad           Bad          No
E       75000    No       Good          Good         No


                                                                    28
D               Classification Techniques
A
     Regression
T    Bayesian Networks
     Decision Trees (hierarchical)
A    Neural Networks
B    Genetic Algorithms

A    Complications
        Some methods require categorical data
S       Data size is still a problem

E
                                                 29
D                Association/Market Basket
A    Examples
T       What items are customers likely to buy together?
        What Web pages are closely related?

A       Others?
     Classic (early) example:

B       Analysis of convenience store data showed customers often buy
         diapers and beer together.
        Importance: Consider putting the two together to increase cross-
A        selling.


S
E
                                                                            30
D             Association Details (two items)
A    Rule evaluation (A implies B)
T       Support for the rule is measured by the percentage of all
         transactions containing both items: P(A ∩ B)
        Confidence of the rule is measured by the transactions with A that
A        also contain B: P(B | A)
        Lift is the potential gain attributed to the rule—the effect compared
B        to other baskets without the effect. If it is greater than 1, the effect
         is positive:
            P(A ∩ B) / ( P(A) P(B) )
A           P(B|A)/P(B)
     Example: Diapers implies Beer
S       Support: P(D ∩ B) = .6         P(D) = .7       P(B) = .5
        Confidence: P(B|D) = .857      = P(D ∩ B)/P(D) = .6/.7
E       Lift: P(B|D) / P(B) = 1.714    = .857 / .5


                                                                                31
D                      Association Challenges
A    If an item is rarely purchased, any other item bought with it
      seems important. So combine items into categories.
T    Item
     1 “ nails
                 Freq.
                 2%
                                             Item
                                             Hardware
                                                             Freq.
                                                             15%
A    2” nails
     3” nails
                 1%
                 1%
                                             Dim. Lumber
                                             Plywood
                                                             20%
                                                             15%
B    4” nails
     Lumber
                 2%
                 50%
                                             Finish lumber 15%


A    Some relationships are obvious.

S        Burger and fries.
     Some relationships are meaningless.
         Hardware store found that toilet rings sell well only when a new
E         store first opens. But what does it mean?

                                                                             32
D                       Cluster Analysis
     Examples
A       Are there groups of customers? (If so, we can cross-sell.)
        Do the locations for our stores have elements in common? (So we
T        can search for similar clusters for new locations.)
        Do our employees (by department?) have common characteristics?
         (So we can hire similar, or dissimilar, people.)
A    Problem: Many dimensions and large datasets

B                                          Large
                                           intercluster
                                           distance
A
S                                                  Small
                                                   intracluster
                                                   distance
E
                                                                      33
D                  Geographic/Location
A    Examples
        Customer location and sales comparisons
T       Factory sites and cost
        Environmental effects
A    Challenge: Map data, multiple overlays

B
A
S
E
                                                   34

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:8/26/2012
language:English
pages:34