Data Warehousing by mnmgroup

VIEWS: 35 PAGES: 73

									Data Warehouse
       &
  Data Mining




                 1
Overview

Part 1: Data Warehouses
Part 2: OLAP
Part 3: Data Mining




                           2
Part 1: Data Warehouses




                          3
Data, Data everywhere
yet ...
           I can’t find the data I need
             data is scattered over the network
             many versions, subtle differences
           I can’t get the data I need
             need an expert to get the data
           I can’t understand the data I
            found
             available data poorly documented
           I can’t use the data I found
             results are unexpected
             data needs to be transformed from
              one form to other
                                                   4
What is a Data Warehouse?

 A single, complete and
 consistent store of data
 obtained from a variety of
 different sources made
 available to end users in a
 what they can understand
 and use in a business
 context.

 [Barry Devlin]
                               5
   Why Data Warehousing?
                                 Which are our
                             lowest/highest margin
                                  customers ?
                                                     Who are my customers
      What is the most                                and what products
    effective distribution                             are they buying?
           channel?



  What product prom-                                       Which customers
-otions have the biggest                                 are most likely to go
   impact on revenue?                                    to the competition ?
                               What impact will
                             new products/services
                               have on revenue
                                 and margins?                                6
Decision Support

Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than update
Used by managers and end-users to
 understand the business and make
 judgements


                                            7
The Evolution of Data
Warehousing

Since 1970s, organizations gained
 competitive advantage through systems
 that automate business processes to offer
 more efficient and cost-effective services
 to the customer.

This resulted in accumulation of growing
 amounts of data in operational databases.


                                              8
Data Warehousing
Concepts

A subject-oriented, integrated, time-
 variant, and non-volatile collection of
 data in support of management’s
 decision-making process (Inmon,
 1993).




                                       9
Subject-oriented Data
 The warehouse is organized around the
  major subjects of the enterprise (e.g.
  customers, products, and sales) rather
  than the major application areas (e.g.
  customer invoicing, stock control, and
  product sales).

 This is reflected in the need to store
  decision-support data rather than
  application-oriented data.

                                           10
Integrated Data

The data warehouse integrates corporate
 application-oriented data from different
 source systems, which often includes data
 that is inconsistent.

The integrated data source must be made
 consistent to present a unified view of the
 data to the users.


                                           11
Time-variant Data

Data in the warehouse is only accurate
 and valid at some point in time or over
 some time interval.

Time-variance is also shown in the
 extended time that the data is held, the
 implicit or explicit association of time with
 all data, and the fact that the data
 represents a series of snapshots.

                                             12
Non-volatile Data

Data in the warehouse is not
 updated in real-time but is refreshed
 from operational systems on a
 regular basis.

New data is always added as a
 supplement to the database, rather
 than a replacement.

                                      13
Benefits of Data
Warehousing

Potential high returns on investment

Competitive advantage

Increased productivity of corporate
 decision-makers


                                       14
Comparison of OLTP
Systems and Data
Warehousing




                     15
Data Warehouse Queries

The types of queries that a data
 warehouse is expected to answer ranges
 from the relatively simple to the highly
 complex and is dependent on the type of
 end-user access tools used.

End-user access tools include:
  Reporting, query, and application development
   tools
  Executive information systems (EIS)
  OLAP tools
  Data mining tools                           16
Examples of Typical Data
Warehouse Queries
 What was the total revenue for Scotland in the third quarter of 2004?
 What was the total revenue for property sales for each type of property in
  Great Britain in 2003?
 What are the three most popular areas in each city for the renting of
  property in 2004 and how does this compare with the figures for the
  previous two years?
 What is the monthly revenue for property sales at each branch office,
  compared with rolling 12-monthly prior figures?
 What would be the effect on property sales in the different regions of
  Britain if legal costs went up by 3.5% and Government taxes went down
  by 1.5% for properties over £100,000?
 Which type of property sells for prices above the average selling price for
  properties in the main cities of Great Britain and how does this correlate
  to demographic data?
 What is the relationship between the total annual revenue generated by
  each branch office and the total number of sales staff assigned to each
  branch office?



                         © Pearson Education Limited 1995, 2005            17
Problems of Data
Warehousing
Underestimation of resources for data
 loading

Hidden problems with source systems

Required data not captured

Increased end-user demands

Data homogenization
                                         18
Problems of Data
Warehousing

High demand for resources

Data ownership

High maintenance

Long duration projects

Complexity of integration
                             19
Typical Architecture of a
Data Warehouse




                            20
Data Mart
A subset of a data warehouse that
 supports the requirements of a particular
 department or business function.

Characteristics include
  Focuses on only the requirements of one
   department or business function.
  Do not normally contain detailed operational
   data unlike data warehouses.
  More easily understood and navigated.


                                                  21
Reasons for Creating a
Data Mart
 To give users access to the data they need to
  analyze most often.

 To provide data in a form that matches the
  collective view of the business function area.
  data by a group of users in a department or

 To improve end-user response time due to the
  reduction in the volume of data to be accessed.



                                                    22
Reasons for Creating a
Data Mart

To provide appropriately structured data
 as dictated by the requirements of the
 end-user access tools.

Building a data mart is simpler compared
 with establishing a corporate data
 warehouse.

The cost of implementing data marts is
 normally less than that required to
 establish a data warehouse.                23
Reasons for Creating a
Data Mart

The potential users of a data mart
 are more clearly defined and can be
 more easily targeted to obtain
 support for a data mart project
 rather than a corporate data
 warehouse project.



                                       24
From the Data Warehouse
to Data Marts
Information

  Individually                              Less
  Structured


    Departmentally                         History
    Structured                             Normalized
                                           Detailed


       Organizationally                      More
       Structured         Data Warehouse

Data
                                                    25
Part 2: OLAP




               26
Nature of OLAP Analysis

Aggregation -- (total sales, percent-to-
 total)
Comparison -- Budget vs. Expenses
Ranking -- Top 10, quartile analysis
Access to detailed and aggregate data
Complex criteria specification
Visualization
Need interactive response to aggregate queries
                                                  27
Business Intelligence
Technologies
OLAP & Data Mining
  Accompanying the growth in data
   warehousing is an ever-increasing
   demand by users for more powerful
   access tools that provide advanced
   analytical capabilities.

  There are two main types of access
   tools available to meet this demand,
   namely Online Analytical Processing
   (OLAP) and data mining.
                                          28
Business Intelligence
Technologies
  OLAP and Data Mining differ in what
   they offer the user and because of this
   they are complementary technologies.

  An environment that includes a data
   warehouse (or more commonly one or
   more data marts) together with tools
   such as OLAP and /or data mining are
   collectively referred to as Business
   Intelligence (BI) technologies.
                                          29
Online Analytical Processing
(OLAP)
The dynamic synthesis, analysis, and
 consolidation of large volumes of
 multi-dimensional data, Codd
 (1993).

Describes a technology that uses a
 multi-dimensional view of aggregate
 data to provide quick access to
 strategic information for the
 purposes of advanced analysis.     30
Online Analytical Processing
(OLAP)
Enables users to gain a deeper
 understanding and knowledge about
 various aspects of their corporate data
 through fast, consistent, interactive
 access to a wide variety of possible views
 of the data.

Allows users to view corporate data in
 such a way that it is a better model of the
 true dimensionality of the enterprise.

                                               31
Online Analytical Processing
(OLAP)
Can easily answer ‘who?’ and ‘what?’
 questions, however, ability to answer
 ‘what if?’ and ‘why?’ type questions
 distinguishes OLAP from general-purpose
 query tools.

Types of analysis ranges from basic
 navigation and browsing (slicing and
 dicing) to calculations, to more complex
 analyses such as time series and complex
 modeling.
                                            32
Examples of OLAP
applications in various
functional areas




                          33
OLAP Applications

Although OLAP applications are
 found in widely divergent functional
 areas, they all have the following key
 features:
  multi-dimensional views of data
  support for complex calculations
  time intelligence


                                      34
OLAP Applications -
support for complex
calculations
Must provide a range of powerful
 computational methods such as that
 required by sales forecasting, which
 uses trend algorithms such as
 moving averages and percentage
 growth.



                                        35
OLAP Applications – time
intelligence
Key feature of almost any analytical
 application as performance is almost
 always judged over time.

Time hierarchy is not always used in the
 same manner as other hierarchies.

Concepts such as year-to-date and period-
 over-period comparisons should be easily
 defined.
                                            36
OLAP Benefits

Increased productivity of end-users.
Reduced backlog of applications
 development for IT staff.
Retention of organizational control over
 the integrity of corporate data.
Reduced query drag and network traffic
 on OLTP systems or on the data
 warehouse.
Improved potential revenue and
 profitability.
                                            37
Representation of Multi-
dimensional Data
Example of two-dimensional query.
    What is the total revenue generated by property
     sales in each city, in each quarter of 2004?’


Choice of representation is based on types
 of queries end-user may ask.

Compare representation - three-field
 relational table versus two-dimensional
 matrix.

                                                       38
Multi-dimensional Data as
Three-field table versus Two-
dimensional Matrix




                                39
Representation of Multi-
dimensional Data
Example of three-dimensional query.
  ‘What is the total revenue generated by
   property sales for each type of property (Flat
   or House) in each city, in each quarter of
   2004?’


Compare representation - four-field
 relational table versus three-
 dimensional cube.

                                                    40
Multi-dimensional Data as
Four-field Table versus
Three-dimensional Cube




                            41
Representation of Multi-
dimensional Data
Cube represents data as cells in an
 array.

Relational table only represents
 multi-dimensional data in two
 dimensions.



                                       42
 Multi-dimensional Data

 Measure - sales (actual, plan, variance)
                              Dimensions: Product, Region, Time
                              Hierarchical summarization paths
              W
             S
           N                  Product    Region        Time
Product




       Juice                  Industry   Country       Year
       Cola
        Milk
      Cream                   Category   Region       Quarter
   Toothpaste
       Soap
               1 2 34 5 6 7   Product    City      Month      week

                Month                    Office        Day
                                                             43
Strengths of OLAP

It is a powerful visualization
 tool
It provides fast, interactive
 response times
It is good for analyzing time
 series
It can be useful to find
 some clusters and outliners
Many vendors offer OLAP
 tools
                                  44
OLAP and Executive
Information Systems
 Andyne Computing --          Oracle -- Express
  Pablo                        Pilot -- LightShip
 Arbor Software -- Essbase    Planning Sciences --
 Cognos -- PowerPlay           Gentium
 Comshare -- Commander        Platinum Technology --
  OLAP                          ProdeaBeacon, Forest &
 Holistic Systems -- Holos     Trees
 Information Advantage --     SAS Institute -- SAS/EIS,
  AXSYS, WebOLAP                OLAP++
 Informix -- Metacube         Speedware -- Media
 Microstrategies --
  DSS/Agent
                                                        45
Part 3: Data Mining




                      46
Data Mining
The process of extracting valid, previously
 unknown, comprehensible, and actionable
 information from large databases and
 using it to make crucial business
 decisions, (Simoudis,1996).

Involves the analysis of data and the use
 of software techniques for finding hidden
 and unexpected patterns and relationships
 in sets of data.

                                           47
Data Mining
Reveals information that is hidden and
 unexpected, as little value in finding
 patterns and relationships that are already
 intuitive.

Patterns and relationships are identified
 by examining the underlying rules and
 features in the data.


                                             48
Data Mining

Most accurate results normally
 require large volumes of data to
 deliver reliable conclusions.

Starts by developing an optimal
 representation of structure of sample
 data

                                     49
Data Mining
Data mining can provide huge
 paybacks for companies who have
 made a significant investment in
 data warehousing.

Relatively new technology, however
 already used in a number of
 industries.

                                      50
Examples of Applications
of Data Mining
Retail / Marketing
  Identifying buying patterns of
   customers
  Finding associations among customer
   demographic characteristics
  Predicting response to mailing
   campaigns
  Market basket analysis

                                         51
Examples of Applications
of Data Mining

Banking
 Detecting patterns of fraudulent credit
  card use
 Identifying loyal customers
 Predicting customers likely to change
  their credit card affiliation
 Determining credit card spending by
  customer groups

                                            52
Examples of Applications
of Data Mining
Insurance
  Claims analysis
  Predicting which customers will buy new
   policies


Medicine
  Characterizing patient behavior to predict
   surgery visits
  Identifying successful medical therapies for
   different illnesses

                                                  53
Data Mining Operations
Four main operations include:
  Predictive modeling
  Database segmentation
  Link analysis
  Deviation detection

There are recognized associations
 between the applications and the
 corresponding operations.
  e.g. Direct marketing strategies use database
   segmentation.

                                               54
Data Mining Techniques

Techniques are specific
 implementations of the data mining
 operations.

Each operation has its own strengths
 and weaknesses.



                                      55
Data Mining Operations
and Associated
Techniques




                         56
Predictive Modeling
Similar to the human learning experience
  uses observations to form a model of the
   important characteristics of some
   phenomenon.


Uses generalizations of ‘real world’ and
 ability to fit new data into a general
 framework.
Can analyze a database to determine
 essential characteristics (model) about
 the data set.
                                              57
Predictive Modeling

Model is developed using a supervised
 learning approach, which has two phases:
 training and testing.
  Training builds a model using a large sample
   of historical data called a training set.
  Testing involves trying out the model on new,
   previously unseen data to determine its
   accuracy and physical performance
   characteristics.


                                                   58
Predictive Modeling

Applications of predictive modeling
 include customer retention management,
 credit approval, cross selling, and direct
 marketing.

There are two techniques associated with
 predictive modeling: classification and
 value prediction, which are distinguished
 by the nature of the variable being
 predicted.

                                              59
Example of Classification
using Tree Induction




                            60
Predictive Modeling -
Value Prediction
Used to estimate a continuous numeric
 value that is associated with a database
 record.

Uses the traditional statistical techniques
 of linear regression and nonlinear
 regression.

Relatively easy-to-use and understand.

                                               61
Predictive Modeling -
Value Prediction
Linear regression attempts to fit a straight
 line through a plot of the data, such that
 the line is the best representation of the
 average of all observations at that point in
 the plot.

Problem is that the technique only works
 well with linear data and is sensitive to
 the presence of outliers (that is, data
 values, which do not conform to the
 expected norm).
                                             62
Predictive Modeling -
Value Prediction
Data mining requires statistical
 methods that can accommodate non-
 linearity, outliers, and non-numeric
 data.

Applications of value prediction
 include credit card fraud detection or
 target mailing list identification.

                                      63
Database Segmentation
Aim is to partition a database into an
 unknown number of segments, or
 clusters, of similar records.

Uses unsupervised learning to
 discover homogeneous sub-
 populations in a database to improve
 the accuracy of the profiles.

                                      64
Database Segmentation
Less precise than other operations
 thus less sensitive to redundant
 and irrelevant features.

Applications of database
 segmentation include customer
 profiling, direct marketing, and
 cross selling.


                                      65
Example of Database
Segmentation using a
Scatterplot




                       66
Link Analysis
Aims to establish links (associations)
 between records, or sets of records, in a
 database.

There are three specializations
  Associations discovery
  Sequential pattern discovery
  Similar time sequence discovery

Applications include product affinity
 analysis, direct marketing, and stock
 price movement.                             67
Link Analysis -
Associations Discovery
Finds items that imply the presence of
 other items in the same event.

Affinities between items are represented
 by association rules.
  e.g. ‘When a customer rents property for more
   than 2 years and is more than 25 years old, in
   40% of cases, the customer will buy a
   property. This association happens in 35% of
   all customers who rent properties’.


                                                68
Link Analysis - Sequential
Pattern Discovery
Finds patterns between events such
 that the presence of one set of items
 is followed by another set of items in
 a database of events over a period of
 time.
  e.g. Used to understand long term
   customer buying behavior.


                                       69
Link Analysis - Similar Time
Sequence Discovery
Finds links between two sets of data
 that are time-dependent, and is
 based on the degree of similarity
 between the patterns that both time
 series demonstrate.
  e.g. Within three months of buying
   property, new home owners will
   purchase goods such as cookers,
   freezers, and washing machines.

                                        70
Deviation Detection
Relatively new operation in terms of
 commercially available data mining tools.

Often a source of true discovery because it
 identifies outliers, which express
 deviation from some previously known
 expectation and norm.



                                             71
Deviation Detection

Can be performed using statistics
 and visualization techniques or as a
 by-product of data mining.

Applications include fraud detection
 in the use of credit cards and
 insurance claims, quality control, and
 defects tracing.

                                        72
Example of Database
Segmentation using a
Visualization




                       73

								
To top