Docstoc

Data Mining_ Warehousing_ and Visualization

Document Sample
Data Mining_ Warehousing_ and Visualization Powered By Docstoc
					Data Mining, Warehousing, and
         Visualization



         Prof. Rushen Chahal




                                Page 1
1-1: The Modern Data Warehouse

  • A data warehouse is a copy of transaction data
    specifically structured for querying, analysis and
    reporting
  • Note that the data warehouse contains a copy of the
    transactions. These are not updated or changed
    later by the transaction system.
  • Also note that this data is specially structured, and
    may have been transformed when it was placed in
    the warehouse


                                                    Page 2
 1-2: Data Warehouse Roles and
           Structures
The DW has the following primary functions:
• It is a direct reflection of the business rules of the
  enterprise.
• It is the collection point for strategic information.
• It is the historical store of strategic information.
• It is the source of information later delivered to data
  marts.
• It is the source of stable data regardless of how the
  business processes may change.



                                                            Page 3
 Position of the Data
Warehouse Within the
    Organization




                        Page 4
                    Data Marts
• A data mart is a smaller, more focused data warehouse.
  It reflects the business rules of a specific business unit.
• The data mart does not need to cleanse its data because
  that was done when it went into the warehouse.
• It is a set of tables for direct access by users.
• These tables are designed for aggregation.
• It typically is not a source for traditional statistical
  analysis.




                                                         Page 5
Position of the Data Mart Within the
            Organization

                                      Decision
                                      Support
                        Data Mart   Information
        Data Delivery




                                      Decision
                                      Support
                        Data Mart   Information




                                      Decision
                                      Support
                        Data Mart   Information




                                                  Page 6
Some of the benefits of a DW
            are:

• Immediate information delivery
• Data integration from across and even outside
  the organization
• Future vision from historical trends
• Tools for looking at data in new ways
• Freedom from IS department resource
  limitations (you don’t need programmers to use
  a data warehouse)

                                            Page 7
          Examples of Common DW Applications
Sales Analysis
 Determine real-time product sales to make vital pricing and distribution decisions.
 Analyze historical product sales to determine success or failure attributes.
 Evaluate successful products and determine key success factors.
 Use corporate data to understand the margin as well as the revenue implications of a decision.
 Rapidly identify a preferred customer segments based on revenue and margin.
 Quickly isolate past preferred customers who no longer buy.
 Identify daily what product is in the manufacturing and distribution pipeline.
 Instantly determine which salespeople are performing, on both a revenue and margin basis,
    and which are behind.
Financial Analysis
 Compare actual to budgets on an annual, monthly and month-to-date basis.
 Review past cash flow trends and forecast future needs.
 Identify and analyze key expense generators.
 Instantly generate a current set of key financial ratios and indicators.
 Receive near-real-time, interactive financial statements.
Human Resource Analysis
 Evaluate trends in benefit program use.
 Identify the wage and benefits costs to determine company-wide variation.
 Review compliance levels for EEOC and other regulated activities.
Other Areas
 Warehouses have also been applied to areas such as: logistics, inventory, purchasing, detailed
   transaction analysis and load balancing.


                                                                                        Page 8
     What Does All This Mean?
• On a daily basis, organizations turn to their data
  warehouses to answer a limitless variety of questions.
• Nothing is free, however, and these benefits do come
  with a cost.
• The value of a data warehouse is a result of the new and
  changed business processes it enables.
• There are limitations, though. A DW cannot correct
  problems with the data, although it may help to clearly
  identify them.




                                                      Page 9
    Comparison of Typical DW Costs and Benefits
Costs
 Hardware, software, development personnel and consultant costs.
 Operational costs like ongoing systems maintenance.

   Benefits
Added Revenue
   Will the new (business objective) process generate new customers
    (what is the estimated value?)
   Will the new (business objective) process increase the buying
    propensity of existing customers (by how much?)
   Is the new process necessary to ensure that the competition doesn't
    offer a demanded service that you can't match?
Reduced costs
   What costs of current systems will be eliminated?
   Is the new process intended to make some operation more efficient? If
    so, how and what is the dollar value?



                                                                          Page 10
 The Cost of Warehousing Data
• Expenditures can be categorized as one-time initial costs
  or as recurring, ongoing costs.
• The initial costs can further be identified as for hardware
  or software.
• Expenditures can also be categorized as capital costs
  (associated with acquisition of the warehouse) or as
  operational costs (associated with running and
  maintaining the warehouse)




                                                        Page 11
 Expenditures Associated with Building a DW

                      Recurring Costs                        One-Time Costs
              
 Capital      
                  Hardware maintenance
                  Software maintenance
                                                        Hardware
                                                           Disk
                                                                                     Software
                                                                                      DBMS
                 Terminal analysis                        CPU                       Terminal
                 Middleware                                analysis
                                                           Network
                                                            Middleware
                                                           Terminal analysis       Network
                                                                                        Log
                                                            utility

                                                            Processing

                                                            Metadata

                                                            Infrastructure
                                                       
Operational   
                  Ongoing refreshment
                  Integration transformation
                                                            Integration/transformation
                                                            processing specification
                 Data model maintenance                   Metadata infrastructure population
                 Record identification maintenance        System of record definition
                 Metadata infrastructure maintenance      Data dictionary language definition
                 Archival of data                         Network transfer definition
                 Data aging within the DW                 CASE/Repository interface
                                                           Initial data warehouse population
                                                           Data model definition
                                                           Database design definition
                                                                                              Page 12
       Cost Are Highly Variable
• A company that spends less money for their data
  warehouse is often happier with it.
• The main justification for the development expense is
  that a DW reduces the cost of accessing the information
  owned by the organization.
• Since information has to be retrieved just once (when it
  is placed in the warehouse), DW users see a lower cost
  on each report generated.




                                                       Page 13
 Typical Multidatabase Report and
        Screen Generation
Data download               Source
                            System

       and                    A




transformation     Source
                   System

  contribute to      B




 retrieval costs
                            Source

for every report            System
                              C



    or screen      Source

   generated       System
                     D




                                     Page 14
Typical DW Report and Screen
         Generation

  Data upload               Source
                            System
                              A


      and
                   Source

transformation     System
                     B
                                     Organizational

costs occur just            Source
                                         Data
                                      Warehouse


once. Retrieval             System
                              C



costs are lower.   Source
                   System
                     D




                                                      Page 15
        Farmers and Explorers
• Every corporation has two types of DW users.
• Farmers know what they want before they set out to find
  it. They submit small queries and retrieve small nuggets
  of information.
• Explorers are quite unpredictable. They often submit
  large queries. Sometimes they find nothing, sometimes
  they find priceless nuggets.
• Cost justification for the DW is usually done on the basis
  of the results obtained by farmers since explorers are
  unpredictable.




                                                        Page 16
     Data Marts and the Data
           Warehouse
    Legacy
                 Legacy Systems
 systems feed
                                                       Sales
                                   Finance
  data to the      Operational
                   Data Store
                                  Data Mart
                                                     Data Mart
                                                                 Marketing
                                                                 Data Mart

  warehouse.
                                                                             Accountin
                   Operational                                                   g
                   Data Store                                                Data Mart

      The
  warehouse        Operational
                   Data Store                 Organizational

     feeds                                        Data
                                               Warehouse

  specialized      Operational
                   Data Store


information to
 departments.
                                                                        Page 17
      The Data Mart is More
           Specialized
                Organizational Data
                Warehouse

   The data     Corporate
                Highly granular data
                                                            Finance
                                                           Data Mart
                                                                        Sales
                                                                       Data Mart
                Normalized design
 mart serves    Robust historical data
                Large data volume
                Data Model driven data
                                                                                           Marketing
                                                                                           Data Mart



 the needs of   Versatile
                General purpose DBMS
                technologies
                                                                                                        Accting

one business                                                                                           Data Mart




unit, not the                                                                 Data Marts


organization.                                                                 Departmentalized
                                                                              Summarized, aggregated
                                                                              data
                                                                              Star join design
                                                                              Limited historical data
                                                                              Limited data volume
                                                                              Requirements driven data
                                          Organizational                      Focused on departmental
                                              Data                            needs
                                                                              Multi-dimensional DBMS
                                           Warehouse                          technologies




                                                                                                 Page 18
    Foundations of Data Mining
• Data mining is the process of using raw data to infer
  important business relationships.
• Despite a consensus on the value of data mining, a great
  deal of confusion exists about what it is.
• It is a collection of powerful techniques intended for
  analyzing large datasets.
• There is no single data mining approach, but rather a set
  of techniques that can be used in combination with each
  other.




                                                       Page 19
      The Roots of Data Mining
• The approach has roots in practice dating back over 30
  years.
• In the early 1960s, data mining was called statistical
  analysis, and the pioneers were statistical software
  companies such as SAS and SPSS.
• By the 1980s, the traditional techniques had been
  augmented by new methods such as fuzzy logic,
  heuristics and neural networks.




                                                      Page 20
          A General Approach
Although all data mining endeavors are unique, they
    possess a common set of process steps:
1. Infrastructure preparation – choice of hardware
    platform, the database system and one or more mining
    tools
2. Exploration – looking at summary data, sampling and
    applying intuition
3. Analysis – each discovered pattern is analyzed for
    significance and trends




                                                     Page 21
A General Approach (continued)
4. Interpretation – Once patterns have been discovered
   and analyzed, the next step is to interpret them.
   Considerations include business cycles, seasonality
   and the population the pattern applies to.
5. Exploitation – this is both a business and a technical
   activity. One way to exploit a pattern is to use it for
   prediction. Others are to package, price or advertise
   the product in a different way.




                                                        Page 22
1.8: The Approach to Data Exploration
          and Data Mining

 The basis      A

                B
   for all
                      A Perfect Correlation
data mining
activities is   A

correlation.    B


                      A Strong Correlation

                A

                B


                       A Weak Correlation
                                              Page 23
  The Spectrum of Correlation

1                        .5                         0
Perfect               Moderate                    No
Correlation           Correlation         Correlation

  • In general, a correlation coefficient is a number
    between 0 and 1 that shows strength of a
    relationship.
  • Some types of correlation are signed (±) to also show
    the direction of the relationship.
  • Even a weak correlation can be interesting, however,
    if it shows a trend over time.

                                                   Page 24
     Methods to Determine
         Correlation
               vs.
The method    A     B            Data element vs. data element

   used
depends on    A vs.              Data element vs. unit of time


the type of           B BB



 elements
              A vs.   B B BB B
                      B B B
                           B
                                 Data element vs. data element groups




   being      A vs.              Data element vs. geography


correlated.
              A vs.              Data element vs. external trends




              A vs.              Data element vs. demographics



                                                                    Page 25
 The Data Warehouse and Data
           Mining
• Data mining does not require the use of a warehouse,
  but it may be the best foundation for mining.
• If multiple analyses are run in sequence, the data need
  to be held constant (as in a DW). In an operational
  database, data change often.
• Also important is that the data in the DW is integrated
  and stable




                                                       Page 26
 Volumes of Data – The Biggest
          Challenge
• The largest challenge a data miner may face is the sheer
  volume of data in the warehouse.
• It is quite important, then, that summary data also be
  available to get the analysis started.
• A major problem is that this sheer volume may mask the
  important relationships the analyst is interested in.
• The ability to overcome the volume and visualize the
  data becomes quite important.




                                                      Page 27
           Foundations of Data
              Visualization
• One of the earliest known examples of data visualization
  was in London during the 1854 cholera epidemic. A map
  (next slide) helped to identify the source of the disease.
• Modern visualization techniques grew from the twin
  technologies of computer graphics and high performance
  computing in the 1970s and 1980s.
• One computer scientist who saw this trend arising was
  Douglas Engelbart in the 1950s.




                                                        Page 28
    Dr. John
Snow used a
map to show
                Broad Street
the source of      Pump
cholera was a
 water pump,
 thus proving
  the disease
   was water
     borne.




                               Page 29
          Opportunity and Timing

• Alternative input devices (light pen, sketch pad and
  mouse) began to appear in the 1960s.
• In the 1970s, flight simulators became much more
  realistic when graphics replaced film.
• In the same decade, special effects computers became
  entrenched in the entertainment industry.
• In the 1980s, visualization grew more dynamic with
  applications like the animation of Los Angeles smog
  patterns.



                                                 Page 30
   One of today’s
 more useful types
of visualization is in
simulators (both in
   games and in
      practice).

This is the only way
most of us will ever
 fly a Boeing 747.




                         Page 31
 It is now both
  cheaper and
  safer to train
  commercial
     pilots on
   simulators.

    With good
 software, pilots
can be placed in
 situations they
  may not ever
 see – until too
  late – in the
     cockpit.       Page 32
A Sequence of Frames Animating
          LA Smog

                                                              Day 2 Offshore Winds – Moderate Smog Particles




    Day 1 Swirling Winds – Light Smog Particles




                Day 3 Head-on View of Smog Particles and Streamlines


                                                                                                           Page 33
     Number Crunching With a
           Difference
• In the 1990s, rapid advances in chip technology,
  both at the CPU and the graphics processor, put
  data visualization everywhere.
• Imagine trying to understand DNA sequences
  from just the numbers!
• On the next slide, a Mapuccino display helps us
  see where the results from a text search come
  from.



                                               Page 34
Page 35

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:9
posted:2/13/2012
language:
pages:35
Description: Prof. Rushen's notes for MBA and BBA students