Docstoc

Class12

Document Sample
Class12 Powered By Docstoc
					                      12




The Data Warehouse
and Data Mining




MIS 304 Winter 2006
                                               12

                Class Goals

• Be able to understand and apply the concept of
  a Data Warehouse to database environments.
• Be able to describe the concept of On Line
  Analytical Processing (OLAP) and show how it
  might be used.
• Understand the seminaries and differences
  between DDS, OLAP, and Data Mining




                                                    2
                                             12

                Class Goals

• Develop and understanding of the Star Schema
  and its importance in the design of the data
  warehouse.




                                                  3
                                        12

     How much data is there?

• Remember the reading from the first
  night’s class.
• We are swimming in data.
• What can we do with it?




                                             4
                                                         12

        The Need for Data Analysis

• Constant pressure from external and internal forces
  requires prompt tactical and strategic decisions.
• The decision-making cycle time is reduced, while
  problems are increasingly complex with a growing
  number of internal and external variables.
• Managers need support systems for facilitating quick
  decision making in a complex environment.
• One solution many people use is a Decision support
  systems (DSS).




                                                              5
                                       12

      Data Warehouse Examples

• Center for Disease Control
  http://www.cdc.gov/nchs/datawh.htm
• United States Geological Survey
  http://orxddwimdn.er.usgs.gov
• Pharmacia – Upjohn
• AT&T
• Meijer




                                            6
                                                12

                    DSS

• Came about in the 1980’s based on work done
  at AT&T, IBM, Boston Consulting Group and
  others.
• Bonczek, Holsapple and Winston 1981,
  Foundations of Decision Support Systems




                                                     7
                                                                     12
             Decision Support Systems
• Decision Support is a methodology (or a series of
  methodologies) designed to extract information from data
  and to use such information as a basis for decision making.

• A decision support system (DSS) is an arrangement of
  computerized tools used to assist managerial decision
  making within a business.
   – A DSS usually requires extensive data “massaging” to produce
     information.
   – The DSS is used at all levels within an organization and is often
     tailored to focus on specific business areas or problems.
   – The DSS is interactive and provides ad hoc query tools to retrieve
     data and to display data in different formats.




                                                                          8
                                                                       12

             Four Components of a DSS
• The data store component is basically a DSS database.
• The data extraction and filtering component is used to extract and
  validate the data taken from the operational database and the
  external data sources.
• The end user query tool is used by the data analyst to create the
  queries that access the database.
• The end user presentation tool is used by the data analyst to organize
  and present the data.




                                                                            9
                                              12
Main Components Of A Decision Support System (DSS)




                                                10
                                                           12
    Operational Data vs. Decision Support Data


• Most operational data are stored in a relational database in
  which the structures tend to be highly normalized.
• The operational data storage is optimized to support
  transactions that represent daily operations.
• Whereas operational data capture daily business
  transactions, DSS data give tactical and strategic business
  meaning to the operational data.




                                                                11
                                                    12

Operational Data vs. Decision Support Data

• The design needs of the DSS and the
  Supporting data are often at odds with one
  another!
• You have two choices:
  – Modify the structure of your operational data
  – Build a separate system.


      What is the deciding factor?



                                                     12
                                                                    12
          Operational Data vs. Decision Support
                          Data
•   Time span
    – Operational data represent current (atomic) transactions.
    – DSS data tend to cover a longer time frame.
•   Granularity
    – Operational data represent specific transactions that occur at a
      given time.
    – DSS data must be presented at different levels of aggregation.
•   Dimensionality
    – Operational data focus on representing atomic transactions.
    – DSS data can be analyzed from multiple dimensions.



                                                                         13
12




 14
                                                12
Contrasting Operational And DSS Data Characteristics




                                                   15
                                                                      12

           Decision Support Systems

•   The DSS Database Requirements
    – Database Schema
        • The DSS database schema must support complex (non-
          normalized) data representations.
        • The queries must be able to extract multidimensional time
          slices.

       What is the Impact on the Database?




                                                                       16
                                       12

    Impacts on the Database

• Less normalization typically means
  more speed.
• Less normalization typically means
  bigger tables
• Less normalization typically means
  “easier” queries (SQL)




                                        17
                                                     12
        Data Extraction and Loading

• The DBMS must support advanced data extracting
  and filtering tools.
• The data extraction capabilities should support
  different data sources and multiple vendors.
• Data filtering capabilities must include the ability to
  check for inconsistent data or data validation rules.
• The DBMS must support advanced data integration,
  aggregation, and classification capabilities.



                                                       18
                                                              12
    Decision Support Systems

• End-User Analytical Interface
  – The DSS DBMS must support advanced data modeling and
    data presentation tools, data analysis tools, and query
    generation and optimization components.
  – The end user analytical interface is one of the most critical
    components.

• Database Size Requirements
  – DSS databases tend to be very large.
  – The DBMS must be capable of supporting very large
    databases (VLDB).
  – The DBMS may be required to use advanced hardware, such
    as multiple disk arrays and multiple-processor technologies.
                                                                19
                                                             12
          The Data Warehouse

• The Data Warehouse is an integrated, subject-oriented, time-
  variant, non-volatile database that provides support for
  decision making.

   – Integrated
      • The Data Warehouse is a centralized, consolidated
        database that integrates data retrieved from the entire
        organization.
   – Subject-Oriented
      • The Data Warehouse data is arranged and optimized to
        provide answers to questions coming from diverse
        functional areas within a company.



                                                                  20
                                                12

        The Data Warehouse

– Time Variant
  • The Warehouse data represent the flow of data
    through time. It can even contain projected
    data.

– Non-Volatile
  • Once data enter the Data Warehouse, they are
    never removed.
  • The Data Warehouse is always growing.

                                                    21
Comparison Of Data Warehouse And Operational   12
            Database Characteristics




                                                22
                            12
Creating A Data Warehouse




                             23
                                                 12

              The Other Option

• Find places where this filtering occurs
  naturally
   – Accounting systems
   – Distribution systems
   – Engineering systems
• Extract the data from the “filtering” system




                                                  24
                                                     12

                      Data Mart
• A data mart is a small, single-subject data
  warehouse subset that provides decision support to
  a small group of people.
• Data Marts can serve as a test vehicle for companies
  exploring the potential benefits of Data Warehouses.
• Data Marts address local or departmental problems,
  while a Data Warehouse involves a company-wide
  effort to support decision making at all levels in the
  organization.



                                                       25
                                                         12
  Twelve Rules That Define a
       Data Warehouse
1. The Data Warehouse and operational environments are
   separated.

2. The Data Warehouse data are integrated.

3. The Data Warehouse contains historical data over a long
   time horizon.

4. The Data Warehouse data are snapshot data captured at a
   given point in time.

5. The Data Warehouse data are subject-oriented.
                                                             26
                                                       12

             12 Rules cont.

6. The Data Warehouse data are mainly read-only with
periodic batch updates from operational data. No
online updates are allowed.

7. The Data Warehouse development life cycle differs
from classical systems development. The Data
Warehouse development is data driven; the classical
approach is process driven.




                                                        27
                                                              12

                      12 Rules cont.
8. The Data Warehouse contains data with several levels of
  detail; current detail data, old detail data, lightly summarized,
  and highly summarized data.

9. The Data Warehouse environment is characterized by read-
   only transactions to very large data sets. The operational
   environment is characterized by numerous update
   transactions to a few data entities at the time.

10. The Data Warehouse environment has a system that traces
   data resources, transformation, and storage.


                                                                 28
                                                        12

               12 Rules cont.

11. The Data Warehouse’s metadata are a critical
  component of this environment. The metadata identify
  and define all data elements. The metadata provide
  the source, transformation, integration, storage, usage,
  relationships, and history of each data element.

  12. The Data Warehouse contains a charge-back
  mechanism for resource usage that enforces optimal
  use of the data by end users.




                                                             29
                                                  12
       10 Mistakes to Avoid

• Mistake 1:Not gathering business requirements
• Mistake 2: Saving time by not creating a subject
  area model
• Mistake 3: Delivering normalized tables to drive
  out mart design
• Mistake 4: Designing the data staging process for
  the ease of the developers at the expense of the
  end users
• Mistake 5: Denormalizing without starting with a
  fully normalized data model


                                                      30
                                                               12

              10 Mistakes to Avoid

• Mistake 6: Allowing users to drive the level of detail
• Mistake 7: Not having a multi-tiered warehousing
  environment
• Mistake 8: Developing a data model from a list of required
  data elements
• Mistake 9: Thinking that you must choose between
  relational and dimensional models
• Mistake 10: Jumping straight into data mart design

  Karolyn Duncan, Steve Hoberman, and Laura Reeves, 10
  Mistakes to Avoid When Modeling Data Warehouses


                                                                31
                               12

OnLine Analytical Processing
          OLAP




                                32
                                                    12
      On-Line Analytical Processing
• On-Line Analytical Processing (OLAP) is an advanced
  data analysis environment that supports decision
  making, business modeling, and operations research
  activities.

• Four Main Characteristics of OLAP
  – Use multidimensional data analysis techniques
  – Provide advanced database support
  – Provide easy-to-use end user interfaces
  – Support client/server architecture


                                                     33
                                                   12
 Multidimensional Data Analysis
          Techniques


• The processing of data in which data are viewed as
  part of a multidimensional structure.
• Multidimensional view allows end users to
  consolidate or aggregate data at different levels.
• Multidimensional view allows a business analyst to
  easily switch business perspectives.




                                                       34
Operational Vs. Multidimensional View Of Sales   12




                                                  35
                                                    12

     On-Line Analytical Processing

• Additional Functions of Multidimensional Data
  Analysis Techniques
  – Advanced data presentation functions
  – Advanced data aggregation, consolidation, and
    classification functions
  – Advanced computational functions
  – Advanced data modeling functions



                                                     36
Integration Of OLAP With   12
A Spreadsheet Program




                            37
                                                              12
   Advanced Database Support

• Access to many different kinds of DBMSs, flat files, and internal
  and external data sources.
• Access to aggregated Data Warehouse data as well as to the
  detail data found in operational databases.
• Advanced data navigation features such as drill-down and roll-
  up.
• Rapid and consistent query response times.
• The ability to map end user requests, expressed in either
  business or model terms, to the appropriate data source and
  then to the proper data access language.
• Support for very large databases.


                                                                 38
                                                              12

     On-Line Analytical Processing

• Easy-to-Use End User Interface
  – Easy-to-use graphical user interfaces make
    sophisticated data extraction and analysis tools easily
    accepted and readily used.

• Client/Server Architecture
  – The client/server environment enables us to divide an
    OLAP system into several components that define its
    architecture.



                                                               39
                                                        12

     On-Line Analytical Processing

• OLAP Architecture
  – Three Main Modules
     • OLAP Graphical User Interface (GUI)
     • OLAP Analytical Processing Logic
     • OLAP Data Processing Logic

  – OLAP systems are designed to use both operational
    and Data Warehouse data.



                                                         40
OLAP Server Arrangement   12




                           41
OLAP Server With Multidimensional   12
Data Store Arrangement




                                     42
OLAP Server With Local Mini Data-Marts   12




                                          43
                                                             12
              Relational OLAP
• Relational On-Line Analytical Processing (ROLAP) provides
  OLAP functionality by using relational database and familiar
  relational query tools.

• Extensions to RDBMS
   – Multidimensional data schema support within the RDBMS
   – Data access language and query performance optimized for
     multidimensional data
   – Support for very large databases




                                                                44
                                                              12
  Multidimensional Data Schema Support

• Normalization of tables in relational technology is seen as a
  stumbling block to its use in OLAP systems.
• DSS data tend to be non-normalized, duplicated, and pre-
  aggregated.
• ROLAP uses a special design technique to enable RDBMS
  technology to support multidimensional data representations,
  known as star schema.
• Star schema creates the near equivalent of a multidimensional
  database schema from the existing relational database.




                                                                  45
                                                              12
    On-Line Analytical Processing
• Data Access Language and Query Performance Optimized for
  Multidimensional Data
   – Most decision support data requests require the use of
     multiple-pass SQL queries or multiple nested SQL
     statements.
   – ROLAP extends SQL so that it can differentiate between
     access requirements for data warehouse data and
     operational data.

• Support for Very Large Databases
   – Decision support data are normally loaded in bulk (batch)
     mode from the operational data.
   – RDBMS must have the proper tools to import, integrate, and
     populate the data warehouse with operational data.
   – The speed of the data-loading operations is important.
                                                                 46
A Typical ROLAP Client/Server Architecture   12




                                              47
                                                                              12
             Multidimensional OLAP
                    (MOLAP)
•   MOLAP extends OLAP functionality to multidimensional databases (MDBMS).
•   MDBMS end users visualize the stored data as a multidimensional cube known as a
    data cube.
•   Data cubes are created by extracting data from the operational databases or from
    the data warehouse.
•   Data cubes are static and require front-end design work.
•   To speed data access, data cubes are normally held in memory, called cube cache.
•   MOLAP is generally faster than their ROLAP counterparts. It is also more resource-
    intensive.
•   MDBMS is best suited for small and medium data sets.
•   Multidimensional data analysis is also affected by how the database system
    handles sparsity.



                                                                                 48
MOLAP Client/Server Architecture   12




                                    49
Relational Vs. Multidimensional OLAP   12




                                        50
                                                   12
                   Star Schema
• The star schema is a data-modeling technique used
  to map multidimensional decision support into a
  relational database.
• Star schemas yield an easily implemented model for
  multidimensional data analysis while still preserving
  the relational structure of the operational database.
• Four Components:
  –   Facts
  –   Dimensions
  –   Attributes
  –   Attribute hierarchies


                                                      51
A Simple Star Schema   12




                        52
                                                                        12

                         Star Schema
• Facts
  – Facts are numeric measurements (values) that represent a specific
    business aspect or activity.
  – The fact table contains facts that are linked through their
    dimensions.
  – Facts can be computed or derived at run-time (metrics).

• Dimensions
  – Dimensions are qualifying characteristics that provide additional
    perspectives to a given fact.
  – Dimensions are stored in dimension tables.



                                                                         53
                                                             12
Possible Attributes For Sales Dimensions

 • Attributes
    – Each dimension table contains attributes. Attributes
      are often used to search, filter, or classify facts.
    – Dimensions provide descriptive characteristics about
      the facts through their attributes.




                                                              54
Three Dimensional View Of Sales   12




                                   55
Slice And Dice View Of Sales   12




                                56
                                                     12

              Attribute Hierarchies

• Attributes within dimensions can be ordered in a
  well-defined attribute hierarchy.
• The attribute hierarchy provides a top-down data
  organization that is used for two main purposes:
   – Aggregation
   – Drill-down/roll-up data analysis




                                                      57
A Location Attribute Hierarchy   12




                                  58
                            12
Attribute Hierarchies In
Multidimensional Analysis




                             59
                                                          12

          Star Schema Representation

• Facts and dimensions are normally represented by physical
  tables in the data warehouse database.
• The fact table is related to each dimension table in a many-
  to-one (M:1) relationship.
• Fact and dimension tables are related by foreign keys and
  are subject to the primary/foreign key constraints.




                                                              60
Star Schema For Sales   12




                         61
Orders Star Schema   12




                      62
          12

Example




           63
                                                  12

                 Star Schema

• Performance-Improving Techniques
  – Normalization of dimensional tables
  – Multiple fact tables representing different
    aggregation levels
  – Denormalization of fact tables
  – Table partitioning and replication


                                                   64
Normalized Dimension Tables   12




                               65
           12
Multiple
Fact
Tables




            66
                                                            12
   Data Warehouse Implementation

• The Data Warehouse as an Active Decision Support
  Network
  – A Data Warehouse is a dynamic support framework.
  – Implementation of a Data Warehouse is part of a complete
    database-system-development infrastructure for company-
    wide decision support.
  – Its design and implementation must be examined in the light
    of the entire infrastructure.




                                                               67
                                                               12
  Data Warehouse Implementation
• A Company-Wide Effort that Requires User
  Involvement and Commitment at All Levels
  – For a successful design and implementation, the designer
    must:
     • Involve end users in the process.
     • Secure end users’ commitment from the beginning.
     • Create continuous end user feedback.
     • Manage end user expectations.
     • Establish procedures for conflict resolution.



                                                                68
                                                           12
Data Warehouse Implementation


• Satisfy the Trilogy: Data, Analysis, and Users
   – For a successful design and implementation, the
     designer must satisfy:
      • Data integration and loading criteria.
      • Data analysis capabilities with acceptable query
        performance.
      • End user data analysis needs.




                                                            69
                                                            12
  Apply Database Design Procedures


– Data Warehouse development is a company-wide
  effort and requires many resources: people, financial,
  and technical.
   • The sheer and often mind-boggling quantity of decision
     support data is likely to require the latest hardware and
     software.
   • It is also imperative to have very detailed procedures to
     orchestrate the flow of data from the operational
     databases to the Dare Warehouse.
   • To implement and support the Data Warehouse
     architecture, we also need people with advanced
     database design, software integration, and management
     skills.

                                                                 70
Data Warehouse Implementation Road Map   12




                                          71
                                                               12

                     Data Mining

• In contrast to the traditional (reactive) DSS tools, the
  data mining premise is proactive.
• Data mining tools automatically search the data for
  anomalies and possible relationships, thereby
  identifying problems that have not yet been identified
  by the end user.
• Data mining tools -- based on algorithms that form the
  building blocks for artificial intelligence, neural
  networks, inductive rules, and predicate logic -- initiate
  analysis to create knowledge.



                                                                72
                                12

       Data Mining Techniques

•   Neural Nets
•   Regression Analysis
•   Time Series Analysis
•   Multidimensional Analysis
•   Factor Analysis
•   Nearest Neighbor methods
•   Cluster Analysis




                                 73
                                    12
Extraction Of Knowledge From Data




                                     74
                                                                      12
                   Data Mining
• Four Phases of Data Mining
  1. Data Preparation
     • Identify and cleanse data sets.
     • Data Warehouse is usually used for data mining operations.

  2. Data Analysis and Classification
     • Identify common data characteristics or patterns using
         – Data groupings, classifications, clusters, or sequences.
         – Data dependencies, links, or relationships.
         – Data patterns, trends, and deviations.


                                                                       75
                                                                 12
                Data Mining

3. Knowledge Acquisition
   • Select the appropriate modeling or knowledge acquisition
     algorithms.
   • Examples: neural networks, decision trees, rules induction,
     genetic algorithms, classification and regression tree, memory-
     based reasoning, or nearest neighbor and data visualization.

4. Prognosis
   • Predict future behavior and forecast business outcomes using the
     data mining findings.



                                                                       76
                     12
Data-Mining Phases




                      77
                                                12

               Examples

• Credit Card usage (who buys what)
• Airline travel (where do people go, when)
• Social Services (who is not paying the tab)




                                                 78

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:2/11/2013
language:English
pages:78