Data Warehouse by wufengmei007

VIEWS: 0 PAGES: 47

									CP611 – Week 12


    Decision Support Systems
     Using Data Warehouses


                               1
      Data versus Information

• Operational databases – used for
    managing dynamic data by allowing
    large number of users to add, access
    and edit data.
•   All operational databases share a
    common attribute: time
•   Data is stored over time and some
    DBMS gather gigabytes of data daily
                                           2
      Data versus Information

• Large amounts of data stored over
    longer periods of time can provide some
    useful information (data dependent).
•   What is information?
•   Knowledge - Patterns of data that may
    be used to confirm current knowledge or
    discover new knowledge

                                          3
What is information good for?

• We said that information means
    knowledge.
•   What is information (knowledge) good
    for?
•   For decision making...
•   E.g. You invested in shares. Which
    shares are good to keep, which to sell
    and which to keep?
                                             4
            Decision Making

• You are going outside for a walk. You
    look through your window and find that
    the sky is covered with dark, rainy
    clouds. The trees are bending towards
    northern side of the house. It is August.
•   You need to make a decision how to
    dress up.

                                                5
     Business Decision Making
• Airline management needs to improve
    its services and increase profits.
•   The management need to make
    decision on increasing or decreasing
    number of flights between two cities.
•   How? Analyse lots of accumulated data
    about the volume of passenger traffic
    between those two cities in their existing
    flights.
                                             6
     Decision Support Systems
               (DSS)
• Internal and external pressure to
    provide tactical and strategic decision
•   Management decision making require
    analysing data that has been
    accumulated over some period of time
•   DSS – Decision support systems are
    specially designed systems to aid
    managers in making decisions
                                              7
      Decision Support Systems
                (DSS)
• Methodology designed to extract
    information from data accumulated over
    time
•   Arrangement of computerized tools to
    assist decision making
•   Used at all levels – tailored to focus on
    specific areas or problems
•   Provides ability to conduct ad hoc queries
                                            8
DSS Components




                 9
Operational Data to DSS Data




                               10
            Data Warehouse
•   Data Warehouse (DW) is a broad based, shared
    database for management decision making that
    contains data that has been accumulated over
    time
•   Data must be high quality, aggregated, often
    denormalised and is not necessarily absolutely
    current




                                                 11
                    Contrast
•   Operational               •   DSS Data
    – Current transactions        – Longer time frame
    – Specific transactions       – Different levels
    – Focus on atomic             – DSS analysed from
      transactions                  multi-dimensions
    – Volume – megabytes          – Volume – gigabytes
    – Scope narrow                – Broad scope



                                                         12
Data Warehouse Concepts

  “The Data Warehouse is an integrated,
 subject-oriented, time-variant, non-volatile
database that provides support for decision
                  -making”




                                            13
     Data Warehouse Concepts
          Subject-oriented
“The Data Warehouse is an integrated, subject-oriented, time
  -variant, non-volatile database that provides support for
  decision-making”
•   DW are organised around subjects of interest.
    Subject are the major entities of concern in the
    business environment
•   For example subjects may include sales, accounts,
    employees, orders, customers and other entities that
    are CENTRAL (of interest and great importance) to
    the particular company's business

                                                          14
     Data Warehouse Concepts
            Integrated
“The Data Warehouse is an integrated, subject-oriented, time
  -variant, non-volatile database that provides support for
  decision-making”
•   Data about each of the subjects in the DW is typically
    collected from several different data sources (several
    company's transactional databases)
•   For example additional demographic data about the
    company customers may be acquired from outside
    sources (e.g. CENSUS data). All of the data about
    the subjects must be integrated in such way that
    provides overall picture of the subject
                                                          15
     Data Warehouse Concepts
           Non-volatile
“The Data Warehouse is an integrated, subject-oriented, time-
  variant, non-volatile database that provides support for
  decision-making”
•   Transactional (operational) data is normally updated
    on a regular basis. For example airline reservations
    database is updated daily (number of seats available).
•   Transactional data is volatile (dynamic) because it is
    constantly changing (deletes, updates and inserts are
    frequent)
•   The data in the DW is non-volatile (read only). Once
    data is added to DW, it does not change

                                                           16
     Data Warehouse Concepts
           Time variant
“The Data Warehouse is an integrated, subject-oriented, time
  -variant, non-volatile database that provides support for
  decision-making”
•   Transactional (operational) data is capturing the
    current state (now, this point of time). For example,
    when withdrawing money from the bank, the system
    needs to know what is your current balance (not what
    it was three days ago or month ago)
•   DW data is historic in nature. The data is stored
    historically (weeks, months, years)

                                                          17
Comparision




              18
      Types of Data Warehouses

• There are basically only two kinds of data
    warehouses; enterprise DW (EDW) and
    data mart (DM)
•   Q. What is the difference between the two?
•   A. Their size and the portion of the
    company that they service


                                           19
    Enterprise Data Warehouse

• Large scale, multi-subject data
    warehouse
•   Support the whole enterprise by
    incorporating the data from entire
    company
•   Drawn from a variety of the company's
    transactional DB as well as externally
    aquired data
                                             20
                Data Mart

• Data Marts are based on a limited
    number of subjects (possibly one)
•   Support to small group of people
•   Designed for local or departmental
    problems
•   A large company will often have several
    Data Marts.

                                          21
Creation




           22
      Data Warehouse Design

• Two main characteristics
     – Subject orientation
     – Historic nature of data
•   DW (or each major part of DW) needs
    to be designed and built around a
    subject and have a temporal (time)
    component.

                                          23
       Data Warehouse Design
• DW – referred as multi-dimensional DB
    because each occurrence of a subject is
    referenced by several subject dimensions
    (characteristics), one of which is time
•   For example, in a hospital patient
    tracking and billing system, the subject
    might be charges, and dimensions may
    include patient, date, procedure and
    doctor
                                          24
 DSS Database Requirements

• Schema
• Data extraction and loading
• End-user analytical interface
• Size



                                  25
                  Schema
• Known as Star Schema - Data-
    modelling technique to map decision
    support requirements into relational
    database
•   The name comes from the visual design
    in which the subject is in the middle and
    the dimensions radiate outwards, as the
    rays of a star.

                                            26
              Star Schema

• Yield easily implemented model for
    multidimensional type data analysis
•   Four components
    – Facts
    – Dimensions
    – Attributes
    – Attribute hierarchies

                                          27
Simple Star Schema




                     28
Example Attributes




                     29
             Data extraction
• Process of copying data from the
    transactional (operational) DB in
    preparation for loading it into DW
•   Because some data from the TDB may
    be noisy (have errors) the data needs to
    be cleaned.
•   Noisy data – missing data, questionable
    data, possible misspellings and
    impossible data.
                                           30
               Data cleaning
•   Two steps
     – Identify the problem data
     – Fix it
•   Identifying the problem is a job for programs
    since it would be time consuming (even
    impossible) for humans
•   Programs may be also designed to fix it. (e.g.
    Age 243 – impossible data - a program may
    decide to put average age)
                                                 31
                Data loading
•   Finally after data extracting, cleaning, and
    transforming, the data is ready to be loaded into
    the data warehouse
•   A schedule for regular updating (data reloading)
    must be put in place (e.g. Daily, weekly,
    monthly)
•   After the loading the data, data warehouse is
    ready for data analysis.

                                                  32
On-Line Analytical Processing
          (OLAP)
• Decision support methodology based on
    viewing data in multiple dimensions
•   OLAP is well suited for querying and multi
    time period trend analysis
•   OLAP concepts
    – Drill down (yearly->monthly->weekly)
    – Slice (subset of the data; e.g. products-
      >shoes)
    – Rotate (interchange the dimensions)         33
On-Line Analytical Processing
          (OLAP)
• Four main characteristics
  – Multidimensional data analysis
  – Advanced database support
  – Easy-to-use end user interfaces
  – Support client/server architecture




                                         34
OLAP Architecture




                    35
OLAP System




              36
   ROLAP – Relational OLAP

• Special case of OLAP
• Relational On-line Analytical Processing
  – OLAP functionality
  – Relational database and query tools
• Extension to RDBMS


                                          37
ROLAP Client/Server
   Architecture




                      38
       Multidimensional OLAP
              (MOLAP)
•   Extends OLAP functionality to
    multidimensional databases (MDBMS)
•   Data cubes created
    – Static
    – Front-end design work required
    – Normally held in memory (cube cache) for faster
      access
    – Faster than ROLAP
    – Resource intensive
                                                        39
MOLAP Architecture




                     40
ROLAP vs MOLAP




                 41
Attribute Hierarchies




                        42
            Data Mining

• DSS tools – reactive
• Data-mining – proactive
• Automatically search for anomalies and
 possible relationships




                                       43
Knowledge from Data




                      44
      Phases of Data Mining

• Data Preparation
• Data Analysis and Classification
• Knowledge Acquisition
• Prognosis



                                     45
Data-Mining Phases




                     46
     Where to from here??

• Next generation……
• Open Source??




                            47

								
To top