Data Warehouse Technologies by HC120910222322

VIEWS: 0 PAGES: 37

									An Introduction to Data Warehousing




        Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   1
In the Beginning, life was simple…




                   Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   2
But…




       Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   3
Our information needs…




                 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   4
Kept growing. (The Spider web)




                  Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   5
Purpose




          To explore and discuss the
          purpose and principles of data
          warehousing.



              Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   6
Data Warehousing/ Business Intelligence




       Data                      Information                                                             Knowledge                                      Decision




    QUANTITY                                                                                                                                         QUALITY




                  Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.                 7
So What Is a Data Warehouse?



    Definition: A data warehouse is the data
     repository of an enterprise. It is generally used for
     research and decision support.
    By comparison: an OLTP (on-line transaction
     processor) or operational system is used to deal
     with the everyday running of one aspect of an
     enterprise.
    OLTP systems are usually designed
     independently of each other and it is difficult for
     them to share information.
                   Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   8
Definition



Collection    of diverse data
    subject oriented
    aimed at executive, decision maker

    often a copy of operational data

    with value-added data (e.g., summaries, history)

    integrated

    time-varying

    non-volatile




                   Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   9
Why Do We Need Data Warehouses?




   Consolidation of information resources
   Improved query performance
   Separate research and decision support functions
    from the operational systems
   Foundation for data mining, data visualization,
    advanced reporting and OLAP tools




                Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   10
What Is a Data Warehouse Used for?




    Knowledge       discovery
        Making consolidated reports
        Finding relationships and correlations

        Data mining

        Examples

           Banks identifying credit risks

           Insurance companies searching for fraud

           Medical research




                  Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   11
How Do Data Warehouses Differ From Operational
Systems?




    Goals
    Structure
    Size
    Performance optimization
    Technologies used




                 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   12
Comparison Chart of Database Types




    Data warehouse (OLAP)                                                               Operational system
                                                                                        (OLTP)
    Subject oriented                                                                    Transaction oriented

    Large (hundreds of GB up to several                                                 Small (MB up to several GB)
    TB)
    Historic data                                                                       Current data

    De-normalized table structure (few                                                  Normalized table structure (many
    tables, many columns per table)                                                     tables, few columns per table)
    Batch updates                                                                       Continuous updates

    Usually very complex queries                                                        Simple to complex queries


                        Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   13
OLTP and OLAP


               Operational DB                                                                                 Data Warehouse


User           Clerk, IT Professional                                                                         Knowledge worker

Function       Day to day operations                                                                          Decision support

DB Design      Application-oriented (E-R )                                                                    Subject-oriented (Star, snow/flake)

Data           Current, Isolated                                                                              Historical, Consolidated

View           Detailed                                                                                       Summarized

Unit of work   Short, Simple transaction                                                                      Complex query

Access         Read/write                                                                                     Read Mostly

#Users         Thousands                                                                                      Hundreds

Db size        100 MB-GB                                                                                      100GB-TB




                          Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   14
Design Differences



    Operational System                                                                                                           Data Warehouse




       ER Diagram                                                                                                                          Star Schema


                     Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.    15
Data Warehouses, Data Marts, and Operational
Data Stores

 Data   Warehouse – The queryable source of data
  in the enterprise. It is comprised of the union of all
  of its constituent data marts.
 Data Mart – A logical subset of the complete data
  warehouse. Often viewed as a restriction of the
  data warehouse to a single business process or to
  a group of related business processes targeted
  toward a particular business group.
 Operational Data Store (ODS) – A point of
  integration for operational systems that developed
  independent of each other. Since an ODS
  supports day to day operations, it needs to be
  continually updated.

                   Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   16
Building a Data Warehouse




                           Data Warehouse Lifecycle

                                                                             Analysis
                                                                             Design

                                                                             Import data

                                                                             Install front-end tools

                                                                             Test and deploy




                 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   17
Stage 1: Analysis



                                                                                                                                                           Analysis
                                                                                                                                                       –
    Identify:
                                                                                                                                                           Design
                                                                                                                                                       –   Import data
        Target Questions                                                                                                                              –   Install front-end tools
                                                                                                                                                       –
        Data needs
                                                                                                                                                           Test and deploy

        Timeliness of data

        Granularity

    Createan enterprise-level data dictionary
    Dimensional analysis
          Identify facts and dimensions


                    Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.                                 18
Stage 2: Design



                                                                                                                                                     – Analysis

    Starschema
                                                                                                                                                       Design
                                                                                                                                                     – Import data
    Data Transformation                                                                                                                             – Install front-end tools
                                                                                                                                                     – Test and deploy
    Aggregates
    Pre-calculated Values                                                                                                         Dimensional Modeling

    HW/SW Architecture




                  Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.                               19
Architecture



               Client                                                                                                                                      Client

                                                           Query & Analysis



               Metadata                                               Warehouse



                                                                        Integration



         Source                                                              Source                                                                          Source


                        Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.              20
Dimensional Modeling




     Fact Table – The primary table in a dimensional
      model that is meant to contain measurements of
      the business.
     Dimension Table – One of a set of companion
      tables to a fact table. Most dimension tables
      contain many textual attributes that are the basis
      for constraining and grouping within data
      warehouse queries.


                 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   21
Stage 3: Import Data



                                                                                                                                                            – Analysis
     Identify data sources                                                                                                                                 – Design
     Extract the needed data from                                                                                                                            Import data
      existing systems to a data                                                                                                                            – Install front-end tools
      staging area                                                                                                                                          – Test and deploy
     Transform and Clean the data
           Resolve data type conflicts
           Resolve naming and key conflicts
           Remove, correct, or flag bad data
           Conform Dimensions
       Load the data into the warehouse




                         Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.                               22
Models and Operators



Data   Models
    relations
    stars & snowflakes

    cubes

Operators
    slice & dice
    roll-up, drill down

    pivoting

    other




                    Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   23
Star


   product    prodId    name price                                                                                                                store        storeId   city
                p1       bolt 10                                                                                                                                 c1      nyc
                p2       nut   5
                                                                                                                                                                 c2      sfo
                                                                                                                                                                 c3       la


              sale oderId date                               custId                      prodId                       storeId                         qty       amt
                    o100 1/7/97                                53                          p1                           c1                             1         12
                    o102 2/7/97                                53                          p2                           c1                             2         11
                     105 3/8/97                               111                          p1                           c3                             5         50




             customer          custId                            name                                    address                                             city
                                 53                                joe                                   10 main                                             sfo
                                 81                               fred                                   12 main                                             sfo
                                111                               sally                                  80 willow                                            la


                          Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.                      24
Star Schema


                                                               sale
                                                              orderId
                                                               date                                                                   customer
         product
                                                              custId                                                                    custId
          prodId
                                                              prodId                                                                    name
          name
                                                              storeId                                                                  address
           price
                                                                qty                                                                      city
                                                                amt




                                                                store
                                                               storeId
                                                                 city




                   Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   25
Cube



       Fact table view:
                                                                                                 Multi-dimensional cube:
       sale   prodId storeId amt
                p1     c1     12                                                                                                    c1                  c2   c3
                p2     c1     11                                                                               p1                   12                       50
                p1     c3     50                                                                               p2                   11                  8
                p2     c2     8


                                                                                                           dimensions = 2




                     Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.             26
3-D Cube




  Fact table view:                                                                                 Multi-dimensional cube:
 sale   prodId   storeId          date                    amt
          p1       c1              1                       12
          p2       c1              1                       11                                                                                             c1        c2        c3
                                                                                                day 2
          p1       c3              1                       50                                                              p1                             44        4
          p2       c2              1                       8                                                               p2 c1                               c2        c3
          p1       c1              2                       44                            day 1
                                                                                                                        p1    12                                         50
          p1       c2              2                       4                                                            p2    11                               8




                                                                                                                         dimensions = 3



                       Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.                            27
Cube Aggregation



                                                                                                       Example: computing sums
                      c1           c2                   c3
  day 2                                                                                                                    ...
              p1      44           4
              p2 c1        c2                 c3
 day 1
           p1    12                           50
           p2    11        8


                                                                                                                      c1                   c2                 c3
                                                                                              sum                     67                   12                 50
                c1     c2                c3
          p1    56     4                 50
          p2    11     8                                                                                                                                           129
                                                                                                                  sum
                rollup                                                                           p1               110
                                                                                                 p2                19
               drill-down

                           Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.              28
Importing Data Into the Warehouse




            OLTP 1




                                                 Data Staging Area                                                                                   Data
            OLTP 2
                                                                                                                                                   Warehouse




            OLTP 3

       Operational Systems
        (source systems)




                      Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.         29
Stage 4: Install Front-end Tools



                                                                                                                                                      – Analysis
                                                                                                                                                      – Design
    Reporting tools                                                                                                                                  – Import data
    Data mining tools                                                                                                                                  Install front-end tools
                                                                                                                                                      – Test and deploy
    GIS
    Etc.




                   Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.                               30
Stage 5: Test and Deploy



                                                                                                                                                     –   Analysis
                                                                                                                                                     –
    Usability tests
                                                                                                                                                         Design
                                                                                                                                                     –   Import data
    Software installation                                                                                                                           –   Install front-end tools
                                                                                                                                                         Test and deploy
    User training
    Performance tweaking based on usage




                  Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.                                 31
Special Concerns




   Timeand expense
   Managing the complexity
   Update procedures and maintenance
   Changes to source systems over time
   Changes to data needs over time




                   Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   32
Goals of the STORET Central Warehouse




    Improved    performance and faster data retrieval
    Ability to produce larger reports
    Ability to provide more data query options
    Streamlined application navigation




                 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   33
Old Web Application Flow




                  Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   34
Central Warehouse Application Flow




                                                                Search Criteria
                                                                  Selection




                                                      Report Size Feedback/
                                                      Report Customization




                                                            Report Generation




                  Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.   35
Data Warehouse Components



  Source Systems                                                                                           “The Data Warehouse”                                                  End User
                                     Data Staging Area
     (Legacy)                                                                                               Presentation Servers                                                Data Access

                                                                                                                                                                    feed   Ad Hoc Query Tools

                                                                              Populate,
      Data         extract                                                    replicate,
                                                                               recover                                                                              feed   Report Writers
                                                                                                                   Data Mart #1:

                                                                                                                                                                    feed   End User Applications

      Data         extract       Data Clean-up and
                                    Processing
                                                                              Populate,                                               Conformed dimensions
                                                                              replicate,                                              Conformed facts
                                                                               recover
                                                                                                          Data Mart #2                                              feed
      Data         extract                                                                                                                                                      Data Mining
                                                                              Populate,                                               Conformed dimensions
                                                                              replicate,                                              Conformed facts
                                                                               recover
                                                                                                          Data Mart #3




         Upload cleaned dimensions                                                                                                 Upload model results




                                        Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.                           36
Data Warehouse Components – Detailed



 Source Systems                                                                                              “The Data Warehouse”                                                        End User
                                    Data Staging Area
    (Legacy)                                                                                                  Presentation Servers                                                      Data Access

                               Storage:                                                                      Data Mart #1:                                                  feed   Ad Hoc Query Tools
                               flat file (fastest);                                                          OLAP (ROLAP and/or
                               RDBMS;                                          Populate,                        MOLAP) query services;
     Data         extract      other                                           replicate,                    dimensional;
                                                                                recover                      subject oriented;                                              feed   Report Writers
                               Processing:                                                                   locally implemented;
                               clean;                                                                        user group driven;
                               prune;                                                                        may store atomic data;
                                                                                                                                                                            feed   End User Applications
                               combine;                                                                      may be frequently
                  extract      remove duplicates;                                                            refreshed;
     Data                      household;                                                                    conforms to DW Bus
                               standardize;
                               conform dimensions;                             Populate,                                                 Conformed dimensions
                               store awaiting replication;                     replicate,                                                Conformed facts                           Models
                               archive;                                         recover                                                                                            forecasting;
                               export to data marts                                                          Data Mart #2                                                   feed   scoring;
     Data         extract                                                                                                                                                          allocating;
                               No user query services                          Populate,                                                 Conformed dimensions                      data mining;
                                                                               replicate,                                                Conformed facts                           other downstream systems;
                                                                                recover                                                                                            other parameters;
                                                                                                             Data Mart #3
                                                                                                                                                                                   special UI




        Upload cleaned dimensions                                                                                                     Upload model results




                                         Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.                                      37

								
To top