Data Warehouse Overview by cuiliqing

VIEWS: 15 PAGES: 223

									 Introduction to
Data Warehouse

          (slides in this section
          are used courtesy of
          Carrig Emerging Technology
          Ph: 410-553-6760
          www.carriget.com)

                                       1
Introduction to Data Warehousing and Data
                  Mining
1) Data Warehouse Introduction
2) Engineering Conflicts
3) OLTP and DSS
4) Stovepipe vs. Integration
5) Data Warehouse Solution
6) Enterprise Information System
7) Security in a Data Warehouse
8) Moving Data to a Data Warehouse
9) Data Marts
10) Data Mining
                                        2
                      Introduction

• Key topics for this course include:
   – Data Warehouse
   – Data Mart
   – Data Mining

• Background and review of relational database
  systems
• Main focus on data warehouse and data mining




                                                 3
        Data Warehouse Introduction

• A data warehouse is a single source for key,
  corporate information needed to enable business
  decisions

• A database application is a piece of software that
  provides a user interface for users to add, delete,
  query and update data

• Typically, a database management system is
  used to actually do the work of adding, deleting,
  querying or updating data

                           Database           Data
    Application             System

                                                        4
 Engineering Conflicts, Query and Update

• It is often an engineering problem when data is
  updated and long-running queries occur at the
  same time
• In some cases, the users who are doing updates
  must wait for queries to complete
• One way to avoid this is to make a read-only
  copy of data


                             Database System
      Application
                               Data        Data
                            for update   for query




                                                     5
              OLTP and DSS Defined

• An application that updates is called an on-line
  transaction processing (OLTP) application

• An application that issues queries to the read-
  only database is called a decision support system
  (DSS)


         OLTP
       Application
                                  Database System


                                   OLTP    DSS
          DSS                      Data    Data
       Application


                                                     6
     Applications in a Typical Enterprise

• Most organizations have several disparate
  OLTP/DSS applications in several databases

  Finance            Inventory          Sales
   OLTP                OLTP             OLTP
 Application        Application       Application



  Finance           Inventory           Sales
    DSS                DSS               DSS
 Application       Application        Application


                  DATABASE SYSTEM
   Finance           Inventory           Sales
    OLTP               OLTP              OLTP
     Data              Data              Data

   Finance           Inventory           Sales
     DSS                DSS              DSS
     Data              Data              Data

                                                    7
            Stovepipe vs Integration

• When systems stand by themselves they are
  often referred to as “stovepipes”

• Systems that easily share data are called “well
  integrated systems”

          Finance
           OLTP                   Inventory
         Application                OLTP
                                 Application




         Finance
           DSS                    Inventory
        Application                  DSS
                                 Application

                                                    8
   Problems with Stovepipe Architecture

• Problems:
   – Users who wish to access data must query several different DSS to
     find it
   – Data may have fundamental conflicts between DSS
       – a department code table in one DSS may differ in another DSS
       – a measurement may be stored in meters in one DSS and yards in
         another


• Solution:
   – Use a data warehouse, where data is integrated
      from the several different stovepipe systems
   – Data warehouse is really sharing-lite -- you
      don’t have to co-ordinate as much when applications are built
     and you still reap the benefits of data sharing


                                                                         9
          Data Warehouse Solution

• A data warehouse is an attempt to integrate
  separate DSS so that users can query one place
  to find the answers to their questions


• A data warehouse has the key, corporate data in
  the organization

• A data warehouse tracks historical
     data




                                                    10
       Data Warehouse - A Success Story

• Largest data warehouse is Wal-Mart (9 TB)

• Uses for Wal-Mart data warehouse
   – Identifies where a new store should be built based on customer demand
   – Identifies how stores are performing across the nation
   – Contains every “scan” from every purchase



• Benefits Wal-Mart gained from their data warehouse
   – Provided competitive advantage over K-Mart
   – Reduced excess inventory in individual stores
   – Avoided wasted funds in building stores which would fail


                                                                     11
           Selling the Data Warehouse

• A data warehouse project will fail without
  corporate sponsorship
   – Preferably, the project should be sponsored by the CEO
   – The CEO must be sold on the value to the business to
      improve competitive advantage by deploying a data warehouse

• If an active, corporate sponsor does not exist,
  data sources will be very difficult to identify

• Only add data to the warehouse
     that will answer key,
     corporate questions asked
     by the corporate sponsor.
  Otherwise, you will have a data dump
                                                                    12
      Building a Useful Data Warehouse

• You really need:
   – strong executive sponsorship
   – good knowledge of the data
   – sound software engineering
   – stability from source systems
   – users who want a success


• A 75 percent failure rate is often cited

• It is WORTH the effort!!!



                                             13
       Enterprise Information System

• An EIS (Enterprise Information System) allows
  users to query data in a data warehouse

• Users can access key, corporate data in the data
  warehouse
                    Enterprise
                   Information
                     System




               Data Warehouse


                                                     14
Users of an Enterprise Information System

• Frequently, multiple EIS are needed to satisfy
  different types of users
   – Some users only want a system that has pre-defined reports so they
     only need to “click one button” to see data they need. These users
     want the system to be no harder to use than a “coffee pot”
   – Other users want to delve into the data and build their own queries


• Executives want a high-level, summary
  data and a simple tool
   – Must be VERY easy to use, users want to click a few
      buttons and get data they want
   – Results must be graphs
   – Users should be able to drill-down into key areas.


                                                                       15
Users of an Enterprise Information System

• Analysts want a flexible, more detailed tool
   – Often very knowledgeable about the data
   – Willing to do more work to learn about the data
   – Sometimes even learn SQL to issue their own
      ad-hoc queries


• General users want a tool that provides detailed
  data, but is very easy to use
   – Want access to the data warehouse to do
      routine tasks such as
      “Find me Hank’s phone number”, etc.
   – Simple application, but not so focused
     on large reports



                                                       16
                        Data Warehouse / EIS
                                                       Inventory
                                                         OLTP
           Finance                   Inventory           Data
            OLTP                                                     Sales
                                       OLTP                          OLTP
          Application               Application                    Application




Finance
                                                  Enterprise           Sales
                                                                       Sales
 OLTP                                            Information           OLTP
                                                                       OLTP
  Data                                                                 Data
                                                   System              Data




                                  Data Warehouse

                        Finance      Inventory         Sales
                        Subject       Subject         Subject
                          Area         Area             Area



                                                                               17
             Need for Data Warehouses

• Data warehouses provide a single place to store key
  corporate data
   – The idea is that users can go one place to find this key data using an
     enterprise information system (EIS)


• Data warehouse is also a place to store and access
  historical data
   – Users measure performance goals for their company over a period of
     time
   – Company statistics are available
   – Data not stored in the same place is difficult to locate
     and compare, easily lost
   – Single query can be used to access key data


                                                                         18
             Security in Data Warehouse

• Building a data warehouse does increase security
  risk because key, corporate information is all in
  one place

• To mitigate that risk, database system
  components can be used to protect the data
  warehouse. These include
   –   Views
   –   Access control
   –   Security Administration
   –   Encryption
   –   Audit




                                                  19
   Moving Data into the Data Warehouse

• Moving data from source OLTP systems to the
  data warehouse is the hard part of data
  warehousing

• Updates to the data warehouse are performed
  periodically
   – weekly
   – nightly
   – monthly


• Occasionally, real-time
      data is needed in a data warehouse, but this
  is not very common


                                                     20
       Using Middleware to Move Data

• Data can be moved to the warehouse via data
  migration software

• This is often called “middleware” because it sits
  between the source OLTP and the data
  warehouse


                         Data
                      Warehouse
    Source            Migration             Data
     OLTP              Software           Warehouse
    System           “Middleware”




                                                      21
                 Need for a Data Mart

• A data mart is a subset of the data warehouse
  that may make it simpler for users to access key
  corporate data
   – Sometimes, users only need a piece of data from the data
     warehouse

• The data mart is typically fed from the data
  warehouse
                                            Data Warehouse

                                 Finance               Inventory             Sales
                                 Subject                Subject             Subject
                                   Area                  Area                 Area




                                           New York                California
                                           Data Mart               Data Mart


                                                                                      22
                           Data Mart in Action
                                                         Inventory
                                                           OLTP
             Finance                   Inventory           Data
              OLTP                                                     Sales
                                         OLTP                          OLTP
            Application               Application                    Application




Finance
                                                    Enterprise           Sales
                                                                         Sales
 OLTP                                              Information           OLTP
                                                                         OLTP
  Data                                                                   Data
                                                     System              Data




                                    Data Warehouse

                          Finance      Inventory         Sales
                          Subject       Subject         Subject          California
New York                    Area         Area             Area           Data Mart
Data Mart


                                                                                 23
              Data Mining Introduction

• Data Mining is done by running software that
  examines a database and looks for patterns in the
  data

• A data warehouse by itself will respond to queries
  from users
   – It will not tell users about patterns in data that users may not have
     thought about
   – To find patterns in data, data mining is
       used to try and mine key information from
     a data warehouse




                                                                             24
             Advantages of Data Mining

• Data mining allows companies to collect
  information and make them more productive and
  beat their competition

• Data mining helps identify
   –   why customers buy certain products
   –   ideas for very direct marketing
   –   ideas for shelf placement
   –   training of employees vs. employee retention
   –   employee benefits vs. employee retention




                                                      25
             Implementing Data Mining

• Apply data mining tools to run data mining
  algorithms against data
• There are two approaches:
   – Copy data from the Data Warehouse and mine it
   – Mine the data in the Data Warehouse

• Popular tools use a variety of different data
  mining algorithms:
   –   association rules
   –   genetic algorithms
   –   decision trees
   –   neural networks



                                                     26
     Data Mining using Separate Data
• You can move data from the data warehouse to
  data mining tools
  – Advantages
     – Data mining tools may organize data so they can run faster
  – Disadvantages
     – Could be very expensive to move large
       amounts of data
                                                        Data Warehouse




                                     Data Mining Tool
        Copy of data made
              by the
         Data Mining Tool


                                                                    27
 Data Mining Against the Data Warehouse

• Data mining tools can access data directly in the
  Data Warehouse
   – Advantages
      – No copy of data is needed for data mining
   – Disadvantages
      – Data may not be organized in a way that is
        efficient for the tool
                                                     Data Warehouse




                         Data Mining Tool




                                                                  28
                Data Mining: Summary

• Data mining attempts to find patterns in data that
  we did not know about
• Often data mining is just a new buzzword for
  statistics
• Data mining differs from statistics in that large
  volumes of data are used
• Many different data mining algorithms exist and
  we will discuss them in the course
• Examples
   – identify users who are most likely to commit credit card fraud
   – identify what attributes about a person most results in them buying
     product x.
                                                                       29
SQL Review

       (slides in this section
       are used courtesy of
       Carrig Emerging Technology
       Ph: 410-553-6760
       www.carriget.com)


                                    30
            Introduction to SQL



1) Introduction to SQL
2) Data Definition Language (DDL)
3) Data Manipulation Language (DML)
4) SELECT Construct
5) SELECT Operators
6) Wildcard Searches
7) Aggregate Operators
8) Calculated Attributes
9) Sorting Results


                                      31
Introduction to Structured Query Language

• Structured Query Language (SQL) is the language
  used to communicate with a relational database
   – Industry standard
   – Based on set theory


• SQL composed of two types of constructs:
   – Data Definition Language (DDL)
       – Defines the structure of the database
   – Data Manipulation Language (DML)
       – Provides the constructs to input and retrieve data




                                                              32
                  SQL Overview - DDL

• Data Definition Language (DDL) is used to
  describe the structure of the database
   – Create tables, indexes, etc.
   – Typical Operations are:
       – CREATE TABLE defines what columns are in the table and
         establishes the table
       – CREATE INDEX defines an index for the table. Indexes are used
         to improve database performance




                                                                         33
                 SQL Overview - DML

• Data Manipulation Language (DML) is used for
  storing, updating, and retrieving data.

• Typical operations include:
   – SELECT is used to retrieve data.
       – Ex: SELECT * FROM PRODUCTS
   – INSERT is used to add new rows to the database.
       – INSERT INTO PRODUCTS VALUES ('food',
         'hardware', 'housewares')
   – UPDATE is used to change rows that already exist in the database.
       – UPDATE PRODUCTS SET PRICE = PRICE + 4
   – DELETE is used to eliminate rows of data from the database.
       – DELETE FROM PRODUCTS


                                                                         34
                   SELECT Overview

• SELECT is used to retrieve records from the
  database.

• Single table SELECT constructs:
   –   WHERE
   –   IN
   –   BETWEEN
   –   LIKE
   – Aggregate Operators
   – DISTINCT
   – ORDER BY




                                                35
             SELECT Examples

• Query Purpose: Retrieve names and prices of all
  products

SELECT ProductName, Price
 FROM TinyProducts

• Query Purpose: Retrieve all information for all
  employees from the TinyProducts table

SELECT *
 FROM TinyProducts


                                                    36
               SELECT with WHERE

• The WHERE clause is used to filter which
  information is returned from a SELECT

• Query Purpose: Retrieve all information only for
  product type of “food”

SELECT *
 FROM TinyProducts
 WHERE ProductType = ‘Food’




                                                     37
          Use of Boolean Operators

• Conditions can be separated by Boolean
  operators:
   – AND, OR, NOT


• Query Purpose: List all information about food
  products that are either cereal or fruit

SELECT *
FROM TinyProducts
WHERE (ProductName = 'Cereal')
OR (ProductName = 'Fruit')


                                                   38
          Boolean Operator Example

• Query Purpose: List the names of all products
  that the type is fruit and the price is less than
  $2.00

SELECT ProductType, ProductName
   FROM TinyProducts
 WHERE Price < 2
   AND ProductName = 'Fruit'




                                                      39
                         IN Operator

• The IN operator allows a search for records that
  match one value in a set of unordered values

• Example questions to use IN:
   – 'Find all products whose type is Food, Hardware, or Housewares'
   – 'Find all food whose type is Meat, Fish, Vegetables, or Fruit'




                                                                       40
                  IN Example
• Query Purpose: List the name of Housewares that
  are Cookware, Linens, or Dishes

SELECT ProductName, ProductType
   FROM TinyProducts
  WHERE ProductName in
   ('Cookware', 'Linens', 'Dishes')

instead of:

SELECT ProductName, ProductType
   FROM TinyProducts
 WHERE (ProductName = ’Cookware')
     OR (ProductName = 'Linens')
     OR (ProductName = 'Dishes')                41
                   BETWEEN Operator

• The BETWEEN operator allows a search for a range
  of values

• Example Queries:
   – 'Find all fruit between Bananas and Grapes'
   – 'Find all cereals whose price is between $1.50 and $4.00 a box




           1.50                                     4.00




                                                                      42
               BETWEEN Example

• Query Purpose: Find all products whose price is
  between $2.00 and $8.00

SELECT ProductName, Price
   FROM TinyProducts
  WHERE Price BETWEEN 2.00 AND 8.00

instead of:

SELECT ProductName, Hardware
   FROM TinyProducts
  WHERE (Price >= 2.00) OR (Price <= 8.00)

                                                    43
          Wildcard Searches of Strings
• The LIKE operator is used to search parts of a
  string

• The following wildcard characters are used:
   % to match any zero or more characters
   _ to match exactly one character




                                                   44
         Wildcard Search Examples

• Query Purpose: List all products whose name
  starts with an ’C'
SELECT *
   FROM TinyProducts
  WHERE ProductName LIKE 'C%'

• Query Purpose: List all products that have a SKU
  number with the last 2 characters of ’23' when
  you don't know the first character

SELECT *
   FROM TinyProducts
  WHERE SKUNumber LIKE '_23'
                                                     45
                 Aggregate Operators
• MIN, MAX, and AVERAGE are used when computing
  statistics on a range of data

• Query Examples:
  – 'What is the highest batting average on the team?'
  – 'What is the average number of hits for all the little league teams in
    the National League?'
  – 'What are the names of the players that had the lowest average on
    the little league team?'




                                                                         46
        Aggregate Operators Example

• Query Purpose: Find the minimum, maximum,
  and average batting average of all players in the
  National League of Little League

SELECT MIN(Average), MAX(Average),
       AVG(Average)
   FROM PLAYERS
  WHERE League = 'National'




                                                      47
           SUM and COUNT Operators

• Use the SUM operator to total the results of a
  query

• COUNT will count the total number of occurrences
  of an item in a search




                                                     48
            SUM And COUNT Examples
•   Query Purpose: Find the total number of
    homeruns hit by all players in the American
    League?
    SELECT SUM(HomeRuns)
    FROM PLAYERS
    WHERE League='American'

•   Query Purpose: List the names of players that
    have hit 3 home runs in the National League?

    SELECT COUNT(*)
    FROM PLAYERS
    WHERE HomeRuns = '3'
    AND League = 'National'
                                                    49
                 Calculated Attributes

• A new attribute can be obtained by using
  arithmetic operators (+,-, *, /) on other
  numeric attributes

• All operators follow standard precedence:
   – Multiplication and division are computed first left to right
   – Addition and subtraction are computed last left to right
   – Use parenthesis to override the standard precedence




(+,-, *, /)
                                                                    50
        Calculated Attributes Example

Query Purpose: List all players with their hits, at
bats, and their batting average

SELECT Name, Hits, AtBats,
     (Hits / AtBats)
FROM PLAYERS




                                                      51
              DISTINCT Operator

• DISTINCT is used to exclude duplicate
  occurrences in the result of a query

• Query Purpose: List all distinct batting averages

  SELECT DISTINCT(Average)
    FROM PLAYERS




                                                      52
            Sorting Query Results
• The ORDER BY clause is used at the end of the
  SELECT statement to sort the results of a query

• Use DESC on the end of the ORDER BY clause to
  sort the data in descending order. Otherwise, the
  result will be in ascending order




                                                      53
               Sorting Example

• Query Purpose: List all players in ascending
  order of their batting average

  SELECT Name, Average
  FROM PLAYERS
  ORDER BY Average

• For descending order add the keyword DESC

  SELECT Name, Average
  FROM PLAYERS
  ORDER BY Name DESC
                                                 54
        Sorting Calculated Attributes
• To refer to a computed attribute in the ORDER BY,
  use its position in the list of columns following
  SELECT

• Query Purpose: List all players in descending
  order of their batting average (here we assume
  batting average is computed at the time of the
  query)

  SELECT Name, Hits, AtBats,
         Hits / AtBats
  FROM PLAYERS
  ORDER BY 3 DESC
                                                      55
               More SQL



1) GROUP BY Construct
2) HAVING Filter
3) Multiple Tables
4) Joins
5) Equijoins
6) Cartesian Product
7) Nulls
8) OUTER JOIN

                          56
                  GROUP BY Clause

• GROUP BY will partition a table into multiple
  groups of related rows.

• As an example, consider the EMPLOYEE table
  where Department partitions the EMPLOYEE set
  into subsets:

                     Engineering

      Marketing                        Customer
                       Finance


                                                  57
              GROUP BY Example

• Query Purpose: For each department, list the
  average salary using the EMPLOYEE table

SELECT Department, AVG(Salary)
  FROM EMPLOYEE
  GROUP BY Department




                                                 58
            GROUP BY With WHERE
             GROUP BY With WHERE

• To filter data further, we can use the WHERE
  clause with GROUP BY clause

Query Purpose: For each department, list the
  highest salary of their administrative assistants.

  SELECT Department, MAX(Salary)
  FROM EMPLOYEE
  WHERE Title='administrative assistant'
  GROUP BY Department



                                                       59
               HAVING Construct

• HAVING is used to restrict the output of aggregate
  functions, such as SUM, MIN, MAX and AVG, to only
  those groups of rows that meet some condition.

Query Purpose: List the average salary for all
  departments that have more than three
  employees.

  SELECT Department, AVG(Salary)
     FROM EMPLOYEE
  GROUP BY Department
  HAVING COUNT(*) > 3


                                                   60
                    Multi-Table SQL

    • It is often necessary to combine data into multiple
      tables.

   EMPLOYEE                          ATTENDS
EmpID Name Salary              EmpID     Name

1         Fred    200          1           Harvard
2         Ethel 300            2           GMU
3         Mike    400          2           Yale
4         David 100            3           MIT
                               3           Stanford
                               3           GMU

                                                        61
                      Joins

• Joins are the means by which multiple tables can
  be combined.

• A join allows us to combine data from different
  tables. A join operation is done through the
  SELECT construct.

• Types of Joins: Equijoin, Outer Join, Inner Join




                                                     62
                    Equijoin

• Joins only those rows where a foreign key
  matches the primary key

• Allows information from multiple tables to be
  linked together in a single query

• Can be used to link as many tables as needed in a
  single query




                                                  63
           Equijoin Query Example

• Query Purpose: List the names of all colleges
  attended by Ethel

SELECT b.Name
 FROM EMPLOYEE a, ATTENDS b
 WHERE a.EmpID = b.EmpID
   AND a.Name = 'Ethel'




                                                  64
                  Equijoin Example

    EMPLOYEE
EmpID   Name    Salary

1       Fred    200
2       Ethel   300
3       Mike    400



                         ATTENDS
          EmpID            College   GPA

          1                Harvard   2.45
          2                GMU       3.79
          2                Nova      3.65
          3                Yale      2.85
          3                Nova      2.65
          3                GMU       4.0




                                            65
          Warning about Joining Tables

• A join is really just a subset of a cartesian
  product. When no fields are 'joined' in the WHERE
  clause, a cartesian product is produced
   – Restated in English: When the linking condition is omitted from
     the WHERE clause, you get a lot of excess garbage that you
     probably do not want.

Sample Query:

  SELECT b.Name
  FROM EMPLOYEE a, ATTENDS b
  WHERE a.Name = 'Ethel'


                                                                       66
               Cartesian Product

• Each row in one table with every other row in other
  table
a.EmpID a.Name a.Salary        b.EmpID      b.GPA

2        Ethel     300         1            3.4
2        Ethel     300         2            2.8
2        Ethel     300         3            3.7
2        Ethel     300         4            3.5
         ....




                                                    67
                       Nulls

• An attribute may be defined as null.

• This indicates that the value is unknown and
  avoids the need for user-defined special
  indicators.

• To prevent a column from having nulls, specify
  NOT NULL on the column in the CREATE TABLE
  statement when setting up the database.




                                                   68
                Nulls Examples

Statement Purpose: Add an employee whose salary
  is unknown

INSERT INTO EMPLOYEE (3,'Hank', NULL)

Query Purpose: Find all employees whose salary is
  unknown (or null)

SELECT *
 FROM EMPLOYEE
 WHERE Salary IS NULL


                                                    69
                    OUTER JOIN

• An OUTER JOIN is used when the query should
  return a result row even for rows that do not have
  corresponding data in one of the tables.

• A LEFT OUTER JOIN returns all rows from the
  'left' table.

• Nulls are returned when a row in the 'left' table
  has no corresponding rows in the right table.




                                                      70
         LEFT OUTER JOIN Example

• Query Purpose: List the college GPAs for each
  employee. Include employees who have not
  attended any colleges

SELECT a.Name, b.GPA
 FROM EMPLOYEE a
 LEFT OUTER JOIN ATTENDS b
 on a.EmpID = b.EmpID




                                                  71
           LEFT OUTER JOIN Example

• Result of the outer join
   – All employees are listed.
   – For an equijoin, only those who attended a college would be listed
   – Here, employee number 4 did not attend college, but is still
     retrieved by the outer join.

          Name                   GPA

              ---------- -----
                Fred       2.45
                Ethel      3.79
                Ethel      3.65
                Mike       2.85
                Mike       2.65
                Mike       4.00
                David      NULL                                           72
Advanced SQL

       (slides in this section
       are used courtesy of
       Carrig Emerging Technology
       Ph: 410-553-6760
       www.carriget.com)

                                    73
            Advanced SQL



1) Finding the nth element in a list
2) Finding the median
3) Correlated subquery
4) Data Definition Language Constructs




                                         74
                 Find the Nth Element

• It is very common to try to find the nth element in
  a list.
   – Examples:
       – Who makes the second highest salary in marketing
         department?
       – What is the fifth best product in sales?
   – This can be done with a program that uses SQL to access the
     database: SQL is sent to the database and the program keeps
     retrieving the result set until the threshold is crossed.

• We show another way of
  doing this using standard SQL.




                                                                   75
   Find the Nth Element: Example Table
• Consider a table, called TEST, with just one
  column, x, with the following values:

                           X
                           4
                           5
                           8




                                                 76
         Find the Nth Element: Step 1

• First join TEST with itself, this yields each
  element matched with every other element:

                   4          4
                   4          5
                   4          8
                   5          4
                   5          5
                   5          8
                   8          4
                   8          5
                   8          8
                                                  77
        Find the Nth Element: Step 2

• Next keep only those rows where the first column
  is greater than or equal the second column.
4          4                      4                    4
4          5                      5                    4
4          8                      5                    5
5          4                      8                    4
5          5                      8                    5
5          8                      8                    8
8          4
8          5      Notice the pattern that just developed, each number on the
                  list now has a certain number of values that match on the
8          8      right. This number matches the position of this value in
                  the list. For example, 4 has only one match as it is the first
                  number in the list, 5 has two matches, 8 has three matches.

                                                                               78
        Find the Nth Element: Step 3

• Now group by the column on the left and identify
  the size of each group.

• The same ideas can be applied to any SELECT
  statement output.

         4          4        4          1
         5          4        5          2
         5          5        8          3
         8          4
         8          5
         8          8

                                                     79
     Finding the Nth Element: Example

• Query Purpose: Find the information about the
  product with the second highest price.

SELECT a.ProductName, a.ProductType,
       a.Price, a.SKUNumber
  FROM TinyProducts a, TinyProducts b
 WHERE a.Price >= b.Price
 GROUP BY a.ProductName,a.ProductType,
          a.Price, a.SKUNumber
HAVING COUNT(*) =
      (SELECT COUNT(*)-1 FROM TinyProducts)




                                                  80
   Finding the Top N Elements: Example

• To ask for the top n values instead of the nth
  value, specify a range (>=) instead of just an
  equality (=) in the HAVING.

• Query Purpose: Find information about the
  products with the two highest prices.
  SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber
       FROM TinyProducts a, TinyProducts b
     WHERE a.Price >= b.Price
        GROUP BY a.ProductName,a.ProductType,
                a.Price, a.SKUNumber
    HAVING COUNT(*) >=
        (SELECT COUNT(*)-1 FROM TinyProducts)
      ORDER BY a.Price


                                                              81
                   Finding the Median

• The median is defined as the element in the
  middle of the list.

• Query Purpose: Find the median price in
  TinyProducts.

SELECT   a.ProductName, a.ProductType, a.Price, a.SKUNumber
  FROM   TinyProducts a, TinyProducts b
 WHERE   a.Price >= b.Price
 GROUP   BY a.ProductName,a.ProductType, a.Price, a.SKUNumber
HAVING   COUNT(*) = (SELECT (COUNT(*)/2)+1 FROM TinyProducts)




                                                                82
                  Using Subqueries

• A subquery may be used in the middle of a query.

• Query Purpose: Find the information about the
  highest priced product, using a simple subquery.

SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber
  FROM TinyProducts a
 WHERE Price = (SELECT MAX(PRICE) FROM TinyProducts)




                                                            83
                    Correlated Subquery

• If the subquery references a data element from
  outside of the subquery, it is called a correlated
  subquery.
    – For each row in the outer part of the query, the correlated subquery
      is executed.

The following query will indicate who makes more money than ‘Ethel’
SELECT a.Name, a.Salary
  FROM Employee a WHERE EXISTS
    (SELECT b.Salary
       FROM Employee b
      WHERE a.Salary > b.Salary
        AND b.Name = 'Ethel')



                                                                         84
           Other Data Manipulation

• INSERT
  – Add rows to a single table


• UPDATE
  – Modify rows in a single table


• DELETE
  – Remove rows from a single table




                                      85
               INSERT Examples

• Statement Purpose: Add a record for employee
  #1, ’Fred' with a salary of 200 to the EMPLOYEE
  table

INSERT INTO Employee VALUES
    (1, ’Fred', 200)

• Statement Purpose: Copy all rows in the
  EMPLOYEE table and place them in NEW_EMPLOYEE

INSERT INTO New_Employee
 SELECT * FROM Employee


                                                    86
               UPDATE Example

• Statement Purpose: Modify Fred’s salary to 150

UPDATE Employee
  SET Salary = 150.00
  WHERE Name = 'Fred'

• Statement Purpose: Give all employees a ten
  percent raise

UPDATE Employee
  SET Salary = Salary * 1.10


                                                   87
              DELETE Examples

• Statement Purpose: Remove all employees who
  have a salary higher than 100.

DELETE FROM Employee
  WHERE Salary > 100

• To remove all employees:

DELETE FROM Employee




                                                88
            CREATE TABLE Example

• Statement Purpose: Create a table to store
  employee information

  CREATE TABLE EMPLOYEE
       (EmpId       SMALLINT,
        Name        CHAR(10),
        Salary      DECIMAL(5,2))

To drop the EMPLOYEE table

      DROP TABLE EMPLOYEE


                                               89
Data Warehouse
    Security

        (slides in this section
        are used courtesy of
        Carrig Emerging Technology
        Ph: 410-553-6760
        www.carriget.com)

                                     90
      Data Warehouse Security


1) Key Security Services
2) Views
3) Access Control
4) Roles
5) Encryption
6) Audit Trails
7) Security Holes
8) Intrusion Detection
9) Misuse Detection

                                91
                        Introduction

• A key feature provided by database systems is
  good security services.
   – In a database system with good security, applications do not have
     to worry about problems that arise with security violations.


• A data warehouse also requires good security
  services because it holds key, corporate data.

                                         Database System


           EIS
                                             Security
                                             Services



                                                                         92
                Key Security Services

• Access Control
  – Controls who accesses what data


• Administration of Access Control
  – Used to give access to users as well as track who has various
    accesses and what kind of accesses are given to a user or group of
    users
  – Audit tracks the usage of the data warehouse




                                                                         93
          Security in a Data Warehouse

• A data warehouse consolidates organizations key
  data in one place.
   – A data warehouse increases the security risk that unauthorized
     users will try to obtain this data


• Security aspects of EIS applications must be
  designed and implemented very thoroughly.

• Access control and audits are
     two of the critical components
     of security.




                                                                      94
   Data Warehouse Security Components

• Database system components that can be used
  to protect a data warehouse include:
  – Views
     – Allow users to only see certain rows or columns of data
  – Access control
     – Indicate which users have access to what data
  – Administration
     – This component is used to actually give access to groups of users
       and to define the accesses given to either an individual or a group.
     – Encryption
     – Protect data from access outside of the DBMS
  – Audit
     – Track what users are doing



                                                                          95
              Views in Data Warehouse

• A view is a logical view into one or more tables.
  Users may be given access to the view without
  access to the base table.

• Views provide some security assistance because
  they can hide data from users.
       EMPLOYEE

       Name       Address             Salary
        Hank      1 South Street       $50,000
        Esther    2 North Street       $80,000
        Tom       34 Main Street       $90,000
        Sue       45 Easy Street       $28,500
        Dave      56 5th Avenue        $35,000
        Pete      7 Broadway           $60,000
        Kathy     89 Western Avenue    $85,000




                                                      96
                          View Example
• A view called SAFE_EMPLOYEE may be created
  as:
       CREATE VIEW SAFE_EMPLOYEE AS
          (SELECT name, address FROM EMPLOYEE)

   Now users of the view SAFE_EMPLOYEE will not
   even know that salary exists.

SAFE_EMPLOYEE
Name       Address               Salary
 Hank      1 South Street
 Esther    2 North Street
 Tom       34 Main Street                       VIEW (SAFE_EMPLOYEE)
 Sue       45 Easy Street
 Dave                                     “Salary” is effectively hidden
           56 5th Avenue
 Pete      7 Broadway
 Kathy     89 Western Avenue




                                                                 97
                     Updating Views

• Restrictions exist on updating views. For the
  EMPLOYEE table, it is possible to insert into the
  SAFE_EMPLOYEE view.
   – Example:
   INSERT INTO SAFE_EMPLOYEE VALUES (‘Hank’, 300)

   This will insert a NULL into the SALARY column of the base table
     EMPLOYEE.

• Other restrictions to view updates exist:
   – Cannot update a view that is defined with an aggregate
   – Cannot update a view that is defined with a GROUP BY



                                                                      98
      Data Warehouse Access Control

• Access control is implemented in a data
  warehouse with the SQL Grant and Revoke
  commands.
• Syntax
  – GRANT <ALL|UPDATE|DELETE|INSERT|SELECT> ON
    <object-name> TO <user name>
  – Example: GRANT SELECT ON EMPLOYEE TO MARY

• Access control is done by DBAs and creators of
  tables.

• To remove access the REVOKE command is used.
  – Example: REVOKE SELECT ON EMPLOYEE FROM MARY

                                                   99
                       Database Roles
• Roles provide security administration by allowing
  users to be grouped into roles. Accesses may
  then be given to a group of users.
   – As an example, some roles for a company might be:
       – Administrative assistant
       – Loan officer
       – Salesperson

• Accesses may be assigned based on roles.
   – This dramatically simplifies administration.
   – If new tables are created, it is not necessary to add thousands of
     new accesses.
   – Examples:
       CREATE ROLE loan_officer AS (Hank, John, Mike)
       GRANT SELECT ON LOAN TO LOAN_OFFICER


                                                                          100
     Example of Application-based Roles

• Consider:

                                 Database
 Users        Applications                    Data
                                  System



• If the database system controls accesses than it
  does not matter what the application does,
  accesses are controlled consistently (same for
  SALES as MARKETING)

• However, more fine-grained access control can
  be granted in the application.


                                                     101
                   Application Roles

• The application can restrict:
   – Data entry screens
   – Reports


• Care must be taken to restrict users in a
  consistent fashion so that a user cannot jump to
  a different application and avoid security set up
  by another application.




                                                      102
  Role Based Security in a Data Warehouse

• Both application and database level security are
  useful in a data warehouse.

• Database level security is needed so that users
  are only allowed to see data they
      need to see.

• Application level security can
     be used to control access to
     certain menus so that users do
     not even know what reports exist.




                                                     103
                          Encryption

• Encryption is the process of coding data so that it
  can only be read by users who have the key that
  allows them to decrypt the data.
   – Example: A message “sell 500 shares” would appear as “xyzzy”
     without the key. Once the key is paired with the encrypted string
     “xyzzy”, it can then be decrypted.
   – The size of the key is a factor in how difficult it is to attack the
     encryption scheme.

• Three places where encryption might be used in a
  data warehouse:
   – Network
   – Data
   – Tape backups


                                                                            104
                  Network Encryption

• In a data warehouse application, data and queries
  are transmitted through a network.
   – Attackers might be able to steal network traffic just by breaking
     into the network medium.

• One way to reduce the risk of this threat is to
  encrypt traffic on the network.
                                                      User



                                           Networ
                                                k


                                                               Data
                                                             Warehouse
 Application                    Database
                                 System
                                                         Tape Backup

                                                                         105
              Network Encryption

• Network encryption is critical because the
  network connects all of the key components in a
  data warehouse.

• Encrypting network traffic mitigates the risk that
  an attacker could succeed with the “man in the
  middle” attack.

• Without this, it may be possible for the
  “man in the middle” to masquerade as
  another user and circumvent existing
  application and database security.



                                                       106
                Data Encryption

• Data encryption refers to encrypting the actual
  data in the data warehouse.

• If the attackers were to retrieve data from
  the warehouse, they would have to
  decrypt it in order to read it.




     EIS              Database            Data
                       System           Warehouse




                                                    107
              Backup Encryption

• Periodically, databases are copied to some kind
  of long-term storage (usually tapes).

• If the database is encrypted, but the tapes are not
  encrypted, the risk exists of someone walking off
  with the tapes.


     EIS              Database            Data
                       System           Warehouse




                                            Tape Backup




                                                          108
                          Audit Trails

• Audit trails are a means of tracking queries,
  updates, deletes, and additions of new data to the
  data warehouse.
   – Audit trails are turned on when the DBMS is started and all
     activity that uses the data warehouse is tracked in the audit trail.

• If a user is suspected of an evil deed, the audit
  trail can be examined to identify what data has
  been accessed by users.




                                                                            109
             Details of DW Audit Trails

• An audit trail of a database system typically
  includes the following information:
   – User ID, Date, Time, Object that has been accessed (table or view),
     Action that accessed the object (INSERT, UPDATE, DELETE,
     SELECT)
   – For UPDATE, the old value and new value is tracked.

• For data warehouses, the SELECT is often used
  to track the queries that have
       been run against the warehouse.




                                                                      110
         Other Uses for DW Audit Trails

• Audit trails can be used to identify the most
  popular data in the warehouse.
   – This information can be used to optimize queries

• An additional use for audit trails is performance
  tuning of the data warehouse.
   – Administrators know where to focus their efforts
   – Reduces administrative overhead




                                                        111
     Dealing with Known Security Holes

• Commercial database systems and operating
  systems are often filled with holes that allow
  users to obtain unauthorized access.
   – To reduce the risk of these known holes, vendors often provide
     “fixes” to their products as soon as these holes become public.


• It is important to constantly keep up with known
  security holes and apply the latest fixes as soon
  as they are released.

• One of the key risks surrounding a data
  warehouse is that privileged users have the “keys
  to the kingdom”.

                                                                       112
         The Risk of “Privileged Users”

• "Privileged users" include:
   – Data warehouse administrators
   – Operating system programmers
   – Operators in the computer center
   – These users can:
       – Modify, delete and query any data in the warehouse
       – Modify the audit trail to mask their actions
       – Give other users unauthorized access


• Numbers of "privileged users" could
     be anywhere from 20 to 30 in some
  organizations.


                                                              113
   Reducing the Risk of Privileged Users

• One way to reduce the risk of privileged users is
  to separate security administration from database
  administration.
   – This would separate the task of giving accesses and managing the
     audit trail from the task of making sure the data in the warehouse
     was correct and properly optimized.




     Security Services
        Access Control                 Security             Database
            Audit
                                       Services             Services
     Database Services                Access Control
                                                         Database Tuning
       Database Tuning                                  Query Optimization
                                          Audit
      Query Optimization                                     Backups
           Backups



                                                                          114
         Information Security Attacks

• Two types of Information security attacks on data
  warehouses are:
   – Intrusion
      – An intrusion occurs when an unauthorized user gains access to the
        data warehouse.
   – Misuse
      – Misuse, often referred to as the insider
              problem occurs when a user who has access
                to the warehouse uses that access for an
              unauthorized purpose


• Audit Trails can be used to
      identify either type of attack, but
  identification of misuse is typically MUCH harder
  to do than intrusion.
                                                                       115
                  Intrusion Detection

• An intrusion is defined as an unauthorized
  access to a system. The assumption is the user is
  external to the environment (e.g.; a hacker).
• To reduce the risk of intrusion, intrusion
  detection tools are used.
   – These tools monitor access to the data warehouse and sound an
     alarm if unauthorized accesses are detected.

                          INTRUSION DETECTION SYSTEM



                          DATA
    USER                WAREHOUSE




                                                                     116
                    Misuse Detection

• Unwanted access by a user that has the ability to
  access data is referred to as misuse.
   – This is also known as the insider problem.
   – Some estimates have shown that 80 % of computer crime is a
     result of misuse.

• For data warehouses the threat of misuse is high
  especially by privileged users.




                                                                  117
                       Summary

• DBMS Security is useful for data warehouses to
  hide data from users with views and to restrict
  access to data with GRANT and REVOKE.
• Application Level Security assists EIS that
  access data warehouses by hiding certain reports
  from users.
• Encryption can be used to further protect against
  the risk of someone walking off with the data
  warehouse.
• Audit Trails are useful for:
   – Catching attackers
   – Identifying usage trends of the data warehouse
                                                      118
 Moving Data
     to the
Data Warehouse
         (slides in this section
         are used courtesy of
         Carrig Emerging Technology
         Ph: 410-553-6760
         www.carriget.com)

                                      119
 Moving Data to the Data Warehouse


1) Moving Data into the Data Warehouse
2) Updating the Data Warehouse
3) Full Refresh
4) Copy Only the Changes
5) BCP
6) Simple Transformations
7) Complex Transformations
8) Commercial ETL Tools

                                         120
   Moving Data into the Data Warehouse

• Data must be moved to the data warehouse from
  source systems.

• Some key issues:
  – Determine the frequency of data updates -- how often should data
    be moved from source systems to the data warehouse.
  – Various means of updating data in the warehouse exist:
      – SQL Commands
      – Database system load programs (e.g.; SQL Server’s BCP)
      – Commercial tools




                                                                   121
          Updating the Data Warehouse

• OLTP (On-Line Transaction Processing) Systems
  have to send their updates to the data warehouse.


  Finance           Inventory            Sales
   OLTP               OLTP               OLTP
 Application       Application         Application




                   Data Warehouse

       Finance         Inventory         Sales
       Subject          Subject         Subject
         Area            Area            Area


                                                     122
       Frequency of Updates to the Data
                 Warehouse
• Updates may occur daily, weekly, monthly, or in
  real-time.

  Finance           Inventory            Sales
   OLTP               OLTP               OLTP
 Application       Application         Application




                   Data Warehouse

       Finance          Inventory        Sales
       Subject           Subject        Subject
         Area             Area           Area


                                                     123
  Determining the Frequency of Updates

• Requirements should drive update frequency

• Range of updates runs from real-time, to
  quarterly.
   – Real time update
       – Expensive
       – Requires update of warehouse while users are
         querying
   – Daily update
       – Somewhat cheaper than real time, but significant
         maintenance required if the warehouse has lots of tables.
   – Monthly or weekly update
       – Much more manageable




                                                                     124
                 Updating the Warehouse

• Full Refresh vs. Only the Changes
                                             Sales
                       Inventory             OLTP
                         OLTP              Application
  Finance             Application
   OLTP
 Application




                       Data Warehouse

       Finance              Inventory      Sales
       Subject               Subject      Subject
         Area                 Area         Area


                                                         125
                   Full Refresh

• Copy the entire source table in the OLTP system
  to the destination table in the Data Warehouse.



     Source OLTP                  Target Data Warehouse




                                        Target
        Source                          Table
         Table




                                                          126
                        Copy Only the Changes

      • Copy only the changes to the source table in the
        OLTP system to the destination table in the data
        warehouse.

                    Source OLTP            Target Data Warehouse

                    Source
                     Table

                                                      Target
                                                      Table



Modified data
since last update
to the warehouse
                                        Data from two updates ago.
                                        Historical data no longer in
                                        source OLTP.
                                                                       127
      Full Refresh vs. Only the Changes
• Full Refresh
   – Pros
      – Much easier to implement
      – Less chance of messing up your database (good data integrity)
   – Cons
      – Can take a lot longer to actually do -- may “run out of night”
      – Can lose out on warehouse ability to track historical data.


• Only the Changes (DELTA)
   – Pros
      – Tracks historical data
   – Cons
      – Can be very hard to implement
      – Can require changes in source applications (more on this later)
                                                                          128
    Full Refresh Using INSERT-SELECT

• One way to move data from one table to another
  is via the INSERT-SELECT.
  – Syntax: INSERT INTO <target_table>
                <any sql SELECT statement>


• Example:
  INSERT INTO DW_EMPLOYEE
         SELECT *
             FROM EMPLOYEE




                  TARGET

                                                   129
  Updating Changes Using INSERT-SELECT

• Changes may be moved by adding a WHERE
  clause to the INSERT-SELECT.

• Example:
  – INSERT INTO DW_EMPLOYEE
         SELECT *
             FROM EMPLOYEE
            WHERE DATE-UPDATED =
             DATEPART(m, CURRENT_TIMESTAMP)




                                              130
                 Updating Using BCP

• BCP is the bulk copy program that comes with MS
  SQL Server.
  – Bulk copy (BCP) moves data to or from a flat file to a SQL table.


• Syntax:
  bcp <table> [in | out] <data file>

                                                  Target Data
      Source OLTP                                  Warehouse

                    Unload                Load
        Source                Temporary              Target
                                 Flat
         Table                   File                Table




                                                                        131
                 BCP Example
• To bulk copy data from the publishers table in
  the pubs database to the publishers.txt data file
  in ASCII text format, execute from the command
  prompt:
   bcp pubs..publishers out publishers.txt -c
      -Sservername -Usa -Ppassword


• To bulk copy data from the publishers.txt file
  into the pub2 table in the pubs database, execute
  from the command prompt:
   bcp pubs..pub2 in publishers.txt -c
      -Sservername -Usa -Ppassword



                                                  132
                       Simple Transformation

 • In addition to moving data from OLTP to the
   warehouse, it is often necessary to transform
   data.
      – Example: System A stores TOTAL_CLOTH in meters and system
        B stores TOTAL_CLOTH in yards. Before the data is moved from
        system A, we need to transform the data.



                                                                       Store 32
         Store 31                                                    (Pattern = 32,
      (Pattern = 31,
                                      TRANSFORMATION          Total Cloth = 20meters)
TOTAL_CLOTH = 50    yards)


                                    Data Warehouse
                             Pattern = 31, Total Cloth = 50 yards
                             Pattern = 32, Total Cloth = 70 yards

                                                                                      133
              Complex Transformation
• More complex transformations occur when a
  value in a source table must be moved to several
  locations in a data warehouse.

                                        BLUE3484
                                (Color = Blue, 34 Inches, LS)



                                           34 in

                                   CONVERT TO
                                  CENTIMETERS
                  BLUE




                                TABLE 2
      TABLE 1                                         TABLE 3    TABLE 4
                                86.36 cm
                                                        Long      Long
      COLOR              Data Warehouse                Sleeves   Sleeves




                                                                      134
                 Commercial ETL Tools

• Key tools in the marketplace
   –   Informatica
   –   Ardent
   –   DecisionBase (Platinum)
   –   Microsoft Data Transformation Services


• All provide libraries of common transformations.

• All provide the ability to
       code complex transformations.




                                                 135
Data Transformation Services




                               136
Choose a Source




                  137
Choose a Destination




                       138
Choose to use a Query for Transfer




                                     139
Enter SQL Query




                  140
Choose Destination TableName




                               141
Verify Transformation




                        142
Decide When to Run Transformation




                                    143
Final Verification




                     144
Run Transformation




                     145
                                    Check Results

select *
   from orderfact

orderid   orderdate           productid productname                  quantity unitprice   discount

10248     1996-07-04 00:00:00.000   11   Queso Cabrales                 12    14.0000          0.0
10248     1996-07-04 00:00:00.000   42   Singaporean Hokkien Fried      10     9.8000          0.0
10248     1996-07-04 00:00:00.000   72   Mozzarella di Giovanni          5    34.8000          0.0
10249     1996-07-05 00:00:00.000   14   Tofu                            9    18.6000          0.0
10249     1996-07-05 00:00:00.000   51   Manjimup Dried Apples          40    42.4000          0.0




                                                                                                 146
                     Summary

• ETL is one of the hard parts of building a data
  warehouse.

• Either full refreshes of data or just the changes
  may be done.

• Doing full refresh is easy, but historical data is
  lost and it may take a lot of time.

• Tracking changes is a tough business.

• ETL commercial tools are beginning to mature
  and can lessen the pain of this task.



                                                       147
 More Ways of
Moving Data to
      the
Data Warehouse
         (slides in this section
         are used courtesy of
         Carrig Emerging Technology
         Ph: 410-553-6760
         www.carriget.com)

                                      148
     More Ways of Moving Data
      to the Data Warehouse


1) Determining What Data Has Changed
2) Recovery Logs
3) Triggers
4) Insert Triggers
5) Delete Triggers
6) Update Triggers
7) Manual Detection



                                       149
         More Ways of Moving Data
          to the Data Warehouse
• There is a need to move data into the data
  warehouse from OLTP and DSS applications

• The problem is detecting what data needs to be
  moved into the data warehouse

• Three methods:
      – Recovery Logs
      – Triggers
      – Manual Techniques




                                                   150
   Determining What Data Has Changed

• Problem: How to get updates made to the source
  to the same information in the data warehouse?



  SOURCE      How to get updates from
                Source Table A to       DATA WAREHOUSE
              Data Warehouse Table B




                                 ?
                                                         151
Determining What Data Has Changed (cont.)
• Problem: How to get updates made to multiple
  sources to the same information in the data
  warehouse?

        SOURCE                                  DATA WAREHOUSE




   “ROW X”
           Employee
             NAME DEPT.    SALARY
                                    ?      “ROW X”              “ROW X”


             Fred Mktg     35000
                                        EmployeeCount
             Hank Sales    60000



                                    ?
                                        DEPT    COUNT
                                                         SalaryInfo
             Sue   IT      71000        Mktg    1
                                                         DEPT    AVG SAL TOT SAL
             Joe   Sales   50000        Sales   1    2
                                        IT      1        Mktg 35000       35000
    UPDATES                             HR      0        IT    71000      71000
                                                         HR        0          0
    Insert into
    Employee Values                                      Sales 60000      60000
    (‘Joe’,’Sales’,’50000)                                 55000       110000



                                                                            152
             What is the Recovery Log?

• Recovery log is used for transaction processing
   – Used to handle errors
   – Does contain before and after image.


• Recovery log can be used to
       identify the data to be updated
  in the data warehouse.
   – Change Data Capture Utility
       – This scans the database log and identifies all changes that the user
         is interested in and either writes them to a file or stores them in
         another table.




                                                                            153
  Change Data Capture Utility in Action


SOURCE


                                       DATA
                                                All changes
                                                 to DBMS
                     DBMS
                                         LOG

                                     RECOVERY LOG




          CHANGE
            DATA
          CAPTURE                         DATA WAREHOUSE
           UTILITY          WRITES




                                                        154
       Example of Using Recovery Log

• Consider an update to the Employee table
  – The information is recorded in the log
  – The change data capture reconstructs update
  – Can then be sent to the data warehouse
                                                       LOG
                                                  TABLE=EMPLOYEE
                                                      SSN=10
     UPDATE EMPLOYEE
       Where SSN=10                               OldSalary=100,
 SET Salary=Salary*2.0
                                                  NewSalary=200



                     CHANGE
                      DATA
                    CAPTURE                           DATA
                  RECONSTRUCTS                      WAREHOUSE
                     UPDATE

                                                               155
           Using the Recovery Log
• Recovery logs are usually in proprietary format.
  Use commercial tools to read the log and identify
  the changes.

• Commercial tools such as CA’s log analyzer can
  place the results of their work in a table.




                                                   156
     Summary of Change Data Capture

• Pro
  – Log exists anyway, might as well use it to find what has changed


• Con
  – Some difficult scenarios may occur where it is hard to see what the
    new update should be in the Data Warehouse.

  – Proprietary format, may not be supported in many DBMS and will
    always lag behind DBMS development.

  – Many tables will be in the source that have nothing to do with the
    data warehouse, but change data capture will process their changes
    as well.



                                                                       157
                            Triggers

• Triggers allow DBA’s to specify that when an
  “event” such as an INSERT, UPDATE, or DELETE
  occurs on a table, another event is triggered.
   – Triggers are used to identify changes that are needed by the
     warehouse.
   – A trigger can be added to a source table and whenever the source
     table is updated, an update can be placed either directly in the
     warehouse or in a staging table that tracks all updates.


• Triggers can be used to detect the
      changes and perform data
      warehouse updates.
   – A different trigger might be run on key updates so that the data
     warehouse nightly process would know what data has changed.

                                                                        158
                        Example of a Trigger

                                                                        STAGING
STEP 2
                                      When values are
                                      inserted, sets off        TRIGGER inserts
                                        the TRIGGER             values (X, Y) into
                                                                a “STAGING” area       STEP 3
         Values (X, Y) are inserted

                                                                          Nightly
                                                                          Process



                                                               Nightly Process inserts
                                                                  values (X, Y) into
STEP 1                                                          the Data Warehouse
                                                      STEP 4       DATA WAREHOUSE



          INSERT into
            TABLE A
         VALUES (X, Y)                                                 Values (X, Y)




                                                                                          159
           Real-Life Trigger Example
• OLTP/DSS Data - Employee table:
  –Employee (ssn, name, salary)


• DW Data - Summary table:
  –EmployeeStatistics (total number employees,
     total salary paid, average salary).


• When a row is inserted in the employee table, we
  need to do an insert into the EmployeeStatistics
  table.
  – Shown on the next page




                                                 160
                 Insert Trigger Example
CREATE TRIGGER EmployeeInsertTrigger
ON Employee
FOR INSERT AS
    BEGIN
    UPDATE EmployeeStatistics
            SET NoEmployee = NoEmployee +
                              (SELECT COUNT(*) FROM INSERTED)

           UPDATE EmployeeStatistics
               SET TotSalary = TotSalary +
                               (SELECT SUM(Salary) FROM INSERTED)
          UPDATE EmployeeStatistics
                SET AvgSalary = TotSalary / NoEmployee
    END




                                                                161
              Insert Trigger in Action
      COMMANDS                      RESULTS

 INSERT INTO EMPLOYEE
VALUES (1, 'John', 300)          (1 ROW(S) AFFECTED)



 INSERT INTO EMPLOYEE
VALUES (2,'Mike', 400)           (1 ROW(S) AFFECTED)



                          Employee
SELECT * FROM EMPLOYEE    EmpId Name             Salary
                          ------ --------------------------
                          1      John            300.00
                          2      Mike            400.00

                          EmployeeStatistics
     SELECT * FROM        NoEmployee TotSalary    AvgSalary
  EMPLOYEESTATISTICS      ---------- ----------   ---------
                          2              700.00      350.00

                                                        162
                   Delete Trigger Example
CREATE TRIGGER EmployeeDeleteTrigger
ON Employee
FOR DELETE AS
  BEGIN
    DECLARE @numberEmployee int

    UPDATE EmployeeStatistics
     SET NoEmployee = NoEmployee - (SELECT COUNT(*) FROM DELETED)

    UPDATE EmployeeStatistics
      SET TotSalary = TotSalary - (SELECT SUM(Salary) FROM DELETED)

   SELECT @numberEmployee = NoEmployee FROM EmployeeStatistics

   IF @numberEmployee > 0
      BEGIN
          UPDATE EmployeeStatistics
            SET AvgSalary = TotSalary / NoEmployee
      End
    ELSE
       UPDATE EmployeeStatistics SET AvgSalary = 0.0
  END


                                                                      163
                Update Trigger Example
CREATE TRIGGER EmployeeUpdateTrigger
ON Employee

FOR UPDATE AS
BEGIN

 IF UPDATE (Salary)
     UPDATE EmployeeStatistics
       SET TotSalary = TotSalary -
           (SELECT SUM(Salary) FROM DELETED) +
           (SELECT SUM(Salary) FROM INSERTED)

            UPDATE EmployeeStatistics
               SET AvgSalary = TotSalary / NoEmployee
END




                                                        164
           Summary of Using Triggers

• Pro
  – Only needed for tables whose data is going to go to the DW


• Con
  – Additional work needed to create detailed triggers

  – Non-trivial to generate a trigger to implement appropriate action

  – May not be acceptable for commercial software on source system




                                                                        165
Other Ways to Determine What Has Changed

• There are other manual ways of detecting the
  change and doing DW updates
   – Look at each row of OLTP and the data in the warehouse
   – Compare the differences between the two files, if the data is not in
     the warehouse, add it!



            OLTP                                  DATA WAREHOUSE

     Hank                                        Hank
     John                                        John
     Mike                                        Mike
     Sam
                       ADD THE DIFFERENCES



                                                                        166
 Manually Identifying What Has Changed

• Pro
  – Flexible



• Con
  – Very expensive

  – Could take a long time




                                         167
                     Summary

• Recovery Logs
• Triggers
• Manual Detection




                               168
Data Warehouse
    Design

         (slides in this section
         are used courtesy of
         Carrig Emerging Technology
         Ph: 410-553-6760
         www.carriget.com)

                                      169
       Data Warehouse Design


1) Overview
2) Describing a Design - ER Diagrams
3) Design Normalization
4) Star Schema Design




                                       170
                        Overview

• How to describe a design
  – Entity Relationship (ER) Diagram


• Types of Designs
  – Normalized
  – Star Schema
  – Snowflake




                                       171
                  Describing a Design

• Different techniques exist, the most prevalent is
  the ER (Entity-Relationship) Diagram

• Entities
   – Things that occur in the real world, usually nouns e.g.; employee,
     part, product, etc.

• Relationships
   – How entities interact, example: one employee may attend many
     colleges -- usually verbs
   – Types of relationships
       –   1-1
       –   1-Many
       –   Many-1
       –   Many-Many

                                                                          172
         Examples of Relationships
1-1                   1-MANY




MANY-1                MANY-MANY




                                     173
                   Normalized Design

• Methodology
  – All 1-1 relationships are placed in a single table.
  – Many-many relationships require two tables that store the single-
    valued relationships and one linking table that indicates how the
    entities are related. The relationship is represented in the linking
    table by referencing keys in the two tables that represent each
    entity in the relationship.

• Checking the design
  – In a Normalized Design, there are many different normalized
    forms. Each normal form (NF) builds on the previous one so that a
    table in 2NF is, by definition, in 1NF.
      – 1NF
      – 2NF
      – 3NF

                                                                           174
  Dealing With Many-Many Relationships

• For Many-Many
  – Two 1-1 Tables (SUPPLIER, PARTS)
  – One linking table (SP)
  – Ex: Suppliers, Parts are the 1-1, SP is the linking table that says
    who sells what parts.



    SUPPLIER                                        PARTS
     S#     SNAME                                    P#     PNAME
     1    SEARS                                      1      HAMMERS
     2    OFFICE DEPOT                               2      NAILS


                            SP
                             S#   P#
                             1    1
                             1    2
                             2    1
                             2    2




                                                                          175
             Normalized Design: Example

• A store sells a product which is supplied by a
  given vendor. The product is purchased by a
  customer at a certain time.
   – Entities: Customer, Product, Store
   – Relationships: Customer buys Product
       – Product is located in Store
       – Product is supplied By a Vendor

                                            VENDOR




  CUSTOMER                  PRODUCT                    STORE




                BUYS
                                       IS-LOCATED-IN


                                                               176
        Checking a Normalized Design

• Normalization
  – Used to reduce data insertion, delete, and update anomalies caused
    by bad designs.
  – Enables users to quickly check a design and make sure there are no
    glaring holes in the design.
  – 1NF
      – All “cells” are atomic -- i.e. each entry in a column contains only
        one value
  – 2NF
      – All non-key values are functionally dependent upon the entire
        primary key -- i.e. if the primary key changes, all other columns
        change.
  – 3NF
      – No transitive dependencies -- i.e. all keys are completely
        dependent on the primary key. If the primary key changes, all
        non-key columns are affected.
                                                                            177
        Overview of Normalized Design

• Pro
  – Relatively easy to change


• Con
  – Queries can involve numerous joins
  – The massive number of tables and links between tables makes it
    hard for customers to build their own queries




                                                                     178
                        Star Schema

• Methodology
  – Single fact table in the middle describing a key event (e.g. sale)
    surrounded by dimension tables (i.e. location, time, employee)
                        D = DIMENSIONS



                                          D2
          D1


                          FACT
                                                          D3
        D5
                                     D4

                                                                         179
             Star Schema: Methodology

• Identify a key fact that occurs.
   – Usually some event creates a real fact. Selling a product in a store
     on Wednesday, patient visiting a hospital, etc.

• Identify all the dimensions of the data being used.
  Think of a dimension as a way to slice the data.
   – Ex: by time, by product, by customer, etc.

• Drill down operations are very well supported




                                                                        180
                  Star Schema: Example

• A store sells a product which is supplied by a
  given vendor. The product is purchased by a
  customer at a certain time.

• Fact
   – CustomerPurchase

• Dimensions are
   –   Customer
   –   Product
   –   Time
   –   Vendor




                                                   181
                   Star Schema: Example (cont.)

                                                                Time
                          Customer



                                                Sale                          Price

                     Store
                                                       Product




  SALE     SALE ID          CUST. ID       STORE ID    PROD. ID        PRICE          TIME

           1                3              7           4               $3.00          4/24/99


CUSTOMER       CUST. ID         NAME           PHONE        Buys Apples        Has Big Car

               3                FRED           1234         Y                  Y

                   TIME         DAY        MONTH        QTR            YEAR

                                24     4               2Q          99


                                                                                                182
                Star Schema: Overview

• Pro
  – Easy for users to navigate and understand


• Con
  – Performance
        – Can end up with one monster fact table, millions of rows
  – Flexibility
        – Not as easy for customers to change the design




                                                                     183
                            Snowflake Schema

• Several stars can be connected to form a snowflake

                                     MARKETING

                                        Distrib-            Ad
                                         ution



                                              Direct Mail
                                                                 Price

                                     Sales                                       PRODUCT
          SALES                                                                         Manu-
                                                    Location
               Marketing                                                 Parts        facturing
Revenue


                                                                                  Make
                                                                                  Chips
           Sale                                                                                    Cost
                           Product
                                                                         Price
Price
                                                                                     Labor
              Vendor




                                                                                             184
                   Summary

• Two basic types of design
  – Star Schema
  – Normalized


• Many Data Warehouse vendors sell products built
  specifically for the star schema

• Some data warehouses insist that normalization
  is the way to build the data warehouse.




                                                   185
  Building a
Data Warehouse

        (slides in this section
        are used courtesy of
        Carrig Emerging Technology
        Ph: 410-553-6760
        www.carriget.com)

                                     186
     Building a Data Warehouse



1) Top Down Approaches
2) Enterprise Data Model Approach
3) "Let Data Users Decide"
4) "Let Data Warehouse Builders Decide"
5) "Let Senior Management Decide"
6) Bottom Up Approach




                                      187
           Building the Data Warehouse

• How to decide what data goes into the data
  warehouse?

• Methods:
   – Top Down
      –   Using Enterprise Data Models
      –   "Let data users decide" approach
      –   "Let data warehouse builders decide" approach
      –   "Let senior management decide" approach

   – Bottom Up
      – Combine data marts into a data warehouse




                                                          188
          Using Enterprise Data Models

• Use the Enterprise Data Model to decide what
  data goes into the data warehouse.
   – Model key processes. This approach says let the business decide.
   – Identify key data used by these processes in an enterprise data
     model -- might be a giant Entity-Relationship diagram.

• Put data in the warehouse based on the
  enterprise data model.




                                                                        189
 An Enterprise Data Model Example

                    PUT IN                   SELL CHIPS
MAKE                 BAGS
CHIPS




                             BUY MORE
                             POTATOES
COUNT
  $$




               CHIP
             SUPPLIERS


   CHIP
  RECIPES                               INGREDIANTS



                                                          190
        "Enterprise Data Model" Approach

• Pro
  – All inclusive -- no chance of leaving key data out.


• Con
  – Very difficult to build an EDM.
  – If the business model changes, you may have to rebuild the
    Enterprise Data Model and the data warehouse.

• Ways of Avoiding the Con
  – In some cases you can buy an EDM -- if the business is common
    enough the packaged EDM might be very close and then you just
    have to modify it to fit your business.


                                                                    191
              "Let Data Users Decide"

• Let the users of the data warehouse choose what
  data will go into the warehouse.
  – The data users deciding the data warehouse data and design will
    pay for it as well.
  – Also, you can charge users who
     query the data as well.




                                            DATA WAREHOUSE

                                                                      192
        "Let Data Users Decide": An Example

                                  DATA WAREHOUSE




   DATA
                                         DATA               DATA
demographics
                                              Ethnic
                                                            budget
                         trends               group

           Advertising
                                                                  Revenue
               ?                                 Age   spending
                                  education                          ?
                                                 ?


      MARKETING                           HUMAN
                                        RESOURCES         FINANCE



                                                                     193
        "Let Data Users Decide" Approach

• Pro
   – Reduces budget problems
   – Users know best!
• Con
   – Requires marketing
   – Could end up with data in the warehouse that is meaningless to the
     people who run the place.
   – Users may not place important data in the warehouse because their
     budget is small.
   – Users who need the data may not use the DW because of budget
     concerns.
• Ways of Mitigating the Con
   – Do not just take money -- try to determine if data is really
     corporate.
                                                                     194
   Pay As You Go Warehouse Analogy




I-495




                                     195
   "Let Data Warehouse Builders Decide"

• The technical staff who is building the warehouse
  decides what data gets put in the warehouse.

              LETS PUT
          INFORMATION ON
            HOW TO BUILD
           VIRUSES IN THE
          DATA WAREHOUSE




                                  DATA WAREHOUSE




                                                   196
  "Let Data Warehouse Builders Decide"
               Approach
• Pro
   – Very easy to design
   – Does not take much time
   – Do not have to deal with users

• Con
   – Could easily result in data DUMP not data warehouse

• Ways to mitigate the con
   – Talk to lots of users to help you guess what should go in the DW




                                                                        197
      “Let Senior Management Decide”

• The senior management decides what data goes
  into the warehouse.
• Asking the senior management is the safest way
  to build a data warehouse.
• Identify the key questions on senior
  management’s mind and get the data to answer
  these questions.




                                                   198
“Let Senior Management Decide” Approach

• Pro
   – Ensures executive support for the project

• Con
   – Senior management does not have much time for this -- you will
     have to only get a few questions at a time
   – This dramatically increases visibility - if you do not move quickly
     senior management will become very angry with the DW.

• Ways to mitigate the con
   – Do your homework before talking to the senior management -- talk
     to the aides of senior management to find out what is on their
     mind.
   – Allocate resources so you can plan to move very quickly once you
     hear from the senior management.

                                                                       199
             Bottom-Up Approach

• Move data from existing OLTP Applications to
  data marts.

• Combine data marts into a data warehouse.

                       DATA
                     WAREHOUSE




            DATA        DATA       DATA
           MART         MART       MART
             25          50         200
           YARDS       METERS       CM


           OLTP        OLTP        OLTP
           APP         APP         APP


                                                 200
                 Bottom-Up Approach

• Pro
   – Data marts are much easier to build than full-fledged DW.


• Con
   – Could end up with a bunch of stove pipe data marts.


• Ways to mitigate the con
   – Develop standards for data when building the data marts so that
     you can glue data from different data marts together.




                                                                       201
Recommendations for an Approach


"Let senior management decide"




                                  202
User Interface to
       the
Data Warehouse
           (slides in this section
           are used courtesy of
           Carrig Emerging Technology
           Ph: 410-553-6760
           www.carriget.com)

                                        203
  User Interface to the Data Warehouse



1) Introduction
2) Types of Users
3) Functions Users Want to Do
4) Approaches to Building a User Interface
5) Hand Built
6) Class Libraries
7) OLAP Tools
8) Types of User Interfaces

                                         204
                          Introduction

• A User Interface (UI) is a front end application
  designed for the user that presents information in
  a simplified manner.
   – Data in a data warehouse does nothing if users cannot access it
   – Users do not want to learn SQL to drive DW applications
   Finance                      Inventory                   Sales
    OLTP                          OLTP                      OLTP
  Application                  Application                Application




                            DATA WAREHOUSE
    Finance                     Inventory                   Sales
     OLTP                         OLTP                      OLTP
      Data                        Data                      Data




                           USER INTERFACE


                                                                        205
                 Building User Interfaces

• DW applications have different types of users
  with different functionality requirements.
   – It is critical to identify the key users.
   – Once you do this, you need to identify their functional
     requirements.

• There are three main approaches to building UI’s
       – Build your own entirely
       – Use commercial Class Libraries
       – Using OLAP Tools




                                                               206
           Types of Users


                CEO


Executive     Executive     Executive


Marketing      Sales        Finance


Analysts      Analysts      Analysts


Everyone      Everyone      Everyone

                                        207
                    Types of Users (cont.)

• Executives
  –   People who run the place
  –   Need answers quickly
  –   May not be very technical
  –   Expect UI to get them what they
      want quickly and efficiently without
      any need for special training

• Analysts
  – Have time to really analyze data and think about it
  – May have strong statistical and IT background
    (i.e. Power user of Excel)
  – Expect UI to have many complex features, and
      provide the ability to generate new queries and perform statistical
    analysis of the data.

                                                                       208
                   Types of Users (cont.)

• Regular User
  – All other users
  – Just need some simple answers to simple questions like “What is
    Hank’s phone number)
  – Expect UI to be simplistic, easy to understand, and provide access
    to basic information.




                                                                     209
            Subject Matter Experts Expect

• Query data in the data warehouse
• Trend analysis
  – “show me how much money we have spent on computers in the
    last four years”
Trend



  Sales




          1995                                              1999

• Benchmark to competitors
  – “what are all our competitors charging for product X”

                                                                210
              Subject Matter Experts Expect (cont.)

• Drill Down
  – “on that chart you just showed me, I noticed that revenue was
    down in Region #4. Please drill down and show me the breakdown
    of each area in Region #4.”

                       WAL-MART
                       DRILL DOWN

              20
    REVENUE




              15
                                        Y Values
              10
                                        X Values
               5
               0
                   1   2     3      4
                       REGIONS

                                                           DRILL DOWN


                                            Revenue

                                                                   DC
                                                      MD     VA

                                                              Region 4
                                                                         211
      Approaches to Building User Interfaces

• Hand-Built
   – Write all of your own code


• Use Class Libraries
   – Use an object oriented approach and buy the CLASS libraries that
     do all the hard work


• OLAP
   – Use an On-Line Analytical Processing package to build user
     interfaces for you.




                                                                        212
          Architecture of User Interfaces (cont.)

• Hand Built

                                            DATA
   USER INTERFACE
      i.e. JAVA
                                          WAREHOUSE

                                             DBMS


                                                      Commercial
                                                      Off The Shelf
• Class Libraries                                       (COTS)



           USER                               OLAP
                                             CLASS
        INTERFACE
                                            LIBRARY



Hand
Built

                                                              213
      Architecture of User Interfaces (cont.)

• OLAP



                                    Result Cube


                   Commercial
                   Off The Shelf
                      (COTS)

                                   REVENUE
  USER INTERFACE




               DBMS


                                                  214
            Hand-Building User Interfaces

• Write all the code yourself
   – Requires many design documents, coding and testing for all of the
     code components.

• Pros

   – Very flexible

• Cons
   – Could take a long time to develop
   – Requires substantial resources
   – May need lots of testing and debugging



                                                                     215
  Using Class Libraries to Build User Interfaces

• Write initial user dialog yourself and call class
  libraries for the hard part (graphics and data
  access functionality).

• Pro
   – Many class libraries available -- avoid doing a lot of coding
     yourself

• Con
   – Not as flexible -- if the class library does not do what you want it
     to do you have to
        – Find a new class library
        – Live without the functionality
   – Can take a while to find the class library you need and learn how
     to interface to it
                                                                            216
   Using OLAP Tools to Build User Interfaces

• Many different OLAP tools
  –   Need to survey an OLAP tool
  –   Buy an OLAP tool
  –   Install it
  –   If it does not match all requirements some code may be needed to
      communicate with the OLAP tool.


• Three types multi-dimensional OLAP

  –   Relational OLAP (ROLAP)
  –   Multi-dimensional (MOLAP)
  –   Hybrid (HOLOP)
  –   Distributed (DOLAP)


                                                                     217
  Summary of Tools for UI Development of DW
• Tools that may be used include:
   – Development of in-house software
       – Do it all yourself
       – Use Class Libraries
   – OLAP
       –   ROLAP
       –   MOLAP
       –   HOLAP
       –   DOLAP
• Different tools or techniques may be useful
  depending upon what kind of user interface is
  being developed.
   – Executive Information Systems
   – Analytical Systems
   – Enterprise Information Systems
                                                  218
                     Types of User Interfaces

 • Executive Information System
     – Developed for the person who runs the place
 • Analytical System
     – Developed for business analysts
 • Enterprise Information System
     – Developed for users throughout the organization
                                     CEO
  EXECUTIVE
  INFORMATION SYSTEM

                       Executive   Executive   Executive


                       Marketing    Sales      Finance


                       Analysts    Analysts    Analysts
                                                           ANALYTICAL SYSTEM

ENTERPRISE
INFORMATION SYSTEM     Everyone    Everyone    Everyone

                                                                               219
            Executive Information System

• The Executive IS is developed specifically for
  people who run the organization.
• Development process:
   – No clean life cycle
   – Prototype constantly. Usually have to guess at
     what executives will want to see
   – Show executives let them come up with ideas
     for revisions
   – Drill down functionality required
• Tools
   – Frequently hand-built, but purchasing a class library can help
     lower the development cost.
   – May just want to use tools that allow development of a
     subscription service in which users may “Subscribe” to a few
     canned reports.
                                                                      220
                     Analytical System

• Analytical systems are user interfaces developed
  for business analysts in an organization.

• Development process:
   – Allow users to drag-and-drop data around to further the analysis of
     this data.
   – More complex interface is acceptable
   – Users may be required to know some SQL knowledge

• Tools:
   – OLAP Tools are frequently used to build the interface




                                                                      221
           Enterprise Information System

• Enterprise IS is written for the general user to
  retrieve simple, key information.
• Development process:
   – Frequently developed in-house
   – So many users around that you really cannot pick a few and ask
     what they need.
   – Simpler than Executive IS as it does not require drill down
     functionality.

• Tools
   – Place some simple, key information
     on a few screens and control
     access and then deploy.


                                                                      222
      Summary of Types of User Interfaces

• Executive Information System
  – For the senior executives
  – Use in-house development or in-house development augmented by
    class libraries

• Analytical System
  – OLAP may make sense here as the interface is more complicated,
    but OLAP has drawbacks due to:
      – Data sparseness
      – No well accepted query language

• Enterprise Information System
  – Much simpler than executive system
  – Good candidate for in-house development

                                                                 223

								
To top