Docstoc

design

Document Sample
design Powered By Docstoc
					Design of DW
Remember
item, city, year, and sales_in_Euro
                            ()



           (city)         (item)           (year)




      (city, item)    (city, year)        (item, year)



                     (city, item, year)
   2 leveles of hierarchies for each dimension
        Item(part,color)              i1,i2
        City(downtown,suburb)         c1,c2
        Year(good_year,bad_year)      y1,y2
   For a 3-dimensional data cube, where Li is the
    number of all levels (L1,2,3=2), the total number
    of cuboids that can be generated is
     3

     (2  1)  33  27
     i1
   {(),
   (c1),(c2),(i1),(i2),(y1),(y2),

   (c1,i1),(c1,i2),(c2,i1),(c2,i2),
   (c1,y1),(c1,y2),(c2,y1),(c2,y2),
   (i1,y1),(i1,y2),(i2,y1),(i2,y2),

   (c1,i1,y1),(c1,i1,y2),(c1,i2,y1),(c1,i2,y2),
   (c2,i1,y1),(c2,i1,y2),(c2,i2,y1),(c2,i2,y2)}
   DMQL Data Mining Query Language
       Relational database schema
       Translation into SQL query
   Example, star schema, and relational data base
   MDX
   Multifeature cubes
   Design of a Data Warehouse
   Lifecycle models
   Data Warehouse models
DMQL
   DMQL: A Data Mining Query Language
    for Relational Databases (Han et al,
    Simon Fraser University)

   Data warehouses and data marts can be
    defined by cube definition and dimension
    definition
DMQL
   Create and manipulate data mining models
    through a SQL-based interface (“Command-
    driven” data mining)
   Abstract away the data mining particulars
   Data mining should be performed on data in the
    database (should not need to export to a
    special-purpose environment)
   Approaches differ on what kinds of models
    should be created, and what operations we
    should be able to perform
Cube Definition Syntax in DMQL

    Cube Definition (Fact Table)
     define cube <cube_name> [<dimension_list>]:
       <measure_list>


    Dimension Definition (Dimension Table)
     define dimension <dimension_name> as
       (<attribute_or_subdimension_list>)
time
        Example of Star Schema
time_key                                            item
day                                               item_key
day_of_the_week              Sales Fact Table     item_name
month                                             brand
quarter                               time_key    type
year                                              supplier_type
                                      item_key
                                     branch_key
       branch                                     location
                                   location_key
       branch_key                                 location_key
       branch_name                   units_sold   street
       branch_type                                city
                                   dollars_sold   state_or_province
                                                  country
                                      avg_sales
                  Measures
Defining Star Schema in DMQL

  define cube sales_star [time, item, branch, location]:
         dollars_sold = sum(sales_in_dollars), units_sold = count(*)


  define dimension time as (time_key, day, day_of_week,
    month, quarter, year)
  define dimension item as (item_key, item_name, brand,
    type, supplier_type)
  define dimension branch as (branch_key, branch_name,
    branch_type)
  define dimension location as (location_key, street, city,
    province_or_state, country)
 The star schema contains two measures
 dollars_sold and units_sold


   How are the DMQL commands
    interpreted to generate a specified data
    cube?
Relational database schema
time(time_key,day_of_week,month,quater,year)

item(item_key,item_name,brand,type,supplier_type)

branch(branch_key,branch_name,branch_type)

location(location_key,street,city,province_or_state,country)

sales(time_key,item_key,branch_key,location_key,number_of
  _units_sold,price)
   The DMQL specification is translated into the following
    SQL query which generates the base cuboid


SELECT s.time_key,s.item_key,s.branch_key,s.location_key,
       SUM(s.number_of_units_sold*s.price),
       SUM(s.number_of_units_sold)
FROM time t, item i, branch b, location l, sales s,
WHERE s.time_key=t.time_key AND s.item_key=i.item_key
 AND s.branch_key=b.branch_key AND s.location_key=l.location_key
GROUP BY (s.time_key,s.item_key,s.branch_key,s.location_key);
   The granularity (resolution) of each
    dimension is at the join key level

 A join key is the key that links a fact table
  and the dimension table
 The fact table associated with a base cuboid
  is sometimes referred as base fact table
   By changing GROUP BY we may generate
    other cuboids
   The apex cuboid representing the total sum of
    dollars_sold and total count of units_sold is
    generated by GROUP BY ();
   Other cuboids may be generated by applying
    selection and projection operations on the base
    cuboid
   To generate a data cube we may as well use
    GROUP BY CUBE
    (s.time_key,s.item_key,s.branch_key,s.location_key);
Defining Snowflake Schema
in DMQL
  define cube sales_snowflake [time, item, branch, location]:
          dollars_sold = sum(sales_in_dollars), avg_sales =
            avg(sales_in_dollars), units_sold = count(*)
  define dimension time as (time_key, day, day_of_week, month,
     quarter, year)
  define dimension item as (item_key, item_name, brand, type,
     supplier(supplier_key, supplier_type))
  define dimension branch as (branch_key, branch_name,
     branch_type)
  define dimension location as (location_key, street, city(city_key,
     province_or_state, country))
     Defining Fact Constellation
     in DMQL
define cube sales [time, item, branch, location]:
         dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold
           = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)

define cube shipping [time, item, shipper, from_location, to_location]:
         dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in
    cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
Example (Exercício 7)
   Suponha um datawarehouse que contém as seguintes
    quatro dimensões: Data, Espectador, Localizaçaõ e
    Jogo e o facto “preço” que consiste no valor, em euros
    Valor pago por um espectador quando assiste a um
    determinado jogo numa data
   Espectadores podem ser estudantes, adultos ou
    séniores, e que cada uma destas categorias tem o seu
    preço de bilhete
   Dimensão data contenha o dia, mês e ano; que a
    dimensão Localização contenha o nome do estádio, e
    que a dimensão Jogo contenha o nome das duas
    equipas defrontadas
Diagrama em estrela
para este DW

 jogo(jogoId, equipa1, equipa2)
 data (dataId, dia, mes, ano)
 espectador (espId, nome, categoria)
 localizacao(localId, estadio)
 factos(jogoId, dataId, espId, localId, preco)
   Escreva em SQL a interrogaçã o que
    devolve o preço total pago por
    espectadores estudantes para assistir
    ao jogo que se realizou no Estádio da
    Luz no dia 1 de Março de 2005
The corresponding SQL querry
  SELECT SUM(preco)
  FROM Factos F, Data D, Localizacao L, Espectador E
  WHERE F.dataId = D.dataId
  AND F.espId = E.espId
  AND F.localId = L.localId
  AND D.dia = 1
  AND D.mes = 3
  AND D.ano = 2005
  AND L.estadio = ‘Estadio da Luz’
  AND E.categoria= ‘Estudante’;
   Esquema relacional que modele a mesma informação e
    uma interrogaçã o SQL que devolva a mesma
    informação:


    jogo(jogoId, equipa1, equipa2, localId, data)
    localizacao(localId, estadio)
    espectador(espId, nome, categoriaId)
    categoria (categoriaId, nomeC, preco)
    jogoEspectador(jogoId, espId)
The corresponding SQL querry
  SELECT SUM (C.preco)
  FROM Categoria C, Espectador E, JogoEspectador JE, Jogo, J,
    Localizacao L
  WHERE C.nomeC = ‘Estudante’
  AND J.data = 1/3/2005
  AND L.estadio = ‘Estadio da Luz’
  AND C.categoriaId = E.categoriaId
  AND E.espId = JE.espId
  AND JE.jogoId = J.jogoId
  AND J.localId = L.localId;
Difference between both
approaches
  Com o modelo em estrela, existe três joins, da
   tabela Factos com cada uma das 3 dimensões
   relevantes
  Com o esquema relacional, existem 4 joins


  In the star schema has less joins, corresponding to
   the relevant dimensions
  In multidimensional model the base cuboid is
   already precomputed
MDX
   Multidimensional Expressions (MDX) as a
    Language

   MDX emerged circa 1998, when it first began to
    appear in commercial applications. MDX was
    created to query OLAP databases, and has
    become widely adopted within the realm of
    analytical applications
   Provide the total sales and total cost amounts for the years 1997
    and 1998 individually for all USA-based stores (including all
    products)
   We are asked, moreover, to provide the information in a two-
    dimensional grid, with the sales and cost amounts (called
    measures in our data warehouse) in the rows and the years (1997
    and 1998) in the columns

SELECT
{[Time].[1997],[Time].[1998]} ON COLUMNS,
{[Measures].[Warehouse Sales],[Measures].[Warehouse Cost]} ON ROWS
FROM Warehouse
WHERE ([Store].[All Stores].[USA])
   The cube that is targeted by the query (the query scope) appears in
    the FROM clause of the query
   The FROM clause in MDX works much as it does in SQL, where it
    stipulates the tables used as sources for the query
    The query syntax also uses other keywords that are common in
    SQL, such as SELECT and WHERE.

   Important difference is that the output of an MDX query, which uses
    a cuboid as a data source, is another cuboid, whereas the output of
    an SQL query (which uses a columnar table as a source) is
    typically columnar
   A query has one or more dimensions. The query above
    has two. (The first three dimensions (=axes) that are
    found in MDX queries are known as rows, columns
    and pages)

SELECT{[Time].[1997].[Q1],[Time].[1997].[Q2]}ON
COLUMNS,{[Warehouse].[All Warehouses].[USA]} ON
 ROWSFROM WarehouseWHERE ([Measures].[Warehouse
 Sales])

   Curled brackets "{}" are used in MDX to represent a set
    of members of a dimension or group of dimensions
Complex Aggregation at
Multiple Granularities
   Multifeature cubes compute complex queries
    involving multiple aggregates at multiple
    granularities (resolution)

   Example
       item is purchased in a sales region on a business
        day (year,month,day)
       The shelf life in months of a given item is stored in
        shelf
       The item price and sales is stored in price and sales
   Find the total sales in 2000, broken down by
    item, region, and month with subtotal for each
    dimension

   A data cube is constructed
   {(item,region,month),(item,region),(item,month),
    (month,region),(item),(month),(region,()}

   Simple data cube, since it does not involve any
    dependent aggregates
   What are dependent aggregates?
Example

 Grouping by all subsets (cuboids)
  {item,region,month} (=data cube)
 Find maximum price for each group
  (cuboid) in 2000
 Among the maximum price tuples find the
  minimum and maximum shelf lives
   Multifeature cube graph for the
   example query

R2 cube {=MIN(R1.shelf)}      R3 cube {=MAX(R1.shelf)}




                   R1 cube {=MAX(price}


                    R0 cube
 The multifeature graph illustrates the
  aggregate dependencies
 R0,R1,R2,R3 are the grouping variables
 The grouping variables R2,R3 are
  dependent on R1
 In extended SQL
     R2 in R1
     R3 in R1
Query in extended SQL
 SELECT
   item,region,month,MAX(price),MIN(R1.shelf),MAX(R1.shelf)
 FROM Purchases
 WHERE year=2000
 CUBE BY item,region,month:R1,R2,R3
 SUCH THAT R1.price=MAX(price) AND
              R2 IN R1 and R2.shelf=MIN(R1.shelf) AND
              R3 IN R1 and R3.shelf=MAX(R1.shelf);
    Design of Data Warehouse:
    A Business Analysis Framework
   Four views regarding the design of a data warehouse
       Top-down view
         • allows selection of the relevant information necessary for the data
           warehouse
       Data source view
         • exposes the information being captured, stored, and managed by
           operational systems
       Data warehouse view
         • consists of fact tables and dimension tables
       Business query view
         • sees the perspectives of data in the warehouse from the view of end-
           user
    Data Warehouse
    Design Process

   Top-down, bottom-up approaches or a combination of
    both
       Top-down: Starts with overall design and planning (mature)
       Bottom-up: Starts with experiments and prototypes (rapid)
   From software engineering point of view
       Waterfall: structured and systematic analysis at each step before
        proceeding to the next
       Spiral: rapid generation of increasingly functional systems, short
        turn around time, quick turn around
Lifecycle planning
   Translation from user requirements into
    software requirements
   Transformation of the software requirements
    into software design
   Implementation of the design into programming
    code
   The sequence of this steps is defined by the
    lifecyle model
   A software lifecycle model must be
    defined for every project!

   The lifecycle model you choose has as
    much influence over your project’s
    success as any other planning decision
    you make!
Pure Waterfall model
Software
Concept
      Requirements
        Analysis
                Architecural
                  Design
                               Detailed
                               Design
                                          Coding and
                                          Debugging
                                                       System
                                                       Testing
Pure Waterfal model
   Document driven model which means that the main
    work products that are carried from phase to phase are
    documents

   The disadvantage of the pure waterfall model arise from
    the difficulty of fully specifying requirements at the
    beginning of the project, before any design works has
    been done and before any code has been written
Salmon model
Software
Concept
      Requirements
        Analysis
                Architecural
                  Design
                               Detailed
                               Design
                                          Coding and
                                          Debugging
                                                       System
                                                       Testing
Code-and-Fix I


                       Code-And-Fix
System Specification                  Release
      (maybe)
Code-and-Fix II
Advantages

   No overhead, you don’t spend any time on planning,
    documentation, quality assurance, enforcement, or
    other activities than pure coding
   Since you jump right into coding, you can show signs of
    progress immediately
   It requires little expertise
Code-and-Fix III
 Maintainability and reliability decrease
  with the complexity and the time
 For any kind of project other than a tiny
  project, this model is dangerous. It might
  have no overhead, but it also provides no
  means of assessing progress, you just
  code until you’re done
Spiral I

    Determine objectives,             Risk
    Alternatives, and                 analysis
                                                                  Evaluate
    constraints                                       IV
                                           II    III Prototypes
 Review                       Start    I
                                             Simulation models
                                      Reqirements
      Development
      Plan
                                              code
                                           test
                            Realse
Spiral II
   The basic idea behind the diagram is that
    you start on a small scale in the middle of
    the spine, explore the risks, make a plan
    to handle the risks, and then commit to an
    approach of the next iteration. Each
    iteration moves your project to a larger
    scale
Spiral III
   The spiral model is a risk-oriented
    lifecycle model that breaks a software
    project up into mini projects. Each mini
    project addresses one or more major
    risks until all the major risks have been
    addressed
Spiral IV
    Determine objectives, and constraints
    Identify and resolve risks
    Evaluate alternatives
    Develop the deliverables for that iteration, and verify
     that they are correct
    Plan next iteration
    Commit to an approach for the next iteration

    One of the most important advantages of the spiral model is that as
        costs increase, risk decrease. The more time and money you
                      spend, the less risk your’re taking
Data Warehouse: A Multi-Tiered Architecture

                               Monitor
                                  &          OLAP Server
  Other          Metadata
  sources                     Integrator

                                                           Analysis
 Operational   Extract                                     Query
 DBs           Transform      Data            Serve        Reports
               Load
               Refresh
                            Warehouse                      Data mining




                             Data Marts

Data Sources          Data Storage         OLAP Engine Front-End Tools
Three Data Warehouse Models
   Enterprise warehouse
       collects all of the information about subjects spanning the entire
        organization
   Data Mart
       a subset of corporate-wide data that is of value to a specific groups
        of users. Its scope is confined to specific, selected groups, such as
        marketing data mart
         • Independent vs. dependent (directly from warehouse) data mart
   Virtual warehouse
       A set of views over operational databases
       Only some of the possible summary views may be materialized
Data Warehouse Usage
   Three kinds of data warehouse applications
       Information processing
         • supports querying, basic statistical analysis, and reporting using
           crosstabs, tables, charts and graphs
       Analytical processing
         • multidimensional analysis of data warehouse data
         • supports basic OLAP operations, slice-dice, drilling, pivoting
       Data mining
         • knowledge discovery from hidden patterns
         • supports associations, constructing analytical models, performing
           classification and prediction, and presenting the mining results using
           visualization tools
Data Warehouse Back-End Tools
and Utilities

   Data extraction
        get data from multiple, heterogeneous, and external sources
   Data cleaning
        detect errors in the data and rectify them when possible
   Data transformation
        convert data from legacy or host format to warehouse format
   Load
        sort, summarize, consolidate, compute views, check integrity, and build
         indicies and partitions
   Refresh
        propagate the updates from the data sources to the warehouse
   DMQL Data Mining Query Language
       Relational database schema
       Translation into SQL query
   Example, star schema, and relational data base
   MDX
   Multifeature cubes
   Design of a Data Warehouse
   Lifecycle models
   Data Warehouse models
   Data Cleaning

     (De)normalization(?)
     Missing Values
     ...

				
DOCUMENT INFO