Docstoc

Data modeling for data warehouse

Document Sample
Data modeling for data warehouse Powered By Docstoc
					Data Warehousing


       Dr. Yousry Taha.

 Prepared by: Ra’ed ALOsaimi.
Outline:
 What is data warehouse?
 Conceptual Structure of Data Warehouse.
 Data Warehouse vs. Heterogeneous DBMS.
 Data warehouse Characteristics .
 Relational Data Modeling of Data Warehouses.
 OLAP Operations
 Data Warehouse Models.
 OLAP Server Architectures.
 Building data warehouse.


                                                 2
What is data warehouse?
  extracting data from multiple sources (Operational
   DB) to be in separate store after restructure them to
   enable the system to get information which support
   decision maker

  A decision support database that is maintained
   separately from the organization’s operational
   database

  Data warehouses provide access to data for complex
   analysis, knowledge discovery, and decision making.

                                                           3
What is data warehouse?(Cont.)


   “A data warehouse is a subject-oriented, integrated,
    time-variant, and non volatile collection of data in
    support of management’s decision-making process.”




                                                           4
Conceptual Structure of Data Warehouse
 Applications that data warehouse supports are:

   OLAP (Online Analytical Processing) is a term used to
    describe the analysis of complex data from the data
    warehouse.

   DSS (Decision Support Systems) also known as EIS
    (Executive Information Systems) supports organization’s
    leading decision makers for making complex and important
    decisions.

   Data Mining is used for knowledge discovery, the process of
    searching data for unanticipated new knowledge.

                                                                  5
Conceptual Structure of Data Warehouse
(Cont.)


      Back Flushing
                                            Data Warehouse

                                                             OLAP
                                                 Data
                  Cleaning   Reformatting                    DSSI
      Databases                                               EIS
                                                Metadata
                                                              Data
                                                             Mining



   Other Data Inputs         Updates/New Data




                                                                      6
  Data Warehouse - Subject-Oriented
 Organized around major subjects, such as
  product, sales.
 Focusing on the modelling and analysis of data for
  decision makers, not on daily operations or a
  processing.
 Provide a simple and concise view around
  particular subject issues by excluding data that
  are not useful in the decision process.


                                                       7
 Data Warehouse - Integrated
 Constructed by integrating multiple, heterogeneous
 data sources
   relational databases, flat files, …


 Data cleaning and data integration techniques are
 applied.
   Ensure consistency in naming conventions (e.g., LastName
    and FamilyName in DB1 and DB2 have the same
    signification)
   encoding structures (e.g, Attribute User_Id is a long int in DB1
    and it is a string in DB2
   attribute measures (e.g, cm vs inch) …
   When data is moved to the warehouse, it is converted.

                                                                       8
    Data Warehouse - Time Variant

 Data warehouse data: provide information from a
 historical perspective (e.g., past 5-10 years)
   Operational database: current value data.


 Every data in the data warehouse
   Contains an element of time, explicitly or implicitly
   But the data of operational database may or may
    not contain “time element”.

                                                            9
 Data Warehouse - Non-Volatile

 A physically separate store.
 Operational update of data does not occur
 in the data warehouse environment.
   Does not require transaction processing,
    recovery, and concurrency control mechanisms
   Requires only two operations in data accessing:
       initial loading of data and querying (read)


                                                      10
Data Warehouse

  We can also describe data warehousing more
   generally as “a collection of decision support
   technologies, aimed at enabling the
  knowledge worker (executive, manager,
  analyst) to make better and faster decisions”




                                                    11
  Data Warehouse vs. Heterogeneous DBMS

 Traditional heterogeneous DB integration:
   Build wrappers/mediators on top of heterogeneous databases
   Query driven approach
       A query posed to a client site, will be transformed into queries
        appropriate for individual heterogeneous sites involved, and the
        results are integrated into a global answer set

 Data warehouse: update-driven
   Information from heterogeneous sources is integrated in advance and
    stored in warehouses for direct query and analysis




                                                                           12
   Data Warehouse vs. Operational DBMS

 OLTP (on-line transaction processing)
   Major task of traditional relational DBMS
   Day-to-day operations: purchasing, inventory,
    banking, …


 OLAP (On-Line Analytical Processing)
   Major task of data warehouse system
   Data analysis and decision making

                                                    13
           Data Warehouse vs. Operational DBMS
                 OLTP                        OLAP
users            Any one                     knowledge worker
function         day to day operations       decision support
DB design        application-oriented        subject-oriented
data             current, up-to-date         historical,
                 detailed,                   summarized, multidimensional
                                             integrated, consolidated
access           read/write                  lots of scans
                 index/hash on prim. key
unit of work     short, simple transaction   complex query
DB size          100MB-GB                    100GB-TB
metric           transaction throughput      query throughput, response




                                                                            14
Data mart
 Focus on one department or business process
 Do not contain any operational data
 Contain much less information than a data warehouse

 When you would use them:
     When data needs tailoring to a particular department or process
     Need a lower risk project
     Don’t want to spend as much money
     Have limited end user(s), allowing data to be tailored for that




                                                                        15
Data warehouse Characteristics
    Data warehouses have the following distinctive
     characteristics :

       multidimensional conceptual view
       generic dimensionality
       unlimited dimensions and aggregation levels
       unrestricted cross-dimensional operations
       dynamic sparse matrix handling
       client-server architecture
       multi-user support
       accessibility

                                                      16
Data warehouse Characteristics(Cont.)

   transparency
   intuitive data manipulation
   consistent reporting performance
   flexible reporting




                                        17
     Data modeling for data warehouse
 A data warehouse is based on a multidimensional data model
  which views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled and
  viewed in multiple dimensions
    Dimension tables such as
        item(item_name, type,…)
        time(day, week, month, quarter, year)
        location(location_name, country)
    Fact table contains measures and keys to related dimension tables




                                                                         18
  Data modeling for data warehouse


 Multidimensional models take advantage of inherent
 relationships in data to populate data in
 multidimensional matrices called data cubes. (These
 may be called hypercube if they have more than three
 dimensions).




                                                        19
Data modeling for data warehouse
 although more than three dimensions cannot be easily
  visualized at all or presented graphically…..
  The data can be queried directly in any combination of
  dimensions, bypassing complex database queries.

 Tools exist for viewing data according to the user's choice
  of dimensions. Changing from one dimensional hierarchy
  (orientation) to another is easily accomplished in a data
  cube by a technique called pivoting) also called
  rotation).
  In this technique the data cube can be thought of as
  rotating to show a different orientation of the axes .


                                                              20
From tables to Data Cubes
2-D view of sales cross-tabulation (pivot table)

Location = « Mekkah »
         item TV              PC         DVD       SUM
time
Q1               670         200        500   1370
Q2               400         250        300   950
Q3               800         400        500   1700
Q4               200         500        400  1100
SUM              2070        1350       1700 5120

                                                         21
From tables to Data Cubes
   3-D view of sales cube
                         Date
           1Qtr   2Qtr    3Qtr   4Qtr   sum
      TV
    PC                                        Mekkah
  DVD
sum
                                              Madinah




                                                       City
                                              Quds


                                              sum




                                                              22
Data modeling for data warehouse
 The multidimensional storage model involves two types
  of tables: dimension tables and fact tables.
 A dimension table consists of tuples of attributes of the
  dimension.
 A fact table can be thought of as having tuples, one per
  a recorded fact. This fact contains some measured or
  observed variable(s) and identifies it (them) with pointers
  to dimension tables. The fact table contains the data and
  the dimensions identify each tuple in that data .




                                                                23
         Relational Data Modeling of Data Warehouses
       Modeling data warehouses: dimensions & measures
•       Star schema: A fact table in the middle connected to a set of
        dimension tables
•       Snowflake schema: represents dimensional hierarchy by
        normalizing the dimension tables (i.e., each level of a dimension
        represented in one table)
    •      save storage
    •      reduces the effectiveness of browsing
•       Fact constellations: Multiple fact tables share dimension tables



                                                                            24
 Example of Star Schema
time
time_key                                            item
day                                               item_key
day_of_the_week              Sales Fact Table     item_name
month                                             brand
quarter                               time_key    type
year                                              supplier_type
                                      item_key
                                     branch_key
       branch                                     location
                                   location_key
       branch_key                                 location_key
       branch_name                   units_sold   street
       branch_type                                city
                                 currency_sold    state_or_province
                                                  country
                                      avg_sales
                  Measures                                        25
    Example of Snowflake Schema

time     This is not a full snowflake schema
                                                 item
time_key                                       item_key         supplier
day                    Sales Fact Table        item_name        supplier_key
day_of_the_week                                brand            supplier_type
month                          time_key        type
quarter                                        supplier_key
year
                                item_key
                              branch_key
branch                                      location
                             location_key
 branch_key                                 location_key
                               units_sold   street            city
 branch_name
 branch_type                                city_key          city_key
                           currency_sold                      city
                                                              state_or_province
       Measures                avg_sales                      country
                                                                           26
          Example of Fact Constellation
time
time_key                                        item             Shipping Fact Table
day                                          item_key
day_of_the_week       Sales Fact Table       item_name                time_key
month                                        brand
quarter                   time_key           type                       item_key
year                                         supplier_type            shipper_key
                             item_key
                                         location                  from_location
                           branch_key
                                         location_key
 branch                   location_key   street                       to_location
                                         city
branch_key                                                         currency_cost
branch_name
                            units_sold   province_or_state
                                         country                    units_shipped
branch_type             currency_sold
                                                       shipper
                            avg_sales                  shipper_key
           Measures                                    shipper_name
                                                       location_key
                                                       shipper_type              27
  Multidimensional Data
 Dimensions are : product, month, region
 Measure is sales_amount
                            Hierarchical summarization paths

                            Industry Region       Year
Product




                            Category Country Quarter

                            Product    City     Month Week

                                       Office     Day
          Month
                                                         28
  An example of Data Cube
                         Date                 Total annual sales
           1Qtr   2Qtr    3Qtr   4Qtr   sum   of TV in Mekkah
      TV
    PC                                        Mekkah
  DVD
sum
                                              Medinah




                                                      City
                                               Quds

                                               sum




                                                               29
OLAP Operations
   Roll up (drill-up): summarize data by climbing
    up hierarchy or by dimension reduction
   Drill down (roll down): reverse of roll-up
       from higher level summary to lower level summary or
        detailed data, or introducing new dimensions
   Slice and dice: project and select
   Pivot (rotate): reorient the cube, visualization,
    3D to series of 2D planes.

                                                              30
    Three Data Warehouse Models
    Enterprise warehouse: collects all information about subjects
     (customer, products, sales, assets, personnel) that span the entire
     organization
        Requires extensive business modeling
        May take years to design and build
    Data Mart: Departmental subsets that focus on selected subjects:
     Marketing data mart: customer, product, sales
        Faster roll-out
        Complex integration in the long term
    Virtual warehouse
    1. A set of views over operational databases
    2. Only some of views may be materialized

                                                                           31
    OLAP Server Architectures
    Relational OLAP (ROLAP):
       Leaves detail values in the relational fact table
       Stores aggregated values in the relational database as
        well.


    Multidimensional OLAP (MOLAP)
       Stores both detail and aggregated within the cube.

    Hybrid OLAP (HOLAP):
       Leaves detail values in the relational fact table
       Stores aggregated values in the cube.
                                                             32
OLAP Server Architectures
ROLAP        HOLAP          MOLAP

Map            Map          Map




Detail        Detail        Detail



Aggregated   Aggregated     Aggregated
  values       values         values


                                         33
Building data warehouse
 Extraction:
  The data must be extracted from multiple, heterogeneous sources, for example,
  databases or other data feeds such as those containing financial market data or
  environmental data.


 Formatting:
  Data must be formatted for consistency within the warehouse. Names,
  meanings, and domains of data from unrelated sources must be reconciled.


 Cleaning:
  The data must be cleaned to ensure validity. Data cleaning is an involved and
  complex process that has been identified as the largest labor-demanding
  component of data warehouse construction



                                                                                  34
Building data warehouse
   Fitting;
    The data must be fitted into the data model of the warehouse. Data from
    the various sources must be installed in the data model of the warehouse.
    Data may have to be converted from relational, object-oriented, or legacy
    databases (network and/or hierarchical) to a multidimensional model.
   Loading
    The data must be loaded into the warehouse. The sheer volume of data in
    the warehouse makes loading the data a significant task. Monitoring tools
    for loads as well as methods to recover from incomplete or incorrect loads
    are required .
    load data include sort, summarize, consolidate, compute views, check
    integrity, and build indices and partitions
   Refresh:propagate the updates from the data sources to the warehouse



                                                                           35
Building data warehouse
   Important design considerations include the
   following:
       Usage projections
       The fit of the data model
       Characteristics of available sources
       Design of the metadata component
       Modular component design
       Design for manageability and change
       Considerations of distributed and parallel architecture




                                                                  36
Datawarehouse Metadata
  The metadata repository is a key data warehouse
  component. The metadata repository includes both
  technical and business metadata.
    technical metadata: covers details of acquisition
     processing, storage structures, data descriptions,
     warehouse operations and maintenance, and
     access support functionality.
    business metadata: includes the relevant
     business rules and organizational details
     supporting the warehouse.


                                                          37
Database architecture
   There are two basic distributed architectures :
   distributed warehouse
   federated warehouse




                                                      38
  Warehouse vs. Data Views
 Data warehouses are different from views in the following
  ways:
   Data Warehouses exist as persistent storage instead of being
      materialized on demand.
     Data Warehouses are not usually relational, but rather multi-
      dimensional.
     Data Warehouses can be indexed for optimization.
     Data Warehouses provide specific support of functionality.
     Data Warehouses deals huge volumes of data that is contained
      generally in more than one database.



                                                                      39
    Defficult of implementing Datawarehouses

 Some significant operational issues arise with data
 warehousing :
   construction .
   administration and
   quality control.




                                                        40

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:21
posted:10/18/2012
language:English
pages:41