Docstoc

Datawarehouse Concept

Document Sample
Datawarehouse Concept Powered By Docstoc
					1. BASIC DEFINITIONS
    Datawarehousing :
   DWH (Datawarehousing) is a repository of integrated information, specifically structured for
   Queries and analysis. Data and information are extracted from heterogeneous sources as they are
   generated. This makes it much easier and more efficient to run queries over data that originally
   came from different sources.
   “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
   Collection of data in support of management’s decision making process”.

   Subject-oriented – a DW is organized around major subjects; excludes data that is
   not useful in the decision support process.

   Integrated – a DW is constructed by integrating numerous data sources (relational
   DB, flat files, legacy systems. DW provides mechanisms for cleaning and
   standardizing of the data.

   Time-variant – data is stored to provide information from a historical prospective.
   Every key structure in the data warehouse contains, either implicitly or explicitly, an
   element of time.

   Nonvolatile – a DW is physically separated from the operational environment. Due to
   this separation it does not require transaction processing, recovery, and concurrency
   control mechanisms.

   It usually requires
   Two operations: initial loading of data and access of data.

   Data Warehouse is an architecture constructed by integrating data from multiple
   heterogeneous sources to support structured and/or ad hoc queries, analytical
   reporting and decision making.
       Data Warehousing is a process of constructing and using data warehouses.

           A Multi-Subject Information Store
           Typically 100’s of Gigabytes to Terabytes

    Data Mart :




Datawarehousing Concepts
                                       Page 1 of 10
   It is a collection of subject areas organized for decision support based on the needs of
   a given department. Ex: sales, marketing etc. the data mart is designed to suit the
   needs of a department. Data mart is much less granular than the ware house data.
   Data Mart is

           A Single Subject Data Warehouse
           Often Departmental or Line of Business Oriented
           Typically Less Than a 100 Gigabytes

    Differences between DWH & Data Mart :
   DWH is used on an enterprise level, while data marts are used on a business division / department
   level. Data warehouses are arranged around the corporate subject areas found in the corporate
   data model. Data warehouses contain more detail information while most data marts contain more
   summarized or aggregated data.


    OLTP :
   OLTP is Online Transaction Processing. This is standard, normalized database
   structure. OLTP is designed for Transactions, which means that inserts, updates and
   deletes must be fast.
    OLAP :
   OLAP is Online Analytical Processing. Read-only, historical, aggregated data.




Datawarehousing Concepts
                                       Page 2 of 10
    Difference between OLTP and OLAP :

                           OLTP                                         OLAP
         Current data                                     Current and historical data
         Short database transactions                      Long database transactions

         Online update/insert/delete                      Batch update/insert/delete
         Normalization is promoted                        Demoralization is promoted
         High volume transactions                         Low volume transactions
         Transaction recovery is                          Transaction recovery is not
                necessary                                   necessary

         Low number of concurrent                         High number of concurrent
            users                                           users

    Fact Table :
   It contains the quantitative measures about the business.
   Fact tables that contain aggregated facts are often called summary tables.
    Dimension Table :
   It is a descriptive data about the facts (business).


    Aggregate tables :
   Aggregate Tables are pre-stored summarized tables. Usage of Aggregates can
   increase the performance of Queries by several times.

    Conformed dimensions :
   Conformed dimensions are a dimension table shared by fact tables. These tables
   connect separate star schemas into an enterprise star schema.
    Schema :

   A schema is a collection of database objects, including tables, views, indexes, and
   synonyms. There are a variety of ways of arranging schema objects in the schema
   models designed for data warehousing. Most data warehouses use a dimensional
   model.



Datawarehousing Concepts
                                        Page 3 of 10
    Star Schema :
   Star Schema is a set of tables comprised of a single, central fact table surrounded by
   de-normalized dimensions. Star schema implement dimensional data structures with
   de-normalized dimensions
    Snow Flake Schema:
   Snow Flake Schema is a set of tables comprised of a single, central fact table
   surrounded by normalized dimension hierarchies. Snowflake schema implement
   dimensional data structures with fully normalized dimensions.


    Queries :
   The DWH contains 2 types of queries. There will be
      Fixed queries that are clearly defined and well understood, such as regular
       reports.
      Ad Hoc Query: Is the starting point for any analysis into a database. The ability
       to run any query when desired and expect a reasonable response that makes the
       data warehouse worthwhile and makes the design such a significant challenge.
       There will also be ad hoc queries that are unpredictable, both in quantity and
       frequency.
       The end-user access tools are capable of automatically generating the database
       query that answers any question posted by the user.
      Canned Queries: are pre-defined queries. Canned queries contain prompts that
       allow you to customize the query for your specific needs


    Kimball (Bottom up) vs Inmon (Top down) approaches :


      Bottom up: Acc. To Ralph Kimball, when you plan to design analytical solutions
       for an enterprise, try building data marts. When you have 3 or 4 such data marts,
       you would be having an enterprise wide data warehouse built up automatically
       without time and effort from exclusively spent on building the EDWH. Because
       the time required for building a data mart is lesser than for an EDWH.



Datawarehousing Concepts
                                      Page 4 of 10
      Top down: try to build an Enterprise wide Data warehouse first and all the data
       marts will be the subsets of the EDWH. Acc. To him, independent data marts
       cannot make up an enterprise data warehouse under any circumstance, but they
       will remain isolated pieces of information –stove pieces.
    ER Diagram :
       ER model is a conceptual data model that views the real world as entities and
       Relationships. A basic component of the model is the Entity-Relationship
       diagram which is used to visually represent data objects.


    ETL :

       Extraction, Transformation & Loading.

       ETL Tools in the market for eg, Informatica, Ascential Data stage, Acta ,Oracle
       Warehouse Builder(OWB) etc.,




Datawarehousing Concepts
                                      Page 5 of 10
    Staging Area :
   It is the work place where raw data is brought in, cleaned, combined, archived and
   exported to one or more data marts. The purpose of data staging area is to get data
   ready for loading into a presentation layer.

    Slowly Changing Dimensions :

Dimensions are said to be slowly changing dimensions when their attributes remain
almost constant, requiring minor alterations.

       Eg Marital status

    Bitmap index, B tree index are the indexing mechanism use for a typical data
     warehouse.

    OLAP, MOLAP, ROLAP, DOLAP, HOLAP :

   OLAP: Online Analytical Processing.
   OLAP tools in the market eg Business Objects, Brio, Cognos , Microstrategy ,
   Alphablock, Crystal Reports etc.,
   ROLAP: Relationnal OLAP, the users see cubes but under the hood it is
   pure relational table, Micro-Strategy is a ROLAP product.
   MOLAP: Multi dimensionnal OLAP, the users see cubes and under the hood
   there a big cube, Oracle Express used to be a MOLAP product.
   DOLAP: Desktop OLAP, the users see many cubes and under the hood there
   are many small cubes, Cognos PowerPlay.
   HOLAP: Hybrid OLAP, combines MOLAP and ROLAP, Essbase


    Types of Facts:

             Additive

          –   Able to add the facts along all the dimensions
          –   Discrete numerical measures eg. Retail sales in $


             Nonadditive

          –   Numeric measures that cannot be added across any dimensions
          –   Intensity measure averaged across all dimensions eg. Room temperature
          –   Textual facts - AVOID THEM


             Semi Additive


Datawarehousing Concepts
                                      Page 6 of 10
          –   Snapshot, taken at a point in time
          –   Measures of Intensity
          –   Not additive along time dimension eg. Account balance, Inventory balance
          –   Added and divided by number of time period to get a time-average.

    Attributes :

   A field represented by a column within an object (entity). An object may be a table,
   view or report. An attribute is also associated with an SGML(HTML) tag used to
   further define the usage.

    Business Activity Monitoring (BAM) :

   BAM is a business solution that is supported by an advanced technical infrastructure
   that enables rapid insight into new business strategies, the reduction of operating cost
   by real-time identification of issues and improved process performance.

    Business Intelligence (BI) :

   Business intelligence is actually an environment in which business users receive data
   that is reliable, consistent, understandable, easily manipulated and timely. With this
   data, business users are able to conduct analyses that yield overall understanding of
   where the business has been, where it is now and where it will be in the near future.
   Business intelligence serves two main purposes. It monitors the financial and
   operational health of the organization (reports, alerts, alarms, analysis tools, key
   performance indicators and dashboards). It also regulates the operation of the
   organization providing two- way integration with operational systems and
   information feedback analysis.

    Data Integration :

   Pulling together and reconciling dispersed data for analytic purposes that
   organizations have maintained in multiple, heterogeneous systems. Data needs to be
   accessed and extracted, moved and loaded, validated and cleaned, and standardized
   and transformed.

    Data Mapping :

   The process of assigning a source data element to a target data element.

    Data Mining :

   A technique using software tools geared for the user who typically does not know
   exactly what he's searching for, but is looking for particular patterns or trends. Data
   mining is the process of shifting through large amounts of data to produce data



Datawarehousing Concepts
                                       Page 7 of 10
   content relationships. It can predict future trends and behaviors, allowing businesses
   to make proactive, knowledge-driven decisions. This is also known as data surfing.

    Data Modeling :

   A method used to define and analyze data requirements needed to support the
   business functions of an enterprise. These data requirements are recorded as a
   conceptual data model with associated data definitions. Data modeling defines the
   relationships between data elements and structures.

    Drill Down:

   A method of exploring detailed data that was used in creating a summary level of
   data. Drill down levels depend on the granularity of the data in the data warehouse.

    Meta Data:

   Meta data is data that expresses the context or relativity of data. Examples of meta
   data include data element descriptions, data type descriptions, attribute/property
   descriptions, range/domain descriptions and process/method descriptions. The
   repository environment encompasses all corporate meta data resources: database
   catalogs, data dictionaries and navigation services. Meta data includes name, length,
   valid values and description of a data element. Meta data is stored in a data dictionary
   and repository. It insulates the data warehouse from changes in the schema of
   operational systems.

    Normalization:

   The process of reducing a complex data structure into its simplest, most stable
   structure. In general, the process entails the removal of redundant attributes, keys, and
   relationships from a conceptual data model.

    Surrogate Key:

   A surrogate key is a single-part, artificially established identifier for an entity.
   Surrogate key assignment is a special case of derived data - one where the primary
   key is derived. A common way of deriving surrogate key values is to assign integer
   values sequentially.




Datawarehousing Concepts
                                       Page 8 of 10
                            MOLAP, ROLAP, and HOLAP

In the OLAP world, there are mainly two different types: Multidimensional OLAP
(MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to
technologies that combine MOLAP and ROLAP.

MOLAP

This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary
formats.

Advantages:

      Excellent performance: MOLAP cubes are built for fast data retrieval, and is
       optimal for slicing and dicing operations.
      Can perform complex calculations: All calculations have been pre-generated
       when the cube is created. Hence, complex calculations are not only doable, but
       they return quickly.

Disadvantages:

      Limited in the amount of data it can handle: Because all calculations are
       performed when the cube is built, it is not possible to include a large amount of
       data in the cube itself. This is not to say that the data in the cube cannot be
       derived from a large amount of data. Indeed, this is possible. But in this case, only
       summary-level information will be included in the cube itself.
      Requires additional investment: Cube technology are often proprietary and do not
       already exist in the organization. Therefore, to adopt MOLAP technology,
       chances are additional investments in human and capital resources are needed.

ROLAP

This methodology relies on manipulating the data stored in the relational database to give
the appearance of traditional OLAP's slicing and dicing functionality. In essence, each
action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL
statement.

Advantages:

      Can handle large amounts of data: The data size limitation of ROLAP technology
       is the limitation on data size of the underlying relational database. In other words,


Datawarehousing Concepts
                                       Page 9 of 10
       ROLAP itself places no limitation on data amount.
      Can leverage functionalities inherent in the relational database: Often, relational
       database already comes with a host of functionalities. ROLAP technologies, since
       they sit on top of the relational database, can therefore leverage these
       functionalities.

Disadvantages:

      Performance can be slow: Because each ROLAP report is essentially a SQL query
       (or multiple SQL queries) in the relational database, the query time can be long if
       the underlying data size is large.
      Limited by SQL functionalities: Because ROLAP technology mainly relies on
       generating SQL statements to query the relational database, and SQL statements
       do not fit all needs (for example, it is difficult to perform complex calculations
       using SQL), ROLAP technologies are therefore traditionally limited by what SQL
       can do. ROLAP vendors have mitigated this risk by building into the tool out-of-
       the-box complex functions as well as the ability to allow users to define their own
       functions.

HOLAP

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For
summary-type information, HOLAP leverages cube technology for faster performance.
When detail information is needed, HOLAP can "drill through" from the cube into the
underlying relational data.




Datawarehousing Concepts
                                      Page 10 of 10

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:152
posted:5/11/2010
language:English
pages:10