Data warehouse - e Page by niusheng11


									Unit 5 A – Data warehouse

What is a data warehouse?

A simple answer to this question is that a data warehouse is managed
data situated after and outside operational systems. This answer is
based on the concept that fundamentally distinguishes an
organization's operation data (data that used to run the organization)
and information data (data used to manage the organization.) For
example, a bank's operation data are the most up-to-date financial
data for each customer, which are stored in different banking hosts.
A given customer's saving account data will be stored in a saving
account database, while his/her stock trading data will be stored in a
stock exchange database. Operation data can be dispersed amongst
different places in an organization, and they are usually overwritten
by new data as transactions go on. On the other hand, the same bank
may have a data warehouse, i.e. a single data repository that stores
all the financial data (all the transaction records) of all its customers
over a period of time (say several years). This huge amount of
accumulated archival data will serve as the bank's information data
on which business intelligence analyses can be done.

From an application point of view, a data warehouse can be defined
as any centralized data repository that can be queried for business
benefit. Warehousing makes it possible to:

      extract archived operational data;
      overcome inconsistencies between different legacy data
      integrate data throughout an enterprise, regardless of
       location, format or communication requirements; and
      incorporate additional or expert information.
Characteristics of data warehouses

As defined by Bill Inmon (1992)(an information system guru and a
well-known data warehousing expert), one can easily recognize the
following characteristics of a data warehouse:

      Subject-oriented -- data in the warehouse are organized by
       subject instead of application, e.g. an insurance company
       would organize their data by customer, premium and claim,
       instead of by different products (auto, life, etc.). A data
       warehouse contains only the information necessary for
       decision support processing.
      Integrated -- encoding of data is often inconsistent, e.g.
       gender might be coded as 'm' and 'f' or 0 and 1, but when data
       are moved from the operational environment into the data
       warehouse they assume a consistent coding convention.
      Time-variant -- a data warehouse contains a place for storing
       data that are 5 to 10 years old, or older. These data are used
       for comparisons, trends and forecasting, and are not updated.
      Non-volatile -- data are not updated or changed in any way
       once they enter a data warehouse. They are only loaded and

These unique characteristics usually pose many challenges to the
warehouse designer or architect when she considers how to
implement a data warehouse for an organization. These challenges
include the considerations of the following criteria for a data

      Load Performance -- as data warehouses require incremental
       loading of new data on a periodic basis, they must not
       artificially constrain the volume of data.
      Data Quality Management -- a data warehouse must ensure
       local consistency, global consistency and referential integrity
       despite 'dirty' sources and massive database size
      Query Performance -- a data warehouse must not be slowed
       or inhibited by the performance of its RDBMS.
      Terabyte Scalability -- data warehouse sizes are growing at
       astonishing rates so their RDBMSs must not have any
       architectural limitations. They must support modular and
       parallel management.
      Mass User Scalability -- access to warehouse data must not be
       limited to an elite few. A data warehouse has to support
       hundreds, even thousands, of concurrent users while
       maintaining acceptable query performance.
      Networked Data Warehouses -- data warehouses rarely exist
       in isolation -- users must be able to look at and work with
       multiple warehouses from a single client workstation.
      Warehouse Administration -- since a data warehouse is
       large-scale and time-cyclic, it must offer administrative ease
       and flexibility.
      Integration of dimensional analysis by RDBMS -- dimensional
       support must be inherent in the warehouse RDBMS to provide
       the highest performance for relational OLAP tools.
      Advanced Query Functionality -- end users require advanced
       analytic calculations, sequential and comparative analysis,
       and consistent access to detailed and summarized data

Data warehousing architecture

Data warehouses, as we have discussed, are different from corporate
databases in the sense that they serve an organization's information
needs instead of its operational needs. Depending on various factors
such as the availability of resources and business needs, there are a
number of different ways to structure data warehouses. In this unit,
we are not going to look into the technical details of data warehouse
architecture -- i.e. the hardware, software and physical layout.
Instead, we will look into the architecture of data warehouses from
the organization perspective as well as a business-needs perspective.
Organization architecture and implementation choices

There are a number of architecture choices for data warehouses as
seen from the organization perspective. These architecture choices
will be based on managerial factors such as:

      The organization's existing physical and management
       structure -- Is the organization highly centralized in one
       location or highly dispersed with many branch offices? Is the
       management structure highly centralized or is there relatively
       independent departmental decision making?
      Technical capability and resources availability -- Does the IS
       department control all the technical expertise and resources
       or are there departmental computing capabilities and
      Commitment and style of management -- Is senior
       management highly committed to data warehouse
       development or is it only driven by needs from frontline users?

Depending on these various organizational factors, there are three
broad types of data warehouse architecture:

      Global architecture -- this architecture is built upon a fully
       integrated data warehouse across all the departments and
       lines of business of an enterprise. The architecture is global in
       the sense of data access instead of the physical organization
       of the data warehouse. A globalized data warehouse can be
       either physically centralized or distributed. However, a global
       architecture can only be effectively managed and maintained
       by a centralized IS department.
      Independent data mart architecture -- the term 'data mart'
       refers to a data warehouse on a smaller scale, usually small,
       stand-alone data warehouses developed by workgroups,
       departments or lines of business within an organization.
       Implementation and maintenance of these data marts are
       usually drawn from a department's own resources and
      Interconnected data mart architecture -- several independent
       data marts within an organization can be integrated or
       interconnected to provide sharing of access to data marts
       across departments or workgroups.

In addition to these architectural choices, there are also several
different approaches for the implementation of data warehouses that
also depend on various organizational factors. These choices are:

      Top-down approach -- this implementation approach usually
       leads to a global architecture, and usually features more
       senior management involvement and more resources from
       the organization. More consistent data definition and business
       rules should be the result. However, the cost for
       implementation will be high and more time will be required.
      Bottom-up approach -- this approach is more flexible in the
       sense that it is usually driven by business needs in individual
       departments and lines of business. This approach does not
       necessarily end up in data mart architectures. The usual
       scenario is that a global data warehouse be built incrementally
       with initial data mart implementations expanding and being
       joined up.
      Combined approach -- this approach is in a sense the best
       approach since, with good project management control, this
       approach can be a 'balancing act' that combines the
       advantages of both the top-down and bottom-up approaches.

Data architecture for data warehouses

In the previous subsection, we talked about the architecture of data
warehouses from the organization's perspective. You learned how a
data warehouse could be organized in terms of the ownership,
management and control of data and usage. This architectural design
is also directly related to how a data warehouse is implemented. In
this subsection, we will look at the architecture of a data warehouse
from another perspective; that is, how the data contained in a data
warehouse should be organized to meet different business needs.

To begin our consideration of architecture from this business-needs
perspective, we first take a look at the different types of data that can
be stored in a data warehouse. There are three basic types of data,
              real-time data
              derived data
              reconciled data

       As you've seen, there are different methodologies for data
       architecture design in data warehouses. The Enterprise Data Model
       (EDM) is one of the methodologies that enables all the data elements
       to be defined consistently throughout the whole organization. Usually
       an EMD exercise is divided into phases. Each phase will include
       different amounts of information. A full-scale EDM exercise can be
       very resource intensive and time-consuming and may not be
       practical for some organizations. Hence, a trimmed-down version of
       EDM can be adopted in which some of the core components that are
       required for data warehouse modelling can be extracted and grouped.
       The following reading presents the details of EMD and a simplified
       version of it. The benefits and drawbacks of EMD are also presented.

 In the physical design phase of data modelling, a very important aspect of the
design is the 'granularity' of the data. This concept is concerned with the degree of
summarization of the data elements. The level of data granularity will directly affect
the ability to answer queries in future data analysis and data mining. The following
reading gives you the details of how to proceed with a choice of data granularity.

Data partitioning is very important in data architecture design. It affects the
efficiency and flexibility of accessing data, and also affects the maintenance aspects
of the data warehouse such as the scalability, portability and ease of sharing and
archiving data.

There are two different perspectives in data partitioning: physical partitioning and
logical partitioning. Physical partitioning refers to how data are structured and
grouped according to the physical design of a data warehouse. Logical partitioning
refers to how data are structured and grouped based on the characteristics of the
data, such as customer, product, time period, accounts, etc. Logical data
partitioning will facilitate access of data according to business needs. However,
physical data partitioning can overlap with logical data partitioning.

To top