Document Sample
DW_architecture_dos_donts Powered By Docstoc
					        Data Warehousing Architecture

• A data warewhouse is an architectural
  construct of an information system that
  provides users with current and historical
  decision support information that is hard to
  access or present in traditional operational
  data stores.
• It comprises a number of components
  illustrated in figure 1

            Data Warehousing Architecture

Figure 1 : A datawarehouse architecture     2
         Data Warehousing Architecture
• Data Warehouse database
  – This is a cornerstone of the data warehouse
    (item 2 in figure 1)
  – Almost always implemented on a relational
    database management system
  – However very large databases, ad hoc
    processing and the need for flexible user views
    e.g. aggregates and drill downs are calling for
    different technological approaches:
     • Parallel relational database designs requiring the use
       of symmetric multiprocessors, Massively parallel
       processors and clusters
    Data Warehousing Architecture

• Speeding up traditional RDBMS
• The use of Multidimensional databases (MDDBs)
  which are tightly coupled to on-line analytical

            Data Warehousing Architecture

• Sourcing, acquisation, cleanup and transformation
  of data
   – Implementing data warehouses involves extracting data
     from operational systems including legacy systems and
     putting it into a suitable format.

   – The various tools are illustrated as item 1 in figure 1

   – These tools perform all the conversions,
     summarisations, key changes, structural changes, and
     condensations needed to transform disparate data into
     information can be used by decision support tools
       Data Warehousing Architecture
– It produces programs and control statements
  required to move data into the data warehouse
  form multiple operational systems
– Maintains the metadata
– Remove unwanted data
– Converts to common data names and
– Calculates summaries
– Establish defaults for missing data
– Keep track of source data definition changes
          Data Warehousing Architecture

• The tools have to deal with:
  – Database hetergeneity: DBMS can be very
    different in data models, data access language
  – Data hetergeneity: the difference in the way
    data is defined and used e.g. synonyms and
    different attributes for the same entity etc.

              Data Warehousing Architecture
• Metadata: data about data that describes the
   – Technical metadata: contains information for
     warehouse designers and administrators
      •   Information about data sources
      •   Transformation descriptions
      •   Rules used to perform data clean up
      •   Access authorisation, information delivery history, data
          acquisation history, data access etc
   – Business metadata: information that gives users an
     understanding of the information stored in the data
      • Queries, reports images
      • Data warehouse operational information e.g. data history and
        ownership                                                      8
            Data Warehousing Architecture

• Access Tools
  – The principal purpose of the data warehouse is
    to provide information for strategic decision

  – The main tools used to achieve this objective
     •   Data query and reporting tools
     •   Executive information system tools
     •   On-line analytical processing tools
     •   Data mining tools
   Information delivery system
• The IDS is used to enable the process of
  subscribing for data warehouse information
  and having it delivered to one or more
  destinations of choice according to some
  user-specified scheduling algorithm
• IDS may be based on time of day or on
  completion of an external event.
• IDS can be achieved by Client/Server
  architecture and now by Interner/intranet
  and World Wide Web.
      Top 10
Data Warehousing
  10. Pre-selecting Your Technical
This is a very common trap in which many organizations find
themselves. It is traditional to select the hardware, software
and other physical, technical components of a system as one of
the earliest activities.
However, a data warehouse is an unstable environment from a
sizing perspective. How do you know the
hardware/RDBMS/end user tool is appropriate for your data
warehouse before conducting even the first round of analysis?
If at all possible, wait to select your technical environment
until after you have analyzed the business requirements for
information, data, and potential systems of record.

9. Allowing Access Tool to Determine Data

This is an extension of #10, but is important enough
to list by itself.
If you select an end user tool before developing your
data architecture, it is very likely that that
architecture will suffer at the hand of design
requirements delivered by the tool.
If you have to sacrifice design requirements in order
to meet functional requirements of a tool, it is
probably time to put that tool aside and select
another one.

      8. Unarchitected Data Marts
OK. Data marts are good; they are an essential part of the
data warehouse architecture. But to build only a data
mart and to ignore the rest of the data warehouse
(specifically the atomic level data and centralized meta
data) will lead you down a path that will be more
expensive and deliver less quality of data than the
The alternative is to architect and build the data
warehouse incrementally, iteratively. Include data marts
as departmental instances of the architecture, and
populate them from the atomic level data. This will
ensure accuracy across the architecture, and reduce costs
by eliminating unnecessary population of stand-alone
data marts.                                            14
           7. Boiling the Ocean
It is more efficient to implement the data warehouse in
small, achievable and palatable chunks than to try to
implement it all at once. When I say “boil the ocean”, I
mean trying to do too many things for too many people
all at the same time.
There is an old adage: “You can have everything;
where would you put it all?” The same holds true for a
data warehouse. If you try to design, develop and
implement a data warehouse that is all-encompassing as
your first iteration, how will the users be able to use all
that you delivered? And in the mean time, while
you’ve been trying to meet all of their needs, you have
failed to meet any needs. And users won’t forget that
for a long time.                                        15
        6. “If you build it they will come”
If you design, develop and implement an operational
system, such as an order processing system, that new
system is typically going to replace an existing system. In
other words, the old system goes away and the users have
no choice but to use the new one. Not so with the d/w.
“If you build it…” implies an analysis that includes only
bottom-up activities. It is crucial to the success of a data
warehouse that a top-down analysis of user requirements
for information be conducted.
After that, users must be tutored, mentored and otherwise
have their hands held as part of the implementation of the
data warehouse. Existence does not guarantee utilization
and, therefore, value.
  5. Putting ROI before RFI (Requirements for
It is very difficult to quantify the intangible benefits that a data
warehouse can provide to an organization. How can you put a
price on increased customer loyalty. Somewhere, sometime,
someone has probably made this calculation. In most cases,
however, the determination of how beneficial the data warehouse
will be is based on criteria that was developed for operational
systems. Just as you cannot use operational data to meet your
strategic informational requirements, it is difficult to calculate the
return on investment (ROI) of a data warehouse.
In terms of benefits to the organization, it is more appropriate to
concentrate on how well the data warehouse addresses the target
users’ requirements for information.

     4. No Committed User Involvement
Write this down:
The success of any data warehouse is directly
proportional to the amount of end user
A data warehouse cannot be successful without
active participate on the part of the target users.
Period. If you do not have user participation, you
will find yourself in a situation where you will
build it and hope that they will come.
If there is no serious user participation in a data
warehouse project, you have to seriously question
whether or not the organization truly needs a data
            3. No Dedicated DBA
In many situations the lack of a dedicated database administrator
(DBA) has prevented a data warehouse project to be complete 1)
on time, or 2) successfully.
“Borrowing” a DBA from the operational “pool” will only result
in questions about the nature of the data warehouse data models
and database design. It’s too flat, not normalized properly, too
much redundancy, and other criticisms are well suited for an
operational system’s database design, but not a data warehouse.
Considering that “data” is the first word in “data warehouse”, be
sure you have a dedicated database administration resource
committed to this important project.

               2. No Meta Data
Meta data is like documentation and training:
Everyone knows it is necessary, but it usually gets
dropped somewhere along the route to implementation.
For the data warehouse, meta data is more important
than just your typical documentation. Remember, in
order to turn data into information you have to have the
data, know that you have it, be able to access it, and
trust it. Meta data is the means by which the users will
be able to understand and trust the data. A time-variant
record of where data came from, what happened to it
along the way, where it is in the data warehouse, and
what vehicles exist to access it will spell the difference
between success and frustration.
          1. Analysis Paralysis
The inability to proceed past a sticking question. Wanting to
“boil the ocean” and model/design everything before
proceeding with development. Having to resolve political
issues surrounding a “standard” or “common” definition. All of
these things (and more!) will result in analysis paralysis.
The 80/20 rule is very applicable to the development of a data
warehouse. Execute 20% effort to get 80% of the total
outcome, then move on to the next set of challenges and
opportunities. Many data warehouse failures started when the
development team stopped.
Get your hands around an idea, understand what the users’
requirements for information are, and build something that
produces something that can be evaluated. Don’t just stand
there…do something!
          Parallel Data Management
• A topic that’s closely linked to Data Warehousing is that
  of Parallel Data Management.
• The argument goes:
   – if your main problem is that your queries run too slowly, use
     more than one machine at a time to make them run faster
     (Parallel Processing).
   – Oracle uses this strategy in its warehousing products.
• There are two types of parallel processing - Symmetric
  Multiprocessing (SMP), and Massively Parallel
  Processing (MPP)

      Parallel Data Management
• SMP - means the O.S. runs and schedules tasks on
  more than one processor without distinction.
   – in other words, all processors are treated equally in an
     effort to get the list of jobs done.
• MPP - more varied in its design, but essentially
  consists of multiple processors, each running their
  own program.
   – the problem with MPP is to harness all these processors
     to solve a single problem.

        Parallel Data Management
• Regardless of the architecture used, there are still
  alternatives regarding the use of the parallel
  processing capabilities.
  – In normal Transaction processing, each transaction
    runs on a separate processor, because transactions are
    small units of work that run in a reasonable time-span.
  – However, the type of analysis carried out in data
    warehouse applications isn’t like that. Typically you
    want to run a query that looks at all the data in a set of
    tables. The problem is splitting that into chunks that
    can be assigned to the multiple processors.

     Parallel Data Management
• There are two possible solutions to this problem:
  Static and Dynamic Partitioning.
   – In Static Partitioning you break up the data into a
     number of sections. Each section is placed on a
     different processor with its own data storage and
     memory. The query is then run on each of the
     processors, and the results combined at the end to give
     the entire picture.
   – This is like joining a queue in a supermarket. You stay
     with it until you reach the check-out.

      Parallel Data Management
                      (cont’d) is that you can’t tell
– The main problem with Static Partitioning
  how much processing the various sections need. If most of the
  relevant data is processed by one processor you could end up
  waiting almost as long as if you didn’t use parallel processing at
– In Dynamic Partitioning the data is stored in one place, and the
  data server takes care of splitting the query into multiple tasks,
  which are allocated to processors as they become available.
– This is like the single queue in a bank. As a counter position
  becomes free the person at the head of the queue takes that

    Parallel Data Management
• With Dynamic Partitioning the performance
  improvement can be dramatic, but the
  partitioning is out of the users hands.