IOUG93 - Technical Architecture For The Data Warehouse - Paper

Document Sample
IOUG93 - Technical Architecture For The Data Warehouse - Paper Powered By Docstoc
					David Walker                                                                       Page 1

      A TECHNICAL ARCHITECTURE FOR THE DATA WAREHOUSE

                                David Walker
                                  Consultant
           Data Management & Warehousing, Wokingham, United Kingdom


Summary

The paper aims to set out a brief description of the process of implementing a Data
Warehouse and then to look in more detail at one particular aspect of the process,
namely the Technical Architecture. Whilst it can not be exhaustive it does set out a
model that can be used as a guide in the implementation of a system.

Introduction

When an organisation decides to build a Data Warehouse the are four key elements
required of the process. These elements are:

       • The Business Analysis

       A process of gathering together users of information (this is not the IT
         department but other departments such as marketing, sales, finance, customer
         service, etc.) and asking them what their goals are and how they measure
         them. The measurements are the Key Performance Indicators (KPI) and will
         eventually become the answers required from the Data Warehouse. The
         constraints that they place on these KPI such as time, location, etc. are the
         dimensions of the Data Warehouse. This basic design allows users to
         formulate natural language questions against a relational database [for
         example What is the total sales (KPI) and average sales (KPI) by region
         (Dimension) over the past two years (Dimension) ?]

          Obtaining a common understanding between departments of the KPI and
          dimensions will in itself prove a challenge and is normally approached as a
          series of workshops. Users should understand that the Data Warehouse is an
          iterative design and should therefore not expect answers to all their questions
          on the first pass. It is also likely that the required information from the system
          will change over time as decisions, based on information that has come
          available, will change the course of business and require different KPI to
          help further decision support.
David Walker                                                                      Page 2


       • The Database Schema Design

       The next stage of the process is to take the requirements identified by the users
         and design a schema that will support the queries. By taking the questions
         identified by users and looking at the data available on the existing systems a
         mapping can be produced. This will inevitably show a mismatch between the
         requirement and the information available. Decisions then have to be made to
         either supplement this from external sources, modify internal sources to
         provide such information, or omit the information until a later iteration.

          The inevitable consequence of this approach is that one, or a number of ‘star
          schemas’ will be derived. These are schemas that have a large central (or
          Fact) table containing the KPI and foreign keys to a number of smaller
          dimension tables. Each dimension table will have a relatively low number of
          rows but has a large number of (highly indexed) columns that are to some
          extent de-normalised. A user may see a table called TIME that has columns
          ‘Day of Year’, ‘Day of Week’, ‘Day of Month’, ‘Month’, ‘Week Number’,
          ‘Year’ and ‘Julian Date’. For five years of information this will only contain
          1825 (5*365) rows. The primary key will be ‘Julian Date’.

          This dimension will be one of perhaps eight that relate to the large central
          table. If each of these dimensions contains 2000 rows, then the central table
          could have up to 2.56^26 rows. In practice, there is a high level of sparsity
          (no sales on Saturday or Sunday would reduce the TIME dimension by 30%
          alone) and therefore it is practical to build the systems to support the design.

          Each ‘star schema’ can be considered a Data Mart for one or more
          departments. These star schemas may also contain pre-calculated summaries,
          averages, etc. The schema design must also include Meta Data, or data about
          data. This will allow users to interrogate the database to discover the where a
          particular field is and if appropriate how it is derived (for example how was
          the gross profit calculated, which system provided the stock levels, etc.). The
          Meta Data can also be used to enforce security on the system by determining
          what a user can or can not see.
David Walker                                                                       Page 3


       • The Technical Architecture

          This is the substantive part of the paper and is covered in detail below. At this
          stage it is sufficient to say that the technical architecture is the design of a
          practical method to build, maintain and secure the Data Warehouse. As the
          system is rolled out it will grow in its importance and will become mission
          critical. Whilst not being required on a seven by twenty-four basis, it will
          need to be available to run very large, and possibly long running queries.
          Other issues will include backup and recovery, response, accuracy and
          presentation.

       • Project Management

       As has already been stated the building of a Data Warehouse will be both inter-
       departmental and an iterative approach. It is necessary for the process to have
       the complete buy-in from management and to have strong project management
       to maintain the momentum over the long term. Not only does the project
       manager have to deal with the initial implementation, but build the necessary
       infrastructure to keep the project rolling over the longer term. A failure to
       achieve this is more often the cause of the collapse of a Data Warehouse project
       than the technical hurdles encountered along the way.

The Technical Architecture
David Walker   Page 4
David Walker                                                                     Page 5

Data Acquisition

                                                  As can be seen at the bottom of the
                                                  diagram there are a number of
                                                  different types of source system. The
                                                  first problem is not only identify the
                                                  sources, but to map and extract the
                                                  data that you require from each of the
                                                  systems. This may require complex
                                                  mappings to obtain all the
                                                  information.

                                                  Data acquisition is made up of a
                                                  number of components. Each of
                                                  which is required to ensure the
                                                  successful loading of data. In some
                                                  cases (such as data loading via
                                                  SQL*Net a number of these
                                                  components are performed together)

               • Extraction
                 This is the physical process of extracting the data from the source
                 system, either to flat files, or tapes.

               • Transformation
                 This is the action required to change coding into common formats, a
                 common problem is that the support desk customer id may be
                 different from that of the finance or marketing database. In order for
                 the data to be accessible by all it has to be converted into a common
                 format.

               • Collation
                 Once a file has been extracted and transformed it can be collated with
                 other files. It may be advantageous at this stage to pre-sort some of
                 the data.

               • Migration
                 This is the moving of the data from the source system to the Data
                 Warehouse. This can take place either via a network or tape loads.
                 Networking issues arise if the volumes to be moved can not be
                 supported by the available bandwidth. If tapes are used then there is
                 the compatibility and reliability of media, and the ability of operators
                 to consistently put the tapes into the system in the correct order.


               • Loading
                 Once obtained the data must be loaded into the system. This may
David Walker                                                                     Page 6

                 require the use of Oracle SQL*Loader. Consideration needs to be
                 given to whether the parallel and/or direct load options can be
                 supported, with the requirement to re-build the indexes afterwards.
                 Another consideration is whether users will be on the system whilst
                 the load takes place, and where aggregates have been used how they
                 are re-calculated after loading


               • Exception Processing
               Despite the best efforts of the IT department there will after loading be
                 exceptions. Examples of this range from the simple (a new customer
                 key was not inserted and therefore all transactions associated with that
                 customer have not yet been loaded) to the complex (a particular
                 branch office was unable to generate their load file, but all other
                 branches have. How are the gaps accounted for?). This process must
                 handle both the expected and the unexpected.

               • Quality Assurance
               Once the data has been loaded it is necessary to perform some form of
                 Quality Assurance on the data. The load and exception processing
                 may have all succeeded but how is good is the data. Some key
                 indicators need to be checked after each load to ensure that they are
                 within the expected norms (i.e. sales in Central London did not rise by
                 700% overnight). This can normally be achieved by setting high and
                 low bounds on KPI, the exceptions are then manually checked to see
                 if they are valid and if so the bounds changed. If not they changes
                 either need to be fixed or rolled back.

               • Publication
                 After the completion of the Quality Assurance end users need to be
                 notified of the state of the database. This can be achieved in a number
                 of ways from a mail message, message of the day, or an automatic
                 initial query that returns the status of the system. These messages
                 should range from the fact that the Data Warehouse is either available
                 or unavailable to the date of the last load and the percentage of feeds
                 used (e.g. if one branch of twenty failed to load and a statistical
                 process was used to populate the data then the data might be described
                 as available with 95% source information included). This information
                 is important to the users as it may affect the way in which the use the
                 outputs from the system.
David Walker                                                                       Page 7

Data acquisition is performed in two stages. The first is the initial data load that can be
effectively managed by a series of bespoke programs that either directly load data into
the database (for example PL/SQL queries from the database to other Oracle databases
or databases that are accessible via SQL*Net and Oracle Gateways). It may also be
possible to use one of the growing number of tools on the market designed for this
purpose. These include Platinum InfoPump, Prism Warehouse Manager and HP
Intelligent Warehouse. This exercise can be considered a one off effort and whilst it is
desirable to re-use the technology for the second stage the differences in volumes and
change control may not make this possible.

The second stage is the day to day updating of the system. It is unlikely that a single
tool will be able to perform this task entirely. However one should try to use a small
number of tools rather than build a complex set of bespoke code to perform the data
acquisition. There are two reasons for this.:

       The first is the changing dynamics of the systems supplying the information. It
       is unlikely that the source systems remain static, indeed they are likely to have
       changed in some form before the end of the first phase of the Data Warehouse
       project. Therefore a tool that allows the Warehouse Manager to update the
       relationships between the source and target entities without having to re-craft
       large code segments is desirable.

       The second reason is the speed of development that is required. It is
       unacceptable to say that the Data Warehouse will not be updated for six weeks
       whilst the new COBOL extraction code is written and tested. Thus traditional
       software development techniques to not meet the need of the Data Warehouse.

The Transaction Repository

                                             The Transaction Repository is the core of
                                             the system. This is where the data that has
                                             been loaded is held. The data here is the
                                             lowest level that will be accessible by the
                                             users. The volumes as have already been
                                             described can be very large. A 250 Million
                                             row table with 100 bytes per row will be in
                                             excess of 30Gb, this is without any indexes,
                                             rollback space or other associated database
                                             requirements.

                                      Further drains on disk resources will include
                                      the use of mirroring (doubling the disk
                                      space requirement), or RAID technologies
                                      (Which can add up to 20% to the disk space
                                      requirement).
                                      With a large Data Warehouse this may start
to cause a problem as the upper bound of most large UNIX systems (for example
David Walker                                                                      Page 8

Sequent, Pyramid, HP and IBM) is around 3.5Tb of raw disk. This may also not all be
use-able due to system design considerations such as channel bandwidth to the disk
array, or slots available in the system cabinets. The total amount of disk space may be
expanded by the use of external mass storage devices supplied by companies such as
EMC, DG (Clariion), etc.

An alternative strategy is to distribute the Data Warehouse over a number of systems.
One might build a system that has a Fact Table and its associated Dimensions on a
single system, the next fact table would reside on another system and if they shared any
fact tables then one would use Oracle Replication server to ensure consistency between
the copies.

A third solution may be to build a completely distributed system and then use a tool
such as MicroStrategy DSS Server or HP Intelligent Warehouse to co-ordinate the
query. This provides users with a single entry point, but allows the back-end power of a
distributed system.

It is important to notice that in the rush to build a disk system large enough to hold all
the data requirements one does not preclude processing power. Often SMP architectures
can have either processor or disk controller cards up to a combined limit, whilst MMP
architectures may need to dedicate nodes to manage the virtual disk array presented to
the other nodes. Finally it is important to realise that where clustering technologies and
MMP systems are utilised that there is an overhead involved in returning information
from the nodes to the processor that spawned that parallel query. This means that simply
adding additional nodes does not mean linear growth in system capacity.

The performance of the data warehouse can now also be dramatically improved by the
use of bit-mapped and hash indexes, pre-joined tables and parallel queries options.
These technologies, whilst very powerful are not panaceas and can only be used in a
controlled fashion within the Data Warehouse.

Data Aggregation

It is possible to improve the performance of the system by the use of Data Aggregation.
This allows often used bands of data (for example quarterly figures) to be pre-calculated
and stored is the database.

The level of aggregation also may be more meaningful to the level of user. For example
a CEO will want to have a high level view (traditionally associated with EIS systems)
and drill down into a lower level of data (traditional Decision Support Systems).
David Walker                                                                      Page 9


However the CEO’s DSS system may be the entry point for the managers who work
under him and they drill down from the DSS level into the transaction repository to
track individual items or trends on a lower scale.

The use of aggregates, whilst improving overall speed may in itself occupy large
amounts of disk space. For example a schema with three dimensions (Region 50 rows,
Customer 10,000 rows and Time 365 rows) will have a fact table of total sales with
182.5 Million (50*10,000*365) rows. If data is now stored for not only each day, but
also 52 weeks, 12 months and 4 quarters the total size rises to 216.5
(50*10,000*{365+52+12+4}) Million rows, a growth of 20% in the total database size.
This does not take into account any aggregations of the other two dimensions that will
also cause large amounts of growth in the database size.

It is therefore possible to produce a system that contains Data Marts, suitable for use by
front end tools that query against a smaller database made up of the required aggregates
and submits jobs in batch to the transaction repository that run over a much longer time
on a separate system. This allows uses to get good local response for a large proportion
of the queries and gives them an understanding that certain types of highly detailed
queries will result in batch jobs.

If queries are monitored then it is possible to change the aggregation strategy over a
period of time. This is because once implemented the users will start to manipulate the
data in ways that where not envisaged in the original design.
David Walker                                                                  Page 10

Meta Data & Security

The value of the Data Warehouse is defined by the ability of end users to obtain
business benefit from the information they glean from the system. This requires an
understanding of what data is there and more importantly how it was derived. Different
departments, even different individuals within a department may understand the
information in different ways. When the system returns an answer that is unexpected the
use must be able to not only view the results but pursue the unexpected element until
they are satisfied with its derivation.

The Meta data also
allows a way for new
and            interesting
perspectives     to      be
obtained on data that is
in the system. For
example an insurance
underwriter may be able
to combine factors such
as location, weather and
car type as a basis for
new          underwriting
decisions,        whereas
historically            the
information whilst held
within the organisation
may not have been
accessible from a single
point and required too
much effort to collate.

The value of this type of
cross          functional
fertilisation raises the
problems of security.
The Data Warehouse is at its best when queries range across the system bringing back a
new perspective. However there are internal problems (for example sales quotas
between different departments being confidential) and external problems (for example
key queries being run by an employee who then leaves to join a competitor).

There is no easy answer to this, but defining access rights against the Meta Data does
enable some level of security to be introduced. It will be a matter of balance between
the needs of the company as a whole and the risks of allowing such broad access to
staff. The outcome of this challenge will depend on the culture of the organisation and
in particular the co-operation of end user managers.
David Walker                                                                    Page 11

Middle-ware and Presentation

The wide variety of tools on the market make it almost impossible to make any sensible
comments in such a small space. What is clear is that there a number of different types
of query tool. These can be classified into Summary Query, Trawling (or Mining),
Simple Data Access and Data Manipulation. Some of these tools will exist in the
organisation today (often spreadsheets and in larger organisations SAS may well exist).
Added to this are the number of user requirements about how they want the data to be
presented (for example graphically, text with dimensions, etc.)

The only way in which to choose a tool is to build up a shortlist from current market
literature (the Internet and magazines provide countless references) and draw up a
shortlist based on the users stated requirements, and some more traditional constraints
such as cost and support. Most vendors are more than happy to provide evaluation and
trial copies which end users can then evaluate for themselves. It is certain that
individual preference and differing requirements will mean that no one tool will suffice,
and given the cost of PC tools in general one can afford to allow a number to be tried
and accepted or rejected.

The choice of middle-ware is somewhat dependant on the final choice of tool, however
there are two clear alternatives within the Oracle arena: SQL*Net and ODBC. Tools
such a the HP Intelligent Warehouse will also provide their own connectivity that routes
the query through the query manager or co-ordinator. The choice of middle-ware should
be limited to one or two, forcing the rejection of front end tools that do not meet this
requirement.


Conclusion

The Data Warehouse is an end user tool, which the IT department must build and
support. It is a dynamic, iterative long-term project that requires different techniques
from those traditionally employed by the information systems professionals.

Staff are required to work in cross-functional teams that will discover business critical
facts that may be the key to the success of your employer (and consequently you) in the
longer term.

The scale of the infrastructure required to support such a venture can and should not be
underestimated. It is certain that the Data Warehouse will involve larger volumes of
data and new querying techniques that will not be envisaged at the initial design phase.
It is therefore critical that the system is built with a flexible and scaleable technical
architecture designed to meet the unexpected.

				
DOCUMENT INFO
Shared By:
Tags:
Stats:
views:24
posted:4/24/2012
language:English
pages:11