Document Sample
Architecture Powered By Docstoc
					Chapter 2: Data Warehousing Architecture

Q. What is architecture?

   •   Architecture is the combination of the science and the art of designing and
       constructing physical structures. In an information system, the architecture helps
       better communication and planning, increase the flexibility and improve learning and
       facilitate learning. The architecture of DW and the blueprint that will drive its
       construction are critical to the success or the failure of the program and its projects.

Q. What is the value of architecture?

In an information system, the architecture adds value to the system in the same way the
blueprint for a construction project.

   •   Communication: The architecture plan is a good communication tool at many levels. It
       can function as a communication tool within the team and with other information
       system teams. This will also providing them with a sense where they fir in the process
       and what they need to accomplish.

   •   Planning: The architecture brings all the details of the project in one place and shows
       how they fit in and provides a cross check for the project plan. The architecture also
       uncovers the technical requirements and dependencies that do not come out as part of
       planning process.

   •   Flexibility and Maintenance: Creating architecture is really about anticipating as many
       of the issues as possible and building a system that can handle those issues as a matter
       of course, rather than they become problems. This makes the data warehouse more
       flexible and easier to maintain.

   •   Learning: The architecture plays an important role as documentation for the system. It
       can help new members of the tem to get up to speed more quickly on the components,
       contents and connections. Building a data warehouse is has a set of standardized
       procedures and it cannot depend on personal beliefs and myths.

   •   Productivity and Reuse: Since we can understand the warehouse processes and
       database contents more quickly, it becomes easier for a developer easier for a
       developer to reuse existing processes than to build from scratch. Building a new data
       source is easier if you can use the generic load utilities and work from existing

Q. Explain the global warehouse architecture:

       A global data warehouse is considered one that will support all, or a large part, of the
       corporation with a high degree of data access and usage across departments or lines-
       of-business. That is, it is designed and constructed based on the needs of the
       enterprise as a whole. It could be considered to be a common repository for decision
       support data that is available across the entire organization, or a large subset thereof.

       The term global is used here to reflect the scope of data access and usage, not the
       physical structure. The global data warehouse can be physically centralized or
       physically distributed throughout the organization. A physically centralized global
       warehouse is to be used by the entire organization that resides in a single location. A
       distributed global warehouse is also to be used by the entire organization, but it
       distributes the data across multiple physical locations within the organization and is
       managed by the IS department.

       Figureshows the two ways that a global warehouse can be implemented. In the top
       part of the figure, you see that the data warehouse is distributed across three physical
       locations. In the bottom part of the figure, the data warehouse resides in a single,
       centralized location.

Independent data mart:

Independent data mart architecture implies stand-alone data marts that are controlled by a
particular workgroup, department, or line of business and are built solely to meet their
needs. There may, in fact, not even be any connectivity with data marts in other workgroups,
departments, or lines of business. Data for these data marts may be generated internally.
The top part of Figure depicts the independent data mart structure. Although the figure
depicts the data coming from operational or external data sources, it could also come from a
global data warehouse if one exists.
Interconnected data mart:

Interconnected data mart architecture is basically a distributed implementation. Although
separate data marts are implemented in a particular workgroup, department, or line of
business, they can be integrated, or interconnected, to provide a more enterprise-wide or
corporate-wide view of the data. In fact, at the highest level of integration, they can become
the global data warehouse. Therefore, end users in one department can access and use the
data on a data mart in another department. This architecture is depicted in the bottom of
Figure. Although the figure depicts the data coming from operational or external data
sources, it could also come from a global data warehouse if one exists.
Comprehensive data warehousing architecture:

        Data Sources                        ETL Software               Data Stores         Data Analysis           Users
                                                                                           Tools and
       Transaction Data                        S
                     IBM                       A
         Prod                                  G
                                               N                                               SQL
                                               G                                                                   ANALYSTS
         Mkt         IMS
                                               R                                               Cognos
                                Ascential      E
         HR          VSAM                                                     Data Marts
                                               A                  Teradata                     SAS
                                                                  IBM                                              MANAGERS
                                               O                                Finance
                     Oracle                    P
         Fin                                   E     Load
                                               R                    Data       Essbase       Queries,Reporting,
                                 Extract       A                  Warehouse                  DSS/EIS,
                                               T    Informatica                              Data Mining
         Acctg       Sybase                                                    Marketing
                                               O                                                                  EXECUTIVES
   Other Internal Data                         N                                              Micro Strategy
                                               A                    Meta
                                               L                    Data         Sales
         ERP         SAP         Sagent
                                               A                               Microsoft        Siebel
   Web Data                                    T
                                               A                                                Business       OPERATIONAL
                     Informix                                                                   Objects        PERSONNEL
       Clickstream                             S
                                SAS            O
   External Data                               R                                                Web
                                               E                                                Browser
       Demographic   Harte-
                     Hanks                  Clean/Scrub                                                           CUSTOMERS/
                                            Transform                                                             SUPPLIERS

This provides an example of a comprehensive data warehousing architecture. It illustrates
the large number of possible source systems, ETL processes, sample products, the data
warehouse, dependent data marts, meta data, data access tools and applications, and various
kinds of users. Inmon calls a comprehensive architecture like this one a Corporate
Information Factory.

Typical Components of a Data Warehouse Architecture

   •        Operational data sources

   •           Operational datastore(ODS)

   •        Load Manager

   •        Warehouse Manager

   •        Query Manager
   •   Detailed Data

   •   Lightly & Highly Summarized Data

   •   Archive & Back up Data

   •   Meta Data

   •   End-User Access Tools

Operational data

   •   Without source system, there would be no data

   •   The data sources for the data warehouse are supplied as follow:

          –   Operational data held in network databases

          –   Departmental data held in file systems

          –   Private data held on workstaions and private serves and external systems such
              as the Internet, commercially available DB, or DB assoicated with and
              organization’s suppliers or customers

Operational datastore(ODS)

   •   Is a repository of current and integrated operational data used for analysis. It is often
       structured and supplied with data in the same way as the data warehouse, but may in
       fact simply act as a staging area for data to be moved into the warehouse

          –   ODS objectives: to integrate information from day-to-day systems and allow
              operational lookup to relieve day-to-day systems of reporting and current-data
              analysis demands

          –   ODS can be helpful step towards building a data warehouse because ODS can
              supply data that has been already extracted from the source systems and

Load Manager

   •   Called the frontend component

   •   Performs all the operations associated with the extraction and loading of data into the

   •   These operations include simple transformations of the data to prepare the data for
       entry into the warehouse

   •   The data is extracted from the operational systems directly or from the operational
       data store (more common) and then to the data warehouse
   •   Size and complexity will vary between data warehouses and may be constructed using
       a combination of vendor data loading tools and custom-built programs.

Warehouse Manager

   •   Performs all the operations associated with the management of the data in the
       warehouse as follows:

          –   Analysis of data to ensure consistency

          –   Transformation and merging of source data from temporary storage into the
              data warehouse tables

          –   Creation of indexes and views

          –   Backing-up and archiving data

   •   Constructed using vendor data management tools and custom-built programs.

   •   Generates query profiles to determine which indexes and aggregations are appropriate

Query Manager

   •   Called backend component

   •   Performs all the operations associated with the management of user queries

          –   Directing queries to the appropriate tables and scheduling the execution of

   •   Constructed using vendor end-user access tools, data warehousing monitoring tools,
       database facilities and custom built programs

   •   Query manager complexity depends on the end-user access tools and database

Detailed Data

   •   Stores all the detailed data in the database schema

   •   On a regular basis, detailed data is added to the warehouse to supplement the
       aggregated data

Lightly and Hightly Summarized Data
   •   Stores all the pre-defined lightly and highly aggregated data generated by the
       warehouse manager

   •   Transient as it will be subject to change on an on-going basis in order to respond to
       changing query profiles

   •   The purpose of summary information is to speed up the performance of queries -

   •   On the other hand, it removes the requirement to continually perform summary
       operations (such as sort or group by) in answering user queries

   •   The summarized data is updated continuously as new data is loaded into the

Archive/Backup Data

   •   Stores detailed and summarized data for the purposes of archiving and backup

   •   May be necessary to backup online summary data if this data is kept beyond the
       retention period for detailed data

   •   The data is transferred to storage archives such as magnetic tape or optical disk

Meta Data

   •   This area of the warehouse stores all the metadata definitions used by all the
       processes in the warehouse

   •   Meta-Data is used for a variety of purposes:

          –   Extraction and loading processes

                  •   Metadata is used to map data sources to a common view of information
                      within the warehouse

          –   Warehouse management process

                  •   Used to automate the production of summary tables

          –   Query management process

                  •   Used to direct a query to the most appropriate data source

   •   End-user access tools use metadata to understand how to build a query

End-user Access Tools

   •   Users interact with the warehouse using end-user access tools
   •     Can be categorized into five main groups

            –   Data reporting and query tools – (Query by Example –MS Access DBMS)

            –   Application development tools (application used to access major DBS –Oracle,

            –   Executive information system (EIS) tools (For sales, marketing and finance)

            –   Online analytical processing (OLAP) tools (Allow users to analyze the data using
                complex and multidimentional views-from multiple databases)

            –   Data mining tools (allow the discovery of new patterns and trend by mining a
                large amount of data using statistical, mathematical tools)

Data Warehousing: Data flows


   •     The processes associated with the extraction, cleansing, and loading of the data from
         the source systems into the data warehouse

   •     Cleaning include removing inconsistencies, adding missing fields, and cross-checking
         for data integrity
   •   Transformation include adding date/time stamp fields, summarizing detailed data,
       deriving new fields to store calculated data

   •   Extract the relevant data from multiple, heterogeneous, and external sources
       (commercial tools are used)

   •   Then mapped and loaded into the warehouse


   •   The process associated with adding value to the data in the warehouse through
       summarizing, packaging, and distribution of the data

   •   Summarizing the data works by choosing, projecting, joining, and grouping relational
       data into views that are more convenient and useful to the end users. Summarizing
       data goes beyond simple relational operations to involves sophistacated statistical
       analysis including identifying trends, clustering, and sampling the data

   •   Packeging the data involves converting the detailed or summarized information into
       more useful formats, such as spreadsheets, test documents, charts, other graphical
       presentations, private databases, and animation.

   •   Distribute the data in appropiate groups to increase its availability and accessibility


   •   The processes associated with archiving and backing-up of data in the warehouse

   •   Archiving the effectiveness and performace maintanance is achieved by transferring
       the older data of limited value to storage archivers such as magnetic tapes, optical
       disk or digital storage devices

   •   If the databases in a warehouse are very big, partitioning is a useful design option
       which enables the fragmentation of a table storing enournous number of records into
       smaller tables. Thus, preserving data warehouse performance

   •   The downflow of data includes the processes to ensure that the current state of the
       data warehouse can be rebuilt following data loss, or software/hardware failures.
       Archived data should be stored in a way that allows the re-establishement of the data
       in the warehouse when required


   •   Involves the process associated with making the data availabe to the end-users

   •   This involves two activities such as data accessing and delivering
   •   Data accessing is concerned with satisfying the end users’s requests for the data they
       need. The main problem here is the creation of an environment so that the users can
       effectively use the query tools to access the most appropiate data source.

   •   Delivering activity makes possible the information delivery to the user’s
       systems/workstations. This activity is referred to as a type of ’’publish-and-
       subscribe” process. Data warehouse publishes several ’business objects’ that are
       revised periodically by monitoring usage patterns. Users subcriber to the set of
       business objects that best meets their needs.

Implementation Choices

   •   Top Down Implementation

   •   Bottom Up Implementation

   •   A Combined Approach

Top Down Implementation

       A top down implementation requires more planning and design work to be completed
       at the beginning of the project. This brings with it the need to involve people from
       each of the workgroups, departments, or lines of business that will be participating in
       the data warehouse implementation. Decisions concerning data sources to be used,
       security, data structure, data quality, data standards, and an overall data model will
       typically need to be completed before actual implementation begins. The top down
       implementation can also imply more of a need for an enterprisewide or corporatewide
       data warehouse with a higher degree of cross workgroup, department, or line of
       business access to the data.

       This approach is depicted in the figure. As shown, with this approach, it is more
       typical to structure a global data warehouse. If data marts are included in the
       configuration, they are typically built afterward. And, they are more typically
       populated from the global data warehouse rather than directly from the operational or
       external data sources.
      A top down implementation can result in more consistent data definitions and the
      enforcement of business rules across the organization, from the beginning. However,
      the cost of the initial planning and design can be significant. It is a time-consuming
      process and can delay actual implementation, benefits, and return-on-investment. For
      example, it is difficult and time consuming to determine, and get agreement on, the
      data definitions and business rules among all the different workgroups, departments,
      and lines of business participating. The top down implementation approach can work
      well when there is a good centralized IS organization that is responsible for all
      hardware and other computer resources. Top down implementation will also be
      difficult to implement in organizations where the workgroup, department, or line of
      business has its own IS resources. They are typically unwilling to wait for a more global
      infrastructure to be put in place.

Bottom Up Implementation

      A bottom up implementation involves the planning and designing of data marts
      without waiting for a more global infrastructure to be put in place. This approach is
      more widely accepted today than the top down approach because immediate results
      from the data marts can be realized and used as justification for expanding to a more
      global implementation.

       In contrast to the top down approach, data marts can be built before, or in parallel
      with, a global data warehouse. And as the figure shows, data marts can be populated
      either from a global data warehouse or directly from the operational or external data

      The bottom up implementation approach has become the choice of many
      organizations, especially business management, because of the faster payback. It
      enables faster results because data marts have a less complex design than a global
      data warehouse. In addition, the initial implementation is usually less expensive in
      terms of hardware and other resources than deploying the global data warehouse.

      With careful planning, monitoring, and design guidelines, the date redundancy among
      the data marts can be minimized. Multiple data marts may bring with them an
      increased load on operational systems because more data extract operations are

A Combined Approach

      As we have seen, there are both positive and negative considerations when
      implementing with the top down and the bottom up approach. In many cases the best
      approach may be a combination of the two. As a first step simply identify the lines of
      business that will be participating. A high level view of the business processes and
      data areas of interest to them will provide the elements for a plan for implementation
      of the data marts.

      As data marts are implemented, develop a plan for how to handle the data elements
      that are needed by multiple data marts. This could be the start of a more global data
      warehouse structure or simply a common data store accessible by all the data marts.
      It some cases it may be appropriate to duplicate the data across multiple data marts.
      This is a trade-off decision between storage space, ease of access, and the impact of
      data redundancy along with the requirement to keep the data in the multiple data
      marts at the same level of consistency.
      There are many issues to be resolved in any data warehousing implementation. Using
      the combined approach can enable resolution of these issues as they are encountered,
      and in the smaller scope of a data mart rather than a global data warehouse. Careful
      monitoring of the implementation processes and management of the issues could
      result in gaining the best benefits of both implementation techniques.

Shared By: