Docstoc

Data_Warehouse

Document Sample
Data_Warehouse Powered By Docstoc
					         DataWarehouse
             Topics
   DATAWAREHOUSE DO’S & DON’TS

   THE IMPORTANCE OF DATA MOVEMENT IN YOUR
    DATAWAREHOUSE

   DATA STEWARDSHIP: THE KEY TO BUSINESS INVOLVEMENT IN
    DATAWAREHOUSING




                                    Correlated by:
                                    Nach Bernard - May 1, 2002
Data Warehouse Do's and Don'ts                                              Page 1 of 2

DO try to leverage existing investments in software and hardware. Too many companies
believe the vendor hype that they cannot accomplish their objective without purchasing the
latest and greatest software and hardware. My greatest success stories revolve around using
database software on the mainframe and using a combination of mainframe and PC-based
query and reporting tools. The customer does not need to purchase more items; they
already have trained personnel in the existing tools, and the existing tools work great!
Current mainframe tools usually are more mature and have processes in place for proper
backup and security. The moral here is look at what you have first before leaping.

DON'T involve negative forces in the project. A warehouse team needs customer-
focused people - - not people who have attitudes, who hate their jobs or who think they
are smarter than anyone else is. The project team should consist of open-minded business
and technical individuals that will work well as a team.

DO appoint a data trainer. This is extremely important if you want your warehouse used
correctly. This person would be from a business area, and hopefully already has a good
understanding of the company data. These people are usually found in Marketing
Information departments, who are used to developing reports for other departments. This
person or people will need to set class times as and provide structured training as well as
provide one on one support. Having this in place could make or break a data warehouse.

DON'T make it difficult for the user. After all, the warehouse is for the business user,
not IT. Data modeling for a warehouse is different that modeling operational systems.

There are data modelers who believe warehouse tables must be normalized to the 9th
degree, or they haven't done their job right (those people do not belong on the team).
Warehouse tables are for business users who in most cases are not familiar with relational
database technology and normalization. The moral here - Keep it simple! If the design
forces complex extract and load programming, then do it! Why are we here otherwise?

DO build a simple, easy to use data dictionary. This is really hard to mess up. As the
project starts, start documenting business definitions in a database. This way, if there is
no data dictionary available, you can provide one to your users through either client-
server screens or hardcopy. This always seems to be done as a last task or not done at all.
It is very important that this be started right away. As soon as your users start to use the
warehouse (or even test it), they will need to know what the data means.

DON'T design the warehouse for the wrong audience. You must involve the customer
from the start and during the whole process. The "customer" is ultimately the group of
people who will be using the warehouse on a daily basis. Designing a warehouse for an
audience that won't be using the warehouse will prove to be a costly mistake, which you
will find out as the project progresses. As mentioned above about purchasing software
                                                                           Page 2 of 2

and hardware for the sake of technology, don't try to build a warehouse because
management read about it in a magazine and thinks the company needs one. There needs
to be a definite business need that your warehouse will solve.

DO develop naming standards RIGHT AWAY to reduce the rework involved with
renaming tables and columns later. And while you are at it, make the names easy to use.
WHSE_CUSTOMER is much easier to understand hat ABC001_CUST_TB. Develop
standard abbreviations for words that will be too long to use. Every column must end
with a description of what it is, such as date, text, amount, quantity, flag, code, etc.

DON'T forget that the hardest part of data warehousing is maintaining it after it has been
built. The project team that builds it most likely will not be around to maintain it. It is
important to select the right people for the job. The group that will maintain it must be
customer service oriented, who will aspire for quick turnaround and insist on quality.
These people are extremely hard to find, but the right people can be trained.
The Importance of Data Movement in Your Data Warehouse                     Page 1 of 4

Data Warehouses have been a hot topic lately because they help an enterprise decipher its
information to make better decisions. There is no shortage of articles on many of the
interrelated components that go into a data warehouse. Technical journals are filled with
advice on key elements of these systems: acquisition of operational data; cleansing or
conversion of that data; DBMS on both the client and server; CPUs and storage resources
to be used; or data analysis.

But one area that has been under-emphasized, yet can have a dramatic impact on the
success of the overall system, is the network infrastructure that performs the extraction,
transport, and loading of data. Without this piece, an enterprise can only toy with the
concept of using information to make better business decisions.

Three Methods of Data Movement

The broad number of platforms and databases types serve to complicate any discussion
regarding an enterprise's data warehouse. Since the vast majority of the Fortune 500
companies have traditionally used mainframes for Data Acquisition, no real data sharing
method can omit the consideration of information movement between a mainframe and
an open system. Determining how to move this data is one of the key decisions necessary
for creating a useful data warehousing environment.

Currently, most organizations use one of three methods of data movement.
                              1)
                                   TCP/IP
              LAN                SNA Network                 LAN


                                 Disk Controller
                                                     2)



                      ESCON                     SCSI OR FC
                                    Partition

Mainframes                          Channel                        Open Systems
                          3)
       Three primary movement methods between mainframes and open systems

Method 1. Existing LAN Technology

Method 1 moves data using existing LAN technology, primarily TCP/IP and SNA. For
TCP/IP, the mainframe converts each IP packet to SNA, putting a significant strain on
available host CPU cycles. This is expensive and typically slow.
                                                                           Page 2 of 4

SNA runs efficiently on mainframes at speeds of 16 Mbits/second. Unfortunately, SNA
is proprietary and not widely used in the open systems market. Front end processors can
support TCP/IP and SNA networks, but the translation overhead between the protocols is
an issue.

The benefits LAN technologies offer for solving data sharing problems include:

          Use of existing LAN topology, typical in large enterprises, eliminates the need
           for new equipment
          Low up-front costs, allowing projects to begin quickly and economically

However, this methods has significant drawbacks:

          Long-term costs can grow, because mainframe CPU cycles are needed for data
           movement, especially with TCP/IP
          EBCDIC to ASCII data conversion is left for other layers to solve. Although
           this is not a difficult task, automation becomes increasingly difficult as
           administrators deal with this method's limited scope
          Data rates can be slow, less that 2 Mbytes/second to a mainframe in many
           cases
          Data movement competes with other needs in the network, which can further
           exacerbate network performance problems

Method 1 is viable for data warehousing where a relatively small amounts of data must be
shared or when the time to move data is not greatly restricted. Enterprises that have
evening time window for data movement could experience problems using LANs when
the amount of data exceeds 1 Gigabyte.

Method 2. "Pseudo-Shared" Environment

Method 2 uses a pseudo-shared environment. Some vendors attack the data movement
problem by trying to provide mixed platform access to a single stored file. For instance,
mainframe applications and open systems applications could both access and modify a
single file stored on the network. This approach is the best for data sharing, because only
one copy of a file is required on the network, regardless of the installed platforms.
Unfortunately, such a product doesn't exist in today's market.

What exists now is a partitioned disk environment, where one side talks to the mainframe
and the other talks to open systems. This is called resource sharing. Data sharing is
facilitated through the movement of information between two partitions, requiring more
disk space and creating contention for the drives and cache.
The benefits of Method 2 are:                                              Page 3 of 4

          Easy installation, reducing the cost of equipment setup
          Conceptually easy to understand, so the research time for equipment ROI is
           reduced
          Network bandwidth limitations are substituted for controller bandwidth on the
           storage device

Method 2 also has drawbacks:

          It's the most expensive solution. To control price, enterprises are resorting to
           a two-vendor strategy, which inherently leads to a non-enterprise solution
          Disk storage is integral to the solution. Therefore, the enterprise locks into a
           particular storage vendor, forcing acceptance of that vendor's technology and
           pricing models.
          It's faster that LANs, but slower than channel movement methods. Current
           data rates are approximately 5 MByte/second, which still leads to performance
           problems in larger data sharing environments
          There's a tendency toward using flat files for data movement, requiring the
           purchase of more storage and increasing the overall cost of implementation

Method 2 has distinct advantages over Method 1, such as a wide installation base and
faster data movement. The high cost, along with being locked into a single storage
provider, are disadvantages that make Method 3 an irresistible solution for enterprises.

Method 3. Existing Channel Technology

Method 3 uses existing channel technology as a transport architecture. This means
moving data across protocol paths specifically designed for fast, uncorrupted data
movement. Such protocols include ESCON, SCSI, and Fibre Channel. Data movement
technologies that use this medium enjoy significantly higher data rates.

The advantages of Method 3 are:

          High speed. ESCON moves data at 17MBytes/second. Multiple ESCONs can
           move data in parallel
          Scalability for additional users. Multiple data paths increase the aggregate
           data transfer rate, which is limited only by the number of paths available
          No commitment to any particular storage vendor
          Transfer speed can be maintained across a wide are link connected to a remote
           site
Method 3 drawbacks include:                                                Page 4 of 4

          The installation process can be complex, because one channel environment
           must be mapped to another
          More devices are needed and must be maintained

Large data warehousing installation need long-distance data sharing, high-speed data
movement, efficient use of current resources, and greater data accessibility. When
considering all of these, the channel movement method--Method 3--is superior. Its
exceptional speed, price-performance ratio, flexibility, scalability, storage independence,
and the versatility of constructing how and where data will be moved clearly advances
today's enterprise strategies.
                                                                             Page 1 of 3

Data Stewardship: The Key to Business Involvement in Data Warehousing

One of the distinguishing characteristics of successful data warehouse endeavors is the
involvement of the business community. It is important to keep the business constituents
involved throughout the process to avoid deleterious results arising from the business side
losing touch with the project of being unprepared or unwilling to accept their share of the
responsibility for the data warehouse. This article organizes the business community
roles into a data stewardship program and outlines ways to get the business community
involved in the data warehouse and acceding to their roles.

Effective stewardship appointment is the most important factor in the program. You need
to balance having too many stewards with having areas of assigned responsibility that are
too broad and won't be met. Generally, appointment of 1-3 stewards per subject area,
depending on the breadth of your subject areas, strikes the best balance. Be thorough.
The bottom line is to have a data steward for each element in the data warehouse.

Data Stewardship can comprise six main functions:

          Arbitrating the transformation rules
          Verifying the data after load
          Contribution of the business metadata
          Approving new users
          Supporting the user community on the data
          Participation on a corporate governance committee

Arbitrating the transformation rules

The data warehouse usually pulls together data from disparate sources. The goal with
transformation rules is to keep the data meaningful and consistent. In this process, it will
often be necessary to arbitrate differences of opinion and different interpretations of value
as to how the data will be represented in its new form. The data steward, in partnership
with IT, will be in the best position to make the call on the initial rules and subsequent
changes.

Verifying the data after load

Verifying the data after the data warehouse load involves confirming the data was loaded
and the transformation rules were properly applied. The verification becomes crucial
when operational systems are prone to change structure and content on short notice or
without notice to the data warehouse team. The sanity check can be followed up with
formally giving the approval to the greater user community for query and analysis of the
data.
                                                                            Page 2 of 3

Contribution of the business metadata

Metadata is comprised of technical and business components. I have experienced best
results when the data stewards contribute the business components such as business
definition and IT contributes the technical components such as data sources and
transformation rules applied. The data steward is in the best position to phrase the
business metadata in terms of importance to the business.

Approving new users

With detailed knowledge of the data sources, transformation rules, and uses of the data,
the data steward can broker requests for new usage of the data warehouse. The data
steward should develop expertise not only in the data but also in the workload limitations
of the data warehouse system for the area of stewardship. This combined knowledge
provides the data steward with the capacity to effectively approve new users and their
authority levels.

Supporting the user community on the data

This expertise comprises the majority of the user training and support topics that the data
steward should be made responsible for on a perpetual basis. User support can be
composed of training classes and ongoing phone (help-desk style) support to the user
community focused on how to functionally use the query tools and use the data. The
most effective presentation is a partnership where IT is responsible for the query tools and
the data steward is responsible for the data

Make it a requirement that new users to the data warehouse take this training before
receiving access. This training should have more of a business focus than a technical
focus. The query tools, though requiring explanation and a point of leadership in the
organization, should be made as easy-to-use as possible and therefore require less training
than the main attraction – the data model and the data itself.

Participation on a corporate governance committee

Once you have the data warehouse up and running, in addition to the ongoing minor
enhancements to the existing subject areas, there will be interest in major additions of
usage, subject areas, and data sources. If any of the following conditions exist, a
corporate governance committee will be an effective way to make discernment and
prioritization:
                                                                           Page 3 of 3
          Perceived return-on-investment is used as a driver for company efforts and
           therefore a forum is needed to confirm and refine these often subjective
           estimates
          There is a limited budget or people resources for data warehouse additions
          Expansion of data and data uses may strain the scalability of the environment
           and create the need for additional hardware and software

With representation from IT and the major business units, the committee’s charter is to
provide corporate governance to strategic data warehouse decisions such as new subject
areas, new data sources, new business problems solved with the data warehouse, and new
expenditures for processors and storage. The committee will be most effective when
comprised of current and potential future data stewards representing the business together
with key IT leadership personnel.

From design through implementation, no phase should be attempted without the
commitment of the business to contribute. An effectual data stewardship program
actively and thoroughly assigns specific responsibilities with clear outcomes across the
six functions. Advantages exist for project teams willing to integrate the key business
constituents of the data warehouse through implementation of a complete program.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:8
posted:12/10/2011
language:
pages:10