DataWarehouse Topics DATAWAREHOUSE DO’S & DON’TS THE IMPORTANCE OF DATA MOVEMENT IN YOUR DATAWAREHOUSE DATA STEWARDSHIP: THE KEY TO BUSINESS INVOLVEMENT IN DATAWAREHOUSING Correlated by: Nach Bernard - May 1, 2002 Data Warehouse Do's and Don'ts Page 1 of 2 DO try to leverage existing investments in software and hardware. Too many companies believe the vendor hype that they cannot accomplish their objective without purchasing the latest and greatest software and hardware. My greatest success stories revolve around using database software on the mainframe and using a combination of mainframe and PC-based query and reporting tools. The customer does not need to purchase more items; they already have trained personnel in the existing tools, and the existing tools work great! Current mainframe tools usually are more mature and have processes in place for proper backup and security. The moral here is look at what you have first before leaping. DON'T involve negative forces in the project. A warehouse team needs customer- focused people - - not people who have attitudes, who hate their jobs or who think they are smarter than anyone else is. The project team should consist of open-minded business and technical individuals that will work well as a team. DO appoint a data trainer. This is extremely important if you want your warehouse used correctly. This person would be from a business area, and hopefully already has a good understanding of the company data. These people are usually found in Marketing Information departments, who are used to developing reports for other departments. This person or people will need to set class times as and provide structured training as well as provide one on one support. Having this in place could make or break a data warehouse. DON'T make it difficult for the user. After all, the warehouse is for the business user, not IT. Data modeling for a warehouse is different that modeling operational systems. There are data modelers who believe warehouse tables must be normalized to the 9th degree, or they haven't done their job right (those people do not belong on the team). Warehouse tables are for business users who in most cases are not familiar with relational database technology and normalization. The moral here - Keep it simple! If the design forces complex extract and load programming, then do it! Why are we here otherwise? DO build a simple, easy to use data dictionary. This is really hard to mess up. As the project starts, start documenting business definitions in a database. This way, if there is no data dictionary available, you can provide one to your users through either client- server screens or hardcopy. This always seems to be done as a last task or not done at all. It is very important that this be started right away. As soon as your users start to use the warehouse (or even test it), they will need to know what the data means. DON'T design the warehouse for the wrong audience. You must involve the customer from the start and during the whole process. The "customer" is ultimately the group of people who will be using the warehouse on a daily basis. Designing a warehouse for an audience that won't be using the warehouse will prove to be a costly mistake, which you will find out as the project progresses. As mentioned above about purchasing software Page 2 of 2 and hardware for the sake of technology, don't try to build a warehouse because management read about it in a magazine and thinks the company needs one. There needs to be a definite business need that your warehouse will solve. DO develop naming standards RIGHT AWAY to reduce the rework involved with renaming tables and columns later. And while you are at it, make the names easy to use. WHSE_CUSTOMER is much easier to understand hat ABC001_CUST_TB. Develop standard abbreviations for words that will be too long to use. Every column must end with a description of what it is, such as date, text, amount, quantity, flag, code, etc. DON'T forget that the hardest part of data warehousing is maintaining it after it has been built. The project team that builds it most likely will not be around to maintain it. It is important to select the right people for the job. The group that will maintain it must be customer service oriented, who will aspire for quick turnaround and insist on quality. These people are extremely hard to find, but the right people can be trained. The Importance of Data Movement in Your Data Warehouse Page 1 of 4 Data Warehouses have been a hot topic lately because they help an enterprise decipher its information to make better decisions. There is no shortage of articles on many of the interrelated components that go into a data warehouse. Technical journals are filled with advice on key elements of these systems: acquisition of operational data; cleansing or conversion of that data; DBMS on both the client and server; CPUs and storage resources to be used; or data analysis. But one area that has been under-emphasized, yet can have a dramatic impact on the success of the overall system, is the network infrastructure that performs the extraction, transport, and loading of data. Without this piece, an enterprise can only toy with the concept of using information to make better business decisions. Three Methods of Data Movement The broad number of platforms and databases types serve to complicate any discussion regarding an enterprise's data warehouse. Since the vast majority of the Fortune 500 companies have traditionally used mainframes for Data Acquisition, no real data sharing method can omit the consideration of information movement between a mainframe and an open system. Determining how to move this data is one of the key decisions necessary for creating a useful data warehousing environment. Currently, most organizations use one of three methods of data movement. 1) TCP/IP LAN SNA Network LAN Disk Controller 2) ESCON SCSI OR FC Partition Mainframes Channel Open Systems 3) Three primary movement methods between mainframes and open systems Method 1. Existing LAN Technology Method 1 moves data using existing LAN technology, primarily TCP/IP and SNA. For TCP/IP, the mainframe converts each IP packet to SNA, putting a significant strain on available host CPU cycles. This is expensive and typically slow. Page 2 of 4 SNA runs efficiently on mainframes at speeds of 16 Mbits/second. Unfortunately, SNA is proprietary and not widely used in the open systems market. Front end processors can support TCP/IP and SNA networks, but the translation overhead between the protocols is an issue. The benefits LAN technologies offer for solving data sharing problems include: Use of existing LAN topology, typical in large enterprises, eliminates the need for new equipment Low up-front costs, allowing projects to begin quickly and economically However, this methods has significant drawbacks: Long-term costs can grow, because mainframe CPU cycles are needed for data movement, especially with TCP/IP EBCDIC to ASCII data conversion is left for other layers to solve. Although this is not a difficult task, automation becomes increasingly difficult as administrators deal with this method's limited scope Data rates can be slow, less that 2 Mbytes/second to a mainframe in many cases Data movement competes with other needs in the network, which can further exacerbate network performance problems Method 1 is viable for data warehousing where a relatively small amounts of data must be shared or when the time to move data is not greatly restricted. Enterprises that have evening time window for data movement could experience problems using LANs when the amount of data exceeds 1 Gigabyte. Method 2. "Pseudo-Shared" Environment Method 2 uses a pseudo-shared environment. Some vendors attack the data movement problem by trying to provide mixed platform access to a single stored file. For instance, mainframe applications and open systems applications could both access and modify a single file stored on the network. This approach is the best for data sharing, because only one copy of a file is required on the network, regardless of the installed platforms. Unfortunately, such a product doesn't exist in today's market. What exists now is a partitioned disk environment, where one side talks to the mainframe and the other talks to open systems. This is called resource sharing. Data sharing is facilitated through the movement of information between two partitions, requiring more disk space and creating contention for the drives and cache. The benefits of Method 2 are: Page 3 of 4 Easy installation, reducing the cost of equipment setup Conceptually easy to understand, so the research time for equipment ROI is reduced Network bandwidth limitations are substituted for controller bandwidth on the storage device Method 2 also has drawbacks: It's the most expensive solution. To control price, enterprises are resorting to a two-vendor strategy, which inherently leads to a non-enterprise solution Disk storage is integral to the solution. Therefore, the enterprise locks into a particular storage vendor, forcing acceptance of that vendor's technology and pricing models. It's faster that LANs, but slower than channel movement methods. Current data rates are approximately 5 MByte/second, which still leads to performance problems in larger data sharing environments There's a tendency toward using flat files for data movement, requiring the purchase of more storage and increasing the overall cost of implementation Method 2 has distinct advantages over Method 1, such as a wide installation base and faster data movement. The high cost, along with being locked into a single storage provider, are disadvantages that make Method 3 an irresistible solution for enterprises. Method 3. Existing Channel Technology Method 3 uses existing channel technology as a transport architecture. This means moving data across protocol paths specifically designed for fast, uncorrupted data movement. Such protocols include ESCON, SCSI, and Fibre Channel. Data movement technologies that use this medium enjoy significantly higher data rates. The advantages of Method 3 are: High speed. ESCON moves data at 17MBytes/second. Multiple ESCONs can move data in parallel Scalability for additional users. Multiple data paths increase the aggregate data transfer rate, which is limited only by the number of paths available No commitment to any particular storage vendor Transfer speed can be maintained across a wide are link connected to a remote site Method 3 drawbacks include: Page 4 of 4 The installation process can be complex, because one channel environment must be mapped to another More devices are needed and must be maintained Large data warehousing installation need long-distance data sharing, high-speed data movement, efficient use of current resources, and greater data accessibility. When considering all of these, the channel movement method--Method 3--is superior. Its exceptional speed, price-performance ratio, flexibility, scalability, storage independence, and the versatility of constructing how and where data will be moved clearly advances today's enterprise strategies. Page 1 of 3 Data Stewardship: The Key to Business Involvement in Data Warehousing One of the distinguishing characteristics of successful data warehouse endeavors is the involvement of the business community. It is important to keep the business constituents involved throughout the process to avoid deleterious results arising from the business side losing touch with the project of being unprepared or unwilling to accept their share of the responsibility for the data warehouse. This article organizes the business community roles into a data stewardship program and outlines ways to get the business community involved in the data warehouse and acceding to their roles. Effective stewardship appointment is the most important factor in the program. You need to balance having too many stewards with having areas of assigned responsibility that are too broad and won't be met. Generally, appointment of 1-3 stewards per subject area, depending on the breadth of your subject areas, strikes the best balance. Be thorough. The bottom line is to have a data steward for each element in the data warehouse. Data Stewardship can comprise six main functions: Arbitrating the transformation rules Verifying the data after load Contribution of the business metadata Approving new users Supporting the user community on the data Participation on a corporate governance committee Arbitrating the transformation rules The data warehouse usually pulls together data from disparate sources. The goal with transformation rules is to keep the data meaningful and consistent. In this process, it will often be necessary to arbitrate differences of opinion and different interpretations of value as to how the data will be represented in its new form. The data steward, in partnership with IT, will be in the best position to make the call on the initial rules and subsequent changes. Verifying the data after load Verifying the data after the data warehouse load involves confirming the data was loaded and the transformation rules were properly applied. The verification becomes crucial when operational systems are prone to change structure and content on short notice or without notice to the data warehouse team. The sanity check can be followed up with formally giving the approval to the greater user community for query and analysis of the data. Page 2 of 3 Contribution of the business metadata Metadata is comprised of technical and business components. I have experienced best results when the data stewards contribute the business components such as business definition and IT contributes the technical components such as data sources and transformation rules applied. The data steward is in the best position to phrase the business metadata in terms of importance to the business. Approving new users With detailed knowledge of the data sources, transformation rules, and uses of the data, the data steward can broker requests for new usage of the data warehouse. The data steward should develop expertise not only in the data but also in the workload limitations of the data warehouse system for the area of stewardship. This combined knowledge provides the data steward with the capacity to effectively approve new users and their authority levels. Supporting the user community on the data This expertise comprises the majority of the user training and support topics that the data steward should be made responsible for on a perpetual basis. User support can be composed of training classes and ongoing phone (help-desk style) support to the user community focused on how to functionally use the query tools and use the data. The most effective presentation is a partnership where IT is responsible for the query tools and the data steward is responsible for the data Make it a requirement that new users to the data warehouse take this training before receiving access. This training should have more of a business focus than a technical focus. The query tools, though requiring explanation and a point of leadership in the organization, should be made as easy-to-use as possible and therefore require less training than the main attraction – the data model and the data itself. Participation on a corporate governance committee Once you have the data warehouse up and running, in addition to the ongoing minor enhancements to the existing subject areas, there will be interest in major additions of usage, subject areas, and data sources. If any of the following conditions exist, a corporate governance committee will be an effective way to make discernment and prioritization: Page 3 of 3 Perceived return-on-investment is used as a driver for company efforts and therefore a forum is needed to confirm and refine these often subjective estimates There is a limited budget or people resources for data warehouse additions Expansion of data and data uses may strain the scalability of the environment and create the need for additional hardware and software With representation from IT and the major business units, the committee’s charter is to provide corporate governance to strategic data warehouse decisions such as new subject areas, new data sources, new business problems solved with the data warehouse, and new expenditures for processors and storage. The committee will be most effective when comprised of current and potential future data stewards representing the business together with key IT leadership personnel. From design through implementation, no phase should be attempted without the commitment of the business to contribute. An effectual data stewardship program actively and thoroughly assigns specific responsibilities with clear outcomes across the six functions. Advantages exist for project teams willing to integrate the key business constituents of the data warehouse through implementation of a complete program.
Pages to are hidden for
"Data_Warehouse"Please download to view full document