Data Quality - PDF

Description

Data Quality Methodology - Informatica

Shared by: ganeshdw
-
Stats
views:
4010
posted:
6/28/2009
language:
English
pages:
0
Document Sample
scope of work template
							Velocity v8
Data Quality Methodology

Data Quality
Executive Summary
There are important considerations that organizations must be aware of when organizing a project that involves data quality. The goal here is to set forth principles that apply to most data quality projects. Experience has shown that adhering to these principles maximizes the potential for success in data quality projects. Most organizations realize that data quality shortcomings have a negative effect on the organization’s planning and performance. Increasingly, organizations are instituting data quality programs to address the known shortcomings; discover hidden data quality issues; and to create a consistent, ongoing process for monitoring and improving data quality throughout the organization. Data quality projects originate from a variety of sources, including:
● ● ● ● ● ● ●

Development of an Integration Competency Center (ICC). Master data management/customer data integration (MDM/CDI) projects. Data integration projects driven by mergers or acquisitions. Data migration from older to newer systems. Data warehousing. Quality based initiatives such as Six Sigma. Addressing specific data quality issues that have been identified by management as adversely affecting performance.

Regardless of the impetus, data quality projects will succeed or fail based on these common factors:
● ● ●

The commitment of business owners to the success of the project. The effort expended to discover data quality issues as the project is designed. The presence of a data governance process to analyze data quality issues and approve general rules for dealing with the issues. The effective documentation of data quality business rules.

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

2 of 439

●

The establishment of scorecards and other metrics to measure data quality issues over time. The realization that data quality projects must have a continuing life cycle to discover new data quality issues that were not present when the project commenced and to prevent previous data quality issues from creeping into the data over time.

●

Business Drivers
The impetus for a data quality project has a greater impact on the scope of the project than it has on the critical aspects for project success. When the data quality initiative derives from a broader initiative such as ICC, MDM, or CDI, then the outcome of the project is likely to have an enterprise impact. In other words, the results of the project will potentially have a higher degree of visibility and financial impact. On the other hand, a data quality project that derives from a data integration project related to a merger or acquisition may be perceived as having only limited impact and perhaps less visibility. Nevertheless, even a department level project should be analyzed for its potential impact across the organization.
●

Is the data that is scoped for the project used outside the confines of the project? Is the type of data common to other data sources not covered by the project? Is there a reasonable likelihood that the project will serve as a model for future efforts or similar projects? Are there potential side effects to the project that extend beyond the confines of the project?

● ●

●

If the answer to any of these questions is “yes”, then the project likely has a broader impact than what was originally perceived.

Key Success Factors Commitment of Business Owners to the Project

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

3 of 439

A strong commitment by the project’s business owners to time, resources and focus is the most critical factor that leads to the success of a data quality project. While in some cases IT may “own” the project, IT never owns the data. The business owns the data and the business must live with the data daily. Decisions about even trivial aspects of data quality business rules can have impacts that are not necessarily easy to foresee without analysis and input from the business owners. Participation by key business personnel throughout the project lifecycle will help to avert unintended surprises and unhappy end users. This requires a commitment and a sense of priority that must be driven downward from the Project Sponsor. Gaining the appropriate level of commitment requires an understanding throughout the business that data quality issues have real costs and that correcting data quality issues and maintaining a high level of data quality has ongoing benefits. Some examples include:
●

Reducing customer support call times by achieving a single view of the customer. Avoiding returned shipments. Detecting regulatory compliance risks. Improving the accuracy of BI metrics. Increasing customer satisfaction. Gaining a clearer picture of supplier and customer interactions. Creating reusable data quality business rules.

● ● ● ● ● ●

These are all demonstrable bottom line benefits.

Effort to Discover Data Quality Issues during the Design Phase
With most data quality initiatives, there are several known data quality issues at the onset of the project. However, not all issues will be known unless effort is expended to profile and analyze the data during the design phase. Informatica Data Explorer (IDE) and Informatica Data Quality (IDQ) both provide profiling and analysis capabilities critical for project success. It is almost always easier and cheaper to discover problems early in the design phase than it is to retrofit development to address problems discovered during implementation. Consequently, a significant portion of the overall project should be devoted to profiling and analysis.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

4 of 439

Some aspects of profiling can be performed quickly and efficiently:
● ● ● ●

Data type discovery and enforcement on a column by column basis. Patterns in the data (e.g., phone number fields) Percentage of populated records. Fields with a limited set of valid inputs.

Others can require more detailed analysis:
● ●

Slugged values (e.g., “Training” as a customer name). Repurposed fields (e.g., including non-address information in an address field). End user adaptations (e.g., strings such as “#NOTE#”, “Do not ship”, etc.) Units of measure out of bounds.

● ●

In the hands of an experienced user, a very high percentage of these sorts of problems can be discovered during profiling and analysis. This upfront investment in analysis will reduce development and testing times, minimize load errors and reduce project risk.

Data Governance Process
A data governance committee is an ideal forum for discussing and resolving decisions about data quality business rules. The committee might consist of the Project Sponsor, the Business Project Manager, the Technical Project Manager, the Quality Assurance Manager, the Test Manager, the User Acceptance Test Lead, one or more Business Analysts, one or more Data Stewards, one or more End Users, and one or more Data Quality Developers. Other team members may be adjunct members brought into meetings as their skill sets are needed. Data governance is critical to the success of data quality initiatives. A data governance committee facilitates communication between the business users (who best know the data and the impact of business rules) and the technical personnel charged with implementing and testing business rules. Data governance also serves as a forum for the prioritization of data quality business rules. Nearly every data quality project has limitations on time and resources that prevent all known data quality issues from being addressed. A data governance committee is well positioned to assess both the business and the technical impacts of

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

5 of 439

particular data quality issues in order to prioritize them intelligently.

Effective Documentation of Data Quality Business Rules
The data governance committee should be charged with approving and documenting all business rules developed for the project. The documentation should include a description of the field(s) governed by the rule, the detail for the rule and brief rationale for the rule. This will serve as input for the implementation of the necessary data quality plans. The data quality business rule documentation also plays into development of the test plans. As test plans are developed, the testing organization can flag business rules that may be too ambiguous to permit definitive testing. Clear documentation also facilitates reusability of business rules.

Using Scorecards and other Metrics to Measure Data Quality
As with data integration development, data quality development involves frequent test processing; often followed by tweaking data quality plans to implement all required business rules. One way to measure the effectiveness of this process is to create scorecards that measure key data quality indicators from one run to the next. IDQ reporting allows scorecards to be created with relative ease. These scorecards do not cease to have utility when the project reaches a “go live” state. Data quality is an ongoing discipline, not a project discipline. Continual monitoring and re-analysis is necessary if an organization is to achieve and maintain high levels of data quality.

The Continuing Life Cycle of Data Quality
It is a truism with data quality projects that as soon as a data quality process has completed, the quality of the data begins to decline. Thus it is essential that a data quality program not go fallow when the project goes live. A successful data quality system must continue to maintain controls, monitoring and profiling to ensure that data quality does not deteriorate over time.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

6 of 439

Last updated: 20-May-08 16:35

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

7 of 439

Roles
●

Velocity Roles and Responsibilities Business Analyst Business Project Manager Data Architect Data Integration Developer Data Quality Developer Data Steward/Data Quality Steward Database Administrator (DBA) End User Project Sponsor Quality Assurance Manager Technical Project Manager Test Engineer Test Manager User Acceptance Test Lead

● ● ● ● ● ● ● ● ● ● ● ● ● ●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

8 of 439

Velocity Roles and Responsibilities
The following pages describe the roles used throughout this Guide, along with the responsibilities typically associated with each. Please note that the concept of a role is distinct from that of an employee or full time equivalent (FTE). A role encapsulates a set of responsibilities that may be fulfilled by a single person in a part-time or fulltime capacity, or may be accomplished by a number of people working together. The Velocity Guide refers to roles with an implicit assumption that there is a corresponding person in that role. For example, a task description may discuss the involvement of "the DBA" on a particular project, however, there may be one or more DBAs, or a person whose part-time responsibility is database administration. In addition, note that there is no assumption of staffing level for each role -- that is, a small project may have one individual filling the role of Data Integration Developer, Data Architect, and Database Administrator, while large projects may have multiple individuals assigned to each role. In cases where multiple people represent a given role, the singular role name is used, and project planners can specify the actual allocation of work among all relevant parties. For example, the methodology always refers to the Technical Architect, when in fact, there may be a team of two or more people developing the Technical Architecture for a very large development effort.

Data Integration Project - Sample Organization Chart

Last updated: 20-May-08 18:51

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

9 of 439

Business Analyst
The primary role of the Business Analyst (sometimes known as the Functional Analyst) is to represent the interests of the business in the development of the data integration solution. The secondary role is to function as an interpreter for business and technical staff, translating concepts and terminology and generally bridging gaps in understanding. Under normal circumstances, someone from the business community fills this role, since deep knowledge of the business requirement is indispensable. Ideally, familiarity with the technology and the development life-cycle allows the individual to function as the communications channel between technical and business users.

Reports to:
●

Business Project Manager

Responsibilities:
●

Ensures that the delivered solution fulfills the needs of the business (should be involved in decisions related to the business requirements) Assists in determining the data integration system project scope, time and required resources Provides support and analysis of data collection, mapping, aggregation and balancing functions Performs requirements analysis, documentation, testing, ad-hoc reporting, user support and project leadership Produces detailed business process flows, functional requirements specifications and data models and communicates these requirements to the design and build teams Conducts cost/benefit assessments of the functionality requested by end-users Prioritizes and balances competing priorities Plans and authors the user documentation set

●

●

●

●

● ● ●

Qualifications/Certifications
●

Possesses excellent communication skills, both written and verbal

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

10 of 439

● ● ●

Must be able to work effectively with both business and technical stakeholders Works independently with minimal supervision Has knowledge of the tools and technologies used in the data integration solution Holds certification in industry vertical knowledge (if applicable)

●

Recommended Training
● ● ● ● ● ●

Interview/workshop techniques Project Management Data Analysis Structured analysis UML or other business design methodology Data Warehouse Development

Last updated: 09-Apr-07 15:20

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

11 of 439

Business Project Manager
The Business Project Manager has overall responsibility for the delivery of the data integration solution. As such, the Business Project Manager works with the project sponsor, technical project manager, user community, and development team to strike an appropriate balance of business needs, resource availability, project scope, schedule, and budget to deliver specified requirements and meet customer satisfaction.

Reports to:
●

Project Sponsor

Responsibilities:
● ● ● ●

Develops and manages the project work plan Manages project scope, time-line and budget Resolves budget issues Works with the Technical Project Manager to procure and assign the appropriate resources for the project Communicates project progress to Project Sponsor(s) Is responsible for ensuring delivery on commitments and ensuring that the delivered solution fulfills the needs of the business Performs requirements analysis, documentation, ad-hoc reporting and project leadership

● ●

●

Qualifications/Certifications
● ● ● ● ● ● ●

Translates strategies into deliverables Prioritizes and balances competing priorities Possesses excellent communication skills, both written and verbal Results oriented team player Must be able to work effectively with both business and technical stakeholders Works independently with minimal supervision Has knowledge of the tools and technologies used in the data integration

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

12 of 439

solution
●

Holds certification in industry vertical knowledge (if applicable)

Recommended Training
●

Project Management

Last updated: 06-Apr-07 17:55

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

13 of 439

Data Architect
The Data Architect is responsible for the delivery of a robust scalable data architecture that meets the business goals of the organization. The Data Architect develops the logical data models, and documents the models in Entity-Relationship Diagrams (ERD). The Data Architect must work with the Business Analysts and Data Integration Developers to translate the business requirements into a logical model. The logical model is captured in the ERD, which then feeds the work of the Database Administrator, who designs and implements the physical database. Depending on the specific structure of the development organization, the Data Architect may also be considered a Data Warehouse Architect, in cooperation with the Technical Architect. This role involves developing the overall Data Warehouse logical architecture, specifically the configuration of the data warehouse, data marts, and an operational data store or staging area if necessary. The physical implementation of the architecture is the responsibility of the Database Administrator.

Reports to:
●

Technical Project Manager

Responsibilities:
●

Designs an information strategy that maximizes the value of data as an enterprise asset Maintains logical/physical data models Coordinates the metadata associated with the application Develops technical design documents Develops and communicates data standards Maintains Data Quality metrics Plans architectures and infrastructures in support of data management processes and procedures Supports the build out of the Data Warehouse, Data Marts and operational data store Effectively communicates with other technology and product team members

● ● ● ● ● ●

●

●

Qualifications/Certifications

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

14 of 439

● ●

Strong understanding of data integration concepts Understanding of multiple data architectures that can support a Data Warehouse Ability to translate functional requirements into technical design specifications Ability to develop technical design documents and test case documents Experience in optimizing data loads and data transformations Industry vertical experience is essential Project Solution experience is desired Has had some exposure to Project Management Has worked with Modeling Packages Has experience with at least one RDBMS Strong Business Analysis and problem solving skills Familiarity with Enterprise Architecture Structures (Zachman/TOGAF)

● ● ● ● ● ● ● ● ● ●

Recommended Training
● ●

Modeling Packages Data Warehouse Development

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

15 of 439

Data Integration Developer
The Data Integration Developer is responsible for the design, build, and deployment of the project's data integration component. A typical data integration effort usually involves multiple Data Integration Developers developing the Informatica mappings, executing sessions, and validating the results.

Reports to:
●

Technical Project Manager

Responsibilities:
●

Uses the Informatica Data Integration platform to extract, transform, and load data Develops Informatica mapping designs Develops Data Integration Workflows and load processes Ensures adherence to locally defined standards for all developed components Performs data analysis for both Source and Target tables/columns Provides technical documentation of Source and Target mappings Supports the development and design of the internal data integration framework Participates in design and development reviews Works with System owners to resolve source data issues and refine transformation rules Ensures performance metrics are met and tracked Writes and maintains unit tests Conduct QA Reviews Performs production migrations

● ● ● ● ● ●

● ●

● ● ● ●

Qualifications/Certifications
● ●

Understands data integration processes and how to tune for performance Has SQL experience

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

16 of 439

● ●

Possesses excellent communications skills Has the ability to develop work plans and follow through on assignments with minimal guidance Has Informatica Data Integration Platform experience Is an Informatica Certified Designer Has RDBMS experience Has the ability to work with business and system owners to obtain requirements and manage expectations

● ● ● ●

Recommended Training
● ● ● ● ● ● ● ● ●

Data Modeling PowerCenter – Level I & II Developer PowerCenter - Performance Tuning PowerCenter - Team Based Development PowerCenter - Advanced Mapping Techniques PowerCenter - Advanced Workflow Techniques PowerCenter - XML Support PowerCenter - Data Profiling PowerExchange

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

17 of 439

Data Quality Developer
The Data Quality Developer (DQ Developer) is responsible for designing, testing, deploying, and documenting the project's data quality procedures and their outputs. The DQ Developer provides the Data Integration Developer with all relevant outputs and results from the data quality procedures, including any ongoing procedures that will run in the Operate phase or after project-end. The DQ Developer must provide the Business Analyst with the summary results of data quality analysis as needed during the project. The DQ Developer must also document at a functional level how the procedures work within the data quality applications. The primary tasks associated with this role are to use Informatica Data Quality and Informatica Data Explorer to profile the project source data, define or confirm the definition of the metadata, cleanse and accuracy-check the project data, check for duplicate or redundant records, and provide the Data Integration Developer with concrete proposals on how to proceed with the ETL processes.

Reports to:
●

Technical Project Manager

Responsibilities:
● ● ●

Profile source data and determine all source data and metadata characteristics Design and execute Data Quality Audit Present profiling/audit results, in summary and in detail, to the business analyst, the project manager, and the data steward Assist the business analyst/project manager/data steward in defining or modifying the project plan based on these results Assist the Data Integration Developer in designing source-to-target mappings Design and execute the data quality plans that will cleanse, de-duplicate, and otherwise prepare the project data for the Build phase Test Data Quality plans for accuracy and completeness Assist in deploying plans that will run in a scheduled or batch environment Document all plans in detail and hand-over documentation to the customer Assist in any other areas relating to the use of data quality processes, such as unit testing

●

● ●

● ● ● ●

Qualifications/Certifications

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

18 of 439

● ● ● ●

Has knowledge of the tools and technologies used in the data quality solution Results oriented team player Possesses excellent communication skills, both written and verbal Must be able to work effectively with both business and technical stakeholders

Recommended Training
● ● ● ● ●

Data Quality Workbench I & II Data Explorer Level I PowerCenter Level I Developer Basic RDBMS Training Data Warehouse Development

Last updated: 15-Feb-07 17:34

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

19 of 439

Data Steward/Data Quality Steward
The Data Steward owns the data and associated business and technical rules on behalf of the Project Sponsor. This role has responsibility for defining and maintaining business and technical rules, liaising with the business and technical communities, and resolving issues relating to the data. The Data Steward will be the primary contact for all questions relating to the data, its use, processing and quality. In essence, this role formalizes the accountability for the management of organizational data. Typically the Data Steward is a key member of a Data Stewardship Committee put into place by the Project Sponsor. This committee will include business users and technical staff such as Application Experts. There is often an arbitration element to the role where data is put to different uses by separate groups of users whose requirements have to be reconciled.

Reports to:
●

Business Project Manager

Responsibilities:
● ● ● ● ● ● ● ● ●

Records the business use for defined data Identifies opportunities to share and re-use data Decides upon the target data quality metrics Monitors the progress towards, and tuning of, data quality target metrics Oversees data quality strategy and remedial measures Participates in the enforcement of data quality standards Enters, maintains and verifies data changes Ensures the quality, completeness and accuracy of data definitions Communicates concerns, issues and problems with data to the individuals that can influence change Researches and resolves data issues

●

Qualifications/Certifications

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

20 of 439

● ●

Possesses strong analytical and problem solving skills Has experience in managing data standardization in a large organization, including setting and executing strategy Previous industry vertical experience is essential Possesses excellent communication skills, both written and verbal Exhibits effective negotiating skills Displays meticulous attention to detail Must be able to work effectively with both business and technical stakeholders Works independently with minimal supervision Project solution experience is desirable

● ● ● ● ● ● ●

Recommended Training
● ●

Data Quality Workbench Level I Data Explorer Level I

Last updated: 15-Feb-07 17:34

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

21 of 439

Database Administrator (DBA)
The Database Administrator (DBA) in a Data Integration Solution is typically responsible for translating the logical model (i.e., the ERD) into a physical model for implementation in the chosen DBMS, implementing the model, developing volume and capacity estimates, performance tuning, and general administration of the DBMS. In many cases, the project DBA also has useful knowledge of existing source database systems. In most cases, a DBA's skills are tied to a particular DBMS, such as Oracle or Sybase. As a result, an analytic solution with heterogeneous sources/targets may require the involvement of several DBAs. The Project Manager and Data Warehouse Administrator are responsible for ensuring that the DBAs are working in concert toward a common solution.

Reports to:
●

Technical Project Manager

Responsibilities:
● ● ●

Plans, implements and supports enterprise databases Establishes and maintains database security and integrity controls Delivers database services while managing to policies, procedures and standards Tests and implements new technical solutions Monitors and supports the database infrastructure (including clients) Develops volume and capacity estimates Proposes and implements enhancements to improve performance and reliability Provides operational support of databases, including backup and recovery Develops programs to migrate data between systems Works to resolve technical issues Contributes to technical and system architectural planning Supports data integration developers in troubleshooting performance issues Collaborates with other Departments (i.e., Network Administrators) to identify and resolve performance issues

● ● ● ●

● ● ● ● ● ●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

22 of 439

Qualifications/Certifications
● ● ● ● ● ● ● ●

Experience in database administration, backup and recovery Expertise in database configuration and tuning Appreciation of DI tool-set and associated tools Experience in developing and supporting ETL real-time and batch processes Strategic planning and system analysis Strong analytical and communication skills Able to work effectively with both business and technical stakeholders Ability to work independently with minimal supervision

Recommended Training
●

DBMS Administration

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

23 of 439

End User
The End User is the ultimate "consumer" of the data in the data warehouse and/or data marts. As such, the end user represents a key customer constituent (management is another), and must therefore be heavily involved in the development of a data integration solution. Specifically, a representative of the End User community must be involved in gathering and clarifying the business requirements, developing the solution and User Acceptance Testing (if applicable).

Reports to:
●

Business Project Manager

Responsibilities:
● ● ● ●

Gathers and clarifies business requirements Reviews technical design proposals Participates in User Acceptance testing Provides feedback on the user experience

Qualifications/Certifications
● ●

Strong understanding of the business' processes Good communication skills

Recommended Training
● ●

Data Analyzer - Quickstart Data Analyzer - Report Development

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

24 of 439

Project Sponsor
The Project Sponsor is typically a member of the business community rather than an IT/IS resource. This is important because the lack of business sponsorship is often a contributing cause of systems implementation failure. The Project Sponsor often initiates the effort, serves as project champion, guides the Project Managers in understanding business priorities, and reports status of the implementation to executive leadership. Once an implementation is complete, the Project Sponsor may also serve as "chief evangelist", bringing word of the successful implementation to other areas within the organization.

Reports to:
●

Executive Leadership

Responsibilities:
● ● ● ●

Provides the business sponsorship for the project Champions the project within the business Initiates the project effort Guides the Project Managers in understanding business requirements and priorities Assists in determining the data integration system project scope, time, budget and required resources Reports status of the implementation to executive leadership

●

●

Qualifications/Certifications
●

Has industry vertical knowledge

Recommended Training
●

N/A

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

25 of 439

Quality Assurance Manager
The Quality Assurance (QA) Manager ensures that the original intent of the business case is achieved in the actual implementation of the analytic solution. This involves leading the efforts to validate the integrity of the data throughout the data integration processes, and ensuring that the utlimate data target has been accurately derived from the source data. The QA Manager can be a member of the IT organization, but serve as a liaison to the business community (i.e., the Business Analysts and End Users). In situations where issues arise with regard to the quality of the solution, the QA Manager works with project management and the development team to resolve them. Depending upon the test approach taken by the project team, the QA Manager may also serve as the Test Manager.

Reports to:
●

Technical Project Manager

Responsibilities:
●

Leads the effort to validate the integrity of the data through the data integration processes Ensures that the data contained in the data integration solution has been accurately derived from the source data Develops and maintains quality assurance plans and test requirements documentation Verifies compliance to commitments contained in quality plans Works with the project management and development teams to resolve issues Participates in the enforcement of data quality standards Communicates concerns, issues and problems with data Participates in the testing and post-production verification Together with the Technical Lead and the Repository Administrator, articulates the development standards Advises on the development methods to ensure that quality is built in Designs the QA and standards enforcement strategy Together with the Test Manager, coordinates the QA and Test strategies

●

●

● ● ● ● ● ●

● ● ●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

26 of 439

●

Manages the implementation of the QA strategy

Qualifications/Certifications
● ● ●

Industry vertical knowledge Solid understanding of the Software Development Life Cycle Experience in quality assurance performance, auditing processes, best practices and procedures Experience with automated testing tools Knowledge of Data Warehouse and Data Integration enterprise environments Able to work effectively with both business and technical stakeholders

● ● ●

Recommended Training
● ● ● ●

PowerCenter Level I Developer Infomatica Data Explorer Informatica Data Quality Workbench Project Management

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

27 of 439

Technical Project Manager
The Technical Project Manager has overall responsibility for managing the technical resources within a project. As such, he/she works with the project sponsor, business project manager and development team to assign the appropriate resources for a project within the scope, schedule, and budget and to ensure that project deliverables are met.

Reports to:
●

Project Sponsor or Business Project Manager

Responsibilities:
● ● ● ● ● ●

Defines and implements the methodology adopted for the project Liaises with the Project Sponsor and Business Project Manager Manages project resources within the project scope, time-line and budget Ensures all business requirements are accurate Communicates project progress to Project Sponsor(s) Is responsible for ensuring delivery on commitments and ensuring that the delivered solution fulfills the needs of the business Performs requirements analysis, documentation, ad-hoc reporting and resource leadership

●

Qualifications/Certifications
● ● ● ●

Translates strategies into deliverables Prioritizes and balances competing priorities Must be able to work effectively with both business and technical stakeholders Has knowledge of the tools and technologies used in the data integration solution Holds certification in industry vertical knowledge (if applicable)

●

Recommended Training

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

28 of 439

● ● ● ●

Project Management Techniques PowerCenter Developer Level I PowerCenter Administrator Level I Data Analyzer Introduction

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

29 of 439

Test Engineer
The Test Engineer is responsible for completion of test plans and their execution. During test planning, the Test Engineer works with the Testing Manager/Quality Assurance Manager to finalize the test plans and to ensure that the requirements are testable. The Test Engineer is also responsible for complete execution including design and implementing test scripts, test suites of test cases, and test data. The Test Engineer should be able to demonstrate knowledge of testing techniques and to provide feedback to developers. He/She uses the procedures as defined in the test strategy to execute, report results and progress of test execution and to escalate testing issues as appropriate.

Reports to:
●

Test Manager (or Quality Assurance Manager)

Responsibilities:
● ●

Provides input to the test plan and executes it Carries out requested procedures to ensure that Data Integration systems and services meet organization standards and business requirements Develops and maintains test plans, test requirements documentation, test cases and test scripts Verifies compliance to commitments contained in the test plans Escalates issues and works to resolve them Participates in testing and post-production verification efforts Executes test scripts and documents and provides the results to the test manager Provides feedback to developers Investigates and resolves test failures

●

● ● ● ●

● ●

Qualifications/Certifications
● ● ●

Solid understanding of the Software Development Life Cycle Experience with automated testing tools Strong knowledge of Data Warehouse and Data Integration enterprise

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

30 of 439

environments
● ●

Experience in a quality assurance and testing environment Experience in developing and executing test cases and in setting up complex test environments Industry vertical knowledge

●

Recommended Training
● ● ● ●

PowerCenter Developer Level I &II Data Analyzer Introduction SQL Basics Data Quality Workbench

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

31 of 439

Test Manager
The Test Manager is responsible for coordinating all aspects of test planning and execution. During test planning, the Test Manager becomes familiar with the business requirements in order to develop sufficient test coverage for all planned functionality. He/she also develops a test schedule that fits into the overall project plan. Typically, the Test Manager works with a development counterpart during test execution; the development manager schedules and oversees the completion of fixes for bugs found during testing. The test manager is also responsible for the creation of the test data set. An integrated test data set is a valuable project resource in its own right; apart from its obvious role in testing, the test data set is very useful to the developers of integration and presentation components. In general, separate functional and volume test data sets will be required. In most cases, these should be derived from the production environment. It may also be necessary to manufacture a data set which triggers all the business rules and transformations specified for the application. Finally, the Test Manager must continually advocate adherence to the Test Plans. Projects at risk of delayed completion often sacrifice testing at the expense of a highquality end result.

Reports to:
●

Technical Project Manager (or Quality Assurance Manager)

Responsibilities:
● ●

Coordinates all aspects of test planning and execution Carries out procedures to ensure that Data Integration systems and services meet organization standards and business requirements Develops and maintains test plans, test requirements documentation, test cases and test scripts Develops and maintains test data sets Verifies compliance to commitments contained in the test plans Works with the project management and development teams to resolve issues Communicates concerns, issues and problems with data

●

● ● ● ●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

32 of 439

● ● ●

Leads testing and post-production verification efforts Executes test scripts and documents and publishes the results Investigates and resolves test failures

Qualifications/Certifications
● ● ●

Solid understanding of the Software Development Life Cycle Experience with automated testing tools Strong knowledge of Data Warehouse and Data Integration enterprise environments Experience in a quality assurance and testing environment Experience in developing and executing test cases and in setting up complex test environments Experience in classifying, tracking and verifying bug fixes Industry vertical knowledge Able to work effectively with both business and technical stakeholders Project management

● ●

● ● ● ●

Recommended Training
● ● ●

PowerCenter Developer Level I Data Analyzer Introduction Data Explorer

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

33 of 439

User Acceptance Test Lead
The User Acceptance Test Lead is responsible for leading the final testing and gaining final approval from the business users. The User Acceptance Test Lead interacts with the End Users and the design team during the development effort to ensure the inclusion of all the user requirements within the original defined scope. He/ she then validates that the deployed solution meets the final user requirements.

Reports to:
●

Business Project Manager

Responsibilities:
● ●

Gathers and clarifies business requirements Interacts with the design team and end users during the development efforts to ensure inclusion of users requirements within the defined scope Reviews technical design proposals Schedules and leads the user acceptance test effort Provides test script/case training to the user acceptance test team Reports on test activities and results Validates that the deployed solution meets the final user requirements

● ● ● ● ●

Qualifications/Certifications
● ● ● ●

Experience planning and executing user acceptance testing Strong understanding of the business' processes Knowledge of the project solution Excellent communication skills

Recommended Training
●

N/A

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

34 of 439

Last updated: 12-Jun-07 16:06

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

35 of 439

Phase 1: Manage
1 Manage
●

1.1 Define Project
r

1.1.2 Build Business Case

●

1.2 Plan and Manage Project
r

1.2.1 Establish Project Roles 1.2.2 Develop Project Estimate 1.2.3 Develop Project Plan 1.2.4 Manage Project

r

r

r

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

36 of 439

Phase 1: Manage
Description

Managing the development of a data integration solution requires extensive planning. A well-defined, comprehensive plan provides the foundation from which to build a project solution. The goal of this phase is to address the key elements required for a solid project foundation. These elements include:
●

Scope - Clearly defined business objectives. The measurable, businessrelevant outcomes expected from the project should be established early in the development effort. Then, an estimate of the expected Return on Investment (ROI) can be developed to gauge the level of investment and anticipated return. The business objectives should also spell out a complete inventory of business processes to facilitate a collective understanding of these processes among project team members. Planning/Managing - The project plan should detail the project scope as well as its objectives, required work efforts, risks, and assumptions. A thorough, comprehensive scope can be used to develop a work breakdown structure (WBS) and establish project roles for summary task assignments. The plan should also spell out the change and control process that will be used for the project. Project Close/Wrap-Up - At the end of each project, the final step is to obtain project closure. Part of this closure is to ensure the completeness of the effort and obtain sign-off for the project. Additionally, a project evaluation will help in retaining lessons learned and assessing the success of the overall effort.

●

●

Prerequisites
None

Roles
Business Project Manager (Primary) Data Integration Developer (Secondary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

37 of 439

Data Quality Developer (Secondary) Data Transformation Developer (Secondary) Presentation Layer Developer (Secondary) Production Supervisor (Approve) Project Sponsor (Primary) Quality Assurance Manager (Approve) Technical Architect (Primary) Technical Project Manager (Primary)

Considerations
None

Best Practices
None

Sample Deliverables
None
Last updated: 20-May-08 18:53

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

38 of 439

Phase 1: Manage
Task 1.1 Define Project Description
This task entails constructing the business context for the project, defining in business terms the purpose and scope of the project as well as the value to the business (i.e., the business case).

Prerequisites
None

Roles
Business Analyst (Primary) Business Project Manager (Primary) Project Sponsor (Primary)

Considerations
There are no technical considerations during this task; in fact, any discussion of implementation specifics should be avoided at this time. The focus here is on defining the project deliverable in business terms with no regard for technical feasibility. Any discussion of technologies is likely to sidetrack the strategic thinking needed to develop the project objectives.

Best Practices
None

Sample Deliverables
Project Definition

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

39 of 439

Last updated: 01-Feb-07 18:43

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

40 of 439

Phase 1: Manage
Subtask 1.1.2 Build Business Case Description
Building support and funding for a data integration solution nearly always requires convincing executive IT management of its value to the business. The best way to do this, if possible, is to actually calculate the project's estimated return on investment (ROI) through a business case that calculates ROI. ROI modeling is valuable because it:
● ●

Supplies a fundamental cost-justification framework for evaluating a data integration project. Mandates advance planning among all appropriate parties, including IT team members, business users, and executive management. Helps organizations clarify and agree on the benefits they expect, and in that process, helps them set realistic expectations for the data integration solution or the data quality initiative.

●

In addition to traditional ROI modeling on data integration initiatives, quantitative and qualitative ROI assessments should also include assessments of data quality. Poor data quality costs organizations vast sums in lost revenues. Defective data leads to breakdowns in the supply chain, poor business decisions, and inferior customer relationship management. Moreover, poor quality data can lead to failures in compliance with industry regulations and even to outright project failure at the IT level. It is vital to acknowledge data quality issues at an early stage in the project. Consider a data integration project that is planned and resourced meticulously but that is undertaken on a dataset where the data is of a poorer quality than anyone realized. This can lead to the classic “code-load-explode” scenario, wherein the data breaks down in the target system due to a poor understanding of the data and metadata. What is worse, a data integration project can succeed from an IT perspective but deliver little if any business value if the data within the system is faulty. For example, a CRM system containing a dataset with a large quantity of redundant or inaccurate records is likely to be of little value to the business. Often an organization does not realize it has data quality issues until it is too late. For this reason, data quality should be a consideration in ROI modeling for all data integration projects – from the beginning. For more details on how to quantify business value and associated data integration project cost, please see Assessing the Business Case.

Prerequisites
1.1.1 Establish Business Project Scope

Roles
Business Project Manager (Secondary)

Considerations
The Business Case must focus on business value and, as much as possible, quantify that value. The business beneficiaries are primarily responsible for assessing the project benefits, while technical considerations drive the cost assessments. These two assessments - benefits and costs - form the basis for determining overall ROI to the business.

Building the Business Case Step 1 - Business Benefits
When creating your ROI model, it is best to start by looking at the expected business benefit of implementing the data integration solution. Common business imperatives include:
● ●

Improving decision-making and ensuring regulatory compliance. Modernizing the business to reduce costs.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

41 of 439

● ● ●

Merging and acquiring other organizations. Increasing business profitability. Outsourcing non-core business functions to be able to focus on your company’s core value proposition.

Each of these business imperatives requires support via substantial IT initiatives. Common IT initiatives include:
● ● ● ● ●

Business intelligence initiatives. Retirement of legacy systems. Application consolidation initiatives. Establishment of data hubs for customer, supplier, and/or product data. Business process outsourcing (BPO) and/or Software as a Service (SaaS).

For these IT initiatives to be successful, you must be able to integrate data from a variety of disparate systems. The form of those data integration projects may vary. You may have a:
● ● ● ●

Data Warehousing project, which enables new business insight usually through business intelligence. Data Migration project, where data sources are moved to enable a new application or system. Data Consolidation project, where certain data sources or applications are retired in favor of another. Master Data Management project, where multiple data sources come together to form a more complex, master view of the data. Data Synchronization project, where data between two source systems need to stay perfectly consistent to enable different applications or systems. B2B Data Transformation project, where data from external partners is transformed to internal formats for processing by internal systems and responses are transformed back to partner appropriate formats. Data Quality project, where the goals are to cleanse data and to correct errors such as duplicates, missing information, mistyped information and other data deficiencies.

●

●

●

Once you have established the heritage of your data integration project back to its origins in the business imperatives, it is important to estimate the value derived from the data integration project. You can estimate the value by asking questions such as:
● ● ● ●

What is the business goal of this project? Is this relevant? What are the business metrics or key performance indicators associated with this goal? How will the business measure the success of this initiative? How does data accessibility affect the business initiative? Does having access to all of your data improve the business initiative? How does data availability affect the business initiative? Does having data available when it’s needed improve the business initiative? How does data quality affect the business initiative? Does having good data quality improve the business initiative? Conversely, what is the potential negative impact of having poor data quality on the business initiative? How does data auditability affect the business? Does having an audit trail of your data improve the business initiative from a compliance perspective? How does data security affect the business? Does ensuring secure data improve the business initiative?

●

●

●

●

After asking the questions above, you’ll start to be able to equate business value, in a monetary number, with the data integration project. Remember to not only estimate the business value over the first year after implementation, but also over the course of time. Most business cases and associated ROI models factor in expected business value for at least three years. If you are still struggling with estimating business value with the data integration initiative, see the table below that outlines common business value categories and how they relate to various data integration initiatives:

Business Value Category
INCREASE REVENUE

Explanation

Typical Metrics

Data Integration Examples

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

42 of 439

New Customer Acquisition

Lower the costs of acquiring new customers

- cost per new customer acquisition - cost per lead - # new customers acquired/month per sales rep or per office/store

- Marketing analytics - Integration of third party data (from credit bureaus, directory services, salesforce.com, etc.)

Cross-Sell / Up-Sell Increase penetration and sales - % cross-sell rate within existing customers - # products/customer - % share of wallet - customer lifetime value

- Single view of customer across all products, channels - Marketing analytics & customer segmentation - Customer lifetime value analysis - Sales/agent productivity dashboard - Sales & demand analytics - Customer master data integration - Demand chain synchronization - Data sharing across design, development, production and marketing/sales teams - Data sharing with third parties e. g. contract manufacturers, channels, marketing agencies - Cross-geography/cross-channel pricing visibility - Differential pricing analysis and tracking - Promotions effectiveness analysis

Sales and Channel Increase sales productivity, Management and improve visibility into demand

- sales per rep or per employee - close rate - revenue per transaction

New Product / Service Delivery

Accelerate new product/service - # new products launched/year introductions, and improve "hit - new product/service launch time rate" of new offerings - new product/service adoption rate

Pricing / Promotions

Set pricing and promotions to stimulate demand while improving margins

- margins - profitability per segment - cost-per-impression, cost-per-action

LOWER COSTS Supply Chain Management Lower procurement costs, increase supply chain visibility, and improve inventory management - purchasing discounts - inventory turns - quote-to-cash cycle time - demand forecast accuracy - product master data integration - demand analysis - cross-supplier purchasing history - cross-enterprise inventory rollup - scheduling and production synchronization

Production & Service Delivery

Lower the costs to manufacture - production cycle times products and/or deliver services - cost per unit (product) - cost per transaction (service) - straight-through-processing rate Lower distribution costs and improve visibility into distribution chain - distribution costs per unit - average delivery times - delivery date reliability

Logistics & Distribution

- integration with third party logistics management and distribution partners

Invoicing, Collections and Fraud Prevention

Improve invoicing and collections efficiency, and detect/prevent fraud

- # invoicing errors - DSO (days sales outstanding) - % uncollectible - % fraudulent transactions - End-of-quarter days to close - Financial reporting efficiency - Asset utilization rates

- invoicing/collections reconciliation - fraud detection

Financial Management

Streamline financial management and reporting

- Financial data warehouse/ reporting - Financial reconciliation - Asset management/tracking

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

43 of 439

MANAGE RISK Compliance Risk(e. Prevent compliance outages to -# negative audit/inspection findings g. SEC/SOX/Basel avoid investigations, penalties, - probability of compliance lapse II/PCI) and negative impact on brand - cost of compliance lapses (fines, recovery costs, lost business) - audit/oversight costs - Financial reporting - Compliance monitoring & reporting

Financial/Asset Risk Management

Improve risk management of key assets, including financial, commodity, energy or capital assets

- errors & omissions - probability of loss - expected loss - safeguard and control costs

- Risk management data warehouse - Reference data integration - Scenario analysis - Corporate performance management - Resiliency and automatic failover/recovery for all data integration processes

Business Reduce downtime and lost Continuity/ business, prevent loss of key Disaster Recovery data, and lower recovery costs Risk

- mean time between failure (MTBF) - mean time to recover (MTTR) - recovery time objective (RTO) - recover point objective (RPO -- data loss)

Step 2 – Calculating the Costs
Now that you have estimated the monetary business value from the data integration project in Step 1, you will need to calculate the associated costs with that project in Step 2. In most cases, the data integration project is inevitable – one way or another the business initiative is going to be accomplished – so it is best to compare two alternative cost scenarios. One scenario would be implementing that data integration with tools from Informatica, while the other scenario would be implementing the data integration project without Informatica’s toolset. Some examples of benchmarks to support the case for Informatica lowering the total cost of ownership (TCO) on data integration and data quality projects are outlined below:

Benchmarks from Industry Analysts, Consultants, and Authors Forrester Research, "The Total Economic Impact of Deploying Informatica PowerCenter", 2004 The average savings of using a data integration/ETL tool vs. hand coding: • 31% in development costs • 32% in operations costs • 32% in maintenance costs • 35% in overall project life-cycle costs

Gartner, "Integration Competency Center: Where Are Companies Today?", 2005 • The top-performing third of Integration Competency Centers (ICCs) will save an average of: • 30% in data interface development time and costs • 20% in maintenance costs

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

44 of 439

• The top-performing third of ICCs will achieve 25% reuse of integration components

Larry English, Improving Data Warehouse and Business Information Quality, Wiley Computer Publishing, 1999. • "The business costs of non-quality data, including irrecoverable costs, rework of products and services, workarounds, and lost and missed revenue may be as high as 10 to 25 percent of revenue or total budget of an organization." • "Invalid data values in the typical customer database averages around 15 to 20 percent… Actual data errors, even though the values may be valid, may be 25 to 30 percent or more in those same databases." • "Large organizations often have data redundantly stored 10 times or more."

Ponemon Institute-- Study of costs incurred by 14 companies that had security breaches affecting between 1,500 to 900,000 consumer records • Total costs to recover from a breach averaged $14 million per company, or $140 per lost customer record • Direct costs for incremental, out-of-pocket, unbudgeted spending averaged $5 million per company, or $50 per lost customer for outside legal counsel, mail notification letters, calls to individual customers, increased call center costs and discounted product offers • Indirect costs for lost employee productivity averaged $1.5 million per company, or $15 per customer record • Opportunity costs covering loss of existing customers and increased difficulty in recruiting new customers averaged $7.5 million per company, or $75 per lost customer record. • Overall customer loss averaged 2.6 percent of all customers and ranged as high as 11 percent

In addition to lowering cost of implementing a data integration solution, Informatica adds value to the ROI model by mitigating risk in the data integration project. In order to quantify the value of risk mitigation, you should consider the cost of project overrun and the associated likelihood of overrun when using Informatica vs. when you don’t use Informatica for your data integration project. An example analysis of risk mitigation value is below:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

45 of 439

Step 3 – Putting it all Together
Once you have calculated the three year business/IT benefits and the three year costs of using PowerCenter vs. not using PowerCenter, put all of this information into a format that is easy-to-read for IT and line of business executive management. The following isa sample summary of an ROI model:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

46 of 439

For data migration projects it is frequently necessary to prove that using Informatica technology for the data migration efforts has benefits over traditional means. To prove the value, three areas should be considered: 1. Informatica Software can reduce the overall project timeline by accelerating migration development efforts. 2. Informatica delivered migrations will have lower risk due to ease of maintenance, less development effort, higher quality of data, and increased project management tools with the metadata driven solution. 3. Availability of lineage reports as to how the data was manipulated by the data migration process and by whom.

Best Practices
None

Sample Deliverables
None
Last updated: 20-May-08 19:09

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

47 of 439

Phase 1: Manage
Task 1.2 Plan and Manage Project Description
This task incorporates the initial project planning and management activities as well as project management activities that occur throughout the project lifecycle. It includes the initial structure of the project team and the project work steps based on the business objectives and the project scope, and the continuing management of expectations through status reporting, issue tracking and change management.

Prerequisites
None

Roles
Business Project Manager (Primary) Data Integration Developer (Secondary) Data Quality Developer (Secondary) Presentation Layer Developer (Secondary) Project Sponsor (Approve) Technical Architect (Primary) Technical Project Manager (Primary)

Considerations
In general, project management activities involve reconciling trade-offs between business requests as to functionality and timing with technical feasibility and budget considerations. This often means balancing between sensitivity to project goals and

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

48 of 439

concerns ("being a good listener") on the one hand, and maintaining a firm grasp of what is feasible ("telling the truth") on the other. The tools of the trade, apart from strong people skills (especially, interpersonal communication skills), are detailed documentation and frequent review of the status of the project effort against plan, of the unresolved issues, and of the risks regarding enlargement of scope ("change management"). Successful project management is predicated on regular communication of these project aspects with the project manager, and with other management and project personnel. For data migration projects there is often a project management office (PMO) in place The PMO is typically found in high dollar, high profile projects such as implementing a new ERP system that will often cost in the millions of dollars. It is important to identify the roles and gain the understanding of the PMO as to how these roles are needed and will intersect with the broader system implementation. More specifically, these roles will have responsibility beyond the data migration, so the resource requirements for the Data Migration must be understood and guaranteed as part of the larger effort overseen by the PMO. For B2B projects, technical considerations typically play an important role. The format of data received from partners (and replies sent to partners) forms a key consideration in overall business operations and has a direct impact on the planning and scoping of changes. Informatica recommends having the Technical Architect directly involved throughout the process.

Best Practices
None

Sample Deliverables
None
Last updated: 20-May-08 19:13

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

49 of 439

Phase 1: Manage
Subtask 1.2.1 Establish Project Roles Description
This subtask involves defining the roles/skill sets that will be required to complete the project. This is a precursor to building the project team and making resource assignments to specific tasks.

Prerequisites
None

Roles
Business Project Manager (Primary) Project Sponsor (Approve) Technical Project Manager (Primary)

Considerations
The Business Project Scope established in 1.1.1 Establish Business Project Scope provides a primary indication of the required roles and skill sets. The following types of questions are useful discussion topics and help to validate the initial indicators:
●

What are the main tasks/activities of the project and what skills/roles are needed to accomplish them? How complex or broad in scope are these tasks? This can indicate the level of skills needed. What responsibilities will fall to the company resources and which are offloaded to a consultant? Who (i.e. company resource or consultant) will provide the project management? Who will have primary responsibility for infrastructure requirements? ...for data architecture? ...for documentation? ... for testing? ...for deployment/training/support?

●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

50 of 439

●

How much development and testing will be involved?

This is a definitional activity and very distinct from the later assignment of resources. These roles should be defined as generally as possible rather than attempting to match a requirement with a resource at hand. After the project scope and required roles have been defined, there is often pressure to combine roles due to limited funding or availability of resources. There are some roles that inherently provide a healthy balance with one another, and if one person fills both of these roles, project quality may suffer. The classic conflict is between development roles and highly procedural or operational roles. For example, a QA Manager or Test Manager or Lead should not be the same person as a Project Manager or one of the development team. The QA Manager is responsible for determining the criteria for acceptance of project quality and managing quality-related procedures. These responsibilities directly conflict with the developer’s need to meet a tight development schedule. For similar reasons, development personnel are not ideal choices for filling such operational roles as Metadata Manager, DBA, Network Administrator, Repository Administrator, or Production Supervisor. Those roles require operational diligence and adherence to procedure as opposed to ad hoc development. When development roles are mixed with operational roles, resulting ‘shortcuts’ often lead to quality problems in production systems. Tip Involve the Project Sponsor. Before defining any roles, be sure that the Project Sponsor is in agreement as to the project scope and major activities, as well as the level of involvement expected from company personnel and consultant personnel. If this agreement has not been explicitly accomplished, review the project scope with the Project Sponsor to resolve any remaining questions. In defining the necessary roles, be sure to provide the Sponsor with a full description of all roles, indicating which will rely on company personnel and which will use consultant personnel. This sets clear expectations for company involvement and indicates if there is a need to fill additional roles with consultant personnel if the company does not have personnel available in accordance with the project timing. The Role Descriptions in Roles provides typical role definitions. The Project Role Matrix can serve as a starting point for completing the project-specific roles matrix.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

51 of 439

Best Practices
None

Sample Deliverables
Project Definition Project Role Matrix Work Breakdown Structure

Last updated: 01-Feb-07 18:43

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

52 of 439

Phase 1: Manage
Subtask 1.2.2 Develop Project Estimate Description
Once the overall project scope and roles have been defined, details on project execution must be developed. These details should answer the questions of what must be done, who will do it, how long it will take, and how much will it cost. The objective of this subtask is to develop a complete WBS and, subsequently, a solid project estimate. Two important documents required for project execution are the:
●

Work Breakdown Structure (WBS), which can be viewed as a list of tasks that must be completed to achieve the desired project results. (See Developing a Work Breakdown Structure (WBS) for more details) Project Estimate, which, at this time, focuses solely on development costs without consideration for hardware and software liabilities.

●

Estimating a project is never an easy task, and often becomes more difficult as project visibility increases and there is an increasing demand for an "exact estimate". It is important to understand that estimates are never exact. However, estimates are useful for providing a close approximation of the level of effort required by the project. Factors such as project complexity, team skills, and external dependencies always have an impact on the actual effort required. The accuracy of an estimate largely depends on the experience of the estimator (or estimators). For example, an experienced traveller who frequently travels the route between his/her home or office and the airport can easily provide an accurate estimate of the time required for the trip. When the same traveller is asked to estimate travel time to or from an unfamiliar airport however, the estimation process becomes much more complex, requiring consideration of numerous factors such as distance to the airport, means of transportation, speed of available transportation, time of day that the travel will occur, expected weather conditions, and so on. The traveller can arrive at a valid overall estimate by assigning time estimates to each factor, then summing the whole. The resulting estimate however, is not likely to be nearly as accurate as the one based on knowledge gained through experience. The same holds true for estimating

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

53 of 439

the time and resources required to complete development on a data integration solution project.

Prerequisites
None

Roles
Business Project Manager (Primary) Data Integration Developer (Secondary) Data Quality Developer (Secondary) Data Transformation Developer (Secondary) Presentation Layer Developer (Secondary) Project Sponsor (Approve) Technical Architect (Secondary) Technical Project Manager (Secondary)

Considerations
An accurate estimate depends greatly on a complete and accurate Work Breakdown Structure. Having the entire project team review the WBS when it is near completion helps to ensure that it includes all necessary project tasks. Project deadlines often slip because some tasks are overlooked and, therefore, not included in the initial estimates.

Sample Data Requirements for B2B Projects
For B2B projects (and non B2B projects that have significant unstructured or semistructured data transformation requirements) the actual creation and subsequent QA of transformations relies on having sufficient samples of input and output data; and specifications for data formats.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

54 of 439

When estimating for projects that use Informatica’s B2B Data Transformation, estimates should include sufficient time to allow for the collection and assembly of sample data, any cleansing of sample data required (for example to conform to HIPAA or financial privacy regulations), and for any data analysis or metadata discovery to be performed on the sample data. By their nature, the full authoring of B2B data transformations cannot be completed (or in some cases proceed) without the availability of adequate sample data both for input to transformations and for comparison purposes during the quality assurance process.

Best Practices
None

Sample Deliverables
None
Last updated: 20-May-08 19:17

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

55 of 439

Phase 1: Manage
Subtask 1.2.3 Develop Project Plan Description
In this subtask, the Project Manager develops a schedule for the project using the agreed-upon business project scope to determine the major tasks that need to be accomplished and estimates of the amount of effort and resources required.

Prerequisites
None

Roles
Business Project Manager (Primary) Project Sponsor (Approve) Technical Project Manager (Secondary)

Considerations
The initial project plan is based on agreements-to-date with the Project Sponsor regarding project scope, estimation of effort, roles, project timelines and any understanding of requirements. Updates to the plan (as described in Developing and Maintaining the Project Plan) are typically based on changes to scope, approach, priorities, or simply on more precise determinations of effort and of start and/or completion dates as the project unfolds. In some cases, later phases of the project, like System Test (or "alpha"), Beta Test and Deployment, are represented in the initial plan as a single set of activities, and will be more fully defined as the project progresses. Major activities (e.g., System Test, Deployment, etc.) typically involve their own full-fledged planning processes once the technical design is completed. At that time, additional activities may be added to the project plan to allow for more detailed tracking of those project activities.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

56 of 439

Perhaps the most significant message here is that an up-to-date plan is critical for satisfactory management of the project and for timely completion of its tasks. Keeping the plan updated as events occur and client understanding or needs and expectations change requires an on-going effort. The sooner the plan is updated and changes communicated to the Project Sponsor and/or company management, the less likely that expectations will be frustrated to a problematic level.

Best Practices
Data Migration Velocity Approach

Sample Deliverables
Project Roadmap Work Breakdown Structure

Last updated: 01-Feb-07 18:43

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

57 of 439

Phase 1: Manage
Subtask 1.2.4 Manage Project Description
In the broadest sense, project management begins before the project starts and continues until its completion and perhaps beyond. The management effort includes:
● ●

Managing the project beneficiary relationship(s), expectations and involvement Managing the project team, its make-up, involvement, priorities, activities and schedule Managing all project issues as they arise, whether technical, logistical, procedural, or personal.

●

In a more specific sense, project management involves being constantly aware of, or preparing for, anything that needs to be accomplished or dealt with to further the project objectives, and making sure that someone accepts responsibility for such occurrences and delivers in a timely fashion. Project management begins with pre-engagement preparation and includes:
●

Project Kick-off, including the initial project scope, project organization, and project plan Project Status and reviews of the plan and scope Project Content Reviews, including business requirements reviews and technical reviews Change Management as scope changes are proposed, including changes to staffing or priorities Issues Management Project Acceptance and Close

● ●

●

● ●

Prerequisites
None

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

58 of 439

Roles
Business Project Manager (Primary) Project Sponsor (Review Only) Technical Project Manager (Primary)

Considerations
In all management activities and actions, the Project Manager must balance the needs and expectations of the Project Sponsor and project beneficiaries with the needs, limitations and morale of the project team. Limitations and specific needs of the team must be communicated clearly and early to the Project Sponsor and/or company management to mitigate unwarranted expectations and avoid an escalation of expectation-frustration that can have a dire effect on the project outcome. Issues that affect the ability to deliver in any sense, and potential changes to scope, must be brought to the Project Sponsor's attention as soon as possible and managed to satisfactory resolution. In addition to "expectation management", project management includes Quality Assurance for the project deliverables. This involves soliciting specific requirements with subsequent review of deliverables that include in addition to the data integration solution documentation, user interfaces, knowledge-transfer and testing procedures.

Best Practices
None

Sample Deliverables
Issues Tracking Project Review Meeting Agenda Project Status Report Scope Change Assessment

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

59 of 439

Last updated: 01-Feb-07 18:43

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

60 of 439

Phase 2: Analyze
2 Analyze
●

2.1 Define Business Drivers, Objectives and Goals 2.2 Define Business Requirements
r

●

2.2.1 Define Business Rules and Definitions 2.2.2 Establish Data Stewardship

r ●

2.3 Define Business Scope
r

2.3.1 Identify Source Data Systems 2.3.2 Determine Sourcing Feasibility 2.3.3 Determine Target Requirements

r

r ● ● ●

2.6 Determine Technical Readiness 2.7 Determine Regulatory Requirements 2.8 Perform Data Quality Audit
r

2.8.1 Perform Data Quality Analysis of Source Data 2.8.2 Report Analysis Results to the Business

r

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

61 of 439

Phase 2: Analyze
Description
Increasingly, organizations demand faster, better, and cheaper delivery of data integration and business intelligence solutions. Many development failures and project cancellations can be traced to an absence of adequate upfront planning and scope definition. Inadequately defined or prioritized objectives and project requirements foster scenarios where project scope becomes a moving target as requirements may change late in the game, requiring repeated rework of design or even development tasks. The purpose of the Analyze Phase is to build a solid foundation for project scope through a deliberate determination of the business drivers, requirements, and priorities that will form the basis of the project design and development. Once the business case for a data integration or business intelligence solution is accepted and key stakeholders are identified, the process of detailing and prioritizing objectives and requirements can begin - with the ultimate goal of defining project scope and, if appropriate, a roadmap for major project stages.

Prerequisites
None

Roles
Application Specialist (Primary) Business Analyst (Primary) Business Project Manager (Primary) Data Architect (Primary) Data Integration Developer (Primary) Data Quality Developer (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

62 of 439

Data Steward/Data Quality Steward (Primary) Database Administrator (DBA) (Primary) Legal Expert (Primary) Metadata Manager (Primary) Project Sponsor (Secondary) Security Manager (Primary) System Administrator (Primary) Technical Architect (Primary) Technical Project Manager (Primary)

Considerations
Functional and technical requirements must focus on the business goals and objectives of the stakeholders, and must be based on commonly agreed-upon definitions of business information. The initial business requirements are then compared to feasibility studies of the source systems to help the prioritization process that will result in a project roadmap and rough timeline. This sets the stage for incremental delivery of the requirements so that some important needs are met as soon as possible, thereby providing value to the business even though there may be a much longer timeline to complete the entire project. In addition, during this phase it can be valuable to identify the available technical metadata as a way to accelerate the design and improve its quality. A successful Analyze Phase can serve as a foundation for a successful project.

Best Practices
None

Sample Deliverables
None

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

63 of 439

Last updated: 01-Feb-07 18:43

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

64 of 439

Phase 2: Analyze
Task 2.1 Define Business Drivers, Objectives and Goals Description
In many ways, the potential for success of any data integration/business intelligence solution correlates directly to the clarity and focus of its business scope. If the business objectives are vague, there is a much higher risk of failure or, at least, of a less-thandirect path to likely limited success.

Business Drivers
The business drivers explain why the solution is needed and is being recommended at a particular time by identifying the specific business problems, issues, or increased business value that the project is likely to resolve or deliver. Business drivers may include background information necessary to understand the problems and/or needs. There should be clear links between the project’s business drivers and the company’s underlying business strategies.

Business Objectives
Objectives are concrete statements describing what the project is trying to achieve. Objectives should be explicitly defined so that they can be evaluated at the conclusion of a project to determine if they were achieved. Objectives written for a goal statement are nothing more than a deconstruction of the goal statement into a set of necessary and sufficient objective statements. That is, every objective must be accomplished to reach the goal, and no objective is superfluous. Objectives are important because they establish a consensus between the project sponsor and the project beneficiaries regarding the project outcome. The specific deliverables of an IT project, for instance, may or may not make sense to the project sponsor. However, the business objectives should be written so they are understandable by all of the project stakeholders.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

65 of 439

Business Goals
Goal statements provide the overall context for what the project is trying to accomplish. They should align with the company's stated business goals and strategies. Project context is established in a goal statement by stating the project's object of study, its purpose, its quality focus, and its viewpoint. Characteristics of a well-defined goal should reference the project's business benefits in terms of cost, time, and/or quality. Because goals are high-level statements, it may take more than one project to achieve a stated goal. If the goal's achievement can be measured, it is probably defined at too low a level and may actually be an objective. If the goal is not achievable through any combination of projects, it is probably too abstract and may be a vision statement. Every project should have at least one goal. It is the agreement between the company and the project sponsor about what is going to be accomplished by the project. The goal provides focus and serves as the compass for determining if the project outcomes are appropriate. In the project management life cycle, the goal is bound by a number of objective statements. These objective statements clarify the fuzzy boundary of the goal statement. Taken as a pair, the goal and objectives statements define the project. They are the foundation for project planning and scope definition.

Prerequisites
None

Roles
Business Project Manager (Review Only) Project Sponsor (Review Only)

Considerations Business Drivers
The business drivers must be defined using business language. Identify how the project is going to resolve or address specific business problems. Key components when identifying business drivers include:
●

Describe facts, figures, and other pertinent background information to support the existence of a problem.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

66 of 439

●

Explain how the project resolves or helps to resolve the problem in terms familiar to the business. Show any links to business goals, strategies, and principles.

●

Large projects often have significant business and technical requirements that drive the project's development. Consider explaining the origins of the significant requirements as a way of explaining why the project is needed.

Business Objectives
Before the project starts, define and agree on the project objectives and the business goals they define. The deliverables of the project are created based on the objectives not the other way around. A meeting between all major stakeholders is the best way to create the objectives and gain a consensus on them at the same time. This type of meeting encourages discussion among participants and minimizes the amount of time involved in defining business objectives and goals. It may not be possible to gather all the project beneficiaries and the project sponsor together at the same time so multiple meetings may have to be arranged with the results summarized. While goal statements are designed to be vague, a well-worded objective is Specific, Measurable, Attainable/Achievable, Realistic and Time-bound (SMART).
● ● ● ●

Specific: An objective should address a specific target or accomplishment. Measurable: Establish a metric that indicates that an objective has been met. Attainable: If an objective cannot be achieved, then it's probably a goal. Realistic: Limit objectives to what can realistically be done with available resources. Time-bound: Achieve objectives within a specified time frame.

●

At a minimum, make sure each objective contains four parts, as follows:
● ● ● ●

An outcome - describe what the project will accomplish. A time frame - the expected completion date of the project. A measure - metric(s) that will measure success of the project. An action - how to meet the objective.

The business objectives should take into account the results of any data quality investigations carried out before or during the project. If the project source data quality

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

67 of 439

is low, then the project's ability to achieve its objectives may be compromised. If the project has specific data-related objectives, such as regulatory compliance objectives, then a high degree of data quality may be an objective in its own right. For this reason, data quality investigations (such as a Data Quality Audit) should be carried out as early as is feasible in the project life-cycle. See 2.8 Perform Data Quality Audit. Generally speaking, the number of objectives comes down to how much business investment is going to be made in pursuit of the project's goals. High investment projects generally have many objectives. Low investment projects must be more modest in the objectives they pursue. There is considerable discretion in how granular a project manager may get in defining objectives. High-level objectives generally need a more detailed explanation and often lead to more definition in the project's deliverables to obtain the objective. Lower level, detailed objectives tend to require less descriptive narrative and deconstruct into fewer deliverables to obtain. Regardless of the number of objectives identified, the priority should be established by ranking the objectives with their respective impacts, costs, and risks.

Business Goals
The goal statement must also be written in business language so that anyone who reads it can understand it without further explanation. The goal statement should:
● ● ●

Be short and to the point. Provide overall context for what the project is trying to accomplish. Be aligned to business goals in terms of cost, time and quality.

Smaller projects generally have a single goal. Larger projects may have more than one goal, which should also be prioritized. Since the goal statement is meant to be succinct, regardless of the number of goals a project has, the goal statement should always be brief and to the point.

Best Practices
None

Sample Deliverables
None
Last updated: 18-May-08 17:36

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

68 of 439

Phase 2: Analyze
Task 2.2 Define Business Requirements Description
A data integration/business intelligence solution development project typically originates from a company's need to provide management and/or customers with business analytics or to provide business application integration. As with any technical engagement, the first task is to determine clear and focused business requirements to drive the technology implementation. This requires determining what information is critical to support the project objectives and its relation to important strategic and operational business processes. Project success will be based on clearly identifying and accurately resolving these informational needs with the proper timing. The goal of this task is to ensure the participation and consensus of the project sponsor and key beneficiaries during the discovery and prioritization of these information requirements.

Prerequisites
None

Roles
Business Project Manager (Primary) Data Quality Developer (Secondary) Data Steward/Data Quality Steward (Primary) Legal Expert (Approve) Metadata Manager (Primary) Project Sponsor (Approve)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

69 of 439

Considerations
In a data warehouse/business intelligence project, there can be strategic or tactical requirements.

Strategic Requirements
●

The customer management is typically interested in strategic questions that often include a significant timeframe. For example, ‘How has the turnover of product ‘x’ increased over the last year?’ or, 'What is the revenue of area ‘a’ in January of this year as compared to last year?’. Answers to strategic questions provide company executives with the information required to build on the company strengths and/or to eliminate weaknesses.

Strategic requirements are typically implemented through a data warehouse type project with appropriate visualization tools.

Tactical Requirements
●

The tactical requirements serve the ‘day to day’ business. Operational level employees want solutions to enable them to manage their on-going work and solve immediate problems. For instance, a distributor running a fleet of trucks has an unavailable driver on a particular day. They would want to answer questions such as, 'How can the delivery schedule be altered in order to meet the delivery time of the highest priority customer?' Answers to these questions are valid and pertinent for only a short period of time in comparison to the strategic requirements.

Tactical requirements are often implemented via operational data integration.

Best Practices
None

Sample Deliverables
None
Last updated: 02-May-08 12:05

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

70 of 439

Phase 2: Analyze
Subtask 2.2.1 Define Business Rules and Definitions Description
A business rule is a compact and simple statement that represents some important aspect of a business process or policy. By capturing the rules of the business—the logic that governs its operation—systems can be created that are fully aligned with the needs of the organization. Business rules stem from the knowledge of business personnel and constrain some aspect of the business. From a technical perspective, a business rule expresses specific constraints on the creation, updating, and removal of persistent data in an information system. For example, a new bank account cannot be created unless the customer has provided an adequate proof of identification and address.

Prerequisites
None

Roles
Data Quality Developer (Secondary) Data Steward/Data Quality Steward (Primary) Legal Expert (Approve) Metadata Manager (Primary) Security Manager (Approve)

Considerations
Formulating business rules is an iterative process, often stemming from statements of

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

71 of 439

policy in an organization. Rules are expressed in natural language. The following set of guidelines follow best practices and provide practical instructions on how to formulate business rules:
●

Start with a well-defined and agreed upon set of unambiguous definitions captured in a definitions repository. Re-use existing definitions if available. Use meaningful and precise verbs to connect the definitions captured above. Use standard expressions to constrain business rules, such as must, must not, only if, no more than, etc. For example, the total commission paid to broker ABC can be no more than xy% of the total revenue received for the sale of widgets. Use standard expressions for derivation business rules like "x is calculated from/", "summed from", etc. For example, "the departmental commission paid is calculated as the total commission multiplied by the departmental rollup rate."

● ●

●

The aim is to define atomic business rules, that is, rules that cannot be decomposed further. Each atomic business rule is a specific, formal statement of a single term, fact, derivation, or constraint on the business. The components of business rules, once formulated, provide direct inputs to a subsequent conceptual data modeling and analysis phase. In this approach, definitions and connections can eventually be mapped onto a data model and constraints and derivations can be mapped onto a set of rules that are enforced in the data model.

Best Practices
None

Sample Deliverables
Business Requirements Specification

Last updated: 01-Feb-07 18:43

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

72 of 439

Phase 2: Analyze
Subtask 2.2.2 Establish Data Stewardship Description
Data stewardship is about keeping the business community involved and focused on the goals of the project being undertaken. This subtask outlines the roles and responsibilities that key personnel can assume within the framework of an overall stewardship program. This participation should be regarded as ongoing because stewardship activities need to be performed at all stages of a project lifecycle and continue through the operational phase.

Prerequisites
None

Roles
Business Analyst (Secondary) Business Project Manager (Primary) Data Steward/Data Quality Steward (Secondary) Project Sponsor (Approve)

Considerations
A useful mix of personnel to staff a stewardship committee may include:
● ● ● ●

An executive sponsor A business steward A technical steward A data steward

Executive Sponsor

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

73 of 439

● ● ● ●

Chair of the data stewardship committee Ultimate point of arbitration Liaison to management for setting and reporting objectives Should be recruited from project sponsors or management

Technical Steward
● ● ● ●

Member of the data stewardship committee Liaison with technical community Reference point for technical-related issues and arbitration Should be recruited from the technical community with a good knowledge of the business and operational processes

Business Steward
● ● ● ●

Member of the data stewardship committee Liaison with business users Reference point for business-related issues and arbitration Should be recruited from the business community

Data Steward
● ●

Member of the data stewardship committee Balances data and quality targets set by the business with IT/project parameters Responsible for all issues relating to the data, including defining and maintaining business and technical rules and liaising with the business and technical communities Reference point for arbitration where data is put to different uses by separate groups of users whose requirements have to be reconciled

●

●

The mix of personnel for a particular activity should be adequate to provide expertise in each of the major business areas that will be undertaken in the project. The success of the stewardship function relies on the early establishment and distribution of standardized documentation and procedures. These should be distributed to all of the team members working on stewardship activities.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

74 of 439

The data stewardship committee should be involved in the following activities:
● ● ● ●

Arbitration Sanity checking Preparation of metadata Support

Arbitration
Arbitration means resolving data contention issues, deciding which is the best data to use, and determining how this data should best be transformed and interpreted so that it remains meaningful and consistent. This is particularly important during the phases where ambiguity needs to be resolved, for example, when conformed dimensions and standardized facts are being formulated by the analysis teams.

Sanity Checking
There is a role for the data stewardship committee to check the results and ensure that the transformation rules and processes have been applied correctly. This is a key verification task and is particularly important in evaluating prototypes developed in the Analyze Phase , during testing, and after the project goes live.

Preparation of Metadata
The data stewardship committee should be actively involved in the preparation and verification of technical and business metadata. Specific tasks are:
● ● ● ● ● ●

Determining the structure and contents of the metadata Determining how the metadata is to be collected Determining where the metadata is to reside Determining who is likely to use the metadata Determining what business benefits are provided Determining how the metadata is to be acquired

Depending on the tools used to determine the metadata (for example, PowerCenter Profiling option, Informatica Data Explorer), the Data Steward may take a lead role in this activity.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

75 of 439

●

Business metadata - The purpose of maintaining this type of information is to clarify context, aid understanding, and provide business users with the ability to perform high level searches for information. Business metadata is used to answer questions such as: How does this division of the enterprise calculate revenue?"

●

Technical metadata - The purpose of maintaining this type of information is for impact analysis, auditing, and source-target analysis. Technical metadata is used to perform analysis such as: “What would be the impact of changing the length of a field from 20 to 30 characters and what systems would be affected?”

Support
The data stewardship committee should be involved in the inception and preparation of training of the user community by answering questions about data and the tools available to perform analytics. During the Analyze Phase the team would provide inputs to induction training programs prepared for system users when the project goes live. Such programs should include, for example, technical information about how to query the system and semantic information about the data that is retrieved.

New Functionality
The data stewardship committee needs to assess any major additions to functionality. The assessment should consider return on investment, priority, and scalability in terms of new hardware/software requirements. There may be a need to perform this activity during the Analyze Phase if functionality that was initially overlooked is to be included in the scope of the project. After the project has gone live, this activity is of key importance because new functionality needs to be assessed for ongoing development.

Best Practices
None

Sample Deliverables
None

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

76 of 439

Last updated: 15-Feb-07 17:55

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

77 of 439

Phase 2: Analyze
Task 2.3 Define Business Scope Description
The business scope forms the boundary that defines where the project begins and ends. Throughout the project discussions about the business requirements and objectives, it may appear that everyone views the project scope in the same way. However, there is commonly confusion about what falls inside the boundary of a specific project and what does not. Developing a detailed project scope and socializing it with your project team, sponsors, and key stakeholders is critical.

Prerequisites
None

Roles
Informatica Velocity v6 (Primary) Data Architect (Primary) Data Integration Developer (Primary) Data Quality Developer (Primary) Data Steward/Data Quality Steward (Secondary) Metadata Manager (Primary) Project Sponsor (Secondary) Technical Architect (Primary) Technical Project Manager (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

78 of 439

Considerations
The primary consideration in developing the business scope is balancing the highpriority needs of the key beneficiaries with the need to provide results within the nearterm. The Project Manager and Business Analysts need to determine the key business needs and determine the feasibility of meeting those needs to establish a scope that provides value, typically within a 60 to 120 day time-frame. Quick WINS are accomplishments in a relatively short time, without great expense and with a positive outcome - they can be included in the business scope. WINS stand for Ways to Implement New Solutions. Tip As a general rule, involve as many project beneficiaries as possible in the needs assessment and goal definition. A "forum" type of meeting may be the most efficient way to gather the necessary information since it minimizes the amount of time involved in individual interviews and often encourages useful dialog among the participants. However, it is often difficult to gather all of the project beneficiaries and the project sponsor together for any single meeting, so you may have to arrange multiple meetings and summarize the input for the various participants. A common mistake made by project teams is to define the project scope only in general terms. This lack of definition causes managers and key beneficiaries throughout the company to make assumptions related to their own processes or systems falling inside or outside of the scope of the project. Then later, after significant work has been completed by the project team, some managers are surprised to learn that their assumptions were not correct, resulting in problems for the project team. Other project teams report problems with "scope creep" as their project gradually takes on more and more work. The safest rule is “the more detail, the better” along with details regarding what related elements are not within scope or will be delayed to a later effort.

Best Practices
None

Sample Deliverables
None
Last updated: 18-May-08 17:35

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

79 of 439

Phase 2: Analyze
Subtask 2.3.1 Identify Source Data Systems Description
Before beginning any work with the data, it is necessary to determine precisely what data is required to support the data integration solution. In addition, the developers must also determine what source systems house the data, where the data resides in the source systems, and how the data is accessed. In this subtask, the development project team needs to validate the initial list of source systems and source formats and obtain documentation from the source system owners describing the source system schemas. For relational systems, the documentation should include Entity-Relationship diagrams (E-R diagrams) and data dictionaries, if available. For file based data sources (e.g., unstructured, semi-structured and complex XML) documentation may also include data format specifications for both internal and public (in the case of open data format standards) and any deviations from public standards. The development team needs to carefully review the source system documentation to ensure that it is complete (i.e., specifies data owners and dependencies) and current. The team also needs to ensure that the data is fully accessible to the developers and analysts that are building the data integration solution.

Prerequisites
None

Roles
Business Analyst (Primary) Data Architect (Primary) Data Integration Developer (Primary) Data Quality Developer (Primary) Data Transformation Developer (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

80 of 439

Considerations
In determining the source systems for data elements, it is important to request copies of the source system data to serve as samples for further analysis. This is a requirement in 2.8.1 Perform Data Quality Analysis of Source Data , but is also important at this stage of development. As data volumes in the production environment are often large, it is advisable to request a subset of the data for evaluation purposes. However, requesting too small of a subset can be dangerous in that it fails to provide a complete picture of the data and may hide any quality issues that truly exist. Another important element of the source system analysis is to determine the life expectancy of the source system itself. Try to determine if the source system is likely to be replaced or phased out in the foreseeable future. As companies merge, or technologies and processes improve, many companies upgrade or replace their systems. This can present challenges to the team as the primary knowledge of those systems may be replaced as well. Understanding the life expectancy of the source system will play a crucial part in the design process. For example, assume you are building a customer data warehouse for a small bank. The primary source of customer data is a system called Shucks, and you will be building a staging area in the warehouse to act as a landing area for all of the source data. After your project starts, you discover that the bank is being bought out by a larger bank and that Shucks will be replaced within three months by the larger bank's source of customer data: a system called Grins. Instead of having to redesign your entire data warehouse to handle the new source system, it may be possible to design a generic staging area that could fit any customer source system instead of building a staging area based on one specific source system. Assuming that the bulk of your processing occurs after the data has landed in the staging area, you can minimize the impact of replacing source systems by designing a generic staging area that would essentially allow you to plug in the new source system. Designing this type of staging area however, takes a large amount of planning and adds time to the schedule, but will be well worth the effort because the warehouse is now able to handle source system changes. For Data Migration, the source systems that are in scope should be understood at the start of the project. During the Analyze Phase these systems should be confirmed and communicated to all key stakeholders. If there is a disconnect between which systems are in and out of scope it is important to document and analyze the impact. Identifying new source systems may exponentially increment the amount of resources needed on the project and require re-planning. Make a point to over-communicate what systems are in-scope.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

81 of 439

Best Practices
None

Sample Deliverables
None
Last updated: 20-May-08 19:28

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

82 of 439

Phase 2: Analyze
Subtask 2.3.2 Determine Sourcing Feasibility Description
Before beginning to work with the data, it is necessary to determine precisely what data is required to support the data integration solution. In addition, the developers must determine:
● ● ●

what source systems house the data. where the data resides in the source systems. how the data is accessed.

Take care to focus only on data that is within the scope of the requirements. Involvement of the business community is important in order to prioritize the business data needs based upon how effectively the data supports the users' top priority business problems. Determining sourcing feasibility is a two-stage process, requiring:
● ●

A thorough and high-level understanding of the candidate source systems. A detailed analysis of the data sources within these source systems.

Prerequisites
None

Roles
Application Specialist (Primary) Business Analyst (Primary) Data Architect (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

83 of 439

Data Quality Developer (Primary) Metadata Manager (Primary)

Considerations
In determining the source systems for data elements, it is important to request copies of the source system data to serve as samples for further analysis. Because data volumes in the production environment are often large, it is advisable to request a subset of the data for evaluation purposes. However, requesting too small a subset can be dangerous in that it fails to provide a complete picture of the data and may hide any quality issues that exist. Particular care needs to be taken when archived historical data (e.g., data archived on tapes) or syndicated data sets (i.e., externally provided data such as market research) is required as a source to the data integration application. Additional resources and procedures may be required to sample and analyze these data sources.

Candidate Source System Analysis
A list of business data sources should have been prepared during the business requirements phase. This list typically identifies 20 or more types of data that are required to support the data integration solution and may include, for example, sales forecasts, customer demographic data, product information (e.g., categories and classifiers), and financial information (e.g., revenues, commissions, and budgets). The candidate source systems (i.e., where the required data can be found) can be identified based on this list. There may be a single source or multiple sources for the required data. Types of source include:
●

Operational sources — The systems an organization uses to run its business. It may be any combination of the ERP and legacy operational systems. Strategic sources — The data may be sourced from existing strategic decision support systems; for example, executive information systems. External sources — Any information source provided to the organization by an external entity, such as Nielsen marketing data or Dun & Bradstreet.

●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

84 of 439

The following checklist can help to evaluate the suitability of data sources, which can be particularly important for resolving contention amongst the various sources.
●The followin3c1 / 3h a r7Dnl,12 0 0 T h 3Do <p ichl sll,12 0 0 T h 3Do n3c1

● ● ● ● ● ● ● ●

Infrastructure. Services. Networks. Hardware, software, operational limitations. Best Practices. Migration strategies. External f€| ï sources. ²° à° ` data àÀ Ú Ú Security criteria.
à @ @ à • Î à À ð aà •ÐÀ ÐÀ • À | € Ð ð í® à

For B2B solutions, solutions with significant file based data sources (and other solutions with complex data transformation requirements) it is necessary to also assess data sizes, volumes and the frequency of data updates with respect to the ability to parse and transform the data and the implications that will have on hardware and software requirements.

technical content, and business meaning of the source data. A complete set of technical documentation and application source code should be available for this step. Documentation should include Entity-Relationship diagrams (E-R diagrams) for the source systems; these diagrams then serve as the blueprints for extracting data from the source systems.

It is important not to rely solely on the technical documentation to obtain accurate descriptions of the source data, since this documentation may be out of date and inaccurate. Data profiling is a useful technique to determine the structure and integrity of the data sources, particularly when used in conjunction with the technical documentation. The data profiling process involves analyzing the source data, taking an inventory of available data elements, and checking the format of those data elements. It is important s jt s jt s je1.2efu theEr ted the tj T* ttuele ing thormat oefu t

Determine Source Availability
The next step is to determine when all source systems are likely to be available for data extraction. This is necessary in order to determine realistic start and end times for the load window. The developers need to work closely with the source system administrators during this step because the administrators can provide specific information about the hours of operations for their systems. The Source Availability Matrix lists all the sources that are being used for data extraction and specifies the systems' downtimes during a 24-hour period. This matrix should contain details of the availability of the systems on different days of the week, including weekends and holidays. For Data Migration projects access to data is not normally a problem given the premise of the solution. Typically, data migration projects have high level sponsorship and whatever is needed is provided. However, for smaller-impact projects it is important that direct access is provided to all systems that are in scope. If direct access is not available, timelines should be increased and risk items should be added to the project. Historically, most projects without direct access go over-time due to lack of availability of key resources to provide extracted data. If this can be avoided by providing direct access it should.

Determine File Transformation Constraints
For solutions with complex data transformation requirements, the final step is to determine the feasibility of transforming the data to target formats and any implications that will have on the eventual system design. Very large flat file formats often require splitting processes to be introduced into the design in order to split the data into manageable sized chunks for subsequent processing. This will require identification of appropriate boundaries for splitting and may require additional steps to convert the data into formats that are suitable for splitting. For example large PDF-based data sources may require conversion into some other format such as XML before the data can be split.

Best Practices
None

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

88 of 439

Sample Deliverables
None
Last updated: 20-May-08 19:37

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

89 of 439

Phase 2: Analyze
Subtask 2.3.3 Determine Target Requirements Description
This subtask provides detailed business requirements that lead to design of the target data structures for a data integration project. For Operational Data Integration projects, this may involve identifying a subject area or transaction set within an existing operational schema or a new data store. For Data Warehousing / Business Intelligence projects, this typically involves putting some structure to the informational requirements. The preceding business requirements tasks (see Prerequisites) provide a high-level assessment of the organization's business initiative and provide business definitions for the information desired. Note that if the project involves enterprise-wide data integration, it is important that the requirements process involve representatives from all interested departments and that those parties reach a semantic consensus early in the process.

Prerequisites
None

Roles
Application Specialist (Secondary) Business Analyst (Primary) Data Architect (Primary) Data Steward/Data Quality Steward (Secondary) Data Transformation Developer (Secondary) Metadata Manager (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

90 of 439

Technical Architect (Primary)

Considerations Operational Data Integration
For an operational data integration project, requirements should be based on existing or defined business processes. However, for data warehousing projects, strategic information needs must be explored to determine the metrics and dimensions desired.

Metrics
Metrics should indicate an actionable business measurement. An example for a consultancy might be: "Compare the utilization rate of consultants for period x, segmented by industry, for each of the major geographies as compared to the prior period" Often a mix of financial (e.g., budget targets) and operational (e.g., trends in customer satisfaction) key performance metrics is required to achieve a balanced measure of the organizational performance. The key performance metrics may be directly sourced from an existing operational system or may require integration of data from various systems. Market analytics may indicate a requirement for metrics to be compared to external industry performance criteria. The key performance metrics should be agreed-upon through a consensus of the business users to provide common and meaningful definitions. This facilitates the design of processes to treat source metrics that may arrive in a variety of formats from various source systems.

Dimensions
The key to determining dimension requirements is to formulate a business-oriented description for the segmentation requirements for each of the desired metrics. This may involve an iterative process of interaction with the business community during requirements gathering sessions, paying attention to words such as “by” and “where”.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

91 of 439

For example, a Pay-TV operator may be interested in monitoring the effectiveness of a new campaign geared at enrolling new subscribers. In a simple case, the number of new subscribers would be an essential metric; it may, however, be important to the business community to perform an analysis based on the dimensions (e.g., by demography, by group, or by time). A technical consideration at this stage is to understand whether the dimensions are likely to be rapidly changing or slowly changing, since this can affect the structure of an eventual data model built from this analysis. Rapidly-changing dimensions are those whose values may change frequently over their lifecycle (e.g., a customer attribute that changes many times a year) as opposed to a slowly-changing dimension such as an organization that may only change when a reorganization occurs. It is also important at this stage to determine as many likely summarization levels of a dimension as possible. For example, time may have a hierarchical structure comprising year, quarter, month, and day while geography may be broken down into Major Region, Area, Subregion, etc. It is also important to clarify the lowest level of detail that is required for reporting. The metric and dimension requirements should be prioritized according to perceived business value to aid in the discussion of project scope in case there are choices to make regarding what to include or exclude.

Data Migration Projects
Data migration projects should be exclusively driven by the target system needs, not by what is available in the source systems. Therefore, it is recommended to identify the target system needs early in the Analyze Phase and focus the analysis activities on those objects.

B2B Projects
For B2B and non B2B projects that have significant flat file based data targets, consideration needs to be given to the target data to be generated. Considerations include:
● ●

What are target file and data formats? What sizes of target files need to be supported? Will they require recombination of multiple intermediate data formats? Are there applicable intermediate or target canonical formats that can be

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

92 of 439

created or leveraged?
●

What XML schemas are needed to support the generation of the target formats? Do target formats conform to well known proprietary or open data format standards? Does target data generation need to be accomplished within specific time or other performance related thresholds? How are errors both in data received and in overall B2B operation communicated back to the internal operations staff and to external trading partners? What mechanisms are used to send data back to external partners? What applicable middleware, communications and enterprise application software is used in the overall B2B operation? What data transformation implications does the choice of middleware and infrastructure software impose? How is overall B2B interaction governed? What process flows are involved in the system and how are they managed (for example via B2B Data Exchange, external BPM software etc.)? Are there machine readable specifications that can be leveraged directly or on modification to support “Specification driven transformation” based creation of data transformation scripts? Is sample data available for testing and verification of any data transformation scripts created?

●

●

●

● ●

●

●

●

At a higher level, the number and complexity of data sources, the number and complexity of data targets and the number and complexity of intermediate data formats and schemas determine the overall scope of the data transformation and integration aspects of B2B data integration projects as a whole.

Best Practices
None

Sample Deliverables
None
Last updated: 20-May-08 19:46

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

93 of 439

Phase 2: Analyze
Task 2.6 Determine Technical Readiness Description
The goal of this task is to determine the readiness of an IT organization with respect to its technical architecture, implementation of said architecture, and the associated staffing required to support the technical solution. Conducting this analysis, through interviews with the existing IT team members (such as those noted in the Roles section), provides evidence as to whether or not the critical technologies and associated support system are sufficiently mature as to not present significant risk to the endeavor.

Prerequisites
None

Roles
Business Project Manager (Primary) Database Administrator (DBA) (Primary) System Administrator (Primary) Technical Architect (Primary)

Considerations
Carefully consider the following questions when evaluating the technical readiness of a given enterprise:
●

Has the architecture team been staffed and trained in the assessment of critical technologies? Have all of the decisions been made regarding the various components of the infrastructure, including: network, servers, and software?

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

94 of 439

●

Has a schedule been established regarding the ordering, installing, and deployment of the servers and network? If in place, what are the availability, capacity, scalability, and reliability of the infrastructure? Has the project team been fully staffed and trained, including but not limited to: a Project Manager, Technical Architect, System Administrator, Developer(s), and DBA(s)? (See 1.2.1 Establish Project Roles). Are proven implementation practices and approaches in place to ensure a successful project? (See 2.5.3 Assess Technical Strategies and Policies). Has the Technical Architect evaluated and verified the Informatica PowerCenter Quickstart configuration requirements? Has the repository database been installed and configured?

●

●

●

●

●

By gaining a better understanding of questions such as these, developers can achieve a clearer picture of whether or not that organization is sufficiently ready to move forward with the project effort. This information also helps to develop a more accurate and reliable project plan.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:44

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

95 of 439

Phase 2: Analyze
Task 2.7 Determine Regulatory Requirements Description
Many organizations must now comply with a range of regulatory requirements such as financial services regulation, data protection, Sarbanes-Oxley, retention of data for potential criminal investigations, and interchange of data between organizations. Some industries may also be required to complete specialized reports for government regulatory bodies. This can mean prescribed reporting, detailed auditing of data, and specific controls over actions and processing of the data. These requirements differ from the "normal" business requirements in that they are imposed by legislation and/or external bodies. The penalties for not precisely meeting the requirements can be severe. However, there is a "carrot and stick" element to regulatory compliance. Regulatory requirements and industry standards can also present the business with an opportunity to improve its data processes and update the quality of its data in key areas. Successful compliance — for example, in the banking sector, with the Basel II Accord — brings the potential for more productive and profitable uses of data. As data is prepared for the later stages in a project, the project personnel must establish what government or industry standards the project data must adhere to and devise a plan to meet these standards. These steps include establishing a catalog of all reporting and auditing required, including any prescribed content, formats, processes, and controls. The definitions of content (e.g., inclusion/exclusion rules, timescales, units, etc.) and any metrics or calculations, are likely to be particularly important.

Prerequisites
None

Roles
Business Analyst (Primary) Business Project Manager (Review Only)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

96 of 439

Legal Expert (Primary)

Considerations
Areas where requirements arise include the following:
●

Sarbanes-Oxley regulations in the U.S. mean a proliferation of controls on processes and data. Developers need to work closely with an organization’s Finance Department to ascertain exactly how Sarbanes-Oxley affects the project. There may be implications for how environments are set up and controls for migration between environments (e.g., between Development, Test, and Production), as well as for sign-offs, specified verification, etc. Another regulatory system applicable to financial companies is the Basel II Accord. While Basel II does not have the force of law, it is a de facto requirement within the international financial community. Other industries are demanding adherence to new data standards, both communally, by coming together around common data models such as bar codes and RFID (radio frequency identification), and individually, as enterprises realize the benefits of synchronizing their data storage conventions with suppliers and customers. Such initiatives are sometimes gathered under the umbrella of Global Data Synchronization (GDS); the key benefit of GDS is that it is not a compliance chore but a positive and profitable initiative for a business.

●

●

If your project must comply with a government or industry regulation, or if the business simply insists on high standards for its data (for example, to establish a “single version of the truth” for items in the business chain), then you must increase your focus on data quality in the project. 2.8 Perform Data Quality Audit is dedicated to performing a Data Quality Audit that can provide the project stakeholders with a detailed picture of the strengths and weaknesses of the project data in key compliance areas such as accuracy, completeness, and duplication. For example, compliance with a request for data under Section 314 of the USAPATRIOT Act is likely to be difficult for a business that finds it has large numbers of duplicate records, or records that contain empty fields, or fields populated with default values. Such problems should be identified and addressed before the data is moved downstream in the project. Regulatory requirements often require the ability to clearly audit the processes affecting the data. This may require a metadata reporting system that can provide viewing and reporting of data lineage and ‘where-used.’ Remember, such a system can produce

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

97 of 439

spin-off benefits for IT in terms of automated project documentation and impact analysis. Industry and regulatory standards for data interchange may also affect data model and ETL designs. HIPAA and HL7-compliance may dictate transaction definitions that affect healthcare-related projects, as may SWIFT or Basel II for finance-related data. Potentially there are now two areas to investigate in more detail: data and metadata.
●

Map the requirements back to the data and/or metadata required using a standard modeling approach. Use data models and the metadata catalog to assess the availability and quality of the required data and metadata. Use the data models of the systems and data sources involved, along with the inventory of metadata. Verify that the target data models meet the regulatory requirements.

●

●

Processes and Auditing Controls
It is important that data can be audited at every stage of processing where it is necessary. To this end, review any proposed processes and audit controls to verify that the regulatory requirements can be met and that any gaps are filled. Also, ensure that reporting requirements can be met, again filling any gaps. It is important to check that the format, content, and delivery mechanisms for all reports comply with the regulatory requirements.

Best Practices
None

Sample Deliverables
None
Last updated: 15-Feb-07 18:13

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

98 of 439

Phase 2: Analyze
Task 2.8 Perform Data Quality Audit Description
Data Quality is a key factor for several tasks and subtasks in the Analyze Phase. The quality of the proposed project source data, in terms of both its structure and content, is a key determinant of the specifics of the business scope and of the success of the project in general. For information on issues relating primarily to data structure, see subtask 2.3.2 Determine Sourcing Feasibility, which focuses on the quality of the data content. Problems with the data content must be communicated to senior project personnel as soon as they are discovered. Poor data quality can impede the proper execution of later steps in the project, such as data transformation and load operations, and can also compromise the business’ ability to generate a return on the project investment. This is compounded by the fact that most businesses underestimate the extent of their data quality problems. There is little point in performing a data warehouse, migration, or integration project if the underlying data is in bad shape. The Data Quality Audit is designed to analyze representative samples of the source data and discover their data quality characteristics so that these can be articulated to all relevant project personnel. The project leaders can then decide what actions, if any, are necessary to correct data quality issues and ensure that the successful completion of the project is not in jeopardy.

Prerequisites
None

Roles
Business Project Manager (Secondary) Data Quality Developer (Primary) Data Steward/Data Quality Steward (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

99 of 439

Technical Project Manager (Secondary)

Considerations
The Data Quality Audit can typically be conducted very quickly, but the actual time required is determined by the starting condition of the data and the success criteria defined at the beginning of the audit. The main steps are as follow:
●

Representative samples of source data from all main areas are provided to the Data Quality Developer. The Data Quality Developer uses a data analysis tool to determine the quality of the data according to several criteria. The Data Quality Developer generates summary reports on the data and distributes these to the relevant roles for discussion and next steps.

●

●

Two important aspects of the audit are (1) the data quality criteria used, and (2) the type of report generated.

Data Quality Criteria
You can define any number and type of criteria for your data quality. However, there are six standard criteria:
●

Accuracy is concerned with the general accuracy of the data in a dataset. It is often determined by comparing the dataset with a reliable reference source, for example, a dictionary file containing product reference data. Completeness is concerned with missing data, that is, fields in the dataset that have been left empty or whose default values have been left unchanged. For example, many data input fields have a default date setting of 01/01/1900. If a record includes 01/01/1900 as a data of birth, it is highly likely that the field was never populated. Conformity is concerned with data values of a similar type that have been entered in a confusing or unusable manner, for example, telephone numbers that include/omit area codes. Consistency is concerned with the occurrence of disparate types of data records in a dataset created for a single data type (e.g., the combination of personal and business information in a dataset intended for business data only). Integrity is concerned with the recognition of meaningful associations

●

●

●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

100 of 439

between records in a dataset. For example, a dataset may contain records for two or more family members in a household but without any means for the organization to recognize or use this information.
●

Duplication is concerned with data records that duplicate one another’s information, that is, with identifying redundant records in the dataset or records with meaningful information in common. For example:
r

A dataset may contain user-entered records for “Batch No. 12345” and “Batch 12345”, where both records describe the same batch. A dataset may contain several records with common surnames and street addresses, indicating that the records refer to a single household; this type of information is relevant to marketing personnel.

r

This list is not absolute; the characteristics above are sometimes described with other terminology, such as redundancy or timeliness. Every organization’s data needs are different, and the prevalence and relative priority of data quality issues differ from one organization and one project to the next. Note that the accuracy factor differs from the other five factors in the following respect: whereas, for example,a pair of duplicate records may be visible to the naked eye, it can be difficult to tell simply by “eye-balling” if a given data record is inaccurate. Accuracy can be determined by applying fuzzy logic to the data or by validating the records against a verified reference data set.

Best Practices
Developing the Data Quality Business Case

Sample Deliverables
None
Last updated: 21-Aug-07 14:06

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

101 of 439

Phase 2: Analyze
Subtask 2.8.1 Perform Data Quality Analysis of Source Data Description
The data quality audit is a business rules-based approach that aims to help define project expectations through the use of data quality processes (or plans) and data quality scorecards. It involves conducting a data analysis on the project data, or on a representative sample of the data, and producing an accurate and qualified summary of the data’s quality. This subtask focuses on data quality analysis. The results are processed and presented to the business users in the next subtask 2.8.2 Report Analysis Results to the Business.

Prerequisites
None

Roles
Business Analyst (Secondary) Data Quality Developer (Primary) Data Steward/Data Quality Steward (Secondary)

Considerations
There are three key steps in the process:

1. Select Target Data
The main objective of this step is to meet with the data steward and business owners to identify the data sources to be analyzed. For each data source, the Data Quality Developer will need all available information on the data format, content, and structure,as well as input on known data quality issues. The result of this step is a list of the sources of data to be analyzed, along with the identification of all known issues. These define the initial scope of the audit. The following figure illustrates selecting target data from multiple sources.

2. Run Data Quality Analysis
This step identifies and quantifies data quality issues in the source data. Data quality analysis plans are configured in Informatica Data Quality (IDQ) Workbench. (The plans should be configured in a manner that enables the production of scorecards in the next subtask. A scorecard is a graphical representation of the levels of data quality in the dataset.) The plans designed at this INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Quality 102 of 439

stage identify cases of incomplete or absent data values. Using IDQ, the Data Quality Developer can identify all such data content issues. Data analysis provides detailed metrics to guide the next steps of the audit. For example:
● ●

For character data, analysis identifies all distinct values (such as code values) and their frequency distribution. For numeric data, analysis provides statistics on the highest, lowest, average, and total, as well as the number of positive values, negative values, zero/null values, and any non-numeric values. For dates, analysis identifies the highest and lowest dates, the number of blank/null fields, as well as any invalid date values. For consumer packaging data, analysis can detect issues such as bar codes with correct/incorrect numbers of digits.

● ●

The figure below shows sample IDQ report output.

3. Define Business Rules
The key objectives of this step are to identify issues in the areas of completeness, conformity, and consistency, to prioritize data quality issues, and to define customized data quality rules. These objectives involve:
●

Discussions of data quality analyses with business users to define completeness, conformity, and consistency rules for each data element. Tuning and re-running the analysis plans with these business rules.

●

For each data set, a set of base rules must be established to test the conformity of the attributes' data values against basic rule definitions. For example, if an attribute has a date type, then that attribute should only have date information stored. At a minimum, all the necessary fields must be tested against the base rule sets. The following figure illustrated business rule evaluation.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

103 of 439

Best Practices
None

Sample Deliverables
None
Last updated: 15-Feb-07 18:17

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

104 of 439

Phase 2: Analyze
Subtask 2.8.2 Report Analysis Results to the Business Description
The steps outlined in subtask 2.8.1 lead to the preparation of the Data Quality Audit Report, which is delivered in this subtask. The Data Quality Audit report highlights the state of the data analyzed in an easy-to-read, high-impact fashion. The report can include the following types of file:
●

Data quality scorecards - charts and graphs of data quality that can be pre-set to present and compare data quality across key fields and data types Drill-down reports that permit reviewers to access the raw data underlying the summary information Exception files

●

●

In this subtask, potential risk areas are identified and alternative solutions are evaluated. The Data Quality Audit concludes with a presentation of these findings to the business and project stakeholders and agreement on recommended next steps.

Prerequisites
None

Roles
Business Analyst (Secondary) Business Project Manager (Secondary) Data Quality Developer (Primary) Data Steward/Data Quality Steward (Primary) Technical Project Manager (Secondary)

Considerations
There are two key activities in this subtask: delivering the report, and framing a discussion for the business about what actions to take based on the report conclusions.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

105 of 439

Delivering the report involves formatting the analysis results from subtask 2.8.1 into a framework that can be easily understood by the business. This includes building data quality scorecards, preparing the data sources for the scorecards, and possibly creating audit summary documentation such as a Microsoft Word document or a PowerPoint slideshow. The data quality issues can then be evaluated, recommendations made, and project targets set.

Creating Scorecards
Informatica Data Quality (IDQ) is used to identify, measure, and categorize data quality issues according to business criteria. IDQ reports information in several formats, including database tables, CSV files, HTML files, and graphically. (Graphical displays, or scorecards, are linked to the underlying data so that viewers can move from high-level to low-level views of the data.) Part of the report creation process is the agreement of pass/fail scores for the data and the assignment of weights to the data performance for different criteria. For example, the business may state that at least 98 percent of values in address data fields must be accurate and weight the zip +four field as most important. Once the scorecards are defined, the data quality plans can be re-used to track data quality progress over time and throughout the organization. The data quality scorecard can also be presented through a dashboard framework, which adds value to the scorecard by grouping graphical information in business-intelligent ways.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

106 of 439

As can be seen in the above figure, a dashboard can present measurements in a “traffic light” manner (color-coded green/amber/red) to provide quick visual cues as to the quality of and actions needed for the data.

Reviewing the Audit Results and Deciding the Next Step
By integrating various data analysis results within the dashboard application, the stakeholders can review the current state of data quality and decide on appropriate actions within the project. The set of stakeholders should include one or more members of the data stewardship committee, the project manager, data experts, a Data Quality Developer, and representatives of the business. Together, these stakeholders can review the data quality audit conclusions and conduct a costbenefit comparison of the desired data quality levels versus the impact on the project of the steps to achieve these levels. In some projects — for example, when the data must comply with government or industry regulations — the data quality levels are non-negotiable, and the project stakeholders must work to those regulations. In other cases, the business objectives may be achieved by data quality levels that are less than 100 percent. In all cases, the project data must obtain a minimum quality levels in order to pass through the project processes and be accepted by the target data source.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

107 of 439

For these reasons, it is necessary to discuss data quality as early as possible in project planning.

Ongoing Audits and Data Quality Monitoring
Conducting a data quality audit one time provides insight into the then-current state of the data, but does not reflect how project activity can change data quality over time. Tracking levels of data quality over time, as part of an ongoing monitoring process, provides a historical view of when and how much the quality of data has improved. The following figure illustrates how ongoing audits can chart progress in data quality.

As part of a statistical control process, data quality levels can be tracked on a periodic basis and charted to show if the measured levels of data quality reach and remain in an acceptable range, or whether some event has caused the measured level to fall below what is acceptable. Statistical control charts can help in notifying data stewards when an exception event impacts data quality and can help to identify the offending information process. Historical statistical tracking and charting capabilities are available within a data quality scorecard, and scorecards can be easily updated; once configured, the scorecard typically does not need to be re-created for successive data quality analyses.

Best Practices
None

Sample Deliverables
None

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

108 of 439

Last updated: 15-Feb-07 17:29

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

109 of 439

Phase 3: Architect
3 Architect
●

3.1 Develop Solution Architecture
r

3.1.1 Define Technical Requirements 3.1.2 Develop Architecture Logical View 3.1.3 Develop Configuration Recommendations 3.1.4 Develop Architecture Physical View 3.1.5 Estimate Volume Requirements

r

r

r

r ●

3.2 Design Development Architecture
r

3.2.1 Develop Quality Assurance Strategy 3.2.2 Define Development Environments 3.2.3 Develop Change Control Procedures 3.2.5 Develop Change Management Process

r

r

r ●

3.3 Implement Technical Architecture
r

3.3.1 Procure Hardware and Software 3.3.2 Install/Configure Software

r

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

110 of 439

Phase 3: Architect
Description
During this Phase of the project, the technical requirements are defined, the project infrastructure is developed and the development standards and strategies are defined. The conceptual architecture is designed; which forms the basis for determining capacity requirements and configuration recommendations. The environments and strategies for the entire development process are defined. The strategies include development standards, quality assurance, change control processes and metadata strategy. It is critical that the architecture decisions made during this phase are guided by an understanding of the business needs. As Data Integration architectures become more real-time and mission critical, good architecture decisions will ensure the success of the overall effort. This phase should culminate in the implementation of the hardware and software that will allow the Design Phase and the Build Phase of the project to begin. Proper execution during the Architect Phase is especially important for for Data Migration projects. In the Architect Phase a series of key tasks are undertaken to accelerate development, ensure consistency and expedite completion of the data migration.

Prerequisites
None

Roles
Business Analyst (Primary) Business Project Manager (Primary) Data Architect (Primary) Data Integration Developer (Secondary) Data Quality Developer (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

111 of 439

Data Warehouse Administrator (Review Only) Database Administrator (DBA) (Primary) Metadata Manager (Primary) Presentation Layer Developer (Secondary) Project Sponsor (Approve) Quality Assurance Manager (Primary) Repository Administrator (Primary) Security Manager (Secondary) System Administrator (Primary) Technical Architect (Primary) Technical Project Manager (Primary)

Considerations
None

Best Practices
None

Sample Deliverables
None
Last updated: 25-May-08 16:13

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

112 of 439

Phase 3: Architect
Task 3.1 Develop Solution Architecture Description
The scope of solution architecture in a data integration or an enterprise data warehouse project is quite broad and involves careful consideration of many disparate factors. Data integration solutions have grown in scope as well as the amount of data they process. This necessitates careful consideration of architectural issues across a number of architectural domains. Well-designed solution architecture is very crucial to any data integration effort, and can be the most influential, visible part of the whole effort. A robust solution architecture not only meets the business requirements but it also exceeds the expectations of the business community. Given the continuous state of change that has become a trademark of information technology, it is prudent to have an architecture that is not only easy to implement and manage, but also flexible enough to accommodate changes in the future, easily extendable, reliable (with minimal or no downtime), and vastly scalable. This task approaches the development of the architecture as a series of stepwise refinements:
● ● ● ●

First, reviewing the requirements. Then developing a logical model of the architecture for consideration. Refining the logical model into a physical model, and Validating the physical model.

In addition, because the architecture must consider anticipated data volumes, it is necessary to develop a thorough set of estimates. The Technical Architect is responsible for ensuring that the proposed architecture can support the estimated volumes.

Prerequisites
None

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

113 of 439

Roles
Business Analyst (Primary) Data Architect (Primary) Data Quality Developer (Primary) Data Warehouse Administrator (Review Only) Database Administrator (DBA) (Primary) System Administrator (Primary) Technical Architect (Primary) Technical Project Manager (Review Only)

Considerations
A holistic view of architecture encompasses three realms, the development architecture, the execution architecture, and the operations architecture. These three areas of concern provide a framework for considering how any system is built, how it runs, and how it is operated. Although there may be some argument about whether an integration solution is a "system," it is clear that it has all the elements of a software system, including databases, executable programs, end users, maintenance releases, and so forth. Of course, all of these elements must be considered in the design and development of the enterprise solution. Each of these architectural areas involves specific responsibilities and concerns:
●

Development Architecture, which incorporates technology standards, tools, and the techniques and services required in the development of the enterprise solution. This may include many of the services described in the execution architecture, but also involves services that are unique to development environments such as security mechanisms for controlling access to development objects, change control tools and procedures, and migration capabilities. Execution Architecture, which includes the entire supporting infrastructure

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

114 of 439

required to run an application or set of applications. In the context of an enterprise-wide integration solution, this includes client and server hardware, operating systems, database management systems, network infrastructure, and any other technology services employed in the runtime delivery of the solution.
●

Operations Architecture, which is a unified collection of technology services, tools, standards, and controls required to keep a business application production or development environment operating at the designed service level. This differs from the execution architecture in that its primary users are system administrators and production support personnel.

The specific activities that comprise this task focus primarily on the Execution Architecture. 3.2 Design Development Architecture focuses on the development architecture and the Operate Phase discusses the important aspects of operating a data integration solution. Refer to the Operate Phase for more information on the operations architecture.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:44

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

115 of 439

Phase 3: Architect
Subtask 3.1.1 Define Technical Requirements Description
In anticipation of architectural design and subsequent detailed technical design steps, the business requirements and functional requirements must be reviewed and a highlevel specification of the technical requirements developed. The technical requirements will drive these design steps by clarifying what technologies will be employed and, from a high-level, how they will satisfy the business and functional requirements.

Prerequisites
None

Roles
Business Analyst (Primary) Data Quality Developer (Secondary) Technical Architect (Primary) Technical Project Manager (Review Only)

Considerations
The technical requirements should address, at least at a conceptual level, implementation specifications based on the findings to date (regarding data rules, source analysis, strategic decisions, etc.) such as:
●

Technical definitions of business rule derivations (including levels of summarization. Definitions of source and target schema – at least at logical/conceptual level. Data acquisition and data flow requirements.

● ●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

116 of 439

● ● ● ● ● ●

Data quality requirements (at least at a high level). Data consolidation/integration requirements (at least at a high level). Report delivery and access specifications. Performance requirements (both “back-end” and presentation performance). Security requirements and structures (access, domain, administration, etc.). Connectivity specifications and constraints (especially limits of access to operational systems). Specific technologies required (if requirements clearly indicate such).

●

For Data Migration projects the technical requirements are fairly consistent and known They will require processes to:
● ● ● ● ●

Populate the reference data structures Acquire the data from source systems Convert to target definitions Load to the target application Meet the necessary audit functionalities

The details of which will be covered in a data migration strategy.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:44

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

117 of 439

Phase 3: Architect
Subtask 3.1.2 Develop Architecture Logical View Description
Much like a logical data model, a logical view of the architecture provides a high-level depiction of the various entities and relationships as an architectural blueprint of the entire data integration solution. The logical architecture helps people to visualize the solution and show how all the components work together. The major purposes of the logical view are:
●

To describe how the various solution elements work together (i.e., databases, ETL, reporting, and metadata). To communicate the conceptual architecture to project participants to validate the architecture. To serve as a blueprint for developing the more detailed physical view.

●

●

The logical diagram provides a road map of the enterprise initiative and an opportunity for the architects and project planners to define and describe, in some detail, the individual components. The logical view should show relationships in the data flow and among the functional components; indicating, for example, how local repositories relate to the global repository (if applicable). The logical view must take into consideration all of the source systems required to support the solution, the repositories that will contain the runtime metadata, and all known data marts and reports. This is a “living” architectural diagram, to be refined as you implement or grow the solution. The logical view does not contain detailed physical information such as server names, IP addresses, hardware specifications, etc. These details will be fleshed out in the development of the physical view.

Prerequisites
None

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

118 of 439

Roles
Data Architect (Secondary) Technical Architect (Primary)

Considerations
The logical architecture should address reliability, availability, scalability, performance, usability, extensibility, interoperability, security, and QA. It should incorporate all of the high-level components of the information architecture, including but not limited to:
● ● ● ● ● ● ● ● ●

All relevant source systems ETL repositories; BI repositories Metadata Management, Metadata Reporting Real-time Messaging, Web Services, XML Server Data Quality tools, Data Modeling tools PowerCenter Servers, Repository Server Target data structures, e.g., data warehouse, data marts, ODS Web Application Servers ROLAP engines, Portals, MOLAP cubes, Data Mining

For Data Migration projects a key component is the documentation of the various utility database schemas. This will likely include legacy staging, pre-load staging, reference data, and audit database schemas. Additionally, database schemas for Informatica Data Quality and Informatica Data Explorer will also be included.

Best Practices
Designing Data Integration Architectures PowerCenter Enterprise Grid Option

Sample Deliverables
None

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

119 of 439

Last updated: 06-Dec-07 15:36

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

120 of 439

Phase 3: Architect
Subtask 3.1.3 Develop Configuration Recommendations Description
Using the Architecture Logical View as a guide, and considering any corporate standards or preferences, develop a set of recommendations for how to technically configure the analytic solution. These recommendations will serve as the basis for discussion with the appropriate parties, including project management, the Project Sponsor, system administrators, and potentially the user community. At this point, the recommendations of the Data Architect and Technical Architect should be very well formed, based on their understanding of the business requirements and the current and planned technical standards. The recommendations will be formally documented in the next subtask 3.1.4 Develop Architecture Physical View but are not documented at this stage since they are still considered open to debate. Discussions with interested constituents should focus on the recommended architecture, not on protracted debate over the business requirements. It is critical that the scope of the project be set - and agreed upon - prior to developing and documenting the technical configuration recommendations. Changes in the requirements at this point can have a definite impact on the project delivery date. (Refer back to the Manage Phase for a discussion of scope setting and control issues).

Prerequisites
None

Roles
Data Architect (Secondary) Technical Architect (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

121 of 439

Considerations
The configuration recommendations must balance a number of factors in order to be

adopted:
●

Technical solution - The recommended configuration must, of course, solve the technical challenges posed by the analytic solution. In particular, it must consider data capacity and volume throughput requirements. Conformity - The recommended solution should work well within the context of the organization's existing infrastructure and conform to the organization's future infrastructure direction. Cost - The incremental cost of the solution must fit within whatever budgetary parameters have been established by project management. In many cases, incremental costs can be reduced by leveraging existing available hardware resources and leveraging PowerCenter’s server grid technology.

●

●

The primary areas to consider in developing the recommendations include, but are not necessarily limited to:
●

Server Hardware and Operating System - Many IT organizations mandate – or strongly encourage - the choice of server hardware and operating system to fit into the corporate standards. Depending on the size and throughput requirements, the server may be either UNIX, Linux, or NT-based. The technical architectures should also provide a recommendation of a 32-bit architecture or a 64 bit architecture based on the cost/benefit of each. It is advisable to consider the vast advantages of 64-bit OS and PowerCenter as this is likely to provide increased resources and enable faster processing speeds. This is also likely to support the handling of larger numbers in data. It is also important to ensure the hardware is built for OLAP applications, which typically tend to be computational intensive as compared to OLTP systems which require hyper threading. This determination is important for ensuring improved performance. Also make sure the RAM size is determined in accordance with the systems to be built. In many cases RAM disks can be used in place of RAM when increased RAM availability is an issue. This is especially important when the PowerCenter application creates huge cache files. Consult the Platform Availability Matrix at my.informatica.com for specifics on the applications under consideration for the project. Bear in mind that not all

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

122 of 439

applications have the same level of availability on every platform. This is also true for database connectivity (see Database Management System below).
●

Disk Storage Systems – The architecture of the disk storage system should also be included in the architecture configuration. Some organizations leverage a Storage Area Network (SAN) to store all data, while other organizations opt for local storage. In any case, careful consideration should be given to disk array and striping configuration in order to optimize performance for the related systems (i.e., database, ETL, and BI).
●

Database Management System – Similar to organizational standards that mandate hardware or operating system choices, many organizations also mandate the choice of a database management system. In instances where a choice of the DBMS is available, it is important to remember that PowerCenter and Data Analyzer support a vast array of DBMSs on a variety of platforms (refer to the PowerCenter Installation Guide and Data Analyzer Installation Guide for specifics). A DBMS that is supported by all components in the technical infrastructure, such as OS, ETL, and BI, to name a few, should be chosen.
●

PowerCenter Server – The PowerCenter server should, of course, be considered when developing the architecture recommendations. Considerations should include network traffic (between the repository server, PowerCenter server, database server, and client machines), the location of the PowerCenter repository database, and the physical storage that will contain the PowerCenter executables as well as source, target, and cache files.
●

Data Analyzer or other Business Intelligence Data Integration Platforms – Whether using Data Analyzer or a different BI tool for analytics, the goal is to develop configuration recommendations that result in a high-performance application passing data efficiently between source system, ETL server, database tables, and BI enduser reports. For Web-based analytic tools such as Data Analyzer, one should also consider user requirements that may dictate that a secure Web-server infrastructure be utilized to provide reporting access outside of the corporate firewall to enable features such as reporting access from a mobile device. Typically, a secure Webserver infrastructure that utilizes a demilitarized zone (DMZ) will result in a different technical architecture configuration than an

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

123 of 439

infrastructure that simply supports reporting from within the corporate firewall.

TIP Use the Architecture Logical View as a starting point for discussing the technical configuration recommendations. As drafts of the physical view are developed, they will be helpful for explaining the planned architecture.

Best Practices
PowerCenter Enterprise Grid Option

Sample Deliverables
None
Last updated: 06-Dec-07 14:55

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

124 of 439

Phase 3: Architect
Subtask 3.1.4 Develop Architecture Physical View Description
The physical view of the architecture is a refinement of the logical view, but takes into account the actual hardware and software resources necessary to build the architecture. Much like a physical data model, this view of the architecture depicts physical entities (i.e., servers, workstations, and networks) and their attributes (i.e., hardware model, operating system, server name, IP address). In addition, each entity should show the elements of the logical model supported by it. For example, a UNIX server may be serving as a PowerCenter server engine, Data Analyzer server engine, and may also be running Oracle to store the associated PowerCenter repositories. The physical view is the summarized planning document for the architecture implementation. The physical view is unlikely to explicitly show all of the technical information necessary to configure the system, but should provide enough information for domain experts to proceed with their specific responsibilities. In essence, this view is a common blueprint that the system's general contractor (i.e. the Technical Architect) can use to communicate to each of the subcontractors (i.e. UNIX Administrator, Mainframe Administrator, Network Administrator, Application Server Administrator, DBAs, etc).

Prerequisites
None

Roles
Data Warehouse Administrator (Approve) Database Administrator (DBA) (Primary) System Administrator (Primary) Technical Architect (Review Only)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

125 of 439

Considerations
None

Best Practices
PowerCenter Enterprise Grid Option

Sample Deliverables
None
Last updated: 06-Dec-07 15:35

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

126 of 439

Phase 3: Architect
Subtask 3.1.5 Estimate Volume Requirements Description
Estimating the data volume and physical storage requirements of a data integration project is a critical step in the architecture planning process. This subtask represents a starting point for analyzing data volumes, but does not include a definitive discussion of capacity planning. Due to the varying complexity and data volumes associated with data integration solutions, it is crucial to review each technical area of the proposed solution with the appropriate experts (i.e., DBAs, Network Administrators, Server System Administrators, etc.).

Prerequisites
None

Roles
Data Architect (Primary) Data Quality Developer (Primary) Database Administrator (DBA) (Primary) System Administrator (Primary) Technical Architect (Secondary)

Considerations
Capacity planning and volume estimation should focus on several key areas that are likely to become system bottlenecks or to strain system capacity, specifically:

Disk Space Considerations

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

127 of 439

Database size is the most likely factor to affect disk space usage in the data integration solution. As the typical data integration solution does not alter the source systems, there is usually no need to consider their size. However, the target databases, and any ODS or staging areas demand disk storage over and above the existing operational systems. A Database Sizing Model workbook is one effective means for estimating these sizes. During the Architect Phase only a rough volume estimate is required. After the Design Phase is completed, the database sizing model should be updated to reflect the data model and any changes to the known business requirements. The basic techniques for database sizing are well understood by experienced DBAs. Estimates of database size must factor in:
●

Determine the upper bound of the precision of each table row. This can obviously be affected by certain DBMS data types, so be sure to take into account each physical byte consumed. The documentation for the DBMS should specify storage requirements for all supported data types. After the physical data model has been developed, the row width can be calculated. Depending on the type of table, this number may be vastly different for a "young" warehouse than one at "maturity". For example, if the database is designed to store three years of historical sales data, and there is an average daily volume of 5,000 sales, the table will contain 150,000 rows after the first month, but will have swelled to nearly 5.5 million rows at full maturity. Beyond the third year, there should be a process in place for archiving data off the table, thus limiting the size to 5.5 million rows. Indexing can add a significant disk usage penalty to a database. Depending on the overall size of the indexed table, and the size of the keys used in the index, an index may require 30 to 80 percent additional disk space. Again, the DBMS documentation should contain specifics about calculating index size. Partitioning the physical target can greatly increase the efficiency and organization of the load process. However, it does increase the number of physical units to be maintained. Be sure to discuss with the DBAs the most intelligent structuring of the database partitions.

●

●

●

Using these basic factors, it is possible to construct a database sizing model (typically in spreadsheet form) that lists all database tables and indexes, their row widths, and estimated number of rows. Once the row number estimates have been validated, the estimating model should produce a fairly accurate estimate of database size. Note that the model will provide an estimate of raw data size. Be sure to consult the DBAs to understand how to factor in the physical storage characteristics relative to the DBMS being used, such as block parameter sizes.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

128 of 439

The estimating process also provides a good opportunity to validate the star schema data model. For example, fact tables should contain only composite keys and discrete facts. If a fact table is wider than 32-64 bytes, it may be wise to re-evaluate what is being stored. The width of the fact table is very important, since a warehouse can contain millions, tens of millions, or even hundreds of millions of fact records. The dimension tables, on the other hand, will typically be wider than the fact tables, and may contain redundant data (e.g., names, addresses, etc.), but will have far fewer rows. As a result, the size of the dimension tables is rarely a major contributor to the overall target database size. Since there is the possibility of unstructured data being sourced, transformed and stored, it is important to factor in any conversion in data size, either up or down, from source to target. It is important to remember that Business Intelligence (BI) tools may consume significant storage space, depending on the extent to which they pre-aggregate data and how that data is stored. Because this may be an important factor in the overall disk space requirements, be sure to consider storage techniques carefully during the BI platform selection process.

TIP If you have determined that the star schema is the right model to use for the data integration solution, be sure that the DBAs who are responsible for the target data model understand its advantages. A DBA who is unfamiliar with the star schema may seek to normalize the data model in order to save space. Firmly resist this tendency to normalize.

Data Processing Volume
Data processing volume refers to the amount of data being processed by a given PowerCenter server within a specified timeframe. In most data integration implementations, a load window is allotted representing clock time. This window is determined by the availability of the source systems for extracts and the end-user requirements for access to the target data sources. Maintenance jobs that run on a regular basis may further limit the length of the load window. As a result of the limited load window, the PowerCenter server engine must be able to perform its operations on all data in a given time period. The ability to do so is constrained by three factors:
●

Time it takes to extract the data (potentially including network transfer time, if

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

129 of 439

the data is on a remote server)
● ●

Transformation time within PowerCenter Load time (which is also potentially impacted by network latency)

The biggest factors affecting extract and load times are, however, related to database tuning. Refer to Performance Tuning Databases (Oracle) for suggestions on improving database performance. The throughput of the PowerCenter Server engine is typically the last option for improved performance. Refer to the Velocity Best Practice Tuning Sessions for Better Performance which includes suggestions on tuning mappings and sessions to optimize performance. From an estimating standpoint, however, it is impossible to accurately project the throughput (in terms of rows per second) of a mapping due to the high variability in mapping complexity, quantity and complexity of transformations, and the nature of the data being transformed. It is a more accurate estimation to use clock time to ensure processing within the given load window. If the project includes steps dedicated to improving data quality (for example, as described in Task 4.6) then a related performance factor is the time taken to perform data matching (that is, record de-duplication) operations. Depending on the size of the dataset concerned, data matching operations in Infomatica Data Quality can take several hours of processor time to complete. Data matching processes can be tuned and executed on remote machines on the network to significantly reduce record processing time. Refer to the Best Practice Effective Data Matching Techniques for more information.

Network Throughput
Once the physical data row sizes and volumes have been estimated, it is possible to estimate the required network capacity. It is important to remember the network overhead associated with packet headers, as this can have an affect on the total volume of data being transmitted. The Technical Architect should work closely with a Network Administrator to examine network capacity between the different components involved in the solution. The initial estimate is likely to be rough, but should provide a sense of whether the existing capacity is sufficient and whether the solution should be architected differently (i.e., move source or target data prior to session execution, re-locate server engine(s), etc.). The Network Administrator can thoroughly analyze network throughput during system and/or performance testing, and apply the appropriate tuning techniques. It is important to involve the network specialists early in the Architect Phase so that they

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

130 of 439

are not surprised by additional network requirements when the system goes into production.

TIP Informatica generally recommends having either the source or target database co-located with the PowerCenter Server engine because this can significantly reduce network traffic. If such co-location is not possible, it may be advisable to FTP data from a remote source machine to the PowerCenter Server as this is a very efficient way of transporting the data across the network.

Best Practices
None

Sample Deliverables
None
Last updated: 15-Feb-07 18:19

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

131 of 439

Phase 3: Architect
Task 3.2 Design Development Architecture Description

The Development Architecture is the collection of technology standards, tools, techniques, and services required to develop a solution. This task involves developing a testing approach, defining the development environments, and determining the metadata strategy. The benefits of defining the development architecture are achieved later in the project, and include good communication and change controls as well as controllable migration procedures. Ignoring proper controls is likely to lead to issues later on in the project. Although the various subtasks that compose this task are described here in linear fashion, all of these subtasks relate to the others, so it is important to approach the overall body of work in this task as a whole and consider the development architecture as a whole.

Prerequisites
None

Roles
Business Project Manager (Primary) Data Architect (Secondary) Data Integration Developer (Secondary) Database Administrator (DBA) (Primary) Metadata Manager (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

132 of 439

Presentation Layer Developer (Secondary) Project Sponsor (Review Only) Quality Assurance Manager (Primary) Repository Administrator (Primary) Security Manager (Secondary) System Administrator (Primary) Technical Architect (Primary) Technical Project Manager (Primary)

Considerations
The Development Architecture should be designed prior to the actual start of development because many of the decisions made at the beginning of the project may have unforeseen implications once the development team has reached its full size. The design of the Development Architecture must consider numerous factors including the development environment(s), naming standards, developer security, change control procedures, and more. The scope of a typical PowerCenter implementation, possibly covering more than one project, is much broader than a departmentally-scoped solution. It is important to consider this statement fully, because it has implications for the planned deployment of a solution, as well as the requisite planning associated with the development environment. The main difference is that a departmental data mart type project can be created with only two or three developers in a very short time period. By contrast, a full integration solution involving the creation of an ICC (Integration Competency Center) or an analytic solution that approaches enterprise scale requires more of a "big team" approach. This is because many more organizational groups are involved, adherence to standards is much more important, and testing must be more rigorous, since the results will be visible to a larger audience. The following paragraphs outline some of the key differences between a departmental development effort and an enterprise effort:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

133 of 439

With a small development team, the environment may be simplistic:
●

Communication between developers is easy; it may literally consist of shouting over a cubicle partition. Only one or two repository folders may be necessary, since there is little risk of the developers "stepping on" each other's work. Naming standards are not rigidly enforced. Migration procedures are loose; development objects are moved into production without undue emphasis on impact analysis and change control procedures. Developer security is ignored; typically, all developers use similarly often highly privileged user ids.

●

● ●

●

However, as the development team grows and the project becomes more complex, this simplified environment leads to serious development issues:
●

Developers accustomed to informal communication may not thoroughly inform the entire development team of important changes to shared objects. Repository folders originally named to correspond to individual developers will not adequately support subject area- or release-based development groups. Developers maintaining others' mappings are likely to spend unnecessary time and effort trying to decipher unfamiliar names. Failure to understand the dependencies of shared objects leads to unknown impacts on the dependent objects. The lack of rigor in testing and migrating objects into production leads to runtime bugs and errors in the warehouse loading process. Sharing a single developer ID among multiple developers makes it impossible to determine which developer locked a development object, or who made the last change to an object. More importantly, failure to define secured development groups allows all developers to access all folders, leading to the possibility of untested changes being made in test environments.

●

●

●

●

These factors represent only a subset of the issues that may occur when the development architecture is haphazardly constructed, or "organically" grown. As is the case with the execution environment, a departmental data mart development effort can "get away with" minimal architectural planning. But any serious effort to develop an enterprise-scale analytic solution must be based on well-planned architecture, including both the development and execution environments.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

134 of 439

In Data Migration projects it is common to build out a set of reference data tables to support the effort. These often include tables to hold configuration details (valid values), cross-reference specifics, default values, data control structures, table-driven parameter tables. These structures will be key component in the development of reusable objects.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:45

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

135 of 439

Phase 3: Architect
Subtask 3.2.1 Develop Quality Assurance Strategy Description
Although actual testing starts with unit testing during the build phase followed by the project’s Test Phase, there is far more involved in producing a high quality project. The QA Strategy includes definition of key QA roles, key verification processes and key QA assignments involved in detailing all of the validation procedures for the project.

Prerequisites
None

Roles
Quality Assurance Manager (Primary) Security Manager (Secondary) Test Manager (Primary)

Considerations
In determining what project steps will require verification, the QA Manager or “owner” of the project’s QA processes, should consider the business requirements and the project methodology. Although it may take a “sales” effort to win over management to a QA process that is highly involved throughout the project, the benefits can be proven historically in the success rates of projects and their ongoing maintenance costs. However, the trade-offs of cost vs. value will likely affect the scope of QA.
● ●

Potential areas of verification to be considered for QA processes: Formal business requirements reviews with key business stakeholders and sign-off Formal technical requirements reviews with IT stakeholders and sign-off Formal review of environments and architectures with key technical personnel

● ●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

136 of 439

● ● ●

Peer reviews of logic designs Peer walkthroughs of data integration logic (mappings, code, etc.) Unit Testing: definition of procedures, review of test plans, formal sign-off for unit tests Gatekeeping for migration out of Development environment (into QA and/or Production) Regression testing: definition of procedures, review of test plans, formal signoff System Tests: review of Test Plans, formal acceptance process Defect Management: review of procedures, validation of resolution User Acceptance Test: review of Test Plans, formal acceptance process Documentation review Training materials review Review of Deployment Plan; sign-off for deployment completion

●

●

● ● ● ● ● ●

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:45

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

137 of 439

Phase 3: Architect
Subtask 3.2.2 Define Development Environments Description
Although the development environment was relatively simple in the early days of computer system development when a mainframe-based development project typically involved one or more isolated regions connected to one or more database instances, distributed systems, such as federated data warehouses, involve much more complex development environments, and many more "moving parts." The basic concept of isolating developers from testers, and both from the production system, is still critical to development success. However, relative to a centralized development effort, there are many more technical issues, hardware platforms, database instances, and specialized personnel to deal with. The task of defining the development environment is, therefore, extremely important and very difficult. Because of the wide variance in corporate technical environments, standards, and objectives, there is no "optimal" development environment. Rather, there are key areas of consideration and decisions that must be made with respect to them. After the development environment has been defined, it is important to document its configuration, including (most importantly) the information the developers need to use the environments. For example, developers need to understand what systems they are logging into, what databases they are accessing, what repository (or repositories) they are accessing, and where sources and targets reside. An important component of any development environment is to configure it as close to the test and production environments as possible given time and budget. This can significantly ease the development and integration efforts downstream and will ultimately save time and cost during the testing phases.

Prerequisites
3.1.1 Define Technical Requirements

Roles
Database Administrator (DBA) (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

138 of 439

Repository Administrator (Primary) System Administrator (Primary) Technical Architect (Primary) Technical Project Manager (Review Only)

Considerations
The development environment for any data integration solution must consider many of the same issues as a "traditional" development project. The major differences are that the development approach is "repository-centric" (as opposed to code-based), there are multiple sources and targets (unlike a typical system development project, which deals with a single database), and few (if any) hand-coded objects to build and maintain. In addition, because of the repository-based development approach, the development environment must consider all of the following key areas:
●

Repository Configuration. This involves critical decisions, such as whether to use local repositories, a global repository, or both, as well as determining an overall metadata strategy (see 3.2.4 Determine Metadata Strategy ). Folder structure. Within each repository, folders are used to group and organize work units or report objects. To be effective, the folder structure must consider the organization of the development team(s), as well as the change control/migration approach. Developer security. Both PowerCenter and Data Analyzer have built-in security features that allow an administrative user (i.e., the Repository Administrator) to define the access rights of all other users to objects in the repository. The organization of security groups should be carefully planned and implemented prior to the start of development. As an additional option, LDAP can be used to assist in simplifying the organization of users and permissions.

●

●

Repository Configuration
Informatica's data integration platform, PowerCenter, provides capabilities for integrating multiple heterogeneous sources and targets. The requirements of the development team should dictate to what extent the PowerCenter capabilities are exploited, if at all. In a simple data integration development effort, source data may be

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

139 of 439

extracted from a single database or set of flat files, and then transformed and loaded into a single target database. More complex data integration development efforts involve multiple source and target systems. Some of these may include mainframe legacy systems as well as third-party ERP providers. Most data integration solutions currently being developed involve data from multiple sources, target multiple data marts, and include the participation of developers from multiple areas within the corporate organization. In order to develop a cohesive analytic solution, with shared concepts of the business entities, transformation rules, and end results, a PowerCenter-based development environment is required. There are basically three ways to configure an Informatica-based data integration solution although variations on these three options are certainly possible, particularly with the addition of PowerExchange products (i.e., PowerExchange for SAPNetweaver, PowerExchange for PeopleSoft Enterprise) and Data Analyzer for front-end reporting. However, from a development environment standpoint, the following three configurations serve as the basis for determining how to best configure the environment for developers:
●

Standalone PowerCenter. In this configuration, there is a single repository that cannot be shared with any others within the enterprise. This type of repository is referred to as a local repository and is typically used for small, independent, departmental data marts. Many of the capabilities within PowerCenter are available, including developer security, folder structures, and shareable objects. The primary development restrictions are that the objects in the repository can't be shared with other repositories, and this repository cannot access objects in any other repositories. Multiple developers, working on multiple projects, can still use this repository; folders can be configured to restrict access to specified developers (or groups); and a repository administrator with SuperUser authority can control production objects. This means that there would be an instance of repository for development, testing, and production. Some companies can manage colocating development and testing on one repository by segregating codes through folder strategies. PowerCenter Data Integration Hub with Networked Local Repositories. This configuration combines a centralized, shared global repository with one or more distributed local repositories. The strength of this solution is that multiple development groups can work semi-autonomously, while sharing common development objects. In the production environment, distributing the server load across the PowerCenter server engines can leverage this same configuration. This option can dramatically affect the definition of the development environment. PowerCenter as a Data Integration Hub with a Data Analyzer Front-End to

●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

140 of 439

the Reporting Warehouse. This configuration provides an end-to-end suite of products that allow developers to build the entire data integration solution from data loads to end-user reporting.

PowerCenter Data Integration Hub with Networked Local Repositories
In this advanced repository configuration, the Technical Architect must pay careful attention to the sharing of development objects and the use of multiple repositories. Again, there is no single "correct" solution, only general guidelines for consideration. In most cases, the PowerCenter Global Repository becomes a development focal point. Departmental developers wishing to access enterprise definitions of sources, targets, and shareable objects connect to the Global Repository to do so. The layout of this repository, and its contents, must be thoroughly planned and executed. The Global Repository may include shareable folders containing:
●

Source definitions. Because many source systems may be shared, it is important to have a single "true" version of their schemas resident in the Global Repository. Target definitions. Apply the same logic regarding source definitions. Shareable objects. Shared objects should be created and maintained in a single place; the Global Repository is the place.

● ●

TIP It is very important to house all globally-shared database schemas in the Global Repository. Because most IT organizations prefer to maintain their database schemas in a CASE/data modeling tool, the procedures for updating the PowerCenter definitions of source/target schemas must include importing these schemas from tools such as ERwin. It is far easier to develop these procedures for a single (global) repository than for each of the (independent) local repositories that may be using the schemas.

Of course, even if the overall development environment includes a PowerCenter Data Integration Hub, there may still be non-shared sources, targets, and development objects. In these cases, it is perfectly acceptable to house the definitions within a local repository. If necessary, these objects may eventually be migrated into the shared Global Repository. And, it may still make sense to do local development and unit

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

141 of 439

testing in a local repository - even for shared objects, since they shouldn't be shared until they have been fully tested. During the Architect Phase and the Design Phase the Technical Architect should work closely with Project Management and the development lead(s) to determine the appropriate repository placement of development objects in a PowerCenter-based environment. After the initial configuration is determined, the Technical Architect can limit his/her involvement in this area. For example, any data quality steps taken with Infomatica Data Quality (IDQ) applications (such as those implemented in 2.8 Perform Data Quality Audit or 5.3 Design and Build Data Quality Process) are performed using processes saved to a discrete IDQ repository. These processes (called plans in IDQ parlance) can be added to PowerCenter transfomations and subsequently saved with those transformations in the PowerCenter repository. As indicated above, data quality plans can be designed and tested within an IDQ repository before deployment in PowerCenter. Moreover, depending on their purpose, plans may remain in an IDQ server repository, from which they can be distributed as needed across the enterprise, for the life of the project. In addition to the sharing advantages provided by the PowerCenter Data Integration Hub approach, the global repository also serves as a centralized entry point for viewing all repositories linked to it via networked local repositories. This mechanism allows a global repository administrator to oversee multiple development projects without having to separately log-in to each of the individual local repositories. This capability is useful for ensuring that individual project teams are adhering to enterprise standards and may also be used by centralized QA teams, where appropriate.

Folder Architecture Options and Alternatives
Repository folders provide development teams with a simple method for grouping and organizing work units. The process for creating and administering folders is quite simple, and thoroughly explained in Informatica’s product documentation. The main area for consideration is the determination of an appropriate folder structure within one or more repositories.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

142 of 439

TIP If the migration approach adopted by the Technical Architect involves migrating from a development repository to another repository (test or production), it may make sense for the "target" repository to mirror the folder structure within the development repository. This simplifies the repository-to-repository migration procedures. Another possible approach is to assign the same names to corresponding database connections in both the "source" and "target" repositories. This is particularly useful when performing folder copies from one environment to another because it eliminates the need to change database connection settings after the folder copy has been completed.

The most commonly employed general approaches to folder structure are:
●

Folders by Subject (Target) Area. The Subject Area Division method provides a solid infrastructure for large data warehouse or data mart developments by organizing work by key business area. This strategy is particularly suitable for large projects populating numerous target tables. For example, folder names may be SALES, DISTRIBUTION, etc. Folder Division by Environment. This method is easier to establish and maintain than Folders by Subject Area, but is suitable only for small development teams working with a minimal number of mappings. As each developer completes unit tests in his/her individual work folders, the mappings or objects are consolidated as they are migrated to test or QA. Migration to production is significantly simplified, with the maximum number of required folder copies limited to the number of environments. Eventually however, the number of mappings in a single folder may become too large to easily maintain. Folder names may be DEV1, DEV2, DEV3, TEST, QA, etc. Folder Dividion by Source Area. The Source Area Division method is attractive to some development teams, particularly if development is centralized around the source systems. In these situations, the promotion and deployment process can be quite complex depending on the load strategy. Folder names may be ERP, BILLING, etc.

●

●

In addition to these basic approaches, many PowerCenter development environments also include developer folders that are used as "sandboxes," allowing for unrestricted freedom in development and testing. Data Analyzer creates Personal Folders for each user name which can be used as a sandbox area for report development and test. Once the developer has completed the initial development and unit testing within his/her own sandbox folder, he/she can migrate the results to the appropriate folder.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

143 of 439

TIP PowerCenter does not support nested folder hierarchies, which creates a challenge to logically grouping development objects in different folders. A common technique for logically grouping folders is to use standardized naming conventions, typically prefixing folder names with a brief, unique identifier. For example, suppose three developers are working on the development of a Marketing department data mart. Concurrently, in the same repository, another group of developers is working on a Sales data mart. In order to allow each developer to work in his/her own folder, while logically grouping them together, the folders may be named SALES_DEV1, SALES_DEV2, SALES_DEV3, MRKT_DEV1, etc. Because the folders are arranged alphabetically, all of the SALES-related folders will sort together, as will the MRKT folders .

Finally, it is also important to consider the migration process in the design of the folder structures. The migration process depends largely on the folder structure that is established, and the type of repository environment. In earlier versions of PowerCenter, the most efficient method to migrate an object was to perform a complete folder copy. This involves grouping mappings meaningfully within a folder, since all mappings within the folder migrate together. However, if individual objects need to be migrated, the migration process can become very cumbersome, since each object needs to be "manually" migrated. PowerCenter 7.x introduced the concept of team-based development and object versioning, which integrated a true version-control tool within PowerCenter. Objects can be treated as individual elements and can be checked out for development and checked in for testing. Objects can also be linked together to facilitate their deployment to downstream repositories. Data Analyzer 4.x uses the export and import of repository objects for the migration process among environments. Objects are exported and imported as individual pieces and cannot be linked together in a deployment group as they can in PowerCenter 7.x or migrated as a complete folder as they can in earlier versions of PowerCenter.

Developer Security
The security features built into PowerCenter and Data Analyzer allow the development team to be grouped according to the functions and responsibilities of each member. One common, but risky, approach is to give all developers access to the default Administrator ID provided upon installation of the PowerCenter or Data Analyzer

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

144 of 439

software. Many projects use this approach because it allows developers to begin developing mappings and sessions as soon as the software is installed. INFORMATICA STRONGLY DISCOURAGES THIS PRACTICE. The following paragraphs offer some recommendations for configuring security profiles for a development team. PowerCenter's and Data Analyzer’s security approach is similar to database security environments. PowerCenter’s security management is performed through the Repository Manager and Data Analyzer’s security is performed through tasks on the Administrator tab. The internal security enables multi-user development through management of users, groups, privileges, and folders. Despite the similarities, PowerCenter UserIDs are distinct from database userids, and they are created, managed, and maintained via administrative functions provided by the PowerCenter Repository Manager or Data Analyzer Administrator. Although privileges can be assigned to users or groups, it is more common to assign privileges to groups only, and then add users to each group. This approach is simpler than assigning privileges on a user-by-user basis since there are generally a few groups and many users, and any user can belong to more than one group. Every user must be assigned to at least one group. For companies that have the capabilities to do so, LDAP integration is an available option that can minimize the administration of usernames and passwords separately. If you use LDAP authentication for repository users, the repository maintains an association between repository user names and external login names. When you create a user, you can select the login name from the external directory. For additional information on PowerCenter and Data Analyzer security, including suggestions for configuring user privileges and folder-level privileges, see Configuring Security. As development objects migrate closer to the production environment, security privileges should be tightened. For example, the testing group is typically granted Execute permissions in order to run mappings, but should not be given Write access to the mappings. When the testing team identifies necessary changes, it can communicate those changes (via a Change Request or bug report) to the development group, which fixes the error and re-migrates the result to the test area. The tightest security of all is reserved for promoting development objects into production. In some environments, no member of the development team is permitted to move anything into production. In these cases, a System Owner or other system representative outside the development group must be given the appropriate repository

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

145 of 439

privileges to complete the migration process. The Technical Architect and Repository Administrator must understand these conditions while designing an appropriate security solution.

Best Practices
Configuring Security

Sample Deliverables
None
Last updated: 19-Dec-07 16:54

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

146 of 439

Phase 3: Architect
Subtask 3.2.3 Develop Change Control Procedures Description
Changes are inevitable during the initial development and maintenance stages of any project. Wherever and whenever the changes occur - in the logical and physical data models, extract programs, business rules, or deployment plans - they must be controlled. Change control procedures include formal procedures to be followed when requesting a change to the developed system (such as sources, targets, mappings, mapplets, shared transformations, sessions, or batches for PowerCenter and schemas, global variables, reports, or shared objects for Data Analyzer). The primary purpose of a change control process is to facilitate the coordination among the various organizations involved with effecting this change (i.e., development, test, deployment, and operations). This change control process controls the timing, impact, and method by which development changes are migrated through the promotion hierarchy. However, the change control process must not be so cumbersome as to hinder speed of deployment. The procedures should be thorough and rigid, without imposing undue restrictions on the development team's goal of getting its solution into production in a timely manner. This subtask addresses many of the factors influencing the design of the change control procedures. The procedures themselves should be a well-documented series of steps, describing what happens to a development object once it has been modified (or created) and unit tested by the developer. The change control procedures document should also provide background contextual information, including the configuration of the environment, repositories, and databases.

Prerequisites
None

Roles
Data Integration Developer (Secondary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

147 of 439

Database Administrator (DBA) (Secondary) Presentation Layer Developer (Secondary) Quality Assurance Manager (Approve) Repository Administrator (Secondary) System Administrator (Secondary) Technical Project Manager (Primary)

Considerations
It is important to recognize that the change control procedures and the organization of the development environment are heavily dependent upon each other. It is impossible to thoroughly design one without considering the other. The following development environment factors influence the approach taken to change control:

Repository Configuration
Subtask 3.2.2 Define Development Environments discusses the two basic approaches to repository configuration. The first one, Stand-Alone PowerCenter, is the simplest configuration in that it involves a single repository. If that single repository supports both development and production (although this is not generally advisable), then the change control process is fairly straightforward; migrations involve copying the relevant object from a development folder to a production folder, or performing a complete folder copy. However, because of the many advantages gained by isolating development from production environments, Informatica recommends physically separating repositories whenever technically and fiscally feasible. This decision complicates the change control procedures somewhat, but provides a more stable solution. The general approach for migration is similar regardless of whether the environment is a single repository or multiple repository approach. In either case, logical groupings of development objects have been created, representing the various promotion levels within the promotion hierarchy (e.g., DEV, TEST, QA, PROD). In the single repository approach, the logical grouping is accomplished through the use of folders named accordingly. In the multiple repository approach, an entire repository may be used for

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

148 of 439

one (or more) promotion levels. Whenever possible, the production repository should be independent of the others. A typical configuration would be a shared repository supporting both DEV and TEST, and a separate PROD repository.
●

If the object is a global object (reusable or not reusable), the change must be applied to the global repository. If the object is shared, the shortcuts referencing this object automatically reflect the change from any location in the global or local architecture. Therefore, only the "original" object must be migrated. If the object is stored in both repositories (i.e., global and local), the change must be made in both repositories. Finally, if the object is only stored locally, the change is only implemented in the local repository.

●

●

●

Tip With a PowerCenter Data Integration Hub implementation, global repositories can register local repositories. This provides access to both repositories through one "console", simplifying the administrative tasks for completing change requests. In this case, the global Repository Administrator can perform all repository migration tasks.
Regardless of the repository configuration however, the following questions must be considered in the change control procedures:
● ●

What PowerCenter or Data Analyzer objects does this change affect? What other system objects are affected by the change? What processes (migration/promotion, load) does this change impact? What processes does the client have in place to handle and track changes? Who else uses the data affected by the change and are they involved in the change request? How will this change be promoted to other environments in a timely manner? What is the effort involved in making this change? Is there time in the project schedule for this change? Is there sufficient time to fully test the change?

● ●

● ●

Change Request Tracking Method
The change procedures must include a means for tracking change requests and their migration schedules, as well as a procedure for backing out changes, if necessary. The Change Request Form should include information about the nature of the change,

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

149 of 439

the developer making the change, the timing of the request for migration, and enough technical information about the change that it can be reversed if necessary. There are a number of ways to back-out a changed development object. It is important to note, however, that prior to PowerCenter 7.x, reversing a change to a single object in the repository is very tedious and error-prone, and should be considered as a last resort. The time to plan for this occurrence however, is during the implementation of the development environment, not after an incorrect change has been migrated into Production. Backing out a change in PowerCenter 7.x, however, is a simple as reverting to a previous version of the object(s).

Team Based Development, Tracking and Reverting to Previous Version
The team-based development option provides functionality in two areas: versioning and deployment. But, other features, such as repository queries and labeling are necessary to ensure optimal use of versioning and deployment. The following sections describe this functionality at a general level. For a more detailed explanation of any of the capabilities of the team-based development features of PowerCenter, refer to the appropriate sections of the PowerCenter documentation. While the functionality provided via team-based development is quite powerful, it is clear that there are better ways of using it to achieve expected goals. The activities of coordinating development in a team environment, tracking finished work that needs to be reviewed or migrated, managing migrations, and ensuring minimal errors can be quite complex. The process requires a combination of PowerCenter functionality and user process to implement effectively.

Data Migration Projects
For Data Migration projects change control is critical for success. It is common that the target system has continual changes during the life of the data migration project. These cause changes to specifications, which in turn cause a need to change the mappings, sessions, workflows, and scripts that make up the data migration project. Change control is important to allow the project management to understand the scope of change and to limit the impact that process changes cause to related processes. For data migration, the key to change control is in the communication of changes to ensure that testing activities are integrated.

Best Practices
None

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

150 of 439

Sample Deliverables
None
Last updated: 15-Feb-07 18:48

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

151 of 439

Phase 3: Architect
Subtask 3.2.5 Develop Change Management Process Description
Change Management is the process for managing the implementation of changes to a project (i.e., data warehouse or data integration) including hardware, software, services, or related documentation. Its purpose is to minimize the disruption to services caused by change and to ensure that records of hardware, software, services and documentation are kept up to date. The Change Management process enables the actual change to take place. Elements of the process include identify change, create request for change, impact assessment, approval, scheduling, and implementation.

Prerequisites
None

Roles
Business Project Manager (Primary) Project Sponsor (Review Only) Technical Project Manager (Primary)

Considerations Identify Change
Change Management is necessary in any of the following situations:
●

A problem arises that requires a change that will affect more than one business user or a user group such as sales, marketing, etc.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

152 of 439

●

A new requirement is identified as a result of advances in technology (e.g., a software upgrade) or a change in needs (for new functionality). A change is required to fulfill a change in business strategy as identified by a business leader or developer.

●

Request for Change
A request for change should be completed for each proposed change, with a checklist of items to be considered and approved before implementing the change. The change procedures must include a means for tracking change requests and their migration schedules, as well as a procedure for backing out changes, if necessary. The Change Request Form should include information about the nature of the change, the developer making the change, the timing of the request for migration, and enough technical information about the change that it can be reversed if necessary. Before implementing a change request in the PowerCenter environment, it is advisable to create an additional back-up repository. Using this back-up, the repository can be restored to a 'spare' repository database. After a successful restore, the original object can be retrieved via object copy. In addition, be sure to:
●

Track changes manually (electronic or paper change request form), then change the object back to its original form by referring to the change request form. Create one to 'x' number of version folders, where 'x' is the number of versions back that repository information is maintained. If a change needs to be reversed, the object simply needs to be copied to the original development folder from this versioning folder. The number of 'versions' to maintain is at the discretion of the PowerCenter Administrator. Note however, that this approach has the disadvantage of being very time consuming and may also greatly increase the size of the repository databases.

●

PowerCenter Versions 7.X and 8.X
The team-based development option provides functionality in two areas: versioning and deployment. But, other features, such as repository queries and labeling are required to ensure optimal use of versioning and deployment. The following sections describe this functionality at a general level. For a more detailed explanation of any of the capabilities of the Team-based Development features of PowerCenter, please refer to the appropriate sections of the PowerCenter documentation.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

153 of 439

●

For clients using Data Analyzer for front-end reporting, certain considerations need to be addressed with the migration of objects: Data Analyzer’s repository database contains user profiles in addition to reporting objects. If users are synchronized from outside sources (like an LDAP directory or via Data Analyzer’s API), then a repository restore from one environment to another may delete user profiles (once the repository is linked to LDAP). When reports containing references to dashboards are migrated, the dashboards also need to be migrated to reflect the link to the report. In a clustered Data Analyzer configuration, certain objects that are migrated via XML imports may only be reflected on the node that the import operation was performed on. It may be necessary to stop and re-start the other nodes to refresh these nodes with these changes.

●

●

●

Approval to Proceed
An initial review of the Change Request form should assess the cost and value of proceeding with the change. If sufficient information is not provided on the request form to enable the initial reviewer to thoroughly assess the change, he or she should return the request form to the originator for further details. The originator can then resubmit the change request with the requested information. The change request must be tracked through all stages of the change request process, with thorough documentation regarding approval or rejection and resubmission.

Plan and Prepare Change
Once approval to proceed has been granted, the originator may plan and prepare the change in earnest. The following sections on the request for change must be completed at this stage:
●

Full details of change – Inform Administrator, backup repository and backup database. Impact on services and users – Inform business users in advance about any anticipated outage. Assessment of risk of the change failing. Fallback plan in case of failure – Includes reverting to old version using TBD Date and time of change – Migration / Promotion plan Test-Dev and Dev-Prod

●

● ● ●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

154 of 439

Impact Analysis
The Change Control Process must include a formalized approach to completing impact analysis. Any implemented change has some planned downstream impact (e.g., the values on a report will change, additional data will be included, a new target file will be populated, etc.) The importance of the impact analysis process is in recognizing unforeseen downstream affects prior to implementing the change. In many cases, the impact is easy to define. For example, if a requested change is limited to changing the target of a particular session from a flat file to a table, the impact is obvious. However, most changes occur within mappings or within databases, and the hidden impacts can be worrisome. For example, if a business rule change is made, how will the end results of the mapping be affected? If a target table schema needs to be modified within the repository, the corresponding target database must also be changed, and it must be done in sync with the migration of the repository change. An assessment must be completed to determine how a change request affects other objects in the analytic solution architecture. In many development projects, the initial analysis is performed, and then communicated to all affected parties (e.g., Repository Administrator, DBAs, etc.) at a regularly scheduled meeting. This ensures that everyone who needs to be notified is, and that all approve the change request. For PowerCenter, the Repository Manager can be used to identify object interdependencies. An impact analysis must answer the following questions:
● ●

What PowerCenter or Data Analyzer objects does this change affect? What other system objects are affected by the change? What processes (i.e., migration/promotion, load) does this change impact? What processes does the client have in place to handle and track changes? Who else uses the data affected by the change and are they involved in the change request? How will this change be promoted to other environments in a timely manner? What is the effort involved in making this change? Is there time in the project schedule for this change? Is there sufficient time to fully test the change?

● ●

● ●

Implementation
Following final approval and after relevant and timely communications have been issued, the change may be implemented in accordance with the plan and the scheduled date and time.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

155 of 439

After implementation, the change request form should indicate whether the change was successful or unsuccessful so as to maintain a clear record of the outcome of the request.

Change Control and Migration /Promotion Process
Identifying the most efficient method for applying change to all environments is essential. Within the PowerCenter and Data Analyzer environments, the types of objects to manage are:
● ● ● ● ● ● ● ● ● ● ●

Source definitions Target definitions Mappings and mapplets Reusable transformations Sessions Batches Reports Schemas Global variables Dashboards Schedules

In addition, there are objects outside of the Informatica architecture that are directly linked to these objects, so the appropriate procedures need to be established to ensure that all items are synchronized. When a change request is submitted, the following steps should occur: 1. Perform impact analysis on the request. List all objects affected by the change, including development objects and databases. 2. Approve or reject the change or migration request. The Project Manager has authority to approve/reject change requests. 3. If approved, pass the request to the PowerCenter Administrator for processing. 4. Migrate the change to the test environment. 5. Test the requested change. If the change does not pass testing, the process will need to start over for this object. 6. Submit the promotion request for migration to QA and/or production environments.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

156 of 439

7. If appropriate, the Project Manager approves the request. 8. The Repository Administrator promotes the object to appropriate environments.

Best Practices
None

Sample Deliverables
None
Last updated: 15-Feb-07 18:51

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

157 of 439

Phase 3: Architect
Task 3.3 Implement Technical Architecture Description
While it is crucial to design and implement a technical architecture as part of the data integration project development effort, most of the implementation work is beyond the scope of this document. Specifically, the acquisition and installation of hardware and system software is generally handled by internal resources, and is accomplished by following pre-established procedures. This section touches on these topics, but is not meant to be a step-by-step guide to the acquisition and implementation process. After determining an appropriate technical architecture for the solution (3.1 Develop Solution Architecture), the next step is to physically implement that architecture. This includes procuring and installing the hardware and software required to support the data integration processes.

Prerequisites
3.2 Design Development Architecture

Roles
Database Administrator (DBA) (Secondary) Project Sponsor (Approve) Repository Administrator (Primary) System Administrator (Primary) Technical Architect (Primary) Technical Project Manager (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

158 of 439

Considerations
The project schedule should be the focus of the hardware and software implementation process. The entire procurement process, which may require a significant amount of time, must begin as soon as possible to keep the project moving forward. Delays in this step can cause serious delays to the project as a whole. There are, however, a number of proven methods for expediting the procurement and installation processes, as described in the related subtasks.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:45

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

159 of 439

Phase 3: Architect
Subtask 3.3.1 Procure Hardware and Software Description
This is the first step in implementing the technical architecture. The procurement process varies widely among organizations, but is often based on a purchase request (i. e., Request for Purchase or RFP) generated by the Project Manager after the project architecture is planned and configuration recommendations are approved by IT management. An RFP is usually mandatory for procuring any new hardware or software. Although the forms vary widely among companies, an RFP typically lists what products need to be purchased, when they will be needed, and why they are necessary for the project. The document is then reviewed and approved by appropriate management and the organization's "buyer". It is critical to begin the procurement process well in advance of the start of development.

Prerequisites
3.2 Design Development Architecture

Roles
Database Administrator (DBA) (Secondary) Project Sponsor (Approve) Repository Administrator (Secondary) System Administrator (Secondary) Technical Architect (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

160 of 439

Technical Project Manager (Primary)

Considerations
Frequently, the Project Manager does not control purchasing new hardware and software. Approval must be received from another group or individual within the organization, often referred to as a "buyer". Even before product purchase decisions are finalized, it is a good idea to notify the buyer of necessary impending purchases, providing a brief overview of the types of products that are likely to be required and for what reasons. It may also be possible to begin the procurement process before all of the prerequisite steps are complete (See 2.2 Define Business Requirements, 3.1.2 Develop Architecture Logical View, and 3.1.3 Develop Configuration Recommendations. The Technical Architect should have a good idea of at least some of the software and hardware choices before a physical architecture and configuration recommendations are solidified. Finally, if development is ready to begin and the hardware procurement process is not yet complete, it may be worthwhile to get started on a temporary server with the intention of moving the work to the new server when it is available.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:45

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

161 of 439

Phase 3: Architect
Subtask 3.3.2 Install/ Configure Software Description
Installing, configuring, and deploying new hardware and software should not affect the progress of a data integration project. The entire development team depends on a properly configured technical environment. Incorrect installation or delays can have serious negative effects on the project schedule. Establishing and following a detailed installation plan can help avoid unnecessary delays in development. (See 3.1.2 Develop Architecture Logical View).

Prerequisites
3.2 Design Development Architecture

Roles
Database Administrator (DBA) (Primary) Repository Administrator (Primary) System Administrator (Primary) Technical Architect (Review Only) Technical Project Manager (Review Only)

Considerations
When installing and configuring hardware and software for a typical data warehousing project, the following Informatica software components should be considered:
●

PowerCenter Services – The PowerCenter services, including the repository,

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

162 of 439

integration, log, and domain services, should be installed and configured on a server machine.
●

PowerCenter Client – The client tools for the PowerCenter engine must be installed and configured on the client machines for developers. The DataDirect - ODBC drivers should also be installed on the client machines. The PowerCenter client tools allow a developer to interact with the repository through an easy-to-use GUI interface. PowerCenter Reports – PowerCenter Reports (PCR) is a reporting tool that enables users to browse and analyze PowerCenter metadata, allowing users to view PowerCenter operational load statistics and perform impact analysis. PCR is based on Informatica Data Analyzer, running on an included JBOSS application server, to manage and distribute these reports via an internet browser interface. PowerCenter Reports Client – The PCR client is a web-based, thin-client tool that uses Microsoft Internet Explorer 6 as the client. Additional client tool installation for the PCR is usually not necessary, although the proper version of Internet Explorer should be verified on client workstations. Data Analyzer Server – The analytics server engine for Data Analyzer should be installed and configured on a server. Data Analyzer Client – Data Analyzer is a web-based, thin-client tool that uses Microsoft Internet Explorer 6 as the client. Additional client tool installation for Data Analyzer is usually not necessary, although the proper version of Internet Explorer should be verified on the client machines of business users to ensure that minimum requirements are met. PowerExchange – PowerExchange has components that must be installed on the source system, PowerCenter server, and client.

●

●

●

●

●

In addition to considering the Informatica software components that should be installed, the preferred database for the data integration project should be selected and installed, keeping these important database size considerations in mind:
●

PowerCenter Metadata Repository - Although you can create a PowerCenter metadata repository with a minimum of 100MB of database space, Informatica recommends allocating up to 150MB for PowerCenter repositories. Additional space should be added for versioned repositories. The database user should have privileges to create tables, views, and indexes. Data Analyzer Metadata Repository - Although you can create a Data Analyzer repository with a minimum of 60MB of database space, Informatica recommends allocating up to 150MB for Data Analyzer repositories. The database user should have privileges to create tables, views, and indexes. Metadata Manager Repository – Although you can create a Metadata

●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

163 of 439

Manager repository with a minimum of 550MB of database space, you may choose to allocate more space in order to plan for future growth. The database user should have privileges to create tables, views, and indexes.
●

Data Warehouse Database – Allow for ample space with growth at a rapid pace.

PowerCenter Server Installation
The PowerCenter services need to be installed and configured, along with any necessary database connectivity drivers, such as native drivers or ODBC. Connectivity needs to be established among all the platforms before the Informatica applications can be used. The recommended configuration for the PowerCenter environment is to install the PowerCenter services and the repository and target databases on the same multiprocessor machine. This approach minimizes network interference when the server is writing to the target database. Use this approach when available CPU and memory resources on the multiprocessor machine allow all software processes to operate efficiently without “pegging” the server. If available hardware dictates that the PowerCenter Server is separated physically from the target database server, Informatica recommends placing a high-speed network connection between the two servers. Some organizations house the repository database on a separate database server if they are running OLAP servers and want to consolidate metadata repositories. Because the repository tables are typically very small in comparison to the data mart tables, and storage parameters are set at the database level, it may be advisable to keep the repository in a separate database. For step-by-step instructions for installing the PowerCenter services, refer to the Informatica PowerCenter Installation Guide. The following list is intended to complement the installation guide when installing PowerCenter:
●

Network Protocol - TCP/IP and IPX/SPX are the supported protocols for communication between the PowerCenter services and PowerCenter client tools. To improve repository performance, consider installing the Repository service on a machine with a fast network connection. To optimize performance, do not install the Repository service on a Primary Domain Controller (PDC) or a Backup Domain Controller (BDC). Native Database Drivers (or ODBC in some instances) are used by the Server to connect to the source, target, and repository databases. Ensure that

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

164 of 439

all appropriate database drivers (and most recent patch levels) are installed on the PowerCenter server to access source, target, and repository databases.
●

Operating System Patches – Prior to installing PowerCenter, please refer to the PowerCenter Release Notes documentation to ensure that all required patches have been applied to the operating system. This step is often overlooked and can result in operating system errors and/or failures when running the PowerCenter Server. Data Movement Mode - The DataMovementMode option is set in the PowerCenter Integration Service configuration. The DataMovementMode can be set to ASCII or Unicode.Unicode is an international character set standard that supports all major languages (including US, European, and Asian), as well as common technical symbols. Unicode uses a fixed-width encoding of 16-bits for every character. ASCII is a single-byte code page that encodes character data with 7-bits. Although actual performance results depend on the nature of the application, if international code page support (i.e., Unicode) is not required, set the DataMovementMode to ASCII because the 7-bit storage of character data results in smaller cache sizes for string data, resulting in more efficient data movement. Versioning – If Versioning is enabled for a PowerCenter Repository, developers can save multiple copies of any PowerCenter object to the repository. Although this feature provides developers with a seamless way to manage changes during the course of a project, it also results in larger metadata repositories. If Versioning is enabled for a repository, Informatica recommends allocating a minimum of 500MB of space in the database for the PowerCenter repository. Lightweight Directory Access Protocol (LDAP) - If you use PowerCenter default authentication, you create users and maintain passwords in the PowerCenter metadata repository using Repository Manager. The Repository service verifies users against these user names and passwords. If you use Lightweight Directory Access Protocol (LDAP), the Repository service passes a user login to the external directory for authentication, allowing synchronization of PowerCenter user names and passwords with network/ corporate user names and passwords. The repository maintains an association between repository user names and external login names. You must create the user name-login associations, but you do not maintain user passwords in the repository. Informatica provides a PowerCenter plug-in that you can use to interface between PowerCenter and an LDAP server. To install the plug-in, perform the following steps: 1. Configure the LDAP module connection information from the Administration Console. 2. Register the package with each repository that you want to use it with. 3. Set up users in each repository.

●

●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

165 of 439

For more information on configuring LDAP authentication, refer to the Informatica PowerCenter Repository Guide.

PowerCenter Client Installation
The PowerCenter Client needs to be installed on all developer workstations, along with any necessary drivers, including database connectivity drivers such as ODBC. Before you begin the installation, verify that you have enough disk space for the PowerCenter Client. You must have 300MB of disk space to install the PowerCenter 8 Client tools. Also, make sure you have 30MB of temporary file space available for the PowerCenter Setup. When installing PowerCenter Client tools via a standard installation, choose to install the “Client tools” and “ODBC” components.

TIP You can install the PowerCenter Client tools in standard mode or silent mode. You may want to perform a silent installation if you need to install the PowerCenter Client on several machines on the network, or if you want to standardize the installation across all machines in the environment. When you perform a silent installation, the installation program uses information in a response file to locate the installation directory. You can also perform a silent installation for remote machines on the network.
When adding an ODBC data source name (DSN) to client workstations, it is a good idea to keep the DSN consistent among all workstations. Aside from eliminating the potential for confusion on individual developer machines, this is important when importing and exporting repository registries. The Repository Manager saves repository connection information in the registry. To simplify the process of setting up client systems, it is possible to export that information, and then import it for a new client. The registry references the data source names used in the exporting machine. If a registry is imported containing a DSN that does not exist on the client system, the connection will fail at runtime.

PowerCenter Reports Installation
PowerCenter Reports (PCR) replaces the PowerCenter Metadata Reporter. The reports are built on the Data Analyzer infrastructure. Data Analyzer must be installed and configured, along with the application server foundation software. Currently, PCR is shipped with the PowerCenter installation (both Standard and Advanced Editions).

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

166 of 439

The recommended configuration for the PCR environment is to place the PCR/Data Analyzer server, application server, and repository databases on the same multiprocessor machine. This approach minimizes network input/output as the PCR server reads from the PowerCenter repository database. Use this approach when available CPU and memory resources on the multiprocessor machine allow all software processes to operate efficiently without “pegging” the server. If available hardware dictates that the PCR server be physically separated from the PowerCenter repository database server, Informatica recommends placing a high-speed network connection between the two servers. For step-by-step instructions for installing the PowerCenter Reports, refer to the Informatica PowerCenter Installation Guide. The following list of considerations is intended to complement the installation guide when installing PCR:
●

Operating System Patch Levels – Prior to installing PCR, be sure to refer to the Data Analyzer Release Notes documentation to ensure that all required patches have been applied to the operating system. This step is often overlooked and can result in operating system errors and/or failures if the correct patches are not applied. Lightweight Directory Access Protocol (LDAP) - If you use default authentication, you create users and maintain passwords in the Data Analyzer metadata repository. Data Analyzer verifies users against these user names and passwords. However, if you use Lightweight Directory Access Protocol (LDAP), Data Analyzer passes a user login to the external directory for authentication, allowing synchronization of Data Analyzer user names and passwords with network/corporate user names and passwords, as well as PowerCenter user names and passwords. The repository maintains an association between repository user names and external login names. You must create the user name-login associations, but you do not have to maintain user passwords in the repository. In order to enable LDAP, you must configure the IAS.properties and ldaprealm.properties files. For more information on configuring LDAP authentication, see the Data Analyzer Administration Guide.

●

PowerCenter Reports Client Installation
The PCR client is a web-based, thin-client tool that uses Microsoft Internet Explorer 6 as the client. The proper version of Internet Explorer should be verified on client machines, ensuring that Internet Explorer 6 is the default web browser, and the minimum system requirements should be validated. In order to use PCR, the client workstation should have at least a 300MHz processor and 128MB of RAM. Please note that these are the minimum requirements for the PCR

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

167 of 439

client, and that if other applications are running on the client workstation, additional CPU and memory is required. In most situations, users are likely to be multi-tasking using multiple applications, so this should be taken into consideration. Certain interactive features in the PCR require third-party plug-in software to work correctly. Users must download and install the plug-in software on their workstation before they can use these features. PCR uses the following third-party plug-in software:
●

Microsoft SOAP Toolkit - In PCR, you can export a report to an Excel file and refresh the data in Excel directly from the cached data in PCR or from data in the data warehouse through PCR. To use the data refresh feature, you must first install the Microsoft SOAP Toolkit. For information on downloading the Microsoft SOAP Toolkit, see “Working with Reports” in the Data Analyzer User Guide. Adobe SVG Viewer - In PCR, you can display interactive report charts and chart indicators. You can click on an interactive chart to drill into the report data and view details and select sections of the chart. To view interactive charts, you must install Adobe SVG Viewer. For more information on downloading Adobe SVG Viewer, see “Managing Account Information” in the Data Analyzer User Guide.

●

Lastly, for PCR to display its application windows correctly, Informatica recommends disabling any pop-up blocking utility on your browser. If a pop-up blocker is running while you are working with PCR, the PCR windows may not display properly.

Data Analyzer Server Installation
The Data Analyzer Server needs to be installed and configured along with the application server foundation software. Currently, Data Analyzer is certified on the following application servers:
● ● ●

BEA WebLogic IBM WebSphere JBoss Application Server

Refer to the PowerCenter Installation Guide for the current list of supported application servers and exact version numbers.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

168 of 439

TIP When installing IBM WebSphere Application Server, avoid using spaces in the installation directory path name for the application server, http server, or messaging server.
The recommended configuration for the Data Analyzer environment is to put the Data Analyzer Server, application server, repository, and data warehouse databases on the same multiprocessor machine. This approach minimizes network input/output as the Data Analyzer Server reads from the data warehouse database. Use this approach when available CPU and memory resources on the multiprocessor machine allow all software processes to operate efficiently without “pegging” the server. If available hardware dictates that the Data Analyzer Server is separated physically from the data warehouse database server, Informatica recommends placing a high-speed network connection between the two servers. For step-by-step instructions for installing the Data Analyzer Server components, refer to the Informatica Data Analyzer Installation Guide. The following list of considerations is intended to complement the installation guide when installing Data Analyzer:
●

Operating System Patch Levels – Prior to installing Data Analyzer, refer to the Data Analyzer Release Notes documentation to ensure that all required patches have been applied to the operating system. This step is often overlooked and can result in operating system errors and/or failures if the correct patches are not applied. Lightweight Directory Access Protocol (LDAP) - If you use Data Analyzer default authentication, you create users and maintain passwords in the Data Analyzer metadata repository. Data Analyzer verifies users against these user names and passwords. However, if you use Lightweight Directory Access Protocol (LDAP), Data Analyzer passes a user login to the external directory for authentication, allowing synchronization of Data Analyzer user names and passwords with network/corporate user names and passwords, as well as PowerCenter user names and passwords. The repository maintains an association between repository user names and external login names. You must create the user name-login associations, but you do not maintain user passwords in the repository. In order to enable LDAP, you must configure the IAS.properties and ldaprealm.properties files. For more information on configuring LDAP authentication, refer to the Informatica Data Analyzer Administrator Guide.

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

169 of 439

TIP After installing Data Analyzer on the JBoss application server, set the minimum pool size to 0 in the file <JBOSS_HOME>/server/informatica/deploy/hsqldb-ds. xml. This ensures that the managed connections in JBOSS will be configured properly. Without this setting it is possible that email alert messages will not be sent properly. TIP Repository Preparation Before you install Data Analyzer, be sure to clear the database transaction log for the repository database. If the transaction log is full or runs out of space when the Data Analyzer installation program creates the Data Analyzer repository, the installation program will fail.

Data Analyzer Client Installation
The Data Analyzer Client is a web-based, thin-client tool that uses Microsoft Internet Explorer 6 as the client. The proper version of Internet Explorer should be verified on client machines, ensuring that Internet Explorer 6 is the default web browser, and the minimum system requirements should be validated. In order to use the Data Analyzer Client, the client workstation should have at least a 300MHz processor and 128MB of RAM. Please note that these are the minimum requirements for the Data Analyzer Client, and that if other applications are running on the client workstation, additional CPU and memory is required. In most situations, users are likely to be multi-tasking using multiple applications, so this should be taken into consideration. Certain interactive features in Data Analyzer require third-party plug-in software to work correctly. Users must download and install the plug-in software on their workstation before they can use these features. Data Analyzer uses the following third-party plug-in software:
●

Microsoft SOAP Toolkit - In Data Analyzer, you can export a report to an Excel file and refresh the data in Excel directly from the cached data in Data Analyzer or from data in the data warehouse through Data Analyzer. To use the data refresh feature, you must first install the Microsoft SOAP Toolkit. For information on downloading the Microsoft SOAP Toolkit, see “Working with Reports” in the Data Analyzer User Guide. Adobe SVG Viewer - In Data Analyzer, you can display interactive report charts and chart indicators. You can click on an interactive chart to drill into the report data and view details and select sections of the chart. To view interactive charts, you must install Adobe SVG Viewer. For more information

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

170 of 439

on downloading Adobe SVG Viewer, see “Managing Account Information” in the Data Analyzer User Guide. Lastly, for Data Analyzer to display its application windows correctly, Informatica recommends disabling any pop-up blocking utility on your browser. If a pop-up blocker is running while you are working with Data Analyzer, the Data Analyzer windows may not display properly.

Metadata Manager Installation
Metadata Manager software can be installed after the development environment configuration has been completed and approved. The following high-level steps are involved in Metadata Manager installation process: Metadata Manager requires a web server and a Java 2 Enterprise Edition (J2EE)compliant application server. Metadata Manager works with BEA WebLogic Server, IBM WebSphere Application Server, and JBoss Application Server. If you choose to use BEA WebLogic or IBM WebSphere, they must be installed prior to the Metadata Manager installation. The JBoss Application Server can be installed from the Metadata Manager installation process. Informatica recommends that a system administrator, who is familiar with application and web servers, LDAP servers, and the J2EE platform, install the required software. For complete information on the Metadata Manager installation process, refer to the PowerCenter Installation Guide. 1. Install BEA WebLogic Server or IBM WebSphere Application Server on the machine where you plan to install Metadata Manager. You must install the application server and other required software before you install Metadata Manager. 2. You can install Metadata Manager on a machine with a Windows or UNIX operating system. Metadata Manager includes the following installation components:
● ● ● ● ●

Metadata Manager Limited edition of PowerCenter Metadata Manager documentation in PDF format Metadata Manager and Data Analyzer integrated online help Configuration Console online help

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

171 of 439

Be sure to refer to the Metadata Manager Release Notes for information regarding the supported versions of each application. To install Metadata Manager for the first time, complete each of the following tasks in the order listed below: 1. Create database user accounts. Create one database user account for the Metadata Manager Warehouse and Metadata Manager Server repository and another for the Integration repository. 2. Install the application server. Install BEA WebLogic Server or IBM WebSphere Application Server. 3. Install PowerCenter 8. Install PowerCenter 8 to manage metadata extract and load tasks. 4. Install Metadata Manager. When installing Metadata Manager, provide the connection information for the database user accounts for the Integration repository and the Metadata Manager Warehouse and Metadata Manager Server repository. The Metadata Manager installation creates both repositories and installs other Metadata Manager components, such as the Configuration Console, documentation, and XConnects. 5. Optionally, run the pre-compile utility (for BEA WebLogic Server and IBM WebSphere). If you are using the BEA WebLogic Server as your Application server, optionally pre-compile the JSP scripts to display the Metadata Manager web pages faster when they are accessed for the first time. 6. Apply the product license. Apply the application server license, as well as the PowerCenter and Metadata Manager licenses. 7. Configure the PowerCenter Server. Assign the Integration repository to the PowerCenter Server to enable running of prepackaged XConnect workflows. The workflow for each XConnect extracts metadata from the metadata source repository and loads it into the Metadata Manager Warehouse. Note: For more information about installing Metadata Manager, see “Installing Metadata Manager” chapter of the PowerCenter Installation Guide. After the software has been installed and tested, the Metadata Manager Administrator can begin creating security groups, users, and the repositories. Following are the some of the initial steps for the Metadata Manager Administrator once the Metadata Manager is installed. For more information on any of these steps, refer to the Metadata Manager Administration Guide. 1. After completing the Metadata Manager installation, configure XConnects to extract metadata. Configure an XConnect for each source repository, and then load metadata from the source repositories into the Metadata Manager

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

172 of 439

2.

3.

4. 5.

Warehouse. Repository registration / creation in the Metadata Manager. Add each source repository to Metadata Manager. This action adds the corresponding XConnect for this repository in the Configuration Console. Set up the Configuration Console. Verify the Integration repository, PowerCenter Server, and PowerCenter Repository Server connections in the Configuration Console. Also, specify the PowerCenter source files directory in the Configuration Console. Set up and run the XConnect for each source repository using the Configuration Console. To limit the tasks that users can perform and the type of source repository metadata objects that users can view and modify, set user privileges and object access permissions.

PowerExchange Installation
Before beginning the installation, take time to read the PowerExchange Installation Guide as well as the documentation for the specific PowerExchange products you have licensed and plan to install. Take time to identify and notify resources you are going to need to complete the installation. Depending on the specific product, you could need any or all of the following:
● ● ● ● ● ● ●

Database Administrator PowerCenter Administrator MVS Systems Administrator UNIX Systems Administrator Security Administrator Network Administrator Desktop (PC) Support

Installing the PowerExchange Listener on Source Systems
The process for installing PowerExchange on the source system varies greatly depending on the source system. Take care to read through the installation documentation prior to attempting the installation. The PowerExchange Installation Guide has step by step instructions for installing PowerExchange on all supported platforms.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

173 of 439

Installing the PowerExchange Navigator on the PC
The Navigator allows you to create and edit data maps and tables. To install PowerExchange on the desktop (PC) for the first time, complete each of the following tasks in the order listed below: 1. Install the PowerExchange Navigator. Administrator access may be required to install the software. 2. Modify the dbmover.cfg file. Depending on your installation, modifications may not be required. Refer to the PowerExchange Reference Manual for information on the parameters in dbmover.cfg.

Installing PowerExchange Client for the PowerCenter Server
The PowerExchange client for the PowerCenter server allows PowerCenter to read data from PowerExchange data sources. The PowerCenter Administrator should perform the installation with the assistance of a server administrator. It is recommended that a separate user account be created to run the required processes. A PowerCenter Administrator needs to register the PowerExchange plug-in with the PowerExchange repository. Informatica recommends that the installation be performed in one environment and tested from end-to-end (from data map creation to running workflows) before attempting to install the product in other environments.

Best Practices
None

Sample Deliverables
None
Last updated: 15-Feb-07 18:58

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

174 of 439

Phase 4: Design
4 Design
●

4.2 Analyze Data Sources
r

4.2.1 Develop Source to Target Relationships 4.2.2 Determine Source Availability

r

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

175 of 439

Phase 4: Design
Description
The Design Phase lays the foundation for the upcoming Build Phase. In the Design Phase, all data models are developed, source systems are analyzed and physical databases are designed. The presentation layer is designed and a prototype constructed. Each task, if done thoroughly, enables the data integration solution to perform properly and provides an infrastructure that allows for growth and change. Each task in the Design Phase provides the functional architecture for the development process using PowerCenter. The design of target data store may include, data warehouses and data marts, star schemas, web services, message queues or custom databases to drive specific applications or effect a data migration. The Design Phase requires that several preparatory tasks are completed before beginning the development work of building and testing mappings, sessions, and workflows within PowerCenter.

Prerequisites
3 Architect

Roles
Application Specialist (Primary) Business Analyst (Primary) Data Architect (Primary) Data Integration Developer (Primary) Data Quality Developer (Primary) Database Administrator (DBA) (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

176 of 439

Presentation Layer Developer (Primary) System Administrator (Primary) Technical Project Manager (Review Only)

Considerations
None

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:45

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

177 of 439

Phase 4: Design
Task 4.2 Analyze Data Sources Description
The goal of this task is to understand the various data sources that will be feeding the solution. Completing this task successfully increases the understanding needed to efficiently map data using PowerCenter. It is important to understand all of the data elements from a business perspective, including the data values and dependencies on other data elements. It is also important to understand where the data comes from, how the data is related, and how much data there is to deal with (i.e., volume estimates).

Prerequisites
None

Roles
Application Specialist (Primary) Business Analyst (Primary) Data Architect (Primary) Data Integration Developer (Primary) Database Administrator (DBA) (Secondary) System Administrator (Primary) Technical Project Manager (Review Only)

Considerations
None

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

178 of 439

Best Practices
Using Data Explorer for Data Discovery and Analysis

Sample Deliverables
None
Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

179 of 439

Phase 4: Design
Subtask 4.2.1 Develop Source to Target Relationships Description
The third step in analyzing data sources is to determine the relationship between the sources and targets and to identify any rework or target redesign that may be required if specific data elements are not available. This step defines the relationships between the data elements and clearly illuminates possible data issues, such as incompatible data types or unavailable data elements.

Prerequisites
None

Roles
Application Specialist (Secondary) Business Analyst (Primary) Data Architect (Primary) Data Integration Developer (Primary) Technical Project Manager (Review Only)

Considerations
Creating the relationships between the sources and targets is a critical task in the design process. It is important to map all of the data elements from the source data to an appropriate counterpart in the target schema. Taking the necessary care in this effort should result in the following:
●

Identification of any data elements in the target schema that are not currently

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

180 of 439

available from the source. The first step determines what data is not currently available from the source. When the source data is not available, the Data Architect may need to re-evaluate and redesign the target schema or determine where the necessary data can be acquired.
●

Identification of any data elements that can be removed from source records because they are not needed in the target. This step eliminates any data elements that are not required in the target. In many cases, unnecessary data is moved through the extraction process. Regardless of whether the data is coming from flat files or relational sources, it is best to eliminate as much unnecessary data as possible, as early in the process as possible. Determination of the data flow required for moving the data from the source to the target. This can serve as a preliminary design specification for work to be performed during the Build Phase . Any data modifications or translations should be noted during this determination process as the source-to-target relationships are established. Determination of the quality of the data in the source. This ensures that data in the target is of high quality and serves its purpose. All source data should be analyzed in a data quality application to assess its current data quality levels. During the Design Phase , data quality processes can be introduced to fix identified issues and/or enrich data using reference information. Data quality should also be incorporated as an on-going process to be leveraged by the target data source.

●

●

The next step in this subtask produces a (Target-Source Matrix) which provides a framework for matching the business requirements to the essential data elements and defining how the source and target elements are paired. The matrix lists each of the target tables from the data mart in the rows of the matrix and lists descriptions of the source systems in the columns, to provide the following data:
● ● ● ● ● ● ● ● ●

Operational (transactional) system in the organization Operational data store External data provider Operating system DBMS Data fields Data descriptions Data profiling/analysis results Data quality operations, where applicable

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

181 of 439

One objective of the data integration solution is to provide an integrated view of key business data. Therefore, for each target table one or more source systems must exist. The matrix should show all of the possible sources for this particular initiative. After this matrix is completed, the data elements must be checked for correctness and validated with both the Business Analyst(s) and the user community. The Project Manager is responsible for ensuring that these parties agree that the data relationships defined in the Target-Source Matrix are correct and meet the needs of the data integration solution. Prior to any mapping development work, the Project Manager should obtain sign-off from the Business Analysts and user community.

Undefined Data
In some cases the Data Architect cannot locate or access the data required to establish a rule defined by the Business Analyst. When this occurs, the Business Analyst may need to revalidate the particular rule or requirement to ensure that it meets the endusers' needs. If it does not, the Business Analyst and Data Architect must determine if there is another way to use the available data elements to enforce the rule. Enlisting the services of the System Administrator or another knowledgeable source system resource, may be helpful. If no solution is found, or if the data meets requirements but is not available, the Project Manager should communicate with the end-user community and propose an alternative business rule. Choosing to eliminate data too early in the process due to inaccessibility, however, may cause problems further down the road. The Project Manager should meet with the Business Analyst and the Data Architect to determine what rules or requirements can be changed and which must remain as originally defined. The Data Architect can propose data elements that can be safely dropped or changed without compromising the integrity of the user requirements. The Project Manager must then identify any risks inherent in eliminating or changing the data elements and decide which are acceptable to the project. Some of the potential risks involved in eliminating or changing data elements are:
●

Losing a critical piece of data required for a business rule that was not originally defined but is likely to be needed in the future. Such data loss may require a substantial amount of rework and can potentially affect project timelines. Any change in data that needs to be incorporated in the Source or Target data models requires substantial time to rework and could significantly delay development. Such a change would also push back all tasks defined and require a change in the Project Plan. Changes in the Source system model may drop secondary relationships that

●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

182 of 439

were not initially visible.

Source Changes after Initial Assessment
When a source changes after the initial assessment, the corresponding Target-Source Matrix must also change. The Data Architect needs to outline everything that has changed, including the data types, names, and definitions. Then, the various risks involved in changing or eliminating data elements must be re-evaluated. The Data Architect should also decide which risks are acceptable. Once again, the System Administrator may provide useful information about the reasons for any changes to the source system and their effect on data relationships.

Best Practices
None

Sample Deliverables
None
Last updated: 15-Feb-07 19:05

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

183 of 439

Phase 4: Design
Subtask 4.2.2 Determine Source Availability Description
The final step in the 4.2 Analyze Data Sources task is to determine when all source systems are likely to be available for data extraction. This is necessary in order to determine realistic start and end times for the load window. The developers need to work closely with the source system administrators during this step because the administrators can provide specific information about the hours of operations for their systems. The final deliverable in this subtask, the Source Availability Matrix, lists all the sources that are being used for data extraction and specifies the systems' downtimes during a 24-hour period. This matrix should contain details of the availability of the systems on different days of the week, including weekends and holidays.

Prerequisites
None

Roles
Application Specialist (Primary) Data Integration Developer (Secondary) Database Administrator (DBA) (Secondary) System Administrator (Primary) Technical Project Manager (Review Only)

Considerations
The information generated in this step will be crucial later in the development process

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

184 of 439

for determining load windows and availability of source data. In many multi-national companies, source systems are distributed globaly, and therefore, may not be available for extraction concurrently. This can pose problems when trying to extract data with minimal (or no) disruption of users' day-to-day activities. Determining the source availability can go a long way in determining when the load window for a regularly scheduled extraction can run. This information is also helpful for determining whether an Operational Data Store (ODS) is needed. Sometimes, the extraction times can be so varied among necessary source systems that an ODS or staging area is required purely for logistical reasons.

Best Practices
None

Sample Deliverables
Source Availability Matrix Target-Source Matrix

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

185 of 439

Phase 5: Build
5 Build
●

5.1 Launch Build Phase
r

5.1.1 Review Project Scope and Plan 5.1.3 Define Defect Tracking Process

r ●

5.3 Design and Build Data Quality Process
r

5.3.1 Design Data Quality Technical Rules 5.3.2 Determine Dictionary and Reference Data Requirements 5.3.3 Design and Execute Data Enhancement Processes 5.3.4 Design Run-time and Real-time Processes for Operate Phase Execution 5.3.5 Develop Inventory of Data Quality Processes 5.3.6 Review and Package Data Transformation Specification Processes and Documents

r

r

r

r

r

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

186 of 439

Phase 5: Build
Description
The Build Phase uses the design work completed in the Architect Phase and the Design Phase as inputs to physically create the data integration solution including data quality and data transformation development efforts. At this point, the project scope, plan, and business requirements defined in the Manage Phase should be re-evaluated to ensure that the project can deliver the appropriate value at an appropriate time.

Prerequisites
None

Roles
Business Analyst (Primary) Business Project Manager (Secondary) Data Architect (Primary) Data Integration Developer (Primary) Data Quality Developer (Primary) Data Steward/Data Quality Steward (Secondary) Data Warehouse Administrator (Secondary) Database Administrator (DBA) (Primary) Presentation Layer Developer (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

187 of 439

Project Sponsor (Approve) Quality Assurance Manager (Primary) Repository Administrator (Secondary) System Administrator (Secondary) Technical Project Manager (Primary) Test Manager (Primary)

Considerations
PowerCenter serves as a complete data integration platform to move data from source to target databases, perform data transformations, and automate the extract, transform, and load (ETL) processes. As a project progresses from the Design Phase to the Build Phase, it is helpful to review the activities involved in each of these processes.
●

Extract - PowerCenter extracts data from a broad array of heterogeneous sources. Data can be accessed from sources including IBM mainframe and AS400 systems, MQ Series, and TIBCO; ERP systems from SAP, Peoplesoft, and Siebel; relational databases; HIPAA sources; flat files; web log sources and direct parsing of XML data files through DTDs or XML schemas. PowerCenter interfaces mask the complexities of the underlying DBMS for the developer, enabling the build process to focus on implementing the business logic of the solution Transform - The majority of the work in the Build Phase focuses on developing and testing data transformations. These transformations apply the business rules, cleanse the data, and enforce data consistency from disparate sources as data is moved from source to target. Load - PowerCenter automates much of the load process. To increase performance and throughput, loads can be multi-threaded, pipelined, streamed (concurrent execution of the extract, transform, and load steps), or serviced by more than one server. In addition DB2, Oracle, Sybase IQ and Teradata external loaders can be used to increase performance. Data can be delivered to EAI queues for enterprise applications. Data loads can also take advantage of Open

●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

188 of 439

Database Connectivity (ODBC) or use native database drivers to optimize performance. Pushdown optimization can even allow some or all of the transformation work to occur in the target database itself.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

189 of 439

Phase 5: Build
Task 5.1 Launch Build Phase Description
In order to begin the Build phase, all analysis performed in previous phases of the project needs to be compiled, reviewed and disseminated to the members of the Build team. Attention should be given to project schedule, scope, and risk factors. The team should be provided with:
● ● ●

Project background Business objectives for the overall solution effort Project schedule, complete with key milestones, important deliverables, dependencies, and critical risk factors Overview of the technical design including external dependencies Mechanism for tracking scope changes, problem resolution, and other business issues

● ●

A series of meetings may be required to transfer the knowledge from the Design team to the Build team, ensuring that the appropriate staff is provided with relevant information. Some or all of the following types of meetings may be required to get development under way:
● ●

Kick-off meeting to introduce all parties and staff involved in the Build phase Functional design review to discuss the purpose of the project and the benefits expected and review the project plan Technical design review to discuss the source to target mappings, architecture design, and any other technical documentation

●

Information provided in these meetings should enable members of the data integration team to immediately begin development. As a result of these meetings, the integration team should have a clear understanding of the environment in which they are to work, including databases, operating systems, database/SQL tools available in the environment, file systems within the repository and file structures within the organization relating to the project, and all necessary user logons and passwords.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

190 of 439

The team should be provided with points of contact for all facets of the environment (e. g., DBA, UNIX\NT Administrator, PowerCenter Administrator, etc.). The team should also be aware of the appropriate problem escalation plan. When team members encounter design problems or technical problems, there must be an appropriate path for problem escalation. The Project Manager should establish a specific mechanism for problem escalation along with a problem tracking report.

Prerequisites
None

Roles
Business Analyst (Secondary) Data Architect (Primary) Data Integration Developer (Secondary) Data Warehouse Administrator (Secondary) Database Administrator (DBA) (Secondary) Presentation Layer Developer (Primary) Quality Assurance Manager (Primary) Repository Administrator (Review Only) System Administrator (Review Only) Technical Project Manager (Primary) Test Manager (Primary)

Considerations
It is important to include all relevant parties in the launch activities. If all points of

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

191 of 439

discussion cannot be resolved during the kick-off meeting, the key personnel in each area should be present to reschedule quickly, so as not to affect the overall schedule. Because of the nature of the development process, there are often bottlenecks in the development flow. The Project Manager should be aware of the risk factors, which emanate from outside the project, and should be able to anticipate where bottlenecks are likely to occur. The Project Manager also needs to be aware of the external factors that create project dependencies, and should avoid having meetings prematurely when external dependencies have not been resolved. Having meetings prior to resolving these issues can result in significant down time for the developers while they wait to have their sources in place and finalized.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

192 of 439

Phase 5: Build
Subtask 5.1.1 Review Project Scope and Plan Description
The Build team needs to understand the project's objectives, scope, and plan in order to prepare themselves for the Build Phase. There is often a tendency to waste time developing non-critical features or functions. The team should review the project plan and identify the critical success factors and key deliverables to avoid focusing on relatively unimportant tasks. This helps to ensure that the project stays on its original track and avoids much unnecessary effort. The team should be provided with:
● ● ● ●

Detailed descriptions of deliverables and timetables. Dependencies that effect deliverables. Critical success factors. Risk assessments made by the design team.

With this information, the Build team should be able to enhance the project plan to navigate through the risk areas, dependencies, and tasks to reach its goal of developing an effective solution.

Prerequisites
None

Roles
Business Analyst (Review Only) Data Architect (Review Only) Data Integration Developer (Review Only) Data Warehouse Administrator (Review Only)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

193 of 439

Database Administrator (DBA) (Review Only) Presentation Layer Developer (Review Only) Quality Assurance Manager (Review Only) Technical Project Manager (Primary)

Considerations
With the Design Phase complete, this is the first opportunity for the team to review what it has learned during the Architect Phase and the Design Phase about the sources of data for the solution. It is also a good time to review and update the project plan, which was created before these findings, to incorporate the knowledge gained during the earlier phases. For example, the team may have learned that the source of data for marketing campaign programs is a spreadsheet that is not easily accessible by the network on which the data integration platform resides. In this case, the team may need to plan additional tasks and time to build a method for accessing the data. This is also an appropriate time to review data profiling and analysis results to ensure all data quality requirements have been taken into consideration. During the project scope and plan review, significant effort should be made to identify upcoming Build Phase risks and assess their potential impact on project schedule and/ or cost. Because the design is complete, risk management at this point tends to be more tactical than strategic; however, the team leadership must be fully aware of key risk factors that remain. Team members are responsible for identifying the risk factors in their respective areas and notifying project management during the review process.

Best Practices
None

Sample Deliverables
Project Review Meeting Agenda

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

194 of 439

Phase 5: Build
Subtask 5.1.3 Define Defect Tracking Process Description
Since testing is designed to uncover defects, it is crucial to properly record the defects as they are identified, along with their resolution process. This requires a ‘defect tracking system’ that may be entirely manual, based on shared documents such as spreadsheets, or automated using, say, a database with a web browser front-end. Whatever tool is chosen, sufficient details of the problem must be recorded to allow proper investigation of the root cause and then the tracking of the resolution process. The success of a defect tracking system depends on:
●

Formal test plans and schedules being in place, to ensure that defects are discovered, and that their resolutions can be retested. Sufficient details being recorded to ensure that any problems reported are repeatable and can be properly investigated.

●

Prerequisites
None

Roles
Data Integration Developer (Review Only) Data Warehouse Administrator (Review Only) Database Administrator (DBA) (Review Only) Presentation Layer Developer (Review Only) Quality Assurance Manager (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

195 of 439

Repository Administrator (Review Only) System Administrator (Review Only) Technical Project Manager (Primary) Test Manager (Primary)

Considerations
The defect tracking process should encompass these steps:
● ●

Testers prepare Problem Reports to describe defects identified. Test Manager reviews these reports and assigns priorities on an Urgent/High/ Medium/Low basis (‘Urgent’ should only be used for problems that will prevent or severely delay further testing). Urgent problems are immediately passed to the Project Manager for review/ action. Non-urgent problems are reviewed by the Test Manager and Project Manager on a regular basis (this can be daily at a critical development time, but is usually less frequent) to agree priorities for all outstanding problems. The Project Manager assigns problems for investigation according to the agreed-upon priorities. The ‘investigator’ attempts to determine the root cause of the defect and to define the changes needed to rectify the defect. The Project Manager reviews the results of investigations and assigns rectification work to ‘fixers’ according to priorities and effective use of resources. The ‘fixer’ make the required changes and conducts unit testing. Regression testing is also typically conducted. The Project Manager may decide to group a number of fixes together to make effective use of resources.

●

●

●

●

●

● ●

The Project Manager and Test Manager review the test results at their next meeting and agree on closure, if appropriate.

Best Practices
None

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

196 of 439

Sample Deliverables
Issues Tracking

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

197 of 439

Phase 5: Build
Task 5.3 Design and Build Data Quality Process Description
Follow the steps in this task to design and build the data quality enhancement processes that can ensure that the project data meets the standards of data quality required for progress through the rest of the project. The processes designed in this task are based on the results of 2.8 Perform Data Quality Audit. Both the design and build components are captured in the Build Phase since much of this work is interative as intermediate builds of the data quality process are reviewed, the design is further expanded and enhanced. Note: If the results of the Data Quality Audit indicate that the project data already meets all required levels of data quality, then you can skip this task. However, this is unlikely to occur. Here again (as in subtask 2.3.1 Identify Source Data Systems) it is important to work as far as is practicable with the actual source data. Using data derived from the actual source systems - either the complete dataset or a subset - was essential in identifying quality issues during the Data Quality Audit and determining if the data meets the business requirements (i.e., if it answers the business questions identified in the Manage Phase). The data quality enhancement processes designed in the subtasks of this task must operate on as much of the project dataset(s) as deemed necessary, and possibly the entire dataset. Data quality checks can be of two types: one can cover the metadata characteristics of the data, and the other covers the quality of the data contents from a business perspective. In the case of complex ERP systems like SAP, where implementation has a high degree of variation from the base product, a thorough data quality check should be performed to consider the customizations.

Prerequisites
None

Roles

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

198 of 439

Business Analyst (Primary) Business Project Manager (Secondary) Data Integration Developer (Secondary) Data Quality Developer (Primary) Data Steward/Data Quality Steward (Secondary) Technical Project Manager (Approve)

Considerations
Because the quality of the source system data has a major effect on the correctness of all downstream data, it is imperative to resolve as many of the data issues as possible, as early as possible. Making the necessary corrections at this stage eliminates many of the questions that may otherwise arise later during testing and validation. If the data is flawed, the development initiative faces a very real danger of failing. In addition, eliminating errors in the source data makes it far easier to determine the nature of any problems that may arise in the final data outputs. If data comes from different sources, it is mandatory to correct data for each source as well as for the integrated data. If data comes from a mainframe, it is necessary to use the proper access method to interpret data correctly. Note however that Informatica Data Quality (IDQ) applications do not read data directly from mainframe. As indicated above, the issue of data quality covers far more than simply whether the source and target data definitions are compatible. From the business perspective, data quality processes seek to answer the following questions: what standard has the data achieved in areas that are important to the business, and what standards are required in these areas? There are six main areas of data quality performance: Accuracy, Completeness, Conformity, Consistency, Integrity, and Duplication. These are fully explained in task 2.8 Perform Data Quality Audit. The Data Quality Developer uses the results of the Data Quality Audit as the benchmark for the data quality enhancement steps you need to apply in the current task. Before beginning to design the data quality processes, the Data Quality Developer, Business Analyst, Project Sponsor, and other interested

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

199 of 439

parties must meet to review the outcome of the Data Quality Audit and agree the extent of remedial action needed for the project data. The first step is to agree on the business rules to be applied to the data. (See Subtask 5.3.1 Design Data Quality Technical Rules.) The tasks that follow are written from the perspective of Informatica Data Quality, Informatica’s dedicated data quality application suite.

Best Practices
Data Cleansing

Sample Deliverables
None
Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

200 of 439

Phase 5: Build
Subtask 5.3.1 Design Data Quality Technical Rules Description
Business rules are a key driver of data enhancement processes. A business rule is a condition of the data that must be true if the data is to be valid and, in a larger sense, for a specific business objective to succeed. In may cases, poor data quality is directly related to the data’s failure concerning a business rule. In this subtask the Data Quality Developer and the Business Analyst, and optionally other personnel representing the business, establish the business rules to be applied to the data. An important factor in completing this task is proper documentation of the business rules.

Prerequisites
None

Roles
Business Analyst (Primary) Data Quality Developer (Primary) Data Steward/Data Quality Steward (Secondary)

Considerations
All areas of data quality can be affected by business rules, and business rules can be defined at high- and low-levels and at varying levels of complexity. Some business rules can be tested mathematically using simple processes, whereas others may require complex processes or reference data assistance. For example, consider a financial institution that must store several types of information for account holders in order to comply with the Sarbanes-Oxley or the USA-PATRIOT Act. It defines several business rules for its database data, including:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

201 of 439

● ●

Field 1-Field n must not be null or populated with default values. Date of Birth field must contain dates within certain ranges (e.g., to indicate that the account holder is between 18 and 100 years old). All account holder addresses are validated as postally correct.

●

These three rules are equally easy to express, but they are implemented in different ways. All three rules can be checked in a straightforward manner using Informatica Data Quality (IDQ), although the third rule, concerning address validation, requires reference data verification. The decision to use external reference data is covered in subtask 5.3.2 Determine Dictionary and Reference Data Requirements. When defining business rules, the Data Quality Developer must consider the following questions:
● ●

How to document the rules. How to build the data quality processes to validate the rules.

Documenting Business Rules
Documenting rules is essential as a means of tracking the implementation of the business requirements. When documenting business rules, the following information must be provided:
●

A unique ID should be provided for each rule. This can be as simple as a incremented number, or assigning a project code to each rule. A text description of the rule. This should be as complete as possible – however, if the description becomes too lengthy or complex, it may be advisable to break it down into multiple rules. The name of the data source containing the records affected by the rule. The data headers or field names containing the values affected by the rule. The Data Quality Developer and the Business Analyst can refer back to the results of the Data Quality Audit to identify this information. Add columns for the plan name and the results of implementing the rule. The Data Quality Developer can provide this information later.

●

● ●

●

Note: In IDQ, a discrete data quality process is called a plan. A plan has inputs, outputs, and analysis or enhancement algorithms and is analogous to a PowerCenter

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

202 of 439

mapping. It is important to understand that a data quality plan can be added to a PowerCenter custom transformation and run within a PowerCenter mapping.

Assigning Business Rules to Data Quality Plans
When the Data Quality Developer and Business Analyst have agreed on the business rules to apply to the data, the Data Quality Developer must decide how to convert the rules into data quality plans. (The Data Quality Developer need not to create the plans at this stage) The Data Quality Developer may create a plan for each rule, or may incorporate several rules into a single plan. This decision is taken on a rule-by-rule basis. There is a trade-off between simplicity in plan design, wherein each plan contains a single rule, and efficiency in plan design, wherein a single plan addresses several rules. Typically a plan handles more than one rule. One advantage of this course of action is that the Data Quality Developer does not need to define and maintain multiple instances of input and output data, covering small increments of data quality progress, where a single set of inputs and outputs can do the same job in a more sophisticated plan. It’s also worth considering if the plan will be run from within IDQ or added to a PowerCenter mapping for execution in a workflow. Bear in mind that the Data Quality Integration transformation in PowerCenter accepts information from one plan. To add several plans to a mapping, you must add the same number of transformations.

Best Practices
None

Sample Deliverables
None
Last updated: 15-Feb-07 19:12

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

203 of 439

Phase 5: Build
Subtask 5.3.2 Determine Dictionary and Reference Data Requirements Description
Many data quality plans make use of reference data files to validate and improve the quality of the input data. The main purposes of reference data are:
●

To validate the accuracy of the data in question. For example, in cases where input data is verified against tables of known-correct data. To enrich data records with new data or enhance partially-correct data values. For example, in cases of address records that contain usable but incomplete postal information. (Typos can be identified and fixed; Plus-4 information can be added to zip codes.)

●

When preparing to build data quality plans, the Data Quality Developer must determine the types of dictionary and reference files that may be used in the data quality plans, obtain approval to use third-party data, if necessary, and define a strategy for maintaining and distributing reference files. An important factor in completing this task is the proper documentation of the required dictionary or reference files.

Prerequisites
None

Roles
Business Analyst (Secondary) Business Project Manager (Secondary) Data Quality Developer (Primary) Data Steward/Data Quality Steward (Secondary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

204 of 439

Considerations
Data quality plans can make use of three types of reference data.
●

Standard dictionary files. These files are installed with Informatica Data Quality (IDQ) and can be used by several types of components in Workbench. All dictionaries installed with IDQ are text dictionaries. These are plain-text files saved in .DIC file format. They can be created and edited manually. IDQ installs with a set of dictionary files in generic business information areas including forenames, city and town names, units of measurement, and gender identification. Informatica also provides and supports reference data of external origin, such as postal address data endorsed by national postal carriers.

●

Database dictionaries. Users with database expertise can create and specify dictionaries that are linked to database tables, and can, therefore, be updated dynamically when the underlying data is updated. Database dictionaries are useful when the reference data has been originated for other purposes and is likely to change independently of IDQ. By making use of a dynamic connection, data quality plans can always point to the current version of the reference data. Database dictionaries are stored as SELECT statements that query the database at the time of plan execution. IDQ does not install any database dictionaries. ● Third-party reference data. These data files originate from third-party vendors and are provided by Informatica as premium product options. The reference data provided by third-party vendors is typically in database format. If the Data Quality Developer feels that externally-derived reference data files are necessary, he or she must inform the Project Manager or other business personnel as soon as possible, as this is likely to effect (1) the project budget and (2) the software architecture implementation.

Managing and Distributing Reference Files
Managing standard-installed dictionaries is straightforward, as long as the Data Quality Developer does not move the designed plans to non-standard locations. What is a non-standard location? One where the plans cannot see the dictionary files.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

205 of 439

IDQ recognizes set locations for dictionary and reference data files. A Standard (i.e., client-only) install of IDQ looks for its dictionary files in the \Dictionaries folder of the installation. An Enterprise (i.e., client-server) install looks in this location, and also looks in the logged-on user’s \Dictionaries folder on the server if the plan is executed on the server. These locations are specified in IDQ’s config.xml file. If the relevant dictionary files are moved out of these locations, the plan cannot run unless the config.xml file has been edited. Conversely, if the user has created new or modified dictionaries within the standard dictionary format, and wishes to copy (publish) plans to a server or another IDQ installation, the user must copy the new dictionary files to a recognized location for the server or the other IDQ also. Third-party reference data adds another set of actions. The third-party data currently available from Informatica is packaged in a manner that installs to locations recognized by IDQ. (Again, these locations are defined in the config.xml file.) However, copying these files to other locations is not as simple, because the installation of these files is less simple, and because the files are licensed and delivered separately from IDQ. The business must agree to license these files before the Data Quality Developer can assume he or she can develop plans using third-party files, and the system administrator must understand that the reference data will be installed in the required locations. Note: Informatica customers license third-party data on a subscription basis. Informatica provides regular updates to the reference data, and the customer (possibly the system administrator) must perform the updates. Whenever you add a dictionary or reference data file to a plan, you must document exactly how you have done so: record the plan name, the reference file name, and the component instance that uses the reference file. Make sure you pass the inventory of reference data to all other personnel who are going to use the plan. Data migration projects have additional reference data requirements which include a need to determine the valid values for key code fields and to ensure that all input data aligns with these codes. It is recommended to build valid value processes to perform this validation. It is also recommended to use a table driven approach to populate hardcoded values which then allows for easy changes if the specific hard-coded values change over time. Additionally, a large number of basic cross-references are also required for data migration projects. These data types are examples of reference data that should be planned for by using a specific approach to populate and maintain them with input from the business community. These needs can be met with a variety of Informatica products, but to expedite development, they must be addressed prior to building data integration processes.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

206 of 439

Best Practices
None

Sample Deliverables
None
Last updated: 15-Feb-07 19:14

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

207 of 439

Phase 5: Build
Subtask 5.3.3 Design and Execute Data Enhancement Processes Description
This subtask, along with subtask 5.3.4 Design Run-time and Real-time Processes for Operate Phase Execution concerns the design and execution of the data quality plans that will prepare the project data for the Data Integration Design and Development in the Build Phase. While this subtask describes the creation and execution of plans through Informatica Data Quality (IDQ) Workbench, subtask 5.3.4 focuses on the steps to deploy plans in a runtime or scheduled environment. All plans are created in Workbench. However, there are several aspects to creating plans primarily for runtime use, and these are covered in 5.3.4. Users who are creating plans should read both subtasks. Note: IDQ provides a user interface, the Data Quality Workbench, within which plans can be designed, tested, and deployed to other Data Quality engines across the network. Workbench is an intuitive user interface; however, the plans that users construct in Workbench can grow in size and complexity, and Workbench, like all software applications, requires user training. These subtasks are not a substitute for that training. Instead, they describe the rudiments of plan construction, the elements required for various types of plans, and the next steps to plan deployment. Both subtasks assume that the Data Quality Developer will have received formal training in IDQ.

Prerequisites
None

Roles
Data Quality Developer (Primary) Technical Project Manager (Approve)

Considerations
A data quality plan is a discrete set of data analysis and/or enhancement operations with a data source and a data target (or sink). At a high level, the design of a plan is not dissimilar to the design of a PowerCenter mapping. The data sources, sinks, and analysis/enhancement components are represented on-screen by icons, much like the sources, targets, and transformations in a mapping. Sources, sinks, and other components can be configured through a tabbed dialog box in the same way as PowerCenter transformations. One difference between PowerCenter and Workbench is that users cannot define workflows that contain serial data quality

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

208 of 439

plans, although this functionality is available in a runtime/batch scenario. Data quality plans can read source data from, and write data to file and database. Most delimited, flat, or fixed-width file types are usable, as are DB2, Oracle, SQL Server databases and any database legible via ODBC. Informatica Data Quality (IDQ) stores plan data in its own MySQL data repository. The following figure illustrates a simple data quality plan.

This data quality plan shows a data source reading from a SQL database, an operational component analyzing the data, and a data sink component that receives the data available as plan output. A plan can have any number of operational components. Plans can be designed to fulfill several data quality requirements, including data analysis, parsing, cleansing and standardization, enrichment, validation, matching, and consolidation. These are described in detail in the Best Practice Data Cleansing. When designing data quality plans, the questions to consider include:

●

What types of plan are necessary to meet the needs of the project The business should have already signed-off on specific data quality goals as a part of agreeing the overall project objectives, and the Data Quality Audit should have indicated the areas where the project data requires improvement. For example, the audit may indicate that the project data contains a high percentage of duplicate records, and therefore matching and pre-match grouping plans may be necessary. What test cycles are appropriate for the plans? Testing and tuning plans in Workbench is a normal part of plan development. In many cases, testing a plan in Workbench is akin to validating a mapping in PowerCenter, and need not be part of a formal test scenario. However, the Data Quality Developer must be able to sign-off on each plan as valid and executable. What source data will be used for the plans? This is related to the testing issue mentioned above. The final plans that operate on the project data are likely to operate on

●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

209 of 439

the complete project dataset; in every case, the plans will effect changes in the customer data. Ideally, a complete ‘clone’ of the project dataset should be available to the Data Quality Developer, so that the plans can be designed and tested on a fully faithful version of the project data. At the minimum, a meaningful sample of the dataset should be replicated and made available for plan design and test purposes. Bear in mind that a plan that is published to a service domain repository will translate the data source locations set at design time into new locations local to the new computer on which it resides. See subtask 5.3.4 Design Run-time and Real-time Processes for Operate Phase Execution and the Informatica Data Quality User Guide for more information. Where will the plans be deployed? IDQ can be installed in a client-server configuration, with multiple Workbench installations acting as clients to the IDQ server. The server employs service domain architecture, so that a connected Workbench user can run a plan from a local or domain repository to any Execution Service on the service domain. Likewise, the Data Quality Developer may publish plans from Workbench to a remote repository on the IDQ service domain for execution by other Data Quality Developers. An important consideration here is, will the plans be deployed as runtime plans? A plan is considered a runtime plan if it is deployed in a scheduled or batch operation with other plans. In such cases, the plan is run using a command line instruction. See subtask 5.3.4 Design Run-time and Real-time Processes for Operate Phase Execution for details. Bear in mind also that it is possible to add a plan to a mapping if the Data Quality Integration plugin has been installed client-side and server-side to PowerCenter. The Integration enables the following types of interaction:
●

It enables you to browse the Data Quality repository and add a data quality plan to the Data Quality Integration transformation. The functional details of the plan are saved as XML in the PowerCenter repository. It enables the PowerCenter Integration Service to send data quality plan XML to the Data Quality engine when a session containing a Data Quality Integration transformation is run.

●

A plan designed for use in a PowerCenter mapping must set its data source and data sink components to process data in realtime. A subset of the source and sink components can be configured in this way (six out of twenty-one components). Note that plans with realtime capabilities are also suitable for use in a request-response environment, such as a point of data entry environment. These realtime plans can be called by a third-party application to analyze keyboard data inputs and correct human error.

Best Practices
None

Sample Deliverables
INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Quality 210 of 439

None
Last updated: 12-Feb-07 15:05

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

211 of 439

Phase 5: Build
Subtask 5.3.4 Design Runtime and Real-time Processes for Operate Phase Execution Description
This subtask, along with subtask 5.3.3 Design and Execute Data Enhancement Processes concerns the design and execution of the data quality plans to prepare the project data for the Data Integration component of the Build Phase and possibly later phases. While subtask 5.3.3 describes the creation and execution of plans through Data Quality Workbench, this subtask focuses on the steps to deploy plans in a runtime or scheduled environment. All data quality plans are created in Workbench. However, there are several aspects to creating plans primarily for runtime which are described in this subtask. Users who are creating plans should read both subtasks. Because they can be scheduled and run in a batch, runtime plans present two opportunities for the Data Quality Developer and the data project as a whole:
●

A plan that may take several hours to run — such as a large-scale data matching plan — can be scheduled to run overnight as a runtime plan. A runtime plan can be scheduled to run at regular intervals on the dataset to analyze dataset quality; such plans can outlive the project in which they are designed and provide a method for ongoing data monitoring in the enterprise.

●

Because runtime plans need not be run from a user interface, they are commonly published or moved to a computer where higher-performance is available. When publishing or moving a runtime plan, consider the issues discussed in this subtask.

Prerequisites
None

Roles

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

212 of 439

Business Analyst (Review Only) Data Quality Developer (Primary) Technical Project Manager (Review Only)

Considerations
The two main factors to consider when planning to use runtime plans are:
● ●

What data sources will the plan use? What reference files will the plan use?

In both cases, the source data and reference files must reside in locations that are visible to Informatica Data Quality (IDQ). This is pertinent as the runtime plan will typically be moved from its design-time computer to another computer for execution. Data source locations are set in the in the plan at design time. If the plan connects to a file, the name and path to the file(s) are set in the data source component. If the source data is stored in a database, the same database connection must be available on the machine to which the plans are moved. If the plan is run on the machine on which it was designed, then the data locations can remain static — so long as the data source details do not change. However, if the plan is moved to another machine, consider the following questions: Will the plan be run in an IDQ service domain? A plan moved to another machine may be run through Data Quality Server (specifically, by a machine hosting a Data Quality Execution Service.) In this case, the Data Quality engine can run the plan from the repository, and you can publish the plan to repository from the Workbench client. When you publish a plan, bear in mind that IDQ recognizes a specific set of folders as valid source file locations. If a Data Quality Developer defines a plan with a source file stored in the following location on the Workbench computer: C:\Myfiles\File.txt A Data Quality Server on Windows will look for the file here: C:\Program Files\Data Quality\users\user.name\Files\Myfiles

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

213 of 439

And a Data Quality Server on UNIX installed at /home/Informatica/dataquality/ will look for the file here: /home/informatica/dataquality/users/user.name/Files/Myfiles where user.name is the logged-on Data Quality Developer name. (The Data Quality Developer must be working on a Workbench machine that has a client connection to the Data Quality Server.) Path translations are platform-independent, that is, a Windows path will be mapped to a UNIX path. Are the source files in a non-standard location on the runtime computer? If a Data Quality Developer publishes a plan to a service domain repository for runtime execution, and the plan source file is located in a non-standard location on the executing computer, the Data Quality Developer can add a parameter file to the run command, translating the location set in the plan into the required file location. Will the plan be deployed to IDQ machines outside the service domain? If so, the plans must be saved as a .xml file for runtime deployment. (Plans can also be saved as .pln files for use in another instance of Workbench.) The Data Quality Developer can set the run command to distinguish between plans stored in the Data Quality repository and plans saved on the file system. Do the plans use non-standard dictionary files, or dictionary/reference files in non-standard locations? The Data Quality Developer must check that any dictionary or reference files added to a plan at design time are also available at the runtime location. If a plan uses standard dictionary files (i.e., the files that installed with the product) then IDQ takes care of this automatically, as long as the plan resides on a service domain. If a plan is published or copied to a network location and uses non-standard reference files, these files must be copied to the a location that is recognizable to the IDQ installation that will run the deployed plans. For more information on valid dictionary and reference data files, see the Informatica Data Quality User Guide.

Implications for Plan Design
The above settings can have a significant bearing on plan design. When the Data Quality Developer designs a plan in Workbench, he or she should ensure that the folders created for file resources can map efficiently to the server folder structure.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

214 of 439

For example, let’s say the Developer creates a data source file folder on a Workbench installation at the following location: C:\Program Files\Data Quality\Sources When the plan runs on the server side, the Data Quality Server looks for the source file in the following location: C:\Program Files\Data Quality\users\user.name\Files\Program Files\Data Quality\Sources Note that the folder path Program Files\Data Quality is repeated here: in this case, good plan design suggests the creation of folders under C:\ that can be recreated efficiently on the server.

Best Practices
None

Sample Deliverables
None
Last updated: 15-Feb-07 19:16

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

215 of 439

Phase 5: Build
Subtask 5.3.5 Develop Inventory of Data Quality Processes Description
When the Data Quality Developer has designed and tested the plans to be used later in the project, he or she must then create an inventory of the plans. This inventory should be as exhaustive as possible. Data quality plans, once they achieve any size, can be hard for personnel other than the Data Quality Developer to read. Moreover, other project personnel and business users are likely to rely on the inventory to identify where the plan functioned in the project.

Prerequisites
None

Roles
Data Quality Developer (Primary)

Considerations
For each plan created for use in the project (or for use in the Operate Phase and postproject scenarios), the inventory document should answer the following questions. The questions can be divided into two sections: one relating to the plan’s place and function relative to the project and its objectives, and the other relating to the plan design itself. The questions below are a subset of those included in the sample deliverable document Data Quality Plan Documentation and Handover.

Project-related Questions
●

What is the name of the plan? What project is the plan part of? Where does the plan fit in the overall project?

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

216 of 439

●

What particular aspect of the project does the plan address? What are the objectives of the plan? What issues, if any, apply to the plan or its data? What department or group uses the plan output? What are the predicted ‘before and after’ states of the plan data? Where is the plan located (include machine details and folder location) and when was it executed? Is the plan version-controlled? What are the creation/medatada details for the plan? What steps were taken or should be taken following plan execution?

●

●

●

●

●

●

●

Plan Design-related Questions
●

What are the specific data or business objectives of the plan? Who ran (or should run) the plan, and when? In what version of IDQ was the plan was designed? What Informatica application will run the plan, and on which applications will the plan run? Provide a screengrab of the plan layout in the Workbench user interface. What data source(s) are used? Where is the source located? What are the format and origin of the database table?

●

●

●

●

●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

217 of 439

●

Is the source data an output from another IDQ plan, and if so, which one? Describe the activity of each component in the plan. Component functionality can be described at a high level or low level, as appropriate. What reference files or dictionaries are applied? What business rules are defined? This question can refer to the documented business rules from subtask 5.3.1 Design Data Quality Technical Rules. Provide the logical statements, if appropriate. What are the outputs for the instance, and how are they named? Where is the output written: report, database table, or file? Are there exception files? If so, where are they written? What is the next step in the project? Will the plan(s) be re-used (e.g., in a runtime environment)? Who receives the plan output data, and what actions are they likely to they take?

●

●

●

●

●

●

●

●

●

Best Practices
None

Sample Deliverables
None
Last updated: 15-Feb-07 19:19

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

218 of 439

Phase 5: Build
Subtask 5.3.6 Review and Package Data Transformation Specification Processes and Documents Description
In this subtask the Data Quality Developer collates all the documentation produced for the data quality operations thus far in the project and makes them available to the Project Manager, Project Sponsor, and Data Integration Developers — in short, to all personnel who need them. The Data Quality Developer must also ensure that the data quality plans themselves are stored in locations known to and usable by the Data Integration Developers.

Prerequisites
None

Roles
Data Integration Developer (Secondary) Data Quality Developer (Primary) Technical Project Manager (Review Only)

Considerations
After the Data Quality Developer verifies that all data quality-related materials produced in the project are complete, he or she should hand them all over to other interested parties in the project. The Data Quality Developer should either arrange a handover meeting with all relevant project roles or ask the Data Steward to arrange such a meeting.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

219 of 439

The Data Quality Developer should consider making a formal presentation at the meeting and should prepare for a Q&A session before the meeting ends. The presentation may constitute a PowerPoint slide show and may include dashboard reports from data quality plans. The presentation should cover the following areas:
●

Progress in treating the quality of the project data (‘before and after’ states of the data in the key data quality areas) Success stories, lessons learned Data quality targets: met or missed? Recommended next steps for project data

● ● ●

Regarding data quality targets met or missed, the Data Quality Developer must be able to say whether the data operated on is now in a position to proceed through the rest of the project. If the Data Quality Developer believes that there are “show stopper” issues in the data quality, he or she must inform the business managers and provide an estimate of the work necessary to remedy the data issues. The business managers can then decide if the data can pass to the next stage of the project or if remedial action is appropriate. The materials that the Data Quality Developer must assemble include:
●

Inventory of data quality plans (prepared in subtask 5.3.5 Develop Inventory of Data Quality Processes). Data Quality plan files (.pln or .xml files), or locations of the Data Quality repositories containing the plans. Details of backup data quality plans. (All Data Quality repositories containing final plans should be backed up.) Inventory of business rules used in the plans (prepared in subtask 5.3.1 Design Data Quality Technical Rules). Inventory of dictionary and reference files used in the plans (prepared in subtask 5.3.2 Determine Dictionary and Reference Data Requirements). Data Quality Audit results (prepared in task 2.8 Perform Data Quality Audit). Summary of task 5.3 Design and Build Data Quality Process.

●

●

●

●

● ●

Best Practices
Build Data Audit/Balancing Processes

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

220 of 439

Sample Deliverables
Data Quality Plan Design

Last updated: 01-Feb-07 18:47

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

221 of 439

Phase 6: Test
6 Test
●

6.1 Define Overall Test Strategy
r

6.1.1 Define Test Data Strategy 6.1.2 Define Unit Test Plan 6.1.3 Define System Test Plan 6.1.5 Define Test Scenarios 6.1.6 Build/Maintain Test Source Data Set

r

r

r

r ●

6.2 Prepare for Testing Process
r

6.2.1 Prepare Environments 6.2.2 Prepare Defect Management Processes

r ●

6.3 Execute System Test
r

6.3.1 Prepare for System Test 6.3.2 Execute Complete System Test 6.3.3 Perform Data Validation 6.3.5 Conduct Volume Testing

r

r

r ● ●

6.4 Conduct User Acceptance Testing 6.5 Tune System Performance
r

6.5.1 Benchmark 6.5.2 Identify Areas for Improvement

r

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

222 of 439

Phase 6: Test
Description
The diligence with which you pursue the Test Phase of your project will inevitably determine its acceptance by its end-users, and therefore, its success against its business objectives. During the Test Phase you must essentially validate that your system accomplishes everything that the project objectives and requirements specified and that all the resulting data and reports are accurate. Test is also a critical preparation against any eventuality that could impact your project; whether that be radical changes to data volumes, disasters that disrupt service for the system in some way, or spikes in concurrent usage. The Test phase includes the full design of your testing plans and infrastructure as well as two categories of comprehensive system-wide verification procedures; the System Test and the User Acceptance Test (UAT). The System Test is conducted after all elements of the system have been integrated into the test environment. It includes a number of detailed technically-oriented verifications that are managed as processes by the technical team with primarily technical criteria for acceptance. UAT is a detailed user-oriented set of verifications with User Acceptance as the objective. It is typically managed by end-users with participation from the technical team. Any test cannot be considered complete until there is verification that it has accomplished the agreed-upon Acceptance Criteria. Because of the natural tension that exists between completion of the preset project timeline and completion of Acceptance Criteria (which may take longer than expected) the Test Phase schedule is often owned by a QA Manager or Project Sponsor rather than the Project Manager. Velocity includes as a final step in the Test Phase activities related to tuning system performance. Satisfactory performance and system responsiveness can be a critical element of user acceptance.

Prerequisites
None

Roles
Business Analyst (Primary) Data Integration Developer (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

223 of 439

Data Warehouse Administrator (Primary) Database Administrator (DBA) (Primary) End User (Primary) Network Administrator (Primary) Presentation Layer Developer (Primary) Project Sponsor (Review Only) Quality Assurance Manager (Primary) Repository Administrator (Primary) System Administrator (Primary) System Operator (Primary) Technical Project Manager (Secondary) Test Manager (Primary) User Acceptance Test Lead (Primary)

Considerations
To ensure the Test Phase is successful it must be preceded by diligent planning and preparation. Early on, project leadership and project sponsors should establish test strategies and begin building plans for System Test and UAT. Velocity recommends that this planning process begins, at the latest, during the Design Phase, and that it includes descriptions of timelines, participation, test tools, guidelines and scenarios, as well as detailed Acceptance Criteria. The Test Phase includes the development of test plans and procedures. It is intended

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

224 of 439

to overlap with the Build Phase which includes the individual design reviews and unit test procedures. It is difficult to determine your final testing strategy until detailed design and build decisions have been made in the Build Phase. Thus it is expected that from a planning perspective, some tasks and subtasks in the Test Phase will overlap with those in the Build Phase and possibly the Design Phase. The Test Phase includes other important activities in addition to testing. Any defects or deficiencies discovered must be categorized (severity, criticality, priority) recorded, and weighed against the Acceptance Criteria (AC). The technical team should repair them within the guidelines of the AC, and the results must be retested with the inclusion of satisfactory regression testing. This process has the prerequisite for the development of some type of Defect Tracking System; Velocity recommends that this be developed during the Build Phase. Although formal user acceptance signals the completion of the Test Phase, some of its activities will be revisited, perhaps many times, throughout the operation of the system. Performance tuning is recommended as a recurrent process. As data volume grows and the profile of the data changes, performance and responsiveness may degrade. You may want to plan for regular periods of benchmarking and tuning, rather than waiting to be reactive to end-user complaints. By it's nature software development is not always perfect, so some repair and retest should be expected. The Defect Tracking System must be maintained to record defects and enhancements for as long as the system is supported and used. Test scenarios, regression test procedures, and other testing aids must also be retained for this purpose.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

225 of 439

Phase 6: Test
Task 6.1 Define Overall Test Strategy Description
The purpose of testing is to verify that the software has been developed according to the requirements and design specifications. Although the major testing actually occurs at the end of the Build Phase , determining the amount and types of testing to be performed should occur early in the development lifecycle. This enables project management to allocate adequate time and resources to this activity. This also enables the project to build the appropriate testing infrastructure prior to the beginning of the testing phase. Thus, while all of the testing related activities have been consolodated in the Testing phase, the beginning of these activities often begins as early as the Design Phase. The detailed object level testing plans are continually updated and modified as the development process continues since any change to development work is likely to create a new scenario to test. Planning should include the following components :
● ● ● ● ●

resource requirements and schedule construction and maintenance of the test data preparation of test materials preparation of test environments preparation of the methods and control procedures for each of the major tests

Typically, there are three levels of testing: Testing Level Description Performed By

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

226 of 439

Unit

Testing of each individual function. For example, with data integration this includes testing individual mappings, UNIX scripts, stored procedures, or other external programs. Ideally, the developer tests all error conditions and logic branches within the code.

Developer

System or Integration

Testing performed to review the system as a System Test Team whole as well as its points of integration. Testing may include, but is not limited to, data integrity, reliability, and performance. As most data integration solutions do not directly touch end users, User Acceptance Testing should focus on the front-end applications and reports, rather than the load processes themselves. User Acceptance Testing Team

User Acceptance

Prerequisites
None

Roles
Business Analyst (Primary) Data Integration Developer (Primary) End User (Primary) Presentation Layer Developer (Primary) Quality Assurance Manager (Approve) Technical Project Manager (Approve)

Considerations

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

227 of 439

None

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

228 of 439

Phase 6: Test
Subtask 6.1.1 Define Test Data Strategy Description
Ideally, actual data from the production environment will be available for testing so that tests can cover the full range of possible values and states in the data. However, the full set of production data is often not available. Additionally, there is sometimes a risk of sensitive information migrating from production to less-controlled environments (i.e., test); in some circumstances, this may even be illegal. There is also the chicken-and-egg problem of requiring the load of production source data, in order to test the load of production source data. Therefore, it is important to understand that with any set of data used for testing, there is no guarantee that all possible exception cases and value ranges will occur in the sub-set of the data used. If generated data is used, the main challenge is to ensure that it accurately reflects the production environment. Theoretically, generated data can be made to be representative and engineered to test all of the project functionality. While the actual record counts in generated tables are likely to differ from production environments, the ratios between tables should be maintained; for example, if there is a one-to-ten ratio between products and customers in the live environment, care should be taken to retain this same ratio in the test environment. The deliverable from this subtask is a description and schedule for how test data will be derived, stored, and migrated to testing environments. Adequate test data can be important for proper unit testing and is critical for satisfactory system and user acceptance tests.

Prerequisites
None

Roles
Business Analyst (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

229 of 439

Data Integration Developer (Secondary) End User (Primary) Presentation Layer Developer (Primary) Quality Assurance Manager (Approve) Technical Project Manager (Approve) Test Manager (Primary)

Considerations
In stable environments, there is less of a premium on flexible maintenance of test data structures; the overhead of developing software to load test data may not be justified. In dynamic environments (i.e., where source and/or target data structures are not finalized), the availability of a data movement tool such as PowerCenter greatly expands the range of options for test data storage and movement. Usually, data for testing purposes is stored in the same structure as the source in the data flow. However, it is also possible to store test data in a format that is geared toward ease of maintenance and to use PowerCenter to transfer the data to the source system format. So if the source is a database with a constantly changing structure, it may be easier to store test data in XML or CSV formats where it can easily be maintained with a text editor. The PowerCenter mappings that load the test data from this source can make use of techniques to insulate (to some degree) the logic from schema changes by including pass-through transformations after source qualifiers and before targets. For Data Migration, the test data strategy should be focused on how much source data to use rather than how to manufacture test data. It is strongly recommended that the data used for testing is real production data but most likely of less volume then the production system. By using real production data, the final testing will be more meaningful and increase the level of confidence from the business community thus making ‘go/no-go’ decisions easier.

Best Practices
None

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

230 of 439

Sample Deliverables
Critical Test Parameters

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

231 of 439

Phase 6: Test
Subtask 6.1.2 Define Unit Test Plan Description
Any distinct unit of development must be adequately tested by the developer before it is designated ready for system test and for integration with the rest of the project elements. This includes any element of the project that can, in any way, be tested on its own. Rather than conducting unit testing in a haphazard fashion with no means of certifying satisfactory completion, all unit testing should be measured against a specified unit test plan and its completion criteria. Unit test plans are based on the individual business and functional requirements and detailed design for mappings, reports, or components for the mapping or report. The unit test plans should include specification of inputs, tests to verify, and expected outputs and results. The unit test is the best opportunity to discover any misinterpretation of the design as well as errors of development logic. The creation of the unit test plan should be a collaborative effort by the designer and the developer, and must be validated by the designer as meeting the business and functional requirements and design criteria. The designer should begin with a test scenario or test data descriptions and include checklists for the required functionality; the developer may add technical tests and make sure all logic paths are covered. The unit test plan consists of:
●

Identification section: unit name, version number, date of build or change, developer, and other identification information. References to all applicable requirements and design documents. References to all applicable data quality processes (e.g., data analysis, cleansing, standardization, enrichment). Specification of test environment (e.g., system requirements, database/ schema to be used). Short description of test scenarios and/or types of test runs. Per test run:
r r

● ●

●

● ●

Purpose (what features/functionality are being verified). Prerequisites.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

232 of 439

r r r r

Definition of test inputs. References to test data or load-files to be used. Test script (step-by-step guide to executing the test). Specification (checklist) of the expected outputs, messages, error handling results, data output, etc.

●

Comments and findings.

Prerequisites
None

Roles
Business Analyst (Secondary) Data Integration Developer (Primary) Presentation Layer Developer (Primary) Quality Assurance Manager (Review Only)

Considerations
Reference to design documents should contain the name and location of any related requirements documents, high-level and detailed design, mock-ups, workflows, and other applicable documents. Specification of the test environment should include such details as which reference or conversion tables must be used to translate the source data for the appropriate target (e.g., for conversion of postal codes, for key translation, other code translations). It should also include specification of any infrastructure elements or tools to be used in conjunction with the tests. The description of test runs should include the functional coverage, and any dependencies between test runs. Prerequisites should include whatever is needed to create the correct environment for the test to take place, any dependencies the test has on completion of other logic or

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

233 of 439

test runs, availability of reference data, adequate space in database or file system, and so forth. The input files or tables must be specified with their locations. These data must be maintained in a secure place to make repeatable tests possible. Specifying the expected output is the main part of the test plan. It specifies in detail any output records and fields, and any functional or operational results through each step of the test run. The script should cover all of the potential logic paths and include all code translations and other transformations that are part of the unit. Comparing the produced output from the test run with this specification provides the verification that the build satisfies the design. The test script specifies all the steps needed to create the correct environment for the test, to complete the actual test run itself, and the steps to analyze the results. Analysis can be done by hand or by using compare scripts. The Comments and Findings section is where all errors and unexpected results found in the test run should be logged. In addition, errors in the test plan itself can be logged here as well. It is up to the QA Management and/or QA Strategy to determine whether to use a more advanced error tracking system for unit testing or to wait until system test. Some sites demand a more advanced error logging system, (e.g., ClearCase) where errors can be logged along with an indication of their severity and impact, as well as information about who is assigned to resolve the problem. One or more test runs can be specified in a single unit test plan. For example, one run may be an initial load against an empty target, with subsequent runs covering incremental loads against existing data or tests with empty input or with duplicate input records or files and empty reports. Test data must contain a mix of correct and incorrect data. Correct data can be expected to result in the specified output; incorrect data may have results according to the defined error-handling strategy such as creating error records or aborting the process. Examples of incorrect data are:
●

Value errors: value is not in acceptable domain or an empty value for mandatory fields. Syntax errors: incorrect date format, incorrect postal code format, or nonnumeric data in numeric fields. Semantic errors: two values are correct, but can not exist in the same record.

●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

234 of 439

Note that the error handling strategy should account for any Data Quality operations built into the project. Note also that some PowerCenter transformations can make use of data quality processes, or plans, developed in Informatica Data Quality (IDQ) applications. Data quality plan instructions can be loaded into a Data Quality Integration transformation (the transfomation is added to PowerCenter via a plug-in). Data quality plans should be tested using IDQ applications before they are added to PowerCenter transformations. The results of these tests will feed as prerequisites into the main unit test plan. The tests for data quality processes should follow the same guidelines as outlined in this document. A PowerCenter mapping should be validated once the Data Quality Integration transformation has been added to it and configured with a data quality plan. Every difference between the output expectation and the test output itself should be logged in the Comments and Findings section, along with information about the severity and impact on the test process. The unit test can proceed after analysis and error correction. The unit test is complete when all test runs are successfully completed and the findings are resolved and retested. At that point, the unit can be handed over to the next test phase.

Best Practices
Testing Data Quality Plans

Sample Deliverables
Test Case List Unit Test Plan

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

235 of 439

Phase 6: Test
Subtask 6.1.3 Define System Test Plan Description
System Test (sometimes known as Integration Test) is crucial for ensuring that the system operates reliably as a fully integrated system and functions according to the business requirements and technical design. Success rests largely on business users' confidence in the integrity of the data. If the system has flaws that impede its functions, the data may also be flawed or users may perceive it as flawed,which results in a loss of confidence in the system. If the system does not provide adequate performance and responsiveness, the users may abandon it (especially if it is a reporting system) because it does not meet their perceived needs. As with the other testing processes, it is very important to begin planning for System Test early in the project to make sure that all necessary resources are scheduled and prepared ahead of time.

Prerequisites
None

Roles
Quality Assurance Manager (Review Only) Test Manager (Primary)

Considerations
Since the system test addresses multiple areas and test types, creation of the test plan should involve several specialists. The System Test Manager is then responsible for compiling their inputs into one consistent system test plan. All individuals participating in executing the test plan must agree on the relevant performance indicators that are required to determine if project goals and objectives are being met. The performance indicators must be documented, reviewed, and signed-off on by all participating team members.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

236 of 439

Performance indicators are placed in the context of Test Cases, Test Levels, and Test Types, so that the test team can easily measure and monitor their evaluation criteria.

Test Cases
The test case (i.e., unit of work to be tested) must be sufficiently specific to track and improve data quality and performance.

Test Levels
Each test case is categorized as occurring on a specific level or levels. This helps to clearly define the actual extent of testing expected within a given test case. Test levels may include one or more of the following:
●

System Level. Covers all "end to end" integration testing, and involves the complete validation of total system functionality and reliability through all system entry points and exit points. Typically, this test level is the highest, and the last level of testing to be completed. Support System Level. Involves verifying the ability of existing support systems and infrastructure to accommodate new systems or the proposed expansion of existing systems. For example, this level of testing may determine the effect of a potential increase in network traffic due to an expanded system user base on overall business operations. Internal Interface Level. Covers all testing that involves internal system data flow. For example, this level of testing may validate the ability of PowerCenter to successfully connect to a particular data target and load data. External Interface Level. Covers all testing that involves external data sources. For example, this level of testing may collect data from diverse business systems into a data warehouse. Hardware Component Level. Covers all testing that involves verifying the function and reliability of specific hardware components. For example, this level of testing may validate a back-up power system by removing the primary power source. This level of testing typically occurs during the development cycle. Software Process Level. Covers all testing that involves verifying the function and reliability of specific software applications. This level of testing typically occurs during the development cycle. Data Unit Level. Covers all testing that involves verifying the function and reliability of specific data items and structures. This typically occurs during the development cycle in which data types and structures are defined and tested

●

●

●

●

●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

237 of 439

based on the application design constraints and requirements.

Test Types
The Data Integration Developer generates a list of the required test types based on the desired level of testing. The defined test types determine what kind of tests must be performed to satisfy a given test case. Test types that may be required include:
●

Critical Technical Parameters (CTPs). A worksheet of specific CTPs is established, based on the identified test types. Each CTP defines specific functional units that are tested. This should include any specific data items, component, or functional parts. Test Condition Requirements (TCRs). Test Condition Requirement scripts are developed to satisfy all identified CTPs. These TCRs are assigned a numeric designation and include the test objective, list of any prerequisites, test steps, actual results, expected results, tester ID, the current date, and the current iteration of the test. All TCRs are included with each Test Case Description (TCD). Test Execution and Progression. A detailed description of general control procedures for executing a test, such as special conditions and processes for returning a TCR to a developer in the event that it fails. This description is typically provided with each TCD. Test Schedule. A specific test schedule that is defined within each TCD, based upon the project plan, and maintained using MS Project or a comparable tool. The overall Test Schedule for the project is available in the TCD Test Schedule Summary, which identifies the testing start and end dates for each TCD.

●

●

●

As part of 6.2 Execute System Test other specific tests should be planned for :● ● ●

6.3.3 Perform Data Validation 6.3.4 Conduct Disaster Recovery Testing 6.3.5 Conduct Volume Testing

The system test plans should include:
● ●

System name, version number, list of components Reference to design document(s) such as high-level designs, workflow designs, database model and reference, hardware descriptions, etc.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

238 of 439

● ● ●

Specification of test environment Overview of the test runs (coverage, interdependencies) Per test run:
r r

Type and purpose of the test run (coverage, results, etc.) Prerequisites (e.g., accurate results from other test runs, availability of reference data, space in database or file system, availability of monitoring tools, etc.) Definition of test input References to test data or load-files to be used (note: data must be stored in a secure place to permit repeatable tests) Specification of the expected output and system behaviour (including record counts, error records expected, expected runtime, etc.) Specification of expected and maximum acceptable runtime Step-by-step guide to execute the test (including environment preparation, results recording, and analysis steps, etc.)

r r

r

r r

● ●

Defect tracking process and tools Description of structure for meetings to discuss progress, issues and defect management during the test

The system test plan consists of one or more test runs, each of which must be described in detail. The interaction between the test runs must also be specified. After each run, the System Test Manager can decide, depending on the defect count and severity, whether the system test can proceed with subsequent test runs or that errors must be corrected and the previous run repeated. Every difference between the expected output and the test output itself should be recorded and entered into the defect tracking system with a description of the severity and impact on the test process. These errors and the general progress of the system test should be discussed in a weekly or bi-weekly progress meeting. At this meeting, participants review the progress of the system test, any problems identified, and assignments to resolve or avoid them. The meeting should be directed by the System Test Manager and attended by the testers and other necessary specialists like designers, developers, systems engineers and database administrators. After assignment of the findings, the specialists can take the necessary actions to resolve the problems. After the solution is approved and implemented, the system test can proceed.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

239 of 439

When all tests are run successfully and all defects are resolved and retested, the system test plan will have been completed.

Best Practices
None

Sample Deliverables
None
Last updated: 15-Feb-07 19:38

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

240 of 439

Phase 6: Test
Subtask 6.1.5 Define Test Scenarios Description
Test scenarios provide the context, the “story line”, for much of the test procedures, whether Unit Test, System Test or UAT. How can you know that the software solution you’re developing will work within its ultimate business usage? A scenario provides the business case for testing specific functionality, enabling testers to pretend to carry-out the related business activity and then measure the results against expectations. For this reason, design of the scenarios is a critical activity and one that may involve significant effort in order to provide coverage for all the functionality that needs testing. The test scenario forms the basis for development of test scripts and checklists, the source data definitions, and other details of specific test runs.

Prerequisites
None

Roles
Business Analyst (Secondary) End User (Primary) Quality Assurance Manager (Approve) Test Manager (Primary)

Considerations
Test scenarios must be based on the functional and technical requirements by dividing them into specific functions that can be treated in a single test process. Test scenarios may include:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

241 of 439

●

The purpose/objective of the test (functionality being tested) described in enduser terms. Description of business, functional, or technical context for the test. Description of the type of technologies, development objects, and/or data that should be included. Any known dependencies on other elements of the existing or new systems.

● ●

●

Typical attributes of test scenarios:
● ● ● ●

Should be designed to represent both typical and unusual situations. Should include use of valid data as well as invalid or missing data. Test engineers may define their own unit test cases. Business cases and test scenarios for System and Integration Tests are developed by the test team with assistance of developers and end-users.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

242 of 439

Phase 6: Test
Subtask 6.1.6 Build/ Maintain Test Source Data Set Description
This subtask deals with the procedures and considerations for actually creating, storing, and maintaining the test data. The procedures for any given project are, of course, specific to its requirements and environments, but are also opportunistic. For some projects, there will exist a comprehensive set of data or at least a good start in that direction, while for other projects, the test data may need to be created from scratch. In addition to test data that allows full functional testing (i.e., functional test data), there is also a need for adequate data for volume tests (i.e., volume test data). The following paragraphs discuss each of these data types.

Functional Test Data
Creating a source data set to test the functionality of the transformation software should be the responsibility of a specialized team largely consisting of business-aware application experts. Business application skills are necessary to ensure that the test data not only reflects the eventual production environment but that it is also engineered to trigger all the functionality specified for the application. Technical skills in whatever storage format is selected are also required to facilitate data entry and/or movement. Volume is not a requirement of the functional test data set; indeed, too much data is undesirable since the time taken to load it needlessly delays the functional test. In a data integration project, while functional test data for the application sources is indispensable, the case for a predefined data set for the targets should also be considered. If available, such a data set makes it possible to develop an automated test procedure to compare the actual result set to a predicted result set (making the necessary adjustments to generated data, such as surrogate keys, timestamps, etc.). This has additional value in that the definition of a target data set in itself serves as a sort of design audit.

Volume Test Data

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

243 of 439

The main objective for the volume test data set is to ensure that the project satisfies any Service Level Agreements that are in place and generally meets performance expectations in the live environment. Once again, PowerCenter can be used to generate volumes of data and to modify sensitive live information in order to preserve confidentiality. There are a number of techniques to generate multiple output rows from a single source row, such as:
● ● ● ●

Cartesian join in source qualifier Normalizer transformation Union transformation Java transformation

If possible, the volume test data set should also be available to developers for unit testing in order to identify problems as soon as possible.

Maintenance
In addition to the initial acquisition or generation of test data, you will need a protected location for its storage and procedures for migrating it to test environments in such a fashion that the original data set is preserved (for the next test sequence). In addition, you are likely to need procedures that will enable you to rebuild or rework the test data, as required.

Prerequisites
None

Roles
Business Analyst (Primary) Data Integration Developer (Primary)

Considerations
Creating the source and target data sets and conducting automated testing are nontrivial, and are therefore, often dismissed as impractical. This is partly the result of a

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

244 of 439

failure to appreciate the role that PowerCenter can play in the execution of the test strategy. At some point in the test process, it is going to be necessary to compile a schedule of expected results from a given starting point. Using PowerCenter to make this information available and to compare the actual results from the execution of the workflows can greatly facilitate the process. Data Migration projects should have little need for generating test data. It is strongly recommended that all data migration integration and system tests use actual production data. Therefore, effort spent generating test data on a data migration project should be very limited.

Best Practices
None

Sample Deliverables
None
Last updated: 15-Feb-07 19:40

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

245 of 439

Phase 6: Test
Task 6.2 Prepare for Testing Process Description
This is the first major task of the Test Phase – general preparations for System Test and UAT. This includes preparing environments, ramping up defect management procedures, and generally making sure the test plans and all their elements are prepared and that all participants have been notified of the upcoming testing processes.

Prerequisites
None

Roles
Data Integration Developer (Secondary) Database Administrator (DBA) (Primary) Presentation Layer Developer (Secondary) Quality Assurance Manager (Primary) Repository Administrator (Primary) System Administrator (Primary) Test Manager (Primary)

Considerations
Prior to beginning this subtask, you will need to collect and review the documentation generated by the previous tasks and subtasks, including the test strategy, system test plan, and UAT plan. Verify that all required test data has been prepared and that the defect tracking system is operational. Ensure that all unit test certification procedures

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

246 of 439

are being followed. Based on the system test plan and UAT plan:
●

Collect all relevant requirements, functional and internal design specifications, end-user documentation, and any other related documents. Develop the test procedures and documents for testers to follow from these. Verify that all expected participants have been notified of the applicable test schedule. Review the upcoming test processes with the Project Sponsor to ensure that they are consistent with the organization's existing QA culture (i.e., in terms of testing scope, approaches, and methods). Review the test environment requirements (e.g., hardware, software, communications, etc.) to ensure that everything is in place and ready.

●

●

●

Review testware requirements (e.g., coverage analyzers, test tracking, problem/bug tracking, etc.) to ensure that everything is ready for the upcoming tests.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

247 of 439

Phase 6: Test
Subtask 6.2.1 Prepare Environments Description
It is important to prepare the test environments in advance of System Test with the following objectives:
● ●

To emulate, to the extent possible, the Production environment. To provide test environments that enable full integration of the system, and isolation from development. To provide secure environments that support the test procedures and appropriate access. To allow System Tests and UAT to proceed without delays and without system disruptions.

●

●

Prerequisites
None

Roles
Data Integration Developer (Secondary) Database Administrator (DBA) (Primary) Presentation Layer Developer (Secondary) Repository Administrator (Primary) System Administrator (Primary) Test Manager (Primary)

Considerations

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

248 of 439

Plans
A formal test plan needs to be prepared by the Project Manager in conjunction with the Test Manager. This plan should cover responsibilities, tasks, time-scales, resources, training, and success criteria. It is vital that all resources, including off-project support staff, are made available for the entire testing period. Test scripts need to be prepared, together with a definition of the data required to execute the scripts. The Test Manager is responsible for preparing these items, but is likely to delegate a large part of the work. A formal definition of the required environment also needs to be prepared, including all necessary hardware components (i.e., server and client), software components (i.e., operating system, database, data movement, testing tools, application tools, custom application components etc., including versions), security and access rights, and networking. Establishing security and isolation is critical for preventing any unauthorized or unplanned migration of development objects into the test environments. The test environment administrator(s) must have specific verifications, procedures, and timing for any migrations and sufficient controls to enforce them. Review the test plans and scenarios to determine the technical requirements for the test environments. Volume tests and disaster/recovery tests may require special system preparations. The System Test environment may evolve into the UAT environment, depending on requirements and stability.

Processes
Where possible, all processes should be supported by the use of appropriate tools. Some of the key terminology related to the preparation of the environments and the associated processes include:
●

Training testers – a series of briefings and/or training sessions should be made available. This may be any combination of formal presentations, formal training courses, computer based tutorials or self-study sessions. Recording test results – the results of each test must be recorded and crossreferenced to the defect reporting process.

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

249 of 439

●

Reporting and resolution of defects (see 5.1.3 Define Defect Tracking Process) – a process for recording defects, prioritizing their resolution, and tracking the resolution process. Overall test management – a process for tracking the effectiveness of UAT and the likely effort and timescale remaining

●

Data
The data required for testing can be derived from the test cases defined in the scripts. This should enable a full dataset to be defined, ensuring that all possible cases are tested. 'Live data' is usually not sufficient because it does not cover all the cases the system should handle, and may require some sampling to keep the data volumes at realistic levels. It is, of course, possible to use modified live data, adding the additional cases or modifying the live data to create the required cases. The process of creating the test data needs to be defined. Some automated approach to creating all or the majority of the data is best. There is often a need to process data through a system where some form of OLTP is involved. In this case, it must be possible to roll-back to a base-state of data to allow reapplication of the ‘transaction’ data – as would be achieved by restoring from back-up. Where multiple data repositories are involved, it is important to define how these datasets relate. It is also important that the data is consistent across all the repositories and that it can be restored to a known state (or states) as and when required.

Environment
A properly set-up environment is critical to the success of UAT. This covers:
●

Server(s) – must be available for the required duration and have sufficient disk space and processing power for the anticipated workload. Client workstations – must be available and sufficiently powerful to run the required client tools. Server and client software – all necessary software (OS, database, ETL, test tools, data quality tools, connectivity etc.) should be installed at the version used in development (normally) with databases created as required. Networking – all required LAN and WAN connectivity must be set up and firewalls configured to allow appropriate access. Bandwidth must be available for any particular large data transmissions. Databases – all necessary schemas must be created and populated with an

●

●

●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

250 of 439

appropriate backup/restore strategy in place, and access rights defined and implemented.
●

Application software – correct versions should be migrated from development.

For Data Migration, the system test environment should not be limited to the Informatica environment, but should also include all source systems, target systems, reference data and staging databases, and file systems. The system tests will be a simulation of production systems so the entire process should execute like a production environment.

Best Practices
None

Sample Deliverables
None
Last updated: 15-Feb-07 19:43

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

251 of 439

Phase 6: Test
Subtask 6.2.2 Prepare Defect Management Processes Description
The key measure of software quality is, of course, the number of defects (a defect is anything that produces results other than the expected results based on the software design specification). Therefore it is essential for software projects to have a systematic approach to detecting and resolving defects early in the development life cycle.

Prerequisites
None

Roles
Quality Assurance Manager (Primary) Test Manager (Primary)

Considerations
Personal and peer reviews are primary sources of early defect detection. Unit testing, system testing and UAT are other key sources, however, in these later project stages, defect detection is a much more resource-intensive activity. Worse yet, change requests and trouble reports are evidence of defects that have made their way to the end users. There are two major components of successful defect management, defect prevention and defect detection. A good defect management process should enable developers to both lower the number of defects that are introduced, and remove defects early in the life cycle prior to testing. Defect management begins with the design of the initial QA strategy and a good, detailed test strategy. They should clearly define methods for reviewing system requirements and design and spell out guidelines for testing processes, tracking

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

252 of 439

defects, and managing each type of test. In addition, many QA strategies include specific checklists that act as gatekeepers to authorize satisfactory completion of tests, especially during unit and system testing. To support early defect resolution, you must have a defect tracking system that is readily accessible to developers and includes the following:
●

Ability to identify and type the defect, with details of its behaviour
●

Means for recording the timing of the defect discovery, resolution, and retest
●

Complete description of the resolution

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

253 of 439

Phase 6: Test
Task 6.3 Execute System Test Description
System Test (sometimes known as Integration Test) is crucial for ensuring that the system operates reliably and according to the business requirements and technical design. Success rests largely on business users' confidence in the integrity of the data. If the system has flaws that impede its function, the data may also be flawed, or users may perceive it as flawed - which results in a loss of confidence in the system. If the system does not provide adequate performance and responsiveness, the users may abandon it (especially if it is a reporting system) because it does not meet their perceived needs. System testing follows unit testing, providing the first tests of the fully integrated system, and offers an opportunity to clarify users performance expectations and establish realistic goals that can be used to measure actual operation after the system is placed in production. It also offers a good opportunity to refine the data volume estimates that were originally generated in the Architect Phase. This is useful for determining if existing or planned hardware will be sufficient to meet the demands on the system. This task incorporates five steps: 1. 6.3.1 Prepare for System Test , in which the test team determines how to test the system from end-to-end to ensure a successful load as well as planning for the environments, participants, tools and timelines for the test. 2. 6.3.2 Execute Complete System Test , in which the data integration team works with the Database Administrator to run the system tests planned in the prior subtask. It is crucial to also involve end-users in the planning and review of system tests. 3. 6.3.3 Perform Data Validation , in which the QA Manager and QA team ensure that the system is capable of delivering complete, valid data to the business users. 4. 6.3.4 Conduct Disaster Recovery Testing , in which the system’s robustness

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

254 of 439

and recovery in case of disasters such as network or server failure is tested. 5. 6.3.5 Conduct Volume Testing , in which the system’s capability to handle large volumes is tested.

Prerequisites
None

Roles
Business Analyst (Primary) Data Integration Developer (Primary) Database Administrator (DBA) (Primary) End User (Primary) Network Administrator (Secondary) Presentation Layer Developer (Secondary) Project Sponsor (Review Only) Quality Assurance Manager (Review Only) Repository Administrator (Secondary) System Administrator (Primary) Technical Project Manager (Review Only) Test Manager (Primary)

Considerations

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

255 of 439

All involved individuals and departments should review and approve the test plans, test procedures, and test results prior to beginning this subtask. It is important to thoroughly document the system testing procedure, describing the testing strategy, acceptance criteria, scripts, and results. This information can be invaluable later on, when the system is in operation and may not be meeting performance expectations or delivering the results that users want - or expect. For Data Migration projects, system tests are important because these are essentially ‘dress-rehearsals’ for the final migration. These tests should be executed with production-level controls and be tracked and improved upon from system test cycle to system test cycle. In data migration projects these system tests are often referred to as ‘mock-runs’ or ‘trial cutovers’.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

256 of 439

Phase 6: Test
Subtask 6.3.1 Prepare for System Test Description
System test preparation consists primarily of creating the environment(s) required for testing the application and staging the system integration. System Test is the first opportunity, following comprehensive unit testing, to fully integrate all the elements of the system, and to test the system by emulating how it will be used in production. For this reason, the environment should be as similar as possible to the production environment in its hardware, software, communications, and any support tools.

Prerequisites
None

Roles
Data Integration Developer (Secondary) Database Administrator (DBA) (Secondary) System Administrator (Secondary) Test Manager (Primary)

Considerations
The preparations for System Test often take much more effort than expected, so they should be preceded by a detailed integration plan that describes how all of the system elements will be physically integrated within the System Test environment. The integration plan should be specific to your environment, but some of the general steps are likely be the same. The following are some general steps that are common in most integration plans.
●

Migration of Informatica development folders to the system test environment. These folders may also include shared folders and/or shortcut

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

257 of 439

folders that may have been added or modified during the development process. In versioned repositories, deployment groups may be used for this purpose. Often, flat files or parameter files reside on the development environment’s server and need to be copied to the appropriate directories on the system test environment server.
●

Data consistency in system test environment is crucial. In order to emulate the production environment, the data being sourced and targeted should be as close as possible to production data in terms of data quality and size. The data model of the system test environment should be very similar to the model that is going to be implemented in production. Columns, constraints, or indices often change throughout development, so it is important to system test the data model before going into production. Synchronization of incremental logic is key when doing system testing. In order to emulate the production environment, the variables or parameters used for incremental logic need to match the values in the system test environment database(s). If the variables or parameters don’t match, they can cause missing data or unusual amounts of data being sourced.

●

●

For Data Migration projects the system test should not just involve running Informatica Workflows, it should also include data set-up, migrating code, executing data and process validation and post-process auditing. The system test set-up should be part of the system test, not a pre-system test step.

Best Practices
None

Sample Deliverables
System Test Plan

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

258 of 439

Phase 6: Test
Subtask 6.3.2 Execute Complete System Test Description
System testing offers an opportunity to establish performance expectations and verify that the system works as designed, as well as to refine the data volume estimates generated in the Architect Phase . This subtask involves a number of guidelines for running the complete system test and resolving or escalating any issues that may arise during testing.

Prerequisites
None

Roles
Business Analyst (Secondary) Data Integration Developer (Secondary) Database Administrator (DBA) (Review Only) Network Administrator (Review Only) Presentation Layer Developer (Secondary) Quality Assurance Manager (Review Only) Repository Administrator (Review Only) System Administrator (Review Only) Technical Project Manager (Review Only)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

259 of 439

Test Manager (Primary)

Considerations System Test Plan
A System test plan needs to include pre-requisites to enter into the system test phase, criteria to successfully exit system test phase, and defect classifications. In addition, all test conditions, expected results, and test data need to be available prior to system test.

Load Routines
Ensure that the system test plan includes all types of load that may be encountered during the normal operation of the system. For example, a new data warehouse (or a new instance of a data warehouse) may include a one-off initial load step. There may also be weekly, monthly, or ad-hoc processes beyond the normal incremental load routines. System testing is a cyclical process. The project team should plan to execute multiple iterations of the most common load routines within the timeframe allowed for system testing. Applications should be run in the order specified in the test plan.

Scheduling
An understanding of dependent predecessors is crucial for the execution of end-to-end testing, as is the schedule for the testing run. Scheduling, which is the responsibility of the testing team, is generally facilitated through an application such as the PowerCenter Workflow Manager module and/or a third-party scheduling tool. Use the pmcmd command line syntax when running PowerCenter tasks and workflows with a third-party scheduler. Third-party scheduling tools can create dependencies between PowerCenter tasks and jobs that may not be possible to run on PowerCenter. Also the tools in PowerCenter and/or a third-party scheduling tool can be used to detect long running sessions/tasks and alert the system test team via email. This helps to identify issues early and manage system test timeframe effectively.

System Test Results
The team executing the system test plan is responsible for tracking the expected and

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

260 of 439

actual results of each session and task run. Commercial software tools are available for logging test cases and storing test results. The details of each PowerCenter session run can be found in the Workflow Monitor. To see the results:
● ●

Right-click the session in the Workflow Monitor and choose ‘Properties’. Click the Transformation Statistics tab in the Properties dialog box.

Session statistics are also available in the PowerCenter repository view REP_SESS_LOG, or through Metadata Reporter.

Resolution of Coding Defects
The testing team must document the specific statistical results of each run and communicate those results back to the project development team. If the results do not meet the criteria listed in the test case, or if any process fails during testing, the test team should immediately generate a change request. The change request is assigned to the developer(s) responsible for completing system modifications. In the case of a PowerCenter session failure, the test team should seek the advice of the appropriate developer and business analyst before continuing with any other dependent tests. Ideally, all defects will be captured, fixed, and successfully retested within the system testing timeframe. In reality, this is unlikely to happen. If outstanding defects are still apparent at the end of the system testing period, the project team needs to decide how to proceed. If system test plan contains successful system test completion criteria, those criteria must be fulfilled. Defect levels must meet established criteria for completion of the system test cycle. Defects should be judged by their number and by their impact. Ultimately, the project team is responsible for ensuring that the tests adhere to the system test plan and the test cases within it (developed in Subtask 6.3.1 Prepare for System Test ). The project team must review and sign-off on the results of the tests. For Data Migration projects, because they are usually part of a larger implementation the system test should be integrated with the larger project system test. The results of this test should be reviewed, improved upon and communicated to the project manager or project management office (PMO). It is common for these types of projects to have three or four full system tests otherwise known as ‘mock runs’ or ‘trial cutovers’.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

261 of 439

Best Practices
None

Sample Deliverables
None
Last updated: 15-Feb-07 19:46

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

262 of 439

Phase 6: Test
Subtask 6.3.3 Perform Data Validation Description
The purpose of data validation is to ensure that source data is populated as per specification. The team responsible for completing the end-to-end test plan should be in a position to utilize the results detailed in the testing documentation (e.g., TCR, CTPs, TCD, and TCRs). Test team members should review and analyze the test results to determine if project and business expectations are being met.
●

If the team concludes that the expectations are being met, it can sign-off on the end-to-end testing process. If expectations are not met, the testing team should perform a gap analysis on the differences between the test results and the project and business expectations.

●

The gap analysis should list the errors and requirements not met so that a Data Integration Developer can be assigned to investigate the issue. The analysis should also include data from initial runs in production. The Data Integration Developer should assess the resources and time required to modify the data integration environment to achieve the required test results. The Project Sponsor and Project Manager should then finalize the approach for incorporating the modifications, which may include obtaining additional funding or resources, limiting the scope of the modifications, or redefining the business requirements to minimize modifications.

Prerequisites
None

Roles
Business Analyst (Primary) Data Integration Developer (Review Only) Presentation Layer Developer (Secondary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

263 of 439

Project Sponsor (Review Only) Quality Assurance Manager (Review Only) Technical Project Manager (Review Only) Test Manager (Primary)

Considerations
Before performing data validation, it is important to consider these issues:
●

Job Run Validation. A very high-level testing validation can be performed using dashboards or custom reports using Informatica Data Explorer. The session logs and the workflow monitor can be used to check if the job has completed successfully. If relational database error logging is chosen, then the error tables can be checked for any transformation errors and session errors. The Data Integration Developer needs to resolve the errors identified in the error tables. The Integration Service generates the following tables to help you track row errors:
r

PMERR_DATA. Stores data and metadata about a transformation row error and its corresponding source row. PMERR_MSG. Stores metadata about an error and the error message. PMERR_SESS. Stores metadata about the session. PMERR_TRANS. Stores metadata about the source and transformation ports, such as name and datatype, when a transformation error occurs.

r r r

●

Involvement. The test team, the QA team, and, ultimately, the end-user community are all jointly responsible for ensuring the accuracy of the data. At the conclusion of system testing, all must sign-off to indicate their acceptance of the data quality. Access To Front-End for Reviewing Results. The test team should have access to reports and/or a front-end tool to help review the results of each

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

264 of 439

testing run. Before testing begins, the team should determine just how results are to be reviewed and reported, what tool(s) are to be used, and how the results are to be validated. The test team should also have access to current business reports produced in legacy and current operational systems. The current reports can be compared to those produced from data in the new system to determine if requirements are satisfied and that the new reports are accurate. The Data Validation task has enormous scope and is a significant phase in any project cycle. Data validation can be either manual or automated. Manual. This technique involves manually validating target data with source and also ensuring that all the transformation have been correctly applied. Manual validation may be valid for a limited set of data or for master data. Automated. This technique involves using various techniques and/or tools to validate data and ensure, at the end of cycle, that all the requirements are met. The following tools are very useful for data validation:
●

File Diff. This utility is generally available with any testing tool and is very useful if the source(s) and target(s) are files. Otherwise, the result sets from the source and/or target systems can be saved as flat files and compared using file diff utilities.
●

Data Analysis Using IDQ. The testing team can use Informatica Data Quality (IDQ) Data Analysis plans to assess the level of data quality needs. Plans can be built to identify problems with data conformity and consistency. Once the data is analyzed, scorecards can be used to generate a high-level view of the data quality. Using the results from data analysis and scorecards, new test cases can be added and new test data can be created for the testing cycle.
●

Using DataProfiler In Data Validation. Full data validation can be one of the most time-consuming elements of the testing process. During the System Test phase of the data integration project, you can use data profiling technology to validate the data loaded to the target database. Data profiling allows the project team to test the requirements and assumptions that were the basis for the Design Phase and Build Phase of the project, facilitating such tests as:
r

Business rule validations

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

265 of 439

r r r

Domain validations Row counts and distinct value counts Aggregation accuracy

Throughout testing, it is advisable to re-profile the source data. This provides information on any source data changes that may have taken place since the Design Phase. Additionally, it can be used to verify the makeup and diversity of any data sets extracted or created for the purposes of testing. This is particularly relevant in environments where production source data was not available during design. When development data is used to develop the business rules for the mappings, surprises commonly occur when production data finally becomes available.

Defect Management:
The defects encountered during the data validation should be organized using either a simple tool like an Excel (or comparable) spreadsheet or a more advanced tool. Advanced tools may have facilities for defect assignment, defect status changes, and/ or a section for defect explanation. The Data Integration Developer and the testing team must ensure that all defects are identified and corrected before changing the defect status. For Data Migration projects it is important to identify a set of processes and procedures to be executed to simplify the validation process. These processes and procedures should be built into the Punch List and should focus on reliability and efficiency. For large scale data migration projects it is important to realize the scale of validation. A set of tools must be developed to enable the business validation personnel to quickly and accurately validate that the data migration was complete. Additionally it is important that the run book includes steps to verify that all technical steps were completed successfully. PowerCenter Metadata Reporter should be leveraged and documented in the punch list steps and detailed records of all interaction points should be included in operational procedures.

Best Practices
None

Sample Deliverables
None
Last updated: 15-Feb-07 19:48

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

266 of 439

Phase 6: Test
Subtask 6.3.5 Conduct Volume Testing Description
Basic volume testing seeks to verify that the system can cope with anticipated production data levels. Taken to extremes, volume testing seeks to find the physical and logical limits of a system; this is also known as stress testing. Stress and volume testing seek to determine when and if system behavior changes as the load increases. A volume testing exercise is similar to a disaster testing exercise. The test scenarios encountered may never happen in the production environment. However, a wellplanned and conducted test exercise provides invaluable reassurance to the business and IT communities regarding the stability and resilience of the system.

Prerequisites
None

Roles
Data Integration Developer (Secondary) Database Administrator (DBA) (Primary) Network Administrator (Secondary) System Administrator (Secondary) Test Manager (Primary)

Considerations Understand Service Level Agreements
Before starting the volume test exercise, consider the Service Level Agreements (SLA)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

267 of 439

for the particular system. The SLA should set measures for system availability and projected temporal growth in the amount of data being stored by the system. The SLAs are the benchmark to measure the volume test results against.

Estimate Projected Data Volumes Over Time and Consider Peak Load Periods
Enlist the help of the DBAs and Business Analysts to estimate the growth in projected data volume across the lifetime of the system. Remember to make allowances for any data archiving strategy that exists in the system. Data archiving helps to reduce the volume of data in the actual core production system, although of course, the net volume of data will increase over time. Use the projected data volumes to provide benchmarks for testing. Organizations often experience higher than normal periods of activity at predictable times. For example, a retailer or credit card supplier may experience peak activity during weekends or holiday periods. A bank may have month or year-end processes and statements to produce. Volume testing exercises should aim to simulate throughput at peak periods as well as normal periods. Stress testing goes beyond the peak period data volumes in order to find the limits of the system. A task such as duplicate record identification (known as data matching in Informatica Data Quality parlance) can place significant demands on system resources. Informatica Data Quality (IDQ) can perform millions or billions of comparison operations in a matching process. The time available for the completion of a matching process can have a big impact on the perception that the plan is running correctly. Bear in mind that, for these reasons, data matching operations are often scheduled for off-peak periods. Data matching is also a processor-intensive activity: the speed of the processor has a significant impact on how fast a matching process completes. If the project includes data quality operations, consult with a Data Quality Developer when estimating data volumes over time and peak load periods.

Volume Test Planning
Volume test planning is similar in many ways to disaster test planning. See 6.3.4 Conduct Disaster Recovery Testing for details on disaster test planning guidelines. However, there are some volume-test specific issues to consider during the planning stage:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

268 of 439

Obtaining Volume Test Data and Data Scrambling The test team responsible for completing the end-to-end test plan should ensure that the volume(s) of test data accurately reflect the production business environment. Obtaining adequate volumes of data for testing in a nonproduction environment can be time-consuming and logistically difficult, so remember to make allowances in the test plan for this. Some organizations choose to copy data from the production environment into the test system. Security protocol needs to be maintained if data is copied from a production environment since the data is likely to need to be scrambled. Some of the popular RDBMS products contain built-in scrambling packages; third-party scrambling solutions are also available. Contact the DBA and the IT security manager for guidance on the data scrambling protocol of the department or organization. For new applications, production data probably does not exist. Some commercially-available software products can generate large volumes of data. Alternatively, one of the developers may be able to build a customized suite of programs to artificially generate data.
●

Hardware and Network Requirements and Test Timing Remember to consider the hardware and network characteristics when conducting volume testing. Do they match the production environment? Be sure to make allowances for the test results if there is a shortfall in processing capacity or network limitations on the test environment. Volume testing may involve ensuring that testing occurs at an appropriate time of day and day of week, and taking into account any other applications that may negatively affect the database and/or network resources.
●

Increasing Data Volumes Volume testing cycles need to include normal expected volumes of data and some exceptionally high volumes of data. Incorporate peak period loads into the volume testing schedules. If stress tests are being carried out, data volume need to be increased even further. Additional pressure can be applied to the system, for example, by adding a high number of database users or temporarily bringing down a server.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

269 of 439

Any particular stress test cases need to be logged in the test plan and the test schedules.

Volume and Stress Test Execution
●

Volume Test Results Logging The volume testing team is responsible for capturing volume test results. Be sure to capture performance statistics for PowerCenter tasks, database throughput, server performance and network efficiency. PowerCenter Metadata Reporter provides an excellent method of logging PowerCenter session performance over time. Run the Metadata Reporter for each test cycle to capture session and workflow lapse time. The results can be displayed in Data Analyzer dashboards or exported to other media (e.g., PDF files). The views in the PowerCenter Repository can also be queried directly with SQL statements. In addition, collaboration should occur with the network and server administrators regarding the option to capture additional statistics, such as those related to CPU usage, data transfer efficiency, writing to disk etc. The type of statistics to capture depend on the operating system in use. If jobs and tasks are being run through a scheduling tool, use the features within the scheduling tool to capture lapse time data. Alternatively, use shell scripts or batch file scripts to retrieve time and process data from the operating system.
●

System Limits, Scalability, and Bottlenecks If the system has been well-designed and built, the applications are more likely to perform in a predictable manner as data volumes increase. This is known as scalability and is a very desirable trait in any software system. Eventually however, the limits of the system are likely to be exposed as data volumes reach a critical mass and other stresses are introduced into the system. Physical or user-defined limits may be reached on particular parameters. For example, exceeding the maximum file size supported on an

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

270 of 439

operating system constitutes a physical limit. Alternatively, breaching sort space parameters by running a database SQL query probably constitutes a limit that has been defined by the DBA. Bottlenecks are likely to appear in the load processes before such limits are exceeded. For example, a SQL query called in a PowerCenter session may experience a sudden drop in performance when data volumes reach a threshold figure. The DBA and application developer need to investigate any sudden drop in the performance of a particular query. Volume and stress testing is intended to gradually increase the data load in order to expose weaknesses in the system as a whole.

Conclusion
Volume and stress testing are important aspects of the overall system testing strategy. The test results provide important information that can be used to resolve issues before they occur in the live system. However, be aware that it is not possible to test all scenarios that may cause the system to crash. A sound system architecture and well-built software applications can help prevent sudden catastrophic errors.

Best Practices
None

Sample Deliverables
None
Last updated: 18-Oct-07 15:11

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

271 of 439

Phase 6: Test
Task 6.4 Conduct User Acceptance Testing Description
User Acceptance Testing (UAT) is arguably the most important step in the project and is crucial to verifying that the system meets the users’ requirements. Being business usage-focused, it relates to the business requirements rather than on testing all the details of the technical specification. As such UAT is considered black box testing (i.e., without knowledge of all the underlying logic) that focuses on the deliverables to the end user, primarily through the presentation layer. UAT is the responsibility of the user community in terms of organization, staffing and final acceptance, but much of the preparation will have been undertaken by IT staff working to a plan agreed with the users. The function of the user acceptance testing is to obtain final functional approval from the user community for the solution to be deployed into production. As such, every effort must be made to replicate the production conditions.

Prerequisites
None

Roles
End User (Primary) Test Manager (Primary) User Acceptance Test Lead (Primary)

Considerations Plans
By this time User Acceptance Criteria should have been precisely defined by the user community as well, of course, as the specific business objectives and requirements for the project. UAT Acceptance Criteria should include

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

272 of 439

● ●

tolerable bug levels, based on the defect management procedures report validation procedures (data audit, etc.) including “gold standard” reports to use for validation data quality tolerances that must be met validation procedures that will be based for comparison to existing systems (esp. for validation of data migration/synchronization projects or operational integration) required performance tolerances, including response time and usability

● ●

●

As the testers may not have a technical background, the plan should include detailed procedures for testers to follow. The success of UAT depends on having certain critical items in place:
● ●

Formal testing plan supported by detailed test scripts Properly configured environment, including the required test data (ideally a copy of the real, production environment and data) Adequately experienced test team members from the end user community Technical support personnel to support the testing team and to evaluate and remedy problems and defects discovered

● ●

Staffing the User Acceptance Testing
It is important that the user acceptance testers and their management are thoroughly committed to the new system and ensuring its success. There needs to be communication with the user community so that they are informed of the project’s progress and able to identify appropriate members of staff to make available to carry out the testing. These participants will become the users most equipped to adopt the new system and so should be considered “super-users” who may participate in user training thereafter.

Best Practices
None

Sample Deliverables
None
Last updated: 16-Feb-07 14:07

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

273 of 439

Phase 6: Test
Task 6.5 Tune System Performance Description
Tuning a system can, in some cases, provide orders of magnitude performance gains. However, tuning is not something that should just be performed after the system is in production; rather, it is a concept of continual analysis and optimization. More importantly, tuning is a philosophy. The concept of performance must permeate all stages of development, testing, and deployment. Decisions made during the development process can seriously impact performance and no level of production tuning can compensate for an inefficient design that must be redeveloped. The information in this section is intended for use by Data Integration Developers, Data Quality Developers, Database Administrators, and System Administrators, but should be useful for anyone responsible for the long-term maintenance, performance, and support of PowerCenter Sessions, Data Quality Plans, PowerExchange Connectivity and Data Analyzer Reports.

Prerequisites
None

Roles
Data Integration Developer (Primary) Data Warehouse Administrator (Primary) Database Administrator (DBA) (Primary) Network Administrator (Primary) Presentation Layer Developer (Primary) Quality Assurance Manager (Review Only)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

274 of 439

Repository Administrator (Primary) System Administrator (Primary) System Operator (Primary) Technical Project Manager (Review Only) Test Manager (Primary)

Considerations
Performance and tuning the Data Integration environment is more than just simply tuning PowerCenter or any other Informatica product. True system performance analysis requires looking at all areas of the environment to determine opportunities for better performance from relational database systems, file systems, network bandwidth, and even hardware. The tuning effort requires benchmarking, followed by small incremental tuning changes to the environment, then re-executing the benchmarked data integration processes to determine the affect of the tuning changes Often, tuning efforts mistakenly focus on PowerCenter as the only point of concern when there may be other areas causing the bottleneck and needing attention. If you are sourcing data from a relational database for example, your data integration loads can never be faster than the source database can provide data. If the source database is poorly indexed, poorly implemented, or underpowered - no amount of downstream tuning in PowerCenter, hardware, network, file systems etc. can fix the problem of slow source data access. Throughout the tuning process, the entire end-to-end process must be considered and measured. The unit of work being baselined may be a single PowerCenter session for example, but it is always necessary to consider the end-toend process of that session in the tuning efforts. Another important consideration of system tuning is the availability of an on-going means to monitor the system performance. While it is certainly important to focus on a specific area, tune, and deploy to production to gain benefit, continuously monitoring the performance of the system may reveal areas that show degredation over time and sometimes even immediate, extreme degredation for one reason or another. Quick identification of these areas allows pro-active tuning and adjustments before the problems become catosrophic. A good monitoring system may involve a variety of technologies to provide a full view of the environment.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

275 of 439

Note: The PowerCenter Administrator's Guide provides extensive information on performance tuning and is an excellent reference source on this topic. For Data Migration projects performance is often an important consideration. If a data migration project is the result of the implementation of a new package application or operational system, a down-time is usually required. Because this down-time may prevent the business from operating, the scheduled outage window must be as short as possible. Therefore, performance tuning is often addressed between system tests.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:49

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

276 of 439

Phase 6: Test
Subtask 6.5.1 Benchmark Description
Benchmarking involves the process of running sessions or reports and collecting run statistics to set a baseline for comparison. The benchmark can be used as the standard for comparison after the session or report is tuned for performance. When determining a benchmark, the two key statistics to record are:
● ●

session duration from start to finish, and rows per second throughput.

Prerequisites
None

Roles
Data Integration Developer (Primary) Data Warehouse Administrator (Primary) Database Administrator (DBA) (Primary) Network Administrator (Primary) Presentation Layer Developer (Primary) Repository Administrator (Primary) System Administrator (Primary) Test Manager (Primary)

Considerations

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

277 of 439

Since the goal of this task is to improve the performance of the entire system, it is important to choose a variety of mappings to benchmark. Having a variety of mappings ensures that optimizing one session does not adversely affect the performance of another session. It is important to work with the same exact data set each time you run a session for benchmarking and performance tuning. For example, if you run 1,000 rows for the benchmark, it is important to run the exact same rows for future performance tuning tests. After choosing a set of mappings, create a set of new sessions that use the default settings. Run these sessions when no other processes are running in the background.

Tip Tracking Results One way to track benchmarking results is to create a reference spreadsheet. This should define the number of rows processed for each source and target, the session start time, end time, time to complete, and rows per second throughput. Track two values for rows per second throughput: rows per second as calculated by PowerCenter (from transformation statistics in the session properties), and the average rows processed per second (based on total time duration divided by the number of rows loaded).

If it is not possible to run the session without background processes, schedule the session to run daily at a time where there are not many processes running on the server. Be sure that the session runs at the same time each day or night for benchmarking. The session should run at the same time for future tests. Track the performance results in spreadsheet over a period of days or for several runs. After the statistics are gathered, compile the average of the results in a new spreadsheet. Once the average results are calculated, identify the sessions that have lowest throughput or that miss their load window. These sessions are the first candidates for performance tuning. When the benchmark is complete, the sessions should be tuned for performance. It should be possible to identify potential areas for improvement by considering the machine, network, database, and PowerCenter session and server process. Data Analyzer benchmarking should focus on the time taken to run the source query,

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

278 of 439

generate the report, and display it in the user’s browser.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:49

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

279 of 439

Phase 6: Test
Subtask 6.5.2 Identify Areas for Improvement Description
The goal of this subtask is to identify areas for improvement, based on the performance benchmarks established in Subtask 6.5.1 Benchmark .

Prerequisites
None

Roles
Data Integration Developer (Primary) Data Warehouse Administrator (Primary) Database Administrator (DBA) (Primary) Network Administrator (Primary) Presentation Layer Developer (Primary) Repository Administrator (Primary) System Administrator (Primary) Test Manager (Primary)

Considerations
After performance benchmarks are established (in 6.5.1 Benchmark ), careful analysis of the results can reveal areas that may be improved through tuning. It is important to consider all possible areas for improvement, including:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

280 of 439

● ●

Machine. Regardless of whether the system is UNIX- or NT-based. Network. An often-overlooked facet of system performance, network optimization can have a major affect on overall system performance. For example, if the process of moving or FTPing files from a remote server takes four hours and the PowerCenter session takes four minutes, then optimizing and tuning the network may help to shorten the overall process of data movement, session processing, and backup. Key considerations for network performance include the network card and its settings, network protocol employed, available bandwidth, packet size settings, etc. Database. Database tuning is, in itself, an art form and is largely dependent on the DBA's skill, finesse, and in-depth understanding of the database engine. A major consideration in tuning databases is in defining throughput versus response time. It is important to understand that analytic solutions define their performance in response time, while many OLTP systems measure their performance in throughput, and most DBA's are schooled in OLTP performance tuning rather than response time tuning. Each of the three functional areas of database tuning (i.e., memory, disk I/O, and processing) must be addressed for optimal performance, or one of the other areas will suffer. PowerCenter. Most systems need to tune the PowerCenter session and server process in order to achieve an acceptable level of performance. Tuning the server daemon process and individual sessions can increase performance by a factor of 2 or 3, or more. These goals can be achieved by decreasing the number of network hops between the server and the databases, and by eliminating paging of memory on the server running the PowerCenter sessions. Data Analyzer. It is possible that tuning may be required for source queries and the reports themselves if the time taken to generate the report on screen takes too long.

●

●

●

The actual tuning process can begin after the areas for improvement have been identified and documented. For data migration projects, other considerations must be included in the performance tuning activities. Many ERP applications have two-step processes where the data is loaded through simulated on-line processes. More specifically an API will be executed that will replicate in a batch scenario the way that the on-line entry works, executing all edits. In such a case, performance will not be the same as in a scenario where a relational database is being populated. The best approach to performance tuning is to set the expectation that all data errors should be identified and corrected in the ETL layer prior to the load to the target application. This approach can improve performance by as much as 80%.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

281 of 439

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:49

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

282 of 439

Phase 7: Deploy
7 Deploy
●

7.2 Deploy Solution
r

7.2.2 Migrate Development to Production

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

283 of 439

Phase 7: Deploy
Description
Upon completion of the Build Phase (when both development and testing are finished) the data integration solution is ready to be installed in a production environment and submitted to the ultimate test as a viable solution that meets the users' requirements. The deployment strategy developed during the Architect Phase is now put into action. During the Build Phase components are created that may require special initialization steps and proceedures. For the production deployment, checklists and procedures are developed to ensure that crucial steps are not missed in the production cut over. To the end user, this is where the fruits of the project are exposed and the end user acceptance begins. Up to this point, developers have been developing data cleansing, data transformations, load processes, reports, and dashboards in one or more development environments. But whether a project team is developing the back-end processes for a legacy migration project or the front-end presentation layer for a metadata management system, deploying a data integration solution is the final step in the development process. Metadata, which is the cornerstone of any data integration solution, should play an integral role in the documentation and training rollout to users. Not only is metadata critical to the current data integration effort, but it will be integral to planned metadata management projects down the road. After the solution is actually deployed, it must be maintained to ensure stability and scalability. All data integration solutions must be designed to support change as user requirements and the needs of the business change. As data volumes grow and user interest increases, organizations face many hurdles such as software upgrades, additional functionality requests, and regular maintenance. Use the Deploy Phase as a guide to deploying an on-time, scalable, and maintainable data integration solution that provides business value to the user community.

Prerequisites
None

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

284 of 439

Roles
Business Analyst (Primary) Business Project Manager (Primary) Data Architect (Secondary) Data Quality Developer (Primary) Data Warehouse Administrator (Primary) Database Administrator (DBA) (Primary) End User (Secondary) Metadata Manager (Primary) Presentation Layer Developer (Primary) Project Sponsor (Approve) Quality Assurance Manager (Approve) Repository Administrator (Primary) System Administrator (Primary) Technical Architect (Secondary) Technical Project Manager (Review Only)

Considerations
None

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

285 of 439

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:49

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

286 of 439

Phase 7: Deploy
Task 7.2 Deploy Solution Description
The challenges involved in successfully deploying a data integration solution involve managing the migration from development through production, training end-users, and providing clear and consistent documentation. These are all critical factors in determining the success (or failure) of an implementation effort. Before the deployment tasks are undertaken however, it is necessary to determine the organization's level of preparedness for the deployment and thoroughly plan end-user training materials and documentation. If all prerequisites are not satisfactorily completed, it may be advisable to delay the migration, training, and delivery of finalized documentation rather than hurrying through these tasks solely to meet a predetermined target delivery date. For data migration projects it is important to understand that some packaged applications such as SAP have their own deployment strategies. The deployment strategies for Informatica processes should take this into account and when applicable match up with those deployment strategies.

Prerequisites
None

Roles
Business Analyst (Primary) Business Project Manager (Primary) Data Architect (Secondary) Data Integration Developer (Primary) Data Warehouse Administrator (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

287 of 439

Database Administrator (DBA) (Primary) Presentation Layer Developer (Primary) Production Supervisor (Approve) Quality Assurance Manager (Approve) Repository Administrator (Primary) System Administrator (Primary) Technical Architect (Secondary) Technical Project Manager (Approve)

Considerations
None

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

288 of 439

Phase 7: Deploy
Subtask 7.2.2 Migrate Development to Production Description
To successfully migrate PowerCenter or Data Analyzer from one environment to another one (from development to production, for example), some tasks must be completed. These tasks are dispatched within three phases:
● ● ●

Pre-deployment phase Deployment phase Post-deployment phase

Each phase is detailed in the ‘Considerations’ section. While there are multiple tasks to perform in the deployment process, the actual migration phase consists of moving objects from one environment to another. A migration can include the following objects:
●

PowerCenter - mappings, sessions, workflows, scripts, parameters files, stored procedures, etc. Data Analyzer - schemas, reports, dashboards, schedules, global variables. PowerExchange/CDC - datamaps and registrations. Data Quality - plans and dictionaries.

● ● ●

Prerequisites
None

Roles
Data Warehouse Administrator (Primary) Database Administrator (DBA) (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

289 of 439

Production Supervisor (Approve) Quality Assurance Manager (Approve) Repository Administrator (Primary) System Administrator (Primary) Technical Project Manager (Approve)

Considerations
The tasks below should be completed before, during, and after the migration to ensure a successful deployment. Failure to complete one or more of these tasks can result in an incomplete or incorrect deployment. Pre-deployment tasks
●

Ensure all objects have been successfully migrated and tested in the Quality Assurance environment. Ensure the Production environment is compliant with specifications and is ready to receive the deployment. Obtain sign-off from the deployment team and project teams to deploy to the Production environment. Obtain sign-off from the business units to migrate to the Production environment.

●

●

●

Deployment tasks:
●

Verify the consistency of the connection objects names across environments to ensure that the connections are being made to the production sources/ targets. If not, manually change the connections for each incorrect session to source and target the production environment. Determine the method of migration (i.e., folder copy or deployment group) to use. If you are going to use the folder copy method, make sure the shared folders are copied before the non-shared folders. If you are going to use the deployment group method, make sure all the objects to be migrated are checked-in and refresh the deployment group as it is done.

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

290 of 439

●

Data Analyzer objects that reference new tables require that schemas be migrated before the reports. Make sure the new tables are associated with the proper data source and that the data connectors are plugged to the news schemas. Synchronize the deployment window with the maintenance window to minimize the impact on end-users. If the deployment window is longer that the regular maintenance window, it may be necessary to coordinate with the business unit to minimize the impact on the end-users.

●

Post-deployment tasks:
●

Communicate with the management team members on all aspects of the migration (i.e., problems encountered, solutions, tips and tricks, etc.). Finalize and deliver the documentation. Obtain final user and project sponsor acceptance.

● ●

Finally, when deployment is complete, develop a project close document to evaluate the overall effectiveness of the project (i.e., successes, recommended improvements, lessons learned, etc.).

Best Practices
Deployment Groups Migration Procedures - PowerCenter Using PowerCenter Labels Migration Procedures - PowerExchange Deploying Data Analyzer Objects

Sample Deliverables
Project Close Report

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

291 of 439

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

292 of 439

Phase 8: Operate
8 Operate
●

8.2 Operate Solution
r

8.2.6 Monitor Data Quality

●

8.3 Maintain and Upgrade Environment
r

8.3.1 Maintain Repository 8.3.2 Upgrade Software

r

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

293 of 439

Phase 8: Operate
Description
The Operate Phase is the final step in the development of a data integration solution. This phase is sometimes referred to as production support. During its day-to-day operations the system continually faces new challenges such as increased data volumes, hardware and software upgrades, and network or other physical constraints. The goal of this phase is to keep the system operating smoothly by anticipating these challenges before they occur and planning for their resolution. Planning is probably the most important task in the Operate Phase. Often, the project team plans the system's development and deployment, but does not allow adequate time to plan and execute the turnover to day-to-day operations. Many companies have dedicated production support staff with both the necessary tools for system monitoring and a standard escalation process. This team requires only the appropriate system documentation and lead time to be ready to provide support. Thus, it is imperative for the project team to acknowledge this support capability by providing ample time to create, test, and turn over the deliverables discussed throughout this phase.

Prerequisites
None

Roles
Business Project Manager (Primary) Data Integration Developer (Secondary) Data Steward/Data Quality Steward (Primary) Data Warehouse Administrator (Secondary) Database Administrator (DBA) (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

294 of 439

Presentation Layer Developer (Secondary) Repository Administrator (Primary) System Administrator (Primary) System Operator (Primary) Technical Project Manager (Review Only)

Considerations
None

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

295 of 439

Phase 8: Operate
Task 8.2 Operate Solution Description
After the data integration solution has been built and deployed, the job of running it begins. For a data migration or consolidation solution, the system must be monitored to ensure that data is being loaded into the database. A data visualization or metadata reporting solution should be monitored to ensure that the system is accessible to the end users. The goal of this task is to ensure that the necessary processes are in place to facilitate the monitoring of and the reporting on the system's daily processes.

Prerequisites
None

Roles
Business Project Manager (Primary) Data Steward/Data Quality Steward (Primary) Data Warehouse Administrator (Secondary) Database Administrator (DBA) (Primary) Presentation Layer Developer (Secondary) Project Sponsor (Primary) Repository Administrator (Review Only) System Administrator (Primary) System Operator (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

296 of 439

Technical Project Manager (Review Only)

Considerations
None

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

297 of 439

Phase 8: Operate
Subtask 8.2.6 Monitor Data Quality Description
This subtask is concerned with data quality processes that may have been scoped into the project for late-project or post-project use. Such processes are an optional deliverable for most projects. However, there is a strong argument for building into the project plan data quality initiatives that will outlast the project. This argument is based upon the concept that the decision to incorporate ongoing monitoring should be considered a key deliverable, as it provides a means to monitor the existing data to ensure that previously identified data quality issues do not reoccur. For new data entering the system, monitoring provides a means to ensure that any new feeds do not compromise the integrity of the existing data. Moreover, the processes created for the Data Quality Audit task in the Analyze Phase may still be suitable for application to the data in the Operate Phase, or may be suitable with a reasonable amount of tuning. There are three types of data quality process relevant in this context:
● ● ●

Processes that can be scheduled to monitor data quality on an ongoing basis Processes that can address or repair any data quality issues discovered Processes that can run at the point of data entry to prevent bad data from entering the system

This subtask is concerned with agreeing to a strategy to use any or all such processes to validate the continuing quality of the business’ data and to safeguard against lapses in data quality in the future.

Prerequisites
None

Roles
Data Steward/Data Quality Steward (Primary)

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

298 of 439

Production Supervisor (Secondary)

Considerations
Ongoing data quality initiatives bring the data quality process full-circle. This subtask is the logical conclusion to a process that began with the performance of a Data Quality Audit in the Analyze Phase and the creation of data quality processes (called plans in Informatica Data Quality terminology) in the Design Phase. The plans created during and after the Operate Phase are likely to be runtime or realtime plans. A runtime plan is one that can be scheduled for automated, regular execution (e.g., nightly or weekly). A real-time plan is one that can accept a live data feed, for example, from a third-party application, and write output data back to a live application. Real-time plans are useful in data entry scenarios; they can be used to capture data problems at the point of keyboard entry and thus before they are saved to the data system. The real-time plan can be used to check data entries, pass them if accurate, cleanse them of error, or reject them as unusable. Runtime plans can be used to monitor the data stored to the system; these plans can be run during periods of relative inactivity (e.g., weekends). For example, the Data Quality Developer may design a plan to identify duplicate records in the system, and the Developer or the system administrator can schedule the plan to run overnight. Any duplication issues found in the system can be addressed manually or by other data quality plans. The Data Quality Developer must discuss the importance of ongoing data quality management with the business early in the project, so that the business can decide what data quality management steps to take within the project or outside of it. The Data Quality Developer must also consider the impact that ongoing data quality initiatives are likely to have on the business systems. Should the data quality plans be deployed to several locations or centralized? Will the reference data be updated at regular intervals and by whom? Can plan resource files be moved easily across the enterprise? Once the project resources are unwound, these matters require a committed strategy from the business. However, the results — clean, complete, compliant data — are well worth it.

Best Practices

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

299 of 439

None

Sample Deliverables
None
Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

300 of 439

Phase 8: Operate
Task 8.3 Maintain and Upgrade Environment Description
The goal in this task is to develop and implement an upgrade procedure to facilitate upgrading the hardware, software, and/or network hardware that supports the overall analytic solution. This plan should enable both the development and operations staff to plan for and execute system upgrades in an efficient, timely manner, with as little impact on the system's end users as possible.The deployed system incorporates multiple components, many of which are likely to undergo upgrades during the system's lifetime. Ideally, upgrading system components should be treated as a system change and as such, use many of the techniques discussed in 8.2.4 Track Change Control Requests. After these changes are prioritized and authorized by the Project Manager, an upgrade plan should be developed and executed. This plan should include the tasks necessary to perform the upgrades as well as the tasks necessary to update system documentation and the Operations Manual, when appropriate.

Prerequisites
None

Roles
Database Administrator (DBA) (Primary) Repository Administrator (Primary) System Administrator (Secondary)

Considerations
Once the Build Phase has been completed, the development and operations staff should begin determining how upgrades should be carried out. The team should consider all aspects of the systems' architecture including any software and hardware being used. Special attention should be paid to software release schedules, hardware limitations, network limitations, and vendor release support schedules. This information

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

301 of 439

will give the team an idea of how often and when various upgrades are likely to be required. When combined with knowledge of the data load windows, this will allow the operations team to schedule upgrades without adversely affecting the end users. Upgrading the Informatica software has some special implications. Many times, the software upgrade requires a repository upgrade as well. Thus, the operations team should factor in the time required to backup the repository, along with the time to perform the upgrade itself. In addition, the development staff should be involved in order to ensure that all current sessions are running as designed after the upgrade occurs.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

302 of 439

Phase 8: Operate
Subtask 8.3.1 Maintain Repository Description
A key operational aspect of maintaining PowerCenter repositories involves creating and implementing backup policies. These backups become invaluable if some catastrophic event occurs that requires the repository to be restored. Another key operational aspect is monitoring the size and growth of these repository databases, since daily use of these applications adds metadata to the repositories. The Administration Console manages Repository Services and repository content including backup and restoration. The following repository-related functions can be performed through the Administration Console:
● ● ● ● ● ● ● ● ● ●

Enable or disable a Repository Service or service process. Alter the operating mode of a Repository Service. Create and delete repository content. Backup, copy, restore, or delete a repository. Promote a local repository to a global repository. Register and unregister a local repository. Manage user connections and locks. Send repository notification messages. Manage repository plug-ins. Upgrade a repository and upgrade a Repository Service to a Repository Service.

Additional information about upgrades is available in the "Upgrading PowerCenter" chapter of the PowerCenter Installation and Configuration Guide.

Prerequisites
None

Roles

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

303 of 439

Database Administrator (DBA) (Secondary) Repository Administrator (Primary) System Administrator (Secondary)

Considerations Enabling and Disabling the Repository Service
A service process starts on a designated node when a Repository Service is enabled. PowerCenter's High Availability (HA) feature enables a service to fail-over to another node if the original node become unavailable. Administrative duties can be performed through the Administration Console only when the Repository Service is enabled.

Exclusive Mode
The Repository Service executes in normal or exclusive mode. Running the Repository Service in exclusive mode allows only one user to access the repository through the Administrative Console or pmrep command line program. It is advisable to set the Repository Service mode to exclusive when performing administrative tasks that require configuration updates involving deleting repository content or enabling version control, repository promotion, plug-in registration, or repository upgrades. Running in exclusive mode requires full privileges and permissions on a Repository Service. Precautions to take before switching to exclusive mode include user intent notification and disconnect verification. The Repository Service must be stopped and restarted to complete the mode switch.

Repository Backup
Although PowerCenter database tables may be included in Database Administration backup procedures, PowerCenter repository backup procedures and schedules are established to prevent data loss due to hardware, software, or user mishaps. The Repository Service provides backup processing for repositories through the

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

304 of 439

Administrative Console or the pmrep command line program. The Repository Service backup function saves repository objects, connection information, and code page information in a file stored on the server in the backup location. PowerCenter backup scheduling should account for repository change frequency. Because development repositories typically change more frequently than production repositories, it may be desirable to backup the development repository nightly during heavy development efforts. Production repositories, on the other hand, may only need backup processing after development promotions are registered. Preserve the repository dates as part of the backup file name and, as new repositories are added, delete the older ones.

TIP A simple approach to automating PowerCenter repository backups is to use the pmrep command line program. Commands can be packaged and scheduled so that backups occur on a desired schedule without manual intervention. The backup file name should minimally include repository name and backup date (yyyymmdd).

A repository backup file is invaluable for reference when, as occasionally happens, questions arise as to the integrity of the repository or users encounter problems using it. A backup file enables technical support staff to validate repository integrity to, for example, eliminate the repository as a source of user problems. In addition, if the development or production repository is corrupted, the backup repository can be used to recover quickly.

TIP Keep in mind that you cannot restore a single folder or mapping from a repository backup. If, for example, a single important mapping is deleted by accident, you need to obtain a temporary database space from the DBA in order to restore the backup to a temporary repository DB. With the PowerCenter client tools, copy the lost metadata, and then remove the temporary repository from the database and the cache. If the developers need this service often, it may be prudent to keep the temporary database around all the time and copy over the development repository to the backup repository on a daily basis in addition to backing up to a file. Only the DBA should have access to the backup repository and requests should be made through him/her.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

305 of 439

Repository Performance
Repositories may grow in size due to the execution of workflows, especially in large projects. As the repository grows, response may become slower. Consider these techniques to maintain a repository for better performance:
●

Delete Old Session/Workflow Logs Information. Write a simple SQL script to delete old log information. Assuming that repository backups are taken on a consistent basis, you can always get old log information from the repository backup, if necessary. Perform Defragmentation. Much like any other database, repository databases should go undergo periodic "housecleaning" through statistics and defragmentation. Work with the DBAs to schedule this as a regular job.

●

Audit Trail
The SecurityAuditTrail configuration option in the Repository Service properties in the Administrative Console allows tracking changes to repository users, groups, privileges, and permissions. Enabling the audit trail causes the Repository Service to record security changes to the Repository Service log. Security audit changes logged include owner, owner's group or folder permissions, passwords changes of another user, user maintenance, group maintenance, global object permissions, and privileges.

Best Practices
Disaster Recovery Planning with PowerCenter HA Option

Sample Deliverables
None
Last updated: 04-Dec-07 18:21

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

306 of 439

Phase 8: Operate
Subtask 8.3.2 Upgrade Software Description
Upgrading the application software of a data integration solution to a new release is a continuous operations task as new releases are offered periodically by every software vendor. New software releases offer expanded functionality, new capabilities, and fixes to existing functionality that can benefit the data integration environment and future integration work. However, an upgrade can be a disruptive event since project work may halt while the upgrade process is in progress. Given that data integration environments often contain a host of different applications including Informatica software, database systems, operating systems, EAI tools, BI tools, and other related technologies – an upgrade in any one of these technologies may require an upgrade in any number of other software programs for the full system to function properly. System architects and administrators must continually evaluate the new software offerings across the various products in their data integration environment and balance the desire to upgrade with the impact of an upgrade. Software upgrades require a continuous assessment and planning process. A regular schedule should be defined where new releases are evaluated on functionality and need in the environment. Once approved, upgrades must be coordinated with on-going development work and on-going production data integration. Appropriate planning and coordination of software upgrades allow a data integration environment to stay current on its technology stack with minimal disruptions to production data integration efforts and development projects.

Prerequisites
None

Roles
Database Administrator (DBA) (Secondary) Repository Administrator (Primary) System Administrator (Secondary)

Considerations
When faced with a new software release, the first consideration is to decide whether the upgrade is appropriate for the data integration environment. The pro’s and con’s of every upgrade decision typically include the following:

Pro New functionality and features Bug fixes and refinements of existing functionality Often provides enhanced performance

Con Disruptive to development environment Disruptive to production environment May require new training and adversely affect productivity Support for older releases of software is dropped, May require other pieces of software to be forcing an upgrade to maintain support upgraded to function properly May be required to support newer releases of other software in the environment

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

307 of 439

The upgrade decision can be to:
● ● ●

Upgrade to the latest software release immediately. Upgrade at some time in the future. Do not upgrade to this software version at all.

Architects sometimes decide to forgo a particular software version and skip ahead to the future releases if the current release does not provide enough benefit to warrant the disruption to the environment. It is not uncommon for data integration teams to skip minor releases (and sometimes even major releases) if they aren’t appropriate for their environment or when the upgrade effort outweighs the benefits. Whether you are in a production environment or still in development mode, an upgrade requires careful planning to ensure a successful transition and minimal disruption. The following issues need to be factored into the overall upgrade plan:
●

Training - New releases of software often include new features and functionality that are likely to require some level of training for administrators and developers. Proper planning of the necessary training can ensure that employees are trained ahead of the upgrade so that productivity does not suffer once the new software is in place. Because it is impossible to properly estimate and plan the upgrade effort if you do not have knowledge of the new features and potential environment changes, best practice dictates training a core set of architects and system administrators early in the upgrade process so they can assist in the upgrade planning process. Environment Assessment - A future release of software may range from minimal architectural changes to major changes in the overall data integration architecture. Investigation and strategy around potential architecture changes should occur early. In PowerCenter for example, as the architecture has moved to a Service-Oriented-Architecure with high availability and failover, the underlying physical setup and location of software components has changed from release to release. Planning for these architecture changes allows users to take full advantage of the new features when the software upgrade is deployed. Often these changes provide an opportunity to redesign and improve the existing architecture in coordination of the software upgrade. Testing - Often more than 60 percent of the total upgrade time is devoted to testing the data integration environment with the new software release. Ensuring that data continues to flow correctly, software versions are compatible, and new features do not cause unexpected results requires detailed testing. Developing a well thought-out test plan is crucial to a successful upgrade. New Features - A new software release likely includes new and expanded features that may create a need to alter the current data integration processes. During the upgrade process, existing processes may be altered to incorporate and implement the new features. Time is required to make and test these changes as well. Reviewing the new features and assessing the impact on the upgrade process is a key preplanning step. Sandbox Upgrade - In environments with production systems, it is advisable to copy the production environment to a ‘sandbox’ instance. The ‘sandbox’ environment should be as close to an exact copy of production as possible, including production data. A software upgrade is then performed on the ‘sandbox instance’ and data integration processes run on both the current production and the sandbox instance for a period of time. In this way, results can be compared over time to ensure that no unforeseen differences occur in the new software version. If differences do occur, they can be investigated, resolved, and accounted for in the final upgrade plan.

●

●

●

●

Once a comprehensive plan for the upgrade is in place, the time comes to perform the actual upgrade on the development, test, and production environments. The Installation Guides for each of the Informatica products and online help provide instructions on upgrading and the step-by-step process for applying the new version of the software. However, there are a few important steps to emphasize in the upgrade process:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

308 of 439

● ●

Make a copy of the current database instance housing the repository prior to any upgrade. In addition to the copy, ALWAYS make multiple backups of the current version of the repository before attempting the upgrade. Upgrades have been known to fail in production environments, making the partially upgraded repositories unusable. The only recourse at that point is to restore from the backup. The backups created using the Repository Manager are reliable and can be used to successfully restore the original repository. Restoring from backups may be slower than restoring from the copy, but provide a failsafe insurance policy. Always remove all repository locks through Repository Manager before attempting an upgrade. Carefully monitor the upgraded systems for a period of time after the upgrade to ensure the success of the upgrade.

● ●

A well-planned upgrade process is key to ensuring success during the transition from the current version to a new version, with minimal disruption to the development and production environments. A smooth upgrade process enables data integration teams to take advantage of the latest technologies and advances in data integration.

Best Practices
None

Sample Deliverables
None
Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

309 of 439

Best Practices

●

Data Quality and Profiling
r

Build Data Audit/Balancing Processes Data Cleansing Data Profiling Data Quality Mapping Rules Data Quality Project Estimation and Scheduling Factors Developing the Data Quality Business Case Effective Data Matching Techniques Effective Data Standardizing Techniques Integrating Data Quality Plans with PowerCenter Managing Internal and External Reference Data Real-Time Matching Using PowerCenter Testing Data Quality Plans Tuning Data Quality Plans Using Data Explorer for Data Discovery and Analysis Working with Pre-Built Plans in Data Cleanse and Match Naming Conventions - Data Quality

r

r

r

r

r

r

r

r

r

r

r

r

r

r ●

Development Techniques
r

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

310 of 439

Build Data Audit/Balancing Processes Challenge
Data Migration and Data Integration projects are often challenged to verify that the data in an application is complete. More specifically, to identify that all the appropriate data was extracted from a source system and propagated to its final target. This best practice illustrates how to do this in an efficient and a repeatable fashion for increased productivity and reliability. This is particularly important in businesses that are either highly regulated internally and externally or that have to comply with a host of government compliance regulations such as Sarbanes-Oxley, BASEL II, HIPAA, Patriot Act, and many others.

Description
The common practice for audit and balancing solutions is to produce a set of common tables that can hold various control metrics regarding the data integration process. Ultimately, business intelligence reports provide insight at a glance to verify that the correct data has been pulled from the source and completely loaded to the target. Each control measure that is being tracked will require development of a corresponding PowerCenter process to load the metrics to the Audit/ Balancing Detail table. To drive out this type of solution execute the following tasks: 1. Work with business users to identify what audit/balancing processes are needed. Some examples of this may be: a. Customers – (Number of Customers or Number of Customers by Country) b. Orders – (Qty of Units Sold or Net Sales Amount) c. Deliveries – (Number of shipments or Qty of units shipped of Value of all shipments) d. Accounts Receivable – (Number of Accounts Receivable Shipments or Total Accounts Receivable Outstanding) 2. Define for each process defined in #1 which columns should be used for tracking purposes for both the source and target system. 3. Develop a data integration process that will read from the source system and populate the detail audit/balancing table with the control totals. 4. Develop a data integration process that will read from the target system and populate the detail audit/balancing table with the control totals. 5. Develop a reporting mechanism that will query the audit/balancing table and identify the the source and target entries match or if there is a discrepancy. An example audit/balance table definition looks like this : Audit/Balancing Details

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

311 of 439

Column Name AUDIT_KEY CONTROL_AREA

Data Type NUMBER VARCHAR2

Size 10 50 50 10 10 10 10 10

CONTROL_SUB_AREA VARCHAR2 CONTROL_COUNT_1 CONTROL_COUNT_2 CONTROL_COUNT_3 CONTROL_COUNT_4 CONTROL_COUNT_5 CONTROL_SUM_1 CONTROL_SUM_2 CONTROL_SUM_3 CONTROL_SUM_4 CONTROL_SUM_5 NUMBER NUMBER NUMBER NUMBER NUMBER

NUMBER (p,s) 10,2 NUMBER (p,s) 10,2 NUMBER (p,s) 10,2 NUMBER (p,s) 10,2 NUMBER (p,s) 10,2

UPDATE_TIMESTAMP TIMESTAMP UPDATE_PROCESS VARCHAR2 50

Control Column Definition by Control Area/Control Sub Area Column Name CONTROL_AREA Data Type Size

VARCHAR2 50

CONTROL_SUB_AREA VARCHAR2 50

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

312 of 439

CONTROL_COUNT_1 CONTROL_COUNT_2 CONTROL_COUNT_3 CONTROL_COUNT_4 CONTROL_COUNT_5 CONTROL_SUM_1 CONTROL_SUM_2 CONTROL_SUM_3 CONTROL_SUM_4 CONTROL_SUM_5

VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50 VARCHAR2 50

UPDATE_TIMESTAMP TIMESTAMP UPDATE_PROCESS VARCHAR2 50

The following is a screenshot of a single mapping that will populate both the source and target values in a single mapping:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

313 of 439

The following two screenshots show how two mappings could be used to provide the same results:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

314 of 439

Note: One key challenge is how to capture the appropriate control values from the source system if it is continually being updated. The first example with one mapping will not work due to the changes that occur in the time between the extraction of the data from the source and the completion of the load to the target application. In those cases you may want to take advantage of an aggregator transformation to collect the appropriate control totals as illustrated in this screenshot:

The following are two Straw-man Examples of an Audit/Balancing Report which is the end-result of this type of process: Data Area Leg count TT count Diff Leg amt TT amt Customer 11000 Orders 9827 10099 9827 1288 1 0 0 0 11230.21 11230.21 0 21294.22 21011.21 283.01

Deliveries 1298

In summary, there are two big challenges in building audit/balancing processes: 1. Identifying what the control totals should be 2. Building processes that will collect the correct information at the correct granularity There are also a set of basic tasks that can be leveraged and shared across any audit/balancing needs. By building a common model for meeting audit/balancing needs, projects can lower the time needed to develop these solutions and still provide risk reductions by having this type of solution in place.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

315 of 439

Last updated: 04-Jun-08 18:17

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

316 of 439

Data Cleansing Challenge
Poor data quality is one of the biggest obstacles to the success of many data integration projects. A 2005 study by the Gartner Group stated that the majority of currently planned data warehouse projects will suffer limited acceptance or fail outright. Gartner declared that the main cause of project problems was a lack of attention to data quality. Moreover, once in the system, poor data quality can cost organizations vast sums in lost revenues. Defective data leads to breakdowns in the supply chain, poor business decisions, and inferior customer relationship management. It is essential that data quality issues are tackled during any large-scale data project to enable project success and future organizational success. Therefore, the challenge is twofold: to cleanse project data, so that the project succeeds, and to ensure that all data entering the organizational data stores provides for consistent and reliable decision-making.

Description
A significant portion of time in the project development process should be dedicated to data quality, including the implementation of data cleansing processes. In a production environment, data quality reports should be generated after each data warehouse implementation or when new source systems are integrated into the environment. There should also be provision for rolling back if data quality testing indicates that the data is unacceptable. Informatica offers two application suites for tackling data quality issues: Informatica Data Explorer (IDE) and Informatica Data Quality (IDQ). IDE focuses on data profiling, and its results can feed into the data integration process. However, its unique strength is its metadata profiling and discovery capability. IDQ has been developed as a data analysis, cleansing, correction, and de-duplication tool, one that provides a complete solution for identifying and resolving all types of data quality problems and preparing data for the consolidation and load processes.

Concepts
Following are some key concepts in the field of data quality. These data quality concepts provide a foundation that helps to develop a clear picture of the subject data, which can improve both efficiency and effectiveness. The list of concepts can be read as a process, leading from profiling and analysis to consolidation. Profiling and Analysis - whereas data profiling and data analysis are often synonymous terms, in Informatica terminology these tasks are assigned to IDE and IDQ respectively. Thus, profiling is primarily concerned with metadata discovery and definition, and IDE is ideally suited to these tasks. IDQ can discover data quality issues at a record and field level, and Velocity best practices recommends the use of IDQ for such purposes. Note: The remaining items in this document will therefore, focus in the context of IDQ usage.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

317 of 439

Parsing - the process of extracting individual elements within the records, files, or data entry forms in order to check the structure and content of each field and to create discrete fields devoted to specific information types. Examples may include: name, title, company name, phone number, and SSN. Cleansing and Standardization - refers to arranging information in a consistent manner or preferred format. Examples include the removal of dashes from phone numbers or SSNs. For more information, see the Best Practice Effective Data Standardizing Techniques. Enhancement - refers to adding useful, but optional, information to existing data or complete data. Examples may include: sales volume, number of employees for a given business, and zip+4 codes. Validation - the process of correcting data using algorithmic components and secondary reference data sources, to check and validate information. Example: validating addresses with postal directories. Matching and de-duplication - refers to removing, or flagging for removal, redundant or poor-quality records where high-quality records of the same information exist. Use matching components and business rules to identify records that may refer, for example, to the same customer. For more information, see the Best Practice Effective Data Matching Techniques. Consolidation - using the data sets defined during the matching process to combine all cleansed or approved data into a single, consolidated view. Examples are building best record, master record, or house-holding.

Informatica Applications
The Informatica Data Quality software suite has been developed to resolve a wide range of data quality issues, including data cleansing. The suite comprises the following elements:
●

IDQ Workbench - a stand-alone desktop tool that provides a complete set of data quality functionality on a single computer (Windows only).

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

318 of 439

●

IDQ Server- a set of processes that enables the deployment and management of data quality procedures and resources across a network of any size through TCP/IP. IDQ Integration - a plug-in component that integrates Workbench with PowerCenter, enabling PowerCenter users to embed data quality procedures defined in IDQ in their mappings. IDQ stores all its processes as XML in the Data Quality Repository (MySQL). IDQ Server enables the creation and management of multiple repositories.

●

●

Using IDQ in Data Projects
IDQ can be used effectively alongside PowerCenter in data projects, to run data quality procedures in its own applications or to provide them for addition to PowerCenter transformations. Through its Workbench user-interface tool, IDQ tackles data quality in a modular fashion. That is, Workbench enables you to build discrete procedures (called plans in Workbench) which contain data input components, output components, and operational components. Plans can perform analysis, parsing, standardization, enhancement, validation, matching, and consolidation operations on the specified data. Plans are saved into projects that can provide a structure and sequence to your data quality endeavors. The following figure illustrates how data quality processes can function in a project setting:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

319 of 439

In stage 1, you analyze the quality of the project data according to several metrics, in consultation with the business or project sponsor. This stage is performed in Workbench, which enables the creation of versatile and easy to use dashboards to communicate data quality metrics to all interested parties. In stage 2, you verify the target levels of quality for the business according to the data quality measurements taken in stage 1, and in accordance with project resourcing and scheduling. In stage 3, you use Workbench to design the data quality plans and projects to achieve the targets. Capturing business rules and testing the plans are also covered in this stage. In stage 4, you deploy the data quality plans. If you are using IDQ Workbench and Server, you can deploy plans and resources to remote repositories and file systems through the user interface. If you are running Workbench alone on remote computers, you can export your plans as XML. Stage 4 is the phase in which data cleansing and other data quality tasks are performed on the project data. In stage 5, you’ll test and measure the results of the plans and compare them to the initial data quality assessment to verify that targets have been met. If targets have not been met, this information feeds into another iteration of data quality operations in which the plans are tuned and optimized. In a large data project, you may find that data quality processes of varying sizes and impact are necessary at many points in the project plan. At a high level, stages 1 and 2 ideally occur very early in the project, at a point defined as the Manage Phase within Velocity. Stages 3 and 4 typically occur during the Design Phase of Velocity. Stage 5 can occur during the Design and/or Build Phase of Velocity, depending on the level of unit testing required.

Using the IDQ Integration

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

320 of 439

Data Quality Integration is a plug-in component that enables PowerCenter to connect to the Data Quality repository and import data quality plans to a PowerCenter transformation. With the Integration component, you can apply IDQ plans to your data without necessarily interacting with or being aware of IDQ Workbench or Server. The Integration interacts with PowerCenter in two ways:
●

On the PowerCenter client side, it enables you to browse the Data Quality repository and add data quality plans to custom transformations. The data quality plans’ functional details are saved as XML in the PowerCenter repository. On the PowerCenter server side, it enables the PowerCenter Server (or Integration service) to send data quality plan XML to the Data Quality engine for execution.

●

The Integration requires that at least the following IDQ components are available to PowerCenter:
● ●

Client side: PowerCenter needs to access a Data Quality repository from which to import plans. Server side: PowerCenter needs an instance of the Data Quality engine to execute the plan instructions.

An IDQ-trained consultant can build the data quality plans, or you can use the pre-built plans provided by Informatica. Currently, Informatica provides a set of plans dedicated to cleansing and de-duplicating North American name and postal address records. The Integration component enables the following process:
●

Data quality plans are built in Data Quality Workbench and saved from there to the Data Quality repository. The PowerCenter Designer user opens a Data Quality Integration transformation and configures it to read from the Data Quality repository. Next, the users selects a plan from the Data Quality repository and adds it to the transformation. The PowerCenter Designer user saves the transformation and the mapping containing it to the PowerCenter repository. The plan information is saved with the transformation as XML.

●

●

The PowerCenter Integration service can then run a workflow containing the saved mapping. The relevant source data and plan information will be sent to the Data Quality engine, which processes the data (in conjunction with any reference data files used by the plan) and returns the results to PowerCenter.

Last updated: 06-Feb-07 12:43

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

321 of 439

Data Profiling Challenge
Data profiling is an option in PowerCenter version 7.0 and later that leverages existing PowerCenter functionality and a data profiling GUI front-end to provide a wizard-driven approach to creating data profiling mappings, sessions, and workflows. This Best Practice is intended to provide an introduction on usage for new users. Bear in mind that Informatica’s Data Quality (IDQ) applications also provide data profiling capabilities. Consult the following Velocity Best Practice documents for more information:
● ●

Data Cleansing Using Data Explorer for Data Discovery and Analysis

Description
Creating a Custom or Auto Profile
The data profiling option provides visibility into the data contained in source systems and enables users to measure changes in the source data over time. This information can help to improve the quality of the source data. An auto profile is particularly valuable when you are data profiling a source for the first time, since auto profiling offers a good overall perspective of a source. It provides a row count, candidate key evaluation, and redundancy evaluation at the source level, and domain inference, distinct value and null value count, and min, max, and average (if numeric) at the column level. Creating and running an auto profile is quick and helps to gain a reasonably thorough understanding of a source in a short amount of time. A custom data profile is useful when there is a specific question about a source. Custom profiling is useful for validating business rules and/or verifying that data matches a particular pattern. For example, use custom profiling if you have a business rule that you want to validate, or if you want to test whether data matches a particular pattern.

Setting Up the Profile Wizard
To customize the profile wizard for your preferences:
● ●

Open the Profile Manager and choose Tools > Options. If you are profiling data using a database user that is not the owner of the tables to be sourced, check the “Use source owner name during profile mapping generation” option. If you are in the analysis phase of your project, choose “Always run profile interactively” since most of your dataprofiling tasks will be interactive. (In later phases of the project, uncheck this option because more permanent data profiles are useful in these phases.)

●

Running and Monitoring Profiles
Profiles are run in one of two modes: interactive or batch. Choose the appropriate mode by checking or unchecking “Configure Session” on the "Function-Level Operations” tab of the wizard.
●

Use Interactive to create quick, single-use data profiles. The sessions are created with default configuration parameters. For data-profiling tasks that are likely to be reused on a regular basis, create the sessions manually in Workflow Manager and configure and schedule them appropriately.

●

Generating and Viewing Profile Reports
Use Profile Manager to view profile reports. Right-click on a profile and choose View Report.
INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Quality 322 of 439

For greater flexibility, you can also use Data Analyzer to view reports. Each PowerCenter client includes a Data Analyzer schema and reports xml file. The xml files are located in the \Extensions\DataProfile\IPAReports subdirectory of the client installation. You can create additional metrics, attributes, and reports in Data Analyzer to meet specific business requirements. You can also schedule Data Analyzer reports and alerts to send notifications in cases where data does not meet preset quality limits.

Sampling Techniques
Four types of sampling techniques are available with the PowerCenter data profiling option:

Technique No sampling

Description Uses all source data

Usage Relatively small data sources

Automatic random sampling PowerCenter determines the Larger data sources where you appropriate percentage to sample, then want a statistically significant data samples random rows. analysis Manual random sampling PowerCenter samples random rows of the source data based on a userspecified percentage. Samples more or fewer rows than the automatic option chooses.

Sample first N rows

Samples the number of user-selected rows

Provides a quick readout of a source (e.g., first 200 rows)

Profile Warehouse Administration Updating Data Profiling Repository Statistics
The Data Profiling repository contains nearly 30 tables with more than 80 indexes. To ensure that queries run optimally, be sure to keep database statistics up to date. Run the query below as appropriate for your database type, then capture the script that is generated and run it.

ORACLE
select 'analyze table ' || table_name || ' compute statistics;' from user_tables where table_name like 'PMDP%'; select 'analyze index ' || index_name || ' compute statistics;' from user_tables where index_name like 'DP%';

Microsoft SQL Server
select 'update statistics ' + name from sysobjects where name like 'PMDP%'

SYBASE
select 'update statistics ' + name from sysobjects where name like 'PMDP%'

INFORMIX

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

323 of 439

select 'update statistics low for table ', tabname, ' ; ' from systables where table_name like 'PMDP%'

IBM DB2
select 'runstats on table ' || rtrim(tabschema) || '. ' || tabname || ' and indexes all; ' from syscat.tables where tabname like 'PMDP %'

TERADATA
select 'collect statistics on ', tablename, ' index ', indexname from dbc.indices where tablename like 'PMDP%' and databasename = 'database_name' where database_name is the name of the repository database.

Purging Old Data Profiles
Use the Profile Manager to purge old profile data from the Profile Warehouse. Choose Target Warehouse>Connect and connect to the profiling warehouse. Choose Target Warehouse>Purge to open the purging tool.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

324 of 439

Data Quality Mapping Rules Challenge
Use PowerCenter to create data quality mapping rules to enhance the usability of the data in your system.

Description
The issue of poor data quality is one that frequently hinders the success of data integration projects. It can produce inconsistent or faulty results and ruin the credibility of the system with the business users. This Best Practice focuses on techniques for use with PowerCenter and third-party or add-on software. Comments that are specific to the use of PowerCenter are enclosed in brackets. Bear in mind that you can augment or supplant the data quality handling capabilities of PowerCenter with Informatica Data Quality (IDQ), the Informatica application suite dedicated to data quality issues. Data analysis and data enhancement processes, or plans, defined in IDQ can deliver significant data quality improvements to your project data. A data project that has built-in data quality steps, such as those described in the Analyze and Design phases of Velocity, enjoys a significant advantage over a project that has not audited and resolved issues of poor data quality. If you have added these data quality steps to your project, you are likely to avoid the issues described below. A description of the range of IDQ capabilities is beyond the scope of this document. For a summary of Informatica’s data quality methodology, as embodied in IDQ, consult the Best Practice Data Cleansing.

Common Questions to Consider
Data integration/warehousing projects often encounter general data problems that may not merit a full-blown data quality project, but which nonetheless must be addressed. This document discusses some methods to ensure a base level of data quality; much of the content discusses specific strategies to use with PowerCenter. The quality of data is important in all types of projects, whether it be data warehousing,

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

325 of 439

data synchronization, or data migration. Certain questions need to be considered for all of these projects, with the answers driven by the project’s requirements and the business users that are being serviced. Ideally, these questions should be addressed during the Design and Analyze Phases of the project because they can require a significant amount of re-coding if identified later. Some of the areas to consider are:

Text Formatting
The most common hurdle here is capitalization and trimming of spaces. Often, users want to see data in its “raw” format without any capitalization, trimming, or formatting applied to it. This is easily achievable as it is the default behavior, but there is danger in taking this requirement literally since it can lead to duplicate records when some of these fields are used to identify uniqueness and the system is combining data from various source systems. One solution to this issue is to create additional fields that act as a unique key to a given table, but which are formatted in a standard way. Since the “raw” data is stored in the table, users can still see it in this format, but the additional columns mitigate the risk of duplication. Another possibility is to explain to the users that “raw” data in unique, identifying fields is not as clean and consistent as data in a common format. In other words, push back on this requirement. This issue can be particularly troublesome in data migration projects where matching the source data is a high priority. Failing to trim leading/trailing spaces from data can often lead to mismatched results since the spaces are stored as part of the data value. The project team must understand how spaces are handled from the source systems to determine the amount of coding required to correct this. (When using PowerCenter and sourcing flat files, the options provided while configuring the File Properties may be sufficient.). Remember that certain RDBMS products use the data type CHAR, which then stores the data with trailing blanks. These blanks need to be trimmed before matching can occur. It is usually only advisable to use CHAR for 1-character flag fields.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

326 of 439

Note that many fixed-width files do not use a null as space. Therefore, developers must put one space beside the text radio button, and also tell the product that the space is repeating to fill out the rest of the precision of the column. The strip trailing blanks facility then strips off any remaining spaces from the end of the data value. Embedding database text manipulation functions in lookup transformations is not recommended because a developer must then cache the lookup table due to the presence of a SQL override. (In PowerCenter, avoid embedding database text manipulation functions in lookup transformations.) On very large tables, caching is not always realistic or feasible.

Datatype Conversions
It is advisable to use explicit tool functions when converting the data type of a particular data value. [In PowerCenter, if the TO_CHAR function is not used, an implicit conversion is performed, and 15 digits are carried forward, even when they are not needed or desired. PowerCenter can handle some conversions without function calls (these are detailed in the product documentation), but this may cause subsequent support or

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

327 of 439

maintenance headaches.]

Dates
Dates can cause many problems when moving and transforming data from one place to another because an assumption must be made that all data values are in a designated format. [Informatica recommends first checking a piece of data to ensure it is in the proper format before trying to convert it to a Date data type. If the check is not performed first, then a developer increases the risk of transformation errors, which can cause data to be lost]. An example piece of code would be: IIF(IS_DATE(in_RECORD_CREATE_DT, ‘YYYYMMDD'), TO_DATE(in_RECORD_CREATE_DT, 'YYYYMMDD'), NULL) If the majority of the dates coming from a source system arrive in the same format, then it is often wise to create a reusable expression that handles dates, so that the proper checks are made. It is also advisable to determine if any default dates should be defined, such as a low date or high date. These should then be used throughout the system for consistency. However, do not fall into the trap of always using default dates as some are meant to be NULL until the appropriate time (e.g., birth date or death date). The NULL in the example above could be changed to one of the standard default dates described here.

Decimal Precision
With numeric data columns, developers must determine the expected or required precisions of the columns. (By default, to increase performance, PowerCenter treats all numeric columns as 15 digit floating point decimals, regardless of how they are defined in the transformations. The maximum numeric precision in PowerCenter is 28 digits.) If it is determined that a column realistically needs a higher precision, then the Enable Decimal Arithmetic in the Session Properties option needs to be checked. However, be aware that enabling this option can slow performance by as much as 15 percent. The Enable Decimal Arithmetic option must be enabled when comparing two numbers for equality.

Trapping Poor Data Quality Techniques

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

328 of 439

The most important technique for ensuring good data quality is to prevent incorrect, inconsistent, or incomplete data from ever reaching the target system. This goal may be difficult to achieve in a data synchronization or data migration project, but it is very relevant when discussing data warehousing or ODS. This section discusses techniques that you can use to prevent bad data from reaching the system.

Checking Data for Completeness Before Loading
When requesting a data feed from an upstream system, be sure to request an audit file or report that contains a summary of what to expect within the feed. Common requests here are record counts or summaries of numeric data fields. If you have performed a data quality audit, as specified in the Analyze Phase these metrics and others should be readily available. Assuming that the metrics can be obtained from the source system, it is advisable to then create a pre-process step that ensures your input source matches the audit file. If the values do not match, stop the overall process from loading into your target system. The source system can then be alerted to verify where the problem exists in its feed.

Enforcing Rules During Mapping
Another method of filtering bad data is to have a set of clearly defined data rules built into the load job. The records are then evaluated against these rules and routed to an Error or Bad Table for further re-processing accordingly. An example of this is to check all incoming Country Codes against a Valid Values table. If the code is not found, then the record is flagged as an Error record and written to the Error table. A pitfall of this method is that you must determine what happens to the record once it has been loaded to the Error table. If the record is pushed back to the source system to be fixed, then a delay may occur until the record can be successfully loaded to the target system. In fact, if the proper governance is not in place, the source system may refuse to fix the record at all. In this case, a decision must be made to either: 1) fix the data manually and risk not matching with the source system; or 2) relax the business rule to allow the record to be loaded. Often times, in the absence of an enterprise data steward, it is a good idea to assign a team member the role of data steward. It is this person’s responsibility to patrol these tables and push back to the appropriate systems as necessary, as well as help to make decisions about fixing or filtering bad data. A data steward should have a good command of the metadata, and he/she should also understand the consequences to the user community of data decisions.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

329 of 439

Another solution applicable in cases with a small number of code values is to try to anticipate any mistyped error codes and translate them back to the correct codes. The cross-reference translation data can be accumulated over time. Each time an error is corrected, both the incorrect and correct values should be put into the table and used to correct future errors automatically.

Dimension Not Found While Loading Fact
The majority of current data warehouses are built using a dimensional model. A dimensional model relies on the presence of dimension records existing before loading the fact tables. This can usually be accomplished by loading the dimension tables before loading the fact tables. However, there are some cases where a corresponding dimension record is not present at the time of the fact load. When this occurs, consistent rules need to handle this so that data is not improperly exposed to, or hidden from, the users. One solution is to continue to load the data to the fact table, but assign the foreign key a value that represents Not Found or Not Available in the dimension. These keys must also exist in the dimension tables to satisfy referential integrity, but they provide a clear and easy way to identify records that may need to be reprocessed at a later date. Another solution is to filter the record from processing since it may no longer be relevant to the fact table. The team will most likely want to flag the row through the use of either error tables or process codes so that it can be reprocessed at a later time. A third solution is to use dynamic caches and load the dimensions when a record is not found there, even while loading the fact table. This should be done very carefully since it may add unwanted or junk values to the dimension table. One occasion when this may be advisable is in cases where dimensions are simply made up of the distinct combination values in a data set. Thus, this dimension may require a new record if a new combination occurs. It is imperative that all of these solutions be discussed with the users before making any decisions since they will eventually be the ones making decisions based on the reports.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

330 of 439

Data Quality Project Estimation and Scheduling Factors Challenge
This Best Practice is intended to assist project managers who must estimate the time and resources necessary to address data quality issues within data integration or other data-dependent projects. Its primary concerns are the project estimation issues that arise when you add a discrete data quality stage to your data project. However, it also examines the factors that determine when, or whether, you need to build a larger data quality element into your project.

Description
At a high level, there are three ways to add data quality to your project:
●

Add a discrete and self-contained data quality stage, such as that enabled by using pre-built Informatica Data Quality (IDQ) processes, or plans, in conjunction with Informatica Data Cleanse and Match. Add an expanded but finite set of data quality actions to the project, for example in cases where pre-built plans do not fit the project parameters. Incorporate data quality actions throughout the project.

●

●

This document should help you decide which of these methods best suits your project and assist in estimating the time and resources needed for the first and second methods.

Using Pre-Built Plans with Informatica Data Cleanse and Match
Informatica Data Cleanse and Match is a cross-application solution that enables PowerCenter users to add data quality processes defined in IDQ to custom transformations in PowerCenter. It incorporates the following components:
●

Data Quality Workbench, a user-interface application for building and executing data quality processes, or plans. Data Quality Integration, a plug-in component for PowerCenter that integrates PowerCenter and IDQ. At least one set of reference data files that can be read by data quality plans to validate and enrich certain types of project data. For example, Data Cleanse and Match can be used with the North America Content Pack, which includes pre-built data quality plans and complete address reference datasets for the United States and Canada.

●

●

Data Quality Engagement Scenarios
Data Cleanse and Match delivers its data quality capabilities “out of the box”; a PowerCenter user can select data quality plans and add them to a Data Quality transformation without leaving PowerCenter. In this way, Data Cleanse and Match capabilities can be added into a project plan as a relatively short and

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

331 of 439

discrete stage. In a more complex scenario, a Data Quality Developer may wish to modify the underlying data quality plans or create new plans to focus on quality analysis or enhancements in particular areas. This expansion of the data quality operations beyond the pre-built plans can also be handled within a discrete data quality stage. The Project Manager may decide to implement a more thorough approach to data quality and integrate data quality actions throughout the project plan. In many cases, a convincing case can be made for enlarging the data quality aspect to encompass the full data project. (Velocity contains several tasks and subtasks concerned with such an endeavor.) This is well worth considering. Often, businesses do not realize the extent to which their business and project goals depend on the quality of their data. The project impact of these three types of data quality activity can be summarized as follows:

DQ approach Simple stage Expanded data quality stage

Estimated Project impact 10 days, 1-2 Data Quality Developers 15-20 days, 2 Data Quality Developers, high visibility to business

Data quality integrated with data project Duration of data project, 2 or more project roles, impact on business and project objectives
Note: The actual time that should be allotted to the data quality stages noted above depends on the factors discussed in the remainder of this document.

Factors Influencing Project Estimation
The factors influencing project estimation for a data quality stage range from high-level project parameters to lower-level data characteristics. The main factors are listed below and explained in detail later in this document.
● ● ● ● ● ● ● ●

Base and target levels of data quality Overall project duration/budget Overlap of sources/Complexity of data joins Quantity of data sources Matching requirements Data volumes Complexity and quantity of data rules Geography

Determine which scenario — out of the box (Data Cleanse and Match), expanded Data Cleanse and Match, or a thorough data quality integration —best fits your data project by considering the project’s overall objectives and its mix of factors.

The Simple Data Quality Stage

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

332 of 439

Project managers can consider the use of pre-built plans with Data Cleanse and Match as a simple scenario with a predictable number of function points that can be added to the project plan as a single package. You can add the North America Content Pack plans to your project if the project meets most of the following criteria. Similar metrics apply to other types of pre-built plans:
● ● ● ● ● ● ● ● ●

Baseline functionality of the pre-built data quality plans meets 80 percent of the project needs. Complexity of data rules is relatively low. Business rules present in pre-built plans need minimum fine-tuning. Target data quality level is achievable (i.e., <100 percent). Quantity of data sources is relatively low. Overlap of data sources/complexity of database table joins is relatively low. Matching requirements and targets are straightforward. Overall project duration is relatively short. The project relates to a single country.

Note that the source data quality level is not a major concern.

Implementing the Simple Data Quality Stage
The out-of-the-box scenario is designed to deliver significant increases in data quality in those areas for which the plans were designed (i.e., North American name and address data) in a short time frame. As indicated above, it does not anticipate major changes to the underlying data quality plans. It involves the following three steps: 1. Run pre-built plans. 2. Review plan results. 3. Transfer data to the next stage in the project and (optionally) add data quality plans to PowerCenter transformations. While every project is different, a single iteration of the simple model may take approximately five days, as indicated below:
● ● ●

Run pre-built plans (2 days) Review plan results (1 day) Pass data to the next stage in the project and add plans to PowerCenter transformations (2 days)

Note that these estimates fit neatly into a five-day week but may be conservative in some cases. Note also that a Data Quality Developer can tune plans on an ad-hoc basis to suit the project. Therefore you should plan for a two week “simple” data quality stage.

Step - Simple Stage Run pre-built plans

Days, week 1 2

Days, week 2

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

333 of 439

Review plan results Fine-tune pre-built plans if necessary Re-run pre-built plans Review plan results with stakeholders Add plans to PowerCenter transformations and define mappings Run PowerCenter workflows Review results/obtain approval from stakeholders Approve and pass all files to the next project stage

1 2 2 1 1 1

Expanding the Simple Data Quality Stage
Although the simple scenario above allows for the data quality components to be treated as a “black box,” it allows for modifications to the data quality plans. The types of plan tuning that developers can undertake in this time frame include changing the reference dictionaries used by the plans, editing these dictionaries, and re-selecting the data fields used by the plans as keys to identify data matches. The above time frame does not guarantee that a developer can build or re-build a plan from scratch. The gap between base and target levels of data quality is an important area to consider when expanding the data quality stage. The Developer and Project Manager may decide to add a data analysis step in this stage, or even decide to split these activities across the project plan by conducting a data quality audit early in the project, so that issues can be revealed to the business in advance of the formal data quality stage. The schedule should allow for sufficient time for testing the data quality plans and for contact with the business managers in order to define data quality expectations and targets. In addition:
●

If a data quality audit is added early in the project, the data quality stage grows into a projectlength endeavor. If the data quality audit is included in the discrete data quality stage, the expanded, three-week Data Quality stage may look like this:

●

Step - Enhanced DQ Stage Set up and run data analysis plans Review plan results Conduct advance tuning of pre-built plans Run pre-built plans Review plan results with stakeholders Modify pre-built plans or build new plans from scratch Re-run the plans

Days, week 1 1-2 2

Days, week 2

Days, week 3

1 2 2

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

334 of 439

Review plan results/obtain approval from stakeholders Add approved plans to PowerCenter transformations, define mappings Run PowerCenter workflows Review results/obtain approval from stakeholders Approve and pass all files to the next project stage

1 2 1 1 1

Sizing Your Data Quality Initiatives
The following section describes the factors that affect the estimated time that the data quality endeavors may add to a project. Estimating the specific impact that a single factor is likely to have on a project plan is difficult, as a single data factor rarely exists in isolation from others. If one or two of these factors apply to your data, you may be able to treat them within the scope of a discrete DQ stage. If several factors apply, you are moving into a complex scenario and must design your project plan accordingly.

Base and Target Levels of Data Quality
The rigor of your data quality stage depends in large part on the current (i.e., “base”) levels of data quality in your dataset and the target levels that you want to achieve. As part of your data project, you should run a set of data analysis plans and determine the strengths and weaknesses of the proposed project data. If your data is already of a high quality relative to project and business goals, then your data quality stage is likely to be a short one! If possible, you should conduct this analysis at an early stage in the data project (i.e., well in advance of the data quality stage). Depending on your overall project parameters, you may have already scoped a Data Quality Audit into your project. However, if your overall project is short in duration, you may have to tailor your data quality analysis actions to the time available. Action:If there is a wide gap between base and target data quality levels, determine whether a short data quality stage can bridge the gap. If a data quality audit is conducted early in the project, you have latitude to discuss this with the business managers in the context of the overall project timeline. In general, it is good practice to agree with the business to incorporate time into the project plan for a dedicated Data Quality Audit. (See Task 2.8 in theVelocity Work Breakdown Structure.) If the aggregated data quality percentage for your project’s source data is greater than 60 percent, and your target percentage level for the data quality stage is less than 95 percent, then you are in the zone of effectiveness for Data Cleanse and Match. Note: You can assess data quality according to at least six criteria. Your business may need to improve data quality levels with respect to one criterion but not another. See the Best Practice document Data Cleansing .

Overall Project Duration/Budget

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

335 of 439

A data project with a short duration may not have the means to accommodate a complex data quality stage, regardless of the potential or need to enhance the quality of the data involved. In such a case, you may have to incorporate a finite data quality stage. Conversely, a data project with a long time line may have scope for a larger data quality initiative. In large data projects with major business and IT targets, good data quality may be a significant issue. For example, poor data quality can affect the ability to cleanly and quickly load data into target systems. Major data projects typically have a genuine need for high-quality data if they are to avoid unforeseen problems. Action: Evaluate the project schedule parameters and expectations put forward by the business and evaluate how data quality fits into these parameters. You must also determine if there are any data quality issues that may jeopardize project success, such as a poor understanding of the data structure. These issues may already be visible to the business community. If not, they should be raised with the management. Bear in mind that data quality is not simply concerned with the accuracy of the data values — it can encompass the project metadata also.

Overlap of Sources/Complexity of Data Joins
When data sources overlap, data quality issues can be spread across several sources. The relationships among the variables within the sources can be complex, difficult to join together, and difficult to resolve, all adding to project time. If the joins between the data are simple, then this task may be straightforward. However, if the data joins use complex keys or exist over many hierarchies, then the data modeling stage can be time-consuming, and the process of resolving the indices may be prolonged. Action: You can tackle complexity in data sources and in required database joins within a data quality stage, but in doing so, you step outside the scope of the simple data quality stage.

Quantity of Data Sources
This issue is similar to that of data source overlap and complexity (above). The greater the quantity of sources, the greater the opportunity for data quality issues to arise. The number of data sources has a particular impact on the time required to set up the data quality solution. (The source data setup in PowerCenter can facilitate the data setup in the data quality stage.) Action: You may find that the number of data sources correlates with the number of data sites covered by the project. If your project includes data from multiple geographies, you step outside the scope of a simple data quality stage.

Matching Requirements
Data matching plans are the most performance-intensive type of data quality plan. Moreover, matching plans are often coupled to a type of data standardization plan (i.e., grouping plan) that prepares the data for match analysis. Matching plans are not necessarily more complex to design than other types of plans, although they may contain sophisticated business rules. However, the time taken to execute a matching plan is exponentially

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

336 of 439

proportional to the volume of data records passed through the plan. (Specifically, the time taken is proportional to the size and number of data groups created in the grouping plans.) Action: Consult the Best Practice on Effective Data Matching Techniques and determine how long your matching plans may take to run.

Data Volumes
Data matching requirements and data volumes are closely related. As stated above, the time taken to execute a matching plan is exponentially proportional to the volume of data records passed through it. In other types of plans, this exponential relationship does not exist. However, the general rule applies: the larger your data volumes, the longer it takes for plans to execute. Action: Although IDQ can handle data volumes measurable in eight figures, a dataset of more than 1.5 million records is considered larger than average. If your dataset is measurable in millions of records, and high levels of matching/de-duplication are required, consult the Best Practice on Effective Data Matching Techniques.

Complexity and Quantity of Data Rules
This is a key factor in determining the complexity of your data quality stage. If the Data Quality Developer is likely to write a large number of business rules for the data quality plans — as may be the case if data quality target levels are very high or relate to precise data objectives — then the project is de facto moving out of Data Cleanse and Match capability and you need to add rule-creation and rule-review elements to the data quality effort. Action: If the business requires multiple complex rules, you must scope additional time for rule creation and for multiple iterations of the data quality stage. Bear in mind that, as well as writing and adding these rules to data quality plans, the rules must be tested and passed by the business.

Geography
Geography affects the project plan in two ways:
●

First, the geographical spread of data sites is likely to affect the time needed to run plans, collate data, and engage with key business personnel. Working hours in different time zones can mean that one site is starting its business day while others are ending theirs, and this can effect the tight scheduling of the simple data quality stage. Secondly, project data that is sourced from several countries typically means multiple data sources, with opportunities for data quality issues to arise that may be specific to the country or the division of the organization providing the data source.

●

There is also a high correlation between the scale of the data project and the scale of the enterprise in which the project will take place. For multi-national corporations, there is rarely such a thing as a small data project! Action: Consider the geographical spread of your source data. If the data sites are spread across several time zones or countries, you may need to factor in time lags to your data quality planning.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

337 of 439

Last updated: 04-Jun-08 18:40

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

338 of 439

Developing the Data Quality Business Case Challenge
When a potential data quality issue has been identified it is imperative to develop a business case that details the severity of the issue along with the benefits to be gained by implementing a data quality strategy. A strong business case can help to build the necessary organizational support for funding a data quality initiative.

Description
Building a business case around data quality often necessitates starting with a pilot project. The purpose of the pilot project is to document the anticipated return on investment (ROI). It is important to ensure that the pilot is both manageable and achievable in a relatively short period of time. Build the business case by conducting a Data Quality Audit on a representative sample set of data, but set a reasonable scope so that the audit can be accomplished within a three to four week period. At the conclusion of the Data Quality Audit a report should be prepared that captures the results of the investigation (i.e., invalid data, duplicate records, etc.) and extrapolates the expected cost savings that can be gained if an Enterprise data quality initiative is pursued. Below are the five key steps necessary to develop a business case for a Data Quality Audit. Following these steps also provides a solid foundation for detailing the business requirements for an Enterprise data quality initiative. 1. Identify a Test Source a. What source files (s) are to be considered? A representative sample set of data should be evaluated. This can be a crosssection of an enterprise data set or data from a specific department in which a potential data quality issue is expected to be found. b. What data within those files (priority, obsolete, dormant, incorrect) will be used?

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

339 of 439

Prior to conducting the Data Quality Audit, the type of data within each file should be documented. The results generated during the Audit should be tracked against the anticipated data types. For example, if 10% of the records are incorrectly flagged as priority (when they should be marked obsolete or dormant) any reporting based upon the results of this data will be skewed. 2. Identify Issues a. What data needs to be fixed? Any anticipated issues with the data should be identified prior to conducting the Audit in order to ensure that the specific use cases are investigated. b. What data needs to be changed or enhanced? A data dictionary should be created or made available to capture any anticipated values that should reside within a given data field. These values will be utilized via a reference lookup to analyze the level of conformity between the actual value and the recorded value in the reference dictionary. Additionally, any missing values should be updated based upon the documented data dictionary value. c. What is a representative set of business rules to demonstrate functionality? Prior to conducting the Audit, a discussion should be held regarding the business rules that should be enforced in the provided data set. The intent is to use the expected business rules as a starting point for validation of the data during the Audit. As new rules are likely to be identified during the Audit, having a starting point ensures that initial results can be quickly disseminated to key stakeholders via an initial data quality iteration that leverages the previously documented business rules. 3. Define Scope a. What can be achieved with which resources in the time available? The scope of the Audit should be defined in order to ensure that a business case can be made for a data quality initiative within weeks, not months. The project should be seen as a pilot in order to validate the anticipated ROI if an Enterprise initiative is pursued. Just as the scope should be well defined, commitments should be agreed upon prior to starting the project that the

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

340 of 439

required resources (i.e., data steward, IT representative, business user) will be available as needed during the duration of the project. This will ensure that activities such as the data and business rule review remain on schedule. b. What milestones are critical to other parts of the project? Any relationships between the outcome of the project and other initiatives within the organization should be identified up front. Although the Audit is a pilot project, the data quality results should be reusable on other projects within the organization. If there are specific milestones for the delivery of results, this should be incorporated into the project plan in order to ensure that other projects are not adversely impacted. 4. Highlight Resulting Issues a. Highlight typical issues for the Business, Data Owners, the Governance Team and Senior Management. Upon conclusion of the Audit, the issues uncovered during the project should be summarized and presented to key stakeholders in a workshop setting. During the workshop, the results should be highlighted, along with any anticipated impact to the business if a data quality initiative is not enacted within the organization. b. Test the execution resolution of issues. During the Audit, the resolution of identified issues should occur by leveraging Informatica Data Quality. During the workshop, the means to resolve the issues and the end results should be presented. The types of issues typically resolved include: address validation, ensuring conformity of data through the use of reference dictionaries and the identification and resolution of duplicate data. 5. Build Knowledge a. Gain confidence and knowledge of data quality management strategies, conference room pilots, migrations, etc. To reiterate, the intent of the Audit is to quantify the anticipated ROI within an organization if a data quality strategy is implemented. Additionally, knowledge about the data, the business rules and the potential strategy that can be leveraged throughout the entire organization should be captured.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

341 of 439

b. The rules employed will form the basis of an ongoing DQM Strategy in the target systems. The identified rules should be incorporated into an existing data quality management strategy or utilized as the starting point for a new strategy moving forward. The above steps are intended as a starting point for developing a framework for conducting a Data Quality Audit. From this Audit, the key stakeholders in an organization should have definitive proof as to the extent of the types of data quality issues within their organization and the anticipated ROI that can be achieved through the introduction of data quality throughout the organization.

Last updated: 21-Aug-07 11:48

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

342 of 439

Effective Data Matching Techniques Challenge
Identifying and eliminating duplicates is a cornerstone of effective marketing efforts and customer resource management initiatives, and it is an increasingly important driver of cost-efficient compliance with regulatory initiatives such as KYC (Know Your Customer). Once duplicate records are identified, you can remove them from your dataset, and better recognize key relationships among data records (such as customer records from a common household). You can also match records or values against reference data to ensure data accuracy and validity. This Best Practice is targeted toward Informatica Data Quality (IDQ) users familiar with Informatica's matching approach. It has two high-level objectives:
● ●

To identify the key performance variables that affect the design and execution of IDQ matching plans. To describe plan design and plan execution actions that will optimize plan performance and results.

To optimize your data matching operations in IDQ, you must be aware of the factors that are discussed below.

Description
All too often, an organization's datasets contain duplicate data in spite of numerous attempts to cleanse the data or prevent duplicates from occurring. In other scenarios, the datasets may lack common keys (such as customer numbers or product ID fields) that, if present, would allow clear ‘joins’ between the datasets and improve business knowledge. Identifying and eliminating duplicates in datasets can serve several purposes. It enables the creation of a single view of customers; it can help control costs associated with mailing lists by preventing multiple pieces of mail from being sent to the same person or household; and it can assist marketing efforts by identifying households or individuals who are heavy users of a product or service. Data can be enriched by matching across production data and reference data sources. Business intelligence operations can be improved by identifying links between two or more systems to provide a more complete picture of how customers interact with a business. IDQ’s matching capabilities can help to resolve dataset duplications and deliver business results. However, a user’s ability to design and execute a matching plan that meets the key requirements of performance and match quality depends on understanding the best-practice approaches described in this document. An integrated approach to data matching involves several steps that prepare the data for matching and improve the overall quality of the matches. The following table outlines the processes in each step.

Step Profiling

Description Typically the first stage of the data quality process, profiling generates a picture of the data and indicates the data elements that can comprise effective group keys. It also highlights the data elements that require standardizing to improve match scores.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

343 of 439

Standardization

Removes noise, excess punctuation, variant spellings, and other extraneous data elements. Standardization reduces the likelihood that match quality will be affected by data elements that are not relevant to match determination. A post-standardization function in which the groups' key fields identified in the profiling stage are used to segment data into logical groups that facilitate matching plan performance. The process whereby the data values in the created groups are compared against one another and record matches are identified according to user-defined criteria. The process whereby duplicate records are cleansed. It identifies the master record in a duplicate cluster and permits the creation of a new dataset or the elimination of subordinate records. Any child data associated with subordinate records is linked to the master record.

Grouping

Matching

Consolidation

The sections below identify the key factors that affect the performance (or speed) of a matching plan and the quality of the matches identified. They also outline the best practices that ensure that each matching plan is implemented with the highest probability of success. (This document does not make any recommendations on profiling, standardization or consolidation strategies. Its focus is grouping and matching.) The following table identifies the key variables that affect matching plan performance and the quality of matches identified.

Factor Group size

Impact Plan performance

Impact summary The number and size of groups have a significant impact on plan execution speed. The proper selection of group keys ensures that the maximum number of possible matches are identified in the plan. Processors, disk performance, and memory require consideration. This is not a high-priority issue. However, it should be considered when designing the plan. The plan designer must weigh file-based versus database matching approaches when considering plan requirements.

Group keys

Quality of matches

Hardware resources

Plan performance

Size of dataset(s)

Plan performance

Informatica Data Quality components

Plan performance

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

344 of 439

Time window and frequency of execution

Plan performance

The time taken for a matching plan to complete execution depends on its scale. Timing requirements must be understood up-front. The plan designer must weigh deterministic versus probabilistic approaches.

Match identification

Quality of matches

Group Size
Grouping breaks large datasets down into smaller ones to reduce the number of record-to-record comparisons performed in the plan, which directly impacts the speed of plan execution. When matching on grouped data, a matching plan compares the records within each group with one another. When grouping is implemented properly, plan execution speed is increased significantly, with no meaningful effect on match quality. The most important determinant of plan execution speed is the size of the groups to be processed — that is, the number of data records in each group. For example, consider a dataset of 1,000,000 records, for which a grouping strategy generates 10,000 groups. If 9,999 of these groups have an average of 50 records each, the remaining group will contain more than 500,000 records; based on this one large group, the matching plan would require 87 days to complete, processing 1,000,000 comparisons a minute! In comparison, the remaining 9,999 groups could be matched in about 12 minutes if the group sizes were evenly distributed. Group size can also have an impact on the quality of the matches returned in the matching plan. Large groups perform more record comparisons, so more likely matches are potentially identified. The reverse is true for small groups. As groups get smaller, fewer comparisons are possible, and the potential for missing good matches is increased. The goal of grouping is to optimize performance while minimizing the possibility that valid matches will be overlooked because like records are assigned to different groups. Therefore, groups must be defined intelligently through the use of group keys.

Group Keys
Group keys determine which records are assigned to which groups. Group key selection, therefore, has a significant affect on the success of matching operations. Grouping splits data into logical chunks and thereby reduces the total number of comparisons performed by the plan. The selection of group keys, based on key data fields, is critical to ensuring that relevant records are compared against one another. When selecting a group key, two main criteria apply:
●

Candidate group keys should represent a logical separation of the data into distinct units where there is a low probability that matches exist between records in different units. This can be determined by profiling the data and uncovering the structure and quality of the content prior to grouping. Candidate group keys should also have high scores in three keys areas of data quality: completeness, conformity, and accuracy. Problems in these data areas can be improved by standardizing the data prior to grouping.

●

For example, geography is a logical separation criterion when comparing name and address data. A record for a

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

345 of 439

person living in Canada is unlikely to match someone living in Ireland. Thus, the country-identifier field can provide a useful group key. However, if you are working with national data (e.g. Swiss data), duplicate data may exist for an individual living in Geneva, who may also be recorded as living in Genf or Geneve. If the group key in this case is based on city name, records for Geneva, Genf, and Geneve will be written to different groups and never compared — unless variant city names are standardized.

Size of Dataset
In matching, the size of the dataset typically does not have as significant an impact on plan performance as the definition of the groups within the plan. However, in general terms, the larger the dataset, the more time required to produce a matching plan — both in terms of the preparation of the data and the plan execution.

IDQ Components
All IDQ components serve specific purposes, and very little functionality is duplicated across the components. However, there are performance implications for certain component types, combinations of components, and the quantity of components used in the plan. Several tests have been conducted on IDQ (version 2.11) to test source/sink combinations and various operational components. In tests comparing file-based matching against database matching, file-based matching outperformed database matching in UNIX and Windows environments for plans containing up to 100,000 groups. Also, matching plans that wrote output to a CSV Sink outperformed plans with a DB Sink or Match Key Sink. Plans with a Mixed Field Matcher component performed more slowly than plans without a Mixed Field Matcher. Raw performance should not be the only consideration when selecting the components to use in a matching plan. Different components serve different needs and may offer advantages in a given scenario.

Time Window
IDQ can perform millions or billions of comparison operations in a single matching plan. The time available for the completion of a matching plan can have a significant impact on the perception that the plan is running correctly. Knowing the time window for plan completion helps to determine the hardware configuration choices, grouping strategy, and the IDQ components to employ.

Frequency of Execution
The frequency with which plans are executed is linked to the time window available. Matching plans may need to be tuned to fit within the cycle in which they are run. The more frequently a matching plan is run, the more the execution time will have to be considered.

Match Identification
The method used by IDQ to identify good matches has a significant effect on the success of the plan. Two key methods for assessing matches are:
● ●

deterministic matching probabilistic matching

Deterministic matching applies a series of checks to determine if a match can be found between two records. IDQ’s fuzzy matching algorithms can be combined with this method. For example, a deterministic check may first check if

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

346 of 439

the last name comparison score was greater than 85 percent. If this is true, it next checks the address. If an 80 percent match is found, it then checks the first name. If a 90 percent match is found on the first name, then the entire record is considered successfully matched. The advantages of deterministic matching are: (1) it follows a logical path that can be easily communicated to others, and (2) it is similar to the methods employed when manually checking for matches. The disadvantages to this method are its rigidity and its requirement that each dependency be true. This can result in matches being missed, or can require several different rule checks to cover all likely combinations. Probabilistic matching takes the match scores from fuzzy matching components and assigns weights to them in order to calculate a weighted average that indicates the degree of similarity between two pieces of information. The advantage of probabilistic matching is that it is less rigid than deterministic matching. There are no dependencies on certain data elements matching in order for a full match to be found. Weights assigned to individual components can place emphasis on different fields or areas in a record. However, even if a heavilyweighted score falls below a defined threshold, match scores from less heavily-weighted components may still produce a match. The disadvantages of this method are a higher degree of required tweaking on the user’s part to get the right balance of weights in order to optimize successful matches. This can be difficult for users to understand and communicate to one another. Also, the cut-off mark for good matches versus bad matches can be difficult to assess. For example, a matching plan with 95 to 100 percent success may have found all good matches, but matching plan success between 90 and 94 percent may map to only 85 percent genuine matches. Matches between 85 and 89 percent may correspond to only 65 percent genuine matches, and so on. The following table illustrates this principle.

Close analysis of the match results is required because of the relationship between match quality and match thresholds scores assigned since there may not be a one-to-one mapping between the plan’s weighted score and the number of records that can be considered genuine matches.

Best Practice Operations
The following section outlines best practices for matching with IDQ.
INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Quality 347 of 439

Capturing Client Requirements
Capturing client requirements is key to understanding how successful and relevant your matching plans are likely to be. As a best practice, be sure to answer the following questions, as a minimum, before designing and implementing a matching plan:
● ● ● ● ● ● ●

How large is the dataset to be matched? How often will the matching plans be executed? When will the match process need to be completed? Are there any other dependent processes? What are the rules for determining a match? What process is required to sign-off on the quality of match results? What processes exist for merging records?

Test Results
Performance tests demonstrate the following:
● ●

IDQ has near-linear scalability in a multi-processor environment. Scalability in standard installations, as achieved in the allocation of matching plans to multiple processors, will eventually level off.

Performance is the key to success in high-volume matching solutions. IDQ’s architecture supports massive scalability by allowing large jobs to be subdivided and executed across several processors. This scalability greatly enhances IDQ’s ability to meet the service levels required by users without sacrificing quality or requiring an overly complex solution. If IDQ is integrated with PowerCenter, matching scalability can be achieved using PowerCenter's partitioning capabilities.

Managing Group Sizes
As stated earlier, group sizes have a significant affect on the speed of matching plan execution. Also, the quantity of small groups should be minimized to ensure that the greatest number of comparisons are captured. Keep the following parameters in mind when designing a grouping plan.

Condition Maximum group size

Best practice 5,000 records

Exceptions Large datasets over 2M records with uniform data. Minimize the number of groups containing more than 5,000 records.

Minimum number of singlerecord groups Optimum number of comparisons

1,000 groups per one million record dataset. 500,000,000 comparisons +/- 20 percent per 1 million records

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

348 of 439

In cases where the datasets are large, multiple group keys may be required to segment the data to ensure that best practice guidelines are followed. Informatica Corporation can provide sample grouping plans that automate these requirements as far as is practicable.

Group Key Identification
Identifying appropriate group keys is essential to the success of a matching plan. Ideally, any dataset that is about to be matched has been profiled and standardized to identify candidate keys. Group keys act as a “first pass” or high-level summary of the shape of the dataset(s). Remember that only data records within a given group are compared with one another. Therefore, it is vital to select group keys that have high data quality scores for completeness, conformity, consistency, and accuracy. Group key selection depends on the type of data in the dataset, for example whether it contains name and address data or other data types such as product codes.

Hardware Specifications
Matching is a resource-intensive operation, especially in terms of processor capability. Three key variables determine the effect of hardware on a matching plan: processor speed, disk performance, and memory. The majority of the activity required in matching is tied to the processor. Therefore, the speed of the processor has a significant affect on how fast a matching plan completes. Although the average computational speed for IDQ is one million comparisons per minute, the speed can range from as low as 250,000 comparisons to 6.5 million comparisons per minute, depending on the hardware specification, background processes running, and components used. As a best practice, higher-specification processors (e.g., 1.5 GHz minimum) should be used for high-volume matching plans. Hard disk capacity and available memory can also determine how fast a plan completes. The hard disk reads and writes data required by IDQ sources and sinks. The speed of the disk and the level of defragmentation affect how quickly data can be read from, and written to, the hard disk. Information that cannot be stored in memory during plan execution must be temporarily written to the hard disk. This increases the time required to retrieve information that otherwise could be stored in memory, and also increases the load on the hard disk. A RAID drive may be appropriate for datasets of 3 to 4 million records and a minimum of 512MB of memory should be available. The following table is a rough guide for hardware estimates based on IDQ Runtime on Windows platforms. Specifications for UNIX-based systems vary.

Match volumes < 1,500,000 records 1,500,000 to 3 million records > 3 million records

Suggested hardware specification 1.5 GHz computer, 512MB RAM Multi processor server, 1GB RAM Multi-processor server, 2GB RAM, RAID 5 hard disk

Single Processor vs. Multi-Processor
With IDQ Runtime, it is possible to run multiple processes in parallel. Matching plans, whether they are file-based or database-based, can be split into multiple plans to take advantage of multiple processors on a server. Be aware however, that this requires additional effort to create the groups and consolidate the match output. Also, matching plans split across four processors do not run four times faster than a single-processor matching plan. As a result, multi-processor matching may not significantly improve performance in every case.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

349 of 439

Using IDQ with PowerCenter and taking advantage of PowerCenter's partitioning capabilities may also improve throughput. This approach has the advantage that splitting plans into multiple independent plans is not typically required. The following table can help in estimating the execution time between a single and multi-processor match plan.

Plan Type Standardardization/ grouping

Single Processor Depends on operations and size of data set. (Time equals Y)

Multiprocessor Single processor time plus 20 percent. (Time equals Y * 1.20) Time for single processor matching divided by no or processors (NP) multiplied by 25 percent. (Time equals [(X / NP) * 1.25])

Matching

Est 1 million comparisons a minute. (Time equals X)

For example, if a single processor plan takes one hour to group and standardize the data and eight hours to match, a four-processor match plan should require approximately one hour and 20 minute to group and standardize and two and one half hours to match. The time difference between a single- and multi-processor plan in this case would be more than five hours (i.e., nine hours for the single processor plan versus three hours and 50 minutes for the quad-processor plan).

Deterministic vs. Probabilistic Comparisons
No best-practice research has yet been completed on which type of comparison is most effective at determining a match. Each method has strengths and weaknesses. A 2006 article by Forrester Research stated a preference for deterministic comparisons since they remove the burden of identifying a universal match threshold from the user. Bear in mind that IDQ supports deterministic matching operations only. However, IDQ’s Weight Based Analyzer component lets plan designers calculate weighted match scores for matched fields.

Database vs. File-Based Matching
File-based matching and database matching perform essentially the same operations. The major differences between the two methods revolve around how data is stored and how the outputs can be manipulated after matching is complete. With regards to selecting one method or the other, there are no best practice recommendations since this is largely defined by requirements. The following table outlines the strengths and weakness of each method:

File-Based Method Ease of implementation Performance Space utilization Operating system restrictions Easy to implement Fastest method Requires more hard-disk space Possible limit to number of groups that can be created

Database Method Requires SQL knowledge Slower than file-based method Lower hard-disk space requirement None

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

350 of 439

Ability to control/ manipulate output

Low

High

High-Volume Data Matching Techniques
This section discusses the challenges facing IDQ matching plan designers in opti-mizing their plans for speed of execution and quality of results. It highlights the key factors affecting matching performance and discusses the results of IDQ performance testing in single and multi-processor environments. Checking for duplicate records where no clear connection exists among data elements is a resource-intensive activity. In order to detect matching information, a record must be compared against every other record in a dataset. For a single data source, the quantity of comparisons required to check an entire dataset increases geometrically as the volume of data increases. A similar situation arises when matching between two datasets, where the number of comparisons required is a multiple of the volumes of data in each dataset. When the volume of data increases into the tens of millions, the number of comparisons required to identify matches — and consequently, the amount of time required to check for matches — reaches impractical levels.

Approaches to High-Volume Matching
Two key factors control the time it takes to match a dataset:
● ●

The number of comparisons required to check the data. The number of comparisons that can be performed per minute.

The first factor can be controlled in IDQ through grouping, which involves logically segmenting the dataset into distinct elements, or groups, so that there is a high probability that records within a group are not duplicates of records outside of the group. Grouping data greatly reduces the total number of required comparisons without affecting match accuracy. IDQ affects the number of comparisons per minute in two ways:
●

Its matching components maximize the comparison activities assigned to the com-puter processor. This reduces the amount of disk I/O communication in the system and increases the number of comparisons per minute. Therefore, hard-ware with higher processor speeds has higher match throughputs. IDQ architecture also allows matching tasks to be broken into smaller tasks and shared across multiple processors. The use of multiple processors to handle matching operations greatly enhances IDQ scalability with regard to high-volume matching problems.

●

The following section outlines how a multi-processor matching solution can be imple-mented and illustrates the results obtained in Informatica Corporation testing.

Multi-Processor Matching: Solution Overview
IDQ does not automatically distribute its load across multiple processors. To scale a matching plan to take advantage of a multi-processor environment, the plan designer must develop multiple plans for execution in parallel. To develop this solution, the plan designer first groups the data to prevent the plan from running low-probability comparisons. Groups are then subdivided into one or more subgroups (the number of subgroups depends on the plan being run and the number of processors in use). Each subgroup is assigned to a discrete matching plan, and

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

351 of 439

the plans are executed in parallel. The following diagram outlines how multi-processor matching can be implemented in a database model. Source data is first grouped and then subgrouped according to the number of processors available to the job. Each subgroup of data is loaded into a sepa-rate staging area, and the discrete match plans are run in parallel against each table. Results from each plan are consolidated to generate a single match result for the orig-inal source data.

Informatica Corporation Match Plan Tests
Informatica Corporation performed match plan tests on a 2GHz Intel Xeon dual-processor server running Windows 2003 (Server edition). Two gigabytes of RAM were available. The hyper-threading ability of the Xeon processors effectively provided four CPUs on which to run the tests. Several tests were performed using file-based and database-based matching methods and single and multiple processor methods. The tests were performed on one million rows of data. Grouping of the data limited the total number of comparisons to approximately 500,000,000. Test results using file-based and database-based methods showed a near linear scal-ability as the number of available processors increased. As the number of processors increased, so too did the demand on disk I/O resources. As the processor capacity began to scale upward, disk I/O in this configuration eventually limited the benefits of adding additional processor capacity. This is demonstrated in the graph below.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

352 of 439

Execution times for multiple processors were based on the longest execution time of the jobs run in parallel. Therefore, having an even distribution of records across all proc-essors was important to maintaining scalability. When the data was not evenly distributed, some match plans ran longer than others, and the benefits of scaling over multiple processors was not as evident.

Last updated: 26-May-08 17:52

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

353 of 439

Effective Data Standardizing Techniques Challenge
To enable users to streamline their data cleansing and standardization processes (or plans) with Informatica Data Quality (IDQ). The intent is to shorten development timelines and ensure a consistent and methodological approach to cleansing and standardizing project data.

Description
Data cleansing refers to operations that remove non-relevant information and “noise” from the content of the data. Examples of cleansing operations include the removal of person names, “care of” information, excess character spaces, or punctuation from postal address. Data standardization refers to operations related to modifying the appearance of the data, so that it takes on a more uniform structure and to enriching the data by deriving additional details from existing content.

Cleansing and Standardization Operations
Data can be transformed into a “standard” format appropriate for its business type. This is typically performed on complex data types such as name and address or product data. A data standardization operation typically profiles data by type (e.g., word, number, code) and parses data strings into discrete components. This reveals the content of the elements within the data as well as standardizing the data itself. For best results, the Data Quality Developer should carry out these steps in consultation with a member of the business. Often, this individual is the data steward, the person who best understands the nature of the data within the business scenario.
●

Within IDQ, the Profile Standardizer is a powerful tool for parsing unsorted data into the correct fields. However, when using the Profile Standardizer, be aware that there is a finite number of profiles (500) that can be contained within a cleansing plan. Users can extend the number of profiles by using the first 500 profiles within one component and then feeding the data overflow into a second Profile Standardizer via the Token Parser component.

After the data is parsed and labeled, it should be evident if reference dictionaries will be needed to further standardize the data. It may take several iterations of dictionary construction and review before the data is standardized to an acceptable level. Once acceptable standardization has been achieved, data quality scorecard or dashboard reporting can be introduced. For information on dashboard reporting, see the Report Viewer chapter of the Informatica Data Quality 3.1 User Guide.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

354 of 439

Discovering Business Rules
At this point, the business user may discover and define business rules applicable to the data. These rules should be documented and converted to logic that can be contained within a data quality plan. When building a data quality plan, be sure to group related business rules together in a single rules component whenever possible; otherwise the plan may become very difficult to read. If there are rules that do not lend themselves easily to regular IDQ components (i.e, when standardizing product data information), it may be necessary to perform some custom scripting using IDQ’s scripting component. This requirement may arise when a string or an element within a string needs to be treated as an array.

Standard and Third-Party Reference Data
Reference data can be a useful tool when standardizing data. Terms with variant formats or spellings can be standardized to a single form. IDQ installs with several reference dictionary files that cover common name and address and business terms. The illustration below shows part of a dictionary of street address suffixes.

Common Issues when Cleansing and Standardizing Data
If the customer has expectations of a bureau-style service, it may be advisable to re-emphasize the score-carding and graded-data approach to cleansing and standardizing. This helps to ensure that the customer develops reasonable expectations of what can be achieved with the data set within an agreed-upon timeframe.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

355 of 439

Standardizing Ambiguous Data
Data values can often appear ambiguous, particularly in name and address data where name, address, and premise values can be interchangeable. For example, Hill, Park, and Church are all common surnames. In some cases, the position of the value is important. “ST” can be a suffix for street or a prefix for Saint, and sometimes they can both occur in the same string. The address string “St Patrick’s Church, Main St” can reasonably be interpreted as “Saint Patrick’s Church, Main Street.” In this case, if the delimiter is a space (thus ignoring any commas and periods), the string has five tokens. You may need to write business rules using the IDQ Scripting component, as you are treating the string as an array. St with position 1 within the string would be standardized to meaning_1, whereas St with position 5 would be standardized to meaning_2. Each data value can then be compared to a discrete prefix and suffix dictionary.

Conclusion
Using the data cleansing and standardization techniques described in this Best Practice can help an organization to recognize the value of incorporating IDQ into their development methodology. Because data quality is an iterative process, the business rules initially developed may require ongoing modification, as the results produced by IDQ will be affected by the starting condition of the data and the requirements of the business users. When data arrives in multiple languages, it is worth creating similar IDQ plans for each country and applying the same rules across these plans. The data would typically be staged in a database, and the plans developed using a SQL statement as input, with a “where country_code= ‘DE’” clause, for example. Country dictionaries are identifiable by country code to facilitate such statements. Remember that IDQ installs with a large set of reference dictionaries and additional dictionaries are available from Informatica. IDQ provides several components that focus on verifying and correcting the accuracy of name and postal address data. These components leverage address reference data that originates from national postal carriers such as the United States Postal Service. Such datasets enable IDQ to validate an address to premise level. Please note, the reference datasets are licensed and installed as discrete Informatica products, and thus it is important to discuss their inclusion in the project with the business in advance so as to avoid budget and installation issues. Several types of reference data, with differing levels of address granularity, are available from Informatica. Pricing for the licensing of these components may vary and should be discussed with the Informatica Account Manager.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

356 of 439

Integrating Data Quality Plans with PowerCenter Challenge
This Best Practice outlines the steps to integrate an Informatica Data Quality (IDQ) plan into a PowerCenter mapping. This document assumes that the appropriate setup and configuration of IDQ and PowerCenter have been completed as part of the software installation process and these steps are not included in this document.

Description
Preparing IDQ Plans for PowerCenter Integration
IDQ plans are typically developed and tested by executing from workbench. Plans running locally from workbench can use any of the available IDQ Source and Sink components. This is not true for plans that are integrated into PowerCenter as they can only use Source and Sink components that contain the “Enable Real-time processing” check box. Specifically those components are CSV Source, CSV Match Source, CSV Sink and CSV Match Sink. In addition, the Real-time Source and Sink can be used; however, they require additional setup as each field name and length must be defined. Database source and sinks are not allowed in PC integration. When IDQ plans are integrated within a PowerCenter mapping, the source and sink need to be enabled by setting the enable real-time processing option on them. Consider the following points when developing a plan for integration in PC.
●

If the IDQ was plan developed using database source and/or sink, you must replace them with CSV Sink/ Source or CSV Match Sink/Source. If the IDQ plan was developed using group sink/source (or dual group sink), you must replace them with either CSV Sink/Source or CSV Match Sink/Source depending on the functionality you are replacing. When replacing group sink you also must add functionality to the PC mapping to replicate the grouping. This is done by placing a join and sort prior to the IDQ plan containing the match. PowerCenter only sees the input and output ports of the IDQ plan from within the PC mapping. This is driven by the input file used for the workbench plan and the fields selected as output in the sink. If you don’t see a field after the plan is integrated in PowerCenter, it means the field is not in the input file or not selected as output. PowerCenter integration does not allow input ports to be selected as output if the IDQ transformation is defined as a passive transformation. If the IDQ transformation is configured as active this is not an issue as you must select all fields needed as output from the IDQ transformation within the sink transformation of the IDQ plan. Passive and active IDQ transformations follow the general restrictions and rules for active and passive transformations in PowerCenter. The delimiter of the Source and Sink must be comma for integration IDQ plans. Other fields such as Pipe will cause an error within the PowerCenter Designer. If you encounter this error, go back to workbench, change the delimiter to comma, save the plan and then go back to PowerCenter Designer and perform the import of the plan again. For reusability of IDQ plans, use generic naming conventions for the input and output ports. For example, rather than naming a field Customer address1, customer address2, customer city, name the field address1, address2, city, etc. Thus, if the same standardization and cleansing is needed by multiple sources you can integrate the same IDQ plan, which will reduce development time as well as ongoing maintenance. Use only necessary fields as input to each mapping plan. If you are working with an input file that has 50 fields and you only really need 10 fields for the IDQ plan, create a file that contains only the necessary field names, save it as a comma delimited file and then point to that newly created file from the source of the IDQ plan. This changes the input field reference to only those fields that must be visible in the PowerCenter integration.
Velocity v8 Methodology - Data Quality 357 of 439

●

●

●

●

●

●

INFORMATICA CONFIDENTIAL

●

Once the source and sink are converted to real time, you cannot run the plan within workbench, only within the PowerCenter mapping. However, you may change the check box at any time to revert to standalone processing. Be careful not to refresh the IDQ plan in the mapping within PowerCenter while real time is not enabled. If you do so, the PowerCenter mapping will display an error message and will not allow that mapping to be integrated until the Runtime enable is active again.

Integrating IDQ Plans into PowerCenter Mappings
After the IDQ Plans are converted to real time-enabled, they are ready to integrate into a PowerCenter mapping. Integrating into PowerCenter requires proper installation and configuration of the IDQ/PowerCenter integration, including:
● ● ● ●

Making appropriate changes to environment variables (to .profile for UNIX) Installing IDQ on the PowerCenter server Running IDQ Integration and Content install on the server Registering IDQ plug-in via the PowerCenter Admin console Note: The plug-in must be registered in each repository from which an IDQ transformation is to be developed.

● ●

Installing IDQ workbench on the workstation Installing IDQ Integration and Content on the workstation using the PowerCenter Designer

When all of the above steps are executed correctly, the IDQ transformation icon, shown below, is visible in the PowerCenter repository.

To integrate an IDQ plan, open the mapping, and click on the IDQ icon. Then click in the mapping workspace to insert the transformation into the mapping. The following dialog box appears:

Select Active or Passive, as appropriate. Typically, an active transformation is necessary only for a matching plan. If selecting Active, IDQ plan input needs to have all input fields passed through, as typical PowerCenter

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

358 of 439

rules apply to Active and Passive transformation processing. As the following figure illustrates, the IDQ transformation is “empty” in its initial, un-configured state. Notice all ports are currently blank; they will be populated upon import/integration of the IDQ plan.

Double-click on the title bar for the IDQ transformation to open it for editing.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

359 of 439

Then select the far right tab, “Configuration”.

When first integrating an IDQ plan, the connection and repository displays are blank. Click the Connect button to establish a connection to the appropriate IDQ repository.

In the Host Name box, specify the name of the computer on which the IDQ repository is installed. This is usually the PowerCenter server. If the default Port Number (3306) was changed during installation, specify the correct value. Next, click Test Connection.
INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Quality 360 of 439

Note: In some cases if the User Name has not been granted privileges on the Host server you will not be allowed to connect. The procedure for granting privileges to the IDQ (MySQL) repository is explained at the end of this document. When the connection is established, click the down arrow to the right of the Plan Name box, and the following dialog is displayed:

Browse to the plan you want to import, then click on the Validate button. If there is an error in the plan, a dialog box appears. For example, if the Source and Sink have not been configured correctly, the following dialog box appears.

If the plan is valid for PowerCenter integration, the following dialog is displayed.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

361 of 439

After a valid plan has been configured, the PowerCenter ports (equivalent to the IDQ Source and Sink fields, are visible and can be connected just as any other PowerCenter transformation.

Refreshing IDQ Plans for PowerCenter Integration
After Data Quality Plans are integrated in PowerCenter, changes made to the IDQ plan in Workbench are not reflected in the PowerCenter mapping until the plan is manually refreshed in the PowerCenter mapping. When you save an IDQ plan, it is saved in the MySQL repository. When you integrate that plan into PowerCenter, a copy of that plan is then integrated in the PowerCenter metadata; the MySQL repository and the PowerCenter repository do not communicate updates automatically. The following paragraphs detail the process for refreshing integrated IDQ plans when necessary to reflect changes made in workbench.
● ● ● ●

Double-click on IDQ transformation in PowerCenter Mapping Select the Configurations tab: Select Refresh. This reads the current version of the plan and refreshes it within PowerCenter. Select apply. If any PowerCenter-specific errors were created when the plan was modified, an error dialog is displayed.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

362 of 439

●

Update input, output, and pass-through ports as necessary, then save the mapping in PowerCenter, and test the changes.

Saving IDQ Plans to the Appropriate Repository – MySQL Permissions
Plans that are to be integrated into PowerCenter mappings must be saved to an IDQ Repository that is visible to the PowerCenter Designer prior to integration. The usual practice is to save the plan to the IDQ repository located on the PowerCenter server.

In order for a Workbench client to save a plan to that repository, the client machine must be granted permissions to the MySQL on the server. If the client machine has not been granted access, the client receives an error message when attempting to access the server repository. The person at your organization who has login rights to the server on which IDQ is installed needs to perform this task for all users who will need to save or retrieve plans from the IDQ Server. This procedure is detailed below.
● ●

Identify the IP address for any client machine that needs to be granted access. Login to the server on which the MySQL repository is located and login to MySQL: mysql –u root

●

For a user to connect to IDQ server, save and retrieve plans, enter the following command: grant all privileges on *.* to ‘admin’@’<idq_client_ip>’

●

For a user to integrate an IDQ plan into PowerCenter, grant the following privilege:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

363 of 439

grant all privileges on *.* to ‘root’@’<powercenter_client_ip>’

Last updated: 20-May-08 23:18

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

364 of 439

Managing Internal and External Reference Data Challenge
To provide guidelines for the development and management of the reference data sources that can be used with data quality plans in Informatica Data Quality (IDQ). The goal is to ensure the smooth transition from development to production for reference data files and the plans with which they are associated.

Description
Reference data files can be used by a plan to verify or enhance the accuracy of the data inputs to the plan. A reference data file is a list of verified-correct terms and, where appropriate, acceptable variants on those terms. It may be a list of employees, package measurements, or valid postal addresses — any data set that provides an objective reference against which project data sources can be checked or corrected. Reference files are essential to some, but not all data quality processes. Reference data can be internal or external in origin. Internal data is specific to a particular project or client. Such data is typically generated from internal company information. It may be custom-built for the project. External data has been sourced or purchased from outside the organization. External data is used when authoritative, independently-verified data is needed to provide the desired level of data quality to a particular aspect of the source data. Examples include the dictionary files that install with IDQ, postal address data sets that have been verified as current and complete by a national postal carrier, such as United States Postal Service, or company registration and identification information from an industrystandard source such as Dun & Bradstreet. Reference data can be stored in a file format recognizable to Informatica Data Quality or in a format that requires intermediary (third-party) software in order to be read by Informatica applications. Internal data files, as they are often created specifically for data quality projects, are typically saved in the dictionary file format or as delimited text files, which are easily portable into dictionary format. Databases can also be used as a source for internal data. External files are more likely to remain in their original format. For example, external data may be contained in a database or in a library whose files cannot be edited or opened on the desktop to reveal discrete data values.

Working with Internal Data Obtaining Reference Data
Most organizations already possess much information that can be used as reference data — for example, employee tax numbers or customer names. These forms of data may or may not be part of the project source data, and they may be stored in different parts of the organization.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

365 of 439

The question arises, are internal data sources sufficiently reliable for use as reference? Bear in mind that in some cases the reference data does not need to be 100 percent accurate. It can be good enough to compare project data against reference data and to flag inconsistencies between them, particularly in cases where both sets of data are highly unlikely to share common errors.

Saving the Data in .DIC File Format
IDQ installs with a set of reference dictionaries that have been created to handle many types of business data. These dictionaries are created using a proprietary .DIC file name extension. DIC is abbreviated from dictionary, and dictionary files are essentially comma delimited text files. You can create a new dictionary in three ways:
●

You can save an appropriately formatted delimited file as a .DIC file into the Dictionaries folders of your IDQ (client or server) installation. You can use the Dictionary Manager within Data Quality Workbench. This method allows you to create text and database dictionaries. You can write from plan files directly to a dictionary using the IDQ Report Viewer (see below).

●

●

The figure below shows a dictionary file open in IDQ Workbench and its underlying .DIC file open in a text editor. Note that the dictionary file has at least two columns of data. The Label column contains the correct or standardized form of each datum from the dictionary’s perspective. The Item columns contain versions of each datum that the dictionary recognizes as identical to or coterminous with the Label entry. Therefore, each datum in the dictionary must have at least two entries in the DIC file (see the text editor illustration below). A dictionary can have multiple Item columns.

To edit a dictionary value, open the DIC file and make your changes. You can make changes either through a text editor or by opening the dictionary in the Dictionary Manager.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

366 of 439

To add a value to a dictionary, open the DIC file in Dictionary Manager, place the cursor in an empty row, and add a Label string and at least one Item string. You can also add values in a text editor by placing the cursor on a new line and typing Label and Item values separated by commas. Once saved, the dictionary is ready for use in IDQ. Note: IDQ users with database expertise can create and specify dictionaries that are linked to database tables, and that thus can be updated dynamically when the underlying data is updated. Database dictionaries are useful when the reference data has been originated for other purposes and is likely to change independently of data quality. By making use of a dynamic connection, data quality plans can always point to the current version of the reference data.

Sharing Reference Data Across the Organization
As you can publish or export plans from a local Data Quality repository to server repositories, so you can copy dictionaries across the network. The File Manager within IDQ Workbench provides an Explorer-like mechanism for moving files to other machines across the network. Bear in mind that Data Quality looks for .DIC files in pre-set locations within the IDQ installation when running a plan. By default, Data Quality relies on dictionaries being located in the following locations:
● ●

The Dictionaries folders installed with Workbench and Server. The user’s file space in the Data Quality service domain.

IDQ does not recognize a dictionary file that is not in such a location, even if you can browse to the file when designing the data quality plan. Thus, any plan that uses a dictionary in a non-standard location will fail. This is most relevant when you publish or export a plan to another machine on the network. You must ensure that copies of any dictionary files used in the local plan are available in a suitable location on the service domain — in the user space on the server, or at a location in the server’s Dictionaries folders that corresponds to the dictionaries’ location on Workbench — when the plan is copied to the server-side repository. Note: You can change the locations in which IDQ looks for plan dictionaries by editing the config.xml file. However, this is the master configuration file for the product and you should not edit it without consulting Informatica Support. Bear in mind that Data Quality looks only in the locations set in the config.xml file.

Version Controlling Updates and Managing Rollout from Development to Production
Plans can be version-controlled during development in Workbench and when published to a domain repository. You can create and annotate multiple versions of a plan, and review/roll back to earlier versions when necessary. Dictionary files are not version controlled by IDQ, however. You should define a process to log changes and back-up your dictionaries using version control software if possible or a manual method. If modifications are to be made to the versions of dictionary files installed by the software, it is recommended that these modifications be made to a copy of the original file, renamed or relocated as desired. This approach avoids the risk that a subsequent installation might overwrite changes.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

367 of 439

Database reference data can also be version controlled, although this presents difficulties if the database is very large in size. Bear in mind that third-party reference data, such as postal address data, should not ordinarily be changed, and so the need for a versioning strategy for these files is debatable.

Working with External Data Formatting Data into Dictionary Format
External data may or may not permit the copying of data into text format — for example, external data contained in a database or in library files. Currently, third-party postal address validation data is provided to Informatica users in this manner, and IDQ leverages software from the vendor to read these files. (The third-party software has a very small footprint.) However, some software files can be amenable to data extraction to file.

Obtaining Updates for External Reference Data
External data vendors produce regular data updates, and it’s vital to refresh your external reference data when updates become available. The key advantage of external data — its reliability — is lost if you do not apply the latest files from the vendor. If you obtained third-party data through Informatica, you will be kept up to date with the latest data as it becomes available for as long as your data subscription warrants. You can check that you possess the latest versions of third-party data by contacting your Informatica Account Manager.

Managing Reference Updates and Rolling Out Across the Organization
If your organization has a reference data subscription, you will receive either regular data files on compact disc or regular information on how to download data from Informatica or vendor web sites. You must develop a strategy for distributing these updates to all parties who run plans with the external data. This may involve installing the data on machines in a service domain. Bear in mind that postal address data vendors update their offerings every two or three months, and that a significant percentage of postal addresses can change in such time periods. You should plan for the task of obtaining and distributing updates in your organization at frequent intervals. Depending on the number of IDQ installations that must be updated, updating your organization with thirdparty reference data can be a sizable task.

Strategies for Managing Internal and External Reference Data
Experience working with reference data leads to a series of best practice tips for creating and managing reference data files.

Using Workbench to Build Dictionaries
With IDQ Workbench, you can select data fields or columns from a dataset and save them in a dictionarycompatible format.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

368 of 439

Let’s say you have designed a data quality plan that identifies invalid or anomalous records in a customer database. Using IDQ, you can create an exception file of these bad records, and subsequently use this file to create a dictionary-compatible file. For example, let’s say you have an exception file containing suspect or invalid customer account records. Using a very simple data quality plan, you can quickly parse the account numbers from this file to create a new text file containing the account serial numbers only. This file effectively constitutes the labels column of your dictionary. By opening this file in Microsoft Excel or a comparable program and copying the contents of Column A into Column B, and then saving the spreadsheet as a CSV file, you create a file with Label and Item1 columns. Rename the file with a .DIC suffix and add it to the Dictionaries folder of your IDQ installation: the dictionary is now visible to the IDQ Dictionary Manager. You now have a dictionary file of bad account numbers that you can use in any plans checking the validity of the organization's account records.

Using Report Viewer to Build Dictionaries
The IDQ Report Viewer allows you to create exception files and dictionaries on-the-fly from report data. The figure below illustrates how you can drill-down into report data, right-click on a column, and save the column data as a dictionary file. This file will be populated with Label and Item1 entries corresponding to the column data. In this case, the dictionary created is a list of serial numbers from invalid customer records (specifically, records containing bad zip codes). The plan designer can now create plans to check customer databases against these serial numbers. You can also append data to an existing dictionary file in this manner.

As a general rule, it is a best practice to follow the dictionary organization structure installed by the application, adding to that structure as necessary to accommodate specialized and supplemental dictionaries. Subsequent users are then relieved of the need to examine the config.xml file for possible modifications, thereby lowering the risk of accidental errors during migration. When following the original dictionary organization structure is not practical or contravenes other requirements, take care to document

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

369 of 439

the customizations. Since external data may be obtained from third parties and may not be in file format, the most efficient way to share its content across the organization is to locate it on the Data Quality Server machine. (Specifically, this is the machine that hosts the Execution Service.)

Moving Dictionary Files After IDQ Plans are Built
This is a similar issue to that of sharing reference data across the organization. If you must move or relocate your reference data files post-plan development, you have three options:
● ●

You can reset the location to which IDQ looks by default for dictionary files. You can reconfigure the plan components that employ the dictionaries to point to the new location. Depending on the complexity of the plan concerned, this can be very labor-intensive. If deploying plans in a batch or scheduled task, you can append the new location to the plan execution command. You can do this by appending a parameter file to the plan execution instructions on the command line. The parameter file is an xml file that can contain a simple command to use one file path instead of another.

●

Last updated: 08-Feb-07 17:09

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

370 of 439

Real-Time Matching Using PowerCenter Challenge

This Best Practice describes the rationale for matching in real-time along with the concepts and strategies used in planning for and developing a real-time matching solution. It also provides step-by-step instructions on how to build this process using Informatica’s PowerCenter and Data Quality. The cheapest and most effective way to eliminate duplicate records from a system is to prevent them from ever being entered in the first place. Whether the data is coming from a website, an application entry, EDI feeds messages on a queue, changes captured from a database, or other common data feeds, taking these records and matching them against existing master data that already exists allows for only the new, unique records to be added.
● ● ● ● ●

Benefits of preventing duplicate records include: Better ability to service customer, with the most accurate and complete information readily available Reduced risk of fraud or over-exposure Trusted information at the source Less effort in BI, data warehouse, and/or migration projects

Description
Performing effective real-time matching involves multiple puzzle pieces. 1. There is a master data set (or possibly multiple master data sets) that contain clean and unique customers, prospects, suppliers, products, and/or many other types of data. 2. To interact with the master data set, there is an incoming transaction; typically thought to be a new item. This transaction can be anything from a new customer signing up on the web to a list of new products; this is anything that is assumed to be new and intended to be added to master. 3. There must be a process to determine if a “new” item really is new or if it already exists within the master data set. In a perfect world of consistent id’s, spellings, and representations of data across all companies and systems, checking for duplicates would simply be some sort of exact lookup into the master to see if the item already exists. Unfortunately, this is not the case and even being creative and using %LIKE% syntax does not provide thorough results. For example, comparing Bob to Robert or GRN to Green requires a more sophisticated approach.

Standardizing Data in Advance of Matching
The first prerequisite for successful matching is to cleanse and standardize the master data set. This process requires well-defined rules for important attributes. Applying these rules to the data should result in complete, consistent, conformant, valid data, which really means trusted data. These rules should also be reusable so they can be used with the incoming transaction data prior to matching. The more compromises made in the quality of master data by failing to cleanse and standardize, the more effort will need to be put into the matching logic, and the less value the organization will derive from it. There will be many more chances of missed matches allowing duplicates to enter the system. Once the master data is cleansed, the next step is to develop criteria for candidate selection. For efficient matching, there is no need to compare records that are so dissimilar that they cannot meet the business rules for matching. On the other hand, the set of candidates must be sufficiently broad to minimize the chance that similar records will not be compared. For example, when matching consumer data on name and address, it may be sensible to limit candidate pull records to those having the same zip code and the same first letter of the last name, because we can reason that if those elements are different between two records, those two records will not match.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

371 of 439

There also may be cases where multiple candidate sets are needed. This would be the case if there are multiple sets of match rules that the two records will be compared against. Adding to the previous example, think of matching on name and address for one set of match rules and name and phone for a second. This would require selecting records from the master that have the same phone number and first letter of the last name. Once the candidate selection process is resolved, the matching logic can be developed. This can consist of matching one to many elements of the input record to each candidate pulled from the master. Once the data is compared each pair of records, one input and one candidate, will have a match score or a series of match scores. Scores below a certain threshold can then be discarded and potential matches can be output or displayed. The full real-time match process flow includes: 1. The input record coming into the server 2. The server then standardizes the incoming record and retrieves candidate records from the master data source that could match the incoming record 3. Match pairs are then generated, one for each candidate, consisting of the incoming record and the candidate 4. The match pairs then go through the matching logic resulting in a match score 5. Records with a match score below a given threshold are discarded 6. The returned result set consists of the candidates that are potential matches to the incoming record

Developing an Effective Candidate Selection Strategy
Determining which records from the master should be compared with the incoming record is a critical decision in an effective real-time matching system. For most organizations it is not realistic to match an incoming record to all master records. Consider even a modest customer master data set with one million records; the amount of processing, and thus the wait in real-time would be unacceptable. Candidate selection for real-time matching is synonymous to grouping or blocking for batch matching. The goal of candidate selection is to select only that subset of the records from the master that are definitively related by a field, part of a field, or combination of multiple parts/fields. The selection is done using a candidate key or group key. Ideally this key would be constructed and stored in an indexed field within the master table(s) allowing for the quickest retrieval. There are many instances where multiple keys are used to allow for one key to be missing or different, while another pulls in the record as a candidate. What specific data elements the candidate key should consist of very much depends on the scenario and the match rules. The one common theme with candidate keys is the data elements used should have the highest levels of completeness and validity possible. It is also best to use elements that can be verified as valid, such as a postal code
INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Quality 372 of 439

or a National ID. The table below lists multiple common matching elements and how group keys could be used around the data. The ideal size of the candidate record sets, for sub-second response times, should be under 300 records. For acceptable two to three second response times, candidate record counts should be kept under 5000 records.

Step by Step Development
The following instructions further explain the steps for building a solution to real-time matching using the Informatica suite. They involve the following applications:
● ● ● ●

Informatica PowerCenter 8.5.1 - utilizing Web Services Hub Informatica Data Explorer 5.0 SP4 Informatica Data Quality 8.5 SP1 – utilizing North American Country Pack SQL Server 2000
Velocity v8 Methodology - Data Quality 373 of 439

INFORMATICA CONFIDENTIAL

Scenario:
●

A customer master file is provided with the following structure

● ●

In this scenario, we are performing a name and address match Because address is part of the match, we will use the recommended address grouping strategy for our candidate key (see table1) The desire is that different applications from the business will be able to make a web service call to determine if the data entry represents a new customer or an existing customer

●

Solution: 1. The first step is to analyze the customer master file. Assume that this analysis shows the postcode field is complete for all records and the majority of it is of high accuracy. Assume also that neither the first name or last name field is completely populated; thus the match rules we must account for blank names. 2. The next step is to load the customer master file into the database. Below is a list of tasks that should be implemented in the mapping that loads the customer master data into the database:
●

Standardize and validate the address, outputting the discreet address components such as house number, street name, street type, directional, and suite number. (Pre-built mapplet to do this; country pack) Generate the candidate key field, populate that with the selected strategy (assume it is the first 3 characters of the zip, house number, and the first character of street name), and generate an index on that field. (Expression, output of previous mapplet, hint: substr(in_ZIPCODE, 0, 3)|| in_HOUSE_NUMBER||substr(in_STREET_NAME, 0, 1)) Standardize the phone number. (Pre-built mapplet to do this; country pack) Parse the name field into individual fields. Although the data structure indicates names are already parsed into first, middle, and last, assume there are examples where the names are not properly fielded. Also remember to output a value to handle of nicknames. (Pre-built mapplet to do this; country pack) Once complete, your customer master table should look something like this:

●

● ●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

374 of 439

3. Now that the customer master has been loaded, a Web Service mapping must be created to handle real-time matching. For this project, assume that the incoming record will include a full name field, address, city, state, zip, and a phone number. All fields will be free-form text. Since we are providing the Service, we will be using a Web Service Provider source and target. Follow these steps to build the source and target definitions.
●

Within PowerCenter Designer, go to the source analyzer and select the source menu. From there select Web Service Provider and the Create Web Service Definition.

●

You will see a screen like the one below where the Service can be named and input and output ports can be created. Since this is a matching scenario, the potential that multiple records will be returned must be taken into account. Select the Multiple Occurring Elements checkbox for the output ports section. Also add a match score output field to return the percentage at which the input record matches the different potential matching records from the master.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

375 of 439

●

Both the source and target should now be present in the project folder.

4. An IDQ match plan must be build to use within the mapping. In developing a plan for real-time, using a CSV source and CSV sink, both enabled for real-time is the most significant difference from a similar match plan designed for use in IDQ standalone. The source will have the _1 and the _2 fields that a Group Source would supply built into it, e.g. Firstname_1 & Firstname_2. Another difference from batch matching in PowerCenter is that the DQ transformation can be set to passive. The following steps illustrate converting the North America Country Pack’s Individual Name and Address Match Plan from a plan built for use in a batch mapping to a plan built for use in a real-time mapping.
●

Open the DCM_NorthAmerica project and from within the Match folder make a copy of the “Individual Name and Address Match” plan. Rename it to “RT Individual Name and Address Match”. Create a new stub CSV file with only the header row. This will be used to generate a new CSV Source within the plan. This header must use all of the input fields used by the plan before modification. For convenience, a sample stub header is listed below. The header for the stub file will duplicate all of the fields, with one set having a suffix of _1 and the other _2. IN_GROUP_KEY_1,IN_FIRSTNAME_1,IN_FIRSTNAME_ALT_1, IN_MIDNAME_1,IN_LASTNAME_1,IN_POSTNAME_1, IN_HOUSE_NUM_1,IN_STREET_NAME_1,IN_DIRECTIONAL_1, IN_ADDRESS2_1,IN_SUITE_NUM_1,IN_CITY_1,IN_STATE_1,

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

376 of 439

IN_POSTAL_CODE_1,IN_GROUP_KEY_2,IN_FIRSTNAME_2, IN_FIRSTNAME_ALT_2,IN_MIDNAME_2,IN_LASTNAME_2, IN_POSTNAME_2,IN_HOUSE_NUM_2,IN_STREET_NAME_2, IN_DIRECTIONAL_2,IN_ADDRESS2_2,IN_CITY_2,IN_STATE_2, IN_POSTAL_CODE_2
●

Now delete the CSV Match Source from the plan and add a new CSV Source, and point it at the new stub file. Because the components were originally mapped to the CSV Match Source and that was deleted, the fields within your plan need to be reselected. As you open the different match components and RBAs, you can see the different instances that need to be reselected as they appear with a red diamond, as seen below.

●

●

Also delete the CSV Match Sink and replace it with a CSV Sink. Only the match score field(s) must be selected for output. This plan will be imported into a passive transformation. Consequently, data can be passed around it and does not need to be carried through the transformation. With this implementation you can output multiple match scores so it is possible to see why two records matched or didn’t match on a field by field basis. Select the check box for Enable Real-time Processing in both the source and the sink and the plan will be ready to be imported into PowerCenter.

●

5. The mapping will consist of: a. The source and target previously generated b. An IDQ transformation importing the plan just built c. The same IDQ cleansing and standardization transformations used to load then master data (Refer to step 2 for specifics) d. An Expression transformation to generate the group key and build a single directional field e. A SQL transformation to get the candidate records for the master table f. A Filter transformation to filter those records that match score below a certain threshold g. A Sequence transformation to build a unique key for each matching record returned in the SOAP response

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

377 of 439

●

Within PowerCenter Designer, create a new mapping and drag the web service source and target previously created into the mapping. Add the following country pack mapplets to standardize and validate the incoming record from the web service:
r r r

●

mplt_dq_p_Personal_Name_Standardization_FML mplt_dq_p_USA_Address_Validation mplt_dq_p_USA_Phone_Standardization_Validation

●

Add an Expression Transformation and build the candidate key from the Address Validation mapplet output fields. Remember to use the same logic as in the mapping that loaded the customer master. Also within the expression, concatenate the pre and post directional field into a single directional field for matching purposes. Add a SQL transformation to the mapping. The SQL transform will present a dialog box with a few questions related to the SQL transformation. For this example select Query mode, MS SQL Server (change as desired), and a Static connection. For details on the other options refer to the PowerCenter help. Connect all necessary fields from the source qualifier, DQ mapplets, and Expression transformation to the SQL transformation. These fields should include:
r r r r

●

●

XPK_n4_Envelope (This is the Web Service message key) Parsed name elements Standardized and parsed address elements, which will be used for matching. Standardized phone number

●

The next step is to build the query from within the SQL transformation to select the candidate records. Make sure that the output fields agree with the query in number, name, and type.

The output of the SQL transform will be the incoming customer record along with the candidate record.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

378 of 439

These will be stacked records where the Input/Output fields will represent the input record and the Output only fields will represent the Candidate record. A simple example of this is shown in the table below where a single incoming record will be paired with two candidate records:

●

Comparing the new record to the candidates is done by embedding the IDQ plan converted in step 4 into the mapping through the use of the Data Quality transformation. When this transformation is created, select passive as the transformation type. The output of the Data Quality transformation will be a match score. This match score will be in a float type format between 0.0 and 1.0. Using a filter transformation, all records that have a match score below a certain threshold will get filtered off. For this scenario, the cut-off will be 80%. (Hint: TO_FLOAT(out_match_score) >= .80) Any record coming out of the filter transformation is a potential match that exceeds the specified threshold, and the record will be included in the response. Each of these records needs a new Unique ID so the Sequence Generator transformation will be used. To complete the mapping, the output of the Filter and Sequence Generator transformations need to be mapped to the target. Make sure to map the input primary key field (XPK_n4_Envelope_output) to the primary key field of the envelope group in the target (XPK_n4_Envelope) and to the foreign key of the response element group in the target (FK_n4_Envelope). Map the output of the Sequence Generator to the primary key field of the response element group. The mapping should look like this:

●

●

●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

379 of 439

6. Before testing the mapping, create a workflow.
●

Using the Workflow Manager, generate a new workflow and session for this mapping using all the defaults. Once created, edit the session task. On the Mapping tab select the SQL transformation and make sure the connection type is relational. Also make sure to select the proper connection. For more advanced tweaking and web service settings see the PowerCenter documentation.

●

●

The final step is to expose this workflow as a Web Service. This is done by editing the Workflow. The workflow needs to be Web Services enabled and this is done by selecting the enabled checkbox for Web Services. Once the Web Service is enabled, it should be configured. For all the specific details of this please refer to the PowerCenter documentation, but for the purpose of this scenario: a. Give the service the name you would like to see exposed to the outside world

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

380 of 439

Last updated: 26-May-08 12:57

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

382 of 439

Testing Data Quality Plans Challenge
To provide a guide for testing data quality processes or plans created using Informatica Data Quality (IDQ) and to manage some of the unique complexities associated with data quality plans.

Description
Testing data quality plans is an iterative process that occurs as part of the Design Phase of Velocity. Plan testing often precedes the project’s main testing activities, as the tested plan outputs will be used as inputs in the Build Phase. It is not necessary to formally test the plans used in the Analyze Phase of Velocity. The development of data quality plans typically follows a prototyping methodology of create, execute, analyze. Testing is performed as part of the third step, in order to determine that the plans are being developed in accordance with design and project requirements. This method of iterative testing helps support rapid identification and resolution of bugs. Bear in mind that data quality plans are designed to analyze and resolve data content issues. These are not typically cut-and-dry problems, but more often represent a continuum of data improvement issues where it is possible that every data instance is unique and there is a target level of data quality rather than a “right or wrong answer”. Data quality plans tend to resolve problems in terms of percentages and probabilities that a problem is fixed. For example, the project may set a target of 95 percent accuracy in its customer addresses. The level of inaccuracy acceptability is also likely to change over time, based upon the importance of a given data field to the underlying business process. As well, accuracy should continuously improve as the data quality rules are applied and the existing data sets adhere to a higher standard of quality.

Common Questions in Data Quality Plan Testing
●

What dataset will you use to test the plans? While the ideal situation is to use a data set that exactly mimics the project production data, you may not gain access to this data. If you obtain a full cloned set of the project data for testing purposes, bear in mind that some plans (specifically some data

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

383 of 439

matching plans) can take several hours to complete. Consider testing data matching plans overnight.
●

Are the plans using reference dictionaries? Reference dictionary management is an important factor since it is possible to make changes to a reference dictionary independently of IDQ and without making any changes to the plan itself. When you pass an IDQ plan as tested, you must ensure that no additional work is carried out on any dictionaries referenced in the plan. Moreover, you must ensure that the dictionary files reside in locations that are valid IDQ. How will the plans be executed? Will they be executed on a remote IDQ Server and/or via a scheduler? In cases like these, it’s vital to ensure that your plan resources, including source data files and reference data files, are in valid locations for use by the Data Quality engine. For details on the local and remote locations to which IDQ looks for source and reference data files, refer to the Informatica Data Quality 8.5 User Guide. Will the plans be integrated into a PowerCenter transformation? If so, the plans must have real-time enabled data source and sink components.

●

●

Strategies for Testing Data Quality Plans
The best practice steps for testing plans can be grouped under two headings.

Testing to Validate Rules
1. Identify a small, representative sample of source data. 2. To determine the results to expect when the plans are run, manually process the data based on the rules for profiling, standardization or matching that the plans will apply. 3. Execute the plans on the test dataset and validate the plan results against the manually-derived results.

Testing to Validate Plan Effectiveness
This process is concerned with establishing that a data enhancement plan has been properly designed; that is, that the plan delivers the required improvements in data quality. This is largely a matter of comparing the business and project requirements for data quality and establishing if the plans are on course to deliver these. If not, the plans may need a thorough redesign – or the business and project targets may need to be revised. In either case, discussions should be held with the key business stakeholders to review the results of the IDQ plan and determine the appropriate course of action. In

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

384 of 439

addition, once the entire data set is processed against the business rules, there may be other data anomalies that were unaccounted for that may require additional modifications to the underlying business rules and IDQ plans.

Last updated: 05-Dec-07 16:02

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

385 of 439

Tuning Data Quality Plans Challenge
This document gives an insight into the type of considerations and issues a user needs to be aware of when making changes to data quality processes defined in Informatica Data Quality (IDQ). In IDQ, data quality processes are called plans. The principal focus of this best practice is to know how to tune your plans without adversely affecting the plan logic. This best practice is not intended to replace training materials but serve as a guide for decision making in the areas of adding, removing or changing the operational components that comprise a data quality plan.

Description
You should consider the following questions prior to making changes to a data quality plan:
●

What is the purpose of changing the plan? You should consider changing a plan if you believe the plan is not optimally configured, or the plan is not functioning properly and there is a problem at execution time or the plan is not delivering expected results as per the plan design principles. Are you trained to change the plan? Data quality plans can be complex. You should not alter a plan unless you have been trained or are highly experienced with IDQ methodology. Is the plan properly documented? You should ensure all plan documentation on the data flow and the data components are up-to-date. For guidelines on documenting IDQ plans, see the Sample Deliverable Data Quality Plan Design. Have you backed up the plan before editing? If you are using IDQ in a client-server environment, you can create a baseline version of the plan using IDQ version control functionality. In addition, you should copy the plan to a new project folder (viz., Work_Folder) in the Workbench for changing and testing, and leave the original plan untouched during testing. Is the plan operating directly on production data? This applies especially to standardization plans. When editing a plan, always work on staged data (database or flat-file). You can later migrate the plan to the production environment after complete and thorough testing.

●

●

●

●

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

386 of 439

You should have a clear goal whenever you plan to change an existing plan. An event may prompt the change: for example, input data changing (in format or content), or changes in business rules or business/project targets. You should take into account all current change-management procedures, and the updated plans should be thoroughly tested before production processes are updated. This includes integration and regression testing too. (See also Testing Data Quality Plans.) Bear in mind that at a high level there are two types of data quality plans: data analysis and data enhancement plans.
●

Data analysis plans produce reports on data patterns and data quality across the input data. The key objective in data analysis is to determine the levels of completeness, conformity, and consistency in the dataset. In pursuing these objectives, data analysis plans can also identify cases of missing, inaccurate or “noisy” data. Data enhancement plans corrects completeness, conformity and consistency problems; they can also identify duplicate data entries and fix accuracy issues through the use of reference data.

●

Your goal in a data analysis plan is to discover the quality and usability of your data. It is not necessarily your goal to obtain the best scores for your data. Your goal in a data enhancement plan is to resolve the data quality issues discovered in the data analysis.

Adding Components
In general, simply adding a component to a plan is not likely to directly affect results if no further changes are made to the plan. However, once the outputs from the new component are integrated into existing components, the data process flow is changed and the plan must be re-tested and results reviewed in detail before migrating the plan into production. Bear in mind, particularly in data analysis plans, that improved plan statistics do not always mean that the plan is performing better. It is possible to configure a plan that moves “beyond the point of truth” by focusing on certain data elements and excluding others. When added to existing plans, some components have a larger impact than others. For example, adding a “To Upper” component to convert text into upper case may not cause the plan results to change meaningfully, although the presentation of the output data will change. However, adding and integrating a Rule Based Analyzer component

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

387 of 439

(designed to apply business rules) may cause a severe impact, as the rules are likely to change the plan logic. As well as adding a new component — that is, a new icon — to the plan, you can add a new instance to an existing component. This can have the same effect as adding and integrating a new component icon. To avoid overloading a plan with too many components, it is a good practice to add multiple instances to a single component, within reason. Good plan design suggests that instances within a single component should be logically similar and work on the selected inputs in similar ways. The overall name for the component should also be changed to reflect the logic of the instances contained in the component. If you add a new instance to a component, and that instance behaves very differently to the other instances in that component — for example, if it acts on an unrelated set of outputs or performs an unrelated type of action on the data — you should probably add a new component for this instance. This will also help you keep track of your changes onscreen. To avoid making plans over-complicated, it is often a good practice to split tasks into multiple plans where a large amount of data quality measures need to be checked. This makes plans and business rules easier to maintain and provides a good framework for future development. For example, in an environment where a large number of attributes must be evaluated against the six standard data quality criteria (i.e., completeness, conformity, consistency, accuracy, duplication and consolidation) using one plan per data quality criterion may be a good way to move forward. Alternatively, splitting plans up by data entity may be advantageous. Similarly, during standardization, you can create plans for specific function areas (e.g,. address, product, or name) as opposed to adding all standardization tasks to a single large plan. For more information on the six standard data quality criteria, see Data Cleansing

Removing Components
Removing a component from a plan is likely to have a major impact since, in most cases, data flow in the plan will be broken. If you remove an integrated component, configuration changes will be required to all components that use the outputs from the component. The plan cannot run without these configuration changes being completed. The only exceptions to this case are when the output(s) of the removed component are solely used by CSV Sink component or by a frequency component. However, in these cases, you must note that the plan output changes since the column(s) no longer appear in the result set.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

388 of 439

Editing Component Configurations
Changing the configuration of a component can have a comparable impact on the overall plan as adding or removing a component – the plan’s logic changes, and therefore, so do the results that it produces. However, although adding or removing a component may make a plan non-executable, changing the configuration of a component can impact the results in more subtle ways. For example, changing the reference dictionary used by a parsing component does not “break” a plan, but may have a major impact on the resulting output. Similarly, changing the name of a component instance output does not break a plan. By default, component output names “cascade” through the other components in the plan, so when you change an output name, all subsequent components automatically update with the new output name. It is not necessary to change the configuration of dependent components.

Last updated: 26-May-08 11:12

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

389 of 439

Using Data Explorer for Data Discovery and Analysis Challenge
To understand and make full use of Informatica Data Explorer’s potential to profile and define mappings for your project data. Data profiling and mapping provide a firm foundation for virtually any project involving data movement, migration, consolidation or integration, from data warehouse/data mart development, ERP migrations, and enterprise application integration to CRM initiatives and B2B integration. These types of projects rely on an accurate understanding of the true structure of the source data in order to correctly transform the data for a given target database design. However, the data’s actual form rarely coincides with its documented or supposed form. The key to success for data-related projects is to fully understand the data as it actually is, before attempting to cleanse, transform, integrate, mine, or otherwise operate on it. Informatica Data Explorer is a key tool for this purpose. This Best Practice describes how to use Informatica Data Explorer (IDE) in data profiling and mapping scenarios.

Description
Data profiling and data mapping involve a combination of automated and human analyses to reveal the quality, content and structure of project data sources. Data profiling analyzes several aspects of data structure and content, including characteristics of each column or field, the relationships between fields, and the commonality of data values between fields— often an indicator of redundant data.

Data Profiling
Data profiling involves the explicit analysis of source data and the comparison of observed data characteristics against data quality standards. Data quality and integrity issues include invalid values, multiple formats within a field, non-atomic fields (such as long address strings), duplicate entities, cryptic field names, and others. Quality standards may either be the native rules expressed in the source data’s metadata, or an external standard (e.g., corporate, industry, or government) to which the source data must be mapped in order to be assessed.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

390 of 439

Data profiling in IDE is based on two main processes:
● ●

Inference of characteristics from the data Comparison of those characteristics with specified standards, as an assessment of data quality

Data mapping involves establishing relationships among data elements in various data structures or sources, in terms of how the same information is expressed or stored in different ways in different sources. By performing these processes early in a data project, IT organizations can preempt the “code/load/explode” syndrome, wherein a project fails at the load stage because the data is not in the anticipated form. Data profiling and mapping are fundamental techniques applicable to virtually any project. The following figure summarizes and abstracts these scenarios into a single depiction of the IDE solution.

The overall process flow for the IDE Solution is as follows:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

391 of 439

1. Data and metadata are prepared and imported into IDE.

2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents cleansing and transformation requirements based on the source and normalized schemas. 3. The resultant metadata are exported to and managed in the IDE Repository. 4. In a derived-target scenario, the project team designs the target database by modeling the existing data sources and then modifying the model as required to meet current business and performance requirements. In this scenario, IDE is used to develop the normalized schema into a target database. The normalized and target schemas are then exported to IDE’s FTM/XML tool, which documents transformation requirements between fields in the source, normalized, and target schemas. OR 5. In a fixed-target scenario, the design of the target database is a given (i.e., because another organization is responsible for developing it, or because an off-the-shelf package or industry standard is to be used). In this scenario, the schema development process is bypassed. Instead, FTM/XML is used to map the source data fields to the corresponding fields in an externally-specified target schema, and to document transformation requirements between fields in the normalized and target schemas. FTM is used for SQL-based metadata structures, and FTM/XML is used to map SQL and/or XML-based metadata structures. Externally specified targets are typical for ERP package migrations, business-tobusiness integration projects, or situations where a data modeling team is independently designing the target schema. 6. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and loading or formatting specs developed with IDE applications.

IDE's Methods of Data Profiling
IDE employs three methods of data profiling: Column profiling - infers metadata from the data for a column or set of columns. IDE infers both the most likely metadata and alternate metadata which is consistent with the data.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

392 of 439

Table Structural profiling - uses the sample data to infer relationships among the columns in a table. This process can discover primary and foreign keys, functional dependencies, and sub-tables.

Cross-Table profiling - determines the overlap of values across a set of columns, which may come from multiple tables.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

393 of 439

Profiling against external standards requires that the data source be mapped to the standard before being assessed (as shown in the following figure). Note that the mapping is performed by IDE’s Fixed Target Mapping tool (FTM). IDE can also be used in the development and application of corporate standards, making them relevant to existing systems as well as to new systems.

Data profiling projects may involve iterative profiling and cleansing as well since data cleansing may improve the quality of the results obtained through dependency and redundancy profiling. Note that Informatica Data Quality should be considered as an alternative tool for data cleansing.

IDE and Fixed-Target Migration
Fixed-target migration projects involve the conversion and migration of data from one or more sources to an externally defined or fixed-target. IDE is used to profile the data and develop a normalized schema representing

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

394 of 439

the data source(s), while IDE’s Fixed Target Mapping tool (FTM) is used to map from the normalized schema to the fixed target. The general sequence of activities for a fixed-target migration project, as shown in the figure below, is as follows: 1. Data is prepared for IDE. Metadata is imported into IDE. 2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents cleansing and transformation requirements based on the source and normalized schemas. The cleansing requirements can be reviewed and modified by the Data Quality team. 3. The resultant metadata are exported to and managed by the IDE Repository. 4. FTM maps the source data fields to the corresponding fields in an externally specified target schema, and documents transformation requirements between fields in the normalized and target schemas. Externallyspecified targets are typical for ERP migrations or projects where a data modeling team is independently designing the target schema. 5. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and loading or formatting specs developed with IDE and FTM. 6. The cleansing, transformation, and formatting specs can be used by the application development or Data Quality team to cleanse the data, implement any required edits and integrity management functions, and develop the transforms or configure an ETL product to perform the data conversion and migration.

The following screen shot shows how IDE can be used to generate a suggested normalized schema, which may discover ‘hidden’ tables within tables.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

395 of 439

Depending on the staging architecture used, IDE can generate the data definition language (DDL) needed to establish several of the staging databases between the sources and target, as shown below:

Derived-Target Migration
Derived-target migration projects involve the conversion and migration of data from one or more sources to a target database defined by the migration team. IDE is used to profile the data and develop a normalized schema representing the data source(s), and to further develop the normalized schema into a target schema by adding tables and/or fields, eliminating unused tables and/or fields, changing the relational structure, and/or denormalizing the schema to enhance performance. When the target schema is developed from the normalized schema within IDE, the product automatically maintains the mappings from the source to normalized schema, and from the normalized to target schemas. The figure below shows that the general sequence of activities for a derived-target migration project is as follows:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

396 of 439

1. Data is prepared for IDE. Metadata is imported into IDE. 2. IDE is used to profile the data, generate accurate metadata (including a normalized schema), and document cleansing and transformation requirements based on the source and normalized schemas. The cleansing requirements can be reviewed and modified by the Data Quality team. 3. IDE is used to modify and develop the normalized schema into a target schema. This generally involves removing obsolete or spurious data elements, incorporating new business requirements and data elements, adapting to corporate data standards, and denormalizing to enhance performance. 4. The resultant metadata are exported to and managed by the IDE Repository. 5. FTM is used to develop and document transformation requirements between the normalized and target schemas. The mappings between the data elements are automatically carried over from the IDE-based schema development process. 6. The IDE Repository is used to export an XSLT document containing the transformation and the formatting specs developed with IDE and FTM/XML. 7. The cleansing, transformation, and formatting specs are used by the application development or Data Quality team to cleanse the data, implement any required edits and integrity management functions, and develop the transforms of configure an ETL product to perform the data conversion and migration.

Last updated: 09-Feb-07 12:55

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

397 of 439

Working with Pre-Built Plans in Data Cleanse and Match Challenge
To provide a set of best practices for users of the pre-built data quality processes designed for use with the Informatica Data Cleanse and Match (DC&M) product offering. Informatica Data Cleanse and Match is a cross-application data quality solution that installs two components to the PowerCenter system:
●

Data Cleanse and Match Workbench, the desktop application in which data quality processes - or plans - plans can be designed, tested, and executed. Workbench installs with its own Data Quality repository, where plans are stored until needed. Data Quality Integration, a plug-in component that integrates Informatica Data Quality and PowerCenter. The plug-in adds a transformation to PowerCenter, called the Data Quality Integration transformation; PowerCenter Designer users can connect to the Data Quality repository and read data quality plan information into this transformation.

●

Informatica Data Cleanse and Match has been developed to work with Content Packs developed by Informatica. This document focuses on the plans that install with the North America Content Pack, which was developed in conjunction with the components of Data Cleanse and Match. The North America Content Pack delivers data parsing, cleansing, standardization, and de-duplication functionality to United States and Canadian name and address data through a series of pre-built data quality plans and address reference data files. This document focuses on the following areas:
● ● ●

when to use one plan vs. another for data cleansing. what behavior to expected from the plans. how best to manage exception data.

Description
The North America Content Pack installs several plans to the Data Quality Repository:
● ●

Plans 01-04 are designed to parse, standardize, and validate United States name and address data. Plans 05-07 are designed to enable single-source matching operations (identifying duplicates within a data set) or dual source matching operations (identifying matching records between two datasets).

The processing logic for data matching is split between PowerCenter and Informatica Data Quality (IDQ) applications.

Plans 01-04: Parsing, Cleansing, and Validation
These plans provide modular solutions for name and address data. The plans can operate on highly unstructured and wellstructured data sources. The level of structure contained in a given data set determines the plan to be used. The following diagram demonstrates how the level of structure in address data maps to the plans required to standardize and validate an address.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

398 of 439

In cases where the address is well structured and specific data elements (i.e., city, state, and zip) are mapped to specific fields, only the address validation plan may be required. Where the city, state, and zip are mapped to address fields, but not specifically labeled as such (e.g., as Address1 through Address5), a combination of the address standardization and validation plans is required. In extreme cases, where the data is not mapped to any address columns, a combination of the general parser, address standardization, and validation plans may be required to obtain meaning from the data. The purpose of making the plans modular is twofold:
●

It is possible to apply these plans on an individual basis to the data. There is no requirement that the plans be run in sequence with each other. For example, the address validation plan (plan 03) can be run successfully to validate input addresses discretely from the other plans. In fact, the Data Quality Developer will not run all seven plans consecutively on the same dataset. Plans 01 and 02 are not designed to operate in sequence, nor are plans 06 and 07. Modular plans facilitate faster performance. Designing a single plan to perform all the processing tasks contained in the seven plans, even if it were desirable from a functional point of view, would result in significant performance degradation and extremely complex plan logic that would be difficult to modify and maintain.

●

01 General Parser
The General Parser plan was developed to handle highly unstructured data and to parse it into type-specific fields. For example, consider data stored in the following format:

Field1 100 Cardinal Way Redwood City

Field2 Informatica Corp 38725

Field3 CA 94063 100 Cardinal Way

Field4 info@informatica.com CA 94063

Field5 Redwood City info@informatica.com

While it is unusual to see data fragmented and spread across a number of fields in this way, it can and does happen. In cases such as this, data is not stored in any specific fields. Street addresses, email addresses, company names, and dates are scattered throughout the data. Using a combination of dictionaries and pattern recognition, the General Parser plan sorts such data into typespecific fields of address, names, company names, Social Security Numbers, dates, telephone numbers, and email addresses, depending on the profile of the content. As a result, the above data will be parsed into the following format:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

399 of 439

Address1 100 Cardinal Way Redwood City

Address2 CA 94063 100 Cardinal Way

Address3 Redwood City CA 94063

E-mail info@informatica.com info@informatica.com

Date

Company Informatica Corp

08/01/2006

The General Parser does not attempt to apply any structure or meaning to the data. Its purpose is to identify and sort data by information type. As demonstrated with the address fields in the above example, the address fields are labeled as addresses; the contents are not arranged in a standard address format, they are flagged as addresses in the order in which they were processed in the file. The General Parser does not attempt to validate the correctness of a field. For example, the dates are accepted as valid because they have a structure of symbols and numbers that represents a date. A value of 99/99/9999 would also be parsed as a date. The General Parser does not attempt to handle multiple information types in a single field. For example, if a person name and address element are contained in the same field, the General Parser would label the entire field either a name or an address - or leave it unparsed - depending on the elements in the field it can identify first (if any). While the General Parser does not make any assumption about the data prior to parsing, it parses based on the elements of data that it can make sense of first. In cases where no elements of information can be labeled, the field is left in a pipe-delimited form containing unparsed data. The effectiveness of the General Parser to recognize various information types is a function of the dictionaries used to identify that data and the rules used to sort them. Adding or deleting dictionary entries can greatly affect the effectiveness of this plan. Overall, the General Parser is likely only be used in limited cases, where certain types of information may be mixed together, (e.g., telephone and email in the same contact field), or in cases where the data has been badly managed, such as when several files of differing structures have been merged into a single file.

02 Name Standardization
The Name Standardization plan is designed to take in person name or company name information and apply parsing and standardization logic to it. Name Standardization follows two different tracks: one for person names and one for company names. The plan input fields include two inputs for company names. Data that is entered in these fields are assumed to be valid company names, and no additional tests are performed to validate that the data is an existing company name. Any combination of letters, numbers, and symbols can represent a company; therefore, in the absence of an external reference data source, further tests to validate a company name are not likely to yield usable results. Any data entered into the company name fields is subjected to two processes. First, the company name is standardized using the Word Manager component, standardizing any company suffixes included in the field. Second, the standardized company name is matched against the company_names.dic dictionary, which returns the standardized Dun & Bradstreet company name, if found. The second track for name standardization is person names standardization. While this track is dedicated to standardizing person names, it does not necessarily assume that all data entered here is a person name. Person names in North America tend to follow a set structure and typically do not contain company suffixes or digits. Therefore, values entered in this field that contain a company suffix or a company name are taken out of the person name track and moved to the company name track. Additional logic is applied to identify people whose last name is similar (or equal) to a valid company name (for example John Sears); inputs that contain an identified first name and a company name are treated as a person name. If the company name track inputs are already fully populated for the record in question, then any company name detected in a person name column is moved to a field for unparsed company name output. If the name is not recognized as a company name (e. g., by the presence of a company suffix) but contains digits, the data is parsed into the non-name data output field. Any remaining data is accepted as being a valid person name and parsed as such. North American person names are typically entered in one of two different styles: either in a “firstname middlename surname” format or “surname, firstname middlename” format. Name parsing algorithms have been built using this assumption. Name parsing occurs in two passes. The first pass applies a series of dictionaries to the name fields, attempting to parse out name
INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Quality 400 of 439

prefixes, name suffixes, firstnames, and any extraneous data (“noise”) present. Any remaining details are assumed to be middle name or surname details. A rule is applied to the parsed details to check if the name has been parsed correctly. If not, “best guess” parsing is applied to the field based on the possible assumed formats. When name details have been parsed into first, last, and middle name formats, the first name is used to derive additional details including gender and the name prefix. Finally, using all parsed and derived name elements, salutations are generated. In cases where no clear gender can be generated from the first name, the gender field is typically left blank or indeterminate. The salutation field is generated according to the derived gender information. This can be easily replicated outside the data quality plan if the salutation is not immediately needed as an output from the process (assuming the gender field is an output). Depending on the data entered in the person name fields, certain companies may be treated as person names and parsed according to person name processing rules. Likewise, some person names may be identified as companies and standardized according to company name processing logic. This is typically a result of the dictionary content. If this is a significant problem when working with name data, some adjustments to the dictionaries and the rule logic for the plan may be required. Non-name data encountered in the name standardization plan may be standardized as names depending on the contents of the fields. For example, an address datum such as “Corporate Parkway” may be standardized as a business name, as “Corporate” is also a business suffix. Any text data that is entered in a person name field is always treated as a person or company, depending on whether or not the field contains a recognizable company suffix in the text. To ensure that the name standardization plan is delivering adequate results, Informatica strongly recommends pre- and postexecution analysis of the data. Based on the following input:

ROW ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

IN NAME1 Steven King Chris Pope Jr. Shannon C. Prince Dean Jones Mike Judge Thomas Staples Eugene F. Sears Roy Jones Jr. Thomas Smith, Sr Eddie Martin III Martin Luther King, Jr. Staples Corner Sears Chicago Robert Tyre Chris News

The following outputs are produced by the Name Standardization plan:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

401 of 439

The last entry (Chris News) is identified as a company in the current plan configuration – such results can be refined by changing the underlying dictionary entries used to identify company and person names.

03 US Canada Standardization
This plan is designed to apply basic standardization processes to city, state/province, and zip/postal code information for United States and Canadian postal address data. The purpose of the plan is to deliver basic standardization to address elements where processing time is critical and one hundred percent validation is not possible due to time constraints. The plan also organizes key search elements into discrete fields, thereby speeding up the validation process. The plan accepts up to six generic address fields and attempts to parse out city, state/province, and zip/postal code information. All remaining information is assumed to be address information and is absorbed into the address line 1-3 fields. Any information that cannot be parsed into the remaining fields is merged into the non-address data field. The plan makes a number of assumptions that may or may not suit your data:
●

When parsing city, state, and zip details, the address standardization dictionaries assume that these data elements are spelled correctly. Variation in town/city names is very limited, and in cases where punctuation differences exist or where town names are commonly misspelled, the standardization plan may not correctly parse the information. Zip codes are all assumed to be five-digit. In some files, zip codes that begin with “0” may lack this first number and so appear as a four-digit codes, and these may be missed during parsing. Adding four-digit zips to the dictionary is not recommended, as these will conflict with the “Plus 4” element of a zip code. Zip codes may also be confused with other five-digit numbers in an address line such as street numbers. City names are also commonly found in street names and other address elements. For example, “United” is part of a country (United States of America) and is also a town name in the U.S. Bear in mind that the dictionary parsing operates from right to left across the data, so that country name and zip code fields are analyzed before city names and street addresses. Therefore, the word “United” may be parsed and written as the town name for a given address before the actual town name datum is reached. The plan appends a country code to the end of a parsed address if it can identify it as U.S. or Canadian. Therefore, there is no need to include any country code field in the address inputs when configuring the plan.

●

●

●

Most of these issues can be dealt with, if necessary, by minor adjustments to the plan logic or to the dictionaries, or by adding some pre-processing logic to a workflow prior to passing the data into the plan. The plan assumes that all data entered into it are valid address elements. Therefore, once city, state, and zip details have been parsed out, the plan assumes all remaining elements are street address lines and parses them in the order they occurred as address lines 1-3.

04 NA Address Validation
The purposes of the North America Address Validation plan are:
● ●

To match input addresses against known valid addresses in an address database, and To parse, standardize, and enrich the input addresses.
Velocity v8 Methodology - Data Quality 402 of 439

INFORMATICA CONFIDENTIAL

Performing these operations is a resource-intensive process. Using the US Canada Standardization plan before the NA Address Validation plan helps to improve validation plan results in cases where city, state, and zip code information are not already in discrete fields. City, state, and zip are key search criteria for the address validation engine, and they need to be mapped into discrete fields. Not having these fields correctly mapped prior to plan execution leads to poor results and slow execution times. The address validation APIs store specific area information in memory and continue to use that information from one record to the next, when applicable. Therefore, when running validation plans, it is advisable to sort address data by zip/postal code in order to maximize the usage of data in memory. In cases where status codes, error codes, or invalid results are generated as plan outputs, refer to the Informatica Data Quality 3.1 User Guide for information on how to interpret them.

Plans 05-07: Pre-Match Standardization, Grouping, and Matching
These plans take advantage of PowerCenter and IDQ capabilities and are commonly used in pairs. Users will use either plan 05 and 06 or plans 05 and 07. There plans work as follows:
●

05 Match Standardization and Grouping. This plan is used to perform basic standardization and grouping operations on the data prior to matching. 06 Single Source Matching. Single source matching seeks to identify duplicate records within a single data set. 07 Dual Source Matching. Dual source matching seeks to identify duplicate records between two datasets.

● ●

Note that the matching plans are designed for use within a PowerCenter mapping and do not deliver optimal results when executed directly from IDQ Workbench. Note also that the Standardization and Matching plans are geared towards North American English data. Although they work with datasets in other languages, the results may be sub-optimal.

Matching Concepts
To ensure the best possible matching results and performance, match plans usually use a pre-processing step to standardize and group the data. The aim for standardization here is different from a classic standardization plan – the intent is to ensure that different spellings, abbreviations, etc. are as similar to each other as possible to return better match set. For example, 123 Main Rd. and 123 Main Road will obtain an imperfect match score, although they clearly refer to the same street address. Grouping, in a matching context, means sorting input records based on identical values in one or more user-selected fields. When a matching plan is run on grouped data, serial matching operations are performed on a group-by-group basis, so that data records within a group are matched but records across groups are not. A well-designed grouping plan can dramatically cut plan processing time while minimizing the likelihood of missed matches in the dataset. Grouping performs two functions. It sorts the records in a dataset to increase matching plan performance, and it creates new data columns to provide group key options for the matching plan. (In PowerCenter, the Sorter transformation can organize the data to facilitate matching performance. Therefore, the main function of grouping in a PowerCenter context is to create candidate group keys. In both Data Quality and PowerCenter, grouping operations do not affect the source dataset itself.) Matching on un-grouped data involves a large number of comparisons that realistically will not generate a meaningful quantity of additional matches. For example, when looking for duplicates in a customer list, there is little value in comparing the record for John Smith with the record for Angela Murphy as they are obviously not going to be considered as duplicate entries. The type of grouping used depends on the type of information being matched; in general, productive fields for grouping name and address data are location-based (e.g. city name, zip codes) or person/company based (surname and company name composites). For more information on grouping strategies for best result/performance relationship, see the Best Practice Effective Data Matching Techniques. Plan 05 (Match Standardization and Grouping) performs cleansing and standardization operations on the data before

group keys are generated. It offers a number of grouping options. The plan generates the following group keys:

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

403 of 439

● ● ● ● ●

OUT_ZIP_GROUP: first 5 digits of ZIP code OUT_ZIP_NAME3_GROUP: first 5 digits of ZIP code and the first 3 characters of the last name OUT_ZIP_NAME5_GROUP: first 5 digits of ZIP code and the first 5 characters of the last name OUT_ZIP_COMPANY3_GROUP: first 5 digits of ZIP code and the first 3 characters of the cleansed company name OUT_ZIP_COMPANY5_GROUP: first 5 digits of ZIP code and the first 5 characters of the cleansed company name

The grouping output used depends on the data contents and data volume.

Plans 06 Single Source Matching and 07 Dual Source Matching
Plans 06 and 07 are set up in similar ways and assume that person name, company name, and address data inputs will be used. However, in PowerCenter, plan 07 requires the additional input of a Source tag, typically generated by an Expression transform upstream in the PowerCenter mapping. A number of matching algorithms are applied to the address and name elements. To ensure the best possible result, a weightbased component and a custom rule are applied to the outputs from the matching components. For further information on IDQ matching components, consult the Informatica Data Quality 3.1 User Guide. By default the plans are configured to write as output all records that match with an 85% percent or higher degree of certainty. The Data Quality Developer can easily adjusted this figure in each plan.

PowerCenter Mappings
When configuring the Data Quality Integration transformation for the matching plan, the Developer must select a valid grouping field.

To ensure best matching results, the PowerCenter mapping that contains plan 05 should include a Sorter transformation that sorts data according to the group key to be used during matching. This transformation should follow standardization and grouping operations. Note that a single mapping can contain multiple Data Quality Integration transformations, so that the Data Quality Developer or Data Integration Developer can add plan 05 to one Integration transformation and plan 06 or 07 to another in the same mapping. The standardization plan requires a passive transformation, whereas the matching plan requires an active
INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Quality 404 of 439

transformation.

The developer can add a Sequencer transformation to the mapping to generate a unique identifier for each input record if these not present in the source data. (Note that a unique identifier is not required for matching processes). When working with the dual source matching plan, additional PowerCenter transformations are required to pre-process the data for the Integration transformation. Expression transformations are used to label each input with a source tag of A and B respectively. The data from the two sources is then joined together using a Union transformation, before being passed to the Integration transformation containing the standardization and grouping plan. From here on, the mapping has the same design as the single source version.

Last updated: 09-Feb-07 13:18

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

405 of 439

Naming Conventions - Data Quality Challenge
As with any other development process, the use of clear, consistent, and documented naming conventions contributes to the effective use of Informatica Data Quality (IDQ). This Best Practice provides suggested naming conventions for the major structural elements of the IDQ Designer and IDQ Plans.

Description
IDQ Designer
The IDQ Designer is the user interface for the development of IDQ plans. Each IDQ plan holds the business rules and operations for a distinct process. IDQ plans may be constructed for use inside the IDQ Designer (a runtime plan), using the athanor-rt command line utility (also runtime), or within an integration with PowerCenter (a real-time plan). IDQ requires that each IDQ plan belong to a project. Optionally, plans may be organized in folders within a project. Folders may be nested to span more than one level. The organizational structure of IDQ is summarized below. Element Repository Project Folder Plan Parent None. This is the top level organization structure. Repository. There may be multiple projects in a repository. Project or Folder. Folders may be nested. Project or Folder.

At any common level of visibility, IDQ requires that all elements have distinct names. Thus no two projects within a repository may share the same name. Likewise, no two folders at the same level within a project may share the same name. The rule also applies to plans within the same folder. IDQ will not permit an element to be renamed if the new name would conflict with an existing element at the same level. A dialog will explain the error.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

406 of 439

To prevent naming conflicts when an element is copied, it will be prefixed with “Copy of “ if it is pasted at the same level as the source of the copy. If the length of the new name is longer than the allowed length for names of the type of element, the name will be truncated.

Naming Projects
When a project is created, it will be by default have the name “New Project”.

Project naming should be clear and consistent within a repository. The exact approach to naming will vary depending on an organization’s needs. Suggested naming rules include: 1. Limit project names to 22 characters if possible. The limit imposed by the repository is 30 characters. Limiting project names to 22 characters allows “Copy of” to be prefixed to copies of a project without truncating characters. 2. Include enough descriptive information within the project name so an unfamiliar user will have a reasonable idea of what plans may be included in the project. 3. If plans within a project will operate on only one data source, including the data source in the project name may be helpful. 4. If abbreviations are used, they should be consistent and documented.

Naming Folders
When a new project is created, by default it will contain four folders, named “Consolidation”, “Matching”, “Profiling”, and “Standardization”.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

407 of 439

This naming convention for folders tracks the major types of IDQ plans. While the default naming convention may prove satisfactory in many cases, it imposes an organizational structure for plans that may not be optimal. Therefore, another naming convention may make more sense in a particular circumstance. Naming guidelines for folders include: 1. Limit folder names to 42 characters if possible. The limit imposed by the repository is 50 characters. Limiting folder names to 42 characters allows “Copy of” to be prefixed to copies of a folder without truncating characters. 2. Include enough descriptive information within the folder name so an unfamiliar user will have a reasonable idea of what plans may be included in the folder. 3. If abbreviations are used, they should be consistent and documented.

Naming Plans
When a new plan is created, the user is required to select from one of the four main plan classifications, “Analysis”, “Matching”, “Standardization”, or “Consolidation”. By default, the new plan name will correspond to the option selected.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

408 of 439

Including the plan type as part of the plan name is helpful in describing what the plan does. Other suggested naming rules include: 1. Limit plan names to 42 characters if possible. The limit imposed by the repository is 50 characters. Limiting plan names to 42 characters allows “Copy of” to be prefixed to copies of a plan without truncating characters. 2. Include enough descriptive information within the plan name so an unfamiliar user will have a reasonable idea of what the plan does at a high level. 3. While the project and folder structure will be visible within the IDQ Designer and will be required when using athanor-rt, it is not as readily visible within PowerCenter. Therefore, repetition of the information conveyed by the project and folder names may be advisable. 4. If abbreviations are used, they should be consistent and documented.

Naming Components
Within the Designer, component types may be identified by their unique icons as well as by hovering over a component with a mouse.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

409 of 439

However, the component has no visible name at this level. It is only after opening a component for viewing that the component’s name becomes visible.

It is suggested that component names be prefixed with an acronym identifying the component type. While less critical than field naming, as discussed below, using a prefix allows for consistent naming, for clarity, and it makes field naming more efficient in some cases. Suggested prefixes are listed below. Component Address Validator Bigram Character Labeller Prefix AV_ BG_ CL_

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

410 of 439

Context Parser Edit Distance Hamming Distance Jaro Distance Merge Mixed Field Matcher Nysiis Profile Standardizer Rule Based Analyzer Scripting Search Replace Soundex Splitter To Upper Token Labeller Token Parser Weight Based Analyzer Word Manager

CP_ ED_ HD_ JD_ MG_ MFM_ NYS_ PS_ RBA_ SC_ SR_ SX_ SPL_ TU_ TL_ TP_ WBA_ WM_

In addition, names for components should take into account the following suggested rules: 1. Limit names to a reasonably short length. A limit of 32 characters is suggested. In many cases, component names are also useful for field names, and databases limit field lengths at varying sizes. 2. Consider using the name of the input field or at least the field type. 3. Consider limiting names to alphabetic characters, spaces, underscores, and numbers. This will make the corresponding field names compatible with most likely output destinations. 4. If the component type abbreviation itself is not sufficient to identify what the component does, include an identifier for the function of the component in its name. 5. If abbreviations are used, they should be consistent and documented.

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

411 of 439

Naming Dictionaries
Dictionaries may be given any name suitable for the operating system on which they will be used. It is suggested that dictionary naming consider the following rules: 1. Limit dictionary names to characters permitted by the operating system. If a dictionary is to be used on both Windows and UNIX, avoid using spaces. 2. If a dictionary supplied by Informatica is to be modified, it is suggested that the dictionary be renamed and/ or moved to a new folder. This will avoid accidentally overwriting the modifications when an update is installed. 3. If abbreviations are used, they should be consistent and documented.

Naming Fields
Careful field naming is probably the most critical standard to follow when using IDQ.
●

IDQ requires that all fields output by components have unique names; a name cannot be carried through from component to component. The power of IDQ leads to complex plans with many components. IDQ does not have the data lineage feature of PowerCenter, so the component name is the clearest indicator of the source of an input component when a plan is being examined.

● ●

With those considerations in mind, the following naming rules are suggested: 1. Prefix each output field name with the type of component. Component Address Validator Bigram Character Labeller Context Parser Edit Distance Hamming Distance Jaro Distance Merge Mixed Field Matcher Prefix AV_ BG_ CL_ CP_ ED_ HD_ JD_ MG_ MFM_

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

412 of 439

Nysiis Profile Standardizer Rule Based Analyzer Scripting Search Replace Soundex Splitter To Upper Token Labeller Token Parser Weight Based Analyzer Word Manager

NYS_ PS_ RBA_ SC_ SR_ SX_ SPL_ TU_ TL_ TP_ WBA_ WM_

2. Use meaningful field names, with consistent, documented abbreviations. 3. Use consistent casing. 4. While it is possible to rename output fields in sink components, this practice should be avoided when practical, since there is no convenient way to determine which source field provides data to the renamed output field.

Last updated: 04-Jun-08 18:50

INFORMATICA CONFIDENTIAL

Velocity v8 Methodology - Data Quality

413 of 439

Sample Deliverables
●

Business Requirements Specification Critical Test Parameters Data Quality Plan Design Defect Log Defect Report Issues Tracking Project Definition Project Plan Project Status Report Source Availability Matrix

● ● ● ● ● ● ● ● ●

VELOCITY
SAMPLE DELIVERABLE Business Requirements Specification
DOCUMENT AUTHOR: DOCUMENT OWNER: DATE CREATED: LAST UPDATED: PROJECT: COMPANY:

BUSINESS REQUIREMENTS SPECIFICATION
DOCUMENT OVERVIEW
This document presents a brief description of the business of CompanyX and specific business requirements applicable to the project. Any subsequent changes, additions, or deletions are not part of this document and will be submitted to CompanyX separately for acceptance and inclusion as an additional requirement for the project.

1.1

BUSINESS OVERVIEW

<Describe high-level view of business environment, strategy, reason for system implementation, etc>

1.2

BUSINESS REQUIREMENTS MATRIX

1.

Business Requirement Data from log files and the application database will be extracted and distributed into central repository. 1a. 1b

Functional Requirement 1 The central repository will be a 1 flat file database. The data in the central repository will be partitioned by cluster, time period, portal and merchant such 1 that the data will be readily accessible by reading the fewest number of rows. Data from the log files will be gathered from each cluster 1 machine and distributed into central repository.

Priority

1c.

1d.

2 of 4

Informatica Velocity – Sample Deliverable

BUSINESS REQUIREMENTS SPECIFICATION

Business Requirement 2b.

2c.

Functional Requirement The data will be aggregated on at least a monthly basis. If possible, 1 daily or hourly basis. The data will be aggregated by merchant/portal combination by 1 hour, day, month, year and cumulative. 1

Priority

3.

A flat file will be extracted from the central repository for loading into the billing system. 3a.

3b.

The billing extract will be performed on (at least) a monthly 1 basis. Data file(s) will be formatted for the billing system. These 1 requirements will be defined at a later time. 2

4.

The architecture will support the simultaneous use of multiple versions of the log files. 4a.

The system will handle extracting from and loading to multiple 2 versions of the log files. 2

5.

FUTURE REQ: Users will be able to build sessions to extract selected data from the flat file database. 5a. The architecture will support the generation of required data files for reporting needs.

2

1.3
1.3.1

BUSINESS REQUIREMENTS DETAIL
REQUIREMENT 1 DETAIL

Business Requirement Constraints Inputs Outputs Dependencies

Data from log files and the application database will be extracted and distributed into central repository. The log files will be transported from each cluster machine and placed in the repository. The application server must be available for data extract. Hourly log files from the cluster processing machines. Tables from the application database server. Flat file database built from the log files and tables using a hierarchical directory/file structure. Each processing machine in each cluster will correctly generate the log files. The log files will be transferred to the repository machine intact. The database is operational.

Informatica Velocity – Sample Deliverable

3 of 4

BUSINESS REQUIREMENTS SPECIFICATION
Central Repository Machine: SUN E4500, 4 CPU, 4GB RAM, 70GB HDD. Sun Solaris 2.6 PowerCenter will be used to extract data from the application database. PERL scripts will be used to extract data from the log files.

Hardware / Software Requirements

1.3.2

REQUIREMENT 2 DETAIL

Business Requirement Constraints Inputs Outputs Dependencies Hardware / Software Requirements

1.3.3

REQUIREMENT 3 DETAIL

Business Requirement Constraints Inputs Outputs Dependencies Hardware / Software Requirements

4 of 4

Informatica Velocity – Sample Deliverable

VELOCITY
SAMPLE DELIVERABLE Critical Test Parameters
DOCUMENT AUTHOR: DOCUMENT OWNER: DATE CREATED: LAST UPDATED: PROJECT: COMPANY:

CRITICAL TEST PARAMETERS

Report 1
Unit of Functionality Test Type Description: (Data type, format, test config.) Check If Correct

CTP #

TCRS

Source

Comments

1

2

1 2 3 4 1 2 3 4

TEST CONDITION REQUIREMENTS SAMPLE SCRIPT TEMPLATE
Test Case: TCR #: Date Tested: Tested By: Test Iteration: Location:

TEST DESCRIPTION
Support Documents: Prerequisites: Test Setup:

Test Steps:

Expected Results

Actual Results

Comments

TCR Outcome: (Pass/Fail)

2 of 2

Informatica Velocity – Sample Deliverable

VELOCITY
SAMPLE DELIVERABLE Data Quality Plan Design
DOCUMENT AUTHOR: DOCUMENT OWNER: DATE CREATED: LAST UPDATED: PROJECT: COMPANY:

DATA QUALITY PLAN DESIGN

DATA QUALITY PLAN DESIGN SAMPLE DELIVERABLE
A Data Quality Plan Design describes the design and operation of one or more data quality plans within a consultancy project in a manner that business users related to the project can understand. The document should serve as a plan handover document for business users and be written in a manner that a user trained in IDQ can understand and update the plan design unaided. We recommend that you build a plan design document as you build the plan. A Plan Design document should contain the following sections: Introduction Document scope and readership Document history Plan heading [plan name].pln Overview Inputs Component descriptions Dictionaries Outputs Next steps

THE INTRODUCTION
The introduction should describe the data quality objectives of the plan and its relationship to the parent project. When writing the introduction, consider these questions: What is the name of the plan? What project is the plan part of? Where does the plan fit in the overall project? What particular aspect of the project does the plan address? What are the objectives of the plan? What issues, if any, apply to the plan or its data? What business rules are used in the plan? What is the origin of these rules? What department or group uses the plan output? What are the before and after states of the plan data? Where are the plans located (include machine details and folder location) and when were the plans executed? What steps were taken or should be taken following plan execution?

PROJECT NAME/PLAN NAME
This section describes the Informatica Data Quality plan in technical detail. Include the path to the plan within the IDQ Project Manager or on the file system in the heading or in the first paragraph below the heading. Note that the plan name may contain a suffix such as .pln or .xml if it has been saved out from the Data Quality Repository. This section has four main sub-sections:
2 of 4 Informatica Velocity – Sample Deliverable

DATA QUALITY PLAN DESIGN

OVERVIEW The overview provides the following information: The plan type (e.g. standardization, matching). The data or business objective of the plan. Who ran (or should run) the plan, and when. The version of IDQ in which the plan was designed, the Informatica application that will run the plan (e.g. Data Quality Server), and the platform on which the plan will run. A screengrab of the plan layout in the Workbench user interface. Any other relevant information.

INPUTS This section identifies the source data for the plan. Consider these questions: To what data file/table do the plan’s source components connect? Where is the source file located? What are the format and origin of the database table? What data source(s) are used? If a data file, what type? Are any parameters set at this stage (e.g. Unicode)? If a table, what operations are performed at data source level? Provide SQL statements if appropriate. Is the source data an output from another Informatica Data Quality plan, and if so, which one?

COMPONENT DESCRIPTIONS This section describes at a low level the operational components and business rules used in the plan. Where possible, these should be listed in order of their interaction with the data. How much detail you go onto depends on the audience for the document and what their needs are. It also depends on whether the business rules are documented elsewhere. Component functionality can be described at a high level as shown in the examples below: Search Replace component takes Addr Line1 from CSV Source and removes spaces anywhere and full stops from end. Output from Search Replace component is put through the Word Manager and Addr Line 1 is standardized using ‘Address Prefix’ and ‘Address Suffix’ dictionaries. Output from Word Manager is used as input to Token Labeller and profiled using the following dictionaries in this order: A.dic, Bb.dic, C.dic. Continue stepping through each component in this manner. For fine-grained plan description, consider these questions: What is the component name? What instances are defined, and how are they named? For each instance: What input fields are selected?

Informatica Velocity – Sample Deliverable

3 of 4

DATA QUALITY PLAN DESIGN
What parameters are set? What filters are defined? What reference dictionaries are applied? What business rules are defined? (Provide the logical statements if appropriate.) Are the dictionaries/business rules specified by the client? What are the outputs for the instance, and how are they named?

DICTIONARIES List dictionaries and other reference content used, and their file locations. Cross-refer each dictionary to the component(s) that use it.

OUTPUTS
In this section, describe the plan output and identify its file or database destination. Consider these questions: What is the sink name? Where is the sink output written: report, database table, file? What output fields are selected for the sink? Are there exception files? If so, where are they written to? Provide SQL statements if appropriate.

[PLAN NAME.PLN]
This section is optional; it can be used in the same manner as the previous plan section and its subsections, above, if another plan is described in this document.

NEXT STEPS
This section is relevant if there are other actions dependent on the plan and if the plan output is to be used elsewhere in the project, as is typically the case. Consider these questions: What is the next step in the project? Will the plan(s) be re-used? Who receives the plan output data, and what actions will they take? What steps, if any, will the Informatica consultant take in connection with these plans?

4 of 4

Informatica Velocity – Sample Deliverable

DEFECT LOG

DEFECT LOG
Summary DJP 15-Dec-04 JMC 18-Dec-04 20-Dec-04 Reinstatement not specif Investigation To Date Comments Assign Fix To Date Re-test Date

PROJECT NAME: Problem Date No. Raised

Priority*

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

10-Dec-04

U

Reinstated invoice lines not being included in totals

* Priority: U (Urgent, technically can not go live with out fix and issue has impact on further testing), H (High, project requires defect to be fixed befo (Medium) problem that will not prevent project going live but could push a feature scheduled for delivery into a subsequent phase, L (Low, minor pro ** Cause Code: N/A (No Error), SE (Specification Error), CE (Coding Error), DE (Data Error), PP (Product Problem), TP (Test Problem), CR (Change R

1 of 3

Defect List

Informatica Velocity - Sample Deliverable

DEFECT REPORT

DEFECT REPORT
PROJECT NAME: Raised by: Login ID: Date Raised: g / Test Plan Module/ /Schedule: Test Step: Function: Problem Description: <describe problem, including steps necessary to replicate> Priority: Software/system release/version:

Investigation/ Proposed solution: <describe investigations made, conclusions about cause and steps required to rectify>

Impact Analysis: <list other components affected, the impact including re-installation and re-test tasks, and any dependancies>

Investigated by: Changes approved by:

Date: Date:

Summary of Fix: <describe changes made and list all files /components changed, including, where appropriate, servers/directories/versions/

Fixed and Unit Tested by:

Date:

Regression tests: <outline the tests made, referring to pre-prepared test scripts/schedules, where appropriate>

Tested by:

Date closed:

1 of 1

Informatica Velocity - Sample Deliverable

ISSUES TRACKING

Issue # Assign To Status Priority Severity ID'd Date ID'd By Description Work Around Investigation Solution

Short Description

Resolution Date

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

1 of 1

Issues

Informatica Velocity - Sample Deliverable

VELOCITY
SAMPLE DELIVERABLE PROJECT DEFINITION
DOCUMENT AUTHOR: DOCUMENT OWNER: DATE CREATED: LAST UPDATED: PROJECT: COMPANY:

PROJECT DEFINITION

PROJECT OBJECTIVES
These are the key business “drivers” for the project—business-focused goals and objectives. <Objective 1> <Objective 2> <Objective 3> …

PROJECT TIMING
Key Milestone or Deliverable Target Date(s)

TECHNICAL ENVIRONMENT
Informatica Products, versions Platforms / OS Source systems / data characteristics Target systems / DBMS Target architectures

PROJECT BACKGROUND AND STATUS
What has been accomplished on the project? What documents have been completed? (Requirements, design, models, etc.) What source analyses, what target models have been completed?
Document / Model Description Name / Location

EXPECTATIONS FOR CONSULTANT INVOLVEMENT
Role Primary Activities

2 of 3

Informatica Velocity – Sample Deliverable

PROJECT DEFINITION

PROJECT PERSONNEL
Personnel that you are likely to interact with (e.g., DBA, Business Analyst, System Administrator, etc.)?
Name Phone(s) E-mail Role

PRIMARY TASKS, ACTIVITIES AND DELIVERABLES
Task/Deliverable Description Effort (days) Target Date(s)

Informatica Velocity – Sample Deliverable

3 of 3

Velocity Project Plan
Duration S 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? Milestone Summary Project Summary Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 External Tasks External Milestone Deadline Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Start Finish Dec 31, '06 S M

ID

Task Name

T

1

1 Manage

2

1.1 Define Project

3

1.1.1 Establish Business Project Scope

4

1.1.2 Build Business Case

5

1.1.3 Assess Centralized Resources

6

1.2 Plan and Manage Project

7

1.2.1 Establish Project Roles

8

1.2.2 Develop Project Estimate

9

1.2.3 Develop Project Plan

10

1.2.4 Manage Project

11

1.3 Perform Project Close

12

2 Analyze

13

2.1 Define Business Drivers, Objectives and Goals

14

2.2 Define Business Requirements

15

2.2.1 Define Business Rules and Definitions

16

2.2.2 Establish Data Stewardship

17

2.3 Define Business Scope

18

2.3.1 Identify Source Data Systems

19

2.3.2 Determine Sourcing Feasibility

20

2.3.3 Determine Target Requirements

21

2.3.4 Determine Business Process Data Flows

22

2.3.5 Build Roadmap for Incremental Delivery

23

2.4 Define Functional Requirements

24

2.5 Define Metadata Requirements

25

2.5.1 Establish Inventory of Technical Metadata

26

2.5.2 Review Metadata Sourcing Requirements

27

2.5.3 Assess Technical Strategies and Policies

28

2.6 Determine Technical Readiness

29 Task Split Progress

2.7 Determine Regulatory Requirements

Project: Velocity_Project_Plan.mpp Date: Mon 7/28/08

1 of 5

Informatica Velocity - Sample Deliverable

Velocity Project Plan
Duration S 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? Milestone Summary Project Summary Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 External Tasks External Milestone Deadline Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Start Finish Dec 31, '06 S M

ID

Task Name

T

30

2.8 Perform Data Quality Audit

31

2.8.1 Perform Data Analysis of Source Data

32

2.8.2. Report Analysis Results to the Business

33

3 Architect

34

3.1 Develop Solution Architecture

35

3.1.1 Define Technical Requirements

36

3.1.2 Develop Architecture Logical View

37

3.1.3 Develop Configuration Recommendations

38

3.1.4 Develop Architecture Physical View

39

3.1.5 Estimate Volume Requirements

40

3.2 Design Development Architecture

41

3.2.1 Develop Quality Assurance Strategy

42

3.2.2 Define Development Environments

43

3.2.3 Develop Change Control Procedures

44

3.2.4 Determine Metadata Strategy

45

3.2.5 Develop Change Management Process

46

3.3 Implement Technical Architecture

47

3.3.1 Procure Hardware and Software

48

3.3.2 Install/Configure Software

49

4 Design

50

4.1 Develop Data Model(s)

51

4.1.1 Develop Enterprise Data Warehouse Model

52

4.1.2 Develop Data Mart Model(s)

53

4.2 Analyze Data Sources

54

4.2.1 Develop Source to Target Relationships

55

4.2.2 Determine Source Availability

56

4.3 Design Physical Database

57

4.3.1 Develop Physical Database Design

58 Task Split Progress

4.4 Design Presentation Layer

Project: Velocity_Project_Plan.mpp Date: Mon 7/28/08

2 of 5

Informatica Velocity - Sample Deliverable

Velocity Project Plan
Duration S 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? Milestone Summary Project Summary Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 External Tasks External Milestone Deadline Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Start Finish Dec 31, '06 S M

ID

Task Name

T

59

4.4.1 Design Presentation Layer Prototype

60

4.4.2 Present Prototype to Business Analysts

61

4.4.3 Develop Presentation Layout Design

62

5 Build

63

5.1 Launch Build Phase

64

5.1.1 Review Project Scope and Plan

65

5.1.2 Review Physical Model

66

5.1.3 Define Defect Tracking Process

67

5.2 Implement Physical Database

68

5.3 Design and Build Data Quality Process

69

5.3.1 Design Data Quality Technical Rules

70

5.3.2 Determine Dictionary and Reference Data Requirements

71

5.3.3 Design and Execute Data Enhancement Processes

72

5.3.4 Design Run-time and Real-Time Processes for Operate Phase Execution

73

5.3.5 Develop Inventory of Data Quality Processes

74

5.3.6 Review and Package Data Transformation Specification Processes and Documents

75

5.4 Design and Develop Data Integration Processes

76

5.4.1 Design High Level Load Process

77

5.4.2 Develop Error Handling Strategy

78

5.4.3 Plan Restartability Process

79

5.4.4 Develop Inventory of Mappings & Reusable Objects

80

5.4.5 Design Individual Mappings & Reusable Objects

81

5.4.6 Build Mappings & Reusable Objects

82

5.4.7 Perform Unit Test

83

5.4.8 Conduct Peer Reviews

84

5.5 Populate and Validate Database

85

5.5.1 Build Load Process

86

5.5.2 Perform Integrated ETL Testing

87 Task Split Progress

5.6 Build Presentation Layer

Project: Velocity_Project_Plan.mpp Date: Mon 7/28/08

3 of 5

Informatica Velocity - Sample Deliverable

Velocity Project Plan
Duration S 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? Milestone Summary Project Summary Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 External Tasks External Milestone Deadline Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Start Finish Dec 31, '06 S M

ID

Task Name

T

88

5.6.1 Develop Presentation Layer

89

5.6.2 Demonstrate Presentation Layer to Business Analysts

90

6 Test

91

6.1 Define Overall Test Strategy

92

6.1.1 Define Test Data Strategy

93

6.1.2 Define Unit Test Plan

94

6.1.3 Define System Test Plan

95

6.1.4 Define User Acceptance Test Plan

96

6.1.5 Define Test Scenarios

97

6.1.6 Build/Maintain Test Source Data Set

98

6.2 Prepare for Testing Process

99

6.2.1 Prepare Environments

100

6.2.2 Prepare Defect Management Processes

101

6.3 Execute System Test

102

6.3.1 Prepare for System Test

103

6.3.2 Execute Complete System Test

104

6.3.3 Perform Data Validation

105

6.3.4 Conduct Disaster Recovery Testing

106

6.3.5 Conduct Volume Testing

107

6.4 Conduct User Acceptance Testing

108

6.5 Tune System Performance

109

6.5.1 Benchmark

110

6.5.2 Identify Areas for Improvement

111

6.5.3 Tune Data Integration Performance

112

6.5.4 Tune Reporting Performance

113

7 Deploy

114

7.1 Plan Deployment

115

7.1.1 Plan User Training

116 Task Split Progress

7.1.2 Plan Metadata Documentation and Rollout

Project: Velocity_Project_Plan.mpp Date: Mon 7/28/08

4 of 5

Informatica Velocity - Sample Deliverable

Velocity Project Plan
Duration S 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? 1 day? Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Mon 1/1/07 Start Finish Dec 31, '06 S M

ID

Task Name

T

117

7.1.3 Plan User Documentation Rollout

118

7.1.4 Develop Punch List

119

7.1.5 Develop Communication Plan

120

7.1.6 Develop Run Book

121

7.2 Deploy Solution

122

7.2.1 Train Users

123

7.2.2 Migrate Development to Production

124

7.2.3 Package Documentation

125

8 Operate

126

8.1 Define Production Support Procedures

127

8.1.1 Develop Operations Manual

128

8.2 Operate Solution

129

8.2.1 Execute First Production Run

130

8.2.2 Monitor Load Volume

131

8.2.3 Monitor Load Processes

132

8.2.4 Track Change Control Requests

133

8.2.5 Monitor Usage

134

8.2.6 Monitor Data Quality

135

8.3 Maintain and Upgrade Environment

136

8.3.1 Maintain Repository

137

8.3.2 Upgrade Software

Task Split Progress

Milestone Summary Project Summary

External Tasks External Milestone Deadline

Project: Velocity_Project_Plan.mpp Date: Mon 7/28/08

5 of 5

Informatica Velocity - Sample Deliverable

VELOCITY
SAMPLE DELIVERABLE Project Status Report
DOCUMENT AUTHOR: DOCUMENT OWNER: DATE CREATED: LAST UPDATED: PROJECT: COMPANY:

PROJECT STATUS REPORT

FOR WEEK ENDING:
COMPLETED THIS WEEK <Activity 1> <Activity 2> <Activity 3>

<DATE>

PLANNED FOR NEXT WEEK <Activity 1> <Activity 2> <Activity 3>

DELIVERABLES AND MILESTONES
Owner Deliverable / Milestone Key Date(s)

ISSUES <Issue 1> <Issue 2> <Issue 3>

2 of 2

Informatica Velocity – Sample Deliverable

SOURCE AVAILABILITY MATRIX

System

AM 12 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12

PM 12 1

ERP Accounting Manufacturing Asia Accounting HR Europe Manufacturing

= In Use = Available for Extraction

1 of 2

Weekday Schedule

Informatica Velocity - Sample Deliverable

SOURCE AVAILABILITY MATRIX

System

AM 12 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12

PM 12 1

ERP Accounting Manufacturing Asia Accounting HR Europe Manufacturing

= In Use = Available for Extraction

2 of 2

Weekend Schedule

Informatica Velocity - Sample Deliverable


						
Related docs