Docstoc

Informatica Best Practices

Document Sample
Informatica Best Practices Powered By Docstoc
					                                                                         Informatica Training      Day 5

                                                                                                01/04/2011




Configuration Management and Security Challenge
Configuring a PowerCenter security scheme to prevent unauthorized access to mappings, folders
sessions, workflows, repositories, and data in order to ensure system integrity and data
confidentiality.

Description
Security is an often overlooked area within the Informatica ETL domain. However, without paying
close attention to the repository security, one ignores a crucial component of ETL code
management. Determining an optimal security configuration for a PowerCenter environment
requires a thorough understanding of business requirements, data content, and end-user access
requirements. Knowledge of PowerCenter's security functionality and facilities is also a prerequisite
to security design.
Implement security with the goals of easy maintenance and scalability. When establishing
repository security, keep it simple. Although PowerCenter includes the utilities for a complex web of
security, the more simple the configuration, the easier it is to maintain. Securing the PowerCenter
environment involves the following basic principles:

       Create users and groups
       Define access requirements
       Grant privileges and permissions

Before implementing security measures, ask and answer the following questions:
       Who will administer the repository?
       How many projects need to be administered? Will the administrator be able to manage
        security for all PowerCenter projects or just a select few?
       How many environments will be supported in the repository?
       Who needs access to the repository? What do they need the ability to do?
       How will the metadata be organized in the repository? How many folders will be required?
       Where can we limit repository privileges by granting folder permissions instead?
       Who will need Administrator or Super User-type access?

After you evaluate the needs of the repository users, you can create appropriate user groups,
assign repository privileges and folder permissions. In most implementations, the administrator
takes care of maintaining the repository. Limit the number of administrator accounts for
PowerCenter. While this concept is important in a development/unit test environment, it is critical for
protecting the production environment.

Repository Security Overview

A security system needs to properly control access to all sources, targets, mappings, reusable
transformations, tasks, and workflows in both the test and production repositories. A successful
security model needs to support all groups in the project lifecycle and also consider the repository
structure. Informatica offers multiple layers of security, which enables you to customize the security
within your data warehouse environment. Metadata level security controls access to PowerCenter
repositories, which contain objects grouped by folders. Access to metadata is determined by the
privileges granted to the user or to a group of users and the access permissions granted on each
folder. Some privileges do not apply by folder, as they are granted by privilege alone (i.e.,
repository-level tasks).



                                                  1
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011

Just beyond PowerCenter authentication is the connection to the repository database. All client
connectivity to the repository is handled by the PowerCenter Repository Service over a TCP/IP
connection. The particular database account and password is specified at installation and during
the configuration of the Repository Service. Developers need not have knowledge of this database
account and password; they should only use their individual repository user ids and passwords.
This information should be restricted to the administrator.
Other forms of security available in PowerCenter include permissions for connections. Connections
include database, FTP, and external loader connections. These permissions are useful when you
want to limit access to schemas in a relational database and can be set-up in the Workflow
Manager when source and target connections are defined.
Occasionally, you may want to restrict changes to source and target definitions in the repository. A
common way to approach this security issue is to use shared folders, which are owned by an
Administrator or Super User. Granting read access to developers on these folders allows them to
create read-only copies in their work folders.

Informatica Security Architecture
The following diagram, Informatica PowerCenter Security, depicts PowerCenter security, including
access to the repository, Repository Service, Integration Service and the command-line utilities
pmrep and pmcmd.
As shown in the diagram, the repository service is the central component when using default
security. It sits between the PowerCenter repository and all client applications, including GUI tools,
command line tools, and the Integration Service. Each application must be authenticated against
metadata stored in several tables within the repository. Each Repository Service manages a single
repository database where all security data is stored as part of its metadata; this is a second layer
of security. Only the Repository Service has access to this database; it authenticates all client
applications against this metadata.




                                                  2
                                                                       Informatica Training      Day 5

                                                                                              01/04/2011




Repository Service Security

Connection to the PowerCenter repository database is one level of security. The Repository Service
uses native drivers to communicate with the repository database. PowerCenter Client tools and the
Integration Service communicate with the Repository Service over TCP/IP. When a client
application connects to the repository, it connects directly to the Repository Service process. You
can configure a Repository Service to run on multiple machines, or nodes, in the domain. Each
instance running on a node is called a Repository Service process.
This process accesses the database tables and performs most repository-related tasks.
When the Repository Service is installed, the database connection information is entered for the
metadata repository. At this time you need to know the database user id and password to access
the metadata repository.
The database user id must be able to read and write to all tables in the database. As a developer
creates, modifies, executes mappings and sessions, this information is continuously updating the
metadata in the repository. Actual database security should be controlled by the DBA responsible
for that database, in conjunction with the PowerCenter Repository Administrator. After the
Repository Service is installed and started, all subsequent client connectivity is automatic. The
database id and password are transparent at this point

Integration Service Security

Like the Repository Service, the Integration Service communicates with the metadata repository
when it executes workflows or when users are using Workflow Monitor. During configuration of the
Integration Service, the repository database is identified with the appropriate user id and password.
Connectivity to the repository is made using native drivers supplied by Informatica.
Certain permissions are also required to use the pmrep and pmcmd command line utilities.

Encrypting Repository Passwords

You can encrypt passwords and create an environment variable to use with pmcmd and pmrep. For
example, you can encrypt the repository and database passwords for pmrep to maintain security
when using pmrep in scripts. In addition, you can create an environment variable to store the
encrypted password.
Use the following steps as a guideline to use an encrypted password as an environment variable:

1. Use the command line program pmpasswd to encrypt the repository password.
2. Configure the password environment variable to set the encrypted value.

To configure a password as an environment variable on UNIX:

1. At the command line, type:

pmpasswd <repository password>
pmpasswd returns the encrypted password.

2. In a UNIX C shell environment, type:

setenv <Password_Environment_Variable> <encrypted password>

In a UNIX Bourne shell environment, type:

<Password_Environment_Variable> = <encrypted password>

                                                 3
                                                                       Informatica Training      Day 5

                                                                                              01/04/2011

export <Password_Environment_Variable>

You can assign the environment variable any valid UNIX name.


To configure a password as an environment variable on Windows:

1. At the command line, type:

pmpasswd <repository password>
pmpasswd returns the encrypted password.

2. Enter the password environment variable in the Variable field. Enter the encrypted password in
the Value field.

Setting the Repository User Name

For pmcmd and pmrep, you can create an environment variable to store the repository user name.
To configure a user name as an environment variable on UNIX:

1. In a UNIX C shell environment, type:

setenv <User_Name_Environment_Variable> <user name>

2. In a UNIX Bourne shell environment, type:

<User_Name_Environment_Variable> = <user name>
export <User_Name_Environment_Variable> = <user name>

You can assign the environment variable any valid UNIX name.

To configure a user name as an environment variable on Windows:

1. Enter the user name environment variable in the Variable field.
2. Enter the repository user name in the Value field.

Connection Object Permissions

Within Workflow Manager, you can grant read, write, and execute permissions to groups and/or
users for all types of connection objects. This controls who can create, view, change, and execute
workflow tasks that use those specific connections, providing another level of security for these
global repository objects.
Users with „Use Workflow Manager‟ permission can create and modify connection objects.
Connection objects allow the PowerCenter server to read and write to source and target databases.
Any database the server can access requires a connection definition. As shown below, connection
information is stored in the repository. Users executing workflows need execution permission on all
connections used by the workflow. The PowerCenter server looks up the connection information in
the repository, and verifies permission for the required action. If permissions are properly granted,
the server reads and writes to the defined databases, as specified by the workflow.




                                                  4
                                                                       Informatica Training      Day 5

                                                                                              01/04/2011




Users

Users are the fundamental objects of security in a PowerCenter environment. Each individual
logging into the PowerCenter repository should have a unique user account. Informatica does not
recommend creating shared accounts; unique accounts should be created for each user. Each
repository user needs a user name and password, provided by the PowerCenter Repository
Administrator, to access the repository.
Users are created and managed through Repository Manager. Users should change their
passwords from the default immediately after receiving the initial user id from the Administrator.
Passwords can be reset by the user if they are granted the privilege „Use Repository Manager‟.
When you create the repository, the repository automatically creates two default users:

       Administrator - The default password for Administrator is Administrator.
       Database user - The username and password used when you created the repository.

These default users are in the Administrators user group, with full privileges within the repository.
They cannot be deleted from the repository, nor have their group affiliation changed.
To administer repository users, you must have one of the following privileges:

       Administer Repository
       Super User

LDAP (Lightweight Directory Access Protocol)

In addition to default repository user authentication, LDAP can be used to authenticate users. Using
LDAP authentication, the repository maintains an association between the repository user and the
external login name. When a user logs into the repository, the security module authenticates the
user name and password against the external directory. The repository maintains a status for each
user. Users can be enabled or disabled by modifying this status.
Prior to implementing LDAP, the administrator must know:

       Repository server username and password
       An administrator or superuser user name and password for the repository
       An external login name and password



                                                 5
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011


To configure LDAP, follow these steps:

1. Edit ldap_authen.xml, modify the following attributes:
       NAME – the .dll that implements the authentication
       OSTYPE – Host operating system

2. Register ldap_authen.xml in the Repository Server Administration Console.
3. In the Repository Server Administration Console, configure the authentication module.

User Groups

When you create a repository, the Repository Manager creates two repository user groups. These
two groups exist so you can immediately create users and begin developing repository objects.
These groups cannot be deleted from the repository nor have their configured privileges changed.
The default repository user groups are:

       Administrators - which has super-user access
       Public - which has a subset of default repository privileges

You should create custom user groups to manage users and repository privileges effectively. The
number and types of groups that you create should reflect the needs of your development teams,
administrators, and operations group. Informatica recommends minimizing the number of custom
user groups that you create in order to facilitate the maintenance process.
A starting point is to create a group for each type of combination of privileges needed to support the
development cycle and production process. This is the recommended method for assigning
privileges. After creating a user group, you assign a set of privileges for that group. Each repository
user must be assigned to at least one user group. When you assign a user to a group, the user:

       Receives all group privileges.
       Inherits any changes to group privileges.
       Loses and gains privileges if you change the user group membership.

You can also assign users to multiple groups, which grants the user the privileges of each group.
Use the Repository Manager to create and edit repository user groups.

Folder Permissions

When you create or edit a folder, you define permissions for the folder. The permissions can be set
at three different levels:

       owner
       owners group
       repository - remainder of users within the repository

First, choose an owner (i.e., user) and group for the folder. If the owner belongs to more than one
group, you must select one of the groups listed. Once the folder is defined and the owner is
selected, determine what level of permissions you would like to grant to the users within the group.
Then determine the permission level for the remainder of the repository users. The permissions that
can be set include: read, write, and execute. Any combination of these can be granted to the owner,
group or repository.
Be sure to consider folder permissions very carefully. They offer the easiest way to restrict users
and/or groups from having access to folders or restricting access to folders. The following table
gives some examples of folders, their type, and recommended ownership.


                                                  6
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011




Repository Privileges

Repository privileges work in conjunction with folder permissions to give a user or group authority to
perform tasks. Repository privileges are the most granular way of controlling a user‟s activity.
Consider the privileges that each user group requires, as well as folder permissions, when
determining the breakdown of users into groups. Informatica recommends creating one group for
each distinct combination of folder permissions and privileges.
When you assign a user to a user group, the user receives all privileges granted to the group. You
can also assign privileges to users individually. When you grant a privilege to an individual user, the
user retains that privilege, even if his or her user group affiliation changes. For example, you have a
user in a Developer group who has limited group privileges, and you want this user to act as a
backup administrator when you are not available. For the user to perform every task in every folder
in the repository, and to administer the Integration Service, the user must have the Super User
privilege. For tighter security, grant the Super User privilege to the individual user, not the entire
Developer group. This limits the number of users with the Super User privilege, and ensures that
the user retains the privilege even if you remove the user from the Developer group.
The Repository Manager grants a default set of privileges to each new user and group for working
within the repository. You can add or remove privileges from any user or group except:

       Administrators and Public (the default read-only repository groups)
       Administrator and the database user who created the repository (the users automatically
        created in the Administrators group)

The Repository Manager automatically grants each new user and new group the default privileges.
These privileges allow you to perform basic tasks in Designer, Repository Manager, Workflow
Manager, and Workflow Monitor. The following table lists the default repository privileges.




                                                  7
    Informatica Training      Day 5

                           01/04/2011




8
    Informatica Training      Day 5

                           01/04/2011




9
                                                                      Informatica Training      Day 5

                                                                                             01/04/2011

Extended Privileges
In addition to the default privileges listed above, Repository Manager provides extended privileges
that you can assign to users and groups. These privileges are granted to the Administrator group by
default. The following table lists the extended repository privileges.




Extended privileges allow you to perform more tasks and expand the access you have to repository
objects.
It is recommended that you reserve extended privileges for individual users and grant default
privileges to groups.


                                                10
                                                                     Informatica Training      Day 5

                                                                                            01/04/2011



Audit Trails

You can track changes to Repository users, groups, privileges, and permissions by selecting the
SecurityAuditTrail configuration option in the Repository Service properties in the PowerCenter
Administration Console. When you enable the audit trail, the Repository Service logs security
changes to the Repository Service log.
The audit trail logs the following operations:

      Changing the owner, owner's group, or permissions for a folder.
      Changing the password of another user.
      Adding or removing a user.
      Adding or removing a group.
      Adding or removing users from a group.
      Changing global object permissions.
      Adding or removing user and group privileges

Sample Security Implementation

The following steps provide an example of how to establish users, groups, permissions, and
privileges in your environment. Again, the requirements of your projects and production systems
should dictate how security is established.

1. Identify users and the environments they will support (e.g., Development, UAT, QA, Production,
Production Support, etc.).
2. Identify the PowerCenter repositories in your environment (this may be similar to the basic
groups listed in Step 1; for example, Development, UAT, QA, Production, etc.).
3. Identify which users need to exist in each repository.
4. Define the groups that will exist in each PowerCenter Repository.
5. Assign users to groups.
6. Define privileges for each group.

The following table provides an example of groups and privileges that may exist in the PowerCenter
repository. This example assumes one PowerCenter project with three environments co-existing in
one PowerCenter repository.




                                               11
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011




Informatica PowerCenter Security Administration

As mentioned earlier, one individual should be identified as the Informatica Administrator. This
individual is responsible for a number of tasks in the Informatica environment, including security. To
summarize, here are the security-related tasks an administrator is responsible for:

       Creating user accounts.
       Defining and creating groups.
       Defining and granting folder permissions.
       Defining and granting repository privileges.
       Enforcing changes in passwords.
       Controlling requests for changes in privileges.
       Creating and maintaining database, FTP, and external loader connections in conjunction
        with database administrator.
       Working with operations group to ensure tight security in production environment.

Remember, you must have one of the following privileges to administer repository users:

       Administer Repository
       Super User

Summary of Recommendations

When implementing your security model, keep the following recommendations in mind:
       Create groups with limited privileges.
       Do not use shared accounts.
       Limit user and group access to multiple repositories.
       Customize user privileges.
       Limit the Super User privilege.
       Limit the Administer Repository privilege.
       Restrict the Workflow Operator privilege.
       Follow a naming convention for user accounts and group names.


                                                 12
                                                                 Informatica Training      Day 5

                                                                                        01/04/2011

   For more secure environments, turn Audit Trail logging on.




                                            13
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011


Running Sessions in Recovery Mode Challenge
Use the Load Manager architecture for manual error recovery by suspending and resuming the
workflows and worklets when an error is encountered.

Description
When a task in the workflow fails at any point, one option is to truncate the target and run the
workflow again from the beginning. Load Manager architecture offers an alternative to this scenario:
the workflow can be suspended and the user can fix the error rather than re-processing the portion
of the workflow with no errors. This option, "Suspend on Error", results in accurate and complete
target data, as if the session completed successfully with one run.

Configure Mapping for Recovery
For consistent recovery, the mapping needs to produce the same result, and in the same order, in
the recovery execution as in the failed execution. This can be achieved by sorting the input data
using either the sorted ports option in Source Qualifier (or Application Source Qualifier) or by using
a sorter transformation with distinct rows option immediately after source qualifier transformation.
Additionally, ensure that all the targets received data from transformations that produce repeatable
data.

Configure Session for Recovery
Enable the session for recovery by selecting one of the following three Recovery Strategies:

Resume from the last checkpoint
       Integration Service saves session recovery information and updates recovery tables for a
        target database.
       If session interrupts, Integration Service uses saved recovery information to recover it.

Restart task

       Integration Service does not save session recovery information.
       If session interrupts, Integration Service reruns the session during recovery.

Fail task and continue workflow
        Session will not be recovered (default).

Configure Workflow for Recovery
The Suspend on Error option directs the PowerCenter Server to suspend the workflow while the
user fixes the error, and then resumes the workflow.
The server suspends the workflow when any of the following tasks fail:

        Session
        Command
        Worklet
        Email


                                                    14
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011


When a task fails in the workflow, the Integration Service stops running tasks in the path. The
Integration Service does not evaluate the output link of the failed task. If no other task is running in
the workflow, the Workflow Monitor displays the status of the workflow as "Suspended."
If one or more tasks are still running in the workflow when a task fails, the Integration Service stops
running the failed task and continues running tasks in other paths. The Workflow Monitor displays
the status of the workflow as "Suspending."
When the status of the workflow is "Suspended" or "Suspending," you can fix the error, such as a
target database error, and recover the workflow in the Workflow Monitor.
When you recover a workflow, the Integration Service restarts the failed tasks and continues
evaluating the rest of the tasks in the workflow. The Integration Service does not run any task that
already completed successfully.


Truncate Target Table
If the truncate table option is enabled in a recovery-enabled session, the target table is not
truncated during recovery process.

Session Logs
In a suspended workflow scenario, the Integration Service uses the existing session log when it
resumes the workflow from the point of suspension. However, the earlier runs that caused the
suspension are recorded in the historical run information in the repository.

Suspension Email
The workflow can be configured to send an email when the Integration Service suspends the
workflow. When a task fails, the server suspends the workflow and sends the suspension email.
The user can then fix the error and resume the workflow. If another task fails while the Integration
Service is suspending the workflow, the server does not send another suspension email. The
Integration Service only sends out another suspension email if another task fails after the workflow
resumes. Check the "Browse Emails" button on the General tab of the Workflow Designer Edit
sheet to configure the suspension email.

Suspending Worklets
When the "Suspend On Error" option is enabled for the parent workflow, the Integration Service
also suspends the worklet if a task within the worklet fails. When a task in the worklet fails, the
server stops executing the failed task and other tasks in its path. If no other task is running in the
worklet, the status of the worklet is "Suspended". If other tasks are still running in the worklet, the
status of the worklet is "Suspending". The parent workflow is also suspended when the worklet is
"Suspended" or "Suspending".

Starting Recovery
The recovery process can be started using Workflow Manager Client tool or Workflow Monitor client
tool. Alternatively, the recovery process can be started using pmcmd in command line mode or
using a script.



Recovery Tables and Recovery Process

                                                  15
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011

When the Integration Service runs a session that has a resume recovery strategy, it writes to
recovery tables on the target database system. When the Integration Service recovers the session,
it uses information in the recovery tables to determine where to begin loading data to target tables.
If you want the Integration Service to create the recovery tables, grant table creation privilege to the
database user name for the target database connection. If you do not want the Integration Service
to create the recovery tables, create the recovery tables manually.
The Integration Service creates the following recovery tables in the target database:

       PM_RECOVERY. Contains target load information for the session run. The Integration
        Service removes the information from this table after each successful session and initializes
        the information at the beginning of subsequent sessions.
       PM_TGT_RUN_ID. Contains information the Integration Service uses to identify each target
        on the database. The information remains in the table between session runs. If you
        manually create this table, you must create a row and enter a value other than zero for
        LAST_TGT_RUN_ID to ensure that the session recovers successfully.

Do not edit or drop the recovery tables before you recover a session. If you disable recovery, the
Integration Service does not remove the recovery tables from the target database. You must
manually remove the recovery tables.

Unrecoverable Sessions
The following options affect whether the session is incrementally recoverable:

       Output is deterministic. A property that determines if the transformation generates the
        same set of data for each session run. You can set this property for SDK sources and
        Custom transformations.
       Output is repeatable. A property that determines if the transformation generates the data
        in the same order for each session run. You can set this property for Custom
        transformations.
       Lookup source is static. A Lookup transformation property that determines if the lookup
        source is the same between the session and recovery. The Integration Service uses this
        property to determine if the output is deterministic.

Inconsistent Data During Recovery Process
For recovery to be effective, the recovery session must produce the same set of rows and in the
same order. Any change after initial failure – in mapping, session and/or in the server – that
changes the ability to produce repeatable data results in inconsistent data during recovery process.
The following cases may produce inconsistent data during a recovery session:

       Session performs incremental aggregation and server stops unexpectedly.
       Mapping uses sequence generator transformation.
       Mapping uses a normalizer transformation.
       Source and/or target changes after initial session failure.
       Data movement mode change after initial session failure.
       Code page (server, source or target) changes, after initial session failure.
       Mapping changes in a way that causes server to distribute or filter or aggregate rows
        differently.
       Session configurations are not supported by PowerCenter for session recovery.
       Mapping uses a lookup table and the data in the lookup table changes between session
        runs.
       Session sort order changes, when server is running in Unicode mode.



                                                  16
                                                                       Informatica Training      Day 5

                                                                                              01/04/2011



HA Recovery
Highly-available recovery allows the workflow to resume automatically in case of the Integration
Service has failed over. The following options are available in the properties tab of the workflow:

       Enable HA recovery Allows the workflow to be configured for Highly Availability.
       Automatically recover terminated tasks Recover terminated Session or Command tasks
        without user intervention. You must have high availability and the workflow must still be
        running.
       Maximum automatic recovery attempts When you automatically recover terminated
        tasks you can choose the number of times the Integration Service attempts to recover the
        task. Default is 5.

Complex Mappings and Recovery
In the case of complex mappings that load to more than one target that are related (i.e., primary key
– foreign key relationship), the session failure and subsequent recovery may lead to data integrity
issues. In such cases, it is necessary to check the integrity of the target tables to be checked and
fixed prior to starting the recovery process.




                                                 17
                                                                       Informatica Training      Day 5

                                                                                              01/04/2011




Performance      and                                Tunning.                 Determining
Bottlenecks Challenge
Because there are many variables involved in identifying and rectifying performance bottlenecks, an
efficient method for determining where bottlenecks exist is crucial to good data warehouse
management.

Description
The first step in performance tuning is to identify performance bottlenecks. Carefully consider the
following five areas to determine where bottlenecks exist; using a process of elimination,
investigating each area in the order indicated:

1. Target
2. Source
3. Mapping
4. Session
5. System

Use Thread Statistics to Identify Target, Source, and Mapping
Bottlenecks
Use thread statistics to identify source, target or mapping (transformation) bottlenecks. By default,
an Integration Service uses one reader, one transformation, and one target thread to process a
session. Within each session log, the following thread statistics are available:

       Run time – Amount of time the thread was running
       Idle time – Amount of time the thread was idle due to other threads within application or
        Integration Service. This value does not include time the thread is blocked due to the
        operating system.
       Busy – Percentage of the overall run time the thread is not idle. This percentage is
        calculated using the following formula:

                 (run time – idle time) / run time x 100

By analyzing the thread statistics found in an Integration Service session log, it is possible to
determine which thread is being used the most.
If a transformation thread is 100 percent busy and there are additional resources (e.g. CPU cycles
and memory) available on the Integration Service server, add a partition point in the segment.
If reader or writer thread is 100 percent busy, consider using string data types in source or target
ports since non-string ports require more processing.

Use the Swap Method to Test Changes in Isolation
Attempt to isolate performance problems by running test sessions. You should be able to compare
the session‟s original performance with that of tuned session‟s performance.
The swap method is very useful for determining the most common bottlenecks. It involves the
following five steps:




                                                  18
                                                                           Informatica Training      Day 5

                                                                                                  01/04/2011

1. Make a temporary copy of the mapping, session and/or workflow that is to be tuned, then tune
the copy before making changes to the original.
2. Implement only one change at a time and test for any performance improvements to gauge which
tuning methods work most effectively in the environment.
3. Document the change made to the mapping, session and/or workflow and the performance
metrics achieved as a result of the change. The actual execution time may be used as a
performance metric.
4. Delete the temporary mapping, session and/or workflow upon completion of performance tuning.
5. Make appropriate tuning changes to mappings, sessions and/or workflows.

Evaluating the Five Areas of Consideration Target Bottlenecks

Relational Targets
The most common performance bottleneck occurs when the Integration Service writes to a target
database. This type of bottleneck can easily be identified with the following procedure:

1. Make a copy of the original workflow
2. Configure the session in the test workflow to write to a flat file and run the session.
3. Read the thread statistics in session log

If session performance increases significantly when writing to a flat file, you have a write bottleneck.
Consider performing the following tasks to improve performance:

       Drop indexes and key constraints
       Increase checkpoint intervals
       Use bulk loading
       Use external loading
       Minimize deadlocks
       Increase database network packet size
       Optimize target databases

Flat file targets
If the session targets a flat file, you probably do not have a write bottleneck. If the session is writing
to a SAN or a non-local file system, performance may be slower than writing to a local file system. If
possible, a session can be optimized by writing to a flat file target local to the Integration Service. If
the local flat file is very large, you can optimize the write process by dividing it among several
physical drives.
If the SAN or non-local file system is significantly slower than the local file system, work with the
appropriate network/storage group to determine if there are configuration issues within the SAN.

Source Bottlenecks

Relational sources
If the session reads from a relational source, you can use a filter transformation, a read test
mapping, or a database query to identify source bottlenecks.

Using a Filter Transformation.
Add a filter transformation in the mapping after each source qualifier. Set the filter



                                                   19
                                                                           Informatica Training      Day 5

                                                                                                  01/04/2011

condition to false so that no data is processed past the filter transformation. If the time it takes to run
the new session remains about the same, then you have a source bottleneck.

Using a Read Test Session.
You can create a read test mapping to identify source bottlenecks. A read test mapping isolates the
read query by removing any transformation logic from the mapping. Use the following steps to
create a read test mapping:

1. Make a copy of the original mapping.
2. In the copied mapping, retain only the sources, source qualifiers, and any custom joins or
queries.
3. Remove all transformations.
4. Connect the source qualifiers to a file target.

Use the read test mapping in a test session. If the test session performance is similar to the original
session, you have a source bottleneck.

Using a Database Query
You can also identify source bottlenecks by executing a read query directly against the source
database. To do so, perform the following steps:

       Copy the read query directly from the session log.
       Run the query against the source database with a query tool such as SQL Plus.
       Measure the query execution time and the time it takes for the query to return the first row.

If there is a long delay between the two time measurements, you have a source bottleneck.
If your session reads from a relational source and is constrained by a source bottleneck, review the
following suggestions for improving performance:

       Optimize the query.
       Create tempdb as in-memory database.
       Use conditional filters.
       Increase database network packet size.
       Connect to Oracle databases using IPC protocol.


Flat file sources
If your session reads from a flat file source, you probably do not have a read bottleneck. Tuning the
line sequential buffer length to a size large enough to hold approximately four to eight rows of data
at a time (for flat files) may improve performance when reading flat file sources. Also, ensure the flat
file source is local to the Integration Service.

Mapping Bottlenecks
If you have eliminated the reading and writing of data as bottlenecks, you may have a mapping
bottleneck. Use the swap method to determine if the bottleneck is in the mapping.
Begin by adding a Filter transformation in the mapping immediately before each target definition.
Set the filter condition to false so that no data is loaded into the target tables.
If the time it takes to run the new session is the same as the original session, you have a mapping
bottleneck. You can also use the performance details to identify mapping bottlenecks: high
Rowsinlookupcache and High Errorrows counters indicate mapping bottlenecks.




                                                    20
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011

Follow these steps to identify mapping bottlenecks:

Create a test mapping without transformations
1. Make a copy of the original mapping.
2. In the copied mapping, retain only the sources, source qualifiers, and any custom joins or
queries.
3. Remove all transformations.
4. Connect the source qualifiers to the target.

Check for High Rowsinlookupcache counters
Multiple lookups can slow the session. You may improve session performance by locating the
largest lookup tables and tuning those lookup expressions.

Check for High Errorrows counters
Transformation errors affect session performance. If a session has large numbers in any of the
Transformation_errorrows counters, you may improve performance by eliminating the errors.

Session Bottlenecks
Session performance details can be used to flag other problem areas. Create performance details
by selecting “Collect Performance Data” in the session properties before running the session.
View the performance details through the Workflow Monitor as the session runs, or view the
resulting file. The performance details provide counters about each source qualifier, target
definition, and individual transformation within the mapping to help you understand session and
mapping efficiency.
To view the performance details during the session run:

       Right-click the session in the Workflow Monitor.
       Choose Properties.
       Click the Properties tab in the details dialog box.

To view the resulting performance data file, look for the file session_name.perf in the same directory
as the session log and open the file in any text editor.
All transformations have basic counters that indicate the number of input row, output rows, and
error rows. Source qualifiers, normalizers, and targets have additional counters indicating the
efficiency of data moving into and out of buffers. Some transformations have counters specific to
their functionality. When reading performance details, the first column displays the transformation
name as it appears in the mapping, the second column contains the counter name, and the third
column holds the resulting number or efficiency percentage.

Low buffer input and buffer output counters
If the BufferInput_efficiency and BufferOutput_efficiency counters are low for all sources and
targets, increasing the session DTM buffer pool size may improve performance.

Aggregator, Rank, and Joiner readfromdisk and writetodisk counters
If a session contains Aggregator, Rank, or Joiner transformations, examine each
Transformation_readfromdisk and Transformation_writetodisk counter. If these counters display any
number other than zero, you can improve session performance by increasing the index and data
cache sizes.
If the session performs incremental aggregation, the Aggregator_readtodisk and writetodisk
counters display a number other than zero because the Integration Service reads historical
aggregate data from the local disk during the session and writes to disk when saving historical data.
Evaluate the incremental Aggregator_readtodisk and writetodisk counters during the session. If the
counters show any numbers other than zero during the session run, you can increase performance
by tuning the index and data cache sizes.


                                                  21
                                                                      Informatica Training      Day 5

                                                                                             01/04/2011

System Bottlenecks
After tuning the source, target, mapping, and session, you may also consider tuning the system
hosting the Integration Service.
The Integration Service uses system resources to process transformations, session execution, and
the reading and writing of data. The Integration Service also uses system memory for other data
tasks such as creating aggregator, joiner, rank, and lookup table caches.
You can use system performance monitoring tools to monitor the amount of system resources the
Server uses and identify system bottlenecks.

      Windows NT/2000. Use system tools such as the Performance and Processes tab in the
       Task Manager to view CPU usage and total memory usage. You can also view more
       detailed performance information by using the Performance Monitor in the Administrative
       Tools on Windows.
      UNIX. Use the following system tools to monitor system performance and identify system
       bottlenecks:

           -   lsattr -E -I sys0 - To view current system settings
           -   iostat - To monitor loading operation for every disk attached to the database server
           -   vmstat or sar –w - To monitor disk swapping actions
           -   sar –u - To monitor CPU loading.




                                               22
                                                                      Informatica Training      Day 5

                                                                                             01/04/2011


Performance Tuning UNIX Systems Challenge
Identify opportunities for performance improvement within the complexities of the UNIX operating
environment.

Description
This section provides an overview of the subject area, followed by discussion of the use of specific
tools.

Overview
All system performance issues are fundamentally resource contention issues. In any computer
system, there are three essential resources: CPU, memory, and I/O – namely disk and network I/O.
From this standpoint, performance tuning for PowerCenter means ensuring that the PowerCenter
and its sub-processes have adequate resources to execute in a timely and efficient manner.
Each resource has its own particular set of problems. Resource problems are complicated because
all resources interact with each other. Performance tuning is about identifying bottlenecks and
making trade-off to improve the situation. Your best approach is to initially take a baseline
measurement and to obtain a good understanding of how it behaves, then evaluate any bottleneck
revealed on each system resource during your load window and determine the removal of
whichever resource contention offers the greatest opportunity for performance enhancement.
Here is a summary of each system resource area and the problems it can have.

CPU
        On any multiprocessing and multi-user system, many processes want to use the CPUs at
         the same time. The UNIX kernel is responsible for allocation of a finite number of CPU
         cycles across all running processes. If the total demand on the CPU exceeds its finite
         capacity, then all processing is likely to reflect a negative impact on performance; the
         system scheduler puts each process in a queue to wait for CPU availability.
       An average of the count of active processes in the system for the last 1, 5, and 15 minutes
        is reported as load average when you execute the command uptime. The load average
        provides you a basic indicator of the number of contenders for CPU time. Likewise vmstat
        command provides an average usage of all the CPUs along with the number of processes
        contending for CPU (the value under the r column).
       On SMP (symmetric multiprocessing) architecture servers, watch the even utilization of all
        the CPUs. How well all the CPUs are utilized depends on how well an application can be
        parallelized, If a process is incurring a high degree of involuntary context switch by the
        kernel; binding the process to a specific CPU may improve performance.

Memory
       Memory contention arises when the memory requirements of the active processes exceed
        the physical memory available on the system; at this point, the system is out of memory. To
        handle this lack of memory, the system starts paging, or moving portions of active
        processes to disk in order to reclaim physical memory. When this happens, performance
        decreases dramatically.
       Paging is distinguished from swapping, which means moving entire processes to disk and
        reclaiming their space. Paging and excessive swapping indicate that the system can't
        provide enough memory for the processes that are currently running.
       Commands such as vmstat and pstat show whether the system is paging; ps, prstat and sar
        can report the memory requirements of each process.

                                                23
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011

Disk I/O
       The I/O subsystem is a common source of resource contention problems. A finite amount
        of I/O bandwidth must be shared by all the programs (including the UNIX kernel) that
        currently run. The system's I/O buses can transfer only so many megabytes per second;
        individual devices are even more limited. Each type of device has its own peculiarities and,
        therefore, its own problems.
Tools are available to evaluate specific parts of the I/O subsystem:
     iostat can give you information about the transfer rates for each disk drive.
     ps and vmstat can give some information about how many processes are blocked waiting
        for I/O.
     sar can provide voluminous information about I/O efficiency.
     sadp can give detailed information about disk access patterns.

Network I/O

       The source data, the target data, or both the source and target data are likely to be
        connected through an Ethernet channel to the system where PowerCenter resides. Be
        sure to consider the number of Ethernet channels and bandwidth available to avoid
        congestion.
               - netstat shows packet activity on a network, watch for high collision rate of output
               packets on each interface.
               - nfstat monitors NFS traffic; execute nfstat –c from a client machine (not from the
               nfs server); watch for high time rate of total call and “not responding” message.

Given that these issues all boil down to access to some computing resource, mitigation of each
issue con sists of making some adjustment to the environment to provide more (or preferential)
access to the resource; for instance:

       Adjusting execution schedules to allow leverage of low usage times may improve
        availability of memory, disk, network bandwidth, CPU cycles, etc.
       Migrating other applications to other hardware is likely to reduce demand on the hardware
        hosting PowerCenter.
       For CPU intensive sessions, raising CPU priority (or lowering priority for competing
        processes) provides more CPU time to the PowerCenter sessions.
       Adding hardware resources, such as adding memory, can make more resource available to
        all processes.
       Re-configuring existing resources may provide for more efficient usage, such as assigning
        different disk devices for input and output, striping disk devices, or adjusting network packet
        sizes.

Detailed Usage
The following tips have proven useful in performance tuning UNIX-based machines. While some of
these tips are likely to be more helpful than others in a particular environment, all are worthy of
consideration. Availability, syntax and format of each varies across UNIX versions.

Running ps -axu
Run ps -axu to check for the following items:

       Are there any processes waiting for disk access or for paging? If so check the I/O and
        memory subsystems.


                                                 24
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011

       What processes are using most of the CPU? This may help to distribute the workload
        better.
       What processes are using most of the memory? This may help to distribute the workload
        better.
       Does ps show that your system is running many memory-intensive jobs? Look for jobs with
        a large set (RSS) or a high storage integral.

Identifying and Resolving Memory Issues
Use vmstat or sar to check for paging/swapping actions. Check the system to ensure that
excessive paging/swapping does not occur at any time during the session processing. By using sar
5 10 or vmstat 1 10, you can get a snapshot of paging/swapping. If paging or excessive swapping
does occur at any time, increase memory to prevent it. Paging/swapping, on any database system,
causes a major performance decrease and increased I/O. On a memory-starved and I/O-bound
server, this can effectively shut down the PowerCenter process and any databases running on the
server.
Some swapping may occur normally regardless of the tuning settings. This occurs because some
processes use the swap space by their design. To check swap space availability, use pstat and
swap. If the swap space is too small for the intended applications, it should be increased.
Runvmstate 5 (sar wpgr ) for SunOS, vmstat S 5 to detect and confirm memory problems and
check for the following:

       Are pages-outs occurring consistently? If so, you are short of memory.
       Are there a high number of address translation faults? (System V only). This suggests a
        memory shortage.
       Are swap-outs occurring consistently? If so, you are extremely short of memory. Occasional
        swap-outs are normal; BSD systems swap-out inactive jobs. Long bursts of swap-outs
        mean that active jobs are probably falling victim and indicate extreme memory shortage. If
        you don‟t have vmstat S, look at the w and de fields of vmstat. These should always be
        zero.

If memory seems to be the bottleneck, try following remedial steps:

       Reduce the size of the buffer cache (if your system has one) by decreasing BUFPAGES.
       If you have statically allocated STREAMS buffers, reduce the number of large (e.g., 2048-
        and 4096-byte) buffers. This may reduce network performance, but netstat-m should give
        you an idea of how many buffers you really need.
       Reduce the size of your kernels tables. This may limit the systems capacity (i.e., number of
        files, number of processes, etc.).
       Try running jobs requiring a lot of memory at night. This may not help the memory
        problems, but you may not care about them as much.
       Try running jobs requiring a lot of memory in a batch queue. If only one memory-intensive
        job is running at a time, your system may perform satisfactorily.
       Try to limit the time spent running sendmail, which is a memory hog.
       If you don‟t see any significant improvement, add more memory.

Identifying and Resolving Disk I/O Issues
Use iostat to check I/O load and utilization as well as CPU load. Iostat can be used to monitor
the I/O load on the disks on the UNIX server. Using iostat permits monitoring the load on specific
disks. Take notice of how evenly disk activity is distributed among the system disks. If it is not, are
the most active disks also the fastest disks?




                                                  25
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011

Run sadp to get a seek histogram of disk activity. Is activity concentrated in one area of the disk
(good), spread evenly across the disk (tolerable), or in two well-defined peaks at opposite ends
(bad)?

       Reorganize your file systems and disks to distribute I/O activity as evenly as possible.
       Using symbolic links helps to keep the directory structure the same throughout while still
        moving the data files that are causing I/O contention.
       Use your fastest disk drive and controller for your root file system; this almost certainly has
        the heaviest activity. Alternatively, if single-file throughput is important, put performance-
        critical files into one file system and use the fastest drive for that file system.
       Put performance-critical files on a file system with a large block size: 16KB or 32KB (BSD).
       Increase the size of the buffer cache by increasing BUFPAGES (BSD). This may hurt your
        systems memory performance.
       Rebuild your file systems periodically to eliminate fragmentation (i.e., backup, build a new
        file system, and restore).
       If you are using NFS and using remote files, look at your network situation. You don‟t have
        local disk I/O problems.
       Check memory statistics again by running vmstat 5 (sar-rwpg). If your system is paging or
        swapping consistently, you have memory problems, fix memory problem first. Swapping
        makes performance worse.

If your system has disk capacity problem and is constantly running out of disk space try the
following actions:

       Write a find script that detects old core dumps, editor backup and auto-save files, and other
        trash and deletes it automatically. Run the script through cron.
       Use the disk quota system (if your system has one) to prevent individual users from
        gathering too much storage.
       Use a smaller block size on file systems that are mostly small files (e.g., source code files,
        object modules, and small data files).

Identifying and Resolving CPU Overload Issues
Use uptime or sar -u to check for CPU loading. Sar provides more detail, including % usr (user),
%sys (system), %wio (waiting on I/O), and %idle (% of idle time). A target goal should be %usr +
%sys= 80 and %wio = 10 leaving %idle at 10.
If %wio is higher, the disk and I/O contention should be investigated to eliminate I/O bottleneck on
the UNIX server. If the system shows a heavy load of %sys, and %usr has a high %idle, this is
indicative of memory and contention of swapping/paging problems. In this case, it is necessary to
make memory changes to reduce the load on the system server.
When you run iostat 5, also watch for CPU idle time. Is the idle time always 0, without letup? It is
good for the CPU to be busy, but if it is always busy 100 percent of the time, work must be piling up
somewhere. This points to CPU overload.

       Eliminate unnecessary daemon processes. rwhod and routed are particularly likely to be
        performance problems, but any savings will help.
       Get users to run jobs at night with at or any queuing system that‟s available. You may not
        care if the CPU (or the memory or I/O system) is overloaded at night, provided the work is
        done in the morning.
       Using nice to lower the priority of CPU-bound jobs improves interactive performance. Also,
        using nice to raise the priority of CPU-bound jobs expedites them but may hurt interactive
        performance. In general though, using nice is really only a temporary solution. If your
        workload grows, it will soon become insufficient. Consider upgrading your system, replacing
        it, or buying another system to share the load.


                                                 26
     Informatica Training      Day 5

                            01/04/2011




27
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011

Identifying and Resolving Network I/O Issues
Suspect problems with network capacity or with data integrity if users experience slow
performance when they are using rlogin or when they are accessing files via NFS.
Look at netsat-i. If the number of collisions is large, suspect an overloaded network. If the number
of input or output errors is large, suspect hardware problems. A large number of input errors
indicate problems somewhere on the network. A large number of output errors suggests problems
with your system and its interface to the network.
If collisions and network hardware are not a problem, figure out which system appears to be
slow. Use spray to send a large burst of packets to the slow system. If the number of dropped
packets is large, the remote system most likely cannot respond to incoming data fast enough. Look
to see if there are CPU, memory or disk I/O problems on the remote system. If not, the system may
just not be able to tolerate heavy network workloads. Try to reorganize the network so that this
system isn‟t a file server.
A large number of dropped packets may also indicate data corruption. Run netstat-s on the
remote system, then spray the remote system from the local system and run netstat-s again. If the
increase of UDP socket full drops (as indicated by netstat) is equal to or greater than the number of
drop packets that spray reports, the remote system is slow network server If the increase of socket
full drops is less than the number of dropped packets, look for network errors.
Run nfsstat and look at the client RPC data. If the retransfield is more than 5 percent of calls, the
network or an NFS server is overloaded. If timeout is high, at least one NFS server is overloaded,
the network may be faulty, or one or more servers may have crashed. If badmix is roughly equal to
timeout, at least one NFS server is overloaded. If timeout and retrans are high, but badxid is low,
some part of the network between the NFS client and server is overloaded and dropping packets.
Try to prevent users from running I/O- intensive programs across the network. The greputility
is a good example of an I/O intensive program. Instead, have users log into the remote system to
do their work.
Reorganize the computers and disks on your network so that as many users as possible can do
as much work as possible on a local system.
Use systems with good network performance as file servers. lsattr E l sys0 is used to
determine some current settings on some UNIX environments. (In Solaris, you execute prtenv.) Of
particular attention is maxuproc, the setting to determine the maximum level of user background
processes. On most UNIX environments, this is defaulted to 40, but should be increased to 250 on
most systems.
Choose a file system. Be sure to check the database vendor documentation to determine the best
file system for the specific machine. Typical choices include: s5, the UNIX System V file system;
ufs, the UNIX file system derived from Berkeley (BSD); vxfs, the Veritas file system; and lastly raw
devices that, in reality are not a file system at all. Additionally, for the PowerCenter Enterprise Grid
Option cluster file system (CFS), products such as GFS for RedHat Linux, Veritas CFS, and GPFS
for IBM AIX are some of the available choices.

Cluster File System Tuning
In order to take full advantage of the PowerCenter Enterprise Grid Option , cluster file system (CFS)
is recommended. PowerCenter Grid option requires that the directories for each Integration Service
to be shared with other servers. This allows Integration Services to share files such as cache files
between different session runs. CFS performance is a result of tuning parameters and tuning the
infrastructure. Therefore, using the parameters recommended by each CFS vendor is the best
approach for CFS tuning.




                                                  28
                                                                       Informatica Training      Day 5

                                                                                              01/04/2011


Recommended Performance Tuning Procedures
Challenge
To optimize PowerCenter load times by employing a series of performance tuning procedures.

Description
When a PowerCenter session or workflow is not performing at the expected or desired speed, there
is a methodology that can help to diagnose problems that may be adversely affecting various
components of the data integration architecture. While PowerCenter has its own performance
settings that can be tuned, you must consider the entire data integration architecture, including the
UNIX/Windows servers, network, disk array, and the source and target databases to achieve
optimal performance. More often than not, an issue external to PowerCenter is the cause of the
performance problem. In order to correctly and scientifically determine the most logical cause of the
performance problem, you need to execute the performance tuning steps in a specific order. This
enables you to methodically rule out individual pieces and narrow down the specific areas on which
to focus your tuning efforts.

1. Perform Benchmarking
You should always have a baseline of current load times for a given workflow or session with a
similar row count. Maybe you are not achieving your required load window or simply think your
processes could run more efficiently based on comparison with other similar tasks running faster.
Use the benchmark to estimate what your desired performance goal should be and tune to that
goal. Begin with the problem mapping that you created, along with a session and workflow that use
all default settings. This helps to identify which changes have a positive impact on performance.

2. Identify the Performance Bottleneck Area
This step helps to narrow down the areas on which to focus further. Follow the areas and sequence
below when attempting to identify the bottleneck:

       Target
       Source
       Mapping
       Session/Workflow
       System.

The methodology steps you through a series of tests using PowerCenter to identify trends that point
where next to focus. Remember to go through these tests in a scientific manner; running them
multiple times before reaching any conclusion and always keep in mind that fixing one bottleneck
area may create a different bottleneck.

3. "Inside" or "Outside" PowerCenter
Depending on the results of the bottleneck tests, optimize “inside” or “outside” PowerCenter. Be
sure to perform the bottleneck test in the order prescribed in Determining Bottlenecks, since this is
also the order in which you should make any performance changes.
Problems “outside” PowerCenter refers to anything that indicates the source of the performance
problem is external to PowerCenter. The most common performance problems “outside”
PowerCenter are source/target database problem, network bottleneck, server, or operating system
problem.


                                                 29
                                                                          Informatica Training      Day 5

                                                                                                 01/04/2011



       For source database related bottlenecks, refer to Tuning SQL Overrides and Environment
        for Better Performance
       For target database related problems, refer to Performance Tuning Databases - Oracle,
        SQL Server, or Teradata
       For operating system problems, refer to Performance Tuning UNIX Systems or
        Performance Tuning Windows 2000/2003 Systems for more information.

Problems “inside” PowerCenter refers to anything that PowerCenter controls, such as actual
transformation logic, and PowerCenter Workflow/Session settings. The session settings contain
quite a few memory settings and partitioning options that can greatly improve performance. Refer to
the Tuning Sessions for Better Performance for more information.
Although there are certain procedures to follow to optimize mappings, keep in mind that, in most
cases, the mapping design is dictated by business logic; there may be a more efficient way to
perform the business logic within the mapping, but you cannot ignore the necessary business logic
to improve performance.

4. Re-Execute the Problem Workflow or Session
After you have completed the recommended steps for each relevant performance bottleneck, re-run
the problem workflow or session and compare the results to the benchmark and compare load
performance against the baseline. This step is iterative, and should be performed after any
performance-based setting is changed. You are trying to answer the question, “Did the performance
change have a positive impact?” If so, move on to the next bottleneck. Be sure to prepare detailed
documentation at every step along the way so you have a clear record of what was and wasn't tried.
While it may seem like there are an enormous number of areas where a performance problem can
arise, if you follow the steps for finding the bottleneck(s), and apply the tuning techniques specific to
it, you are likely to improve performance and achieve your desired goals.




                                                   30
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011


Tuning Mappings for Better Performance Challenge
In general, mapping-level optimization takes time to implement, but can significantly boost
performance.
Sometimes the mapping is the biggest bottleneck in the load process because business rules
determine the number and complexity of transformations in a mapping.
Before deciding on the best route to optimize the mapping architecture, you need to resolve some
basic issues. Tuning mappings is a grouped approach. The first group can be of assistance almost
universally, bringing about a performance increase in all scenarios. The second group of tuning
processes may yield only small performance increase, or can be of significant value, depending on
the situation.
Some factors to consider when choosing tuning processes at the mapping level include the specific
environment, software/ hardware limitations, and the number of rows going through a mapping. This
Best Practice offers some guidelines for tuning mappings.

Description
Analyze mappings for tuning only after you have tuned the target and source for peak performance.
To optimize mappings, you generally reduce the number of transformations in the mapping and
delete unnecessary links between transformations.
For transformations that use data cache (such as Aggregator, Joiner, Rank, and Lookup
transformations), limit connected input/output or output ports. Doing so can reduce the amount of
data the transformations store in the data cache. Having too many Lookups and Aggregators can
encumber performance because each requires index cache and data cache. Since both are fighting
for memory space, decreasing the number of these transformations in a mapping can help improve
speed. Splitting them up into different mappings is another option.
Limit the number of Aggregators in a mapping. A high number of Aggregators can increase I/O
activity on the cache directory. Unless the seek/access time is fast on the directory itself, having too
many Aggregators can cause a bottleneck. Similarly, too many Lookups in a mapping causes
contention of disk and memory, which can lead to thrashing, leaving insufficient memory to run a
mapping efficiently.

Consider Single-Pass Reading

If several mappings use the same data source, consider a single-pass reading. If you have several
sessions that use the same sources, consolidate the separate mappings with either a single Source
Qualifier Transformation or one set of Source Qualifier Transformations as the data source for the
separate data flows.
Similarly, if a function is used in several mappings, a single-pass reading reduces the number of
times that function is called in the session. For example, if you need to subtract percentage from the
PRICE ports for both the Aggregator and Rank transformations, you can minimize work by
subtracting the percentage before splitting the pipeline.

Optimize SQL Overrides
When SQL overrides are required in a Source Qualifier, Lookup Transformation, or in the update
override of a target object, be sure the SQL statement is tuned. The extent to which and how SQL
can be tuned depends on the underlying source or target database system.

Scrutinize Datatype Conversions
PowerCenter Server automatically makes conversions between compatible datatypes. When these
conversions are performed unnecessarily, performance slows. For example, if a mapping moves
data from an integer port to a decimal port, then back to an integer port, the conversion may be
unnecessary.



                                                  31
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011

In some instances however, datatype conversions can help improve performance. This is especially
true when integer values are used in place of other datatypes for performing comparisons using
Lookup and Filter transformations.

Eliminate Transformation Errors
Large numbers of evaluation errors significantly slow performance of the PowerCenter Server.
During transformation errors, the PowerCenter Server engine pauses to determine the cause of the
error, removes the row causing the error from the data flow, and logs the error in the session log.
Transformation errors can be caused by many things including: conversion errors, conflicting
mapping logic, any condition that is specifically set up as an error, and so on. The session log can
help point out the cause of these errors. If errors recur consistently for certain transformations, re-
evaluate the constraints for these transformations. If you need to run a session that generates a
large number of transformation errors, you might improve performance by setting a lower tracing
level. However, this is not a long-term response to transformation errors. Any source of errors
should be traced and eliminated.

Optimize Lookup Transformations
There are a several ways to optimize lookup transformations that are set up in a mapping.

When to Cache Lookups
Cache small lookup tables. When caching is enabled, the PowerCenter Server caches the lookup
table and queries the lookup cache during the session. When this option is not enabled, the
PowerCenter Server queries the lookup table on a row-by-row basis.
All of the tuning options mentioned in this Best Practice assume that memory and cache sizing for
lookups are sufficient to ensure that caches will not page to disks. Information regarding memory
and cache sizing for Lookup transformations are covered in the Best Practice: Tuning Sessions for
Better Performance.
A better rule of thumb than memory size is to determine the size of the potential lookup cache with
regard to the number of rows expected to be processed.
For example, consider the following example In Mapping X, the source and lookup contain the
following number of records




Number of disk reads:




                                                  32
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011




Consider the case where MANUFACTURER is the lookup table. If the lookup table is cached, it will
take a total of 5200 disk reads to build the cache and execute the lookup. If the lookup table is not
cached, then it will take a total of 10,000 total disk reads to execute the lookup. In this case, the
number of records in the lookup table is small in comparison with the number of times the lookup is
executed. So this lookup should be cached. This is the more likely scenario.
Consider the case where DIM_ITEMS is the lookup table. If the lookup table is cached, it will result
in 105,000 total disk reads to build and execute the lookup. If the lookup table is not cached, then
the disk reads would total 10,000. In this case the number of records in the lookup table is not small
in comparison with the number of times the lookup will be executed. Thus, the lookup should not be
cached.
Use the following eight step method to determine if a lookup should be cached

1. Code the lookup into the mapping.
2. Select a standard set of data from the source. For example, add a "where" clause on a relational
source to load a sample 10,000 rows.
3. Run the mapping with caching turned off and save the log.
4. Run the mapping with caching turned on and save the log to a different name than the log
created in step 3.
5. Look in the cached lookup log and determine how long it takes to cache the lookup object. Note
this time in seconds: LOOKUP TIME IN SECONDS = LS.
6. In the non-cached log, take the time from the last lookup cache to the end of the load in seconds
and divide it into the number or rows being processed: NON-CACHED ROWS PER SECOND =
NRS.
7. In the cached log, take the time from the last lookup cache to the end of the load in seconds and
divide it into number or rows being processed: CACHED ROWS PER SECOND = CRS.


                                                 33
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011

8. Use the following formula to find the breakeven row point:
(LS*NRS*CRS)/(CRS-NRS) = X

Where X is the breakeven point. If your expected source records is less than X, it is better to not
cache the lookup. If your expected source records is more than X, it is better to cache the lookup.
For example:

Assume the lookup takes 166 seconds to cache (LS=166).
Assume with a cached lookup the load is 232 rows per second (CRS=232).
Assume with a non-cached lookup the load is 147 rows per second (NRS = 147).
The formula would result in: (166*147*232)/(232-147) = 66,603.

Thus, if the source has less than 66,603 records, the lookup should not be cached. If it has more
than 66,603 records, then the lookup should be cached.

Sharing Lookup Caches
There are a number of methods for sharing lookup caches:

       Within a specific session run for a mapping, if the same lookup is used multiple times in
        a mapping, the PowerCenter Server will re-use the cache for the multiple instances of the
        lookup.
        Using the same lookup multiple times in the mapping will be more resource intensive with
        each successive instance. If multiple cached lookups are from the same table but are
        expected to return different columns of data, it may be better to setup the multiple lookups
        to bring back the same columns even though not all return ports are used in all lookups.
        Bringing back a common set of columns may reduce the number of disk reads.
       Across sessions of the same mapping, the use of an unnamed persistent cache allows
        multiple runs to use an existing cache file stored on the PowerCenter Server. If the option of
        creating a persistent cache is set in the lookup properties, the memory cache created for
        the lookup during the initial run is saved to the PowerCenter Server. This can improve
        performance because the Server builds the memory cache from cache files instead of the
        database. This feature should only be used when the lookup table is not expected to
        change between session runs.
       Across different mappings and sessions, the use of a named persistent cache allows
        sharing an existing cache file.


Reducing the Number of Cached Rows
There is an option to use a SQL override in the creation of a lookup cache. Options can be added to
the WHERE clause to reduce the set of records included in the resulting cache.
If you use a SQL override in a lookup, the lookup must be cached.

Optimizing the Lookup Condition
In the case where a lookup uses more than one lookup condition, set the conditions with an equal
sign first in order to optimize lookup performance.

Indexing the Lookup Table
The PowerCenter Server must query, sort, and compare values in the lookup condition columns. As
a result, indexes on the database table should include every column used in a lookup condition.
This can improve performance for both cached and un-cached lookups.



                                                 34
                                                                       Informatica Training      Day 5

                                                                                              01/04/2011

       In the case of a cached lookup, an ORDER BY condition is issued in the SQL statement
        used to create the cache. Columns used in the ORDER BY condition should be indexed.
        The session log will contain the ORDER BY statement.
       In the case of an un-cached lookup, since a SQL statement is created for each row passing
        into the lookup transformation, performance can be helped by indexing columns in the
        lookup condition.

Use a Persistent Lookup Cache for Static Lookups
If the lookup source does not change between sessions, configure the Lookup transformation to
use a persistent lookup cache. The PowerCenter Server then saves and reuses cache files from
session to session, eliminating the time required to read the lookup source.

Optimize Filter and Router Transformations
Filtering data as early as possible in the data flow improves the efficiency of a mapping. Instead
of using a Filter Transformation to remove a sizeable number of rows in the middle or end of a
mapping, use a filter on the Source Qualifier or a Filter Transformation immediately after the source
qualifier to improve performance.
Avoid complex expressions when creating the filter condition. Filter transformations are most
effective when a simple integer or TRUE/FALSE expression is used in the filter condition.
Filters or routers should also be used to drop rejected rows from an Update Strategy
transformation if rejected rows do not need to be saved.
Replace multiple filter transformations with a router transformation. This reduces the number
of transformations in the mapping and makes the mapping easier to follow.

Optimize Aggregator Transformations

Aggregator Transformations often slow performance because they must group data before
processing it.
Use simple columns in the group by condition to make the Aggregator Transformation more
efficient. When possible, use numbers instead of strings or dates in the GROUP BY columns. Also
avoid complex expressions in the Aggregator expressions, especially in GROUP BY ports.
Use the Sorted Input option in the Aggregator. This option requires that data sent to the
Aggregator be sorted in the order in which the ports are used in the Aggregator's group by. The
Sorted Input option decreases the use of aggregate caches. When it is used, the PowerCenter
Server assumes all data is sorted by group and, as a group is passed through an Aggregator,
calculations can be performed and information passed on to the next transformation. Without sorted
input, the Server must wait for all rows of data before processing aggregate calculations. Use of the
Sorted Inputs option is usually accompanied by a Source Qualifier which uses the Number of
Sorted Ports option.
Use an Expression and Update Strategy instead of an Aggregator Transformation. This
technique can only be used if the source data can be sorted. Further, using this option assumes
that a mapping is using an Aggregator with Sorted Input option. In the Expression Transformation,
the use of variable ports is required to hold data from the previous row of data processed. The
premise is to use the previous row of data to determine whether the current row is a part of the
current group or is the beginning of a new group.
Thus, if the row is a part of the current group, then its data would be used to continue calculating
the current group function. An Update Strategy Transformation would follow the Expression
Transformation and set the first row of a new group to insert, and the following rows to update.
Use incremental aggregation if you can capture changes from the source that changes less than
half the target. When using incremental aggregation, you apply captured changes in the source to
aggregate calculations in a session. The PowerCenter Server updates your target incrementally,
rather than processing the entire source and recalculating the same calculations every time you run
the session.

                                                 35
                                                                       Informatica Training      Day 5

                                                                                              01/04/2011

Joiner Transformation

Joining Data from the Same Source

You can join data from the same source in the following ways:

       Join two branches of the same pipeline.
       Create two instances of the same source and join pipelines from these source instances.

You may want to join data from the same source if you want to perform a calculation on part of the
data and join the transformed data with the original data. When you join the data using this method,
you can maintain the original data and transform parts of that data within one mapping.
When you join data from the same source, you can create two branches of the pipeline. When you
branch a pipeline, you must add a transformation between the Source Qualifier and the Joiner
transformation in at least one branch of the pipeline. You must join sorted data and configure the
Joiner transformation for sorted input.
If you want to join unsorted data, you must create two instances of the same source and join the
pipelines.
For example, you may have a source with the following ports:

       Employee
       Department
       Total Sales

In the target table, you want to view the employees who generated sales that were greater than the
average sales for their respective departments. To accomplish this, you create a mapping with the
following transformations:

       Sorter transformation. Sort the data.
       Sorted Aggregator transformation. Average the sales data and group by department.
        When you perform this aggregation, you lose the data for individual employees. To maintain
        employee data, you must pass a branch of the pipeline to the Aggregator transformation
        and pass a branch with the same data to the Joiner transformation to maintain the original
        data. When you join both branches of the pipeline, you join the aggregated data with the
        original data.
       Sorted Joiner transformation. Use a sorted Joiner transformation to join the sorted
        aggregated data with the original data.
       Filter transformation. Compare the average sales data against sales data for each
        employee and filter out employees with less than above average sales.

Joining two branches can affect performance if the Joiner transformation receives data from one
branch much later than the other branch. The Joiner transformation caches all the data from the first
branch, and writes the cache to disk if the cache fills. The Joiner transformation must then read the
data from disk when it receives the data from the second branch. This can slow processing.
You can also join same source data by creating a second instance of the source. After you create
the second source instance, you can join the pipelines from the two source instances.

Use the following guidelines when deciding whether to join branches of a pipeline or join two
instances of a source:

       Join two branches of a pipeline when you have a large source or if you can read the source
        data only once. For example, you can only read source data from a message queue once.




                                                 36
                                                                      Informatica Training      Day 5

                                                                                             01/04/2011

       Join two branches of a pipeline when you use sorted data. If the source data is unsorted
        and you use a Sorter transformation to sort the data, branch the pipeline after you sort the
        data.
       Join two instances of a source when you need to add a blocking transformation to the
        pipeline between the source and the Joiner transformation.
       Join two instances of a source if one pipeline may process much more slowly than the other
        pipeline.

Performance Tips
Use the database to do the join when sourcing data from the same database schema. Database
systems usually can perform the join more quickly than the PowerCenter Server, so a SQL override
or a join condition should be used when joining multiple tables from the same database schema.
Use Normal joins whenever possible. Normal joins are faster than outer joins and the resulting
set of data is also smaller.
Join sorted data when possible. You can improve session performance by configuring the Joiner
transformation to use sorted input. When you configure the Joiner transformation to use sorted
data, the PowerCenter Server improves performance by minimizing disk input and output. You see
the greatest performance improvement when you work with large data sets.
For an unsorted Joiner transformation, designate as the master source the source with
fewer rows. For optimal performance and disk storage, designate the master source as the source
with the fewer rows. During a session, the Joiner transformation compares each row of the master
source against the detail source. The fewer unique rows in the master, the fewer iterations of the
join comparison occur, which speeds the join process.
For a sorted Joiner transformation, designate as the master source the source with fewer
duplicate key values. For optimal performance and disk storage, designate the master source as
the source with fewer duplicate key values. When the PowerCenter Server processes a sorted
Joiner transformation, it caches rows for one hundred keys at a time. If the master source contains
many rows with the same key value, the PowerCenter Server must cache more rows, and
performance can be slowed.
Optimizing sorted joiner transformations with partitions. When you use partitions with a sorted
Joiner transformation, you may optimize performance by grouping data and using n:n partitions.

Add a hash auto-keys partition upstream of the sort origin
To obtain expected results and get best performance when partitioning a sorted Joiner
transformation, you must group and sort data. To group data, ensure that rows with the same key
value are routed to the same partition. The best way to ensure that data is grouped and distributed
evenly among partitions is to add a hash auto-keys or key-range partition point before the sort
origin. Placing the partition point before you sort the data ensures that you maintain grouping and
sort the data within each group.

Use n:n partitions
You may be able to improve performance for a sorted Joiner transformation by using n:n partitions.
When you use n:n partitions, the Joiner transformation reads master and detail rows concurrently
and does not need to cache all of the master data. This reduces memory usage and speeds
processing. When you use 1:n partitions, the Joiner transformation caches all the data from the
master pipeline and writes the cache to disk if the memory cache fills. When the Joiner
transformation receives the data from the detail pipeline, it must then read the data from disk to
compare the master and detail pipelines.

Optimize Sequence Generator Transformations



                                                37
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011

Sequence Generator transformations need to determine the next available sequence number; thus,
increasing the Number of Cached Values property can increase performance. This property
determines the number of values the PowerCenter Server caches at one time. If it is set to cache no
values, then the PowerCenter Server must query the repository each time to determine the next
number to be used. You may consider configuring the Number of Cached Values to a value greater
than 1000. Note that any cached values not used in the course of a session are lost since the
sequence generator value in the repository is set when it is called next time, to give the next set of
cache values.

Avoid External Procedure Transformations
For the most part, making calls to external procedures slows a session. If possible, avoid the use of
these Transformations, which include Stored Procedures, External Procedures, and Advanced
External Procedures.

Field-Level Transformation Optimization
As a final step in the tuning process, you can tune expressions used in transformations. When
examining expressions, focus on complex expressions and try to simplify them when possible.
To help isolate slow expressions, do the following:

1. Time the session with the original expression.
2. Copy the mapping and replace half the complex expressions with a constant.
3. Run and time the edited session.
4. Make another copy of the mapping and replace the other half of the complex expressions with a
constant.
5. Run and time the edited session.
Processing field level transformations takes time. If the transformation expressions are complex,
then processing is even slower. It‟s often possible to get a 10 to 20 percent performance
improvement by optimizing complex field level transformations. Use the target table mapping
reports or the Metadata Reporter to examine the transformations. Likely candidates for optimization
are the fields with the most complex expressions. Keep in mind that there may be more than one
field causing performance problems.

Factoring Out Common Logic
Factoring out common logic can reduce the number of times a mapping performs the same logic. If
a mapping performs the same logic multiple times, moving the task upstream in the mapping may
allow the logic to be performed just once. For example, a mapping has five target tables. Each
target requires a Social Security Number lookup. Instead of performing the lookup right before each
target, move the lookup to a position before the data flow splits.

Minimize Function Calls
Anytime a function is called it takes resources to process. There are several common examples
where function calls can be reduced or eliminated.
Aggregate function calls can sometime be reduced. In the case of each aggregate function call,
the PowerCenter Server must search and group the data. Thus, the following expression:

SUM(Column A) + SUM(Column B)

Can be optimized to:

SUM(Column A + Column B)



                                                 38
                                                                       Informatica Training      Day 5

                                                                                              01/04/2011

In general, operators are faster than functions, so operators should be used whenever possible.
For example if you have an expression which involves a CONCAT function such as:

CONCAT(CONCAT(FIRST_NAME, ), LAST_NAME)
It can be optimized to:

FIRST_NAME || LAST_NAME

Remember that IIF() is a function that returns a value, not just a logical test. This allows many
logical statements to be written in a more compact fashion. For example:

IIF(FLG_A=Y and FLG_B=Y and FLG_C= Y, VAL_A+VAL_B+VAL_C,< /FONT>
IIF(FLG_A=Y and FLG_B=Y and FLG_C= N, VAL_A+VAL_B,< /FONT>
IIF(FLG_A=Y and FLG_B=N and FLG_C= Y, VAL_A+VAL_C,< /FONT>
IIF(FLG_A=Y and FLG_B=N and FLG_C= N, VAL_A,< /FONT>
IIF(FLG_A=N and FLG_B=Y and FLG_C= Y, VAL_B+VAL_C,< /FONT>
IIF(FLG_A=N and FLG_B=Y and FLG_C= N, VAL_B,< /FONT>
IIF(FLG_A=N and FLG_B=N and FLG_C= Y, VAL_C,< /FONT>
IIF(FLG_A=N and FLG_B=N and FLG_C= N, 0.0))))))))< /FONT>

Can be optimized to:

IIF(FLG_A=Y, VAL_A, 0.0) + IIF(FLG_B=Y, VAL_B, 0.0) + IIF(FLG_C= Y, VAL_C, 0.0)< /FONT>

The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized expression results
in three IIFs, three comparisons, and two additions.
Be creative in making expressions more efficient. The following is an example of rework of an
expression that eliminates three comparisons down to one:

IIF(X=1 OR X=5 OR X=9, 'yes', 'no')< /FONT>

Can be optimized to:

IIF(MOD(X, 4) = 1, 'yes', 'no')< /FONT >

Calculate Once, Use Many Times
Avoid calculating or testing the same value multiple times. If the same sub-expression is used
several times in a transformation, consider making the sub-expression a local variable. The local
variable can be used only within the transformation in which it was created. Calculating the variable
only once and then referencing the variable in following sub-expressions improves performance.

Choose Numeric vs. String Operations
The PowerCenter Server processes numeric operations faster than string operations. For example,
if a lookup is performed on a large amount of data on two columns, EMPLOYEE_NAME and
EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID improves performance.

Optimizing Char-Char and Char-Varchar Comparisons
When the PowerCenter Server performs comparisons between CHAR and VARCHAR columns, it
slows each time it finds trailing blank spaces in the row. To resolve this, treat CHAR as the CHAR
On Read option in the PowerCenter Server setup so that the server does not trim trailing spaces
from the end of CHAR source fields.



                                                 39
                                                                      Informatica Training      Day 5

                                                                                             01/04/2011

Use DECODE Instead of LOOKUP
When a LOOKUP function is used, the PowerCenter Server must lookup a table in the database.
When a DECODE function is used, the lookup values are incorporated into the expression itself so
the server does not need to lookup a separate table. Thus, when looking up a small set of
unchanging values, using DECODE may improve performance.

Reduce the Number of Transformations in a Mapping
Because there is always overhead involved in moving data among transformations, try, whenever
possible, to reduce the number of transformations. Also, resolve unnecessary links between
transformations to minimize the amount of data moved. This is especially important with data being
pulled from the Source Qualifier Transformation.

Use Pre- and Post-Session SQL Commands
You can specify pre- and post-session SQL commands in the Properties tab of the Source Qualifier
transformation and in the Properties tab of the target instance in a mapping. To increase the load
speed, use these commands to drop indexes on the target before the session runs, then recreate
them when the session completes.
Apply the following guidelines when using SQL statements:

       You can use any command that is valid for the database type. However, the PowerCenter
        Server does not allow nested comments, even though the database may.
       You can use mapping parameters and variables in SQL executed against the source, but
        not against the target.
       Use a semi-colon (;) to separate multiple statements.
       The PowerCenter Server ignores semi-colons within single quotes, double quotes, or within
        /* ...*/.
       If you need to use a semi-colon outside of quotes or comments, you can escape it with a
        back slash (\).
       The Workflow Manager does not validate the SQL.

Use Environmental SQL
For relational databases, you can execute SQL commands in the database environment when
connecting to the database. You can use this for source, target, lookup, and stored procedure
connections. For instance, you can set isolation levels on the source and target systems to avoid
deadlocks. Follow the guidelines listed above for using the SQL statements.

Use Local Variables
You can use local variables in Aggregator, Expression, and Rank transformations.

Temporarily Store Data and Simplify Complex Expressions
Rather than parsing and validating the same expression each time, you can define these
components as variables. This also allows you to simplify complex expressions. For example, the
following expressions:

AVG( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) < /FONT >
SUM( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) < /FONT >

can use variables to simplify complex expressions and temporarily store data:


                                                40
                                                                      Informatica Training      Day 5

                                                                                             01/04/2011




Store Values Across Rows

You can use variables to store data from prior rows. This can help you perform procedural
calculations. To compare the previous state to the state just read:

IIF( PREVIOUS_STATE = STATE, STATE_COUNTER + 1, 1 )< /FONT >

Capture Values from Stored Procedures
Variables also provide a way to capture multiple columns of return values from stored procedures




                                                41
                                                                      Informatica Training      Day 5

                                                                                             01/04/2011


Tuning Sessions for Better Performance Challenge
Running sessions is where the pedal hits the metal. A common misconception is that this is the
area where most tuning should occur. While it is true that various specific session options can be
modified to improve performance, PowerCenter 8 comes with PowerCenter Enterprise Grid Option
and Pushdown optimizations that also improve performance tremendously.

Description
Once you optimize the source and target database, and mapping, you can focus on optimizing the
session. The greatest area for improvement at the session level usually involves tweaking memory
cache settings. The Aggregator (without sorted ports), Joiner, Rank, Sorter and Lookup
transformations (with caching enabled) use caches.
The PowerCenter Server uses index and data caches for each of these transformations. If the
allocated data or index cache is not large enough to store the data, the PowerCenter Server stores
the data in a temporary disk file as it processes the session data. Each time the PowerCenter
Server pages to the temporary file, performance slows.
You can see when the PowerCenter Server pages to the temporary file by examining the
performance details. The transformation_readfromdisk or transformation_writetodisk counters for
any Aggregator, Rank, Lookup, Sorter, or Joiner transformation indicate the number of times the
PowerCenter Server must page to disk to process the transformation. Index and data caches
should both be sized according to the requirements of the individual lookup. The sizing can be done
using the estimation tools provided in the Transformation Guide, or through observation of
actual cache sizes on in the session caching directory.
The PowerCenter Server creates the index and data cache files by default in the PowerCenter
Server variable directory, $PMCacheDir. The naming convention used by the PowerCenter Server
for these files is PM [type of transformation] [generated session instance id number] _
[transformation instance id number] _ [partition index].dat or .idx. For example, an aggregate data
cache file would be named PMAGG31_19. dat. The cache directory may be changed however, if
disk space is a constraint.
Informatica recommends that the cache directory be local to the PowerCenter Server. A RAID 0
arrangement that gives maximum performance with no redundancy is recommended for volatile
cache file directories (i.e., no persistent caches).
If the PowerCenter Server requires more memory than the configured cache size, it stores the
overflow values in these cache files. Since paging to disk can slow session performance, the RAM
allocated needs to be available on the server. If the server doesn‟t have available RAM and uses
paged memory, your session is again accessing the hard disk. In this case, it is more efficient to
allow PowerCenter to page the data rather than the operating system. Adding additional memory to
the server is, of course, the best solution.
The PowerCenter Server writes to the index and data cache files during a session in the following
cases:

       The mapping contains one or more Aggregator transformations, and the session is
        configured for incremental aggregation.
       The mapping contains a Lookup transformation that is configured to use a persistent lookup
        cache, and the PowerCenter Server runs the session for the first time.
       The mapping contains a Lookup transformation that is configured to initialize the persistent
        lookup cache.
       The Data Transformation Manager (DTM) process in a session runs out of cache memory
        and pages to the local cache files. The DTM may create multiple files when processing
        large amounts of data. The session fails if the local directory runs out of disk space.




                                                42
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011

When a session is running, the PowerCenter Server writes a message in the session log indicating
the cache file name and the transformation name. When a session completes, the DTM generally
deletes the overflow index and data cache files.
However, index and data files may exist in the cache directory if the session is configured for either
incremental aggregation or to use a persistent lookup cache.
Cache files may also remain if the session does not complete successfully.

Configuring Automatic Memory Settings
PowerCenter 8 allows you to configure the amount of cache memory. Alternatively, you can
configure the Integration Service to automatically calculate cache memory settings at run time.
When you run a session, the Integration Service allocates buffer memory to the session to move the
data from the source to the target. It also creates session caches in memory. Session caches
include index and data caches for the Aggregator, Rank, Joiner, and Lookup transformations, as
well as Sorter and XML target caches.
The values stored in the data and index caches depend upon the requirements of the
transformation. For example, the Aggregator index cache stores group values as configured in the
group by ports, and the data cache stores calculations based on the group by ports. When the
Integration Service processes a Sorter transformation or writes data to an XML target, it also
creates a cache.

Configuring Session Cache Memory
The Integration Service can determine cache memory requirements for the Lookup, Aggregator,
Rank, Joiner, Sorter and XML.
You can configure auto for the index and data cache size in the transformation properties or on the
mappings tab of the session properties

Max Memory Limits
Configuring maximum memory limits allows you to ensure that you reserve a designated amount or
percentage of memory for other processes. You can configure the memory limit as a numeric value
and as a percent of total memory. Because available memory varies, the Integration Service bases
the percentage value on the total memory on the Integration Service process machine.
For example, you configure automatic caching for three Lookup transformations in a session. Then,
you configure a maximum memory limit of 500MB for the session. When you run the session, the
Integration Service divides the 500MB of allocated memory among the index and data caches for
the Lookup transformations.
When you configure a maximum memory value, the Integration Service divides memory among
transformation caches based on the transformation type.
When you configure a numeric value and a percent both, the Integration Service compares the
values and uses the lower value as the maximum memory limit.
When you configure automatic memory settings, the Integration Service specifies a minimum
memory allocation for the index and data caches. The Integration Service allocates 1,000,000 bytes
to the index cache and 2,000,000 bytes to the data cache for each transformation instance. If you
configure a maximum memory limit that is less than the minimum value for an index or data cache,
the Integration Service overrides this value. For example, if you configure a maximum memory
value of 500 bytes for session containing a Lookup transformation, the Integration Service overrides
or disable the automatic memory settings and uses the default values.
When you run a session on a grid and you configure Maximum Memory Allowed for Auto Memory
Attributes, the Integration Service divides the allocated memory among all the nodes in the grid.
When you configure Maximum Percentage of Total Memory Allowed for Auto Memory Attributes,
the Integration Service allocates the specified percentage of memory on each node in the grid.



                                                 43
                                                                      Informatica Training      Day 5

                                                                                             01/04/2011

Aggregator Caches
Keep the following items in mind when configuring the aggregate memory cache sizes:

      Allocate at least enough space to hold at least one row in each aggregate group.
      Remember that you only need to configure cache memory for an Aggregator transformation
       that does not use sorted ports. The PowerCenter Server uses Session Process memory to
       process an Aggregator transformation with sorted ports, not cache memory.
      Incremental aggregation can improve session performance. When it is used, the
       PowerCenter Server saves index and data cache information to disk at the end of the
       session. The next time the session runs, the PowerCenter Server uses this historical
       information to perform the incremental aggregation. The PowerCenter Server names these
       files PMAGG*.dat and PMAGG*.idx and saves them to the cache directory. Mappings that
       have sessions which use incremental aggregation should be set up so that only new detail
       records are read with each subsequent run.
      When configuring Aggregate data cache size, remember that the data cache holds row data
       for variable ports and connected output ports only. As a result, the data cache is generally
       larger than the index cache. To reduce the data cache size, connect only the necessary
       output ports to subsequent transformations.

Joiner Caches
When a session is run with a Joiner transformation, the PowerCenter Server reads from master and
detail sources concurrently and builds index and data caches based on the master rows. The
PowerCenter Server then performs the join based on the detail source data and the cache data.
The number of rows the PowerCenter Server stores in the cache depends on the partitioning
scheme, the data in the master source, and whether or not you use sorted input.
After the memory caches are built, the PowerCenter Server reads the rows from the detail source
and performs the joins. The PowerCenter Server uses the index cache to test the join condition.
When it finds source data and cache data that match, it retrieves row values from the data cache.

Lookup Caches
Several options can be explored when dealing with Lookup transformation caches.

      Persistent caches should be used when lookup data is not expected to change often.
       Lookup cache files are saved after a session with a persistent cache lookup is run for the
       first time. These files are reused for subsequent runs, bypassing the querying of the
       database for the lookup. If the lookup table changes, you must be sure to set the Recache
       from Database option to ensure that the lookup cache files are rebuilt. You can also delete
       the cache files before the session run to force the session to rebuild the caches.
      Lookup caching should be enabled for relatively small tables. Refer to the Best Practice
       Tuning Mappings for Better Performance to determine when lookups should be cached.
       When the Lookup transformation is not configured for caching, the PowerCenter Server
       queries the lookup table for each input row. The result of the lookup query and processing
       is the same, regardless of whether the lookup table is cached or not. However, when the
       transformation is configured to not cache, the PowerCenter Server queries the lookup table
       instead of the lookup cache. Using a lookup cache can usually increase session
       performance.
      Just like for a joiner, the PowerCenter Server aligns all data for lookup caches on an eight-
       byte boundary, which helps increase the performance of the lookup

Allocating Buffer Memory

                                                44
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011

The Integration Service can determine the memory requirements for the buffer memory:

       DTM Buffer Size
       Default Buffer Block Size

You can also configure DTM buffer size and the default buffer block size in the session properties.
When the PowerCenter Server initializes a session, it allocates blocks of memory to hold source
and target data. Sessions that use a large number of sources and targets may require additional
memory blocks.
To configure these settings, first determine the number of memory blocks the PowerCenter Server
requires to initialize the session. Then you can calculate the buffer size and/or the buffer block size
based on the default settings, to create the required number of session blocks.
If there are XML sources or targets in the mappings, use the number of groups in the XML source
or target in the total calculation for the total number of sources and targets.

Increasing the DTM Buffer Pool Size
The DTM Buffer Pool Size setting specifies the amount of memory the PowerCenter Server uses as
DTM buffer memory. The PowerCenter Server uses DTM buffer memory to create the internal data
structures and buffer blocks used to bring data into and out of the server. When the DTM buffer
memory is increased, the PowerCenter Server creates more buffer blocks, which can improve
performance during momentary slowdowns.
If a session's performance details show low numbers for your source and target
BufferInput_efficiency and BufferOutput_efficiency counters, increasing the DTM buffer pool size
may improve performance.
Using DTM buffer memory allocation generally causes performance to improve initially and then
level off. (Conversely, it may have no impact on source or target-bottlenecked sessions at all and
may not have an impact on DTM bottlenecked sessions). When the DTM buffer memory allocation
is increased, you need to evaluate the total memory available on the PowerCenter Server. If a
session is part of a concurrent batch, the combined DTM buffer memory allocated for the sessions
or batches must not exceed the total memory for the PowerCenter Server system. You can increase
the DTM buffer size in the Performance settings of the Properties tab.

Running Workflows and Sessions Concurrently
The PowerCenter Server can process multiple sessions in parallel and can also process multiple
partitions of a pipeline within a session. If you have a symmetric multiprocessing (SMP) platform,
you can use multiple CPUs to concurrently process session data or partitions of data. This provides
improved performance since true parallelism is achieved. On a single processor platform, these
tasks share the CPU, so there is no parallelism.
To achieve better performance, you can create a workflow that runs several sessions in parallel on
one PowerCenter Server. This technique should only be employed on servers with multiple CPUs
available.

Partitioning Sessions
Performance can be improved by processing data in parallel in a single session by creating multiple
partitions of the pipeline. If you have PowerCenter partitioning available, you can increase the
number of partitions in a pipeline to improve session performance. Increasing the number of
partitions allows the PowerCenter Server to create multiple connections to sources and process
partitions of source data concurrently.
When you create or edit a session, you can change the partitioning information for each pipeline in
a mapping. If the mapping contains multiple pipelines, you can specify multiple partitions in some



                                                  45
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011

pipelines and single partitions in others. Keep the following attributes in mind when specifying
partitioning information for a pipeline:

       Location of partition points. The PowerCenter Server sets partition points at several
        transformations in a pipeline by default. If you have PowerCenter partitioning available, you
        can define other partition points. Select those transformations where you think redistributing
        the rows in a different way is likely to increase the performance considerably.
       Number of partitions. By default, the PowerCenter Server sets the number of partitions to
        one. You can generally define up to 64 partitions at any partition point. When you increase
        the number of partitions, you increase the number of processing threads, which can
        improve session performance. Increasing the number of partitions or partition points also
        increases the load on the server. If the server contains ample CPU bandwidth, processing
        rows of data in a session concurrently can increase session performance. However, if you
        create a large number of partitions or partition points in a session that processes large
        amounts of data, you can overload the system. You can also overload source and target
        systems, so that is another consideration.
       Partition types. The partition type determines how the PowerCenter Server redistributes
        data across partition points. The Workflow Manager allows you to specify the following
        partition types:

            1. Round-robin partitioning. PowerCenter distributes rows of data evenly to all
            partitions. Each partition processes approximately the same number of rows. In a
            pipeline that reads data from file sources of different sizes, you can use round-robin
            partitioning to ensure that each partition receives approximately the same number of
            rows.
            2. Hash keys. The PowerCenter Server uses a hash function to group rows of data
                among partitions. The Server groups the data based on a partition key. There are
                two types of hash partitioning:

                        - Hash auto-keys. The PowerCenter Server uses all grouped or sorted
                        ports as a compound partition key. You can use hash auto-keys partitioning
                        at or before Rank, Sorter, and unsorted Aggregator transformations to
                        ensure that rows are grouped properly before they enter these
                        transformations.
                        - Hash user keys. The PowerCenter Server uses a hash function to group
                        rows of data among partitions based on a user-defined partition key. You
                        choose the ports that define the partition key.

            3. Key range. The PowerCenter Server distributes rows of data based on a port or
               set of ports that you specify as the partition key. For each port, you define a range
               of values. The PowerCenter Server uses the key and ranges to send rows to the
               appropriate partition. Choose key range partitioning where the sources or targets in
               the pipeline are partitioned by key range.
            4. Pass-through partitioning. The PowerCenter Server processes data without
               redistributing rows among partitions. Therefore, all rows in a single partition stay in
               that partition after crossing a pass-through partition point.
            5. Database partitioning partition. You can optimize session performance by using
               the database partitioning partition type instead of the pass-through partition type for
               IBM DB2 targets.

If you find that your system is under-utilized after you have tuned the application, databases, and
system for maximum single-partition performance, you can reconfigure your session to have two or
more partitions to make your session utilize more of the hardware.

Use the following tips when you add partitions to a session:


                                                 46
                                                                       Informatica Training      Day 5

                                                                                              01/04/2011



      Add one partition at a time. To best monitor performance, add one partition at a time, and
       note your session settings before you add each partition.
      Set DTM buffer memory. For a session with n partitions, this value should be at least n
       times the value for the session with one partition.
      Set cached values for Sequence Generator. For a session with n partitions, there should
       be no need to use the number of cached values property of the Sequence Generator
       transformation. If you must set this value to a value greater than zero, make sure it is at
       least n times the original value for the session with one partition.
      Partition the source data evenly. Configure each partition to extract the same number of
       rows. Or redistribute the data among partitions early using a partition point with round-robin.
       This is actually a good way to prevent hammering of the source system. You could have a
       session with multiple partitions where one partition returns all the data and the override
       SQL in the other partitions is set to return zero rows (where 1 = 2 in the where clause
       prevents any rows being returned). Some source systems react better to multiple
       concurrent SQL queries; others prefer smaller numbers of queries.
      Monitor the system while running the session. If there are CPU cycles available (twenty
       percent or more idle time), then performance may improve for this session by adding a
       partition.
      Monitor the system after adding a partition. If the CPU utilization does not go up, the
       wait for I/O time goes up, or the total data transformation rate goes down, then there is
       probably a hardware or software bottleneck. If the wait for I/O time goes up a significant
       amount, then check the system for hardware bottlenecks. Otherwise, check the database
       configuration.
      Tune databases and system. Make sure that your databases are tuned properly for
       parallel ETL and that your system has no bottlenecks.

Increasing the Target Commit Interval
One method of resolving target database bottlenecks is to increase the commit interval.
Each time the target database commits, performance slows. If you increase the commit interval, the
number of times the PowerCenter Server commits decreases and performance may improve.
When increasing the commit interval at the session level, you must remember to increase the size
of the database rollback segments to accommodate the larger number of rows. One of the major
reasons that Informatica set the default commit interval to 10,000 is to accommodate the default
rollback segment / extent size of most databases. If you increase both the commit interval and the
database rollback segments, you should see an increase in performance. In some cases though,
just increasing the commit interval without making the appropriate database changes may cause
the session to fail part way through (i.e., you may get a database error like "unable to extend
rollback segments" in Oracle).

Disabling High Precision
If a session runs with high precision enabled, disabling high precision may improve session
performance.
The Decimal datatype is a numeric datatype with a maximum precision of 28. To use a high-
precision Decimal datatype in a session, you must configure it so that the PowerCenter Server
recognizes this datatype by selecting Enable High Precision in the session property sheet.
However, since reading and manipulating a high-precision datatype (i.e., those with a precision of
greater than 28) can slow the PowerCenter Server down, session performance may be improved by
disabling decimal arithmetic. When you disable high precision, the PowerCenter Server reverts to
using a dataype of Double.

Reducing Error Tracking

                                                47
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011


If a session contains a large number of transformation errors, you may be able to improve
performance by reducing the amount of data the PowerCenter Server writes to the session log.
To reduce the amount of time spent writing to the session log file, set the tracing level to Terse. At
this tracing level, the PowerCenter Server does not write error messages or row-level information
for reject data. However, if terse is not an acceptable level of detail, you may want to consider
leaving the tracing level at Normal and focus your efforts on reducing the number of transformation
errors. Note that the tracing level must be set to Normal in order to use the reject loading utility.
As an additional debug option (beyond the PowerCenter Debugger), you may set the tracing level
to verbose initialization or verbose data.

       Verbose initialization logs initialization details in addition to normal, names of index and
        data files used, and detailed transformation statistics.
       Verbose data logs each row that passes into the mapping. It also notes where the
        PowerCenter Server truncates string data to fit the precision of a column and provides
        detailed transformation statistics. When you configure the tracing level to verbose data, the
        PowerCenter Server writes row data for all rows in a block when it processes a
        transformation.

However, the verbose initialization and verbose data logging options significantly affect the session
performance. Do not use Verbose tracing options except when testing sessions. Always remember
to switch tracing back to Normal after the testing is complete.
The session tracing level overrides any transformation-specific tracing levels within the mapping.
Informatica does not recommend reducing error tracing as a long-term response to high levels of
transformation errors. Because there are only a handful of reasons why transformation errors occur,
it makes sense to fix and prevent any recurring transformation errors. PowerCenter uses the
mapping tracing level when the session tracing level is set to none.

Pushdown Optimization
You can push transformation logic to the source or target database using pushdown optimization.
The amount of work you can push to the database depends on the pushdown optimization
configuration, the transformation logic, and the mapping and session configuration.
When you run a session configured for pushdown optimization, the Integration Service analyzes the
mapping and writes one or more SQL statements based on the mapping transformation logic. The
Integration Service analyzes the transformation logic, mapping, and session configuration to
determine the transformation logic it can push to the database. At run time, the Integration Service
executes any SQL statement generated against the source or target tables, and it processes any
transformation logic that it cannot push to the database.
Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the
Integration Service can push to the source or target database. You can also use the Pushdown
Optimization Viewer to view the messages related to Pushdown Optimization.

Source-Side Pushdown Optimization Sessions
In source-side pushdown optimization, the Integration Service analyzes the mapping from the
source to the target until it reaches a downstream transformation that cannot be pushed to the
database.
The Integration Service generates a SELECT statement based on the transformation logic up to the
transformation it can push to the database. Integration Service pushes all transformation logic that
is valid to push to the database by executing the generated SQL statement at run time. Then, it
reads the results of this SQL statement and continues to run the session. Similarly it create the view
for SQL override and then generate SELECT statement and runs the SELECT statement against
this view. When the session completes, the Integration Service drops the view from the database.


                                                 48
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011

Target-Side Pushdown Optimization Sessions
When you run a session configured for target-side pushdown optimization, the Integration Service
analyzes the mapping from the target to the source or until it reaches an upstream transformation it
cannot push to the database. It generates an INSERT, DELETE, or UPDATE statement based on
the transformation logic for each transformation it can push to the database, starting with the first
transformation in the pipeline it can push to the database. The Integration Service processes the
transformation logic up to the point that it can push the transformation logic to the target database.
Then, it executes the generated SQL.

Full Pushdown Optimization Sessions
To use full pushdown optimization, the source and target must be on the same database. When you
run a session configured for full pushdown optimization, the Integration Service analyzes the
mapping from source to target and analyze each transformation in the pipeline until it analyzes the
target. It generates and executes the SQL on sources and targets,
When you run a session for full pushdown optimization, the database must run a long transaction if
the session contains a large quantity of data. Consider the following database performance issues
when you generate a long transaction:

       A long transaction uses more database resources.
       A long transaction locks the database for longer periods of time, and thereby reduces the
        database concurrency and increases the likelihood of deadlock.
       A long transaction can increase the likelihood that an unexpected event may occur.

The Rank transformation cannot be pushed to the database. If you configure the session for full
pushdown optimization, the Integration Service pushes the Source Qualifier transformation and the
Aggregator transformation to the source. It pushes the Expression transformation and target to the
target database, and it processes the Rank transformation. The Integration Service does not fail the
session if it can push only part of the transformation logic to the database and the session is
configured for full optimization.

Using a Grid
You can use a grid to increase session and workflow performance. A grid is an alias assigned to a
group of nodes that allows you to automate the distribution of workflows and sessions across
nodes.
When you use a grid, the Integration Service distributes workflow tasks and session threads across
multiple nodes. Running workflows and sessions on the nodes of a grid provides the following
performance gains:

       Balances the Integration Service workload.
       Processes concurrent sessions faster.
       Processes partitions faster.

When you run a session on a grid, you improve scalability and performance by distributing session
threads to multiple DTM processes running on nodes in the grid.
To run a workflow or session on a grid, you assign resources to nodes, create and configure the
grid, and configure the Integration Service to run on a grid.

Running a Session on Grid
When you run a session on a grid, the master service process runs the workflow and workflow
tasks, including the Scheduler. Because it runs on the master service process node, the Scheduler

                                                 49
                                                                      Informatica Training      Day 5

                                                                                             01/04/2011

uses the date and time for the master service process node to start scheduled workflows. The Load
Balancer distributes Command tasks as it does when you run a workflow on a grid. In addition,
when the Load Balancer dispatches a Session task, it distributes the session threads to separate
DTM processes.
The master service process starts a temporary preparer DTM process that fetches the session and
prepares it to run. After the preparer DTM process prepares the session, it acts as the master DTM
process, which monitors the DTM processes running on other nodes.
The worker service processes start the worker DTM processes on other nodes. The worker DTM
runs the session. Multiple worker DTM processes running on a node might be running multiple
sessions or multiple partition groups from a single session depending on the session configuration.
For example, you run a workflow on a grid that contains one Session task and one Command task.
You also configure the session to run on the grid.
When the Integration Service process runs the session on a grid, it performs the following tasks:

       On Node 1, the master service process runs workflow tasks. It also starts a temporary
        preparer DTM process, which becomes the master DTM process.
        The Load Balancer dispatches the Command task and session threads to nodes in the grid.
       On Node 2, the worker service process runs the Command task and starts the worker DTM
        processes that run the session threads.
       On Node 3, the worker service process starts the worker DTM processes that run the
        session threads.




                                                50
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011


Tuning SQL Overrides and Environment for Better
Performance Challenge
Tuning SQL Overrides and SQL queries within the source qualifier objects can improve
performance in selecting data from source database tables, which positively impacts the overall
session performance. This Best Practice explores ways to optimize a SQL query within the source
qualifier object. The tips here can be applied to any PowerCenter mapping. While the SQL
discussed here is executed in Oracle 8 and above, the techniques are generally applicable, but
specifics for other RDBMS products (e.g., SQL Server, Sybase, etc.) are not included.

Description
SQL Queries Performing Data Extractions
Optimizing SQL queries is perhaps the most complex portion of performance tuning. When tuning
SQL, the developer must look at the type of execution being forced by hints, the execution plan,
and the indexes on the query tables in the SQL, the logic of the SQL statement itself, and the SQL
syntax. The following paragraphs discuss each of these areas in more detail.

DB2 Coalesce and Oracle NVL
When examining data with NULLs, it is often necessary to substitute a value to make comparisons
and joins work. In Oracle, the NVL function is used, while in DB2, the COALESCE function is used.
Here is an example of the Oracle NLV function:

                SELECT DISTINCT bio.experiment_group_id, bio.database_site_code
                FROM exp.exp_bio_result bio, sar.sar_data_load_log log
                WHERE bio.update_date BETWEEN log.start_time AND log.end_time
                AND NVL(bio.species_type_code, 'X') IN ('mice', 'rats', „X‟)
                AND log.seq_no = (SELECT MAX(seq_no) FROM sar.sar_data_load_log </FONT>
                WHERE load_status = 'P')<

Here is the same query in DB2:

                SELECT DISTINCT bio.experiment_group_id, bio.database_site_code
                FROM bio_result bio, data_load_log log
                WHERE bio.update_date BETWEEN log.start_time AND log.end_time
                AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', „X‟)
                AND log.seq_no = (SELECT MAX(seq_no) FROM data_load_log < /FONT >
                WHERE load_status = 'P')< /FONT >

Surmounting the Single SQL Statement Limitation in Oracle or
DB2: In-line Views
In source qualifiers and lookup objects, you are limited to a single SQL statement. There are
several ways to get around this limitation.
You can create views in the database and use them as you would tables, either as source tables, or
in the FROM clause of the SELECT statement. This can simplify the SQL and make it easier to
understand, but it also makes it harder to maintain.
The logic is now in two places: in an Informatica mapping and in a database view You can use in-
line views which are SELECT statements in the FROM or WHERE clause. This can help focus the
query to a subset of data in the table and work more efficiently than using a traditional join. Here is
an example of an in-line view in the FROM clause:

                                                  51
                                                                   Informatica Training      Day 5

                                                                                          01/04/2011


               SELECT N.DOSE_REGIMEN_TEXT as DOSE_REGIMEN_TEXT,
               N.DOSE_REGIMEN_COMMENT as DOSE_REGIMEN_COMMENT,
               N.DOSE_VEHICLE_BATCH_NUMBER as DOSE_VEHICLE_BATCH_NUMBER,
               N.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID
               FROM DOSE_REGIMEN N,
               (SELECT DISTINCT R.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID
               FROM EXPERIMENT_PARAMETER R,
               NEW_GROUP_TMP TMP
               WHERE R.EXPERIMENT_PARAMETERS_ID=
               TMP.EXPERIMENT_PARAMETERS_ID< /FONT>
               AND R.SCREEN_PROTOCOL_ID = TMP.BDS_PROTOCOL_ID < /FONT >) X
               WHERE N.DOSE_REGIMEN_ID = X.DOSE_REGIMEN_ID < /FONT >
               ORDER BY N.DOSE_REGIMEN_ID

Surmounting the Single SQL Statement Limitation in DB2: Using
the Common Table Expression temp tables and the WITH Clause
The Common Table Expression (CTE) stores data in temp tables during the execution of the SQL
statement. The WITH clause lets you assign a name to a CTE block. You can then reference the
CTE block multiple places in the query by specifying the query name. For example:

               WITH maxseq AS
               (SELECT MAX(seq_no) as seq_no FROM data_load_log WHERE load_status='P')
               < /FONT >
               SELECT DISTINCT bio.experiment_group_id, bio.database_site_code
               FROM bio_result bio, data_load_log log
               WHERE bio.update_date BETWEEN log.start_time AND log.end_time
               AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', „X‟)
               AND log.seq_no = maxseq. seq_no< /FONT >

Here is another example using a WITH clause that uses recursive SQL:

               WITH PERSON_TEMP (PERSON_ID, NAME, PARENT_ID) AS
               (SELECT PERSON_ID, NAME, PARENT_ID
               FROM PARENT_CHILD
               WHERE NAME IN („FRED‟, „SALLY‟, „JIM‟)
               UNION ALL
               SELECT C.PERSON_ID, C.NAME, C.PARENT_ID
               FROM PARENT_CHILD C, PERSON_TEMP RECURS
               WHERE C.PERSON_ID = RECURS.PERSON_ID < /FONT >
               AND LEVEL < 5)
               SELECT * FROM PERSON_TEMP

The PARENT_ID in any particular row refers to the PERSON_ID of the parent. Pretty stupid since
we all have two parents, but you get the idea. The LEVEL clause prevents infinite recursion.

CASE (DB2) vs. DECODE (Oracle)
The CASE syntax is allowed in ORACLE, but you are much more likely to see the DECODE logic,
even for a single case since it was the only legal way to test a condition in earlier versions.
DECODE is not allowed in DB2. In Oracle:

               SELECT EMPLOYEE, FNAME, LNAME,
               DECODE (SALARY)

                                              52
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011

                < 10000, „NEED RAISE‟,
                > 1000000, „OVERPAID‟,
                „THE REST OF US‟) AS COMMENT
                FROM EMPLOYEE

In DB2:

                SELECT EMPLOYEE, FNAME, LNAME,
                CASE
                WHEN SALARY < 10000 THEN „NEED RAISE‟
                WHEN SALARY > 1000000 THEN „OVERPAID‟
                ELSE „THE REST OF US‟
                END AS COMMENT
                FROM EMPLOYEE

Debugging Tip: Obtaining a Sample Subset
It is often useful to get a small sample of the data from a long running query that returns a large set
of data. The logic can be commented out or removed after it is put in general use.
DB2 uses the FETCH FIRST n ROWS ONLY clause to do this as follows:

                SELECT EMPLOYEE, FNAME, LNAME
                FROM EMPLOYEE
                WHERE JOB_TITLE = „WORKERBEE‟ < /FONT >
                FETCH FIRST 12 ROWS ONLY

Oracle does it this way using the ROWNUM variable:

                SELECT EMPLOYEE, FNAME, LNAME
                FROM EMPLOYEE
                WHERE JOB_TITLE = „WORKERBEE‟ < /FONT >
                AND ROWNUM <= 12< /FONT>

INTERSECT, INTERSECT ALL, UNION, UNION ALL
Remember that both the UNION and INTERSECT operators return distinct rows, while UNION ALL
and INTERSECT ALL return all rows.

System Dates in Oracle and DB2
Oracle uses the system variable SYSDATE for the current time and date, and allows you to display
either the time and/or the date however you want with date functions.
Here is an example that returns yesterday‟s date in Oracle (default format as mm/dd/yyyy):

                SELECT TRUNC(SYSDATE) – 1 FROM DUAL

DB2 uses the system variables, here called special registers, CURRENT DATE, CURRENT TIME
and CURRENT TIMESTAMP
Here is an example for DB2:

                SELECT FNAME, LNAME, CURRENT DATE AS TODAY
                FROM EMPLOYEE




                                                  53
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011

Oracle: Using Hints
Hints affect the way a query or sub-query is executed and can therefore, provide a significant
performance increase in queries. Hints cause the database engine to relinquish control over how a
query is executed, thereby giving the developer control over the execution. Hints are always
honored unless execution is not possible. Because the database engine does not evaluate whether
the hint makes sense, developers must be careful in implementing hints. Oracle has many types of
hints: optimizer hints, access method hints, join order hints, join operation hints, and parallel
execution hints. Optimizer and access method hints are the most common.
In the latest versions of Oracle, the Cost-based query analysis is built-in and Rule-based analysis is
no longer possible. It was in Rule-based Oracle systems that hints mentioning specific indexes were
most helpful. In Oracle version 9.2, however, the use of /*+ INDEX */ hints may actually decrease
performance significantly in many cases. If you are using older versions of Oracle however, the use
of the proper INDEX hints should help performance.
The optimizer hint allows the developer to change the optimizer's goals when creating the execution
plan. The table below provides a partial list of optimizer hints and descriptions.

Optimizer hints: Choosing the best join method
Sort/merge and hash joins are in the same group, but nested loop joins are very different.
Sort/merge involves two sorts while the nested loop involves no sorts. The hash join also requires
memory to build the hash table.
Hash joins are most effective when the amount of data is large and one table is much larger than
the other.
Here is an example of a select that performs best as a hash join:

                        SELECT COUNT(*) FROM CUSTOMERS C, MANAGERS M
                        WHERE C.CUST_ID = M.MANAGER_ID< /FONT >




                                                 54
                                                                     Informatica Training      Day 5

                                                                                            01/04/2011




Access method hints
Access method hints control how data is accessed. These hints are used to force the database
engine to use indexes, hash scans, or row id scans. The following table provides a partial list of
access method hints.




The syntax for using a hint in a SQL statement is as follows:

                        Select /*+ FIRST_ROWS */ empno, ename
                        From emp;
                        Select /*+ USE_CONCAT */ empno, ename
                        From emp




                                                 55
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011

SQL Execution and Explain Plan
The simplest change is forcing the SQL to choose either rule-based or cost-based execution. This
change can be accomplished without changing the logic of the SQL query. While cost-based
execution is typically considered the best SQL execution; it relies upon optimization of the Oracle
parameters and updated database statistics. If these statistics are not maintained, cost-based query
execution can suffer over time. When that happens, rule-based execution can actually provide
better execution time.
The developer can determine which type of execution is being used by running an explain plan on
the SQL query in question. Note that the step in the explain plan that is indented the most is the
statement that is executed first. The results of that statement are then used as input by the next
level statement.
Typically, the developer should attempt to eliminate any full table scans and index range scans
whenever possible. Full table scans cause degradation in performance.
Information provided by the Explain Plan can be enhanced using the SQL Trace Utility. This utility
provides the following additional information including:

       The number of executions
       The elapsed time of the statement execution
       The CPU time used to execute the statement

The SQL Trace Utility adds value because it definitively shows the statements that are using the
most resources, and can immediately show the change in resource consumption after the statement
has been tuned and a new explain plan has been run.

Using Indexes
The explain plan also shows whether indexes are being used to facilitate execution. The data
warehouse team should compare the indexes being used to those available. If necessary, the
administrative staff should identify new indexes that are needed to improve execution and ask the
database administration team to add them to the appropriate tables. Once implemented, the explain
plan should be executed again to ensure that the indexes are being used. If an index is not being
used, it is possible to force the query to use it by using an access method hint, as described earlier.

Reviewing SQL Logic
The final step in SQL optimization involves reviewing the SQL logic itself. The purpose of this
review is to determine whether the logic is efficiently capturing the data needed for processing.
Review of the logic may uncover the need for additional filters to select only certain data, as well as
the need to restructure the where clause to use indexes. In extreme cases, the entire SQL
statement may need to be re-written to become more efficient.

Reviewing SQL Syntax
SQL Syntax can also have a great impact on query performance. Certain operators can slow
performance, for example:

        EXISTS clauses are almost always used in correlated sub-queries. They are executed for
         each row of the parent query and cannot take advantage of indexes, while the IN clause is
         executed once and does use indexes, and may be translated to a JOIN by the optimizer. If
         possible, replace EXISTS with an IN clause. For example:

                SELECT * FROM DEPARTMENTS WHERE DEPT_ID IN
                (SELECT DISTINCT DEPT_ID FROM MANAGERS) -- Faster


                                                  56
                                                                           Informatica Training      Day 5

                                                                                                  01/04/2011

                 SELECT * FROM DEPARTMENTS D WHERE EXISTS
                 (SELECT * FROM MANAGERS M WHERE M.DEPT_ID = D.DEPT_ID)< /FONT >




        Where possible, use the EXISTS clause instead of the INTERSECT clause. Simply
         modifying the query in this way can improve performance by more than100 percent.
        Where possible, limit the use of outer joins on tables. Remove the outer joins from the
         query and create lookup objects within the mapping to fill in the optional information.

Choosing the Best Join Order
Place the smallest table first in the join order. This is often a staging table holding the IDs identifying
the data in the incremental ETL load.
Always put the small table column on the right side of the join. Use the driving table first in the
WHERE clause, and work from it outward. In other words, be consistent and orderly about placing
columns in the WHERE clause.
Outer joins limit the join order that the optimizer can use. Don‟t use them needlessly.

Anti-join with NOT IN, NOT EXISTS, MINUS or EXCEPT, OUTER
JOIN
        Avoid use of the NOT IN clause. This clause causes the database engine to perform a full
         table scan. While this may not be a problem on small tables, it can become a performance
         drain on large tables.

                 SELECT NAME_ID FROM CUSTOMERS
                 WHERE NAME_ID NOT IN
                 (SELECT NAME_ID FROM EMPLOYEES)




                                                    57
                                                                          Informatica Training      Day 5

                                                                                                 01/04/2011

        Avoid use of the NOT EXISTS clause. This clause is better than the NOT IN, but still may
         cause a full table scan.

                 SELECT C.NAME_ID FROM CUSTOMERS C
                 WHERE NOT EXISTS
                 (SELECT * FROM EMPLOYEES E
                 WHERE C.NAME_ID = E.NAME_ID)< /FONT >

        In Oracle, use the MINUS operator to do the anti-join, if possible. In DB2, use the
         equivalent EXCEPT operator.

                 SELECT C.NAME_ID FROM CUSTOMERS C
                 MINUS
                 SELECT E.NAME_ID* FROM EMPLOYEES E

        Also consider using outer joins with IS NULL conditions for anti-joins.

                 SELECT C.NAME_ID FROM CUSTOMERS C, EMPLOYEES E
                 WHERE C.NAME_ID = E.NAME_ID (+)< /FONT >
                 AND C.NAME_ID IS NULL

Review the database SQL manuals to determine the cost benefits or liabilities of certain SQL
clauses as they may change based on the database engine.
l In lookups from large tables, try to limit the rows returned to the set of rows matching the set in the
source qualifier. Add the WHERE clause conditions to the lookup. For example, if the source
qualifier selects sales orders entered into the system since the previous load of the database, then,
in the product information lookup, only select the products that match the distinct product IDs in the
incremental sales orders. l Avoid range lookups. This is a SELECT that uses a BETWEEN in the
WHERE clause that uses values retrieved from a table as limits in the BETWEEN. Here is an
example:

                 SELECT
                 R.BATCH_TRACKING_NO,
                 R.SUPPLIER_DESC,
                 R.SUPPLIER_REG_NO,
                 R.SUPPLIER_REF_CODE,
                 R.GCW_LOAD_DATE
                 FROM CDS_SUPPLIER R,
                 (SELECT LOAD_DATE_PREV AS LOAD_DATE_PREV,
                 L.LOAD_DATE) AS LOAD_DATE
                 FROM ETL_AUDIT_LOG L
                 WHERE L.LOAD_DATE_PREV IN
                 (SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV
                 FROM ETL_AUDIT_LOG Y)
                 )Z
                 WHERE
                 R.LOAD_DATE BETWEEN Z.LOAD_DATE_PREV AND Z.LOAD_DATE

The work-around is to use an in-line view to get the lower range in the FROM clause and join it to
the main query that limits the higher date range in its where clause. Use an ORDER BY the lower
limit in the in-line view. This is likely to reduce the throughput time from hours to seconds.
Here is the improved SQL:

                 SELECT
                 R.BATCH_TRACKING_NO,


                                                   58
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011

                R.SUPPLIER_DESC,
                R.SUPPLIER_REG_NO,
                R.SUPPLIER_REF_CODE,
                R.LOAD_DATE
                FROM
                /* In-line view for lower limit */
                (SELECT
                R1.BATCH_TRACKING_NO,
                R1.SUPPLIER_DESC,
                R1.SUPPLIER_REG_NO,
                R1.SUPPLIER_REF_CODE,
                R1.LOAD_DATE
                FROM CDS_SUPPLIER R1,
                (SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV
                FROM ETL_AUDIT_LOG Y) Z
                WHERE R1.LOAD_DATE >= Z.LOAD_DATE_PREV< /FONT>
                ORDER BY R1.LOAD_DATE) R,
                /* end in-line view for lower limit */
                (SELECT MAX(D.LOAD_DATE) AS LOAD_DATE
                FROM ETL_AUDIT_LOG D) A /* upper limit /*
                WHERE R. LOAD_DATE <= A.LOAD_DATE< /FONT>


Tuning System Architecture
Use the following steps to improve the performance of any system:

1. Establish performance boundaries (baseline).
2. Define performance objectives.
3. Develop a performance monitoring plan.
4. Execute the plan.
5. Analyze measurements to determine whether the results meet the objectives. If objectives are
met, consider reducing the number of measurements because performance monitoring itself uses
system resources. Otherwise continue with Step 6.
6. Determine the major constraints in the system.
7. Decide where the team can afford to make trade-offs and which resources can bear additional
load.
8. Adjust the configuration of the system. If it is feasible to change more than one tuning option,
implement one at a time. If there are no options left at any level, this indicates that the system has
reached its limits and hardware upgrades may be advisable.
9. Return to Step 4 and continue to monitor the system.
10. Return to Step 1.
11. Re-examine outlined objectives and indicators.
12. Refine monitoring and tuning strategy.

System Resources
The PowerCenter Server uses the following system resources:

        CPU
        Load Manager shared memory
        DTM buffer memory
        Cache memory

When tuning the system, evaluate the following considerations during the implementation process.


                                                 59
                                                                       Informatica Training      Day 5

                                                                                              01/04/2011



      Determine if the network is running at an optimal speed. Recommended best practice is to
       minimize the number of network hops between the PowerCenter Server and the databases.
      Use multiple PowerCenter Servers on separate systems to potentially improve session
       performance.
      When all character data processed by the PowerCenter Server is US-ASCII or EBCDIC,
       configure the PowerCenter Server for ASCII data movement mode. In ASCII mode, the
       PowerCenter Server uses one byte to store each character. In Unicode mode, the
       PowerCenter Server uses two bytes for each character, which can potentially slow session
       performance
      Check hard disks on related machines. Slow disk access on source and target databases,
       source and target file systems, as well as the PowerCenter Server and repository machines
       can slow session performance.
      When an operating system runs out of physical memory, it starts paging to disk to free
       physical memory. Configure the physical memory for the PowerCenter Server machine to
       minimize paging to disk. Increase system memory when sessions use large cached lookups
       or sessions have many partitions.
      In a multi-processor UNIX environment, the PowerCenter Server may use a large amount
       of system resources. Use processor binding to control processor usage by the
       PowerCenter Server.
      In a Sun Solaris environment, use the psrset command to create and manage a processor
       set. After creating a processor set, use the pbind command to bind the PowerCenter Server
       to the processor set so that the processor set only runs the PowerCenter Sever. For details,
       see project system administrator and Sun Solaris documentation.
      In an HP-UX environment, use the Process Resource Manager utility to control CPU usage
       in the system. The Process Resource Manager allocates minimum system resources and
       uses a maximum cap of resources.
      In an AIX environment, use the Workload Manager in AIX 5L to manage system resources
       during peak demands.

The Workload Manager can allocate resources and manage CPU, memory, and disk I/O bandwidth.

Database Performance Features

Nearly everything is a trade-off in the physical database implementation. Work with the DBA in
determining which of the many available alternatives is the best implementation choice for the
particular database. The project team must have a thorough understanding of the data, database,
and desired use of the database by the end-user community prior to beginning the physical
implementation process. Evaluate the following considerations during the implementation process.

      Denormalization. The DBA can use denormalization to improve performance by
       eliminating the constraints and primary key to foreign key relationships, and also eliminating
       join tables.
      Indexes. Proper indexing can significantly improve query response time. The trade-off of
       heavy indexing is a degradation of the time required to load data rows in to the target
       tables. Carefully written pre-session scripts are recommended to drop indexes before the
       load and rebuilding them after the load using post-session scripts.
      Constraints. Avoid constraints if possible and try to exploit integrity enforcement through
       the use of incorporating that additional logic in the mappings.
      Rollback and Temporary Segments. Rollback and temporary segments are primarily
       used to store data for queries (temporary) and INSERTs and UPDATES (rollback). The
       rollback area must be large enough to hold all the data prior to a COMMIT. Proper sizing
       can be crucial to ensuring successful completion of load sessions, particularly on initial
       loads.


                                                60
                                                                Informatica Training      Day 5

                                                                                       01/04/2011

   OS Priority. The priority of background processes is an often-overlooked problem that can
    be difficult to determine after the fact. DBAs must work with the System Administrator to
    ensure all the database processes have the same priority.
   Striping. Database performance can be increased significantly by implementing either
    RAID 0 (striping) or RAID 5 (pooled disk sharing) disk I/O throughput.
   Disk Controllers. Although expensive, striping and RAID 5 can be further enhanced by
    separating the disk controllers.




                                           61
                                                                          Informatica Training      Day 5

                                                                                                 01/04/2011


Advanced Server Configuration Options
Challenge
Correctly configuring Advanced Integration Service properties, Integration Service process
variables, and automatic memory settings; using custom properties to write service logs to files; and
adjusting semaphore and shared memory settings in the UNIX environment.

Description
Configuring Advanced Integration Service Properties
Use the Administration Console to configure the advanced properties, such as the character set of
the Integration Service logs. To edit the advanced properties, select the Integration Service in the
Navigator, and click the Properties tab > Advanced Properties > Edit.

Configuring Integration Service Process Variables
One configuration best practice is to properly configure and leverage the Integration service (IS)
process variables. The benefits include:

       Ease of deployment across environments (DEV > TEST > PRD)
       Ease of switching sessions from one IS to another without manually editing all the sessions
        to change directory paths.
       All the variables are related to directory paths used by a given Integration Service.

You must specify the paths for Integration Service files for each Integration Service process.
Examples of Integration Service files include run-time files, state of operation files, and session log
files.
Each Integration Service process uses run-time files to process workflows and sessions. If you
configure an Integration Service to run on a grid or to run on backup nodes, the run-time files must
be stored in a shared location. Each node must have access to the runtime files used to process a
session or workflow. This includes files such as parameter files, cache files, input files, and output
files.
State of operation files must be accessible by all Integration Service processes. When you enable
an Integration Service, it creates files to store the state of operations for the service. The state of
operations includes information such as the active service requests, scheduled tasks, and
completed and running processes. If the service fails, the Integration Service can restore the state
and recover operations from the point of interruption.
All Integration Service processes associated with an Integration Service must use the same shared
location. However, each Integration Service can use a separate location.
By default, the installation program creates a set of Integration Service directories in the
server\infa_shared directory. You can set the shared location for these directories by configuring
the process variable $PMRootDir to point to the same location for each Integration Service process.
You must specify the directory path for each type of file. You specify the following directories using
service process variables:

Each registered server has its own set of variables. The list is fixed, not user-extensible




                                                  62
     Informatica Training      Day 5

                            01/04/2011




63
                                                                     Informatica Training      Day 5

                                                                                            01/04/2011

Adjusting Semaphore Settings on UNIX Platforms

When PowerCenter runs on a UNIX platform, it uses operating system semaphores to keep
processes synchronized and to prevent collisions when accessing shared data structures. You may
need to increase these semaphore settings before installing the server.
Seven semaphores are required to run a session. Most installations require between 64 and 128
available semaphores, depending on the number of sessions the server runs concurrently. This is in
addition to any semaphores required by other software, such as database servers.
The total number of available operating system semaphores is an operating system configuration
parameter, with a limit per user and system. The method used to change the parameter depends on
the operating system:

      HP/UX: Use sam (1M) to change the parameters.
      Solaris: Use admintool or edit /etc/system to change the parameters.
      AIX: Use smit to change the parameters.

Setting Shared Memory and Semaphore Parameters on UNIX
Platforms
Informatica recommends setting the following parameters as high as possible for the UNIX
operating system. However, if you set these parameters too high, the machine may not boot.
Always refer to the operating system documentation for parameter limits. Note that different UNIX
operating systems set these variables in different ways or may be self tuning. Always reboot the
system after configuring the UNIX kernel.

HP-UX
For HP-UX release 11i the CDLIMIT and NOFILES parameters are not implemented. In some
versions, SEMMSL is hard-coded to 500. NCALL is referred to as NCALLOUT.
Use the HP System V IPC Shared-Memory Subsystem to update parameters.
To change a value, perform the following steps:

1. Enter the /usr/sbin/sam command to start the System Administration Manager (SAM) program.
2. Double click the Kernel Configuration icon.
3. Double click the Configurable Parameters icon.
4. Double click the parameter you want to change and enter the new value in the Formula/Value
field.
5. Click OK.
6. Repeat these steps for all kernel configuration parameters that you want to change.
7. When you are finished setting all of the kernel configuration parameters, select Process New
Kernel from the Action menu.

The HP-UX operating system automatically reboots after you change the values for the kernel
configuration parameters.

Configuring Automatic Memory Settings

With Informatica PowerCenter 8, you can configure the Integration Service to determine buffer
memory size and session cache size at runtime. When you run a session, the Integration Service
allocates buffer memory to the session to move the data from the source to the target. It also
creates session caches in memory. Session caches include index and data caches for the
Aggregator, Rank, Joiner, and Lookup transformations, as well as Sorter and XML target caches.



                                               64
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011

Configure buffer memory and cache memory settings in the Transformation and Session Properties.
When you configure buffer memory and cache memory settings, consider the overall memory
usage for best performance.
Enable automatic memory settings by configuring a value for the Maximum Memory Allowed for
Auto Memory Attributes or the Maximum Percentage of Total Memory Allowed for Auto Memory
Attributes. If the value is set to zero for either of these attributes, the Integration Service disables
automatic memory settings and uses default values.




                                                  65
                                                                            Informatica Training      Day 5

                                                                                                   01/04/2011


Causes and Analysis of UNIX Core Files Challenge
This Best Practice explains what UNIX core files are and why they are created, and offers some tips
on analyzing them.

Description
Fatal run-time errors in UNIX programs usually result in the termination of the UNIX process by the
operating system. Usually, when the operating system terminates a process, a "core dump" file is
also created, which can be used to analyze the reason for the abnormal termination.

What is a Core File and What Causes it to be Created?

UNIX operating systems may terminate a process before its normal, expected exit for several
reasons. These reasons are typically for bad behavior by the program, and include attempts to
execute illegal or incorrect machine instructions, attempts to allocate memory outside the memory
space allocated to the program, attempts to write to memory marked read-only by the operating
system, and other similar incorrect low level operations. Most of these bad behaviors are caused by
errors in programming logic in the program.
UNIX may also terminate a process for some reasons that are not caused by programming errors.
The main examples of this type of termination are when a process exceeds its CPU time limit, and
when a process exceeds its memory limit.
When UNIX terminates a process in this way, it normally writes an image of the processes memory
to disk in a single file. These files are called "core files", and are intended to be used by a
programmer to help determine the cause of the failure. Depending on the UNIX version, the name
of the file may be "core", or in more recent UNIX versions, "core.nnnn" where nnnn is the UNIX
process ID of the process that was terminated.
Core files are not created for "normal" runtime errors such as incorrect file permissions, lack of disk
space, inability to open a file or network connection, and other errors that a program is expected to
detect and handle. However, under certain error conditions a program may not handle the error
conditions correctly and may follow a path of execution that causes the OS to terminate it and
cause a core dump.
Mixing incompatible versions of UNIX, vendor, and database libraries can often trigger behavior that
causes unexpected core dumps. For example, using an odbc driver library from one vendor and an
odbc driver manager from another vendor may result in a core dump if the libraries are not
compatible. A similar situation can occur if a process is using libraries from different versions of a
database client, such as a mixed installation of Oracle 8i and 9i. An installation like this should not
exist, but if it does, core dumps are often the result.

Core File Locations and Size Limits
A core file is written to the current working directory of the process that was terminated. For
PowerCenter, this is always the directory the services were started from. For other applications, this
may not be true.
UNIX also implements a per user resource limit on the maximum size of core files. This is controlled
by the ulimit command. If the limit is 0, then core files will not be created. If the limit is less than the
total memory size of the process, a partial core file will be written.




                                                    66
                                                                            Informatica Training      Day 5

                                                                                                   01/04/2011

Analyzing Core Files
Core files provide valuable insight into the state and condition the process was in just before it was
terminated. It also contains the history or log of routines that the process went through before that
fateful function call; this log is known as the stack trace. There is little information in a core file that
is relevant to an end user; most of the contents of a core file are only relevant to a developer, or
someone who understands the internals of the program that generated the core file. However, there
are a few things that an end user can do with a core file in the way of initial analysis. The most
important aspect of analyzing a core file is the task of extracting this stack trace out of the core
dump. Debuggers are the tools that help retrieve this stack trace and other vital information out of
the core. Informatica recommends using the pmstack utility.
The first step is to save the core file under a new name so that it is not overwritten by a later crash
of the same application. One option is to append a timestamp to the core, but it can be renamed to
anything:

                          mv core core.ddmmyyhhmi

The second step is to log in with the same UNIX user id that started up the process that crashed.
This sets the debugger's environment to be same as that of the process at startup time.
The third step is to go to the directory where the program is installed. Run the "file" command on the
core file. This returns the name of the process that created the core file.

                          file <fullpathtocorefile>/core.ddmmyyhhmi

Core files can be generated by the PowerCenter executables (i.e., pmserver, infaservices, and
pmdtm) as well as from other UNIX commands executed by the Integration Service, typically from
command tasks and per- or post-session commands.
If a PowerCenter process is terminated by the OS and a core is generated, the session or server
log typically indicates „Process terminating on Signal/Exception‟ as its last entry.

Using the pmstack Utility
Informatica provides a „pmstack‟ utility that can automatically analyze a core file. If the core file is
from PowerCenter, it will generate a complete stack trace from the core file, which can be sent to
Informatica Customer Support for further analysis. The track contains everything necessary to
further diagnose the problem. Core files themselves are normally not useful on a system other than
the one where they were generated. The pmstack utility can be downloaded from the Informatica
Support knowledge base as article 13652, and from the support ftp server at tsftp.informatica.com.
Once downloaded, run pmstack with the –c option, followed by the name of the core file:

$ pmstack -c core.21896
=================================
SSG pmstack ver 2.0 073004
=================================
Core info :
-rw------- 1 pr_pc_d pr_pc_d 58806272 Mar 29 16:28 core.21896
core.21896: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-
style, from ''''''''pmdtm''''''''
Process name used for analyzing the core : pmdtm
Generating stack trace, please wait..
Pmstack completed successfully
Please send file core.21896.trace to Informatica Technical Support

You can then look at the generated trace file or send it to support.




                                                    67
                                                                    Informatica Training      Day 5

                                                                                           01/04/2011

Pmstack also supports a –p option, which can be used to extract a stack trace from a running
process. This is sometimes useful if the process appears to be hung to determine what the process
is doing.




                                               68
                                                                          Informatica Training      Day 5

                                                                                                 01/04/2011


Managing Repository Size Challenge
The PowerCenter repository is expected to grow over time as new development and production
runs occur. Over time, the repository can be expected to grow to a size that may start slowing
performance of the repository or make backups increasingly difficult. This Best Practice discusses
methods to manage the size of the repository.
The release of PowerCenter version 8.x added several features that aid in managing the repository
size. Although the repository is slightly larger with version 8.x than it was with the previous versions,
the client tools have increased functionality to limit the dependency on the size of the repository.
PowerCenter versions earlier than 8.x require more administration to keep the repository sizes
manageable.

Description
Why should we manage the size of the repository?
Repository size affects the following:

       DB backups and restores. If database backups are being performed, the size required for
        the backup can be reduced. If PowerCenter backups are being used, you can limit what
        gets backed up.
       Overall query time of the repository, which slows performance of the repository over
        time. Analyzing tables on a regular basis can aid in repository table performance.
       Migrations (i.e., copying from one repository to the next). Limit data transfer between
        repositories to avoid locking up the repository for a long period of time. Some options are
        available to avoid transferring all run statistics when migrating. A typical repository starts off
        small (i.e., 50MB to 60MB for an empty repository) and grows to upwards of 1GB for a large
        repository. The type of information stored in the repository includes:

                         -   Versions
                         -   Objects
                         -   Run statistics
                         -   Scheduling information
                         -   Variables

Tips for Managing Repository Size
Versions and Objects

Delete old versions or purged objects from the repository. Use your repository queries in the client
tools to generate reusable queries that can determine out-of-date versions and objects for removal.
Use Query Browser to run object queries on both versioned and non-versioned repositories..
Old versions and objects not only increase the size of the repository, but also make it more difficult
to manage further into the development cycle. Cleaning up the folders makes it easier to determine
what is valid and what is not.
One way to keep repository size small is to use shortcuts by creating shared folders if you are using
the same source/target definition, reusable transformations in multiple folders.

Folders
Remove folders and objects that are no longer used or referenced. Unnecessary folders increase
the size of the repository backups. These folders should not be a part of production but they may
exist in development or test repositories.




                                                   69
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011

Run Statistics
Remove old run statistics from the repository if you no longer need them. History is important to
determine trending, scaling, and performance tuning needs but you can always generate reports
based on the PowerCenter Metadata Reporter and save reports of the data you need. To remove
the run statistics, go to Repository Manager and truncate the logs based on the dates.

Recommendations
Informatica strongly recommends upgrading to the latest version of PowerCenter since the most
recent release includes such features as skip workflow and session log, skip deployment group
history, skip MX data and so forth. The repository size in version 8.x and above is larger than the
previous versions of PowerCenter, but the added size does not significantly affect the performance
of the repository. It is still advisable to analyze the tables or run statistics to optimize the tables.
Informatica does not recommend directly querying the repository tables or performing deletes on
them. Use the client tools unless otherwise advised by Informatica technical support personnel.




                                                  70
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011


Platform Sizing Challenge
Determining the appropriate platform size to support the PowerCenter environment based on
customer environments and requirements.

Description
The required platform size to support PowerCenter depends on each customer‟s unique
environment and processing requirements. The Integration Service allocates resources for
individual extraction, transformation, and load (ETL) jobs or sessions.
Each session has its own resource requirements. The resources required for the Integration Service
depend on the number of sessions, what each session does while moving data, and how many
sessions run concurrently. This Best Practice discusses the relevant questions pertinent to
estimating the platform requirements.

TIP

An important concept regarding platform sizing is not to size your environment too soon in the
project lifecycle.
Too often, clients size their machines before any ETL is designed or developed, and in many cases
these platforms are too small for the resultant system. Thus, it is better to analyze sizing
requirements after the data transformation processes have been well defined during the design and
development phases.

Environment Questions
To determine platform size, consider the following questions regarding your environment:

         What sources do you plan to access?
         How do you currently access those sources?
         Have you decided on the target environment (i.e., database, hardware, operating system)?
          If so, what is it?
         Have you decided on the PowerCenter environment (i.e., hardware, operating system)?
         Is it possible for the PowerCenter services to be on the same machine as the target?
         How do you plan to access your information (i.e., cube, ad-hoc query tool) and what tools
          will you use to do this?
         What other applications or services, if any, run on the PowerCenter server?
         What are the latency requirements for the PowerCenter loads?

Engine Sizing Questions
To determine engine size, consider the following questions:

         Is the overall ETL task currently being performed? If so, how is it being done, and how long
          does it take?
         What is the total volume of data to move?
         What is the largest table (i.e., bytes and rows)? Is there any key on this table that can be
          used to partition load sessions, if needed?
         How often does the refresh occur?
         Will refresh be scheduled at a certain time, or driven by external events?
         Is there a "modified" timestamp on the source table rows?
         What is the batch window available for the load?
         Are you doing a load of detail data, aggregations, or both?

                                                  71
                                                                        Informatica Training      Day 5

                                                                                               01/04/2011

       If you are doing aggregations, what is the ration of source/target rows for the largest result
        set? How large is the result set (bytes and rows)?

The answers to these questions provide an approximation guide to the factors that affect
PowerCenter's resource requirements. To simplify the analysis, focus on large jobs that drive the
resource requirement.

Engine Resource Consumption
The following sections summarize some recommendations on the PowerCenter engine resource
consumption.

Processor
1 to 1.5 CPUs per concurrent non-partitioned session or transformation job.

Memory
       20 to 30MB of memory for the main engine for session coordination.
       20 to 30MB of memory per session, if there are no aggregations, lookups, or
        heterogeneous data joins. Note that 32-bit systems have an operating system limitation of
        2GB per session.
       Caches for aggregation, lookups or joins use additional memory:
       Lookup tables are cached in full; the memory consumed depends on the size of the tables.
       Aggregate caches store the individual groups; more memory is used if there are more
        groups.
       Sorting the input to aggregations greatly reduces the need for memory.
       Joins cache the master table in a join; memory consumed depends on the size of the
        master.

System Recommendations
PowerCenter has a service-oriented architecture that provides the ability to scale services and
share resources across multiple machines. Below are the recommendations for the system.
Minimum server

       1 Node, 4 CPUs and 8GB of memory (instead of the minimal requirement of 4GB RAM).

Disk Space
Disk space is not a factor if the machine is used only for PowerCenter services, unless the following
conditions exist:

       Data is staged to flat files on the PowerCenter machine.
       Data is stored in incremental aggregation files for adding data to aggregates. The space
        consumed is about the size of the data aggregated.
       Temporary space is needed for paging for transformations that require large caches that
        cannot be entirely cached by system memory
       Sessions logs are saved by timestamp

If any of these factors is true, Informatica recommends monitoring disk space on a regular basis or
maintaining some type of script to purge unused files.



                                                 72
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011

Sizing Analysis
The basic goal is to size the machine so that all jobs can complete within the specified load window.
You should consider the answers to the questions in the "Environment" and "Engine Sizing"
sections to estimate the required number of sessions, the volume of data that each session moves,
and its lookup table, aggregation, and heterogeneous join caching requirements. Use these
estimates with the recommendations in the "Engine Resource Consumption" section to determine
the required number of processors, memory, and disk space to achieve the required performance to
meet the load window.
Note that the deployment environment often creates performance constraints that hardware
capacity cannot overcome. The engine throughput is usually constrained by one or more of the
environmental factors addressed by the questions in the "Environment" section. For example, if the
data sources and target are both remote from the PowerCenter machine, the network is often the
constraining factor. At some point, additional sessions, processors, and memory may not yield
faster execution because the network (not the PowerCenter services) imposes the performance
limit. The hardware sizing analysis is highly dependent on the environment in which the server is
deployed. You need to understand the performance characteristics of the environment before
making any sizing conclusions.
It is also vitally important to remember that other applications (in addition to PowerCenter) are likely
to use the platform. PowerCenter often runs on a server with a database engine and query/analysis
tools. In fact, in an environment where PowerCenter, the target database, and query/analysis tools
all run on the same machine, the query/analysis tool often drives the hardware requirements.
However, if the loading is performed after business hours, the query/analysis tools requirements
may not be a sizing limitation.




                                                  73
                                                                       Informatica Training      Day 5

                                                                                              01/04/2011


PowerCenter Admin Console Challenge
Using the PowerCenter Administration Console to administer PowerCenter domain and services.

Description
PowerCenter has a service-oriented architecture that provides the ability to scale services and
share resources across multiple machines. The PowerCenter domain is the fundamental
administrative unit in PowerCenter. A domain is a collection of nodes and services that you can
group in folders based on administration ownership.
The Administration Console consolidates administrative tasks for domain objects such as services,
nodes, licenses, and grids. For more information on domain configuration, refer the Best Practice on
Domain Configuration.

Folders and Security
It is a good practice to create folders in the domain in order to organize objects and manage
security. Folders can contain nodes, services, grids, licenses and other folders. Folders can be
created based on functionality type, object type, or environment type.

       Functionality-type folders group services based on a functional area such as Sales or
        Marketing.
       Object type-folders group objects based on the service type. For example, Integration
        services folder.
       Environment-type folders group objects based on the environment. For example, if you
        have development and testing on the same domain, group the services according to the
        environment.

Create User Accounts in the admin console, then set permissions and privileges to the folders the
users need access to. It is a good practice for the Administrator to monitor the user activity in the
domain periodically and save the reports for audit purposes.

Nodes, Services, and Grids
A node is the logical representation of a machine in a domain. One node in the domain acts as a
gateway to receive service requests from clients and route them to the appropriate service and
node. Node properties can be set and modified using the admin console. It is important to note that
the property to set the maximum session/tasks to run is “Maximum Processes”. Set this threshold to
a maximum number; for example, 200 is a good threshold. If you are using Adaptive Dispatch mode
it is a good practice to recalculate the CPU profile when the node is idle since it uses 100 percent
of the CPU.
The admin console also allows you to manage application services. You can access properties of
the services under one window using the admin console.




                                                 74
                                                                         Informatica Training      Day 5

                                                                                                01/04/2011


Understanding and Setting UNIX Resources for
PowerCenter Installations Challenge
This Best Practice explains what UNIX resource limits are, and how to control and manage them.

Description
UNIX systems impose per-process limits on resources such as processor usage, memory, and file
handles. Understanding and setting these resources correctly is essential for PowerCenter
installations.

Understanding UNIX Resource Limits
UNIX systems impose limits on several different resources. The resources that can be limited
depend on the actual operating system (e.g., Solaris, AIX, Linux, or HPUX) and the version of the
operating system. In general, all UNIX systems implement per-process limits on the following
resources. There may be additional resource limits, depending on the operating system.




These limits are implemented on an individual process basis. The limits are also „inherited‟ by child
processes when they are created.
In practice, this means that the resource limits are typically set at log-on time, and apply to all
processes started from the login shell. In the case of PowerCenter, any limits in effect before the
Integration Service is started also apply to all sessions (pmdtm) started from that node. Any limits in
effect when the Repository Service is started also apply to all pmrepagents started from that
repository service (repository service process is an instance of the repository service running on a
particular machine or node).
When a process exceeds its resource limit, UNIX fails the operation that caused the limit to be
exceeded. Depending on the limit that is reached, memory allocations fail, files can‟t be opened,
and processes are terminated when they exceed their processor time.
Since PowerCenter sessions often use a large amount of processor time, open many files, and can
use large amounts of memory, it is important to set resource limits correctly to prevent the operating
system from limiting access to required resources, while preventing problems.


                                                  75
                                                                       Informatica Training      Day 5

                                                                                              01/04/2011

Hard and Soft Limits
Each resource that can be limited actually allows two limits to be specified – a „soft‟ limit and a
„hard‟ limit. Hard and soft limits can be confusing.
From a practical point of view, the difference between hard and soft limits doesn‟t matter to
PowerCenter or any other process; the lower value is enforced when it reached, whether it is a hard
or soft limit.
The difference between hard and soft limits really only matters when changing resource limits. The
hard limits are the absolute maximums set by the System Administrator that can only be changed
by the System Administrator. The soft limits are „recommended‟ values set by the System
Administrator, and can be increased by the user, up to the maximum limits.

UNIX Resource Limit Commands
The standard interface to UNIX resource limits is the „ulimit‟ shell command. This command
displays and sets resource limits. The C shell implements a variation of this command called „limit‟,
which has different syntax but the same functions.

       ulimit –a Displays all soft limits
       ulimit –a –H Displays all hard limits in effect

Recommended ulimit settings for a PowerCenter server:




                                                   76
                                                                           Informatica Training      Day 5

                                                                                                  01/04/2011

Setting Resource Limits
Resource limits are normally set in the log-in script, either .profile for the Korn shell or .bash_profile
for the bash shell. One ulimit command is required for each resource being set, and usually the soft
limit is set. A typical sequence is:

                         ulimit -S -c unlimited
                         ulimit -S -d 1232896
                         ulimit -S -s 32768
                         ulimit -S -t unlimited
                         ulimit -S -f 2097152
                         ulimit -S -n 1024
                         ulimit -S -v unlimited

after running this, the limits are changed:

                         % ulimit –S –a
                         core file size (blocks, -c) unlimited
                         data seg size (kbytes, -d) 1232896
                         file size (blocks, -f) 2097152
                         max memory size (kbytes, -m) unlimited
                         open files (-n) 1024
                         stack size (kbytes, -s) 32768
                         cpu time (seconds, -t) unlimited
                         virtual memory (kbytes, -v) unlimited

Setting or Changing Hard Resource Limits
Setting or changing hard resource limits varies across UNIX types. Most current UNIX systems set
the initial hard limits in the file /etc/profile, which must be changed by a System Administrator. In
some cases, it is necessary to run a system utility such as smit on AIX to change the global system
limits.




                                                   77

				
DOCUMENT INFO
Shared By:
Stats:
views:10962
posted:4/2/2011
language:English
pages:77