Autosys Diagnose Tips by sebalopez

VIEWS: 4,037 PAGES: 16

									Troubleshooting Tips for Unicenter AutoSys 4.5x
This document provides tips and references for troubleshooting your Unicenter AutoSys 4.5.x implementation.

Additional Resources
The Unicenter AutoSys User Guide provides helpful troubleshooting information, particularly in Chapter 14 “Troubleshooting”, Appendix B “Troubleshooting CCI” and Appendix C “General Debugging.” In addition, you should also consult the following links on the CA Support Online website: Techdoc Index for AutoSys Notable troubleshooting topics include the following: – – TEC424427: “I am getting the following message at my command prompt: Unrecognized job type” TEC411871: “AutoSys 4.5.1 will not execute commands with a recent 12.5.1 Sybase Client installed on Remote Agents. When running command (autorep -G all) on a Remote Agent you get following message; This application failed to start because libsybdb.dll not found.” TEC411389: “Error message received when trying to save a new job: "Unexpected Response from Listener with ATSYS 4.5/ UWCC” TEC406557: “Trying to start the eventor processor by running $AUTOSYS/bin/eventor. The event_demon fails to start and no real explanation or error is displayed as to why.” TEC405710: “eTrust Access Control Locks out the Telnet in Unicenter AutoSys 4.5” TEC428525: “Running "autoping" from the event processor machine to a newly installed Windows remote agent fails stating "read failed win socket error =10054". Attempts to telnet from the EP machine to the Remote Agent machine also fail.”

– –

– –

Implementation CD AutoSys section oSys/Autosys_Frame.htm Note that, although this link contains older information, the basic AutoSys tips still apply.


Troubleshooting Steps
In troubleshooting your Unicenter AutoSys implementation (or any product implementation for that matter) you should take care to ensure your approach is consistent, repeatable, inquisitive and well documented. In general you will need to: 1. Define the problem Clearly state what happened that should not have happened – or what did not happen that should have happened. Be sure to note the scope of the problem – including the date\time of occurrence, affected jobs\machines\users\network as well as any preceding jobs or recent activity on the machines involved in the transaction. 2. Identify versions\patch level Do this for all affected components and note any security\firewall settings that may be in effect. 3. Confirm communication Verify that the affected machines can “talk” to one another to determine if the fault lies with a possible network\firewall\permissions error 4. Execute the job manually Verify that the job syntax is correct. This helps determine if the fault lies with the scheduling system or with the job itself. 5. Check logs and system date\clock Check to see what happened on the affected machine(s) and ensure that the system date\clock are correct – especially when using date\time related job parameters such as start_times and RunWindow. Use the autosyslog command to view either the event processor log file or the Remote Agent log file for a specified job. Both the Remote Agent and Event Processor write diagnostic messages to their respective logs, as part of their normal operations and in response to detected error conditions. The syntax for this command is:
autosyslog [-e | -J job_name] [-p]

The event processor logs all events it processes and provides a detailed trace of its activities. The Remote Agent’s log displays the log for the specific job’s most recent run. Although the Remote Agent’s log file is automatically deleted by default after a successful job run, the log file will not be deleted at job completion if the job ended with a FAILURE status. The event processor log also contains a timestamped history of each event that occurs. Viewing this log is an alternative to monitoring “all jobs” and “all events.”


For more information on autosyslog, consult the Unicenter AutoSys Reference Guide for Windows and Unix. 6. Document the solution Once solution has been applied take steps to prevent repeat. Typically, this involves education, development\documentation of standardized processes and conventions (e.g., naming conventions), or application of necessary patches (and establishing a process whereby this is not allowed to lapse.)

Unicenter AutoSys provides a highly flexible, very customizable scheduling tool for managing your workload environment; however, you should be very aware of the implications and restrictions inherent in any of the scheduling options before employing them. For example, if a predecessor job regularly exceeds its “term_run_time” it can cause a ripple effect felt through the full chain of its dependent jobs. Ensure that you (and anyone else who will be scheduling jobs through AutoSys) understand the architecture (e.g., components and their relationships, firewall requirements, job submission authorizations, etc.) and follow agreed upon standards for defining and submitting jobs – including file naming conventions and calendars. Note: Naming conventions for jobs, calendars, variables and views should be clearly established as part of the initial architecture and consistently enforced throughout the implementation.

Job Related Errors
In some cases, a job’s failure to execute has to do more with the job itself than the scheduling system. Therefore, one of your first troubleshooting steps should be to verify the validity of the job – including its syntax and access to required resources. If the job failed because the command being executed by the job returned an error, run the AutoSys autorep –J jobname -d and investigate why the Job abended:


In the example above, the command executed by the Job returned an exit code of “1” upon completion (see “Pri/Xit” column). Notice that AutoSys attempted to run the Job twice (as seen in the Ntry colum which notes the number of restart attempts). At first, the job failed because the Remote Agent was not running (“Connect to socket FAILED”). However, that was corrected and AutoSys resubmitted the Job, which failed again for the same reason. Make sure that the correct syntax is provided to enable the command, executable, UNIX shell script, application, or batch file (and its parameters) to run on the Remote Agent Client (when all necessary conditions are met). Keep the following in mind when using the command attribute in Job definitions: You cannot redirect standard input, output, and error files in the command attribute. Use other job attributes, such as std_in_file for standard input, to provide the necessary functionality. Environment variables for the command are defined by a default profile or the profile specified in the job definition. Although system environment variables are automatically set in the command’s environment, user environment variables are not. You must define all other required environment variables in the job’s profile. If a command works properly when issued at the command prompt, but fails to run properly when specified as a command attribute, the necessary user-defined environment variables and the variables defined in the job profile are probably different. If this is the case, compare the variables to verify that all required user environment variables are defined in the job profile. Information on how to do this can be found in the User Guides. When specifying drive letters in job definitions, you must enclose the colon character with quotation marks or backslashes. For example, C\:\tmp or "C:\tmp" is valid; C:\tmp is not.


Job Runs on Command Prompt but not through AutoSys
If a command runs manually on a Windows command prompt but fails with a “job returned =-1” when run via AutoSys check the system's PATH variable to see if it contains spaces in the path location to the command. Setting the command's bin location at the beginning of the PATH variable in the Administrator GUI "System" Environment Variables either surrounded by double-quotes or using "~1" in place of the portion of the PATH definition with "spaces" will allow the command to be found.

Password and Permission Errors (Windows Only)
Jobs can also fail because the job’s owner ID and/or password have not been defined to the AutoSys security or if it does not have permission to start a Job on a Client. When an Agent runs a job on a computer, it logs on as the user who owns the job. To enable the Agent to do this, the Scheduler passes both the job information and an encrypted version of the job owner’s password from the database to the Agent. You must ensure that the password you provide is valid! The EDIT Superuser can use autosys_secure interactively or from the command line to enter these IDs and/or passwords. After the EDIT Superuser enters the IDs and passwords, any user who knows an existing user ID and password can change the password or delete that user ID and password. In the following example the job could not run because user “Autosys” or its password had not been defined to the AutoSys security.


To remedy this first logon as the EDIT superuser and run autosys_secure:

Select option 5: Manage AutoSys User@Host users. Then, select 1: Create AutoSys User@Host or Domain Password.

autosys_secure will prompt for credentials. Enter the correct user name, host or domain, and password:


autosys_secure can also be executed fully at the command prompt without requiring interaction.

Scheduling Problems in the Job Definition
If you include scheduling options, such as max_run_alarm, term_run_time or run_window, it is critical that you understand how these parameters work and how long the job typically takes to run, particularly when there are many dependencies spanning multiple platforms and machines.


If a job’s starting conditions have not been met, run the AutoSys job_depends –J jobname –d command to see why it could not start at its start time: For example:

In the example above, the job’s starting conditions had not been met because it can only run if its predecessor returns a 0 (exitcode=0). However, since the predecessor job was still running (and, therefore, had not yet returned a “0”) when the job’s date condition was met, it could not start. To avoid this type of problem, make sure that the job’s start_times attribute is set appropriately. In the following example the output of the job_depends –J jobname –d command shows that the job’s starting conditions have not been met because it can only run if its predecessor runs successfully. Since its predecessor failed, it cannot be started.


Maximum and Minimum Run Time Errors
If the job failed because it exceeded its maximum run time (specified through term_run_time) the job is taking longer than the specified time to finish, which might indicate that the job is stuck in a loop or is waiting for additional data. Therefore, you should: Make sure that the job is not stuck in a loop or waiting for data that has never arrived. Also, make sure that the maximum run time threshold is adequate. Note: Keep in mind that if you used the max_run_alarm attribute, exceeding the limit will send an alarm – it will not cause the job to terminate. Conversely, a job might also fail to meet its minimum run time, finishing sooner than expected, which could also indicate that it is not running properly. In this case you should: Make sure that the job is getting all the data it needs to run properly. Verify that the minimum run time threshold is adequate.

Missed Run Window
The run_window attribute controls only when the job starts — not when it stops. If a job definition contains the run_window attribute, once the job becomes eligible to run (based on its starting conditions), Unicenter AutoSys JM verifies whether the specified run window includes the current time. If it does, the job starts. If it does not, the product determines whether to run the job based on the end of the previous run window and the beginning of the next run window. To see what happened, execute the following command:
autorep –J jobname –d

For example:


The run_window attribute is not, in itself, a starting condition — it is an additional control over when a job may start after its starting conditions are satisfied. This attribute is especially useful, for example, when you do not know when a monitored file may arrive and there are specific times when a job dependent on the monitored file should not run. Therefore, make sure that the job’s condition (starting conditions), date_condition (date/time conditions), and run_window attributes are all set appropriately (for example, a run window cannot span more than 24 hours). Then, if the job is on hold, make sure to run sendevent –E JOB_OFF_HOLD –J jobname before the end of the run window. You should also consider the availability of resources required by the job. For example, notice that the Job below is queued and that it has a short run window.


This job may not start before the end of the run window because its load (job_load attribute) added to the load of the running job may exceed the max_load attribute of the machine they run on. In fact, that is exactly what happened in the example above:

Here you can see that the job did not run because there were not enough resources available before the end of its run window.


Retries Limit
When a job exceeds the maximum number of retries specified by n_retrys in the job definition or Max Restart Trys in the instance configuration, it exits with a failure status. The n_retrys attribute applies to application failures – for example, where AutoSys is unable to find a file or command, or where permissions were not properly set. It does not apply to system or network failures such as when a computer is unavailable or a socket connection has timed out. Job restarts after system or network failures are controlled by the Max Restart Trys setting on the Unicenter AutoSys JM Administrator Scheduler window. The delay between restarts is determined by the Restart Constant and Restart Factor settings on the Unicenter AutoSys JM Administrator Scheduler window which are limited by the maximum specified by the Max Restart Wait setting.

The following formula is used to calculate wait time: Wait Time = Restart Constant + (Max Restart Trys * Restart Factor) If Wait Time > Max Restart Wait, then WaitTime = Max Restart Wait. If necessary, define the number of times to attempt to restart the job after it exits with a failure status. The n_retrys value can be set to any integer ranging from 0 to 20 (default: 0 – the job will not restart). For example:
n_retrys: 3


specifies that the job will automatically restart up to five times after an application failure. This means that the job would start as scheduled, and if it fails, it would restart up to three times for a total of four attempts.

Job Date\Time Conditions Not Met
Make sure the job is scheduled according to its date/time condition. These are specified by the days_of_week, start_times, start_mins, and run_calendar attributes. Attempting to start the Job via sendevent –E STARTJOB –J jobname –T “MM/DD/YYYY HH:MM” will result in a date/time condition failure. The Job report will show:

In the example above you will see that job is scheduled to run on 08/07/2007 at 21:46 (Job definition). However, it was manually scheduled to run on the present date at 22:55. The Event State (ES) is Processed (PD), but the Job Status (ST) is Inactive (IN).

Term Run Time Limit Exceeded
A Job may terminate because it either exceeded its term_run_time attribute, which designates the maximum run time (in minutes) that the job should require to finish normally or the job was killed with a command such as sendevent –E KILLJOB –J jobname. When a job runs longer than the time specified by the – term_run_time attribute it is terminated by Unicenter AutoSys JM.


Note: Under Windows, processes launched by user applications or batch (*.bat) files are not terminated. Unicenter AutoSys JM only terminates the CMD.EXE process that it used to launch the job. Otherwise, Unicenter AutoSys JM kills the process specified in the command definition. In UNIX, all child processes associated with the command process are killed. Define the maximum number of minutes the job should ever require to finish normally, if necessary. term_run_time can be set to any integer (default: 0 – the job is allowed to run forever). For example:
term_run_time: 15

specifies that the job will terminate if it runs for more than 15 minutes.

Unable to Contact Machine
In some cases a job does not execute because network problems, such as name resolution errors, or firewall configurations prevent the AutoSys Scheduler from reaching the Job Management Agent in the first place. Use diagnostic tools, such as tracert and pathping, to help determine problems such as broken links.

Tools and Verification Checks
CCI is used to facilitate communications between components and this is particularly critical when cross platform scheduling is in effect. To verify that the necessary CCI components are running, execute the following command:
ccicntrl status

Here you can see an example of the results:

Depending upon the exact configuration of the machines in your environment, NR-Client may be running instead of NR-Server. Usually NR-Server is installed and remote is used for persistent connections. Therefore, at least two CCI components must be running: Transport and NR-Server. You should also make sure CCI is sending and receiving by using CCIR and CCIS utilities. For example:


If the required CCI components are running and there are no network related issues, verify that the Event Management components are running by executing the following commands:
unifstat –c evtd unifstat –c evtr

For example:


In the example above, the Event Management components which are essential to remediation, in fact, stopped. Re-start Event Management by running the following command:
unicntrl start opr

Diagnostic tools such as tracert and pathping can help determine problems such as broken links.


To top