Embed
Email

Oracle RAC Cluster Troubleshooting

Document Sample
Oracle RAC Cluster Troubleshooting
Description

The following is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remains at the sole discretion of Oracle.

CRS & RAC Troubleshooting

Krishnadev Telikicherla Cluster & Parallel Storage Technology Oracle Corporation



Oracle Corporation



Topics:

Defining the Issue Creating a Timeline Hang or Slowdown Performance Issues Gathering Data Testcases Rediscovery Engaging Oracle Support Examples



Oracle Corporation



Defining the Issue Layers

What layers are involved in the issue:





Oracle Clusterware



• CRS daemon • CSS daemon • HangCheckTimer [Linux] / Oprocd (not Linux) • EVM • OCR • Voting

• • •



General RDBMS Operating System Hardware

Oracle Corporation



Defining the Issue Cause vs. Effects

Causes:

– – –



Resource issues Oracle issues OS issues Hangs/Spins Instances Crashes and Evictions Node Reboots and Evictions Oracle Errors (ORA-600, ORA-7445, ORA-29740)



Effects:

– – – –



Oracle Corporation



Defining the Issue Description

When describing the problem while creating the SR via Metalink it is important that you use phrases that will help identify known issues either in bugs or Metalink content. In the body of the SR try to be as detailed as possible about the environment. Nobody knows the system better than the you. Talk to the sys-admin as well regarding OS/Network related issues.



Oracle Corporation



Creating a Timeline

A timeline helps identify the times to concentrate on when reviewing files A timeline can be built from reviewing the files themselves once they are provided to support but this will only slow resolution time down Timelines should include an ordering of cause and effects as well as include all participating nodes Include specific times, ie…





At 3:00am PST we noticed that node2 was hanging.



Oracle Corporation



Hang or slowdown

Differentiate between a database hang and a database slowdown Identify the extent of a hang



Oracle Corporation



Is it a Hang or a Slowdown?

Check: System states to see if there is any change over a short period of time V$SESSION_WAIT where wait_time=0 Overall machine load, including cpu, memory, swap, I/O



Oracle Corporation



Is it a Hang or a Slowdown?

Single or multiprocess hang:









Usually characterized by a particular job hanging or not completing Essentially the same as in single instance unless it’s internode parallel query.



Instance hang: A single instance is unusable. Multi-instance or full database hang: Entire database is hung or not responding

Oracle Corporation



Performance

Single process or statement Instance Multi-Instance



Oracle Corporation



Single Process or Single Statement

Find the wait event 10046 level 12

- oradebug setorapid - oradebug event 10046 trace name context forever, level 12 - oradebug tracefile_name



Explain plan 10053 if plan problems are found V$SESSTAT Truss/trace/dbx/pstack if OS-related problems are suspected

Oracle Corporation



Instance Slowdown

Statspack / AWR OS performance statistics - cpu, memory, and I/O Characteristics:

– – –



Related to a particular job? Certain time of day? What’s changed?



Oracle Corporation



Multi-Instance Slowdowns

AWR from each node can be of use: AWR collects instance specific data Examine and correlate the reports



Oracle Corporation



Multi-Instance Slowdowns

In cases of extreme slowdowns: systemstates on all nodes V$SESSION_WAIT Alert logs and any trace files Process states, or stack traces if determined and applicable



Oracle Corporation



Debugging Techniques

v$session_wait System states from all nodes 10046 level 12 trace of the hung process ORADEBUG Lock layer and DLM tracing Get any traces:

DLM traces Background processes, alert logs, and init.ora User traces

Oracle Corporation



Debugging and Diagnostics

Performance issues or hangs: Identify the resource being requested. Identify who holds the resource.



Oracle Corporation



ORADEBUG and Tools

Hang analyze:





hanganalyze



Note: 301137.1 – OS Watcher User Guide Note: 135714.1 - Script to Collect RAC Diagnostic Information (diagcollection.pl)



Oracle Corporation



Gathering Data Best Practices

Single most important step There is never too much data, but including lots of useless data can increase download time of the data as well as increase the amount of time to process the data. Always error on getting too much data, but be aware of the impact on the resolution time. Too little data increases resolution time more than too much data. Always include a readme.txt file that explains the contens of the provided files



Oracle Corporation



Gathering Data Processes

Always get stacks from processes that seem to be spinning, hanging or unresponsive:

– – –



oradebug gdb pstack



ps and top info can be very usefull when trying to determine if a processes exhibits issues such as memory leaks, spinning or hanging

Oracle Corporation



Gathering Data RAC

For instance evictions please review Metalink note 219361.1 See Metalink note 203226.1 : RAC Survival Kit: Real Application Clusters Troubleshooting and Information See Metalink note 289690.1 : Data Gathering for Troubleshooting RAC and CRS issues



Oracle Corporation



Gathering Data Tools

RDA – system and Oracle configuration information racdiag – modifiable sql script for gathering rac data. See Metalink note 135714.1 “Script to Collect RAC Diagnostic Information OSW – OS Watcher gathers top, slabinfo, netstat and ps data over programmable intervals 301137.1 “OS Watcher User Guide”



Oracle Corporation



Gathering Data CRS 10.2.0.x (continued)

CRS and other resource issues:









ORA_CRS_HOME log//cssd/oclsmon log//cssd log//client log//crsd log//evmd log//racg ORACLE_HOME (rdbms) racg/dump ORACLE_BASE//hdump



Oracle Corporation



Gathering Data Tools (continue)

Starting with 10.2.0.1 $ORA_CRS_HOME/bin/diagcollection.pl collect all RAC relevant files (run as root)

oracle10@stnsp010>./diagcollection.pl Production Copyright 2004, 2005, Oracle. All rights reserved Cluster Ready Services (CRS) diagnostic collection tool diagcollection --collect [--crs] For collecting crs diag information [--oh] For collecting oracle home diag information [--ob] For collecting oracle base diag information [--all] Default.For collecting all diag information NOTE: 1. You can also do the following ./diagcollection.pl --collect --crs --oh 2. ORA_CRS_HOME,ORACLE_HOME and ORACLE_BASE env variables need to be set. --clean cleans up the diagnosability information gathered by this script --coreanalyze extracts information from core files and stores it in a text file



Oracle Corporation



Testcases

Not always feasible If provided, can greatly influence resolution time When providing a testcase:

– –



Include a readme file Try to strip the testcase down to the minimal elements that are needed to reproduce the problem



If at all possible, always try to build a testcase Testcases are your friends!



Oracle Corporation



Rediscovery

Expensive for a support organization Issue rediscovery is not always obvious Use Metalink to identify possible causes for issues as well as workarounds and patch availability Communicate new issues between DBAs



Oracle Corporation



Engaging Oracle Support

Try to be responsive to all TARs when they are set to CUS status. Delays inherently causes two problems:

1. 2.



The issue loses momentum A new engineer may have to take over the issue



Oracle Corporation



Examples

10.2.0.2 HP-UX/Itanium ServiceGuard, CRS, CFS and RAC Delays in reconfiguration



Oracle Corporation



Examples

10.2.0.2 Linux CRS, RAC and ASM ORA-600[2103] and one instance crashed



Oracle Corporation



Questions?



Oracle Corporation




Related docs
Other docs by Arun Mahendran
AMS Best Practice
Views: 167  |  Downloads: 31
Bhoomika Chawla
Views: 28  |  Downloads: 0
Microsoft Exchange Server 2003
Views: 343  |  Downloads: 72
Swine Flu
Views: 29  |  Downloads: 8
RAC DBA-2
Views: 918  |  Downloads: 151
Mr & Mrs Smith Screenplay
Views: 1680  |  Downloads: 52
Understanding RAC Internals
Views: 4089  |  Downloads: 299
Anushka
Views: 139  |  Downloads: 5
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!