CRS & RAC
Troubleshooting
Krishnadev Telikicherla
Cluster & Parallel Storage Technology
Oracle Corporation
Oracle Corporation
Topics:
Defining the Issue
Creating a Timeline
Hang or Slowdown
Performance Issues
Gathering Data
Testcases
Rediscovery
Engaging Oracle Support
Examples
Oracle Corporation
Defining the Issue
Layers
What layers are involved in the issue:
• Oracle Clusterware
• CRS daemon
• CSS daemon
• HangCheckTimer [Linux] / Oprocd (not
Linux)
• EVM
• OCR
• Voting
• General RDBMS
• Operating System
• Hardware
Oracle Corporation
Defining the Issue
Cause vs. Effects
Causes:
– Resource issues
– Oracle issues
– OS issues
Effects:
– Hangs/Spins
– Instances Crashes and Evictions
– Node Reboots and Evictions
– Oracle Errors (ORA-600, ORA-7445, ORA-29740)
Oracle Corporation
Defining the Issue
Description
When describing the problem while creating the SR
via Metalink it is important that you use phrases that
will help identify known issues either in bugs or
Metalink content.
In the body of the SR try to be as detailed as possible
about the environment.
Nobody knows the system better than the you.
Talk to the sys-admin as well regarding OS/Network
related issues.
Oracle Corporation
Creating a Timeline
A timeline helps identify the times to concentrate on
when reviewing files
A timeline can be built from reviewing the files
themselves once they are provided to support but this
will only slow resolution time down
Timelines should include an ordering of cause and
effects as well as include all participating nodes
Include specific times, ie…
– At 3:00am PST we noticed that node2 was hanging.
Oracle Corporation
Hang or slowdown
Differentiate between a database hang and a
database slowdown
Identify the extent of a hang
Oracle Corporation
Is it a Hang or a Slowdown?
Check:
System states to see if there is any change
over a short period of time
V$SESSION_WAIT where wait_time=0
Overall machine load, including cpu,
memory, swap, I/O
Oracle Corporation
Is it a Hang or a Slowdown?
Single or multiprocess hang:
– Usually characterized by a particular job
hanging or not completing
– Essentially the same as in single instance
unless it’s internode parallel query.
Instance hang: A single instance is
unusable.
Multi-instance or full database hang: Entire
database is hung or not responding
Oracle Corporation
Performance
Single process or statement
Instance
Multi-Instance
Oracle Corporation
Single Process or Single
Statement
Find the wait event
10046 level 12
- oradebug setorapid
- oradebug event 10046 trace name context forever, level 12
- oradebug tracefile_name
Explain plan
10053 if plan problems are found
V$SESSTAT
Truss/trace/dbx/pstack if OS-related
problems are suspected
Oracle Corporation
Instance Slowdown
Statspack / AWR
OS performance statistics - cpu, memory,
and I/O
Characteristics:
– Related to a particular job?
– Certain time of day?
– What’s changed?
Oracle Corporation
Multi-Instance Slowdowns
AWR from each node can be of use:
AWR collects instance specific data
Examine and correlate the reports
Oracle Corporation
Multi-Instance Slowdowns
In cases of extreme slowdowns:
systemstates on all nodes
V$SESSION_WAIT
Alert logs and any trace files
Process states, or stack traces if
determined and applicable
Oracle Corporation
Debugging Techniques
v$session_wait
System states from all nodes
10046 level 12 trace of the hung process
ORADEBUG
Lock layer and DLM tracing
Get any traces:
DLM traces
Background processes, alert logs, and init.ora
User traces
Oracle Corporation
Debugging and Diagnostics
Performance issues or hangs:
Identify the resource being requested.
Identify who holds the resource.
Oracle Corporation
ORADEBUG and Tools
Hang analyze:
– hanganalyze
Note: 301137.1 – OS Watcher User Guide
Note: 135714.1 - Script to Collect RAC
Diagnostic Information (diagcollection.pl)
Oracle Corporation
Gathering Data
Best Practices
Single most important step
There is never too much data, but including lots of
useless data can increase download time of the data
as well as increase the amount of time to process the
data.
Always error on getting too much data, but be aware
of the impact on the resolution time.
Too little data increases resolution time more than too
much data.
Always include a readme.txt file that explains the
contens of the provided files
Oracle Corporation
Gathering Data
Processes
Always get stacks from processes that seem
to be spinning, hanging or unresponsive:
– oradebug
– gdb
– pstack
ps and top info can be very usefull when
trying to determine if a processes exhibits
issues such as memory leaks, spinning or
hanging
Oracle Corporation
Gathering Data
RAC
For instance evictions please review Metalink
note 219361.1
See Metalink note 203226.1 : RAC Survival
Kit: Real Application Clusters Troubleshooting
and Information
See Metalink note 289690.1 : Data Gathering
for Troubleshooting RAC and CRS issues
Oracle Corporation
Gathering Data
Tools
RDA – system and Oracle configuration information
racdiag – modifiable sql script for gathering rac data. See
Metalink note 135714.1 “Script to Collect RAC Diagnostic
Information
OSW – OS Watcher gathers top, slabinfo, netstat and ps data
over programmable intervals 301137.1 “OS Watcher User
Guide”
Oracle Corporation
Gathering Data
CRS 10.2.0.x (continued)
CRS and other resource issues:
– ORA_CRS_HOME
log//cssd/oclsmon
log//cssd
log//client
log//crsd
log//evmd
log//racg
– ORACLE_HOME (rdbms)
racg/dump
ORACLE_BASE//hdump
Oracle Corporation
Gathering Data
Tools (continue)
Starting with 10.2.0.1 $ORA_CRS_HOME/bin/diagcollection.pl collect all
RAC relevant files (run as root)
oracle10@stnsp010>./diagcollection.pl
Production Copyright 2004, 2005, Oracle. All rights reserved
Cluster Ready Services (CRS) diagnostic collection tool
diagcollection
--collect
[--crs] For collecting crs diag information
[--oh] For collecting oracle home diag information
[--ob] For collecting oracle base diag information
[--all] Default.For collecting all diag information
NOTE:
1. You can also do the following
./diagcollection.pl --collect --crs --oh
2. ORA_CRS_HOME,ORACLE_HOME and ORACLE_BASE env variables
need to be set.
--clean cleans up the diagnosability
information gathered by this script
--coreanalyze extracts information from core files
and stores it in a text file
Oracle Corporation
Testcases
Not always feasible
If provided, can greatly influence resolution time
When providing a testcase:
– Include a readme file
– Try to strip the testcase down to the minimal elements that
are needed to reproduce the problem
If at all possible, always try to build a testcase
Testcases are your friends!
Oracle Corporation
Rediscovery
Expensive for a support organization
Issue rediscovery is not always obvious
Use Metalink to identify possible causes for
issues as well as workarounds and patch
availability
Communicate new issues between DBAs
Oracle Corporation
Engaging Oracle Support
Try to be responsive to all TARs when they
are set to CUS status. Delays inherently
causes two problems:
1. The issue loses momentum
2. A new engineer may have to take over the issue
Oracle Corporation
Examples
10.2.0.2 HP-UX/Itanium ServiceGuard, CRS,
CFS and RAC
Delays in reconfiguration
Oracle Corporation
Examples
10.2.0.2 Linux CRS, RAC and ASM
ORA-600[2103] and one instance crashed
Oracle Corporation
Questions?
Oracle Corporation