Embed
Email

Oracle Clusterware Troubleshooting

Document Sample
Oracle Clusterware Troubleshooting
Description

This doc explains the Performance concepts,testcases and slowdown troubleshooting of CRS Oracle Cluster Ready Service.

Shared by: Arun Mahendran
Stats
views:
577
posted:
8/17/2009
language:
English
pages:
29
CRS & RAC

Troubleshooting





Krishnadev Telikicherla

Cluster & Parallel Storage Technology

Oracle Corporation







Oracle Corporation

Topics:



Defining the Issue

Creating a Timeline

Hang or Slowdown

Performance Issues

Gathering Data

Testcases

Rediscovery

Engaging Oracle Support

Examples





Oracle Corporation

Defining the Issue

Layers

What layers are involved in the issue:



• Oracle Clusterware

• CRS daemon

• CSS daemon

• HangCheckTimer [Linux] / Oprocd (not

Linux)

• EVM

• OCR

• Voting

• General RDBMS

• Operating System

• Hardware





Oracle Corporation

Defining the Issue

Cause vs. Effects

Causes:

– Resource issues

– Oracle issues

– OS issues

Effects:

– Hangs/Spins

– Instances Crashes and Evictions

– Node Reboots and Evictions

– Oracle Errors (ORA-600, ORA-7445, ORA-29740)







Oracle Corporation

Defining the Issue

Description

When describing the problem while creating the SR

via Metalink it is important that you use phrases that

will help identify known issues either in bugs or

Metalink content.

In the body of the SR try to be as detailed as possible

about the environment.

Nobody knows the system better than the you.

Talk to the sys-admin as well regarding OS/Network

related issues.







Oracle Corporation

Creating a Timeline



A timeline helps identify the times to concentrate on

when reviewing files

A timeline can be built from reviewing the files

themselves once they are provided to support but this

will only slow resolution time down

Timelines should include an ordering of cause and

effects as well as include all participating nodes

Include specific times, ie…

– At 3:00am PST we noticed that node2 was hanging.









Oracle Corporation

Hang or slowdown



Differentiate between a database hang and a

database slowdown

Identify the extent of a hang









Oracle Corporation

Is it a Hang or a Slowdown?



Check:

System states to see if there is any change

over a short period of time

V$SESSION_WAIT where wait_time=0

Overall machine load, including cpu,

memory, swap, I/O









Oracle Corporation

Is it a Hang or a Slowdown?



Single or multiprocess hang:

– Usually characterized by a particular job

hanging or not completing

– Essentially the same as in single instance

unless it’s internode parallel query.

Instance hang: A single instance is

unusable.

Multi-instance or full database hang: Entire

database is hung or not responding



Oracle Corporation

Performance



Single process or statement

Instance

Multi-Instance









Oracle Corporation

Single Process or Single

Statement

Find the wait event

10046 level 12

- oradebug setorapid

- oradebug event 10046 trace name context forever, level 12

- oradebug tracefile_name



Explain plan

10053 if plan problems are found

V$SESSTAT

Truss/trace/dbx/pstack if OS-related

problems are suspected

Oracle Corporation

Instance Slowdown



Statspack / AWR

OS performance statistics - cpu, memory,

and I/O

Characteristics:

– Related to a particular job?

– Certain time of day?

– What’s changed?









Oracle Corporation

Multi-Instance Slowdowns



AWR from each node can be of use:

AWR collects instance specific data

Examine and correlate the reports









Oracle Corporation

Multi-Instance Slowdowns



In cases of extreme slowdowns:

systemstates on all nodes

V$SESSION_WAIT

Alert logs and any trace files

Process states, or stack traces if

determined and applicable









Oracle Corporation

Debugging Techniques



v$session_wait

System states from all nodes

10046 level 12 trace of the hung process

ORADEBUG

Lock layer and DLM tracing

Get any traces:

DLM traces

Background processes, alert logs, and init.ora

User traces



Oracle Corporation

Debugging and Diagnostics



Performance issues or hangs:

Identify the resource being requested.

Identify who holds the resource.









Oracle Corporation

ORADEBUG and Tools



Hang analyze:

– hanganalyze

Note: 301137.1 – OS Watcher User Guide

Note: 135714.1 - Script to Collect RAC

Diagnostic Information (diagcollection.pl)









Oracle Corporation

Gathering Data

Best Practices

Single most important step

There is never too much data, but including lots of

useless data can increase download time of the data

as well as increase the amount of time to process the

data.

Always error on getting too much data, but be aware

of the impact on the resolution time.

Too little data increases resolution time more than too

much data.

Always include a readme.txt file that explains the

contens of the provided files





Oracle Corporation

Gathering Data

Processes

Always get stacks from processes that seem

to be spinning, hanging or unresponsive:

– oradebug

– gdb

– pstack

ps and top info can be very usefull when

trying to determine if a processes exhibits

issues such as memory leaks, spinning or

hanging



Oracle Corporation

Gathering Data

RAC

For instance evictions please review Metalink

note 219361.1

See Metalink note 203226.1 : RAC Survival

Kit: Real Application Clusters Troubleshooting

and Information

See Metalink note 289690.1 : Data Gathering

for Troubleshooting RAC and CRS issues









Oracle Corporation

Gathering Data

Tools

RDA – system and Oracle configuration information

racdiag – modifiable sql script for gathering rac data. See

Metalink note 135714.1 “Script to Collect RAC Diagnostic

Information

OSW – OS Watcher gathers top, slabinfo, netstat and ps data

over programmable intervals 301137.1 “OS Watcher User

Guide”









Oracle Corporation

Gathering Data

CRS 10.2.0.x (continued)

CRS and other resource issues:

– ORA_CRS_HOME

log//cssd/oclsmon

log//cssd

log//client

log//crsd

log//evmd

log//racg

– ORACLE_HOME (rdbms)

racg/dump

ORACLE_BASE//hdump







Oracle Corporation

Gathering Data

Tools (continue)

Starting with 10.2.0.1 $ORA_CRS_HOME/bin/diagcollection.pl collect all

RAC relevant files (run as root)

oracle10@stnsp010>./diagcollection.pl

Production Copyright 2004, 2005, Oracle. All rights reserved

Cluster Ready Services (CRS) diagnostic collection tool

diagcollection

--collect

[--crs] For collecting crs diag information

[--oh] For collecting oracle home diag information

[--ob] For collecting oracle base diag information

[--all] Default.For collecting all diag information

NOTE:

1. You can also do the following

./diagcollection.pl --collect --crs --oh

2. ORA_CRS_HOME,ORACLE_HOME and ORACLE_BASE env variables

need to be set.

--clean cleans up the diagnosability

information gathered by this script

--coreanalyze extracts information from core files

and stores it in a text file









Oracle Corporation

Testcases



Not always feasible

If provided, can greatly influence resolution time

When providing a testcase:

– Include a readme file

– Try to strip the testcase down to the minimal elements that

are needed to reproduce the problem

If at all possible, always try to build a testcase

Testcases are your friends!









Oracle Corporation

Rediscovery



Expensive for a support organization

Issue rediscovery is not always obvious

Use Metalink to identify possible causes for

issues as well as workarounds and patch

availability

Communicate new issues between DBAs









Oracle Corporation

Engaging Oracle Support



Try to be responsive to all TARs when they

are set to CUS status. Delays inherently

causes two problems:

1. The issue loses momentum

2. A new engineer may have to take over the issue









Oracle Corporation

Examples



10.2.0.2 HP-UX/Itanium ServiceGuard, CRS,

CFS and RAC

Delays in reconfiguration









Oracle Corporation

Examples



10.2.0.2 Linux CRS, RAC and ASM

ORA-600[2103] and one instance crashed









Oracle Corporation

Questions?









Oracle Corporation


Related docs
Other docs by Arun Mahendran
MISSION IMPOSSIBLE 2 Screenplay
Views: 301  |  Downloads: 14
AMS Best Practice
Views: 167  |  Downloads: 31
My Dream Bike Harley Davidson
Views: 33  |  Downloads: 0
Padmapriya
Views: 72  |  Downloads: 2
Adding a node in Oracle 10gR2 Cluster
Views: 545  |  Downloads: 68
Sneha
Views: 71  |  Downloads: 3
Sridevi
Views: 77  |  Downloads: 1
China landscapes
Views: 24  |  Downloads: 5
Bhoomika Chawla
Views: 28  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!