Docstoc

Oracle Clusterware Troubleshooting

Document Sample
Oracle Clusterware Troubleshooting Powered By Docstoc
					         CRS & RAC
       Troubleshooting


       Krishnadev Telikicherla
Cluster & Parallel Storage Technology
         Oracle Corporation



             Oracle Corporation
                Topics:

Defining the Issue
Creating a Timeline
Hang or Slowdown
Performance Issues
Gathering Data
Testcases
Rediscovery
Engaging Oracle Support
Examples


                  Oracle Corporation
        Defining the Issue
              Layers
What layers are involved in the issue:

•   Oracle Clusterware
     • CRS daemon
     • CSS daemon
     • HangCheckTimer [Linux] / Oprocd (not
       Linux)
     • EVM
     • OCR
     • Voting
•   General RDBMS
•   Operating System
•   Hardware


                         Oracle Corporation
        Defining the Issue
        Cause vs. Effects
Causes:
 –   Resource issues
 –   Oracle issues
 –   OS issues
Effects:
 –   Hangs/Spins
 –   Instances Crashes and Evictions
 –   Node Reboots and Evictions
 –   Oracle Errors (ORA-600, ORA-7445, ORA-29740)



                   Oracle Corporation
        Defining the Issue
           Description
When describing the problem while creating the SR
via Metalink it is important that you use phrases that
will help identify known issues either in bugs or
Metalink content.
In the body of the SR try to be as detailed as possible
about the environment.
Nobody knows the system better than the you.
Talk to the sys-admin as well regarding OS/Network
related issues.



                     Oracle Corporation
        Creating a Timeline

A timeline helps identify the times to concentrate on
when reviewing files
A timeline can be built from reviewing the files
themselves once they are provided to support but this
will only slow resolution time down
Timelines should include an ordering of cause and
effects as well as include all participating nodes
Include specific times, ie…
  –   At 3:00am PST we noticed that node2 was hanging.




                       Oracle Corporation
      Hang or slowdown

Differentiate between a database hang and a
database slowdown
Identify the extent of a hang




                Oracle Corporation
Is it a Hang or a Slowdown?

  Check:
  System states to see if there is any change
  over a short period of time
  V$SESSION_WAIT where wait_time=0
  Overall machine load, including cpu,
  memory, swap, I/O




                  Oracle Corporation
Is it a Hang or a Slowdown?

  Single or multiprocess hang:
   –   Usually characterized by a particular job
       hanging or not completing
   –   Essentially the same as in single instance
       unless it’s internode parallel query.
  Instance hang: A single instance is
  unusable.
  Multi-instance or full database hang: Entire
  database is hung or not responding

                     Oracle Corporation
Performance

 Single process or statement
 Instance
 Multi-Instance




                 Oracle Corporation
Single Process or Single
Statement
  Find the wait event
  10046 level 12
     - oradebug setorapid
     - oradebug event 10046 trace name context forever, level 12
     - oradebug tracefile_name

  Explain plan
  10053 if plan problems are found
  V$SESSTAT
  Truss/trace/dbx/pstack if OS-related
  problems are suspected
                        Oracle Corporation
Instance Slowdown

  Statspack / AWR
  OS performance statistics - cpu, memory,
  and I/O
  Characteristics:
   –   Related to a particular job?
   –   Certain time of day?
   –   What’s changed?




                     Oracle Corporation
Multi-Instance Slowdowns

  AWR from each node can be of use:
  AWR collects instance specific data
  Examine and correlate the reports




                 Oracle Corporation
Multi-Instance Slowdowns

  In cases of extreme slowdowns:
  systemstates on all nodes
  V$SESSION_WAIT
  Alert logs and any trace files
  Process states, or stack traces if
  determined and applicable




                  Oracle Corporation
Debugging Techniques

  v$session_wait
  System states from all nodes
  10046 level 12 trace of the hung process
  ORADEBUG
  Lock layer and DLM tracing
  Get any traces:
        DLM traces
        Background processes, alert logs, and init.ora
        User traces

                    Oracle Corporation
Debugging and Diagnostics

 Performance issues or hangs:
 Identify the resource being requested.
 Identify who holds the resource.




                  Oracle Corporation
ORADEBUG and Tools

 Hang analyze:
  –   hanganalyze <level>
 Note: 301137.1 – OS Watcher User Guide
 Note: 135714.1 - Script to Collect RAC
 Diagnostic Information (diagcollection.pl)




                  Oracle Corporation
           Gathering Data
           Best Practices
Single most important step
There is never too much data, but including lots of
useless data can increase download time of the data
as well as increase the amount of time to process the
data.
Always error on getting too much data, but be aware
of the impact on the resolution time.
Too little data increases resolution time more than too
much data.
Always include a readme.txt file that explains the
contens of the provided files


                     Oracle Corporation
          Gathering Data
            Processes
Always get stacks from processes that seem
to be spinning, hanging or unresponsive:
 –   oradebug
 –   gdb
 –   pstack
ps and top info can be very usefull when
trying to determine if a processes exhibits
issues such as memory leaks, spinning or
hanging

                  Oracle Corporation
         Gathering Data
             RAC
For instance evictions please review Metalink
note 219361.1
See Metalink note 203226.1 : RAC Survival
Kit: Real Application Clusters Troubleshooting
and Information
See Metalink note 289690.1 : Data Gathering
for Troubleshooting RAC and CRS issues




                 Oracle Corporation
            Gathering Data
                Tools
RDA – system and Oracle configuration information
racdiag – modifiable sql script for gathering rac data. See
Metalink note 135714.1 “Script to Collect RAC Diagnostic
Information
OSW – OS Watcher gathers top, slabinfo, netstat and ps data
over programmable intervals 301137.1 “OS Watcher User
Guide”




                       Oracle Corporation
     Gathering Data
 CRS 10.2.0.x (continued)
CRS and other resource issues:
 –   ORA_CRS_HOME
       log/<hostname>/cssd/oclsmon
       log/<hostname>/cssd
       log/<hostname>/client
       log/<hostname>/crsd
       log/<hostname>/evmd
       log/<hostname>/racg
 –   ORACLE_HOME (rdbms)
       racg/dump
       ORACLE_BASE/<db_name>/hdump



                  Oracle Corporation
              Gathering Data
             Tools (continue)
Starting with 10.2.0.1 $ORA_CRS_HOME/bin/diagcollection.pl collect all
RAC relevant files (run as root)
     oracle10@stnsp010>./diagcollection.pl
     Production Copyright 2004, 2005, Oracle. All rights reserved
     Cluster Ready Services (CRS) diagnostic collection tool
     diagcollection
        --collect
               [--crs] For collecting crs diag information
               [--oh] For collecting oracle home diag information
               [--ob] For collecting oracle base diag information
               [--all] Default.For collecting all diag information
               NOTE:
               1. You can also do the following
                  ./diagcollection.pl --collect --crs --oh
               2. ORA_CRS_HOME,ORACLE_HOME and ORACLE_BASE env variables
                                                  need to be set.
         --clean        cleans up the diagnosability
                      information gathered by this script
         --coreanalyze extracts information from core files
                     and stores it in a text file




                               Oracle Corporation
                  Testcases

Not always feasible
If provided, can greatly influence resolution time
When providing a testcase:
  –   Include a readme file
  –   Try to strip the testcase down to the minimal elements that
      are needed to reproduce the problem
If at all possible, always try to build a testcase
Testcases are your friends!




                         Oracle Corporation
           Rediscovery

Expensive for a support organization
Issue rediscovery is not always obvious
Use Metalink to identify possible causes for
issues as well as workarounds and patch
availability
Communicate new issues between DBAs




                 Oracle Corporation
Engaging Oracle Support

Try to be responsive to all TARs when they
are set to CUS status. Delays inherently
causes two problems:
1.   The issue loses momentum
2.   A new engineer may have to take over the issue




                   Oracle Corporation
           Examples

10.2.0.2 HP-UX/Itanium ServiceGuard, CRS,
CFS and RAC
Delays in reconfiguration




              Oracle Corporation
          Examples

10.2.0.2 Linux CRS, RAC and ASM
ORA-600[2103] and one instance crashed




              Oracle Corporation
Questions?




   Oracle Corporation

				
DOCUMENT INFO
Shared By:
Stats:
views:826
posted:8/18/2009
language:English
pages:29
Description: This doc explains the Performance concepts,testcases and slowdown troubleshooting of CRS Oracle Cluster Ready Service.