1
Monitoring Your RAC 10g Cluster Environment V2.0
Gary McGalliard
RAC Pack - Technical Manager
2
Subjects to Discuss
Why we Monitor What to Monitor How to Monitor Questions
3
Why Monitor?
4
Common Oracle DBA Tasks
Installing Oracle software Creating Oracle databases Performing upgrades Starting up and shutting down the database Managing the database’s storage structures Managing users and security Managing schema objects Making database backups/ recovery when necessary Proactively monitoring the database’s health and taking preventive or corrective action as required Monitoring and tuning performance
5
Track Application Usage
What are the busy periods? Is the workload as expected? Has the disk usage gone up? Is the avg transaction length increasing? Are we using more CPU? Are there more users than last month, last quarter, last year? Are we meeting user expectations? – Service Level Agreement/Objectives (SLA/SLO) – This is the ultimate measure of IT success
Same questions as single instance monitoring
6
Evaluate Changes
Did the last change …
– – – – –
Help lower CPU usage? Increase the read rate? Reduce the write rate? Change the average transaction profile? Improve the user’s perception of response time?
Same questions as single instance
7
Capacity Planning
When should another machine be ordered? How long will the current storage unit last? Network performance still within limits? Can the systems handle the next change? Are additional resources needed before increasing application users by X%? Same questions as single instance
8
Prevent Unplanned Outages
Use effective management practices Check logs for error messages Review application testing reports Adhere to capacity planning standards Unplanned downtime drains business bottom lines Same methods as single instance Service Level Agreements/Objectives (SLA/SLO) define outage types
9
Service Level Agreements Clearly define SLO’s
Sufficiently granular
– –
Cannot architect, design, OR manage a system without clearly understanding the SLOs 24x7 is NOT an SLO
Define HA/recovery time objectives, throughput, response time, data loss, etc
– – –
Need to be established with an understanding of the cost of downtime for the system. RTO and RPO are key availability metrics Response time and throughput are key performance metrics Planned vs unplanned Localized vs site-wide
Must address different failure conditions
– –
Must be linked to the business requirements
–
Response time and resolution time
Must be realistic
10
Why Monitor? - Summary
Part of DBA’s Common Task List Track Application Usage/trends Evaluate Changes (relative to SLA/SLO’s) Capacity Planning Prevent Unplanned Outages Meeting Service Level Agreements/Objectives
11
What to Monitor?
EVERYTHING
Same resources as single instance For RAC:
– – –
Each instance carrying planned load (balanced?). Shared storage access is equal. Interconnect
Load Latency
–
High CPU usage - Oracle processes getting enough resources.
12
What to Monitor?
Performance Statics, Logs, Errors at ALL Levels
Application Level
Database Level
OS Level
13
OS Level Statistics
Each cluster member, check usage
– –
CPU – blocked queue length, %idle IO – queue length, response times
Storage Network - Public - Private Interconnect (RAC)
– –
Memory – paging, swapping, scan rates Log: /var/log/messages – error messages
14
What to Monitor? CRS 10.2.0.x
ORA_CRS_HOME – CRS alert log - log//alert.log – CRS logs - log//crsd/ – CSS logs - log//cssd/ – EVM logs – log//evmd & evm/log/ – SRVM logs - log//client – OPMN logs - opmn/logs – Resource specific logs – log//racg – Cluster Network Communication logs - log ORACLE_HOME (rdbms) – Resource specific logs – log//racg – SRVM logs - log//client
Note 331168.1 - Oracle Clusterware consolidated logging in 10gR2
15
What to Monitor? ASM
alert_.log – Default: ORACLE_HOME/rdbms/log Trace Files – Default: ORACLE_HOME/rdbms/log bdump - background_dump_dest cdump - core_dump_dest udump - user_dump_dest
16
What to Monitor? RDBMS
alert_.log – Default: ORACLE_HOME/rdbms/log Trace Files – Default: ORACLE_HOME/rdbms/log – bdump - background_dump_dest – cdump - core_dump_dest – udump – user_dump_dest AWR / Statspack (each node for RAC) – retain for one full business cycle listener_.log – Default: ORACLE_HOME/network/log
17
What to Monitor? Application
Must be designed and coded into the application. Mid-tier server OS level monitoring can use the same methods as the database server. Remember, monitoring is about identifying deviations to “normal” processing expectations.
–
Establish baselines at all levels
The deviations are then investigated as possible problems.
18
How to Monitor?
19
What Are Baselines?
Baselines are time-lagged calculations (usually averages of one sort or another). Provides a basis for making comparisons of past performance to current performance.
–
Compare past Mondays to this Monday, past weeks to this week, etc. Determining whether the trends show you're likely to meet an established goal.
May also be forward-looking.
–
Be aware of how your systems perform. Record baseline information and review on a regular schedule.
20
OS (Unix) Tools
top – Top Processes ps – Process Status iostat - I/O Statistics netstat - Network Statistics vmstat - Virtual Memory Statistics ping - Checks network host connectivity
21
OS Watcher (OSW)
A collection of UNIX shell scripts intended to collect and archive operating system and network metrics. Support in diagnosing:
– –
complex RAC issues generic performance issues
OSW operates as a set of background processes, gathering OS data on a regular basis using Unix utilities. OSW can be installed and run standalone. Data collection intervals are configurable by the user.
22
OS Watcher (OSW)
OSW is certified on the following platforms:
• AIX, Tru64, Solaris, HP-UX, Linux OSW invokes distinct OS utilities • ps, top, mpstat, iostat, netstat, traceroute, vmstat startOSW.sh - start OSW processes
– –
arg1 = snapshot interval in seconds. arg2 = number of hours of archive data to store.
stopOSW.sh - terminate all OSW processes Metalink Note 301137.1
23
How to Monitor? – OS Level Summary
There are many tools which collect statistics at the OS level.
– – – –
Pick one/several you like Collect the information Review the results Review the methods used on a regular basis
Change as needed - e.g. New tools are available
24
Automatic Workload Repository
Superior to Any Other Data Collection Tool Automatic, Self-Managing, More Efficient Set-up Out-of-Box Pre-Calculated Metrics
–
E.g. transactions/second, logon/second, etc.
Foundation of Self-Management Enables Historical Performance Analysis
–
–
My user complained about poor performance 3 AM last night. What was going on then? Who was using the system at any given time in the past and what exactly were they doing?
25
Automatic Workload Repository Regularly Monitor
Load Profile Top 5 Timed Events RAC Statistics
– – –
Global Cache Load Profile Global Cache Efficiency Percentages Global Cache and Enqueue Services
26
Oracle Enterprise Manager 10g
Enables management of RAC environments as single system image Cluster Database page provides RAC – wide view – Aggregated status, performance data across all instances – Supports operations on database and services – Drill down to pages for specific instances – Drill up to cluster page Cluster page – Shows hardware and operating system configuration, performance, and status across cluster – Drill down to pages for specific nodes
27
RAC Administration
Single system image Cluster Database page provides RAC-wide view
– – – – –
Aggregated status Performance data across all instances Database Operations Drill down to instances Drill up to cluster Hardware OS configuration Performance Status Drill down to nodes
Cluster page
– – – – –
28
RAC Monitoring
• • • • CRS Monitoring RAC DB and Instance monitoring Interconnect monitoring Cluster cache diagnostics
User transparency Cluster awareness
– –
Database Hosts (OS) e.g. storage alerts
Database-level alerts
–
Cluster-aware EM jobs RAC-specific performance management Service Assurance Management
29
Cluster Database Performance
30
RAC Interconnect Monitoring
Monitor private and public interconnects Identify interconnects used Traffic generated Interconnect alerts
31
RAC Cluster Cache Diagnostics Monitor inter-instance
communication Identify performance problems due to object contention
32
Comprehensive System Monitoring
Integrated Database and OS Monitoring Comprehensive Performance Monitoring for All Supported Database Versions
–
– –
Well Defined, Intuitive, Performance Management Workflow Detailed Wait , Session, SQL Drilldowns Historical Performance Data
Event, Metric History
–
Full Integration with New Oracle10g Data Sources
AWR, ASH
33
Monitor your system
Define key metrics and monitor them actively
–
Establish a (performance) baseline RDA (+ RACDDT) AWR/ADDM Active Session History OSWatcher Enterprise Manager
Learn how to use Oracle-provided tools
– – – – –
Coordinate monitoring and collection of OS level stats as well as db-level stats
–
Problems observed at one layer are often just symptoms of problems that exist at a different layer
Don’t jump to conclusions
34
References
Metalink Note: 301137.1 - “OS Watcher User Guide” Metalink Note: 175853.1 - “Remote Diagnostics Agent (RDA)” Metalink Note: 250655.1 - “How to use the Automatic Database Diagnostic Monitor ” Metalink Note: 243132.1 - “10g New Feature Active Session History (Ash) And Analysis Of Ash Online And Offline ” OTN - Enterprise Manager 10g Grid Control: ScreenWatch Demos (Monitoring)
–
http://www.oracle.com/technology/products/oem/htdocs/demos.html
Oracle® Database 2 Day DBA Oracle® Database PL/SQL Packages and Types Reference Oracle® Database Performance Tuning “Service Level Agreement in the Data Center” By Edward Wustenhoff –Sun Professional Services
35
QUESTIONS ANSWERS
36
Thank You!
37
38