Embed
Email

RAC and ASM Best Practices

Document Sample
RAC and ASM Best Practices
Description

The following is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remains at the sole discretion of Oracle.

RAC & ASM Best Practices

“You Probably Need More than just RAC”



Kirk McGowan Technical Director – RAC Pack Oracle Server Technologies Cluster and Parallel Storage Development



Agenda

Operational Best Practices (IT MGMT 101) Background

– – – –



Requirements Why RAC Implementations Fail Case Study Criticality of IT Service Management (ITIL) process People, Process, AND Technology



Best Practices





Why do people buy RAC?

Low cost scalability





Cost reduction, consolidation, infrastructure that can grow with the business Growing expectations for uninterrupted service.



High Availability





Why do RAC Implementations fail?

RAC, scale-out clustering is “new” technology





Insufficient budget and effort is put towards filling the knowledge gap



HA is difficult to do, and cannot be done with technology alone





Operational processes and discipline are critical success factors, but are not addressed sufficiently



Case Study

Based on true stories. Any resemblance, in full or in part, to your own experiences is intentional and expected. Names have been changed to protect the innocent



Case Study

Background





– – –



8-12 months spent implementing 2 systems – somewhat different architectures, very different workloads, identical tech stacks Oracle expertise (Development) engaged to help flatten tech learning curve Non-mission critical systems, but important elements of a larger enterprise re-architecture effort. Many technology issues encountered across the stack, and resolved over the 8-12 month implementation



Hw, OS, storage, network, rdbms, cluster, and application



Case Study

Situation

– – – –



New mission critical deployment using same technology stack Distinct architecture, applications development teams, and operations teams Large staff turnover Major escalation, post production



CIO: “Oracle products do not meet our business requirements” “RAC is unstable” “DG doesn’t handle the workload” “JDBC connections don’t failover”



Case Study

Operational Issues





Requirements, aka SLO’s were not defined

e.g. Claim of 20s failover time; application logic included 80s failover time, cluster failure detection time alone set to 120s.







Inadequate test environments

Problems encountered first in production – including the fact that SLO’s could not be met







Inadequate change control

Lessons learned in previous deployments were not applied to new deployment – rediscovery of same problems Some changes implemented in test, but never rolled into production – re-occuring problems (outages) in production No process for confirming a change actually fixes the problem prior to implementing in production



Case Study

More Operational Issues





Poor knowledge xfer between internal teams

Configuration recommendations, patches, fixes identified in previous deployments were not communicated. Evictions are a symptom, not the problem.







Inadequate system monitoring

OS level statistics (CPU, IO, memory) were not being captured. Impossible to RCA on many problems without ability to correlate cluster / database symptoms with system level activity.







Inadequate Support procedures

Inconsistent data capture No on-site vendor support consistent with criticality of system No operations manual - Managing and responding to outages - Responding and restoring service after outages



Overview of Operational Process Requirements

What are “ITIL Guidelines”?

“ITIL (the IT Infrastructure Library) is the most widely accepted approach to IT service management in the world, ITIL provides a comprehensive and consistent set of best practices for IT service management, promoting a quality approach to achieving business effectiveness and efficiency in the use of information systems.”



IT Service Management

IT Service Management = Service Delivery + Service Support Service Delivery: partially concerned with setting up agreements and monitoring the targets within these agreements. Service Support: processes can be viewed as delivering services as laid down in these agreements.



Provisioning of IT Service Mgmt

In all organizations, must be matched to current and rapidly changing business demands. The objective is to continually improve the quality of service, aligned to the business requirements, cost-effectively. To meet this objective, three areas need to be considered:

– – –



People with the right skills, appropriate training and the right service culture Effective and efficient Service Management processes Good IT Infrastructure in terms of tools and technology.



Unless People, Processes and Technology are considered and implemented appropriately within a steering framework, the objectives of Service Management will not be realized.



Service Delivery

Financial Management Service Level Management

– – –



Severity/priority definitions

e.g. Sev1, Sev2, Sev3, Sev4



Response time guidelines SLAs



Capacity Management IT Service Continuity Management Availability Management



Service Support

Incident Management





Incident documentation & Reporting, incident handling, escalation procedures RCAs, QA & Process improvement Standard configs, gold images, CEMLIs Risk assessment, backout, sw maintenance, decommission New deployments, upgrades, Emergency release, component release



Problem Management





Configuration Management





Change Management





Release Management





BP: Set & Manage Expectations

Why is this important?

– –



Expectations with RAC are different at the outset HA is as much (if not moreso) about the processes and procedures, than it is about the technology No matter what technology stack you implement, on it’s own it is incapable of meeting stringent SLA’s



Must communicate what the technology can AND can’t do Must be clear on what else needs to be in place to supplement the technology if HA business requirements are going to be met.





HA isn’t cheap!



BP: Clearly define SLO’s

Sufficiently granular

– –



Cannot architect, design, OR manage a system without clearly understanding the SLOs 24x7 is NOT an SLO



Define HA/recovery time objectives, throughput, response time, data loss, etc

– – –



Need to be established with an understanding of the cost of downtime for the system. RTO and RPO are key availability metrics Response time and throughput are key performance metrics Planned vs unplanned Localized vs site-wide Response time and resolution time



Must address different failure conditions

– –



Must be linked to the business requirements





Must be realistic



Manage to the SLO’s

Definitions of problem severity levels Documented targets for both incident response time, and resolution time, based on severity Classification of applications w.r.t. business criticality Establish SLA with business

– –



– –



Negotiated response and resolution times Definition of metrics E.g. Application Availability shall be measured using the following formula: Total Minutes In A Calendar Month minus Unscheduled Outage Minutes minus Scheduled Outage Minutes in such month, divided by Total Minutes In A Calendar Month Negotiated SLO’s Effectively documents expectations between IT and business



Incident log: date, time, description, duration, resolution



Example Resolution Time Matrix

Severity 1 Priority 1 and 2 SRs Severity 1 Priority 3 SRs Severity 2 Priority 1 SRs Severity 2 SRs < 1 hour < 13 Hours < 14 hours < 132 hrs



Example Response Time Matrix

Status New,XFR ASG IRR, 2CB RVW,1CB PCR,RDV WIP INT LMS,CUS DEV Sev1/P1 Sev1/P2 15 15 15 15 60 60 60 4 4 30 60 30 60 N/A 60 2 4 4 Sev2/P1 Sev2 15 15 15 15 60 60 60 4 4 Sev3/ Sev4 30 60 30 60 60 120 60 120 120 3 hrs 18 hrs 4 days 120 min 3 hrs 2 days 4 days 3 days 10 days



BP: TEST, TEST, TEST

Testing is a shared responsibility





Functional, destructive, and stress testing

Both in terms of configuration, and capacity Separate from Production Building a test harness to mimic production workload is a necessary, but non-trivial effort



Test environments must be representative of production

– – –



Ideally, problems would never be encountered first in production









If they are, the first question should be: Why didn’t we catch the problem in test? Exceeding some threshold Unique timing or race condition What can we do so we catch this type of problem in the future? Build a test case that can be reused as part of pre-production testing.



BP: Define, document, and adhere to Change Control Processes

This amounts to self discipline Applies to all changes at all levels of the tech stack

– –



Hw changes, configuration changes, patches and patchsets, upgrades, and even significant changes in workload. If no changes are introduced, system will reach a steady state, and function for ever.



A well designed system will be able to tolerate some fluctuations, and faults. A well managed system will meet service levels









If a problem (that was fixed) is encountered again elsewhere, it is a change management process problem, not a technology problem. I.e. rediscovery should not happen. Ensure fixes are applied across all nodes in a cluster, and all environments to which the fix applies.



BP: Plan for, and execute Knowledge Xfer

New technology has a learning curve. 10g, RAC, and ASM cross traditional job boundaries so knowledge xfer must be executed across all affected groups

– – – –



Architecture, development, and operations Network admin, sysadmin, storage admin, dba e.g. evictions are not a problem, they are a symptom Learn how to use the various tools and interpret output



Learn how to identify and diagnose problems

Hanganalyze, system state dumps, truss, etc… Understand behaviour – distinction between cause and symptom



Needs to occur pre-production





Operational Readiness



BP: Monitor your system

Define key metrics and monitor them actively





Establish a (performance) baseline RDA (+ RACDDT) AWR/ADDM Active Session History OSWatcher



Learn how to use Oracle-provided tools

– – – –



Coordinate monitoring and collection of OS level stats as well as db-level stats





Problems observed at one layer are often just symptoms of problems that exist at a different layer



Don’t jump to conclusions



BP: Define, Document, and communicate Support procedures

Define corrective procedures for outages





Routinely test corrective procedures Prevent Detect capture resume analyze fix Classify high priority systems, and the steps that need to be taken in each phase Keep an active log of every outage

If we don’t provide sufficient tools to get to root cause, then shame on us. If you don’t implement the diagnositic capabilities that are provided to help get to root cause, then shame on you Serious outages should never happen more than once.



HA process:

– – –



Summary

Deficiencies in operational processes and procedures are the root cause of the vast majority of escalations





Address these, you dramatically increase your chances of a successful RAC deployment, and will save yourself a lot of future pain Configuration Management – Initial Install and config, standardized “gold image” deployment Incident Management - Diagnosing cluster-related problems



Additional areas of challenge










Related docs
Other docs by Arun Mahendran
AMS Best Practice
Views: 167  |  Downloads: 31
Bhoomika Chawla
Views: 28  |  Downloads: 0
Microsoft Exchange Server 2003
Views: 343  |  Downloads: 72
Swine Flu
Views: 29  |  Downloads: 8
RAC DBA-2
Views: 918  |  Downloads: 151
Mr & Mrs Smith Screenplay
Views: 1680  |  Downloads: 52
Understanding RAC Internals
Views: 4089  |  Downloads: 299
Anushka
Views: 139  |  Downloads: 5
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!