RAC & ASM Best Practices
“You Probably Need More than just RAC”
Kirk McGowan Technical Director – RAC Pack Oracle Server Technologies Cluster and Parallel Storage Development
Operational Best Practices (IT MGMT 101) Background
– – – –
Requirements Why RAC Implementations Fail Case Study Criticality of IT Service Management (ITIL) process People, Process, AND Technology
Why do people buy RAC?
Low cost scalability
Cost reduction, consolidation, infrastructure that can grow with the business Growing expectations for uninterrupted service.
Why do RAC Implementations fail?
RAC, scale-out clustering is “new” technology
Insufficient budget and effort is put towards filling the knowledge gap
HA is difficult to do, and cannot be done with technology alone
Operational processes and discipline are critical success factors, but are not addressed sufficiently
Based on true stories. Any resemblance, in full or in part, to your own experiences is intentional and expected. Names have been changed to protect the innocent
– – –
8-12 months spent implementing 2 systems – somewhat different architectures, very different workloads, identical tech stacks Oracle expertise (Development) engaged to help flatten tech learning curve Non-mission critical systems, but important elements of a larger enterprise re-architecture effort. Many technology issues encountered across the stack, and resolved over the 8-12 month implementation
Hw, OS, storage, network, rdbms, cluster, and application
– – – –
New mission critical deployment using same technology stack Distinct architecture, applications development teams, and operations teams Large staff turnover Major escalation, post production
CIO: “Oracle products do not meet our business requirements” “RAC is unstable” “DG doesn’t handle the workload” “JDBC connections don’t failover”
Requirements, aka SLO’s were not defined
e.g. Claim of 20s failover time; application logic included 80s failover time, cluster failure detection time alone set to 120s.
Inadequate test environments
Problems encountered first in production – including the fact that SLO’s could not be met
Inadequate change control
Lessons learned in previous deployments were not applied to new deployment – rediscovery of same problems Some changes implemented in test, but never rolled into production – re-occuring problems (outages) in production No process for confirming a change actually fixes the problem prior to implementing in production
More Operational Issues
Poor knowledge xfer between internal teams
Configuration recommendations, patches, fixes identified in previous deployments were not communicated. Evictions are a symptom, not the problem.
Inadequate system monitoring
OS level statistics (CPU, IO, memory) were not being captured. Impossible to RCA on many problems without ability to correlate cluster / database symptoms with system level activity.
Inadequate Support procedures
Inconsistent data capture No on-site vendor support consistent with criticality of system No operations manual - Managing and responding to outages - Responding and restoring service after outages
Overview of Operational Process Requirements
What are “ITIL Guidelines”?
“ITIL (the IT Infrastructure Library) is the most widely accepted approach to IT service management in the world, ITIL provides a comprehensive and consistent set of best practices for IT service management, promoting a quality approach to achieving business effectiveness and efficiency in the use of information systems.”
IT Service Management
IT Service Management = Service Delivery + Service Support Service Delivery: partially concerned with setting up agreements and monitoring the targets within these agreements. Service Support: processes can be viewed as delivering services as laid down in these agreements.
Provisioning of IT Service Mgmt
In all organizations, must be matched to current and rapidly changing business demands. The objective is to continually improve the quality of service, aligned to the business requirements, cost-effectively. To meet this objective, three areas need to be considered:
– – –
People with the right skills, appropriate training and the right service culture Effective and efficient Service Management processes Good IT Infrastructure in terms of tools and technology.
Unless People, Processes and Technology are considered and implemented appropriately within a steering framework, the objectives of Service Management will not be realized.
Financial Management Service Level Management
– – –
e.g. Sev1, Sev2, Sev3, Sev4
Response time guidelines SLAs
Capacity Management IT Service Continuity Management Availability Management
Incident documentation & Reporting, incident handling, escalation procedures RCAs, QA & Process improvement Standard configs, gold images, CEMLIs Risk assessment, backout, sw maintenance, decommission New deployments, upgrades, Emergency release, component release
BP: Set & Manage Expectations
Why is this important?
Expectations with RAC are different at the outset HA is as much (if not moreso) about the processes and procedures, than it is about the technology No matter what technology stack you implement, on it’s own it is incapable of meeting stringent SLA’s
Must communicate what the technology can AND can’t do Must be clear on what else needs to be in place to supplement the technology if HA business requirements are going to be met.
HA isn’t cheap!
BP: Clearly define SLO’s
Cannot architect, design, OR manage a system without clearly understanding the SLOs 24x7 is NOT an SLO
Define HA/recovery time objectives, throughput, response time, data loss, etc
– – –
Need to be established with an understanding of the cost of downtime for the system. RTO and RPO are key availability metrics Response time and throughput are key performance metrics Planned vs unplanned Localized vs site-wide Response time and resolution time
Must address different failure conditions
Must be linked to the business requirements
Must be realistic
Manage to the SLO’s
Definitions of problem severity levels Documented targets for both incident response time, and resolution time, based on severity Classification of applications w.r.t. business criticality Establish SLA with business
Negotiated response and resolution times Definition of metrics E.g. Application Availability shall be measured using the following formula: Total Minutes In A Calendar Month minus Unscheduled Outage Minutes minus Scheduled Outage Minutes in such month, divided by Total Minutes In A Calendar Month Negotiated SLO’s Effectively documents expectations between IT and business
Incident log: date, time, description, duration, resolution
Example Resolution Time Matrix
Severity 1 Priority 1 and 2 SRs Severity 1 Priority 3 SRs Severity 2 Priority 1 SRs Severity 2 SRs < 1 hour < 13 Hours < 14 hours < 132 hrs
Example Response Time Matrix
Status New,XFR ASG IRR, 2CB RVW,1CB PCR,RDV WIP INT LMS,CUS DEV Sev1/P1 Sev1/P2 15 15 15 15 60 60 60 4 4 30 60 30 60 N/A 60 2 4 4 Sev2/P1 Sev2 15 15 15 15 60 60 60 4 4 Sev3/ Sev4 30 60 30 60 60 120 60 120 120 3 hrs 18 hrs 4 days 120 min 3 hrs 2 days 4 days 3 days 10 days
BP: TEST, TEST, TEST
Testing is a shared responsibility
Functional, destructive, and stress testing
Both in terms of configuration, and capacity Separate from Production Building a test harness to mimic production workload is a necessary, but non-trivial effort
Test environments must be representative of production
– – –
Ideally, problems would never be encountered first in production
If they are, the first question should be: Why didn’t we catch the problem in test? Exceeding some threshold Unique timing or race condition What can we do so we catch this type of problem in the future? Build a test case that can be reused as part of pre-production testing.
BP: Define, document, and adhere to Change Control Processes
This amounts to self discipline Applies to all changes at all levels of the tech stack
Hw changes, configuration changes, patches and patchsets, upgrades, and even significant changes in workload. If no changes are introduced, system will reach a steady state, and function for ever.
A well designed system will be able to tolerate some fluctuations, and faults. A well managed system will meet service levels
If a problem (that was fixed) is encountered again elsewhere, it is a change management process problem, not a technology problem. I.e. rediscovery should not happen. Ensure fixes are applied across all nodes in a cluster, and all environments to which the fix applies.
BP: Plan for, and execute Knowledge Xfer
New technology has a learning curve. 10g, RAC, and ASM cross traditional job boundaries so knowledge xfer must be executed across all affected groups
– – – –
Architecture, development, and operations Network admin, sysadmin, storage admin, dba e.g. evictions are not a problem, they are a symptom Learn how to use the various tools and interpret output
Learn how to identify and diagnose problems
Hanganalyze, system state dumps, truss, etc… Understand behaviour – distinction between cause and symptom
Needs to occur pre-production
BP: Monitor your system
Define key metrics and monitor them actively
Establish a (performance) baseline RDA (+ RACDDT) AWR/ADDM Active Session History OSWatcher
Learn how to use Oracle-provided tools
– – – –
Coordinate monitoring and collection of OS level stats as well as db-level stats
Problems observed at one layer are often just symptoms of problems that exist at a different layer
Don’t jump to conclusions
BP: Define, Document, and communicate Support procedures
Define corrective procedures for outages
Routinely test corrective procedures Prevent Detect capture resume analyze fix Classify high priority systems, and the steps that need to be taken in each phase Keep an active log of every outage
If we don’t provide sufficient tools to get to root cause, then shame on us. If you don’t implement the diagnositic capabilities that are provided to help get to root cause, then shame on you Serious outages should never happen more than once.
– – –
Deficiencies in operational processes and procedures are the root cause of the vast majority of escalations
Address these, you dramatically increase your chances of a successful RAC deployment, and will save yourself a lot of future pain Configuration Management – Initial Install and config, standardized “gold image” deployment Incident Management - Diagnosing cluster-related problems
Additional areas of challenge