Docstoc

RAC and ASM Best Practices (PDF)

Document Sample
RAC and ASM Best Practices (PDF) Powered By Docstoc
					RAC & ASM Best Practices
“You Probably Need More than just RAC”

Kirk McGowan Technical Director – RAC Pack Oracle Server Technologies Cluster and Parallel Storage Development

Agenda
Operational Best Practices (IT MGMT 101) Background
– – – –

Requirements Why RAC Implementations Fail Case Study Criticality of IT Service Management (ITIL) process People, Process, AND Technology

Best Practices
–

Why do people buy RAC?
Low cost scalability
–

Cost reduction, consolidation, infrastructure that can grow with the business Growing expectations for uninterrupted service.

High Availability
–

Why do RAC Implementations fail?
RAC, scale-out clustering is “new” technology
–

Insufficient budget and effort is put towards filling the knowledge gap

HA is difficult to do, and cannot be done with technology alone
–

Operational processes and discipline are critical success factors, but are not addressed sufficiently

Case Study
Based on true stories. Any resemblance, in full or in part, to your own experiences is intentional and expected. Names have been changed to protect the innocent

Case Study
Background
–

– – –

8-12 months spent implementing 2 systems – somewhat different architectures, very different workloads, identical tech stacks Oracle expertise (Development) engaged to help flatten tech learning curve Non-mission critical systems, but important elements of a larger enterprise re-architecture effort. Many technology issues encountered across the stack, and resolved over the 8-12 month implementation

Hw, OS, storage, network, rdbms, cluster, and application

Case Study
Situation
– – – –

New mission critical deployment using same technology stack Distinct architecture, applications development teams, and operations teams Large staff turnover Major escalation, post production

CIO: “Oracle products do not meet our business requirements” “RAC is unstable” “DG doesn’t handle the workload” “JDBC connections don’t failover”

Case Study
Operational Issues
–

Requirements, aka SLO’s were not defined
e.g. Claim of 20s failover time; application logic included 80s failover time, cluster failure detection time alone set to 120s.

–

Inadequate test environments
Problems encountered first in production – including the fact that SLO’s could not be met

–

Inadequate change control
Lessons learned in previous deployments were not applied to new deployment – rediscovery of same problems Some changes implemented in test, but never rolled into production – re-occuring problems (outages) in production No process for confirming a change actually fixes the problem prior to implementing in production

Case Study
More Operational Issues
–

Poor knowledge xfer between internal teams
Configuration recommendations, patches, fixes identified in previous deployments were not communicated. Evictions are a symptom, not the problem.

–

Inadequate system monitoring
OS level statistics (CPU, IO, memory) were not being captured. Impossible to RCA on many problems without ability to correlate cluster / database symptoms with system level activity.

–

Inadequate Support procedures
Inconsistent data capture No on-site vendor support consistent with criticality of system No operations manual - Managing and responding to outages - Responding and restoring service after outages

Overview of Operational Process Requirements
What are “ITIL Guidelines”?
“ITIL (the IT Infrastructure Library) is the most widely accepted approach to IT service management in the world, ITIL provides a comprehensive and consistent set of best practices for IT service management, promoting a quality approach to achieving business effectiveness and efficiency in the use of information systems.”

IT Service Management
IT Service Management = Service Delivery + Service Support Service Delivery: partially concerned with setting up agreements and monitoring the targets within these agreements. Service Support: processes can be viewed as delivering services as laid down in these agreements.

Provisioning of IT Service Mgmt
In all organizations, must be matched to current and rapidly changing business demands. The objective is to continually improve the quality of service, aligned to the business requirements, cost-effectively. To meet this objective, three areas need to be considered:
– – –

People with the right skills, appropriate training and the right service culture Effective and efficient Service Management processes Good IT Infrastructure in terms of tools and technology.

Unless People, Processes and Technology are considered and implemented appropriately within a steering framework, the objectives of Service Management will not be realized.

Service Delivery
Financial Management Service Level Management
– – –

Severity/priority definitions
e.g. Sev1, Sev2, Sev3, Sev4

Response time guidelines SLAs

Capacity Management IT Service Continuity Management Availability Management

Service Support
Incident Management
–

Incident documentation & Reporting, incident handling, escalation procedures RCAs, QA & Process improvement Standard configs, gold images, CEMLIs Risk assessment, backout, sw maintenance, decommission New deployments, upgrades, Emergency release, component release

Problem Management
–

Configuration Management
–

Change Management
–

Release Management
–

BP: Set & Manage Expectations
Why is this important?
– –

Expectations with RAC are different at the outset HA is as much (if not moreso) about the processes and procedures, than it is about the technology No matter what technology stack you implement, on it’s own it is incapable of meeting stringent SLA’s

Must communicate what the technology can AND can’t do Must be clear on what else needs to be in place to supplement the technology if HA business requirements are going to be met.
–

HA isn’t cheap!

BP: Clearly define SLO’s
Sufficiently granular
– –

Cannot architect, design, OR manage a system without clearly understanding the SLOs 24x7 is NOT an SLO

Define HA/recovery time objectives, throughput, response time, data loss, etc
– – –

Need to be established with an understanding of the cost of downtime for the system. RTO and RPO are key availability metrics Response time and throughput are key performance metrics Planned vs unplanned Localized vs site-wide Response time and resolution time

Must address different failure conditions
– –

Must be linked to the business requirements
–

Must be realistic

Manage to the SLO’s
Definitions of problem severity levels Documented targets for both incident response time, and resolution time, based on severity Classification of applications w.r.t. business criticality Establish SLA with business
– –

– –

Negotiated response and resolution times Definition of metrics E.g. Application Availability shall be measured using the following formula: Total Minutes In A Calendar Month minus Unscheduled Outage Minutes minus Scheduled Outage Minutes in such month, divided by Total Minutes In A Calendar Month Negotiated SLO’s Effectively documents expectations between IT and business

Incident log: date, time, description, duration, resolution

Example Resolution Time Matrix
Severity 1 Priority 1 and 2 SRs Severity 1 Priority 3 SRs Severity 2 Priority 1 SRs Severity 2 SRs < 1 hour < 13 Hours < 14 hours < 132 hrs

Example Response Time Matrix
Status New,XFR ASG IRR, 2CB RVW,1CB PCR,RDV WIP INT LMS,CUS DEV Sev1/P1 Sev1/P2 15 15 15 15 60 60 60 4 4 30 60 30 60 N/A 60 2 4 4 Sev2/P1 Sev2 15 15 15 15 60 60 60 4 4 Sev3/ Sev4 30 60 30 60 60 120 60 120 120 3 hrs 18 hrs 4 days 120 min 3 hrs 2 days 4 days 3 days 10 days

BP: TEST, TEST, TEST
Testing is a shared responsibility
–

Functional, destructive, and stress testing
Both in terms of configuration, and capacity Separate from Production Building a test harness to mimic production workload is a necessary, but non-trivial effort

Test environments must be representative of production
– – –

Ideally, problems would never be encountered first in production
–

–

If they are, the first question should be: Why didn’t we catch the problem in test? Exceeding some threshold Unique timing or race condition What can we do so we catch this type of problem in the future? Build a test case that can be reused as part of pre-production testing.

BP: Define, document, and adhere to Change Control Processes
This amounts to self discipline Applies to all changes at all levels of the tech stack
– –

Hw changes, configuration changes, patches and patchsets, upgrades, and even significant changes in workload. If no changes are introduced, system will reach a steady state, and function for ever.

A well designed system will be able to tolerate some fluctuations, and faults. A well managed system will meet service levels
–

–

If a problem (that was fixed) is encountered again elsewhere, it is a change management process problem, not a technology problem. I.e. rediscovery should not happen. Ensure fixes are applied across all nodes in a cluster, and all environments to which the fix applies.

BP: Plan for, and execute Knowledge Xfer
New technology has a learning curve. 10g, RAC, and ASM cross traditional job boundaries so knowledge xfer must be executed across all affected groups
– – – –

Architecture, development, and operations Network admin, sysadmin, storage admin, dba e.g. evictions are not a problem, they are a symptom Learn how to use the various tools and interpret output

Learn how to identify and diagnose problems
Hanganalyze, system state dumps, truss, etc… Understand behaviour – distinction between cause and symptom

Needs to occur pre-production
–

Operational Readiness

BP: Monitor your system
Define key metrics and monitor them actively
–

Establish a (performance) baseline RDA (+ RACDDT) AWR/ADDM Active Session History OSWatcher

Learn how to use Oracle-provided tools
– – – –

Coordinate monitoring and collection of OS level stats as well as db-level stats
–

Problems observed at one layer are often just symptoms of problems that exist at a different layer

Don’t jump to conclusions

BP: Define, Document, and communicate Support procedures
Define corrective procedures for outages
–

Routinely test corrective procedures Prevent Detect capture resume analyze fix Classify high priority systems, and the steps that need to be taken in each phase Keep an active log of every outage
If we don’t provide sufficient tools to get to root cause, then shame on us. If you don’t implement the diagnositic capabilities that are provided to help get to root cause, then shame on you Serious outages should never happen more than once.

HA process:
– – –

Summary
Deficiencies in operational processes and procedures are the root cause of the vast majority of escalations
–

Address these, you dramatically increase your chances of a successful RAC deployment, and will save yourself a lot of future pain Configuration Management – Initial Install and config, standardized “gold image” deployment Incident Management - Diagnosing cluster-related problems

Additional areas of challenge
–

–


				
DOCUMENT INFO
Description: The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.