Lessons Learned From A Migration Project Oracle to - PDF by ozb45831


									      Oracle RAC

Lessons Learned From A Migration Project

   Oracle to on Linux
   Project Overview
   New Hardware Overview
   RDBMS and Patch
   Testing
   Implementation Weekend
   Problems past Go-Live
   Conclusion
   Q&A
               Who is e-DBA?
   e-DBA – Oracle Certified Partner
   Managed services and consultancy
   Specialised in Oracle Database, Application
    Server, Unix and Linux
   End-to-End management and delivery of the
    whole project including hardware, operating
    system and Oracle technology stack
                    Project Overview
   Old production system reached capacity limit
    −   Oracle RAC Standard Edition; e-DBA
        asked to support original 3rd party implementation
    −   SuSE Linux Enterprise Server 9 SP3
    −   Hardware: HP ProLiant DL360
            1 single-core CPU per node
            3GB memory
            MSA 1000 for storage
   Forms and Reports based application
    −   2 OAS Forms & Reports standalone
    −   Zeus load balancer doing DNS round robin
        Development Over Past Year
   Project started out small but rapidly growing
    −   Originally intended to port MySQL-based
        application to a more resilient platform
   Problems encountered
    −   I/O subsystem too slow
    −   Constant memory problems
    −   Run queue would sometimes exceed 12 for
        extended periods
    −   Slow interconnect LMSn processes desperate for
Node 1 CPU Usage Early October
Node 2 CPU Usage Early October
          New Hardware – Database
   HP ProLiant DL385 G2
    −   More processing power: 2x dual core Opteron 2800
    −   More memory: 14 GB RAM
   SAN storage
    −   Overall much better performance; HP EVA 4000
    −   HBA multipathing (Eumulex)
    −   Striped & mirrored disk group available for Oracle
    −   Finally more than 2 disks!
Points for Discussion-Software Stack
   Operating System
    −   SLES 9 or 10?
    −   RHEL 4 or 5/OEL 4 or 5?
   Oracle Database
    −   Upgrade to Enterprise Edition?
    −   Stay with 10g Release 2 and patch?
    −   Go for Oracle 11g?
   Oracle Application Server
    −   Stay with base release or patch to
                  Oracle RDBMS
   Decided to apply latest patchset to existing installation
    −   Not enough testing resources available to ensure
        application works well with 11g, and
    −   Patching to deemed less risky than 11g
    −   No upgrade to Enterprise Edition
    −   No significant advantage offered by 11g Standard
        Edition without additional options
    −   No Critical Patch Update
   No RAC test environment - again!
               RDBMS Installation
   RAW devices for OCR, voting disk and ASM
    −   No ASMLib
    −   External redundancy for OCR and Voting Disk
    −   Q: What's your experience with this?
    −   Single raw device for ASM; external redundancy
   Followed Best Practices: 3 Oracle Homes
    −   Clusterware
    −   ASM
    −   Database
              Patchset 2 -
   Problems!!!
    −   Many Alerts on Metalink
    −   Note 415665.1: /var/log/messages flooded, patch
        5679560 to fix problem
    −   Oracle Upgrade Companion: valuable resource
   Short note about patch 5679560
    −   Rolling patch the manual way (opatch apply -local)
    −   srvctl stop instance -d RACDB -i RACDB1
    −   Then stop local ASM instance and nodeapps
    −   crs_relocate ora.crs.node1.vip to avoid timeouts
           Oracle Application Server
   Decision made to upgrade to
    −   Patch 4960210 ( to is 1.6 GB in
    −   OAS on Linux is 32bit only
    −   Lengthy note 415222.1 details OAS 10.1.2.x
        maintenance, needs to be followed carefully
    −   Main problems arising from using 64bit Linux with
        32bit Oracle software: mainly relinking problems
    −   Font problems with Forms, fixed as per Note
   Customer wanted to push the whole migration
    −   Initially only two weeks allocated for the migration
        including testing
    −   Initial testing resources needed to be reinforced
    −   No load testing due to lack of time/resources
   Hardware testing
   Application failover testing
    Tests Performed Prior to Go-Live
   Multitude of tests have been performed:
    −   Multipathing/SAN connectivity
    −   Single Application Server failure
    −   Single Load Balancer failure
    −   Single RAC node failure
    −   Database listener termination on one node
    −   Clusterware failure on one node
    −   Database instance termination
    −   ASM instance termination on one node
                         Test Results
   All in all, system worked as expected
   Problems encountered:
    −   Messages in user_dump_dest
            Applied patch 5452672 per Oracle recommendation
    −   OEM DB Control
            Took 10 minutes to load for recreated repository, dbsmp
             used 100% CPU reproducibly
            Fix : delete table statistics for SYS.X$KCCRSR and lock
             that object (see Metalink Note 420200.1)
RMAN Restore to New Environment
set dbid 571120117
run {
    set controlfile autobackup format for device type disk
    to '/u01/oraback/RACDB/rman/controlfile/ctl_%d_%F.ctlbak';
    restore controlfile from autobackup;
# mount db here
run {
   set until time "to_date('20.10.2007 08:40','dd.mm.yyyy hh24:mi')";
   restore database;

executing command: SET until clause

Starting restore at 20-OCT-07
using channel ORA_DISK_1

RMAN-00571:   ===========================================================
RMAN-00569:   =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571:   ===========================================================
RMAN-03002:   failure of restore command at 10/20/2007 09:54:26
RMAN-06026:   some targets not found - aborting restore
RMAN-06023:   no backup or copy of datafile 4 found to restore
RMAN-06023:   no backup or copy of datafile 3 found to restore
RMAN-06023:   no backup or copy of datafile 2 found to restore
RMAN-06023:   no backup or copy of datafile 1 found to restore
RMAN Restore Problem Explained
   Second restore failed: RMAN-06023
    −   No backup or copy of datafile found, but why?
    −   Cleared out test database using ASMCMD
            +DATA/RACDB/{datafile,controlfile,logs} purged
            FRA untouched – this turned out to be the problem
    −   Restore of database implicitly registered new
            Most likely thanks to the log_archive_format parameter
            No backup for that incarnation found – RMAN aborting
RMAN Restore Problem Explained
rman target /

connected to target database: RACDB (DBID=571120117, not open)

RMAN> run {
2> set until time = "to_date('20.10.2007 08:30','dd.mm.yyyy hh24:mi')";
3> restore database ;
4> }

executing command: SET until clause
using target database control file instead of recovery catalog

Starting restore at 20-OCT-07
Starting implicit crosscheck backup at 20-OCT-07
allocated channel: ORA_DISK_1
Crosschecked 80 objects
Finished implicit crosscheck backup at 20-OCT-07

Starting implicit crosscheck copy at 20-OCT-07
using channel ORA_DISK_1
Finished implicit crosscheck copy at 20-OCT-07

searching for all files in the recovery area
cataloging files...
cataloging done

List of Cataloged Files
File Name: +data/RACDB/ARCHIVELOG/2007_10_20/thread_2_seq_152.592.636422705
            Problems Post Go-Live I
   Many trace files RACDB1_PZ99_xxx
    −   Rediscovery of bug 5743827 according to Oracle,
        which is in state 33 (suspended)
    −   Bug not visible to us – no further information
    −   Workaround: set _optim_peek_user_binds=FALSE
        on all nodes
    −   Impact: high – cannot afford setting this parameter
        without extensive testing.
    −   Downside of not implementing the workaround is
        large number of trace files which is acceptable
           Problems Post Go-Live II
   ORA-604 and ORA-1461 in SMON trace
    −   System stayed up, no users affected
    −   Multiple (5) bugs associated with this problem, it is
        documented in Note 461911.1
    −   Invalid child cursors executed in recursive SQL
    −   Workarounds not known to help:
           Flush shared pool

           Set session_cached_cursors to 0

    −   Recommendation: apply patch 6602742
    −   Status: pending – no test environment yet
             Problems Post Go-Live III
   DBMS scheduler not working as expected
    −   Had not been tested beforehand!
            Insufficient customer resources
            We had not been informed
    −   Symptom: some critical scheduled jobs on node 2
        always start 5-45 minutes too late
    −   Mission-critical problem for the customer
    −   No suitable and timely help from Oracle
    −   Revert to DBMS_JOB for critical jobs
    −   No solution! It has not been a problem in
          Problems Post Go-Live IV
   Oracle process running out of kernel IO
    resources – again
    −   DBW0 reported this, users did not notice at all
    −   Hoped to have fixed it prior to go live
    −   Another bug number: 6087207 in addition to
    −   Better analyst this time
    −   Merge label request for 2 patches: 6343214
    −   Not yet applied – still no test environment!!
           Problems Post Go-Live V
   Oracle Application Servers
    −   dmesg output lists many segfaults for frmweb
    −   No user noticed or seemed to be affected, but our
        monitoring tools picked it up
    −   Raised with Oracle on low priority, support analyst
        suggested JRE problem (x86-64 platform)
    −   Workaround: noexec=1 in kernel command line
    −   ... Had no effect
             Problems Post Go-Live VI
   ORA-4031 errors in 2nd instance
    −   Analysis of v$sgastat showed plenty of free space
    −   ORA-4031 always accompanied by ORA-313
        and/or ORA-312: “online log file not found” - scary!
    −   Shared pool size undersized by default
            Set it to 48M on 32bit systems (default is 32MB)
            Set it to 150M on 64bit systems (default is 88M)
    −   Internal Bug 4181069
    −   See also Metalink Note 419166.1
   All in all a very smooth migration despite the
    past 6 slides
   One node now can support the whole workload
   Customer still has not approved rebuild of old
    production environment for testing
   Lessons learned
    −   Always research alerts before applying a patchset
    −   Make sure to have a test environment
    −   Allocate enough resources for testing
    −   If possible, do a load test (hammerora, swingbench)

Any questions?

To top