Repl Check

Document Sample
Repl Check Powered By Docstoc
					1. Overview of how data moves between Replicated Sites ............................................................ 2
2. Checking the replication propagation environment is configured correctly ................................. 2
   2.1 Replication group status ........................................................................................................ 2
   2.2 Replicated object triggers, packages and status .................................................................. 2
   2.3 Status of existing admin requests ......................................................................................... 3
   2.4 Check database links ............................................................................................................ 3
3. Checking the automatic propagation mechanism is working ...................................................... 4
   3.1 Checking for errors ................................................................................................................ 4
   3.2 Check job_queue_processes is set ...................................................................................... 4
   3.3 Has a push / purge job been scheduled ............................................................................... 5
   3.4 Check the status of the propagation jobs .............................................................................. 6
   3.5 Check the propagator and their private database links ........................................................ 6
   3.6 Check if the push job is currently running ............................................................................. 7
   3.7 Terminating a deferred queue push job that is currently running ......................................... 8
4. Diagnosing the progress and status of replicated transactions ................................................... 8
   4.1 Verify transactions can be manually pushed ........................................................................ 8
   4.2 Verifying transactions are being propagated ........................................................................ 9
   4.3 Deferred transaction propagated but not applied to receiving site ..................................... 11
      4.3.1 Checking for deferred errors in the queue ................................................................... 11
      4.3.2 Stop on Error ................................................................................................................ 12
   4.4 Verify the purge transaction operation ................................................................................ 12
      4.4.1 Verify transactions can be manually purged ................................................................ 12
      4.4.2 Verifying transactions are being purged ...................................................................... 13
   4.5 Large deferred transactions and slow propagation ............................................................. 14
5. Diagnosing hanging propagation ............................................................................................... 15
   5.1 Check for locks .................................................................................................................... 16
   5.2 Check the wait events ......................................................................................................... 16
   5.3 Check if a large error is being queued ................................................................................ 17
   5.4 Advanced Analysis .............................................................................................................. 18
6. How to clear down large deferred queues ................................................................................. 18
   7. Example of how deferred transactions are propagated ........................................................ 19
   7.1 Data is inserted into the table: ............................................................................................. 19
   7.2 Interrogating the deferred queue:........................................................................................ 19
   7.3 Identify the next transaction to be pushed: ......................................................................... 20
   7.4 Manually push transactions to REP901 and interrogate the deferred queue ..................... 20
   7.5 Identify unpurged transactions ............................................................................................ 21
1. Overview of how data moves between Replicated Sites
Oracle uses internalised triggers to capture and store any data changes made
to tables that have been defined as replicated. It stores the data changes as
remote procedure calls (RPC's) in a deferred transaction queue table (defcall)
for propagation at a predefined time and interval. Propagation to remote master
sites is automated using the database job queue mechanism. The job either
applies the transaction to the remote master or if a transaction fails it is
placed in the remote deferred transaction error queue for analysis at a later
date.


2. Checking the replication propagation environment is configured correctly
A large majority of the problems encountered by replication users are due
to incorrect setup and configuration of the replicated environment. Refer
to section 1 of [NOTE:122039.1] and [NOTE:117434.1] before checking the
following :

2.1 Replication group status
For propagation to be successful the status of the replication groups must be
normal:
 select gname, status from dba_repgroup;
 GNAME                            STATUS
 ------------------------------ ---------
 GROUP1                            NORMAL

2.2 Replicated object triggers, packages and status
For the propagation and queuing of data changes to be successful to remote
sites, replicated tables (objects) must display as valid at all replication
sites. They must also have the associated internalised triggers and packages
defined. Run the following query to check the replicated tables (objects), in
releases prior to Oracle 8.1.x:

 select oname, status, generation_status
 from dba_repobject
 where status != 'VALID';

Run the following query to check the replicated tables (objects), in Oracle
8.1.x and above:

 select oname, status, generation_status,
      replication_trigger_exists, internal_package_exists
 from dba_repobject
 where status != 'VALID'
   or replication_trigger_exists != 'Y'
   or internal_package_exists != 'Y';

If the above query returns replication objects that do not have the associated
internalised triggers or packages, it may be necessary to re-generate
replication support. If after re-generation the replication objects still show
as invalid, run the following statement to ensure that all dependant SYS and
SYSTEM objects are valid:

 column owner format a25
 column object_name format a30
 column object_type format a15

 select owner, object_name, object_type
 from dba_objects
 where status != 'VALID'
 and owner in ('SYS', 'SYSTEM');

If problems persist, contact Oracle Support Services.

2.3 Status of existing admin requests
Data cannot move between replicated sites if there are outstanding admin
requests for the associated master group and there should not be any such
requests if the group displays with status normal. Use the following query
to check for admin requests:

 column gname format a18

 select gname, request, status, errnum
 from dba_repcatlog
 order by id, gname;

For additional analysis of the replication administrative requests queue
please refer to [NOTE:180014.1] and [NOTE:122039.1].


2.4 Check database links
The correct configuration of database link is essential for replication
propagation to operate correctly. The following articles cover the links
required by replication:

 [NOTE:50593.1] Initial steps required to create Multi Master
         and Snapshot Replication v8.0
 [NOTE:117434.1] Initial steps required to a create Multi Master
         Replication environment v8.1

Check the existence of the public and private links with the following query
at each site involved to propagation:

 column owner format a15
 column db_link format a45
 column username format a15
 select owner, db_link, username from dba_db_links;

 OWNER                DB_LINK                                   USERNAME
 --------------- ----------------------------------------- ---------------
 PUBLIC             DB2
 REPADMIN               DB2                                   REPADMIN

Test each of the links and ensure that the global name matches the link name
with the following query:

 connect as database link owner
 select * from global_name@<db link name>;
It is important that the correct links exist for the user who owns the job
that performs the replication push job.



3. Checking the automatic propagation mechanism is working
Oracle Replication uses the job queue to automate the propagation and purging
of deferred transactions. If the jobs are not configured properly, they will
not run as expected and their associated tasks will not be completed.

Administrators usually become aware that there may be a problem with the job
queue mechanism when they discover that the replication deferred transaction
queue is building up. To check if the queue is growing run :

 select count(*) from deftrandest ;

If the queue appears to be growing to an unusual size, use the following 3.x
sections to ensure that the automated jobs are not the cause. Do not attempt
to quiesce the replication system with SUSPEND_MASTER_ACTIVITY, as that will
just try to push the queue first.


3.1 Checking for errors
If a job fails while attempting to push or purge replication data changes,
errors will be written to the alertSID.log. Additional and more detailed
information will go into the following files referred to by the alert.log:

 - Pre V9    : SID_snpx_nnnnn.TRC
 - V9 and above : SID_cjq0_nnnnn.TRC and SID_jnnn_nnnn.TRC

The format of these files may vary between operating systems, their location
can be determined by running the following from SQLPLUS:

 SQL> show parameter dump_dest


3.2 Check job_queue_processes is set
The init.ora parameters that control the Oracle job queues must be set,
execute the following statement to check that they are correctly set at
each replication site:

 SQL> show parameter job

 NAME                             TYPE VALUE
 -------------------------------- ------- --------
 job_queue_interval                    integer 30
 job_queue_processes                      integer 4   <= must be > 0

job_queue_processes : A job queue process executes a single job at a time and
this parameter determines the maximum concurrent number of these. In most
replication environments configure this to be:

 = number of replicated destinations (or connection qualifier destinations)
  + 1 (for administrative requests)
  + enough to service other concurrent non replication jobs if they exist
job_queue_interval : This parameter only applies to versions prior to Oracle9i,
it specifies in seconds how frequently each SNPn background process of the
instance wakes up and checks if there is a job to run. Its default value is 60
and if jobs need to execute more frequently reduce this value.


3.3 Has a push / purge job been scheduled
For transactions to be propagated to remote master sites automatically, there
needs to be a job in the job queue to push them. There should be one push
job per remote master site, check the job exists with:

 column what format a64
 select job, what from dba_jobs where upper(what) like '%DBMS_DEFER_SYS%';

     JOB WHAT
 -------- ----------------------------------------------------------------
      43 declare rc binary_integer; begin rc := sys.dbms_defer_sys.purge(
         delay_seconds=>0); end;

     44 declare rc binary_integer; begin rc := sys.dbms_defer_sys.push(d
      estination=>'DB2.WORLD', stop_on_error=>FALSE, delay_seconds=>0,
       parallelism=>2); end;

     45 declare rc binary_integer; begin rc := sys.dbms_defer_sys.push(d
      estination=>'DB3.WORLD', stop_on_error=>FALSE, delay_seconds=>0,
       parallelism=>2); end;

If the push job exists it is important to know when it last ran and when it is
next scheduled to run, check the jobs schedule with:

 column dblink format a24
 column last_date format a9
 column next_date format a9
 column interval format a22

 select job push_job, dblink,
     SUBSTR(TO_CHAR(LAST_DATE,'MM/DD/YY HH24:MI:SS'),1,20) LAST_DATE,
     SUBSTR(TO_CHAR(NEXT_DATE,'MM/DD/YY HH24:MI:SS'),1,20) NEXT_DATE,
     interval
 from defschedule
 where job in (select job from dba_jobs
          where upper(what) like '%DBMS_DEFER_SYS.PUSH%');

   PUSH_JOB DBLINK                          LAST_DATE NEXT_DATE INTERVAL
 ---------- --------------------- --------- --------- ---------------------
        44 DB2.WORLD                     05/15/02 05/15/02 sysdate + 10/(60*24)
                             14:27:22 14:37:22

      45 DB3.WORLD       05/15/02 05/15/02 sysdate + 1/1440
                 14:30:33 14:31:33

There may be two reasons that the next_date has been passed. The first is the
job has failed, to check this see section 3.4. The second is that the job is
still running, to check this see section 3.6. It is important to note that the
next_date is set when the job completes, so if the interval is 10 minutes and
the job took 9 minutes to run the next_date will be 19 minutes from the time
the job started. Please refer to DBMS_JOB.CHANGE in [NOTE:61730.1] if the next
date of the job needs to be altered.

Oracle uses a lazy algorithm to purge deferred transactions from the local
queue. It is important that the purge job runs regularly to clear down this
queue because the same underlying table is used for transactions waiting to go
to remote sites as for those which have been pushed but not purged. If there
is a large number of transactions to be purged it can affect the performance
of propagation, check the scheduled purge job exists with:

 column last_date format a9
 column next_date format a9
 column interval format a40

 select job purge_job,
     SUBSTR(TO_CHAR(LAST_DATE,'MM/DD/YY HH24:MI:SS'),1,20) LAST_DATE,
     SUBSTR(TO_CHAR(NEXT_DATE,'MM/DD/YY HH24:MI:SS'),1,20) NEXT_DATE,
     interval
 from defschedule
 where job in (select job from dba_jobs
          where upper(what) like '%DBMS_DEFER_SYS.PURGE%');

  PURGE_JOB LAST_DATE NEXT_DATE INTERVAL
 ---------- --------- --------- ----------------------------------------
        43 05/15/02 05/15/02 /*1:Hr*/ sysdate + 1/24
          14:03:16 15:03:16


3.4 Check the status of the propagation jobs
If a job returns an error during execution, Oracle will try it again. The
first attempt is made after one minute, the second attempt after two minutes,
the third after four minutes, etc. This happens up to the original interval,
so if the interval is two minutes, its 1 minute, 2 minutes, 2 minutes, ... If
the job fails 16 times it is marked as broken, use the following to check how
many failures there have been and if the job has been marked as broken:

 column broken format a6
 select job, broken, failures from dba_jobs
  where upper(what) like '%DBMS_DEFER_SYS%';

       JOB BROKEN FAILURES
 ---------- ------ ----------
        26 N               0
        25 N               3

If the job is showing failures follow the advice in section 3.1, once the
underlying problem has been resolved unbreak the job with DBMS_JOB.BROKEN
(see [NOTE:1018453.102]). Please note that the job may show failures = 0
and broken = Y is the job was manually broken with DBMS_JOB.BROKEN.


3.5 Check the propagator and their private database links
The owner of the job that performs the push must be the replication propagator
and that user must have a private database link to the site where the job
is pushing data to. See section 2.4 for details of the required database links
and how to check they are working correctly.

Use the following SQL to check the propagator, links and push jobs all match
up:

 select * from defpropagator;

 USERNAME                                USERID STATUS CREATED
 ------------------------------ ---------- ------- ---------
 REPADMIN                                  31 VALID 18-APR-02

 select db_link from dba_db_links
 where owner = (select username from defpropagator);

 DB_LINK
 ----------------------------------
 DB2.WORLD
 DB3.WORLD


 column pushed_site_by_propagator format a25
 select job, dblink pushed_site_by_propagator
 from defschedule
 where job in (select job from dba_jobs
          where log_user in (select username from defpropagator)
            and upper(what) like '%DBMS_DEFER_SYS.PUSH%');

       JOB PUSHED_SITE_BY_PROPAGATOR
 ---------- -------------------------
        43 DB2.WORLD
        44 DB3.WORLD


3.6 Check if the push job is currently running
Oracle replication only allows a single push operation to run to a master site
at a time, although multiple push operations can occur concurrently but they
have to be to different remote master sites. If connection qualifiers are being
used then there can be multiple push jobs to the same master site, but with
different database links (see [NOTE:1024982.6]).

It may also be difficult to know if data is moving between replicated sites by
using the deftrandest table because new transactions will be added all the time
and for transactions with many calls they may take some time to process. The
following query identifies push jobs that are currently running:

 column dblink format a30
 select /*+ ORDERED */ j.job, j.sid, d.dblink,
     SUBSTR(TO_CHAR(J.THIS_DATE,'MM/DD/RRRR HH24:MI:SS'),1,20) START_DATE
 from defschedule d, dba_jobs_running j
  where j.job in (select job from dba_jobs
            where upper(what) like '%DBMS_DEFER_SYS.PUSH%')
   and j.job = d.job;

       JOB         SID DBLINK                            START_DATE
 ---------- ---------- ------------------------------ -------------------
       44          9 DB2.WORLD                          05/16/2002 12:14:47

When a job runs it create a Job Queue Lock to protect it from being run more
than once (i.e. run manually from a users session), there have been conditions
observed where the lock is still held by the Job Queue Process due to network
failures. Use the following query to identify Job Queue (JQ) lock:

 select s.sid, s.serial#, s.username
 from v$lock l, v$session s
 where l.type = 'JQ'
 and l.sid = s.sid;

       SID SERIAL# USERNAME
 ---------- ---------- ------------------------------
        9        164 REPADMIN

It is important to note that the above queries will not identify manual push
operations that have been initiated from a users session, see section 4.1
for more details of identifying these. After evaluating the above and section
4 it may be necessary to terminate the push process, please follow the steps
defined in section 3.7 to do this.


3.7 Terminating a deferred queue push job that is currently running
There will be situations where the running push job needs to be terminated
and prevented from running again, until the current problem that is being
encountered is resolved. Perform the following steps:

 - Break the job with :

  execute dbms_job.broken(<jobno>, true);

 - Kill the Job Queue Process from the Operating System. To do this use
   the sid from section 3.6 to identify the process in v$session, v$process
   and v$bgprocess. The process will generally be named SNPx or Jxxx.

 - After killing the process, wait approximately 1 minute, to ensure the
   job is removed from dba_jobs_running.

Killing the background process from the operating system will release the
Job Queue lock and the User Lock used to protect the push operation. Once
the underlying problem has been resolved by working through sections 4, 5
and 6, restart the jobs with:

 execute dbms_job.broken(<jobno>, false);



4. Diagnosing the progress and status of replicated transactions
The following section provides database administrators with a step by step
guide to diagnosing the progress of deferred transactions. Prior to following
these steps ensure the job queue is correctly configured, see section 3 above.

4.1 Verify transactions can be manually pushed
If it has been established that the job queue is configured correctly, but
still there are problems with propagation, the first step should be to analyse
the problem by trying a manual push of the deferred transaction queue.

Remember to make sure that the job that would normally perform the push is not
currently running and has been prevented from running when the manual push is
being tested (see sections 3.6 and 3.7). If this is not the case the manual
push will normally return immediately without pushing any rows.

Log on as the propagation user and run the following:

 declare x integer;
 begin
   x := sys.dbms_defer_sys.push('<destination>',...);
 end;
 /

If you get errors during the manual push, generally rows will not be pushed,
use MetaLink to search for known problems that relate to these errors.

If the manual push completes without errors but the entries in deftrandest
remain unchanged, the following could be the cause:

 - Another users session is performing the same push operation, when a
   manual push starts it allocates a User Lock to ensure there is only
   one push at a time, use [NOTE:1059290.6] to identify the blocking
   session.
 - BUG:734902 (fixed in 8.0.6), which may manifest itself as a hang.

If neither of the above are the cause, check through the following 4.x
sections and if the cause cannot be identified, raise a call with Oracle
Support Services.

If the manual push hangs check through the following 4.x sections and if they
do not identify the cause use section 5 to diagnose the problem.

4.2 Verifying transactions are being propagated
In Oracle release 7.x, the replication routines deleted all transactions from
the deferred transaction queue once they had been pushed. Where as from Oracle
8.x onwards transactions (and calls) are no longer immediately deleted after
application at the remote site, they are instead purged from the local queue
on a regular basis (interval).

There are two different views of the deferred transaction queue:

 - DEFTRAN : contains all unpurged transactions

 - DEFTRANDEST : contains all transaction that have not yet been pushed
   to a remote master site. Transactions appear once per master site they
   have to be pushed to.

To view all unpurged transactions:

 select * from deftran order by delivery_order;

 DEFERRED_TRAN_ID                            DELIVERY_ORDER D START_TIM
 ------------------------------ -------------- - ---------
 9.13.904                            280185 R 16-MAY-02
 1.5.559                            280654 R 16-MAY-02
 3.11.549                            281112 R 16-MAY-02

** Note the deftran view also includes transactions from the deferred error
  queue, see sections 4.3.1 and 4.4 for additional information.

To view all transactions yet to be propagated:

 column deferred_tran_id format a16
 column dblink format a30
 select * from deftrandest order by delivery_order;

 DEFERRED_TRAN_ID DELIVERY_ORDER DBLINK
 ---------------- -------------- ------------------------------
 1.5.559                  280654 DB2.WORLD
 3.11.549                  281112 DB2.WORLD

The transaction with the lowest delivery order will be the next transaction
to be pushed to the remote replication site. Prior to Oracle9 the DEFTRANDEST
view was the only way to identify how propagation was progressing. From
Oracle8 onwards use the following to investigate the current push:

 - Transactions that are currently being pushed appear in the target sites
   DEF$_ORIGIN table.
 - A transaction has been pushed if system.def$_destination.last_delivered
   is greater than system.def$_aqcall.cscn.

Oracle9 includes a mechanism for identifying how far through the current
transaction the current push is. In Oracle9 and above, use the following
query to identify if a transaction is being pushed and how many rows have
been pushed:

 column xid format a12
 column dblink format a30

 select /*+ ORDERED */ p.xid, p.dblink,
     (MAX(c.callno) + 1) "Calls in Tran",
     (p.sequence/(MAX(c.callno) + 1)) * 100 "% Processed Calls"
 from v$replprop p, defcall c
 where p.name like '%Slave%'
   and c.deferred_tran_id = p.xid
 group by p.xid, p.dblink, p.sequence;

 XID          DBLINK                          Calls in Tran % Processed Calls
 ------------ ------------------------------ ------------- -----------------
 10.20.556 DB2.WORLD                                          9999           32.446489
 6.6.581        DB2.WORLD                                   7000          15.153031

The current implementation of v$replprop only applies to transactions that are
pushed using parallel propagation, however Oracle recommends all customers use
parallel propagation. On systems that are CPU bound pushing the deferred queue,
it may be better to run the following query:

 column xid format a12
 column dblink format a30
 select p.xid, p.dblink,
     p.sequence "Processed Calls"
 from v$replprop p
 where p.name like '%Slave%';

In Oracle9 and above running the following query should assist database
administrators is monitoring the overall activity in the deferred transaction
queue:

 select * from v$replqueue;

 TXNS_ENQUEUED CALLS_ENQUEUED TXNS_PURGED LAST_ENQU LAST_PURG
 ------------- -------------- ----------- --------- ---------
        6543            21299           300 17-MAY-02 17-MAY-02

See Section 7 for an example of a transaction being propagated to a remote
site. If the current transaction being pushed appears to be hung or running
very slowly, check the following 4.x sections of this document, if they do
not help section 5 contains a more detailed analysis method.


4.3 Deferred transaction propagated but not applied to receiving site
If deferred transactions are leaving the local master site (being removed from
the DEFTRANDEST view) and the data changes do not appear to be applied at the
remote master site. The most likely cause is they are failing at the remote
site; the most common reasons for this are:

 - Database space management or rollback problems.
 - Data conflicts, where another user has updated a row that is being
   pushed to this replication site.

When the replication propagation mechanism encounters an error in applying a
transaction to a remote master site, it rolls back the transaction and writes
the whole transaction (all rows/calls) to the deferred error queue at the site
to which the transactions was being pushed. The whole transaction is copied
even if only one row causes the problem.


4.3.1 Checking for deferred errors in the queue
To obtain information of a transaction in error query deferror with:

 column deferred_tran_id heading 'Deferred|Transaction|ID' format a11
 column callno heading 'Call|Number' format 99999999
 column origin_tran_db heading 'Origin |Database' format a16
 column destination heading 'Destination|Database' format a16
 column time_of_error heading 'Date Of|Error' format a9
 column error_number heading 'Oracle|Error|Number' format 999999

 select deferred_tran_id, callno, origin_tran_db, destination,
     TO_CHAR(start_time, 'MM/DD/YY HH24:MI:SS') TIME_OF_ERROR,
     error_number
 from deferror
 order by start_time;
 Deferred                                                      Oracle
 Transaction          Call Origin            Destination          Date Of Error
 ID            Number Database                 Database             Error     Number
 ----------- --------- ---------------- ---------------- --------- -------
 5.11.152              0 DB3.WORLD                DB4.WORLD                05/16/02 1403
                                                20:03:07
 8.10.153              2 DB3.WORLD                DB4.WORLD                05/16/02 1403
                                                20:08:01

In general to resolve the errors that the above query returns, search MetaLink
for ORA-<Oracle Error Number>. If the error is an ORA-1403 as in the above
example data between the replication sites tables has diverged, this data
will have to be manually resynchronised and Oracle would recommend customers
implement a conflict resolution mechanism.

Use [NOTE:2065172.102] to discover details of the call that is in error.


4.3.2 Stop on Error
The majority of customers configure their push jobs with stop_on_error = false,
the consequence of this is:

 - If one transaction fails and is written to the error queue, then the
   following transactions succeed, there is a possibility that data at the
   remote master site could be logically inconsistent, because the earlier
   transaction has been overtaken.

If a customer has configured stop_on_error = true, then the first transaction
to fail will be written to the error queue and subsequent transactions will not
be pushed. This makes it much easier to resolve divergent data.

Use the following query to identify what stop_on_error has been set to:

 column what format a64
 select job, what from dba_jobs where upper(what) like '%STOP_ON_ERROR%';

       JOB WHAT
 ---------- ----------------------------------------------------------------
        44 declare rc binary_integer; begin rc := sys.dbms_defer_sys.push(d
          estination=>'DB2.WORLD', stop_on_error=>FALSE, delay_seconds=>0,
           parallelism=>2); end;



4.4 Verify the purge transaction operation
As stated in section 4.2 deferred transactions are not immediately purged from
the local replication site after they are propagated. It is very important
that the purge happens at a regular interval, such that the queue to purge
does not grow too large. If the queue to purge gets too large it will start to
impact the performance of propagation.


4.4.1 Verify transactions can be manually purged
Use section 3.3 to check that the purge job is correctly scheduled and section
3.4 to identify if the job is failing. If the purge job is failing or the
number of transactions to be purged does not appear to be decreasing follow
the steps in this section.

The purge job should be owned (log_user) by a replication administrator, so
log on as a replication administrator user (normally repadmin) and run the
following:

 declare x integer;
 begin
   x := sys.dbms_defer_sys.purge(delay_seconds=>0);
 end;
 /

If the manual purge raises errors, check MetaLink for likely causes, address
the errors and run the purge again. If the manual purge returns without error
follow the steps described in section 4.4.2 to check that transactions are
being correctly purged.


4.4.2 Verifying transactions are being purged
The easiest way to check if the queue of deferred transactions to be purged
is decreasing is to use the following query, which eliminates deferror entries
in the local deferred queue:

 select min(delivery_order) "Oldest Unpurged"
 from deftran
 where destination_list != 'D';

 Oldest Unpurged
 ---------------
        781633

In Oracle9 and above it may be better to run the following query particularly
if the deferred queue is very large, it shows the number of transactions that
have been purged since the database was last started:

 select txns_purged from v$replqueue;

 TXNS_PURGED
 -----------
        356

There are two types of purge that Oracle can perform:

 - Lazy purge (the default).
 - Precise purge.

The lazy purge will purge transactions with a system.def$_aqcall.cscn lower
than the local low water mark for propagated transactions (this is calculated
based on the minimum last_delivered in the local system.def$_destination).
This low water mark can be lower than some cscn numbers of some previously
pushed transactions, so they will not be purged immediately. This can happen
if not all the push jobs have run and still have active transactions. In this
case, the transactions will remain in def$_aqcall until the low water mark
rises above the cscn for the transaction.
A precise purge will purge transactions with a cscn lower than the low water
mark for propagated transactions to it's specific destination. This means that
the purge will query the last_delivered for each dblink destination. All
transactions that have been pushed from the local site to that destination
will usually fall below the low water mark and be purged. To perform the
precise purge execute:

 declare x integer;
 begin
   x := sys.dbms_defer_sys.purge(purge_method=>0);
 end;
 /


4.5 Large deferred transactions and slow propagation
The two main reasons for slow propagation of the deferred queue are:

 - Limited systems resources, usually caused by CPU or network bottlenecks.
 - Large deferred transaction or error queues.

For the replication propagation mechanism to achieve maximum throughput the
deferred queue needs to be kept as small as possible and transactions need to
be propagated at regular intervals. There are two types of large queue:

 - Queues with one or two transactions with tens or hundreds of thousands
   of calls, usually cause by bulk update operations (DML) or SQL loads.
 - Queues with tens or hundreds of thousands of transactions, usually
   caused by a failure in the propagation job due to network outage or
   space management issues at the remote site.

In the majority of cases the queue will have already become large before the
database administrator becomes aware of the problem and running queries against
the DEF.... views will prove difficult because with enormous queues the views
are slow.

The following queries should help the database administrator to make a decision
about what to do with the large deferred queue, please note on some systems it
may not be practical to run these queries.

Check how many rows are in the current or next transaction to be pushed:

 select d.deferred_tran_id, count(*) calls
 from defcall c, deftrandest d
 where d.deferred_tran_id = c.deferred_tran_id
 and d.delivery_order = (select min(delivery_order) from deftrandest)
 group by d.deferred_tran_id;

 DEFERRED_TRAN_ID                            CALLS
 ------------------------------ ----------
 1.0.704                            4999

Check how many rows exist in transactions to come (ordered by rows):

 select deferred_tran_id, count(callno) calls
 from defcall
 group by deferred_tran_id
 order by calls;

 DEFERRED_TRAN_ID                            CALLS
 ------------------------------ ----------
 1.0.704                            4999
 1.21.691                            3430
 1.7.725                             112
 10.0.669                             102

Oracle 9.x and above, check how many rows have been propagated from the
current transaction so far:

 column xid format a12
 column dblink format a30

 select p.xid, p.dblink,
     p.sequence "Processed Calls"
 from v$replprop p
 where p.name like '%Slave%';

 XID          DBLINK                          Processed Calls
 ------------ ------------------------------ ---------------
 9.13.1016 DB2.WORLD                                         4999
 4.10.628 DB2.WORLD                                          636

Oracle 9.x and above, check the overall number of transactions and rows that
have been queued since the instance was last started:

 select * from v$replqueue;

 TXNS_ENQUEUED CALLS_ENQUEUED TXNS_PURGED LAST_ENQU LAST_PURG
 ------------- -------------- ----------- --------- ---------
        72438          2179964           2400 22-MAY-02

After making the above assessment it may be the case that the only real option
is to clear down the deferred queue and manually resynchronise the data. See
Section 6 for details of how to perform this operation.

To avoid large deferred queues building up in the future:

 - Ensure conflict resolution handlers are defined for tables that receive
   large updates, by handling the conflicts we avoid the overhead of rolling
   back the transaction and re-pushing it into the remote error queue. This
   operation can take considerably longer than the original transaction.
 - Monitor the push and purge jobs with Enterprise Manager events, as soon as
   a failure occurs the database administrator will then be alerted and can
   address the problem before the queues build up.
 - There is no easy way to avoid large transactions that are generated by
   mistake or adhoc user access, but for planned batch operations consider
   using procedural replication.



5. Diagnosing hanging propagation
If after completing the analysis described in section 4, rows do not appear to
be moving between replication sites, perform the diagnostic steps described in
the section.


5.1 Check for locks
If a row being pushed to a remote master is locked by a users session for an
extended period of time, propagation will appear to be hung. Use the following
select statement to identify sessions blocking propagation at the site where
propagation is pushing data too:

 column holder format 99999
 column holder_name format a16
 column rowno format 9999999999
 column object_name format a25
 column owner format a16

 select w.holding_session holder,
     h.username holder_name,
     s.row_wait_row# rowno,
     o.object_name,
     o.owner
 from dba_waiters w, dba_objects o, v$session s, v$session h
 where w.waiting_session = s.sid
 and s.username = 'REPADMIN'
 and w.holding_session = h.sid
 and s.row_wait_obj# = o.object_id;

 HOLDER HOLDER_NAME                             ROWNO OBJECT_NAME               OWNER
 ------ ---------------- ----------- ------------------------- --------------
    13 SCOTT                         32 T2                      REPDBA

For this query to be executed successfully; replace 'REPADMIN' with the user
that the pushing site's replication propagator pushes to (receiver user) and
make sure CATBLOCK.SQL has been run.

If the ROWNO / OBJECT_NAME column does not change the blocking users session
will have to be killed to allow propagation to continue. Under some
circumstances more than one row may be returned because each query slave used
by parallel propagation opens a separate session at the remote site.

[NOTE:62354.1] and UTLLOCKT.SQL contain additional information about
identifying locks.


5.2 Check the wait events
If propagation is not waiting for a lock then the next step in diagnosing
the hang is to find out what the pushing and receiving sessions are waiting
for. The easiest way to do this is to use the V$SESSION_WAIT view; customers
should note that Job Queue processes (DBMS_JOB) are not always recorded in
this view.

Run the following query at the pushing site if propagation is being executed
from DBMS_JOB:

 column event format a44
 select sw.sid, sw.event, sw.p1, sw.p2
 from v$session_wait sw
 where sw.sid in ( select qs.sid
            from dba_jobs_running jr, dba_jobs j,
                v$px_session qs
             where jr.job = j.job
            and jr.sid = qs.qcsid
            and upper(j.what) like '%DBMS_DEFER_SYS.PUSH%')
 and sw.wait_time = 0;

Run the following query at the pushing site if propagation is being executed
by a users session, replace REPADMIN with the propagation user:

 column event format a44

 select sw.sid, sw.event, sw.p1, sw.p2
 from v$session_wait sw, v$session s
 where sw.sid = s.sid
 and sw.sid in (select qs.sid from v$px_session qs)
 and sw.wait_time = 0
 and s.username = 'REPADMIN';

Run the following query at the site where data is being pushed to and replace
'REPADMIN' with the user that the propagation user pushes to (receiver user),
note each parallel propagation slave will appear as a separate session:

 column event format a44

 select sw.sid, sw.event, sw.p1, sw.p2
 from v$session_wait sw, v$session s
 where sw.sid = s.sid
 and s.username = 'REPADMIN'
 and sw.wait_time = 0;

If the sessions appear to be stuck on the same event for a long period of time,
consult [NOTE:61998.1] or raise a call with Oracle Support Services.


5.3 Check if a large error is being queued
Writing failed replicated transactions to the remote deferred transaction
error queue is one of the longest operations replication can perform, failed
transactions often take four or more times as long to write to the error queue
as they did to create.

Section 4.3 can be used to identify when small transactions are being written
to the error queue, however it will not assist in identifying if there is a
single transaction with thousands of rows being written to the error queue. The
following query can be used to identify if a large error is being queued at
the remote master site:

 select count(*)
 from v$sqlarea a, v$session s
 where s.sql_address = a.address
   and s.sql_hash_value = a.hash_value
   and a.sql_text like '%DEF$_AQERROR%'
   and s.username = 'REPADMIN';
Replace 'REPADMIN' with the user that the pushing sites replication propagator
pushes to (receiver user).


5.4 Advanced Analysis
If after following all other sections in this article the push still seems to
be hung or stuck, collect the information described in this section and supply
it to Oracle Support Services. It may be necessary to raise a call with Oracle
Support Services to assist in collecting the required information.

The first step is to collect an errorstack from the pushing session, this must
include all query slaves used by the push. Use the query defined in section
5.2 to identify these.

The second step is to collect an errorstack from all sessions that are applying
changes at the remote site. Use the query defined in section 5.2 to identify
these.

Collect the errorstack by running the following from SQLPLUS for each session:

 connect .... as sysdba
 oradebug setospid <process ID>
 oradebug unlimit
 oradebug dump errorstack 3

As hung or stuck push operations may in fact be spinning or looping operations,
it may also be necessary to collect in depth sql_trace of all sessions listed
above. Collect the required trace by running the following from SQLPLUS for
each session:

 connect .... as sysdba
 oradebug setospid <process ID>
 oradebug unlimit
 oradebug dump 10046 12

If after analysis the sessions need to be killed refer to section 3.7.



6. How to clear down large deferred queues
Database administrators frequently have to make the decision to terminate
propagation and manually resynchronise their replicated environments when the
deferred queue has become large and or slow. See section 4.5 for additional
information.

Once the decision has been made to clear down the queue and resynchronise the
data, follow :

 [NOTE:190885.1] How to Clear Down the Deferred Queue and
         DBMS_DEFER_SYS.DELETE_TRAN

Oracle normally recommends customers use dbms_defer_sys.delete_tran with all
arguments set to null, to remove transactions from the deferred queue. However
when the queue is very large this may not be the most efficient mechanism for
clearing the queue.

The basic steps to clear down the queue are listed below, if the queue is
large, slow or hanging refer to [NOTE:190885.1]:

 - Terminate the current push operation and prevent it from re-running,
   if the push is running from dbms_job see section 3.7.
 - Stop remote sites from replicating to the local site.
 - DO NOT attempt to suspend or quiesce the replicated environment, that
   will try to push the queue again and introduce admin requests that also
   need to be cleaned out.
 - If the system schema has optimiser statistics defined, make sure they
   are up to date, by analyzing the schema with compute statistics. In
   general Oracle does not recommend adding statistics to the system
   schema tables.
 - From SQLPLUS :

   execute dbms_defer_sys.delete_tran(null, null);



7. Example of how deferred transactions are propagated
The following is an example of a transaction being replicated from ORA9I.WORLD
to REP9I.WORLD.


7.1 Data is inserted into the table:
 insert into scott.dept values (80,'MARKETING','ORLANDO');
 commit;
 insert into scott.dept values (90,'MARKETING','LONDON');
 commit;


7.2 Interrogating the deferred queue:
After applying DML to replicated tables, modifications are logged in the
system.def$_aqcall table. These changes can be seen through the deftran and
deftrandest views:

 select * from deftran;

 DEFERRED_TRAN_ID                            DELIVERY_ORDER D START_TIM
 ------------------------------ -------------- - ---------
 4.49.495                              479001 R 22-JAN-02
 3.0.494                              478972 R 22-JAN-02

 column deferred_tran_id format a10
 column dblink format a30
 select * from deftrandest order by delivery_order;

--OR--

 select enq_tid, cscn
 from system.def$_aqcall
 where cscn is not null
 order by cscn;
 DEFERRED_T DELIVERY_ORDER DBLINK
 ---------- -------------- ------------------------------
 4.49.495             479001 REP9I.WORLD


7.3 Identify the next transaction to be pushed:
Find out what the current LAST_DELIVERED value is for SYSTEM.DEF$_DESTINATION:

 select dblink, last_delivered from system.def$_destination;

 DBLINK                          LAST_DELIVERED
 ------------------------------ --------------
 REP9I.WORLD                                 478992
 ORA9I.WORLD                                 478878

At this point, DEFERRED_TRAN_ID OR ENQ_TID 4.49.495 has not yet been pushed
because its DELIVERY_ORDER OR CSCN of 479001 is greater than the current
LAST_DELIVERED value for all transactions going to REP9I.WORLD which is 478992.

DEF$_AQCALL.CSCN > DEF$_DESTINATION.LAST_DELIVERED means the transaction has
NOT been pushed.


7.4 Manually push transactions to REP901 and interrogate the deferred queue
Push the queue:

 ..... dbms_defer_sys.push(destination=>'REP9I.WORLD');

Check the views:

 select * from deftrandest;

 no rows selected

 select * from deftran;

 DEFERRED_TRAN_ID                            DELIVERY_ORDER D START_TIM
 ------------------------------ -------------- - ---------
 4.49.495                              479001 R 22-JAN-02
 3.0.494                              478972 R 22-JAN-02

 select enq_tid, cscn from system.def$_aqcall;

 ENQ_TID                               CSCN
 ------------------------------ ----------
 4.49.495                          479001
 3.0.494                          478972

As expected, the unpurged queue view deftran shows all transactions and the
deftrandest view is emply because all transactions have been pushed.

 select dblink, last_delivered from system.def$_destination;

 DBLINK                      LAST_DELIVERED
 ---------------------------- --------------
 REP9I.WORLD                                479017
 ORA9I.WORLD                                478878

The LAST_DELIVERED column has increased and all transactions with
DEF$_AQCALL.CSCN < DEF$_DESTINATION.LAST_DELIVERED have been pushed,
which is all of the transactions currently in deftran.


7.5 Identify unpurged transactions
The following query identifies the number of transactions that have been
pushed but not yet purged:

 select count(*)
 from system.def$_aqcall
 where cscn < (select last_delivered from
          system.def$_destination where dblink ='<DBLINK>');

   COUNT(*)
 ----------
        2

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:11
posted:11/4/2012
language:English
pages:21