Troubleshooting Streams Capture when status is Paused For Flow Control [ID 746247.1]

Posted: 九月 15, 2011 in oracle, oracle document notes



Modified
09-JAN-2011     Type PROBLEM     Status PUBLISHED

In this Document
Symptoms
Changes
Cause
Solution
References


Applies to:

Oracle Server – Enterprise Edition – Version:
9.2.0.1 to 11.1.0.7 – Release: 9.2 to 11.1
Information in this document
applies to any platform.

Symptoms

For DBA and Support to diagnose
and resolve Streams Capture is stuck with Paused for Flow Control

Definition

This warning message typically
occurs when the target site is unable to keep up
with the rate of messages flowing from the source site.In 10g Flow control is
automatically enabled to reduce the rate at which events are being enqueued into
the staging area. But this message also indicate other problems which will be
covered in this note.

If “Streams AQ: enqueue blocked due to flow
control” occurs,it simply means that we cannot queue messages until the
components that had fallen behind catch up .As long as you see CAPTURE getting
in and out of FLOWCONTROl status,there should not be any
problem.

Implement Flow Control (9.2 source site
only):

When the threshold for memory of the buffer queue is
exceeded, Streams will write the messages to disk.This is sometimes referred to
as “spillover”. When this occurs, Streams can no longer take advantage of the
in-memory queue optimization. One technique to minimize this spillover is to
implement a form of flow control.
The manual scripts of Note 259609.1
Script to Prevent Excessive Spill of Message from the Streams Buffer Queue to
disk should only be used on Oracle 9.2 Streams source
databases (ie, databases running 9.2 streams capture processes).

In 9i
,implement a “heartbeat” table. To ensure that the applied_scn of the
DBA_CAPTURE view is updated periodically, implement a “heart beat” table.
Implementing a heartbeat table ensures that the metadata is updated frequently.
Additionally, the heartbeat table provides quick feedback as to the health of
the streams replication.
In Oracle10g source and target databases, Streams flow
control is automatic. However, if Streams is used in a 9iR2/10g combination with
9iR2 as the source site, it is imperative that the manual 9iR2 flow control
scripts be used at the source site.

//

Changes

Where and which view to
check the symptoms

a) The healthcheck script will show
capture status enabled and capture statistics as PAUSED FOR FLOW
CONTROL
. Example from SOURCE Healthcheck Script

Capture statistics:

C001 CAPUSR_CAP TURE 03:32:25 09/13/08
193 404 PAUSED FOR FLOW CONTROL

or from

SELECT CAPTURE_NAME,
STATE FROM V$STREAMS_CAPTURE;

CAPTURE_NAME
STATE
————– ————————
STREAM_CAPTURE PAUSED FOR
FLOW CONTROL

b) V$BUFFERED_SUBSCRIBERS may show TOO MANY
UNBROWSED MESSAGES
Example from SOURCE Healthcheck Script(HC)

++ BUFFERED PUBLISHERS ++

Queue Name:
STREAMS_QUEUE
SENDER_NAME: STREAM_CAPTURE
Current Number of Msgs in Queue:
15521
MEMORY_USAGE: 15
UNBROWSED_MSGS: 5001
PUBLISHER_STATE: IN FLOW
CONTROL: TOO MANY UNBROWSED MESSAGES

or from Flow Control Query:

COLUMN queue_schema
FORMAT A12
COLUMN queue_name FORMAT A20
SELECT queue_schema, queue_name,
unbrowsed_msgs, overspilled_msgs, memory_usage, publisher_state
FROM
V$BUFFERED_PUBLISHERS;

NOTE: The Flow Control query will
show:

* the queue owner and name
* the number of messages that have
been enqueued, but not browsed
* the number of messages that have been
spilled, but not browsed
* the percentage of the streams pool that is being
used (or 0 if there is no streams pool)
* the state of the publisher.
Possible values are:

– PUBLISHING MESSAGES (normal)
– IN FLOW CONTROL:
TOO MANY UNBROWSED MESSAGES
– IN FLOW CONTROL: OVERSPILLED MESSAGES
– IN
FLOW CONTROL: INSUFFICIENT MEMORY AND UNBROWSED MESSAGES

This last column
will really help to determine why flow control is being enabled for a system and
what can
be done to resolve the matter.

c) Propagation
schedule may throw Ora-25307 error
Example from SOURCE Healthcheck
Script(HC)

++ SCHEDULE FOR EACH PROPAGATION++
Propagation Name:
STREAM_PROPAGATE
Status : Enabled
Error Message: ORA-25307: Enqueue rate
too high, flow control enabled

Cause

The capture
process is unable to enqueue LCRs either because of low memory or because
propagation and apply process are consuming messages slower than the capture
process is creating them.

1) INIT.ORA parameter
problem

Inadequate Memory esp., Streams pool size is small or not
set

2) If AQ_TM_PROCESSES parameter is set too high
in 9i and 10g

For 10g/11g unset AQ_TM_PROCESSES  and QMON is automatically
configured (auto-tune).

For 9i and10g/11g if the parameter is explicitly
set to zero, then QMON processes are not created.  You should never disable the
Queue Monitor processes by setting aq_tm_processes=0

To know more on
this parameter See
Note:305662.1
– Queue Monitor Process: Architecture and Known
Issues

3) Capture parameter setting have
not been done properly .

4) Publisher has enqueued a lot more
messages than have been browsed.

(>1000 unbrowsed messages)

5) High
latency

High Latency can be due to long-running transactions,many dependent
transactions, or  slow Capture,Propagation, or Apply processes.In some
situations, propagation may become disabled (if the number of failures is (16).
Also check Propagation latency from
dba_queue_schedules;

6) Check Apply
latency

select
(apply_time-applied_message_create_time)*86400
"latency_in_seconds",
(sysdate-apply_time)*86400"latency_in_seconds",
to_char(applied_message_create_time,'hh24:mi:ss
mm/dd/yy') "event_creation",
to_char(apply_time,'hh24:mi:ss mm/dd/yy')
"apply_time"
from dba_apply_progress;

7)
Long running Transactions
Check that there is no occurrence
for long running transaction entries with the alert log file,
it will be
something like that C002: long running txn detected, xid:
0x0008.010.00084261

8) Some known bugs

a) Bug 7206332
PROPAGATION NOT HAPPENING–ASSOCIATED J00X PROCESSES STUCK (Suspended, Req’d
Info not Available)

b) Bug 6054573
ENQUEUE: MEMORY FLOW CONTROL ; CAPTURE PERFORMANCE DEGRADATION ON CDC
Closed
as a duplicate of Unpublished Bug 5093060
STREAMS: 5000 LCR LIMIT IS CAUSING UNNECESSARY FLOW CONTROL AT APPLY SITE

Fixed-Releases: A204 B106

c) Streams pool memory accounting can
underflow, leading to capture staying in ‘PAUSED FOR FLOW CONTROL’ state
indefinitely.
If you see capture in ‘PAUSED FOR FLOW CONTROL’ even in a low
workload environment, and x$knlasg has fields such as memory_allocated_knlasg
that are very large, you may have hit this issue. The information in x$knlasg is
recorded in the Streams healthcheck section entitled ++ Streams Pool memory
Information ++ as per Note 273674.1 for 10.2 onwards. The cause has been
identifed and fixed in Bug 7719668
STREAMS MEMORY ACCOUNTING INCORRECT, LEADING TO CAPTURE ‘PAUSED IN FLOW
CONTROL’  Fixed In Ver: 11.2

9) No proper
patches for specific database version installed
.

10) One off case: When the capture process was
recreated, one of the customer did not recreate the
associated queues
and queue tables, which retained the messages from the previous capture process.
This is the reason for CAPTURE to go into FLOW CONTROL.

11)
In a scenario with Multiple Destinations  we can get the ‘pause’
message when one destination is down.This is the correct behaviour. There are
two common reasons for source-queue growth:

Suppose a source queue is
propagating captured messages to multiple destination queues, and one     or
more destination databases acknowledge successful propagation of messages much
more slowly   than the other queues. In this case, the source queue can grow
because the slower destination databases create a backlog of messages that have
already been acknowledged by the faster       destination databases. In such an
environment, consider creating more than one capture process to      capture
changes at the source database. Doing so lets you use one source queue for the
slower
destination databases and another source queue for the faster
destination databases.

12) Capture process is stopped
on PAUSED FOR FLOW CONTROL and latency is increasing continuously.Capture
process is not enabled for Combined Capture Apply (CCA) (only available between
11G releases onwards).

Propagation job is dying as soon as it starts
without no error or a small log on alert.log file:

Tue Sep 07 12:55:35
2010
i_sched_prop: DEBUG : removed job
Tue Sep 07 12:55:55
2010
i_sched_prop: DEBUG : removed job

Running following
query several time, it is possible to observe this propagation job
behavior:

SELECT
p.propagation_name,
DECODE(s.SCHEDULE_DISABLED,
'Y', 'Disabled',
'N',
'Enabled') SCHEDULE_DISABLED,
s.PROCESS_NAME,
s.total_bytes,
s.FAILURES,
s.message_delivery_mode,
s.LAST_ERROR_MSG
FROM
DBA_QUEUE_SCHEDULES s, DBA_PROPAGATION p
WHERE p.DESTINATION_DBLINK
=
NVL(REGEXP_SUBSTR(s.destination, '[^@]+', 1, 2), s.destination)
AND
s.SCHEMA = p.SOURCE_QUEUE_OWNER
AND s.QNAME = p.SOURCE_QUEUE_NAME order by
message_delivery_mode, propagation_name;

Solution

Diagnostics and
Resolution

1) INIT.ora
changes

Increase the STREAMS_POOL_SIZE
on capture and apply site It is required to have a value of 200MB at least as
per recommendation For example:

CONNECT / AS SYSDBA
ALTER SYSTEM SET
STREAMS_POOL_SIZE=500M;

This will re-enable the capture from
the ‘PAUSED FOR FLOW CONTROL’ status.

2)
Unset AQ_TM_PROCESSES

In 10g
onwards this parameter should be unset on both the sites.

ALTER
SYSTEM RESET AQ_TM_PROCESSES SCOPE=SPFILE SID=’*’;

High AQ_TM_PROCESSES
value causes the acknowledgments not to receive from apply site.


3) Follow
Section 5 “Capture Process Configuration”of NOTE:
298877.1
– 10gR1 Streams Recommended Configuration

Example
at SOURCE site:

execute
dbms_capture_adm.set_parameter('capture_name','_CHECKPOINT_FREQUENCY','100');
execute
dbms_capture_adm.set_parameter('capture_name','_SGA_SIZE','50');

Restart(stop
and start) the apply then capture.
If you have a lot of memory to work you
could even set _SGA_SIZE’ to something like 200.

4) If Status is
IN FLOW CONTROL: TOO MANY UNBROWSED MESSAGES

This situation
could be a propagation problem

a) Stop capture that is in Flow Control
via DBMS_CAPTURE_ADM.STOP_CAPTURE

b) Determine the J process associated
with propagation via DBA_QUEUE_SCHEDULES, ie:

select schema, qname,
destination, schedule_disabled, process_name from
dba_queue_schedules;

c) Stop propagation via
DBMS_PROPAGATION_ADM.STOP_PROPAGATION (Set FORCE option to TRUE if
needed)

d) Check DBA_QUEUE_SCHEDULES / DBA_PROPAGATION to make sure the
propagation has been stopped

e) Wait few minutes and make sure that the J
Process associated to the Propagation that has been stopped is gone .If not use
use kill -9 to kill the process at OS.

f) Set job_queue_processes to 0,
wait few minutes, make sure the CJQ process and other J processes are gone (at
OS level), then set JOB_QUEUE_PROCESSES back to the original
value

g) Use DBMS_PROPAGATION_ADM.START_PROPAGATION to start
Propagation

h) Check DBA_QUEUE_SCHEDULES / DBA_PROPAGATION to make sure
the propagation has been started and it has a J process associated.

i)
Start CAPTURE via DBMS_CAPTURE_ADM.START_CAPTURE

j) Check
DBA_QUEUE_SCHEDULES columns TOTAL_NUMBER, TOTAL_BYTES …

If propagation is happening, the numbers would be
changing..

k) At SOURCE and TARGET databases, check
V$BUFFERED_QUEUES.

l) In case PROPAGATION once more seems hanging and not
propagating LCRs then find the associated  J process and upload to Support along
with HC scripts:

– errostacks with 3 minutes interval
– 10046 and 24040
propagation level 10 trace .

From sqlplus as any AS SYSDBA
user :

spool
&spoolfilename

set echo on
set pagesize 1000

column
destination format a25
column session_id format a10
column
destination_dblink format a25
column source_queue_owner format a20
column
source_queue_name format a25

select
schema,qname,destination,latency,process_name,session_id,instance,message_delivery_mode
from
dba_queue_schedules order by message_delivery_mode;
SELECT
source_queue_owner,source_queue_name,propagation_name,
destination_dblink,queue_to_queue,status
from dba_propagation order by propagation_name;

oradebug setospid
<ospid_of_hung_propagation_job>
oradebug unlimit

oradebug dump
errorstack 10
<< wait one minute >>
oradebug dump errorstack
10
<< wait one minute >>
oradebug dump errorstack
10

oradebug tracefile_name

oradebug Event 10046 trace name context
forever, level 12
oradebug Event 24040 trace name context forever, level
10

<< wait 10-15 minutes >>

oradebug dump errorstack
10

oradebug short_stack

-- repeat the short_stack command 10-15
times in rapid succession

oradebug dump errorstack 10
oradebug Event
24040 trace name context off
oradebug Event 10046 trace name context
off
exit;

If the state is IN FLOW
CONTROL: TOO MANY UNBROWSED MESSAGES, then you can run the following query to
determine which subscriber is not dequeuing messagesor lagging behind from the
queue:

SELECT subscriber_name,
cnum_msgs, total_dequeued_msg, total_spilled_msg
FROM
V$BUFFERED_SUBSCRIBERS;

A comparison of the total
number of messages enqueued for the subscriber (CNUM_MSG)
and the number of
messages dequeued by that subscriber (TOTAL_DEQUEUED_MSG) should indicate which
subscriber has fallen behind.

5) Resolve latency of
Propagation

If the propagation is disabled(broken), re-enable it
manually.
To reduce the latency of propagation between jobs, set the hidden
parameter_job_queue_interval = 1. This should be done as an init.ora parameter
or an spfile parameter.Because it is a hidden parameter, the database must be
restarted in order for the value to take effect. The default for
_job_queue_interval is 5 seconds.

In some situations, when propagation
becomes disabled (if the number of failures is 16),re-enable the propagation
manually.For example, if the queue name is STREAMS_QUEUE owned by STRMADMIN and
the destination database link is ORCL2.WORLD, syntax would be:

execute
dbms_aqadm.disable_propagation_schedule
('STRMADMIN.STREAMS_QUEUE','ORCL2.WORLD');
execute
dbms.aqadm.enable_propagation_schedule ('STRMADMIN.STREAMS_QUEUE','ORCL2.WORLD'
);
execute dbms_aqadm.alter_propagation_schedule
('STRMADMIN.STREAMS_QUEUE','ORCL2.WORLD',latency=>5);

Propagation
and the capture process will be resumed automatically when the target site is able to accept
more messages.

6) Check Apply parameters

Set the
appropriate parameters for apply process.For 10.2 set the following apply
parameters.
Disable the capture process and the apply process as well

execute DBMS_APPLY_ADM.Set_parameter(‘applyName’,’parallelism’,’4’)
execute DBMS_APPLY_ADM.Set_parameter(‘applyName’,’_dynamic_stmts’,’Y’)
execute DBMS_APPLY_ADM.Set_parameter(‘applyName’,’_hash_table_size’,’1000000’)
execute DBMS_APPLY_ADM.Set_parameter(‘applyName’,’disable_on_error’,’N’)
execute DBMS_APPLY_ADM.Set_parameter(‘applyName’,’_txn_buffer_size’,10*parallelism’);

NOTE:315666.1
– How To Set The _TXN_BUFFER_SIZE Parameter For a Streams Apply

SELECT
APPLY_NAME,PARAMETER,VALUE FROM
ALL_APPLY_PARAMETERS;

Enable the apply processes
followed by Capture process


7) Check for long running
Transactions:
Identify the transaction id for that long running
transaction using the following query and ignore the
transaction if it
causes high latency

prompt
prompt ++
Current Long Running Transactions ++
prompt Current transactions open for
more than 20 minutes
prompt

col runlength HEAD 'Txn Open|Minutes'
format 9999.99
col sid HEAD 'Session' format a13
col xid HEAD
'Transaction|ID' format a18
col terminal HEAD 'Terminal' format a10
col
program HEAD 'Program' format a27 wrap

select t.inst_id,
sid||','||serial# sid,xidusn||'.'||xidslot||'.'||xidsqn xid, (sysdate -
start_date )* 1440
runlength ,terminal,program from gv$transaction t,
gv$session s where t.addr=s.taddr
and (sysdate - start_date) * 1440 >
20;

For ignoring transaction at capture , use
the following steps on source site:

execute dbms_capture_adm.stop_capture('CAPTURE_NAME');
execute dbms_capture_adm.set_parameter('CAPTURE_NAME','_ignore_transaction','TXN_ID');
execute dbms_capture_adm.start_capture('CAPTURE_NAME');


8)

Solutions for some known bugs

    ORA-25307:
Enqueue rate too high, flow control enabled is reported after enqueueing
5000
user-enqueued buffered messages.

By default flow
control kicks for user-enqueued events when there are 5000 unbrowsed messages in
a
queue. The default behaviour can be changed in 10.2 by applying Patch 5093060 which allows
you to manipulate when
buffer publisher flow control is activated

The fix is event
based in 10.2.0.3 and it introduces two new events:
a) 10867 : This
event controls flow control threshold values for any buffered message
publisher.
The level of the event decides the threshold
value.
b) 10868 : This event controls flow control threshold values for
only capture publisher (the Capture    process).The level of the event decides
the threshold value.
When any of the above events are not set or reset
using “context off” command, the threshold values
falls  back to
default values which are:
1) Buffered publisher : 5000
2)
Capture publisher : 15000

The fix for unpublished bug
5093060 is present in 10.2.0.4 onwards and can be activated by setting the

following hidden parameters, i.e.,

alter system set
"_capture_publisher_flow_control_threshold"=10000 scope=memory;
alter system
set "_buffered_publisher_flow_control_threshold"=10000 scope=memory;

From 10.2.0.1 to 10.2.0.3 ,enable the flow control fix for
user enqueued messages.

connect / as
sysdba
--Setting the value dynamically
alter system set events '10867
trace name context forever, level 15000';
--Setting back to the default
5000
alter system set events '10867 trace name context off';
--Setting in
the spfile
alter system set event='10867 trace name context forever, level
15000' scope=spfile;
--Unsetting in the spfile
alter system reset event
scope=spfile sid='*';
Setting in a pfile

event=’10867 trace name
context forever, level 15000′

Setting any of these will require a bounce
of the database.

Download and apply Patch 7719668 if you are hitting the Bug 7719668 which is
fixed in  11.2
Backport can be made availalble on request for the latest
patchset

9) Apply recommended
patches

Finally make sure you have all the
patches applied for the database version as per Note:437838.1
“Recommended Patches for Streams”

10) The following should be the order
in which you can drop these components.
1  Drop apply.
2. Drop
capture
3. Drop propagation
4. Drop queue tables at APPLY and
CAPTURE sites using the following command

exec
dbms_aqadm.drop_queue_Table('<owner.queue_Table_name>',force=>TRUE);

5. Then recreate the components in the following. sequence

a) recreate the queue
table/queue at apply and capture sites
b) recreate APPLY.
c) recreate
capture
d) recreate propagation.

11)  Create Separate queues

The recommendation for sending LCRs to multiple destinations,and one
destination is constantly
slower than the others in applying LCRs
(or has intermittent network problems), then we  should
configure
a separate queue for that destination to avoid impacting other sites.

12) On Stream Health Check report generated using Doc Id. 273674.1,
following can be observed

a) That there is no propagation process
associated to your propagation although propagation is enabled. On Streams HC
see:
Statistics > Propagation > SCHEDULE FOR EACH
PROPAGATION

b)  That queue where propagation propagates from are not
recovered properly.
On Streams HC you can see a message like the
following:

Analysis > Notification > SYS Checks

+ ERROR: Queue
'<SCHEMA>.<QUEUE_NAME> has not been recovered normally
+ Force
recovery by altering the queue ownership to another instance.
+ Use the
DBMS_AQADM.ALTER_QUEUE_TABLE procedure to specify a different
instance.

This problem is not applicable to Combined Capture
Apply(CCA) propagation issues.

To solve the problem, it has been tried to
recover the queue manually :

On RAC, by altering queue table ownership
using DBMS_AQADM.ALTER_QUEUE_TABLE API as notification explains.

On both
(RAC and single instance), by :

* Stopping associated Capture and
Propagations
* Stopping and starting the queue using DBMS_AQADM
procedures
* Starting associated Capture and Propagations

But despite
of this queue table is still in a not recovered state. It is possible to check
this by running the query

select queue_name,
flags from x$buffered_queues where flags=1 and
queue_schema='<OWNER>';

If a your queue is returned then
the queue has not been recovered properly.
Recover the queue as
follows:

1. Stop all Streams processes (capture, propagation etc)
associated to the queue.
2. ALTER SYSTEM RESET aq_tm_processes SCOPE=SPFILE
SID=’*’ ;
3. Bounce the instance
4. Start all streams processes (capture,
propagation etc) associated to the queue.
5. Check if queues are recovered by
doing:
select queue_name, flags from x$buffered_queues where flags=1 and
queue_schema='<OWNER>’;

If a queue is returned then the queue
has not been recovered properl
Contact Oracle Support for further
diagnosis

//

References

NOTE:298877.1
– 10gR1 Streams Recommended Configuration
NOTE:315666.1
– How To Set The _TXN_BUFFER_SIZE Parameter For a Streams Apply Process
NOTE:335516.1
– Master Note for Streams Performance Recommendations
NOTE:437838.1
– Streams Specific Patches
NOTE:471713.1
– Different States of Capture & Apply Process
NOTE:729860.1
– Troubleshooting Queries in Streams
NOTE:730036.1
– Overview for Troubleshooting Streams Performance Issues
http://st-doc.us.oracle.com/9.0/920/server.920/a96571/prop.htm#46370
NOTE:1277775.1
– Streams Capture process is paused for flow control. Alert log shows ORA-4031
(“streams pool”,”unknown object”,”streams pool”,”sob_kgqmrec”)” followed by
ORA-600[4450

Show Related Information
Related


Products


  • Oracle Database Products > Oracle Database > Oracle Database >
    Oracle Server – Enterprise Edition

Keywords


V$BUFFERED_SUBSCRIBERS; HANGING; PROPAGATE; LATENCY;
AQ_TM_PROCESSES; FLOW CONTROL; ORA-25307

Errors


ORA-25307; 25307 ERROR

Back to topBack to
top

Copyright (c) 2007, 2010, Oracle. All rights reserved.
Legal Notices and Terms of Use | Privacy Statement
Article Rating
Rate this document
Excellent
Good
Poor
Did this document
help you?
Yes
No
Just browsing
How easy was it to
find this document?
Very easy
Somewhat easy
Not easy
Comments

Provide feedback for this article. Please use ‘Contact Us’ for other feedback.Important Note: this feedback may be
anonymously visible to other customers until processed by Oracle
Support.

Cancel

//  

<!–

746247.1
–>

发表评论

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / 更改 )

Twitter picture

You are commenting using your Twitter account. Log Out / 更改 )

Facebook photo

You are commenting using your Facebook account. Log Out / 更改 )

Google+ photo

You are commenting using your Google+ account. Log Out / 更改 )

Connecting to %s