pmon长期持有cache buffers chains导致实例hang住一例

前几日,有客户报一个备用库实例中有多个回话hang住的问题,在hang住前本地有维护人员执行了truncate表的操作。
同事前往客户现场进一步确认了问题,并传回了当时hang状况下的266级systemstate文件。该实例的并发回话数量较少,所以实例状态抓取后总的信息量并不多,这种情形中通过分析systemstate信息往往要好于分析hanganalyze信息。
通过著名的源自于metalink的ass awk脚本可以很快找出各进程的状态,以及重要资源的持有者:
awk -f ass109.awk systemstate.txt

Starting Systemstate 1
…………………………………..
Ass.Awk Version 1.0.9 – Processing systemstate.txt

System State 1
~~~~~~~~~~~~~~~~
1:
2: last wait for ‘pmon timer’
3: waiting for ‘rdbms ipc message’ wait
4: waiting for ‘rdbms ipc message’ wait
5: waiting for ‘latch: cache buffers chains'[Latch 70000048c8c66b0] wait DBW0
6: waiting for ‘latch: cache buffers chains'[Latch 70000048c8c66b0] wait DBW1
7: waiting for ‘latch: cache buffers chains'[Latch 70000048c8c66b0] wait DBW2
8: waiting for ‘rdbms ipc message’ wait
9: waiting for ‘rdbms ipc message’ wait
10: waiting for ‘rdbms ipc message’ wait
11: waiting for ‘enq: RO – fast object reuse'[Enqueue RO-0001000B-00000001] wait
12: waiting for ‘rdbms ipc message’ wait
13: waiting for ‘rdbms ipc message’ wait
14: waiting for ‘rdbms ipc message’ wait
15: waiting for ‘rdbms ipc message’ wait
16: waiting for ‘SQL*Net message from client’ wait
17: waiting for ‘SQL*Net message from client’ wait
18: waiting for ‘SQL*Net message from client’ wait
19: waiting for ‘latch: cache buffers chains'[Latch 70000048c8c66b0] wait
20: waiting for ‘SQL*Net message from client’ wait
21: waiting for ‘SQL*Net message from client’ wait
22: last wait for ‘SQL*Net message from client’ [DEAD]
Cmd: Delete
23: waiting for ‘SQL*Net message from client’ wait
24: waiting for ‘SQL*Net message from client’ wait
25:
26: waiting for ‘SQL*Net message from client’ wait
27: last wait for ‘ksdxexeotherwait’
28: waiting for ‘enq: TX – row lock contention'[Enqueue TX-00120003-00002CC0] wait
Cmd: Update
29: waiting for ‘SQL*Net message from client’ wait
30: waiting for ‘SQL*Net message from client’ wait
31: waiting for ‘SQL*Net message from client’ wait
32: waiting for ‘SQL*Net message from client’ wait
33: waiting for ‘SQL*Net message from client’ wait
34: waiting for ‘Streams AQ: qmn coordinator idle wait’ wait
35: for ‘Streams AQ: waiting for time management or cleanup tasks’ wait
36: waiting for ‘Streams AQ: qmn slave idle wait’ wait
37: waiting for ‘enq: RO – fast object reuse'[Enqueue RO-00010025-00000001] wait
38: waiting for ‘SQL*Net message from client’ wait
39: waiting for ‘SQL*Net message from client’ wait
41: waiting for ‘SQL*Net message from client’ wait
42: last wait for ‘enq: TX – row lock contention’ wait
Blockers
~~~~~~~~

Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of ‘???’ implies that the holder was not found in the
systemstate.

Resource Holder State
Latch 70000048c8c66b0 ??? Blocker
Enqueue RO-0001000B-00000001 10: waiting for ‘rdbms ipc message’
Enqueue RO-0001000B-00000001 11: 11: is waiting for 10: 11:
Enqueue TX-00120003-00002CC0 22: last wait for ‘SQL*Net message from client’
Enqueue RO-00010025-00000001 10: waiting for ‘rdbms ipc message’
Enqueue RO-00010025-00000001 37: 37: is waiting for 10: 37:

Object Names
~~~~~~~~~~~~
Latch 70000048c8c66b0 holding (efd=14) 70000048c8c66b0 Child ca
Enqueue RO-0001000B-00000001
Enqueue TX-00120003-00002CC0
Enqueue RO-00010025-00000001

首先注意到的是RO队列锁,RO即(REUSE OBJECT),该锁用以协调前台进程与后台进程DBWR和CKPT之间的工作,该队列一般只在drop或truncate对象时可见到。可以看到pid为11和37的进程均在等待pid为10的进程,因为pid=10的进程持有着它们锁需要的RO-0001000B-00000001和RO-00010025-00000001,分析dump文件可以发现该持有进程正是CKPT后台进程;

而该检查点进程也处于BUSY的非空闲等待中,其等待事件为’rdbms ipc message’,即它在等待另一个后台进程给它发送信息。这个时候我们来观察其他忙碌的后台进程可以发现,pid为5,6,7的进程均在等待同一个栓’latch: cache buffers chains'[Latch 70000048c8c66b0];这个三个进程均为DBWR进程,此外还有一个DBW3进程处于’rdbms ipc message’等待中,多个DBWR进程是由于设置了db_writer_processes参数;看起来是ckpt进程准备对需要truncate的对象做对象级别的检查点,以保证该对象所有脏块均已写到磁盘上,所以对dbwr进程发出需要写出的message,继而进入’rdbms ipc message’等待直到dbwr进程完成写出任务,但由于dbwr进程长期无法获取某脏块对应的latch: cache buffers chains,故写出工作一直处于pending状态,这样一个hang链就十分清晰了。

我们来分析’latch: cache buffers chains'[Latch 70000048c8c66b0]这个栓,ass分析systemstate dump时将该栓的holder归为’???’,即无法自dump文件中找到该栓的持有者;进一步直接分析dump文件可以发现:
PROCESS 2:
—————————————-
SO: 70000048f529d40, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=2, calls cur/top: 70000047d0e75a0/70000048f8956d0, flag: (e) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 504403177803376304 122 2
last post received-location: kslges
last process to post me: 70000048e533a88 1 6
last post sent: 0 0 148
last post sent-location: ktmpsm
last process posted by me: 70000048e535228 1 22
(latch info) wait_event=0 bits=2
holding (efd=14) 70000048c8c66b0 Child cache buffers chains level=1 child#=61214
Location from where latch is held: kcbgcur: kslbegin:
Context saved from call: 336311247
state=busy(exclusive) (val=0x2000000000000002) holder orapid = 2
waiters [orapid (seconds since: put on list, posted, alive check)]:
5 (480, 1278471465, 0)
6 (480, 1278471465, 0)
19 (480, 1278471465, 0)
7 (216, 1278471465, 0)
waiter count=4
Process Group: DEFAULT, pseudo proc: 70000048e5f91d8
O/S info: user: oracle, term: UNKNOWN, ospid: 74476
OSD pid info: Unix process pid: 74476, image: oracle@DR_570 (PMON)

可以看到pid=2的PMON进程holding 该cache buffers chains子栓,而dbwx进程即waiters 5,6,7;

一般情况下pmon是不会去持有类似于cache buffers chains子栓这样的低级栓的,除非在cleanup失败会话或关闭实例情况下。从ass分析信息来看,当时确实有一个进程处于DEAD状态,即22: last wait for ‘SQL*Net message from client’ [DEAD],而该会话最后所做的是delete from “LINC”.”MSDB_ACCGL” t操作。

虽然无法证明,但极有可能是22号进程在进行delete过程中发生会话失败,PMON进程尝试清理该进程,并获取了相关栓。但该进程始终没有被杀死,即便使用OS 命令:kill -9 处理该进程后仍可以在systemstate中找到该进程的信息。据同事分析,当时之前曾有一度PMON的CPU使用率达到100%,之后PMON进程进入’pmon timer’空闲等待,且一直没有释放对应子栓,令DBWR进程处于长期无法获得栓资源的进而hang住的状态。

在MOS上搜索PMON+cache buffers chains可以发现几个PMON长期持有该类子栓且从不释放的Bug,但版本为Oracle 8等较老版本,且都是不能reproduceable的case。谨以录之:

Hdr: 4126734 8.1.7.4.0 RDBMS 8.1.7.4.0 BUFFER CACHE PRODID-5 PORTID-59
Abstract: ORACLE PROCESS GOES TO CPU LOOP WHILE HOLDING “CHILD CACHE BUFFERS CHAINS” LATCH

*** 01/17/05 07:08 pm ***
TAR:
—-
4249496.999

PROBLEM:
——–
1. Clear description of the problem encountered:

Oracle process goes to CPU loop while holding “Child cache buffers chains

level=1” latch. This condition causes other processes to wait on
either “latch free” or “Child cache buffers chains” which completely hang

the instance.

When looping Oracle process is killed, PMON also goes to CPU loop
cleaning
up that process. PMON also hold the same “Child cache buffers chains
level=1” latch while instance continues to hang.

Hang disappear only after instance is bounced.

2. Pertinent configuration information (MTS/OPS/distributed/etc)

11i Apps Install (version 11.5.9.) running on Oracle 8.1.7.4.0 database.

3. Indication of the frequency and predictability of the problem

Process running pl/sql procedure doing dbms_lob.instr() and
dbms_lob.read()
seem to encounter this problem.

4. Sequence of events leading to the problem

5. Technical impact on the customer. Include persistent after effects.

In a bad day, ct encounters 2 occurence of this problem a day causing
loss
of service in peak processing period.

DIAGNOSTIC ANALYSIS:
——————–
1. Got 3 systemstate dumps while hang is occuring
2. Got 3 errorstack dumps of PMON while it is looping in CPU

WORKAROUND:
———–
bounce instance

RELATED BUGS:
————-
bug 2361194

REPRODUCIBILITY:
—————-
– always reproducible on test system at ct site.

TEST CASE:
———-
– unable to reproduce on 8.1.7.4.0 test system in-house

STACK TRACE:
————
PMON stack while looping:
ksedmp ksdxfdmp ksdxcb sspuser _sigreturn kggchk kcbso1 kcbpsod kcbsod
kssxdl kssdch ksudlc kssxdl ksudlp kssxdl ksuxdl ksuxda ksucln ksbrdp
opirip opidrv sou2o main $START$

SUPPORTING INFORMATION:
———————–
Customers running Release 11i of the E-Business Suite will be entitled
to complimentary Extended Maintenance Support through July 31, 2005.

Comments

  1. admin says

    提交SR后,MOS的回复:
    Note: This is INTERNAL ONLY research. No action should be taken by the customer on this information.
    This is research only, and may NOT be applicable to your specific situation.

    KM SEARCH
    ————————–
    Keywords: pmon ‘cache buffer chains latch’ 10.2.0.3

    Bug 6113745: APPSST DSS 10G:LATCH: CACHE BUFFERS CHAINS CAUSE PMON TO STUCK 10.2.0.3
    Base Bug Bug 6110331: ORA-600 [KCBGET_37] / ORA-600 [1100] AND INSTANCE CRASH
    REDISCOVERY INFORMATION:
    If you are seeing any of the following –

    ORA-600 [kcbget_37]
    ORA-600 [1100] from pmon
    loop in kcbzchkpin
    hangs due to wait on cache buffer chains latch

    and the tracefile shows the kcbbh user chain is corrupted you are
    probably hitting this bug.

    KM SEARCH
    ————————–
    Keywords: ckpt “enq: RO – fast object reuse”

    Note:785232.1 CKPT causes another process to wait too long for ‘enq: RO – fast object reuse’

    Verified the issue in the file as noted below:

    HANGANALYZE
    ————————–
    Filename=oradebug hanganalyze 4.txt

    ==============
    HANG ANALYSIS:
    ==============
    Open chains found:
    Chain 1 : :
    <0/867/44431/0x8f52ec00/103586/No Wait>
    — <0/871/10764/0x8f5303a0/165738/enq: TX - row lock contention>
    Chain 2 : :
    <0/877/1/0x8f52bcc0/90552/rdbms ipc message>
    — <0/868/630/0x8e53b888/103120/enq: RO - fast object reuse>
    Chain 3 : :
    <0/885/1/0x8f529d40/74476/No Wait>
    — <0/844/45259/0x8e5371a8/205292/latch: cache buffers chains>
    Other chains found:
    Chain 4 : :
    <0/850/7151/0x8e539128/200996/No Wait>
    Chain 5 : :
    <0/852/30992/0x8e5f91d8//No Wait>
    Chain 6 : :
    <0/856/3/0x8f532320/189248/Streams AQ: qmn slave idle wait>
    Chain 7 : :
    <0/860/3/0x8f531b40/160672/Streams AQ: qmn coordinator idle>
    Chain 8 : :
    <0/870/6/0x8e53b0a8/200760/Streams AQ: waiting for time man>
    Chain 9 : :
    <0/876/1/0x8e535228/267588/enq: RO - fast object reuse>
    Chain 10 : :
    <0/880/1/0x8e534268/255314/latch: cache buffers chains>
    Chain 11 : :
    <0/881/1/0x8f52ad00/110806/latch: cache buffers chains>
    Chain 12 : :
    <0/882/1/0x8e533a88/243052/latch: cache buffers chains>

    Oracle Support – 08 July 2010 11:10:03 GMT+08:00
    === ODM Issue Clarification ===

    On : 10.2.0.3 version, RDBMS

    Instance hung, PMON holding a cache buffers chains , blocked DBW0, DBW1,DBW2.

    EXPECTED BEHAVIOR
    ———————————-
    Expect the instance does not hang.

    STEPS
    ———————–
    The issue can not be reproduced at will.

    BUSINESS IMPACT
    ———————–
    The issue has the following business impact:
    Due to this issue, the availability of the database is affected.

    Oracle Support – 08 July 2010 09:46:57 GMT+08:00
    Name
    ——–
    === ODM Data Collection ===

    === ODM Data Collection ===

    TRACE FILE
    ————————
    Filename=ms_pmon_74476.trc

    —– Call Stack Trace —–
    ksedst_tracecb <- ksdxfdmp <- ksdxcb <- sspuser <- 000044C0 <- kcbgcur <- ktbgcur <- kdiulk <- kcoubk <- ktundo <- ktubko <- ktuabt <- ktcrab <- ktcsod <- kssxdl <- kssdch_stage <- kssdch <- ksuxds <- kssxdl <- kssdch_stage <- kssdch <- ksudlp <- kssxdl <- ksuxdl <- ksuxda <- ksucln <- ksbrdp <- opirip <- opidrv <- sou2o <- opimai_real <- main <- start COMMENTS ------------------- From above stack, PMON is cleaning up dead process and latches, rolling back transaction. ALERT LOG ------------------------ Filename=alert_MS_full.log Tue Jun 22 22:22:37 2010 Errors in file /oracle/product/10.2.0/admin/MS/udump/ms_ora_127724.trc: ORA-00600: internal error code, arguments: [kcbget_37], [1], [], [], [], [], [], [] Tue Jun 22 22:22:59 2010 Errors in file /oracle/product/10.2.0/admin/MS/udump/ms_ora_127724.trc: ORA-00600: internal error code, arguments: [kcbso1_1], [], [], [], [], [], [], [] ORA-00600: internal error code, arguments: [kcbget_37], [1], [], [], [], [], [], [] Tue Jun 22 22:22:59 2010 Errors in file /oracle/product/10.2.0/admin/MS/udump/ms_ora_127724.trc: ORA-00600: internal error code, arguments: [kcbso1_1], [], [], [], [], [], [], [] ORA-00600: internal error code, arguments: [kcbso1_1], [], [], [], [], [], [], [] ORA-00600: internal error code, arguments: [kcbget_37], [1], [], [], [], [], [], [] Tue Jun 22 22:23:00 2010 Errors in file /oracle/product/10.2.0/admin/MS/udump/ms_ora_127724.trc: ORA-00600: internal error code, arguments: [kcbso1_1], [], [], [], [], [], [], [] ORA-00600: internal error code, arguments: [kcbso1_1], [], [], [], [], [], [], [] ORA-00600: internal error code, arguments: [kcbget_37], [1], [], [], [], [], [], [] Tue Jun 22 22:23:04 2010 Errors in file /oracle/product/10.2.0/admin/MS/bdump/ms_pmon_119002.trc: ORA-00600: internal error code, arguments: [kcbso1_1], [], [], [], [], [], [], [] Tue Jun 22 22:23:05 2010 Errors in file /oracle/product/10.2.0/admin/MS/bdump/ms_pmon_119002.trc: ORA-00600: internal error code, arguments: [kcbso1_1], [], [], [], [], [], [], [] Tue Jun 22 22:23:05 2010 PMON: terminating instance due to error 472 Instance terminated by PMON, pid = 119002 COMMENTS ---------------------- This database has crashed once due to the same error in Note:6110331.8 This issue is described in My Oracle Support Note: Note:6110331.8 Bug 6110331 - OERI[kcbget_37] / OERI[1100] / spin possible This customer hits Bug 6110331, which can be fixed on 10.2.0.3 on AIX 5L 64-bit by one-off patch 6110331. This Bug is also fixed on 10.2.0.4. Oracle Support - 08 July 2010 11:33:20 GMT+08:00 === ODM Proposed Solution === 1. Apply one-off patch 6110331. ~OR~ 2. Upgrade to 10.2.0.4. Oracle Support - 08 July 2010 11:32:37 GMT+08:00 === ODM Research === Note: This is INTERNAL ONLY research. No action should be taken by the customer on this information. This is research only, and may NOT be applicable to your specific situation. Bug Conflict Check --------------------------- Bug 6110331 has no conflicts with already applied patches. Oracle Support - 08 July 2010 11:30:20 GMT+08:00 === ODM Cause Justification === This issue is described in My Oracle Support's Note : Note:6110331.8 Bug 6110331 - OERI[kcbget_37] / OERI[1100] / spin possible For this customer, PMON stuck and holding a cache buffers chains , blocked DBW0, DBW1,DBW2. This symptom matches above note. Customer is on 10.2.0.3, which can be affected by Bug 6110331. Oracle Support - 08 July 2010 11:27:52 GMT+08:00 === ODM Cause Determination === The issue is caused by Bug 6110331.Corruption of a list of users for a buffer can occur leading to an ORA-600 [kcbget_37], ORA-600 [1100], spinning and/or hangs due to waits on cache buffer chains latch or other similar ORA-600 errors. Your issue is caused by Bug 6110331, to fix the issue, you can apply patch 6110331. To implement the solution, please execute the following steps: 1. Download and review the readme and pre-requisites for Patch.6110331. Download link: http://updates.oracle.com/download/6110331.html

    Choose 10.2.0.3
    Choose AIX 5L (64-bit)

    2. Ensure that you have taken a backup of your system before applying
    the recommended patch.

    3. Apply the patch in a test environment.

    4. Retest the issue.

    5. Migrate the solution as appropriate to other environments.

    MOS上这位support的观点是可以应用one-off patch 6110331 可以解决该问题,但因为该问题不是十分reproduceable ,所以很难验证可以完全解决问题。

Comment

*

沪ICP备14014813号-2

沪公网安备 31010802001379号