巧用close_trace命令释放误删trace文件

可能很多朋友都遇到过这样的情况,在UNIX/Linux上定期清理Oracle日志文件夹时可能删除到仍被后台进程open着的trace文件,即某些后台进程一直持有着这些”被已经误删了的“打开文件的描述符(fd),这种情况下文件系统上该文件实际占用的空间是不会被释放的,这就造成使用df命令查看文件系统剩余空间和用du命令查看文件夹空间使用量时数值不一致的问题。此外因为是后台进程持有这些打开文件描述符,所以我们无法像kill服务进程一样来解决该问题(部分后台进程是可以kill的,不建议这样做)。oradebug是sqlplus中威力强大的debug命令,我们可以通过该命令发起多种trace/dump,其中也包括了close_trace事件;close_trace事件可以让指定进程关闭其正持有的trace文件。

下面我们就来演示下相关操作:

[maclean@rh2 ~]$ ps -ef|grep ora_|grep -v grep
maclean   7281     1  0 16:35 ?        00:00:00 ora_pmon_PROD
maclean   7283     1  0 16:35 ?        00:00:00 ora_psp0_PROD
maclean   7285     1  0 16:35 ?        00:00:00 ora_mman_PROD
maclean   7287     1  0 16:35 ?        00:00:00 ora_dbw0_PROD
maclean   7289     1  0 16:35 ?        00:00:00 ora_lgwr_PROD
maclean   7291     1  0 16:35 ?        00:00:00 ora_ckpt_PROD
maclean   7293     1  0 16:35 ?        00:00:00 ora_smon_PROD
maclean   7295     1  0 16:35 ?        00:00:00 ora_reco_PROD
maclean   7297     1  0 16:35 ?        00:00:00 ora_cjq0_PROD
maclean   7299     1  0 16:35 ?        00:00:00 ora_mmon_PROD
maclean   7301     1  0 16:35 ?        00:00:00 ora_mmnl_PROD
maclean   7303     1  0 16:35 ?        00:00:00 ora_d000_PROD
maclean   7305     1  0 16:35 ?        00:00:00 ora_s000_PROD
maclean   7313     1  0 16:35 ?        00:00:00 ora_qmnc_PROD
maclean   7430     1  0 16:35 ?        00:00:00 ora_q000_PROD
maclean   7438     1  0 16:36 ?        00:00:00 ora_q001_PROD

/* lgwr是著名的Oracle后台进程,在这个启动的实例中其系统进程号为7289*/

[maclean@rh2 ~]$ ls -l /proc/7289/fd        /* linux上的proc文件系统可以很方便我们探测进程信息*/
total 0
lr-x------ 1 maclean oinstall 64 Jul 26 16:38 0 -> /dev/null
lr-x------ 1 maclean oinstall 64 Jul 26 16:38 1 -> /dev/null
lr-x------ 1 maclean oinstall 64 Jul 26 16:38 10 -> /dev/zero
lr-x------ 1 maclean oinstall 64 Jul 26 16:38 11 -> /dev/zero
lr-x------ 1 maclean oinstall 64 Jul 26 16:38 12 -> /s01/rac10g/rdbms/mesg/oraus.msb
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 13 -> /s01/rac10g/dbs/hc_PROD.dat
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 14 -> /s01/rac10g/dbs/lkPROD
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 15 -> /s01/rac10g/oradata/PROD/controlfile/o1_mf_64q6xphj_.ctl
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 16 -> /s01/rac10g/flash_recovery_area/PROD/controlfile/o1_mf_64q6xpms_.ctl
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 17 -> /s01/rac10g/oradata/PROD/onlinelog/o1_mf_1_64q6xrsr_.log
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 18 -> /s01/rac10g/flash_recovery_area/PROD/onlinelog/o1_mf_1_64q6xsoy_.log
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 19 -> /s01/rac10g/oradata/PROD/onlinelog/o1_mf_2_64q6xths_.log
l-wx------ 1 maclean oinstall 64 Jul 26 16:38 2 -> /s01/rac10g/admin/PROD/bdump/prod_lgwr_7289.trc
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 20 -> /s01/rac10g/flash_recovery_area/PROD/onlinelog/o1_mf_2_64q6xv9o_.log
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 21 -> /s01/rac10g/oradata/PROD/onlinelog/o1_mf_3_64q6xw1b_.log
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 22 -> /s01/rac10g/flash_recovery_area/PROD/onlinelog/o1_mf_3_64q6xwv0_.log
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 23 -> /s01/rac10g/oradata/PROD/datafile/o1_mf_system_64q6wd5j_.dbf
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 24 -> /s01/rac10g/oradata/PROD/datafile/o1_mf_undotbs1_64q6wd7f_.dbf
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 25 -> /s01/rac10g/oradata/PROD/datafile/o1_mf_sysaux_64q6wd5m_.dbf
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 26 -> /s01/rac10g/oradata/PROD/datafile/o1_mf_users_64q6wd89_.dbf
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 27 -> /s01/rac10g/oradata/PROD/datafile/o1_mf_temp_64q6xyox_.tmp
lr-x------ 1 maclean oinstall 64 Jul 26 16:38 28 -> /s01/rac10g/rdbms/mesg/oraus.msb
lr-x------ 1 maclean oinstall 64 Jul 26 16:38 3 -> /dev/null
lr-x------ 1 maclean oinstall 64 Jul 26 16:38 4 -> /dev/null
l-wx------ 1 maclean oinstall 64 Jul 26 16:38 5 -> /s01/rac10g/admin/PROD/udump/prod_ora_7279.trc
l-wx------ 1 maclean oinstall 64 Jul 26 16:38 6 -> /s01/rac10g/admin/PROD/bdump/alert_PROD.log
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 7 -> /s01/rac10g/dbs/lkinstPROD (deleted)
lrwx------ 1 maclean oinstall 64 Jul 26 16:38 8 -> /s01/rac10g/dbs/hc_PROD.dat
l-wx------ 1 maclean oinstall 64 Jul 26 16:38 9 -> /s01/rac10g/admin/PROD/bdump/alert_PROD.log

/*可以看到lgwr进程相关trace文件为/s01/rac10g/admin/PROD/bdump/prod_lgwr_7289.trc,对应打开文件描述符为2*/

[maclean@rh2 ~]$ ls -lh /s01/rac10g/admin/PROD/bdump/prod_lgwr_7289.trc
-rw-r----- 1 maclean oinstall 1.7M Jul 26 16:37 /s01/rac10g/admin/PROD/bdump/prod_lgwr_7289.trc

[maclean@rh2 ~]$ rm -f /s01/rac10g/admin/PROD/bdump/prod_lgwr_7289.trc

/*尝试删除该trace文件*/

[maclean@rh2 ~]$ ls -l /proc/7289/fd|grep lgwr
l-wx------ 1 maclean oinstall 64 Jul 26 16:38 2 -> /s01/rac10g/admin/PROD/bdump/prod_lgwr_7289.trc (deleted)

/*文件已处在deleted状态,但lgwr进程仍持有该文件相关的文件描述符,这个时候该文件占有的空间并不会被释放*/

[maclean@rh2 ~]$ lsof|grep lgwr
oracle 7289   maclean    2w   REG 8,2   1702391 3867134 /s01/rac10g/admin/PROD/bdump/prod_lgwr_7289.trc (deleted)

[maclean@rh2 ~]$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.5.0 - Production on Mon Jul 26 17:03:04 2010

Copyright (c) 1982, 2010, Oracle.  All Rights Reserved.


Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> oradebug setospid 7289;
Oracle pid: 6, Unix process pid: 7289, image: oracle@rh2 (LGWR)
SQL> oradebug flush;             /*写出trace buffer内容到trace文件*/
Statement processed.
SQL> oradebug close_trace;
Statement processed.
/*close_trace能够释放指定Oracle进程正打开着的文件,To close the current trace file use*/
SQL> host
[maclean@rh2 ~]$ lsof|grep lgwr

[maclean@rh2 ~]$ ls -l /proc/7289/fd/|grep lgwr
[maclean@rh2 ~]$
/* 从进程相关的fd文件夹中查找不到原来的trace文件;close_trace命令成功释放了该文件,并回收了磁盘空间。*/

ora-00600:[17281], [1001]一例

检查告警日志发现出现ora-600:[17281],[1001]记录,该数据库版本为10.2.0.4:

ORA-00600: internal error code, arguments: [17281], [1001], [0x70000042F5E54F8], [], [], [], [], []
ORA-01001: invalid cursor

分析该600错误产生的trace文件,发现当时运行的语句是一段匿名块:

Current SQL statement for this session:
declare
t_owner varchar2(30);
t_name  varchar2(30);
procedure check_mview is
dummy integer;
begin
if :object_type = ‘TABLE’ then
select 1 into dummy
from sys.all_objects
where owner = :object_owner
and object_name = :object_name
and object_type = ‘MATERIALIZED VIEW’
and rownum = 1;
:object_type := ‘MATERIALIZED VIEW’;
end if;
exception
when others then null;
end;
begin
:sub_object := null;
if :deep != 0 then
begin
if :part2 is null then
select constraint_type, owner, constraint_name
into :object_type, :object_owner, :object_name
from sys.all_constraints c
where c.constraint_name = :part1 and c.owner = user
and rownum = 1;
else
select constraint_type, owner, constraint_name, :part3
into :object_type, :object_owner, :object_name, :sub_object
from sys.all_constraints c
where c.constraint_name = :part2 and c.owner = :part1
and rownum = 1;
end if;
if :object_type = ‘P’ then :object_type := ‘PRIMARY KEY’; end if;
if :object_type = ‘U’ then :object_type := ‘UNIQUE KEY’; end if;
if :object_type = ‘R’ then :object_type := ‘FOREIGN KEY’; end if;
if :object_type = ‘C’ then :object_type := ‘CHECK CONSTRAINT’; end if;
return;
exception
when no_data_found then null;
end;
end if;
:sub_object := :part2;
if (:part2 is null) or (:part1 != user) then
begin
select object_type, user, :part1
into :object_type, :object_owner, :object_name
from sys.all_objects
where owner = user
and object_name = :part1
and object_type in (‘MATERIALIZED VIEW’, ‘TABLE’, ‘VIEW’, ‘SEQUENCE’, ‘PROCEDURE’, ‘FUNCTION’, ‘PACKAGE’, ‘TYPE’, ‘TRIGGER’, ‘SYNONYM’)
and rownum = 1;
if :object_type = ‘SYNONYM’ then
select s.table_owner, s.table_name
into t_owner, t_name
from sys.all_synonyms s
where s.synonym_name = :part1
and s.owner = user
and rownum = 1;
select o.object_type, o.owner, o.object_name
into :object_type, :object_owner, :object_name
from sys.all_objects o
where o.owner = t_owner
and o.object_name = t_name
and object_type in (‘MATERIALIZED VIEW’, ‘TABLE’, ‘VIEW’, ‘SEQUENCE’, ‘PROCEDURE’, ‘FUNCTION’, ‘PACKAGE’, ‘TYPE’, ‘TRIGGER’, ‘SYNONYM’)
and rownum = 1;
end if;
:sub_object := :part2;
if :part3 is not null then
:sub_object := :sub_object || ‘.’ || :part3;
end if;
check_mview;
return;
exception
when no_data_found then null;
end;
end if;
begin
select s.table_owner, s.table_name
into t_owner, t_name
from sys.all_synonyms s
where s.synonym_name = :part1
and s.owner = ‘PUBLIC’
and rownum = 1;
select o.object_type, o.owner, o.object_name
into :object_type, :object_owner, :object_name
from sys.all_objects o
where o.owner = t_owner
and o.object_name = t_name
and object_type in (‘MATERIALIZED VIEW’, ‘TABLE’, ‘VIEW’, ‘SEQUENCE’, ‘PROCEDURE’, ‘FUNCTION’, ‘PACKAGE’, ‘TYPE’, ‘TRIGGER’, ‘SYNONYM’)
and rownum = 1;
check_mview;
return;
exception
when no_data_found then null;
end;
:sub_object := :part3;
begin
select o.object_type, o.owner, o.object_name
into :object_type, :object_owner, :object_name
from sys.all_objects o
where o.owner = :part1
and o.object_name = :part2
and object_type in (‘MATERIALIZED VIEW’, ‘TABLE’, ‘VIEW’, ‘SEQUENCE’, ‘PROCEDURE’, ‘FUNCTION’, ‘PACKAGE’, ‘TYPE’, ‘TRIGGER’, ‘SYNONYM’)
and rownum = 1;
check_mview;
return;
exception
when no_data_found then null;
end;
begin
if :part2 is null and :part3 is null
then
select ‘USER’, null, :part1
into :object_type, :object_owner, :object_name
from sys.all_users u
where u.username = :part1
and rownum = 1;
return;
end if;
exception
when no_data_found then null;
end;
begin
if :part2 is null and :part3 is null and :deep != 0
then
select ‘ROLE’, null, :part1
into :object_type, :object_owner, :object_name
from sys.session_roles r
where r.role = :part1
and rownum = 1;
return;
end if;
exception
when no_data_found then null;
end;
:object_owner := null;
:object_type := null;
:object_name := null;
:sub_object := null;
end;
—– Call Stack Trace —–
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
——————– ——– ——————– —————————-
ksedst+001c          bl       ksedst1              088424844 ? 041124844 ?
ksedmp+0290          bl       ksedst               104A54870 ?
ksfdmp+0018          bl       03F30204
kgeriv+0108          bl       _ptrgl
kgeasi+0118          bl       kgeriv               1104722C8 ? 1101B87C0 ?
104AFA0FC ? 7000000100067F8 ?
000000000 ?
kgicli+0188          bl       kgeasi               110195A58 ? 110447310 ?
438100004381 ? 200000002 ?
200000002 ? 000000000 ?
0000003E9 ? 000000002 ?
kgidlt+0398          bl       kgicli               110450A70 ? 104C734E0 ?
kgidel+0018          bl       kgidlt               FFFFFFFFFFF90B8 ? 000000000 ?
000000003 ? 000000000 ?
FFFFFFFFFFF9468 ?
perabo+00ac          bl       kgidel               FFFFFFFFFFF8D20 ? 0000000FF ?
perdcs+0050          bl       perabo               000000000 ? 000000820 ?
000000000 ?
peidcs+01dc          bl       perdcs               110477618 ? 000000000 ?
kkxcls+00a4          bl       peidcs               FFFFFFFFFFF9468 ? 110477618 ?
kxsClean+0044        bl       kkxcls               1100DD338 ?
kxsCloseXsc+0444     bl       kxsClean             FFFFFFFFFFF9760 ?
kksCloseCursor+031c  bl       kxsCloseXsc          110478688 ? 110281FB0 ?
opicca+00c4          bl       kksCloseCursor       104BD9640 ?
opiclo+00a0          bl       opicca               10013D940 ?
kpoclsa+0050         bl       03F32B00
opiodr+0ae0          bl       _ptrgl
ttcpip+1020          bl       _ptrgl
opitsk+1124          bl       01F9F2A0
opiino+0990          bl       opitsk               000000000 ? 000000000 ?
opiodr+0ae0          bl       _ptrgl
opidrv+0484          bl       01F9E0E8
sou2o+0090           bl       opidrv               3C02DC1BBC ? 44065F000 ?
FFFFFFFFFFFF3A0 ?
opimai_real+01bc     bl       01F9B9F4
main+0098            bl       opimai_real          000000000 ? 000000000 ?
__start+0098         bl       main                 000000000 ? 000000000 ?



first argument为17281,该代码对应为在关闭游标时发生错误事件。
发生错误时的调用栈为:kkxcls->peidcs->perdcs->perabo->kgidel->kgidlt->kgicli,通过以上调用栈与argument信息在600 lookup工具中查询,可以发现bug:[6051353]:

Hdr: 6051353 10.1.0.45 THIN 10.1.0.5 PRODID-972 PORTID-212 ORA-600
Abstract: ORA-600[17281] ORA-[1001]

*** 05/14/07 07:09 am ***
TAR:
----
17450130.600

PROBLEM:
--------
Oracle 10.1.0.5 64-bit
AIX5L 64-bit server

Following the application of CPUJAN2007 patch, the database is giving
internal errors.
Tha alert log sjows:
  ORA-600: internal error code, arguments: [17281], [1001],
[0x70000001E792DC8], [], [], [], [], []
  ORA-1001: invalid cursor

Trace file shows the failing statement is an insert.

INSERT INTO V_RPMORGRESOURCEPOSITION
(ORGID, RESOURCEID, EFFECTIVEDAY, TERMINATIONDAY, OWNEDMW,
COMMITTEDMW, AVAILABLEMW, UNOFFEREDMW, FRRCOMMITTEDMW )
SELECT :B12 , :B11 , EFFECTIVEDAY, LEAST(:B10 , :B9 ),
NVL(OWNEDMW,0) + NVL(:B8 ,0), NVL(COMMITTEDMW,0) + NVL(:B7 ,0),
NVL(AVAILABLEMW,0) + NVL(:B6 ,0), NVL(UNOFFEREDMW,0) + NVL(:B5 ,0),
NVL(FRRCOMMITTEDMW,0) + NVL(:B4 ,0)
FROM RPMORGRESOURCEPOSITION
WHERE ORGID = :B3 AND RESOURCEID = :B2 AND EFFECTIVEDAY = :B1

    Other information:
        O/S info: user: , term: , ospid: 1234, machine: esu03als
        client info: smartino@PJM
        application name: eRpm
        action name: QueryZonalLoadObligationDetail
        last wait for 'SQL*Net message from client'

DIAGNOSTIC ANALYSIS:
--------------------
the function stack exactly matches 4359111
kgicli kgidlt kgidel perabo
    Bug 4359111 - STRESS ORA-600[17281] WHEN RUNNING GMT APPLICATION13
But this is already fixed in 10.1.0.5

The 'client' (esu03als) is a Linux server and there is no oracle running
there.

The AIX SysAdmin. who manages the esu03als server says this about the
application:

    "The way RPM connects to our database is by using DBCP (Apache Jakarta
     Project -- see
     As far as I can tell, the release (jar) contained within the RPM
     delivery is version 1.1 of DBCP which was released on 2003-10-20."

This couls also be Bug 5392685/5413487 but tis doesn't seem to have a
resolution.

Also could be Bug 5366763
  possible workaround - set session_cached_cursors to 0
  But this is already set to 0.

Originally the Customer thought this error started after applying CPUJAN2007,
but it is appearing on another database where the CPU patch is not applied.

WORKAROUND:
-----------
none

RELATED BUGS:
-------------
4359111
5413487
5366763

REPRODUCIBILITY:
----------------
intermittent

TEST CASE:
----------
none

STACK TRACE:
------------
          ksedmp ksfdmp kgeriv kgeasi
          kgicli kgidlt kgidel perabo perdcs peidcs kkxcls2 kxtcln
          kxsClean kksCloseCursor opicca opiclo kpoclsa opiodr
          ttcpip opitsk opiino opiodr opidrv sou2o
          main

其调用栈完全一致,可以基本确定2者的关联性。但该文档叙述bug发生在10.1.0.5的AIX版本上,且据称该Bug之前已在“Bug 4359111 – STRESS ORA-600[17281] WHEN RUNNING GMT APPLICATION13”中声明并在10.1.0.5上修复,看起来又是一个伪修复的漏洞。另外一个文档叙述了同样的错误发生在10.2.0.4上:

Hdr: 8337808 10.2.0.4 RDBMS 10.2.0.4 PRG INTERFACE PRODID-5 PORTID-46 ORA-600 4359111
Abstract: ORA-600 [17281] [1001] EVEN AFTER APPLYING THE PATCH 4359111

In prod_ora_12985.trc we see that:
 O/S info: user: Arun?Sharma, term: ARUN, ospid: 2556:3300,
 machine: BVM-EDP\ARUN
Got the ORA-600 at 2009-04-17 13:27:21 .

From the trace this was likely because the PLSQL block in
cursor #2 has an instantiation entry indicating that it
has cursor #3 open:
 INSTANTIATION OBJECT: object=0xf60e9ed0
 type="PL/SQL"[0] lock=0xa383a928 handle=0xb48def6c body=(nil)
 flags=[40] executions=0
 CURSORS: size=4 count=1 next=3
 index cursor      tag  context flags
 ----- ------ -------- -------- ---------------
     2      3 0xf60d56f4 0xf60f832c LRU/PRS/[03]
            ^here

But there is no cursor#3 so it has likely been closed independently

So the immediate thing to do would be to find out what this OS
user (Arun Sharma) was doing at 13.27 on 17th April from the
ARUN machine under OS pid 2556:3300 ?
 - what was the client program ?
 - what is this clients Oracle RSF version ?
 - what was being done in that client at that time.

It is unlikely that the user will remember such fine detail
so you may want to track the alert log closely and as soon
as you seen the ORA-600 find the O/S user , machine etc..
and try to contact them ASAP to confirm what they were doing
etc.. If we can get a handle on the client version / program /
actions that may help. Beyond that the next step is likely
to need a diagnostic on the server side to note cursor close
operations from the client without extranoues additional trace.

From the above update his client is TOAD using 8.0.6.0 on Windows.
TOAD is known to be affected by bug 4359111 so
you should upgrade this client to a version where
bug 4359111 is fixed. (4359111 is a CLIENT SIDE fix)

Marking as an unconfirmed duplicate of bug 4359111
as it looks like some specific client may be connecting
which does not have that patch in place.

与以上文档描述相同,trace中存在以下记录:

INSTANTIATION OBJECT: object=1105e4fe8
type=”PL/SQL”[0] lock=70000044155a8b0 handle=70000042f5e54f8 body=0 level=0
flags=[40] executions=0
CURSORS: size=4 count=3 next=5
index cursor      tag  context flags
—– —— ——– ——– —————
2      4 11049ecc8 110525f60 LRU/PRS/[03]
3      6 11049ecc8 1105261f0 LRU/PRS/[03]
4      5 11049ecc8 110526338 LRU/PRS/[03]

但实际上这里cursor 3所打开的cursor#:4,5,6均不存在,所以cursor# 3也被单独关闭了。文档中问题是由toad引起的,首先toad连接数据库是不需要安装Oracle client的,它通过一些客制化过的c/c++的接口连接到DB;如文档所述可以确定toad V8.0.6.0受到 Bug4359111的影响,而我们的环境中是通过PL/SQL developer连接到数据库的,该工具需要用到Oracle client,而开发人员安装的Oracle client一般为9.2.0.1,极有可能是这一较低版本的客户端软件造成了问题发生,到这里触发Bug的条件基本清晰了。

8i/9i的oracle client虽然仍能够连接到10g,但难保不发生一些兼容性问题或者将早期版本中的Bug再次代入,Oracle对这些连接形式或已不提供技术支持,或提供扩展模式(可能收费)的技术支持。以下列表列出了各版本Server-client的兼容性:



  • #1 – See Note 207319.1
  • #2 – An ORA-3134 error is incorrectly reported if a 10g client tries to connect to an 8.1.7.3 or lower server. See Note 3437884.8 .
  • #3 – An ORA-3134 error is correctly reported when attempting to connect to this version.
  • #4 – There are problems connecting from a 10g client to 8i/9i where one is EBCDIC based. See Note 3564573.8
  • #5 – For connections between 10.2 (or higher) and 9.2 the 9.2 end MUST be at 9.2.0.4 or higher. Connections between 10.2 (or higher) and 9.2.0.1, 9.2.0.2 or 9.2.0.3 are not supported.
  • #6 – For connections between 11.1 (or higher) and 10.1 / 10.2 the 10g end MUST be at 10.1.0.5 / 10.2.0.2 (or higher) respectively in order to use PLSQL between those versions. See Note 4511371.8 for more details.

其实在我们升级或迁移Oracle数据库的时候就因该考虑到客户端软件也需要升级到合适版本才能满足今后兼容性及应用程序健壮度的要求,当然客户端软件并不一定只是oracle client,它可能是jdbc,也许是odbc,也许是dbi等等。

索引使用空间异常增长一例

客户的某套系统上有一个表空间近日使用率异常增长,该表空间用以存储索引段,经过定位发现一个原本只有200M左右的索引使用将近30+G的空间,而且表现为绝大多数是未格式化的数据块。以下为通过 show_space脚本收集的段信息:

Unformatted Blocks = 1772568
Blocks with 00-25% free space = 0
Blocks with 26-50% free space = 2173
Blocks with 51-75% free space = 0
Blocks with 76-100% free space = 0
Full Blocks = 60762

Unformatted Blocks总数1772568,共使用1772568*16k=27G. 一个索引占用如此多的未格式化块似乎不可思议,首先想到的可能是一个威力惊人的Bug。

探索MOS,可以发现以下这个有趣的note:

Applies to:

Oracle Server – Enterprise Edition – Version: 10.2.0.3
Information in this document applies to any platform.

Symptoms

The problem as observed is that the space allocated for a table/index(partition) is far too high and keeps growing.

Using the AVG_ROW_SIZE * NUM_ROWS shows that the space as needed would be far less that the allocated space (in some cases allocated xxGb while xxMb would be expected).

Cause

Bug 5987262 TABLESPACE IS ABNORMALLY INCREASED BY UNFORMATTED BLOCKS.   This bug was closed as duplicate of unpublished Bug 5890312 HANG OBSERVED WHILE CREATING CTXCAT INDEX

The following PL/SQL block will show the space usage in the (partitioned) table

set serveroutput on
exec dbms_output.enable(1000000);

declare

unf number;
unfb number;
fs1 number;
fs1b number;
fs2 number;
fs2b number;
fs3 number;
fs3b number;
fs4 number;
fs4b number;
full number;
fullb number;

own dba_tables.owner%type;
tab dba_tables.table_name%type;

yesno varchar2(3);

type parts is table of dba_tab_partitions%rowtype;
partlist parts;

type cursor_ref is ref cursor;
c_cur cursor_ref;

begin

own:=upper(‘&owner’);
tab:=upper(‘&table_name’);
dbms_output.put_line(‘——————————————————————————–‘);

open c_cur for select partitioned from dba_tables
where owner=own and table_name=tab;
fetch c_cur into yesno;
close c_cur;

dbms_output.put_line(‘Owner:     ‘||own);
dbms_output.put_line(‘Table:     ‘||tab);

dbms_output.put_line(‘————————————————‘);

if yesno=’NO’
then
dbms_space.space_usage(own,tab,’TABLE’,unf,unfb,fs1,fs1b,fs2,fs2b,fs3,fs3b,fs4,fs4b,full,fullb);
dbms_output.put_line(‘unf: ‘||unf||’ fs1: ‘||fs1||’ fs2: ‘||fs2||’ fs3: ‘||fs3||’ fs4: ‘||fs4||’ full: ‘||full);

else
open c_cur for select * from dba_tab_partitions
where table_owner=own and table_name=tab;
fetch c_cur bulk collect into partlist;
close c_cur;

for i in partlist.first .. partlist.last
loop
dbms_space.space_usage(partlist(i).table_owner,partlist(i).table_name,’TABLE PARTITION’,unf,unfb,fs1,fs1b,fs2,fs2b,fs3,fs3b,fs4,fs4b,full,fullb,partlist(i).partition_name);
dbms_output.put_line(‘Partition: ‘||partlist(i).partition_name);
dbms_output.put_line(‘unf: ‘||unf||’ fs1: ‘||fs1||’ fs2: ‘||fs2||’ fs3: ‘||fs3||’ fs4: ‘||fs4||’ full: ‘||full);
end loop;

end if;
dbms_output.put_line(‘——————————————————————————–‘);

end;
/

In case there is a need to check a single table partition:

set serveroutput on
exec dbms_output.enable(1000000);

declare

unf number;
unfb number;
fs1 number;
fs1b number;
fs2 number;
fs2b number;
fs3 number;
fs3b number;
fs4 number;
fs4b number;
full number;
fullb number;

own dba_tables.owner%type;
tab dba_tables.table_name%type;
par dba_tab_partitions.partition_name%type;

begin

own:=upper(‘&owner’);
tab:=upper(‘&table_name’);
par:=upper(‘&partition_name’);

dbms_output.put_line(‘——————————————————————————–‘);

dbms_output.put_line(‘Owner:     ‘||own);
dbms_output.put_line(‘Table:     ‘||tab);
dbms_output.put_line(‘Partition: ‘||par);
dbms_output.put_line(‘————————————————‘);

dbms_space.space_usage(own,tab,’TABLE PARTITION’,unf,unfb,fs1,fs1b,fs2,fs2b,fs3,fs3b,fs4,fs4b,full,fullb,par);
dbms_output.put_line(‘unf: ‘||unf||’ fs1: ‘||fs1||’ fs2: ‘||fs2||’ fs3: ‘||fs3||’ fs4: ‘||fs4||’ full: ‘||full);

dbms_output.put_line(‘——————————————————————————–‘);

end;
/

The following PL/SQL block will show the space usage in the (partitioned) index:

set serveroutput on
exec dbms_output.enable(1000000);

declare

unf number;
unfb number;
fs1 number;
fs1b number;
fs2 number;
fs2b number;
fs3 number;
fs3b number;
fs4 number;
fs4b number;
full number;
fullb number;

own dba_indexes.owner%type;
ind dba_indexes.index_name%type;

yesno varchar2(3);

type parts is table of dba_ind_partitions%rowtype;
partlist parts;

type cursor_ref is ref cursor;
c_cur cursor_ref;

begin

own:=upper(‘&owner’);
ind:=upper(‘&index_name’);
dbms_output.put_line(‘——————————————————————————–‘);

open c_cur for select partitioned from dba_indexes
where owner=own and index_name=ind;
fetch c_cur into yesno;
close c_cur;

dbms_output.put_line(‘Owner: ‘||own);
dbms_output.put_line(‘Index: ‘||ind);

dbms_output.put_line(‘————————————————‘);

if yesno=’NO’
then
dbms_space.space_usage(own,ind,’INDEX’,unf,unfb,fs1,fs1b,fs2,fs2b,fs3,fs3b,fs4,fs4b,full,fullb);
dbms_output.put_line(‘unf: ‘||unf||’ fs1: ‘||fs1||’ fs2: ‘||fs2||’ fs3: ‘||fs3||’ fs4: ‘||fs4||’ full: ‘||full);

else
open c_cur for select * from dba_ind_partitions
where index_owner=own and index_name=ind;
fetch c_cur bulk collect into partlist;
close c_cur;

for i in partlist.first .. partlist.last
loop
dbms_space.space_usage(partlist(i).index_owner,partlist(i).index_name,’INDEX PARTITION’,unf,unfb,fs1,fs1b,fs2,fs2b,fs3,fs3b,fs4,fs4b,full,fullb,partlist(i).partition_name);
dbms_output.put_line(‘Partition: ‘||partlist(i).partition_name);
dbms_output.put_line(‘unf: ‘||unf||’ fs1: ‘||fs1||’ fs2: ‘||fs2||’ fs3: ‘||fs3||’ fs4: ‘||fs4||’ full: ‘||full);
end loop;

end if;
dbms_output.put_line(‘——————————————————————————–‘);

end;
/

In case there is a need to check a single index partition:

set serveroutput on
exec dbms_output.enable(1000000);

declare

unf number;
unfb number;
fs1 number;
fs1b number;
fs2 number;
fs2b number;
fs3 number;
fs3b number;
fs4 number;
fs4b number;
full number;
fullb number;

own dba_indexes.owner%type;
ind dba_indexes.index_name%type;
par dba_ind_partitions.partition_name%type;

begin

own:=upper(‘&owner’);
ind:=upper(‘&index_name’);
par:=upper(‘&partition_name’);

dbms_output.put_line(‘——————————————————————————–‘);

dbms_output.put_line(‘Owner: ‘||own);
dbms_output.put_line(‘Index: ‘||ind);
dbms_output.put_line(‘Partition: ‘||par);
dbms_output.put_line(‘————————————————‘);

dbms_space.space_usage(own,ind,’INDEX PARTITION’,unf,unfb,fs1,fs1b,fs2,fs2b,fs3,fs3b,fs4,fs4b,full,fullb,par);
dbms_output.put_line(‘unf: ‘||unf||’ fs1: ‘||fs1||’ fs2: ‘||fs2||’ fs3: ‘||fs3||’ fs4: ‘||fs4||’ full: ‘||full);

dbms_output.put_line(‘——————————————————————————–‘);

end;
/

If the value of UNF makes up the major part of the space as allocated,  unpublished bug 5890312 is hit.
In a specific case a table had a value of 4GB unformatted blocks out of 4.3GB allocated blocks.

Solution

The actions as needed to resolve this issue are:

1) Install the patch for unpublished Bug 5890312 in order to prevent future occurrences

2) Resolve the current space allocation, options:
– shrink the table (partition)
– recreate the table (partition)
– recreate the index (partition)

以上关于表空间异常增长且表(或表分区)和索引(或索引分区)中分配了大量未格式化的块的描述与我们的问题十分相似,但该文档的指出可能发生该问题的版本是10.2.0.3,而我们的版本已经是10.2.0.4了,是否又一次是Oracle的”伪修复”呢?怀着疑虑我们提交了SR,希望MOS帮助我们确认是否”hit Bug 729149.1″。

来自MOS的回复:

“The bug is fixed in 10.2.0.4.3 as well as 11.1.0.6.
You might as well apply the lastest PSU available for your platform: 10.2.0.4.4 Patch:9352164

I’m afraid there is no one off (interim) patch available for 10.2.0.4 on Solaris SPARC.
It was only provided for 9.2.0.8 and 10.2.0.3.
The one off patch 5890312 has been made available in Recommended Patch Sets (PSU): 10.2.0.4.3 and 10.2.0.4.4.

More information about Oracle Recommended Patches and their availability can be found in:
NOTE.756671.1 Oracle Recommended Patches — Oracle Database

If you have trouble applying any of the recommended PSUs, please let me know and i will log a one off backport request on your behalf  to ask development is it would be possible to provide you with the one off patch on 10.2.0.4.”

看起来我们并不走运,在10.2.0.4的基础上甚至没有one-off补丁,如果要彻底解决只能打PSU。

这个Bug最早出现是在9.2.0.8上,有一个相关的说明:

Hdr: 5987262 9.2.0.8.0 RDBMS 9.2.0.8.0 SPACE PRODID-5 PORTID-197 5890312
Abstract: TABLESPACE IS ABNORMALLY INCREASED  BY UNFORMATTED BLOCKS

*** 04/12/07 12:32 am ***
TAR:
----
6265975.992

PROBLEM:
--------
SQL>  select count(*) from pb.PBTAJUMT;

  COUNT(*)
----------
    897807

SQL> select sum(bytes)/1024/1024 from dba_extents where
segment_name='PBTAJUMT';

SUM(BYTES)/1024/1024
--------------------
               22008

SQL> select count(*) from dba_extents where segment_name='PBTAJUMT';

  COUNT(*)
----------
      5502

SQL>
       UNF
----------
   1404407

SQL>
       FS4
----------
        87

SQL>
       FS3
----------
         8

SQL>
       FS2
----------
         1

SQL>
       FS1
----------
        99

SQL>
       FULL
----------
      2291

AFTER 18 rows inserted >>

SQL> select count(*) from pb.PBTAJUMT;

  COUNT(*)
----------
    897825

SQL> select sum(bytes)/1024/1024 from dba_extents where
segment_name='PBTAJUMT';

SUM(BYTES)/1024/1024
--------------------
               23004

SQL> select count(*) from dba_extents where segment_name='PBTAJUMT';

  COUNT(*)
----------
      5751

SQL>
       UNF
----------
   1468072

SQL>
       FS4
----------
       103

SQL>
       FS3
----------
         7

SQL>
       FS2
----------
         2

SQL>
       FS1
----------
        56

SQL>
      FULL
----------
      2334

After 18 rows inserted, extents increased about 1G btyes.

DIAGNOSTIC ANALYSIS:
--------------------
ALERT LOG : node2 >

Thu Apr 12 09:40:15 2007
ORA-1653: unable to extend table PB.PBTAJUMT by 256 in
tablespace PB
Thu Apr 12 09:42:48 2007
ORA-1653: unable to extend table PB.PBTAJUMT by 256 in
tablespace PB
..
Thu Apr 12 09:44:37 2007
ORA-1653: unable to extend table PB.PBTAJUMT by 256 in
tablespace PB
Thu Apr 12 09:48:06 2007
ORA-1653: unable to extend table PB.PBTAJUMT by 256 in
tablespace PB
Thu Apr 12 09:49:49 2007
alter database datafile '/dev/oravg05/roravg05_04' resize 1000m
Thu Apr 12 09:50:21 2007
Completed: alter database datafile '/dev/oravg05/roravg05_04'
Thu Apr 12 09:50:39 2007
ORA-1653: unable to extend table PB.PBTAJUMT by 256 in
tablespace PB
...
Thu Apr 12 11:43:07 2007
 alter database datafile '/dev/oravg05/roravg05_04' resize 3300m
Completed:  alter database datafile '/dev/oravg05/roravg05_04
Thu Apr 12 11:43:51 2007
ORA-1653: unable to extend table PB.PBTAJUMT by 256 in
tablespace PB
Thu Apr 12 11:44:35 2007
alter database datafile '/dev/oravg05/roravg05_04' resize 3500
Thu Apr 12 11:44:37 2007
ORA-3214 signalled during: alter database datafile
'/dev/oravg05/roravg05_04'...
Thu Apr 12 11:44:54 2007
alter database datafile '/dev/oravg05/roravg05_04' resize 3500m
Thu Apr 12 11:45:05 2007
Completed: alter database datafile '/dev/oravg05/roravg05_04'

kjbdb1_ora_23026.trc :: segment dump >>

After a specific  extents , we can find owner is only inst 2 .
...
 0x05073607  Free: 2 Inst: 2
   0x05073608  Free: 1 Inst: 2
   0x05482905  Free: 2 Inst: 2
   0x05482906  Free: 1 Inst: 1
   0x05482907  Free: 5 Inst: 1
   0x05482908  Free: 5 Inst: 1
   0x11400405  Free: 5 Inst: 2 <======
   0x11400406  Free: 5 Inst: 2
   0x11400407  Free: 5 Inst: 2
   0x11400408  Free: 5 Inst: 2
   0x11400505  Free: 5 Inst: 2
   0x11400506  Free: 5 Inst: 2
   0x11400507  Free: 5 Inst: 2
   0x11400508  Free: 5 Inst: 2
   0x11400605  Free: 5 Inst: 2
   0x11400606  Free: 5 Inst: 2
   0x11400607  Free: 5 Inst: 2
   0x11400608  Free: 5 Inst: 2
   0x11400705  Free: 5 Inst: 2
   0x11400706  Free: 5 Inst: 2
   0x11400707  Free: 5 Inst: 2
   0x11400708  Free: 5 Inst: 2
   0x11400805  Free: 5 Inst: 2
   0x11400806  Free: 5 Inst: 2
   0x11400807  Free: 5 Inst: 2
   0x11400808  Free: 5 Inst: 2
   0x11400905  Free: 5 Inst: 2
   0x11400a05  Free: 5 Inst: 2
   0x11400b05  Free: 5 Inst: 2
   0x11400c05  Free: 5 Inst: 2
   0x11400d05  Free: 5 Inst: 2
   0x11400e05  Free: 5 Inst: 2
   0x11400f05  Free: 5 Inst: 2
   0x11401005  Free: 5 Inst: 2
   0x11401105  Free: 5 Inst: 2
   0x11401205  Free: 5 Inst: 2
   0x11401305  Free: 5 Inst: 2
   0x11401405  Free: 5 Inst: 2
   0x11401505  Free: 5 Inst: 2
   0x11401605  Free: 5 Inst: 2
   0x11401705  Free: 5 Inst: 2
   0x11401805  Free: 5 Inst: 2
   0x11401905  Free: 5 Inst: 2
   0x11401a05  Free: 5 Inst: 2
   0x11401b05  Free: 5 Inst: 2
   0x11401c05  Free: 5 Inst: 2
   0x11401d05  Free: 5 Inst: 2
   0x11401e05  Free: 5 Inst: 2
   0x11401f05  Free: 5 Inst: 2
   0x11402005  Free: 5 Inst: 2
   0x11402105  Free: 5 Inst: 2
   0x11402205  Free: 5 Inst: 2
   0x11402305  Free: 5 Inst: 2
   0x11402405  Free: 5 Inst: 2
   0x11402505  Free: 5 Inst: 2
   0x11402605  Free: 5 Inst: 2
   0x11402705  Free: 5 Inst: 2
   0x11402805  Free: 5 Inst: 2
   0x11402905  Free: 5 Inst: 2
   0x11402a05  Free: 5 Inst: 2
   0x11402b05  Free: 5 Inst: 2
   0x11402c05  Free: 5 Inst: 2
   0x11403605  Free: 5 Inst: 2
   0x11403705  Free: 5 Inst: 2

WORKAROUND:
-----------
none

RELATED BUGS:
-------------
bug.5945288, bug.4887955, bug.2447046
REPRODUCIBILITY:
----------------
No, cannot reproduce intentionally. But happened 4 times on product machine.

TEST CASE:
----------

STACK TRACE:
------------

SUPPORTING INFORMATION:
-----------------------
This system migrated on 19th of Mar. It happened 4 times from 20th of Mar.
24 HOUR CONTACT INFORMATION FOR P1 BUGS:

MOS无法重现该Bug,但可以确认该Bug的存在,具体触发的条件也不得而知。

如果您的环境无法在短期内应用补丁的话,可以通过 shrink/recreate 表(或表分区)和索引(或索引分区)来workaround这个问题,事实上我感觉这个Bug的触发条件并不简单,也就是说出现概率并不高;但如果您觉得自己不是那么幸运,那么还尽快将您的产品数据库升级到10.2.0.4.4以上或者10.2.0.5(我们是激进派)吧!

Fail to queue the whole FAL gap in dataguard一例

近日告警日志中出现以下记录:
FAL[server]: Fail to queue the whole FAL gap
GAP – thread 1 sequence 180-180
DBID 3731271451 branch 689955035

这是一个10.2.0.3的dataguard环境,采用物理备库,归档传输模式;查询metalink发现相关note:

Symptoms

When using ARCH transport, gaps could be flagged in the alert log even though the single log gap was for a log that had not been written at the primary yet.

alert.log on primary shows:

FAL[server]: Fail to queue the whole FAL gap

GAP – thread 1 sequence 63962-63962

DBID 1243807152 branch 631898097

or alert.log on standby shows:

Fetching gap sequence in thread 1, gap sequence 63962-63962

Thu Jan 24 14:36:30 2008

FAL[client]: Failed to request gap sequence

GAP – thread 1 sequence 63962-63962

DBID 2004523329 branch 594300676

FAL[client]: All defined FAL servers have been attempted.

v$archive_gap returns no rows

SQL> select * from v$archive_gap;

no rows selected

Cause

Bug 5526409 – FAL gaps reported at standby for log not yet written at primary

Solution

Bug 5526409 is fixed in 10.2.0.4 and 11.1.

Upgrade to 10.2.0.4 as Bug 5526409 is fixed in 10.2.0.4.

Their is no impact of these messages on the database. You can safely ignore these messages.

One-off Patch for Bug 5526409 on top of 10.2.0.3 is available for some platforms. Please check Patch 5526409 for your platform.

该note描述在10.1.0.2-10.2.0.3版本中,在ARCH传输的DataGuard环境中可能出现日志传输gap为单个在primary库中尚未写出的日志,该gap可能会在告警日志中以以上形式标示。
该bug(问题)在版本10.2.0.4和11.1中得到了修复,在10.2.0.3版本中部分平台上有one-off补丁。但实际上该bug(问题)对于主备库不会有任何影响,我们也可以将之忽略。

7月最新发布11.2.0.1.2 Patch set update

7月13日,11g release 2 的第二个补丁集更新发布了;9i的最终版本为9.2.0.8,10g上10.2.0.5很有可能成为最终版本,我们预期今后(11g,12g)中Patch set数量会有效减少,而patch set update数量可能大幅增加;这样的更新形式可以为Oracle Database提升一定的软件形象。可以猜想11gr2的最终版本号可能是11.2.0.2/3.x。

附该psu的readme note:

Released: July 13, 2010

This document is accurate at the time of release. For any changes and additional information regarding PSU 11.2.0.1.2, see these related documents that are available at My Oracle Support (http://support.oracle.com/):

  • Note 854428.1 Patch Set Updates for Oracle Products
  • Note 1089071.1 Oracle Database Patch Set Update 11.2.0.1.2 Known Issues

This document includes the following sections:

1 Patch Information

Patch Set Update (PSU) patches are cumulative. That is, the content of all previous PSUs is included in the latest PSU patch.

PSU 11.2.0.1.2 includes the fixes listed in Section 5, “Bugs Fixed by This Patch”.

Table 1 describes installation types and security content. For each installation type, it indicates the most recent PSU patch to include new security fixes that are pertinent to that installation type. If there are no security fixes to be applied to an installation type, then “None” is indicated. If a specific PSU is listed, then apply that or any later PSU patch to be current with security fixes.

Table 1 Installation Types and Security Content

Installation Type Latest PSU with Security Fixes
Server homes PSU 11.2.0.1.2


Client-Only Installations None
Instant Client Installations None

(The Instant Client installation is not the same as the client-only Installation. For additional information about Instant Client installations, see Oracle Database Concepts.)

2 Patch Installation and Deinstallation

This section includes the following sections:

2.1 Platforms for PSU 11.2.0.1.2

For a list of platforms that are supported in this Patch Set Update, see My Oracle Support Note 1060989.1 Critical Patch Update July 2010 Patch Availability Document for Oracle Products.

2.2 OPatch Utility Information

You must use the OPatch utility version 11.2.0.1.0 or later to apply this patch. Oracle recommends that you use the latest released OPatch 11.2, which is available for download from My Oracle Support patch 6880880 by selecting the 11.2.0.0.0 release.

For information about OPatch documentation, including any known issues, see My Oracle Support Note 293369.1 OPatch documentation list.

2.3 Patch Installation

These instructions are for all Oracle Database installations.

2.3.1 Patch Pre-Installation Instructions

Before you install PSU 11.2.0.1.2, perform the following actions to check the environment and to detect and resolve any one-off patch conflicts.

2.3.1.1 Environments with ASM

If you are installing the PSU to an environment that has Automatic Storage Management (ASM), note the following:

  • For Linux x86 and Linux x86-64 platforms, install either (A) the bug fix for 8898852 and the Database PSU patch 9654983, or (B) the Grid Infrastructure PSU patch 9343627.
  • For all other platforms, no action is required. The fix for 8898852 was included in the base 11.2.0.1.0 release.

2.3.1.2 Environment Checks
  1. Ensure that the $PATH definition has the following executables: make, ar, ld, and nm.The location of these executables depends on your operating system. On many operating systems, they are located in /usr/ccs/bin, in which case you can set your PATH definition as follows:
    export PATH=$PATH:/usr/ccs/bin
    

2.3.1.3 One-off Patch Conflict Detection and Resolution

For an introduction to the PSU one-off patch concepts, see “Patch Set Updates Patch Conflict Resolution” in My Oracle Support Note 854428.1 Patch Set Updates for Oracle Products.

The fastest and easiest way to determine whether you have one-off patches in the Oracle home that conflict with the PSU, and to get the necessary conflict resolution patches, is to use the Patch Recommendations and Patch Plans features on the Patches & Updates tab in My Oracle Support. These features work in conjunction with the My Oracle Support Configuration Manager. Recorded training sessions on these features can be found in Note 603505.1.

However, if you are not using My Oracle Support Patch Plans, follow these steps:

  1. Determine whether any currently installed one-off patches conflict with the PSU patch as follows:
    unzip p9654983_11201_<platform>.zip
    opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir ./9654983
    
  2. The report will indicate the patches that conflict with PSU 9654983 and the patches for which PSU 9654983 is a superset.Note that Oracle proactively provides PSU 11.2.0.1.2 one-off patches for common conflicts.
  3. Use My Oracle Support Note 1061295.1 Patch Set Updates – One-off Patch Conflict Resolution to determine, for each conflicting patch, whether a conflict resolution patch is already available, and if you need to request a new conflict resolution patch or if the conflict may be ignored.
  4. When all the one-off patches that you have requested are available at My Oracle Support, proceed with Section 2.3.2, “Patch Installation Instructions”.

2.3.2 Patch Installation Instructions

Follow these steps:

  1. If you are using a Data Guard Physical Standby database, you must first install this patch on the primary database before installing the patch on the physical standby database. It is not supported to install this patch on the physical standby database before installing the patch on the primary database. For more information, see My Oracle Support Note 278641.1.
  2. Do one of the following, depending on whether this is a RAC environment:
    • If this is a RAC environment, choose one of the patch installation methods provided by OPatch (rolling, all node, or minimum downtime), and shutdown instances and listeners as appropriate for the installation method selected.This PSU patch is rolling RAC installable. Refer to My Oracle Support Note 244241.1 Rolling Patch – OPatch Support for RAC.
    • If this is not a RAC environment, shut down all instances and listeners associated with the Oracle home that you are updating. For more information, see Oracle Database Administrator’s Guide.
  3. Set your current directory to the directory where the patch is located and then run the OPatch utility by entering the following commands:
    unzip p9654983_11201_<platform>.zip
    cd 9654983
    opatch apply
    
  4. If there are errors, refer to Section 3, “Known Issues”.

2.3.3 Patch Post-Installation Instructions

After installing the patch, perform the following actions:

  1. Apply conflict resolution patches as explained in Section 2.3.3.1.
  2. Load modified SQL files into the database, as explained in Section 2.3.3.2.

2.3.3.1 Applying Conflict Resolution Patches

Apply the patch conflict resolution one-off patches that were determined to be needed when you performed the steps in Section 2.3.1.3, “One-off Patch Conflict Detection and Resolution”.

2.3.3.2 Loading Modified SQL Files into the Database

The following steps load modified SQL files into the database. For a RAC environment, perform these steps on only one node.

  1. For each database instance running on the Oracle home being patched, connect to the database using SQL*Plus. Connect as SYSDBA and run the catbundle.sql script as follows:
    cd $ORACLE_HOME/rdbms/admin
    sqlplus /nolog
    SQL> CONNECT / AS SYSDBA
    SQL> STARTUP
    SQL> @catbundle.sql psu apply
    SQL> QUIT
    

    The catbundle.sql execution is reflected in the dba_registry_history view by a row associated with bundle series PSU.

    For information about the catbundle.sql script, see My Oracle Support Note 605795.1 Introduction to Oracle Database catbundle.sql.

  2. Check the following log files in $ORACLE_HOME/cfgtoollogs/catbundle for any errors:
    catbundle_PSU_<database SID>_APPLY_<TIMESTAMP>.log
    catbundle_PSU_<database SID>_GENERATE_<TIMESTAMP>.log
    

    where TIMESTAMP is of the form YYYYMMMDD_HH_MM_SS. If there are errors, refer to Section 3, “Known Issues”.

2.3.4 Patch Post-Installation Instructions for Databases Created or Upgraded after Installation of PSU 11.2.0.1.2 in the Oracle Home

These instructions are for a database that is created or upgraded after the installation of PSU 11.2.0.1.2.

You must execute the steps in Section 2.3.3.2, “Loading Modified SQL Files into the Database” for any new database only if it was created by any of the following methods:

  • Using DBCA (Database Configuration Assistant) to select a sample database (General, Data Warehouse, Transaction Processing)
  • Using a script that was created by DBCA that creates a database from a sample database

2.4 Patch Deinstallation

These instructions apply if you need to deinstall the patch.

2.4.1 Patch Deinstallation Instructions for a Non-RAC Environment

Follow these steps:

  1. Verify that an $ORACLE_HOME/rdbms/admin/catbundle_PSU_<database SID>_ROLLBACK.sql file exists for each database associated with this ORACLE_HOME. If this is not the case, you must execute the steps in Section 2.3.3.2, “Loading Modified SQL Files into the Database” against the database before deinstalling the PSU.
  2. Shut down all instances and listeners associated with the Oracle home that you are updating. For more information, see Oracle Database Administrator’s Guide.
  3. Run the OPatch utility specifying the rollback argument as follows.
    opatch rollback -id 9654983
    
  4. If there are errors, refer to Section 3, “Known Issues”.

2.4.2 Patch Post-Deinstallation Instructions for a Non-RAC Environment

Follow these steps:

  1. Start all database instances running from the Oracle home. (For more information, see Oracle Database Administrator’s Guide.)
  2. For each database instance running out of the ORACLE_HOME, connect to the database using SQL*Plus as SYSDBA and run the rollback script as follows:
    cd $ORACLE_HOME/rdbms/admin
    sqlplus /nolog
    SQL> CONNECT / AS SYSDBA
    SQL> STARTUP
    SQL> @catbundle_PSU_<database SID>_ROLLBACK.sql
    SQL> QUIT
    

    In a RAC environment, the name of the rollback script will have the format catbundle_PSU_<database SID PREFIX>_ROLLBACK.sql.

  3. Check the log file for any errors. The log file is found in $ORACLE_HOME/cfgtoollogs/catbundle and is named catbundle_PSU_<database SID>_ROLLBACK_<TIMESTAMP>.log where TIMESTAMP is of the form YYYYMMMDD_HH_MM_SS. If there are errors, refer to Section 3, “Known Issues”.

2.4.3 Patch Deinstallation Instructions for a RAC Environment

Follow these steps for each node in the cluster, one node at a time:

  1. Shut down the instance on the node.
  2. Run the OPatch utility specifying the rollback argument as follows.
    opatch rollback -id 9654983
    

    If there are errors, refer to Section 3, “Known Issues”.

  3. Start the instance on the node as follows:
    srvctl start instance
    

2.4.4 Patch Post-Deinstallation Instructions for a RAC Environment

Follow the instructions listed in Section Section 2.4.2, “Patch Post-Deinstallation Instructions for a Non-RAC Environment” only on the node for which the steps in Section 2.3.3.2, “Loading Modified SQL Files into the Database” were executed during the patch application.

All other instances can be started and accessed as usual while you are executing the deinstallation steps.

3 Known Issues

For information about OPatch issues, see My Oracle Support Note 293369.1 OPatch documentation list.

For issues documented after the release of this PSU, see My Oracle Support Note 1089071.1 Oracle Database Patch Set Update 11.2.0.1.2 Known Issues.

Other known issues are as follows.

Issue 1
The following ignorable errors may be encountered while running the catbundle.sql script or its rollback script:

ORA-29809: cannot drop an operator with dependent objects
ORA-29931: specified association does not exist
ORA-29830: operator does not exist
ORA-00942: table or view does not exist
ORA-00955: name is already used by an existing object
ORA-01430: column being added already exists in table
ORA-01432: public synonym to be dropped does not exist
ORA-01434: private synonym to be dropped does not exist
ORA-01435: user does not exist
ORA-01917: user or role 'XDB' does not exist
ORA-01920: user name '<user-name>' conflicts with another user or role name
ORA-01921: role name '<role name>' conflicts with another user or role name
ORA-01952: system privileges not granted to 'WKSYS'
ORA-02303: cannot drop or replace a type with type or table dependents
ORA-02443: Cannot drop constraint - nonexistent constraint
ORA-04043: object <object-name> does not exist
ORA-29832: cannot drop or replace an indextype with dependent indexes
ORA-29844: duplicate operator name specified
ORA-14452: attempt to create, alter or drop an index on temporary table already in use
ORA-06512: at line <line number>. If this error follow any of above errors, then can be safely ignored.
ORA-01927: cannot REVOKE privileges you did not grant

4 References

The following documents are references for this patch.

Note 293369.1 OPatch documentation list

Note 360870.1 Impact of Java Security Vulnerabilities on Oracle Products

Note 468959.1 Enterprise Manager Grid Control Known Issues

Note 9352237.8 Bug 9352237 – 11.2.0.1.1 Patch Set Update (PSU)

5 Bugs Fixed by This Patch

This patch includes the following bug fixes.

5.1 CPU Molecules

CPU molecules in PSU 11.2.0.1.2:

PSU 11.2.0.1.2 contains the following new CPU molecules:

9676419 – DB-11.2.0.1-MOLECULE-004-CPUJUL2010

9676420 – DB-11.2.0.1-MOLECULE-005-CPUJUL2010

5.2 Bug Fixes

PSU 11.2.0.1.2 contains the following new fixes:

Automatic Storage Management

8755082 – ORA-00600: [KCFIS_TRANSLATE4:VOLUME LOOKUP], [2], [WRONG DEVICE NAME], [], [], [

8890026 – ASM PARTNERING CREATES IMBALANCED PARTNERSHIPS

9170608 – STBH:DD BLOCKS PINNED FOR QUERIES THAT DO NOT REQUEST USED SPACE

9363145 – STBH:DB INSTANCES TERMINATED BY ASMB DUE TO ORA-00600 [KFDSKALLOC0]

Buffer Cache

8330783 – HANGING DB WITH “CACHE BUFFER CHAINS” AND “BUFFER DEADLOCK” WAITS DURING INSERT

8822531 – TAKING AWR SNAP HANGS

Data Guard Broker

8918433 – UNPERSISTED FSFO STATE BITS CAN GET PERSISTED

9363384 – PHYSICAL STANDBY SERVICES NOT STARTED AFTER CONVERT FROM SNAPSHOT

9467635 – BROKER’S METADATA FILE UPGRADE TO 11.2 IS BROKEN

9467727 – GETSTATUS DOC YIELDS INCORRECT RESULT IF DBRESOURCE_ID PROP VALUE IS USED

Data Guard Logical

8774868 – LGSBFSFO: ORA-600 [3020], [3], [138] RAISED IN RECOVERY SLAVE

8822832 – V$ARCHIVE_DEST_STATUS HAS INCORRECT VALUE FOR APPLIED_SEQ#

DataGuard Redo Transport

8872096 – ARCHIVING FORCED DURING CLOSE WHEN NO STANDBY IS PRESENT

9399090 – STBH: CONSTANT/HIGH FREQUENT LOG SWITCHES ON BEEHIVE DATABASE IN THE LAST 3 DAYS

Shared Cursors

8865718 – RECURSIVE CURSORS CONTAINING “AS OF SNAPSHOT” CLAUSE ARE NOT SHARED

8981059 – HIGH VERSION COUNT:BIND_MISMATCH,USER_BIND_PEEK_MISMATCH,OPTIMIZER_MODE_MISMATCH

9010222 – APPS ST 11G ORA-00600 [KKSFBC-REPARSE-INFINITE-LOOP]

9067282 – TB:SH:ORA-00600:[KKSFBC-WRONG-KKSCSFLGS] WHILE RUNNING TPC-H

DML Drivers

9255542 – ARRAY INSERT TO PARTITIONED TABLE LOOSES ROWS DUE TO CONCURRENT DDL (ORA-14403)

9488887 – FORIEGN KEY VIOLATION WITH ARRAY-INSERT AND ONLINE IDX REBUILD AFTER BUG-9255542

Flashback Database

8834425 – ORA-240 IN RVWR PROCESS CAUSING 5MIN TRANSACTIONAL HANG

PLSQL

9210925 – AFTER MANUAL UPGRADE TO 11.1.0.7 PL/SQL CALLS INCORRECT FUNCTION

Automatic Memory Management

8505803 – PRE_PAGE_SGA RESULTS IN EXCESSIVE PAGE TABLE SIZE WHEN USING MEMORY_TARGET [AMM]

Partitioning

9165206 – PARTITIONING ORA-600 [KKPOLLS1] / [KKDOILSF1] – DURING PARTITION MAINTANANCE

Real Application Cluster

8875671 – LX64: ORA-600 ARGS [KJPNP_CHK:!MASTER_READY],

9093300 – LOTS OF REPEATED KJXOCDR: DROP DUPLICATE OPEN MESSAGE IN LMD TRACE

Row Access Method

8544696 – TABLE GROWTH – BLOCKS ARE NOT REUSED

Streams

8650719 – DOWNSTREAM CAPTURE ABORTS WITH ORA-26766

Secure Files

8856478 – RAM SECUREFILE PERF DEGRADATION WITH SF COMPRESSION ON SMALL LOBS DURING ATB MOVE

9272086 – STBH: DATA PUMP WRITER SEEMS TO BE WAITING ON WAIT FOR UNREAD MESSAGE ON BROADCA

DB Recovery

8909984 – APPSST GSI 11G: GAPS IN AWR SNAPSHOTS

9068088 – MEDIA RECOVERY WAS HUNG ON STANDBY

9145541 – ORA-600 [25027] / ORA-600 [4097] FOR ACTIVE TX IN A PLUGGED TABLESPACE

9167285 – PKT-BUGOLTP: ORA-07445: [KCRALC()+87]

Space Management

7519406 – ‘J000’ TRACE FILE REGARDING GATHER_STATS_JOB INTERMITTENTLY SINCE 10.2.0.4

8815639 – [11GR2-LNX-090813] MULTIPLE INSERT CAUSE DATA ALLOCATION ABOVE HHWM

9216806 – HIGH “ENQ: TS – CONTENTION” FOR TEMPORARY SEGMENT WHILE SQLLDR DIRECT PATH LOAD

9242411 – STRESS-BIGBH: LOTS OF OR-3113S IN BIGBH STRESS TEST

9461782 – ORA-7445 [KTSLF_SUMFSG()+54] [SIGSEGV] AND KTSLFSUM_CFS ON CALL STACK

Compression

9011088 – [11GR2]ADDING COLUMN TO COMPRESSED TABLE, DATA LOSS OCCURED.

9275072 – APPSST GSI 11G : BUFFER BUSY WAITS INSERTING INTO TABLES

9341448 – APPSST GSI 11G : BUFFER BUSY WAITS AND LATCH: CACHE BUFFERS WAITS WHEN INSERTING

9637033 – ORA-07445[KDR9IR2RST0] INSERT AS SELECT IN A COMPRESSED TABLE WITH > 255 COLUMNS

SQL Execution

8664189 – ORA-00600 [KDISS_UNCOMPRESS: BUFFER LENGTH]

9119194 – PSRC: DISTRIBUTED QUERY SLOWER IN 10.2.0.4 COMPARED TO 10.2.0.3

Transaction Management

8268775 – PERF: HIGH US ENQUEUE CONTENTION DURING A LOGIN STORM OR SESSION FAILOVER

8803762 – ORA-00600 [KDSGRP1] BLOCK CORRUPTION ON 11G DATABASE UPGRADE

Memory Management

8431487 – INSTANCE CRASH ORA-07445 [KGGHSTFEL()+192] ORA-07445[KGGHSTMAP()+241]

Message

9713537 – ENHANCE CAUSE/ACTION FIELDS OF THE INTERNAL ERROR ORA-00600

9714832 – ENHANCE CAUSE/ACTION FIELDS OF THE INTERNAL ERROR ORA-07445

pmon长期持有cache buffers chains导致实例hang住一例

前几日,有客户报一个备用库实例中有多个回话hang住的问题,在hang住前本地有维护人员执行了truncate表的操作。
同事前往客户现场进一步确认了问题,并传回了当时hang状况下的266级systemstate文件。该实例的并发回话数量较少,所以实例状态抓取后总的信息量并不多,这种情形中通过分析systemstate信息往往要好于分析hanganalyze信息。
通过著名的源自于metalink的ass awk脚本可以很快找出各进程的状态,以及重要资源的持有者:
awk -f ass109.awk systemstate.txt

Starting Systemstate 1
…………………………………..
Ass.Awk Version 1.0.9 – Processing systemstate.txt

System State 1
~~~~~~~~~~~~~~~~
1:
2: last wait for ‘pmon timer’
3: waiting for ‘rdbms ipc message’ wait
4: waiting for ‘rdbms ipc message’ wait
5: waiting for ‘latch: cache buffers chains'[Latch 70000048c8c66b0] wait DBW0
6: waiting for ‘latch: cache buffers chains'[Latch 70000048c8c66b0] wait DBW1
7: waiting for ‘latch: cache buffers chains'[Latch 70000048c8c66b0] wait DBW2
8: waiting for ‘rdbms ipc message’ wait
9: waiting for ‘rdbms ipc message’ wait
10: waiting for ‘rdbms ipc message’ wait
11: waiting for ‘enq: RO – fast object reuse'[Enqueue RO-0001000B-00000001] wait
12: waiting for ‘rdbms ipc message’ wait
13: waiting for ‘rdbms ipc message’ wait
14: waiting for ‘rdbms ipc message’ wait
15: waiting for ‘rdbms ipc message’ wait
16: waiting for ‘SQL*Net message from client’ wait
17: waiting for ‘SQL*Net message from client’ wait
18: waiting for ‘SQL*Net message from client’ wait
19: waiting for ‘latch: cache buffers chains'[Latch 70000048c8c66b0] wait
20: waiting for ‘SQL*Net message from client’ wait
21: waiting for ‘SQL*Net message from client’ wait
22: last wait for ‘SQL*Net message from client’ [DEAD]
Cmd: Delete
23: waiting for ‘SQL*Net message from client’ wait
24: waiting for ‘SQL*Net message from client’ wait
25:
26: waiting for ‘SQL*Net message from client’ wait
27: last wait for ‘ksdxexeotherwait’
28: waiting for ‘enq: TX – row lock contention'[Enqueue TX-00120003-00002CC0] wait
Cmd: Update
29: waiting for ‘SQL*Net message from client’ wait
30: waiting for ‘SQL*Net message from client’ wait
31: waiting for ‘SQL*Net message from client’ wait
32: waiting for ‘SQL*Net message from client’ wait
33: waiting for ‘SQL*Net message from client’ wait
34: waiting for ‘Streams AQ: qmn coordinator idle wait’ wait
35: for ‘Streams AQ: waiting for time management or cleanup tasks’ wait
36: waiting for ‘Streams AQ: qmn slave idle wait’ wait
37: waiting for ‘enq: RO – fast object reuse'[Enqueue RO-00010025-00000001] wait
38: waiting for ‘SQL*Net message from client’ wait
39: waiting for ‘SQL*Net message from client’ wait
41: waiting for ‘SQL*Net message from client’ wait
42: last wait for ‘enq: TX – row lock contention’ wait
Blockers
~~~~~~~~

Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of ‘???’ implies that the holder was not found in the
systemstate.

Resource Holder State
Latch 70000048c8c66b0 ??? Blocker
Enqueue RO-0001000B-00000001 10: waiting for ‘rdbms ipc message’
Enqueue RO-0001000B-00000001 11: 11: is waiting for 10: 11:
Enqueue TX-00120003-00002CC0 22: last wait for ‘SQL*Net message from client’
Enqueue RO-00010025-00000001 10: waiting for ‘rdbms ipc message’
Enqueue RO-00010025-00000001 37: 37: is waiting for 10: 37:

Object Names
~~~~~~~~~~~~
Latch 70000048c8c66b0 holding (efd=14) 70000048c8c66b0 Child ca
Enqueue RO-0001000B-00000001
Enqueue TX-00120003-00002CC0
Enqueue RO-00010025-00000001

首先注意到的是RO队列锁,RO即(REUSE OBJECT),该锁用以协调前台进程与后台进程DBWR和CKPT之间的工作,该队列一般只在drop或truncate对象时可见到。可以看到pid为11和37的进程均在等待pid为10的进程,因为pid=10的进程持有着它们锁需要的RO-0001000B-00000001和RO-00010025-00000001,分析dump文件可以发现该持有进程正是CKPT后台进程;

而该检查点进程也处于BUSY的非空闲等待中,其等待事件为’rdbms ipc message’,即它在等待另一个后台进程给它发送信息。这个时候我们来观察其他忙碌的后台进程可以发现,pid为5,6,7的进程均在等待同一个栓’latch: cache buffers chains'[Latch 70000048c8c66b0];这个三个进程均为DBWR进程,此外还有一个DBW3进程处于’rdbms ipc message’等待中,多个DBWR进程是由于设置了db_writer_processes参数;看起来是ckpt进程准备对需要truncate的对象做对象级别的检查点,以保证该对象所有脏块均已写到磁盘上,所以对dbwr进程发出需要写出的message,继而进入’rdbms ipc message’等待直到dbwr进程完成写出任务,但由于dbwr进程长期无法获取某脏块对应的latch: cache buffers chains,故写出工作一直处于pending状态,这样一个hang链就十分清晰了。

我们来分析’latch: cache buffers chains'[Latch 70000048c8c66b0]这个栓,ass分析systemstate dump时将该栓的holder归为’???’,即无法自dump文件中找到该栓的持有者;进一步直接分析dump文件可以发现:
PROCESS 2:
—————————————-
SO: 70000048f529d40, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=2, calls cur/top: 70000047d0e75a0/70000048f8956d0, flag: (e) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 504403177803376304 122 2
last post received-location: kslges
last process to post me: 70000048e533a88 1 6
last post sent: 0 0 148
last post sent-location: ktmpsm
last process posted by me: 70000048e535228 1 22
(latch info) wait_event=0 bits=2
holding (efd=14) 70000048c8c66b0 Child cache buffers chains level=1 child#=61214
Location from where latch is held: kcbgcur: kslbegin:
Context saved from call: 336311247
state=busy(exclusive) (val=0x2000000000000002) holder orapid = 2
waiters [orapid (seconds since: put on list, posted, alive check)]:
5 (480, 1278471465, 0)
6 (480, 1278471465, 0)
19 (480, 1278471465, 0)
7 (216, 1278471465, 0)
waiter count=4
Process Group: DEFAULT, pseudo proc: 70000048e5f91d8
O/S info: user: oracle, term: UNKNOWN, ospid: 74476
OSD pid info: Unix process pid: 74476, image: oracle@DR_570 (PMON)

可以看到pid=2的PMON进程holding 该cache buffers chains子栓,而dbwx进程即waiters 5,6,7;

一般情况下pmon是不会去持有类似于cache buffers chains子栓这样的低级栓的,除非在cleanup失败会话或关闭实例情况下。从ass分析信息来看,当时确实有一个进程处于DEAD状态,即22: last wait for ‘SQL*Net message from client’ [DEAD],而该会话最后所做的是delete from “LINC”.”MSDB_ACCGL” t操作。

虽然无法证明,但极有可能是22号进程在进行delete过程中发生会话失败,PMON进程尝试清理该进程,并获取了相关栓。但该进程始终没有被杀死,即便使用OS 命令:kill -9 处理该进程后仍可以在systemstate中找到该进程的信息。据同事分析,当时之前曾有一度PMON的CPU使用率达到100%,之后PMON进程进入’pmon timer’空闲等待,且一直没有释放对应子栓,令DBWR进程处于长期无法获得栓资源的进而hang住的状态。

在MOS上搜索PMON+cache buffers chains可以发现几个PMON长期持有该类子栓且从不释放的Bug,但版本为Oracle 8等较老版本,且都是不能reproduceable的case。谨以录之:

Hdr: 4126734 8.1.7.4.0 RDBMS 8.1.7.4.0 BUFFER CACHE PRODID-5 PORTID-59
Abstract: ORACLE PROCESS GOES TO CPU LOOP WHILE HOLDING “CHILD CACHE BUFFERS CHAINS” LATCH

*** 01/17/05 07:08 pm ***
TAR:
—-
4249496.999

PROBLEM:
——–
1. Clear description of the problem encountered:

Oracle process goes to CPU loop while holding “Child cache buffers chains

level=1” latch. This condition causes other processes to wait on
either “latch free” or “Child cache buffers chains” which completely hang

the instance.

When looping Oracle process is killed, PMON also goes to CPU loop
cleaning
up that process. PMON also hold the same “Child cache buffers chains
level=1” latch while instance continues to hang.

Hang disappear only after instance is bounced.

2. Pertinent configuration information (MTS/OPS/distributed/etc)

11i Apps Install (version 11.5.9.) running on Oracle 8.1.7.4.0 database.

3. Indication of the frequency and predictability of the problem

Process running pl/sql procedure doing dbms_lob.instr() and
dbms_lob.read()
seem to encounter this problem.

4. Sequence of events leading to the problem

5. Technical impact on the customer. Include persistent after effects.

In a bad day, ct encounters 2 occurence of this problem a day causing
loss
of service in peak processing period.

DIAGNOSTIC ANALYSIS:
——————–
1. Got 3 systemstate dumps while hang is occuring
2. Got 3 errorstack dumps of PMON while it is looping in CPU

WORKAROUND:
———–
bounce instance

RELATED BUGS:
————-
bug 2361194

REPRODUCIBILITY:
—————-
– always reproducible on test system at ct site.

TEST CASE:
———-
– unable to reproduce on 8.1.7.4.0 test system in-house

STACK TRACE:
————
PMON stack while looping:
ksedmp ksdxfdmp ksdxcb sspuser _sigreturn kggchk kcbso1 kcbpsod kcbsod
kssxdl kssdch ksudlc kssxdl ksudlp kssxdl ksuxdl ksuxda ksucln ksbrdp
opirip opidrv sou2o main $START$

SUPPORTING INFORMATION:
———————–
Customers running Release 11i of the E-Business Suite will be entitled
to complimentary Extended Maintenance Support through July 31, 2005.

ORA-00600 [4400][48]错误一例

5月26日某客户告警日志中出现的ORA-00600 [4400],[48]错误记录,并产生了trace文件:,

*** SERVICE NAME:(ETL) 2010-05-26 16:45:45.930
*** SESSION ID:(262.12024) 2010-05-26 16:45:45.930
*** 2010-05-26 16:45:45.930
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [4400], [48], [], [], [], [], [], []
ORA-10387: parallel query server interrupt (normal)
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              0FFFFFFFF ? 000000020 ?
ksedmp+0290          bl       ksedst               1048DFFC0 ?
ksfdmp+0018          bl       03F5B014
kgeriv+0108          bl       _ptrgl
kgeasi+0118          bl       kgeriv               10496FEC8 ? 000000002 ?
                                                   700000010008000 ? 110000AD0 ?
                                                   110190290 ?
ktcddt+012c          bl       kgeasi               110190110 ? 110450040 ?
                                                   113000001130 ? 200000002 ?
                                                   100000001 ? 000000000 ?
                                                   000000030 ? 000000013 ?
ktcsod+0384          bl       ktcddt               0000003C0 ? 000000018 ?
                                                   1048D4808 ?
kssdch_stage+0758    bl       _ptrgl
....................................................
----- End of Call Stack Trace -----
******************* Dumping process map ****************
377144 : ora_p001_ETL
100000000      93710K  read/exec         oracle
1100007ef       2372K  read/write        oracle
9fffffff0000000        44K  read/exec         /usr/ccs/bin/usla64
9fffffff000b30a         0K  read/write        /usr/ccs/bin/usla64
900000000375880       484K  read/exec         /usr/lib/liblvm.a[shr_64.o]
9001000a0121ac8       135K  read/write        /usr/lib/liblvm.a[shr_64.o]
900000000426180        74K  read/exec         /usr/lib/libcfg.a[shr_64.o]
9001000a00f5ed0        26K  read/write        /usr/lib/libcfg.a[shr_64.o]
90000000034e280         2K  read/exec         /usr/lib/libcrypt.a[shr_64.o]
9001000a00d4760         0K  read/write        /usr/lib/libcrypt.a[shr_64.o]
90000000069a7b0         4K  read/exec         /usr/lib/libc.a[aio_64.o]
9001000a022a340         0K  read/write        /usr/lib/libc.a[aio_64.o]
90000000035f300        85K  read/exec         /usr/lib/libodm.a[shr_64.o]
9001000a00d5c08        35K  read/write        /usr/lib/libodm.a[shr_64.o]
9000000003ef080        83K  read/exec         /usr/lib/libperfstat.a[shr_64.o]
9001000a01d0818         9K  read/write        /usr/lib/libperfstat.a[shr_64.o]
900000000358000         0K  read/exec         /usr/lib/libdl.a[shr_64.o]
9001000a0219000         0K  read/write        /usr/lib/libdl.a[shr_64.o]
9000000007ff100      8588K  read/exec         /oracle/product/10.2.0/lib/libjox10.a[shr.o]
8001000a0000038       585K  read/write        /oracle/product/10.2.0/lib/libjox10.a[shr.o]
9000000004a0000       228K  read/exec         /usr/lib/libpthreads.a[shr_xpg5_64.o]
9001000a0144000       558K  read/write        /usr/lib/libpthreads.a[shr_xpg5_64.o]
900000000045500      2966K  read/exec         /usr/lib/libc.a[shr_64.o]
9001000a0000788       844K  read/write        /usr/lib/libc.a[shr_64.o]
   Total      110843K
******************* End of process map dump ****************
===================================================
PROCESS STATE
-------------
Process global information:
     process: 7000000cf48bf98, call: 7000000b40ebdf0, xact: 7000000cdece7a8, curses: 7000000cf666540, usrses: 7000000cf666540
  ----------------------------------------
  SO: 7000000cf48bf98, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=43, calls cur/top: 7000000b40ebdf0/7000000b40ebdf0, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 115 0 4
              last post received-location: kslpsr
              last process to post me: 7000000cf479c38 1 6
              last post sent: 0 0 251
              last post sent-location: kxfpqr: QC
              last process posted by me: 7000000cf48a7f8 10 0
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 7000000cf50bd50
    O/S info: user: oracle, term: UNKNOWN, ospid: 377144
    OSD pid info: Unix process pid: 377144, image: oracle@etl_a (P001)
   SO: 7000000cf666540, type: 4, owner: 7000000cf48bf98, flag: INIT/-/-/0x00
    (session) sid: 262 trans: 7000000cdece7a8, creator: 7000000cf48bf98, flag: (c0100041) USR/- BSY/-/-/DEL/-/-
              DID: 0001-0028-000FFB1E, short-term DID: 0001-002B-00151DA1
              txn branch: 7000000cdf412c8
              oct: 0, prv: 0, sql: 0, psql: 0, user: 26/ETL
    O/S info: user: bo, term: pts/0, ospid: 377144, machine: etl_a
              program: oracle@etl_a (P001)
    last wait for 'SQL*Net message from dblink' blocking sess=0x0 seq=45 wait_time=476 seconds since wait started=0
                driver id=0, #bytes=1, =0
    Dumping Session Wait History
     for 'SQL*Net message from dblink' count=1 wait_time=476
                driver id=0, #bytes=1, =0
     for 'SQL*Net message to dblink' count=1 wait_time=1
                driver id=0, #bytes=1, =0
     for 'log file sync' count=1 wait_time=7483
                buffer#=ca9, =0, =0
     for 'log file sync' count=1 wait_time=358
                buffer#=ca9, =0, =0
     for 'PX Deq: Execution Msg' count=1 wait_time=15527
                sleeptime/senderid=1001ffff, passes=1, =0
     for 'log file sync' count=1 wait_time=11855
                buffer#=c57, =0, =0
     for 'PX Deq: Execution Msg' count=1 wait_time=277142
                sleeptime/senderid=1001ffff, passes=1, =0
     for 'direct path write' count=1 wait_time=25
                file number=5, first dba=103ca, block cnt=3
     for 'direct path write' count=1 wait_time=6
                file number=4, first dba=1abe1, block cnt=8
     for 'row cache lock' count=1 wait_time=3307
                cache id=5, mode=0, request=5
    temporary object counter: 0
      ----------------------------------------
      Virtual Thread:
      kgskvt: 7000000c0e86b58, sess: 7000000cf666540, vc: 0, proc: 7000000cf48bf98
      consumer group cur: OTHER_GROUPS (upd? 0), mapped: DEFAULT_CONSUMER_GROUP, orig:
      vt_state: 0x200, vt_flags: 0x30, blkrun: 0
      is_assigned: 1, in_sched: 0 (0)
      vt_active: 0 (pending: 0)
      used quanta: 0 (cg: 0)
      cpu start time: 0, quantum status: 0x0
      quantum checks to skip: 0, check thresh: 0
      idle time: 0, active time: 0 (cg: 0)
      cpu yields: 0 (cg: 0), waits: 0 (cg: 0), wait time: 0 (cg: 0)
      queued time outs: 0, time: 0 (cur 0, cg 0)
      calls aborted: 0, num est exec limit hit: 0
      undo current: 0k max: 0k
      ----------------------------------------
      UOL used : 0 locks(used=0, free=2)
      KGX Atomic Operation Log 7000000bdcf0568
       Mutex 0(0, 0) idn 0 oper NONE
       Cursor Parent uid 262 efd 9 whr 11 slp 0
       oper=NONE pt1=7000000c69d0878 pt2=7000000bef37090 pt3=7000000b47119c8
       pt4=0 u41=1 stt=0
      KGX Atomic Operation Log 7000000bdcf05b0
       Mutex 7000000bdc2d028(0, 2) idn 0 oper NONE
       Cursor Stat uid 262 efd 12 whr 1 slp 0
       oper=NONE pt1=7000000bdc2cef8 pt2=0 pt3=0
       pt4=0 u41=0 stt=0
      KGX Atomic Operation Log 7000000bdcf05f8
       Mutex 0(0, 0) idn 0 oper NONE
       Library Cache uid 262 efd 0 whr 0 slp 0
      ----------------------------------------
      SO: 7000000cf7e3ec8, type: 5, owner: 7000000cf666540, flag: INIT/-/-/0x00
      (enqueue) DX-00000010-00000000	DID: 0001-002B-00151DA9
      lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2
      res: 0x7000000cf8bd0b8, mode: X, lock_flag: 0x0
      own: 0x7000000cf666540, sess: 0x7000000cf666540, proc: 0x7000000cf48bf98, prv: 0x7000000cf8bd0c8

从trace文件名就可以看出出错的是某个并行子进程,其最近调用堆栈为ktcsod->ktcddt->kgeasi(报错),ktcsod与ktcddt均为Oracle中内核事务控制模块函数 (Kernel Transaction Control),又该并行子进程最近等待为”SQL*Net message from dblink”事件,共等待476s(wait_time=476 seconds)。trace显示该并行会话包含分布式事务锁:

      SO: 7000000cf7e3ec8, type: 5, owner: 7000000cf666540, flag: INIT/-/-/0x00
      (enqueue) DX-00000010-00000000	DID: 0001-002B-00151DA9
      lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2
      res: 0x7000000cf8bd0b8, mode: X, lock_flag: 0x0
      own: 0x7000000cf666540, sess: 0x7000000cf666540, proc: 0x7000000cf48bf98, prv: 0x7000000cf8bd0c8

即DX(Distributed transaction entry)锁;可以猜测该并行会话在等待远程事务被清理,但等待超时,故报错。

查询metalink可以发现Bug 5223587与该错误一致,该Bug的特征为:
1. 出现ORA-600[4400],且调用栈为ktcddt ,说明涉及到了分布式事务
2. trace显示当时无打开的游标
3. 可以看到当时的等待为’SQL*Net message from dblink’事件
Metalink文档[ID  444108.1]描 述该错误会在涉及远程数据库的并行DML操作中出现,但该错误仅会在清理阶段出现,故实际无影响(It is not critical as such, since it is happening during the cleanup of the operation.)。

该Bug可以通过实施Patch 5223587修复, 目前该补丁只有对应基础版本10.2.0.2和10.2.0.3的版本,该补丁在 10.2.0.3 上有AIX power(64 bit)和Solaris Sparc(64 bit)的版本,10.2.0.2上只有linux-64 与HP-UX的版本。因为该Bug一般不会产生不良影响,故实际上不建议应用补丁。

ora-600[qesmmCValStat4]一例

在之前的文章中提到过一个有趣的绘制五角星的SQL,具体SQL语句如下:
with a as
(select distinct round(sum(x) over(order by n)) x,
round(sum(y) over(order by n)) y
from (select n,
cos(trunc(n / 20) * (1 – 1 / 5) * 3.1415926) * 2 x,
sin(trunc(n / 20) * (1 – 1 / 5) * 3.1415926) y
from (select rownum – 1 n from all_objects where rownum <= 20 * 5)))
select replace (sys_connect_by_path(point, ‘/’), ‘/’, null) star
from (select b.y, b.x, decode(a.x, null, ‘ ‘, ‘*’) point
from a,
(select *
from (select rownum – 1 + (select min(x) from a) x
from all_objects
where rownum <= (select max(x) – min(x) + 1 from a)),
(select rownum – 1 + (select min(y) from a) y
from all_objects
where rownum <= (select max(y) – min(y) + 1 from a))) b
where a.x(+) = b.x
and a.y(+) = b.y)
where x = (select max(x) from a)
start with x = (select min(x) from a)
connect by y = prior y and x = prior x + 1;

在10.2.0.1版本输入以上SQL可能会出现ORA-00600: internal error code, arguments: [qesmmCValStat4], [3], [1], [], [],[], [], []错误,g10r21_ora_15473.trc

其调用栈为:qercoFetch->qerhjFetch->qerhjInitializeManagementComponents->

qesmmCStartWorkArea->qesmmCValidateStatus->kgeasnmierr(报错)

metalink文档ID 360811.1对该 Bug 4926357进行了描述,该bug可能在9.2.07~10.2.0.1版本中出现,一般由带start with … … connect by prior ..子句的查询语句引起;典型的调用栈为:qesmmCValidateStatus<- qesmmCStartWorkArea <-qerhjInitializeManagementComponents <-qerhjFetch …

未公布的Bug 4401437是Bug 4926357的一个复制品avatar,该Bug已在10.1.0.5, 10.2.0.2, 11.1等版本中修复了。

Oracle support建议的四种解决方式:

1. 打上该Bug的one-off补丁;

2.升级到该Bug已修复的大版本中,例如从10.2.0.1升级到10.2.0.2;

3.设置参数hash_join_enabled(9i中),_hash_join_enabled(10g中),实例级别的或者会话级别的均可,如:

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bi
PL/SQL Release 10.2.0.1.0 - Production
CORE    10.2.0.1.0      Production
TNS for Linux: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production

SQL> alter session set "_hash_join_enabled"=false;

Session altered.

SQL> set pagesize 1400;
SQL> with a as
2   (select distinct round(sum(x) over(order by n)) x,
3                    round(sum(y) over(order by n)) y
4      from (select n,
5                   cos(trunc(n / 20) * (1 - 1 / 5) * 3.1415926) * 2 x,
6                   sin(trunc(n / 20) * (1 - 1 / 5) * 3.1415926) y
7              from (select rownum - 1 n from all_objects where rownum <= 20 * 5)))
8  select replace (sys_connect_by_path(point, '/'), '/', null) star
9    from (select b.y, b.x, decode(a.x, null, ' ', '*') point
10            from a,
11                 (select *
12                    from (select rownum - 1 + (select min(x) from a) x
13                            from all_objects
14                           where rownum <= (select max(x) - min(x) + 1 from a)),
15                         (select rownum - 1 + (select min(y) from a) y
16                            from all_objects
17                           where rownum <= (select max(y) - min(y) + 1 from a))) b
18           where a.x(+) = b.x
19             and a.y(+) = b.y)
20   where x = (select max(x) from a)
21   start with x = (select min(x) from a)
22  connect by y = prior y and x = prior x + 1;

STAR
--------------------------------------------------------------------------------
.................
20 rows selected.

4.设置’no_filtering’ 提示,如:
with a as
(select distinct round(sum(x) over(order by n)) x,
round(sum(y) over(order by n)) y
from (select n,
cos(trunc(n / 20) * (1 – 1 / 5) * 3.1415926) * 2 x,
sin(trunc(n / 20) * (1 – 1 / 5) * 3.1415926) y
from (select rownum – 1 n from all_objects where rownum <= 20 * 5)))
select  /*+ no_filtering */ replace (sys_connect_by_path(point, ‘/’), ‘/’, null) star
from (select b.y, b.x, decode(a.x, null, ‘ ‘, ‘*’) point
from a,
(select *
from (select rownum – 1 + (select min(x) from a) x
from all_objects
where rownum <= (select max(x) – min(x) + 1 from a)),
(select rownum – 1 + (select min(y) from a) y
from all_objects
where rownum <= (select max(y) – min(y) + 1 from a))) b
where a.x(+) = b.x
and a.y(+) = b.y)
where x = (select max(x) from a)
start with x = (select min(x) from a)
connect by y = prior y and x = prior x + 1;

但是该hint在9.2.0.7版本中可能因为Bug 4752555 ‘Wrong results from CONNECT BY query’而导致返回错误行数;并且这是一个undocumented hint,所以强烈不推荐使用!

ORA-600 internal error[kqrfrpo]一例

3月17日某客户主机上出现了文件系统空间不足的问题,经查发现是Oracle在1点左右产生了大量trace跟踪文件以致耗尽了磁盘空间,这些trace跟踪文件均是由Oracle服务进程遭遇错误“ORA-600: : internal error code, arguments: [kqrfrpo], [0x700000279FF98E0], [11], [], [], [], [], []”后在用户udump目录下所产生。

据客户工作人员称,在当时情况下(1点20分左右)以系统DBA权限本地方式登录数据库缓慢,而在1点25分时,登录缓慢现象消失,当时检查数据库,发现没有异常等待(如latch free等)和资源锁,数据库恢复正常。

Ora-600错误代码代表发生的错误是Oracle内部错误,一般是由于Oracle bug,操作系统bug或不当设置等问题引起的;具体发生的问题细节需要通过错误附加项来了解,本次错误中出现的第一位错误附加项是kqrfrpo。

通过METALINK相关文档我们发现ORA-600 kqrfrpo错误极有可能是Oracle 9i中的bug 3835429 OERI[kqrfrpo] / DB hang after killing a user process 所引起的,该bug跨越版本9.2.0.1.0至9.2.0.6.0,在版本9.2.0.7.0中得到了修正。

该bug的产生原因简述为,当一个用户进程在某个不恰当的时机被杀死,那么字典缓存latch可能无限期地被挂起,当其他进程无法获得该闩,则可能引起数据库级别的挂起(database wide hang)。

进一步分析ORA-600 kqrfrpo错误可能造成的影响,METALINK文档中就该错误可能造成影响的分类如下:
* 实例意外终止,即Oracle数据库crash(如pmon进程发生ora-600错误)
* 进程意外终止,在数据库层面表现为会话级的失败
* 内存块损坏
* 可能导致磁盘上的数据损坏
* 无任何影响

幸运的是,本次的ORA-600[kqrfrp]错误没有发生在数据库后台进程(pmon等)中,因此没有发生实例意外终止的现象,但出现登录数据库缓慢的现象,并且伴随用户进程因ora-600错误而异常中止。仔细观察600错误的trace文档可以发现,其中部分数据库服务进程的应用客户端为JDBC THIN CLIENT即java瘦客户端应用,若该类应用服务在活动情况下遭遇上述错误可能导致SQL执行出现问题,进而使得应用逻辑在数据库层面未得到实现。实际的情况仍需要得到应用方面的确认。

回顾该系统之前的情况,于3月16日夜间因通过cics连接的数据库服务进程遭遇ORA-600[4454]错误,在当时情况下无法在数据库级别杀死session,故在操作系统级别杀死了上述遭遇ORA-600[445]错误的服务进程;联系到以上情况,有可能是杀死进程触发了BUG,使得ORA-600[kqrfrpo]错误出现。从杀死用户进程到ora-600错误出现,期间跨越了4个小时。

针对ORA-600 kqrfrpo错误,因考虑到该错误是通过杀死用户服务进程的操作触发,故可以将之视为在特殊操作情况下才可能发生的隐性错误,实际数据库运行周期内需要杀死服务进程的情况并不常见,故该错误发生的概率较低。建议:
* 优先使用alter system kill session的命令来清除相关会话和进程
* 应用针对该bug的补丁3835429以彻底解决该问题。

11gr2 Pseudo-error debugging events

so different from 10gr2,more debugging events added.

for example , event 10949:

10949, 00000, “Disable autotune direct path read for full table scan”

AND

10978, 00000, “general event for materialized view logs”

you can view the event list from HERE.

沪ICP备14014813号-2

沪公网安备 31010802001379号