ORA-00600 [KCBZPB_1], [59033077], [4], [1], [] example

below is the 600 entry in the alertlog:

alert.log:
Hex dump of Absolute File 14, Block 312821 in trace file /u01/ORAHOME/app/oracle/admin/TIGERS7/bdump/tigers7_dbw0_10999.trc
***
Corrupt block relative dba: 0x0384c5f5 (file 14, block 312821)
Bad header found during preparing block for write
Data in bad block -
type: 6 format: 1 rdba: 0x00000384
last change scn: 0xf90b.c5f55f7c seq: 0x9 flg: 0x72
consistency value in tail: 0x0001f90b
check value in block header: 0x102, block checksum disabled
spare1: 0x6, spare2: 0x2, spare3: 0x0
***
Thu Apr 16 18:32:48 2009
Errors in file /u01/ORAHOME/app/oracle/admin/TIGERS7/bdump/tigers7_dbw0_10999.trc:
ORA-00600: internal error code, arguments: [kcbzpb_1], [59033077], [4], [1], [], [], [], []
Thu Apr 16 18:32:49 2009
Errors in file /u01/ORAHOME/app/oracle/admin/TIGERS7/bdump/tigers7_dbw0_10999.trc:
ORA-00600: internal error code, arguments: [kcbzpb_1], [59033077], [4], [1], [], [], [], []
DBW0: terminating instance due to error 600
Instance terminated by DBW0, pid = 10999
Thu Apr 16 19:04:58 2009

After that, We have executed dbverify against the identified file and it produced no errors:

DBVERIFY: Release 9.2.0.8.0 - Production on Thu Apr 16 19:31:32 2009

Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.


DBVERIFY - Verification starting : FILE = /u32/ORAINDX/oradata/TIGERS7/indx01.dbf

DBVERIFY - Verification complete

Total Pages Examined : 1280000
Total Pages Processed (Data) : 0
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 1262823
Total Pages Failing (Index): 0
Total Pages Processed (Other): 8751
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 8426
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Highest block SCN : 10386833124905 (2418.1602203177)

we do open a sr ,and oracle support suggest to do below query:
ACTION PLAN
===========
1) please describe the sequence of events leading up to the problem

2) please upload the alert.log. ZIP if >2MB. Dot not use RAR.

3) please describe your backup strategy:
a) when was your last valid backup?
b) are you using RMAN to perform this backup?
c) do you have all archivelogs from the last backup to now?
d) was this a hot or cold backup?

4) even if you’re not using RMAN, run the following in RMAN:
$ rman target /
RMAN> backup validate check logical database;

5) Once RMAN validate is completed, run the following in SQL*Plus as SYSDBA:
SQL> select * from v$database_block_corruption;

6) Please run the following query in SQL*Plus as SYSDBA
— db must be in either MOUNT or OPEN mode
— Save the queries to a file, eg. rec_query1.sql, then run it in SQL*Plus
—————– start ——————
set echo on
set pagesize 2000 linesize 200 trimspool on
col name form a60
col status form a10
col dbname form a15
col member form a60
col inst_id form 999
col resetlogs_time form a25
col created form a25
col DB_UNIQUE_NAME form a15
col stat form 9999999999
col thr form 99999
col “Uptime” form a80

spool rec_query1.out
show user
alter session set nls_date_format=’DD-MM-RR hh24:mi:ss’;

select inst_id, instance_name, status,
to_char(STARTUP_TIME,’dd-Mon-yyyy hh24:mi’) || ‘ – ‘ ||
trunc(SYSDATE-(STARTUP_TIME) ) || ‘ day(s), ‘ ||
trunc(24*((SYSDATE-STARTUP_TIME) – trunc(SYSDATE-STARTUP_TIME)))||’ hour(s), ‘ ||
mod(trunc(1440*((SYSDATE-STARTUP_TIME) – trunc(SYSDATE-STARTUP_TIME))), 60) ||’ minute(s), ‘ ||
mod(trunc(86400*((SYSDATE-STARTUP_TIME) – trunc(SYSDATE-STARTUP_TIME))), 60) ||’ seconds’
“Uptime”
from gv$instance
order by inst_id
/

select dbid, name dbname, open_mode, database_role,
to_char(created,’dd-Mon-YYYY hh24:mi:ss’) created,
to_char(resetlogs_time,’dd-Mon-YYYY hh24:mi:ss’) resetlogs_time
from v$database;

archive log list;

select count(*) from v$backup where status = ‘ACTIVE’;

select * from v$log;
select * from v$logfile;
select * from v$recover_file order by 1;

select distinct(status)from v$datafile;
select FILE#,TS# , status, NAME from v$datafile
where status not in (‘SYSTEM’,’ONLINE’)
order by 1;

select fhsta, count(*)
from X$KCVFH group by fhsta;

select min(fhrba_Seq), max(fhrba_Seq)
from X$KCVFH;
select hxfil FILE#,fhsta STAT,fhscn SCN,
fhthr thr, fhrba_Seq SEQUENCE,fhtnm TABLESPACE
from x$kcvfh order by 1;

7) dump the block. Run the following as SYSDBA in SQL*Plus:
SQL> alter session set max_dump_file_size=unlimited;
SQL> oradebug setmypid;
SQL> alter system dump datafile ‘full pathname for file 14’ block 312821;
SQL> oradebug tracefile_name;

==> upload the said trace file

8) run dbv against datafile 14:
$ dbv file= logfile=

spool off
—————– end ——————

RESEARCH
===============
ORA-600 [4519] “Block Corruption Detected – Cache type wrong”
We found a corrupted block when trying to read a block using
consistent read. An invalid block type was found.
Possible Block Corruption in Memory.

ORA-600 [kcbzpb_1] A block has been read cleanly from disk and updated successfully by the
clients of the cache layer.
Before the cache layer writes the block back to disk it does a health
check on the cache header.
If requested to do so (default), it generates a checksum for the block.
The health check is failing.
MEMORY CORRUPTION

ORA-600 [kcbzpb_1] was raised because DBA 59033077 => 14,312821 was found corrupted when read in the cache before we writ eit to disk.
Alert.log shows same block as corrupted, BAD HEADER, meaning blocks was overwriten.
Now DBV doesn’t show any corruption in file 14.

ACTION PLAN
====================

Hi,

I reviewed the information and the crash was caused by in memory corruption.
If restarted your database should be fine.

RESEARCH
================
Db crashed with ORA-600 [KCBZPB_1] because of corrupted block in memory:

STACK: kcbbxsv kcbbwlru kcbbdrv ksbabs ksbrdp

Bug.5866883/5845232 (36) INSTANCE GOES DOWN DUE TO ORA-600 [KCBZPB_1] V9208:
Bug.5845843/5845232 (96) DATABASE CRASH BY ORA-00600 [2032] , ORA-00600 [KCBZPB_1]

Bug:5845232: Block corruption / errors from concurrent dequeue operations
Tags: AQ CORR/PHY DUMP OERI R9208 REGRESSION SUPERCEEDED
Details:
This problem is introduced in 9.2.0.8 by the fix for bug 4144683.
Concurrent dequeue operations can lead to block corruption
/ memory corruption with varying symptoms such as ORA-600 [6033],
ORA-600 [6101] and ORA-600 [kcoapl_blkchk] if DB_BLOCK_CHECKING is enabled.
The fix for this bug is Patch 6401576.

Bug:6401576 ORA-600 [KTBAIR1] / ORA-600 [KCBZPB_1] / CORRUPTION MESSAGES –> DB CRASH
Abstract: OERI[ktbair1] / ORA-600 [6101] index corruption possible
Fixed-Releases: WIN:9208P22
Tags: CORR/IND OERI
Details:
Note: This fix replaces the fix in bug 5845232.
Certain index operations can lead to block corruption
/ memory corruption with varying symptoms such as ORA-600 [6033],
ORA-600 [6101] , ORA-600 [ktbair1] , ORA-600 [kcbzpb_1],
ORA-600 [4519] and ORA-600 [kcoapl_blkchk] if DB_BLOCK_CHECKING is enabled.

ISSUE CLARIFICATION
====================
Db crashed with ORA-600 [KCBZPB_1]

ISSUE VERIFICATION
===================
alert.log and trace file

CAUSE DETERMINATION
======================
in memory corruption

CAUSE JUSTIFICATION
====================
Bug:6401576 ORA-600 [KTBAIR1] / ORA-600 [KCBZPB_1] / CORRUPTION MESSAGES –> DB CRASH

POTENTIAL SOLUTION(S)
======================
apply patch for Bug:6401576

POTENTIAL SOLUTION JUSTIFICATION(S)
====================================
to fi x the issue

SOLUTION / ACTION PLAN
=======================

Hi,

These errors looks very similar to Bug:6401576 ORA-600 [KTBAIR1] / ORA-600 [KCBZPB_1] / CORRUPTION MESSAGES –> DB CRASH

Please download and apply one-off patch for Bug:6401576 from

Metalink->Patches->patch#=6401576 ->Platform=Hp_UX

Thanks, Rodica

ORA-00600 [kcbz_check_objd_typ_3]错误一例

5月26日凌晨某客户实例警告日志中出现”ORA-00600: internal error code, arguments: [kcbz_check_objd_typ_3], [0], [0], [1], [], [], [], []“,经过分析发现与之前CR实例发生的600错误情况症状相同。

从跟踪文件m1_m0001_4209.trc中可以看到当时的执行的SQL语句,如附件。

PL/SQL的调用堆栈为:

—– PL/SQL Call Stack —–

object      line  object

handle    number  name

3f3e89300        10  package body SYS.PRVT_HDM

3f5e9d3d8        16  SYS.WRI$_ADV_HDM_T

3f3f90898      1535  package body SYS.PRVT_ADVISOR

3f3f90898      1618  package body SYS.PRVT_ADVISOR

3f3e89300       106  package body SYS.PRVT_HDM

3eb69a3f8         1  anonymous block

函数调用堆栈为:

kgerinv kgeasnmierr kcbassertbd3 kcbz_check_objd_typ kcbzib kcbgtcr ktrget kdsgrp kdsfbr qertbFetchByRowid rwsfcd…….

以上信息与metalink Bug 4430244 中的描述完全一致,原因为Segment advisor的代码错误的将已被drop的对象数据块加载到缓存区中,导致后续的操作出现错误。

通过之前的调查已明确当前系统中已应用过Bug 4430244的补丁,认为可能是ORACLE没有解决但存在的未知BUG,或者是以前 提供的4430244补丁存在缺陷,原因如下:

1. 已确定目前所应用的小补丁没有冲突,即各小补丁所实现的功 能不受影响;

2. 当前数据库系统已经使用了与本次错误信息完全符合的bug 4430244的小补丁;

3. 当相关的小补丁都应用之后,在metalink上仍有客户提交完全相同的错误(详见bug 7032704和bug 6818725)。

4. 本次错误相关的数据块与以前(2008年7月)出现的不一致,因此不是物理存储上的损坏。

5. 本次错误与Bug 6388743 “ORA-00600 [KCBZ_CHECK_OBJD_TYP_3],[0],[0],[1],[],[],[],[] OCCURRED”中的描述完全一致,ORACLE猜测是但没有确定是BUG 4430244, 只是建议使用4430244的小补丁或10.2.0.4的补丁包, 最后此bug以客户应用10.2.0.4补丁包而被视 为结束。真实原因不明。

Metalink在之前该实例出现错误的SR提出以下解决方法:

1. 在RAC的所有实例中冲 刷buffer_cache

使用命令刷数据缓存区后,会使缓存区的数据块都标记为free,即以前缓存的数据都被移出内存。影响主要是在一段时间内使物理读增加,因 此建议在业务空闲时段操作,避免在刷缓存的同时有高负载应用在申请缓存空间。该方法可临时防止相关错误抛出。

2. 停用Segment advisor job

该job相关功能是进行 对象段(如表,索引等)的存储空间状态收集,识别 是否适合根据其可用空间大小进行收缩,并提供建议,例如是否存在较大的浪费空间,表的高水位线是否太高,是否需要进行表的回缩等。如 不需要或不关心对象段的存储空间状态,可不做相关操作。该方法可以长远解决问题的发生。

3. 升级数据库版本至10.2.0.4

author: maclean
permanent link:https://www.askmac.cn/2010/05/31/ora-00600-kcbz_check_objd_typ_3%E9%94%99%E8%AF%AF%E4%B8%80%E4%BE%8B/
date:2010-05-31
All rights reserved.

ora-4031 and "obj stat memory" component in Shared Pool

My customer has an application which using a lots of “temporary” table. They create so many tables but only operate once.And they are using automatic SGA management,all this cause large memory usage in shared pool, especially the “obj stat memo”.
Recently they encountered  the error 4031 , there are still 1G space available in shared pool, so i believe the root cause is too many obj stat fragment .
Verified the issue in the log file as noted below:
LOG FILE
-----------------------
Filename  =gpnms4_j000_22234.trc.log
See the following error:

***  2010-05-22 21:47:37.388
ORA-12012: error on auto execute of job 2
ORA-04031:  unable to allocate 4064 bytes of shared memory ("shared pool","select  SYSDATE+(30 / (24*60)...","sga heap(4,0)","kglsim heap")
***  2010-05-22 21:47:37.410
ORA-00604: error occurred at recursive SQL  level 1
ORA-04031: unable to allocate 4064 bytes of shared memory  ("shared pool","select sysdate+(64)/(24*60) ...","sga heap(6,0)","kglsim  heap")
ORA-12012: error on auto execute of job 2
ORA-04031:  unable to allocate 4064 bytes of shared memory ("shared pool","select  SYSDATE+(30 / (24*60)...","sga heap(4,0)","kglsim heap")
4031 specified trace as noted below:

TRACE FILE gpnms4_j000_22234.trc

------------
*** SERVICE  NAME:(SYS$USERS) 2010-05-22 21:47:37.240
*** SESSION ID:(121.20137)  2010-05-22 21:47:37.240
*** 2010-05-22 21:47:37.240
=================================
Begin  4031 Diagnostic Information
=================================
The  following information assists Oracle in diagnosing
causes of  ORA-4031 errors.  This trace may be disabled
by setting the init.ora  _4031_dump_bitvec = 0
=====================================
Allocation  Request Summary Informaton
=====================================
Current  information setting:  04014fff
SGA Heap Dump Interval=3600 seconds
Dump Interval=300 seconds
Last Dump Time=05/22/2010 21:47:37
Dump Count=1
Allocation request for:      kglsim heap
Heap:  380048e48, size: 4064
******************************************************
HEAP  DUMP heap name="sga heap(4,0)"  desc=380048e48
extent sz=0xfe0  alt=216 het=32767 rec=9 flg=-125 opc=2
parent=0 owner=0 nex=0  xsz=0x1000000
latch set 4 of 6
durations enabled for this heap
reserved granules for root 0 (granule size 16777216)
====================
Process  State Object
====================
----------------------------------------
SO:  5761a3750, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle  pid=37, calls cur/top: 56e79bdc8/572cd9160, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info)  last post received: 1359 0 4
last post  received-location: kslpsr
last process to post me:  57f156848 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by  me: 57f156848 1 6
(latch info) wait_event=0 bits=60
holding     (efd=19) 5731bf3a0 Child shared pool sim alloc level=6 child#=12
Location from where latch is held: kglsim_chk_heaplist: alloc:
Context saved from call: 0
state=busy, wlstate=free
holding    (efd=19) 5711d3de0 Child library cache level=5 child#=15
Location from where latch is held: kglobpn: child:: latch
Context saved from call: 22
state=busy, wlstate=free
Process Group: DEFAULT, pseudo proc: 57416ad08
O/S info:  user: oracle, term: UNKNOWN, ospid: 22234
OSD pid info: Unix  process pid: 22234, image: oracle@SHUDBa1 (J000)

SO:  5761d48b0, type: 4, owner: 5761a3750, flag: INIT/-/-/0x00
(session)  sid: 121 trans: 0, creator: 5761a3750, flag: (40000041) USR/-  BSY/-/-/-/-/-
DID: 0001-0025-0060BC7E, short-term DID:  0000-0000-00000000
txn branch: 0
oct: 0,  prv: 0, sql: 0, psql: 0, user: 93/WUBIAO
O/S info: user: oracle,  term: UNKNOWN, ospid: 22234, machine: SHUDBa1
program:  oracle@SHUDBa1 (J000)
last wait for 'SGA: allocation forcing  component growth' blocking sess=0x0 seq=2 wait_time=10349 seconds since  wait started=0
=0, =0, =0
Dumping Session Wait History
for 'SGA: allocation forcing component growth' count=1  wait_time=10349
=0, =0, =0
for 'db file sequential  read' count=1 wait_time=89
file#=7, block#=6f767, blocks=1
temporary object counter: 0

----- Call Stack Trace -----
ksm_4031_dump  <- ksmasg <- kghnospc <- kghalo <- kglsim_chk_heaplist
<- kglsim_upd_newhp <- 3076 <- kglUpdateSimulator <-  kglobpn <- kglpim
<- qcdlgtd <- qcsfplob <-  qcsprfro <- qcsprfro_tree <- qcsprfro_tree
<-  qcspafq <- qcspqbDescendents <- qcspqb <- kkmdrv <- opiSem
<- opiprs <- kksParseChildCursor <- rpiswu2 <-  kksLoadChild <- kxsGetRuntimeLock
<- kksfbc <-  kkspsc0 <- kksParseCursor <- opiosq0 <- opiodr
<- rpidrus <- skgmstack <- rpidru <- rpiswu2 <- rpidrv
<- rpisplu <- rpispl <- kkjfnd <- kkjex1e  <- kkjsexe
<- kkjrdp <- opirip <- opidrv  <- sou2o <- opimai_real
<- main

==============================
Memory  Utilization of Subpool 1
================================
Allocation Name          Size
_________________________   __________
"free memory              "   390197664
"obj stat  memo            "   403349328
"KGH: NO ACCESS           "     16695456

==============================
Memory Utilization  of Subpool 2
================================
Allocation Name           Size
_________________________  __________
"free  memory              "   319286368
"obj stat memo            "    380580336
"KGH: NO ACCESS           "    16728160

==============================
Memory  Utilization of Subpool 3
================================
Allocation Name          Size
_________________________   __________
"free memory              "   325262592
"obj stat  memo            "   351782568
"KGH: NO ACCESS           "     16728160

==============================
Memory Utilization  of Subpool 4
================================
Allocation Name           Size
_________________________  __________
"free  memory              "   556035112
"obj stat memo            "    551594016
"KGH: NO ACCESS           "    32394336

==============================
Memory  Utilization of Subpool 5
================================
Allocation Name          Size
_________________________   __________
"free memory              "   577033336
"obj stat  memo            "   432672408
"KGH: NO ACCESS           "     16736320

==============================
Memory Utilization  of Subpool 6
================================
Allocation Name           Size
_________________________  __________
"free  memory              "   333050648
"obj stat memo            "    406731024
"KGH: NO ACCESS           "    16719968

LIBRARY  CACHE STATISTICS:
namespace           gets hit ratio      pins hit  ratio    reloads   invalids
-------------- --------- ---------  --------- --------- ---------- ----------
CRSR           2214572558      0.355 3977649054     0.403   81251674   34891116
TABL             69900473     0.844 438948912     0.885   11952398          0
BODY              1126452     0.970  86610514     0.999      50956          0
TRGR            104735463     0.994 156443548     0.994     369130          0
INDX             23588405     0.667  47086339     0.801    1478850          0
CLST              2746262     0.997   4303977     0.994      15628          0
At first, i think this “obj stat memory”  subheap can be cleaned up by flush shared pool, but i ‘m wrong ; The component is permanent in shared pool.
SQL>  select  * from v$sgastat where name='obj stat memo';

POOL         NAME                            BYTES
------------ -------------------------- ----------
shared pool  obj stat memo                 1454448

SQL> alter system flush shared_pool;

System altered.
SQL> select  * from v$sgastat where name='obj stat memo';

POOL         NAME                            BYTES
------------ -------------------------- ----------
shared pool  obj stat memo                 1454448
After that I  read the metalink note “Bug 5573238  Shared pool memory use / ORA-4031 due to “obj stat memo” in one subpool”, oracle support advise to workaround this problem by setting STATISTICS_LEVEL=BASIC or “_object_statistics”=false. But STATISTICS_LEVEL=BASIC will disable many automatic features like ASMM, Automatic statistics gathering;If you set “_object_statistics”=false, then as  _object_statistics control the population of segment stats in the v$segstat andv$segment_statistics views. So setting this to false will result in no data populating these v$views.Both the parameters will require a database restart.
And i have opened a new tar , the oracle support advise as below:

Trace file shows indeed a high size for the ‘obj stat memo’ component but also some imbalance between the subpools.

Bug 5573238 is not an issue here as fixed as from 10.2.0.4 but the workaround can be used here as well. So, setting
“statistics_level”=basic or “_object_statistics”=false should also workaround this issue.

The imbalance of the subpools on the other hand might be caused by bug 6271590 which is fixed as from PSU patch
10.2.0.4.2.

I would suggest to apply the latest PSU patch available right now and that is 10.2.0.4.4.

For the ‘obj stat memo’ issue, you have to choose between further investigating the issue or using the workaround by
setting “statistics_level”=basic or “_object_statistics”=false.

If you want to further investigate the issue, we will need a more detailed heapdump of the ORA-4031.

To Do:
1. Apply PSU patch 10.2.0.4.4 (patch 9352164) on top of patchset 10.2.0.4

2. Regarding the ‘obj stat memo’ issue:

a) Workaround the issue by setting “statistics_level”=basic or “_object_statistics”=false
+
Bounce the instance

-OR-

b) Further investigate the issue:

Set following parameters in the init.ora file (SPFILE/PFILE):
SQL> alter system set max_dump_file_size = unlimited scope=spfile;
SQL> alter system set events ‘4031 trace name heapdump level 536870914’ scope=spfile;

Bounce the instance

Once the ORA-4031 reoccurs, provide alert+trace file

So imbalance between subpool can be fixed , and will reduce the likelihood of  4031 occurrence. But we  can never flush huge memory used by “obj stat memory”,i think it’s awful .

ORA-600 [kddummy_blkchk] [18038] 一例

一位客户的Oracle告警日志中出现了ORA-600 [kddummy_blkchk] [18038]故障,alert中的具体信息:

Errors in file /u01/app/oracle/admin/prdw014a/udump/prdw014a_ora_4377.trc:
ORA-00600: internal error code, arguments: [kddummy_blkchk], [222], [5792], [18038], [], [], [], []
Mon May 17 15:27:53 2010
Trace dumping is performing id=[cdmp_20100517152753]
Mon May 17 15:27:53 2010
Doing block recovery for file 2 block 504365
Block recovery from logseq 159276, block 166357 to scn 10934615778284
Mon May 17 15:27:53 2010
Recovery of Online Redo Log: Thread 1 Group 4 Seq 159276 Reading mem 0
Mem# 0: /u01/app/oracle/dataPRDW014/redo04a_1.log
Mem# 1: /u01/app/oracle/dataPRDW014/redo04a_2.log
Block recovery completed at rba 159276.167277.16, scn 2545.3924010007
Doing block recovery for file 222 block 5792
Block recovery from logseq 159276, block 84741 to scn 10934615778283
Mon May 17 15:27:53 2010
Recovery of Online Redo Log: Thread 1 Group 4 Seq 159276 Reading mem 0
Mem# 0: /u01/app/oracle/dataPRDW014/redo04a_1.log
Mem# 1: /u01/app/oracle/dataPRDW014/redo04a_2.log
Block recovery completed at rba 159276.167277.16, scn 2545.3924009964
Mon May 17 15:27:55 2010
Block recovery completed at rba 159276.167277.16, scn 2545.3924009964
Mon May 17 15:27:55 2010
Corrupt Block Found
TSN = 67, TSNAME = OBA_DATA
RFN = 222, BLK = 5792, RDBA = 931141280
OBJN = 1657288, OBJD = 1699775, OBJECT = W_ORG_DS, SUBOBJECT =
SEGMENT OWNER = BMS_OBA_DW, SEGMENT TYPE = Table Segment
Mon May 17 15:32:56 2010
Trace dumping is performing id=[cdmp_20100517153255]

附600错误产生的trace信息:
prdw014a_ora_4377.trc

/u01/app/oracle/admin/prdw014a/udump/prdw014a_ora_4377.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining Scoring Engine
and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/102prdw014
System name: SunOS
Node name: v08k405
Release: 5.9
Version: Generic_122300-29
Machine: sun4u
Instance name: prdw014a
Redo thread mounted by this instance: 1
Oracle process number: 109
Unix process pid: 4377, image: oracle@v08k405

*** 2010-05-17 15:23:15.391
*** ACTION NAME:() 2010-05-17 15:23:15.389
*** MODULE NAME:(pmdtm@v04k413 (TNS V1-V3)) 2010-05-17 15:23:15.389
*** SERVICE NAME:(prdw014_taf) 2010-05-17 15:23:15.389
*** SESSION ID:(789.48811) 2010-05-17 15:23:15.389
TYP:0 CLS: 4 AFN:222 DBA:0x378016a0 OBJ:1699775 SCN:0x09f1.e9e3a3eb SEQ: 2 OP:14.4
kteop redo - redo operation on extent map
RESIZE: entry:0 delta:

...
..
..
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kddummy_blkchk], [222], [5792], [18038], [], [], [], []
Current SQL statement for this session:
INSERT /*+ SYS_DL_CURSOR */ INTO bms_oba_dw.W_ORG_DS ("W_CUSTOMER_CLASS","NAME","ST_ADDRESS","CITY","STATE","ZIPCODE","COUNTRY","CUST_TYPE_CODE","CUST_TYPE_NAME","ACTIVE_FLG","DOM_ULT_DUNS_NUM","DUNS_NUM","EMP_COUNT","FORMED_DT","GLBLULT_DUNS_NUM","ANNUAL_REVENUE","BRANCH_FLG","BIRTH_DT","NO_OF_CHILDREN","LEGAL_NAME","FAMILY_NAME","OTHER_NAME","PREFERRED_NAME","INDV_ADDNL_TITLE","INDV_TITLE","INDV_MARITAL_STATE","INDV_GENDER","EMAIL_ADDRESS","RELATIONSHIP_STATE","INDV_EMP_STATUS","FAX_NUM","PAGER_NUM","MOBILE_NUM","LIFE_CYCLE_STATE","CUST_CAT_CODE","CUST_CAT_NAME","SIC_CODE","SIC_NAME","GOVT_ID_TYPE","GOVT_ID_VALUE","DUNNS_SITE_NAME","DUNNS_GLOBAL_NAME","DUNNS_LEGAL_NAME","CUSTOMER_NUM","ALT_CUSTOMER_NUM","ALT_PHONE_NUM","INTERNET_HOME_PAGE","LEGAL_STRUCT_CODE","LEGAL_STRUCT_NAME","DIRECT_MKTG_FLG","SOLICITATION_FLG","CUSTOMER_HIER1_CODE","CUSTOMER_HIER1_NAME","CUSTOMER_HIER2_CODE","CUSTOMER_HIER2_NAME","CUSTOMER_HIER3_CODE","CUSTOMER_HIER3_NAME","CUSTOMER_HIER4_CODE","CUSTOMER_HIER4_NAME","CUSTOMER_HIER5_CODE","CUSTOMER_HIER5_NAME","CUSTOMER_HIER6_CODE","CREATED_BY_ID","CHANGED_BY_ID","CREATED_ON_DT","CHANGED_ON_DT","AUX1_CHANGED_ON_DT","AUX2_CHANGED_ON_DT","AUX3_CHANGED_ON_DT","AUX4_CHANGED_ON_DT","SRC_EFF_FROM_DT","SRC_EFF_TO_DT","DELETE_FLG","DATASOURCE_NUM_ID","INTEGRATION_ID","TENANT_ID","X_CUSTOM","MOT_ATTRIBUTE1","MOT_ATTRIBUTE2","MOT_ATTRIBUTE3","MOT_ATTRIBUTE4","MOT_ATTRIBUTE5","MOT_ATTRIBUTE6","MOT_ATTRIBUTE7","MOT_ATTRIBUTE8","MOT_ATTRIBUTE9","MOT_ATTRIBUTE10","MOT_ATTRIBUTE11","MOT_ATTRIBUTE12","MOT_ATTRIBUTE13","MOT_ATTRIBUTE14","MOT_ATTRIBUTE15","MOT_ATTRIBUTE16","MOT_ATTRIBUTE17","MOT_ATTRIBUTE18","MOT_ATTRIBUTE19","MOT_ATTRIBUTE20","MOT_PARTY_TYPE","MOT_PHONE_AREA_CODE","MOT_ORIG_SYSTEM_REFERENCE","MOT_PER_EMAIL_ADDR","MOT_PERSON_FIRST_NAME","MOT_PHONE_EXTENSION","MOT_ALTERNATE_NAME","MOT_TELEPHONE_TYPE","MOT_SALES_CHANNEL_CODE","MOT_ACCOUNT_NAME","MOT_ATTRIBUTE_CATEGORY","MOT_INTERCOMPANY_FLAG","MOT_PARTY_NUMBER","MOT_PARTY_ID","MOT_LAST_UPDATE_LOGIN","MOT_CUST_CLASS_DESC","MOT_RECEIPT_METHOD_NAME","MOT_PHONE_NUMBER","MOT_CONTACT_POINT_PURPOSE","MOT_SALESREP_NAME","MOT_PAY_TERMS_CODE","MOT_PAY_TERMS_NAME") VALUES (NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL)
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedmp()+744         CALL     ksedst()             000000840 ?
                                                   FFFFFFFF7FFF620C ?
                                                   000000000 ?
                                                   FFFFFFFF7FFF2D00 ?
                                                   FFFFFFFF7FFF1A68 ?
                                                   FFFFFFFF7FFF2468 ?
kgerinv()+200        PTR_CALL 0000000000000000     000106800 ? 10681C1C4 ?
                                                   10681C000 ? 00010681C ?
                                                   000106800 ? 10681C1C4 ?
kseinpre()+96        CALL     kgerinv()            106816B18 ? 000000000 ?
                                                   1064564C0 ? 000000003 ?
                                                   FFFFFFFF7FFF6750 ?
                                                   000001430 ?
ksesin()+52          CALL     kseinpre()           000106800 ? 000000003 ?
                                                   00000025F ? 10681C1B8 ?
                                                   FFFFFFFF7FFF6750 ?
                                                   1068167D8 ?
kco_blkchk()+2568    CALL     ksesin()             1064564C0 ? 000000003 ?
                                                   000106800 ? 0000000DE ?
                                                   000000000 ? 000106800 ?
kcoapl()+1284        CALL     kco_blkchk()         0001900DE ? 0378016A0 ?
                                                   0000016A0 ? 00000FC00 ?
                                                   000000000 ?
                                                   FFFFFFFF7FFF89F8 ?
kcbapl()+412         CALL     kcoapl()             000000002 ? 000002300 ?
                                                   000105800 ? 583DBC000 ?
                                                   106816C98 ? 00010598F ?
kcrfw_redo_gen()+16  CALL     kcbapl()             FFFFFFFF7FFF89B8 ?
376                                                583FB7870 ?
                                                   FFFFFFFF7AF3AA3C ?
                                                   B6E9FABD0 ? 000000000 ?
                                                   583DBC000 ?
kcbchg1_main()+1363  CALL     kcrfw_redo_gen()     000000000 ?
2                                                  FFFFFFFF7FFF76C8 ?
                                                   B693A9998 ? 000000000 ?
                                                   3800135A0 ?
                                                   FFFFFFFF7FFF7700 ?
kcbchg1()+1324       CALL     kcbchg1_main()       000100C00 ?
                                                   FFFFFFFF7FFF7850 ?
                                                   000000000 ? 583FB7870 ?
                                                   000000000 ? 00000FFFF ?
ktuchg()+968         CALL     kcbchg1()            000106819 ? 1068195B8 ?
                                                   1068195C8 ? 106819000 ?
                                                   000000000 ? 106819000 ?
ktbchg2nt()+104      CALL     ktuchg()             000000002 ? 000000001 ?
                                                   FFFFFFFF7FFF8928 ?
                                                   B67A76DD8 ? 000000000 ?
                                                   000000000 ?
kteopgen()+728       CALL     ktbchg2nt()          FFFFFFFF7FFF89B8 ?
                                                   FFFFFFFF7FFF87C4 ?
                                                   000000000 ? 000000000 ?
                                                   FFFFFFFF7FFF8928 ?
                                                   FFFFFFFF7FFF9D98 ?
kteopresize()+2276   CALL     kteopgen()           FFFFFFFF7FFF89B8 ?
                                                   000000006 ? 000106800 ?
                                                   000000002 ? 10682247C ?
                                                   106816B18 ?
ktsxbmdelext1()+968  CALL     kteopresize()        FFFFFFFF7FFF9D98 ?
8                                                  FFFFFFFF7FFF9E88 ?
                                                   000000004 ? 000000002 ?
                                                   000000000 ? 000000000 ?
ktsstrm_segment()+6  CALL     ktsxbmdelext1()      FFFFFFFF7AD33A78 ?
308                                                0000016A0 ? 0003FFFFF ?
                                                   FFFFFFFF7AD33A78 ?
                                                   106822000 ? 000000043 ?
ktsmg_trimf()+1208   CALL     ktsstrm_segment()    000000000 ? 000000003 ?
                                                   000000001 ? 000100C00 ?
                                                   106819000 ? 000000000 ?
kdbltrmt()+1916      CALL     ktsmg_trimf()        00010598F ? 0000010E2 ?
                                                   106822478 ? 000000005 ?
                                                   10682247C ? 106816B18 ?
kdblfpl()+96         CALL     kdbltrmt()           000000006 ? 000000000 ?
                                                   FFFFFFFF7AD33918 ?
                                                   000000180 ? 0000010E4 ?
                                                   000000008 ?
kdblfl()+1948        CALL     kdblfpl()            FFFFFFFF7FFFB0AC ?
                                                   FFFFFFFF7AD33918 ?
                                                   000000000 ?
                                                   FFFFFFFF7AD33AE0 ?
                                                   FFFFFFFF7AD33A68 ?
                                                   000000000 ?
klafin()+160         CALL     kdblfl()             FFFFFFFF7FFFB0AC ?
                                                   FFFFFFFF7AD33918 ?
                                                   000000000 ? 000000001 ?
                                                   000000008 ? 000106800 ?
kpodpfin()+76        CALL     klafin()             FFFFFFFF7AF35C40 ?
                                                   1059BF2B8 ? 000000321 ?
                                                   FFFFFFFF7AD33918 ?
                                                   000000000 ? 000400000 ?
kpodpmop()+320       CALL     kpodpfin()           FFFFFFFF7AF35C40 ?
                                                   000106816 ? 000106800 ?
                                                   000000321 ? 000000001 ?
                                                   FFFFFFFF7AF35BC8 ?
opiodr()+1496        PTR_CALL 0000000000000000     000000301 ? 000000321 ?

进过与Oracle support确认,定位为Bug 5386204 – Block corruption / OERI[kddummy_blkchk] after direct load of ASSM segment [ID 5386204.8].
“kteop redo – redo operation on extent map” 记录是确定该Bug的一个重要依据。

该Bug的Oracle note:

Bug 5386204  Block corruption / OERI[kddummy_blkchk] after direct load of ASSM segment
This note gives a brief overview of bug 5386204.
The content was last updated on: 08-FEB-2010
Click here for details of each of the sections below.
This bug is alerted in Note:580561.1
Affects:

Product (Component)    Oracle Server (Rdbms)
Range of versions believed to be affected    Versions < 11
Versions confirmed as being affected

* 9.2.0.8
* 10.2.0.1
* 10.2.0.2
* 10.2.0.3
* 10.2.0.4

Platforms affected    Generic (all / most platforms affected)

Fixed:

This issue is fixed in

* 9.2.0.8 Patch 15 on Windows Platforms
* 10.2.0.2 Patch 15 on Windows Platforms
* 10.2.0.3 Patch 5 on Windows Platforms
* 10.2.0.4.1 (Patch Set Update)
* 10.2.0.4 Patch 2 on Windows Platforms
* 10.2.0.5 (Server Patch Set)
* 11.1.0.6 (Base Release)

Symptoms:

Related To:

* Internal Error May Occur (ORA-600)
* Corruption (Logical)
* ORA-600 [kddummy_blkchk]

* Direct Path Operations
* ASSM Space Management (Bitmap Managed Segments)

Description

Block corruption / ORA-600 [kddummy_blkchk][file#] [block#] [18038]
can occur on a segment which has been direct loaded.

(The corruption shows as a PAGETABLE SEGMENT HEADER
having blocks in the “Auxillary Map” outside of the “Extent Map”
range)

Note:
This bug was previously incorrectly listed as fixed in 10.2.0.4

Further details on this issue can be found in Note:580561.1
ORA-600 [kddummy_blkchk][][][18038] during extent operations like TRUNCATE on ASSM tablespaces [ID 580561.1]

Applies to:
Oracle Server – Enterprise Edition – Version: 9.2.0.8 to 10.2.0.4
Information in this document applies to any platform.
Description

This alert describes the problem in Bug 5386204 / Note 5386204.8.

Block corruption with error ORA-600 [kddummy_blkchk] [file#] [block#] [18038]
may be reported during a DROP/TRUNCATE

The corruption shows as a PAGETABLE SEGMENT HEADER having blocks in the
“Auxillary Map” outside of the “Extent Map” range.

The same operation terminated without any error in previous RDBMS versions
like Oracle9i.

Likelihood of Occurrence

The object is populated by direct path operations such as SQL*Loader using DIRECT=Y for example.
The object is stored in a Locally Managed Tablespace (LMT) that is using ASSM (dba_tablespaces.segment_space_management=’AUTO’).
Bug 5386204 is mostly hit when db_block_size=16384.

Possible Symptoms

One evidence of hitting this bug might be the value 18038 in the third argument of
ORA-600 [kddummy_blkchk] where [18038] is a check error code.

@Error check code 18038 means that the “Data dba” stored in “Auxiliary Map” is out of range

@TYP:0 CLS: 4 AFN:234 DBA:0x3a801554 OBJ:0 SCN:0x000b.290f5e0d SEQ:  1 OP:14.2

@In this case “Data dba: 0x3a801555” stored in the “Auxiliary Map” is equal to 0x3a801551 + 4 which is out of the extent 0, hence the error.
@Note that extent 0 is 4 blocks, so extent 0 starts from 0x3a801551 to 0x3a801554.

Workaround or Resolution

In order to identify objects that are affected by the corruption, use the procedure
DBMS_SPACE_ADMIN.ASSM_TABLESPACE_VERIFY

@DBMS_SPACE_ADMIN.ASSM_SEGMENT_VERIFY is also an option but it requires patch for Bug 6760697 is needed)

How to execute DBMS_SPACE_ADMIN.ASSM_TABLESPACE_VERIFY:

alter system set DB_BLOCK_CHECKSUM = OFF;
— open a new session and run :
exec DBMS_SPACE_ADMIN.assm_tablespace_verify(‘<Tablespace Name>’, DBMS_SPACE_ADMIN.TS_VERIFY_DEEP, DBMS_SPACE_ADMIN.SEGMENT_VERIFY_DEEP);

See if any trace file is generated in the directory defined by user_dump_dest.
The absence of a trace file means that no corrupt segments were found.

Note: DB_BLOCK_CHECKSUM has to be disabled; otherwise the same ORA-600 error may be produced

@Oracle check block type 0x23=PAGETABLE SEGMENT HEADER even if DB_BLOCK_CHECKING is not set.

Example of output from DBMS_SPACE_ADMIN.ASSM_TABLESPACE_VERIFY

Segment header [dba: 0x003a801554, (file 234,block 5460)]
Segment object id: 7825838; inc. no.: 0
*********

verifying extent map and tablespace bitmap consistency
———
Verifying extent map and  auxilliary extent map   consistency in   the segment
Block Corruption in seg hdr / ext map block:                rdba: 0x3a801554, err code: 18038

Identifying the object using the segment header information.

Segment header [dba: 0x003a801554, (file 234,block 5460)]

select *
from DBA_EXTENTS
where FILE_ID = 234
and 5460 between block_id and block_id + blocks – 1;

Identifying the object using the Segment object id information.

Segment object id: 7825838; inc. no.: 0

select *
from DBA_OBJECTS
where DATA_OBJECT_ID = 7825838;

@How to execute DBMS_SPACE_ADMIN.ASSM_SEGMENT_VERIFY

WORKAROUNDs:

Disable DB_BLOCK_CHECKSUM for any action taken.

Note: DB_BLOCK_CHECKSUM has to be disabled; otherwise the same ORA-600 error may be produced

alter system set DB_BLOCK_CHECKSUM = OFF;
— open a new session

DROP TABLE .. PURGE;
ALTER TABLE .. MOVE ..;
Create table as select (CTAS)
export/import, etc

Patches

The patch prevents the corruption from taking place. Affected objects will have to be recreated.

This bug was previously incorrectly listed as fixed in 10.2.0.4.

@This problem is fixed in the 10.2.0.5 Patch Set (not available yet and still without a due date).
This problem is fixed in the 11.1.0.6 rdbms release.

One off patches for this issue are available for some platforms / versions.

See Patch 5386204 for patch availability.

Modification History

03-JUN-2008 – Initial Alert version
04-JUN-2008 – Implemented correction
11-JUN-2008 – Added info about DB_BLOCK_CHECKSUM
13-JUN-2008 – Published

References

BUG:5386204 – ORA-600 [KDDUMMY_BLKCHK] ERRORS WITH CODE 18038

NOTE:5386204.8 – Bug 5386204 – Block corruption / OERI[kddummy_blkchk] after direct load of ASSM segment

Bug 5386204: ORA-600 [KDDUMMY_BLKCHK] ERRORS WITH CODE 18038

Show Bug Attributes Bug Attributes
Type     B – Defect     Fixed in Product Version     11.1
Severity     1 – Complete Loss of Service     Product Version     10.2.0.2
Status     80 – Development to Q/A     Platform     226 – Linux x86-64
Created     12-Jul-2006     Platform Version     2.6.5-7.191-SMP
Updated     20-May-2010     Base Bug     –
Database Version     10.2.0.2
Affects Platforms     Generic
Product Source     Oracle

Show Related Products Related Products
Line     Oracle Database Products     Family     Oracle Database
Area     Oracle Database     Product     5 – Oracle Server – Enterprise Edition

Hdr: 5386204 10.2.0.2 RDBMS 10.2.0.2 SPACE PRODID-5 PORTID-226 ORA-600
Abstract: ORA-600 [KDDUMMY_BLKCHK] ERRORS WITH CODE 18038

*** 07/12/06 12:59 am ***
TAR:
—-

PROBLEM:
——–
1. Clear description of the problem encountered
Customer is getting repeated ORA-600 [kddummy_blkchk] errors reported with
internal check code 18038 on tables which have had bulk deletions made.  This
has occurred on both production and test instances.

2. Pertinent configuration information (MTS/OPS/distributed/etc)
RAC, ASM

3. Indication of the frequency and predictability of the problem
Problem is intermittent but occurs several times a day impacting the
customers ability to work.

4. Sequence of events leading to the problem
Error is typically signalled on a COMMIT most likely following a deletion
from the tables.

5. Technical impact on the customer. Include persistent after effects.
Severe, as it occurs multiple times per day, and corrupt the underlying
tables preventing further data loads.

DIAGNOSTIC ANALYSIS:
——————–
The trace files show that the problem occurs following a bulk deletion from
the underlying tables, which appear to corrupt the extent map, as the segment
header dump shows 1 extent of 4 blks, but the deleteion entry in the redo
stream shows one extent of 8 blks, e.g.:

REDO RECORD – Thread:1 RBA: 0x0005da.000e5e34.01c0 LEN: 0x00fc VLD: 0x01
SCN: 0x000d.37eacce9 SUBSCN:  5 07/11/2006 10:29:53
CHANGE #1 TYP:0 CLS:60 AFN:39 DBA:0x09c322e0 OBJ:4294967295
SCN:0x000d.37eacce9 SEQ:  2 OP:5.1
ktudb redo: siz: 112 spc: 15940 flg: 0x0022 seq: 0x011d rec: 0x06
xid:  0x0016.020.000005b6
ktubu redo: slt: 32 rci: 5 opc: 14.5 objn: 2 objd: 93662 tsn: 12
Undo type:  Regular undo       Undo type:  Last buffer split:  No
Tablespace Undo:  Yes
0x00000000
kteopu undo – undo operation on extent map
segdba: 0x87e3cc  class: 4  mapdba:0x87e3cc  offset: 3
rbr extent – dba: 0x0  nbk: 0x0
kteop redo – redo operation on extent map
ADD: dba:0x803673d len:8 at offset:1
DEFAULT: ???
SETSTAT: exts:2 blks:16 lastmap:0x0 mapcnt:0
CHANGE #2 TYP:0 CLS: 4 AFN:2 DBA:0x0087e3cc OBJ:93662 SCN:0x000d.37eacce9
SEQ:  1 OP:14.4
kteop redo – redo operation on extent map
DELETE: entry:1
shift back: dba:0x0 len:0
SETSTAT: exts:1 blks:8 lastmap:0x0 mapcnt:0

WORKAROUND:
———–
None

RELATED BUGS:
————-
Bug 4949123 – ORA-600: [KDDUMMY_BLKCHK], [541], [147050], [18038]

REPRODUCIBILITY:
—————-
Consistently occurring at customers site.

TEST CASE:
———-
n/a

STACK TRACE:
————
ksedst ksedmp ksfdmp kgerinv kseinpre ksesin kco_blkchk kcoapl kcbapl
kcrfw_redo_gen kcbchg1_main kcbchg1 ktuchg ktbchg2nt kteopgen kteopresize
ktsxbmdelext1 ktsstrm_segment ktsmg_icmt_prepare ktcifc ktucmt ktpcmt ktcrcm
ktdcmt k2lcom k2send xctctl xctcom_with_options kksExecuteCommand opiexe
opipls opiodr rpidrus skgmstack rpidru rpiswu2 rpidrv psddr0 psdnal
pevm_EXECC pfrinstr_EXECC pfrrun_no_tool pfrrun plsql_run peicnt kkxexe
opiexe kpoal8 opiodr ttcpip opitsk opiino opiodr opidrv sou2o opimai_real
main __libc_start_main _start

SUPPORTING INFORMATION:
———————–
alertlogs and trace files

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-
n/a

DIAL-IN INFORMATION:
——————–
n/a

IMPACT DATE:
————
21-JUL-2006

*** 07/12/06 02:34 am *** (CHG: Asg->NEW OWNER OWNER)

A redo dump of the segment header during the entire procedure execution was
requested on 06 Aug and supplied on 09 Aug so why are you asking for this
information again when you already have it?  Please check that file
(redo_1.trc in bug5386204_07Aug.zip), and let me know if you need anything
else.
*** 09/19/06 02:39 am *** (CHG: Sta->30)
Uploaded the requested information in file bug5386204_Oct02.zip.

*** 11/27/06 11:13 am ***
Here is one theory we (space group) have on this bug so far:
During direct load one of the segments does not get loaded with any data. The
segment is empty and the first extent has 8 blocks (this is 16k block size).
However it goes through the usual high water mark movement phase (even though
the hwm does not move). During the hwm movement phase, the segment is trimmed
close to 64k boundary. For ASSM segment with 16k block size, this means the
segment will be left with no data blocks after the trim- 4 blocks after the
trim would represent bitmaps and segment header only.

There are two issues here:
(1) Why was ktsstrm_segment called on an empty (or unloaded) segment at first
place?
(2) Even if it was called, why is segment trimmed to 64k boundary?

I’m working on the 2nd issue and will give an update soon.
*** 11/29/06 03:18 am *** (CHG: Pri->1)
*** 11/29/06 03:18 am ***
*** 11/29/06 03:25 am *** -> CLOSED
*** 11/29/06 05:31 pm ***
*** 11/30/06 10:36 pm ***
We ran into some issues (bugs) while testing the code for the diagnostic
patch. I was hoping to have it finished by today but it seems it’ll take some
more time and I’m pretty hopeful of having it ready to go by tomorrow evening
(PST). I’m really sorry for the delay.
*** 12/01/06 07:16 pm ***
*** 12/02/06 05:05 pm ***
Sorry for the delay in replying. I would expect the long regressions to be
complete by sunday afternoon PST. I should be ready to release the patch by
sunday evening if things go fine. Will keep this page updated on my progress.
*** 12/03/06 05:30 pm ***
*** 12/04/06 05:35 pm ***
It seems most of major issues with the long regressions have been taken care
of and I hope to get a clean run on the farm soon, by tomorrow end of day and
the patch should be on its way soonafter.

I had a question though, that will help me in getting the patch out faster. I
wanted to know if the customer has had any diagnosibility patches installed
on their 10.2.0.2.0 release version.

Another thing which I would like to mention here is that my patch modifies
only one file (ktss.c) in the RDBMS code.
*** 12/05/06 02:19 am ***
*** 12/05/06 04:42 pm ***
*** 12/05/06 05:06 pm ***
I was hoping to have all the farm regressions (and the patch) done by today
evening but it seems farm is taking a bit long to finish the regressions.
I’ll work on the patch as soon as I have the regressions done. Sorry for the
delay. I’ll provide an update on that in the next few hours.
*** 12/05/06 09:07 pm ***
My regressions are still moving very slowly through the queue on the farm.
The farm seems to be busy with 11g Beta 4 deadline round the corner. My
regressions have been on the farm for more than a day now. I’ll work on the
patch as soon as I have a clean farm run.
*** 12/06/06 06:15 pm ***
Still waiting on the clean farm runs. Fortunately, I’ve been able to get a
high priority on the farm jobs. So, I expect things to run clean soon. Will
keep things updated here.
*** 12/07/06 05:54 pm ***
Got my farm runs completed last night but got a small number of diffs. Have
been trying to isolate them and hopefully soon, everything should be clean.
Farm has been giving those diffs over and over again though those look
unrelated to my change. Currently, verifying them on my linux workstation.
*** 12/08/06 06:01 pm ***
*** 12/08/06 06:23 pm ***
Have been able to run almost all the long regressions locally and things look
clean. There’s just a couple of long regressions which I’m still running and
I should expect to be ready to go as soon as they are completed. Should be
able to start the patch building soon.
*** 12/11/06 01:07 am ***
There’s one long regression which seems to be broken. I’m currently working
on that to have it run clean. Will update as soon as I have it running clean.
*** 12/11/06 01:23 pm ***
Everything is clean now. Working on starting the patch building process.
*** 12/11/06 02:46 pm ***

The customer has confirmed that following application of the suplied patch
the error no longer occurred when running the testcase, which ran through to
completion after about 8 hours.  They are resetting the testcase, and will
run it again to verify this, but the initial response is that this looks to
have resolved the problem.

Can you confirm if the patch would need to be rebuilt as a permananent fix,
ie. any diagnostics to be removed etc. or is it actually the full fix anyway?
*** 12/13/06 07:19 am ***
The customer has confirmed the following:

1. Rerun the test for the 2nd time with patched rdbms: completed quickly and
without any problems.
2. Rollbacked the patch: the test failed as expected within 30 minutes.
3. Re-applied the patch and ran the test once again: completed ok.

This appears to confirm that the patch resolves the problem so could we have
an answer to the previous update?
*** 12/13/06 07:47 pm ***
That is good news.
No additional diagnostics have been added to the patch. So, it’s not needed
to be rebuilt. I guess the supplied patch should be complete in itself.
*** 12/14/06 12:40 am ***
Thanks for the update.

该文档描述当使用直接路径方式导入数据时一定概率导致该Bug产生,譬如使用Sql loader且DIRECT=Y;
该Bug只会由存贮在本地管理方式(LTM)并自动段管理(ASSM)的对象引发, 并且当标准块大小为16k时出现概率较高(Bug 5386204 is mostly hit when db_block_size=16384.)
一般数据库都会启用db_block_checksum,该参数控制Oracle在读入块时做检验操作,[18038]是kddummy_blkchk的一种错误代码,出现该错误代码说明存储在段头中的辅助区间图中的Data dba越界, 我们举一个段头来看:

Start dump data blocks tsn: 4 file#: 4 minblk 139 maxblk 139
buffer tsn: 4 rdba: 0x0100008b (4/139)
scn: 0x0000.000f327e seq: 0x01 flg: 0x04 tail: 0x327e2301
frmt: 0x02 chkval: 0x619e type: 0x23=PAGETABLE SEGMENT HEADER
Hex dump of block: st=0, typ_found=1
.......
  Extent Control Header
  -----------------------------------------------------------------
  Extent Header:: spare1: 0      spare2: 0      #extents: 9      #blocks: 72
                  last map  0x00000000  #maps: 0      offset: 2716
      Highwater::  0x0101e1f1  ext#: 8      blk#: 8      ext size: 8
  #blocks in seg. hdr's freelists: 0
  #blocks below: 65
  mapblk  0x00000000  offset: 8
                   Unlocked
  --------------------------------------------------------
  Low HighWater Mark :
      Highwater::  0x0101e1f1  ext#: 8      blk#: 8      ext size: 8
  #blocks in seg. hdr's freelists: 0
  #blocks below: 65
  mapblk  0x00000000  offset: 8
  Level 1 BMB for High HWM block: 0x0101e1e9
  Level 1 BMB for Low HWM block: 0x0101e1e9
  --------------------------------------------------------
  Segment Type: 1 nl2: 1      blksz: 8192   fbsz: 0
  L2 Array start offset:  0x00001434
  First Level 3 BMB:  0x00000000
  L2 Hint for inserts:  0x0100008a
  Last Level 1 BMB:  0x0101e1e9
  Last Level II BMB:  0x0100008a
  Last Level III BMB:  0x00000000
     Map Header:: next  0x00000000  #extents: 9    obj#: 51806  flag: 0x10000000
  Inc # 0
  Extent Map
  -----------------------------------------------------------------
   0x01000089  length: 8
   0x0101e1a1  length: 8
   0x0101e1a9  length: 8
   0x0101e1b9  length: 8
   0x0101e1c1  length: 8
   0x0101e1c9  length: 8
   0x0101e1d9  length: 8
   0x0101e1e1  length: 8
   0x0101e1e9  length: 8

  Auxillary Map
  --------------------------------------------------------
   Extent 0     :  L1 dba:  0x01000089 Data dba:  0x0100008c
   Extent 1     :  L1 dba:  0x01000089 Data dba:  0x0101e1a1
   Extent 2     :  L1 dba:  0x0101e1a9 Data dba:  0x0101e1aa
   Extent 3     :  L1 dba:  0x0101e1a9 Data dba:  0x0101e1b9
   Extent 4     :  L1 dba:  0x0101e1c1 Data dba:  0x0101e1c2
   Extent 5     :  L1 dba:  0x0101e1c1 Data dba:  0x0101e1c9
   Extent 6     :  L1 dba:  0x0101e1d9 Data dba:  0x0101e1da
   Extent 7     :  L1 dba:  0x0101e1d9 Data dba:  0x0101e1e1
   Extent 8     :  L1 dba:  0x0101e1e9 Data dba:  0x0101e1ea
  --------------------------------------------------------

   Second Level Bitmap block DBAs
   --------------------------------------------------------
   DBA 1:   0x0100008a

其中辅助区间图( Auxillary Map)列出了该段每个区间(Extent)的一级位图块以及区间中实际数据开始的data block address (Data dba).譬如Extent 0 中的Data dba应在
(0x0100008A ~0x01000090)之间,否则即越界。
DROP或TRUNCATE是触发该Bug的主要操作,原因是这2个操作都需要使用到Pagetable segment header中的Auxiliary Map。
Oracle建议的WorkAround方式主要是通过MOVE TABLESPACE 来”REBUILD”这个PAGETABLE SEGMENT HEADER。
这个Case中Oracle support给出Workaround建议:

1-. Make sure the below query will return the table mentioned above:

SQL> select owner, object_name, object_type, SUBOBJECT_NAME, OBJECT_ID,
DATA_OBJECT_ID, CREATED,LAST_DDL_TIME,TIMESTAMP
from DBA_OBJECTS
where DATA_OBJECT_ID =1699775;

If so continue:

SQL>alter system set DB_BLOCK_CHECKSUM = OFF;

Find all indexes for W_ORG_DS table.

SQL> select owner, index_name, index_type, table_name , table_owner from dba_indexes
Where table_owner = ‘BMS_OBA_DW’ and
Table_name = ‘W_ORG_DS’;

connect as BMS_OBA_DW

SQL> desc W_ORG_DS

if this table does not have LONG column, then Alter table table_name move is like a CTAS but better since is using the same name of the object plus keeping any related object like index, etc. If it has Long column then export/truncate/import need to be use;

SQL>Alter table W_ORG_DS Move;

Then rebuild all indexes for W_ORG_DS table as per above query: .i.e.

SQL>Alter index rebuild

To avoid problem, please apply patch for bug 5386204, see note 580561.1 for further information.

Oracle文档宣称其已在10.2.0.4的第一个patch set update及10.2.0.5中修复了该Bug.

注:最早认为该Bug在10.2.0.4中就已经修复了,但后来确认“This bug was previously incorrectly listed as fixed in 10.2.0.4”。

发一个jd:System Engineer 博君一笑

   1.  Goal and mission:

A System Engineer has excellent and in-depth knowledge of a family of products (e.g. NagraVision CAS, iDTV & SMS, MediaGuard CAS). He is the recognised expert on that range of systems and is able to analyse and resolve complex problems under severe pressure. He recognises the importance of excellent customer relationships and is a good team player. He is highly motivated to develop his skills.

   2. Responsibility:

    * Is assigned to one or more accounts. Is responsible for the technical solution (installation, customer support, maintenance, etc) and works colsely with the program manager in charge of the account.
    * Creatively develops workaround solutions using broad knowledge of NagraVision products when no other solution is viable
    * Maintains a can-do attitude
    * Is fascinated by new technology and products
    * Handles most crisis situations confidently, is aware of customer consequences and knows when a call should be escalated
    * Recognises possible product design faults or potential enhancements and readily communicates this feedback to senior System Engineers for validation
    * Takes pride in solving customer problems effectively and is intolerant of poor quality work
    * Lives up to customer expectations and updates them on changes
    * Is respectful of and helpful to other engineers and enjoys gaining and sharing knowledge
    * Attends training on products, services and behavioural skills related to the position
    * Provides tuition on NagraVision products to customers and to NagraVision Engineers, under the managerial responsibility of the Head of System Engineering.
    * Participates in Hotline turns, under the managerial responsibility of the Head of System Engineering.
    * Execute security measures on the systems.
    * Provide quick response time to meet customer satisfaction.

   3. Reporting line

Reporting to: Head of System Engineering

Direct Report: Igor Ferigutti

   4. Interfaces

    * Daily contact with Program/Deployment Manager to inform on current status and plan the next steps of a project
    * Information to and from Customers using live communication, conference calls and e-mail reports
    * Collaboration with Central Services’ staff for problem solving.
    * Collaboration with Product Development for bug reporting and problem solving
    * Regular communication with Head of System Engineering to keep the resource planning up-to-date.

   5. Skills

    * Is recognised to have excellent knowledge of a set of NagraVision products
    * Mostly self-reliant, but requires direction under some situations.
    * Has a good telephone manner and remote diagnosis skills.
    * Is fluent in English, any other Asian language is an asset.
    * Recognised expertise in the UNIX platforms and Windows.
    * Ability to perform complex queries on Oracle or Sybase databases.
    * Recognised ability to analyse TCP/IP networking problems.
    * Ability to demonstrate certain NagraVision products or to provide training on them.
    * Demonstrates good communication skills and reasonable inter-personal skills.
    * Working knowledge of the Ticketing and Remote access tools.
    * Ability to complete problem analysis and do troubleshooting of simple multi-cluster systems.
    * Demonstrates high customer service awareness.
    * Works effectively under pressure.

这就是D&D中所谓的万精油职业吗?

ora-7445 [kghalp+0500] [SIGSEGV]错误

今天没有外出(似乎人不到现场就特别容易出问题),早上10点左右接到电话被告知crm11实例上出现了7445错误,准备用web vpn拨上去查看一下,赫然发觉windows 7 不支持这种vpn(准确说ie8和firefox都不支持);无奈无奈只好用拨号。
发现alert log中出现大量 7445错误记录:

Fri Mar 26 09:24:53 2010
Errors in file /oravl01/oracle/admin/CRMDB1/udump/crmdb11_ora_6754320.trc:
ORA-07445: exception encountered: core dump [kghalp+0500] [SIGSEGV] [Invalid permissions for mapped object] [0x00000003B] [] []
Fri Mar 26 09:24:55 2010
Trace dumping is performing id=[cdmp_20100326092455]
Fri Mar 26 09:31:16 2010
Errors in file /oravl01/oracle/admin/CRMDB1/udump/crmdb11_ora_2994552.trc:
ORA-07445: exception encountered: core dump [kghalp+0500] [SIGSEGV] [Invalid permissions for mapped object] [0x00000003B] [] []

看到kghalp函数第一印象 ,是Oracle中堆管理使用的函数;
让我们猜猜字面意思? k -> kernel g -> generic h-> heap a-> allocation p-> point
再让我们来看一下当时的call stack:

Exception signal: 11 (SIGSEGV), code: 51 (Invalid permissions for mapped object), addr: 0x3b, PC: [0x1000973e0, kghalp+0500]
Registers:
iar: 00000001000973e0, msr: a00000000000d0b2
 lr: 00000001013a6df8,  cr: 0000000022292484
r00: 0000000000000010, r01: 0ffffffffffcb160, r02: 000000011022a9c0,
r03: 0000000000000002, r04: 0000000000000000, r05: 0000000000000100,
r06: 0000000000000001, r07: 0000000000000000, r08: 0000000000000000,
r09: 0000000000000000, r10: 00000000101b60d8, r11: 0000000000000004,
r12: 0000000024592484, r13: 000000011026bfe0, r14: 0000000000000000,
r15: 0000000000009000, r16: 0000000110195b2c, r17: 0000000000000000,
r18: 0000000000000001, r19: 0000000000000000, r20: 0000000000001000,
r21: 0000000000000000, r22: 0000000000000100, r23: 0000000000000001,
r24: 0000000000000000, r25: 0000000000000000, r26: 0000000000000001,
r27: 0000000104c7fd44, r28: 0000000000000000, r29: 0000000000000100,
r30: 0000000000000000, r31: 0000000110195a58,
*** 2010-03-26 09:57:28.679
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [kghalp+0500] [SIGSEGV] [Invalid permissions for mapped object] [0x00000003B] [] []
Current SQL statement for this session:
INSERT INTO AUDIT_DDL_LOG (DDL_TIME, SESSION_ID, OS_USER, IP_ADDRESS, TERMINAL, HOST, USER_NAME, DDL_TYPE, OBJECT_TYPE, OWNER, OBJECT_NAME, SQL_TEXT) VALUES (SYSDATE, SYS_CONTEXT('USERENV','SESSIONID'), SYS_CONTEXT('USERENV','OS_USER'), SYS_CONTEXT('USERENV','IP_ADDRESS'), SYS_CONTEXT('USERENV','TERMINAL'), SYS_CONTEXT('USERENV','HOST'), ORA_LOGIN_USER, ORA_SYSEVENT, ORA_DICT_OBJ_TYPE, ORA_DICT_OBJ_OWNER, ORA_DICT_OBJ_NAME, :B1 )
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
70000043da500d0        10  anonymous block
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              000000000 ? 104A54EED ?
ksedmp+0290          bl       ksedst               104A54870 ?
ssexhd+03e0          bl       ksedmp               300001D15 ?
000044C0             ?        00000000
parchk+01f4          bl       kghalp               000000000 ?
                                                   2842288200000001 ?
                                                   000000000 ? 000000000 ?
                                                   000001040 ? 110195B2C ?
ptmak+0168           bl       parchk               FFFFFFFFFFCB560 ?
                                                   FFFFFFFFFFCB430 ?
                                                   FFFFFFFFFFCB430 ?
pdybF00_Init+0244    bl       ptmak                10008049C ? 000000000 ?
                                                   FFFFFFFFFFCB4F0 ? 07FFFFFFF ?
pdy1F79_Init+00c8    bl       pdybF00_Init         110BEB1D0 ?
pdy1F01_Driver+0048  bl       pdy1F79_Init         FFFFFFFFFFCBC40 ?
pdli_new_cog+00f0    bl       pdy1F01_Driver       FFFFFFFFFFCBCE0 ? 000000000 ?
pdlifu+0264          bl       pdli_new_cog         1013885F4 ? FFFFFFFFFFCCB00 ?
                                                   7000004383E7680 ?
phpcog+0010          bl       pdlifu               FFFFFFFFFFCD958 ?
                                                   7000004383E7680 ? 104C95048 ?
phpcmp+0f80          bl       phpcog               FFFFFFFFFFCC4F0 ? 000000000 ?
pcicms2+02d4         bl       phpcmp               FFFFFFFFFFCD958 ?


发生错误的最上层 kghalp 函数由 parchk 调用, 这似乎是一个package check函数(猜测,呵呵). 我们来整理一下思路, parchk 函数调用了 kghalp函数以帮其分配内存,但却得到了一个非法的低地址[[0x00000003B],正常情况下正文段使用的空间; 这看起来显然是一个bug。
让我们来查查support.oracle.com , 键入7445 kghalp 和sigsegv 关键字 (很多时候不需要使用ora 600/7445 lookup tools).
bug 8244533 赫然显目:

Bug 8244533: ORA-07445 [KGHALP] ERRORS COMPILING PACKAGE WITH DEBUG
    STACK TRACE:
    ------------
       ksedst <- ksedmp <- ssexhd <- 000044BC <- parchk        <- ptmak <-
    pdybF00_Init <- pdy1F79_Init <- pdy1F01_Driver <- pdli_new_cog         <-
    pdlifu <- phpcog <- phpcmp <- pcicms2 <- pcicms          <- kkxcms <- kkxswcm
    <- kkxmpbms <- kkxmesu <- xtypls           <- qctopls <- qctcopn <- qctcopn

    Exception signal: 11 (SIGSEGV), code: 51 (Invalid permissions for mapped
    object),
    addr: 0x3b, PC: [0x1000973e0, kghalp+0500]
    Registers:
    iar: 00000001000973e0, msr: a00000000000d0b2
    lr: 000000010139ffb8,  cr: 00000000222a2484
    r00: 0000000000000010, r01: 0ffffffffffe2980, r02: 00000001101e5ab8,
    r03: 0000000000000002, r04: 0000000000000000, r05: 0000000000000100,
    r06: 0000000000000001, r07: 0000000000000000, r08: 0000000000000000,
    r09: 0000000000000000, r10: 0000000010171200, r11: 0000000000000004,
    r12: 00000000245a2484, r13: 000000011021fbc0, r14: 0000000000000000,
    r15: 0000000000009000, r16: 0000000110150c54, r17: 0000000000000000,
    r18: 0000000000000001, r19: 0000000000000000, r20: 0000000000001000,
    r21: 0000000000000000, r22: 0000000000000100, r23: 0000000000000001,
    r24: 0000000000000000, r25: 0000000000000000, r26: 0000000000000001,
    r27: 0000000104c5983c, r28: 0000000000000000, r29: 0000000000000100,
    r30: 0000000000000000, r31: 0000000110150b80,
    *** 16:37:14.603
    ksedmp: internal or fatal error
    ORA-7445: exception encountered: core dump [kghalp+0500] [SIGSEGV]
    [Invalid permissions for mapped object] [0x00000003B] [] []
    Current SQL statement for this session:
    select dummy from dual where  ora_dict_obj_type = 'TABLE'
----- Call Stack Trace -----ptmak pdybF00_Init pdy1F79_Init pdy1F01_Driver pdli_new_cog pdlifuphpcog phpcmp pcicms2 pcicms kkxcms kkxswcm kkxmpbms kkxmesu xtyplsTo Filer.Based on this call stack this would appear a likely match forbug 6951953 Abstract: ORA-7445 [PTMAK] IMPORTING PACKAGE COMPILED DEBUG.This bug is fixed on 10.2.0.5 and there is a 10.2.0.4 patch available for IBM AIX Based Systems (64-bit).It maybe worth while to have the customer apply the patch to seeif it resolves the issue.Also the uploaded files included test.sql is this a reproducable testcase?

这个bug 似乎仅在 IBM AIX on POWER Systems (64-bit) 发生,当以DEBUG 模式编译包时有一定几率出现。
好了,既然已经了解了可能发生的诱因,我们可以进一步分析了,接下来看看 errorstack trace信息中 的SO 记录。

      SO: 70000043d217668, type: 53, owner: 70000048cee2238, flag: INIT/-/-/0x00
      LIBRARY OBJECT LOCK: lock=70000043d217668 handle=700000446261588 mode=N
      call pin=0 session pin=0 hpc=0000 hlc=0000
      htl=70000043d2176e8[70000042b52b368,70000042bb9a808] htb=70000044929b460 ssga=70000044929ad68
      user=70000048cee2238 session=70000048eb33010 count=1 flags=[0000] savepoint=0x4bac1488
      LIBRARY OBJECT HANDLE: handle=700000446261588 mtx=7000004462616b8(1) cdp=1
      name=ALTER TRIGGER "SHUCRM3O"."TRI_PRODUCT_INSTANCE_RELATED" COMPILE DEBUG REUSE SETTINGS
      hash=164e6a8942406cee159f8943a1a3c85e timestamp=03-26-2010 09:52:12
      namespace=CRSR flags=RON/KGHP/TIM/PN0/SML/KST/DBN/MTX/[120100d0]
      kkkk-dddd-llll=0000-0001-0001 lock=N pin=0 latch#=16 hpc=0002 hlc=0002
      lwt=700000446261630[700000446261630,700000446261630] ltm=700000446261640[700000446261640,700000446261640]
      pwt=7000004462615f8[7000004462615f8,7000004462615f8] ptm=700000446261608[700000446261608,700000446261608]
      ref=700000446261660[700000446261660,700000446261660] lnd=700000446261678[700000446261678,700000446261678]
        LIBRARY OBJECT: object=70000045adbc1e8
        type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
        CHILDREN: size=16
        child#    table reference   handle
             5 70000041776f5c0 70000045ae44720 70000042bfa3a20
        DATA BLOCKS:
        data#     heap  pointer    status pins change whr
            0 70000043d9fed20 70000045adbc300 I/P/A/-/-    0 NONE   00

的确有以debug 模式编译对象的语句,不过对象不是包而是trigger ; 看起来只要是可以以debug 模式compile 的对象都有可能引发该问题。
好了,问题到这里已经比较明确了: 应用端以DEBUG模式重新编译包引发了 Oracle bug 8244533,从而导致了对应服务进程的崩溃;总算是虚惊一场,之后通过trace内的machine和user信息找到了实施变更的应用方人员并教育之。

对Oracle中索引叶块分裂而引起延迟情况的测试和分析

在版本10.2.0.4未打上相关one-off补丁的情况下,分别对ASSM和MSSM管理模式表空间进行索引分裂测试,经过测试的结论如下:

l  在10gr2版本中MSSM方式是不能避免索引分裂引起交易超时问题;

l  10.2.0.4上的one-off补丁因为目前仅存在Linux版本,可以考虑声请补丁后具体测试(因目前没有补丁所以处于未知状态)。

l  合并索引是目前最具可行性的解决方案(alter index coalesce)。

l  最新的11gr2中经测试仍存在该问题。

具体测试过程如下:

1.    自动段管理模式下的索引块分裂

SQL> drop tablespace idx1 including contents and datafiles;

Tablespace dropped.

SQL> create tablespace idx1 datafile ‘?/dbs/idx1.dbf’ size 500M

2  segment space management AUTO

3  extent management local uniform size 10M;

创建自动段管理的表空间

Tablespace created.

SQL> create table idx1(a number) tablespace idx1;

Table created.

create index idx1_idx on idx1 (a) tablespace idx1 pctfree 0;

Index created.         创建实验对象表及索引

SQL> insert into idx1 select rownum from all_objects, all_objects where rownum <= 250000;           插入25万条记录

250000 rows created.

SQL> commit;

Commit complete.

SQL>create table idx2 tablespace idx1 as select * from idx1 where 1=2;

Table created.

insert into idx2

select * from idx1 where rowid in

(select rid from

(select rid, rownum rn from

(select rowid rid from idx1 where a between 10127 and 243625 order by a)                    取出后端部分记录,即每250条取一条

)

where mod(rn, 250) = 0

)

/

933 rows created.

SQL> commit;

Commit complete.

SQL> analyze index idx1_idx validate structure; 分析原索引

select blocks,lf_blks,del_lf_rows from index_stats;

Index analyzed.

SQL>

BLOCKS    LF_BLKS DEL_LF_ROWS

———- ———- ———–

1280        499           0               未删除情况下499个叶块

SQL> delete from idx1 where a between 10127 and 243625;                             大量删除

commit;

233499 rows deleted.

SQL> SQL>

Commit complete.

SQL> analyze index idx1_idx validate structure;

select blocks,lf_blks,del_lf_rows from index_stats;

Index analyzed.

SQL>

BLOCKS    LF_BLKS DEL_LF_ROWS

———- ———- ———–

1280        499      233499            删除后叶块数量不变

SQL> insert into idx1 select * from idx2;                   令那些empty 不再empty,但每个块中只有一到二条记录,空闲率仍为75-100%

commit;

933 rows created.

Commit complete.

SQL> insert into idx1 select 250000+rownum from all_objects where rownum <= 126;          造成leaf块分裂前提

SQL> select ss.value,sy.name from v$sesstat ss ,v$sysstat sy where ss.statistic#=sy.statistic# and name like ‘%split%’  and sid=(select distinct sid from v$mystat);

VALUE NAME

———- —————————————————————-

997 leaf node splits

997 leaf node 90-10 splits

0 branch node splits

0 queue splits                 找出当前会话目前的叶块分裂次数

SQL>insert into idx1 values (251000);                                        此处确实叶块分裂

1 row created.

SQL> commit;

Commit complete.

SQL> select ss.value,sy.name from v$sesstat ss ,v$sysstat sy where ss.statistic#=sy.statistic# and name like ‘%split%’  and sid=(select distinct sid from v$mystat);

VALUE NAME

———- —————————————————————-

998 leaf node splits

998 leaf node 90-10 splits

0 branch node splits

0 queue splits         可以看到对比之前的查询多了一个叶块分裂

SQL> set linesize 200 pagesize 1500;

SQL> select  executions, buffer_gets, disk_reads, cpu_time, elapsed_time, rows_processed, sql_text from v$sql

2  where sql_text like ‘%insert%idx1%’ and sql_text not like ‘%v$sql%’;

EXECUTIONS BUFFER_GETS DISK_READS   CPU_TIME ELAPSED_TIME ROWS_PROCESSED

———- ———– ———- ———- ———— ————–

SQL_TEXT

——————————————————————————————————————————————————————————————————–

1        1603          0     271601       271601            933

insert into idx2 select * from idx1 where rowid in (select rid from (select rid, rownum rn from (select rowid rid from idx1 where a between 10127 and 243625 order by a) ) where mod(rn, 250) = 0 )

1         156          0      82803        82803            126

insert into idx1 select 250000+rownum from all_objects where rownum <= 126

1         177 0       3728         3728              1

insert into idx1 values (251000)     读了那些实际不空的块,较多buffer_get

1        1409          0      40293        40293            933

insert into idx1 select * from idx2

1      240842          0    3478341      3478341         250000

SQL> insert into idx1 values (251001);                                  不分裂的插入

1 row created.

SQL> commit;

Commit complete.

SQL> select  executions, buffer_gets, disk_reads, cpu_time, elapsed_time, rows_processed, sql_text from v$sql

2  where sql_text like ‘%insert%idx1%’ and sql_text not like ‘%v$sql%’;

EXECUTIONS BUFFER_GETS DISK_READS   CPU_TIME ELAPSED_TIME ROWS_PROCESSED

———- ———– ———- ———- ———— ————–

SQL_TEXT

——————————————————————————————————————————————————————————————————–

1        1603          0     271601       271601            933

insert into idx2 select * from idx1 where rowid in (select rid from (select rid, rownum rn from (select rowid rid from idx1 where a between 10127 and 243625 order by a) ) where mod(rn, 250) = 0 )

1         156          0      82803        82803            126

insert into idx1 select 250000+rownum from all_objects where rownum <= 126

1           9          0       1640         1640              1

insert into idx1 values (251001) 不分裂的插入,少量buffer_gets

1         177          0       3728         3728              1

insert into idx1 values (251000)

1        1409          0      40293        40293            933

insert into idx1 select * from idx2

1      240842          0    3478341      3478341         250000

insert into idx1 select rownum from all_objects, all_objects where rownum <= 250000

如演示1所示,在自动段管理模式下大量删除后插入造成许多块为75%-100%空闲率且不完全为空,此后叶块分裂时将引起插入操作的相关前台进程扫描大量“空块“,若这些块不在内存中(引发物理读)且可能需要延迟块清除等原因时,减缓了该扫描操作的速度,造成叶块分裂缓慢,最终导致了其他insert操作被split操作所阻塞,出现enq:tx index contention等待事件。

2.  手动段管理模式下的索引块分裂

SQL> drop tablespace idx1 including contents and datafiles;

Tablespace dropped.

SQL> create tablespace idx1 datafile ‘?/dbs/idx1.dbf’ size 500M

2  segment space management MANUAL                                      — MSSM的情况

3  extent management local uniform size 10M;

Tablespace created.

SQL> create table idx1(a number) tablespace idx1;

create index idx1_idx on idx1 (a) tablespace idx1 pctfree 0;

Table created.

SQL> SQL> insert into idx1 select rownum from all_objects, all_objects where rownum <= 250

Index created.

SQL> SQL> 000;

commit;

create table idx2 tablespace idx1 as select * from idx1 where 1=2;

insert into idx2

select * from idx1 where rowid in

(select rid from

(select rid, rownum rn from

(select rowid rid from idx1 where a between 10127 and 243625 order by a)

)

where mod(rn, 250) = 0

)

/

commit;

250000 rows created.

SQL> SQL>

Commit complete.

SQL> SQL>

Table created.

SQL> SQL>   2    3    4    5    6    7    8    9

933 rows created.

SQL> SQL>

Commit complete.

SQL> analyze index idx1_idx validate structure;

select blocks,lf_blks,del_lf_rows from index_stats;

Index analyzed.

SQL>

BLOCKS    LF_BLKS DEL_LF_ROWS

———- ———- ———–

1280        499           0

SQL> delete from idx1 where a between 10127 and 243625;

233499 rows deleted.

SQL> commit;

Commit complete.

SQL> insert into idx1 select * from idx2;

commit;

933 rows created.

SQL> SQL>

Commit complete.

SQL> SQL> insert into idx1 select 250000+rownum from all_objects where rownum <= 126;

commit;

126 rows created.

SQL> SQL>

Commit complete.

SQL>

SQL> select ss.value,sy.name from v$sesstat ss ,v$sysstat sy where ss.statistic#=sy.statistic# and name like ‘%split%’  and sid=(select distinct sid from v$mystat);

VALUE NAME

———- —————————————————————-

1496 leaf node splits

1496 leaf node 90-10 splits

0 branch node splits

0 queue splits

SQL> insert into idx1 values (251000);                                  确实分裂

1 row created.

SQL> commit;

Commit complete.

SQL> select ss.value,sy.name from v$sesstat ss ,v$sysstat sy where ss.statistic#=sy.statistic# and name like ‘%split%’  and sid=(select distinct sid from v$mystat);

VALUE NAME

———- —————————————————————-

1497 leaf node splits

1497 leaf node 90-10 splits

0 branch node splits

0 queue splits

以上与ASSM时完全一致

SQL> select  executions, buffer_gets, disk_reads, cpu_time, elapsed_time, rows_processed, sql_text from v$sql

2  where sql_text like ‘%insert%idx1%’ and sql_text not like ‘%v$sql%’;

EXECUTIONS BUFFER_GETS DISK_READS   CPU_TIME ELAPSED_TIME ROWS_PROCESSED

———- ———– ———- ———- ———— ————–

SQL_TEXT

——————————————————————————————————————————————————————————————————–

1        1553          0     283301       283301            933

insert into idx2 select * from idx1 where rowid in (select rid from (select rid, rownum rn from (select rowid rid from idx1 where a between 10127 and 243625 order by a) ) where mod(rn, 250) = 0 )

1         153          0      78465        78465            126

insert into idx1 select 250000+rownum from all_objects where rownum <= 126

1        963 0      10422        10422              1              ASSM模式下更大量的空块

insert into idx1 values (251000)

1         984          0      35615        35615            933

insert into idx1 select * from idx2

1      238579          0    3468326      3469984         250000

insert into idx1 select rownum from all_objects, all_objects where rownum <= 250000

SQL> insert into idx1 values (251001);

1 row created.

SQL> commit;

Commit complete.

SQL> select  executions, buffer_gets, disk_reads, cpu_time, elapsed_time, rows_processed, sql_text from v$sql

2  where sql_text like ‘%insert%idx1%’ and sql_text not like ‘%v$sql%’;

EXECUTIONS BUFFER_GETS DISK_READS   CPU_TIME ELAPSED_TIME ROWS_PROCESSED

———- ———– ———- ———- ———— ————–

SQL_TEXT

——————————————————————————————————————————————————————————————————–

1        1553          0     283301       283301            933

insert into idx2 select * from idx1 where rowid in (select rid from (select rid, rownum rn from (select rowid rid from idx1 where a between 10127 and 243625 order by a) ) where mod(rn, 250) = 0 )

1         153          0      78465        78465            126

insert into idx1 select 250000+rownum from all_objects where rownum <= 126

1           7 0       1476         1476              1

insert into idx1 values (251001)    —不分裂的情况与ASSM时一致

1         963 0      10422        10422              1

insert into idx1 values (251000)

1         984          0      35615        35615            933

insert into idx1 select * from idx2

1      238579          0    3468326      3469984         250000

insert into idx1 select rownum from all_objects, all_objects where rownum <= 250000

6 rows selected.

如演示2所示,MSSM情况下叶块分裂读取了比ASSM模式下更多的“空块“;MSSM并不能解决大量删除后叶块分裂需要扫描大量非空块的问题,实际上可能更糟糕。从理论上讲MSSM的freelist只能指出那些未达到pctfree和曾经到达pctfree后来删除记录后使用空间下降到pctused的块(doc:A free list is a list of free data blocks that usually includes blocks existing in a number of different extents within the segment. Free lists are composed of blocks in which free space has not yet reached PCTFREE or used space has shrunk below PCTUSED.),换而言之MSSM模式下”空块“会更多。

3.  自动段管理模式下coalesce后的索引块分裂

SQL> drop tablespace idx1 including contents and datafiles;

Tablespace dropped.

SQL> create tablespace idx1 datafile ‘?/dbs/idx1.dbf’ size 500M

2  segment space management AUTO                                       — ASSM coalesce情况

3  extent management local uniform size 10M;

Tablespace created.

SQL> create table idx1(a number) tablespace idx1;

create index idx1_idx on idx1 (a) tablespace idx1 pctfree 0;

Table created.

SQL> SQL>

Index created.

SQL> SQL> insert into idx1 select rownum from all_objects, all_objects where rownum <= 250000;

commit;

create table idx2 tablespace idx1 as select * from idx1 where 1=2;

insert into idx2

select * from idx1 where rowid in

(select rid from

(select rid, rownum rn from

(select rowid rid from idx1 where a between 10127 and 243625 order by a)

)

where mod(rn, 250) = 0

)

/

commit;

250000 rows created.

SQL> SQL>

Commit complete.

SQL> SQL>

Table created.

SQL> SQL>   2    3    4    5    6    7    8    9

933 rows created.

SQL> SQL>

Commit complete.

SQL> SQL> SQL>

SQL>

SQL> analyze index idx1_idx validate structure;

select blocks,lf_blks,del_lf_rows from index_stats;

Index analyzed.

SQL>

BLOCKS    LF_BLKS DEL_LF_ROWS

———- ———- ———–

1280        499           0

SQL> delete from idx1 where a between 10127 and 243625;

commit;

233499 rows deleted.

SQL> SQL>

Commit complete.

SQL> alter index idx1_idx coalesce;

Index altered.

SQL> analyze index idx1_idx validate structure;

select blocks,lf_blks,del_lf_rows from index_stats;

Index analyzed.

SQL>

BLOCKS    LF_BLKS DEL_LF_ROWS

———- ———- ———–

1280         33           0 — coalesc lf块合并了

SQL> insert into idx1 select * from idx2;

933 rows created.

SQL> SQL> commit;

Commit complete.

SQL>

SQL> insert into idx1 select 250000+rownum from all_objects where rownum <= 126;

commit;

126 rows created.

SQL> SQL>

Commit complete.

SQL> select ss.value,sy.name from v$sesstat ss ,v$sysstat sy where ss.statistic#=sy.statistic# and name like ‘%split%’  and sid=(select distinct sid from v$mystat);

VALUE NAME

———- —————————————————————-

1999 leaf node splits

1995 leaf node 90-10 splits

0 branch node splits

0 queue splits

SQL> insert into idx1 values (251000);                                       确实分裂

1 row created.

SQL> commit;

Commit complete.

SQL> select ss.value,sy.name from v$sesstat ss ,v$sysstat sy where ss.statistic#=sy.statistic# and name like ‘%split%’  and sid=(select distinct sid from v$mystat);

VALUE NAME

———- —————————————————————-

2000 leaf node splits

1996 leaf node 90-10 splits

0 branch node splits

0 queue splits

SQL> select  executions, buffer_gets, disk_reads, cpu_time, elapsed_time, rows_processed, sql_text from v$sql

2  where sql_text like ‘%insert%idx1%’ and sql_text not like ‘%v$sql%’;

EXECUTIONS BUFFER_GETS DISK_READS   CPU_TIME ELAPSED_TIME ROWS_PROCESSED

———- ———– ———- ———- ———— ————–

SQL_TEXT

——————————————————————————————————————————————————————————————————–

1        1603          0     268924       268924            933

insert into idx2 select * from idx1 where rowid in (select rid from (select rid, rownum rn from (select rowid rid from idx1 where a between 10127 and 243625 order by a) ) where mod(rn, 250) = 0 )

1         156          0      78349        78349            126

insert into idx1 select 250000+rownum from all_objects where rownum <= 126

1          23 0       2218         2218              1                             少量buffer gets

insert into idx1 values (251000)

1         191          0      15596        15596            933

insert into idx1 select * from idx2

1      240852          0    3206130      3206130         250000

insert into idx1 select rownum from all_objects, all_objects where rownum <= 250000

SQL> insert into idx1 values (251001);

1 row created.

SQL> commit;

Commit complete.

SQL> select  executions, buffer_gets, disk_reads, cpu_time, elapsed_time, rows_processed, sql_text from v$sql

2  where sql_text like ‘%insert%idx1%’ and sql_text not like ‘%v$sql%’;

EXECUTIONS BUFFER_GETS DISK_READS   CPU_TIME ELAPSED_TIME ROWS_PROCESSED

———- ———– ———- ———- ———— ————–

SQL_TEXT

——————————————————————————————————————————————————————————————————–

1        1603          0     268924       268924            933

insert into idx2 select * from idx1 where rowid in (select rid from (select rid, rownum rn from (select rowid rid from idx1 where a between 10127 and 243625 order by a) ) where mod(rn, 250) = 0 )

1         156          0      78349        78349            126

insert into idx1 select 250000+rownum from all_objects where rownum <= 126

1           9 0       1574         1574              1

insert into idx1 values (251001)

1          23 0       2218         2218              1

insert into idx1 values (251000)

1         191          0      15596        15596            933

insert into idx1 select * from idx2

1      240852          0    3206130      3206130         250000

insert into idx1 select rownum from all_objects, all_objects where rownum <= 250000

6 rows selected.

如演示三所示在删除后进行coalesce操作,合并操作将大量空块分离出了索引结构(move empty out of index structure),之后的叶块分裂仅读取了少量必要的块。

4.  手动段管理模式下coalesce后的索引块分裂

SQL> drop tablespace idx1 including contents and datafiles;

Tablespace dropped.

SQL> create tablespace idx1 datafile ‘?/dbs/idx1.dbf’ size 500M

2  segment space management MANUAL                               — mssm情况下 coalesce

3  extent management local uniform size 10M;

Tablespace created.

SQL> create table idx1(a number) tablespace idx1;

create index idx1_idx on idx1 (a) tablespace idx1 pctfree 0;

Table created.

SQL> SQL> insert into idx1 select rownum from all_objects, all_objects where rownum <= 250

Index created.

SQL> SQL> 000;

commit;

create table idx2 tablespace idx1 as select * from idx1 where 1=2;

insert into idx2

select * from idx1 where rowid in

(select rid from

(select rid, rownum rn from

(select rowid rid from idx1 where a between 10127 and 243625 order by a)

)

where mod(rn, 250) = 0

)

/

commit;

250000 rows created.

SQL> SQL>

Commit complete.

SQL> SQL>

Table created.

SQL> SQL>   2    3    4    5    6    7    8    9

933 rows created.

SQL> SQL>

Commit complete.

SQL> SQL> SQL>

SQL>

SQL> analyze index idx1_idx validate structure;

select blocks,lf_blks,del_lf_rows from index_stats;

Index analyzed.

SQL>

BLOCKS    LF_BLKS DEL_LF_ROWS

———- ———- ———–

1280        499           0

SQL> delete from idx1 where a between 10127 and 243625;

commit;

233499 rows deleted.

SQL> SQL>

Commit complete.

SQL> analyze index idx1_idx validate structure;

select blocks,lf_blks,del_lf_rows from index_stats;

Index analyzed.

SQL>

BLOCKS    LF_BLKS DEL_LF_ROWS

———- ———- ———–

1280        499      233499

SQL> alter index idx1_idx coalesce;

Index altered.

SQL> analyze index idx1_idx validate structure;

select blocks,lf_blks,del_lf_rows from index_stats;

Index analyzed.

SQL>

BLOCKS    LF_BLKS DEL_LF_ROWS

———- ———- ———–

1280         33           0

SQL> insert into idx1 select * from idx2;

933 rows created.

SQL> SQL> commit;

Commit complete.

SQL>

SQL> insert into idx1 select 250000+rownum from all_objects where rownum <= 126;

commit;

126 rows created.

SQL> SQL>

Commit complete.

SQL> select ss.value,sy.name from v$sesstat ss ,v$sysstat sy where ss.statistic#=sy.statistic# and name like ‘%split%’  and sid=(select distinct sid from v$mystat);

VALUE NAME

———- —————————————————————-

2502 leaf node splits

2494 leaf node 90-10 splits

0 branch node splits

0 queue splits

SQL> insert into idx1 values (251000);                       确实分裂

1 row created.

SQL> commit;

Commit complete.

SQL> select ss.value,sy.name from v$sesstat ss ,v$sysstat sy where ss.statistic#=sy.statistic# and name like ‘%split%’  and sid=(select distinct sid from v$mystat);

VALUE NAME

———- —————————————————————-

2503 leaf node splits

2495 leaf node 90-10 splits

0 branch node splits

0 queue splits

SQL> select  executions, buffer_gets, disk_reads, cpu_time, elapsed_time, rows_processed, sql_text from v$sql

2  where sql_text like ‘%insert%idx1%’ and sql_text not like ‘%v$sql%’;

EXECUTIONS BUFFER_GETS DISK_READS   CPU_TIME ELAPSED_TIME ROWS_PROCESSED

———- ———– ———- ———- ———— ————–

SQL_TEXT

——————————————————————————————————————————————————————————————————–

1        1553          0     281059       281059            933

insert into idx2 select * from idx1 where rowid in (select rid from (select rid, rownum rn from (select rowid rid from idx1 where a between 10127 and 243625 order by a) ) where mod(rn, 250) = 0 )

1         153          0      77817        77817            126

insert into idx1 select 250000+rownum from all_objects where rownum <= 126

1          19          0       2010         2010              1                       少量buffer get

insert into idx1 values (251000)

1         126          0      15364        15364            933

insert into idx1 select * from idx2

1      238644          0    3229737      3230569         250000

insert into idx1 select rownum from all_objects, all_objects where rownum <= 250000

SQL> insert into idx1 values (251001);

1 row created.

SQL> commit;

Commit complete.

SQL> select  executions, buffer_gets, disk_reads, cpu_time, elapsed_time, rows_processed, sql_text from v$sql

2  where sql_text like ‘%insert%idx1%’ and sql_text not like ‘%v$sql%’;

EXECUTIONS BUFFER_GETS DISK_READS   CPU_TIME ELAPSED_TIME ROWS_PROCESSED

———- ———– ———- ———- ———— ————–

SQL_TEXT

——————————————————————————————————————————————————————————————————–

1        1553          0     281059       281059            933

insert into idx2 select * from idx1 where rowid in (select rid from (select rid, rownum rn from (select rowid rid from idx1 where a between 10127 and 243625 order by a) ) where mod(rn, 250) = 0 )

1         153          0      77817        77817            126

insert into idx1 select 250000+rownum from all_objects where rownum <= 126

1          7 0       1460         1460              1

insert into idx1 values (251001)

1          19 0       2010         2010              1

insert into idx1 values (251000)

1         126          0      15364        15364            933

insert into idx1 select * from idx2

1      238644          0    3229737      3230569         250000

insert into idx1 select rownum from all_objects, all_objects where rownum <= 250000

6 rows selected.

如演示4所示,MSSM模式下合并操作与ASSM情况下大致一样,合并操作可以有效解决该问题。

5.  Coalesce合并操作的锁影响

SQL> create table coal (t1 int);

Table created.

SQL> create index pk_t1 on coal(t1);

Index created.

SQL> begin

2    for i in 1..3000 loop

3      insert into coal values(i);

4      commit;

5      end loop;

6      end;

7  /

PL/SQL procedure successfully completed.

SQL> delete coal where t1>500;

2500 rows deleted.

SQL> commit;

Commit complete.

SQL> analyze index pk_t1 validate structure;

Index analyzed.    注意analyze validate操作会block一切dml操作

SQL> select blocks,lf_blks,del_lf_rows from index_stats;

BLOCKS    LF_BLKS DEL_LF_ROWS

———- ———- ———–

8          6        2500          删除后的状态

此时另开一个会话,开始dml操作:

SQL> update coal set t1=t1+1;

500 rows updated.

回到原会话

SQL> alter index pk_T1 coalesce;             — coalesce 未被阻塞

Index altered.

在另一个会话中commit,以便执行validate structure

SQL> analyze index pk_t1 validate structure;

Index analyzed.

SQL> select blocks,lf_blks,del_lf_rows from index_stats;

BLOCKS    LF_BLKS DEL_LF_ROWS

———- ———- ———–

8          3         500

显然coalesce的操作没有涉及有dml操作的块

在没有dml操作的情况下:

SQL> truncate table coal;

Table truncated.

SQL> begin

2    for i in 1..3000 loop

3      insert into coal values(i);

4      commit;

5      end loop;

6      end;

7  /

PL/SQL procedure successfully completed.

SQL> analyze index pk_t1 validate structure;

Index analyzed.

SQL> select blocks,lf_blks,del_lf_rows from index_stats;

BLOCKS    LF_BLKS DEL_LF_ROWS

———- ———- ———–

8          6           0

SQL> delete coal where t1>500;

2500 rows deleted.

SQL> commit;

Commit complete.

SQL> analyze index pk_t1 validate structure;

Index analyzed.

SQL> select blocks,lf_blks,del_lf_rows from index_stats;

BLOCKS    LF_BLKS DEL_LF_ROWS

———- ———- ———–

8          6        2500

SQL> alter index pk_t1 coalesce;

Index altered.

SQL> analyze index pk_t1 validate structure;

Index analyzed.

SQL> select blocks,lf_blks,del_lf_rows from index_stats;

BLOCKS    LF_BLKS DEL_LF_ROWS

———- ———- ———–

8          1           0

没有dml时,coalesce 操作涉及了所有块

如演示5所示coalesce会避开dml操作涉及的块,但在coalesec的短暂间歇出现在索引上有事务的块不会太多。且coalesce操作不会降低索引高度。

附件是关于rebuild及coalesce索引操作的详细描述:

6.  Coalesce操作总结

优点:

l  是一种快速的操作,对整体性能影响最小(not performance sensitive)。

l  不会锁表,绕过有事务的索引块。

l  可以有效解决现有的问题。

l  不会降低索引高度,引起再次的root split

缺点:

l  需要针对个别对象,定期执行合并操作;无法一劳永逸地全局地解决该问题。

7.  Linux 10.2.0.4上相关补丁的技术交流

Metalink bug 8286901 note中叙述了一位用户遇到相同的问题并提交了SR,当时oracle support给出了one-off补丁,但该用户在apply了该补丁后仍未解决问题。

以下为note 原文:

It is similar to bug8286901, but after applied patch8286901, still see enq tx
contentiona with high “failed probes on index block reclamation”

Issue encountered by customer and Oracle developer (Stefan Pommerenk).


He describes is thus:


"Space search performed by the index splitter can't find space in neighboring


blocks, and then instead of allocating new space, we go and continue to


search for space elsewhere, which manifests itself in block reads from disk,


block cleanouts, and subsequent blocks written due to aggressive MTTR


setting."




"To clarify: the cleanouts are not the problem per se. The culprit seems to


be that the space search performed by the index splitter can't find space in


neighboring blocks, and then instead of allocating new space, we go and


continue to search for space elsewhere, which manifests itself in block reads


from disk, block cleanouts, and subsequent blocks written due to aggressive


MTTR setting. This action has caused other sessions to get blocked on TX


enqueue contention, blocked on the splitting session. Advice was to set 10224


trace event for the splitter for a short time only in order to get


diagnostics as to why the space search rejected most blocks.


> A secondary symptom are the bitmap level 1 block updates, which may or may


not be related to the space search; I've not seen them before, maybe because


I didn't really pay attention :P , but the symptoms seen in the ASH trace


indicate it's the same problem. Someone in space mgmt has to look at it to


confirm it is the same problem."

与该用户进行了mail私下交流,他的回复:

I still have a case open with Oracle. I believe that this is a bug in the Oracle code. The problem is that it has been difficult to create a reproducible test case for Oracle support. My specific issue was basically put on hold pending the results of another customer’s service request that appeared to have had the same issue, (9034788). Unfortunately they couldn’t reproduce the issue in that case either.

I believe that there is a correlation between the enq TX – index contention wait event and a spike in the number of ‘failed probes on index block reclamation. I have specifically asked Oracle to explain why there is a spike in the ‘failed probes on index block reclamation’ during the same time frame as the enq TX index contention wait event, but they have not answered my question.

I was hoping that some investigation by Oracle Support into the failed probes metric might get someone on the right track to discovering the bug. That hasn’t happened though.

Hi ,

Thanks for your sharing .  The bug (or specific ktsp behave) is fatal in response time sensitive  OLTP env.

I would like to ask my customer to coalesce those index where massive deleted regularly.

Thanks for your help again!

Yes, I saw that. I have applied patch 8286901 and set the event for version 10.2.0.4, but the problem still occurs periodically. And as I mentioned before, we see a correlation between enq TX waits and the failed probes on index block reclamation. Which is why I still think that it is a bug. I agree that trying to rebuild or coalesce the indexes are simply attempts to workaround the issue and not solve the root cause.

Early on when I started on this issue I did do some index dumps and could clearly see that we had lots of blocks with only 1 or 2 records after our mass delete jobs. I have provided Oracle Support with this information as well as oradump files while the problem is occurring, but they don’t seem to be able to find anything wrong so far.

If you are interested in seeing if you are experiencing a high ‘failed probes on index block reclamation’ event run the query below.

select SS.snap_id,
SS.stat_name,
TO_CHAR(S.BEGIN_INTERVAL_TIME, ‘DAY’) DAY,
S.BEGIN_INTERVAL_TIME,
S.END_INTERVAL_TIME,
SS.value,
SS.value – LAG(SS.VALUE, 1, ss.value) OVER (ORDER BY SS.SNAP_ID) AS DIFF
from DBA_HIST_SYSSTAT SS,
DBA_HIST_SNAPSHOT S
where S.SNAP_ID = SS.SNAP_ID
AND SS.stat_NAME = ‘failed probes on index block reclamation’
ORDER BY SS.SNAP_ID ;

  1. 在11gr2上的测试

在最新的11gr2中进行了测试,仍可以重现该问题(如图单条insert引起了6675buffer_gets,这是在更大量数据的情况下)。

我们可以猜测Oracle提供的one-off补丁中可能是为叶块分裂所会扫描的“空块”附加了一个上限,在未达到上限的情况下扫描仍会发生。而在主流的公开的发行版本中Oracle不会引入该补丁的内容。尝试在没有缓存的情况下引起分裂问题,分裂引起了大约4000个块的物理读,但该操作仍在0.12秒(有缓存是0.02秒,如图)内完成了(该测试使用普通ata硬盘,读取速度在100MB/S: Timing buffered disk reads:  306 MB in  3.00 seconds = 101.93 MB/sec);从1月21日的ash视图中可以看到引起split的260会话处于单块读等待(db file sequential read)中,且已等待了43950us约等于44ms;这与良好io的经验值10ms左右有较大出入;我们可以确信io性能问题也是引发此叶块分裂延迟如此显性的一个重要因素。

具体结论

综上所述,在之前讨论的几个方案中,MSSM方式是不能避免索引分裂引起交易超时问题的;不删除数据的方案在许多对象上不可行;10.2.0.4上的one-off补丁因为目前仅存在Linux版本,可以考虑声请补丁后具体测试(因目前没有补丁所以处于未知状态)。Coalesce合并索引是目前既有的最具可操作性且无副作用的解决方案。

Oracle内部错误ORA-07445[kpopfr()+339] [SIGFPE]一例

当所有列长度综合超过1048576时可能引发的一个dump错误,session会自动关闭。一般只有列很多且单列较“宽”时可能出现该错误。

已经测试的在10.2.0.1,以及10.2.0.3上均可以再现该问题,测试方法:

create table test

( c000 char(2000),

c001 char(2000),

c523 char(2000),

c524 char(576));

— sum of all column size is 1048576(0x100000).

Run next shell script.

while [ 1 ]

do

echo “set feedback off”

echo “select * from test where c001 = ‘A’;”

done | sqlplus -s scott/tiger

Note 245840.1 Information on the sections in this article

以上循环执行一段时间后session会被关闭,告警日志中出现

ORA-07445: exception encountered: core dump [kpopfr()+339] [SIGFPE] [Integer divide by zero][0x002327FF5] [] []的记录。没有在9i版本上测试,不能确定其影响。

该bug在10.2.0.4 patch set中已被修复,也可以通过小补丁形式修复,Oracle发布的小布丁只针对10.2.0.3版本,即10.2.0.1上是不能打的。

附bug描述原文:
Subject:     Bug 5753629 – Query may dump [in kpopfr / kposdi]

Doc ID:     5753629.8     Type:     PATCH

Modified Date :     03-APR-2009     Status:     PUBLISHED

@ Note to support: do not edit this note – it is auto generated
Bug 5753629  Query may dump [in kpopfr / kposdi]

This note gives a brief overview of bug 5753629.
The content was last updated on: 03-APR-2009
Click here for details of each of the sections below.
Affects:

Product (Component)     Oracle Server (Rdbms)
Range of versions believed to be affected     Versions < 11
Versions confirmed as being affected

* 10.2.0.3

Platforms affected     Generic (all / most platforms affected)

Fixed:

This issue is fixed in

* 10.2.0.3 Patch 9 on Windows Platforms
* 10.2.0.4 (Server Patch Set)
* 11.1.0.6 (Base Release)
* Process May Dump (ORA-7445) / Abend / Abort
* Dump in or under kpopfr / kposdi
* (None Specified)

Symptoms:

Related To:

Description

Repeatedly executing a query can lead to a dump in kpopfr.

eg:

create table test

( c000 char(2000),

c001 char(2000),

c523 char(2000),

c524 char(576));

— sum of all column size is 1048576(0x100000).

Run next shell script.

while [ 1 ]

do

echo “set feedback off”

echo “select * from test where c001 = ‘A’;”

done | sqlplus -s scott/tiger

^

Dump occurs

Please note: The above is a summary description only. Actual symptoms can vary. Matching to any

symptoms here does not confirm that you are encountering this problem. Always consult with Oracle

Support for advice.

References

Bug 5753629 (This link will only work for PUBLISHED bugs)

如何使用gdb工具对Oracle系统状态(systemstate)做trace

当Oracle系统hang住 ,无法使用一切方法登录时 (包括 sqlplus -prelim / as sysdba),我们可以使用gdb调试工具来对 Oracle做系统 dump ,通过 系统 dump信息 判断 具体hang的原因 。 若直接 将 进程 kill 掉,则将失去现场 无法帮助今后避免 这样的hang情况。

要使用gdb 外部工具, 就需要知道目前实例中后台进程的进程号。

我们一般通过 以下命令列出 Oracle 进程:ps -ef|grep <SID>

[oracle@rh2 ~]$ ps -ef|grep oraclewebmoney
oracle   16996 16995  0 21:55 ?        00:00:00 oraclewebmoney (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

然后启动gdb ,指定Oracle软件中二进制文件 oracle的位置和 进程id

[oracle@rh2 udump]$ gdb $ORACLE_HOME/bin/oracle  16996
GNU gdb Red Hat Linux (6.3.0.0-1.159.el4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type “show copying” to see the conditions.
There is absolutely no warranty for GDB.  Type “show warranty” for details.
This GDB was configured as “x86_64-redhat-linux-gnu”…
(no debugging symbols found)
Using host libthread_db library “/lib64/tls/libthread_db.so.1″.

Attaching to program: /u01/oracle/product/10.2.0/db_1/bin/oracle, process 14594
Reading symbols from /u01/oracle/product/10.2.0/db_1/lib/libskgxp10.so…(no debugging symbols found)…done.
Loaded symbols for /u01/oracle/product/10.2.0/db_1/lib/libskgxp10.so
Reading symbols from /u01/oracle/product/10.2.0/db_1/lib/libhasgen10.so…done.
Loaded symbols for /u01/oracle/product/10.2.0/db_1/lib/libhasgen10.so
Reading symbols from /u01/oracle/product/10.2.0/db_1/lib/libskgxn2.so…done.
Loaded symbols for /u01/oracle/product/10.2.0/db_1/lib/libskgxn2.so
Reading symbols from /u01/oracle/product/10.2.0/db_1/lib/libocr10.so…done.
Loaded symbols for /u01/oracle/product/10.2.0/db_1/lib/libocr10.so
Reading symbols from /u01/oracle/product/10.2.0/db_1/lib/libocrb10.so…done.
Loaded symbols for /u01/oracle/product/10.2.0/db_1/lib/libocrb10.so
Reading symbols from /u01/oracle/product/10.2.0/db_1/lib/libocrutl10.so…done.
Loaded symbols for /u01/oracle/product/10.2.0/db_1/lib/libocrutl10.so
Reading symbols from /u01/oracle/product/10.2.0/db_1/lib/libjox10.so…

在gdb 提示行中 输入 print ksudss(10),即

(gdb) print ksudss(10)

之后将在udump目录中产生相关<SID>_ora_<pid>的trace文件,我们通过分析trace可以发现hang的主要原因。

trace文件示例如下:

System name:    Linux
Node name:      rh2
Release:        2.6.9-78.ELsmp
Version:        #1 SMP Wed Jul 9 15:46:26 EDT 2008
Machine:        x86_64
Instance name: webmoney
Redo thread mounted by this instance: 1
Oracle process number: 15
Unix process pid: 16996, image: oracle@rh2 (TNS V1-V3)

*** 2009-09-07 21:57:14.100
*** SERVICE NAME:(SYS$USERS) 2009-09-07 21:57:14.100
*** SESSION ID:(528.2041) 2009-09-07 21:57:14.100
===================================================
SYSTEM STATE
————
System global information:
processes: base 0x91637c30, size 500, cleanup 0x9167a2e0
allocation: free sessions 0x91779840, free calls (nil)
control alloc errors: 0 (process), 0 (session), 0 (call)
PMON latch cleanup depth: 0
seconds since PMON’s last scan for dead processes: 45
system statistics:

[gview file=”http://askmac.cn/wp-content/uploads/resource/gdb.pdf”]

参数cluster_interconnect详细介绍

以下文本摘自metalink doc:

This note attempts to clarify the cluster_interconnects parameter and the
platforms on which the implementation has been made. A brief explanation on
the workings of the parameter has also been presented in this note.
This is also one of the most frequently questions related to cluster and RAC
installations on most sites and forms a part of the prerequisite as well.

ORACLE 9I RAC – Parameter CLUSTER_INTERCONNECTS
———————————————–

FREQUENTLY ASKED QUESTIONS
————————–
November 2002

CONTENTS
——–
1.  What is the parameter CLUSTER_INTERCONNECTS for ?
2.  Is the parameter CLUSTER_INTERCONNECTS available for all platforms ?
3.  How is the Interconnect recognized on Linux ?
4.  Where could I find more information on this parameter ?
5.  How to detect which interconnect is used ?
6.  Cluster_Interconnects is mentioned in the 9i RAC administration
    guide as a Solaris specific parameter, is this the only platform
    where this parameter is available ?
7.  Are there any side effects for this parameter, namely affecting normal
    operations ?
8.  Is the parameter OPS_INTERCONNECTS which was available in 8i similar
    to this parameter ?
9.  Does Cluster_interconnect allow failover from one Interconnect to another
    Interconnect ?
10. Is the size of messages limited on the Interconnect ?
11. How can you see which protocoll is being used by the instances ?
12. Can the parameter CLUSTER_INTERCONNECTS be changed dynamically during runtime ?

 
QUESTIONS & ANSWERS
——————-
1. What is the parameter CLUSTER_INTERCONNECTS for ?

Answer
——
This parameter is used to influence the selection of the network interface
for Global Cache Service (GCS) and Global Enqueue Service (GES) processing.

This note does not compare the other elements of 8i OPS with 9i RAC
because of substantial differences in the behaviour of both architectures.
Oracle 9i RAC has certain optimizations which attempt to transfer most of
the information required via the interconnects so that the number of disk
reads are minimized. This behaviour known as Cache fusion phase 2 is summarised
in Note 139436.1
The definition of the interconnnect is a private network which
will be used to transfer the cluster traffic and Oracle Resource directory
information and blocks to satisfy queries. The technical term for that is
cache fusion.

The CLUSTER_INTERCONNECTS should be used when
– you want to override the default network selection
– bandwith of a single interconnect does not meet the bandwith requirements of
  a Real Application Cluster database

The syntax of the parameter is:

CLUSTER_INTERCONNECTS = if1:if2:…:ifn
Where if<n> is an IP address in standard dotted-decimal format, for example,
144.25.16.214. Subsequent platform implementations may specify interconnects
with different syntaxes.
2. Is the parameter CLUSTER_INTERCONNECTS available for all platforms ?

Answer
——

This parameter is configurable on most platforms.
This parameter can not be used on Linux.

The following Matrix shows when the parameter was introduced on which platform:

Operating System    Available since
AIX                   9.2.0
HP/UX                 9.0.1
HP Tru64              9.0.1
HP OPenVMS            9.0.1
Sun Solaris           9.0.1

References
———-
Bug <2119403> ORACLE9I RAC ADMINISTRATION SAYS CLUSTER_INTERCONNECTS IS SOLARIS ONLY.
Bug <2359300> ENHANCE CLUSTER_INTERCONNECTS TO WORK WITH 9I RAC ON IBM
3.  How is the Interconnect recognized on Linux ?

Answer
——
Since Oracle9i 9.2.0.8 CLUSTER_INTECONNETCS can be used to change the interconnect.
A patch is also available for 9.2.0.7 under Patch 4751660.
Before 9.2.0.8 the Oracle implementation for the interface selection reads the ‘private hostname’
in the cmcfg.ora file and uses the corresponding ip-address for the interconnect.
If no private hostname is available the public hostname will be used.
4.  Where could I find information on this parameter ?

Answer
——

The parameter is documented in the following books:
Oracle9i Database Reference Release 2 (9.2)
Oracle9i Release 1 (9.0.1) New Features in Oracle9i Database Reference –
                   What’s New in Oracle9i Database Reference?
Oracle9i Real Application Clusters Administration Release 2 (9.2)
Oracle9i Real Application Clusters Deployment and Performance Release 2 (9.2)

Also port specific documentation may contain information about the usage of
the cluster_interconnects parameter.

Documentation can be viewed on
    http://tahiti.oracle.com
    http://otn.oracle.com/documentation/content.html
References:
———–
Note 162725.1: OPS/RAC VMS: Using alternate TCP Interconnects on 8i OPS
               and 9i RAC on OpenVMS

Note 151051.1: Init.ora Parameter “CLUSTER_INTERCONNECTS” Reference Note

5. How to detect which interconnect is used ?
    The following commands show which interconnect is used for UDP or TCP:
    sqlplus> connect / as sysdba
             oradebug setmypid
             oradebug ipc
             exit

    The corresponding trace can be found in the user_dump_dest directory and for
    example contains the following information in the last couple of lines:

           SKGXPCTX: 0x32911a8 ctx
           admno 0x12f7150d admport:
           SSKGXPT 0x3291db8 flags SSKGXPT_READPENDING     info for network 0
                 socket no 9     IP 172.16.193.1         UDP 43307
                 sflags SSKGXPT_WRITESSKGXPT_UP
                 info for network 1
                 socket no 0     IP 0.0.0.0      UDP 0
                 sflags SSKGXPT_DOWN
           context timestamp 0x1ca5
                 no ports
   Please note that on some platforms and versions (Oracle9i 9.2.0.1 on Windows)
   you might see an ORA-70 when the command oradebug ipc has not been
   implemented.

   When  other protocols such as LLT, HMP or RDG are used, then the trace file will not
   reveal an IP address.
6.  Cluster_Interconnects is mentioned in the 9i RAC administration
    guide as a Solaris specific parameter, is this the only platform
    where this parameter is available ?

Answer
—– 

This information that this parameter works on Solaris only is incorrect. Please
check the answer for question number 2 for the complete list of platforms for the same.

References:
———–
bug <2119403> ORACLE9I RAC ADMINISTRATION SAYS CLUSTER_INTERCONNECTS IS SOLARIS ONLY.
7.  Are there any side effects for this parameter, namely affecting normal
    operations ?

Answer
—–
When you set CLUSTER_INTERCONNECTS in cluster configurations, the
interconnect high availability features are not available. In other words,
an interconnect failure that is normally unnoticeable would instead cause
an Oracle cluster failure as Oracle still attempts to access the network
interface which has gone down. Using this parameter you are explicitly
specifying the interface or list of interfaces to be used.
 

8.  Is the parameter OPS_INTERCONNECTS which was available in 8i similar
    to this parameter ?

Answer
——
Yes, the parameter OPS_INTERCONNECTS was used to influence the network selection
for the Oracle 8i Parallel Server.

Reference
———
Note <120650.1> Init.ora Parameter “OPS_INTERCONNECTS” Reference Note
9.  Does Cluster_interconnect allow failover from one Interconnect to another
    Interconnect ?

Answer
——
Failover capability is not implemented at the Oracle level. In general this
functionality is delivered by hardware and/or Software of the operating system.
For platform details please see Oracle platform specific documentation
and the operating system documentation.
10. Is the size of messages limited on the Interconnect ?

Answer
——
The message size depends on the protocoll and platform.
UDP: In Oracle9i Release 2 (9.2.0.1) message size for UDP was limited to 32K.
     Oracle9i 9.2.0.2 allows to use bigger UDP message sizes depending on the
     platform. To increase throughput on an interconnect you have to adjust
     udp kernel parameters.
TCP: There is no need to set the message size for TCP.
RDG: The recommendations for RDG are documented in
        Oracle9i Administrator’s Reference – Part No. A97297-01
References
———-
Bug <2475236> RAC multiblock read performance issue using UDP IPC
11. How can you see which protocoll is being used by the instances ?

Answer
——
Please see the alert-file(s) of your RAC instances. During startup you’ll
   find a message in the alert-file that shows the protocoll being used.

      Wed Oct 30 05:28:55 2002
      cluster interconnect IPC version:Oracle UDP/IP with Sun RSM disabled
      IPC Vendor 1 proto 2 Version 1.0
12. Can the parameter CLUSTER_INTERCONNECT be changed dynamically during runtime ?

Answer
——
    No. Cluster_interconnects is a static parameter and can only be set in the
    spfile or pfile (init.ora)

沪ICP备14014813号-2

沪公网安备 31010802001379号