ORA-07445: [__lwp_kill()+8] [SIGIOT]错误一例

这是一套SunOS 5.10上的10.2.0.3的RAC系统,8月初告警日志中陆续出现以下记录:

Tue Aug  3 15:17:04 2010
Errors in file /u01/app/oracle/admin/prsi061/udump/prsi061a_ora_27774.trc:
ORA-07445: exception encountered: core dump [__lwp_kill()+8] [SIGIOT] [unknown code] [0x6C7E00000000] [] []

SIGIOT信号伴随7445错误出现并不多见,因为该信号一般是用来实现相关的硬件异常的。
我们可以欣赏一下这个trace文件
trace文件中的堆栈信息如下:

ksedmp()+744         CALL     ksedst()             000000840 ? 1066C60CC ?
                                                   000000000 ? 1066C2BC0 ?
                                                   1066C1928 ? 1066C2328 ?
ssexhd()+1240        CALL     ksedmp()             000106400 ? 106530764 ?
                                                   106530000 ? 000106530 ?
                                                   000106400 ? 106530764 ?
__sighndlr()+12      PTR_CALL 0000000000000000     10652D000 ? 1066C9EF0 ?
                                                   10652A72C ? 00010652D ?
                                                   000000006 ? 000000067 ?
call_user_handler()  CALL     __sighndlr()         000000006 ? 1066C9EF0 ?
+992                                               1066C9C10 ? 10033B1C0 ?
                                                   000000000 ? 000000005 ?
sigacthandler()+84   CALL     call_user_handler()  FFFFFFFF7D500200 ?
                                                   FFFFFFFF7D500200 ?
                                                   1066C9C10 ? 000000009 ?
                                                   000000000 ? 000000000 ?
__lwp_kill()+8       PTR_CALL 0000000000000000     000000000 ? 1066C9EF0 ?
                                                   1066C9C10 ?
                                                   FFFFFFFF7D500200 ?
                                                   000000000 ?
                                                   FFFFFFFF7C73C000 ?
raise()+16           FRM_LESS _pthread_kill()      000000000 ? 000000006 ?
                                                   FFFFFFFF7F60AC48 ?
                                                   FFFFFFFF7C54B048 ?
                                                   000000005 ?
                                                   FFFFFFFF7C74CB50 ?
abort()+208          CALL     raise()              000000006 ? 000000006 ?
                                                   000000005 ?
                                                   FFFFFFFF7C748500 ?
                                                   FFFFFFFF7D500200 ?
                                                   000000005 ?
vcsipc_poll()+1724   CALL     FFFFFFFF7F5477E0     000001DA0 ?
                                                   FFFFFFFF7F550D00 ?
                                                   FFFFFFFF7F4205A8 ?
                                                   0001F107C ? 000000001 ?
                                                   000000000 ?
skgxpwait()+5604     CALL     vcsipc_poll()        FFFFFFFF7FFECE90 ?
                                                   106747378 ? 000001FD0 ?
                                                   FFFFFFFF7FFE78C8 ?
                                                   000001C00 ? 000200000 ?
ksxpwait()+1804      CALL     0000000106524980     FFFFFFFF7F54FC28 ?
                                                   106747378 ? 000000000 ?
                                                   FFFFFFFF7FFECF68 ?
                                                   0000004E2 ?
                                                   FFFFFFFF7FFECE90 ?
ksliwat()+2952       CALL     ksxpwait()           000000000 ? 000101000 ?
                                                   000000000 ? 10652DB98 ?
                                                   000001000 ? 106533FC8 ?
kslwaitns_timed()+4  CALL     ksliwat()            000000000 ? 000000002 ?
8                                                  00000007D ? 5798B6C18 ?
                                                   5798B6BA0 ? 000032033 ?
kskthbwt()+232       CALL     kslwaitns_timed()    00000007D ? 000000001 ?
                                                   00000007C ? 000000000 ?
                                                   FFFFFFFF7FFED3B8 ?
                                                   000000001 ?
kslwait()+116        CALL     kskthbwt()           00000007D ? 00000007C ?
                                                   000000000 ? 000000007 ?
                                                   000032033 ? 000000001 ?
ksxprcv()+916        CALL     kslwait()            0925A2B0A ? 000000000 ?
                                                   00000000A ? 00000000A ?
                                                   000032033 ? 000000001 ?
kclwcrs()+960        CALL     ksxprcv()            0001056DE ? 10652B118 ?
                                                   00000007D ? 1056DE598 ?
                                                   00010652A ? 1056DE000 ?
kclgclk()+10052      CALL     kclwcrs()            3800143A8 ? 000000000 ?
                                                   000000000 ? 519F716A0 ?
                                                   000000007 ? 000106535 ?
kcbzib()+19288       CALL     kclgclk()            000106400 ? 00000000C ?
                                                   FFFFFFFF7FFF5EB8 ?
                                                   000000000 ? 000000000 ?
                                                   000105400 ?
kcbgtcr()+10528      CALL     kcbzib()             5665FB520 ?
                                                   FFFFFFFF7C058170 ?
                                                   000105C00 ? 000000000 ?
                                                   000000006 ?
                                                   FFFFFFFF7FFF44A0 ?
ktrget()+260         CALL     kcbgtcr()            FFFFFFFF7FFF5028 ?
                                                   FFFFFFFF7FFF502C ?
                                                   5665FB520 ? 000000000 ?
                                                   000000000 ? 57FF6DA18 ?
kdst_fetch()+872     CALL     ktrget()             FFFFFFFF7C058160 ?
                                                   FFFFFFFF7C0580E0 ?
                                                   00000023F ? 000000000 ?
                                                   FFFFFFFF7C058170 ?
                                                   3800172B8 ?
kdstf0100101km()+50  CALL     kdst_fetch()         FFFFFFFF7C058158 ?
4                                                  000000000 ?
                                                   FFFFFFFF7FFF5688 ?
                                                   000106528 ? 00000023F ?
                                                   00000FC00 ?
kdsttgr()+27872      CALL     kdstf0100101km()     FFFFFFFF7C058158 ?
                                                   4E6AB809E ? 000000001 ?
                                                   000000000 ? 54C540BB8 ?
                                                   FFFFFFFF7C058038 ?
qertbFetch()+720     CALL     kdsttgr()            000000000 ? 000000000 ?
                                                   FFFFFFFF7C054EA8 ?
                                                   FFFFFFFF7C058158 ?
                                                   000000004 ? 1032109C0 ?
qerflFetch()+172     PTR_CALL 0000000000000000     000000001 ? 000000001 ?
                                                   1056DE068 ?
                                                   FFFFFFFF7FFF6348 ?
                                                   10652B298 ? 000000002 ?
opifch2()+8204       PTR_CALL 0000000000000000     FFFFFFFF7C058898 ?
                                                   102527EE0 ?
                                                   FFFFFFFF7FFF69D0 ?
                                                   000000001 ? 103210000 ?
                                                   10320CA00 ?
opifch()+52          CALL     opifch2()            FFFFFFFF7FFF6878 ?
                                                   000000090 ? 000000000 ?
                                                   000000001 ? 000000000 ?
                                                   105A2C000 ?
opipls()+3532        CALL     opifch()             000000005 ? 000000002 ?
                                                   FFFFFFFF7FFF6F20 ?
                                                   000000002 ? 000000000 ?
                                                   000000001 ?
opiodr()+1548        PTR_CALL 0000000000000000     000106400 ? 10653A000 ?
                                                   000105800 ? 000000010 ?
                                                   00010653A ?
                                                   FFFFFFFF7B6392D8 ?
rpidrus()+196        CALL     opiodr()             10576DC08 ? 000000066 ?
                                                   10652B000 ? 000000001 ?
                                                   FFFFFFFF7C03A830 ?
                                                   00010652D ?
skgmstack()+168      PTR_CALL 0000000000000000     FFFFFFFF7FFF8350 ?
                                                   000000006 ?
                                                   FFFFFFFF7FFF8100 ?
                                                   10652A000 ? 000000066 ?
                                                   1056DE000 ?
rpidru()+172         CALL     skgmstack()          10034D1E0 ?
                                                   FFFFFFFF7FFF8350 ?
                                                   00000F618 ? 10034D1E0 ?
                                                   FFFFFFFF7FFF8350 ?
                                                   FFFFFFFF7FFF8328 ?
rpiswu2()+500        PTR_CALL 0000000000000000     FFFFFFFF7FFF8B18 ?
                                                   1056C3000 ? 1056C2B90 ?
                                                   1056C0F50 ? 000000C10 ?
                                                   000000182 ?
rpidrv()+1696        CALL     rpiswu2()            000000000 ? 10652B298 ?
                                                   000000000 ?
                                                   FFFFFFFF7FFF84E8 ?
                                                   1056DE000 ? 00010652A ?
psddr0()+516         CALL     rpidrv()             FFFFFFFF7FFF8EC0 ?
                                                   000105C00 ?
                                                   FFFFFFFF7FFF89C4 ?
                                                   000000002 ?
                                                   FFFFFFFF7B615F60 ?
                                                   00010652D ?
psdnal()+512         CALL     psddr0()             106541CE0 ? 10652B298 ?
                                                   000000066 ? 1056DE068 ?
                                                   000000008 ? 00000000A ?
pevm_BFTCHC()+308    PTR_CALL 0000000000000000     FFFFFFFF7FFF9CA8 ?
                                                   00000000A ? 000000000 ?
                                                   FFFFFFFF7B6396F8 ?
                                                   106537000 ? 10652B000 ?
pfrinstr_FTCHC()+18  CALL     pevm_BFTCHC()        000000000 ? 105AE7600 ?
0                                                  555E62580 ?
                                                   FFFFFFFF7C069EE8 ?
                                                   FFFFFFFF7B6396F8 ?
                                                   000000000 ?
pfrrun_no_tool()+72  PTR_CALL 0000000000000000     000000000 ? 000000000 ?
                                                   FFFFFFFF7C069F50 ?
                                                   FFFFFFFF7C069EE8 ?
                                                   0000001EE ? 555E62892 ?
pfrrun()+832         CALL     pfrrun_no_tool()     FFFFFFFF7C069EE8 ?
                                                   555E6288E ?
                                                   FFFFFFFF7C069F50 ?
                                                   105B1BF50 ? 000002001 ?
                                                   000002001 ?
plsql_run()+696      CALL     pfrrun()             FFFFFFFF7C035420 ?
                                                   FFFFFFFF7C069EE8 ?
                                                   000002001 ? 000200000 ?
                                                   FFFFFFFF7C069EE8 ?
                                                   0001056DE ?
peicnt()+260         CALL     plsql_run()          000000006 ? 000000000 ?
                                                   FFFFFFFF7B63BBF8 ?
                                                   FFFFFFFF7FFF9888 ?
                                                   000000180 ? 000000007 ?
kkxexe()+616         CALL     peicnt()             FFFFFFFF7FFFA808 ?
                                                   10652B298 ? 106541CE0 ?
                                                   106762258 ? 10652B000 ?
                                                   10652B000 ?
opiexe()+12736       CALL     kkxexe()             FFFFFFFF7B63F4B8 ?
                                                   106537000 ? 000106537 ?
                                                   FFFFFFFF7FFF9CA8 ?
                                                   000000000 ? 54C0616F0 ?
kpoal8()+1912        CALL     opiexe()             000106400 ?
                                                   FFFFFFFF7C056EC0 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ? 57FDB9250 ?
opiodr()+1548        PTR_CALL 0000000000000000     0BFFFFC00 ? 000040008 ?
                                                   000000000 ? 000000820 ?
                                                   000105800 ? 106538260 ?
ttcpip()+1284        PTR_CALL 0000000000000000     10576DC08 ? 00000005E ?
                                                   10652B000 ? 000000001 ?
                                                   FFFFFFFF7C03A830 ?
                                                   00010652D ?
opitsk()+1432        CALL     ttcpip()             000000017 ?
                                                   FFFFFFFF7FFFCFB0 ?
                                                   1056C3F6C ? 1056C1750 ?
                                                   000000000 ? 10652B118 ?
opiino()+1128        CALL     opitsk()             106538268 ? 000000001 ?
                                                   000000000 ? 106538260 ?
                                                   1058884D0 ? 0FFFFFFFD ?
opiodr()+1548        PTR_CALL 0000000000000000     000106400 ? 10652DB98 ?
                                                   000106400 ? 10652D000 ?
                                                   000106400 ? 106538260 ?
opidrv()+896         CALL     opiodr()             1065373D8 ? 00000003C ?
                                                   000106400 ? 1065381E0 ?
                                                   000106538 ? 00010652D ?
sou2o()+80           CALL     opidrv()             10653A960 ? 000000000 ?
                                                   00000003C ? 106537698 ?
                                                   00000003C ? 000000000 ?
opimai_real()+124    CALL     sou2o()              FFFFFFFF7FFFF708 ?
                                                   00000003C ? 000000004 ?
                                                   FFFFFFFF7FFFF730 ?
                                                   105E12000 ? 000105E12 ?
main()+152           CALL     opimai_real()        000000002 ?
                                                   FFFFFFFF7FFFF808 ?
                                                   104054D6C ? 1064D3220 ?
                                                   00247E3B4 ? 000014800 ?
_start()+380         CALL     main()               000000002 ? 000000008 ?
                                                   000000000 ?
                                                   FFFFFFFF7FFFF818 ?
                                                   FFFFFFFF7FFFF928 ?
                                                   FFFFFFFF7D500200 ?

经过和MOS确认,认为是apply了Patch 5165885后引起的新问题:

I have checked our internal bug database and issue seems to be occuring due to fix for Bug.5165885.

Action plan:
=============

Apply patch for 6678154

https://updates.oracle.com/download/6678154.html

Workaround:
————–

Remove the patch for 5165885 .

Yes, Symptoms are pointing finger towards this bug. I would recommend to apply the patch rather than going for workarounds.

这个case目前实施了补丁6678154,仍在观察期。
录以记之!

ORA-00600:[qctcte1]内部错误一例

一套AIX上的4节点10.2.0.4 RAC系统在1月份出现实例hang住的现象,并伴随有ORA-00600:[qctcte1]内部错误,trace文件内容如下:

siposrc1_ora_102944.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /oracle/product/10.2.0/db_1
System name: AIX
Node name: jszydb1
Release: 3
Version: 5
Machine: 00CE31834C00
Instance name: siposrc1
Redo thread mounted by this instance: 1
Oracle process number: 34
Unix process pid: 102944, image: oracle@jszydb1

*** ACTION NAME:() 2010-01-18 15:53:11.530
*** MODULE NAME:(JDBC Thin Client) 2010-01-18 15:53:11.530
*** SERVICE NAME:(siposrc) 2010-01-18 15:53:11.530
*** SESSION ID:(2175.6953) 2010-01-18 15:53:11.530
*** 2010-01-18 15:53:11.530
ksedmp: internal or fatal error
ORA-00600: 内部错误代码, 参数: [qctcte1], [0], [], [], [], [], [], []
Current SQL statement for this session:
SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE NO_PARALLEL(SAMPLESUB) opt_param('parallel_execution_enabled', 'false') NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */ NVL(SUM(C1),:"SYS_B_00000"), NVL(SUM(C2),:"SYS_B_00001") FROM (SELECT /*+ IGNORE_WHERE_CLAUSE NO_PARALLEL("SIPO_ECLA$_TEMP") FULL("SIPO_ECLA$_TEMP") NO_PARALLEL_INDEX("SIPO_ECLA$_TEMP") */ :"SYS_B_00002" AS C1, CASE WHEN "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00003" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00004" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00005" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00006" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00007" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00008" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00009" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00010" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00011" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00012" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00013" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00014" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00015" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00016" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00017" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00018" OR "SIPO_ECLA
.

.
$_TEMP"."SEQ_ID"=:"SYS_B_40000" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_40001" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_40002" THEN :"SYS_B_40003" ELSE :"SYS_B_40004" END AS C2 FROM "SIPO_ECLA$_TEMP" SAMPLE BLOCK (:"SYS_B_40005" , :"SYS_B_40006") SEED (:"SYS_B_40007") "SIPO_ECLA$_TEM

STACK
qctcte qctocssm qctcopn qctcopn xtyxcssr xtyopncb qctcopn qctcpqb <- qctcpqbl <- xtydrv <- opitca <- kksFullTypeCheck <- rpiswu2 <- kksSetBindType <- kksfbc <- opiexe <- opiall0 <- opikpr <- opiodr <- rpidrus <- skgmstack <- rpidru <- rpiswu2 <- kprball <- IPRA <- IPRA <- kkedsSel <- kkecdn <- kkotap <- kkoiqb <- kkooqb <- kkoqbc <- apakkoqb <- apaqbdDescendents <- 3d4 <- apaqbdListReverse <- 06c <- apaqbd <- apadrv <- opitca <- kksLoadChild <- kxsGetRuntimeLock <- 810 <- kksfbc <- kkspsc0 <- kksParseCursor <- opiosq0 <- opiall0 <- opial7 <- opiodr <- ttcpip <- opitsk <- opiino <- opiodr <- opidrv

提交SR,MOS认为可能是Bug 6666870,给出以下方案:
There are a large number of possible bugs but Bug 6666870 is the most likely culprit. There is no one off patch available. 10205 is not yet available.

ACTION PLAN:
=============
1.Please disable Cost Based Transformation as a workaround, this can be done in the init.ora/spfile or at the session level, for example:
SQL> alter session set "_optimizer_cost_based_transformation"=off ;

2.Apply the 10205 patch set when it becomes available.

3.If the optimiser change fails to resolve your issue, please advise us of any recent changes to your DB or server?

4.In particular, did you recently apply the CPUJAN2008 Patch?
If so, please see Note.558901.1 Ext/Mod ORA-00600 internal error code, arguments [qctcte1] After Applying CPUJAN2008 Patch

5.If there is a function based index involved, please see;
Note.788124.1 Ext/Pub ORA-00600 [qctcte1] With Function Based Index Access

6.Changing your code to eliminate the parallel clauses may also act as a workaround.

ddl操作是否会产生undo?

ddl是否会产生undo?
这可能是每一个初学Oracle的人都会有的疑问;ddl操作又不能rollback回滚,要什么undo数据呢?
事实是几乎每个ddl操作都会产生undo,我们来探究一下:

SQL> select vs.name, ms.value
  2    from v$mystat ms, v$sysstat vs
  3   where ms.statistic# = vs.statistic#
  4     and name = 'undo change vector size';
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
undo change vector size                                                   0

SQL> create table YOUYUS (t1 int);
Table created.

SQL>   select vs.name, ms.value
  2      from v$mystat ms, v$sysstat vs
  3     where ms.statistic# = vs.statistic#
  4       and name = 'undo change vector size';
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
undo change vector size                                                1992

/*create table的ddl语句产生了大约1992 bytes的撤销变化向量*/

SQL> drop table YOUYUS;
Table dropped.

SQL>   select vs.name, ms.value
  2      from v$mystat ms, v$sysstat vs
  3     where ms.statistic# = vs.statistic#
  4       and name = 'undo change vector size';
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
undo change vector size                                                4528

/* drop table语句产生2563 bytes的undo数据,多于create table;我们可以猜测create table时Oracle需要向基表中insert数据,而drop table时则需要delete/update数据,显然后者产生更多的undo*/

/*我们尝试创建一个由254个列组成的表*/

SQL>     select vs.name, ms.value
  2        from v$mystat ms, v$sysstat vs
  3       where ms.statistic# = vs.statistic#
  4         and name = 'undo change vector size';
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
undo change vector size                                                   0

create table YOUYUS (
t1 int,
t2 char(4) default 'oooo',
t3 char(4) default 'oooo',
t4 char(4) default 'oooo',
t5 char(4) default 'oooo',
t6 char(4) default 'oooo',
t7 char(4) default 'oooo',
t8 char(4) default 'oooo',
t9 char(4) default 'oooo',
............................
t248 char(4) default 'oooo',
t249 char(4) default 'oooo',
t250 char(4) default 'oooo',
t251 char(4) default 'oooo',
t252 char(4) default 'oooo',
t253 char(4) default 'oooo',
t254 char(4) default 'oooo'
);

SQL>     select vs.name, ms.value
  2        from v$mystat ms, v$sysstat vs
  3       where ms.statistic# = vs.statistic#
  4         and name = 'undo change vector size';
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
undo change vector size                                               85832

/*产生了83k的undo,ddl所产生的undo量视乎其所要维护数据字典的操作类型和操作量*/

SQL> oradebug setmypid;
Statement processed.

SQL> oradebug event 10046 trace name context forever,level 1;
Statement processed.

SQL> drop table YOUYUS;
Table dropped.

SQL>     select vs.name, ms.value
  2        from v$mystat ms, v$sysstat vs
  3       where ms.statistic# = vs.statistic#
  4         and name = 'undo change vector size';
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
undo change vector size                                              214020

/*drop 产生了125k的undo*/

SQL> oradebug tracefile_name;
/home/maclean/app/maclean/diag/rdbms/prod/PROD/trace/PROD_ora_5433.trc

/* 我们来看看drop table 到底做了哪些递归操作? */

[maclean@rh2 ~]$ cat PROD_ora_5433.trc|egrep "delete|update"
         'Need use delete_topo_geometry_layer() to deregister table '
select decode(u.type#, 2, u.ext_username, u.name), o.name,        t.update$, t.insert$, t.delete$, t.enabled,        decode(bitand(t.property, 8192),8192, 1, 0),        decode(bitand(t.property, 65536), 65536, 1, 0),       decode(bitand(t.property, 131072), 131072, 1, 0),       (select o.name from obj$ o          where o.obj# = u.spare2 and o.type# =57)  from sys.obj$ o, sys.user$ u, sys.trigger$ t, sys.obj$ bo where t.baseobject=bo.obj# and bo.name = :1 and bo.spare3 = :2  and bo.namespace = 1  and t.obj#=o.obj# and o.owner#=u.user#  and o.type# = 12 and bitand(property,16)=0 and bitand(property,8)=0  order by o.obj#
delete from object_usage where obj# in  (select a.obj# from object_usage a, ind$ b where  a.obj# = b.obj# and b.bo# = :1)
delete from sys.cache_stats_1$ where dataobj# = :1
delete com$ where obj#=:1
delete from hist_head$ where obj# = :1
delete from dependency$ where d_obj#=:1
delete from source$ where obj#=:1
delete from compression$ where obj#=:1
   m_stmt:='delete from sdo_geor_ddl__table$$ where id=2';
   m_stmt:='delete from sdo_geor_ddl__table$$';
delete from sdo_geor_ddl__table$$ where id=2
delete from col$ where obj#=:1
delete from icol$ where bo#=:1
delete from icoldep$ where obj# in (select obj# from ind$ where bo#=:1)
delete from jijoin$ where obj# in ( select obj# from jijoin$ where tab1obj# = :1 or tab2obj# = :1)
delete from jirefreshsql$ where iobj# in ( select iobj# from jirefreshsql$ where tobj# = :1)
delete from ccol$ where obj#=:1
delete from ind$ where bo#=:1
delete from cdef$ where obj#=:1
delete ecol$ where tabobj# = :1
delete from tab$ where obj#=:1
delete from idl_ub1$ where obj#=:1 and part=:2
delete from idl_char$ where obj#=:1 and part=:2
delete from idl_ub2$ where obj#=:1 and part=:2
delete from idl_sb4$ where obj#=:1 and part=:2
delete from ncomp_dll$ where obj#=:1 returning dllname into :2
delete from idl_ub1$ where obj#=:1 and part=:2
delete from idl_char$ where obj#=:1 and part=:2
delete from idl_ub2$ where obj#=:1 and part=:2
delete from idl_sb4$ where obj#=:1 and part=:2
delete from ncomp_dll$ where obj#=:1 returning dllname into :2
delete from idl_ub1$ where obj#=:1 and part=:2
delete from idl_char$ where obj#=:1 and part=:2
delete from idl_ub2$ where obj#=:1 and part=:2
delete from idl_sb4$ where obj#=:1 and part=:2
delete from ncomp_dll$ where obj#=:1 returning dllname into :2
delete from col$ where obj#=:1
delete coltype$ where obj#=:1
delete from subcoltype$ where obj#=:1
delete ntab$ where obj#=:1
delete lob$ where obj#=:1
delete refcon$ where obj#=:1
delete from opqtype$ where obj#=:1
delete from cdef$ where obj#=:1
delete from objauth$ where obj#=:1
delete from obj$ where obj# = :1
update seg$ set type#=:4,blocks=:5,extents=:6,minexts=:7,maxexts=:8,extsize=:9,extpct=:10,user#=:11,iniexts=:12,lists=decode(:13, 65535, NULL, :13),groups=decode(:14, 65535, NULL, :14), cachehint=:15, hwmincr=:16, spare1=DECODE(:17,0,NULL,:17),scanhint=:18, bitmapranges=:19 where ts#=:1 and file#=:2 and block#=:3
update seg$ set type#=:4,blocks=:5,extents=:6,minexts=:7,maxexts=:8,extsize=:9,extpct=:10,user#=:11,iniexts=:12,lists=decode(:13, 65535, NULL, :13),groups=decode(:14, 65535, NULL, :14), cachehint=:15, hwmincr=:16, spare1=DECODE(:17,0,NULL,:17),scanhint=:18, bitmapranges=:19 where ts#=:1 and file#=:2 and block#=:3
delete from seg$ where ts#=:1 and file#=:2 and block#=:3

/*如果ddl操作执行失败又会如何呢?*/

SQL> oradebug setmypid;
Statement processed.
SQL> oradebug event 10046 trace name context forever,level 1;
Statement processed.

SQL>     select vs.name, ms.value
  2        from v$mystat ms, v$sysstat vs
  3       where ms.statistic# = vs.statistic#
  4         and name = 'undo change vector size';
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
undo change vector size                                                   0

SQL> drop table YOUYUS;
drop table YOUYUS
           *
ERROR at line 1:
ORA-00942: table or view does not exist

SQL>     select vs.name, ms.value
  2        from v$mystat ms, v$sysstat vs
  3       where ms.statistic# = vs.statistic#
  4         and name = 'undo change vector size';
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
undo change vector size                                                 264
/*同样产生了undo,量较少*/


SQL> oradebug tracefile_name;
/home/maclean/app/maclean/diag/rdbms/prod/PROD/trace/PROD_ora_5494.trc
[maclean@rh2 trace]$ cat PROD_ora_5494.trc|egrep "update|insert|delete"
         'Need use delete_topo_geometry_layer() to deregister table '
   m_stmt:='insert into sdo_geor_ddl__table$$ values (1)';
   m_stmt:='insert into sdo_geor_ddl__table$$ values (2)';
insert into sdo_geor_ddl__table$$ values (2)
   m_stmt:='delete from sdo_geor_ddl__table$$';
delete from sdo_geor_ddl__table$$

/*执行少量递归操作后,Oracle发现所要drop的对象并不存在,将会rollback之前的"部分"递归dml操作*/

其实我们可以把ddl操作分解为以下步骤:

begin
commit;
--编译ddl
begin
--实现ddl,包括一系列递归的数据字典维护操作及其他操作
commit;
exception
when others then
rollback;
end;
end;

ddl操作无需也不允许手动commit或rollback参与,但这并不代表ddl操作不产生undo。

ora-600 [17182]错误一例

这是一套古老的系统,SUNOS 5.8,Oracle 8.1.7.4。最近老革命途遇新问题,告警日志烽烟掠起:

Errors in file /u01/app/oracle/admin/CULPRODB/udump/culprodb_ora_7913.trc:
ORA-00600: internal error code, arguments: [17182], [32438472], [], [], [], [], [], []
Thu Jul 15 16:19:29 2010
Errors in file /u01/app/oracle/admin/CULPRODB/udump/culprodb_ora_7913.trc:
ORA-00600: internal error code, arguments: [17182], [32438472], [], [], [], [], [], []
Thu Jul 15 16:19:30 2010
Errors in file /u01/app/oracle/admin/CULPRODB/udump/culprodb_ora_7913.trc:
ORA-00600: internal error code, arguments: [17182], [32438472], [], [], [], [], [], []

如果你像我一样对600着迷,那么点击这里欣赏一下这个trace文件。报错期间运行的SQL及调用栈信息:

ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [17182], [32438472], [], [], [], [], [], []
Current SQL statement for this session:
select * from olsuser.cardmaster where cm_card_no between '2336330010201570013' and '2336330010201580004' union
select * from olsuser.cardmaster where cm_card_no between '2336330012402300018' and '2336330012402310009' union
select * from olsuser.cardmaster where cm_card_no between '2336330052400220016' and '2336330052400230007' union
select * from olsuser.cardmaster where cm_card_no between '2336330015103900012' and '2336330015138100032' union
select * from olsuser.cardmaster where cm_card_no between '2336330055100910018' and '2336330055100920009'
----- Call Stack Trace -----
calling                   call     entry
location                  type     point
--------------------      -------- --------------------
ksedmp()+220              CALL     ksedst()+0
kgeriv()+268              PTR_CALL 0000000000000000
kgesiv()+140              CALL     kgeriv()+0
kgesic1()+32              CALL     kgesiv()+0
kghfrf()+204              CALL     kgherror()+0
kkscls()+1592             CALL     kghfrf()+0
opicca()+248              CALL     kkscls()+0
opiclo()+8                CALL     opicca()+0
kpoclsa()+60              CALL     opiclo()+0
opiodr()+2540             PTR_CALL 0000000000000000
ttcpip()+5676             PTR_CALL 0000000000000000
opitsk()+2408             CALL     ttcpip()+0
opiino()+2080             CALL     opitsk()+0
opiodr()+2540             PTR_CALL 0000000000000000
opidrv()+1656             CALL     opiodr()+0
sou2o()+16                CALL     opidrv()+0
main()+172                CALL     sou2o()+0
_start()+380              CALL     main()+0
/*8.1.7中stack trace还附带着寄存器信息,但我们可读不懂:)  */

opicca->kkscls->kghfrf->kgherror(heap层报错)->kgesic1。问题主要发生在调用kghfrf函数的时候,《famous summary stack trace from Oracle Version 8.1.7.4.0 Bug Note》 一文罗列了Oracle的一些stack summary,其中kghfrx函数的作用是”Free extent. This is called when a heap is unpinned to request that it”;可以猜测kghfrf函数是用来释放某种内存结构的。在MOS上输入”kghfrf 8.1.7.4″关键词,可以找到Note 291936.1:

ORA-00600 [17182] on Oracle 8.1.7.4.0 After a CTRL-C or Client Termination
Applies to:
Oracle Server – Enterprise Edition – Version: 8.1.7.4
This problem can occur on any platform.
Checked for relevance on 06-Mar-2007

Oracle RDBMS Server Versions prior to 9i
Symptoms
1. Intermittent heap corruptions errors like ORA-00600 [17182] are reported in the alert.log file.

2. There is no impact to the database other than the process which encounters the errors getting killed.

3. From the trace file generated for this ORA-00600 error, check if the top few functions are :

kgherror kghfrf kkscls opicca

Cause
If the trace file shows that kkscls calls kghfrf, then it is related to:

Bug 2281320 — ORA-600[17182] POSSIBLE AFTER CTRL-C OR CLIENT DEATH
Solution
The problem is when we call kghfrf to free a chunk of memory, we expect that this chunk to have been allocated from the Heap Memory and hence have a valid header, although internally we have used Frame Memory managed chunk. As a result, kghfrf errors out with the “Bagic Magic Number” in the Memory Chunk header error message.

If you are running Oracle 8174, encounter this ORA-00600 [17182], and the call stack indicates the following functions { kgherror kghfrf kkscls }, then download and apply Patch 2281320 from MetaLink.

This issue has been fixed in Oracle Server 8.1.7.5 and later versions.

Note 2281320.8 is not limited to dblinks and can occur during normal database operation as well.

该文档叙述描述在9i以前版本中可能因堆损坏而出现该ORA-00600 [17182]错误,该错误不会导致致命问题或数据库损坏,最坏的情况是遭遇该错误的服务进程被杀死。与该问题匹配的主要依据是stack trace为kgherror kghfrf kkscls opicca,同我们的实际情况一致。可以通过打上one-off patch 2281320或者升级到8.1.7.5来避免该内部错误的发生,当然也可以置之不理,显然它不会造成太大的麻烦。
此外kghfrf函数用以释放内存chunk,Oracle development起初以为所有这些可能被释放的chunk都是从堆内存中分配而来,因此都该有一个有效的header;而实际上它们可能是以帧式内存管理的chunk。kghfrf因读取到这种chunk header中的错误幻数(Bagic Magic Number)而误入歧途了。

如何公开Oracle trace文件?

隐式参数_trace_files_public决定了Oracle产生的trace文件是否公开,该参数默认值为FALSE,也就是非DBA/OINSTALL组的用户是没有权限读取数据库产生的trace文件的;在某些场合中我们需要让非DBA组的用户也能访问trace文件,就可以通过修改该参数实现。请看下面的例子:

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

SQL> col name for a20
SQL> col value for a20
SQL> col describ for a40
SQL> SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ
  2   FROM SYS.x$ksppi x, SYS.x$ksppcv y
  3   WHERE x.inst_id = USERENV ('Instance')
  4   AND y.inst_id = USERENV ('Instance')
  5   AND x.indx = y.indx
  6  AND x.ksppinm LIKE '%_trace_files_public%'
  7  order by x.ksppinm;

NAME                 VALUE                DESCRIB
-------------------- -------------------- ----------------------------------------
_trace_files_public  FALSE                Create publicly accessible trace files

SQL> oradebug setmypid;
Statement processed.
SQL> oradebug ipc;
Information written to trace file.
SQL> oradebug tracefile_name;
/s01/10gdb/admin/YOUYUS/udump/youyus_ora_10268.trc
SQL> !ls -l /s01/10gdb/admin/YOUYUS/udump/youyus_ora_10268.trc
-rw-r----- 1 maclean oinstall 4206 Aug 11 20:51 /s01/10gdb/admin/YOUYUS/udump/youyus_ora_10268.trc
/*所产生的trace文件权限为640,非oinstall组用户无权限读取该文件*/

SQL> alter system set "_trace_files_public"=true;
alter system set "_trace_files_public"=true
                 *
ERROR at line 1:
ORA-02095: specified initialization parameter cannot be modified
/*修改该参数需要重启实例*/

SQL> alter system set "_trace_files_public"=true scope=spfile;

System altered.

SQL> startup force;
ORACLE instance started.

Total System Global Area 1577058304 bytes
Fixed Size                  2084264 bytes
Variable Size             922747480 bytes
Database Buffers          637534208 bytes
Redo Buffers               14692352 bytes
Database mounted.
Database opened.

SQL> oradebug setmypid;
Statement processed.
SQL> oradebug ipc;
Information written to trace file.
SQL> oradebug tracefile_name;
/s01/10gdb/admin/YOUYUS/udump/youyus_ora_10430.trc
SQL> ! ls -l /s01/10gdb/admin/YOUYUS/udump/youyus_ora_10430.trc
-rw-r--r-- 1 maclean oinstall 5471 Aug 11 20:54 /s01/10gdb/admin/YOUYUS/udump/youyus_ora_10430.trc
/*other组用户也具有了读权限*/

SQL> ! ls -l /s01/10gdb/admin/YOUYUS/
total 24
drwxr-x--- 2 maclean oinstall 4096 Aug 11 20:56 adump
drwxr-x--- 2 maclean oinstall 4096 Aug 11 20:54 bdump
drwxr-x--- 2 maclean oinstall 4096 Aug  5 21:35 cdump
drwxr-x--- 2 maclean oinstall 4096 Aug  5 21:36 dpdump
drwxr-x--- 2 maclean oinstall 4096 Aug  5 21:37 pfile
drwxr-x--- 2 maclean oinstall 4096 Aug 11 20:54 udump
/*请注意修改_trace_files_public为true,并不会修改trace所在目录的权限,Oracle默认建立bdump/udump等trace目录时分配的权限为750,other组用户无法进入这些目录,需要修改目录权限为755,即o+r+x*/

SQL> ! chmod o+r+x /s01/10gdb/admin/YOUYUS/*dump

SQL>  ! ls -l /s01/10gdb/admin/YOUYUS/
total 24
drwxr-xr-x 2 maclean oinstall 4096 Aug 11 20:56 adump
drwxr-xr-x 2 maclean oinstall 4096 Aug 11 20:54 bdump
drwxr-xr-x 2 maclean oinstall 4096 Aug  5 21:35 cdump
drwxr-xr-x 2 maclean oinstall 4096 Aug  5 21:36 dpdump
drwxr-x--- 2 maclean oinstall 4096 Aug  5 21:37 pfile
drwxr-xr-x 2 maclean oinstall 4096 Aug 11 20:54 udump

/*需要注意的另一点是修改_trace_files_public参数并不会引起既有的trace文件的权限被修改,典型的例子是alert log告警日志*/
[maclean@rh2 bdump]$ ls -l
total 20
-rw-r----- 1 maclean oinstall 12971 Aug 11 21:17 alert_YOUYUS.log
-rw-r--r-- 1 maclean oinstall   690 Aug 11 21:12 youyus_lgwr_10514.trc

SQL> SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ
  2   FROM SYS.x$ksppi x, SYS.x$ksppcv y
  3   WHERE x.inst_id = USERENV ('Instance')
  4   AND y.inst_id = USERENV ('Instance')
  5   AND x.indx = y.indx
  6  AND x.ksppinm LIKE '%_trace_files_public%'
  7  order by x.ksppinm;

NAME                 VALUE                DESCRIB
-------------------- -------------------- ----------------------------------------
_trace_files_public  FALSE                Create publicly accessible trace files

SQL> alter system set "_trace_files_public"=true scope=spfile;

System altered.

SQL> startup force;
ORACLE instance started.

Total System Global Area 1577058304 bytes
Fixed Size                  2084264 bytes
Variable Size             922747480 bytes
Database Buffers          637534208 bytes
Redo Buffers               14692352 bytes
Database mounted.
Database opened.
SQL> !ls -l
total 32
-rw-r----- 1 maclean oinstall 21189 Aug 11 21:20 alert_YOUYUS.log
-rw-r--r-- 1 maclean oinstall   690 Aug 11 21:12 youyus_lgwr_10514.trc
-rw-r--r-- 1 maclean oinstall   690 Aug 11 21:20 youyus_lgwr_11136.trc

UNION ALL returning wrong results?

有应用人员反映某套Linux上的11.2.0.1数据库系统中出现了UNION ALL后返回的结果集不正确的问题,我们具体分析下出现问题的其中一条语句:

SELECT MTL_SECONDARY_INVENTORIES.SECONDARY_INVENTORY_NAME,
       MTL_SECONDARY_INVENTORIES.ORGANIZATION_ID,
       MTL_SECONDARY_INVENTORIES.DESCRIPTION,
       MTL_SECONDARY_INVENTORIES.AVAILABILITY_TYPE,
       MTL_SECONDARY_INVENTORIES.MATERIAL_ACCOUNT,
       MTL_SECONDARY_INVENTORIES.MATERIAL_OVERHEAD_ACCOUNT,
       MTL_SECONDARY_INVENTORIES.RESOURCE_ACCOUNT,
       MTL_SECONDARY_INVENTORIES.OVERHEAD_ACCOUNT,
       MTL_SECONDARY_INVENTORIES.OUTSIDE_PROCESSING_ACCOUNT,
       MTL_SECONDARY_INVENTORIES.ASSET_INVENTORY,
       MTL_SECONDARY_INVENTORIES.EXPENSE_ACCOUNT,
       MTL_SECONDARY_INVENTORIES.ENCUMBRANCE_ACCOUNT,
       MTL_SECONDARY_INVENTORIES.ATTRIBUTE3,
       MTL_SECONDARY_INVENTORIES.ATTRIBUTE5,
       WORKFLOW_START_TIMES.WORKFLOW_START_TIME
  FROM REPEMEAERP.MTL_SECONDARY_INVENTORIES,
       REPEMEAERP.WORKFLOW_START_TIMES
 WHERE MTL_SECONDARY_INVENTORIES.DW_UPDATE_DT >
       TO_DATE('01/01/1900 00:00:00', 'MM/DD/YYYY HH24:MI:SS')
   AND MTL_SECONDARY_INVENTORIES.DW_UPDATE_DT <=
       WORKFLOW_START_TIMES.WORKFLOW_START_TIME
   AND WORKFLOW_START_TIMES.WORKFLOW_NAME =
       LTRIM(RTRIM('w_int_FreqBatch_EMEA'))
/*以上是QUERY A*/
UNION ALL
/*以下是QUERY B*/
SELECT DISTINCT 'WORKORDERS',
                MTL_SECONDARY_INVENTORIES.ORGANIZATION_ID,
                'WORK ORDERS WITH WIP AS CATEGORY VALUE',
                1,
                0,
                0,
                0,
                0,
                0,
                1,
                0,
                0,
                'MOI',
                '0',
                WORKFLOW_START_TIMES.WORKFLOW_START_TIME
  FROM REPEMEAERP.MTL_SECONDARY_INVENTORIES, EIMMAINT.WORKFLOW_START_TIMES
 WHERE MTL_SECONDARY_INVENTORIES.DW_UPDATE_DT >
       TO_DATE('01/01/1900 00:00:00', 'MM/DD/YYYY HH24:MI:SS')
   AND MTL_SECONDARY_INVENTORIES.DW_UPDATE_DT <=
       WORKFLOW_START_TIMES.WORKFLOW_START_TIME
   AND WORKFLOW_START_TIMES.WORKFLOW_NAME =
       LTRIM(RTRIM('w_int_FreqBatch_EMEA'))
/
138 rows selected.

以上查询语句中,QUERY A部分(也就是UNION ALL之前的SELECT语句)单独查询时返回返回69条记录,QUERY B部分单独查询时返回15记录,UNION ALL后返回的结果却是138条记录,而非84条记录。实际上这套系统也是最近才从10g迁移到11gr2上,之前在10g中同样的应用没有出过类似的问题,可以猜测是11g中新引入的某种特性存在可能引发wrong result的Bug。

具体思路虽然有了,但仍无法确定问题的关键所在;我们来看看该SQL的执行计划:

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name                          | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |                               |     7 |  2443 |    52   (0)| 00:00:01 |
|   1 |  NESTED LOOPS                   |                               |     7 |  2443 |    52   (0)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL             | WORKFLOW_START_TIMES          |     1 |    29 |    48   (0)| 00:00:01 |
|   3 |   VIEW                          | VW_JF_SET$9BAED2EA            |     1 |   320 |     4   (0)| 00:00:01 |
|   4 |    UNION ALL PUSHED PREDICATE   |                               |       |       |            |          |
|*  5 |     FILTER                      |                               |       |       |            |          |
|   6 |      TABLE ACCESS BY INDEX ROWID| MTL_SECONDARY_INVENTORIES     |     3 |   336 |     2   (0)| 00:00:01 |
|*  7 |       INDEX RANGE SCAN          | IDX_MTL_SECONDARY_INVENTORIES |     1 |       |     1   (0)| 00:00:01 |
|*  8 |     FILTER                      |                               |       |       |            |          |
|   9 |      TABLE ACCESS BY INDEX ROWID| MTL_SECONDARY_INVENTORIES     |     3 |    36 |     2   (0)| 00:00:01 |
|* 10 |       INDEX RANGE SCAN          | IDX_MTL_SECONDARY_INVENTORIES |     1 |       |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("WORKFLOW_START_TIMES"."WORKFLOW_NAME"='w_int_FreqBatch_EMEA')
5 - filter(TO_DATE(' 1900-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss')<"WORKFLOW_START_TIMES"."WORKFLOW_START_TIME") 7 - access("MTL_SECONDARY_INVENTORIES"."DW_UPDATE_DT">TO_DATE(' 1900-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "MTL_SECONDARY_INVENTORIES"."DW_UPDATE_DT"<="WORKFLOW_START_TIMES"."WORKFLOW_START_TIME"
)
8 - filter(TO_DATE(' 1900-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss')<"WORKFLOW_START_TIMES"."WORKFLOW_START_TIME") 10 - access("MTL_SECONDARY_INVENTORIES"."DW_UPDATE_DT">TO_DATE(' 1900-01-01 00:00:00', 'syyyy-mm-dd
hh24:mi:ss') AND "MTL_SECONDARY_INVENTORIES"."DW_UPDATE_DT"<="WORKFLOW_START_TIMES"."WORKFLOW_START_TIME"
)

你可能从以上执行计划中发现了两处十分陌生的字眼:UNION ALL  PUSHED PREDICATE和VW_JF_SET$。它们是什么!?

先来说说JF,JF是join factorization的缩写,你可以把它翻译作链接因式分解,如果你学过离散数学或者数据库原理的话,那么这种在11.2.0.1中最新推出的基于成本的变换操作对你来说并不陌生。用公式的样式来表达大概是下面这样:

YYA,YYB和YYC是3个关联的数据对象亦或者是3个关联的结果集;
(YYA JOIN YYB) UNION [ALL] (YYA JOIN YYC)
可以转换成为:
YYA JOIN (YYB UNION [ALL] YYC)

这样做YYA部分只需要读取一次,还可以少做一次JOIN,听上去是挺不错的吧!
下面我们来看一个Oracle使用join factorization的十分简单的实例:

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
PL/SQL Release 11.2.0.1.0 - Production
CORE    11.2.0.1.0      Production
TNS for Linux: Version 11.2.0.1.0 - Production
NLSRTL Version 11.2.0.1.0 - Production

SQL> drop table yya;

drop table yya

           *

ERROR at line 1:

ORA-00942: table or view does not exist

SQL> drop table yyb;

drop table yyb

           *

ERROR at line 1:

ORA-00942: table or view does not exist

SQL> create table yya as select rownum id1,rownum id2,rownum id3 from dual connect by level<=20000;
Table created.
SQL> create table yyb as select rownum id1,rownum id2,rownum id3 from dual connect by level<=20000;
Table created.

SQL> explain plan for
2  select * from yya ,yyb where yya.id1=yyb.id1
3  union all
4  select * from yya, yyb where yya.id1=yyb.id1;

Explained.

SQL> set linesize 100 pagesize 1400;

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 744914999

-------------------------------------------------------------------------------------------
| Id  | Operation            | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |                    | 40000 |  2500K|    49   (3)| 00:00:01 |
|*  1 |  HASH JOIN           |                    | 40000 |  2500K|    49   (3)| 00:00:01 |
|   2 |   TABLE ACCESS FULL  | YYA                | 20000 |   234K|    16   (0)| 00:00:01 |
|   3 |   VIEW               | VW_JF_SET$6E3F6682 | 40000 |  2031K|    32   (0)| 00:00:01 |
|   4 |    UNION-ALL         |                    |       |       |            |          |
|   5 |     TABLE ACCESS FULL| YYB                | 20000 |   761K|    16   (0)| 00:00:01 |
|   6 |     TABLE ACCESS FULL| YYB                | 20000 |   761K|    16   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("YYA"."ID1"="ITEM_1")

/*执行计划中出现了VW_JF_SET$F22B2A93,Oracle选择了使用join factorization,该执行计划总成本49*/

SQL> alter session set "_optimizer_join_factorization"=false;

Session altered.

/*隐藏参数_optimizer_join_factorization决定了优化器是否可以选用join factorization,现在我们禁用它*/
SQL> explain plan for
  2  select * from yya join yyb on yya.id1=yyb.id1
  3  union all
  4  select * from yya join yyb on yya.id1=yyb.id1;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3439541885

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      | 40000 |  1992K|    66  (52)| 00:00:01 |
|   1 |  UNION-ALL          |      |       |       |            |          |
|*  2 |   HASH JOIN         |      | 20000 |   996K|    33   (4)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| YYA  | 20000 |   234K|    16   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| YYB  | 20000 |   761K|    16   (0)| 00:00:01 |
|*  5 |   HASH JOIN         |      | 20000 |   996K|    33   (4)| 00:00:01 |
|   6 |    TABLE ACCESS FULL| YYA  | 20000 |   234K|    16   (0)| 00:00:01 |
|   7 |    TABLE ACCESS FULL| YYB  | 20000 |   761K|    16   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("YYA"."ID1"="YYB"."ID1")
   5 - access("YYA"."ID1"="YYB"."ID1")
/*禁用链接因式分解后,Oracle使用了常规的"笨办法",成本上升到66*/

/*有趣的是下面的测试*/

SQL> alter session set "_optimizer_join_factorization"=true;

Session altered.

SQL> create table yyc as select * from yyb;

Table created.

SQL> explain plan for
  2  select * from yya,yyc where yya.id1=yyc.id1
  3  union all
  4  select * from yya,yyb where yya.id1=yyb.id1;

Explained.

SQL>  select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 4240055274

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      | 40000 |  1992K|    66  (52)| 00:00:01 |
|   1 |  UNION-ALL          |      |       |       |            |          |
|*  2 |   HASH JOIN         |      | 20000 |   996K|    33   (4)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| YYA  | 20000 |   234K|    16   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| YYC  | 20000 |   761K|    16   (0)| 00:00:01 |
|*  5 |   HASH JOIN         |      | 20000 |   996K|    33   (4)| 00:00:01 |
|   6 |    TABLE ACCESS FULL| YYA  | 20000 |   234K|    16   (0)| 00:00:01 |
|   7 |    TABLE ACCESS FULL| YYB  | 20000 |   761K|    16   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("YYA"."ID1"="YYC"."ID1")
   5 - access("YYA"."ID1"="YYB"."ID1")
/*confused,Oracle有什么理由在这里反而不用join factorization了呢?看起来短期内join factorization的实际应用还有待"商榷"
*/

/*10053事件能解释这一问题吗?*/
SQL> alter system flush shared_pool;

System altered.

SQL> oradebug setmypid;
Statement processed.
SQL> oradebug event 10053 trace name context forever,level 1;
Statement processed.
SQL> explain plan for
  2  select * from yya join yyb on yya.id1=yyb.id1
  3  union all
  4  select * from yya join yyc on yya.id1=yyc.id1;

Explained.

SQL> oradebug event 10053 trace name context off;
Statement processed.
SQL> oradebug tracefile_name;
/home/maclean/app/maclean/diag/rdbms/prod/PROD/trace/PROD_ora_7907.trc

view /home/maclean/app/maclean/diag/rdbms/prod/PROD/trace/PROD_ora_7907.trc
***********************************
Cost-Based Join Factorization
***********************************
Join-Factorization on query block SET$1 (#1)
JF: Using search type: exhaustive
JF: Generate basic transformation units
Validating JF unit: (branch: {2, 3} table: {YYA, YYA})
  rejected: join predicates do not match

JF: Generate transformation units from basic units
JF: No state generated.
/*优化器认为其链接谓词不符合使用join  factorization的条件,JF题案被驳回,"悬案"!*/

join factorization是很棒的新技术,这点没错,但新技术往往又是horrible(可怕的),最近我常用这个词。我们的问题是不是这个新来的引起的呢?通过join factorization关键字检索MOS,可以发现一个今年(2010)3月出现的Bug 9504322,quote:

Hdr: 9504322 11.2.0.1 RDBMS 11.2.0.1 QRY OPTIMIZER PRODID-5 PORTID-226
Abstract: WRONG RESULTS WITH UNION_ALL AND INLINE VIEWS

*** 03/24/10 05:38 am ***

PROBLEM:
--------
Wrong results on 11.2 for queries of type:

SELECT * FROM
(
SELECT ... FROM view, table WHERE ...
UNION ALL
SELECT ... FROM view, table WHERE NOT ...
);

DIAGNOSTIC ANALYSIS:
--------------------
Problem seen between 10.2.0.4 and 11.2.0.1.
If we remove the use of inline view the correct results are returned.

WORKAROUND:
-----------
N/A

RELATED BUGS:
-------------

REPRODUCIBILITY:
----------------
It is reproducing on generic 11.2.0.1

呵呵,似乎有点眉目了,不过实践是检验真理的唯一标准:


SQL> alter session set "_optimizer_join_factorization"=true;

Session altered.

SELECT MTL_SECONDARY_INVENTORIES.SECONDARY_INVENTORY_NAME,
       MTL_SECONDARY_INVENTORIES.ORGANIZATION_ID,
       MTL_SECONDARY_INVENTORIES.DESCRIPTION,
       MTL_SECONDARY_INVENTORIES.AVAILABILITY_TYPE,
       MTL_SECONDARY_INVENTORIES.MATERIAL_ACCOUNT,
       MTL_SECONDARY_INVENTORIES.MATERIAL_OVERHEAD_ACCOUNT,
       MTL_SECONDARY_INVENTORIES.RESOURCE_ACCOUNT,
       MTL_SECONDARY_INVENTORIES.OVERHEAD_ACCOUNT,
       MTL_SECONDARY_INVENTORIES.OUTSIDE_PROCESSING_ACCOUNT,
       MTL_SECONDARY_INVENTORIES.ASSET_INVENTORY,
       MTL_SECONDARY_INVENTORIES.EXPENSE_ACCOUNT,
       MTL_SECONDARY_INVENTORIES.ENCUMBRANCE_ACCOUNT,
       MTL_SECONDARY_INVENTORIES.ATTRIBUTE3,
       MTL_SECONDARY_INVENTORIES.ATTRIBUTE5,
       WORKFLOW_START_TIMES.WORKFLOW_START_TIME
  FROM REPEMEAERP.MTL_SECONDARY_INVENTORIES,
       REPEMEAERP.WORKFLOW_START_TIMES
 WHERE MTL_SECONDARY_INVENTORIES.DW_UPDATE_DT >
       TO_DATE('01/01/1900 00:00:00', 'MM/DD/YYYY HH24:MI:SS')
   AND MTL_SECONDARY_INVENTORIES.DW_UPDATE_DT <=
       WORKFLOW_START_TIMES.WORKFLOW_START_TIME
   AND WORKFLOW_START_TIMES.WORKFLOW_NAME =
       LTRIM(RTRIM('w_int_FreqBatch_EMEA'))
/*以上是QUERY A*/
UNION ALL
/*以下是QUERY B*/
SELECT DISTINCT 'WORKORDERS',
                MTL_SECONDARY_INVENTORIES.ORGANIZATION_ID,
                'WORK ORDERS WITH WIP AS CATEGORY VALUE',
                1,
                0,
                0,
                0,
                0,
                0,
                1,
                0,
                0,
                'MOI',
                '0',
                WORKFLOW_START_TIMES.WORKFLOW_START_TIME
  FROM REPEMEAERP.MTL_SECONDARY_INVENTORIES, EIMMAINT.WORKFLOW_START_TIMES
 WHERE MTL_SECONDARY_INVENTORIES.DW_UPDATE_DT >
       TO_DATE('01/01/1900 00:00:00', 'MM/DD/YYYY HH24:MI:SS')
   AND MTL_SECONDARY_INVENTORIES.DW_UPDATE_DT <=
       WORKFLOW_START_TIMES.WORKFLOW_START_TIME
   AND WORKFLOW_START_TIMES.WORKFLOW_NAME =
       LTRIM(RTRIM('w_int_FreqBatch_EMEA'))
/

138 rows selected.

结果和我们猜想的大相径庭,join factorization并非罪魁,找不到终点让我们回到原点。
至此UNION ALL PUSHED PREDICATE有了极大的嫌疑,什么是PUSH PREDICATE?我把它叫做谓词前推,这玩样最早出现在10g上,但一直问题多多!它到底是何种OPERATION呢?让我们来看看下面的例子:

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
PL/SQL Release 11.2.0.1.0 - Production
CORE    11.2.0.1.0      Production
TNS for Linux: Version 11.2.0.1.0 - Production
NLSRTL Version 11.2.0.1.0 - Production

SQL> create table youyus (t1 int,t2 varchar2(20));

Table created.

SQL> alter table youyus add primary key(t1);

Table altered.

SQL> explain plan for
  2  select *
  3    from youyus
  4  union all
  5  select * from youyus;

Explained.
/*在之后的语句中将用到这个子查询*/
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 1959159425

-----------------------------------------------------------------------------
| Id  | Operation          | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |        |     2 |    50 |     4  (50)| 00:00:01 |
|   1 |  UNION-ALL         |        |       |       |            |          |
|   2 |   TABLE ACCESS FULL| YOUYUS |     1 |    25 |     2   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| YOUYUS |     1 |    25 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------
/*在之后的语句中将用到这个子查询,这里它的"原始"执行计划十分简单*/

SQL> explain plan for
  2  select v2.t1, v2.t2
  3    from (select t1 from youyus where rownum=1) v1,
  4         (select *
  5            from youyus
  6          union all
  7          select * from youyus) v2
  8   where v1.t1 = v2.t1;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 2456530141

-----------------------------------------------------------------------------------------------
| Id  | Operation                      | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |              |     1 |    27 |     1   (0)| 00:00:01 |
|   1 |  NESTED LOOPS                  |              |     1 |    27 |     1   (0)| 00:00:01 |
|   2 |   VIEW                         |              |     1 |    13 |     1   (0)| 00:00:01 |
|*  3 |    COUNT STOPKEY               |              |       |       |            |          |
|   4 |     INDEX FULL SCAN            | SYS_C0010819 |     1 |    13 |     1   (0)| 00:00:01 |
|   5 |   VIEW                         |              |     1 |    14 |     0   (0)| 00:00:01 |
|   6 |    UNION ALL PUSHED PREDICATE  |              |       |       |            |          |
|   7 |     TABLE ACCESS BY INDEX ROWID| YOUYUS       |     1 |    25 |     0   (0)| 00:00:01 |
|*  8 |      INDEX UNIQUE SCAN         | SYS_C0010819 |     1 |       |     0   (0)| 00:00:01 |
|   9 |     TABLE ACCESS BY INDEX ROWID| YOUYUS       |     1 |    25 |     0   (0)| 00:00:01 |
|* 10 |      INDEX UNIQUE SCAN         | SYS_C0010819 |     1 |       |     0   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter(ROWNUM=1)
   8 - access("YOUYUS"."T1"="V1"."T1")
  10 - access("YOUYUS"."T1"="V1"."T1")
/* PUSHED PREDICATE将谓词逻辑前推到UNION ALL的子查询中,其优势在于可以避免全表扫描,利用索引*/

SQL> set linesize 100 pagesize 1400;
SQL>
SQL> explain plan for
  2  select /*+ no_push_pred(v2) */ v2.t1, v2.t2
  3    from (select t1 from youyus where rownum=1) v1,
  4         (select *
  5            from youyus
  6          union all
  7          select * from youyus) v2
  8   where v1.t1 = v2.t1;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 2769827061

-------------------------------------------------------------------------------------
| Id  | Operation            | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |              |     1 |    38 |     6  (17)| 00:00:01 |
|*  1 |  HASH JOIN           |              |     1 |    38 |     6  (17)| 00:00:01 |
|   2 |   VIEW               |              |     1 |    13 |     1   (0)| 00:00:01 |
|*  3 |    COUNT STOPKEY     |              |       |       |            |          |
|   4 |     INDEX FULL SCAN  | SYS_C0010819 |     1 |    13 |     1   (0)| 00:00:01 |
|   5 |   VIEW               |              |     2 |    50 |     4   (0)| 00:00:01 |
|   6 |    UNION-ALL         |              |       |       |            |          |
|   7 |     TABLE ACCESS FULL| YOUYUS       |     1 |    25 |     2   (0)| 00:00:01 |
|   8 |     TABLE ACCESS FULL| YOUYUS       |     1 |    25 |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("V1"."T1"="V2"."T1")
   3 - filter(ROWNUM=1)
/*no_push_pred hint让Oracle 放弃使用PUSHED PREDICATE,使用常规UNION-ALL操作后,子查询执行计划回归成全表扫描,整个计划成本上升*/

Oracle中清除游标缓存的几种方法

9i以后引入了bind peeking绑定变量窥视特性,但该特性常有帮当忙之嫌,所以有了11g的自适应游标特性。排除因绑定变量窥视造成的因素外,统计信息讹误也会造成执行计划偏差,这时我们就可能需要清除指定游标的缓存信息,从而达到重新解析的目的。

下面我们列举几种可以达到清除游标缓存的方法,权作抛砖引玉:

1. alter system flush shared_pool;              /* 最简单最粗暴的方法,清除所有游标缓存,可能造成短期内大量解析,不推荐*/

2. dbms_shared_pool 包很早就有了,但该包名下的purge过程却要到10.2.0.4才出现,Bug 5614566最早在2006年描述了需要清除游标缓存接口的要求:

Hdr: 5614566 10.2.0.2 RDBMS 10.2.0.2 DICTIONARY PRODID-5 PORTID-176
Abstract: WE NEED A FLUSH CURSOR INTERFACE
*** 10/20/06 07:48 am ***

而且该过程在10.2.0.4中默认是无法正常使用的,需要通过设置event 5614566或者打上5614566补丁来启用;具体设置方法如下:

alter system set events ‘5614566 trace name context forever’;

该存储过程的具体argument如下:

PROCEDURE PURGE
参数名称                       类型                    输入/输出默认值?
------------------------------ ----------------------- ------ --------
NAME                           VARCHAR2                IN
FLAG                           CHAR                    IN     DEFAULT
HEAPS                          NUMBER                  IN     DEFAULT

其中NAME指定了需要清除的对象名,这里分成2种。PL/SQL对象,触发器,序列,类型和JAVA对象以其命名指定;SQL游标对象通过该SQL的address与hash_value组合指定。FLAG指定了对象的类型,若没有指定该参数,Oracle将认为之前代入的NAME参数对应到包/存储过程/函数的命名空间, 需要注意的是该参数是大小写敏感的,包括了以下各类型:

FLAG值 对应对象类型
P 包/存储过程/函数
Q 序列
R 触发器
T 类型
JS Java源程序
JC Java类程序
JR Java资源
JD Java共享数据
C cursor

HEAP参数指定了清除对象的哪些堆信息,以SQL游标为例,其最主要的信息包括在HEAP 0和HEAP 6中,HEAP 0包括了游标自身的大多数信息,而HEAP 6则存放了游标相关的执行计划。如果我们想要清除HEAP 0和HEAP 6中的信息,则2的0次方+2的6次方=1+64=65,那么我们在代入HEAP参数为65 即可;如果我们只想清除游标的执行计划则清除HEAP 6即可,代入HEAP参数为2的6次方即64。该参数的默认值为1,清除HEAP 0将会导致整个对象的缓存信息被清除掉。

下面我们来演示如何利用该存储过程来清除SQL缓存:

SQL> alter system flush shared_pool;

系统已更改。

SQL> select /* cache_me */  count(*)  from youyus;

  COUNT(*)

----------

         9

SQL> select sql_id,
 2         address,
 3         hash_value,
 4         executions,
 5         loads,
 6         version_count,
 7         invalidations,
 8         parse_calls
 9    from v$sqlarea
 10   where sql_text like '%cache_me%'
 11     and sql_text not like '%v$sqlarea%';

SQL_ID        ADDRESS  HASH_VALUE EXECUTIONS      LOADS VERSION_COUNT INVALIDATIONS PARSE_CALLS
------------- -------- ---------- ---------- ---------- ------------- ------------- -----------
25asu5a76nqmn 2F51508C 2389334644          3          1             1             0           3

SQL> select address, plan_hash_value
 2    from v$sql_plan
 3   where sql_id = '25asu5a76nqmn';

ADDRESS  PLAN_HASH_VALUE
-------- ---------------
2F51508C      2542806819
2F51508C      2542806819
2F51508C      2542806819

SQL> exec dbms_shared_pool.purge('2F51508C,2389334644','C',64);

PL/SQL 过程已成功完成。

SQL> select sql_id,
  2         address,
  3         hash_value,
  4         executions,
  5         loads,
  6         version_count,
  7         invalidations,
  8         parse_calls,
  9         plan_hash_value
 10    from v$sqlarea
 11   where sql_text like '%cache_me%'
 12     and sql_text not like '%v$sqlarea%';

SQL_ID        ADDRESS  HASH_VALUE EXECUTIONS      LOADS VERSION_COUNT INVALIDATIONS PARSE_CALLS PLAN_HASH_VALUE
------------- -------- ---------- ---------- ---------- ------------- ------------- ----------- ---------------
25asu5a76nqmn 2F51508C 2389334644          4          1             1             0           4      2542806819

SQL> select * from v$sql_plan where plan_hash_value= 2542806819;
未选定行

/*执行计划消失了,而游标主体信息仍在*/
SQL> select /* cache_me */  count(*)  from youyus;

 COUNT(*)
----------
 9

SQL> select sql_id,
 2         address,
 3         hash_value,
 4         executions,
 5         loads,
 6         version_count,
 7         invalidations,
 8         parse_calls,
 9         plan_hash_value
 10    from v$sqlarea
 11   where sql_text like '%cache_me%'
 12     and sql_text not like '%v$sqlarea%';

SQL_ID        ADDRESS  HASH_VALUE EXECUTIONS      LOADS VERSION_COUNT INVALIDATIONS PARSE_CALLS PLAN_HASH_VALUE
------------- -------- ---------- ---------- ---------- ------------- ------------- ----------- ---------------
25asu5a76nqmn 2F51508C 2389334644          5          1             1             0           5      2542806819
/*这里新增的一次parse call是硬解析*/

SQL>  select address,operation from v$sql_plan where plan_hash_value= 2542806819;

ADDRESS  OPERATION
-------- ------------------------------------------------------------
2F51508C SELECT STATEMENT
2F51508C SORT
2F51508C TABLE ACCESS

SQL> exec dbms_shared_pool.purge('2F51508C,2389334644','C',1);

PL/SQL 过程已成功完成。

SQL> select sql_id,
 2         address,
 3         hash_value,
 4         executions,
 5         loads,
 6         version_count,
 7         invalidations,
 8         parse_calls,
 9         plan_hash_value
 10    from v$sqlarea
 11   where sql_text like '%cache_me%'
 12     and sql_text not like '%v$sqlarea%';

未选定行
SQL> select address,operation from v$sql_plan where plan_hash_value= 2542806819;

未选定行
SQL> select /* cache_me */  count(*)  from youyus;

 COUNT(*)
----------
 9

SQL> select sql_id,
 2         address,
 3         hash_value,
 4         executions,
 5         loads,
 6         version_count,
 7         invalidations,
 8         parse_calls,
 9         plan_hash_value
 10    from v$sqlarea
 11   where sql_text like '%cache_me%'
 12     and sql_text not like '%v$sqlarea%';

SQL_ID        ADDRESS  HASH_VALUE EXECUTIONS      LOADS VERSION_COUNT INVALIDATIONS PARSE_CALLS PLAN_HASH_VALUE
------------- -------- ---------- ---------- ---------- ------------- ------------- ----------- ---------------
25asu5a76nqmn 2F51508C 2389334644          1          2             1             1           1      2542806819

/*清除游标heap 0后,包括执行计划的所有信息都被清除了,甚至于simulator中的信息*/

3.如果您的环境中恰好无法利用dbms_shared_pool.purge存储过程,我们也可以采用一些折中的方法来清除游标缓存;譬如通过一个无关紧要的grant/revoke操作,但这样也会造成所有该授权/撤职对象相关SQL的执行计划失效:

SQL> alter system flush shared_pool;

系统已更改。

SQL> select /* cache_me */  count(*)  from youyus;

  COUNT(*)
----------
         9

SQL> select sql_id,
  2         address,
  3         hash_value,
  4         executions,
  5         loads,
  6         version_count,
  7         invalidations,
  8         parse_calls,
  9         plan_hash_value
 10    from v$sqlarea
 11   where sql_text like '%cache_me%'
 12     and sql_text not like '%v$sqlarea%';

SQL_ID        ADDRESS  HASH_VALUE EXECUTIONS      LOADS VERSION_COUNT INVALIDATIONS PARSE_CALLS PLAN_HASH_VALUE
------------- -------- ---------- ---------- ---------- ------------- ------------- ----------- ---------------
25asu5a76nqmn 2F540EA0 2389334644          1          1             1             0           1      2542806819

SQL> select address,operation,to_char(timestamp,'HH24:MI:SS') from v$sql_plan where plan_hash_value= 2542806819;

ADDRESS  OPERATION                                                    TO_CHAR(
-------- ------------------------------------------------------------ --------
2F540EA0 SELECT STATEMENT                                             13:39:28
2F540EA0 SORT                                                         13:39:28
2F540EA0 TABLE ACCESS                                                 13:39:28

SQL> revoke select on youyus from scott;

撤销成功。

SQL> select sql_id,
  2         address,
  3         hash_value,
  4         executions,
  5         loads,
  6         version_count,
  7         invalidations,
  8         parse_calls,
  9         plan_hash_value
 10    from v$sqlarea
 11   where sql_text like '%cache_me%'
 12     and sql_text not like '%v$sqlarea%';

SQL_ID        ADDRESS  HASH_VALUE EXECUTIONS      LOADS VERSION_COUNT INVALIDATIONS PARSE_CALLS PLAN_HASH_VALUE
------------- -------- ---------- ---------- ---------- ------------- ------------- ----------- ---------------
25asu5a76nqmn 2F540EA0 2389334644          1          1             1             1           1      2542806819

/*授权/撤销会造成执行计划invalid,此处 INVALIDATIONS上升到1*/

SQL> select /* cache_me */  count(*)  from youyus;

  COUNT(*)
----------
         9
/*重新执行SQL,将引发一次硬解析*/
SQL> select address,operation,to_char(timestamp,'HH24:MI:SS') from v$sql_plan where plan_hash_value= 2542806819;

ADDRESS  OPERATION                                                    TO_CHAR(
-------- ------------------------------------------------------------ --------
2F540EA0 SELECT STATEMENT                                             13:40:23
2F540EA0 SORT                                                         13:40:23
2F540EA0 TABLE ACCESS                                                 13:40:23

/*执行计划的时间戳发生了变化,达到了重新解析游标的目的*/

4.或许你不是一个位高权重的DBA,无法执行授权/撤职命令,但如果你能分析游标所涉及对象的统计信息或者执行其他一些ddl操作,那么也可以达到同样的目的:

SQL> alter system flush shared_pool;

系统已更改。

SQL>
SQL> select /* cache_me */  count(*)  from youyus;

  COUNT(*)
----------
         9

SQL> select sql_id,
  2         address,
  3         hash_value,
  4         executions,
  5         loads,
  6         version_count,
  7         invalidations,
  8         parse_calls,
  9         plan_hash_value
 10    from v$sqlarea
 11   where sql_text like '%cache_me%'
 12     and sql_text not like '%v$sqlarea%';

SQL_ID        ADDRESS  HASH_VALUE EXECUTIONS      LOADS VERSION_COUNT INVALIDATIONS PARSE_CALLS PLAN_HASH_VALUE
------------- -------- ---------- ---------- ---------- ------------- ------------- ----------- ---------------
25asu5a76nqmn 2F540EA0 2389334644          1          1             1             0           1      2542806819

SQL> analyze table  youyus compute statistics;

表已分析。

SQL> select sql_id,
  2         address,
  3         hash_value,
  4         executions,
  5         loads,
  6         version_count,
  7         invalidations,
  8         parse_calls,
  9         plan_hash_value
 10    from v$sqlarea
 11   where sql_text like '%cache_me%'
 12     and sql_text not like '%v$sqlarea%';

SQL_ID        ADDRESS  HASH_VALUE EXECUTIONS      LOADS VERSION_COUNT INVALIDATIONS PARSE_CALLS PLAN_HASH_VALUE
------------- -------- ---------- ---------- ---------- ------------- ------------- ----------- ---------------
25asu5a76nqmn 2F540EA0 2389334644          1          1             1             1           1      2542806819
/*统计信息更新,造成了invalid*/

SQL>    create index ind_youyus on youyus(t1);

索引已创建。

SQL> alter system flush shared_pool;

系统已更改。

SQL> select /* cache_me */  count(*)  from youyus;

  COUNT(*)
----------
         9

SQL> select sql_id,
  2         address,
  3         hash_value,
  4         executions,
  5         loads,
  6         version_count,
  7         invalidations,
  8         parse_calls,
  9         plan_hash_value
 10    from v$sqlarea
 11   where sql_text like '%cache_me%'
 12     and sql_text not like '%v$sqlarea%';

SQL_ID        ADDRESS  HASH_VALUE EXECUTIONS      LOADS VERSION_COUNT INVALIDATIONS PARSE_CALLS PLAN_HASH_VALUE
------------- -------- ---------- ---------- ---------- ------------- ------------- ----------- ---------------
25asu5a76nqmn 2F464EA0 2389334644          1          1             1             0           1      2542806819

SQL> alter index ind_youyus rebuild online;

索引已更改。

SQL> select sql_id,
  2         address,
  3         hash_value,
  4         executions,
  5         loads,
  6         version_count,
  7         invalidations,
  8         parse_calls,
  9         plan_hash_value
 10    from v$sqlarea
 11   where sql_text like '%cache_me%'
 12     and sql_text not like '%v$sqlarea%';

SQL_ID        ADDRESS  HASH_VALUE EXECUTIONS      LOADS VERSION_COUNT INVALIDATIONS PARSE_CALLS PLAN_HASH_VALUE
------------- -------- ---------- ---------- ---------- ------------- ------------- ----------- ---------------
25asu5a76nqmn 2F464EA0 2389334644          1          1             1             1           1      2542806819
/*在线重建索引也可以达到同样的目的*/

That's Great!

重做日志浪费(redo wastage)

Oracle中联机日志文件(online redo log)在大多平台上以512 字节为一个标准块。

(HPUX,Tru64 Unix上是1024bytes,SCO UNIX,Reliant UNIX上是2048bytes,而MVS,MPE/ix上是4096bytes,虽然以上许多UNIX已经不再流行,实际情况可以通过

select max(l.lebsz) log_block_size_kccle

from sys.x$kccle l

where l.inst_id = userenv(‘Instance’)   语句查询到)

LGWR后台进程写出REDO时未必能填满最后的当前日志块。举例而言,假设redo buffer中有1025字节的内容需要写出,则1025=512+512+1 共占用三个重做日志标准块,前2个标准块被填满而第三个标准块只使用了1个字节。在LGWR完成写出前,需要释放”redo allocation”闩,在此之前SGA中索引”redo buffer”信息的变量将指向未被填满块后面的一个重做块,换而言之有511字节的空间被LGWR跳过了,这就是我们说的redo wastage;我们可以通过分析v$sysstat动态视图中的redo wastage统计信息了解实例生命周期中的重做浪费量。

SQL> col name for a25
SQL> select name,value from v$sysstat where name like '%wastage%';

NAME                           VALUE
------------------------- ----------
redo wastage                  132032

redo wastage的一个图示:

为什么要浪费这部分空间呢?实际上,这种做法十分有益于LGWR的串行I/O模式。redo wastage并不是问题或者Bug,而是Oracle故意为之的。当然过量的redo wastage也不是什么好事,一般是LGWR写出过于频繁的症状表现。9i以后很少有因为隐式参数_log_io_size过小而导致的LGWR过载了,如果在您的系统中发现redo wastage的问题不小,那么无限制地滥用commit操作往往是引起问题的罪魁祸首,减少不必要的commit语句,把commit从循环中移除都将利于减少redo wastage。

我们来看一下关于redo wastage的演示:

SQL> select distinct bytes/1024/1024 from v$log;

BYTES/1024/1024
---------------
             50                          /*确认联机日志文件大小为50MB*/
SQL> archive log list;                 /*确认数据库处于归档状态*/
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            /s01/arch
SQL> set time on;
19:49:45 SQL> alter system switch logfile;           /*切换日志,清理现场*/
System altered.
19:51:07 SQL> col name for a25
19:51:16 SQL> select name,value from v$sysstat where name in ('redo size','redo wastage');

NAME                           VALUE
------------------------- ----------
redo size                 1418793324
redo wastage                88286544               /*演示开始时的基础统计值*/
19:51:19 SQL> begin
19:52:10   2  for i in 1..550000 loop
19:52:10   3  insert into tv values(1,'a');
19:52:10   4   commit;
19:52:10   5   end loop;
19:52:10   6   end;
19:52:11   7  /
/*匿名块中commit操作位于loop循环内,将导致大量redo wastage*/
PL/SQL procedure successfully completed.

19:53:07 SQL> select name,value from v$sysstat where name in ('redo size','redo wastage');

NAME                           VALUE
------------------------- ----------
redo size                 1689225404
redo wastage               112011352
/*频繁提交的匿名块产生了 1689225404-1418793324=257MB的redo,其中存在112011352-88286544=22MB的redo wastage*/

19:53:14 SQL>  begin
19:53:33   2  for i in 1..550000 loop
19:53:33   3  insert into tv values(1,'a');
19:53:33   4  end loop;
19:53:33   5    commit;
19:53:33   6   end;
19:53:34   7  /
/* 此匿名块中commit操作被移除loop循环中,批量修改数据后仅在最后提交一次*/
PL/SQL procedure successfully completed.

19:53:59 SQL> select name,value from v$sysstat where name in ('redo size','redo wastage');

NAME                           VALUE
------------------------- ----------
redo size                 1828546240
redo wastage               112061296
/*稀疏提交的匿名块最后产生了1828546240-1689225404=132MB的重做,而redo wastage为112061296-112011352=48k*/

可能您会很奇怪前者不是只比后者多出22MB的redo浪费吗,为什么总的redo量差了那么多?

我们需要注意到commit本身也是要产生redo的,而且其所产生的还不少!就以上演示来看频繁提交的过程中,commit所占用的redo空间几乎接近一半(257-132-22)/257=40%,而每次commit的平均redo量为(257-132-22)*1024*1024/550000=196 bytes。

commit操作是事务ACID的基础之一,合理运用commit可以帮我们构建健壮可靠的应用,而滥用它必将是另一场灾难!

latch free:cache buffer handles造成的SQL性能问题

数月之前,一位新疆的同事反映客户的某套单节点数据库存在性能问题,某个SQL长时间运行没有结束,在该SQL运行期间大量回话处于latch free等待中。因为相隔遥遥千里,同事之间是通过Email交流的;他首先给我发来了该问题SQL的相关explain_log。其中该语句的执行计划如下:

-----------------------------------------------------------------------------------------------
| Id  | Operation                               |  Name               | Rows  | Bytes | Cost  |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |                     |     1 |  1956 |    13 |
|*  1 |  FILTER                                 |                     |       |       |       |
|   2 |   NESTED LOOPS OUTER                    |                     |     1 |  1956 |    13 |
|   3 |    NESTED LOOPS OUTER                   |                     |     1 |  1942 |    12 |
|   4 |     NESTED LOOPS OUTER                  |                     |     1 |  1924 |    11 |
|   5 |      NESTED LOOPS                       |                     |     1 |  1906 |    10 |
|   6 |       NESTED LOOPS                      |                     |     1 |  1856 |     9 |
|   7 |        NESTED LOOPS                     |                     |     5 |  9055 |     8 |
|   8 |         NESTED LOOPS OUTER              |                     |     5 |  8575 |     7 |
|   9 |          NESTED LOOPS OUTER             |                     |     5 |  8280 |     6 |
|  10 |           NESTED LOOPS OUTER            |                     |     5 |  7785 |     5 |
|  11 |            NESTED LOOPS OUTER           |                     |     5 |  7290 |     4 |
|  12 |             NESTED LOOPS OUTER          |                     |     5 |  4130 |     3 |
|* 13 |              TABLE ACCESS BY INDEX ROWID| S_PROD_INT          |     5 |  2960 |     2 |
|* 14 |               INDEX RANGE SCAN          | S_PROD_INT_M53      |   326 |       |     5 |
|  15 |              TABLE ACCESS BY INDEX ROWID| S_PROD_LN           |     1 |   234 |     1 |
|* 16 |               INDEX UNIQUE SCAN         | S_PROD_LN_P1        |     1 |       |       |
|  17 |             TABLE ACCESS BY INDEX ROWID | S_PROD_INT_CRSE     |     1 |   632 |     1 |
|* 18 |              INDEX RANGE SCAN           | S_PROD_INT_CRSE_U1  |     1 |       |     1 |
|  19 |            TABLE ACCESS BY INDEX ROWID  | S_LIT               |     1 |    99 |     1 |
|* 20 |             INDEX UNIQUE SCAN           | S_LIT_P1            |     1 |       |       |
|  21 |           TABLE ACCESS BY INDEX ROWID   | S_LIT               |     1 |    99 |     1 |
|* 22 |            INDEX UNIQUE SCAN            | S_LIT_P1            |     1 |       |       |
|  23 |          TABLE ACCESS BY INDEX ROWID    | S_PROD_INT_TNTX     |     1 |    59 |     1 |
|* 24 |           INDEX RANGE SCAN              | S_PROD_INT_TNTX_U1  |     1 |       |     1 |
|  25 |         TABLE ACCESS BY INDEX ROWID     | S_VOD               |     1 |    96 |     1 |
|* 26 |          INDEX RANGE SCAN               | S_VOD_U2            |     1 |       |     2 |
|* 27 |        TABLE ACCESS BY INDEX ROWID      | S_VOD_VER           |     1 |    45 |     1 |
|* 28 |         INDEX RANGE SCAN                | S_VOD_VER_U1        |     2 |       |     4 |
|  29 |       TABLE ACCESS BY INDEX ROWID       | S_ISS_OBJ_DEF       |     1 |    50 |     1 |
|* 30 |        INDEX RANGE SCAN                 | S_ISS_OBJ_DEF_M3    |     1 |       |     2 |
|  31 |      TABLE ACCESS BY INDEX ROWID        | S_CTLG_CAT          |     1 |    18 |     1 |
|* 32 |       INDEX UNIQUE SCAN                 | S_CTLG_CAT_P1       |     1 |       |       |
|  33 |     TABLE ACCESS BY INDEX ROWID         | S_CTLG_CAT          |     1 |    18 |     1 |
|* 34 |      INDEX UNIQUE SCAN                  | S_CTLG_CAT_P1       |     1 |       |       |
|  35 |    TABLE ACCESS BY INDEX ROWID          | S_CTLG_CAT          |     1 |    14 |     1 |
|* 36 |     INDEX UNIQUE SCAN                   | S_CTLG_CAT_P1       |     1 |       |       |
|  37 |   NESTED LOOPS                          |                     |     1 |   137 |     7 |
|  38 |    NESTED LOOPS                         |                     |     4 |   456 |     6 |
|  39 |     NESTED LOOPS                        |                     |     4 |   364 |     5 |
|  40 |      NESTED LOOPS                       |                     |     2 |   142 |     4 |
|  41 |       NESTED LOOPS                      |                     |     1 |    53 |     3 |
|  42 |        NESTED LOOPS                     |                     |     3 |   120 |     2 |
|* 43 |         INDEX RANGE SCAN                | S_CTLG_CAT_PROD_U1  |     3 |    57 |     2 |
|* 44 |         TABLE ACCESS BY INDEX ROWID     | S_CTLG_CAT          |     1 |    21 |     1 |
|* 45 |          INDEX UNIQUE SCAN              | S_CTLG_CAT_P1       |     1 |       |       |
|* 46 |        TABLE ACCESS BY INDEX ROWID      | S_CTLG              |     1 |    13 |     1 |
|* 47 |         INDEX UNIQUE SCAN               | S_CTLG_P1           |     1 |       |       |
|  48 |       TABLE ACCESS BY INDEX ROWID       | S_PARTY_CTGRY       |     2 |    36 |     1 |
|* 49 |        INDEX RANGE SCAN                 | S_PARTY_CTGRY_F1    |     2 |       |     1 |
|  50 |      TABLE ACCESS BY INDEX ROWID        | S_PARTY_RPT_REL     |     2 |    40 |     1 |
|* 51 |       INDEX RANGE SCAN                  | S_PARTY_RPTREL_F50  |     2 |       |     1 |
|  52 |     TABLE ACCESS BY INDEX ROWID         | S_PARTY_PER         |     1 |    23 |     1 |
|* 53 |      INDEX RANGE SCAN                   | S_PARTY_PER_F1      |     1 |       |     2 |
|* 54 |    INDEX RANGE SCAN                     | S_PARTY_PER_U1      |     1 |    23 |     3 |
|  55 |   NESTED LOOPS                          |                     |     1 |   157 |     8 |
|  56 |    NESTED LOOPS                         |                     |     1 |   144 |     7 |
|  57 |     NESTED LOOPS                        |                     |     1 |   125 |     6 |
|  58 |      NESTED LOOPS                       |                     |     1 |   104 |     5 |
|  59 |       NESTED LOOPS                      |                     |     1 |    86 |     4 |
|  60 |        NESTED LOOPS                     |                     |     4 |   264 |     3 |
|  61 |         NESTED LOOPS                    |                     |     2 |    86 |     2 |
|* 62 |          INDEX RANGE SCAN               | S_PARTY_PER_U1      |     2 |    46 |     4 |
|  63 |          TABLE ACCESS BY INDEX ROWID    | S_POSTN             |     1 |    20 |     1 |
|* 64 |           INDEX UNIQUE SCAN             | S_POSTN_P1          |     1 |       |       |
|* 65 |         INDEX RANGE SCAN                | S_PARTY_PER_U1      |     2 |    46 |     3 |
|* 66 |        INDEX RANGE SCAN                 | S_PARTY_RPT_REL_U2  |     1 |    20 |     1 |
|* 67 |       INDEX RANGE SCAN                  | S_PARTY_CTGRY_U1    |     1 |    18 |     1 |
|* 68 |      TABLE ACCESS BY INDEX ROWID        | S_CTLG_CAT          |     1 |    21 |     1 |
|* 69 |       INDEX UNIQUE SCAN                 | S_CTLG_CAT_P1       |     1 |       |       |
|* 70 |     INDEX RANGE SCAN                    | S_CTLG_CAT_PROD_U1  |     1 |    19 |     1 |
|* 71 |    TABLE ACCESS BY INDEX ROWID          | S_CTLG              |     1 |    13 |     1 |
|* 72 |     INDEX UNIQUE SCAN                   | S_CTLG_P1           |     1 |       |       |
|  73 |   NESTED LOOPS                          |                     |     1 |    58 |     3 |
|  74 |    NESTED LOOPS                         |                     |     3 |   135 |     2 |
|* 75 |     INDEX RANGE SCAN                    | S_CTLG_CAT_PROD_U1  |     2 |    38 |     2 |
|* 76 |     TABLE ACCESS BY INDEX ROWID         | S_CTLG_CAT          |     1 |    26 |     1 |
|* 77 |      INDEX UNIQUE SCAN                  | S_CTLG_CAT_P1       |     1 |       |       |
|* 78 |    TABLE ACCESS BY INDEX ROWID          | S_CTLG              |     1 |    13 |     1 |
|* 79 |     INDEX UNIQUE SCAN                   | S_CTLG_P1           |     1 |       |       |
-----------------------------------------------------------------------------------------------

该SQL涉及近四十个表和索引的链接操作,而且全部采用嵌套循环链接模式,是典型的siebel应用。这个SQL的执行计划十分复杂,我甚至不愿意去看它;既然从执行计划上进一步分析过于违背“惰性”,我首先想到的还是从wait interface入手,毕竟OWI是”useless”的。因为无法远程操作,反复收集信息会让整个过程十分缓慢,一口气问对方要了10046,errorstack,latches的dump信息,以及该数据库的完整RDA报告。

既然是latch free等待,那么总要先定位到相关的latch才好进一步分析,我们首先看10046 trace:

cat    siebdb1_ora_15465.trc |grep "nam='latch free'"|awk '{print $7}' |sort|uniq    -c
1 p1=-4611685984823284696
169 p1=-4611685984823317640
1 p1=-4611685988591416672
1 p1=-4611685988592487696
2 p1=-4611685988692060472
1 p1=-4611685988694684144
1 p1=-4611685988705526816
1 p1=-4611685988733017136
1 p1=-4611685988779484112
1 p1=-4611685988784059264
1 p1=-4611685988787329592
1 p1=-4611685988788323152
1 p1=-4611685988800312448
2 p1=-4611685988805347840
24 p1=-4611685988806714976
2 p1=-4611685988837854328
7 p1=-4611685988841028520
1 p1=-4611685988869432648
1 p1=-4611685988871320920

通过以上脚本处理,我们可以看到在执行过程中出现了各种latch free,出现最频繁的是p1=-4611685984823317640的栓。latch free等待事件中p1所代表的是会话所等待栓的地址。-4611685984823317640换算到16进制为C0000007D2F56B78。我们看看这个栓到底是何方神圣:

 c0000007d2f56b78 cache buffer handles level=3
        Location from where latch is held: kcbzfs:
        Context saved from call: 0
        state=busy
  gotten 2734921346 times wait, failed first 656272521 sleeps 44808579
  gotten 0 times nowait, failed: 0

当会话需要pin住buffer header时它首先要获去buffer handle,得到buffer handle的过程中首先要抢占cache buffer handles栓,为了避免对于cache buffer handles栓的过度争用,每个会话被允许cache一小撮buffer handles,也叫保留集(reserved set)。该保留集的上限由隐式参数_db_handles_cached(默认为5)所控制,在此基础上会话在执行不是十分复杂的SQL时不必反复申请栓。
同时存在一些十分复杂的语句,需要同时pin住大量的缓存,此时若存在空闲的handle,会话被允许索要手头5个cached buffer handles以外的handle。也为了限制单个会话的资源占用量,Oracle通过内部算法总是尽量让每个进程能够pin住”合理份额“的buffer,由此单个会话同一时间所能持有的总handle数等于db_block_buffers/processes,隐式参数_cursor_db_buffers_pinned指出了该值。另cache buffer handles并没有子栓,仅有父栓本身,所以如果系统中有大量复杂SQL,会话所能cache的buffer handles远不足以满足需求时,该栓就成为整个系统性能的瓶颈,甚至可能出现潜在的死锁/饿死问题。
结合我们从trace中获取的实际信息,cache buffer handles的命中率为1-656272521/2734921346=76%;一般来说如果系统中某个栓的命中率低于95%时,我们就可以认为该栓的争用较为激烈了;而此处一个父栓的命中率到了80%以下,我们很难想象其性能恶化状态。
在同事发给我的RDA中探索,从resource limit栏目中可以看到该实例的会话高水位从未超过2500,而实例设定的processes值则为6000,这是一个很多初级DBA或施工方会犯的错误,认为设定高一些的processes值可以避免出现ora-00020错误从而高枕无忧,殊不知Oracle中许多隐式参数的分配是针对process数来计算的,资源总数是有限的,process数多则单个会话所能操控的理论资源数相应减少,这不仅在Oracle中,几乎是放之四海皆准的原理。
极有可能是过高的processes参数影响了_cursor_db_buffers_pinned,而该问题SQL除了十分复杂外还有全部采用nested loop链接方式的问题,这就要求同时时刻需pin住大量不同表和索引的buffer,于是在SQL执行伊始会话尝试pin住了一部分的buffer,但因为最终达到了_cursor_db_buffers_pinned所指定的上限,该会话在持有cache buffer handles栓的情况下处于hang状态,进一步导致了大量其他会话因latch free等待而阻塞。
当时给同事的最主要建议就是降低processes参数,从该系统的V$resource_limit动态视图来看,将processes设到3000就绰绰有余了;同时我们可以增大隐式参数_db_handles_cached以避免系统中对该栓的争用。

几天后,同事回信告诉我开发人员通过其他途径绕过了该问题,但我还是建议他在未来可能的offline计划中加入修改processes和_db_handles_cached参数的操作。

这个case发生几个月后,我无意中从MOS中发现以下note:

Hdr: 9328032 10.2.0.3 RDBMS 10.2.0.3 BUFFER CACHE PRODID-5 PORTID-23
Abstract: INCREASE IN CACHE BUFFER CHAINS/HANDLES WAITS

*** 01/29/10 05:41 am ***
TAR:
----
3-1375394871

PROBLEM:
--------
Today, and despite no significant change in workload we have increased waits
on these two latches:

29Jan2010_1030_1045 : Top 5 Timed Events
   Avg %Total
29Jan2010_1030_1045 : ~~~~~~~~~~~~~~~~~~
  wait   Call
29Jan2010_1030_1045 : Event                                 Waits    Time (s)
  (ms)   Time Wait Class
29Jan2010_1030_1045 : ------------------------------ ------------ -----------
------ ------ ----------
29Jan2010_1030_1045 : db file sequential read           4,912,142      50,169
    10   40.7   User I/O
29Jan2010_1030_1045 : CPU time                                         45,456
         36.9
29Jan2010_1030_1045 : latch: cache buffers chains       1,347,911      12,255
     9   10.0 Concurrenc
29Jan2010_1030_1045 : latch: cache buffer handles       1,364,489      10,662
     8    8.7      Other
29Jan2010_1030_1045 : log file sync                       482,281       7,101
    15    5.8     Commit
29Jan2010_1030_1045 :
-------------------------------------------------------------     

DIAGNOSTIC ANALYSIS:
--------------------
See below

WORKAROUND:
-----------
None

RELATED BUGS:
-------------

REPRODUCIBILITY:
----------------

TEST CASE:
----------

STACK TRACE:
------------

SUPPORTING INFORMATION:
-----------------------

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
----------------------------------------

DIAL-IN INFORMATION:
--------------------

IMPACT DATE:
------------ 
Notes so far:

There seems to be 5 - 12% increase across a number of statistics
between the OK sample and the problem sample.
eg:                          Total           per second  per TX
  26th> execute count       6,322,503        7,025.2      14.6
  29th> execute count       6,828,911        7,552.3      14.7
        8% increase in execute count

  26th> data blocks consistent reads - u 16,567,225  18,408.4  38.2
  29th> data blocks consistent reads - u 18,700,073  20,681.1  40.3
        12.6% increase in consistent read undo records applied

cache buffer handles
~~~~~~~~~~~~~~~~~~~~
 There is about an 8% increase in gets of this latch between
 the two reports:
                               Gets        miss%   wait time
  26th> cache buffer handles  80,320,092   21.0        56
  29th> cache buffer handles  87,158,750   55.9     10661

 And the latch was already showing 21% miss rate before so was
 showing signs of being close to becoming a problem.

 The latch misses stats show:
                            Location      Sleeps   Waiter Sleeps
   cache buffer handles     kcbzgs        726,520  832,967
   cache buffer handles     kcbzfs        638,106  531,875

 This latch is used just to get / release buffer state objects.

 Each process caches up to "_db_handles_cached" of these against
 its own process so once we are warmed up and have processes
 then we should only use this latch if we are after pinning more
 than 5 buffers in the process state.

 As the misses seem balanced across kcbzgs and kcbzfs then the
 gets do not appear to be due to warming up so this points at
 SQLs on the system needing to pin more than 5 buffers at a
 time. If we increase "_db_handles_cached" then we should be able
 to alleviate the load on this latch.

 eg: See PID 4174 from yesterday we see it has 9 buffers pinned
     and is running:
       SELECT   /*+ LEADING (PROD, ITM, ORDX)
          INDEX (ITM S_ORDER_ITEM_BT_W1_X )*/ ...
     So we can see there are SQLs needing many more than 5 buffers
     pinned at a time.
*** 01/29/10 10:06 am ***
cache buffers chains
~~~~~~~~~~~~~~~~~~~~            GETS           Miss%
 26th>  cache buffers chains    866,844,547    1.0    0.0     83
 29th>  cache buffers chains    900,769,954    1.4    0.1  12258
        4% increase in gets , large increase in wait time.

 Previous miss rate was 1%, now 1.4%.

 James already checked and mostly the spread looks reasonable
  across child latches.
                                           per second
     26th>     Transactions:                481.56
     29th>     Transactions:                513.79
               6.5% more transactions processed.

 This is looking like increased load is starting to see the
 latch contention. It may be sensible here to look at the
 higher "buffer gets" SQLs to see if they really are well
 tuned. eg: 8y9msp8p3gv23 in the 1130-1145 reports are
 showing increased executions but also increased buffers
 per execution. It would seem sensible to check out the
 plans and use of the top buffer gets SQLs.

 It would also be wise to confirm we are not hitting bug 6455161
 as that affects this release and I cannot see that patch in place.
 ie: Verify:
     select owner||'.'||Object_name||'.'||subobject_name
       from dba_objects
      where object_id<>data_object_id
        and object_type like 'INDEX%'
      order by owner,object_name,subobject_name;
 That would not be a difference from 26th but if this is hit
 you gets may be far higher than needed for certains types of
 query.
*** 01/29/10 02:14 pm *** (CHG: Sta->11)
*** 01/29/10 02:14 pm ***
*** 01/29/10 02:40 pm *** (CHG: Sta->16)
*** 01/29/10 03:03 pm ***
*** 01/29/10 11:33 pm ***
Many of the indexes are on SIEBEL objects so
bug 6455161 looks like it could occur here.
*** 01/30/10 12:02 am ***
*** 01/30/10 12:41 am ***
I checked a few top buffer gets SQLs from older AWR that we have
awrrpt_1_67673_67677 . It looks like the heaviest SQLs are
probably not using indexes which are exposed to 6455161 but a couple
may be:

 av5skvp0mkxm1
  SELECT /*+ ALL_ROWS */ T3.CONFLICT_ID, T3.LAST_UPD ...
  FROM SIEBEL.S_LST_OF_VAL T1, SIEBEL.S_SRM_ACTION T2, SIEBEL.S_SRM_ACTION T3

   ...

  Indexes on the q1.log list which might be used by that query in NL:
    SIEBEL.S_SRM_ACTION_M1    SIEBEL.S_SRM_ACTION_P1
    SIEBEL.S_SRM_ACTION_U1
    SIEBEL.S_SRM_ACT_PARAM_P1 SIEBEL.S_SRM_ACT_PARAM_U1

 7khg9mx4cv5c5
  Queries from SIEBEL.S_ORG_EXT with lots of other tables

  Might use SIEBEL.S_ORG_EXT_M1 from the list
*** 01/30/10 12:59 am ***
=====
>>> Side issue unrelated to the issue here:
      From patch list patches_28012010.lis
      customer has 6868080 installed. This was superceded by 8244734
=====
*** 01/30/10 07:06 am *** (CHG: Sta->10)
*** 01/30/10 07:06 am ***
Suggestions for this:

a> Set "_db_handles_cached" >= 10

b> Get and install a patch for bug 6455161
    This seems like this is minimal impact at present
    as problem indexes are not used in the top buffer
    gets SQLs BUT one index rebuild could lead to mass
    problems, and it may give some small reductions
    in buffer gets on certain SQLs, just not the top
    ones.
    It is a small focused fix with no clashes on current
    list.

c> Double check some of the top buffer gets SQL plans.
    They may be the best they could be but are worth
    checking as there is little in the AWRs to show
    any special reason for cache buffers chains increase
    other than the slightly higher workload.
*** 02/01/10 07:32 am ***

>> SQL "av5skvp0mkxm1" may be susceptible to 6455161.
*** 02/01/10 08:03 am ***
Reviewing level of "_db_handles_cached" as some of the top
SQLs seem very deep NL outer.

Testing this here.

*** 02/01/10 11:05 am ***
For repeatedly executed deep NL SQLs higher handle caching should
improve loading on the "cache buffer handles" latch, but it
has no impact on the chains latch as NLs and NL outers keep buffers
pinned anyway.

So suggestions are still>>
 Set "_db_handles_cached" >= 40 (given nested depth of some SQL)
   See if extra memory should be added to shared pool to accomodate that

 Install 6455161

 Drill more into SQL to see if any can be improved esp wrt buffer
  gets needed .
 Also
   Recheck V$latch_children spread for cache buffers chains
   and recheck X$BH blocks on warmest children
   See if there is any session skew on the latch waits
   (any sessions more prone to waits than others)

该note所述case与我们的情况相似,经过排查发现最有”嫌疑“的SQL语句中也存在大量的Nested Loop外连接。这似乎是siebel应用一个普遍存在的问题,note中的客户最初发现的症结也是cache buffer handles栓等待问题,Oracle MOS最后给出的建议是:

1.增大db_handles_cache 到10以上,最后又建议增加到40以上

2. 这个note中客户的数据库版本是sparc(64 bit) 10.2.0.3, MOS建议在此基础上应用Bug 6455161的one-off patch

3.检查有大量buffer gets的SQL语句的执行计划

转储控制文件信息

当前控制文件的内容可以通过CONTROLF dump以文本形式转储到实例参数指定的目录下。

各级别CONTROLF转储的区别:

转储级别 转储内容
1 仅包含文件头信息
2 包括文件头,数据库信息记录,检查点进程记录
3 所有记录类型,针对循环重用的记录类型仅保留最早及最近的记录
4 如上,包含4条最新的可重用类型的记录
5+ 如上,包含的可重用类型记录为2*level条
SQL> oradebug setmypid;
Statement processed.
SQL> oradebug dump controlf 3;
Statement processed.
SQL> oradebug tracefile_name;
/s01/rac10g/admin/PROD/udump/prod_ora_3710.trc

以下例而言,使用oradebug命令启动转储,将会把当前控制文件信息以文本形式转储到该服务进程对应的用户转储文件中,文本中记录了level 3所对应的内容。

SQL> oradebug setmypid;
Statement processed.
SQL> oradebug dump controlf 3;
Statement processed.
SQL> oradebug tracefile_name;
/s01/rac10g/admin/PROD/udump/prod_ora_3710.trc

oradebug必须以SYSDBA形式登陆后方能使用,你也可以以非SYSDBA的身份通过alter session形式启动转储:

SQL> alter session set events 'immediate trace name controlf level 3';

Session altered.

[gview file=”http://askmac.cn/wp-content/uploads/resource/Controlfile.doc”]

沪ICP备14014813号-2

沪公网安备 31010802001379号