Know more about RAC statistics and wait event

以下列出了RAC中的主要统计信息和等待事件:

 

1. Statistics:

1.1 V$SYSSTAT, V$SESSTAT (join to V$STATNAME)

  • gc cr blocks served
  • gc cr block build time
  • gc cr block flush time
  • gc cr block send time
  • gc current blocks served
  • gc current block pin time
  • gc current block flush time
  • gc current block send time
  • gc cr blocks received
  • gc cr block receive time
  • gc current blocks received
  • gc current block receive time
  • gc blocks lost
  • gc claim blocks lost
  • gc blocks corrupt
  • global enqueue gets sync
  • global enqueue gets async
  • global enqueue get time
  • global enqueue releases
  • gcs messages sent
  • ges messages sent
  • global enqueue CPU used by this session
  • gc CPU used by this session
  • IPC CPU used by this session
  • global undo segment hints helped (10.2)
  • global undo segment hints were stale (10.2)

1.2 V$SYSMETRIC

  • GC CR Block Received Per Second
  • GC CR Block Received Per Txn
  • GC Current Block Received Per Second
  • GC Current Block Received Per Txn
  • Global Cache Average CR Get Time
  • Global Cache Average Current Get Time
  • Global Cache Blocks Corrupted
  • Global Cache Blocks Lost

1.3 V$SEGMENT_STATISTICS, V$SEGSTAT

  • gc buffer busy
  • gc cr blocks received
  • gc current blocks received

1.4 DBA_HIST_SEG_STAT

  • GC_CR_BLOCKS_SERVED_TOTAL
  • GC_CR_BLOCKS_SERVED_DELTA
  • GC_CU_BLOCKS_SERVED_TOTAL
  • GC_CU_BLOCKS_SERVED_DELTA

1.5 V$GES_STATISTICS

  • acks for commit broadcast(actual)
  • acks for commit broadcast(logical)
  • broadcast msgs on commit(actual)
  • broadcast msgs on commit(logical)
  • broadcast msgs on commit(wasted)
  • dynamically allocated gcs resources
  • dynamically allocated gcs shadows
  • false posts waiting for scn acks
  • flow control messages received
  • flow control messages sent
  • gcs assume cvt
  • gcs assume no cvt
  • gcs ast xid
  • gcs blocked converts
  • gcs blocked cr converts
  • gcs compatible basts
  • gcs compatible cr basts (global)
  • gcs compatible cr basts (local)
  • gcs cr basts to PIs
  • gcs cr serve without current lock
  • gcs dbwr flush pi msgs
  • gcs dbwr write request msgs
  • gcs error msgs
  • gcs forward cr to pinged instance
  • gcs immediate (compatible) converts
  • gcs immediate (null) converts
  • gcs immediate cr (compatible) converts
  • gcs immediate cr (null) converts
  • gcs indirect ast
  • gcs lms flush pi msgs
  • gcs lms write request msgs
  • gcs msgs process time(ms)
  • gcs msgs received
  • gcs out-of-order msgs
  • gcs pings refused
  • gcs queued converts
  • gcs recovery claim msgs
  • gcs refuse xid
  • gcs regular cr
  • gcs retry convert request
  • gcs side channel msgs actual
  • gcs side channel msgs logical
  • gcs undo cr
  • gcs write notification msgs
  • gcs writes refused
  • ges msgs process time(ms)
  • ges msgs received
  • global posts dropped
  • global posts queue time
  • global posts queued
  • global posts requested
  • global posts sent
  • implicit batch messages received
  • implicit batch messages sent
  • lmd msg send time(ms)
  • lms(s) msg send time(ms)
  • messages flow controlled
  • messages queue sent actual
  • messages queue sent logical
  • messages received actual
  • messages received logical
  • messages sent directly
  • messages sent indirectly
  • messages sent not implicit batched
  • messages sent pbatched
  • msgs causing lmd to send msgs
  • msgs causing lms(s) to send msgs
  • msgs received queue time (ms)
  • msgs received queued
  • msgs sent queue time (ms)
  • msgs sent queue time on ksxp (ms)
  • msgs sent queued
  • msgs sent queued on ksxp
  • process batch messages received
  • process batch messages sent

2. Wait Events:

在10g中RAC等待事件可以分为3类,下面列出了主要的等待事件包括一些undocumented wait event.

2.1. Real time only :

Those wait events are only defined while the process is waiting; after the wait is over, they are reclassified according to the outcome of the global cache operation. They should appear only on the following views: V$SESSION_WAIT, V$ACTIVE_SESSION_HISTORY, V$EVENT_HISTOGRAM.

 

  • gc cr request
  • gc current request

2.2. Historical only:

These are events represent the outcome of a GC request (fixup events). They can appear in the following views: V$SESSION_EVENT, V$SYSTEM_EVENT, DBA_HIST_SYSTEM_EVENT, V$EVENTMETRIC.

  • gc cr block 2-way
  • gc cr block 3-way
  • gc cr block busy
  • gc cr block congested
  • gc cr block lost
  • gc cr block unknown
  • gc cr grant 2-way
  • gc cr grant busy
  • gc cr grant congested
  • gc cr grant unknown
  • gc current block 2-way
  • gc current block 3-way
  • gc current block busy
  • gc current block congested
  • gc current block lost
  • gc current block unknown
  • gc current grant 2-way
  • gc current grant busy
  • gc current grant congested
  • gc current grant unknown

2.3. Other events:

The remaining wait events may appear in any of the views listed before, namely: V$SESSION_WAIT, V$ACTIVE_SESSION_HISTORY, V$EVENT_HISTOGRAM, V$SESSION_EVENT, V$SYSTEM_EVENT, DBA_HIST_SYSTEM_EVENT, V$EVENTMETRIC.

  • LMON global data update
  • cr request retry
  • gc assume
  • gc block recovery request
  • gc buffer busy
  • gc claim
  • gc cr cancel
  • gc cr disk read
  • gc cr disk request
  • gc cr failure
  • gc cr multi block request
  • gc current cancel
  • gc current multi block request
  • gc current retry
  • gc current split
  • gc domain validation
  • gc freelist
  • gc prepare
  • gc quiesce wait
  • gc recovery free
  • gc recovery quiesce
  • gc remaster
  • gcs ddet enter server mode
  • gcs domain validation
  • gcs drm freeze begin
  • gcs drm freeze in enter server mode
  • gcs enter server mode
  • gcs log flush sync
  • gcs remastering wait for read latch
  • gcs remastering wait for write latch
  • gcs remote message
  • gcs resource directory to be unfrozen
  • gcs to be enabled
  • ges LMD suspend for testing event
  • ges LMD to inherit communication channels
  • ges LMD to shutdown
  • ges LMON for send queues
  • ges LMON to get to FTDONE
  • ges LMON to join CGS group
  • ges cached resource cleanup
  • ges cancel
  • ges cgs registration
  • ges enter server mode
  • ges generic event
  • ges global resource directory to be frozen
  • ges inquiry response
  • ges lmd and pmon to attach
  • ges lmd/lmses to freeze in rcfg – mrcvr
  • ges lmd/lmses to unfreeze in rcfg – mrcvr
  • ges master to get established for SCN op
  • ges performance test completion
  • ges pmon to exit
  • ges process with outstanding i/o
  • ges reconfiguration to start
  • ges remote message
  • ges resource cleanout during enqueue open
  • ges resource cleanout during enqueue open-cvt
  • ges resource directory to be unfrozen
  • ges retry query node
  • ges reusing os pid
  • ges user error
  • ges wait for lmon to be ready
  • ges1 LMON to wake up LMD – mrcvr
  • ges2 LMON to wake up LMD – mrcvr
  • ges2 LMON to wake up lms – mrcvr 2
  • ges2 LMON to wake up lms – mrcvr 3
  • ges2 proc latch in rm latch get 1
  • ges2 proc latch in rm latch get 2
  • global cache busy
  • global enqueue expand wait
  • latch: KCL gc element parent latch
  • latch: gcs resource hash
  • latch: ges resource hash list

ASM file metadata operation等待事件

什么是” ASM file metadata operation “等待事件?

这是一个内部的undocumented等待事件:

“ASM file metadata operation” event is associated with database (instance) communication with ASM (instance). If the database has files in ASM disk group(s) it needs to access disk group(s), get extent maps for files that it already has, get updated extent info (e.g. after a rebalance), create new files in ASM etc.

High waits for ‘ksv master wait’ while doing an ASM file metadata operation were reported when a data migration utility was running. This wait was also seen for a drop of a tablespace.

Many waits for “ASM metadata file operation” may be seen under kfnConnect
when sessions query ASM group statistics in a DB instance.

Rediscovery Notes:
Waits occur under repeated kfnConnect calls when querying certain
views based on fixed (X$) views, such as V$ASM_DISKGROUP_STAT.

At this time on the ASM only a few process(one or two) will be on non-idle
wait events such as “ASM file metadata operation” .

WAITING FOR ‘ASM FILE METADATA OPERATION’ DURING RMAN BACKUP & ASM HANGS

Oracle等待事件kfk:async disk IO

kfk: async disk IO等待事件是ASM下异步的System I/O等待事件,kfk内核层面在disk_asynch_io=true时被激活。当rbal或其他ASM相关后台进程在维护ASM磁盘组时可能进入kfk: async disk IO等待。

SQL> col name for a20
SQL> col PARAMETER1 for a10
SQL> col PARAMETER2 for a10
SQL> col PARAMETER3 for a10
SQL> col WAIT_CLASS for a15

SQL> select name,parameter1,parameter2,parameter3,wait_class from v$event_name where name='kfk: async disk IO';

NAME                 PARAMETER1 PARAMETER2 PARAMETER3 WAIT_CLASS
-------------------- ---------- ---------- ---------- ---------------
kfk: async disk IO   count      intr       timeout    System I/O

SQL> select * from v$version;    

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE    11.2.0.2.0      Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production

SQL> select name,value from v$system_parameter where name in ('instance_type','asm_power_limit');

NAME                 VALUE
-------------------- ----------
instance_type        asm
asm_power_limit      10

SQL> conn / as sysasm
Connected.

SQL> oradebug setmypid;
Statement processed.
SQL> oradebug event 10046 trace name context forever,level 8;
Statement processed.
SQL> alter diskgroup data check all;

Diskgroup altered.

SQL> oradebug event 10046 trace name context off;
Statement processed.

SQL> oradebug tracefile_name;
/s01/orabase/diag/asm/+asm/+ASM1/trace/+ASM1_ora_29405.trc

=====================trace=====================
PARSING IN CURSOR #140442102181424 len=30 dep=0 uid=0 oct=193 
lid=0 tim=1313673029551496 hv=2849532521 ad='6bd58b50' sqlid='ft5h7dunxhum9'
alter diskgroup data check all
END OF STMT
PARSE #140442102181424:c=1999,e=14171,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=0,tim=1313673029551493
WAIT #140442102181424: nam='Disk file operations I/O' ela= 573 FileOperation=2 fileno=0 filetype=15 obj#=-1 
WAIT #140442102181424: nam='Disk file operations I/O' ela= 33 FileOperation=2 fileno=0 filetype=15 obj#=-1 
WAIT #140442102181424: nam='Disk file operations I/O' ela= 29 FileOperation=2 fileno=0 filetype=15 obj#=-1 
WAIT #140442102181424: nam='kfk: async disk IO' ela= 941 count=1 intr=0 timeout=4294967295 obj#=-1

fdp_checkDsk(): 20
----- Abridged Call Stack Trace -----
ksedsts()+461<-kfdp_checkDsk()+476<-kfdCheck()+1649<-kfgCheck()+477<-kfxdrvAl
terOne()+5976<-kfxdrvAlter()+2287<-kfxdrvEntry()+1306<-opiexe()+20028<-opiosq
0()+3993<-kpooprx()+274<-kpoal8()+800<-opiodr()+910<-ttcpip()+2289<-opitsk()+
1670<-opiino()+966<-opiodr()+910
<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+252<-main()+201<
-__libc_start_main()+244<-_start()+36
----- End of Abridged Call Stack Trace -----
WAIT #140442102181424: nam='rdbms ipc reply' ela= 1610 from_process=19 timeou
t=2147483647 p3=0 obj#=-1 tim=1313673029798048
kfdp_checkDsk(): 21
----- Abridged Call Stack Trace -----
ksedsts()+461<-kfdp_checkDsk()+476<-kfdCheck()+1649<-kfgCheck()+477<-kfxdrvAl
terOne()+5976<-kfxdrvAlter()+2287<-kfxdrvEntry()+1306<-opiexe()+20028<-opiosq
0()+3993<-kpooprx()+274<-kpoal8()+800<-opiodr()+910<-ttcpip()+2289<-opitsk()+
1670<-opiino()+966<-opiodr()+910
<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+252<-main()+201<
-__libc_start_main()+244<-_start()+36
----- End of Abridged Call Stack Trace -----
WAIT #140442102181424: nam='rdbms ipc reply' ela= 1677 from_process=19 timeou
t=2147483647 p3=0 obj#=-1 tim=1313673029885645
kfdp_checkDsk(): 22
----- Abridged Call Stack Trace -----
ksedsts()+461<-kfdp_checkDsk()+476<-kfdCheck()+1649<-kfgCheck()+477<-kfxdrvAl
terOne()+5976<-kfxdrvAlter()+2287<-kfxdrvEntry()+1306<-opiexe()+20028<-opiosq
0()+3993<-kpooprx()+274<-kpoal8()+800<-opiodr()+910<-ttcpip()+2289<-opitsk()+
1670<-opiino()+966<-opiodr()+910
<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+252<-main()+201<
-__libc_start_main()+244<-_start()+36
----- End of Abridged Call Stack Trace -----
WAIT #140442102181424: nam='rdbms ipc reply' ela= 1350 from_process=19 timeou
t=2147483647 p3=0 obj#=-1 tim=1313673029970397
kfdp_checkDsk(): 23
----- Abridged Call Stack Trace -----
kfdp_checkDsk(): 24
----- Abridged Call Stack Trace -----
ksedsts()+461<-kfdp_checkDsk()+476<-kfdCheck()+1649<-kfgCheck()+477<-kfxdrvAl
terOne()+5976<-kfxdrvAlter()+2287<-kfxdrvEntry()+1306<-opiexe()+20028<-opiosq
0()+3993<-kpooprx()+274<-kpoal8()+800<-opiodr()+910<-ttcpip()+2289<-opitsk()+
1670<-opiino()+966<-opiodr()+910
<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+252<-main()+201<
-__libc_start_main()+244<-_start()+36
----- End of Abridged Call Stack Trace -----

Oracle等待事件:Data file init write

在给某一个大表加有default值的clob列时出现了Data file init write等待事件,这个等待事件是10gR2中新加入的,恰恰10gr2的文档(乃至11g的文档)都没有列出该等待事件。该等待事件一般在Oracle自动扩展数据文件(auto extend datafile)并串行地格式化数据文件的新创建的空间时作为前台(foreground)等待事件出现:

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

SQL> show parameter compatible

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
compatible                           string      10.2.0.3

/* 数据库版本10.2.0.4,compatible为10.2.0.3 */

SQL> alter table tv add spare6 clob default '........';
SQL> col event for a30;

SQL> select event, time_waited, current_file#, current_block#
  2    from V$active_Session_History
  3   where event like 'Data file init write'
  4   order by sample_time desc;

EVENT                          TIME_WAITED CURRENT_FILE# CURRENT_BLOCK#
------------------------------ ----------- ------------- --------------
Data file init write                533810             4          54783
Data file init write                442042             4          54783
Data file init write                 47286             4          54783
Data file init write                 42932             4          54783
Data file init write                413477             4          54783
Data file init write                153503             4          54783
Data file init write                 33520             4          54783
Data file init write                307616             4          54783
Data file init write                214404             4          54783
Data file init write                  3400             4          54783
Data file init write                212386             4          54783
Data file init write                192631             4          54783
..............
Data file init write                611157             4          54775
Data file init write                  1379             4          54775
Data file init write                223541             4          54775

伴随着数据文件扩展(Data file init write等待事件的直接触发原因),Oracle需要通过以下递归dml语句维护相应的数据字典:

1.查询字典表file$确定表空间对应的数据文件号
select file# from file$ where ts#=:1
该操作可能伴随Data file init write等待事件发生:
PARSING IN CURSOR #3 len=36 dep=2 uid=0 oct=3 lid=0 tim=1261083587010014 hv=1570213724 ad='8f7d4210'
select file# from file$ where ts#=:1
END OF STMT
.............
WAIT #14: nam='Data file init write' ela= 1091 count=1 intr=256 timeout=4294967295 obj#=57314 
WAIT #14: nam='Data file init write' ela= 1078 count=1 intr=256 timeout=4294967295 obj#=57314 
WAIT #14: nam='Data file init write' ela= 1102 count=1 intr=256 timeout=4294967295 obj#=57314 
WAIT #14: nam='Data file init write' ela= 1156 count=1 intr=256 timeout=4294967295 obj#=57314 
WAIT #14: nam='Data file init write' ela= 1870 count=1 intr=256 timeout=4294967295 obj#=57314 
WAIT #14: nam='Data file init write' ela= 37 count=1 intr=256 timeout=4294967295 obj#=57314
WAIT #14: nam='Data file init write' ela= 4 count=4294967295 intr=32 timeout=2147483647 obj#=57314 

2.若使用bigfile tablespace则可能出现以下insert seg$流程
insert into seg$ (file#,block#,type#,ts#,blocks,extents,minexts,maxexts,
extsize,extpct,user#,iniexts,lists,groups,cachehint,bitmapranges,hwmincr, 
spare1, scanhint) values (:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,:12,:13,:14,:15,0,:16,
DECODE(:17,0,NULL,:17),:18)
PARSE #15:c=0,e=368,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1261083586909081
BINDS #15:
kkscoacd
 Bind#0
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=2ba79a0e1330  bln=22  avl=03  flg=05
  value=1024

SQL> select file#,block#,blocks from sys.seg$ where file#=1024;

     FILE#     BLOCK#     BLOCKS
---------- ---------- ----------
      1024         82   13434880
      1024   13434962      16384
      1024   13451346      16384
      1024   13467730      16384
      1024   13484114      16384

/* 会出现奇怪的1024号文件 */

3.维护用户表空间限额字典数据
update tsq$
   set blocks    = :3,
       maxblocks = :4,
       grantor#  = :5,
       priv1     = :6,
       priv2     = :7,
       priv3     = :8
 where ts# = :1
   and user# = :2

4.更新seg$数据字典基表以扩展数据段
update seg$
   set type#     = :4,
       blocks    = :5,
       extents   = :6,
       minexts   = :7,
       maxexts   = :8,
       extsize   = :9,
       extpct    = :10,
       user#     = :11,
       iniexts   = :12,
       lists     = decode(:13, 65535, NULL, :13),
       groups    = decode(:14, 6 5535, NULL, :14),
       cachehint = :15,
       hwmincr   = :16,
       spare1    = DECODE(:17, 0, NULL, :17),
       scanhint  = :18
 where ts# = :1
   and file# = :2
   and block# = :3

以上数据文件空间扩展流程中只有查询语句”select file# from file$ where ts#=:1″伴随有”Data file init write”等待。

SQL*Net break/reset to client等待事件

一般情况下无法从动态视图(v$session/v$session_wait)看到这个等待事件,因为它十分短暂。其本质从字面意思上来解释的话,是一种网络等待(network issue);
举例而言,如果运行的代码中包含某种可能的错误,且在调用中触发了的话,服务器端本地的服务进程有义务对远程客户端告知该信息,这个告知的过程中服务进程就处于SQL*Net break/reset to client等待中,直到客户端收到问题信息为止。与一般意义上的Sever-client模式一样,使用dblink时也可能出现该种等待事件。
下面我们来尝试演示该种等待事件:

SQL>  create table tv (t1 int unique);

Table created.

SQL> insert into tv values(1);

1 row created.

SQL> commit;

Commit complete.

SQL> oradebug setmypid;
Statement processed.
SQL>  oradebug event 10046 trace name context forever,level 8;
Statement processed.
SQL> insert into tv values(1);
insert into tv values(1)
*
ERROR at line 1:
ORA-00001: unique constraint (SYS.SYS_C009343) violated

SQL>  oradebug event 10046 trace name context off;
Statement processed.
SQL> select event, total_waits from v$session_event where event like '%break%' and sid =(select distinct sid from v$mystat);

EVENT                                                            TOTAL_WAITS
---------------------------------------------------------------- -----------
SQL*Net break/reset to client                                              2

可以看到出现了2次SQL*Net break/reset to client ,进一步分析trace文件可以发现以下记录:

EXEC #2:c=8000,e=9020,p=0,cr=10,cu=9,mis=0,r=0,dep=0,og=1,plh=0,tim=1279117896316991
ERROR #2:err=1 tim=1279117896317039
STAT #2 id=1 cnt=0 pid=0 pos=1 obj=0 op='LOAD TABLE CONVENTIONAL  (cr=0 pr=0 pw=0 time=0 us)'
WAIT #2: nam='SQL*Net break/reset to client' ela= 33 driver id=1413697536 break?=1 p3=0 obj#=-1 tim=1279117896317323
WAIT #2: nam='SQL*Net break/reset to client' ela= 521 driver id=1413697536 break?=0 p3=0 obj#=-1 tim=1279117896317915
WAIT #2: nam='SQL*Net message to client' ela= 5 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=1279117896317964

2次SQL*Net break/reset to client分别为33us和521us; 设计不良的应用程序或者程序中存在较多未安置的exception都可能导致SQL*Net break/reset to client等待事件。

single-task message and cursor pin S on X

Recently i find many session are waiting for Event <cursor pin s on x> in my customer’s AIX 10.2.0.4 instance, it’s a new event in the 10gr2,after mutex take  place of latch .

When a session is going to parse or reparse one cursor, need to pin X this cursor first. Then the others who what to execute this child cursor  have to wait until pin X released.

I looked for the session who was pining that cursor on  X mode ,and saw it was waiting on event single-task messag:
the mutex idn is d322e445,and operated in EXCL mode; It correspond to the cursor whose hash is “ff2c2fd459ac5ee188586781d322e445” as after:

PROCESS 1753:
----------------------------------------
SO: 70000048e9d5e68, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=1753, calls cur/top: 70000043308cb08/70000043798a6e0, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 109 0 4
last post received-location: kslpsr
last process to post me: 70000048f8deaf0 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: 70000048f8deaf0 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 70000048ca877c0
O/S info: user: orauser, term: UNKNOWN, ospid: 3318124
OSD pid info: Unix process pid: 3318124, image: oracle@p595crm1
.........................
----------------------------------------
SO: 70000048eed0d30, type: 4, owner: 70000048e9d5e68, flag: INIT/-/-/0x00
(session) sid: 3311 trans: 70000046be86948, creator: 70000048e9d5e68, flag: (100041) USR/- BSY/-/-/-/-/-
DID: 0001-06D9-008347C7, short-term DID: 0001-06D9-008347C8
txn branch: 70000046a261488
oct: 3, prv: 0, sql: 700000414e4f570, psql: 700000415f47a90, user: 50/SHUCRM1C
service name: CRMDB1
O/S info: user: pausr12, term: unknown, ospid: 1234, machine: p570web2
program: JDBC Thin Client
application name: JDBC Thin Client, hash value=2546894660
waiting for 'single-task message' blocking sess=0x0 seq=22963 wait_time=0 seconds since wait started=6
=0, =0, =0
Dumping Session Wait History
for 'SQL*Net message from dblink' count=1 wait_time=325
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=2
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=383
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=1
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=265
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=2
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=677
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=0
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=237
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=1
driver id=54435000, #bytes=1, =0
temporary object counter: 0
SO:  70000043e695968, type: 53, owner: 70000048eed0d30, flag: INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=70000043e695968 handle=700000410dd46c8 mode=N                           --the child cursor
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=70000043e6959e8[70000041e9eb830,70000042411c178] htb=70000042411c178  ssga=70000042411bc10
user=70000048eed0d30 session=70000048eed0d30 count=1 flags=CBK[0020]  savepoint=0x0
LIBRARY OBJECT HANDLE: handle=700000410dd46c8 mtx=700000410dd47f8(0)  cdp=0
namespace=CRSR flags=RON/KGHP/PN0/EXP/[10010100]
kkkk-dddd-llll=0000-0001-0001  lock=N pin=X latch#=23 hpc=fffc  hlc=fffc                                     --lock  in NULL mode,pin on X mode
lwt=700000410dd4770[700000410dd4770,700000410dd4770]  ltm=700000410dd4780[700000445169a08,700000453b69228]
pwt=700000410dd4738[700000410dd4738,700000410dd4738]  ptm=700000410dd4748[700000410dd4748,700000410dd4748]
ref=700000410dd47a0[70000043d1b5e58,70000043d1b5e58]  lnd=700000410dd47b8[700000410dd47b8,700000410dd47b8]
LIBRARY OBJECT: object=7000004354983b0
type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
DEPENDENCIES: count=4 size=16
TRANSLATIONS: count=2 size=16
DATA BLOCKS:
data#     heap  pointer    status pins change whr
----- -------- -------- --------- ---- ------ ---
0 7000004359e4768 7000004354984c8 I/P/A/-/-    0 NONE   00
6 700000414b0e4d0 700000450c0bc28 I/P/A/-/E    0 NONE   00

----------------------------------------

KGX Atomic  Operation Log 70000041853c068
Mutex 700000414b0e3d8(3311, 0) idn d322e445 oper EXCL
Cursor Pin uid 3311 efd 0 whr 1 slp  0      -- pin sid 3311
opr=3 pso=70000043e695968  flg=0            -- operate code is 3
pcs=700000414b0e3d8 nxt=0 flg=35 cld=0 hd=700000410dd46c8   par=70000041d78b1e0    -- child cursor handle address is 700000410dd46c8
ct=0 hsh=0 unp=0 unn=0 hvl=1d78b4b8 nhv=1  ses=70000048eed0d30                              -- heap 0 pointer address 70000041d78b1e0
hep=700000414b0e458 flg=80 ld=1 ob=7000004354983b0 ptr=700000450c0bc28  fex=700000450c0af38
----------------------------------------
SO: 70000041e9eb7b0, type: 53, owner: 70000048eed0d30, flag:  INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=70000041e9eb7b0 handle=700000414e4f570 mode=N
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=70000041e9eb830[70000044590d030,70000043e6959e8] htb=70000042411c178  ssga=70000042411bc10
user=70000048eed0d30 session=70000048eed0d30 count=1 flags=[0000]  savepoint=0x4c08a856
LIBRARY OBJECT HANDLE: handle=700000414e4f570 mtx=700000414e4f6a0(0)  cdp=1
name=
SELECT * FROM(SELECT A.*, rownum r FROM( select  account_id,billing_nbr,calling_nbr,called_nbr,to_date,call_duration,charge_item_name,to_char(rate  /100, '99999999999990.99') rate, charge,channel_id from ct_05
where  1=1 and account_id ='300187744'  and key_source_type ='52001'  order by to_date ) A WHERE rownum <= 15 ) B WHERE r > 0
hash=ff2c2fd459ac5ee188586781d322e445 timestamp=06-04-2010 15:14:33
namespace=CRSR flags=RON/KGHP/TIM/PN0/MED/DBN/[50010040]
kkkk-dddd-llll=0000-0001-0001 lock=N pin=0 latch#=23 hpc=0018 hlc=0018
lwt=700000414e4f618[700000414e4f618,700000414e4f618]  ltm=700000414e4f628[700000414e4f628,700000414e4f628]
pwt=700000414e4f5e0[700000414e4f5e0,700000414e4f5e0]  ptm=700000414e4f5f0[700000414e4f5f0,700000414e4f5f0]
ref=700000414e4f648[700000414e4f648,700000414e4f648]  lnd=700000414e4f660[700000414e4f660,700000414e4f660]
LIBRARY OBJECT: object=70000041d78b0c8
type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
CHILDREN: size=16
child#    table reference   handle
------ -------- --------- --------
0 70000043d1b61e8 70000043d1b5e58 700000410dd46c8                                 -- the child cursor
DATA BLOCKS:
data#     heap  pointer    status pins change whr
----- -------- -------- --------- ---- ------ ---
0 7000004467f6078 70000041d78b1e0  I/P/A/-/-    0 NONE   00                                      --  heap 0

The session is active,and its wait history indicates there are some dblink operation inside the running SQL, When this session waited to reparse the only child cursor for that SQL , it pin this child cursor in X mode first,and then it need more information from remote DB , but no response until SQL*NET MESSAGE from DBLINK timeout,it was waiting single-task message for more than 6 seconds when tracing.

In this scenario other sessions who wait to execute this child cursor were all waiting , so sadly.

The oracle document describes <single-task message> as When running single task, this event indicates that the session waits for the client side of the executable.

A little odd,the blocked also did some single task when it lost one dblink connection.

I searched the metalink with keyword: single-task message, there were few documents  found , but fortunately below note:

Hdr: 7757687 10.2.0.3.0 RDBMS 10.2.0.3.0 PRG INT/DISTR PRODID-5 PORTID-212
Abstract: CURSOR: PIN S WAIT ON X BLOCKER “SINGLE-TASK MESSAGE”

PROBLEM:
——–
The databse has multiple processes waiting on “CURSOR: PIN S WAIT ON X”

The holder of the mutes is waiitng on “single-task message” form almost
745687
seconds.
The query run by the holder  does conatin a dblink.

DIAGNOSTIC ANALYSIS:
——————–
All the  process waiting on the same mutex:-

waiting for ‘cursor: pin S wait on X’ blocking sess=0x0
idn=5d779f21, value=7af00000000, where|sleeps=57a0dfdb6

Holder
PROCESS 26:
KGX Atomic Operation Log 700000061490300
Mutex 70000006226e0f0(1967, 0) idn 5d779f21 oper EXCL
Cursor Pin uid 1967 efd 0 whr 1 slp 0
opr=3 pso=7000000670f93a0 flg=0
pcs=70000006226e0f0 nxt=0 flg=35 cld=0 hd=7000000606838f0
par=7000000629006a8
ct=0 hsh=0 unp=0 unn=0 hvl=62900980 nhv=1 ses=700000079543fe0
hep=70000006226e170 flg=80 ld=1 ob=700000062fe69a8
ptr=700000065f124f8 fex=700000065f11808

Session object for this process:-

SO: 700000079543fe0, type: 4, owner: 70000007a6ad538, flag: INIT/-/-/0x00
(session) sid: 1967 trans: 0, creator: 70000007a6ad538, flag: (41)
USR/- BSY/-/-/-/-/-
DID: 0001-001A-00000A38, short-term DID: 0001-001A-00000A39
txn branch: 0
oct: 3, prv: 0, sql: 700000062900e18, psql:
70000005d797bc8, user: 21/V500
O/S info: user: d_c11, term: , ospid: 6271140, machine: aruput5
program: cpm_srvscript@aruput5 (TNS V1-V3)
application name: cpm_srvscript@aruput5 (TNS V1-V3), hash value=0
waiting for ‘single-task message’ blocking sess=0x0 seq=7071
wait_time=0 seconds since wait
started=745687

This bug which occurs on Version 10.2.0.3 is so similar to my case on AIX version 10.2.0.4, so as the stack trace:

ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044C0<-nttrd+0120<-nsprecv+07a0<-
nscon+0218<-nsdo+157c<-nscall3+012c<-nscall+0778<-niotns+0888<-nigcall+0028<-osnco
n+0540<-kpkiadef+006c<-upiini+03fc<-upiah0+00ac<-kpuatch+0808<-OCIServerAttach+011
4<-kpnconn+02a4<-npicon0+042c<-kpndbcon+0630<-OCIKDBLinkConn2+0038<-OCIKDBLinkConn
+002c<-ddfnet2Normal+011c<-kkmfcbrm+009c<-kkmpfcbk+02a0<-qcsprfro+0538<-qcsprfro_tree
+0318<-qcsprfro_tree+01c8<-qcspafq+0068<-qcspqb+0470<-kkmdrv+003c<-opiSem+13b4<-opiDe
ferredSem+0234<-opitca+01e8<-kksFullTypeCheck+001c<-rpiswu2+034c<-kksLoadChild+30b0<-
kxsGetRuntimeLock+0810<-kksfbc+2930<-kkspsc0+0ffc<-kksParseCursor+00d4<-opiosq0+0b30<
-kpooprx+0168<-kpoal8+0400<-opiodr+0adc<-ttcpip+1020<-opitsk+10b0<-opiino+0990<-opiod
r+0adc<-opidrv+0474<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098

My case stack trace:
ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044C0<-nttcni+01b8<-nttcon+04f4<-ntconn+0160<
-nsopen+0960<-nscall1+01b0<-nscall+049c<-niotns+0880<-nigcall+0028<-osncon+0540<-kpkiadef+006c<
-upiini+0408<-upiah0+00ac<-kpuatch+0808<-OCIServerAttach+0114<-kpnconn+02a4<-npicon0+042c<
-kpndbcon+0630<-OCIKDBLinkConn2+0038<-OCIKDBLinkConn+002c<-ddfnet2Normal+011c<-kkmfcbrm+009c<
-kkmpfcbk+02a0<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<-qcspafq+0068<
-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-kkmevw+0a58<-kkmfcbvw+0178<
-kkmfcblo+0e38<-kkmpfcbk+0250<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<
-qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-qcsfpsq+003c<
-kkmfcbsq+01e8<-kkmpfcbk+0260<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<
-qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-qcsfpsq+003c<
-kkmfcbsq+01e8<-kkmpfcbk+0260<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<
-qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-kkmdrv+003c<-opiSem+13c0<
-opiprs+01ac<-kksParseChildCursor+05e0<-rpiswu2+034c<-kksLoadChild+1d88<
-kxsGetRuntimeLock+0810<-kksfbc+28b0<-kkspsc0+0ffc<-kksParseCursor+00d4<-opiosq0+0ae0<
-kpooprx+0168<-kpoal8+0400<-opiodr+0ae0<-ttcpip+1020<-opitsk+1124<-opiino+0990<-opiodr+0ae0<
-opidrv+0484<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098

Both of them have called OCIKDBLinkConn->OCIKDBLinkConn2->kpndbcon->npicon0->kpnconn->OCIServerAttach->kpuatch and follow.
Till now this bug has no patch to apply or workaround way what’s awful.

As “Bug 5111335: STREAMS PROPAGATION STUCK ON “SINGLE-TASK MESSAGE” PRODUCE LIBRARY CACHE LOCK CO” described:

Hdr: 5111335 10.2.0.2 RDBMS 10.2.0.2 STREAMS PRODID-5 PORTID-226
Abstract: STREAMS PROPAGATION STUCK ON “SINGLE-TASK MESSAGE” PRODUCE LIBRARY CACHE LOCK CO
This is a three cluster environment. Two nodes per cluster and on that
cluster running RAC with 2 instances.
The databases involved are:

GCTGCMU is the “GCT” database in GCM = Greenwich USA, will be named USA from
now on
GCTHKGU is the “GCT” database in HKG = Hong Kong, will be named HKG from now
on
GCTLONU is the “GCT” database in LON = London, will be named LON from now on

On this environment there is a Streams bi-directional replication environment
defined between the 3 databases.

USA database is propagating changes captured to LON and HKG.
There are two propagations defined from this site to every site. 4
propagations as a total.

GCTHKGU_BATCH_PROPAGATE
GCTLONU_BATCH_PROPAGATE
GCTHKGU_DAILY_PROPAGATE
GCTLONU_DAILY_PROPAGATE

Daily propagation propagates changes stored on streams queue
STRMADMIN.STREAMSOUT_DAILY.
Batch propagation propagates changes stored on streams queue
STRMADMIN.STREAMSOUT_BATCH

This databases have been upgraded to 10.2.0.2 from 10.2.0.1 last 13-Mar-2006,
and no problems were reported.

Customer reported that he could see that propagations from USA to HKG were
stucked on library cache lock.
This situation was solved by customer by bouncing USA instances.

Analyzing the systemstate dumps we could see that there were propagations
locked on this situation and QMON slaves too.
The library cache lock was on the queue STRMADMIN.STREAMSOUT_DAILY and the
owner of the lock was process J000 that was stucked ‘single-task message’.

From the information on the systemstate dump we can see that this process has
been on that situation for more than 3 days and 20 hours.
Based on the call stack it seems that the process was about or in the middle
of a call to a listener.

DIAGNOSTIC ANALYSIS:
——————–
Files to be uploaded:

USA:
– RDA.zip , RDA node 1
– GCTGCMU1_healthchk.html, streams healthcheck node 1
– GCTGCMU1_healthchk.html, streams healthcheck node 2
– gctgcmu2_ora_21349.trc, hanganalyze and systemstate dumps when the hang is
ocurring. Node 2
– gctgcmu1_ora_858.trc, hanganalyze and systemstate dumps when the hang is
ocurring. Node 1.

HKG:
– alert_GCTHKGU1.log, alert log file node 1
– alert_GCTHKGU2.log, alert log file node 2
– listener_hkg0223xus.log_20060320, listener log file node 1.
– listener_hkg0224xus.log_20060319, listener log file node 2.

LON:
– alert_GCTLONU1.log, alert log file node 1
– alert_GCTLONU2.log, alert log file node 2
– listener_lon0223xus.log_20060318, listener log file node 1.
– listener_lon3166xus.log_20060320, listener log file node 2.

Hang was identified at USA on node 1.

Analysis of gctgcmu1_ora_858.trc:

System State 1
~~~~~~~~~~~~~~
21: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=21
42: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=404
51: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=17
67: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=90
71: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=56
76: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=344
Cmd: PL/SQL Execute
77: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=87

System State 2
~~~~~~~~~~~~~~
21: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=21
42: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=404
51: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=17
67: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=90
71: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=56
76: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=344
Cmd: PL/SQL Execute
77: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=87

LOCK: handle=25ea4afc8 QUEU:STRMADMIN.STREAMSOUT_DAILY

Resource Holder State
LOCK: handle=25ea4afc8 50: waiting for ‘single-task message’

PROCESS 50:
~~~~~~~~~~~

OSD pid info: Unix process pid: 12949, image: oracle@slodb001 (J000)

From the call stack looks like process was doing an SQL*Net connection to a
listenerand doing this
connection we get hanged because the stack trace is the same for the two
process dumps shown on
both system states.

(session)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 92/STRMADMIN
waiting for ‘single-task message’ blocking sess=0x(nil) seq=5 wait_time=0
seconds since wait
started=332308
=0, =0, =0

It is running a propagation job: next_date :=
sys.dbms_aqadm.aq$_propaq(job);

Knowing that the seconds in wait are 332308 and the dump of the process
occurred on 2006-03-21 08:15:12.640,
this process is waiting on this event:

1. On USA the process is waiting since 17-MAR-2006 11:56:44
2. That time on London is 17-MAR-2006 16:56:44
3. That time on HONG KONG is 18-MAR-2006 01:56:44

At that time nothing relevant was found on alert log files on listener log
files.

Processes waiting on library cache lock
~~~~~~~~~~~~~~~

21 (PZ99) : Slave of J002
42 (q004) : Slave of process 44 (QMNC)
51 (J001) :
67 (q003) : Slave of process 44 (QMNC)
71 (q005) : Slave of process 44 (QMNC)
51 (TNS shadow) :
77 (q007) : Slave of process 44 (QMNC)

WORKAROUND:
———–
No workaround found so far.
Killing the process or bouncing instance should solve the problem.

RELATED BUGS:
————-
No similar bugs found.

This environment is also involved on bug 5089630

REPRODUCIBILITY:
—————-
Problem has reproduced just one time on customer site.

TEST CASE:
———-

STACK TRACE:
————
For processes waiting on library cache lock we can see different call stacks,
here is the one for process 21:

ksdxfstk ksdxcb sspuser pthread_sighandler_rt GI___libc_sigaction GI___poll
kslwait kjusuc ksipgetctx kqlmli kgllkal kglget
kglgob kwqicgob kwqicrqo0 kwqmsnrmsg kwqsldqs kwqsif
qerfxFetch qervwFetch qertqoFetch qerpxSlaveFetch qerpxFetch
opiexe kpoal8 opiodr kpoodr upirtrc kpurcsc kpuexecv8 kpuexec OCIStmtExecute

kxfxsStmtExecute kxfxsExecute kxfxsp kxfxmai kxfprdp opirip opidrv sou2o
opimai_real main __libc_start_main _start

For process waiting on single-task message, we can see the following call
stack:

ksdxfstk ksdxcb sspuser pthread_sighandler_rt GI___libc_sigaction GI___poll
clntudp_call do_ypcall yp_match internal_gethostbyname2_r
nss_nis_gethostbyname_r
gethostbyname_r
gethostbyname snlpcgthstbynm snlinGetAddrInfo nttbnd2addr ntacbbnd2addr
ntacbnd2addr
nsc2addr nscall1 nscall niotns nigcall osncon
kpkiadef upiini upiah0 kpuatch OCIServerAttach kpnconn
npicon0 kpndbcon OCIKDBLinkConn2 OCIKDBLinkConn kwqpcon kwqpgps
spefmccallstd pextproc peftrusted psdexsp rpiswu2 psdextp pefccal pefca
pevm_FCAL pfrinstr_FCAL pfrrun_no_tool pfrrun plsql_run peicnt kkxexe opiexe
opiodr rpidrus skgmstack rpidru rpiswu2 rpidrv rpiexe kkjex1e kkjsexe kkjrdp

opirip opidrv sou2o opimai_real main libc_start_main start

I do not see any Oracle bug here but it is hard to be
certain without the SQLNET.ORA settings and multiple
snapshots of the problem process. It is quite valid
for an Oracle session to be holding a library cache lock
when it makes an outbound DBlink call. If another session
wants that lock in an incompatible mode then it has to wait.

In this case the blocking session making the outbound link
call appears from the one stack dump to be in OS code
under gethostbyname().
However it could have been spinning down the stack in
Oracle code but I cannot comment on the info we have.

The “single-task message” wait event is started / ended
in kpnconn() so if there was a “spin” it would have to be
above that point in the stack.

In 10.2 there is the option to configure NET to timeout
on outbound connect attempts by setting
SQLNET.OUTBOUND_CONNECT_TIMEOUT = N seconds.
The timeout occurs in niotns() in the stack so if this
is set you would expect a session in this state to then
timeout which may be useful if there are underlying issues with
TCP hostname resolution.
(Watch out for bug 4933023 if you use this SQLNET parameter).

If the issue recurs then get CPU usage info, multiple stacks
and you may want to attach with gdb and see if the code ever
returns above gethostbyname or not to get an idea if this

Sorry , my explanation is probably not clear .

Here is the stack you have but annotated:

Callback to get the short stack. Confirms we are not HUNG in a device
driver.
ksdxfstk
ksdxcb
sspuser
pthread_sighandler_rt
GI___libc_sigaction

Inside OS supplied code executing gethostbyname() C call.
GI___poll
clntudp_call
do_ypcall
yp_match
internal_gethostbyname2_r
nss_nis_gethostbyname_r
gethostbyname_r
gethostbyname

Oracle Net code:
snlpcgthstbynm
snlinGetAddrInfo
nttbnd2addr
ntacbbnd2addr
ntacbnd2addr
nsc2addr
nscall1
nscall
niotns <<< This is where SQLNET.OUTBOUND_CONNECT_TIMEOUT
is implemented IF it has been configured.
nigcall

Oracle RDBMS code:
osncon
kpkiadef
upiini
upiah0
kpuatch
OCIServerAttach
kpnconn <<< This is where “single-task message” starts/ends
npicon0

If we just get two stack snapshots of the process, as was the case, and
both have the above stack then we can say:
As the “seq=” of both snaps is the same (5) then we did not return
below kpnconn() in the call stack.

But as poll() is typically a blocking call, even if only for a short
time, then we cannot be sure if this process was blocked permanently
in the poll() waiting on some large timout or similar, or if the
code was executing and looping. If it was looping it could have done so
at any point on the above stack down to kpnconn() where we would have
changed the seq# on the wait event if we exited here.

We can say that as the short stack dump triggered a callout then the
Linux kernel was not stuck inside a device driver.

We can also say it looks like the gethostbyname is in YP code doing a UDP
call so at OS level you may want to check out the name resolution
configuration. You probably dont want to be falling all the way back to
UDP requests to get hostnames resolved.

So if the problem recurs it would help to get:
ps information of the process (to see if it uses CPU at all)

strace for a few minutes – this should help show if the
process is entering / leaving any system call (eg: poll)

A full call stack with arguments from either gdb or errorstack
and these should help show the arguments passed up the stack.

In gdb you can also do something like attach to the problem process
and :
break osncon
break nigcall
break niotns
etc.. up the stack to gethostbyname and then “cont”
and it should stop if it enters into those functions from below
helping indicate a spin/loop point below that point.

It may also be worth getting Linux “crash” information for the
process to see what the process looks like from the Linux side
but follow up with the OS team on that side.

SQLNET.OUTBOUND_CONNECT_TIMEOUT is a backup mechanism to help prevent

This note claimed that one session may hold library cache lock , and wait in single-task message
after it makes an outbound DBlink call.If another session wants that lock in an incompatible mode then it has to wait.
The “single-task message” wait event is started / ended in kpnconn() so if there was a “spin” it would have to be
above that point in the stack. Oracle support advise to set SQLNET.OUTBOUND_CONNECT_TIMEOUT so that
we can expect a session in this state to then timeout which may be useful if there are underlying issues with
TCP hostname resolution. The principle in these cases is identical.
Do deep digging ,you can see this Knowledge:

Hdr: 8427478 10.2.0.3 RDBMS 10.2.0.3 PI/DISTRIB PRODID-5 PORTID-23 ORA-12170
Abstract: WHEN USING DBLINK,IT CONNECT TWICE AND TAKE DOBULE-TIME FOR TIME-OUT( ORA-12170)
PROBLEM:
——–
When client connect to database using TCP/IP, if there are some problems
in the network and server , The client recevies ORA-12170 (ETIMEDOUT)
The timeout time is depend on OS TCP parameter.(e.g. tcp_syn_retries
parameter on Linux platform).

ERROR:
ORA-12170 : TNS:Connect timeout occurred

** sqlnet.log ***************************************************
VERSION INFORMATION:
TNS for Linux: Version 10.2.0.3.0 – Production
TCP/IP NT Protocol Adapter for Linux: Version 10.2.0.3.0 – Production
Time: 10-APR-2009 21:38:41
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-505: Operation timed out
nt secondary err code: 110 <= ETIMEDOUT nt OS err code: 0 Client address:
** sqlnet.log ***************************************************

client server
| |
connect() |
|——-x–>|——–
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |when setting tcp_syn_retries=5(default)
|——-x–>| | It takes about 189 seconds for time-out.
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
———————
ORA-12170 occur.

But, the connection via DBLINK, it sometimes try to connect twice. and
it takes double-time for time-out.

local site remote site
| |
connect() |
|——-x–>|——–
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
ORA-12170 occur ,but it not returning to client.
and server process of local site try to connect automatically.(2nd connect)
connect() | |
|——-x–>| |
| SYN | |
|——-x–>| | It takes about 378 seconds for time-out.
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
———————
ORA-12170 occur.

The problem is that it takes dobule-time by this behavior using DBLINK.

DIAGNOSTIC ANALYSIS:
——————–
We got stack trace when trying connect.

1st connect
connect()+36<- nttcon()+965<- ntconn()+265<- nsopen()+1335 <- nscall1()+449<- nscall()+609<- niotns()+1678<- nigcall()+112 <- osncon()+692<- kpkiadef()+58<- upiini()+683<- upiah0()+104 <- kpuatch()+1064<- OCIServerAttach()+129<- kpnconn()+463 <- npicon0()+1250<- kpndbcon()+909<- OCIKDBLinkConn2()+40 <- OCIKDBLinkConn()+36<- ddfnet3Share()+151<- kksarm()+480<- kksauc()+481 <- kksCheckCriteria()+1711<- kksCheckCursor()+226 <- kksSearchChildList()+1496 <- kksfbc()+10276<- kkspsc0()+1903<- kksParseCursor()+132<- opiosq0()+1349 <- kpooprx()+215<- kpoal8()+673<- opiodr()+985<- ttcpip()+1093<- opitsk()+1031 <- opiino()+821<- opiodr()+985<- opidrv()+466<- sou2o()+91<- opimai_real()+117 <- main()+111<- __libc_start_main()+161 2nd connect <- connect()+36<- nttcon()+965<- ntconn()+265<- nsopen()+1335<- nscall1()+449<- nscall()+609 <- niotns()+1678<- nigcall()+112<- osncon()+692<- kpkiadef()+58<- upiini()+683<- upiah0()+104 <- kpuatch()+1064<- OCIServerAttach()+129<- kpnconn()+463<- npicon0()+1250<- kpndbcon()+909 <- OCIKDBLinkConn2()+40<- OCIKDBLinkConn()+36<- ddfnet2Normal()+204<- kkmfcbrm()+132 <- __PGOSF172_kkmpfcbk()+473<- qcsprfro()+503<- qcsprfro_tree()+302<- qcsprfro_tree()+118 <- qcspafq()+102<- qcspqb()+347<- kkmdrv()+60<- opiSem()+1044<- opiDeferredSem()+283 <- opitca()+275<- __PGOSF302_kksFullTypeCheck()+20<- rpiswu2()+334<- kksLoadChild()+5852 <- kxsGetRuntimeLock()+1314<- kksfbc()+12270<- kkspsc0()+1903<- kksParseCursor()+132 <- opiosq0()+1349<- kpooprx()+215<- kpoal8()+673<- opiodr()+985<- ttcpip()+1093 <- opitsk()+1031<- opiino()+821<- opiodr()+985<- opidrv()+466<- sou2o()+91<- opimai_real()+117 <- main()+111<- __libc_start_main()+161<- WORKAROUND: ----------- execute following command before connect. SQL> alter system flush shared_pool;

You can find the kpnconn call in stack trace , so the single-task message may occur in two time dblink reconnect and take a long time.
The document provide one workaround way: flush shared_pool.which I suspected.
In summary , still no Mature program to resolve this problem , All we can do is pray.
In my opinion , Version 10.2.0.4 is stable enough within most layer , but still a lots of bug or unexpected behavior reside in K2(kernel Distributed Execution Layer). Distributed system is a really complicated subject in computer science.

沪ICP备14014813号-2

沪公网安备 31010802001379号