Oracle RAC内部错误:ORA-00600[keltnfy-ldmInit]一例

一套SUNOS上的2节点10.2.0.2 RAC系统日前出现ORA-00600: internal error code, arguments: [keltnfy-ldmInit], [46], [1], [], [], [], [], []内部错误,错误发生时系统操作人员误使用hostname命令修改了1号主机的主机名,之后陆续出现以上ora-00600错误,同时操作系统日志显示RAC CSS进程意外终止,具体日志如下:

================== OS Message=====================
Jan 10 11:15:10 cupd25k-a root: [ID 702911 user.error] Cluster Ready Services completed waiting on dependencies.
Jan 10 11:15:16 cupd25k-a root: [ID 702911 user.error] Duplicate Oracle CLSMON found. Killing and restarting it.
Jan 10 11:15:16 cupd25k-a root: [ID 702911 user.error] Oracle CSS daemon failed to start up. Check CRS logs for diagnostics.
Jan 10 11:15:16 cupd25k-a root: [ID 702911 user.error] Oracle CLSMON terminated with unexpected status 137. Respawning

/* 这里的Duplicate Oracle CLSMON found 因该指的是OCLSMON进程,
"In Oracle 10.2.0.2 and above there is an additional process called OCLSOMON
which monitors the CSS daemon for hangs or scheduling issues and can reboot a
node if there is a perceived hang. OCLSOMON is spawned in init.cssd and runs
as the Oracle user."
   oclsmon进程在10.2.0.2以后版本被引入,用以监视css进程,
   若发生hang或操作系统调度问题时该进程可能会reboot节点,
   oclsmon进程会被init.cssd脚本spawned.  */

==================oclsmon.log======================
2011-01-10 11:15:11.376
unspecified member number is (1)
Member 1 group OCLSMON_ in use. Is oclsmon already up?
2011-01-10 11:15:11.479
Internal Error Information:
  Category: 8
  Operation: skgxnreg: the member number is i
  Location: skgxnreg_7
  Other:
  Dep: 1
2011-01-10 11:15:11.737
unspecified member number is (1)
Member 1 group OCLSMON_ in use. Is oclsmon already up?
2011-01-10 11:15:11.751
Internal Error Information:
  Category: 8
  Operation: skgxnreg: the member number is i
  Location: skgxnreg_7
  Other:
  Dep: 1
2011-01-10 11:15:12.006
unspecified member number is (1)
Member 1 group OCLSMON_ in use. Is oclsmon already up?
2011-01-10 11:15:12.023
Internal Error Information:
  Category: 8
  Operation: skgxnreg: the member number is i
  Location: skgxnreg_7
  Other:
  Dep: 1
2011-01-10 11:15:12.278
unspecified member number is (1)
Member 1 group OCLSMON_ in use. Is oclsmon already up?
2011-01-10 11:15:12.293
Internal Error Information:
  Category: 8
  Operation: skgxnreg: the member number is i
  Location: skgxnreg_7
  Other:
  Dep: 1

/*  skgxn是Oracle Clusterware用以监视skgxn事件(即第三方CLUSTERWARE相关的事宜,他们应该有用sun的cluster);
    似乎是修改hostname导致了Oracle CSS出现了fatal error,并启动了一个以上的OCLSMON进程(Duplicate Oracle CLSMON found),
    最后"Oracle CSS daemon failed to start up. Check CRS logs for diagnostics",
    在Oracle instance启动的情况下25k-a节点的CSS进程意外终止,
    可能导致该节点上的所有实例的LMD(global Enqueue Service daemon)、LMON无法正常工作而导致实例hang住。*/

==========================alert.log====================
Errors in file /oracle/oracle/admin/BOCPCS/udump/bocpcs1_ora_12320.trc:
ORA-00600: internal error code, arguments: [keltnfy-ldmInit], [46], [1], [], [], [], [], []

=========================part of trace file===============
*** 2011-01-10 11:11:02.957
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [keltnfy-ldmInit], [46], [1], [], [], [], [], []
Current SQL information unavailable - no session.
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedmp()+716         CALL     ksedst()             FFFFFFFF7FFF9D40 ?
                                                   000000000 ? 0FFFFFFFF ?
                                                   FFFFFFFF7FFF8EE8 ?
                                                   FFFFFFFF7FFFA640 ?
                                                   000000008 ?
kgerinv()+200        PTR_CALL 0000000000000000     000000002 ? 10638A1CC ?
                                                   000000001 ? 000000000 ?
                                                   10638A000 ? 10638A1CC ?
kgeasnmierr()+28     CALL     kgerinv()            106384B98 ? 000000000 ?
                                                   105D3B940 ? 000000002 ?
                                                   FFFFFFFF7FFFDFF0 ?
                                                   000001430 ?
keltnfy()+784        CALL     kgeasnmierr()        106384B98 ? 1064DCBF0 ?
                                                   105D3B940 ? 000000002 ?
                                                   000000000 ? 00000002E ?
kscnfy()+552         PTR_CALL 0000000000000000     10639B498 ? 38001E7A8 ?
                                                   1055AC5D0 ? 10639B498 ?
                                                   000102C00 ? 10638A1C0 ?
ksucrp()+2436        CALL     kscnfy()             000008000 ? 000808214 ?
                                                   100C4C220 ? 1055C6680 ?
                                                   00000000F ? 000000001 ?
opiino()+2056        CALL     ksucrp()             000106387 ? 380007608 ?
                                                   000000000 ? 000380000 ?
                                                   000106000 ? 106387618 ?
opiodr()+1488        PTR_CALL 0000000000000000     10555A000 ?
                                                   FFFFFFFF7FFFF1C8 ?
                                                   00010555A ? 000106000 ?
                                                   105C83000 ? 000000001 ?
opidrv()+828         CALL     opiodr()             106391000 ? 000000000 ?
                                                   106390DD8 ? 106390000 ?
                                                   106391BD0 ? 000106000 ?
sou2o()+80           CALL     opidrv()             106394358 ? 000000001 ?
                                                   00000003C ? 000000000 ?
                                                   00000003C ? 000106000 ?
opimai_real()+124    CALL     sou2o()              FFFFFFFF7FFFF788 ?
                                                   00000003C ? 000000004 ?
                                                   FFFFFFFF7FFFF7B0 ?
                                                   105C82000 ? 000105C82 ?
main()+152           CALL     opimai_real()        000000002 ?
                                                   FFFFFFFF7FFFF888 ?
                                                   103F1BBCC ? 10632DB10 ?
                                                   002411E44 ? 000014400 ?
_start()+380         CALL     main()               000000002 ? 000000008 ?
                                                   000000000 ?
                                                   FFFFFFFF7FFFF898 ?
                                                   FFFFFFFF7FFFF9A8 ?
                                                   FFFFFFFF7C700200 ?

/* 可以看到以上trace文件指出了no session,
    在服务进程启动阶段遭遇了该keltnfy-ldmInit内部错误*/

metalink文档Startup Database Produces Ora-00600: [Keltnfy-Ldminit] [ID 336447.1]
介绍了该内部错误一般由主机上的不当网络配置引起,很显然使用hostname命令修改了一个无法解析的
主机名时可能引发该ORA-00600[keltnfy-ldmInit]内部错误。

Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 10.2.0.3 - Release: 10.2 to 10.2
Information in this document applies to any platform.
***Checked for relevance on 09-Jun-2010***
Symptoms

An startup nomount on Oracle 10g Release 2 database produces the following exception in alert log

Starting up ORACLE RDBMS Version: 10.2.0.1.0.
Errors in file /opt/oracle/10.2/admin/ORCL/udump/ORCL_ora_535.trc:
ORA-00600: internal error code, arguments: [keltnfy-ldmInit], [46], [1], [], [], [], [], []
USER: terminating instance due to error 600
Instance terminated by USER, pid = 535
Cause

The problem is related to getting host information.
In this case, ldmInit()/sldmInit() is failing with error 46 : LDMERR_HOST_NOT_FOUND

The following exception may also occur :

LDMERR_SOSD_INIT         OSD init failed to be specific in these OSD failures
 LDMERR_BAD_ADDR         bad address when system call gethostname failed
 LDMERR_HOST_NOT_FOUND   gethostbyname system call fails
 LDMERR_NO_SUPPORT       when specific address type is not supported

Development has fixed two bugs so far regarding this issue

Bug:5438154 - Abstract: ORA-600[KELTNFY-LDMINIT]  STARTING THE DB
Release Notes:
ldmInit returned LDMERR_HOST_NOT_FOUND for the machine huge alias list/address list
Workaround:
reduce the alais list of the machine

Bug:5486074 - Abstract: ORA-600 [KELTNFY-LDMINIT] WHEN DNS IS NOT AVAILABLE
Release Notes:
Internal error is raised by the Server Generated Alert subsystem when it can not determine Host Name or
Network Address. This can be caused by DNS server being unaavilable. 

Solution

The fix for 5486074 will not fix any underlying error from gethostbyname(), it just change the internal error to a warning message :

 "Warning: keltnfy call to ldmInit failed with error 46"

You will still need to fix the network config issue.  

These are the check you can do verify the host information 

      Check permission on /etc/hosts 

$ ls -l /etc/hosts
-rw-r--r--  2 root root 194 Oct 17  2006 /etc/hosts

      Check if /etc/hosts file is correctly configured

              ( all of this on one line ). 

Check the hostname:
$ hostname
$ ping `hostname`

Make sure you are able to ping the hostname
      Check if /etc/nodename is correctly configured

If you have DNS setup, ping is not a tool to diagnose DNS problem. A better tool to use is nslookup, dnsquery, or dig.

$ nslookup
$ nslookup
$ nslookup 

The forward and reverse lookup should succeed and return consistent address/info.  

 Check nsswitch.conf

$ more nsswitch.conf
hosts:      files dns
Make sure host lookup is also done through the /etc/hosts file and not just dns.  It is recommended that FILES come first before DNS.
Also, check the resolv.conf. This makes sure that the DNS is working properly.

显然在生产主机上使用hostname命令是危险的,因为你很难保证你在打字的时候不会因为同事的一下拍击而输错,有人说在生产环境中rm命令因该被禁用,那么这种特殊待遇对hostname命令也适用,我们可以用什么来代替hostname查看主机名呢?选择可以有非常多,这里我推荐一种:

-bash-3.00$ oslevel -r 
5300-07

-bash-3.00$ hostname
askmac.cn

-bash-3.00$ uname -n
askmac.cn

/* uname -n完全可以满足你的需要! */
That's great!

沪ICP备14014813号-2

沪公网安备 31010802001379号