Flex ASM环境中crsd无法启动造成Grid Infrastructure (GI) 启动失败 (Doc ID 2504902.1)

适用于:

Oracle Database Cloud Service – 版本 N/A 和更高版本
Oracle Database – Enterprise Edition – 版本 12.2.0.1 和更高版本
Oracle Database Cloud Schema Service – 版本 N/A 和更高版本
Oracle Database Exadata Cloud Machine – 版本 N/A 和更高版本
Oracle Cloud Infrastructure – Database Service – 版本 N/A 和更高版本
本文档所含信息适用于所有平台

 

 

症状

在一个Flex ASM环境里, Grid Infrastructure (GI) 启动失败, 而这时其它的一个或者多个节点上GI正在运行,
并且 “crsctl stat res -t -init” 的输出显示除了ora.crsd以外其它资源都是起来的。
这时 ora.crsd 的状态是offline 或者 intermediate。

集群的 alert.log 报如下错误:2018-04-05 15:16:53.918 [CRSD(2697)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 2697
2018-04-05 15:17:00.608 [CRSD(2697)]CRS-1013: The OCR location in an ASM disk group is inaccessible. Details in /u01/app/grid/diag/crs/sp1frhodb102/crs/trace/crsd.trc.
2018-04-05 15:17:00.615 [CRSD(2697)]CRS-0804: Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage Storage layer error [Insufficient quorum to open OCR devices] [0]]. Details at (:CRSD00111:) in /u01/app/grid/diag/crs/sp1frhodb102/crs/trace/crsd.trc.

 

跟踪文件 crsd.trc 报如下错误:2018-04-05 15:17:01.732 : CLSCRED:2919112768: (:CLSCRED0101:)clsCredDomInitRootDom: Using user given storage context for repository access.
2018-04-05 15:17:01.757 : OCRRAW:2919112768: 8033 Error 4 querying length of attr ASM_DISCOVERY_ADDRESS2018-04-05 15:17:01.761 : OCRRAW:2919112768: 8033 Error 4 querying length of attr ASM_STATIC_DISCOVERY_ADDRESS

2018-04-05 15:17:01.798 : CLSCRED:2919112768: (:CLSCRED1079:)clsCredOcrKeyExists: Obj dom : SYSTEM.credentials.domains.root.ASM.Self.076fa97b2ac84f70ff7035254e98f38d.root not found
2018-04-05 15:17:01.798 : OCRRAW:2919112768: 7755 Error 4 opening dom root in 0x4d37e30

2018-04-05 15:17:01.816 : OCRRAW:2919112768: kgfnConnect2: kgfnGetBeqData failed

2018-04-05 15:17:01.816*:kgfn.c@4933: kgfnConnect2: kgfnGetBeqData failed
2018-04-05 15:17:01.816 : CSSCLNT:2919112768: clsssinit: initialized context: (0x4f23d50) flags 0x104
2018-04-05 15:17:01.821 : CSSCLNT:2919112768: clsssterm: terminating context (0x4f23d50)
2018-04-05 15:17:01.862 : OCRRAW:2919112768: kgfnConnect2Int: cstr=(DESCRIPTION=(TCP_USER_TIMEOUT=1)(TRANSPORT_CONNECT_TIMEOUT=60)(EXPIRE_TIME=1)(ADDRESS_LIST=(LOAD_BALANCE=ON)(ADDRESS=(PROTOCOL=tcp)(HOST=nn.nn.255.13))(PORT=1526)))(CONNECT_DATA=(SERVICE_NAME=+ASM)))

2018-04-05 15:17:01.862*:kgfn.c@6685: kgfnConnect2Int: cstr=(DESCRIPTION=(TCP_USER_TIMEOUT=1)(TRANSPORT_CONNECT_TIMEOUT=60)(EXPIRE_TIME=1)(ADDRESS_LIST=(LOAD_BALANCE=ON)(ADDRESS=(PROTOCOL=tcp)(HOST=nn.nn.255.13)(PORT=1526)))(CONNECT_DATA=(SERVICE_NAME=+ASM)))
2018-04-05 15:17:01.862*:kgfn.c@6853: kgfnConnect2Int: OCISessionBegin failed
2018-04-05 15:17:03.139 : OCRRAW:2919112768: kgfnRecordErr 1017 OCI error:
ORA-01017: invalid username/password; logon denied

2018-04-05 15:17:03.139*:kgfn.c@1707: kgfnRecordErrPriv: 1017 error=ORA-01017: invalid username/password; logon denied

2018-04-05 15:17:03.140 : default:2919112768: clsCredDomClose: Credctx deleted 0x4d45890
2018-04-05 15:17:03.140 : OCRRAW:2919112768: kgfnConnect2: failed to connect

2018-04-05 15:17:03.140*:kgfn.c@5253: kgfnConnect2: failed to connect
2018-04-05 15:17:03.140 : OCRRAW:2919112768: kgfnConnect2Retry: failed to connect connect after 1 attempts, 143s elapsed

2018-04-05 15:17:03.140 : OCRRAW:2919112768: kgfo_kge2slos error stack at kgfoAl06: ORA-01017: invalid username/password; logon denied
ORA-27300: OS system dependent operation:sslssunreghdlr failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: sskgpreset1
ORA-15077: could not locate ASM instance serving a required diskgroup

 

 

原因

 

通常有如下原因:

1) sqlnet.ora 里有
SQLNET.AUTHENTICATION_SERVICES=none这项设置使得crsd连接到其它节点上的远程ASM实例所需要的任何OS认证,都变得失效。

2) ASM 密码不对

3) ASMlistener 子网与为私有interconnect配置不匹配。
执行 “oifcfg getif” 能看到 private interconnect (cluster interconnect) 所在的子网。

这个问题的报错在SQLNET.AUTHENTICATION_SERVICES=all时也会出现。

 

 

解决方案

 

1) 如果是sqlnet.ora 里有 SQLNET.AUTHENTICATION_SERVICES=none 或 SQLNET.AUTHENTICATION_SERVICES=all 的情况

1) 从 Grid Home SQLNET.ORA 文件 (位于 $ORACLE_HOME/network/admin) 里清除 “SQLNET.AUTHENTICATION_SERVICES=none” 或 “SQLNET.AUTHENTICATION_SERVICES=all”

2) 以强制方式重启 CRS

crsctl stop crs -f
crsctl start crs

参考 “Unable to startup CRS as ASM failed to startup with “ORA-01017: invalid username/password; logon denied Document 1681849.1

 2) ASM 密码不对的情况 (这是在 sqlnet.ora 文件没有问题的情况下的可能的故障原因)

1)  按 MOS 文章 ” How to recreate shared ASM password file in 12c GI cluster Document 1929673.1” 所指示的重建ASM 密码

2) 以强制方式重启 CRS
crsctl stop crs -f
crsctl start crs

 3) ASMlistener 子网(subnet)与所配置的 interconnect 不匹配

按文章 Document 283684.1 里段落 “C. For 12c and 18c Oracle Clusterware with Flex ASM” 的步骤 3,重建ASMlistener。

 

一个快速的规避方法是在本地节点上使用sqlplus手动启动。
如果手动启动asm几分钟后ora.crsd还是没有online, 则以root身份执行”crsctl start res ora.crsd -init” 。

Comment

*

沪ICP备14014813号-2

沪公网安备 31010802001379号