oracle rac 12.1以后的脑裂brain split evict踢节点算法改进

oracle 12.1以后RAC发生脑裂时踢节点的算法变了;以前2节点RAC情况下,总是踢不是lower node number(master node)的那个节点;这样一般就是踢2号节点。在12.1以后clusterware维护一个weight权重值,主要是计算每个节点或者说子集群里使用的service和连接这些service的负载情况,这样当发生脑裂时总是踢weight小的节点,即负载轻的节点,保证服务更多用户的节点存活。 具体可以见 文档  12c: Which Node Will Survive when Split Brain Takes Place (Doc ID 1951726.1) 和 Split Brain: What’s new in Oracle Database 12.1.0.2c?   

 

理解 12.1.0.2 开始,脑裂问题发生后,节点保留策略。

在 11.2 及早期版本,在脑裂发生时,节点号小的会保留下来。然而从 12.1.0.2 开始,引入节点权重的概念。从 12.1.0.2 开始,解决脑裂时,权重高的节点将会存活下来。

 

 

这里负责计算 权重weigth的函数是 clssnmrCheckNodeWeight  , clssnm 即 Node Monitoring (clssnm.c) – Node monitoring (NM) is used to verify the health of all members of the cluster. It will maintain consistency with vendor clusterware (if it exists) via skgxn.

 

12c: Which Node Will Survive when Split Brain Takes Place (Doc ID 1951726.1)

 

PURPOSE
To understand the new behavior, from 12.1.0.2, of which node will survive when split brain takes place.

DETAILS
In 11.2 or even older version, the lowest number node will survive when split brain takes place, however this has changed in 12.1.0.2 with the introduction of node weight. Started from 12.1.0.2, during split brain resolution, node with higher weight will survive:

2014-11-24 14:25:41.140603 : CSSD:1117321536: clssnmrCheckNodeWeight: node(1) has weight stamp(0), pebble(0)
2014-11-24 14:25:41.140609 : CSSD:1117321536: clssnmrCheckNodeWeight: node(2) has weight stamp(311972654), pebble(3)
2014-11-24 14:25:41.140612 : CSSD:1117321536: clssnmrCheckNodeWeight: stamp(311972654), completed(1/2)
2014-11-24 14:25:41.140615 : CSSD:1117321536: clssnmrCheckSplit: Waiting for node weights, stamp(311972654)
2014-11-24 14:25:41.188880 : CSSD:1084811584: clssnmvDiskKillCheck: not evicted, file /dev/raw/raw2 flags 0x00000000, kill block unique 0, my unique 1416805718
2014-11-24 14:25:41.558921 : CSSD:1114167616: clssnmvDiskPing: Writing with status 0x3, timestamp 1416810341/1022717334
2014-11-24 14:25:41.731912 : CSSD:1086388544: clssnmvDHBValidateNCopy: node 1, node1, has a disk HB, but no network HB, DHB has rcfg 311972655, wrtcnt, 9527468, LATS 102 2717514, lastSeqNo 9527467, uniqueness 1416808381, timestamp 1416810341/1022722074
2014-11-24 14:25:41.731928 : CSSD:1086388544: clssnmvReadDskHeartbeat: manual shutdown of nodename node1, nodenum 1 epoch 1416810341 msec 1022722074
2014-11-24 14:25:41.732266 : CSSD:1117321536: clssnmrCheckNodeWeight: node(2) has weight stamp(311972654), pebble(3)
2014-11-24 14:25:41.732273 : CSSD:1117321536: clssnmrCheckNodeWeight: stamp(311972654), completed(1/1)
2014-11-24 14:25:41.732294 : CSSD:1117321536: clssnmCheckDskInfo: My cohort: 2
2014-11-24 14:25:41.732299 : CSSD:1117321536: clssnmRemove: Start
2014-11-24 14:25:41.732306 : CSSD:1117321536: (:CSSNM00007:)clssnmrRemoveNode: Evicting node 1, node1, from the cluster in incarnation 311972655, node birth incarnation 311972654, death incarnation 311972655, stateflags 0x225000 uniqueness value 1416808381 The number of the resource executing on each node and others are considered by the weight. Reference

RHEL 7.4装Oracle 11.2.0.4 RAC的一些问题

cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)

 

一、要安装compat-libcap1.x86_64包:

yum install compat-libcap1.x86_64

否则root.sh会因为缺少库文件而报错

clscfg.bin: error while loading shared libraries: libcap.so.1: cannot open shared object file: No such file or directory Failed to create keys in the OLR, rc = 127, Message:

Failed to write the checkpoint:” with status:FAIL.Error code is 256

 

二、 在执行root.sh前要给GRID_HOME打补丁 ,补丁号18370031 否则因为RHEL 7修改了一些RC脚本会导致安装不上

报错如下:

 

The following environment variables are set as:
ORACLE_OWNER= oracle
ORACLE_HOME=  /u01/app/11.2.0/grid
 
Enter the full pathname of the local bin directory: [/usr/local/bin]:
The contents of “dbhome” have not changed. No need to overwrite.
The contents of “oraenv” have not changed. No need to overwrite.
The contents of “coraenv” have not changed. No need to overwrite.
 
Creating /etc/oratab file…
Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
Creating trace directory
User ignored Prerequisites during installation
Installing Trace File Analyzer
OLR initialization – successful
root wallet
root wallet cert
root cert export
peer wallet
profile reader wallet
pa wallet
peer wallet keys
pa wallet keys
peer cert request
pa cert request
peer cert
pa cert
peer root cert TP
profile reader root cert TP
pa root cert TP
peer pa cert TP
pa peer cert TP
profile reader pa cert TP
profile reader peer cert TP
peer user cert
pa user cert
Adding Clusterware entries to inittab
ohasd failed to start
Failed to start the Clusterware. Last 20 lines of the alert log follow:
2018-03-08 10:20:24.544:
[client(17856)]CRS-2101:The OLR was formatted using version 3.

 

因为补丁18370031(Patch 18370031: RC SCRIPTS (/ETC/RC.D/RC.* , /ETC/INIT.D/* ) ON OL7 FOR CLUSTERWARE)   被 20433339( Patch 24333766: MERGE REQUEST ON TOP OF 11.2.0.4.0 FOR BUGS 18370031 20954311)
包含了, 所以目前应该只能下载到20433339了。

 

 

 

opatch napply -local 20433339



opatch lsinventory
Oracle Interim Patch Installer version 11.2.0.3.23
Copyright (c) 2020, Oracle Corporation.  All rights reserved.


Oracle Home       : /u01/app/11.2.0/grid
Central Inventory : /u01/app/oraInventory
   from           : /u01/app/11.2.0/grid/oraInst.loc
OPatch version    : 11.2.0.3.23
OUI version       : 11.2.0.4.0
Log file location : /u01/app/11.2.0/grid/cfgtoollogs/opatch/opatch2020-03-06_09-37-02AM_1.log

Lsinventory Output file location : /u01/app/11.2.0/grid/cfgtoollogs/opatch/lsinv/lsinventory2020-03-06_09-37-02AM.txt
--------------------------------------------------------------------------------
Local Machine Information::
Hostname: rac1
ARU platform id: 226
ARU platform description:: Linux x86-64

Installed Top-level Products (1): 

Oracle Grid Infrastructure 11g                                       11.2.0.4.0
There are 1 products installed in this Oracle Home.


Interim patches (1) :

Patch  24333766     : applied on Fri Mar 06 08:15:04 EST 2020
Unique Patch ID:  20433339
Patch description:  "OCW Interim patch for 24333766"
   Created on 30 Nov 2016, 12:56:34 hrs PST8PDT
   Bugs fixed:
     18370031, 20954311



--------------------------------------------------------------------------------

OPatch succeeded.

 

但是注意要打以上补丁先要将Opatch升级到最新,所以要安装前先去support.oracle.com下载2个包p6880880_112000_Linux-x86-64.zip和p24333766_112040_Linux-x86-64.zip。

三、安装fuser 命令

yum install psmisc

 

 

四、对于RHEL 7下的udev配置可以看这篇文章 https://gruffdba.wordpress.com/2017/02/20/udev-rules-for-asm-disks-on-rhel7/

这里顺便贴出来:

 

 

 

On this blog and elsewhere you will find UDEV rules examples for setting device ownership and naming consistency on older versions of Linux.

With RHEL7 some of the syntax has changed slightly.

This example was created using OEL7 with the Red Hat kernel, but should also work on Red Hat and CentOS.


First, log in as root and check the block device is visible on the Linux host:

[root@unirac02 ~]# ls /dev/sd*
/dev/sda /dev/sda1 /dev/sda2 /dev/sdb /dev/sdb1
In this example I have created a device sdb, and as you can see I have created a partition header on it.

Next, make sure we can see the device’s SCSI ID:

[root@unirac02 ~]# /lib/udev/scsi_id -g -u /dev/sdb
36006016004503e0017f99d58603c7c1e
Next, we are going to create a UDEV rule for this SCSI ID in the file /etc/udev/rules.d/99-oracleasm.rules.

[root@unirac01 ~]# cat /etc/udev/rules.d/99-oracleasm.rules
KERNEL=="sd?", ENV{ID_SERIAL}=="36006016004503e0017f99d58603c7c1e", SYMLINK+="oracleasm/grid1", OWNER="oracle", GROUP="oinstall", MODE="0660"
If you have several devices to add, you can use the following script to automate the rule generation.

[root@unirac02 ~]# mydevs="sdb sdc sdd" ; export count=0 ; for mydev in $mydevs; do ((count+=1)) ; /lib/udev/scsi_id -g -u /dev/$mydev | awk '{print "KERNEL==\"sd?\", ENV{ID_SERIAL}==\""$1"\", SYMLINK+=\"oracleasm/disk"ENVIRON["count"]"\", OWNER=\"oracle\", GROUP=\"oinstall\", MODE=\"0660\""}' ; done
KERNEL=="sd?", ENV{ID_SERIAL}=="36006016004503e0017f99d58603c7c1e", SYMLINK+="oracleasm/disk1", OWNER="oracle", GROUP="oinstall", MODE="0660"
KERNEL=="sd?", ENV{ID_SERIAL}=="36006016004503e0017f99d58603d1a87", SYMLINK+="oracleasm/disk2", OWNER="oracle", GROUP="oinstall", MODE="0660"
KERNEL=="sd?", ENV{ID_SERIAL}=="36006016004503e0017f99d58603d246a", SYMLINK+="oracleasm/disk3", OWNER="oracle", GROUP="oinstall", MODE="0660"
With RHEL7, restarting the UDEV rules is slightly difference than previous releases:

[root@unirac02 ~]# /sbin/udevadm control --reload-rules
[root@unirac02 ~]# /sbin/udevadm trigger
Now check, and a new device should be visible under /dev/oracleasm

[root@unirac02 ~]# ls -al /dev/oracleasm/*
lrwxrwxrwx. 1 root root 6 Feb 20 18:38 /dev/oracleasm/grid1 -> ../sdb


一些相关文档:

 

Requirements for Installing Oracle 11.2.0.4 RDBMS on OL7 or RHEL7 64-bit (x86-64) (Doc ID 1962100.1)

 

APPLIES TO:

Oracle Database – Standard Edition – Version 11.2.0.4 to 11.2.0.4 [Release 11.2]
Oracle Database – Enterprise Edition – Version 11.2.0.4 to 11.2.0.4 [Release 11.2]
Oracle Database Cloud Schema Service – Version N/A and later
Oracle Database Exadata Cloud Machine – Version N/A and later
Oracle Cloud Infrastructure – Database Service – Version N/A and later
Linux x86-64

PURPOSE

This note explains the requirements that need to be met for a successful installation of Oracle 11gR2 RDBMS release 11.2.0.4 on Red Hat Enterprise Linux 7.0 (or higher 7.x version), 64-bit (x86-64).  These guidelines apply to cluster (RAC) or standalone / single instances.

It is NOT the purpose of this NOTE to repeat every “how-to” step that is presented in the 11gR2 Installation Guide manual. For example this NOTE does not include how to create the Linux OS account named “oracle”, nor does it cover how to set environment variables. Both are adequately covered in Chapter 2 “Oracle Database Pre-installation Requirements” of the 11gR2 Installation Guide manual.

You can download Oracle 11.2.0.4 software from My Oracle Support (patch 13390677)

SCOPE

This procedure is meant for those planning/installing Oracle 11gR2 RDBMS release 11.2.0.4.0 (or higher 11.2.0.x version) on Red Hat Enterprise Linux 7.0 (or higher 7.x version) on the 64-bit (x86-64) platform. Since it is the expressed goal to keep Oracle Linux (OL) functionally IDENTICAL to RHEL, this NOTE is also completely applicable to 64-bit (x86-64) OL 7.0 (or higher 7.x version).

This procedure is not meant for those planning/installing Grid Infrastructure (GI) or any other Oracle products.

DETAILS

Requirements for installing Oracle 11gR2 RDBMS release 11.2.0.4 64-bit on RHEL7 or OL7 64-bit (x86_64)

I. Hardware:
===========
1. Minimum Hardware Requirements
a.) At least 1.0 GB (1024MB) of physical RAM
b.) Swap disk space proportional to the system’s physical memory as follows:

 

RAM Swap Space
Between 1 GB and 2 GB 1.5 times the size of RAM
Between 2 GB and 16 GB Equal to the size of RAM
More than 16 GB 16 GB

 

NOTE: The above recommendations (from the 11.2 Database installation guide) are MINIMUM recommendations for installations. Further RAM and swap space may be required to tune/improve RDBMS performance.

c.) 1.0 GB (1024MB) of disk space (and less than 2TB of disk space) in the /tmp directory.
d.) approximately 4.4 GB of local disk space for the database software.
e.) approximately 1.7 GB of disk space for a preconfigured database that uses file system storage (optional).

2. Refer Note:236826.1 for details on certified filesystems for Oracle Database.

II. Software:
============
1. As is specified in section 1.3.2 of the Oracle Database Installation Guide for 11gR2 on Linux (part number E24321-02), Oracle recommends that you install the Linux operating system with the default software packages (RPMs) and do not customize the RPMs during installation. For additional information on “default-RPMs”, please see Note 376183.1, “Defining a “default RPMs” installation of the RHEL OS” or Note 401167.1, “Defining a “default RPMs” installation of the Oracle Enterprise Linux (OEL) OS”.

2.Linux Kernel Requirements

Oracle Linux 7.0 

  • Oracle Linux 7 with Unbreakable Enterprise Kernel : 3.8.13-33.el7uek.x86_64 or later
  • Oracle Linux 7 with the Red Hat Compatible kernel : 3.10.0-54.0.1.el7.x86_64 or later

Red Hat Enterprise Linux Server 7.0

  • Red Hat Enterprise Linux 7 : 3.10.0-54.0.1.el7.x86_64 or later

NOTE:

  • RHEL7 servers must be running Red Hat kernel 3.10.0-54.0.1.el7 (x86_64) or higher or 3.8.13-33.el7uek (x86_64) or higher with UEK kernel. OL7 servers must also be running kernel 3.8.13-33.el7uek (x86_64) or higher version. The product RHEL does not deliver UEK Kernel. Only in OL 7 UEK and RHCK Kernel is included.
  • It is observed there are hang issues in RHEL 7 with many CPU cores and more RAM, due NUMA was enabled. As a work around it is recommended to turn off NUMA.

3. Required OS Components (per Release Notes, and Install Guide)

a.) The exact version number details of this list are based upon 64-bit (x86_64) RHEL 7.0. When a higher “update” level is used, the RPM release numbers (such as 4.4.4-13) may be slightly different. Since updates of RHEL 7 are certified, this is fine so long as you are still using 64-bit Linux (x86_64) RHEL 7 RPMs.
b.) Some of the Install Guide requirements will already be present from the “default-RPMs” foundation of Linux that you started with:

 

compat-libstdc++-33-3.2.3
binutils-2.23.52.0.1-12.el7.x86_64
compat-libcap1-1.10-3.el7.x86_64
gcc-4.8.2-3.el7.x86_64
gcc-c++-4.8.2-3.el7.x86_64
glibc-2.17-36.el7.x86_64
glibc-devel-2.17-36.el7.x86_64
ksh
libaio-0.3.109-9.el7.x86_64
libaio-devel-0.3.109-9.el7.x86_64
libgcc-4.8.2-3.el7.x86_64
libstdc++-4.8.2-3.el7.x86_64
libstdc++-devel-4.8.2-3.el7.x86_64
libXi-1.7.2-1.el7.x86_64
libXtst-1.2.2-1.el7.x86_64
make-3.82-19.el7.x86_64
sysstat-10.1.5-1.el7.x86_64

4. Additional Required OS Components (per the runInstaller OUI)
a.) intentionally blank

5. Additional Required OS Components (per this NOTE)
a.) Please do not rush, skip, or minimize this critical step. This list is based upon a “default-RPMs” installation of 64-bit (x86_64) RHEL 6. Additional RPMs (beyond anything known to Oracle) may be needed if a “less-than-default-RPMs” installation of 64-bit (x86_64) RHEL Server 6 is performed. For more information, please refer to Note 376183.1, “Defining a “default RPMs” installation of the RHEL OS” or Note 401167.1, “Defining a “default RPMs” installation of the Oracle Enterprise Linux (OEL) OS”.
b.) Several RPMs will be required as prerequisites to those listed in section II.3.c:  

cpp-4.8.2-16.el7.x86_64
glibc-headers-2.17-55.el7.x86_64
mpfr-3.1.1-4.el7.x86_64

 6. Oracle Global Customer Support has noticed a recent trend with install problems that originates from installing too many RPMs. For example:
a.) installing your own JDK version (prior to execute the Oracle Software runInstaller) is not needed on Linux, and is not recommended on Linux. A pre-existing JDK often interferes with the correct JDK that the Linux Oracle Software runInstaller will place and use.
b.) installing more than the required version of the gcc / g++ RPMs often leads to accidentally using (aka enabling or activating) the incorrect one. If you have multiple RDBMS versions installed on the same Linux machine, then you will likely have to manage multiple versions of gcc /g++ . For more information, please see Note 444084.1, “Multiple gcc / g++ Versions in Linux”

7. All of the RPMs in section II. are on the Red Hat Enterprise Linux 7 64-bit (x86_64) distribution media.

III. Environment:
================
1. Modify your kernel settings in /etc/sysctl.conf (RedHat) as follows. If the current value for any parameter is higher than the value listed in this table, do not change the value of that parameter. Range values (such as net.ipv4.ip_local_port_range) must match exactly. 

kernel.shmall = physical RAM size / pagesize For most systems, this will be the value 2097152. See Note 301830.1 for more information.
kernel.shmmax = 1/2 of physical RAM. This would be the value 2147483648 for a system with 4GB of physical RAM. See Note:567506.1 for more information.
kernel.shmmni = 4096
kernel.sem = 250 32000 100 128
fs.file-max = 512 x processes (for example 6815744 for 13312 processes)
fs.aio-max-nr = 1048576
net.ipv4.ip_local_port_range = 9000 65500
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 1048576

2. To activate these new settings into the running kernel space, run the “sysctl -p” command as root.

3. Set Shell Limits for the oracle User. Assuming that the “oracle” Unix user will perform the installation, do the following:

a.) Add the following settings to /etc/security/limits.conf

oracle soft nproc 2047
oracle hard nproc 16384
oracle soft nofile 1024
oracle hard nofile 65536
oracle soft stack 10240

b.) Verify the latest version of PAM is loaded, then add or edit the following line in the /etc/pam.d/login file, if it does not already exist: 

session required pam_limits.so

c.) Verify the current ulimits, and raise if needed.  This can be done many ways…adding the following lines to /etc/profile is the recommended method: 

if [ $USER = “oracle” ]; then
if [ $SHELL = “/bin/ksh” ]; then
ulimit -u 16384
ulimit -n 65536
else
ulimit -u 16384 -n 65536
fi
fi

 

4. The gcc-4.4.4 and gcc-c++-4.4.4 RPM items above will ensure that the correct gcc / g++ versions are installed. It is also required that you ensure that these correct gcc / g++ versions are active, and in-use. Ensure that the commands “gcc –version” and “g++ –version” each return “4.8.2”.

 

5. If any Java packages are installed on the system, unset the Java environment variables, for example JAVA_HOME.

6. The oracle account that is used to install Oracle 11.2.0.4 should not have the Oracle install related variables set by default. For example setting ORACLE_HOME, PATH, LD_LIBRARY_PATH to include Oracle binaries in .profile, .login file and /etc/profile.d should be completely avoided.
a.) Setting $ORACLE_BASE (not $ORACLE_HOME) is recommended, since it eases a few prompts in the OUI runInstaller tool.
b.) Following the successful install, it is recommended to set $ORACLE_HOME, and to set $PATH to include $ORACLE_HOME/bin at the beginning of the $PATH string.

7. By default, RHEL 7 x86_64 Linux is installed with SELinux as “enforcing”. This is fine for the 11gR2 installation process. However, to subsequently run “sqlplus”, switch SELinux to the “Permissive” mode. See NOTE 454196.1, “./sqlplus: error on libnnz11.so: cannot restore segment prot after reloc” for more details.

UPDATE: Internal testing suggests that there is no problem running “sqlplus” with SELinux in “enforcing” mode on RHEL7/OL7. The problem only affects RHEL5/OL5.

8. Log in as Oracle user and start the installation as follows: 

$ ./runInstaller -ignorePrereq

a.) It is best practice not to use any form of “su” to start the runInstaller, in order to avoid potential display-related problems.
b.) When performing the 11.2.0.4 installation, make sure to use the “runInstaller” version that comes with 11.2.0.4 software.
c.) When performing any subsequent 11.2.0.x patchset, make sure to use the “runInstaller” version that comes with the patchset.

Known Issue :

01) The installer needs to be launched with “-ignorePrereq” option due to unpublished bug 19947777. This issue occurs since Oracle Linux 7 was not released when Oracle database 11.2.0.4 was made available and hence was not certified. However, Oracle 11.2.0.4 is now certified on OL7. Refer Note 1962046.1 for details.

02) Compilation fails for target ‘relink_exe’ fails with “undefined reference to symbol ‘B_DestroyKeyObject’” error and is reported in unpublished bug 19692824. The solution is to install patch 19692824 as documented in Note 1965691.1.

ADDITIONAL NOTES
—————-
1. Supported distributions of the 32-bit (x86) Linux OS can run on on AMD64/EM64T and Intel Processor Chips that adhere to the x86_64 architecture
a.) Oracle 32-bit Database Server running on AMD64/EM64T with 32-bit OS is supported, but is NOT covered by this NOTE.
b.) Oracle 32-bit Database Server running on AMD64/EM64T with 64-bit OS is not certified and is not supported.
c.) Oracle 32-bit Database Client running on AMD64/EM64T with 64-bit OS is expected to be supported, but is NOT covered by this NOTE.

2. Asynchronous I/O on ext2 and ext3 file systems is supported if your scsi/fc driver supports that functionality. 

Note : Asynchronous I/O on Ext4 file system is supported with Oracle 10g onwards on OEL5.6 and later.
Reference : Oracle Linux, Filesystem & I/O Type Supportability (Note 279069.1)

3. No extra patch is required for the DIRECTIO support for x86_64.

4. No LD_ASSUME_KERNEL value should be used with the 11gR2 product.

5. The following rpm command can be used to distinguish between a 32-bit or 64-bit package.   

# rpm -qa –queryformat “%{NAME}-%{VERSION}-%{RELEASE} (%{ARCH})\n” | grep glibc-devel
glibc-devel-2.12-1.7(x86_64)
glibc-devel-2.12-1.7(i686)
Installation walk-through – Oracle Grid/RAC 11.2.0.4 on Oracle Linux 7 (Doc ID 1951613.1)

APPLIES TO:
Oracle Database – Enterprise Edition – Version 11.2.0.4 to 11.2.0.4 [Release 11.2]
Oracle Database Cloud Schema Service – Version N/A and later
Oracle Database Exadata Cloud Machine – Version N/A and later
Oracle Cloud Infrastructure – Database Service – Version N/A and later
Oracle Database Backup Service – Version N/A and later
Linux x86-64

PURPOSE

This document aims to provide clarity on the installation/patching processes required while installing Oracle Grid 11.2.0.4.0 and Oracle RAC 11.2.0.4.0 on Oracle Linux 7 by providing details on the steps taken to complete an example installation. For general recommendations, refer to Note 1962100.1 “Requirements for Installing Oracle 11.2.0.4 RDBMS on OL7 or RHEL7 64-bit (x86-64)”

SCOPE
This document is intended to complement the official Oracle documentation. If there are any incompatibilities between this document and the official Oracle documentation, they are unintentional, and should be ignored. This document is not meant to be a substitute for official documentation; care should be taken to ensure that all official documentation is reviewed thoroughly.

DETAILS
Operating System Installation & Setup – Recommendations

Yum Repository
Set up public-yum repository and enable the latest AddOns channels, e.g.

# cat /etc/yum.repos.d/public-yum-ol7.repo
[ol7_latest]
name=Oracle Linux $releasever Latest ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL7/latest/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=1

[ol7_addons]
name=Oracle Linux $releasever Add ons ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL7/addons/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=1

 

11gR2 preinstall RPM
Install 11gR2 preinstall rpm. The preinstall rpm installs all dependencies for the Oracle RDBMS server installation, and creates the oracle user and the dba and oinstall groups.

Oracle ASMLib
Download the certified oracleasmlib package from OTN (http://www.oracle.com/technetwork/server-storage/linux/asmlib/ol7-2352094.html)

Install oracleasm packages if oracleasmlib is to be used

 

yum install oracleasm-support.x86_64 oracleasmlib-2.0.8-2.el7.x86_64.rpm

 

Oracle Automatic Storage Management Cluster File System (Oracle ACFS)
For details on ACFS support, including required patches, refer to Note 1369107.1 (ACFS Support On OS Platforms (Certification Matrix)

Disk Naming Consistency
For consistent disk naming, install device-mapper-multipath along with associated dependencies (device-mapper-multipath-libs.x86_64)

 

yum install device-mapper-multipath

 

Multipathing

Start multipathd, and verify status e.g.

 

[root@xxxx ~]# systemctl start multipathd
[root@xxxx ~]# systemctl status multipathd


SELinux
Disable, e.g.

setenforce 0

Oracle Grid Infrastructure – Installation Notes
Patch 19404309

Note: It is presumed that the user has already reviewed the Oracle Grid Infrastructure Installation Guide and associated Release Notes; instructions and/or recommendations from those documents will not be repeated here.

After downloading the Oracle Grid Infrastructure software, and before attempting any installation, download Patch 19404309 from My Oracle Support, and apply the patch using the instructions in the patch README.

Patch 18370031

Download Patch 18370031 from My Oracle Support. Then, start an interactive Oracle Grid Infrastructure installation through the Oracle Universal Installer (OUI), but do not execute root.sh on any node until afterthe application of Patch 18370031. When the OUI prompts the user to execute the root.sh scripts*, Patch 18370031 should be applied by following the instructions in Section 2.3, Case 5 – Patching a Software Only GI Home Installation or Before the GI Home Is Configured – of the patch README. Note: The README should be reviewed in full, as it contains other requirements (e.g. upgrading OPatch, etc.).

* If executing a software-only installation, the patch should be applied after the installation concludes, but before any configuration is attempted.

Once Patch 18370031 has been applied, proceed with the remainder of the installation (or configuration).

Oracle Database/RAC – Installation Notes
Note: As the title suggests, this section applies both to installations of Oracle Database and Oracle Real Application Clusters (RAC).

Patch 19404309
Note: It is presumed that the user has already reviewed the Oracle Database, Oracle RAC Installation Guides and associated Release Notes; instructions and/or recommendations from those documents will not be repeated here.

After downloading the Oracle Database/RAC software, and before attempting any installation, download Patch 19404309 from My Oracle Support, and apply the patch using the instructions in the patch README.

Patch 19692824
During installation of Oracle Database or Oracle RAC on OL7, the following linking error may be encountered:

 

Error in invoking target ‘agent nmhs’ of makefile ‘<ORACLE_HOME>/sysman/lib/ins_emagent.mk’. See ‘<installation log>’ for details.
If this error is encountered, the user should select Continue. Then, after the installation has completed, the user must download Patch 19692824 from My Oracle Support and apply it per the instructions included in the patch README.

Installation/Home Cloning
Note: It may be possible to perform the above steps once, then use Oracle’s cloning technology to clone the installation/home. Further details are available in the cloning sections of the relevant Administration and Deployment guides:

Cloning Oracle Clusterware

Cloning Oracle RAC to Nodes in a New Cluster

Cloning Oracle Software

Oracle RAC集群Grid Infrastructure 启动的五大问题(Doc ID 1526147.1)

适用于:

Oracle Database – Enterprise Edition – 版本 11.2.0.1 和更高版本
Oracle Database Cloud Schema Service – 版本 N/A 和更高版本
Oracle Database Exadata Cloud Machine – 版本 N/A 和更高版本
Oracle Cloud Infrastructure – Database Service – 版本 N/A 和更高版本
Oracle Database Cloud Exadata Service – 版本 N/A 和更高版本
本文档所含信息适用于所有平台

 

用途

本文档的目的是总结可能阻止 Grid Infrastructure (GI) 成功启动的 5 大问题。

 

 

适用范围

本文档仅适用于 11gR2 Grid Infrastructure。

 

要确定 GI 的状态,请运行以下命令:

 

1. $GRID_HOME/bin/crsctl check crs
2. $GRID_HOME/bin/crsctl stat res -t -init
3. $GRID_HOME/bin/crsctl stat res -t
4. ps -ef | egrep 'init|d.bin'

详细信息

 

问题 1:CRS-4639:无法连接 Oracle 高可用性服务,ohasd.bin 未运行或 ohasd.bin 虽在运行但无 init.ohasd 或其他进程

症状:

 

1. 命令“$GRID_HOME/bin/crsctl check crs”返回错误:

 

CRS-4639: Could not contact Oracle High Availability Services

 

 

2. 命令“ps -ef | grep init”不显示类似于如下所示的行:

 

root 4878 1 0 Sep12 ? 00:00:02 /bin/sh /etc/init.d/init.ohasd run

 

 

3. 命令“ps -ef | grep d.bin”不显示类似于如下所示的行:

 

root 21350 1 6 22:24 ? 00:00:01 /u01/app/11.2.0/grid/bin/ohasd.bin reboot


或者它只显示 "ohasd.bin reboot" 进程而没有其他进程

4. 日志 ohasd.log 中出现以下信息:

 

2013-11-04 09:09:15.541: [ default][2609911536] Created alert : (:OHAS00117:) : TIMED OUT WAITING FOR OHASD MONITOR

 

5. 日志 ohasOUT.log 中出现以下信息:

 

2013-11-04 08:59:14 Changing directory to /u01/app/11.2.0/grid/log/lc1n1/ohasd
OHASD starting Timed out waiting for init.ohasd script to start; posting an alert

 

 

6. ohasd.bin 一直处于启动状态,ohasd.log 信息:

 

2014-08-31 15:00:25.132: [  CRSSEC][733177600]{0:0:2} Exception: PrimaryGroupEntry constructor failed to validate group name with error: 0 groupId: 0x7f8df8022450 acl_string: pgrp:spec:r-x
2014-08-31 15:00:25.132: [  CRSSEC][733177600]{0:0:2} Exception: ACL entry creation failed for: pgrp:spec:r-x
2014-08-31 15:00:25.132: [    INIT][733177600]{0:0:2} Dump State Starting ...

 

 

7. 只有ohasd.bin运行,但是ohasd.log没有任何信息。 OS 日志/var/log/messages显示

 

2015-07-12 racnode1 logger: autorun file for ohasd is missing

 

可能的原因:

 

1. 文件“/etc/inittab”并不包含行(对于OL5/RHEL5以及以下版本,内容也会因版本的不同而不同)
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
2. 未达到运行级别 3,一些 rc3 脚本挂起
3. Init 进程 (pid 1) 并未衍生 /etc/inittab (h1) 中定义的进程,或 init.ohasd 之前的不当输入,如 xx:wait:<process> 阻碍了 init.ohasd 的启动
4. CRS 自动启动已禁用
5. Oracle 本地注册表 ($GRID_HOME/cdata/<node>.olr) 丢失或损坏(root用户执行命令检查 “ocrdump -local /tmp/olr.log”, 文件 /tmp/olr.log 应该包含所有GI进程有关信息,对比一个正常工作的集群环境)
6. root用户之前在”spec”组,但是现在”spec”组被删除,但是旧组仍然记录在OLR中,可以通过OLR dump验证。
7. 节点重启后当init.ohasd启动时 HOSTNAME 为空。


解决方案:

 

1. 将以下行添加至 /etc/inittab (对于OL5/RHEL5以及以下版本)
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
并以 root 用户身份运行“init q”。 对于Linux OL6/RHEL6, 请参考文档 note 1607600.1
2. 运行命令“ps -ef | grep rc”,并kill看起来受阻的所有 rc3 脚本。
3. 删除 init.ohasd 前的不当输入。如果“init q”未衍生“init.ohasd run”进程,请咨询 OS 供应商
4. 启用 CRS 自动启动:
# crsctl enable crs
# crsctl start crs
5. 以 root 用户身份从备份中恢复 OLR(Oracle 本地注册表):(参考Note 1193643.1)
# crsctl stop crs -f
# touch $GRID_HOME/cdata/<node>.olr
# chown root:oinstall $GRID_HOME/cdata/<node>.olr
# ocrconfig -local -restore$GRID_HOME/cdata/<node>/backup_<date>_<num>.olr
# crsctl start crs如果出于某种原因,OLR 备份不存在,要重建 OLR 就需要以 root 用户身份执行 deconfig 并重新运行 root.sh:
# $GRID_HOME/crs/install/rootcrs.pl -deconfig -force
# $GRID_HOME/root.sh

6. 需要重新初始化/创建OLR, 使用命令与前面创建OLR命令相同。

 

7. 重启init.ohasd进程或者在init.ohasd中添加”sleep 30″,这样允许在启动集群前输出hostname信息,参考Note 1427234.1.

 

8. 如果上面方法不能解决问题,请检查OS messages中有关ohasd.bin日志信息,按照OS message中提示信息,

设置LD_LIBRARY_PATH = <GRID_HOME>/lib,并且手动执行crswrapexece.pl命令。

 

问题 2:CRS-4530:联系集群同步服务守护进程时出现通信故障,ocssd.bin 未运行

 

症状:

 

1. 命令“$GRID_HOME/bin/crsctl check crs”返回错误:

CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager

 

2. 命令“ps -ef | grep d.bin”不显示类似于如下所示的行:

 

 

oragrid 21543 1 1 22:24 ? 00:00:01 /u01/app/11.2.0/grid/bin/ocssd.bin

 

3. ocssd.bin 正在运行,但在 ocssd.log 中显示消息“CLSGPNP_CALL_AGAIN”后又中止运行

 

4. ocssd.log 显示如下内容:

 

   2012-01-27 13:42:58.796: [ CSSD][19]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 223132864, wrtcnt, 1112, LATS 783238209,
lastSeqNo 1111, uniqueness 1327692232, timestamp 1327693378/787089065

 

5. 对于 3 个或更多节点的情况,2 个节点形成的集群一切正常,但是,当第 3 个节点加入时就出现故障,ocssd.log 显示如下内容:

 

   2012-02-09 11:33:53.048: [ CSSD][1120926016](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 2 nodes with leader 2, racnode2, is smaller than
cohort of 2 nodes led by node 1, racnode1, based on map type 2
2012-02-09 11:33:53.048: [ CSSD][1120926016]###################################
2012-02-09 11:33:53.048: [ CSSD][1120926016]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread

 

6. 10 分钟后 ocssd.bin 启动超时

 

   2012-04-08 12:04:33.153: [    CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1333911873
......
2012-04-08 12:14:31.994: [    CSSD][5]clssgmShutDown: Received abortive shutdown request from client.
2012-04-08 12:14:31.994: [    CSSD][5]###################################
2012-04-08 12:14:31.994: [    CSSD][5]clssscExit: CSSD aborting from thread GMClientListener
2012-04-08 12:14:31.994: [    CSSD][5]###################################
2012-04-08 12:14:31.994: [    CSSD][5](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

 

 

7. alert<node>.log 显示:

 

 

2014-02-05 06:16:56.815
[cssd(3361)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/bdprod2/cssd/ocssd.log
...
2014-02-05 06:27:01.707
[ohasd(2252)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'bdprod2'.
2014-02-05 06:27:02.075
[ohasd(2252)]CRS-2771:Maximum restart attempts reached for resource 'ora.cssd'; will not restart.
>

 

可能的原因:

 

1. 表决磁盘丢失或无法访问
2. 多播未正常工作(对于版本11.2.0.2,这是正常的情况。对于 11.2.0.3 PSU5/PSU6/PSU7 和 12.1.0.1 版本,是由于Bug 16547309)
3. 私网未工作,ping 或 traceroute <private host> 显示无法访问目标。或虽然 ping/traceroute 正常工作,但是在私网中启用了防火墙
4. gpnpd 未出现,卡在 dispatch 线程中, Bug 10105195
5. 通过 asm_diskstring 发现的磁盘太多,或由于 Bug 13454354 导致扫描太慢(仅在 Solaris 11.2.0.3 上出现)


解决方案:

 

1. 通过检查存储存取性、磁盘权限等恢复表决磁盘存取。如果表决盘在 OS 级别无法访问,请敦促操作系统管理员恢复磁盘访问。
如果 OCR ASM 磁盘组中的 voting disk已经丢失,以独占模式启动 CRS,并重建表决磁盘:
# crsctl start crs -excl
# crsctl replace votedisk <+OCRVOTE diskgroup>
2. 请参考 Document 1212703.1 ,了解多播功能的测试及修正。对于版本 11.2.0.3 PSU5/PSU6/PSU7 和12.1.0.1, 您可以为集群私网启用多播或者应用补丁16547309 或最新的PSU。更多信息请参考Document 1564555.1
3. 咨询网络管理员,恢复私网访问或禁用私网防火墙(对于 Linux,请检查服务 iptables 状态和服务 ip6tables 状态)
4. 终止正常运行节点上的 gpnpd.bin 进程,请参考 Document 10105195.8
一旦以上问题得以解决,请重新启动 Grid Infrastructure。
如果 ping/traceroute 对私网均可用,但是问题发生在从 11.2.0.1 至 11.2.0.2 升级过程中,请检查Bug 13416559 获取解决方法。
5. 通过提供更加具体的 asm_diskstring,限制 ASM 扫描磁盘的数量,请参考 bug 13583387
对于 Solaris 11.2.0.3,请应用补丁 13250497,请参阅 Document 1451367.1.

 

问题 3:CRS-4535:无法与集群就绪服务通信,crsd.bin 未运行

 

症状:

 

1. 命令“$GRID_HOME/bin/crsctl check crs”返回错误:

 

CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager

 

2. 命令“ps -ef | grep d.bin”不显示类似于如下所示的行:

 

root 23017 1 1 22:34 ? 00:00:00 /u01/app/11.2.0/grid/bin/crsd.bin reboot

 

3. 即使存在 crsd.bin 进程,命令“crsctl stat res -t –init”仍然显示:

 

ora.crsd
1    ONLINE     INTERMEDIATE

 

可能的原因:

 

1. ocssd.bin 未运行,或资源 ora.cssd 不在线
2. +ASM<n> 实例无法启动
3. OCR 无法访问
4. 网络配置已改变,导致 gpnp profile.xml 不匹配
5. Crsd 的 $GRID_HOME/crs/init/<host>.pid 文件已被手动删除或重命名,crsd.log 显示:“Error3 -2 writing PID to the file”
6. ocr.loc 内容与其他集群节点不匹配。crsd.log 显示:“Shutdown CacheLocal. my hash ids don’t match”
7.当巨帧(Jumbo Frame)在集群私网被启用时,节点私网能够通过“ping”命令互相联通,但是无法通过巨帧尺寸ping通(例如:ping -s 8900 <私网 ip>)或者
集群中的其他节点已经配置巨帧(MTU: 9000),而出现问题的节点没有配置巨帧(MTU:1500)。
8.对于平台 AIX 6.1 TL08 SP01 和 AIX 7.1 TL02 SP01,由于多播数据包被截断。

解决方案:

 

 

1. 检查问题 2 的解决方案,确保 ocssd.bin 运行且 ora.cssd 在线
2. 对于 11.2.0.2 以上版本,确保资源 ora.cluster_interconnect.haip 在线,请参考 Document 1383737.1 了解和HAIP相关的,ASM无法启动的问题。
3. 确保 OCR 磁盘可用且可以访问。如果由于某种原因丢失 OCR,请参考 Document 1062983.1 了解如何恢复OCR。
4. 恢复网络配置,与 $GRID_HOME/gpnp/<node>/profiles/peer/profile.xml 中定义的接口相同,请参考Document 283684.1 了解如何修改私网配置。
5. 请使用 touch 命令,在 $GRID_HOME/crs/init 目录下创建名为 <host>.pid 的文件。
对于 11.2.0.1,该文件归 <grid> 用户所有。
对于 11.2.0.2,该文件归 root 用户所有。
6. 使用 ocrconfig 命令修正 ocr.loc 内容:
例如,作为 root 用户:
# ocrconfig -repair -add +OCR2 (添加条目)
# ocrconfig -repair -delete +OCR2 (删除条目)
以上命令需要 ohasd.bin 启动并运行 。一旦以上问题得以解决,请通过以下命令重新启动 GI 或启动 crsd.bin:
# crsctl start res ora.crsd -init
7. 如果巨帧只是在网卡层面配置了巨帧,请敦促网络管理员在交换机层面启动巨帧。如果您不需要使用巨帧,请将集群中所有节点的私网MTU值设置为1500,之后重启所有节点。
8. 对于平台 AIX 6.1 TL08 SP01 和 AIX 7.1 TL02 SP01,根据下面的note应用对应的 AIX 补丁
Document 1528452.1 AIX 6.1 TL8 or 7.1 TL2: 11gR2 GI Second Node Fails to Join the Cluster as CRSD and EVMD are in INTERMEDIATE State

 

 

问题 4:Agent 或者 mdnsd.bin, gpnpd.bin, gipcd.bin 未运行

症状:

 

1. orarootagent 未运行. ohasd.log 显示:

2012-12-21 02:14:05.071: [    AGFW][24] {0:0:2} Created alert : (:CRSAGF00123:) :  Failed to start the agent process: /grid/11.2.0/grid_2/bin/orarootagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/grid/11.2.0/grid_2/bin/orarootagent]
2. mdnsd.bin, gpnpd.bin 或者 gipcd.bin 未运行, 以下是 mdnsd log中显示的一个例子:
2012-12-31 21:37:27.601: [  clsdmt][1088776512]Creating PID [4526] file for home /u01/app/11.2.0/grid host lc1n1 bin mdns to /u01/app/11.2.0/grid/mdns/init/
2012-12-31 21:37:27.602: [  clsdmt][1088776512]Error3 -2 writing PID [4526] to the file []
2012-12-31 21:37:27.602: [  clsdmt][1088776512]Failed to record pid for MDNSD

或者

2012-12-31 21:39:52.656: [  clsdmt][1099217216]Creating PID [4645] file for home /u01/app/11.2.0/grid host lc1n1 bin mdns to /u01/app/11.2.0/grid/mdns/init/
2012-12-31 21:39:52.656: [  clsdmt][1099217216]Writing PID [4645] to the file [/u01/app/11.2.0/grid/mdns/init/lc1n1.pid]
2012-12-31 21:39:52.656: [  clsdmt][1099217216]Failed to record pid for MDNSD

3. oraagent 或 appagent 未运行, 日志crsd.log显示:

 

2012-12-01 00:06:24.462: [    AGFW][1164069184] {0:2:27} Created alert : (:CRSAGF00130:) :  Failed to start the agent /u01/app/grid/11.2.0/bin/appagent_oracle

 

 

可能的原因:

 

1. orarootagent 缺少执行权限
2. 缺少进程相关的 <node>.pid 文件或者这个文件的所有者/权限不对
3. GRID_HOME 所有者/权限不对


解决方案:

1. 和一个好的GRID_HOME比较所有者/权限,并做相应的改正,或者以root用户执行:
# cd <GRID_HOME>/crs/install
# ./rootcrs.pl -unlock
# ./rootcrs.pl -patch


这将停止集群软件,对需要的文件的所有者/权限设置为root用户,并且重启集群软件。
2. 如果对应的 <node>.pid 不存在, 就用touch命令创建一个具有相应所有者/权限的文件, 否则就按要求改正文件<node>.pid的所有者/权限, 然后重启集群软件.
这里是<GRID_HOME>下,所有者属于root:root 权限 644的<node>.pid 文件列表:

./ologgerd/init/<node>.pid
./osysmond/init/<node>.pid
./ctss/init/<node>.pid
./ohasd/init/<node>.pid
./crs/init/<node>.pid
所有者属于<grid>:oinstall,权限644
./mdns/init/<node>.pid
./evm/init/<node>.pid
./gipc/init/<node>.pid
./gpnp/init/<node>.pid3.
对第3种原因,请参考解决方案1

 

问题 5:ASM 实例未启动,ora.asm 不在线

症状:

1. 命令“ps -ef | grep asm”不显示 ASM 进程

2. 命令“crsctl stat res -t –init”显示:

 

ora.asm
1    ONLINE    OFFLINE


可能的原因:

 

1. ASM spfile 损坏
2. ASM discovery string不正确,因此无法发现 voting disk/OCR
3. ASMlib 配置问题
4. ASM实例使用不同的cluster_interconnect, 第一个节点 HAIP OFFLINE 导致第二个节点ASM实例无法启动


解决方案:

1. 创建临时 pfile 以启动 ASM 实例,然后重建 spfile,请参考 Document 1095214.1 了解更多详细信息。
2. 请参考 Document 1077094.1 以更正 ASM discovery string。
3. 请参考 Document 1050164.1 以修正 ASMlib 配置。
4. 请参考 Document 1383737.1 作为解决方案。请参考 Document 1210883.1 了解更多HAIP信息

 

要进一步调试 GI 启动问题,请参考 Document 1050908.1 Troubleshoot Grid Infrastructure Startup Issues.

诊断 Oracle RAC集群Grid Infrastructure 启动问题 (Doc ID 1623340.1)

适用于:

Oracle Database – Enterprise Edition – 版本 11.2.0.1 和更高版本
Oracle Database Cloud Schema Service – 版本 N/A 和更高版本
Oracle Database Exadata Cloud Machine – 版本 N/A 和更高版本
Oracle Cloud Infrastructure – Database Service – 版本 N/A 和更高版本
Oracle Database Backup Service – 版本 N/A 和更高版本
本文档所含信息适用于所有平台

 

 

用途

 

本文提供了诊断 11GR2 和 12C Grid Infrastructure 启动问题的方法。对于新安装的环境(root.sh 和 rootupgrade.sh 执行过程中)和有故障的旧环境都适用。针对 root.sh 的问题,我们可以参考 note 1053970.1 来获取更多的信息。

 

 

适用范围

 

本文适用于集群/RAC数据库管理员和 Oracle 支持工程师。

 

详细信息

 

启动顺序:

 

简而言之,操作系统负责启动 ohasd 进程,ohasd 进程启动 agents 用来启动守护进程(gipcd, mdnsd, gpnpd, ctssd, ocssd, crsd, evmd ,asm …) ,crsd 启动 agents 用来启动用户资源(database,SCAN,Listener 等)。

 

如果需要了解更详细的 Grid Infrastructure Cluster 启动顺序,请参阅 note 1053147.1。

 

集群状态

查询集群和守护进程的状态:

 

 

$GRID_HOME/bin/crsctl check crs

CRS-4638: Oracle High Availability Services is online

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

$GRID_HOME/bin/crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1        ONLINE  ONLINE       rac1                  Started
ora.crsd
1        ONLINE  ONLINE       rac1
ora.cssd
1        ONLINE  ONLINE       rac1
ora.cssdmonitor
1        ONLINE  ONLINE       rac1
ora.ctssd
1        ONLINE  ONLINE       rac1                  OBSERVER
ora.diskmon
1        ONLINE  ONLINE       rac1
ora.drivers.acfs
1        ONLINE  ONLINE       rac1
ora.evmd
1        ONLINE  ONLINE       rac1
ora.gipcd
1        ONLINE  ONLINE       rac1
ora.gpnpd
1        ONLINE  ONLINE       rac1
ora.mdnsd
1        ONLINE  ONLINE       rac1

 

 

对于11.2.0.2 和以上的版本,会有以下两个额外的进程:

 

ora.cluster_interconnect.haip
1        ONLINE  ONLINE       rac1
ora.crf
1        ONLINE  ONLINE       rac1

 

 

对于11.2.0.3 以上的非EXADATA的系统,ora.diskmon会处于offline的状态,如下:

 

ora.diskmon
1        OFFLINE  OFFLINE       rac1

 

 

 

对于 12c 以上的版本, 会出现ora.storage资源:

 

ora.storage
1 ONLINE ONLINE racnode1 STABLE

 

 

如果守护进程 offline 我们可以通过以下命令启动:

 

$GRID_HOME/bin/crsctl start res ora.crsd -init

 

问题 1: OHASD 无法启动

 

由于 ohasd.bin 的责任是直接或者间接的启动集群所有的其它进程,所以只有这个进程正常启动了,其它的进程才能起来,如果 ohasd.bin 的进程没有起来,当我们检查资源状态的时候会报错 CRS-4639 (Could not contact Oracle High Availability Services); 如果 ohasd.bin 已经启动了,而再次尝试启时,错误 CRS-4640 会出现;如果它启动失败了,那么我们会看到以下的错误信息:

 

CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.

自动启动 ohasd.bin 依赖于以下的配置:

1. 操作系统配置了正确的 run level:

OS 需要在 CRS 启动之前设置成指定的 run level 来确保 CRS 的正常启动。

我们可以通过以下方式找到 CRS 需要 OS 设置的 run level:

 

 

cat /etc/inittab|grep init.ohasd
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null

注意:Oracle Linux 6 (OL6) 和 Red Hat Linux 6 (RHEL6) 上已经不使用inittab,init.ohasd已经被/etc/init/oracle-ohasd.conf替代,不过 /etc/init.d/init.ohasd run 应该仍然可用。Oracle Linux 7 (以及 Red Hat Linux 7) 使用 systemd 来启动/关闭服务 (比如: /etc/systemd/system/oracle-ohasd.service)

以上例子展示了,CRS 需要 OS 运行在 run level 3 或 5;请注意,由于操作系统的不同,CRS 启动需要的 OS 的 run level 也会不同。

找到当前 OS 正在运行的 run level:

 

who -r

 

 

2. “init.ohasd run” 启动

 

在 Linux/Unix 平台上,由于”init.ohasd run” 是配置在 /etc/inittab中,进程 init(进程id 1,linux,Solars和HP-UX上为/sbin/init ,Aix上为/usr/sbin/init)会启动并且产生”init.ohasd run”进程,如果这个过程失败了,就不会有”init.ohasd run”的启动和运行,ohasd.bin 也是无法启动的:

 

ps -ef|grep init.ohasd|grep -v grep
root      2279     1  0 18:14 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run

注意:Oracle Linux 6 (OL6) 和 Red Hat Linux 6 (RHEL6) 上已经不使用inittab,init.ohasd已经被/etc/init/oracle-ohasd.conf替代,不过 /etc/init.d/init.ohasd run 应该仍然可用。Oracle Linux 7 (以及 Red Hat Linux 7) 使用 systemd 来启动/关闭服务 (比如: /etc/systemd/system/oracle-ohasd.service)

如果任何 rc Snncommand 的脚本(在 rcn.d 中,如 S98gcstartup)在启动的过程中挂死,此时 init 的进程可能无法启动”/etc/init.d/init.ohasd run”;您需要寻求 OS 厂商的帮助,找到为什么 Snncommand 脚本挂死或者无法正常启动的原因;

错误”[ohasd(<pid>)] CRS-0715:Oracle High Availability Service has timed out waiting for init.ohasd to be started.” 可能会在 init.ohasd 无法在指定时间内启动后出现

如果系统管理员无法在短期内找到 init.ohasd 无法启动的原因,以下办法可以作为一个临时的解决办法:

 

cd <location-of-init.ohasd>
nohup ./init.ohasd run &

3. Clusterware 自动启动;–自动启动默认是开启的

默认情况下 CRS 自动启动是开启的,我们可以通过以下方式开启:

 

$GRID_HOME/bin/crsctl enable crs

 

检查这个功能是否被开启:

 

$GRID_HOME/bin/crsctl config crs

如果以下信息被输出在OS的日志中

 

Feb 29 16:20:36 racnode1 logger: Oracle Cluster Ready Services startup disabled.
Feb 29 16:20:36 racnode1 logger: Could not access /var/opt/oracle/scls_scr/racnode1/root/ohasdstr

原因是由于这个文件不存在或者不可访问,产生这个问题的原因一般是人为的修改或者是打 GI 补丁的过程中使用了错误的 opatch (如:使用 Solaris 平台上的 opatch 在 Linux 上打补丁)

4. syslogd 启动并且 OS 能够执行 init 脚本 S96ohasd

 

节点启动之后,OS 可能停滞在一些其它的 Snn 的脚本上,所以可能没有机会执行到脚本 S96ohasd;如果是这种情况,我们不会在 OS 日志中看到以下信息

 

Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.

如果在 OS 日志里看不到上面的信息,还有一种可能是 syslogd((/usr/sbin/syslogd)没有被完全启动。GRID 在这种情况下也是无法正常启动的,这种情况不适用于 AIX 的平台。

为了了解 OS 启动之后是否能够执行 S96ohasd 脚本,可以按照以下的方法修改该脚本:

 

From:

    case `$CAT $AUTOSTARTFILE` in
enable*)
$LOGERR "Oracle HA daemon is enabled for autostart."

To:

    case `$CAT $AUTOSTARTFILE` in
enable*)
/bin/touch /tmp/ohasd.start."`date`"
$LOGERR "Oracle HA daemon is enabled for autostart."

重启节点后,如果您没有看到文件 /tmp/ohasd.start.timestamp 被创建,那么就是说 OS 停滞在其它的 Snn 的脚本上。如果您能看到 /tmp/ohasd.start.timestamp 生成了,但是”Oracle HA daemon is enabled for autostart”没有写入到messages 文件里,就是 syslogd 没有被完全启动了。以上的两种情况,您都需要寻求系统管理员的帮助,从 OS 的层面找到问题的原因,对于后一种情况,有个临时的解决办法是“休眠”2分钟, 按照以下的方法修改 ohasd 脚本:

 

 

From:

    case `$CAT $AUTOSTARTFILE` in
enable*)
$LOGERR "Oracle HA daemon is enabled for autostart."

To:

    case `$CAT $AUTOSTARTFILE` in
enable*)
/bin/sleep 120
$LOGERR "Oracle HA daemon is enabled for autostart."


5.
 GRID_HOME 所在的文件系统在执行初始化脚本 S96ohasd 的时候在线;正常情况下一旦 S96ohasd 执行结束,我们会在 OS message 里看到以下信息:

 

Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.
..
Jan 20 20:46:57 rac1 logger: exec /ocw/grid/perl/bin/perl -I/ocw/grid/perl/lib /ocw/grid/bin/crswrapexece.pl /ocw/grid/crs/install/s_crsconfig_rac1_env.txt /ocw/grid/bin/ohasd.bin "reboot"

 

 

如果您只看到了第一行,没有看到最后一行的信息,很可能是 GRID_HOME 所在的文件系统在脚本 S96ohasd 执行的时候还没有正常挂载。

 

6. Oracle Local Registry  (OLR, $GRID_HOME/cdata/${HOSTNAME}.olr) 有效并可以正常读写

 

 

ls -l $GRID_HOME/cdata/*.olr
-rw------- 1 root  oinstall 272756736 Feb  2 18:20 rac1.olr

如果 OLR 是不可读写的或者损坏的,我们会在 ohasd.log 中看到以下的相关信息

 

..
2010-01-24 22:59:10.470: [ default][1373676464] Initializing OLR
2010-01-24 22:59:10.472: [  OCROSD][1373676464]utopen:6m':failed in stat OCR file/disk /ocw/grid/cdata/rac1.olr, errno=2, os err string=No such file or directory
2010-01-24 22:59:10.472: [  OCROSD][1373676464]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-01-24 22:59:10.473: [  OCRRAW][1373676464]proprinit: Could not open raw device
2010-01-24 22:59:10.473: [  OCRAPI][1373676464]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 22:59:10.473: [  CRSOCR][1373676464] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]
2010-01-24 22:59:10.473: [ default][1373676464] OLR initalization failured, rc=26
2010-01-24 22:59:10.474: [ default][1373676464]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
2010-01-24 22:59:10.474: [ default][1373676464][PANIC] OHASD exiting; Could not init OLR

 

或者

 

..
2010-01-24 23:01:46.275: [  OCROSD][1228334000]utread:3: Problem reading buffer 1907f000 buflen 4096 retval 0 phy_offset 102400 retry 5
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]propriogid:1_1: Failed to read the whole bootblock. Assumes invalid format.
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]proprioini: all disks are not OCR/OLR formatted
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]proprinit: Could not open raw device
2010-01-24 23:01:46.275: [  OCRAPI][1228334000]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 23:01:46.276: [  CRSOCR][1228334000] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage
2010-01-24 23:01:46.276: [ default][1228334000] OLR initalization failured, rc=26
2010-01-24 23:01:46.276: [ default][1228334000]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
2010-01-24 23:01:46.277: [ default][1228334000][PANIC] OHASD exiting; Could not init OLR

 

或者

 

..
2010-11-07 03:00:08.932: [ default][1] Created alert : (:OHAS00102:) : OHASD is not running as privileged user
2010-11-07 03:00:08.932: [ default][1][PANIC] OHASD exiting: must be run as privileged user

 

或者

 

ohasd.bin comes up but output of "crsctl stat res -t -init"shows no resource, and "ocrconfig -local -manualbackup" fails

 

或者
..

2010-08-04 13:13:11.102: [   CRSPE][35] Resources parsed
2010-08-04 13:13:11.103: [   CRSPE][35] Server [] has been registered with the PE data model
2010-08-04 13:13:11.103: [   CRSPE][35] STARTUPCMD_REQ = false:
2010-08-04 13:13:11.103: [   CRSPE][35] Server [] has changed state from [Invalid/unitialized] to [VISIBLE]
2010-08-04 13:13:11.103: [  CRSOCR][31] Multi Write Batch processing...
2010-08-04 13:13:11.103: [ default][35] Dump State Starting ...

..

2010-08-04 13:13:11.112: [   CRSPE][35] SERVERS:

:VISIBLE:address{{Absolute|Node:0|Process:-1|Type:1}}; recovered state:VISIBLE. Assigned to no pool

------------- SERVER POOLS:
Free [min:0][max:-1][importance:0] NO SERVERS ASSIGNED

2010-08-04 13:13:11.113: [   CRSPE][35] Dumping ICE contents...:ICE operation count: 0
2010-08-04 13:13:11.113: [ default][35] Dump State Done.

 

 

解决办法就是使用下面的命令,恢复一个好的备份 “ocrconfig -local -restore <ocr_backup_name>”。

默认情况下,OLR 在系统安装结束后会自动的备份在 $GRID_HOME/cdata/$HOST/backup_$TIME_STAMP.olr 。

 

7. ohasd.bin可以正常的访问到网络的 socket 文件:

 

2010-06-29 10:31:01.570: [ COMMCRS][1206901056]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))

2010-06-29 10:31:01.571: [  OCRSRV][1217390912]th_listen: CLSCLISTEN failed clsc_ret= 3, addr= [(ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))]
2010-06-29 10:31:01.571: [  OCRSRV][3267002960]th_init: Local listener did not reach valid state

在 Grid Infrastructure 环境中,和 ohasd 有关的 socket 文件属主应该是 root 用户,但是在 Oracle Restart 的环境中,他们应该是属于 grid 用户的,关于更多的关于网络 socket 文件权限和属主,请参考章节”网络 socket 文件,属主和权限” 给出的例子.

 

8. ohasd.bin 能够访问日志文件的位置:

OS messages/syslog 显示以下信息:

 

Feb 20 10:47:08 racnode1 OHASD[9566]: OHASD exiting; Directory /ocw/grid/log/racnode1/ohasd not found.

 

请参考章节”日志位置, 属主和权限”部分的例子,并确定这些必要的目录是否有丢失的,并且是按照正确的权限和属主创建的。

 

9. 节点启动后,在 SUSE Linux 的系统上,ohasd 可能无法启动,此问题请参考 note 1325718.1 – OHASD not Starting After Reboot on SLES

 

10. OHASD 无法启动,使用 “ps -ef| grep ohasd.bin” 显示 ohasd.bin 的进程已经启动,但是 $GRID_HOME/log/<node>/ohasd/ohasd.log 在好几分钟之后都没有任何信息更新,使用 OS 的 truss 工具 可以看到该进程一致在循环的执行关闭从未被打开的文件句柄的操作:

 

 

..
15058/1:         0.1995 close(2147483646)                               Err#9 EBADF
15058/1:         0.1996 close(2147483645)                               Err#9 EBADF
..

通过 ohasd.bin 的 Call stack ,可以看到以下信息:

 

_close  sclssutl_closefiledescriptors  main ..

这是由于 bug 11834289 导致的, 该问题在 11.2.0.3 和之上的版本已经被修复,该 bug 的其它症状还有:集群的进程无法启动,而且做 call stack 和 truss 查看的时候也会看到相同的情况(循环的执行 OS 函数 “close”) . 如果该 bug 发生在启动其它的资源时,我们会看到错误信息: “CRS-5802: Unable to start the agent process” 提示。

 

11. 其它的一些潜在的原因和解决办法请参见 note 1069182.1 – OHASD Failed to Start: Inappropriate ioctl for device

 

12. ohasd.bin 正常启动,但是, “crsctl check crs” 只显示以下一行信息:

 

CRS-4638: Oracle High Availability Services is online

并且命令 “crsctl stat res -p -init” 无法显示任何信息

这个问题是由于 OLR 损坏导致的,请参考 note 1193643.1 进行恢复。

 

13. EL7/OL7上: note 1959008.1 – Install of Clusterware fails while running root.sh on OL7 – ohasd fails to start

 

14. 对于 EL7/OL7, patch 25606616 is needed: TRACKING BUG TO PROVIDE GI FIXES FOR OL7

 

15. 如果 ohasd 仍然无法启动,请参见 ohasd 的日志 <grid-home>/log/<nodename>/ohasd/ohasd.log 和 ohasdOUT.log 来获取更多的信息;

 

问题 2: OHASD Agents  未启动

 

OHASD.BIN 会启动 4 个 agents/monitors 来启动其它的资源:

 

oraagent: 负责启动  ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd 等
orarootagent: 负责启动 ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs 等
cssdagent / cssdmonitor: 负责启动 ora.cssd(对应 ocssd.bin) 和 ora.cssdmonitor(对应 cssdmonitor)

如果 ohasd.bin 不能正常地启动以上任何一个 agents,集群都无法运行在正常的状态。

 

1. 通常情况下,agents 无法启动的原因是 agent 的日志或者日志所在的目录没有正确设置属主和权限。

关于日志文件和文件夹的权限和属主设置,请参见章节 “日志文件位置, 属主和权限” 中的介绍。

一个例子是在手工打补丁时忘记执行 “rootcrs.pl -patch/postpatch” 会导致agent启动失败:

 

2015-02-25 15:43:54.350806 : CRSMAIN:3294918400: {0:0:2} {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/orarootagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/orarootagent]

2015-02-25 15:43:54.382154 : CRSMAIN:3294918400: {0:0:2} {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/cssdagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdagent]

2015-02-25 15:43:54.384105 : CRSMAIN:3294918400: {0:0:2} {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/cssdmonitor Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdmonitor]

 

解决方案是执行缺失的步骤。

 

2. 如果 agent 的二进制文件(oraagent.bin 或者 orarootagent.bin 等)损坏, agent 也将无法启动,从而导致相关的资源也无法启动:

 

2011-05-03 11:11:13.189

[ohasd(25303)]CRS-5828:Could not start agent '/ocw/grid/bin/orarootagent_grid'. Details at (:CRSAGF00130:) {0:0:2} in /ocw/grid/log/racnode1/ohasd/ohasd.log.

2011-05-03 12:03:17.491: [    AGFW][1117866336] {0:0:184} Created alert : (:CRSAGF00130:) :  Failed to start the agent /ocw/grid/bin/orarootagent_grid
2011-05-03 12:03:17.491: [    AGFW][1117866336] {0:0:184} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_START[ora.diskmon 1 1] ID 4098:403
2011-05-03 12:03:17.491: [    AGFW][1117866336] {0:0:184} Can not stop the agent: /ocw/grid/bin/orarootagent_grid because pid is not initialized
..
2011-05-03 12:03:17.492: [   CRSPE][1128372576] {0:0:184} Fatal Error from AGFW Proxy: Unable to start the agent process
2011-05-03 12:03:17.492: [   CRSPE][1128372576] {0:0:184} CRS-2674: Start of 'ora.diskmon' on 'racnode1' failed

..

2011-06-27 22:34:57.805: [    AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00123:) :  Failed to start the agent process: /ocw/grid/bin/cssdagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdagent]
2011-06-27 22:34:57.805: [    AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00126:) :  Agent start failed
..
2011-06-27 22:34:57.806: [    AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00123:) :  Failed to start the agent process: /ocw/grid/bin/cssdmonitor Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdmonitor]

 

解决办法: 您可以和正常节点上的 agent 文件进行比较,并且恢复一个好的副本回来。

 

3. Agent 可能会因为 bug 11834289 而启动失败,伴有错误 “CRS-5802: Unable to start the agent process”, 参考 “OHASD 无法启动”的第10条

4. 参考: note 1964240.1 – CRS-5823:Could not initialize agent framework

 

 

问题 3: OCSSD.BIN 无法启动

 

cssd.bin 的正常启动依赖于以下几个必要的条件:

1. GPnP profile 可正常读写 – gpnpd  需要完全正常启动来为profile服务。

如果 ocssd.bin 能够正常的获取 profile,通常情况下,我们会在 ocssd.log 中看到以下类似的信息:

 

2010-02-02 18:00:16.251: [    GPnP][408926240]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling "ipc://GPNPD_rac1", try 4 of 500...
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileVerifyForCall: [at clsgpnp.c:1867] Result: (87) CLSGPNP_SIG_VALPEER. Profile verified.  prf=0x165160d0
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileGetSequenceRef: [at clsgpnp.c:841] Result: (0) CLSGPNP_OK. seq of p=0x165160d0 is '6'=6
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileCallUrlInt: [at clsgpnp.c:2186] Result: (0) CLSGPNP_OK. Successful get-profile CALL to remote "ipc://GPNPD_rac1" disco ""

 

 

否则,我们会看到以下信息显示在 ocssd.log 中。

 

2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1100] GIPC gipcretConnectionRefused (29) gipcConnect(ipc-ipc://GPNPD_rac1)
2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1101] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url "ipc://GPNPD_rac1"
2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnp_getProfileEx: [at clsgpnp.c:546] Result: (13) CLSGPNP_NO_DAEMON. Can't get GPnP service profile from local GPnP daemon
2010-02-03 22:26:17.057: [ default][3852126240]Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2010-02-03 22:26:17.057: [    CSSD][3852126240]clsgpnp_getProfile failed, rc(13)

解决方案是确保 gpnpd 是启动并且正常运行的。

 

2. Voting Disk 可以正常读写

 

在 11gR2 的版本中, ocssd.bin 通过 GPnP profile 中的记录获取 Voting disk 的信息, 如果没有足够多的选举盘是可读写的,那么 ocssd.bin 会终止掉自己。

 

2010-02-03 22:37:22.212: [    CSSD][2330355744]clssnmReadDiscoveryProfile: voting file discovery string(/share/storage/di*)
..
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmvDiskVerify: Successful discovery of 0 disks
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmvFindInitialConfigs: No voting files found
2010-02-03 22:37:22.228: [    CSSD][1145538880]###################################
2010-02-03 22:37:22.228: [    CSSD][1145538880]clssscExit: CSSD signal 11 in thread clssnmvDDiscThread

 

如果所有节点上的 ocssd.bin 因为以下错误无法启动,这是因为 voting file 正在被修改:

 

2010-05-02 03:11:19.033: [    CSSD][1197668093]clssnmCompleteInitVFDiscovery: Detected voting file add in progress for CIN 0:1134513465:0, waiting for configuration to complete 0:1134513098:0

解决的办法是,参照 note 1364971.1 中的步骤,以 exclusive 模式启动 ocssd.bin。

 

如果选举盘的位置是非 ASM 的设备,它的权限和属主应该是如下显示:

 

-rw-r----- 1 ogrid oinstall 21004288 Feb  4 09:13 votedisk1

 

3. 网络功能是正常的,并且域名解析能够正常工作:

 

如果 ocssd.bin 无法正常的绑定到任何网络上,我们会在 ocssd.log 中看到以下类似的日志信息:

 

2010-02-03 23:26:25.804: [GIPCXCPT][1206540320]gipcmodGipcPassInitializeNetwork: failed to find any interfaces in clsinet, ret gipcretFail (1)
2010-02-03 23:26:25.804: [GIPCGMOD][1206540320]gipcmodGipcPassInitializeNetwork: EXCEPTION[ ret gipcretFail (1) ]  failed to determine host from clsinet, using default
..
2010-02-03 23:26:25.810: [    CSSD][1206540320]clsssclsnrsetup: gipcEndpoint failed, rc 39
2010-02-03 23:26:25.811: [    CSSD][1206540320]clssnmOpenGIPCEndp: failed to listen on gipc addr gipc://rac1:nm_eotcs- ret 39
2010-02-03 23:26:25.811: [    CSSD][1206540320]clssscmain: failed to open gipc endp

如果私网上出现了联通性的故障(包含多播功能关闭),我们会在 ocssd.log 中看到以下类似的日志信息:

 

 

2010-09-20 11:52:54.014: [    CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 453, LATS 328297844, lastSeqNo 452, uniqueness 1284979488, timestamp 1284979973/329344894
2010-09-20 11:52:54.016: [    CSSD][1078421824]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
..  >>>> after a long delay
2010-09-20 12:02:39.578: [    CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 1037, LATS 328883434, lastSeqNo 1036, uniqueness 1284979488, timestamp 1284980558/329930254
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssgmExecuteClientRequest: MAINT recvd from proc 2 (0xe1ad870)
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssgmShutDown: Received abortive shutdown request from client.
2010-09-20 12:02:39.895: [    CSSD][1107286336]###################################
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssscExit: CSSD aborting from thread GMClientListener
2010-09-20 12:02:39.895: [    CSSD][1107286336]###################################

验证网络是否正常,请参见:note 1054902.1

如果在网络修改后CSSD不能启动,请使用 (“gpnptool get”) 检查gpnp profile里定义的cluster_interconnect和真正的网卡名字是否一致
在11.2.0.1上,如果私网不可用,ocssd.bin可能会使用公网

 

4. 第三方的集群管理软件是启动的 (如果使用了第三方 clusterware)

 

Grid Infrastructure 可以提供所有的集群功能,不需要安装第三方集群软件。但是如果在您的环境里GI是基于第三方的集群管理软件的,那么需要确保CRS启动前第三方的集群管理软件已经启动了,可以使用grid用户执行下面的命令来验证

 

$GRID_HOME/bin/lsnodes -n
racnode1    1
racnode1    0

如果第三方的集群管理软件没有完全正常启动,我们在 ocssd.log 中看到以下类似的日志信息:

 

2010-08-30 18:28:13.207: [    CSSD][36]clssnm_skgxninit: skgxncin failed, will retry
2010-08-30 18:28:14.207: [    CSSD][36]clssnm_skgxnmon: skgxn init failed
2010-08-30 18:28:14.208: [    CSSD][36]###################################
2010-08-30 18:28:14.208: [    CSSD][36]clssscExit: CSSD signal 11 in thread skgxnmon

未安装集群管理软件之前,请使用 grid 用户执行以下操作验证:

 

$INSTALL_SOURCE/install/lsnodes -v

hp-ux上的一个案例: note 2130230.1 – Grid infrastructure startup fails due to vendor Clusterware did not start (HP-UX Service guard)

 

5. 在错误的 GRID_HOME 下执行命令”crsctl”

 

命令”crsctl” 必须在正确的 GRID_HOME 下执行,才能正常启动其它进程,否则我们会看到以下的错误信息提示:

 

2012-11-14 10:21:44.014: [    CSSD][1086675264]ASSERT clssnm1.c 3248
2012-11-14 10:21:44.014: [    CSSD][1086675264](:CSSNM00056:)clssnmvStartDiscovery: Terminating because of the release version(11.2.0.2.0) of this node being lesser than the active version(11.2.0.3.0) that the cluster is at
2012-11-14 10:21:44.014: [    CSSD][1086675264]###################################
2012-11-14 10:21:44.014: [    CSSD][1086675264]clssscExit: CSSD aborting from thread clssnmvDDiscThread#

 

 

 

问题 4: CRSD.BIN 无法启动

 

 

如果”crsctl stat res -t -init”显示 ora.crsd 处于intermediate状态,并且另一个节点正在运行着,那么很可能原因是当前这个节点的crsd.bin无法和另一个节点的master crsd.bin通信。
此时,master crsd.bin很可能有异常,杀掉那个master crsd.bin很可能解决问题。
执行 “grep MASTER crsd.trc” 来找到master crsd.bin在哪个节点运行,杀掉那个节点的crsd.bin
之后crsd.bin会被自动启动,不过其它节点的crsd.bin会变成master crsd.bin

 

crsd.bin 的正常启动依赖于以下几个必要的条件:

 

1. ocssd 已经完全正常启动

如果 ocssd.bin 没有完全正常启动,我们会在 crsd.log 中看到以下提示信息:

 

 

2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clssscConnect: gipc request failed with 29 (0x16)
2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clsssInitNative: connect failed, rc 29
2010-02-03 22:37:51.639: [  CRSRTI][1548456880] CSS is not ready. Received status 3 from CSS. Waiting for good status ..

2. OCR 可以正常读写

 

如果 OCR 保存在 ASM 中,那么 ora.asm 资源(ASM 实例) 必须已经启动而且 OCR 所在的磁盘组必须已经被挂载,否则我们在 crsd.log 会看到以下的类似信息:

 

2010-02-03 22:22:55.186: [  OCRASM][2603807664]proprasmo: Error in open/create file in dg [GI]

[  OCRASM][2603807664]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge

ORA-15077: could not locate ASM instance serving a required diskgroup

2010-02-03 22:22:55.189: [  OCRASM][2603807664]proprasmo: kgfoCheckMount returned [7]
2010-02-03 22:22:55.189: [  OCRASM][2603807664]proprasmo: The ASM instance is down
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprioo: Failed to open [+GI]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprioo: No OCR/OLR devices are usable
2010-02-03 22:22:55.190: [  OCRASM][2603807664]proprasmcl: asmhandle is NULL
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprinit: Could not open raw device
2010-02-03 22:22:55.190: [  OCRASM][2603807664]proprasmcl: asmhandle is NULL
2010-02-03 22:22:55.190: [  OCRAPI][2603807664]a_init:16!: Backend init unsuccessful : [26]
2010-02-03 22:22:55.190: [  CRSOCR][2603807664] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required diskgroup] [7]

2010-02-03 22:22:55.190: [    CRSD][2603807664][PANIC] CRSD exiting: Could not init OCR, code: 26

 

 

注意:在11.2 的版本中 ASM 会比 crsd.bin 先启动,并且会把含有 OCR 的磁盘组自动挂载。

如果您的 OCR 在非 ASM 的存储中,该文件的属主和权限如下:

 

-rw-r----- 1 root  oinstall  272756736 Feb  3 23:24 ocr

 

 

如果 OCR 是在非 ASM 的存储中,并且不能被正常访问,在 crsd.log 会看到以下的类似信息

 

 

2010-02-03 23:14:33.583: [  OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-02-03 23:14:33.583: [  OCRRAW][2346668976]proprinit: Could not open raw device
2010-02-03 23:14:33.583: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:14:34.587: [  OCROSD][2346668976]utopen:6m':failed in stat OCR file/disk /share/storage/ocr, errno=2, os err string=No such file or directory
2010-02-03 23:14:34.587: [  OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-02-03 23:14:34.587: [  OCRRAW][2346668976]proprinit: Could not open raw device
2010-02-03 23:14:34.587: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:14:35.589: [    CRSD][2346668976][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26

如果 OCR 是坏掉了,在 crsd.log 会看到以下的类似信息:

 

2010-02-03 23:19:38.417: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]propriogid:1_2: INVALID FORMAT
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]proprioini: all disks are not OCR/OLR formatted
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]proprinit: Could not open raw device
2010-02-03 23:19:39.429: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:19:40.432: [    CRSD][3360863152][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26

如果您的 grid 用户的权限或者所在组发生了变化,尽管 ASM 还是可以访问的,在 crsd.log 会看到以下的类似信息:

 

 

2010-03-10 11:45:12.510: [  OCRASM][611467760]proprasmo: Error in open/create file in dg [SYSTEMDG]

[  OCRASM][611467760]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge

ORA-01031: insufficient privileges

2010-03-10 11:45:12.528: [  OCRASM][611467760]proprasmo: kgfoCheckMount returned [7]
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmo: The ASM instance is down
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprioo: Failed to open [+SYSTEMDG]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprioo: No OCR/OLR devices are usable
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmcl: asmhandle is NULL
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprinit: Could not open raw device
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmcl: asmhandle is NULL
2010-03-10 11:45:12.529: [  OCRAPI][611467760]a_init:16!: Backend init unsuccessful : [26]
2010-03-10 11:45:12.530: [  CRSOCR][611467760] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge
ORA-01031: insufficient privileges] [7]

 

 

 

如果grid用户无法写ORACLE_BASE目录,以及GRID_HOME 下的 oracle 二进制文件的属主或者权限错误,尽管 ASM 正常启动并运行,在 crsd.log 会看到以下的类似信息:

 

 

2012-03-04 21:34:23.139: [  OCRASM][3301265904]proprasmo: Error in open/create file in dg [OCR]

[  OCRASM][3301265904]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=12547, loc=kgfokge

2012-03-04 21:34:23.139: [  OCRASM][3301265904]ASM Error Stack : ORA-12547: TNS:lost contact

2012-03-04 21:34:23.633: [  OCRASM][3301265904]proprasmo: kgfoCheckMount returned [7]
2012-03-04 21:34:23.633: [  OCRASM][3301265904]proprasmo: The ASM instance is down
2012-03-04 21:34:23.634: [  OCRRAW][3301265904]proprioo: Failed to open [+OCR]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2012-03-04 21:34:23.634: [  OCRRAW][3301265904]proprioo: No OCR/OLR devices are usable
2012-03-04 21:34:23.635: [  OCRASM][3301265904]proprasmcl: asmhandle is NULL
2012-03-04 21:34:23.636: [    GIPC][3301265904] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 690], original from [clsss.c : 5326]
2012-03-04 21:34:23.639: [ default][3301265904]clsvactversion:4: Retrieving Active Version from local storage.
2012-03-04 21:34:23.643: [  OCRRAW][3301265904]proprrepauto: The local OCR configuration matches with the configuration published by OCR Cache Writer. No repair required.
2012-03-04 21:34:23.645: [  OCRRAW][3301265904]proprinit: Could not open raw device
2012-03-04 21:34:23.646: [  OCRASM][3301265904]proprasmcl: asmhandle is NULL
2012-03-04 21:34:23.650: [  OCRAPI][3301265904]a_init:16!: Backend init unsuccessful : [26]
2012-03-04 21:34:23.651: [  CRSOCR][3301265904] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage
ORA-12547: TNS:lost contact

2012-03-04 21:34:23.652: [ CRSMAIN][3301265904] Created alert : (:CRSD00111:) :  Could not init OCR, error: PROC-26: Error while accessing the physical storage
ORA-12547: TNS:lost contact

2012-03-04 21:34:23.652: [    CRSD][3301265904][PANIC] CRSD exiting: Could not init OCR, code: 26

 

 

正常的 GRID_HOME 下该文件的属主和权限应该是如下显示:

 

-rwsr-s--x 1 grid oinstall 184431149 Feb  2 20:37 /ocw/grid/bin/oracle

如果 OCR 文件或者它的镜像文件无法正常访问 (可能是 ASM 已经启动, 但是 OCR/mirror 所在的磁盘组没有挂载),在 crsd.log 会看到以下的类似信息:

 

 

2010-05-11 11:16:38.578: [  OCRASM][18]proprasmo: Error in open/create file in dg [OCRMIR]
[  OCRASM][18]SLOS : SLOS: cat=8, opn=kgfoOpenFile01, dep=15056, loc=kgfokge
ORA-17503: ksfdopn:DGOpenFile05 Failed to open file +OCRMIR.255.4294967295
ORA-17503: ksfdopn:2 Failed to open file +OCRMIR.255.4294967295
ORA-15001: diskgroup "OCRMIR
..
2010-05-11 11:16:38.647: [  OCRASM][18]proprasmo: kgfoCheckMount returned [6]
2010-05-11 11:16:38.648: [  OCRASM][18]proprasmo: The ASM disk group OCRMIR is not found or not mounted
2010-05-11 11:16:38.648: [  OCRASM][18]proprasmdvch: Failed to open OCR location [+OCRMIR] error [26]
2010-05-11 11:16:38.648: [  OCRRAW][18]propriodvch: Error  [8] returned device check for [+OCRMIR]
2010-05-11 11:16:38.648: [  OCRRAW][18]dev_replace: non-master could not verify the new disk (8)
[  OCRSRV][18]proath_invalidate_action: Failed to replace [+OCRMIR] [8]
[  OCRAPI][18]procr_ctx_set_invalid_no_abort: ctx set to invalid
..
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91: Comparing device hash ids between local and master failed
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91 Local dev (1862408427, 1028247821, 0, 0, 0)
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91 Master dev (1862408427, 1859478705, 0, 0, 0)
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:9: Shutdown CacheLocal. my hash ids don't match
[  OCRAPI][19]procr_ctx_set_invalid_no_abort: ctx set to invalid
[  OCRAPI][19]procr_ctx_set_invalid: aborting...
2010-05-11 11:16:46.587: [    CRSD][19] Dump State Starting ...

 

 

3. crsd.bin 的进程号文件(<GRID_HOME>/crs/init/<节点名>.pid)存在,但是却指向其它的进程

如果进程号文件不存在,在日志 $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log 我们会看到以下的提示信息:

 

 

2010-02-14 17:40:57.927: [ora.crsd][1243486528] [check] PID FILE doesn't exist.
..
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Creating PID [30269] file for home /ocw/grid host racnode1 bin crs to /ocw/grid/crs/init/
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Error3 -2 writing PID [30269] to the file []
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Failed to record pid for CRSD
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Terminating process
2010-02-14 17:41:57.927: [ default][1092499776] CRSD exiting on stop request from clsdms_thdmai

解决办法,我们可以手工创建一个进程号文件:使用 grid 用户执行 “touch” 命令,然后重新启动 ora.crsd 资源。

如果进程号文件存在,但是记录的 PID 是指向了其它的进程,而不是 crsd.bin 的进程,在日志 $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log 我们会看到以下的提示信息:

 

 

2011-04-06 15:53:38.777: [ora.crsd][1160390976] [check] PID will be looked for in /ocw/grid/crs/init/racnode1.pid
2011-04-06 15:53:38.778: [ora.crsd][1160390976] [check] PID which will be monitored will be 1535                               >> 1535 is output of "cat /ocw/grid/crs/init/racnode1.pid"
2011-04-06 15:53:38.965: [ COMMCRS][1191860544]clsc_connect: (0x2aaab400b0b0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD))
[  clsdmc][1160390976]Fail to connect (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD)) with status 9
2011-04-06 15:53:38.966: [ora.crsd][1160390976] [check] Error = error 9 encountered when connecting to CRSD
2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check] Calling PID check for daemon
2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check] Trying to check PID = 1535
2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check] PID check returned ONLINE CLSDM returned OFFLINE
2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check] DaemonAgent::check returned 5
2011-04-06 15:53:39.203: [    AGFW][1160390976] check for resource: ora.crsd 1 1 completed with status: FAILED
2011-04-06 15:53:39.203: [    AGFW][1170880832] ora.crsd 1 1 state changed from: UNKNOWN to: FAILED
..
2011-04-06 15:54:10.511: [    AGFW][1167522112] ora.crsd 1 1 state changed from: UNKNOWN to: CLEANING
..
2011-04-06 15:54:10.513: [ora.crsd][1146542400] [clean] Trying to stop PID = 1535
..
2011-04-06 15:54:11.514: [ora.crsd][1146542400] [clean] Trying to check PID = 1535

在 OS 层面检查该问题:

 

 

ls -l /ocw/grid/crs/init/*pid
-rwxr-xr-x 1 ogrid oinstall 5 Feb 17 11:00 /ocw/grid/crs/init/racnode1.pid
cat /ocw/grid/crs/init/*pid
1535
ps -ef| grep 1535
root      1535     1  0 Mar30 ?        00:00:00 iscsid                  >> 注意:进程 1535 不是 crsd.bin

解决办法是,使用 root 用户,创建一个空的进程号文件,然后重启资源 ora.crsd:

 

 

# > $GRID_HOME/crs/init/<racnode1>.pid
# $GRID_HOME/bin/crsctl stop res ora.crsd -init
# $GRID_HOME/bin/crsctl start res ora.crsd -init

4. 网络功能是正常的,并且域名解析能够正常工作:

 

如果网络功能不正常,ocssd.bin 进程仍然可能被启动, 但是 crsd.bin 可能会失败,同时在 crsd.log 中会提示以下信息:

 

 

2010-02-03 23:34:28.412: [    GPnP][2235814832]clsgpnp_Init: [at clsgpnp0.c:837] GPnP client pid=867, tl=3, f=0
2010-02-03 23:34:28.428: [  OCRAPI][2235814832]clsu_get_private_ip_addresses: no ip addresses found.
..
2010-02-03 23:34:28.434: [  OCRAPI][2235814832]a_init:13!: Clusterware init unsuccessful : [44]
2010-02-03 23:34:28.434: [  CRSOCR][2235814832] OCR context init failure.  Error: PROC-44: Error in network address and interface operations Network address and interface operations error [7]
2010-02-03 23:34:28.434: [    CRSD][2235814832][PANIC] CRSD exiting: Could not init OCR, code: 44

 

 

或者:

 

 

2009-12-10 06:28:31.974: [  OCRMAS][20]proath_connect_master:1: could not connect to master  clsc_ret1 = 9, clsc_ret2 = 9
2009-12-10 06:28:31.974: [  OCRMAS][20]th_master:11: Could not connect to the new master
2009-12-10 06:29:01.450: [ CRSMAIN][2] Policy Engine is not initialized yet!
2009-12-10 06:29:31.489: [ CRSMAIN][2] Policy Engine is not initialized yet!

 

或者:

 

2009-12-31 00:42:08.110: [ COMMCRS][10]clsc_receive: (102b03250) Error receiving, ns (12535, 12560), transport (505, 145, 0)

关于网络和域名解析的验证,请参考:note 1054902.1

 

 

5. crsd 可执行文件(crsd.bin 和 crsd in GRID_HOME/bin) 的权限或者属主正确并且没有进行过手工的修改, 一个简单可行的检查办法是对比好的节点和坏节点的以下命令输出 “ls -l <grid-home>/bin/crsd <grid-home>/bin/crsd.bin”.

6. crsd可能因为下面的原因无法启动

 

 

note 1552472.1 -CRSD Will Not Start Following a Node Reboot: crsd.log reports: clsclisten: op 65 failed and/or Unable to get E2E port
note 1684332.1 - GI crsd Fails to Start: clsclisten: op 65 failed, NSerr (12560, 0), transport: (583, 0, 0)

 

7. 关于CRSD进程启动问题的进一步深入诊断,请参考 note 1323698.1 – Troubleshooting CRSD Start up Issue

问题 5: GPNPD.BIN 无法启动

 

1. 网络的域名解析不正常

gpnpd.bin 进程启动失败,以下信息提示在 gpnpd.log 中:

 

 

2010-05-13 12:48:11.540: [    GPnP][1171126592]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling "tcp://node2:9393", try 1 of 3...
2010-05-13 12:48:11.540: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1015] ENTRY
2010-05-13 12:48:11.541: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1066] GIPC gipcretFail (1) gipcConnect(tcp-tcp://node2:9393)
2010-05-13 12:48:11.541: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1067] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url "tcp://node2:9393"

以上的例子,请确定当前节点能够正常的 ping 到“node2” ,并且确认两个节点之间没有任何防火墙。

2. bug 10105195

由于 bug 10105195, gpnp 的调度线程(dispatch thread)可能被阻断,例如:网络扫描。这个 bug 在 11.2.0.2 GI PSU2,11.2.0.3 及以上版本被修复,具体信息,请参见 note 10105195.8

 

问题 6: 其它的一些守护进程无法启动

 

常见原因:

1. 守护进程的日志文件或者日志所在的路径权限或者属主不正确。

如果日志文件或者日志文件所在的路径权限或者属主设置有问题,通常我们会看到进程尝试启动,但是日志里的信息却始终没有更新.

关于日志位置和权限属主的限制,请参见 “日志文件位置, 属主和权限” 获取更多的信息。

2. 网络的 socket 文件权限或者属主错误

这种情况下,守护进程的日志会显示以下信息:

 

2010-02-02 12:55:20.485: [ COMMCRS][1121433920]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))

2010-02-02 12:55:20.485: [  clsdmt][1110944064]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))

 

3. OLR 文件损坏

这种情况下,守护进程的日志会显示以下信息(以下是个 ora.ctssd 无法启动的例子):

 

 

2012-07-22 00:15:16.565: [ default][1]clsvactversion:4: Retrieving Active Version from local storage.
2012-07-22 00:15:16.575: [    CTSS][1]clsctss_r_av3: Invalid active version [] retrieved from OLR. Returns [19].
2012-07-22 00:15:16.585: [    CTSS][1](:ctss_init16:): Error [19] retrieving active version. Returns [19].
2012-07-22 00:15:16.585: [    CTSS][1]ctss_main: CTSS init failed [19]
2012-07-22 00:15:16.585: [    CTSS][1]ctss_main: CTSS daemon aborting [19].
2012-07-22 00:15:16.585: [    CTSS][1]CTSS daemon aborting

解决办法,请恢复一个好的OLR的副本,具体办法请参见 note 1193643.1

4. 其它情况:

note 1087521.1 – CTSS Daemon Aborting With “op 65 failed, NSerr (12560, 0), transport: (583, 0, 0)”

 

问题 7: CRSD Agents 无法启动

 

 

CRSD.BIN 会负责衍生出两个 agents 进程来启动用户的资源,这两个 agents 的名字和 ohasd.bin 的 agents 的名字相同:

orarootagent: 负责启动 ora.netn.network, ora.nodename.vip, ora.scann.vip 和 ora.gns
oraagent: 负责启动 ora.asm, ora.eons, ora.ons, listener, SCAN listener, diskgroup, database, service 等资源

我们可以通过以下命令查看用户的资源状态:

 

 

$GRID_HOME/crsctl stat res -t

 

如果 crsd.bin 无法正常启动以上任何一个 agent,用户的资源都将无法正常启动.

1. 通常这些 agent 无法启动的常见原因是 agent 的日志或者日志所在的路径没有设置合适的权限或者属主。

请参见以下 “日志文件位置, 属主和权限” 部分关于日志权限的设置。

2. agent 可能因为 bug 11834289 无法启动,此时我们会看到 “CRS-5802: Unable to start the agent process”错误信息,请参见 “OHASD 无法启动”  #10 获取更多信息。

 

问题 8: HAIP 无法启动

 

HAIP 无法启动的原因有很多,例如:

[ohasd(891)]CRS-2807:Resource ‘ora.cluster_interconnect.haip’ failed to start automatically.

请参见 note 1210883.1 获取更多关于 HAIP 的信息。

 

 

网络和域名解析的验证

CRS 的启动,依赖于网络功能和域名解析的正常工作,如果网络功能或者域名解析不能正常工作,CRS 将无法正常启动。

关于网络和域名解析的验证,请参考: note 1054902.1

 

日志文件位置, 属主和权限

正确的设置 $GRID_HOME/log 和这里的子目录以及文件对 CRS 组件的正常启动是至关重要的。

 

在 Grid Infrastructure 的环境中:

我们假设一个 Grid Infrastructure 环境,节点名字为 rac1, CRS 的属主是 grid, 并且有两个单独的 RDBMS 属主分别为: rdbmsap 和 rdbmsar,以下是 $GRID_HOME/log 中正常的设置情况:

 

drwxrwxr-x 5 grid oinstall 4096 Dec  6 09:20 log
drwxr-xr-x  2 grid oinstall 4096 Dec  6 08:36 crs
drwxr-xr-t 17 root   oinstall 4096 Dec  6 09:22 rac1
drwxr-x--- 2 grid oinstall  4096 Dec  6 09:20 admin
drwxrwxr-t 4 root   oinstall  4096 Dec  6 09:20 agent
drwxrwxrwt 7 root    oinstall 4096 Jan 26 18:15 crsd
drwxr-xr-t 2 grid  oinstall 4096 Dec  6 09:40 application_grid
drwxr-xr-t 2 grid  oinstall 4096 Jan 26 18:15 oraagent_grid
drwxr-xr-t 2 rdbmsap oinstall 4096 Jan 26 18:15 oraagent_rdbmsap
drwxr-xr-t 2 rdbmsar oinstall 4096 Jan 26 18:15 oraagent_rdbmsar
drwxr-xr-t 2 grid  oinstall 4096 Jan 26 18:15 ora_oc4j_type_grid
drwxr-xr-t 2 root    root     4096 Jan 26 20:09 orarootagent_root
drwxrwxr-t 6 root oinstall 4096 Dec  6 09:24 ohasd
drwxr-xr-t 2 grid oinstall 4096 Jan 26 18:14 oraagent_grid
drwxr-xr-t 2 root   root     4096 Dec  6 09:24 oracssdagent_root
drwxr-xr-t 2 root   root     4096 Dec  6 09:24 oracssdmonitor_root
drwxr-xr-t 2 root   root     4096 Jan 26 18:14 orarootagent_root
-rw-rw-r-- 1 root root     12931 Jan 26 21:30 alertrac1.log
drwxr-x--- 2 grid oinstall  4096 Jan 26 20:44 client
drwxr-x--- 2 root oinstall  4096 Dec  6 09:24 crsd
drwxr-x--- 2 grid oinstall  4096 Dec  6 09:24 cssd
drwxr-x--- 2 root oinstall  4096 Dec  6 09:24 ctssd
drwxr-x--- 2 grid oinstall  4096 Jan 26 18:14 diskmon
drwxr-x--- 2 grid oinstall  4096 Dec  6 09:25 evmd
drwxr-x--- 2 grid oinstall  4096 Jan 26 21:20 gipcd
drwxr-x--- 2 root oinstall  4096 Dec  6 09:20 gnsd
drwxr-x--- 2 grid oinstall  4096 Jan 26 20:58 gpnpd
drwxr-x--- 2 grid oinstall  4096 Jan 26 21:19 mdnsd
drwxr-x--- 2 root oinstall  4096 Jan 26 21:20 ohasd
drwxrwxr-t 5 grid oinstall  4096 Dec  6 09:34 racg
drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgeut
drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgevtf
drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgmain
drwxr-x--- 2 grid oinstall  4096 Jan 26 20:57 srvm

请注意,绝大部分的子目录都继承了父目录的属主和权限,以上仅作为一个参考,来判断 CRS HOME 中是否有一些递归的权限和属主改变,如果您已经有一个相同版本的正在运行的工作节点,您可以把该运行的节点作为参考。

 

在 Oracle Restart 的环境中:

这里显示了在 Oracle Restart 环境中 $GRID_HOME/log 目录下的权限和属主设置:

 

drwxrwxr-x 5 grid oinstall 4096 Oct 31  2009 log
drwxr-xr-x  2 grid oinstall 4096 Oct 31  2009 crs
drwxr-xr-x  3 grid oinstall 4096 Oct 31  2009 diag
drwxr-xr-t 17 root   oinstall 4096 Oct 31  2009 rac1
drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 admin
drwxrwxr-t 4 root   oinstall  4096 Oct 31  2009 agent
drwxrwxrwt 2 root oinstall 4096 Oct 31  2009 crsd
drwxrwxr-t 8 root oinstall 4096 Jul 14 08:15 ohasd
drwxr-xr-x 2 grid oinstall 4096 Aug  5 13:40 oraagent_grid
drwxr-xr-x 2 grid oinstall 4096 Aug  2 07:11 oracssdagent_grid
drwxr-xr-x 2 grid oinstall 4096 Aug  3 21:13 orarootagent_grid
-rwxr-xr-x 1 grid oinstall 13782 Aug  1 17:23 alertrac1.log
drwxr-x--- 2 grid oinstall  4096 Nov  2  2009 client
drwxr-x--- 2 root   oinstall  4096 Oct 31  2009 crsd
drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 cssd
drwxr-x--- 2 root   oinstall  4096 Oct 31  2009 ctssd
drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 diskmon
drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 evmd
drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 gipcd
drwxr-x--- 2 root   oinstall  4096 Oct 31  2009 gnsd
drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 gpnpd
drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 mdnsd
drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 ohasd
drwxrwxr-t 5 grid oinstall  4096 Oct 31  2009 racg
drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgeut
drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgevtf
drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgmain
drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 srvm

 

对于12.1.0.2及以上版本,参考 note 1915729.1 – Oracle Clusterware Diagnostic and Alert Log Moved to ADR

 

网络socket文件的位置,属主和权限

网络的 socket 文件可能位于目录: /tmp/.oracle, /var/tmp/.oracle or /usr/tmp/.oracle 中。

当网络的 socket 文件权限或者属主设置不正确的时候,我们通常会在守护进程的日志中看到以下类似的信息:

 

2011-06-18 14:07:28.545: [ COMMCRS][772]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_EVMD))

2011-06-18 14:07:28.545: [  clsdmt][515]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=lena042DBG_EVMD))
2011-06-18 14:07:28.545: [  clsdmt][515]Terminating process
2011-06-18 14:07:28.559: [ default][515] EVMD exiting on stop request from clsdms_thdmai

 

以下错误也有可能提示:

 

CRS-5017: The resource action "ora.evmd start" encountered the following error:
CRS-2674: Start of 'ora.evmd' on 'racnode1' failed
..

解决的办法:请使用 root 用户停掉 GI,删除这些 socket 文件,并重新启动 GI。

我们假设一个 Grid Infrastructure 环境,节点名为 rac1, CRS 的属主是 grid,以下是 socket 文件夹(../.oracle)正常的设置情况:

 

在 Grid Infrastructure cluster 环境中:

以下例子是集群环境中的例子:

 

 

drwxrwxrwt  2 root oinstall 4096 Feb  2 21:25 .oracle

./.oracle:
drwxrwxrwt 2 root  oinstall 4096 Feb  2 21:25 .
srwxrwx--- 1 grid oinstall    0 Feb  2 18:00 master_diskmon
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 mdnsd
-rw-r--r-- 1 grid oinstall    5 Feb  2 18:00 mdnsd.pid
prw-r--r-- 1 root  root        0 Feb  2 13:33 npohasd
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 ora_gipc_GPNPD_rac1
-rw-r--r-- 1 grid oinstall    0 Feb  2 13:34 ora_gipc_GPNPD_rac1_lock
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11724.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11724.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11735.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11735.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:45 s#12339.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:45 s#12339.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6275.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6275.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6276.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6276.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6278.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6278.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sAevm
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sCevm
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sCRSD_IPC_SOCKET_11
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sCRSD_UI_SOCKET
srwxrwxrwx 1 root  root        0 Feb  2 21:25 srac1DBG_CRSD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_CSSD
srwxrwxrwx 1 root  root        0 Feb  2 18:00 srac1DBG_CTSSD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_EVMD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_GIPCD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_GPNPD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_MDNSD
srwxrwxrwx 1 root  root        0 Feb  2 18:00 srac1DBG_OHASD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER_SCAN2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER_SCAN3
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_eotcs
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_eotcs_lock
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1__lock
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sOHASD_IPC_SOCKET_11
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sOHASD_UI_SOCKET
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOracle_CSS_LclLstnr_eotcs_1
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOracle_CSS_LclLstnr_eotcs_1_lock
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sora_crsqs
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sprocr_local_conn_0_PROC
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sprocr_local_conn_0_PROL
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sSYSTEM.evm.acceptor.auth

 

 

在 Oracle Restart 环境中:

以下是 Oracle Restart 环境中的输出例子:

 

drwxrwxrwt  2 root oinstall 4096 Feb  2 21:25 .oracle

./.oracle:
srwxrwx--- 1 grid oinstall 0 Aug  1 17:23 master_diskmon
prw-r--r-- 1 grid oinstall 0 Oct 31  2009 npohasd
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 s#14478.1
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 s#14478.2
srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.1
srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.2
srwxrwxrwx 1 grid oinstall 0 Jul  7 10:59 s#2269.1
srwxrwxrwx 1 grid oinstall 0 Jul  7 10:59 s#2269.2
srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.1
srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.2
srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.1
srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.2
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sCRSD_UI_SOCKET
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 srac1DBG_CSSD
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 srac1DBG_OHASD
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sEXTPROC1521
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_localhost
-rw-r--r-- 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_localhost_lock
-rw-r--r-- 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1__lock
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOHASD_IPC_SOCKET_11
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOHASD_UI_SOCKET
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sgrid_CSS_LclLstnr_localhost_1
-rw-r--r-- 1 grid oinstall 0 Aug  1 17:23 sgrid_CSS_LclLstnr_localhost_1_lock
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sprocr_local_conn_0_PROL

 

诊断文件收集

如果通过本文没有找到问题原因,请使用 root 用户,在所有的节点上执行 $GRID_HOME/bin/diagcollection.sh ,并上传在当前目录下生成所有的 .gz 压缩文件来做进一步诊断。

Oracle 11g OCM考试考点分析 RAC数据库安装

 本文永久链接地址:https://www.askmac.cn/archives/oracle-11g-ocm-rac-install.html

 

16 RAC数据库安装

 

16.1 目标

 

在完成这个课程后,你应该可以:

  • 安装数据库软件
  • 创建一个集群数据库
  • 执行创建数据库后的任务

 

[Read more…]

Oracle 11g OCM考试考点分析 管理Oracle 集群

本文永久链接地址:https://www.askmac.cn/archives/oracle-11g-ocm-manage-clusterware.html

 

 

15 管理Oracle 集群

 

15.1 目标

在完成这个课程后,你应该可以:

  • 熟练的描述集群管理
  • 演示OCR备份和恢复技术

 

15.2 管理Oracle 集群

  • 命令行工具

–crsctl管理集群相关的操作:

-启动和关闭Oracle集群

-启用和禁用Oracle集群后台进程

-注册集群资源

-srvctl 管理Oracle 资源相关操作

-启动和关闭数据库实例和服务

在Oracle Grid安装的home路径下的命令行工具crsctl和srvctl用来管理Oracle集群。使用crsctl可以监控和管理任何集群节点的集群组件和资源。srvctl工具提供了类似的功能,来监控和管理Oracle相关的资源,例如数据库实例和数据库服务。crsctl命令只能是集群管理者来运行,srvctl命令可以是其他用户,例如数据库管理员来使用。

[Read more…]

Oracle 11g OCM考试考点分析 Oracle Grid 安装

本文永久链接地址:https://www.askmac.cn/archives/oracle-11g-install-grid.html

 

Grid 安装

14.1 目标

 

在这个课程之后,你应该能够:

  • 完成grid 预安装任务
  • 安装grid
  • 验证安装
  • 配置ASM磁盘组

 

14.2  grid预安装任务

 

1.共享存储

  • 这里有3种方式来存储grid文件:

-一个支持的集群文件系统(CFS)

-一个验证的网络文件系统(NFS)

-自动存储管理(ASM)

 

存储选项 Voting/OCR oracle 软件
ASM yes no
ASM集群文件系统(ACFS) no yes
oracle 集群文件(OCFS2) yes yes
NFS(仅仅验证) yes yes
共享磁盘片(块或裸设备) no no

 

  • 使用DBCA或者OUI不能最新安装到块或者裸设备上
  • 当更新一个现存的RAC数据库,你可以使用现有的裸设备和块设备分区和执行安装的滚动升级。

[Read more…]

clssnmvDiskCheck: Aborting, 0 of 1 configured voting disks available, need 1

cssd.log中的报错信息如下:
2013-09-25 08:46:03.739: [    CSSD][2834](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 1 configured voting disks available, need 1
2013-09-25 08:46:03.749: [    CSSD][2834]###################################
2013-09-25 08:46:03.749: [    CSSD][2834]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread
2013-09-25 08:46:03.749: [    CSSD][2834]###################################
2013-09-25 08:46:03.749: [    CSSD][2834](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
2013-09-25 08:46:03.753: [    CSSD][2834]
 
 
该报错与BUG 13869978  较为一致
 
该BUG 13869978的特点是 当仅有一个votedisk时才会触发, 当使用ASM DISKGROUP存放votedisk时:
external redundancy  => 1份votedisk
normal   redundancy  => 3份
High     redundancy  => 5份
当使用external redundancy 时由于只有1份votedisk 所以可能触发该BUG
该bug 目前在11.2.0.3.3 +AIX上有补丁可以打
建议:
考虑使用 normal redundancy  的diskgroup存放votedisk ,或者 应用补丁13869978
 
 

Bug 13869978  OCSSD reports that the voting file is offline without reporting the reason

 

Affects:

Product (Component) Oracle Server (PCW)
Range of versions believed to be affected Versions BELOW 12.1
Versions confirmed as being affected
  • 11.2.0.3
  • 11.2.0.2
Platforms affected Generic (all / most platforms affected)

Fixed:

This issue is fixed in
  • 12.1.0.1 (Base Release)
  • 11.2.0.4 (Future Patch Set)
  • 11.2.0.3.4 Grid Infrastructure Patch Set Update (GI PSU)
  • 11.2.0.3 Patch 11 on Windows Platforms

Symptoms:

Related To:

  • (None Specified)
  • Cluster Ready Services / Parallel Server Management

Description

When we have a single voting file CSSD report the file offline, but thre is no IO error or hung condition
previous to taking the voting file offline:

[    CSSD][29](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 1 configured voting disks available, need 1
[    CSSD][29]###################################
[    CSSD][29]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread
[    CSSD][29]###################################
[    CSSD][29](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

Rediscovery Notes:

If there is no IO error or IO hung that caused to set the voting file offline, then we may be getting this bug.

Workaround
Use more than 1 voting file.

Oracle RAC/Clusterware 多种心跳heartbeat机制介绍 RAC超时机制分析

ORACLE RAC中最主要存在2种clusterware集群件心跳 &  RAC超时机制分析:

1、Network Heartbeat 网络心跳 每秒发生一次; 10.2.0.4以后网络心跳超时misscount为60s,;11.2以后网络心跳超时misscount为30s。

2、Disk Heartbeat 磁盘心跳  每秒发生一次; 10.2.0.4以后 磁盘心跳超时DiskTimeout为200s。

注意不管是磁盘心跳还是网络心跳都依赖于cssd.bin进程来实施这些操作,在真实世界中任何造成cssd.bin这个普通用户进程无法正常工作的原因均可能造成上述2种心跳超时, 原因包括但不局限于 CPU无法分配足够的时间片、内存不足、SWAP、网络问题、Votedisk IO问题、本次磁盘IO问题等等(askmac.cn)。

 

此外在使用ASM的情况下,DB作为ASM实例的Client客户; ASM实例会对DB实例的ASMB等进程进行监控, 以保证DB与ASM之间通信正常。 若DB的ASMB进程长期无响应(大约为200s)则ASM实例将考虑KILL DB的ASMB进程,由于ASMB是关键后台进程所以将导致DB实例重启。

也存在其他可能的情况,例如由于ASMB 被某些latch block, 会阻塞其他进程,导致PMON进行强制清理。

 

综上所述不管是Clusterware的 cssd.bin进程还是ASMB进程,他们都是OS上的普通用户进程,OS本身出现的问题、超时、延迟均可能造成它们无法正常工作导致。建议在确认对造成OS长时间的网络、IO延时的维护操作,考虑先停止节点上的Clusterware后再实施。

另可以考虑修改misscount、Disktimeout等 心跳超时机制为更大值,但修改这些值并不能保证就可以不触发Node Evication。

 

关于RAC /CRS对于本地盘的问题,详见如下的SR回复:

Does RAC/CRS monitor Local Disk IO ?

 

Oracle software use local ORACLE_HOME / GRID_HOME library files for main process operations.

 

 

There are some socket files under /tmp or /var/tmp needed for CRS communication.

 

Also, the init processes are all depending on the /etc directory to spawn the child processes.

 

Again, this is a complicated design for a cluster software which mainly rely on the OS stability including local file system.

 

Any changes to storage / OS are all recommended to stop CRS services since those are out of our release Q/A tests.

 

由于10.2的环境已经超出我们开发的支持服务期限,建议考虑升级到11.2.0.3来获得更全面的技术支持。

Oracle Acs资深顾问罗敏 老罗技术核心感悟:Clusterware是成熟产品吗?

作者为: 

SHOUG成员 – ORACLE ACS高级顾问罗敏

 

Oracle公司自10g版本开始就推出了集群管理软件CRS,以后又升级改造成Clusterware,到11g版本之后更是大动干戈,内部架构进行了大幅度改造,并与ASM技术整合在一起,称之为GI(Grid Infrastructure)。Clusterware替代了硬件厂商和第三方厂商的集群软件功能,也使得Oracle RAC与Clusterware集成为一体,在产品的整体性、服务支持一体化等方面具有显著优势。

作为新产品、新技术,稳定性、成熟性略差,情有可原。但到了11g仍然如此,则让人难以理解了。

本人最近在Windows 2012平台实施了2节点11.2.0.4 RAC,并通过增加节点方式扩展到了4节点RAC,在国内实属罕见案例。期间一些波折,表明 Clusterware产品仍然不成熟。

话说那天我在实施节点扩展操作之前,先花费了半天时间进行了新节点的环境准备之后,并通过如下命令进行了环境检查:

cluvfy stage -pre nodeadd -n hsedb3 –verbose

 

… …

节点 “hsedb3” 上的共享存储检查成功

 

硬件和操作系统设置 的后期检查成功。

哟,一切都“成功”,开练了!于是,我按照Oracle文档标准流程在节点1开始运行AddNode.bat脚本了,一切“正常”!我继续在节点3运行了gridconfig.bat等脚本。

待所有脚本顺利运行完之后检查环境时,却发现节点3根本没有加入到集群环境中,节点3上的Clusterware服务也根本没有启动。—– 这就是产品的严重不成熟,明明出问题了,所有脚本却不显示任何一条返回错误,显示一切正常!更可气的是,AddNode.bat脚本的日志文件(addNodeActions2014-09-07_04-52-22PM.log)也居然显示一切正常,最后还来一句:

*** 安装 结束 页***

C:\app\11.2.0\grid 的 添加集群节点 已成功。

我知道Oracle支持在Windows平台进行RAC加节点操作,但现在没有成功,一定是我犯什么错误了,也肯定知道有什么错误信息藏在什么鸟日志文件里了。无奈天色已晚,忙乎一天了,于是先打道回府了。

隔日,待我回到现场仔细分析各类Clusterware日志文件信息时,首先在alerthsedb3.log文件中大海捞针般地发现了出错信息:

[cssd(4484)]CRS-1649:表决文件出现 I/O 错误: \\.\ORCLDISKORADG0; 详细信息见 (:CSSNM00059:) (位于 C:\app\11.2.0\grid\log\hsedb3\cssd\ocssd.log)。

于是,按图索骥继续去查询ocssd.log文件中的信息。又像侦探一样,在ocssd.log文件8千多行的日志信息中发现了如下错误信息:

2014-09-07 17:00:06.192: [   SKGFD][4484]ERROR: -9(Error 27070, OS Error (OSD-04016: 异步 I/O 请求排队时出错。

O/S-Error: (OS 19) 介质受写入保护。)

此时,其实本人已经觉察出问题了:可能是节点3对存储设备只有读权限,连表决盘(Voting Disk)都没有写入功能,从而导致失败了。为保险期间,还是根据上述出错信息在Metalink中进行了一番搜索,果然如此!《Tablespace (Datafile) Creation On ASM Diskgroup Fails With “[ORA-15081: Failed To Submit An I/O Operation To A Disk] : [ O/S-Error: (OS 19) The media Is Write Protected]” On Windows. ( Doc ID 1551766.1 )》详细描述了原委和解决方案。于是,按照该文档的建议,我将节点3对所有共享存储设备的权限从只读状态修改为可读、可写的联机状态。也明白一个细节:新节点对共享存储设备的权限缺省为只读状态。无论如何,安装之前没有仔细检查共享存储设备的权限是我犯的一个错误。

接下来该是重新进行节点增加操作了。且慢!因为前面已经错误地进行了节点增加操作,而且居然显示成功了,那么运行AddNode.bat脚本的节点1肯定已经在OCR、Voting Disk等集群文件中写入节点3不正确的信息了。因此,需要先实施从集群中删除节点3的操作,但是发现Oracle标准文档中的删除节点操作的如下第一条命令有错误!

C:\>Grid_home\perl\bin\perl -I$Grid_home\perl\lib -I$Grid_home\crs\install

Grid_home\crs\install\rootcrs.pl -deconfig -force

 

又是一番折腾,将上述命令修改如下:

cd \app\11.2.0\grid

 

C:\>perl\bin\perl -I perl\lib -I crs\install crs\install\rootcrs.pl -deconfig –force

终于顺利删除了节点3!

现在可以重新来一遍了。这次一马平川地成功增加了节点3的Clusterware以及RAC,还有节点4的Clusterware和RAC。

 

感悟之一:明明节点3对共享存储只有读权限,而cluvfy却说:节点 “hsedb3” 上的共享存储检查成功!一定是cluvfy只检查了读权限,而没有检查写权限。很可能是cluvfy的Bug!

感悟之二:明明增加节点3的操作失败了。但不仅AddNode.bat没有在命令行及时显示错误,而且对应的日志文件还显示“添加集群节点已成功”。极大地误导客户!罪不可恕!

感悟之三:诊断Clusterware问题太难了!Oracle公司没有告诉客户Clusterware问题的诊断思路,特别是日志文件太多了,不知道先看哪个日志文件,后看哪个日志文件。此次本人完全是凭经验,先看了alerthsedb3.log文件,才找到问题的蛛丝马迹,进而逐步确认问题并加以解决。

… …

总之,Clusterware仍然是一个非常不成熟的产品!

沪ICP备14014813号-2

沪公网安备 31010802001379号