How to recover from root.sh on 11.2 Grid Infrastructure Failed

从10g的clusterware到11g Release2的Grid Infrastructure,Oracle往RAC这个框架里塞进了太多东西。虽然照着Step by Step Installation指南步步为营地去安装11.2.0.1的GI,但在实际执行root.sh脚本的时候,不免又要出现这样那样的错误。例如下面的一例:

[root@rh3 grid]# ./root.sh
Running Oracle 11g root.sh script...

The following environment variables are set as:
    ORACLE_OWNER= maclean
    ORACLE_HOME=  /u01/app/11.2.0/grid

Enter the full pathname of the local bin directory: [/usr/local/bin]: 

The file "dbhome" already exists in /usr/local/bin.  Overwrite it? (y/n)
[n]: The file "oraenv" already exists in /usr/local/bin.  Overwrite it? (y/n)
[n]:
The file "coraenv" already exists in /usr/local/bin.  Overwrite it? (y/n)
[n]: 

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root.sh script.
Now product-specific root actions will be performed.
2011-03-28 20:43:13: Parsing the host name
2011-03-28 20:43:13: Checking for super user privileges
2011-03-28 20:43:13: User has super user privileges
Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
LOCAL ADD MODE
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
Adding daemon to inittab
CRS-4123: Oracle High Availability Services has been started.
ohasd is starting

ADVM/ACFS is not supported on oraclelinux-release-5-5.0.2

一个节点上的root.sh脚本运行居然提示说ADVM/ACFS不支持OEL 5.5,但实际上Redhat 5或者OEL 5是目前仅有的少数支持ACFS的平台(The ACFS install would be on a supported Linux release – either Oracle Enterprise Linux 5 or Red Hat 5)。

检索Metalink发现这是一个Linux平台上的Bug 9474252: ‘ACFSLOAD START’ RETURNS “ADVM/ACFS IS NOT SUPPORTED ON DHL-RELEASE-…”

因为以上Not Supported错误信息在另外一个节点(也是Enterprise Linux Enterprise Linux Server release 5.5 (Carthage)) 运行root.sh脚本时并未出现,那么一般只要找出2个节点间的差异就可能解决问题了:

未出错节点上release相关rpm包的情况

[maclean@rh6 tmp]$ cat /etc/issue
Enterprise Linux Enterprise Linux Server release 5.5 (Carthage)
Kernel \r on an \m

[maclean@rh6 tmp]$ rpm -qa|grep release
enterprise-release-notes-5Server-17
enterprise-release-5-0.0.22

出错节点上release相关rpm包的情况

[root@rh3 tmp]# rpm -qa | grep release
oraclelinux-release-5-5.0.2
enterprise-release-5-0.0.22
enterprise-release-notes-5Server-17

以上可以看到相比起没有出错的节点,出错节点上多安装了一个名为oraclelinux-release-5-5.0.2的rpm包,我们尝试来卸载该rpm是否能解决问题,补充实际上该问题也可以通过修改/tmp/.linux_release文件的内容为enterprise-release-5-0.0.17来解决,而无需如我们这里做的卸载名为oraclelinux-release-5*的rpm软件包:

[root@rh3 install]# rpm -e oraclelinux-release-5-5.0.2

[root@rh3 grid]# ./root.sh
Running Oracle 11g root.sh script...

The following environment variables are set as:
    ORACLE_OWNER= maclean
    ORACLE_HOME=  /u01/app/11.2.0/grid

Enter the full pathname of the local bin directory: [/usr/local/bin]:
The file "dbhome" already exists in /usr/local/bin.  Overwrite it? (y/n)
[n]:
The file "oraenv" already exists in /usr/local/bin.  Overwrite it? (y/n)
[n]:
The file "coraenv" already exists in /usr/local/bin.  Overwrite it? (y/n)
[n]: 

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root.sh script.
Now product-specific root actions will be performed.
2011-03-28 20:57:21: Parsing the host name
2011-03-28 20:57:21: Checking for super user privileges
2011-03-28 20:57:21: User has super user privileges
Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
CRS is already configured on this node for crshome=0
Cannot configure two CRS instances on the same cluster.
Please deconfigure before proceeding with the configuration of new home.

再次在失败节点上运行root.sh,被提示告知需要首先deconfigure然后才能再次配置。在官方文档<Oracle Grid Infrastructure Installation Guide 11g Release 2>中介绍了如何反向配置11g release 2中的Grid Infrastructure(Deconfiguring Oracle Clusterware Without Removing Binaries):

/* 同为管理Grid Infra所以仍需要root用户来执行以下操作 */

[root@rh3 grid]# pwd
/u01/app/11.2.0/grid

/* 目前位于GRID_HOME目录下  */

[root@rh3 grid]# cd crs/install

/* 以-deconfig选项执行一个名为rootcrs.pl的脚本 */

[root@rh3 install]# ./rootcrs.pl -deconfig
2011-03-28 21:03:05: Parsing the host name
2011-03-28 21:03:05: Checking for super user privileges
2011-03-28 21:03:05: User has super user privileges
Using configuration parameter file: ./crsconfig_params
VIP exists.:rh3
VIP exists.: //192.168.1.105/255.255.255.0/eth0
VIP exists.:rh6
VIP exists.: //192.168.1.103/255.255.255.0/eth0
GSD exists.
ONS daemon exists. Local port 6100, remote port 6200
eONS daemon exists. Multicast port 20796, multicast IP address 234.227.83.81, listening port 2016
Please confirm that you intend to remove the VIPs rh3 (y/[n]) y
ACFS-9200: Supported
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rh3'
CRS-2673: Attempting to stop 'ora.crsd' on 'rh3'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'rh3'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'rh3'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'rh3' succeeded
CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'rh3' has completed
CRS-2677: Stop of 'ora.crsd' on 'rh3' succeeded
CRS-2673: Attempting to stop 'ora.mdnsd' on 'rh3'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'rh3'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'rh3'
CRS-2673: Attempting to stop 'ora.ctssd' on 'rh3'
CRS-2673: Attempting to stop 'ora.evmd' on 'rh3'
CRS-2677: Stop of 'ora.cssdmonitor' on 'rh3' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'rh3' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'rh3' succeeded
CRS-2677: Stop of 'ora.evmd' on 'rh3' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'rh3' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'rh3'
CRS-2677: Stop of 'ora.cssd' on 'rh3' succeeded
CRS-2673: Attempting to stop 'ora.diskmon' on 'rh3'
CRS-2673: Attempting to stop 'ora.gipcd' on 'rh3'
CRS-2677: Stop of 'ora.gipcd' on 'rh3' succeeded
CRS-2677: Stop of 'ora.diskmon' on 'rh3' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rh3' has completed
CRS-4133: Oracle High Availability Services has been stopped.
Successfully deconfigured Oracle clusterware stack on this node

/* 如果以上deconfig操作未能成功反向配置那么可以以-FORCE选项执行rootcrs.pl脚本 */

[root@rh3 install]# ./rootcrs.pl -deconfig -force
2011-03-28 21:41:00: Parsing the host name
2011-03-28 21:41:00: Checking for super user privileges
2011-03-28 21:41:00: User has super user privileges
Using configuration parameter file: ./crsconfig_params
VIP exists.:rh3
VIP exists.: //192.168.1.105/255.255.255.0/eth0
VIP exists.:rh6
VIP exists.: //192.168.1.103/255.255.255.0/eth0
GSD exists.
ONS daemon exists. Local port 6100, remote port 6200
eONS daemon exists. Multicast port 20796, multicast IP address 234.227.83.81, listening port 2016
ACFS-9200: Supported
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rh3'
CRS-2673: Attempting to stop 'ora.crsd' on 'rh3'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'rh3'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'rh3'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'rh3' succeeded
CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'rh3' has completed
CRS-2677: Stop of 'ora.crsd' on 'rh3' succeeded
CRS-2673: Attempting to stop 'ora.mdnsd' on 'rh3'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'rh3'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'rh3'
CRS-2673: Attempting to stop 'ora.ctssd' on 'rh3'
CRS-2673: Attempting to stop 'ora.evmd' on 'rh3'
CRS-2677: Stop of 'ora.cssdmonitor' on 'rh3' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'rh3' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'rh3' succeeded
CRS-2677: Stop of 'ora.evmd' on 'rh3' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'rh3' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'rh3'
CRS-2677: Stop of 'ora.cssd' on 'rh3' succeeded
CRS-2673: Attempting to stop 'ora.diskmon' on 'rh3'
CRS-2673: Attempting to stop 'ora.gipcd' on 'rh3'
CRS-2677: Stop of 'ora.gipcd' on 'rh3' succeeded
CRS-2677: Stop of 'ora.diskmon' on 'rh3' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rh3' has completed
CRS-4133: Oracle High Availability Services has been stopped.
Successfully deconfigured Oracle clusterware stack on this node

/* 所幸以上这招总是能够奏效,否则岂不是每次都要完全卸载后重新安装GI? */

顺利完成以上反向配置CRS后,就可以再次尝试运行多灾多难的root.sh了:

[root@rh3 grid]# pwd
/u01/app/11.2.0/grid

[root@rh3 grid]# ./root.sh
Running Oracle 11g root.sh script...

The following environment variables are set as:
    ORACLE_OWNER= maclean
    ORACLE_HOME=  /u01/app/11.2.0/grid

Enter the full pathname of the local bin directory: [/usr/local/bin]:
The file "dbhome" already exists in /usr/local/bin.  Overwrite it? (y/n)
[n]:
The file "oraenv" already exists in /usr/local/bin.  Overwrite it? (y/n)
[n]:
The file "coraenv" already exists in /usr/local/bin.  Overwrite it? (y/n)
[n]: 

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root.sh script.
Now product-specific root actions will be performed.
2011-03-28 21:07:29: Parsing the host name
2011-03-28 21:07:29: Checking for super user privileges
2011-03-28 21:07:29: User has super user privileges
Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
LOCAL ADD MODE
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
Adding daemon to inittab
CRS-4123: Oracle High Availability Services has been started.
ohasd is starting
FATAL: Module oracleoks not found.
FATAL: Module oracleadvm not found.
FATAL: Module oracleacfs not found.
acfsroot: ACFS-9121: Failed to detect /dev/asm/.asm_ctl_spec.

acfsroot: ACFS-9310: ADVM/ACFS installation failed.

acfsroot: ACFS-9311: not all components were detected after the installation.

CRS-4402: The CSS daemon was started in exclusive mode but found an active CSS daemon on node rh6, number 1, and is terminating
An active cluster was found during exclusive startup, restarting to join the cluster
CRS-2672: Attempting to start 'ora.mdnsd' on 'rh3'
CRS-2676: Start of 'ora.mdnsd' on 'rh3' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'rh3'
CRS-2676: Start of 'ora.gipcd' on 'rh3' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'rh3'
CRS-2676: Start of 'ora.gpnpd' on 'rh3' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rh3'
CRS-2676: Start of 'ora.cssdmonitor' on 'rh3' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'rh3'
CRS-2672: Attempting to start 'ora.diskmon' on 'rh3'
CRS-2676: Start of 'ora.diskmon' on 'rh3' succeeded
CRS-2676: Start of 'ora.cssd' on 'rh3' succeeded
CRS-2672: Attempting to start 'ora.ctssd' on 'rh3'
CRS-2676: Start of 'ora.ctssd' on 'rh3' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'rh3'
CRS-2676: Start of 'ora.crsd' on 'rh3' succeeded
CRS-2672: Attempting to start 'ora.evmd' on 'rh3'
CRS-2676: Start of 'ora.evmd' on 'rh3' succeeded
/u01/app/11.2.0/grid/bin/srvctl start vip -i rh3 ... failed
Preparing packages for installation...
cvuqdisk-1.0.7-1
Configure Oracle Grid Infrastructure for a Cluster ... failed
Updating inventory properties for clusterware
Starting Oracle Universal Installer...

Checking swap space: must be greater than 500 MB.   Actual 5023 MB    Passed
The inventory pointer is located at /etc/oraInst.loc
The inventory is located at /s01/oraInventory
'UpdateNodeList' was successful.

以上虽然绕过了”ADVM/ACFS is not supported”的问题,但又出现了”FATAL: Module oracleoks/oracleadvm/oracleacfs not found”,Linux下ACFS/ADVM相关加载Module无法找到的问题,查了下metalink发现这是GI 11.2.0.2中2个被确认的bug 10252497bug 10266447,而实际我所安装的是11.2.0.1版本的GI…….. 好了,所幸我目前的环境是使用NFS的存储,所以如ADVM/ACFS这些存储选项的问题可以忽略不计,准备在11.2.0.2上再测试下。

不得不说11.2.0.1版本GI的安装存在太多的问题,以至于Oracle Support不得不撰写了不少相关故障诊断的文档,例如:<Troubleshooting 11.2 Grid Infastructure Installation Root.sh Issues [ID 1053970.1]>,<How to Proceed from Failed 11gR2 Grid Infrastructure (CRS) Installation [ID 942166.1]>。目前为止还没体验过11.2.0.2的GI,希望它不像上一个版本那么糟糕!

fast incremental backup failed on standby database

一套Linux上的11.1.0.7的physical standby物理备库在使用fast incremental backup进行高于0级的增量备份时会出现ORA-19648错误,其出错记录如下:

RMAN>  backup incremental level 1 database;

Starting backup at 22-MAR-11
using channel ORA_DISK_1
using channel ORA_DISK_2
channel ORA_DISK_1: starting incremental level 1 datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00002 name=/standby/oradata/SBDB2/datafile/o1_mf_sysaux_22m6ov92_.dbf
input datafile file number=00003 name=/standby/oradata/SBDB2/datafile/o1_mf_undotbs1_23m6ovap_.dbf
input datafile file number=00006 name=/standby/oradata/SBDB2/datafile/o1_mf_enc_25m6ovba_.dbf
channel ORA_DISK_1: starting piece 1 at 22-MAR-11
channel ORA_DISK_2: starting incremental level 1 datafile backup set
channel ORA_DISK_2: specifying datafile(s) in backup set
input datafile file number=00001 name=/standby/oradata/SBDB2/datafile/o1_mf_system_21m6ov92_.dbf
input datafile file number=00005 name=/standby/oradata/SBDB2/datafile/o1_mf_example_24m6ovar_.dbf
input datafile file number=00004 name=/standby/oradata/SBDB2/datafile/o1_mf_users_26m6ovba_.dbf
channel ORA_DISK_2: starting piece 1 at 22-MAR-11
RMAN-03009: failure of backup command on ORA_DISK_1 channel at 03/22/2011 20:20:28
ORA-19648: datafile 2: incremental-start SCN equals checkpoint SCN
ORA-19640: datafile checkpoint is SCN 1249353 time 03/09/2011 05:50:27
continuing other job steps, job failed will not be re-run
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on ORA_DISK_2 channel at 03/22/2011 20:22:33
ORA-19648: datafile 1: incremental-start SCN equals checkpoint SCN
ORA-19640: datafile checkpoint is SCN 1249352 time 03/09/2011 05:50:26

经过分析该ORA-19640->ORA-19648是由11.1.0.7上的Bug引起的,MOS已经确认其为Bug 9288598:

Affects:
Product (Component)	 Oracle Server (Rdbms)
Range of versions believed to be affected	 Versions BELOW 12.1
Versions confirmed as being affected	
11.1.0.7
Platforms affected	 Generic (all / most platforms affected)
Fixed:

This issue is fixed in	
12.1 (Future Release)
11.2.0.2 (Server Patch Set)

An RMAN Backup incremental level 1 can fail with ORA-19648 / ORA-19640 if the standby database is open read only rather than just skipping the datafile.
PROBLEM:
--------
Following error occurring during incremental backup:

Starting backup at 02-DEC-2009 20:37:44
allocated channel: ORA_SBT_TAPE_1
channel ORA_SBT_TAPE_1: SID=154 device type=SBT_TAPE
channel ORA_SBT_TAPE_1: Veritas NetBackup for Oracle - Release 6.5 
(2008052301)
channel ORA_SBT_TAPE_1: starting incremental level 1 datafile backup set
channel ORA_SBT_TAPE_1: specifying datafile(s) in backup set
input datafile file number=00002 
name=/data/ora_data01/KAHCB4P/KAHCB4P_sysaux_01.dbf
input datafile file number=00001 
name=/data/ora_data01/KAHCB4P/KAHCB4P_system_01.dbf
input datafile file number=00003 
name=/data/ora_data01/KAHCB4P/KAHCB4P_undo_01.dbf
input datafile file number=00004 
name=/data/ora_data01/KAHCB4P/KAHCB4P_users_01.dbf
input datafile file number=00005 
name=/data/ora_data01/KAHCB4P/KAHCB4P_tools_01.dbf
input datafile file number=00006 
name=/data/ora_data01/KAHCB4P/KAHCB4P_backup_test_01.dbf
channel ORA_SBT_TAPE_1: starting piece 1 at 02-DEC-2009 20:38:01
RMAN-3009: failure of backup command on ORA_SBT_TAPE_1 channel at 12/02/2009 
20:38:02
ORA-19648: datafile 2: incremental-start SCN equals checkpoint SCN
ORA-19640: datafile checkpoint is SCN 3911695 time 11/27/2009 12:36:48
continuing other job steps, job failed will not be re-run
channel ORA_SBT_TAPE_1: starting incremental level 1 datafile backup set
channel ORA_SBT_TAPE_1: specifying datafile(s) in backup set
including current control file in backup set
including current SPFILE in backup set

DIAGNOSTIC ANALYSIS:
--------------------
Same problem as Bug 6903819
Requested backport, but this hasn't fixed the problem.
Confirmed that the patch was installed correctly and relinked successfully.

RELEASE NOTES:
]]Backup incremental leve 1 fails with ORA-19648 and ORA-19640 when standby
]]database opened for read only.
*** 02/14/10 02:43 pm *** (ADD: Impact/Symptom->FEATURE UNUSABLE )

REDISCOVERY INFORMATION:
Backup incremental level 1 fails with ORA-19648 and the standby database is
open read-only.
WORKAROUND:
None

可以从上述Note中看到该Bug需要到11.2.0.2中才得到fix,那么在11.1.0.7上我们有什么办法能解决或者绕过该问题吗?
经过测试,我得到2种workaround的方法:
该ORA-19648错误首先可以通过在物理备库(physical standby)上停止介质恢复(MRP)进程的方式来workaround:

SQL>  alter database recover managed standby database cancel;
Database altered.

RMAN> backup incremental level 1 database;

Starting backup at 22-MAR-11
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=141 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=140 device type=DISK
channel ORA_DISK_1: starting incremental level 1 datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00002 name=/standby/oradata/SBDB2/datafile/o1_mf_sysaux_22m6ov92_.dbf
input datafile file number=00004 name=/standby/oradata/SBDB2/datafile/o1_mf_users_26m6ovba_.dbf
input datafile file number=00006 name=/standby/oradata/SBDB2/datafile/o1_mf_enc_25m6ovba_.dbf
channel ORA_DISK_1: starting piece 1 at 22-MAR-11
channel ORA_DISK_2: starting incremental level 1 datafile backup set
channel ORA_DISK_2: specifying datafile(s) in backup set
input datafile file number=00001 name=/standby/oradata/SBDB2/datafile/o1_mf_system_21m6ov92_.dbf
input datafile file number=00003 name=/standby/oradata/SBDB2/datafile/o1_mf_undotbs1_23m6ovap_.dbf
input datafile file number=00005 name=/standby/oradata/SBDB2/datafile/o1_mf_example_24m6ovar_.dbf
channel ORA_DISK_2: starting piece 1 at 22-MAR-11
channel ORA_DISK_1: finished piece 1 at 22-MAR-11
piece handle=/standby/backup/2km7suui_1_1.bak tag=TAG20110322T212538 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:33
channel ORA_DISK_2: finished piece 1 at 22-MAR-11
piece handle=/standby/backup/2lm7suv1_1_1.bak tag=TAG20110322T212538 comment=NONE
channel ORA_DISK_2: backup set complete, elapsed time: 00:00:27
Finished backup at 22-MAR-11

Starting Control File and SPFILE Autobackup at 22-MAR-11
piece handle=/standby/backup/c-157018592-20110322-01.bak comment=NONE
Finished Control File and SPFILE Autobackup at 22-MAR-11

该ORA-19648错误也可以通过禁用fast incremental backup的块跟踪(block change tracking)特性来绕过问题,当然在可能的情况下我们更推荐使用上面那种方法,因为毕竟还可以利用到fast incremental backup特性:

SQL> alter database recover managed standby database using current logfile disconnect;
Database altered.

SQL> select process,status from v$managed_standby;

PROCESS   STATUS
--------- ------------
ARCH	  CONNECTED
ARCH	  CONNECTED
ARCH	  CONNECTED
ARCH	  CONNECTED
MRP0	  APPLYING_LOG

SQL> select status from v$block_change_tracking;

STATUS
----------
ENABLED


SQL> alter database disable block change tracking;
Database altered.


RMAN>  backup incremental level 2 database;

Starting backup at 22-MAR-11
using channel ORA_DISK_1
using channel ORA_DISK_2
channel ORA_DISK_1: starting incremental level 2 datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00002 name=/standby/oradata/SBDB2/datafile/o1_mf_sysaux_22m6ov92_.dbf
input datafile file number=00004 name=/standby/oradata/SBDB2/datafile/o1_mf_users_26m6ovba_.dbf
input datafile file number=00006 name=/standby/oradata/SBDB2/datafile/o1_mf_enc_25m6ovba_.dbf
channel ORA_DISK_1: starting piece 1 at 22-MAR-11
channel ORA_DISK_2: starting incremental level 2 datafile backup set
channel ORA_DISK_2: specifying datafile(s) in backup set
input datafile file number=00001 name=/standby/oradata/SBDB2/datafile/o1_mf_system_21m6ov92_.dbf
input datafile file number=00003 name=/standby/oradata/SBDB2/datafile/o1_mf_undotbs1_23m6ovap_.dbf
input datafile file number=00005 name=/standby/oradata/SBDB2/datafile/o1_mf_example_24m6ovar_.dbf
channel ORA_DISK_2: starting piece 1 at 22-MAR-11
channel ORA_DISK_1: finished piece 1 at 22-MAR-11
piece handle=/standby/backup/2nm7sv8d_1_1.bak tag=TAG20110322T213052 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:15
channel ORA_DISK_2: finished piece 1 at 22-MAR-11
piece handle=/standby/backup/2om7sv8s_1_1.bak tag=TAG20110322T213052 comment=NONE
channel ORA_DISK_2: backup set complete, elapsed time: 00:00:15
Finished backup at 22-MAR-11

Starting Control File and SPFILE Autobackup at 22-MAR-11
piece handle=/standby/backup/c-157018592-20110322-02.bak comment=NONE
Finished Control File and SPFILE Autobackup at 22-MAR-11

DEFERRED_SEGMENT_CREATION

deferred_segment_creation 延迟段创建是11.2的新特性, 当创建一个 空表或者空分区时,实际的表段table segment被延迟到第一行数据插入的时候。
该功能通过DEFERRED_SEGMENT_CREATION参数启用,默认为TRUE.

由DEFERRED_SEGMENT_CREATION启用的新的空间分配方式的优势在于:

若系统安装后存在大量空表,而这些空表将从不会加载任何数据;假设一个空表占用64k(db_block_size=8,一个extent)的空间,则若存在一万张这样的空表
就要消耗625M的空间

当应用程序初次安装需要创建大量空表时,因为延迟了数据段创建的过程,所以可以节约大量时间
当向这些空表或者空分区插入第一行数据时,对应的表、LOb字段和索引的segment段会被创建。
在段创建过程中,相关表上的游标将失效invalidation。 官方认为在这里游标失效对性能的影响很小。
而且需要注意的使用deferred_segment_creation=true的情况下,更应当注意表空间的使用率,说不定哪一天这些空表全被用上了,哪就有点老母鸡变鸭了。

可以通过dba_tables、dba_indexes视图的SEGMENT_CREATED 字段确定该对象的 段是否被创建了。

SQL> select distinct segment_created from dba_tables;

SEG

NO
YES
N/A

 

用户可以通过2种方式来控制段segment的创建:

1.通过在session/system级别设置 DEFERRED_SEGMENT_CREATION 参数:

SQL> alter system set DEFERRED_SEGMENT_CREATION=false;

System altered.

SQL> alter session set DEFERRED_SEGMENT_CREATION=false;

Session altered.
2. 通过创建表create table时指定SEGMENT CREATION子句来控制:

SEGMENT CREATION DEFERRED 若如此指定,则延迟段的创建,在11gR2中为默认表现
SEGMENT CREATION IMMEDIATE 若如此指定,则不会延迟段的创建,回归到11gR2之前的默认表现
注意对于 延迟段创建的表可以通过alter table..move来使之立即生成segment
但是目前无法做到控制相关依赖对象例如索引的段创建,这些从属对象继承他们父对象表的属性。

 

也可以通过alter table allocate extent来促使创建段:

SQL> conn maclean/oracle
Connected.
SQL>
SQL> create table mac_defer(t1 int) tablespace users;

Table created.

SQL> select segment_created from dba_tables where table_name=’MAC_DEFER’;

SEG

NO

SQL> alter table mac_defer allocate extent (size 1M);

Table altered.

SQL> select segment_created from dba_tables where table_name=’MAC_DEFER’;

SEG

YES

 

 
该deferred_segment_creation特性的缺点是:

1. 由于要在第一次INSERT或其他加载数据方式时才产生segment段,而段的创建需要在表空间上分配空间allocate space,若短期内
大量空表存在插入的需求,则可能在短期内出现空间分配争用

2. deferred_segment_creation引入了少量的BUG.

以下是11.2.0.3上存在的一些bug,部分在psu或者bp中修复了:

NB Bug Fixed Description
15866428 11.2.0.4, 12.1.0.0 ORA-14766 / ORA-14403 during concurrent partition maintenance
14252187 12.1.0.0 ORA-600 [qesmaGetTblSeg1] from deferred segment creation in RAC
13986244 11.2.0.3.BP14, 11.2.0.4, 12.1.0.0 Various ORA-600 seen with deferred segment creation in RAC
13611310 12.1.0.0 Parallel DML with LOBs fails with ORA-7445 [qesmaGetFromLocalOrQCCache]
12614714 11.2.0.4, 12.1.0.0 ORA-1950 occurs when executing DML after EXCHANGE PARTITION and DROP USER
13649031 11.2.0.3.4, 11.2.0.3.BP06, 11.2.0.4, 12.1.0.0 ORA-10637 occurs on SHRINK of a partitioned table with deferred segments
13497523 11.2.0.3.BP15, 11.2.0.4, 12.1.0.0 Errors from SQLLDR loads into non-partitioned tables with deferred segment creation
* 13326736 11.2.0.2.8, 11.2.0.2.BP18, 11.2.0.3.3, 11.2.0.3.BP05, 11.2.0.4, 12.1.0.0 Dictionary corruption / ORA-959 due to DROP TABLESPACE
12535346 11.2.0.3.3, 11.2.0.3.BP07, 12.1.0.0 ORA-7445 [kxccexi] using referential integrity constraints with deferred segments or interval partitions
12358753 12.1.0.0 INDEX_STATS has wrong values for ANALYZE of deferred segmentindexes
11930350 12.1.0.0 Deadlock / undetected FK violation from DML on REFERENCE partitioned table
建议:

1. 对于存在较多空表或空分区且存在空间压力的,对性能、响应时间没有太高要求的系统可以考虑使用该特性
2. 对对性能、响应时间有较高要求的库建议关闭该特性,deferred_segment_creation=false

 

SQL> show parameter defer

NAME TYPE
———————————— ———————-
VALUE
——————————
deferred_segment_creation boolean
TRUE
SQL>
SQL> create table Maclean_nonpartition (“OBJECT_ID” NUMBER) ;

表已创建。

SQL> create table Maclean_partition (“OBJECT_ID” NUMBER) partition by range(object_id)
2 (partition p1 values less than (99999) tablespace users,
3 partition p2 values less than (maxvalue) tablespace users);

表已创建。

SQL> select segment_name,partition_name from dba_segments where segment_name like ‘%MACLEAN%’ and owner=’MACLEAN’;

未选定行

SQL> insert into Maclean_nonpartition values(1);

已创建 1 行。

SQL> insert into Maclean_partition values(1);

已创建 1 行。

SQL> commit;

提交完成。

SQL> select segment_name,partition_name from dba_segments where segment_name like ‘%MACLEAN%’ and owner=’MACLEAN’;

SEGMENT_NAME PARTITION_NAME
—————————— ————————————————————
MACLEAN_PARTITION P1
MACLEAN_NONPARTITION

 

 

create table Maclean_nonpartition (“OBJECT_ID” NUMBER);

create table Maclean_partition (“OBJECT_ID” NUMBER) partition by range(object_id)
(partition p1 values less than (99999) tablespace users,
partition p2 values less than (maxvalue) tablespace users);

 

Setup Oracle Direct NFS Client

在Oracle 11g中引入了Direct Network File System(Oracle Direct NFS)的新特性,通过一个打包在Oracle内核中的NFS客户机以改善实例使用NFS时的性能,同时进一步完善了通过NFS实现RAC的解决方案。常规的NFS客户端软件一般由操作系统供应商提供,这类NFS客户端不会专门为Oracle数据文件的IO做优化。而通过内建的Oracle Direct NFS,数据库将可以直接访问NFS服务器上的文件,避免由OS内核NFS造成的额外开销。Oracle宣称由以上优化所带来的性能提升,在DSS环境中超过40%,而在OLTP环境中超过10%(详见<Oracle Database 11g  Direct NFS Client a white paper>)。

接下来我们将通过实例来演示如何构建Oracle Direct NFS客户机,实际上这并不困难;Direct NFS客户机会以如下顺序查找装载点的设置信息:

  • $ORACLE_HOME/dbs/oranfstab 作用域为$ORACLE_HOME相关的数据库
  • /etc/oranfstab 作用域为主机上所有可用数据库
  • 最后为/etc/mtab配置文件,以确定可用的NFS装载点

我们一般推荐使用$ORACLE_HOME/dbs/oranfstab来配置Direct NFS客户机;该oranfstab配置文件可以包括Server,path,export以及mount参数,各参数代表的属性如下:

  • Server:NFS服务器名
  • Path:到达NFS服务器的最多4个网络路径,可以是IP或者主机名
  • Export:从NFS服务器导出的路径
  • Mount:NFS的本地装载点

需要注意的是如果是在使用NFS实现RAC的情景中,那么必须使用/etc/oranfstab配置文件。并且该配置文件应当在所有节点上时同步的。
在正式启用Direct NFS客户机前,NFS文件系统应当已由常规NFS方式mount并且可用。为了启用Direct NFS client,我们还需要将标准的Oracle磁盘管理库(Oracle Disk Manager (ODM) library)替换为支持Direct NFS client的NFS ODM。可以通过建立从标准ODM库指向NFS ODM库的符号链接来完成以上工作,但是需要注意的是以上操作仅能在实例关闭的情况下才能实施并且有效。

SQL> shutdown immediate;

[maclean@rh2 ~]$ cd $ORACLE_HOME/lib
[maclean@rh2 lib]$ mv libodm11.so libodm11.so.old
[maclean@rh2 lib]$ ln -s libnfsodm11.so libodm11.so
[maclean@rh2 lib]$ ls -l libodm11.so
lrwxrwxrwx 1 maclean oinstall 14 Feb 18 19:27 libodm11.so -> libnfsodm11.so

接下来我们将正式启用Direct NFS Client,并会简单测试其性能:

[root@rh2 ~]# showmount -e  nas
Export list for nas:
/d01 rh2

/* 以常规方式装载NFS文件系统 */
[root@rh2 ~]# mount -t nfs nas:/d01 /d01

[maclean@rh2 ~]$ cat $ORACLE_HOME/dbs/oranfstab
server: nas
path: 192.168.1.188
export: /d01  mount: /d01

SQL> startup;

启动阶段告警日志会出现以下信息:
Oracle instance running with ODM: Oracle Direct NFS ODM Library Version 3.0
.................
Direct NFS: attempting to mount /d01 on filer nas defined in oranfstab
Direct NFS: channel config is:
     channel id [0] local [] path [192.168.1.188]
Direct NFS: mount complete dir /d01 on nas mntport 998 nfsport 2049

SQL> desc v$dnfs_servers;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ID                                                 NUMBER
 SVRNAME                                            VARCHAR2(255)
 DIRNAME                                            VARCHAR2(1024)
 MNTPORT                                            NUMBER
 NFSPORT                                            NUMBER
 WTMAX                                              NUMBER
 RTMAX                                              NUMBER

/*可以通过查询V$DNFS_SERVERS动态视图来了解NFS客户机的装载情况 */

SQL> col svrname for a10
SQL> col dirname for a10
SQL> select svrname,dirname,mntport,nfsport from v$dnfs_servers;

SVRNAME    DIRNAME       MNTPORT    NFSPORT
---------- ---------- ---------- ----------
nas        /d01              998       2049

/* 此外V$DNFS_FILES视图提供了当前Direct NFS打开的文件
         V$DNFS_CHANNELS视图提供了DNFS为服务器打开的网络路径
         V$DNFS_STATS视图提供了DNFS的性能统计信息 */

SQL> set timing on;
SQL> create tablespace nfs_perf datafile '/d01/perf01.dbf' size 2000M;
Tablespace created.

Elapsed: 00:00:30.91

SQL> shutdown immediate;

/* 通过还原libodm11.so库文件和删除oranfstab配置文件可以禁用Direct NFS */

[maclean@rh2 lib]$ cp libodm11.so.old libodm11.so
[maclean@rh2 lib]$ rm $ORACLE_HOME/dbs/oranfstab

SQL> select svrname,dirname,mntport,nfsport from v$dnfs_servers;
no rows selected

SQL> set timing on;
SQL> create tablespace nfs_kernel datafile '/d01/kernel01.dbf' size 2000M;
Tablespace created.

Elapsed: 00:00:29.45

/* 似乎从简单的数据文件创建时间上体现不出Oracle Direct Nfs的优势...*/

That's ok 

Applying online patch on 11gr2

在Oracle 11g中提出了online patch(也叫hot patch)的特性;Hot patching允许我们在实例始终在线的情况下安装,启用或禁用一个修复补丁或者诊断补丁。针对7*24在线的业务系统,hot patch为我们提供了一条既能避免当机时间而又可以实施补丁的途径。在Oracle 11g中我们可以使用Opatch命令行工具针对任意数据库实施在线补丁(前提是该补丁是一个hot patch)。一般来说在线补丁(hot patches)只能是那些代码修改范围小且复杂度很低的补丁,举例来说它们往往是一些诊断补丁(diagnostic patches)或者小bug的修复(small bug fixes)。值得注意的是hot patching将需要消耗额外的内存,决定其消耗内存数量的因素是:1.补丁本身的大小,2.实例中的进程总数;举例来说某个补丁的大小正好为一个OS page的大小(一般为4kB),那么当实例中运行的进程总数为1000时,则该hot patching所额外消耗的内存总数为4kB*1000=4MB。

hot patches与常规Conventional patches对比具有可在线实施和安装快的特性,如下图:
online patching

在实际生产环境中,相信没有多少朋友实施过hot patching,一来国内目前还没有普及11g的使用,二来hot patching的数量在所有interim patch中只占极少数;一直以来都想写这样一篇关于hot patching的博文,唯一妨碍我写作的问题是在11.2.0.1下找不到可实施的online interim patch;以MOS->patches&upgrade目前的分类我们很难找出某个base release下可用的hot patch,当然这并不妨碍补丁专栏的使用。为了这个令人郁闷的问题,我特意去提交了一个Service Request,得到的回复:

I have tried to find the patches which support online patching on 11.2.0.1 version,
but I also can not find them because there are too many patches and there is no catalog for the patches
which support online patching, and I can only check the patch readme to confirm whether that patch supports online patching.

I found one patch which supports online patching, but this patch is for 11.2.0.2 version.
The patch no. is 10188727.

Sorry for the inconvenience brought to you. Hope the above update can help you.
If the above patch is not what you want, then please update the SR and I will continue for your issue.

这其中提到的patch 10188727,可以从Note<RDBMS Online Patching Aka Hot Patching [ID 761111.1]>中找到,另外一个可找到的hot patch是11.1.0.6上的6198642<DUMMY PATCH FOR TESTING DB11 PATCHING>,不过很可惜该补丁只有Linux x86一个平台版本的。所以我不得不先将11.2.0.1的测试库升级到了11.2.0.2上:

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE    11.2.0.2.0      Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production

[maclean@rh2 OPatch]$ ps -ef|grep pmon|grep -v grep
maclean  22481     1  0 19:19 ?        00:00:00 ora_pmon_PROD

[maclean@rh2 OPatch]$ pmap -d 22481
22481:   ora_pmon_PROD
Address           Kbytes Mode  Offset           Device    Mapping
0000000000400000  180232 r-x-- 0000000000000000 008:00002 oracle
000000000b602000    1820 rwx-- 000000000b002000 008:00002 oracle
000000000b7c9000     300 rwx-- 000000000b7c9000 000:00000   [ anon ]
000000000dbef000     436 rwx-- 000000000dbef000 000:00000   [ anon ]
0000000060000000 2050048 rwxs- 0000000000000000 000:00009   [ shmid=0x550001 ]
0000003e09a00000     112 r-x-- 0000000000000000 008:00001 ld-2.5.so
0000003e09c1b000       4 r-x-- 000000000001b000 008:00001 ld-2.5.so
0000003e09c1c000       4 rwx-- 000000000001c000 008:00001 ld-2.5.so
0000003e09e00000    1336 r-x-- 0000000000000000 008:00001 libc-2.5.so
0000003e09f4e000    2044 ----- 000000000014e000 008:00001 libc-2.5.so
0000003e0a14d000      16 r-x-- 000000000014d000 008:00001 libc-2.5.so
0000003e0a151000       4 rwx-- 0000000000151000 008:00001 libc-2.5.so
0000003e0a152000      20 rwx-- 0000003e0a152000 000:00000   [ anon ]
0000003e0a200000     520 r-x-- 0000000000000000 008:00001 libm-2.5.so
0000003e0a282000    2044 ----- 0000000000082000 008:00001 libm-2.5.so
0000003e0a481000       4 r-x-- 0000000000081000 008:00001 libm-2.5.so
0000003e0a482000       4 rwx-- 0000000000082000 008:00001 libm-2.5.so
0000003e0a600000       8 r-x-- 0000000000000000 008:00001 libdl-2.5.so
0000003e0a602000    2048 ----- 0000000000002000 008:00001 libdl-2.5.so
0000003e0a802000       4 r-x-- 0000000000002000 008:00001 libdl-2.5.so
0000003e0a803000       4 rwx-- 0000000000003000 008:00001 libdl-2.5.so
0000003e0aa00000      88 r-x-- 0000000000000000 008:00001 libpthread-2.5.so
0000003e0aa16000    2044 ----- 0000000000016000 008:00001 libpthread-2.5.so
0000003e0ac15000       4 r-x-- 0000000000015000 008:00001 libpthread-2.5.so
0000003e0ac16000       4 rwx-- 0000000000016000 008:00001 libpthread-2.5.so
0000003e0ac17000      16 rwx-- 0000003e0ac17000 000:00000   [ anon ]
0000003e0ae00000      28 r-x-- 0000000000000000 008:00001 librt-2.5.so
0000003e0ae07000    2048 ----- 0000000000007000 008:00001 librt-2.5.so
0000003e0b007000       4 r-x-- 0000000000007000 008:00001 librt-2.5.so
0000003e0b008000       4 rwx-- 0000000000008000 008:00001 librt-2.5.so
0000003e0da00000      84 r-x-- 0000000000000000 008:00001 libnsl-2.5.so
0000003e0da15000    2044 ----- 0000000000015000 008:00001 libnsl-2.5.so
0000003e0dc14000       4 r-x-- 0000000000014000 008:00001 libnsl-2.5.so
0000003e0dc15000       4 rwx-- 0000000000015000 008:00001 libnsl-2.5.so
0000003e0dc16000       8 rwx-- 0000003e0dc16000 000:00000   [ anon ]
00002abec920e000       8 rwx-- 00002abec920e000 000:00000   [ anon ]
00002abec9210000       4 r-x-- 0000000000000000 008:00002 libodmd11.so
00002abec9211000    1024 ----- 0000000000001000 008:00002 libodmd11.so
00002abec9311000       4 rwx-- 0000000000001000 008:00002 libodmd11.so
00002abec9312000     360 r-x-- 0000000000000000 008:00002 libcell11.so
00002abec936c000    1020 ----- 000000000005a000 008:00002 libcell11.so
00002abec946b000      36 rwx-- 0000000000059000 008:00002 libcell11.so
00002abec9474000       4 rwx-- 00002abec9474000 000:00000   [ anon ]
00002abec9475000     848 r-x-- 0000000000000000 008:00002 libskgxp11.so
00002abec9549000    1024 ----- 00000000000d4000 008:00002 libskgxp11.so
00002abec9649000       8 rwx-- 00000000000d4000 008:00002 libskgxp11.so
00002abec9665000       4 rwx-- 00002abec9665000 000:00000   [ anon ]
00002abec9666000    2580 r-x-- 0000000000000000 008:00002 libnnz11.so
00002abec98eb000    1020 ----- 0000000000285000 008:00002 libnnz11.so
00002abec99ea000     264 rwx-- 0000000000284000 008:00002 libnnz11.so
00002abec9a2c000       8 rwx-- 00002abec9a2c000 000:00000   [ anon ]
00002abec9a2e000      96 r-x-- 0000000000000000 008:00002 libclsra11.so
00002abec9a46000    1020 ----- 0000000000018000 008:00002 libclsra11.so
00002abec9b45000       4 rwx-- 0000000000017000 008:00002 libclsra11.so
00002abec9b46000       4 rwx-- 00002abec9b46000 000:00000   [ anon ]
00002abec9b47000     136 r-x-- 0000000000000000 008:00002 libdbcfg11.so
00002abec9b69000    1020 ----- 0000000000022000 008:00002 libdbcfg11.so
00002abec9c68000       8 rwx-- 0000000000021000 008:00002 libdbcfg11.so
00002abec9c6a000    6832 r-x-- 0000000000000000 008:00002 libhasgen11.so
00002abeca316000    1020 ----- 00000000006ac000 008:00002 libhasgen11.so
00002abeca415000     136 rwx-- 00000000006ab000 008:00002 libhasgen11.so
00002abeca437000      24 rwx-- 00002abeca437000 000:00000   [ anon ]
00002abeca43d000       8 r-x-- 0000000000000000 008:00002 libskgxn2.so
00002abeca43f000    1020 ----- 0000000000002000 008:00002 libskgxn2.so
00002abeca53e000       4 rwx-- 0000000000001000 008:00002 libskgxn2.so
00002abeca53f000       4 rwx-- 00002abeca53f000 000:00000   [ anon ]
00002abeca540000     656 r-x-- 0000000000000000 008:00002 libocr11.so
00002abeca5e4000    1020 ----- 00000000000a4000 008:00002 libocr11.so
00002abeca6e3000      12 rwx-- 00000000000a3000 008:00002 libocr11.so
00002abeca6e6000     628 r-x-- 0000000000000000 008:00002 libocrb11.so
00002abeca783000    1024 ----- 000000000009d000 008:00002 libocrb11.so
00002abeca883000       8 rwx-- 000000000009d000 008:00002 libocrb11.so
00002abeca885000      44 r-x-- 0000000000000000 008:00002 libocrutl11.so
00002abeca890000    1020 ----- 000000000000b000 008:00002 libocrutl11.so
00002abeca98f000       4 rwx-- 000000000000a000 008:00002 libocrutl11.so
00002abeca990000       4 rwx-- 00002abeca990000 000:00000   [ anon ]
00002abeca991000       4 r-x-- 0000000000000000 008:00001 libaio.so.1.0.1
00002abeca992000    2044 ----- 0000000000001000 008:00001 libaio.so.1.0.1
00002abecab91000       4 rwx-- 0000000000000000 008:00001 libaio.so.1.0.1
00002abecab92000      16 rwx-- 00002abecab92000 000:00000   [ anon ]
00002abecab96000      20 r-x-- 0000000000000000 008:00001 libnuma.so.1
00002abecab9b000    2044 ----- 0000000000005000 008:00001 libnuma.so.1
00002abecad9a000       4 rwx-- 0000000000004000 008:00001 libnuma.so.1
00002abecad9b000    1280 rwx-- 00002abecad9b000 000:00000   [ anon ]
00002abecaef5000      40 r-x-- 0000000000000000 008:00001 libnss_files-2.5.so
00002abecaeff000    2044 ----- 000000000000a000 008:00001 libnss_files-2.5.so
00002abecb0fe000       4 r-x-- 0000000000009000 008:00001 libnss_files-2.5.so
00002abecb0ff000       4 rwx-- 000000000000a000 008:00001 libnss_files-2.5.so
00002abecb100000    1700 rwx-- 00002abecb100000 000:00000   [ anon ]
00002abecb2a9000      28 rwx-- 0000000000000000 000:00011 zero
00002abecb2b0000      64 rwx-- 0000000000000000 000:00011 zero
00002abecb2c0000      64 rwx-- 0000000000000000 000:00011 zero
00002abecb2d0000      64 rwx-- 0000000000000000 000:00011 zero
00002abecb2e0000      64 rwx-- 0000000000000000 000:00011 zero
00002abecb2f0000      64 rwx-- 0000000000000000 000:00011 zero
00002abecb300000     164 rwx-- 0000000000057000 000:00011 zero
00002abecb329000       8 rwx-- 00002abecb329000 000:00000   [ anon ]
00002abecb32b000       4 rwxs- 0000000000000000 008:00002 hc_PROD.dat
00002abecb32c000      40 r-x-- 0000000000000000 008:00002 libnque11.so
00002abecb336000    1020 ----- 000000000000a000 008:00002 libnque11.so
00002abecb435000       4 rwx-- 0000000000009000 008:00002 libnque11.so
00002abecb436000    1048 rwx-- 00002abecb436000 000:00000   [ anon ]
00007fff3342d000      84 rwx-- 00007ffffffea000 000:00000   [ stack ]
ffffffffff600000    8192 ----- 0000000000000000 000:00000   [ anon ]
mapped: 2291488K    writeable/private: 7840K    shared: 2050052K

[maclean@rh2 ~]$  cd $ORACLE_HOME/OPatch
[maclean@rh2 OPatch]$ unzip p10188727_112020_Linux-x86-64.zip
Archive:  p10188727_112020_Linux-x86-64.zip
   creating: 10188727/
   creating: 10188727/files/
   creating: 10188727/files/lib/
   creating: 10188727/files/lib/libserver11.a/
  inflating: 10188727/files/lib/libserver11.a/kkopq.o
   creating: 10188727/etc/
   creating: 10188727/etc/config/
  inflating: 10188727/etc/config/inventory.xml
  inflating: 10188727/etc/config/actions.xml
  inflating: 10188727/etc/config/deploy.xml
   creating: 10188727/etc/xml/
  inflating: 10188727/etc/xml/GenericActions.xml
  inflating: 10188727/etc/xml/ShiphomeDirectoryStructure.xml
  inflating: 10188727/README.txt
   creating: 10188727/online/
   creating: 10188727/online/files/
   creating: 10188727/online/files/hpatch/
  inflating: 10188727/online/files/hpatch/bug10188727.pch
   creating: 10188727/online/etc/
   creating: 10188727/online/etc/config/
  inflating: 10188727/online/etc/config/inventory.xml
  inflating: 10188727/online/etc/config/actions.xml
  inflating: 10188727/online/etc/config/deploy.xml
   creating: 10188727/online/etc/xml/
  inflating: 10188727/online/etc/xml/GenericActions.xml
  inflating: 10188727/online/etc/xml/ShiphomeDirectoryStructure.xml 

[maclean@rh2 OPatch]$ opatch query 10188727 -all
Invoking OPatch 11.2.0.1.1

Oracle Interim Patch Installer version 11.2.0.1.1
Copyright (c) 2009, Oracle Corporation.  All rights reserved.

Oracle Home       : /s01/product/11.2.0/dbhome_2
Central Inventory : /s01/oraInventory
   from           : /etc/oraInst.loc
OPatch version    : 11.2.0.1.1
OUI version       : 11.2.0.2.0
OUI location      : /s01/product/11.2.0/dbhome_2/oui
Log file location : /s01/product/11.2.0/dbhome_2/cfgtoollogs/opatch/opatch2011-02-17_20-05-21PM.log

Patch history file: /s01/product/11.2.0/dbhome_2/cfgtoollogs/opatch/opatch_history.txt

--------------------------------------------------------------------------------
 Patch created on 2 Dec 2010, 01:44:13 hrs PST8PDT
 Need to shutdown Oracle instances: true
 Patch is roll-backable: true
 Patch is a "Patchset Update": false
 Patch is a rolling patch: true
 Patch has sql related actions: false
 Patch is an online patch: false
 Patch is a portal patch: false
 Patch is an "auto-enabled" patch: false

 List of platforms supported:
   226: Linux x86-64

 List of bugs to be fixed:
   10188727: AFTER UPGRADING TO 11.2.0.2 SOME SQLS FAIL WITH ORA-7445 [KKEIDC()+180] ERROR

 This patch is a "singleton" patch.

 This patch belongs to the "db" product family 

 List of executables affected:
   ORACLE_HOME/bin/oracle

 List of optional components:
   oracle.rdbms:  11.2.0.2.0

 List of optional actions:
   Update /s01/product/11.2.0/dbhome_2/lib/libserver11.a with /kkopq.o
   cd /s01/product/11.2.0/dbhome_2/rdbms/lib
     ; make -f ins_rdbms.mk ioracle ORACLE_HOME=/s01/product/11.2.0/dbhome_2

  Possible XML representation of the patch:

     10188727

--------------------------------------------------------------------------------
OPatch succeeded.

[maclean@rh2 OPatch]$ ./opatch query -is_online_patch 10188727
Invoking OPatch 11.2.0.1.1

Oracle Interim Patch Installer version 11.2.0.1.1
Copyright (c) 2009, Oracle Corporation.  All rights reserved.

Oracle Home       : /s01/product/11.2.0/dbhome_2
Central Inventory : /s01/oraInventory
   from           : /etc/oraInst.loc
OPatch version    : 11.2.0.1.1
OUI version       : 11.2.0.2.0
OUI location      : /s01/product/11.2.0/dbhome_2/oui
Log file location : /s01/product/11.2.0/dbhome_2/cfgtoollogs/opatch/opatch2011-02-17_19-45-33PM.log

Patch history file: /s01/product/11.2.0/dbhome_2/cfgtoollogs/opatch/opatch_history.txt

--------------------------------------------------------------------------------
 Patch is an online patch: false

OPatch succeeded.

/* 虽然Opatch返回"online patch: false",但实际上这是一个online patch,
    造成以上问题的原因可能是Opatch版本低 */

[maclean@rh2 OPatch]$ tree 10188727
10188727
|-- README.txt
|-- etc
|   |-- config
|   |   |-- actions.xml
|   |   |-- deploy.xml
|   |   `-- inventory.xml
|   `-- xml
|       |-- GenericActions.xml
|       `-- ShiphomeDirectoryStructure.xml
|-- files
|   `-- lib
|       `-- libserver11.a
|           `-- kkopq.o
`-- online
    |-- etc
    |   |-- config
    |   |   |-- actions.xml
    |   |   |-- deploy.xml
    |   |   `-- inventory.xml
    |   `-- xml
    |       |-- GenericActions.xml
    |       `-- ShiphomeDirectoryStructure.xml
    `-- files
        `-- hpatch
            `-- bug10188727.pch

/* 可以从以上目录结构中看到包含了online子目录,我们可以直接观察其inventory.xml信息文件 */

[maclean@rh2 OPatch]$ cat 10188727/online/etc/config/inventory.xml |grep instance
 <instance_shutdown>false</instance_shutdown>
 <instance_shutdown_message></instance_shutdown_message>

/* 以上instance_shutdown为false说明其可以作为online patch实施 */

[maclean@rh2 OPatch]$ cd 10188727

/* opatch online patching的具体语法如下 */
opatch apply online -connectString  <SID>:<USERNAME>:<PASSWORD>:<NODE1>, \
<SID2>:<USERNAME>:<PASSWORD>:<NODE2>,... 

[maclean@rh2 10188727]$ opatch apply online -connectString PROD:sys:password
Invoking OPatch 11.2.0.1.1

Oracle Interim Patch Installer version 11.2.0.1.1
Copyright (c) 2009, Oracle Corporation.  All rights reserved.

Oracle Home       : /s01/product/11.2.0/dbhome_2
Central Inventory : /s01/oraInventory
   from           : /etc/oraInst.loc
OPatch version    : 11.2.0.1.1
OUI version       : 11.2.0.2.0
OUI location      : /s01/product/11.2.0/dbhome_2/oui
Log file location : /s01/product/11.2.0/dbhome_2/cfgtoollogs/opatch/opatch2011-02-17_19-49-44PM.log

Patch history file: /s01/product/11.2.0/dbhome_2/cfgtoollogs/opatch/opatch_history.txt

The patch should be applied/rolled back in '-all_nodes' mode only.
Converting the RAC mode to '-all_nodes' mode.
ApplySession applying interim patch '10188727' to OH '/s01/product/11.2.0/dbhome_2'

Running prerequisite checks...

OPatch detected non-cluster Oracle Home from the inventory and will patch the local system only.

Backing up files and inventory (not for auto-rollback) for the Oracle Home
Backing up files affected by the patch '10188727' for restore. This might take a while...
Backing up files affected by the patch '10188727' for rollback. This might take a while...

Patching component oracle.rdbms, 11.2.0.2.0...
The patch will be installed on active database instances.
Installing and enabling the online patch 'bug10188727.pch', on database 'PROD'.

ApplySession adding interim patch '10188727' to inventory

Verifying the update...
Inventory check OK: Patch ID 10188727 is registered in Oracle Home inventory with proper meta-data.
Files check OK: Files from Patch ID 10188727 are present in Oracle Home.
OPatch succeeded.

[maclean@rh2 OPatch]$ opatch lsinventory -detail|tail -21
Interim patches (1) :

Patch (online) 10188727: applied on Thu Feb 17 19:49:52 CST 2011
Unique Patch ID:  13202318
   Created on 2 Dec 2010, 01:44:15 hrs PST8PDT
   Bugs fixed:
     10188727
   Files Touched:
     bug10188727.pch --> ORACLE_HOME/hpatch/bug10188727.pch
   Instances Patched:
     PROD
   Patch Location in Inventory:
     /s01/product/11.2.0/dbhome_2/inventory/oneoffs/10188727
   Patch Location in Storage area:
     /s01/product/11.2.0/dbhome_2/.patch_storage/10188727_Dec_2_2010_01_44_15

告警日志alert.log中的信息:
Patch file bug10188727.pch is out of sync with oracle binary; performing fixup
Patch file bug10188727.pch has been synced with oracle binary
Patch bug10188727.pch Installed - Update #1
Patch bug10188727.pch Enabled - Update #2

[maclean@rh2 OPatch]$ ps -ef|grep pmon|grep -v grep
maclean  22481     1  0 19:19 ?        00:00:00 ora_pmon_PROD

[maclean@rh2 OPatch]$ pmap -d 22481|tail -10
00002abecb435000       4 rwx-- 0000000000009000 008:00002 libnque11.so
00002abecb436000    1048 rwx-- 00002abecb436000 000:00000   [ anon ]
00002abecb53c000       8 r-x-- 000000000c64e000 008:00002 oracle
00002abecb53e000    5052 r-x-- 00000000000bd000 008:00002 oracle
00002abecba2d000     140 r-x-- 0000000000000000 008:00002 bug10188727.so
00002abecba50000    1024 ----- 0000000000023000 008:00002 bug10188727.so
00002abecbb50000       8 rwx-- 0000000000023000 008:00002 bug10188727.so
00007fff3342d000      84 rwx-- 00007ffffffea000 000:00000   [ stack ]
ffffffffff600000    8192 ----- 0000000000000000 000:00000   [ anon ]
mapped: 2297720K    writeable/private: 7848K    shared: 2050052K

/* 再次观察pmon进程的内存信息,可以看到pmap输出中多出了2个小的oracle正文镜像和
    名为bug10188727.so的共享库文件,而该后台进程的private memory由原来的7840k上升到7848k,
    实际增幅为8k */

/* 此外可以通过oradebug命令将该online patch禁用,虽然并不推荐这样做 */

SQL> oradebug patch disable  bug10188727.pch;
Statement processed.

SQL>  oradebug patch disable bug10188727.pch;
Patch file already disabled

[maclean@rh2 hpatch]$ pmap -d 22481|tail -8
00002abecb53c000       8 r-x-- 000000000c64e000 008:00002 oracle
00002abecb53e000    5052 r-x-- 00000000000bd000 008:00002 oracle
00002abecba2d000     140 r-x-- 0000000000000000 008:00002 bug10188727.so
00002abecba50000    1024 ----- 0000000000023000 008:00002 bug10188727.so
00002abecbb50000       8 rwx-- 0000000000023000 008:00002 bug10188727.so
00007fff3342d000      84 rwx-- 00007ffffffea000 000:00000   [ stack ]
ffffffffff600000    8192 ----- 0000000000000000 000:00000   [ anon ]
mapped: 2297720K    writeable/private: 7848K    shared: 2050052K

/* 但disable掉online patch并不会导致在线补丁额外消耗的内存被回收 */

/* 当然我们还可以很方便地启用它 */

SQL> oradebug patch enable bug10188727.pch;
Statement processed.
SQL>  oradebug patch enable bug10188727.pch;
Patch file already enabled

[maclean@rh2 ~]$ cd $ORACLE_HOME/hpatch

[maclean@rh2 hpatch]$ ls -l
total 368
-rw-r--r-- 1 maclean oinstall 177874 Feb 17 19:49 bug10188727.pch
-rwx------ 1 maclean oinstall      1 Feb 17 19:49 bug10188727.pchPROD.fixup
-rwx------ 1 maclean oinstall 176850 Feb 17 19:49 bug10188727.so
-rw------- 1 maclean oinstall    712 Feb 17 20:13 orapatchPROD.cfg

/* 注意不要在实例启动时删除以上hpatch目录及目录下任何文件,这可能导致instance出现意外 */

[maclean@rh2 hpatch]$ cd $ORACLE_HOME/OPatch

/* 我们还能够将online patch rollback回滚掉,如以下语法 */
opatch rollback -id <patchID> -connectString  <SID>:<USERNAME>:<PASSWORD>:<NODE1>, \
<SID2>:<USERNAME>:<PASSWORD>:<NODE2>,  ...

[maclean@rh2 OPatch]$ opatch rollback -id 10188727 -connectString PROD:sys:password -invPtrLoc /s01/product/11.2.0/dbhome_2/oraInst.loc
Invoking OPatch 11.2.0.1.1

Oracle Interim Patch Installer version 11.2.0.1.1
Copyright (c) 2009, Oracle Corporation.  All rights reserved.

Oracle Home       : /s01/product/11.2.0/dbhome_2
Central Inventory : /s01/oraInventory
   from           : /s01/product/11.2.0/dbhome_2/oraInst.loc
OPatch version    : 11.2.0.1.1
OUI version       : 11.2.0.2.0
OUI location      : /s01/product/11.2.0/dbhome_2/oui
Log file location : /s01/product/11.2.0/dbhome_2/cfgtoollogs/opatch/opatch2011-02-17_20-18-00PM.log

Patch history file: /s01/product/11.2.0/dbhome_2/cfgtoollogs/opatch/opatch_history.txt

RollbackSession rolling back interim patch '10188727' from OH '/s01/product/11.2.0/dbhome_2'

The patch should be applied/rolled back in '-all_nodes' mode only.
Converting the RAC mode to '-all_nodes' mode.

Running prerequisite checks...

OPatch detected non-cluster Oracle Home from the inventory and will patch the local system only.

Backing up files affected by the patch '10188727' for restore. This might take a while...

Patching component oracle.rdbms, 11.2.0.2.0...
The patch will be removed from active database instances.
Disabling and removing online patch 'bug10188727.pch', on database 'PROD'

RollbackSession removing interim patch '10188727' from inventory

OPatch succeeded.

告警日志alert.log中的remove信息:
Patch bug10188727.pch Disabled - Update #5
Patch bug10188727.pch Removed - Update #6
Thu Feb 17 20:18:07 2011
Online patch bug10188727.pch has been disabled
Online patch bug10188727.pch has been removed

[maclean@rh2 trace]$ pmap -d 22481|tail -8
00002abecb53c000       8 r-x-- 000000000c64e000 008:00002 oracle
00002abecb53e000    5052 r-x-- 00000000000bd000 008:00002 oracle
00002abecba2d000     140 r-x-- 0000000000000000 008:00002 bug10188727.so
00002abecba50000    1024 ----- 0000000000023000 008:00002 bug10188727.so
00002abecbb50000       8 rwx-- 0000000000023000 008:00002 bug10188727.so
00007fff3342d000      84 rwx-- 00007ffffffea000 000:00000   [ stack ]
ffffffffff600000    8192 ----- 0000000000000000 000:00000   [ anon ]
mapped: 2297720K    writeable/private: 7848K    shared: 2050052K

/* 显然rollback回滚掉该online interim patch也不足以回收内存,唯一的方法是重启实例 */

SQL> startup force;

[maclean@rh2 trace]$ ps -ef|grep pmon|grep -v grep
maclean  25563     1  0 20:22 ?        00:00:00 ora_pmon_PROD

[maclean@rh2 trace]$ pmap -d 25563|tail -8
00002aaf2f324000       4 rwxs- 0000000000000000 008:00002 hc_PROD.dat
00002aaf2f325000      40 r-x-- 0000000000000000 008:00002 libnque11.so
00002aaf2f32f000    1020 ----- 000000000000a000 008:00002 libnque11.so
00002aaf2f42e000       4 rwx-- 0000000000009000 008:00002 libnque11.so
00002aaf2f42f000    1048 rwx-- 00002aaf2f42f000 000:00000   [ anon ]
00007fffdfa84000      84 rwx-- 00007ffffffea000 000:00000   [ stack ]
ffffffffff600000    8192 ----- 0000000000000000 000:00000   [ anon ]
mapped: 2291488K    writeable/private: 7840K    shared: 2050052K

/* That's ok!
    ash to ash, dust to dust! */

References:<RDBMS Online Patching Aka Hot Patching [ID 761111.1]>

Oracle Advanced Security:Column Encryption Overhead

在Oracle 10g中出现了column encryption列加密特性,通过对列上的数据加密实现数据安全性的目的。当然实现这一加密特性是有代价的,一方面会导致所加密列数据每行所占磁盘空间字节数增长,另一方面会消耗更多的cpu和内存资源。

当使用Oracle的TDE(transparent data encryption)加密数据表的某一列时将导致该表上每行数据所占用的空间大致增加33-51个字节,这几十个字节用作以下用途:

  • 其中20个字节用以对加密值的完整性检查,该部分可以通过’nomac’选项来省略
  • 同时加密会填补加密值到16个字节(如果列本身长度不够的话),举例来说如果是9个字节长度的number类型,那么加密该number字段时就会需要将该数据填补到16个字节,也就是额外地多用了7个字节
  • 加密时默认采用salt选项(default),salt是指一串长度为16个字节的随机string,在数据被正式加密前这串string将会被添加到列上,这种做法使得黑客无法通过比对已知的密文来匹配加密值(steal patterns of ciphertext to known ciphertext);salt总是位于加密数据的末尾;该部分可以通过no salt选项来省略。

注意默认使用salt选项加密的列是不能创建索引的,所以强烈建议加密列时强制使用no salt选项!加密列上的索引不支持范围扫描操作(Range scans on encrypted columns can’t use index),而加密表空间(encryption tablespace)没该限制。

SQL> create table enctab (t1 int encrypt);
Table created.

SQL>  create index ind_enc on enctab(t1);
create index ind_enc on enctab(t1)
*
ERROR at line 1:
ORA-28338: Column(s) cannot be both indexed and encrypted with salt

SQL> create table news (t1 varchar2(1) encrypt);
Table created.

/*以默认的salt和mac选项创建示例用表 */

SQL> insert into news values('1');
1 row created.

SQL> commit;
Commit complete.

/* 该列本身的长度为1个字节 */

SQL> select dump(t1,16) from news;

DUMP(T1,16)
--------------------------------------------------------------------------------
Typ=1 Len=1: 31

/* 以下为该行的dump信息,可以看到加密值增长到了52字节 */
col  0: [52]
de 76 08 74 2a c0 e3 94 89 e6 a8 3b 22 54 ca e5 af 4d eb a0 26 a7 e5 c2 f5
c0 e5 3a a0 09 9a 08 fa 56 2a 92 a0 83 b3 7f 0b 99 03 ad 12 78 d4 03 ec 6e
b3 c2

针对加密列可以使用’nomac’和no salt选项来减少性能损耗,其中’no mac’选项用以允许Oracle省略在加密数据中产生和存放message authentication code(MAC,信息验证代码),如上文所述的这部分代码用以对加密值的完整性检查,会占用20个字节的空间。通过使用’nomac’选项可以有效较少加密和后续操作所额外消耗的cpu周期,同时为加密值的存储减少20个字节的开销。
另外no salt选项用以省略加密中加入的16个字节的随机字符串(string),在能保证列值都唯一的情况下(攻击者无法通过已知密文比对的方式来解密),使用该选项可以有效减少cpu周期和每个单元16字节的空间开销。

askmac.cn>create table Maclean (t1 varchar2(16) encrypt no salt 'nomac');

/* 注意这里的nomac要被单引号括起来 */

askmac.cn>alter table table_name modify column_name encrypt [using ] [no salt] ['nomac'];

此外目前列加密不支持外键约束,造成这种限制的原因是每张表都有其唯一的密钥(encryption key);而表空间加密则不存在这种限制,即便某个从属表不在加密表空间上。

SQL> create table man (t1 int primary key );
Table created.

SQL> create table woman(t1 int encrypt);
Table created.

SQL>  alter table woman add constraint fk foreign key(t1) references man(t1);
 alter table woman add constraint fk foreign key(t1) references man(t1)
                                                 *
ERROR at line 1:
ORA-28335: referenced or referencing FK constraint column cannot be encrypted

列加密特性对于表连接(table joining)来说是透明的,即便作为连接条件的列被加密了也是如此(join tables is transparent,even if the columns for join condition are encrypted)。同时分区键是不能作为加密列的,否则将出现ORA-28346: an encrypted column cannot serve as a partitioning column错误。
此外加密列索引存在诸多限制,总结加密列索引(Indexes On Encrypted Columns)的几个restrictions:

  1. 只有使用no salt选项加密的列上才允许创建索引
  2. 加密列上不支持位图索引
  3. 加密列不支持外键
  4. 加密列上创建的索引只能做等式查询,因为不能做Range scan所以如between,like等非等式查询是不支持的;这种限制是由于索引中的数据也被加密了,所以实际上数据是以加密后的形式来排序的。所有非等式查询的条件均无法利用到索引,而使用全表扫描。
  5. 如果应用不使用等式查询的话,那么建议不要在加密列上创建索引,因为这样无益与性能,反而会增加性能开销。

而加密表空间(TDE Tablespace Encryption)不存在以上关于索引的限制,甚至在加密表空间上的表的索引在非加密表空间上也不会影响其使用,包括Range Scans;显然这一点出乎许多人的意料:

SQL> select tablespace_name,ENCRYPTED from dba_tablespaces  where tablespace_name in ('ENC','USERS');
TABLESPACE_NAME                ENC
------------------------------ ---
ENC                            YES
USERS                          NO

SQL> create table tv tablespace enc as select * from dba_objects;
Table created.

SQL> create index pk_tv on tv(object_id) tablespace enc;
Index created.

SQL> set autotrace on;
SQL> select 1 from tv where object_id=9999;
         1
----------
         1

Execution Plan
----------------------------------------------------------
Plan hash value: 2009574168
--------------------------------------------------------------------------
| Id  | Operation        | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |     1 |    13 |     1   (0)| 00:00:01 |
|*  1 |  INDEX RANGE SCAN| PK_TV |     1 |    13 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("OBJECT_ID"=9999)

/* 将该索引移动至非加密表空间上 */

SQL> alter index pk_tv rebuild tablespace users;
Index altered.

SQL> select 1 from tv where object_id=9999;

         1
----------
         1

Execution Plan
----------------------------------------------------------
Plan hash value: 2009574168
--------------------------------------------------------------------------
| Id  | Operation        | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |     1 |    13 |     1   (0)| 00:00:01 |
|*  1 |  INDEX RANGE SCAN| PK_TV |     1 |    13 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("OBJECT_ID"=9999)

/* 可以看到虽然索引被移动到非加密表空间上但仍可以被利用到并range scan */

SQL> create bitmap index obj_typ on tv (object_type) tablespace users;
Index created.

/* 创建位图索引也没有问题 */

当然表空间加密也仍然存在一些限制:

  • 加密表空间不能使用传统的exp/imp工具导入导出,而只能使用datapump工具
  • 显然External Large Objects (BFILEs)这种存储在数据库外的外部大对象也不受支持。

ORA-19808错误一例

一套Linux上的11.2.0.2 RAC系统,其中一个节点startup mount时出现ORA-19808错误,日志如下:


SQL> alter database mount;
alter database mount
*
ERROR at line 1:
ORA-01105: mount is incompatible with mounts by other instances
ORA-19808: recovery destination parameter mismatch


SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options


[oracle@rh3 ~]$ oerr ora 19808
19808, 00000, "recovery destination parameter mismatch"
// *Cause:  The value of parameters DB_RECOVERY_FILE_DEST and
//          DB_RECOVERY_FILE_DEST_SIZE must be same in all instances.
//          instance. All databases must have same recovery destination
//          parameters.
// *Action: Check DB_RECOVERY_FILE_DEST and DB_RECOVERY_FILE_DEST_SIZE
//          values in all instances.

SQL> col name for a30
SQL> col value for a30

SQL> select name ,value,inst_id from gv$system_parameter  where name='db_recovery_file_dest_size';

NAME                           VALUE                             INST_ID
------------------------------ ------------------------------ ----------
db_recovery_file_dest_size     0                                       1
db_recovery_file_dest_size     42278584320                             2

该错误是由于启动节点使用了pfile形式的参数文件,而该参数文件中的db_recovery_file_dest_size值与已经启动的另一个节点的db_recovery_file_dest_size不一致所造成。

如果使用共享的server parameter file则不可能出现上述情况,当然也可以通过在启动节点上修改db_recovery_file_dest_size来解决问题。

解决Oracle错误ORA-15061一例

一套Linux上的11.2.0.1系统,告警日志中出现以下错误:

ORA-00202: control file: '+DATA/controlfile/current.256.7446483424'
ORA-17505: ksfdrsz:1 Failed to resize file to size 612 blocks
ORA-15061: ASM operation not supported [41]
WARNING: Oracle Managed File +FRA in the recovery area is orphaned by the control file.
The control file can not keep all recovery area files due to space limitation.
krse.c
Archived Log entry 200 added for thread 1 sequence 200 ID 0x3739c2f0 dest 1:

RMAN backup failed with

Rman backup error:
RMAN-03009: failure of backup command on ORA_DISK_1 channel at 
ORA-19510: failed to set size of 6800 blocks for file "+FRA" (block size=8192)
ORA-17505: ksfdrsz:1 Failed to resize file to size 6800 blocks
ORA-15061: ASM operation not supported [41]

[oracle@rh2 ~]$ oerr ora 15061
15061, 00000, "ASM operation not supported [%s]"
// *Cause:  An ASM operation was attempted that is invalid or not supported
//          by this version of the ASM instance.
// *Action: This is an internal error code that is used for maintaining
//          compatibility between software versions and should never be
//          visible to the user; contact Oracle support Services.
//

提交SR后,根据Oracle GCS确认为BUG:9788316:

1.The following error indicates that it failed to resize the controlfile to 612 blocks. If the DB_BLOCK_SIZE is 8192, 
then 612 blocks is not more than 5MB. According to the results of the query on V$ASM_DISKGROUP in 'results01.txt' file, 
the ASM diskgroup +DATA has 108605 MB free space. So, the ASM diskgroup +DATA has enough space for the 612 blocks.

2. By the way, please confirm whether you have recently applied PSU #1. Anyway, 
please try to relink the Oracle executables, as shown here. Before you run the "relink" command, 
make sure to shutdown both the ASM instance and target database.

$ORACLE_HOME/bin/relink all

3 After relinking the Oracle executables, please confirm whether you are still 
experiencing the same ORA-15061 error.

ORA-15061: ASM Operation Not Supported [41] After Apply PSU #1 (Doc ID 1126113.1)
ORA-15061 reported while doing a file operation with 11.1 or 11.2 ASM after PSU applied in database home (Doc ID 1070880.1)

Hdr: 9788316 11.2.0.1.1 RDBMS 11.2.0.1.0 ASM PRODID-5 PORTID-267
Abstract: AFTER APPLY PSU 1 (11.2.0.1.1) ON RDBMS HOME UNABLE TO RESIZE ASM DATAFILE.

*** 06/07/10 02:09 pm ***
—-
3-1827355081

PROBLEM:
——–
1) After apply PSU 1 (11.1.0.2.1) on the 11.2.0.1.0 RDBMS Oracle Home
customer is unable to resize an ASM datafile:
============================================================
SQL> alter database datafile
‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’ resize 560M;
alter database datafile ‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’
resize 560M
*
ERROR at line 1:
ORA-1237: cannot extend datafile 4
ORA-1110: data file 4: ‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’
ORA-17505: ksfdrsz:1 Failed to resize file to size 71680 blocks
ORA-15061: ASM operation not supported [41]

============================================================

2) DB alertlog reports (alert_brg13ed1.log):
============================================================

Mon Jun 07 11:27:57 2010
Stopping background process CJQ0
Mon Jun 07 12:30:28 2010
alter database datafile ‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’
resize 536870915
ORA-1237 signalled during: alter database datafile
‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’ resize 536870915…
Mon Jun 07 13:07:40 2010
alter database datafile ‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’
resize 560m
ORA-1237 signalled during: alter database datafile
‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’ resize 560m…
Mon Jun 07 13:49:50 2010
alter database datafile ‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’
resize 800M
ORA-1237 signalled during: alter database datafile
‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’ resize 800M…
Mon Jun 07 13:58:51 2010
alter database datafile ‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’
resize 560M
ORA-1237 signalled during: alter database datafile
‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’ resize 560M…
Mon Jun 07 14:25:43 2010

============================================================

3) ASM alert.log does not report any problem.

DIAGNOSTIC ANALYSIS:
——————–
(see below)

WORKAROUND:
———–
None

RELATED BUGS:
————-

REPRODUCIBILITY:
—————-

TEST CASE:
———-

STACK TRACE:
————

SUPPORTING INFORMATION:
———————–

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-

DIAL-IN INFORMATION:
——————–

IMPACT DATE:
————

*** 06/07/10 02:09 pm ***
4) I can confirm that at (Jun 07 10:55:59 EDT 2010) customer installed the
Patch: 9355126 (on the Grid OH), which includes the fix for Bug: 8898852:
============================================================

Invoking OPatch 11.2.0.1.2

Oracle Interim Patch Installer version 11.2.0.1.2

Oracle Home : /u01/grid/oracle/product/grid
Central Inventory : /u01/app/oracle/oraInventory
from : /var/opt/oracle/oraInst.loc
OPatch version : 11.2.0.1.2
OUI version : 11.2.0.1.0
OUI location : /u01/grid/oracle/product/grid/oui
——————————————————————————

Installed Top-level Products (1):

Oracle Grid Infrastructure
11.2.0.1.0
There are 1 products installed in this Oracle Home.

Installed Products (87):

There are 87 products installed in this Oracle Home.

Interim patches (1) :

Patch 9355126 : applied on Mon Jun 07 10:55:59 EDT 2010
Unique Patch ID: 12175902
Created on 5 Feb 2010, 07:38:30 hrs PST8PDT
Bugs fixed:
8974548, 8898852
============================================================

5) Also, I can confirm that at (Mon May 03 00:01:14 EDT 2010) customer
installed the 11.2.0.1.1 Patch Set Update (PSU #1) which include as well the
patch for Bug: 8898852 (on the RDBMS OH):
============================================================
Invoking OPatch 11.2.0.1.2

Oracle Interim Patch Installer version 11.2.0.1.2

Oracle Home : /u01/app/oracle/product/11.2.0/db1
Central Inventory : /u01/app/oracle/oraInventory
from : /var/opt/oracle/oraInst.loc
OPatch version : 11.2.0.1.2
OUI version : 11.2.0.1.0
OUI location : /u01/app/oracle/product/11.2.0/db1//oui

——————————————————————————

Installed Top-level Products (1):

Oracle Database 11g
11.2.0.1.0
There are 1 products installed in this Oracle Home.

Installed Products (134):

. 11.2.0.1.0
There are 134 products installed in this Oracle Home.

Interim patches (1) :

Patch 9352237 : applied on Mon May 03 00:01:14 EDT 2010
Unique Patch ID: 12366369
Created on 6 Apr 2010, 05:03:41 hrs PST8PDT
Bugs fixed:
8661168, 8769239, 8898852, 8801119, 9054253, 8706590, 8725286, 8974548
8778277, 8780372, 8769569, 9027691, 9454036, 9454037, 9454038, 8761974
7705591, 8496830, 8702892, 8639114, 8723477, 8729793, 8919682, 8818983
9001453, 8475069, 9328668, 8891929, 8798317, 8820324, 8733749, 8702535
8565708, 9036013, 8735201, 8684517, 8870559, 8773383, 8933870, 8812705
8405205, 8822365, 8813366, 8761260, 8790767, 8795418, 8913269, 8897784
8760714, 8717461, 8671349, 8775569, 8898589, 8861700, 8607693, 8642202
8780281, 9369797, 8780711, 8784929, 8834636, 9015983, 8891037, 8828328
8570322, 8832205, 8665189, 8717031, 8685253, 8718952, 8799099, 8633358
9032717, 9321701, 8588519, 8783738, 8796511, 8782971, 8756598, 9454385
8856497, 8703064, 9066116, 9007102, 8721315, 8818175, 8674263, 9352237
8753903, 8720447, 9057443, 8790561, 8733225, 9197917, 8928276, 8991997,
8837736
============================================================

6) But the problem persists:
============================================================
SQL> alter database datafile
‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’ resize 560M;
alter database datafile ‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’
resize 560M
*
ERROR at line 1:
ORA-1237: cannot extend datafile 4
ORA-1110: data file 4: ‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’
ORA-17505: ksfdrsz:1 Failed to resize file to size 71680 blocks
ORA-15061: ASM operation not supported [41]

============================================================

7) 11.2.0.1.1 Patch Set Update (PSU #1) should include the fix for bug:
8898852 (ORA-15061: ASM operation not supported [41]).
This PSU contains a fix that adds an operation to the ASM protocol.
The problem is that the PSU has been applied correctly to the RDBMS
home, but the fix is not correctly applied to the ASM HOME.

So either install that patch to the ASM HOME, or if that has been done
check why it failed.

A common reason for failure is a permission problem. Check bug 9711074
and its duplicates for more background. And check documentation bug 8629483
for the solution.

How do I know the fix is not missing? This error message is the key:
ORA-15061: ASM operation not supported [41]
(especially with operation 41)

If the patch was applied to the grid home, then probably the relink failed
due to permission problems.

ORA-15061: ASM Operation Not Supported [41] After Apply PSU #1 [ID 1126113.1]

pplies to:
Oracle Server – Enterprise Edition – Version: 11.2.0.1 and later [Release: 11.2 and later ]
Information in this document applies to any platform.
Symptoms
1) After apply PSU 1 (11.2.0.1.1 ) on the Grid Infrastructure Oracle Home
customer is unable to resize an ASM datafile:

SQL> alter database datafile
‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’ resize 560M;
alter database datafile ‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’
resize 560M
*
ERROR at line 1:
ORA-1237: cannot extend datafile 4
ORA-1110: data file 4: ‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’
ORA-17505: ksfdrsz:1 Failed to resize file to size 71680 blocks
ORA-15061: ASM operation not supported [41]

2) DB alertlog reports (alert_brg13ed1.log):

Mon Jun 07 11:27:57 2010
Stopping background process CJQ0
Mon Jun 07 12:30:28 2010
alter database datafile ‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’
resize 536870915
ORA-1237 signalled during: alter database datafile
‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’ resize 536870915…
Mon Jun 07 13:07:40 2010
alter database datafile ‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’
resize 560m
ORA-1237 signalled during: alter database datafile
‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’ resize 560m…
Mon Jun 07 13:49:50 2010
alter database datafile ‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’
resize 800M
ORA-1237 signalled during: alter database datafile
‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’ resize 800M…
Mon Jun 07 13:58:51 2010
alter database datafile ‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’
resize 560M
ORA-1237 signalled during: alter database datafile
‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’ resize 560M…
Mon Jun 07 14:25:43 2010

3) ASM alert.log does not report any problem.

Changes
PSU 1 (11.2.0.1.1 ) patchset was installed on the Grid Infrastructure Oracle Home
Cause

The PSU #1 (11.2.0.1.1 ) patch on the Grid Infrastructure Oracle Home was not correctly linked by opatch.

The PSU #1 (11.2.0.1.1 ) patch on the Grid Infrastructure Oracle Home needs to be relinked.
Solution
Relink the PSU #1 patch on the Grid Infrastructure Oracle Home as follow:

1) Shutdown the database instances.

2) Shutdown the ASM instance.

3) Relink the Grid Infrastructure Oracle Home as follow:

script /tmp/relink_GI.txt

env | sort

$ORACLE_HOME/bin/relink all

exit

4) Startup the ASM instance.

5) Startup the databases.

6) Resize the datafile:

SQL> alter database datafile ‘+DATA/brg13ed/datafile/ts_brg_users.1279.711064345’ resize 560M;

该bug一般是由于对grid infrastructure实施了PSU 1 (11.2.0.1.1 )后,Oracle binary没有正确link导致的,可行的解决方案是relink all,但这需要重启instance!

Advice on upgrading to 11.2.0.2 and converting to RAC

Question:Problem Description: Currently we are running 11.2.0.1 Grid on a 2 nodes. We are facing DB crash issue and different bugs with current environment. One of the option Oracle provided is to upgrade Grid to 11.2.0.2 and convert to real RAC.

There are 11.2.0.1 database under the clusterware. These DBs are currently actively running on one node and cold failover HA enabled by Oracle’s perl scripts. We also have 10g standalone DBs running under the same clusterware.

We are planning the upgrade of Grid from 11.2.0.1 to 11.2.0.2, and also converting the cold faliover HA DBs to the real RAC 2 nodes or RAC 1 node.

We will need Oracle’s advice on the technical road map and procedures on this. Also the issues we might face. The upgrade is planned to start in one week from now. We really appreciate that Oracle get back to us as quickly as possible. Thanks.

Answer:
Please follow the below document for manual upgrade or using dbua.
Complete Checklist for Manual Upgrades to 11gR2 (Doc ID 837570.1)
Complete checklist to upgrade the database to 11g R2 using DBUA (Doc ID 870814.1)
Please follow the document to convert the single instance to RAC.
How to Convert 10g Single-Instance database to 10g RAC using Manual Conversion procedure (Doc ID 747457.1)

1. Our current version of Grid/clusterware is 11.2.0.1. Two versions of RDBMS are running currently on the cluster: 11.2.0.1 and 10.2.0.5. They are all on ASMs. +ASM1 is running on node 1 while +ASM2 is running on node 2. Some DBs are Standalone DBs and some DBs are Oracle active-passive HA with failover controlled by the Oracle perl scripts.
We are planning to upgrade Grid to 11.2.0.2 and converting Oracle HA to the real RAC 2 node or RAC 1 node. At the sametime, we are planning to move other system’s DBs onto the same cluster. The current DB versions for that system have 10.2.0.4,10.2.0.1, 9.2.0.7, 9.2.0.5.

Ans:

Oracle 9i Databases were not managed by Oracle Clusterware 11.1 already.The Clusterware has exceeded the limitation. So kindly request to upgrade the 9i database ASAP to manage these databases on 11gR2 clusterware.

2. Does Oracle have any suggestion on the order of upgrade to 11.2.0.2 and the converting to real RAC from Oracle HA under clusterware?

Ans:

There is no documentation. I assume that you have installed clusterware,ASM and RDBMS has been installed on active-passive(Oracle HA) nodes. You need to disable the pearl script and bring up the clusterware ,ASM and RDBMS instance on passive node.

3. In note 1212703.1 Oracle provided, patch 9974223 is mentioned if multicast is enabled only for 224.0.0.251 address. I am confused on the instructions in the readme of patch 9974223. It mentions to apply both on CRS_HOME and RDBMS HOME.

Ans:

If we upgrade RDBMS to 11.2.0.2 as well becides the Grid 11.2.0.2 upgrade, is it required to apply the patch to the RDBMS HOME too, or just optional? If it is required, does the patch only need to be applied in the RDBMS, which has RAC DB running? Does it need to apply on the Standalone DB on the Grid/ASM?

The patch 9974223 is applicable for CRS/RDBMS. You can apply this patch on both home. Yes. Please apply the patch on RDBMS home also. You no need to apply for standalone DB on the Grid/ASM.

1. Per Infrastructure Redundant Interconnect and ora.cluster_interconnect.haip [ID 1210883.1 ], Redundant Interconnect without any 3rd-party IP failover technology (bond, IPMP or similar) is supported natively by Grid Infrastructure starting from 11.2.0.2. The note is for the interconnect.
What about the Oracle VIP? We already binded the 2nd public interface of IPMP to the node-vip (nodeapps) in order to resolve the DB listener down issue when IPMP failover to 2nd interface , do we have to undo it before the 11.2.0.2 upgrade?h

Ans: – no need to undo ipmp what is already configured.

2. In the note Oracle provided: http://download.oracle.com/docs/cd/E11882_01/install.112/e17213/procstop.htm#CEGHBJDB, It says:
=========
– To upgrade existing 11.2.0.1 Oracle Grid Infrastructure installations to Oracle Grid Infrastructure 11.2.0.2, you must first do at least one of the following:

Patch the release 11.2.0.1 Oracle Grid Infrastructure home with the 9413827 and 9706490 patches.

Install Oracle Grid Infrastructure Patch Set 1 (GI PSU1) or Oracle Grid Infrastructure Patch Set 2 (GI PSU2).

– Oracle Clusterware and Oracle ASM upgrades are always out-of-place upgrades. With 11g release 2 (11.2), you cannot perform an in-place upgrade of Oracle Clusterware and Oracle ASM to existing homes.
==========

If the”Oracle Clusterware and Oracle ASM upgrades are always out-of-place upgrades for 11.2″, why still need to patch the existing 11.2.0.1 Grid CRS HOME as mentioned above? If it is out of place, then Grid will have a new CRS HOME.

Ans:- Yes. Still you need to apply the recommended patches before upgrade. The reason is due to known issues, it is recommended to apply the patch before upgrade.
Ofcourse the grid infra needs to be installed on new home. Please go thorough the Grid installation document for more detail.

1. Our Grid is 11.2.0.1.0, we only need to apply one patch 9413827?

Yes. Please apply the 9413827 only.

2. Do we need to apply 9413827 to the RDBMS Home for 2 HA DBs that are under CRS control right before we upgrade the Grid to 11.2.0.2?

Yes. Please apply it.

Please refer the below document to check RAC option is enabled or not.
How to check RAC Option is currently linked into the Oracle Binary (Doc ID 284785.1)

RAC one node is only supported with DB version of 11g under 11.2.0.2 Grid for Solaris 10, correct?

Yes. It is supported. Please find the certification metrics as below

OS Product Certified With Version Status
10 11gR2 64-bit Oracle Clusterware 11g Certified
10 11gR2 64-bit Oracle Solaris Cluster 3.3 Certified
10 11gR2 64-bit Oracle Solaris Cluster 3.2 Certified
10 11gR2 64-bit Veritas Storage Foundation for Oracle RAC 5.1 Certified
10 11gR2 64-bit Veritas Storage Foundation for Oracle RAC 5.0 Certified

We have some DBs with 10g version. Right now they are standalone DBs and can not be upgraded for now. Does Oracle support 10g database with RAC 2 nodes under 11.2.0.2 Grid on Solaris 10?

Yes. It is supported.

了解更多关于11gR2 diskmon

 

下图显示了有ohasd管理的所有资源(resource)/守护进程(daemons)的依赖关系:

 

 

Diskmon

 

Master diskmon

• Monitors CELLSRVs and Network using Heartbeats
• Propagates Cell/Network state to ASM/RDBMS processes (dskm)
• Maintains a cluster-wide global view of Cells with other DISKMONs in the cluster
• Accepts fencing requests from CRS and delivers them to the Cells
• Accepts intradatabase IORM plans from RDBMS and sends them to the Cells
• Provides communication with the cells

 

 

Diskmon daemon用以监控Exadata 中的cell server,即只有在Exadata环境中才有用。但是在版本11.2.0.1-11.2.0.2,即便是非Exadata环境也会默认启动该守护进 程。  在版本11.2.0.3 中 改进了这一细节,非Exadata环境无发启动diskmon了。

11.2.0.3 Grid Infrastructure diskmon Will be Offline by Default in Non-Exadata Environment

What is being announced?

As Grid Infrastructure daemon diskmon.bin is used for Exadata fencing, started from 11.2.0.3, resource ora.diskmon will be offline in non-Exadata environment. This is expected behaviour change.

Prior to 11.2.0.3:

ps -ef| grep diskmon.bin
grid      3361  3166  0 22:57 ?        00:00:00 /ocw/grid/bin/diskmon.bin -d -f

On 11.2.0.3:

ps -ef| grep diskmon.bin

>> no more diskmon.bin

 

 

一些diskmon进程的日志:

 

[ CSSD]2009-07-27 10:27:36.419 [20] >TRACE: kgzf_dskm_conn4: unable to connect to master
diskmon in 60174 msec

[ CSSD]2009-07-27 10:27:36.419 [20] >TRACE: kgzf_send_main1: connection to master diskmon
timed out

[ CSSD]2009-07-27 10:27:36.421 [22] >TRACE: KGZF: Fatal diskmon condition, IO fencing is
not available. For additional error info look at the master diskmon log file (diskmon.log)

[ CSSD]2009-07-27 10:27:36.421 [22] >ERROR: ASSERT clsssc.c 2471
[ CSSD]2009-07-27 10:27:36.421 [22] >ERROR: clssscSAGEInitFenceCompl: Fence completion
failed, rc 56859

It seems that the new process registered with Oracle Clusterware diskmon is not able to communicate properly .

setsid: failed with -1/1
dskm_getenv_oracle_user: calling getpwnam_r for user oracle
dskm_getenv_oracle_user: info for user oracle complete
dskm_set_user: unable to change ownership for the log directory
/optware/oracle/11.1.0.7/crs/log/shplab01/diskmon to user oracle, id 1101, errno 1
07/27/09 10:27:37: Master Diskmon starting

The tusc output of the cssd log gives the following information
...
1248953770.528145 [/optware/ora][20944]{2992772}
unlink("/var/spool/sockets/pwgr/client20944") ERR#2 ENOENT
1248953770.612485 [/optware/ora][20944]{2992772}
unlink("/tmp/.oracle_master_diskmon") ERR#1 EPERM
1248953770.649479 [/optware/ora][20944]{2992772}
unlink("/tmp/.oracle_master_diskmon") ERR#1 EPERM
1248953770.656719 [/optware/ora][20944]{2992772}
unlink("/var/spool/sockets/pwgr/client20944") ERR#1 EPERM
...

 There is a permission error of the file /tmp/.oracle_master_diskmon.
Solution

The resolution is to change the permissions of the file  /tmp/.oracle_master_diskmon, which should be owned by oracle . 

diskmon.log
============
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.6]{0}:
(14036 -> 12265) SKGXP_CHECK_HEART_BEAT_RESP_EXPIRE: NO PATH to Monitor
entry: 0x17161490
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.7]{0}:
  Subnet: 0
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.8]{0}:
   Remote endpoint [192.168.10.3/44538] is DOWN
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] SKGXP:[386927568.9]{0}:
   Local endpoint [192.168.10.1/45530] is UP
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512]
SKGXP:[386927568.10]{0}: SKGXP_DO_HEART_BEAT_RESP: Matching Monitor Entry Not
Found
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512]
SKGXP:[386927568.11]{0}:   SKGXPGPID Internet address 192.168.10.3 RDS port
number 44538
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512] dskm_hb_thrd_main11:
got status change
2011-12-01 22:14:49.510: [ DISKMON][14036:1093384512]
dskm_ant_rsc_monitor_start: rscnam: o/192.168.10.3 rsc: 0x171609c0 state:
UNREACHABLE reconn_attempts: 0 last_reconn_ts: 1322773921
2011-12-01 22:14:49.649: [ DISKMON][14036:1093384512]
dskm_node_guids_are_offline: query SM done. retcode = 56891(REACHABLE)
2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] dskm_oss_get_net_info5:
oss_get_net_info for device o/192.168.10.3 returned skgxpid
040302010001894cb5afca0419ed706ae92f000008000000000000000000000001030000c0a80a
03000000000000000000000000adfa00000000000016000000 and the following 1 ip
adresess. known_reid: Yes
2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512]     192.168.10.1
2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512]
dskm_ant_rsc_monitor_start6.5:Cell does support TCP monitor, and does support
SM Query, cell incarnation is 1, guid num is 2
2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] GUID-0 =
0x0021280001a0af15
2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512] GUID-1 =
0x0021280001a0af16
2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512]
dskm_ant_rsc_monitor_start2: Connected to Same_Inc OSS device: o/192.168.10.3
numIP: 1
2011-12-01 22:14:49.657: [ DISKMON][14036:1093384512]     192.168.10.1

2011-12-01 22:15:07.501: [ DISKMON][14036:1108523328] dskm_slave_thrd_main3:
peer disconnected
2011-12-01 22:15:07.501: [ DISKMON][14036:1108523328] dskm_slave_thrd_main5:
client +ASM1/ASM/15374 disconnected, reid
cid=14e1b2b4de58ef1eff5487b58dccc906,icin=188142332,nmn=1,lnid=188142332,gid=7
,gin=1,gmn=0,umemid=0,opid=8,opsn=1,lvl=process hdr=0x       0

2011-12-01 22:15:08.440: [ CSSCLNT]clsssRecvMsg: got a disconnect from the
server while waiting for message type 1
2011-12-01 22:15:08.440: [ CSSCLNT]clssgsGroupGetStatus:  communications
failed (0/3/-1)

2011-12-01 22:15:08.440: [ CSSCLNT]clssgsGroupGetStatus: returning 8

2011-12-01 22:15:08.440: [ DISKMON][14036:1102219584] CRITICAL: Diskmon
exiting: dskm_rac_thrd_main10: Diskmon is shutting down due to CSSD ABORT
event
2011-12-01 22:15:08.440: [ DISKMON][14036:1102219584] SHUTDOWN FORCE due to
CSSD ABORT
2011-12-01 22:15:08.440: [ DISKMON][14036:1102219584] dskm_rac_thrd_main:
exiting
2011-12-01 22:15:08.754: [ DISKMON][14036:1104320832] dskm_slave_thrd_main5:
client orarootagent/13701 disconnected, reid
cid=DUMMY,icin=-1,nmn=-1,lnid=-1,gid=-1,gin=-1,gmn=-1,umemid=-1,opid=-1,opsn=-
1,lvl=process hdr=0xfece0100
2011-12-01 22:15:09.988: [ DISKMON][14036:1191118288] dskm_cleanup_thrds:
cleaning up the rac event handling thread tid 1102219584
[ DISKMON][13016]

I/O Fencing and SKGXP HA monitoring daemon -- Version 1.2.0.0
Process 13016 started on 2011-12-01 at 22:15:39.863

2011-12-01 22:15:39.867: [ DISKMON][13016] dskm main: starting up

ocssd.log
==========
2011-12-01 22:15:04.223: [    CSSD][1127139648]clssgmmkLocalKillThread: Time
up. Timeout 60500 Start time 369099698 End time 369160198 Current time
369160198
2011-12-01 22:15:04.223: [    CSSD][1127139648]clssgmmkLocalKillResults:
Replying to kill request from remote node 2 kill id 1 Success map 0x00000000
Fail map 0x00000000
2011-12-01 22:15:04.224: [GIPCHAUP][1094015296] gipchaUpperProcessDisconnect:
processing DISCONNECT for hendp 0x2aa5550 [00000000000092e5] { gipchaEndpoint
: port 'nm2_gts-cluster/af9c-724c-2e3f-3946', peer
'gts1db02:205f-3cac-025e-c962', srcCid 00000000-000092e5,  dstCid
00000000-000009d9, numSend 0, maxSend 100, usrFlags 0x4000, flags 0x204 }
2011-12-01 22:15:04.224: [    CSSD][1122408768]clssnmeventhndlr:
Disconnecting endp 0x932d ninf 0x1c3a2c0
2011-12-01 22:15:04.224: [    CSSD][1122408768]clssnmDiscHelper: gts1db02,
node(2) connection failed, endp (0x932d), probe(0x3000000000), ninf->endp
0x932d
2011-12-01 22:15:04.224: [    CSSD][1122408768]clssnmDiscHelper: node 2 clean
up, endp (0x932d), init state 3, cur state 3
2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcInternalDissociate: obj
0x2e99290 [000000000000932d] { gipcEndpoint : localAddr
'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr
'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0,
numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags
0x0 } not associated with any container, ret gipcretFail (1)
2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcDissociateF
[clssnmDiscHelper : clssnm.c : 3284]: EXCEPTION[ ret gipcretFail (1) ]  
failed to dissociate obj 0x2e99290 [000000000000932d] { gipcEndpoint :
localAddr 'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr
'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0,
numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags
0x0 }, flags 0x0
2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcInternalDissociate: obj
0x2e99290 [000000000000932d] { gipcEndpoint : localAddr
'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr
'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0,
numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags
0x0 } not associated with any container, ret gipcretFail (1)
2011-12-01 22:15:04.224: [GIPCXCPT][1122408768] gipcDissociateF
[clssnmDiscHelper : clssnm.c : 3430]: EXCEPTION[ ret gipcretFail (1) ]  
failed to dissociate obj 0x2e99290 [000000000000932d] { gipcEndpoint :
localAddr 'gipcha://gts1db01:nm2_gts-cluster/af9c-724c-2e3f-394', remoteAddr
'gipcha://gts1db02:205f-3cac-025e-c96', numPend 0, numReady 0, numDone 0,
numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x13860e, usrFlags
0x0 }, flags 0x0
2011-12-01 22:15:04.224: [    CSSD][1122408768]clssnmDiscEndp: gipcDestroy
0x932d
2011-12-01 22:15:04.603: [    
CSSD][1104976192](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by
node gts1db02, number 2, sync 188142334, stamp 393990918
2011-12-01 22:15:04.603: [    
CSSD][1104976192]###################################
2011-12-01 22:15:04.603: [    CSSD][1104976192]clssscExit: CSSD aborting from
thread clssnmvKillBlockThread
2011-12-01 22:15:04.603: [    
CSSD][1104976192]###################################
2011-12-01 22:15:04.603: [    CSSD][1104976192](:CSSSC00012:)clssscExit: A
fatal error occurred and the CSS daemon is terminating abnormally
gts1db01, number 1, has experienced a failure in thread number 10 and is
shutting down
2011-12-01 22:15:04.603: [    CSSD][1104976192]clssscExit: Starting CRSD
cleanup

2011-12-01 22:15:04.737: [    CSSD][1103399232]clssgmDiscEndpcl: gipcDestroy
0xa2ea2f7
2011-12-01 22:15:04.925: [    
CSSD][1112947008](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for
3472942430 ms for voting file o/192.168.10.5/DBFS_DG_CD_04_gts1cel03)
2011-12-01 22:15:04.925: [    CSSD][1112947008]clssnmvDiskAvailabilityChange:
voting file o/192.168.10.5/DBFS_DG_CD_04_gts1cel03 now offline
2011-12-01 22:15:04.925: [    
CSSD][1112947008](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for
3472942450 ms for voting file o/192.168.10.4/DBFS_DG_CD_02_gts1cel02)
2011-12-01 22:15:04.925: [    CSSD][1112947008]clssnmvDiskAvailabilityChange:
voting file o/192.168.10.4/DBFS_DG_CD_02_gts1cel02 now offline
2011-12-01 22:15:04.925: [    
CSSD][1112947008](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for
3472942480 ms for voting file o/192.168.10.3/DBFS_DG_CD_02_gts1cel01)
2011-12-01 22:15:04.926: [    CSSD][1112947008]clssnmvDiskAvailabilityChange:
voting file o/192.168.10.3/DBFS_DG_CD_02_gts1cel01 now offline
2011-12-01 22:15:04.926: [    
CSSD][1112947008](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 3 configured
voting disks available, need 2
2011-12-01 22:15:04.926: [    CSSD][1112947008]clssscExit: abort already set
1
2011-12-01 22:15:04.926: [   SKGFD][1109793088]Lib :OSS:: closing handle
0x2538e70 for disk :o/192.168.10.5/DBFS_DG_CD_04_gts1cel03:

2011-12-01 22:15:04.926: [   SKGFD][1098676544]Lib :OSS:: closing handle
0x2aaaac0d7cb0 for disk :o/192.168.10.3/DBFS_DG_CD_02_gts1cel01:

Heartbeat timeout logic may to fail to detect dead cells if diskmon
has been running for over 40 days.

Rediscovery Notes:
 If diskmon has been running for over 40 days and DB processes start to hang
 after a cell death, you may have hit this bug.

 All nodes may hang due to one of the heartbeat threads in diskmon
getting stuck trying to notify the instance(s) that it reconnected
to the cell. However if this occurs there is insufficient diagnostic
data collected to help confirm why the hang occurred.
This fix is a diagnostic enhancement for this scenario.

If diskmon/DSKM processes are hung/stuck this fix may help collect
additional useful diagnostics.

PROBLEM:
--------
Diskmon logs fill up quickly causing their disk / volume /u00  to become full

DIAGNOSTIC ANALYSIS:
--------------------
1)  The following are consistently and repeatedly logged:

2010-12-17 03:22:25.848: [ DISKMON][17796:1089268032] dskm_rac_ini13: calling
clssgsqgrp
2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_rac_ini80: called
clssgsqgrp:
2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_dump_group_priv:
vers: 0 flags: 0x0 confInc: 0 My confInc: 0
2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_dump_group_priv:
CSS Msg Hdr: vers: 0 type: UNKNOWN (0) chunks: NO MORE CHUNKS (0) transport:
UNKNOWN (0) mSize: 0
2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_dump_group_priv:
Group Private Data is not of type DSKM_MSG_SS_REQ. Not proceeding with msg
dump
2010-12-17 03:22:25.849: [ DISKMON][17796:1089268032] dskm_rac_ini15: Found
my member number 1 to be busy. Waiting (attempts: 598) for OCSSD to clean up
previous incarnation

2) Core files generated, and many  stack dumps of diskmon futher enlarges the
diskmon.log

   The following is frequently seen in the diskmon.log

2010-12-17 03:22:28.855: [ DISKMON][17796:1089268032] dskm_rac_ini16: OCSSD
has notified that another diskmon is currently running in this node.
This might be a duplicate startup. If not consult OCSSD log for additional
information.
2010-12-17 03:22:28.855: [ DISKMON][17796] INCIDENT : EXCEPTION (SIGNAL: 6)
in [gsignal()]
2010-12-17 03:22:28.855: [ DISKMON][17796] Thread 1089268032 got exception 6
2010-12-17 03:22:28.855: [ DISKMON][17796] Stack dump for thread 1089268032
[ DISKMON][17796]
....

沪ICP备14014813号-2

沪公网安备 31010802001379号