11.2 中Oracle Cluster Registry(OCR)可选的存储设备

在11.2中ocr和votedisk 可以存放在ASM中了,该版本中Oracle Cluster Registry(OCR)可选的存储设备包括:

 

2011-10-29 00:13:18.828: [  OCROSD][1087046128]utstoragetypecommon:
Oracle Cluster Registry does not support the storage type configured.
OCR can be configured on: ASM, OCFS, OCFS2, NFS, Block Device, Character Device, VxFS
2011-10-29 00:13:18.829: [  OCROSD][1087046128]utopen:6m'': OCR location # [0] [/g01/ocrlocal]
configured is not a valid storage type. Return code [37].
2011-10-29 00:13:18.829: [  OCROSD][1087046128]utopen:7: failed to open any OCR file/disk,
errno=11, os err string=Resource temporarily unavailable
2011-10-29 00:13:18.829: [  OCRRAW][1087046128]proprinit: Could not open raw device
2011-10-29 00:13:18.829: [ default][1087046128]a_init:7!: Backend init unsuccessful : [26]
2011-10-29 00:13:19.831: [  OCROSD][1087046128]utstoragetypecommon:
Oracle Cluster Registry does not support the storage type configured.
OCR can be configured on: ASM, OCFS, OCFS2, NFS, Block Device, Character Device, VxFS

 

ASM, OCFS, OCFS2, NFS, Block Device, Character Device, VxFS都可以用来存放ocr文件。 虽然NFS是一种可用的选项,但是实际不推荐在产品环境使用网络文件系统, Mos Note<Mount Options for Oracle files when used with NAS devices [ID 359515.1]>介绍了mount NFS时的一些注意事项。

了解ocssd.bin如何控制RAC节点重启

ocssd.bin是RAC cluterware重要的后台进程,这里我们不再介绍其复杂的功用,只介绍一些ocssd.bin reboot node的细节。

注意在11gR2 standalone 环境中ocssd.bin crash/panic或者被手动KILL掉,都不会导致节点重启:

 

[oracle@mlab1 ~]$ crsctl  stat res  -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       mlab1                                        
ora.FRA.dg
               ONLINE  ONLINE       mlab1                                        
ora.LISTENER.lsnr
               ONLINE  ONLINE       mlab1                                        
ora.asm
               ONLINE  ONLINE       mlab1                    Started             
ora.ons
               OFFLINE OFFLINE      mlab1                                        
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
      1        ONLINE  ONLINE       mlab1                                        
ora.diskmon
      1        OFFLINE OFFLINE                                                   
ora.evmd
      1        ONLINE  ONLINE       mlab1                                        
ora.proda.db
      1        ONLINE  ONLINE       mlab1                    Open                

首先把CSSD的LOG LEVEL升到2,以便获得更多的CSSD日志

[oracle@mlab1 ~]$ crsctl debug log css CSSD:2
CRS-4151: DEPRECATED: use crsctl set log {css|crs|evm}
Set CSSD Module: CSSD  Log Level: 2

在11g中可以使用crsctl set log css语法来替代crsctl debug log了

[oracle@mlab1 ~]$ crsctl set log css CSSD:2
Set CSSD Module: CSSD  Log Level: 2

[oracle@mlab1 ~]$ crsctl get log css CSSD
Get CSSD Module: CSSD  Log Level: 2

oracle   17797     1  0 Oct19 ?        00:00:11 /g01/oracle/app/oracle/product/11.2.0/grid/bin/ocssd.bin 
oracle   29016 28865  0 21:47 pts/1    00:00:00 grep cssd.bin

[oracle@mlab1 ~]$ kill -9 17797

[oracle@mlab1 ~]$ ps -ef|grep cssd.bin
oracle   29128     1  0 21:48 ?        00:00:00 /g01/oracle/app/oracle/product/11.2.0/grid/bin/ocssd.bin 
oracle   29144 28865  0 21:49 pts/1    00:00:00 grep cssd.bin

[oracle@mlab1 ~]$ uptime 
 21:49:13 up 28 days, 22:24,  3 users,  load average: 0.16, 0.06, 0.01

tail -f ocssd.log 

2012-10-21 09:45:06.853: [    CSSD][1105594688]clssgmClientConnectMsg: properties of cmProc 0x7f270c1617e0 - 1,2,3,4,5
2012-10-21 09:45:06.853: [    CSSD][1105594688]clssgmClientConnectMsg: Connect from con(0x20b8) proc(0x7f270c1617e0) pid(28935) version 11:2:1:4, properties: 1,2,3,4,5
2012-10-21 09:45:06.853: [    CSSD][1105594688]clssgmClientConnectMsg: msg flags 0x0000
2012-10-21 09:45:06.856: [    CSSD][1105594688]clssgmDeadProc: proc 0x7f270c1617e0
2012-10-21 09:45:06.856: [    CSSD][1105594688]clssgmDestroyProc: cleaning up proc(0x7f270c1617e0) con(0x20b8) skgpid  ospid 28935 with 0 clients, refcount 0
2012-10-21 09:45:06.856: [    CSSD][1105594688]clssgmDiscEndpcl: gipcDestroy 0x20b8
2012-10-21 09:48:57.641: [    CSSD][2632525536]clsu_load_ENV_levels: Module = CSSD, LogLevel = 2, TraceLevel = 0
2012-10-21 09:48:57.641: [    CSSD][2632525536]clsu_load_ENV_levels: Module = GIPCCM, LogLevel = 2, TraceLevel = 0
2012-10-21 09:48:57.642: [    CSSD][2632525536]clsu_load_ENV_levels: Module = GIPCGM, LogLevel = 2, TraceLevel = 0
2012-10-21 09:48:57.642: [    CSSD][2632525536]clsu_load_ENV_levels: Module = GIPCNM, LogLevel = 2, TraceLevel = 0
2012-10-21 09:48:57.642: [    CSSD][2632525536]clsu_load_ENV_levels: Module = GPNP, LogLevel = 1, TraceLevel = 0
2012-10-21 09:48:57.642: [    CSSD][2632525536]clsu_load_ENV_levels: Module = OLR, LogLevel = 0, TraceLevel = 0
2012-10-21 09:48:57.642: [    CSSD][2632525536]clsu_load_ENV_levels: Module = SKGFD, LogLevel = 0, TraceLevel = 0
[    CSSD][2632525536]clsugetconf : Configuration type [3].
2012-10-21 09:48:57.642: [    CSSD][2632525536]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (local-only) mode with uniqueness value 1350827337
2012-10-21 09:48:57.642: [    CSSD][2632525536]clssscmain: Environment is production
2012-10-21 09:48:57.642: [    CSSD][2632525536]clssscmain: Core file size limit extended
2012-10-21 09:48:57.654: [    CSSD][2632525536]clssscmain: GIPCHA down 0
2012-10-21 09:48:57.655: [    CSSD][2632525536]clssscGetParameterOLR: OLR fetch for parameter logsize (8) failed with rc 21
2012-10-21 09:48:57.655: [    CSSD][2632525536]clssscExtendLimits: The current soft limit for file descriptors is 65536, hard limit is 65536
2012-10-21 09:48:57.655: [    CSSD][2632525536]clssscExtendLimits: The current soft limit for locked memory is 3955359744, hard limit is 3955359744
2012-10-21 09:48:57.655: [    CSSD][2632525536]clssscmain: Running as user oracle
2012-10-21 09:48:57.656: [    CSSD][2632525536]clssscmain: RT queue setting is at default value
2012-10-21 09:48:57.657: [    CSSD][2632525536]clssscGetParameterOLR: OLR fetch for parameter auth rep (9) failed with rc 21
2012-10-21 09:48:57.657: [    CSSD][2632525536]clssgmInitCMInfoMin: clsmonJoined set via localonly
[  clsdmt][1097894208]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=mlab1DBG_CSSD))
2012-10-21 09:48:57.658: [  clsdmt][1097894208]PID for the Process [29128], connkey 4
2012-10-21 09:48:57.658: [    CSSD][2632525536]clssscGetParameterOLR: OLR fetch for parameter diagwait (14) failed with rc 21
2012-10-21 09:48:57.662: [    CSSD][2632525536]clssnmInitNMInfoMin: Initializing first-reconfig to (0)
2012-10-21 09:48:57.662: [    CSSD][2632525536]clssscmain: initgminfo done
2012-10-21 09:48:57.662: [    CSSD][1082157376]clssgmclientlsnr: Spawned
2012-10-21 09:48:57.662: [    CSSD][1082157376]clssgmEvtInformation: reqtype (13) cmProc ((nil)) client ((nil))
2012-10-21 09:48:57.662: [    CSSD][1082157376]clssgmEvtInformation: reqtype (13) req (0x12e9900)
2012-10-21 09:48:57.663: [    CSSD][1082157376]clssgmclientlsnr: listening on clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_mlab1_)(GIPCID=00000000-00000000-29128))
2012-10-21 09:48:57.665: [    CSSD][2632525536]clssscmain:read clusterguid 5f0de5b55b586f17bfc26fd1c7c638a0 from OLR
2012-10-21 09:48:57.665: [    CSSD][2632525536]clssscmain: Cluster GUID is 5f0de5b55b586f17bfc26fd1c7c638a0
2012-10-21 09:48:57.665: [    CSSD][2632525536]clssnmNotifyReq: type (12)
2012-10-21 09:48:57.665: [    CSSD][2632525536]clssscmain: Skipping voting device init for local_only
2012-10-21 09:48:57.665: [    CSSD][2632525536]clssnmInitNodeDB: Initializing with OCR id 0
2012-10-21 09:48:58.612: [    CSSD][1082157376]clssscSelect: cookie accept request 0x7fe79802a2d0
2012-10-21 09:48:58.612: [    CSSD][1082157376]clssgmAllocProc: (0x139f4f0) allocated
2012-10-21 09:48:58.612: [    CSSD][1082157376]clssgmClientConnectMsg: properties of cmProc 0x139f4f0 - 1,2,3,4,5
2012-10-21 09:48:58.613: [    CSSD][1082157376]clssgmClientConnectMsg: Connect from con(0xd2) proc(0x139f4f0) pid(29114) version 11:2:1:4, properties: 1,2,3,4,5
2012-10-21 09:48:58.613: [    CSSD][1082157376]clssgmClientConnectMsg: The CSSD agent is process (0x139f4f0), number 1
2012-10-21 09:48:58.613: [    CSSD][1082157376]clssgmEvtInformation: reqtype (11) cmProc (0x139f4f0) client ((nil))
2012-10-21 09:48:58.613: [    CSSD][1082157376]clssgmEvtInformation: reqtype (11) req (0x139a730)
2012-10-21 09:48:58.613: [    CSSD][1082157376]clssnmQueueNotification: type (11) 0x139a730
2012-10-21 09:48:58.665: [    CSSD][1078692160]clssnm_skgxnmon: Compatible vendor clusterware not in use
2012-10-21 09:48:58.665: [    CSSD][2632525536]clssnmNotifyReq: type (20)
2012-10-21 09:48:58.665: [    CSSD][2632525536][INFO]clssnmInitNodeDB: local only, no IPMI allowed
2012-10-21 09:48:58.665: [    CSSD][1078692160]clssgmDeathChkThread: Spawned
2012-10-21 09:48:58.666: [    CSSD][2632525536]clssnmNotifyReq: type (11)
2012-10-21 09:48:58.666: [    CSSD][2632525536]clssnmNotifyReq: found 1 for type (11) 0x139a730
2012-10-21 09:48:58.666: [    CSSD][2632525536]clssnmCompleteGMReq: Completed request type 11 with status 1
2012-10-21 09:48:58.666: [    CSSD][2632525536]clssgmDoneQEle: re-queueing req 0x12e5390 status 1
2012-10-21 09:48:58.666: [    CSSD][2632525536]clssnmNotifyReq: type (13)
2012-10-21 09:48:58.666: [    CSSD][2632525536]clssnmNotifyReq: found 1 for type (13) 0x12e9900
2012-10-21 09:48:58.666: [    CSSD][2632525536]clssnmCompleteGMReq: Completed request type 13 with status 1
2012-10-21 09:48:58.666: [    CSSD][2632525536]clssgmDoneQEle: re-queueing req 0x12cdb80 status 1
2012-10-21 09:48:58.666: [    CSSD][1083734336]clssgmPeerListener: Spawned for node
2012-10-21 09:48:58.667: [    CSSD][1083734336]clssgmPeerListener: physical hostname mlab1 privname mlab1
2012-10-21 09:48:58.667: [    CSSD][1083734336]clssgmPeerListener: gipc addr gipc://mlab1:gm_
2012-10-21 09:48:58.667: [    CSSD][1083734336]clssgmPeerListener: gipc addr gipcha://mlab1:gm2_
2012-10-21 09:48:58.667: [    CSSD][1082157376]clssgmclientOpenEndp: listening on clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=Oracle_CSS_LclLstnr_localhost_1)(GIPCID=00000000-00000000-29128))
2012-10-21 09:48:58.667: [    CSSD][1100257600]clssnmPollingThread: Spawned, poll interval 1000
2012-10-21 09:48:58.667: [    CSSD][1107290432]clssnmRcfgMgrThread: Spawned
2012-10-21 09:48:58.667: [    CSSD][1108867392]clssnmClusterListener: Spawned
2012-10-21 09:48:58.667: [    CSSD][1108867392]clssnmOpenEndp: Not opening endp for localonly mode
2012-10-21 09:48:58.668: [    CSSD][1082157376]clssgmclientOpenEndp: listening on clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_mlab1_localhost)(GIPCID=00000000-00000000-29128))
2012-10-21 09:48:58.668: [    CSSD][1082157376]clssgmCheckReqNMCompletion: Completing request type 11 for proc (0x139f4f0), operation status 1, client status 0
2012-10-21 09:48:58.668: [    CSSD][1091864896]clssnmSendingThread: Spawned
2012-10-21 09:48:58.669: [    CSSD][1082157376]clssgmEvtInformation: reqtype (22) cmProc (0x139f4f0) client ((nil))
2012-10-21 09:48:58.669: [    CSSD][1082157376]clssgmEvtInformation: reqtype (22) req (0x7fe798207480)
2012-10-21 09:48:58.669: [    CSSD][1082157376]clssnmQueueNotification: type (22) 0x7fe798207480
2012-10-21 09:48:59.667: [    CSSD][1083734336]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2012-10-21 09:49:00.667: [    CSSD][1083734336]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2012-10-21 09:49:01.667: [    CSSD][1083734336]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2012-10-21 09:49:02.667: [    CSSD][1083734336]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2012-10-21 09:49:03.667: [    CSSD][1083734336]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2012-10-21 09:49:04.668: [    CSSD][1083734336]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2012-10-21 09:49:05.668: [    CSSD][1083734336]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2012-10-21 09:49:06.168: [    CSSD][1107290432]clssnmRcfgMgrThread: Local Join
2012-10-21 09:49:06.168: [    CSSD][1107290432]clssnmLocalJoinEvent: begin on node(1), waittime 193000
2012-10-21 09:49:06.168: [    CSSD][1107290432]clssnmLocalJoinEvent: scanning 2 nodes
2012-10-21 09:49:06.168: [    CSSD][1107290432]clssnmLocalJoinEvent: Starting initial cluster reconfig
2012-10-21 09:49:06.168: [    CSSD][1107290432]clssnmDoSyncUpdate: Initiating sync 0
2012-10-21 09:49:06.168: [    CSSD][1107290432]clssscCompareSwapEventValue: changed NMReconfigInProgress  val 1, from -1, changes 1
2012-10-21 09:49:06.168: [    CSSD][1107290432]clssnmDoSyncUpdate: local disk timeout set to 200000 ms, remote disk timeout set to 200000
2012-10-21 09:49:06.168: [    CSSD][1107290432]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
2012-10-21 09:49:06.170: [    CSSD][1107290432]clssnmSetFirstIncarn: got incarnation 243778226 from OLR
2012-10-21 09:49:06.174: [    CSSD][1107290432]clssnmSetFirstIncarn: Incarnation set to 243778227
2012-10-21 09:49:06.174: [    CSSD][1107290432]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 243778227
2012-10-21 09:49:06.174: [    CSSD][1107290432]clssnmSetupAckWait: Ack message type (11)
2012-10-21 09:49:06.174: [    CSSD][1107290432]clssnmSetupAckWait: node(1) is ALIVE
2012-10-21 09:49:06.174: [    CSSD][1107290432]clssnmSendSync: syncSeqNo(243778227), indicating EXADATA fence initialization incomplete
2012-10-21 09:49:06.174: [    CSSD][1107290432]List of nodes that have ACKed my sync: NULL
2012-10-21 09:49:06.174: [    CSSD][1107290432]clssnmSendSync: syncSeqNo(243778227)
2012-10-21 09:49:06.174: [    CSSD][1107290432]clssnmWaitForAcks: Ack message type(11), ackCount(1)
2012-10-21 09:49:06.174: [    CSSD][1108867392]clssscUpdateEventValue: NMReconfigInProgress  val 1, changes 2
2012-10-21 09:49:06.174: [    CSSD][1108867392]clssnmHandleSync: local disk timeout set to 200000 ms, remote disk timeout set to 200000
2012-10-21 09:49:06.174: [    CSSD][1108867392]clssnmHandleSync: initleader 1 newleader 1
2012-10-21 09:49:06.174: [    CSSD][1108867392]clssnmQueueClientEvent:  Sending Event(2), type 2, incarn 0
2012-10-21 09:49:06.174: [    CSSD][1108867392]clssnmQueueClientEvent: Node[1] state = 1, birth = 0, unique = 1350827337
2012-10-21 09:49:06.174: [    CSSD][1108867392]clssnmHandleSync: Acknowledging sync: src[1] srcName[mlab1] seq[1] sync[243778227]
2012-10-21 09:49:06.174: [    CSSD][1108867392]clssnmSendAck: node 1, mlab1, syncSeqNo(243778227) type(11)
2012-10-21 09:49:06.174: [    CSSD][2632525536]NMEVENT_SUSPEND [00][00][00][00]
2012-10-21 09:49:06.175: [    CSSD][2632525536]clssgmUpdateEventValue: CmInfo State  val 5, changes 1
2012-10-21 09:49:06.175: [    CSSD][2632525536]clssgmSuspendAllGrocks: Issue SUSPEND
2012-10-21 09:49:06.175: [    CSSD][2632525536]clssgmSuspendAllGrocks: done
2012-10-21 09:49:06.175: [    CSSD][1108867392]clssnmHandleAck: src[1] dest[1] dom[0] seq[0] sync[243778227] type[11] ackCount(0)
2012-10-21 09:49:06.175: [    CSSD][2632525536]clssgmUpdateEventValue: CmInfo State  val 2, changes 2
2012-10-21 09:49:06.175: [    CSSD][2632525536]clssgmUpdateEventValue: ConnectedNodes  val 0, changes 1
2012-10-21 09:49:06.175: [    CSSD][2632525536]clssgmCleanupNodeContexts():  cleaning up nodes, rcfg(0)
2012-10-21 09:49:06.175: [    CSSD][2632525536]clssgmCleanupNodeContexts():  successful cleanup of nodes rcfg(0)
2012-10-21 09:49:06.175: [    CSSD][2632525536]clssgmStartNMMon:  completed node cleanup
2012-10-21 09:49:06.175: [    CSSD][1107290432]clssnmSendSync: syncSeqNo(243778227), indicating EXADATA fence initialization incomplete
2012-10-21 09:49:06.175: [    CSSD][1107290432]List of nodes that have ACKed my sync: 1
2012-10-21 09:49:06.175: [    CSSD][1107290432]clssnmWaitForAcks: done, syncseq(243778227), msg type(11)
2012-10-21 09:49:06.175: [    CSSD][1107290432]clssnmSetMinMaxVersion:node1  product/protocol (11.2/1.4)
2012-10-21 09:49:06.175: [    CSSD][1107290432]clssnmSetMinMaxVersion: properties common to all nodes: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
2012-10-21 09:49:06.175: [    CSSD][1107290432]clssnmSetMinMaxVersion: min product/protocol (11.2/1.4)
2012-10-21 09:49:06.175: [    CSSD][1107290432]clssnmSetMinMaxVersion: max product/protocol (11.2/1.4)
2012-10-21 09:49:06.175: [    CSSD][1107290432]clssnmNeedConfReq: No configuration to change
2012-10-21 09:49:06.175: [    CSSD][1107290432]clssnmDoSyncUpdate: node(1) is transitioning from joining state to active state
2012-10-21 09:49:06.175: [    CSSD][1107290432]clssnmDoSyncUpdate: Wait for 0 vote ack(s)
2012-10-21 09:49:06.175: [    CSSD][1107290432]clssnmCheckDskInfo: Checking disk info...
2012-10-21 09:49:06.175: [    CSSD][1107290432]clssnmRemove: Start
2012-10-21 09:49:06.175: [    CSSD][1107290432]clssnmWaitOnEvictions: Start
2012-10-21 09:49:06.175: [    CSSD][1107290432]clssnmBldSendUpdate: syncSeqNo(243778227)
2012-10-21 09:49:06.175: [    CSSD][1107290432]clssnmBldSendUpdate: using msg version 4
2012-10-21 09:49:06.175: [    CSSD][1107290432]clssnmDoSyncUpdate: Sync 243778227 complete!
2012-10-21 09:49:06.175: [    CSSD][1108867392]clssnmHandleUpdate: sync[243778227] src[1], msgvers 4 icin 243778227
2012-10-21 09:49:06.175: [    CSSD][1108867392]clssnmHandleUpdate: common properties are 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
2012-10-21 09:49:06.175: [    CSSD][1108867392]clssscSAGEInitFencing: kgzf fence initialization starting with GUID 5f0de5b55b586f17bfc26fd1c7c638a0, icin 243778227, node number 1, uniqueness 243778227
2012-10-21 09:49:06.176: [ default][1108867392]CELL communication is configured to use 0 interface(s):

2012-10-21 09:49:06.176: [ default][1108867392]Kgzf_ini_begin: diskmon is disabled

2012-10-21 09:49:06.176: [    CSSD][1108867392]clssscSAGEInitFencing: kgzf fence initialization successfully started
2012-10-21 09:49:06.176: [    CSSD][1108867392]clssnmUpdateNodeState: node mlab1, number 1, current state 2, proposed state 3, current unique 1350827337, proposed unique 1350827337, prevConuni 0, birth 243778227
2012-10-21 09:49:06.176: [    CSSD][1108867392]clssnmSendAck: node 1, mlab1, syncSeqNo(243778227) type(15)
2012-10-21 09:49:06.176: [    CSSD][1108867392]clssnmQueueClientEvent:  Sending Event(1), type 1, incarn 243778227
2012-10-21 09:49:06.176: [    CSSD][1108867392]clssnmQueueClientEvent: Node[1] state = 3, birth = 243778227, unique = 1350827337
2012-10-21 09:49:06.176: [    CSSD][1108867392]clssnmHandleUpdate: SYNC(243778227) from node(1) completed
2012-10-21 09:49:06.177: [    CSSD][2632525536]clssgmStartNMMon: node 1 active, birth 243778227
2012-10-21 09:49:06.177: [    CSSD][1108867392]clssnmHandleUpdate: NODE 1 (mlab1) IS ACTIVE MEMBER OF CLUSTER
2012-10-21 09:49:06.177: [    CSSD][2632525536]clssgmUpdateEventValue: Reconfig Event  val 1, changes 1
2012-10-21 09:49:06.177: [    CSSD][1108867392]clssscUpdateEventValue: NMReconfigInProgress  val -1, changes 3
2012-10-21 09:49:06.177: [    CSSD][1108867392]clssnmHandleUpdate: local disk timeout set to 200000 ms, remote disk timeout set to 200000
2012-10-21 09:49:06.177: [    CSSD][2632525536]clssgmUpdateEventValue: CmInfo State  val 3, changes 3
2012-10-21 09:49:06.177: [    CSSD][1083734336]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 3 waited 510
2012-10-21 09:49:06.177: [    CSSD][1083734336]clssgmUpdateEventValue: HoldRequest  val 1, changes 1
2012-10-21 09:49:06.177: [    CSSD][1085311296]clssgmReconfigThread:  started for reconfig (243778227)
2012-10-21 09:49:06.177: [    CSSD][1085311296]NMEVENT_RECONFIG [00][00][00][02]

 

 

可以看到在standalone环境下的ocssd.bin crash/killed都不会造成节点重启,因为是非cluster环境,所以reboot确实不需要,而仅仅是重启一个ocssd.bin进程。

 

 

而在RAC cluster环境中则不一样了:

 

[root@vrh1 ~]# crsctl set log css CSSD:2
Set CSSD Module: CSSD  Log Level: 2

[root@vrh1 ~]# 
[root@vrh1 ~]# ps -ef|grep ocssd.bin
grid      3929     1  1 Oct19 ?        00:57:01 /g01/11.2.0/grid/bin/ocssd.bin 
root     27297 26827  0 10:00 pts/0    00:00:00 grep ocssd.bin

[root@vrh1 ~]# kill -19 3929

signal 19是SIGSTOP

被KILL -19 cssd.bin的vrh1节点的ocssd.log

2012-10-21 10:01:31.068: [    CSSD][1086265664]clssnmvDiskPing: Writing with status 0x3, timestamp 199229204/1350828091
2012-10-21 10:01:31.068: [    CSSD][1103239488]clssnmvDiskPing: Writing with status 0x3, timestamp 199229204/1350828091
2012-10-21 10:01:31.068: [    CSSD][1092921664]clssnmvDiskPing: Writing with status 0x3, timestamp 199229204/1350828091
2012-10-21 10:01:31.195: [    CSSD][1096075584]clssnmvDiskPing: Writing with status 0x3, timestamp 199229324/1350828091
2012-10-21 10:01:31.196: [    CSSD][1111820608]clssnmvDiskPing: Writing with status 0x3, timestamp 199229324/1350828091
2012-10-21 10:01:31.220: [    CSSD][1113397568]clssnmvDiskKillCheck: not evicted, file /dev/asm-diskh flags 0x00000000, kill block unique 0, my unique 1350627944
2012-10-21 10:01:31.220: [    CSSD][1099512128]clssnmvDiskKillCheck: not evicted, file /dev/asm-diskg flags 0x00000000, kill block unique 0, my unique 1350627944
2012-10-21 10:01:31.220: [    CSSD][1077381440]clssnmvDiskKillCheck: not evicted, file /dev/asm-diski flags 0x00000000, kill block unique 0, my unique 1350627944

因为ocssd.bin进程absent,日志从10:01:31.220后未更新

另一节点上的ocssd.log:

2012-10-21 10:01:45.623: [    CSSD][1124575552]clssnmPollingThread: node vrh1 (1) at 50% heartbeat fatal, removal in 14.940 seconds
2012-10-21 10:01:45.623: [    CSSD][1124575552]clssnmPollingThread: node vrh1 (1) is impending reconfig, flag 2491404, misstime 15060
2012-10-21 10:01:45.625: [    CSSD][1085704512]clssnmvDHBValidateNCopy: node 1, vrh1, has a disk HB, but no network HB, DHB has rcfg 239944581, wrtcnt, 9550714, LATS 344798954, lastSeqNo 7825639, uniqueness 1350627944, timestamp 1350828091/199229204
2012-10-21 10:01:45.625: [    CSSD][1089829184]clssnmvDHBValidateNCopy: node 1, vrh1, has a disk HB, but no network HB, DHB has rcfg 239944581, wrtcnt, 9550715, LATS 344798954, lastSeqNo 6103632, uniqueness 1350627944, timestamp 1350828091/199229324
2012-10-21 10:01:45.684: [    CSSD][1107228992]clssnmvDiskKillCheck: not evicted, file /dev/asm-diskk flags 0x00000000, kill block unique 0, my unique 1350482491
2012-10-21 10:01:46.105: [    CSSD][1111959872]clssnmvDiskKillCheck: not evicted, file /dev/asm-diskh flags 0x00000000, kill block unique 0, my unique 1350482491
2012-10-21 10:01:46.105: [    CSSD][1118267712]clssnmvDiskKillCheck: not evicted, file /dev/asm-diskj flags 0x00000000, kill block unique 0, my unique 1350482491
2012-10-21 10:01:46.105: [    CSSD][1108805952]clssnmvDiskKillCheck: not evicted, file /dev/asm-diski flags 0x00000000, kill block unique 0, my unique 1350482491
2012-10-21 10:01:46.105: [    CSSD][1116690752]clssnmvDiskKillCheck: not evicted, file /dev/asm-diskg flags 0x00000000, kill block unique 0, my unique 1350482491
2012-10-21 10:01:46.587: [    CSSD][1126152512]clssnmSendingThread: sending status msg to all nodes
2012-10-21 10:01:46.587: [    CSSD][1126152512]clssnmSendingThread: sent 8 status msgs to all nodes

2012-10-21 10:01:53.627: [    CSSD][1124575552]clssnmPollingThread: node vrh1 (1) at 75% heartbeat fatal, removal in 6.930 seconds

2012-10-21 10:01:55.102: [    CSSD][1126152512]clssnmSendingThread: sending status msg to all nodes
2012-10-21 10:01:55.102: [    CSSD][1126152512]clssnmSendingThread: sent 8 status msgs to all nodes

2012-10-21 10:01:57.628: [    CSSD][1124575552]clssnmPollingThread: node vrh1 (1) at 90% heartbeat fatal, removal in 2.930 seconds, seedhbimpd 1

2012-10-21 10:01:59.608: [    CSSD][1126152512]clssnmSendingThread: sending status msg to all nodes
2012-10-21 10:01:59.608: [    CSSD][1126152512]clssnmSendingThread: sent 9 status msgs to all nodes

2012-10-21 10:02:00.560: [    CSSD][1124575552]clssnmPollingThread: Removal started for node vrh1 (1), flags 0x26040c, state 3, wt4c 0
2012-10-21 10:02:00.560: [    CSSD][1124575552]clssnmMarkNodeForRemoval: node 1, vrh1 marked for removal
2012-10-21 10:02:00.560: [    CSSD][1124575552]clssnmDiscHelper: vrh1, node(1) connection failed, endp (0x98f947), probe((nil)), ninf->endp 0x98f947
2012-10-21 10:02:00.560: [    CSSD][1124575552]clssnmDiscHelper: node 1 clean up, endp (0x98f947), init state 5, cur state 5
2012-10-21 10:02:00.560: [GIPCXCPT][1124575552] gipcInternalDissociate: obj 0x7f3bc80d8020 [000000000098f947] { gipcEndpoint : localAddr 'gipcha://vrh2:nm2_vrh-cluster/9dc0-9546-c12b-e74', remoteAddr 'gipcha://vrh1:2cf2-b3ca-7399-111', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
2012-10-21 10:02:00.560: [GIPCXCPT][1124575552] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3436]: EXCEPTION[ ret gipcretFail (1) ]  failed to dissociate obj 0x7f3bc80d8020 [000000000098f947] { gipcEndpoint : localAddr 'gipcha://vrh2:nm2_vrh-cluster/9dc0-9546-c12b-e74', remoteAddr 'gipcha://vrh1:2cf2-b3ca-7399-111', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 }, flags 0x0
2012-10-21 10:02:00.560: [    CSSD][1127729472]clssnmRcfgMgrThread: Reconfig in progress...
2012-10-21 10:02:00.560: [    CSSD][1127729472]clssnmRcfgMgrThread: sync leader(1) failed, misstime(30000) unique(1350627944)
2012-10-21 10:02:00.560: [    CSSD][1127729472]clssscCompareSwapEventValue: changed NMReconfigInProgress  val -1, from 1, changes 18
2012-10-21 10:02:00.560: [    CSSD][1127729472]clssnmDoSyncUpdate: Initiating sync 239944581
2012-10-21 10:02:00.560: [    CSSD][1129306432]clssnmDiscEndp: gipcDestroy 0x98f947
2012-10-21 10:02:00.560: [    CSSD][1127729472]clssscCompareSwapEventValue: changed NMReconfigInProgress  val 2, from -1, changes 19
2012-10-21 10:02:00.560: [    CSSD][1127729472]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
2012-10-21 10:02:00.560: [    CSSD][1127729472]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
2012-10-21 10:02:00.560: [    CSSD][1127729472]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 239944581
2012-10-21 10:02:00.560: [    CSSD][1127729472]clssnmSetupAckWait: Ack message type (11)
2012-10-21 10:02:00.560: [    CSSD][1127729472]clssnmSetupAckWait: node(2) is ALIVE
2012-10-21 10:02:00.560: [    CSSD][1127729472]clssnmSendSync: syncSeqNo(239944581), indicating EXADATA fence initialization complete
2012-10-21 10:02:00.560: [    CSSD][1127729472]List of nodes that have ACKed my sync: NULL
2012-10-21 10:02:00.560: [    CSSD][1127729472]clssnmSendSync: syncSeqNo(239944581)
2012-10-21 10:02:00.560: [    CSSD][1127729472]clssnmWaitForAcks: Ack message type(11), ackCount(1)
2012-10-21 10:02:00.560: [    CSSD][1129306432]clssnmHandleSync: Node vrh2, number 2, is EXADATA fence capable
2012-10-21 10:02:00.561: [    CSSD][1129306432]clssscUpdateEventValue: NMReconfigInProgress  val 2, changes 20
2012-10-21 10:02:00.561: [    CSSD][1129306432]clssnmHandleSync: local disk timeout set to 27000 ms, remote disk timeout set to 27000
2012-10-21 10:02:00.561: [    CSSD][1129306432]clssnmHandleSync: initleader 2 newleader 2
2012-10-21 10:02:00.561: [    CSSD][1129306432]clssnmQueueClientEvent:  Sending Event(2), type 2, incarn 239944580
2012-10-21 10:02:00.561: [    CSSD][1129306432]clssnmQueueClientEvent: Node[1] state = 5, birth = 239944580, unique = 1350627944
2012-10-21 10:02:00.561: [    CSSD][1129306432]clssnmQueueClientEvent: Node[2] state = 3, birth = 239944577, unique = 1350482491
2012-10-21 10:02:00.561: [    CSSD][1129306432]clssnmHandleSync: Acknowledging sync: src[2] srcName[vrh2] seq[16] sync[239944581]
2012-10-21 10:02:00.561: [    CSSD][1129306432]clssnmSendAck: node 2, vrh2, syncSeqNo(239944581) type(11)
2012-10-21 10:02:00.561: [    CSSD][1129306432]clssnmHandleAck: src[2] dest[2] dom[0] seq[0] sync[239944581] type[11] ackCount(0)
2012-10-21 10:02:00.561: [    CSSD][3611592416]clssgmStartNMMon: node 1 active, birth 239944580
2012-10-21 10:02:00.561: [    CSSD][3611592416]clssgmStartNMMon: node 2 active, birth 239944577
2012-10-21 10:02:00.561: [    CSSD][3611592416]NMEVENT_SUSPEND [00][00][00][06]
2012-10-21 10:02:00.561: [    CSSD][3611592416]clssgmUpdateEventValue: CmInfo State  val 5, changes 50
2012-10-21 10:02:00.561: [    CSSD][3611592416]clssgmSuspendAllGrocks: Issue SUSPEND
2012-10-21 10:02:00.561: [    CSSD][1127729472]clssnmSendSync: syncSeqNo(239944581), indicating EXADATA fence initialization complete
2012-10-21 10:02:00.561: [    CSSD][1127729472]List of nodes that have ACKed my sync: 2
2012-10-21 10:02:00.561: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(IG+ASMSYS$USERS) count(2) master(2) event(2), incarn 4, mbrc 2, to member 2, events 0x0, state 0x0
2012-10-21 10:02:00.561: [    CSSD][1127729472]clssnmWaitForAcks: done, syncseq(239944581), msg type(11)
2012-10-21 10:02:00.561: [    CSSD][1127729472]clssnmSetMinMaxVersion:node2  product/protocol (11.2/1.4)
2012-10-21 10:02:00.561: [    CSSD][1127729472]clssnmSetMinMaxVersion: properties common to all nodes: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
2012-10-21 10:02:00.561: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(crs_version) count(2) master(2) event(2), incarn 6, mbrc 2, to member 2, events 0x0, state 0x0
2012-10-21 10:02:00.561: [    CSSD][1127729472]clssnmSetMinMaxVersion: min product/protocol (11.2/1.4)
2012-10-21 10:02:00.561: [    CSSD][1127729472]clssnmSetMinMaxVersion: max product/protocol (11.2/1.4)
2012-10-21 10:02:00.561: [    CSSD][1127729472]clssnmNeedConfReq: No configuration to change
2012-10-21 10:02:00.561: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(crs_version) count(2) master(2) event(2), incarn 6, mbrc 2, to member 0, events 0x20, state 0x0
2012-10-21 10:02:00.561: [    CSSD][1127729472]clssnmDoSyncUpdate: Terminating node 1, vrh1, misstime(30000) state(5)
2012-10-21 10:02:00.561: [    CSSD][1127729472]clssnmDoSyncUpdate: Wait for 0 vote ack(s)
2012-10-21 10:02:00.561: [    CSSD][1127729472]clssnmCheckDskInfo: Checking disk info...
2012-10-21 10:02:00.561: [    CSSD][1127729472]clssnmRemove: Start
2012-10-21 10:02:00.561: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(CRF-) count(3) master(2) event(2), incarn 11, mbrc 3, to member 2, events 0x38, state 0x0
2012-10-21 10:02:00.561: [    CSSD][1127729472]clssnmrRemoveNode: Removing node 1, vrh1, from the cluster in incarnation 239944581, node birth incarnation 239944580, death incarnation 239944581, stateflags 0x260000 uniqueness value 1350627944
2012-10-21 10:02:00.561: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(CLSN.ONSPROC.MASTER) count(1) master(2) event(2), incarn 1, mbrc 1, to member 2, events 0xa0, state 0x0
2012-10-21 10:02:00.561: [ default][1127729472]kgzf_gen_node_reid2: generated reid cid=58a8249042c37f94bf844767ea0ae255,icin=239944576,nmn=1,lnid=239944580,gid=0,gin=0,gmn=0,umemid=0,opid=0,opsn=0,lvl=node hdr=0xfece0100

2012-10-21 10:02:00.561: [    CSSD][1127729472]clssnmrFenceSage: Fenced node vrh1, number 1, with EXADATA, handle 0
2012-10-21 10:02:00.561: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(DB+ASM) count(2) master(1) event(2), incarn 4, mbrc 2, to member 1, events 0x68, state 0x0
2012-10-21 10:02:00.561: [    CSSD][1127729472]clssnmSendShutdown: req to node 1, kill time 344813894
2012-10-21 10:02:00.561: [    CSSD][1127729472]clssnmsendmsg: not connected to node 1

2012-10-21 10:02:00.562: [    CSSD][1127729472]clssnmSendShutdown: Send to node 1 failed
2012-10-21 10:02:00.562: [    CSSD][1127729472]clssnmWaitOnEvictions: Start
2012-10-21 10:02:00.562: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(DG+ASM) count(2) master(1) event(2), incarn 4, mbrc 2, to member 1, events 0x0, state 0x0
2012-10-21 10:02:00.562: [    CSSD][1127729472]clssnmWaitOnEvictions: node 1, undead 1, EXADATA fence handle 0 kill reqest id 0, last DHB (1350828091, 199229324, 399572), seedhbimpd TRUE
2012-10-21 10:02:00.562: [    CSSD][1127729472]clssnmCheckKillStatus: Node 1, vrh1, down, LATS(344798954),timeout(14940)
2012-10-21 10:02:00.562: [    CSSD][1127729472]clssnmBldSendUpdate: syncSeqNo(239944581)
2012-10-21 10:02:00.562: [    CSSD][1127729472]clssnmBldSendUpdate: using msg version 4
2012-10-21 10:02:00.562: [    CSSD][1127729472]clssnmDoSyncUpdate: Sync 239944581 complete!
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmHandleUpdate: sync[239944581] src[2], msgvers 4 icin 239944576
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmHandleUpdate: common properties are 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmUpdateNodeState: node vrh1, number 1, current state 5, proposed state 0, current unique 1350627944, proposed unique 1350627944, prevConuni 1350627944, birth 239944580
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmDeactivateNode: node 1, state 5
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmDeactivateNode: node 1 (vrh1) left cluster
2012-10-21 10:02:00.563: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(IG+ASMSYS$BACKGROUND) count(2) master(2) event(2), incarn 4, mbrc 2, to member 2, events 0x0, state 0x0
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmUpdateNodeState: node vrh2, number 2, current state 3, proposed state 3, current unique 1350482491, proposed unique 1350482491, prevConuni 0, birth 239944577
2012-10-21 10:02:00.563: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(VT+ASM) count(2) master(2) event(2), incarn 12, mbrc 2, to member 2, events 0x60, state 0x0
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmSendAck: node 2, vrh2, syncSeqNo(239944581) type(15)
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmQueueClientEvent:  Sending Event(1), type 1, incarn 239944581
2012-10-21 10:02:00.563: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(DG+ASM0) count(2) master(1) event(2), incarn 4, mbrc 2, to member 1, events 0x0, state 0x0
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmQueueClientEvent: Node[1] state = 0, birth = 0, unique = 0
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmQueueClientEvent: Node[2] state = 3, birth = 239944577, unique = 1350482491
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmHandleUpdate: SYNC(239944581) from node(2) completed
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmHandleUpdate: NODE 2 (vrh2) IS ACTIVE MEMBER OF CLUSTER
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssscUpdateEventValue: NMReconfigInProgress  val -1, changes 21
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmHandleUpdate: local disk timeout set to 200000 ms, remote disk timeout set to 200000
2012-10-21 10:02:00.563: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(GR+GCR1) count(2) master(1) event(2), incarn 6, mbrc 2, to member 1, events 0x280, state 0x0
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmStartPendingConfigChange: New configuration request for CIN 0:1350628006:0
2012-10-21 10:02:00.563: [    CSSD][1129306432]  misscount          30    reboot latency      3
2012-10-21 10:02:00.563: [    CSSD][1129306432]  long I/O timeout  200    short I/O timeout  27
2012-10-21 10:02:00.563: [    CSSD][1129306432]  diagnostic wait    13  active version 11.2.0.3.0
2012-10-21 10:02:00.563: [    CSSD][1129306432]  Listing unique IDs for 5 voting files:
2012-10-21 10:02:00.563: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(DG_DATA) count(1) master(1) event(2), incarn 3, mbrc 1, to member 1, events 0x4, state 0x0
x0
2012-10-21 10:02:00.563: [    CSSD][1129306432]    voting file 1: 85edc0e8-2d274f78-bfc58cdc-73b8c68a
2012-10-21 10:02:00.563: [    CSSD][1129306432]    voting file 2: 201ffffc-8ba44faa-bfe2efec-2aa75840
2012-10-21 10:02:00.563: [    CSSD][1129306432]    voting file 3: 6f2a25c5-89964faa-bf6980f7-c5f621ce
2012-10-21 10:02:00.563: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(ocr_vrh-cluster) count(1) master(2) event(2), incarn 3, mbrc 1, to member 2, events 0x78, state 0x0
2012-10-21 10:02:00.563: [    CSSD][1129306432]    voting file 4: 93eb3156-48454f25-bf3717df-1a2c73d5
2012-10-21 10:02:00.563: [    CSSD][1129306432]    voting file 5: 37372406-78964f88-bfbfbd31-d8b3829f
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmStartPendingConfigChange: Initiating configuration change reconfig for CIN 1350628006
2012-10-21 10:02:00.563: [    CSSD][1129306432]clssnmStartCINUpdate: Starting CIN update for type 8
2012-10-21 10:02:00.563: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(CLSFRAME) count(1) master(2) event(2), incarn 3, mbrc 1, to member 2, events 0x8, state 0x0
2012-10-21 10:02:00.563: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(CRSDMAIN) count(1) master(2) event(2), incarn 3, mbrc 1, to member 2, events 0x8, state 0x0
2012-10-21 10:02:00.563: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(EVMDMAIN) count(1) master(2) event(2), incarn 3, mbrc 1, to member 2, events 0x8, state 0x0
2012-10-21 10:02:00.564: [    CSSD][1119844672]clssnmvDiskEvict: preconuni is NULL skipping the eviction write for node 1, state 0
2012-10-21 10:02:00.564: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(EVMDMAIN2) count(1) master(2) event(2), incarn 3, mbrc 1, to member 2, events 0x8, state 0x0
2012-10-21 10:02:00.564: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(CTSSGROUP) count(2) master(2) event(2), incarn 4, mbrc 2, to member 2, events 0x8, state 0x0
2012-10-21 10:02:00.564: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(DG_BACKUPDG) count(1) master(1) event(2), incarn 3, mbrc 1, to member 1, events 0x4, state 0x0
2012-10-21 10:02:00.564: [    CSSD][3611592416]clssgmQueueGrockEvent: groupName(DG_SYSTEMDG) count(1) master(1) event(2), incarn 3, mbrc 1, to member 1, events 0x4, state 0x0
2012-10-21 10:02:00.564: [    CSSD][3611592416]clssgmSuspendAllGrocks: done
2012-10-21 10:02:00.564: [    CSSD][3611592416]clssgmUpdateEventValue: CmInfo State  val 2, changes 51
2012-10-21 10:02:00.564: [    CSSD][3611592416]clssgmUpdateEventValue: ConnectedNodes  val 239944580, changes 18
2012-10-21 10:02:00.564: [    CSSD][3611592416]clssgmCleanupNodeContexts():  cleaning up nodes, rcfg(239944580)
2012-10-21 10:02:00.564: [    CSSD][1096816960]clssnmvDiskEvict: preconuni is NULL skipping the eviction write for node 1, state 0
2012-10-21 10:02:00.564: [    CSSD][3611592416]clssgmCleanupNodeContexts():  successful cleanup of nodes rcfg(239944580)
2012-10-21 10:02:00.564: [    CSSD][3611592416]clssgmStartNMMon:  completed node cleanup
2012-10-21 10:02:00.564: [    CSSD][3611592416]clssgmStartNMMon: node 1 failed, birth (239944580, 0) (old/new)
2012-10-21 10:02:00.564: [    CSSD][3611592416]clssgmStartNMMon: node 2 active, birth 239944577
2012-10-21 10:02:00.564: [    CSSD][3611592416]clssgmUpdateEventValue: Reconfig Event  val 1, changes 13
2012-10-21 10:02:00.564: [    CSSD][3611592416]clssgmUpdateEventValue: CmInfo State  val 3, changes 52
2012-10-21 10:02:00.564: [    CSSD][1113536832]clssnmvDiskEvict: preconuni is NULL skipping the eviction write for node 1, state 0
2012-10-21 10:02:00.564: [    CSSD][1085704512]clssnmvDiskEvict: preconuni is NULL skipping the eviction write for node 1, state 0
2012-10-21 10:02:00.565: [    CSSD][1089829184]clssnmvDiskEvict: preconuni is NULL skipping the eviction write for node 1, state 0
2012-10-21 10:02:00.565: [    CSSD][1130883392]clssgmReconfigThread:  started for reconfig (239944581)
2012-10-21 10:02:00.565: [    CSSD][1130883392]NMEVENT_RECONFIG [00][00][00][04]
2012-10-21 10:02:00.565: [    CSSD][1130883392]clssgmWaitOnEventValue: after HoldRequest  val 1, eval 1 waited 0
2012-10-21 10:02:00.565: [    CSSD][1130883392]clssgmCompareSwapEventValue: changed CmInfo State  val 4, from 3, changes 53
2012-10-21 10:02:00.565: [    CSSD][1130883392]clssgmCleanupNodeContexts():  cleaning up nodes, rcfg(239944580)
2012-10-21 10:02:00.565: [    CSSD][1130883392]clssgmDisconnectNodes: Closing connection 0x98fa69 for node vrh1, number 1, in incarnation 239944581; state flags 0x80000001, conn state flags 0x000a
2012-10-21 10:02:00.565: [    CSSD][1130883392]clssgmCleanupNodeContexts():  successful cleanup of nodes rcfg(239944581)
2012-10-21 10:02:00.565: [    CSSD][1130883392]clssgmUpdateEventValue: ReadyPeers  val 1, changes 7
2012-10-21 10:02:00.565: [    CSSD][1130883392]clssgmCompareSwapEventValue: changed CmInfo State  val 6, from 4, changes 54
2012-10-21 10:02:00.565: [    CSSD][1130883392]clssgmEstablishConnections: 1 nodes in cluster incarn 239944581
2012-10-21 10:02:00.565: [    CSSD][1130883392]clssgmUpdateEventValue: ConnectedNodes  val 0, changes 19
2012-10-21 10:02:00.565: [    CSSD][1085704512]clssnmCINUpdateComplete: Pending CIN update completed, config state 1
2012-10-21 10:02:00.566: [    CSSD][1122998592]clssgmPeerListener: new incarn 239944581. old 239944580
2012-10-21 10:02:00.566: [    CSSD][1124575552]clssnmPollingThread: signaling reconfig for config change
2012-10-21 10:02:00.566: [    CSSD][1122998592]clssgmPeerDeactivate: node 1 (vrh1), death 239944581, state 0x80000000 connstate 0xa
2012-10-21 10:02:00.566: [ GIPCLIB][1122998592] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:0x183d130, idxPtr:0x7f3bd6f6f8c0, key:0x42ef7140, flags:0x0
2012-10-21 10:02:00.566: [GIPCXCPT][1122998592] gipcObjectLookupF [gipcDissociateF : gipc.c : 2175]: search found no matching oid 0000000000000000, ret gipcretKeyNotFound (36), ret gipcretInvalidObject (3)
2012-10-21 10:02:00.566: [GIPCXCPT][1122998592] gipcDissociateF [clssgmPeerDeactivate : clssgmp.c : 3525]: EXCEPTION[ ret gipcretInvalidObject (3) ]  failed to dissociate obj 0000000000000000, flags 0x0
2012-10-21 10:02:00.566: [    CSSD][1122998592]clssgmCleanFuture: discarded 0 future msgs for 1
2012-10-21 10:02:00.566: [    CSSD][1127729472]clssnmDoSyncUpdate: Initiating sync 239944582
2012-10-21 10:02:00.566: [    CSSD][1122998592]clssgmPeerListener: connects done (1/1)
2012-10-21 10:02:00.566: [    CSSD][1127729472]clssscCompareSwapEventValue: changed NMReconfigInProgress  val 2, from -1, changes 22
2012-10-21 10:02:00.566: [    CSSD][1127729472]clssscCompareSwapEventValue: changed NMReconfigInProgress  val 2, from -1, changes 22
2012-10-21 10:02:00.566: [    CSSD][1122998592]clssgmUpdateEventValue: ConnectedNodes  val 239944581, changes 20
2012-10-21 10:02:00.566: [    CSSD][1127729472]clssnmDoSyncUpdate: local disk timeout set to 200000 ms, remote disk timeout set to 200000
2012-10-21 10:02:00.566: [    CSSD][1127729472]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
2012-10-21 10:02:00.566: [    CSSD][1127729472]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 239944582
2012-10-21 10:02:00.566: [    CSSD][1127729472]clssnmSetupAckWait: Ack message type (11)
2012-10-21 10:02:00.566: [    CSSD][1127729472]clssnmSetupAckWait: node(2) is ALIVE
2012-10-21 10:02:00.566: [    CSSD][1127729472]clssnmSendSync: syncSeqNo(239944582), indicating EXADATA fence initialization complete
2012-10-21 10:02:00.566: [    CSSD][1127729472]List of nodes that have ACKed my sync: NULL
2012-10-21 10:02:00.566: [    CSSD][1127729472]clssnmSendSync: syncSeqNo(239944582)
2012-10-21 10:02:00.566: [    CSSD][1127729472]clssnmWaitForAcks: Ack message type(11), ackCount(1)
2012-10-21 10:02:00.566: [    CSSD][1130883392]clssgmWaitChangeEventValue: ev(ConnectedNodes) changed to 239944581
2012-10-21 10:02:00.566: [    CSSD][1130883392]clssgmEstablishConnections: Sending STATUS message to all nodes for incarnation 239944581
2012-10-21 10:02:00.566: [    CSSD][1130883392]clssgmEstablishConnections: (1/1) connected, incarn(239944581)
2012-10-21 10:02:00.566: [    CSSD][1130883392]clssgmCompareSwapEventValue: changed CmInfo State  val 7, from 6, changes 55
2012-10-21 10:02:00.566: [    CSSD][1129306432]clssnmHandleSync: Node vrh2, number 2, is EXADATA fence capable
2012-10-21 10:02:00.566: [    CSSD][1129306432]clssscUpdateEventValue: NMReconfigInProgress  val 2, changes 23
2012-10-21 10:02:00.566: [    CSSD][1129306432]clssnmHandleSync: local disk timeout set to 200000 ms, remote disk timeout set to 200000
2012-10-21 10:02:00.566: [    CSSD][1130883392]clssgmSetVersions: properties common to all peers: 1,2,3,4,5,6,7,8,9,10,11
2012-10-21 10:02:00.566: [    CSSD][1129306432]clssnmHandleSync: initleader 2 newleader 2
2012-10-21 10:02:00.566: [    CSSD][1130883392]clssgmEstablishMasterNode: MASTER for 239944581 is node(2) birth(239944577)
2012-10-21 10:02:00.566: [    CSSD][1129306432]clssnmQueueClientEvent:  Sending Event(2), type 2, incarn 239944581
2012-10-21 10:02:00.566: [    CSSD][1130883392]clssgmCompareSwapEventValue: changed CmInfo State  val 8, from 7, changes 56
2012-10-21 10:02:00.566: [    CSSD][1129306432]clssnmQueueClientEvent: Node[1] state = 0, birth = 0, unique = 0
2012-10-21 10:02:00.566: [    CSSD][1130883392]clssgmMasterCMSync: Synchronizing group/lock status, replay-mode=0
2012-10-21 10:02:00.566: [    CSSD][1129306432]clssnmQueueClientEvent: Node[2] state = 3, birth = 239944577, unique = 1350482491
2012-10-21 10:02:00.566: [    CSSD][1130883392]clssgmCompareSwapEventValue: changed CmInfo State  val 9, from 8, changes 57
2012-10-21 10:02:00.566: [    CSSD][1129306432]clssnmHandleSync: Acknowledging sync: src[2] srcName[vrh2] seq[20] sync[239944582]
2012-10-21 10:02:00.566: [    CSSD][1130883392]clssgmMasterCMSync: processing grock(IG+ASMSYS$USERS) type(2)
2012-10-21 10:02:00.566: [    CSSD][1130883392]clssgmCleanupOrphanMembers: orphan member(1/IG+ASMSYS$USERS), birth(239944580) on node(1), birth(0/239944581)
2012-10-21 10:02:00.566: [    CSSD][1129306432]clssnmSendAck: node 2, vrh2, syncSeqNo(239944582) type(11)

 

以上使用KILL -19 SIGSTOP cssd.bin进程也造成节点重启。

 

接着我们尝试用KILL -9 cssd.bin,节点同样重启:

 

[root@vrh1 ~]# ps -ef|grep ocssd
grid      3900     1  1 10:03 ?        00:00:37 /g01/11.2.0/grid/bin/ocssd.bin 
grid      6019  4287  0 10:39 pts/1    00:00:00 tail -f ocssd.log
root      6028  4331  0 10:39 pts/0    00:00:00 grep ocssd

[root@vrh1 ~]# kill -9 3900

2012-10-21 10:39:22.075: [    CSSD][1121757504]clssnmPollingThread: signaling reconfig for config change
2012-10-21 10:40:49.822: [    CSSD][1441830624]clsu_load_ENV_levels: Module = CSSD, LogLevel = 2, TraceLevel = 0
2012-10-21 10:40:49.822: [    CSSD][1441830624]clsu_load_ENV_levels: Module = GIPCCM, LogLevel = 2, TraceLevel = 0
2012-10-21 10:40:49.822: [    CSSD][1441830624]clsu_load_ENV_levels: Module = GIPCGM, LogLevel = 2, TraceLevel = 0
2012-10-21 10:40:49.822: [    CSSD][1441830624]clsu_load_ENV_levels: Module = GIPCNM, LogLevel = 2, TraceLevel = 0
2012-10-21 10:40:49.822: [    CSSD][1441830624]clsu_load_ENV_levels: Module = GPNP, LogLevel = 1, TraceLevel = 0
2012-10-21 10:40:49.822: [    CSSD][1441830624]clsu_load_ENV_levels: Module = OLR, LogLevel = 0, TraceLevel = 0
2012-10-21 10:40:49.822: [    CSSD][1441830624]clsu_load_ENV_levels: Module = SKGFD, LogLevel = 0, TraceLevel = 0
[    CSSD][1441830624]clsugetconf : Configuration type [4].
2012-10-21 10:40:49.823: [    CSSD][1441830624]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1350830449
2012-10-21 10:40:49.823: [    CSSD][1441830624]clssscmain: Environment is production
2012-10-21 10:40:49.823: [    CSSD][1441830624]clssscmain: Core file size limit extended
2012-10-21 10:40:49.835: [    CSSD][1441830624]clssscmain: GIPCHA down 0

RAC CSS diagwait参数的作用和设置方法

diagwait 参数的作用延迟节点重启的时间值,以便RAC后台进程写出必要的诊断信息到各自的日志中。
Diagwait:

Delay the node reboot for a short time to write all diagnostic messages to the logs.

Doesn’t increase of probability of data corruption

Setup steps: shutdown CRS

crsctl set css diagwait 13 –force
restart CRS
Set diagwait to 13
Will set the oprocd margin to 10 seconds instead of 500ms
Will prevent unneccesary node evictions under high load
Is a general recommendation for servers under high load, not specific to Oracle VM

Please review Metalink note 559365.1 on how to set this

If diagwait > reboottime then OPROCD_DEFAULT_MARGIN := (diagwait – reboottime) * 1000

设置css diagwait的步骤:

1. crsctl stop crs
#<CRS_HOME>/bin/oprocd stop

<2>. Ensure that Clusterware stack is down on all nodes by executing
#ps -ef |egrep “crsd.bin|ocssd.bin|evmd.bin|oprocd”
This should return no processes. If there are clusterware processes running and you proceed to the next step, you will corrupt your OCR. Do
not continue until the clusterware processes are down on all the nodes of the cluster.

<3>. From one node of the cluster, change the value of the “diagwait” parameter to 13 seconds by issuing the command as root:
#crsctl set css diagwait 13 -force

<4>. Check if diagwait is set successfully by executing. the following command. The command should return 13. If diagwait is not set, the

following message will be returned “Configuration parameter diagwait is not defined”
#crsctl get css diagwait

 

<5>. Restart the Oracle Clusterware on all the nodes by executing:
#crsctl start crs

 

<6>. Validate that the node is running by executing:
#crsctl check crs

 
移除diagwait的设置

crsctl unset css diagwait

了解Oracle RAC Brain Split Resolution

Upgrade GI/CRS 11.1.0.7 to 11.2.0.2. Rootupgrade.sh Hanging

Upgrade grid 11.1.0.7 to 11.2.0.2. Rootupgrade.sh Hanging

We installed 11gR2 GI software and applied PSU2 patches upon getting runupgrade.sh prompt.runupgrade.sh hang on the first node.

[root@vrh8 client]# uname -a
Linux vrh8 2.6.18-238.5.1.el5 #1 SMP Mon Feb 21 05:52:39 EST 2011 x86_64

x86_64 x86_64 GNU/Linux
cluvfy passed with 2 ignorable errors:

[root@vrh8 vrh8]# cd /tmp
[root@vrh8 tmp]# df -lh .
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg0-tmp 992M 263M 679M 28% /tmp

[root@vrh8 grid]# grep fail cluvfy_during_inst.log
/tmp l118464lwap1049 /tmp 713MB 1GB failed
Result: Free disk space check failed for “l118464lwap1049:/tmp”
/tmp vrh8 /tmp 692.131MB 1GB failed
Result: Free disk space check failed for “vrh8:/tmp”
Result: Check for multiple users with UID value 0 failed

[root@vrh8 vrh8]# cd /tmp
[root@vrh8 tmp]# df -lh .
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg0-tmp 992M 263M 679M 28% /tmp

We installed 11gR2 GI software and applied PSU2 patches upon getting runupgrade.sh prompt.

runupgrade.sh hang on the first node. We followed “How to Proceed from Failed Upgrade to 11gR2

Grid Infrastructure on Linux/Unix [ID 969254.1]” 1A section, it didn’t help.

[root@vrh8 bin]# ./crsctl query crs activeversion
Oracle Clusterware active version on the cluster is [11.1.0.7.0]

rootupgrade.sh output:

[root@vrh8 11.2.0.2]# ./rootupgrade.sh
Running Oracle 11g root script…

The following environment variables are set as:
ORACLE_OWNER= oracrs
ORACLE_HOME= /d22/oracrs/11.2.0.2

Enter the full pathname of the local bin directory: [/usr/local/bin]:
The contents of “dbhome” have not changed. No need to overwrite.
The contents of “oraenv” have not changed. No need to overwrite.
The contents of “coraenv” have not changed. No need to overwrite.

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /d22/oracrs/11.2.0.2/crs/install/crsconfig_params
LOCAL ADD MODE
Creating OCR keys for user ‘root’, privgrp ‘root’..
Operation successful.
OLR initialization – successful
Adding daemon to inittab
ACFS-9200: Supported
ACFS-9300: ADVM/ACFS distribution files found.
ACFS-9312: Existing ADVM/ACFS installation detected.
ACFS-9314: Removing previous ADVM/ACFS installation.
ACFS-9315: Previous ADVM/ACFS components successfully removed.
ACFS-9307: Installing requested ADVM/ACFS software.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9321: Creating udev for ADVM/ACFS.
ACFS-9323: Creating module dependencies – this may take some time.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9309: ADVM/ACFS installation correctness verified.

****hanging here for more than 2 hrs, so we cancelled it

INT at /d22/oracrs/11.2.0.2/crs/install/crsconfig_lib.pm line 1173.
/d22/oracrs/11.2.0.2/perl/bin/perl -I/d22/oracrs/11.2.0.2/perl/lib –

I/d22/oracrs/11.2.0.2/crs/install /d22/oracrs/11.2.0.2/crs/install/rootcrs.pl execution failed
Oracle root script execution aborted!

1. The below logs are required to analyze this issue.

NEW_GRID_HOME/cfgtoollogs/crsconfig/*.*
NEW_GRID_HOME/log/<nodename>/*.*

Please upload the logs under the above directories. Zip and upload the files including the subdirectories.

2. When the rootupgrade was handing, did you check the usage of /tmp. Was free space exhausting?

=== ODM Research ===

There has been multiple root script run for upgrade. I have taken the first incident from the file
rootcrs_vrh8.log:
—————————————–

2011-02-13 13:07:55: Successfully started requested Oracle stack daemons
2011-02-13 13:07:55: Upgrading the existing voting disks!
2011-02-13 13:07:55: Executing /d22/oracrs/11.2.0.2/bin/cssvfupgd
2011-02-13 13:07:55: Executing cmd: /d22/oracrs/11.2.0.2/bin/cssvfupgd <<<<<<<<<<<<<<< The root script seems to hang at this point.
2011-02-13 15:01:16: ###### Begin DIE Stack Trace ######
2011-02-13 15:01:16: Package File Line Calling
2011-02-13 15:01:16: ————— ——————– —- ———-
2011-02-13 15:01:16: 1: main rootcrs.pl 325 crsconfig_lib::dietrap
2011-02-13 15:01:16: 2: crsconfig_lib crsconfig_lib.pm 9301 main::__ANON__
2011-02-13 15:01:16: 3: crsconfig_lib crsconfig_lib.pm 9301 (eval)
2011-02-13 15:01:16: 4: crsconfig_lib crsconfig_lib.pm 9260 crsconfig_lib::system_cmd_capture1
2011-02-13 15:01:16: 5: crsconfig_lib crsconfig_lib.pm 9247 crsconfig_lib::system_cmd_capture
2011-02-13 15:01:16: 6: crsconfig_lib crsconfig_lib.pm 924 crsconfig_lib::system_cmd
2011-02-13 15:01:16: 7: oracss oracss.pm 275 crsconfig_lib::run_crs_cmd
2011-02-13 15:01:16: 8: crsconfig_lib crsconfig_lib.pm 1019 oracss::CSS_upgrade
2011-02-13 15:01:16: 9: crsconfig_lib crsconfig_lib.pm 1006 crsconfig_lib::start_cluster
2011-02-13 15:01:16: 10: main rootcrs.pl 697 crsconfig_lib::perform_start_cluster
2011-02-13 15:01:16: ####### End DIE Stack Trace #######

cssvfupgd.log:
——————–
Oracle Database 11g Clusterware Release 11.2.0.2.0 – Production Copyright 1996, 2010 Oracle. All rights reserved.
2011-02-13 13:07:55.356: [ OCRRAW][3605955376]prgval:buffer passed is too small
2011-02-13 13:07:55.361: [CSSVFUPG][3605955376]cssvfupgd_GetVFList: found voting file /s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat
2011-02-13 13:07:55.365: [ OCRRAW][3605955376]prgval:buffer passed is too small
2011-02-13 13:07:55.369: [CSSVFUPG][3605955376]cssvfupgd_GetVFList: found voting file /s01/app/ocrvot/VOTEDISK/UAT2_vdisk2.dat
2011-02-13 13:07:55.373: [ OCRRAW][3605955376]prgval:buffer passed is too small
2011-02-13 13:07:55.377: [CSSVFUPG][3605955376]cssvfupgd_GetVFList: found voting file /s01/app/ocrvot/VOTEDISK/UAT2_vdisk3.dat
2011-02-13 13:07:55.402: [CSSVFUPG][3605955376]cssvfupgd_SetNum: Processing SYSTEM.css.misscount
2011-02-13 13:07:55.404: [CSSVFUPG][3605955376]cssvfupgd_SetNum: Processing SYSTEM.css.disktimeout
2011-02-13 13:07:55.406: [CSSVFUPG][3605955376]cssvfupgd_SetNum: Processing SYSTEM.css.reboottime
2011-02-13 13:07:55.408: [CSSVFUPG][3605955376]cssvfupgd_SetNum: Processing SYSTEM.css.diagwait
2011-02-13 13:07:55.414: [CSSVFUPG][3605955376]cssvfupgd_SetNum: Processing SYSTEM.css.pollinterval
2011-02-13 13:07:55.416: [CSSVFUPG][3605955376]cssvfupgd_GetGUID: Fetching GUID for /s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat
2011-02-13 13:07:55.419: [ SKGFD][3605955376]NOTE: No asm libraries found in the system

2011-02-13 13:07:55.419: [ CLSF][3605955376]Allocated CLSF context
2011-02-13 13:07:55.419: [ SKGFD][3605955376]Discovery with str:/s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 13:07:55.419: [ SKGFD][3605955376]UFS discovery with :/s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 13:07:55.420: [ SKGFD][3605955376]Fetching UFS disk :/s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 13:07:55.420: [ SKGFD][3605955376]OSS discovery with :/s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 13:07:55.421: [ SKGFD][3605955376]Handle 0x124de360 from lib :UFS:: for disk :/s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 14:19:31.132: [ SKGFD][3605955376]WARNING:io_getevents timed out 2226 sec >>>>>>>>>>>>>>>>>>>> After about one hour it shows time out error.

2011-02-13 14:19:31.132: [ SKGFD][3605955376]WARNING:io_getevents timed out 2226 sec

The script has stalled at the voting disk upgrade phase. Please provide me the below details.

1. What cluster file system are you using for the voting files? provide its details and the mount options used.

for ocfs, get its mount options
mount | grep ocfs

3. Voting disks details
ls -l /s01/app/ocrvot/VOTEDISK/UAT2_vdisk*

4. Get the diagwait detail.
OLD_CRS_HOME/bin/crsctl get css diagwait

1. What cluster file system are you using for the voting files? provide its details and the mount options used
/dev/emcpowera1 on /s01/app/ocrvot type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local)

2. Voting disks details

[root@vrh8 11.2.0.2]# ls -l /s01/app/ocrvot/VOTEDISK/UAT2_vdisk*
-rw-r—– 1 oracrs oinstall 21004288 Jun 11 07:31 /s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat
-rw-r—– 1 oracrs oinstall 21004288 Jun 11 07:31 /s01/app/ocrvot/VOTEDISK/UAT2_vdisk2.dat
-rw-r—– 1 oracrs oinstall 21004288 Jun 11 07:31 /s01/app/ocrvot/VOTEDISK/UAT2_vdisk3.dat

 

3. Get the diagwait detail

crsctl get css diagwait
Failure 33 in main Oracle Cluster Registry context initialization: PROC-33: Oracle Cluster Registry is not configured Operating System error [No such file or directory] [2]

owc may not be required now as the issue we face is clear.

The diagwait should not error out, as explained in the following note,
11gR2 rootupgrade.sh Fails as cssvfupgd Can not Upgrade Voting Disk (Doc ID 1102283.1)

Make sure you are running ‘crsctl get css diagwait’ from the old crs home. You can also check it in multiple node. If it errors out, this has to be fixed as explained in the above note.

according to that note ,When I ./oprocd stop ,get error:
[root@l118464lwap1049 bin]# ./oprocd stop
Jun 16 23:24:42.966 | ERR | failed to connect to daemon, errno(111)

ACFS-9200: Supported
ACFS-9300: ADVM/ACFS distribution files found.
ACFS-9307: Installing requested ADVM/ACFS software.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9321: Creating udev for ADVM/ACFS.
ACFS-9323: Creating module dependencies – this may take some time.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9309: ADVM/ACFS installation correctness verified.

cssvfupgd.log
2011-02-13 23:36:49.311: [ OCRRAW][3394941744]prgval:buffer passed is too small
2011-02-13 23:36:49.315: [CSSVFUPG][3394941744]cssvfupgd_GetVFList: found voting
file /s01/app/ocrvot/VOTEDISK/UAT2_vdisk2.dat
2011-02-13 23:36:49.319: [ OCRRAW][3394941744]prgval:buffer passed is too small
2011-02-13 23:36:49.323: [CSSVFUPG][3394941744]cssvfupgd_GetVFList: found voting
file /s01/app/ocrvot/VOTEDISK/UAT2_vdisk3.dat
2011-02-13 23:36:49.351: [CSSVFUPG][3394941744]cssvfupgd_SetNum: Processing SYST
EM.css.misscount
2011-02-13 23:36:49.354: [CSSVFUPG][3394941744]cssvfupgd_SetNum: Processing SYST
EM.css.disktimeout
2011-02-13 23:36:49.356: [CSSVFUPG][3394941744]cssvfupgd_SetNum: Processing SYST
EM.css.reboottime
2011-02-13 23:36:49.358: [CSSVFUPG][3394941744]cssvfupgd_SetNum: Processing SYST
EM.css.diagwait
2011-02-13 23:36:49.367: [CSSVFUPG][3394941744]cssvfupgd_SetNum: Processing SYST
EM.css.pollinterval
2011-02-13 23:36:49.369: [CSSVFUPG][3394941744]cssvfupgd_GetGUID: Fetching GUID
for /s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat
2011-02-13 23:36:49.371: [ SKGFD][3394941744]NOTE: No asm libraries found in t
he system

2011-02-13 23:36:49.372: [ CLSF][3394941744]Allocated CLSF context
2011-02-13 23:36:49.372: [ SKGFD][3394941744]Discovery with str:/s01/app/ocrvo
t/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 23:36:49.372: [ SKGFD][3394941744]UFS discovery with :/s01/app/ocrv
ot/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 23:36:49.372: [ SKGFD][3394941744]Fetching UFS disk :/s01/app/ocrvo
t/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 23:36:49.372: [ SKGFD][3394941744]OSS discovery with :/s01/app/ocrv
ot/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 23:36:49.372: [ SKGFD][3394941744]Handle 0x98c4360 from lib :UFS::
for disk :/s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat:

Question:
in Your update about cssvfupgd.log You stated it was hanging there.
Is there an entry after about 70 minutes about a timeout in that log file like:

2011-02-13 23:36:49.372: [ SKGFD][3394941744]Handle 0x98c4360 from lib :UFS::
for disk :/s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat:
2011-02-17 0:48:19.372: [ SKGFD][3394941744]WARNING:io_getevents timed out 4294 sec <<<< present ???

Please provide the following outputs:
rpm -qa|grep ocfs2
uname -a
cat /etc/redhat-release

[root@vrh8 ~]# rpm -qa|grep ocfs2
ocfs2console-1.4.4-1.el5
ocfs2-tools-1.4.4-1.el5
ocfs2-2.6.18-238.5.1.el5-1.4.7-1.el5
[root@vrh8 ~]# uname -a
Linux vrh8 2.6.18-238.5.1.el5 #1 SMP Mon Feb 21 05:52:39 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
[root@vrh8 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.6 (Tikanga)
[root@vrh8 ~]#

Combinations that install SUCCESSFUL:

OEL5.4+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
OEL5.6+ocfs2-1.4.8-1+ocfs2-tools-1.6.3
OEL5.6+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
RHLE5.6+OEL kernel(redhat compatible kernel)+ocfs2-1.4.8-1+ocfs2-tools-1.6.3
RHLE5.6+OEL kernel(redhat compatible kernel)+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
RHEL5.4

Combinations that failed:
RHLE5.6(redhat kernel)+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
RHLE5.6(redhat kernel)+ocfs2-1.4.8-1+ocfs2-tools-1.6.3

Problem reproduces with redhat kernel — RHEL 5.6 with 2.6.18-2xx kernels

Please review the following Note to change the location of your voting disk
Note 428681.1
Title: How to ADD/REMOVE/REPLACE/MOVE Oracle Cluster Registry (OCR) and Voting Disk

Pasting info from —
Oracle? Clusterware Administration and Deployment Guide
11g Release 2 (11.2)

3 Managing Oracle Cluster Registry and Voting Disks
Oracle Universal Installer for Oracle Clusterware 11g release 2 (11.2), does not support the use of raw or block devices. However, if you upgrade from a previous Oracle Clusterware release, then you can continue to use raw or block devices.

[oracrs@vrh8 grid]$ grep fail cluvfy_during_inst_061711.log
/tmp l118464lwap1049 /tmp 706MB 1GB failed
Result: Free disk space check failed for “l118464lwap1049:/tmp”
/tmp vrh8 /tmp 927.1312MB 1GB failed
Result: Free disk space check failed for “vrh8:/tmp”
Result: Check for multiple users with UID value 0 failed
PRVF-5431 : Oracle Cluster Voting Disk configuration check failed

[oracrs@vrh8 grid]$ ./runcluvfy.sh stage -pre crsinst -n vrh8,l118464lwap1049 -verbose|tee cluvfy_during_inst.log

Please upload the following Cluvfy trace log —
$ORA_CRS_HOME/cv/log/cvutrace.log.0

Please download the latest CVU from OTN:
http://www.oracle.com/technetwork/database/clustering/downloads/cvu-download-homepage-099973.html

Please upload
/s02/app/crs/11.2.0.2/log/vrh8/agent/ohasd/oraagent_oracrs/oraagent_oracrs.log

In addition pls upload
/s02/app/crs/11.2.0.2/log/vrh8/agent/ohasd/oracssdagent_root/oracssdagent_root.log

Please run this command on both the new setup and your existing production setup for a quick comparison —
rpm -qa|grep ocfs2

Server with issue:
[root@vrh8 ohasd]# rpm -qa|grep ocfs2
ocfs2console-1.4.4-1.el5
ocfs2-tools-1.4.4-1.el5
ocfs2-2.6.18-238.5.1.el5-1.4.7-1.el5

Prod:

[root@vrh9  bin]# rpm -qa|grep ocfs2
ocfs2-2.6.18-194.el5-1.4.7-1.el5
ocfs2console-1.4.4-1.el5
ocfs2-tools-1.4.4-1.el5
ocfs2-2.6.18-194.8.1.el5-1.4.7-1.el5

[root@vrh8 ~]# uname -a
Linux vrh8 2.6.18-238.5.1.el5 #1 SMP Mon Feb 21 05:52:39 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

[root@vrh8 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.6 (Tikanga)

rpm -qa|grep ocfs2
ocfs2console-1.4.4-1.el5
ocfs2-tools-1.4.4-1.el5
ocfs2-2.6.18-238.5.1.el5-1.4.7-1.el5

@ . from Bug 11876815 (Doc ID 1321757.1)
@ combinations that install SUCCESSFUL:
@ .
@ OEL5.4+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
@ OEL5.6+ocfs2-1.4.8-1+ocfs2-tools-1.6.3
@ OEL5.6+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
@ RHLE5.6+OEL kernel(redhat compatible kernel)+ocfs2-1.4.8-1+ocfs2-tools-1.6.3
@ RHLE5.6+OEL kernel(redhat compatible kernel)+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
@ RHEL5.4
@ .
@ combinations that failed:
@ RHLE5.6(redhat kernel)+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
@ RHLE5.6(redhat kernel)+ocfs2-1.4.8-1+ocfs2-tools-1.6.3
@ .
@ .
@ So that is clear that , it is redhat kernel’s problem.Since RHEL5.6 redhat
@ provided 2.6.18-2xx kernels, we can’t fix redhat kernels, please use Oracle
@ Enterprise kernel (redhat compatible) for installation.

As per last action plan (conveyed if any) you need to contact REDHAT support to know the cause of this issue. Workaround is to not use OCFS and go for raw device for upgrade to succeed.
A Oracle bug 11876815 was logged internally for this hang issue and few combinations of OEL, RHEL, OCFS2 were tried and tested and the combination you are using has not worked for us too (per bug internal updates given above)
The solution provided by Oracle bug developer is to use OEL and not RHEL or contact RHEL support for identifying the cause and solution (incase they have already tested this setup).
Let me know if RHEL support is already engaged and provide the case id so that I can open internal SR for Oracle/Red Hat Joint Escalation Team (JET) Engagement for both vendors to work together internally.

+ the SR issue of grid upgrade from 11.1 to 11.2.0.2.2 is resolved
– voting disk was moved from ocfs to raw device – as a workaround for Bug 11876815
– set TMP and TEMP env to new dir with availabe space before running the installer and prechecks to succeed
– applied GIPSU#2 before the rootupgrade.sh step
– rootupgrade.sh step was successful on all nodes
– verified post upgrade checks and logs to confirm GI upgrade was success !

+ DB upgrade to 11.2.0.2 Plus PSU#2 will be resumed shorlty

Slide:Upgrade 11.2.0.1 GI/CRS to 11.2.0.2 in Linux

Know more about RAC statistics and wait event

以下列出了RAC中的主要统计信息和等待事件:

 

1. Statistics:

1.1 V$SYSSTAT, V$SESSTAT (join to V$STATNAME)

  • gc cr blocks served
  • gc cr block build time
  • gc cr block flush time
  • gc cr block send time
  • gc current blocks served
  • gc current block pin time
  • gc current block flush time
  • gc current block send time
  • gc cr blocks received
  • gc cr block receive time
  • gc current blocks received
  • gc current block receive time
  • gc blocks lost
  • gc claim blocks lost
  • gc blocks corrupt
  • global enqueue gets sync
  • global enqueue gets async
  • global enqueue get time
  • global enqueue releases
  • gcs messages sent
  • ges messages sent
  • global enqueue CPU used by this session
  • gc CPU used by this session
  • IPC CPU used by this session
  • global undo segment hints helped (10.2)
  • global undo segment hints were stale (10.2)

1.2 V$SYSMETRIC

  • GC CR Block Received Per Second
  • GC CR Block Received Per Txn
  • GC Current Block Received Per Second
  • GC Current Block Received Per Txn
  • Global Cache Average CR Get Time
  • Global Cache Average Current Get Time
  • Global Cache Blocks Corrupted
  • Global Cache Blocks Lost

1.3 V$SEGMENT_STATISTICS, V$SEGSTAT

  • gc buffer busy
  • gc cr blocks received
  • gc current blocks received

1.4 DBA_HIST_SEG_STAT

  • GC_CR_BLOCKS_SERVED_TOTAL
  • GC_CR_BLOCKS_SERVED_DELTA
  • GC_CU_BLOCKS_SERVED_TOTAL
  • GC_CU_BLOCKS_SERVED_DELTA

1.5 V$GES_STATISTICS

  • acks for commit broadcast(actual)
  • acks for commit broadcast(logical)
  • broadcast msgs on commit(actual)
  • broadcast msgs on commit(logical)
  • broadcast msgs on commit(wasted)
  • dynamically allocated gcs resources
  • dynamically allocated gcs shadows
  • false posts waiting for scn acks
  • flow control messages received
  • flow control messages sent
  • gcs assume cvt
  • gcs assume no cvt
  • gcs ast xid
  • gcs blocked converts
  • gcs blocked cr converts
  • gcs compatible basts
  • gcs compatible cr basts (global)
  • gcs compatible cr basts (local)
  • gcs cr basts to PIs
  • gcs cr serve without current lock
  • gcs dbwr flush pi msgs
  • gcs dbwr write request msgs
  • gcs error msgs
  • gcs forward cr to pinged instance
  • gcs immediate (compatible) converts
  • gcs immediate (null) converts
  • gcs immediate cr (compatible) converts
  • gcs immediate cr (null) converts
  • gcs indirect ast
  • gcs lms flush pi msgs
  • gcs lms write request msgs
  • gcs msgs process time(ms)
  • gcs msgs received
  • gcs out-of-order msgs
  • gcs pings refused
  • gcs queued converts
  • gcs recovery claim msgs
  • gcs refuse xid
  • gcs regular cr
  • gcs retry convert request
  • gcs side channel msgs actual
  • gcs side channel msgs logical
  • gcs undo cr
  • gcs write notification msgs
  • gcs writes refused
  • ges msgs process time(ms)
  • ges msgs received
  • global posts dropped
  • global posts queue time
  • global posts queued
  • global posts requested
  • global posts sent
  • implicit batch messages received
  • implicit batch messages sent
  • lmd msg send time(ms)
  • lms(s) msg send time(ms)
  • messages flow controlled
  • messages queue sent actual
  • messages queue sent logical
  • messages received actual
  • messages received logical
  • messages sent directly
  • messages sent indirectly
  • messages sent not implicit batched
  • messages sent pbatched
  • msgs causing lmd to send msgs
  • msgs causing lms(s) to send msgs
  • msgs received queue time (ms)
  • msgs received queued
  • msgs sent queue time (ms)
  • msgs sent queue time on ksxp (ms)
  • msgs sent queued
  • msgs sent queued on ksxp
  • process batch messages received
  • process batch messages sent

2. Wait Events:

在10g中RAC等待事件可以分为3类,下面列出了主要的等待事件包括一些undocumented wait event.

2.1. Real time only :

Those wait events are only defined while the process is waiting; after the wait is over, they are reclassified according to the outcome of the global cache operation. They should appear only on the following views: V$SESSION_WAIT, V$ACTIVE_SESSION_HISTORY, V$EVENT_HISTOGRAM.

 

  • gc cr request
  • gc current request

2.2. Historical only:

These are events represent the outcome of a GC request (fixup events). They can appear in the following views: V$SESSION_EVENT, V$SYSTEM_EVENT, DBA_HIST_SYSTEM_EVENT, V$EVENTMETRIC.

  • gc cr block 2-way
  • gc cr block 3-way
  • gc cr block busy
  • gc cr block congested
  • gc cr block lost
  • gc cr block unknown
  • gc cr grant 2-way
  • gc cr grant busy
  • gc cr grant congested
  • gc cr grant unknown
  • gc current block 2-way
  • gc current block 3-way
  • gc current block busy
  • gc current block congested
  • gc current block lost
  • gc current block unknown
  • gc current grant 2-way
  • gc current grant busy
  • gc current grant congested
  • gc current grant unknown

2.3. Other events:

The remaining wait events may appear in any of the views listed before, namely: V$SESSION_WAIT, V$ACTIVE_SESSION_HISTORY, V$EVENT_HISTOGRAM, V$SESSION_EVENT, V$SYSTEM_EVENT, DBA_HIST_SYSTEM_EVENT, V$EVENTMETRIC.

  • LMON global data update
  • cr request retry
  • gc assume
  • gc block recovery request
  • gc buffer busy
  • gc claim
  • gc cr cancel
  • gc cr disk read
  • gc cr disk request
  • gc cr failure
  • gc cr multi block request
  • gc current cancel
  • gc current multi block request
  • gc current retry
  • gc current split
  • gc domain validation
  • gc freelist
  • gc prepare
  • gc quiesce wait
  • gc recovery free
  • gc recovery quiesce
  • gc remaster
  • gcs ddet enter server mode
  • gcs domain validation
  • gcs drm freeze begin
  • gcs drm freeze in enter server mode
  • gcs enter server mode
  • gcs log flush sync
  • gcs remastering wait for read latch
  • gcs remastering wait for write latch
  • gcs remote message
  • gcs resource directory to be unfrozen
  • gcs to be enabled
  • ges LMD suspend for testing event
  • ges LMD to inherit communication channels
  • ges LMD to shutdown
  • ges LMON for send queues
  • ges LMON to get to FTDONE
  • ges LMON to join CGS group
  • ges cached resource cleanup
  • ges cancel
  • ges cgs registration
  • ges enter server mode
  • ges generic event
  • ges global resource directory to be frozen
  • ges inquiry response
  • ges lmd and pmon to attach
  • ges lmd/lmses to freeze in rcfg – mrcvr
  • ges lmd/lmses to unfreeze in rcfg – mrcvr
  • ges master to get established for SCN op
  • ges performance test completion
  • ges pmon to exit
  • ges process with outstanding i/o
  • ges reconfiguration to start
  • ges remote message
  • ges resource cleanout during enqueue open
  • ges resource cleanout during enqueue open-cvt
  • ges resource directory to be unfrozen
  • ges retry query node
  • ges reusing os pid
  • ges user error
  • ges wait for lmon to be ready
  • ges1 LMON to wake up LMD – mrcvr
  • ges2 LMON to wake up LMD – mrcvr
  • ges2 LMON to wake up lms – mrcvr 2
  • ges2 LMON to wake up lms – mrcvr 3
  • ges2 proc latch in rm latch get 1
  • ges2 proc latch in rm latch get 2
  • global cache busy
  • global enqueue expand wait
  • latch: KCL gc element parent latch
  • latch: gcs resource hash
  • latch: ges resource hash list

Slide:Upgrade 11.2.0.1 RAC DB/RDBMS to 11.2.0.2 in Linux By Maclean

Upgrade 11.2.0.1 DB/RDBMS to 11.2.0.2 in Linux

<Upgrade 11.2.0.1 GI/CRS to 11.2.0.2 in Linux>一文中我们介绍了升级11.2.0.1 GI/CRS到11.2.0.2的详细步骤,因为GI/CRS的版本总是要求大于DB/RDBMS,所以这是我们升级RDBMS数据库软件的前提条件。

接下来我们将具体介绍升级11.2.0.1 DB/RDBMS到 11.2.0.2的详细步骤:

一、 下载补丁介质

11.2.0.2的patchset目前没有公开的下载地址,因为updates.oracle.com目前已经不再提供ftp下载模式,所以我们只能通过登录My Oracle Support后进入Patch栏目搜索Patchid并获得加密的下载链接。

11.2.0.2补丁集的全称是11.2.0.2.0 PATCH SET FOR ORACLE DATABASE SERVER (Patchset)(patchid:10098816),可以通过10098816这个id到Patch栏目搜索,并找出对应平台的介质zip包。如在Linux x86-64平台上:

Patch 10098816 11.2.0.2.0 PATCH SET FOR ORACLE DATABASE SERVER_download

 

以上p10098816_112020_Linux-x86-64_1of7.zip和p10098816_112020_Linux-x86-64_2of7.zip ,这2个zip包对应为Database/RDBMS软件的介质,我们不需要下载所有的7个zip包,有这2个升级数据库软件就已经足够了。

完成以上2个软件的下载后,分别解压zip包:

unzip p10098816_112020_Linux-x86-64_1of7.zip -d  $PATCHHOME
unzip p10098816_112020_Linux-x86-64_2of7.zip -d  $PATCHHOME

二、以out of place方式安装11.2.0.2 DB数据库软件

因为11.2.0.2的Patchset以后都是out of place的,所以我们可以不用像在11gr2以前那样必须在原有安装低版本软件的基础上才能升级软件,而可以选择在别的位置完全新安装。

注意该步骤不需要停止数据库实例,可以在前期工作中完成。

以DB/RDBMS数据库软件的拥有者身份(oracle用户)启动方才解压目录下的oui安装界面:

su - oracle

(oracle)$ unset ORACLE_HOME ORACLE_BASE ORACLE_SID
(oracle)$ export DISPLAY=:0
(oracle)$ cd $PATCHHOME
(oracle)$ ./runInstaller

在Oracle Universal Installer界面下的Select Installation Options Screen选择install database only.

upgrade_110202_DB_1

 

在Grid Installation Options下若是RAC 数据库则选择Oracle Real Application cluster database installation,注意如果在该屏幕下出现[FATAL] [INS-35354] The system on which you are attempting to install Oracle RAC is not part of a valid cluster则可能是在之前的安装Gird的过程中没有正确的Update Inventory更新信息库信息,见<11gr2 RAC安装INS-35354问题一例>

若是单节点数据库则选择Single instance database installation

 

upgrade_110202_DB_2

 

在Specify Installation Location Screen上一般OUI会帮你自动匹配一个$ORACLE_BASE变量下不同于原有数据库软件安装目录的新目录,确认这些目录下有足够的磁盘空间,保险起见空间应大于10GB。注意这里是out of place安装,所以千万不要填入原有的安装路径。

 

upgrade_110202_DB_3

 

以上安装完成后OUI会提示要在所有节点上以root身份执行root.sh脚本:

su - root
(root #) /s01/orabase/product/11.2.0/dbhome_2/root.sh

Running Oracle 11g root script...

The following environment variables are set as:
    ORACLE_OWNER= oracle
    ORACLE_HOME=  /s01/orabase/product/11.2.0/dbhome_2

Enter the full pathname of the local bin directory: [/usr/local/bin]:
The contents of "dbhome" have not changed. No need to overwrite.
The contents of "oraenv" have not changed. No need to overwrite.
The contents of "coraenv" have not changed. No need to overwrite.

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Finished product-specific root actions.

三、升级前的准备工作
以上我们完成了11.2.0.2 数据库软件的安装工作,但是还没有升级实例和数据字典。
在正式升级之前,极有必要完成一系列的备份和准备工作,这些准备工作可以详见拙作<Oracle数据库升级前必要的准备工作>

1.清理数据字典中的无用数据,包括审计和回收站,它们可能拉慢数据字典升级的速度:

TRUNCATE TABLE SYS.AUD$;
purge DBA_RECYCLEBIN;

 

2.如果条件允许的话,建议使用RMAN全量备份数据库,前提是数据库没有达到TB级别。

rman target / catalog rman/rman@cata

backup as compressed backupset incremental level 0 database ;

 

3. 收集数据字典的统计信息,若dictionary的统计信息不准备可能导致catupgrd.sql字典升级脚本运行过久:

SQL> set timing on;

SQL> EXECUTE dbms_stats.gather_dictionary_stats;

PL/SQL procedure successfully completed.

Elapsed: 00:00:27.81

 

4.运行dbupgdiag.sql升级信息收集脚本, 该脚本可以提供数据库的一些版本信息和组建信息,以下为该脚本的示例输出内容:

cat db_upg_diag_VPROD_07-Sep-2011_0737.log

                          *** Start of LogFile ***

  Oracle Database Upgrade Diagnostic Utility       09-07-2011 19:37:23

===============
Database Uptime
===============

19:32 07-SEP-11

=================
Database Wordsize
=================

This is a 64-bit database

================
Software Version
================

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
PL/SQL Release 11.2.0.1.0 - Production
CORE    11.2.0.1.0      Production
TNS for Linux: Version 11.2.0.1.0 - Production
NLSRTL Version 11.2.0.1.0 - Production

=============
Compatibility
=============

Compatibility is set as 11.2.0.0.0

================
Component Status
================

Comp ID Component                          Status    Version        Org_Version    Prv_Version
------- ---------------------------------- --------- -------------- -------------- --------------
CATALOG Oracle Database Catalog Views      VALID     11.2.0.1.0
CATPROC Oracle Database Packages and Types VALID     11.2.0.1.0
OWM     Oracle Workspace Manager           VALID     11.2.0.1.0
RAC     Oracle Real Application Clusters   VALID     11.2.0.1.0

======================================================
List of Invalid Database Objects Owned by SYS / SYSTEM
======================================================

Number of Invalid Objects
------------------------------------------------------------------
There are no Invalid Objects

DOC>################################################################
DOC>
DOC> If there are no Invalid objects below will result in zero rows.
DOC>
DOC>################################################################
DOC>#

no rows selected

================================
List of Invalid Database Objects
================================

Number of Invalid Objects
------------------------------------------------------------------
There are no Invalid Objects

DOC>################################################################
DOC>
DOC> If there are no Invalid objects below will result in zero rows.
DOC>
DOC>################################################################
DOC>#

no rows selected

==============================================================
Identifying whether a database was created as 32-bit or 64-bit
==============================================================

DOC>###########################################################################
DOC>
DOC> Result referencing the string 'B023' ==> Database was created as 32-bit
DOC> Result referencing the string 'B047' ==> Database was created as 64-bit
DOC> When String results in 'B023' and when upgrading database to 10.2.0.3.0
DOC> (64-bit) , For known issue refer below articles
DOC>
DOC> Note 412271.1 ORA-600 [22635] and ORA-600 [KOKEIIX1] Reported While
DOC> Upgrading Or Patching Databases To 10.2.0.3
DOC> Note 579523.1 ORA-600 [22635], ORA-600 [KOKEIIX1], ORA-7445 [KOPESIZ] and
DOC> OCI-21500 [KOXSIHREAD1] Reported While Upgrading To 11.1.0.6
DOC>
DOC>###########################################################################
DOC>#

Metadata Initial DB Creation Info
-------- -----------------------------------
B047     Database was created as 64-bit

===================================================
Number of Duplicate Objects Owned by SYS and SYSTEM
===================================================

Counting duplicate objects ....

  COUNT(1)
----------
         4

=========================================
Duplicate Objects Owned by SYS and SYSTEM
=========================================

Querying duplicate objects ....

OBJECT_NAME                              OBJECT_TYPE
---------------------------------------- ----------------------------------------
AQ$_SCHEDULES                            TABLE
AQ$_SCHEDULES_PRIMARY                    INDEX
DBMS_REPCAT_AUTH                         PACKAGE BODY
DBMS_REPCAT_AUTH                         PACKAGE

DOC>
DOC>################################################################################
DOC>
DOC> If any objects found please follow below article.
DOC> Note 1030426.6 How to Clean Up Duplicate Objects Owned by SYS and SYSTEM schema
DOC> Read the Exceptions carefully before taking actions.
DOC>
DOC>################################################################################
DOC>#

================
JVM Verification
================

JAVAVM - NOT Installed. Below results can be ignored

================================================
Checking Existence of Java-Based Users and Roles
================================================

DOC>
DOC>################################################################################
DOC>
DOC> There should not be any Java Based users for database version 9.0.1 and above.
DOC> If any users found, it is faulty JVM.
DOC>
DOC>################################################################################
DOC>#

User Existence
---------------------------
No Java Based Users

DOC>
DOC>###############################################################
DOC>
DOC> Healthy JVM Should contain Six Roles.
DOC> If there are more or less than six role, JVM is inconsistent.
DOC>
DOC>###############################################################
DOC>#

Role
------------------------------
No JAVA related Roles

Roles

=========================================
List of Invalid Java Objects owned by SYS
=========================================

There are no SYS owned invalid JAVA objects

DOC>
DOC>#################################################################
DOC>
DOC> Check the status of the main JVM interface packages DBMS_JAVA
DOC> and INITJVMAUX and make sure it is VALID.
DOC> If there are no Invalid objects below will result in zero rows.
DOC>
DOC>#################################################################
DOC>#

no rows selected

INFO: Below query should succeed with 'foo' as result.
select dbms_java.longname('foo') "JAVAVM TESTING" from dual
       *
ERROR at line 1:
ORA-00904: "DBMS_JAVA"."LONGNAME": invalid identifier

                            *** End of LogFile ***

以上spool内容显示所要升级的数据库现有CATALOG、CATPROC、OWM和RAC组件,且没有安装JVM,升级JVM组建的数据字典将消耗较长的时间。

另外一个建议运行的脚本是utlu112i.sql,它位于新安装的$ORACLE_HOME/rdbms/admin目录下。

该脚本会给出一些升级前地建议,包括建议保证系统表空间和闪回区域有足够的空间,以及收集数据字典的统计信息,如以下输出:

SQL> @/s01/orabase/product/11.2.0/dbhome_2/rdbms/admin/utlu112i.sql
Oracle Database 11.2 Pre-Upgrade Information Tool 09-07-2011 20:02:30
Script Version: 11.2.0.2.0 Build: 001
.
**********************************************************************
Database:
**********************************************************************
--> name:          VPROD
--> version:       11.2.0.1.0
--> compatible:    11.2.0.0.0
--> blocksize:     8192
--> platform:      Linux x86 64-bit
--> timezone file: V11
.
**********************************************************************
Tablespaces: [make adjustments in the current environment]
**********************************************************************
--> SYSTEM tablespace is adequate for the upgrade.
.... minimum required size: 267 MB
--> SYSAUX tablespace is adequate for the upgrade.
.... minimum required size: 150 MB
--> UNDOTBS1 tablespace is adequate for the upgrade.
.... minimum required size: 253 MB
--> TEMP tablespace is adequate for the upgrade.
.... minimum required size: 61 MB
.
**********************************************************************
Flashback: ON
**********************************************************************
FlashbackInfo:
--> name:          +SYSTEMDG
--> limit:         4977 MB
--> used:          264 MB
--> size:          4977 MB
--> reclaim:       0 MB
--> files:         7
WARNING: --> Flashback Recovery Area Set.  Please ensure adequate disk space              in recover
y areas before performing an upgrade.
.
**********************************************************************
Update Parameters: [Update Oracle Database 11.2 init.ora or spfile]
Note: Pre-upgrade tool was run on a lower version 64-bit database.
**********************************************************************
--> If Target Oracle is 32-Bit, refer here for Update Parameters:
-- No update parameter changes are required.
.

--> If Target Oracle is 64-Bit, refer here for Update Parameters:
-- No update parameter changes are required.
.
**********************************************************************
Renamed Parameters: [Update Oracle Database 11.2 init.ora or spfile]
**********************************************************************
-- No renamed parameters found. No changes are required.
.
**********************************************************************
Obsolete/Deprecated Parameters: [Update Oracle Database 11.2 init.ora or spfile]
**********************************************************************
-- No obsolete parameters found. No changes are required
.

**********************************************************************
Components: [The following database components will be upgraded or installed]
**********************************************************************
--> Oracle Catalog Views         [upgrade]  VALID
--> Oracle Packages and Types    [upgrade]  VALID
--> Real Application Clusters    [upgrade]  VALID
--> Oracle Workspace Manager     [upgrade]  VALID
.
**********************************************************************
Miscellaneous Warnings
**********************************************************************
WARNING: --> The "cluster_database" parameter is currently "TRUE"
.... and must be set to "FALSE" prior to running a manual upgrade.
WARNING: --> Database is using a timezone file older than version 14.
.... After the release migration, it is recommended that DBMS_DST package
.... be used to upgrade the 11.2.0.1.0 database timezone version
.... to the latest version which comes with the new release.
WARNING: --> Your recycle bin is turned on and currently contains no objects.
.... Because it is REQUIRED that the recycle bin be empty prior to upgrading
.... and your recycle bin is turned on, you may need to execute the command:
        PURGE DBA_RECYCLEBIN
.... prior to executing your upgrade to confirm the recycle bin is empty.
.
**********************************************************************
Recommendations
**********************************************************************
Oracle recommends gathering dictionary statistics prior to
upgrading the database.
To gather dictionary statistics execute the following command
while connected as SYSDBA:

    EXECUTE dbms_stats.gather_dictionary_stats;

**********************************************************************
Oracle recommends removing all hidden parameters prior to upgrading.

To view existing hidden parameters execute the following command
while connected AS SYSDBA:

    SELECT name,description from SYS.V$PARAMETER WHERE name
        LIKE '\_%' ESCAPE '\'

Changes will need to be made in the init.ora or spfile.

**********************************************************************
Oracle recommends reviewing any defined events prior to upgrading.

To view existing non-default events execute the following commands
while connected AS SYSDBA:
  Events:
    SELECT (translate(value,chr(13)||chr(10),' ')) FROM sys.v$parameter2
      WHERE  UPPER(name) ='EVENT' AND  isdefault='FALSE'

  Trace Events:
    SELECT (translate(value,chr(13)||chr(10),' ')) from sys.v$parameter2
      WHERE UPPER(name) = '_TRACE_EVENTS' AND isdefault='FALSE'

Changes will need to be made in the init.ora or spfile.

**********************************************************************

 

5.如果数据库很大那么建议打开闪回数据库flashback database,并创建还原点,这样可以极大地缩短回退时间。

可以通过以下查询判断数据库是或否启用了flashback database功能:

 

SQL> select FLASHBACK_ON from v$database;

FLASHBACK_ON
------------------
NO

 

若显示NO则说明之前没有启用数据库闪回功能,若希望启用数据库闪回功能需要数据库短时间停机:

 

关闭所有的数据库实例

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.

启动某一套实例到mount 状态

SQL> startup mount;
ORACLE instance started.

Total System Global Area 1252663296 bytes
Fixed Size                  2212936 bytes
Variable Size             603982776 bytes
Database Buffers          637534208 bytes
Redo Buffers                8933376 bytes
Database mounted.

SQL> alter database flashback on;

Database altered.

在本节点打开数据库,并启动所有节点

SQL> alter database open;

Database altered.

 

以上在数据库级别启用了闪回flashback功能。

接着我们需要停止应用程序,注意在这一步之前的准备工作都可以在线完成,但是本步骤将要求停止一切应用程序的链接,关闭数据库,并启动到restrict限制模式,以便创建restore point,方便可能的升级回退。,strict模式避免了普通用户的链接。

在所有节点上关闭数据库实例,并在唯一节点上启动数据库到restrict模式。

 

startup restrict;

ORACLE instance started.

Total System Global Area 1252663296 bytes
Fixed Size 2212936 bytes
Variable Size 603982776 bytes
Database Buffers 637534208 bytes
Redo Buffers 8933376 bytes
Database mounted.
Database opened.

SQL> conn maclean/maclean
ERROR:
ORA-01035: ORACLE only available to users with RESTRICTED SESSION privilege

Warning: You are no longer connected to ORACLE.

conn / as sysdba

SQL> create restore point maclean_rollback guarantee flashback database;

Restore point created.

SQL> select * from v$restore_point;

       SCN DATABASE_INCARNATION# GUA STORAGE_SIZE
---------- --------------------- --- ------------
TIME
---------------------------------------------------------------------------
RESTORE_POINT_TIME                                                          PRE
--------------------------------------------------------------------------- ---
NAME
--------------------------------------------------------------------------------
    601958                     1 YES     15941632
07-SEP-11 07.52.59.000000000 PM
                                                                            YES
MACLEAN_ROLLBACK

 

四、正式升级数据库实例和数据字典

1. 关闭所有数据库实例

2. 复制相关的pfile或spfile形式的参数到新的ORACLE_HOME下,这里我们假设使用ASM存储共享的spfile,那么只需要在所有节点上将init$SID.ora形式的文件拷贝即可:

 

(oracle $) cat $ORACLE_HOME/dbs/initVPROD1.ora
SPFILE='+SYSTEMDG/VPROD/spfileVPROD.ora'

(oracle $) cp $ORACLE_HOME/dbs/initVPROD1.ora /s01/orabase/product/11.2.0/dbhome_2/dbs

设置ORACLE_HOME和PATH变量指向新的11.2.0.2数据库软件

(oracle $) export ORACLE_HOME=/s01/orabase/product/11.2.0/dbhome_2
(oracle $) export PATH=/s01/orabase/product/11.2.0/dbhome_2/bin:$PATH

设置正确的ORACLE_SID

(oracle $) export ORACLE_SID=VPROD1
(oracle $) unset LD_LIBRARY_PATH

 

3. 启动实例到nomount状态,并修改cluster_database参数到spfile:

 

SQL> startup nomount;
ORACLE instance started.

Total System Global Area 1252663296 bytes
Fixed Size                  2226072 bytes
Variable Size             402655336 bytes
Database Buffers          838860800 bytes
Redo Buffers                8921088 bytes

SQL> alter system set cluster_database=false scope=spfile;

System altered.

 

4. 重启实例到upgrade模式,升级数据字典,运行$ORACLE_HOME/rdbms/admin/catupgrd.sql脚本:

 

SQL> shutdown immediate;
ORA-01507: database not mounted

ORACLE instance shut down.
SQL> startup upgrade;
ORACLE instance started.

Total System Global Area 1252663296 bytes
Fixed Size                  2226072 bytes
Variable Size             402655336 bytes
Database Buffers          838860800 bytes
Redo Buffers                8921088 bytes
Database mounted.
Database opened.

SQL> set echo on  

SQL> SPOOL /tmp/upgrade.log

SQL> set time on; 

20:40:40 SQL> @/s01/orabase/product/11.2.0/dbhome_2/rdbms/admin/catupgrd.sql 

在以上catupgrd.sql脚本运行过程中可以通过DBA_SERVER_REGISTRY视图了解组件字典升级的进度

SQL> select * from DBA_SERVER_REGISTRY;
select * from DBA_SERVER_REGISTRY
              *
ERROR at line 1:
ORA-04063: view "SYS.DBA_SERVER_REGISTRY" has errors
or
ERROR at line 1:
ORA-04063: package body "SYS.DBMS_REGISTRY" has errors

在一开始会提示该视图有错误,这不要紧,稍等一会。

SQL> select comp_name,status,version from dba_server_registry;

COMP_NAME                                          STATUS                           VERSION
-------------------------------------------------- --------------------------       ------------------------------
Oracle Workspace Manager                           UPGRADING                        11.2.0.1.0
Oracle Database Catalog Views                      VALID                            11.2.0.2.0
Oracle Database Packages and Types                 VALID                            11.2.0.2.0
Oracle Real Application Clusters                   VALID                            11.2.0.2.0

20:50:40 SQL>
20:50:40 SQL> Rem *********************************************************************
20:50:40 SQL> Rem END catupgrd.sql
20:50:40 SQL> Rem *********************************************************************
20:50:40 SQL> 

以上catupgrd.sql脚本运行了10分钟左右

重启实例,运行utlrp.sql脚本编译失效对象

sqlplus  / as sysdba
startup;

@?/rdbms/admin/utlrp

TIMESTAMP
--------------------------------------------------------------------------------
COMP_TIMESTAMP UTLRP_BGN  2011-09-07 20:53:38

该脚本会自动根据cpu数目选择并行度

DOC>   This script automatically chooses serial or parallel recompilation
DOC>   based on the number of CPUs available (parameter cpu_count) multiplied
DOC>   by the number of threads per CPU (parameter parallel_threads_per_cpu).
DOC>   On RAC, this number is added across all RAC nodes.

TIMESTAMP
--------------------------------------------------------------------------------
COMP_TIMESTAMP UTLRP_END  2011-09-07 20:55:09

该脚本耗时约2分钟

修改cluster_database参数为true,并重启所有节点实例

SQL> alter system set cluster_database=true scope=spfile;

System altered.

可以看到以上在数据库仅安装了CATALOG、CATPROC、OWM和RAC Cluster View 4种组件的情况下,catupgrd.sql字典升级脚本仅耗时10分钟左右。 而实际的生产库可能安装了更多的组件,如JVM等组件将耗时较多。

以下总结了各Oracle组件升级字典的平均耗时,是一张十分有用的升级时间参考表:

DB Sample Upgrade Time

较少组件情况下

Component HH:MM:SS
Oracle Server 00:16:17
JServer JAVA Virtual Machine 00:05:19
Oracle XDK 00:00:48
Oracle Text 00:00:58
Oracle XML Database 00:04:09
Oracle Database Java Packages 00:00:33
Gathering Statistics 00:02:43
Total Upgrade Time: 00:30:47

 

较多组件情况下

Component HH:MM:SS
Oracle Server 00:16:17
JServer JAVA Virtual Machine 00:05:19
Oracle Workspace Manager 00:01:01
Oracle Enterprise Manager 00:10:13
Oracle XDK 00:00:48
Oracle Text 00:00:58
Oracle XML Database 00:04:09
Oracle Database Java Packages 00:00:33
Oracle Multimedia 00:07:43
Oracle Expression Filter 00:00:18
Oracle Rule Manager 00:00:12
Gathering Statistics 00:04:53
Total Upgrade Time: 00:52:31

 

5.使用srvctl命令更新ocr中DBHOME相关信息:

 

su  - oracle

srvctl upgrade database -d VPROD -o $NEW_ORACLE_HOME

srvctl upgrade database -d VPROD -o /s01/orabase/product/11.2.0/dbhome_2

[oracle@vrh1 ~]$ srvctl config database -d VPROD
Database unique name: VPROD
Database name: VPROD
Oracle home: /s01/orabase/product/11.2.0/dbhome_2
Oracle user: oracle
Spfile: +SYSTEMDG/VPROD/spfileVPROD.ora
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: VPROD
Database instances: VPROD1,VPROD2
Disk Groups: SYSTEMDG
Mount point paths:
Services:
Type: RAC
Database is administrator managed

[oracle@vrh1 ~]$ srvctl stop database -d VPROD
PRCC-1016 : VPROD was already stopped
[oracle@vrh1 ~]$ srvctl start database -d VPROD  

[oracle@vrh1 ~]$ srvctl status  database -d VPROD
Instance VPROD1 is running on node vrh1
Instance VPROD2 is running on node vrh2

 

6.修改oracle用户的profile配置文件指中的变量:

 

cat .bash_profile 

# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi

# User specific environment and startup programs

ORACLE_HOME=/s01/orabase/product/11.2.0/dbhome_2
ORACLE_SID=VPROD1
ORACLE_BASE=/s01/orabase
PATH=$ORACLE_HOME/bin:$ORACLE_HOME/OPatch:$PATH:$HOME/bin

export PATH ORACLE_HOME ORACLE_SID ORACLE_BASE

SQL> select * from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
www.askmac.cn

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE    11.2.0.2.0      Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production

 

7. 数据库升级完成后进入一个pending area,建议在至少2个礼拜内,不要升级compatible参数和删除restore point。

在确认没有回退的必要后,修改compatible参数并删除restore point:

 

alter system set compatible=’11.2.0.2.0′ scope=spfile;

drop restore point  MACLEAN_ROLLBACK;

srvctl stop database -d VPROD 

srvctl start database -d VPROD

以上成功地将11.2.0.1的RAC数据库升级到了11.2.0.2。

 

五、回退升级操作(Database Downgrade)

我们可以选择2种回退办法:

  1. 通过restore point还原到11.2.0.1的数据库
  2. 执行catdwgrd.sql降级数据字典

针对第一种方法:

关闭所有节点实例

srvctl stop database -d VPROD

export ORACLE_HOME=$OLD_ORACLE_HOME
export PATH=$OLD_ORACLE_HOME/bin:$PATH
unset LD_LIBRARY_PATH

sqlplus  / as sysdba

SQL> select * from v$restore_point;

       SCN DATABASE_INCARNATION# GUA STORAGE_SIZE
---------- --------------------- --- ------------
TIME
---------------------------------------------------------------------------
RESTORE_POINT_TIME                                                          PRE
--------------------------------------------------------------------------- ---
NAME
--------------------------------------------------------------------------------
    601958                     1 YES    462307328
07-SEP-11 07.52.59.000000000 PM
                                                                            YES
MACLEAN_ROLLBACK

SQL> flashback database to restore point MACLEAN_ROLLBACK;

Flashback complete.

flashback database的速度 视乎flashback log多少而定,一般是很快的,在1分钟之内。

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-01589: must use RESETLOGS or NORESETLOGS option for database open

SQL> alter database open resetlogs;

Database altered.

以上通过restore point的方法是我所推荐的,这种方法简单、省时省力、高效且问题少少,是一种绿色方案。同时不要忘记使用srvctl upgrade命令还原ocr中的DBHOME信息,以及还原profile文件。

针对第二种方法:
catdwgrd.sql的运行有诸多限制,其所消耗的时间可能要略长于catupgrd.sql。而且该脚本在运行过程中可能遇到各种错误,不推荐使用这种方法。

关于使用catdwgrd.sql脚本降级数据库11.2.0.2到11.2.0.1,可以参考MOS note <How To Downgrade From Database 11.2 To Previous Release (includes 11.2.0.2-11.2.0.1) [ID 883335.1]>

Know GCS AND GES structure size in shared pool

RAC环境中共享池很大一部分被gcs和ges资源所占用,一般来说这些资源对象都是永久的(perm)的,所以我们无法期待LRU或flush shared_pool操作能够清理这些资源。

在使用大缓存(large buffer cache)的RAC实例环境中,查询v$sgastat内存动态性能视图时总是能发现’gcs resources’、’gcs shadows’、’ ges resource’、’ges enqueues ‘这些组件占用了共享池中的大量内存,为了避免shared pool出现著名的ORA-04031错误,Oracle推荐在RAC环境中设置较大的shared_pool_size初始化参数,此外显示地设置较大的GCS和GES资源结构的初始化分配数(INITIAL_ALLOCATION)也有利于避免ORA-4031。

这些控制GES和GCS资源结构初始化分配数量的参数主要包括:

  • _gcs_resources  number of gcs resources to be allocated GCS Resources Number of GCS resource structures determined by
    _gcs_resources parameter
    Stored in segmented array
    Externalized in X$KJBR
    Number of free GCS resource structures in X$KJBRFX
  • _gcs_shadow_locks number of pcm shadow locks to be allocated GCS Enqueues (Shadows/Clients) Number of GCS enqueue structures determined by  _gcs_shadow_locks parameter Stored in segmented array
    Externalized in X$KJBL
    Number of free GCS enqueue structures in X$KJBLFX
  • _lm_ress number of resources configured for cluster database LM_RESS controls the number of resources that can be locked by each lock manager instance. These resources include lock resources allocated for DML, DDL (data dictionary locks), data dictionary, and library cache locks plus the file and log management locks. Stored in heap
    Externalized in X$KJIRFT
  • _lm_locks number of enqueues configured for cluster database Stored in segmented array
    Externalized in X$KJILKFT

为了更好地在RAC环境中设置shared_pool_size共享池的大小(手动设置该参数并不会disable AMM or ASMM),我们很有必要评估大量初始化分配的全局资源本身将占用shared pool多大的空间。

我们可以通过v$resource_limit视图了解这些GES、GCS全局资源的分配情况:

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE    11.2.0.2.0      Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production

SQL> select * from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
www.askmac.cn



SQL> select * from v$resource_limit where resource_name in ('gcs_resources', 'gcs_shadows','ges_ress','ges_locks'); 

RESOURCE_NAME                  CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION        LIMIT_VALUE
------------------------------ ------------------- --------------- ------------------------- ------------------
ges_ress                                      7223            7486    1000000                 UNLIMITED
ges_locks                                     4944            5027    1000000                 UNLIMITED
gcs_resources                                 4021            4021     114466                    114466
gcs_shadows                                   3925            3925     114466                    114466

可以通过v$sgastat视图了解这些全局资源占用了多少空间:

select *
  from v$sgastat
 where name in
       ('ges resource ', 'ges enqueues', 'gcs resources', 'gcs shadows');

POOL         NAME                            BYTES
------------ -------------------------- ----------
shared pool  gcs resources                16483232
shared pool  gcs shadows                  11904560
shared pool  ges enqueues                 47809680
shared pool  ges resource                288405768

单个gcs_resources结构大约占用120 bytes
单个gcs_shadows 结构大约占用72 bytes
单个ges_resource 结构大约占用288 bytes

我们可以使用一下初步估算GES、GCS资源结构将至少占用多大的共享池资源:

‘gcs_resources’ = initial_allocation * 120 bytes = “_gcs_resources parameter” * 120 bytes
‘gcs_shadows’ = initial_allocation * 72 bytes = “_gcs_shadow_locks parameter” * 72 bytes
‘ges_resource’= initial_allocation * 288 bytes = “_lm_ress parameter ” * 288 bytes

注意这里计算出的仅仅是理论的最小值,实际值因为内存分配的机制所以必然会远大于计算值

如上例中 gcs resources = 114466 * 120 =13735920 << 实际值的16483232
gcs_shadows = 114466 * 72 = 8241552 << 实际值的11904560
ges_resource = 1000000 * 288 = 288000000 < 实际的288405768

一般来说我们将计算值 * 160% 后可以得出一个较为客观的估算值。

注意以上公式只是为我们在RAC环境中调优共享池的大小提供参考的依据。当我们观察v$resource_limit视图并认为需要提高GES、GSC资源的初始化分配数目时,可以参照上述方式估算出必要的shared_pool_size或sga_target大小。

沪ICP备14014813号-2

沪公网安备 31010802001379号