latch free:SQL memory manager latch

一套HP-UX上的10.2.0.4 RAC系统,其中一个节点出现大量的latch free:SQL memory manager latch等待事件, 相关ADDM性能信息:

 

SELECT KSMCHCLS CLASS, COUNT(KSMCHCLS) NUM, SUM(KSMCHSIZ) SIZ,
To_char( ((SUM(KSMCHSIZ)/COUNT(KSMCHCLS)/1024)),'999,999.00')||'k' "AVG SIZE"
FROM X$KSMSP
GROUP BY KSMCHCLS;
CLASS NUM SIZ AVG SIZE
-------- ---------- ---------- ------------
R-freea 927 9481080 9.99k
freeabl 1037478 2001975328 1.88k
R-free 593 170107192 280.14k
recr 515080 595805056 1.13k
R-perm 498 35345056 69.31k
R-recr 2 2224 1.09k
perm 97848 1320357168 13.18k
free 251577 161871664 .63k

DETAILED ADDM REPORT FOR TASK 'TASK_21286' WITH ID 21286
--------------------------------------------------------

Analysis Period: 26-OCT-2009 from 10:00:52 to 11:00:12
Database ID/Instance: 2429423994/1
Database/Instance Names: VPROD/VPROD
Host Name: crmdb1
Database Version: 10.2.0.4.0
Snapshot Range: from 10253 to 10254
Database Time: 420176 seconds
Average Database Load: 118 active sessions

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

FINDING 1: 75% impact (313574 seconds)
--------------------------------------
Database latches in the "Other" wait class were consuming significant database
time.

RECOMMENDATION 1: DB Configuration, 75% benefit (313574 seconds)
RATIONALE: The latch "SQL memory manager latch" with latch id 289 was
one of the top latches contributing to high "latch free" waits.
RELEVANT OBJECT: database latch# 289

SYMPTOMS THAT LED TO THE FINDING:
SYMPTOM: Wait class "Other" was consuming significant database time.
(75% impact [313620 seconds])

FINDINGS
========
Load Profile

Per Second Per Transaction
Redo size: 534,863.07 8,586.10
Logical reads: 527,828.57 8,473.18
Block changes: 13,065.49 209.74
Physical reads: 3,999.96 64.21
Physical writes: 617.50 9.91
User calls: 9,869.10 158.43
Parses: 3,287.89 52.78
Hard parses: 27.90 0.45
Sorts: 1,564.66 25.12
Logons: 4.35 0.07
Executes: 5,423.19 87.06

COMMENTS
==========
Load in terms of GETS is tremendous and no particular SQL is responsible for a
large percentage of the load. It may simply be too much for the hardware/configuration.

HOWEVER
=========
I/O while moderate in volume is very slow.

Tablespace IO Stats

* ordered by IOs (Reads + Writes) desc

Tablespace Reads Av Reads/s Av Rd(ms) Av Blks/Rd Writes Av Writes/s Buffer Waits Av Buf Wt(ms)
------------------------------------------------------------------------------------------------------
TBS_CRM_TS_S1 8,340,219 772 4.48 2.95 356,810 33 17,284,247 2.15
TBS_CRM_IDX_S1 1,643,104 152 7.89 1.22 298,265 28 22,684 4.34
TEMP 762,343 71 1.09 6.67 411,652 38 0 0.00
TBS_BAKDATA 358,137 33 6.00 4.11 93,672 9 335,632 6.32
TBS_RM_TS_S1 400,528 37 9.57 1.09 39,000 4 1,546 10.49
TBS_RM_IDX_S1 399,888 37 8.71 1.01 33,342 3 1,577 9.74
TBS_PROD 422,348 39 3.68 1.00 941 0 5 12.00
TBS_SP_TS_S1 341,104 32 6.26 12.19 504 0 44 5.68
TBS_BFBAKDATA 187,013 17 2.34 7.49 909 0 104,797 2.75
TBS_PF 141,810 13 8.76 1.79 8,571 1 563 11.39
TBS_BSS_SYS 118,965 11 2.26 14.72 2,171 0 7 0.00

So about 1/3 typical performance in terms of Av Rd(ms).

SEGMENTS
==========
Segments by Logical Reads

* Total Logical Reads: 5,699,755,248
* Captured Segments account for 72.1% of Total

Owner Tablespace Name Object Name Subobject Name Obj. Type Logical Reads %Total
CRM TBS_CRM_TS_S1 PROD_2_TD TABLE 454,967,408 7.98
RM TBS_RM_TS_S1 PHONE_NUMBER TABLE 435,121,200 7.63
EVENT TBS_CRM_TS_S1 CUST_INTERACTION TABLE 386,838,512 6.79
CRM TBS_CRM_TS_S1 CO_2_TD TABLE 383,932,304 6.74
CRM TBS_CRM_TS_S1 PARTY_IDENTIFY_EXTEND TABLE 298,659,184 5.24

So 5 individual segments are the target of 72% of all gets.

.
ACTION PLAN
============
Please do:

(1) Disk I/O
-----------------
Please ask your SA why Disk I/O is so slow.

(2) Segements
---------------------
Please consider partitioning these 5 objects to spread out GETS and READS:

Owner Tablespace Name Object Name
---------------------------------------------
CRM TBS_CRM_TS_S1 PROD_2_TD
RM TBS_RM_TS_S1 PHONE_NUMBER
EVENT TBS_CRM_TS_S1 CUST_INTERACTION
CRM TBS_CRM_TS_S1 CO_2_TD
CRM TBS_CRM_TS_S1 PARTY_IDENTIFY_EXTEND

 

诊断发现shared pool共享池有较多的空闲内存,反倒是IO响应速度的Av Rd(ms)要慢于典型的1 ms to 5 ms。

 

需要SA系统管理员进一步确认存储IO是否存在性能问题。

 

文档《Note.457063.1 Slow I/O On HP Unix》介绍了HP-UX平台上IO性能瓶颈的一些解决路径,引用如下:

 

 Slow I/O On HP Unix

Applies to:
Oracle Server – Enterprise Edition – Version: 8.1.7.0 to 10.2.0.3 – Release: 8.1.7 to 10.2
HP OpenVMS VAX
HP-UX PA-RISC (64-bit)
HP-UX PA-RISC (32-bit)
HP OpenVMS Alpha
HP-UX Itanium
HP Tru64 UNIX
Symptoms

Disk utility output shows 100% usage for disk continously:

ProcList CPU Rpt Mem Rpt Disk Rpt NextKeys SlctProc Help Exit
GlancePlus C.04.50.00 19:15:39 bplita3 ia64 Current Avg High
———————————————————————-
CPU  Util S SN NU U      | 90% 90% 90%
Disk Util F       F      |100% 100% 100% <=== Disk Too heavily loaded

In Statspack or AWR report,  high ‘free buffer waits’ seen even after with considerably increasing db_cache_size and db writers.

Statspack Report

Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            Buffer Nowait %:   99.99       Redo NoWait %:     99.99
            Buffer  Hit   %:   58.92    In-memory Sort %:     99.82
            Library Hit   %:   89.19        Soft Parse %:     83.51
         Execute to Parse %:   50.76         Latch Hit %:    100.00
Parse CPU to Parse Elapsd %:   81.48     % Non-Parse CPU:     93.21

….

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                                     % Total
Event                                               Waits    Time (s) Ela Time
——————————————– ———— ———– ——–
free buffer waits                                  76,390      74,807    55.82
enqueue                                             4,071      11,625     8.67
…..

Cause

The problem is related to Disk I/O.
Solution

1. Use the O/S striping software

 Try to use the O/S striping software to distribute database files over as many disks as you can.

2. Use Direct IO
Mount the filesystem with direct IO option.
For example:
% Mount –F vxfs –o remount,nodatainlog,mincache=direct,convosync=direct /dev/vg00/lv_ora /soft/oracle

mincache and convosync
“mincache=direct” => bypass buffer cache on read
“convosync=direct” => force direct I/O for DB writers

Mincache=direct and convosync=direct allow data to be transferred directly from Oracle buffer cache to disk and disk to Oracle buffer cache. This avoids double buffering by bypassing the file system buffer cache and can improve physical read/write performance. However, cases where the disk read could have been avoided because a required block was in file system buffer cache may be negatively impacted.

If your filesystem is mounted with this option then FILESYSTEMIO_OPTIONS parameter needs to set to “SETALL”  in order to use DIO.
Parameters in Oracle influencing the use of Direct IO

    FILESYSTEMIO_OPTIONS defines the IO operations on filesystem files .This parameter should not normally be set by the user.
    The value may be any of the following:
        asynch – Set by default on HP. This allows asynchronous IO to be used where supported by the OS.
        directIO – This allows directIO to be used where supported by the OS. Direct IO bypasses any Unix buffer cache.
        setall – Enables both ASYNC and DIRECT IO.
        none – This disables ASYNC IO and DIRECT IO so that Oracle uses normal synchronous writes, without any direct io options.

See
Document 120697.1 Init.ora Parameter “FILESYSTEMIO_OPTIONS” Reference Note

Oracle recommends that you use the value SETALL instead of the value DIRECTIO, because the DIRECTIO value disables asynchronous I/O.

    DISK_ASYNCH_IO controls whether I/O to datafiles, control files, and logfiles is asynchronous (that is, whether parallel server processes can overlap I/O requests with CPU processing during table scans). If your platform supports asynchronous I/O to disk, Oracle recommends that you leave this parameter set to its default value. However, if the asynchronous I/O implementation is not stable, you can set this parameter to false to disable asynchronous I/O. If your platform does not support asynchronous I/O to disk, this parameter has no effect.

    If you set DISK_ASYNCH_IO to false, then you should also set DBWR_IO_SLAVES or DB_WRITER_PROCESSES to a value other than its default of zero in order to simulate asynchronous I/O.

    DB_WRITER_PROCESSES or DBWR_IO_SLAVES
    see comments in DISK_ASYNCH_IO

3. Concurrent I/O
An alternative solution to Direct I/O is to use Concurrent I/O. Concurrent I/O is available in OnlineJFS 5.0.1.

To enable Concurrent I/O, the filesystem must be mounted with “-o cio”.
Eg:

mount -F vxfs -o nodatainlog,cio /soevxfs/redos /oracle/mnt/redos

Please note that remount should not be used to enable Concurrent I/O on mounted filesystems.

“-o cio”
Concurrent I/O allows multiple processes to read from or write to the same file without blocking other read(2) or write(2) calls.With Concurrent I/O, the read and write operations are not serialized.  This advisory is generally used by applications that require high performance for accessing data and do not perform overlapping writes to the same file. It is the responsibility of the application or the running threads to coordinate the write activities to the same file. It also avoids double buffering by bypassing the filesystem buffer cache and thus improves physical read/write performance significantly. Concurrent I/O performs very close to that of raw logical volumes.

Oracle内部错误:ORA-00600[OSDEP_INTERNAL]一例

一套HP-UX上的9.2.0.5系统在shutdown abort时出现ORA-00600: internal error code, arguments: [OSDEP_INTERNAL], [], [], [], [], [], [], []内部错误,伴随有ORA-27302: failure occurred at: skgpwinit4,ORA-27303: additional information: attach to invalid skgp shared ctx,具体日志如下:

/opt/oracle/product/9.2.0.5/rdbms/log/ngende_ora_7669.trc
Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.5.0 - Production
ORACLE_HOME = /opt/oracle/product/9.2.0.5
System name: HP-UX
Node name: yictngd3
Release: B.11.23
Version: U
Machine: ia64
Instance name: nGende
Redo thread mounted by this instance: 0 
Oracle process number: 0
7669

*** 2010-09-08 00:10:02.985
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [OSDEP_INTERNAL], [], [], [], [], [], [], []
ORA-27302: failure occurred at: skgpwinit4
ORA-27303: additional information: attach to invalid skgp shared ctx
Current SQL information unavailable - no session.

Call stack
--------------
ksedmp <- ksfdmp <- kgerinv <- kgerin <- kgerecoserr <- ksucrp <- ksucresg <- kpolna 
<- kpogsk <- opiodr <- ttcpip <- opitsk <- Cannot <- Cannot <- Cannot <- Cannot <- opiino 
<- opiodr <- opidrv <- sou2o <- main <- main_opd_entry

经查该内部错误与操作系统共享内存有关,相关的Note有:

Ora-00600: Internal Error Code, Arguments: [Osdep_internal] [ID 304027.1]
Applies to:
Oracle Server - Enterprise Edition - Version: 9.2.0.2 to 10.2.0.3 - Release: 9.2 to 10.2
Information in this document applies to any platform.
***Checked for relevance on 03-NOV-2010***

Getting ORA-600 [OSDEP_INTERNAL] errors while starting up the database:

ORA-00600: internal error code, arguments: [OSDEP_INTERNAL],
[], [], [], [], [], [], []
ORA-27302: failure occurred at: skgpwreset1
ORA-27303: additional information: invalid shared ctx
ORA-27146: post/wait initialization failed
ORA-27300: OS system dependent operation:semget failed with status: 28
ORA-27301: OS failure message: No space left on device
ORA-27302: failure occurred at: sskgpsemsper
Symptoms
Getting ORA-600 [OSDEP_INTERNAL]
Accompanied by the following errors
ORA-27302:Failure occured at: skgpwreset1
ORA-27303:additional information: invalid shared ctx
ORA-27146: post/wait initialization failed
ORA-27300: OS system dependent operation: segment failed with error 28
ORA-27301: OS system Failure message: No space left on device
ORA-27302: failure occured at: sskgpsemsper

Cause
The functions in the trace file generated point to the semaphore settings .
Smmns is set too low.

Solution
set semmns 32767
Arrange to make the changes persistent as per the Operating system then restart the server and check if the changes are persistent.
eg: Linux /etc/sysctl.conf

sem = semmsl semmns semopm semmni
kernel.sem = 256 32768 100 228

Getting ORA-00600 [OSDEP_INTERNAL]: Internal Error While Trying To Connect / As Sysdba [ID 253885.1]

Applies to:
Oracle Server - Enterprise Edition - Version: 9.2.0.3 and later   [Release: 9.2 and later ]
HP-UX PA-RISC (64-bit)
Symptoms
Getting following error while trying to connect as sysdba using sqlplus:

SQL> conn / as sysdba
ERROR:
ORA-01041: internal error. hostdef extension doesn't exist

Alert.log shows:

ORA-00600: internal error code, arguments: [OSDEP_INTERNAL], [], [], [], [], [],[], []
ORA-27302: failure occurred at: skgpwinit4
ORA-27303: additional information: attach to invalid skgp shared ctx
Cause
- Database was shutdown using "shutdown abort" option.
- Shared memory segment was not removed even though the instance was down.
Solution
+ Check which shared memory segments are owned by the oracle owner

Use the ipcs -bm command:

% ipcs -bm

m 34034336 0xf8f18468 --rw-r----- ORACLE dba 16777216

+ Delete the 'orphan' shared memory segments:

% ipcrm -m 34034336

If there is more than one instance running on the server and you are not sure how to identify the shared
memory segments then please contact support.

不恰当的设置OS VM参数可能导致该问题,而在HP-UX PA-RISC平台上使用'shotdown abort'命令时可能因为共享内存未能正常移除而出现该内部错误;因为实例还是以'abort'方式关闭的,仅仅是共享内存未能释放,所以只需要以ipcs->ipcrm等os命令将相应的共享内存段释放就可以了,不会造成其他影响。

Oracle Recommended Kernel Parameter settings for HP Itanium v3 11.31

Siebel Performance Tuning Guide > Tuning Operating Systems for Performance >
Tuning Siebel Business Applications for HP-UX > Tuning Kernel Settings for HP-UX --
http://download.oracle.com/docs/cd/E14004_01/books/PerformTun/PerformTunOS11.html#wp1243470
Modify the HP-UX kernel parameters to values like those shown below (suggested guidelines).
Use the HP-UX System Administration Manager (SAM) tool to make these changes.

nproc                       4096 - 4096
ksi_alloc_max              32768 - (NPROC*8)
max_thread_proc             4096 - 4096
maxdsiz               0x90000000 - 0X90000000
maxdsiz_64bit         2147483648 - 2147483648
maxfiles                    4000 - 4000
maxssiz                401604608 - 401604608
maxssiz_64bit         1073741824 - 1073741824
maxtsiz               0x40000000 - 0X40000000
maxusers                     128 - 128
msgmap                      4098 - (NPROC+2)
msgmni                      4096 - (NPROC)
msgtql                      4096 - (NPROC)
ncallout                    8000 - 8000
nclist                      2148 - (100+16*MAXUSERS)
ncsize                     35840 - (8*NPROC+2048+VX_NCSIZE)
nfile                      67584 - (16*NPROC+2048)
ninode                     34816 - (8*NPROC+2048)
nkthread                    7184 - (((NPROC*7)/4)+16)
nproc                       4096 - 4096
nsysmap                     8192 - ((NPROC)>800?2*(NPROC):800)
nsysmap64                   8192 - ((NPROC)>800?2*(NPROC):800)
semmap                      1026 - 1026
semmni                      1024 - 1024
semmns                     16384 - ((NPROC*2)*2)
semmnu                      2048 - 2048
semume                       256 - 256
shmmax                0x40000000 Y 0X40000000
shmmni                      1024 - 1024
shmseg                      1024 Y 1024
vps_ceiling                    64 - 64

------------------------------------------------------------------
maxusers
nclist
ncsize
nfile
semmap
semmnu
semume

This short note is to let you know that I started a new thread in the support-engineering forum on the topic you raised.

For the record, in my initial research, I found that only 4 of the 8 Kernel parameters you specified have been rendered obsolete.

nfile                     maximum number of open files (system-wide) http://docs.hp.com/en/B2355-60105/nfile.5.html
nfile is obsolete --    http://docs.hp.com/en/B2355-60130/nfile.5.html

nsysmap             number of entries in a kernel dynamic memory allocation map http://docs.hp.com/en/B2355-60105/nsysmap.5.html
nsysmap is obsolete -- http://docs.hp.com/en/B2355-60130/nsysmap.5.html

nsysmap64         number of entries in a kernel dynamic memory allocation map http://docs.hp.com/en/B2355-60105/nsysmap.5.html
nsysmap64 is obsolete -- http://docs.hp.com/en/B2355-60130/nsysmap.5.html

semmap "specifies the size of the free-space resource map used
for allocating new System V IPC semaphores in shared memory." http://docs.hp.com/en/939/KCParms/KCparam.SemMap.html
semmap is obsolete -- 'HP-UX 11i Version 2 December 2007 Release Notes Operating Environments Update Release '
System-V
IPC Kernel Tunable Parameter (semmap)(Obsolete): The memory allocation
of semaphore sets previously controlled by semmap is now done
dynamically by the kernel. The semmap kernel parameter is no longer
tunable

nclist                   number of cblocks for pty and tty data transfers http://docs.hp.com/en/939/KCParms/KCparam.Ncallout.html
nclist is not obsolete -- http://docs.hp.com/en/B2355-60130/nclist.5.html

ncsize                  number of Directory Name Lookup Cache (DNLC) entries http://docs.hp.com/en/B2355-60105/ncsize.5.html
ncsize is not obsolete -- http://docs.hp.com/en/B2355-60130/ncsize.5.html

semmnu             maximum number of System V IPC undo structures for processes http://docs.hp.com/en/B2355-60105/semmnu.5.html
semmnu is not obsolete -- http://docs.hp.com/en/B2355-60130/semmnu.5.html

semume             maximum number of System V IPC undo entries per process http://docs.hp.com/en/B3921-90010/semume.5.html
semume is not obsolete-- http://docs.hp.com/en/B2355-60130/semume.5.html

According to HP, the file table, system memory map, and semaphore map are
all dynamically sized in the 11iv3 kernel, so tunables specifying maximum size are no longer relevant.
That Oracle's document still contains recommendations for "nfile", is mostly due to
the Feb 2007 initial release of 11iv3, which had a VERY POOR implementation of the file table structures.
Please note this has now been fixed.

As HP-UX 11iv3 has also finally implemented man pages for all the kernel tunables (or at least most of them)
, this should be considered a good starting point for further exploration on the options available.

 

 

 

Rule Description Result Action
A00170 This rule verifies that the kernel parameters are set according to theOracle Database Installation Guide 10g Release 2 (10.2) for HP-UX. SHMMAX=137438953472 too small The kernel parameters are not set according to the Oracle Database Installation Guide 10g Release 2 (10.2) for HP-UX. For the recommended values, see the Oracle Database Installation Guide 10g Release 2 (10.2) for HP-UX.
The kernel parameter is set less than the minimum requirement. Refer to the following minimum parameter values:
ksi_alloc_max=32768
executable_stack=0
max_thread_proc=1024
maxswapchunks=16384
maxuprc=3686
msgmap=4098
msgmni=4096
msgseg=32767
msgtql=4096
ncsize=35840
nfile=63488
nflocks=4096
ninode=34816
nkthread=7184
nproc=4096
semmap=4098
semmni=4096
semmns=8192
semmnu=4092
semvmx=32767
shmmax=size of RAM
shmmni=512
shmseg=120
vps_ceiling=64
maxdsiz=1073741824
maxdsiz_64bit=2147483648
maxssiz=134217728
maxssiz_64bit=1073741824
A00175 This rule checks that the OS links and libraries exist as required before the installation of Oracle Database 10g. 11.31 Some of the required OS libraries could not be found. Refer to the following:

沪ICP备14014813号-2

沪公网安备 31010802001379号