RAC中增大db_cache_size引发的ORA-04031错误

几个礼拜前, 有一套10.2.0.2 的 二节点RAC 数据库因为增大db_cache_size , 引发其中一个实例发生著名的ORA-04031 错误,日志如下:

 

Errors in file /oracle/oracle/admin/maclean/udump/u1_ora_13757.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 1048 bytes of shared memory
("shared pool","select name,online$,contents...","Typecheck","kgghteInit")
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 1048 bytes of shared memory
("shared pool","select name,online$,contents...","Typecheck","seg:kggfaAllocSeg")
Thu Oct 13 08:25:05 2011
Log from www.askmac.cn  & www.askmac.cn
Errors in file /oracle/oracle/admin/maclean/udump/u1_ora_1444.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 4120 bytes of shared memory
("shared pool","select name,online$,contents...","Typecheck","kgghtInit")
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 4120 bytes of shared memory
("shared pool","select name,online$,contents...","Typecheck","kgghtInit")

 

以上错误出现的同时实例出现大量的row cache lock字典缓存和cursor:pin S wait on X等待事件,说明共享池中的row cache字典缓存和SQL area 执行计划因为Free Memory不足而被不断换出,导致硬解析增多并SQL解析性能下降,进一步造成了应用程序挂起,赶到现场后对该ORA-04031错误进行了分析。

SGA中的内存池包含不同大小的内存块。当数据库启动时,会有一块大的内存被分配并使用Free list的空闲列表追踪。随着时间推移,这些内存被不断分配和释放,内存块(chunk)被按照其大小在不同的Fress list中移动,当SGA里任何一个内存池出现不能满足内部分配一整块连续的内存块请求时,就可能出现ORA-04031错误。实际使用中造成ORA-04031错误的原因可能是Oracle软件bug、产品缺陷、应用程序设计不当、Oracle内存参数设置不当。

 

这里出现ORA-04031错误的内存池是shared pool即共享池,为了搞清楚ORA-04031错误发生的实际原因,我们通过AWR报告分析共享池的使用情况。

 

以下是 ORA-04031 问题发生前一天AWR报告中的共享池内存使用情况:

 

Pool Name Begin MB End MB % Diff
large free memory 112.00 112.00 0.00
shared ASH buffers 25.60 25.60 0.00
shared CCursor 19.44 20.16 3.70
shared Checkpoint queue 5.87 5.87 0.00
shared PCursor 10.57 11.14 5.38
shared event statistics per sess 7.72 7.72 0.00
shared free memory 32.99 33.00 0.02
shared gcs resources 78.75 78.75 0.00
shared gcs shadows 49.61 49.61 0.00
shared ges big msg buffers 15.03 15.03 0.00
shared ges reserved msg buffers 7.86 7.86 0.00
shared ges resource 5.28 5.28 0.00
shared kglsim heap 16.63 16.63 0.00
shared kglsim object batch 25.63 25.63 0.00
shared library cache 21.32 22.01 3.23
shared row cache 7.13 7.13 0.00
shared sql area 64.06 61.55 -3.91
streams free memory 64.00 64.00 0.00
buffer_cache 3,936.00 3,936.00 0.00
fixed_sga 2.08 2.08 0.00
log_buffer 3.09 3.09 0.00

 

以下是 ORA-04031 问题发生时AWR报告中的共享池内存使用情况:

 

Pool Name Begin MB End MB % Diff
large free memory 112.00 112.00 0.00
shared ASH buffers 25.60 25.60 0.00
shared Checkpoint queue 5.87 5.87 0.00
shared KCL name table 9.00 9.00 0.00
shared event statistics per sess 7.72 7.72 0.00
shared free memory 25.56 25.52 -0.12
shared gcs resources 143.39 143.39 0.00
shared gcs shadows 90.33 90.33 0.00
shared ges big msg buffers 15.03 15.03 0.00
shared ges reserved msg buffers 7.86 7.86 0.00
shared library cache 7.59 7.65 0.80
shared row cache 7.13 7.13 0.00
shared sql area 8.70 7.35 -15.57
streams free memory 64.00 64.00 0.00
buffer_cache 7,168.00 7,168.00 0.00
fixed_sga 2.09 2.09 0.00
log_buffer 3.09 3.09 0.00

 

红色部分标注了2个报告中差异最大的地方,在问题发生时共享池中gcs resources和gcs shadows 2种资源对比前一天增长了169M。 gcs资源在共享池中享有较高的优先级, 而普通的SQL语句或执行计划享有较低的优先级,因为gcs资源所占用空间的大量膨胀,导致在没有调大共享池大小的情况下sql area和row cache内存资源被换出进而引发SQL解析性能下降和ORA-04031问题。

 

gcs resources和gcs shadow资源均是Oracle RAC中特有的全局缓存服务资源,这些资源负责处理RAC中的全局buffer cache。 同时这些资源所占用共享池的空间视乎Oracle实例所使用高速缓存的大小而决定,Metalink文档说明了该问题:

 

“The ‘gcs resources’ and ‘gcs shadows’ structures are used for handling buffer caches in RAC, so their memory usages are depending on buffer cache size. We can use V$RESOURCE_LIMIT to monitor them.”

 

当实例高速缓存buffer cache被大小时gcs资源所占用的空间也相应增长,具体算法如下:

 

‘gcs_resources’ = initial_allocation * 120 bytes = “_gcs_resources parameter” * 120 bytes
‘gcs_shadows’ = initial_allocation * 72 bytes = “_gcs_shadow_locks parameter” * 72 bytes

select * from v$resource_limit where resource_name like '%gcs%';

RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE
------------------------------ ------------------- --------------- ---
gcs_resources 507772 514607 976083 976083
gcs_shadows 133862 139927 976083 976083

 

我们可以通过现有的v$resource_limit视图中的INITIAL_ALLOCATION估算Buffer cache增加后的INITIAL_ALLOCATION数量,例如我们准备将db_cache_size从10g增加到20g,那么可以通过下列公式算出有必要增加的共享池大小:

 

add_to_shared_pool_size= 140 * Buffer_cache增加的兆数 * 192 bytes * 1.6

= 140 * 10* 1024 * 192 * 1.6 = 440401920 = 420M

 

问题总结

 

由于RAC环境中Oracle 使用共享池中的gcs resource/shadow 资源管理 全局缓存 , 当实例的Buffer Cache总量增加时gcs resource/shadow 这些资源的数目也会相应上升 , 这导致共享池中可用的剩余空间大幅下降,又因为 gcs 全局缓存资源在共享池中享有较高的优先级( perm ,且在10.2中 gcs资源不能和其他如row cache或library cache 共享一个Extent的内存区间) , 引发了大量的row/dictionary cache字典缓存和SQL执行计划被换出共享池, 引发大量的解析等待: cursor pin s on x 和 row cache lock ,  最终还是 没有避免ORA-04031 的错误被触发。 这里要补充一句, 因为这套10g 的系统没有 启用ASMM(Automatic Shared Memory Management) 特性 且 共享池本身设置地较小(shared_pool_size=512MB) , 都是导致该ORA-04031 错误较为显性地被触发的因素 。

 

<深入了解ASMM>中我介绍了ASMM的一些优点, 这里再罗列一下:

手动管理SGA的缺点在于:

  • 个别组件如shared pool、default buffer pool的大小存在最优值,但组件之间无法交换内存
  • 在9i中就提供了多种内存建议(advisor),但都要求人工手动干预
  • 无法适应工作负载存在变化的环境
  • 往往会导致内存浪费,没有用到实处
  • 若设置不当,引发著名的ORA-04031错误的可能性大大提高

ASMM自动管理SGA的优势在于:

  • 全自动的共享内存管理
  • 无需再配置每一个内存组件大小参数
  • 仅使用一个参数sga_target驱动
  • 有效利用所有可用的内存,极大程度上减少内存浪费
  • 对比MSMM其内存管理模式:
    • 更加动态
    • 更加灵活
    • 并具备适应性
  • 易于使用
  • 一定程度上增强了性能,因为内存分配更为合理了
  • 当某个组件急需更多内存时可以有效提供,因此可以一定程度避免ORA-04031的发生

 

解决方案

 

可以通过以下2种方式避免该RAC环境中特有的由增大Buffer_Cache导致GCS资源空间膨胀造成的ORA-04031问题:

  1. 在增加Buffer_Cache的同时估算所需相应增长的shared pool共享池大小
  2. 使用10g的SGA自动管理内存ASMM方式可以一定程度上避免ORA-04031错误的发生,但是自动管理方式存在实际使用时存在一些缺点,建议在启用ASMM:SGA_Target的同时,设置 shared_pool_size和 db_cache_size的最小大小,以最大可能避免因resize造成的问题; 同时建议设置_enabled_shared_pool_duration=false,禁用shared pool duration特性,也可以一定程度上减少ORA-04031发生的概率。

_enabled_shared_pool_duration:该参数控制是否启用10g中特有的shared pool duration特性,当我们设置sga_target为0时该参数为false;同时在10.2.0.5前若cursor_space_for_time设置为true时该参数也为false,不过在10.2.0.5以后cursor_space_for_time参数被废弃

Know GCS AND GES structure size in shared pool

RAC环境中共享池很大一部分被gcs和ges资源所占用,一般来说这些资源对象都是永久的(perm)的,所以我们无法期待LRU或flush shared_pool操作能够清理这些资源。

在使用大缓存(large buffer cache)的RAC实例环境中,查询v$sgastat内存动态性能视图时总是能发现’gcs resources’、’gcs shadows’、’ ges resource’、’ges enqueues ‘这些组件占用了共享池中的大量内存,为了避免shared pool出现著名的ORA-04031错误,Oracle推荐在RAC环境中设置较大的shared_pool_size初始化参数,此外显示地设置较大的GCS和GES资源结构的初始化分配数(INITIAL_ALLOCATION)也有利于避免ORA-4031。

这些控制GES和GCS资源结构初始化分配数量的参数主要包括:

  • _gcs_resources  number of gcs resources to be allocated GCS Resources Number of GCS resource structures determined by
    _gcs_resources parameter
    Stored in segmented array
    Externalized in X$KJBR
    Number of free GCS resource structures in X$KJBRFX
  • _gcs_shadow_locks number of pcm shadow locks to be allocated GCS Enqueues (Shadows/Clients) Number of GCS enqueue structures determined by  _gcs_shadow_locks parameter Stored in segmented array
    Externalized in X$KJBL
    Number of free GCS enqueue structures in X$KJBLFX
  • _lm_ress number of resources configured for cluster database LM_RESS controls the number of resources that can be locked by each lock manager instance. These resources include lock resources allocated for DML, DDL (data dictionary locks), data dictionary, and library cache locks plus the file and log management locks. Stored in heap
    Externalized in X$KJIRFT
  • _lm_locks number of enqueues configured for cluster database Stored in segmented array
    Externalized in X$KJILKFT

为了更好地在RAC环境中设置shared_pool_size共享池的大小(手动设置该参数并不会disable AMM or ASMM),我们很有必要评估大量初始化分配的全局资源本身将占用shared pool多大的空间。

我们可以通过v$resource_limit视图了解这些GES、GCS全局资源的分配情况:

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE    11.2.0.2.0      Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production

SQL> select * from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
www.askmac.cn



SQL> select * from v$resource_limit where resource_name in ('gcs_resources', 'gcs_shadows','ges_ress','ges_locks'); 

RESOURCE_NAME                  CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION        LIMIT_VALUE
------------------------------ ------------------- --------------- ------------------------- ------------------
ges_ress                                      7223            7486    1000000                 UNLIMITED
ges_locks                                     4944            5027    1000000                 UNLIMITED
gcs_resources                                 4021            4021     114466                    114466
gcs_shadows                                   3925            3925     114466                    114466

可以通过v$sgastat视图了解这些全局资源占用了多少空间:

select *
  from v$sgastat
 where name in
       ('ges resource ', 'ges enqueues', 'gcs resources', 'gcs shadows');

POOL         NAME                            BYTES
------------ -------------------------- ----------
shared pool  gcs resources                16483232
shared pool  gcs shadows                  11904560
shared pool  ges enqueues                 47809680
shared pool  ges resource                288405768

单个gcs_resources结构大约占用120 bytes
单个gcs_shadows 结构大约占用72 bytes
单个ges_resource 结构大约占用288 bytes

我们可以使用一下初步估算GES、GCS资源结构将至少占用多大的共享池资源:

‘gcs_resources’ = initial_allocation * 120 bytes = “_gcs_resources parameter” * 120 bytes
‘gcs_shadows’ = initial_allocation * 72 bytes = “_gcs_shadow_locks parameter” * 72 bytes
‘ges_resource’= initial_allocation * 288 bytes = “_lm_ress parameter ” * 288 bytes

注意这里计算出的仅仅是理论的最小值,实际值因为内存分配的机制所以必然会远大于计算值

如上例中 gcs resources = 114466 * 120 =13735920 << 实际值的16483232
gcs_shadows = 114466 * 72 = 8241552 << 实际值的11904560
ges_resource = 1000000 * 288 = 288000000 < 实际的288405768

一般来说我们将计算值 * 160% 后可以得出一个较为客观的估算值。

注意以上公式只是为我们在RAC环境中调优共享池的大小提供参考的依据。当我们观察v$resource_limit视图并认为需要提高GES、GSC资源的初始化分配数目时,可以参照上述方式估算出必要的shared_pool_size或sga_target大小。

解决UDE-31623错误一例

一套Linux x86-64上的11.2.0.2数据库在使用data pump数据泵expdp工具导出时遭遇了ORA-31623错误,详细日志如下:

[oracle@rh2 dbs]$ expdp system/password dumpfile=s01:sh.dmp schemas=sh

Export: Release 11.2.0.2.0 - Production on Mon May 2 22:46:57 2011

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

UDE-31623: operation generated ORACLE error 31623
ORA-31623: a job is not attached to this session via the specified handle
ORA-06512: at "SYS.DBMS_DATAPUMP", line 3326
ORA-06512: at "SYS.DBMS_DATAPUMP", line 4551
ORA-06512: at line 1

/* 伴随以上ORA-31623错误在告警日志中出现了以下记录 */

2011-05-02 22:47:05.792000 +08:00
Errors in file /s01/orabase/diag/rdbms/prod/PROD1/trace/PROD1_ora_22060.trc  (incident=54525):
ORA-04031: unable to allocate 56 bytes of shared memory 
("streams pool","unknown object","streams pool","fixed allocation callback")
Incident details in: /s01/orabase/diag/rdbms/prod/PROD1/incident/incdir_54525/PROD1_ora_22060_i54525.trc
2011-05-02 22:47:08.189000 +08:00
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Dumping diagnostic data in directory=[cdmp_20110502224708], requested by (instance=1, osid=22060), summary=[incident=54525].
2011-05-02 22:47:10.502000 +08:00
Sweep [inc][54525]: completed
Sweep [inc2][54525]: completed

SQL> show parameter memory

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
hi_shared_memory_address             integer     0
memory_max_target                    big integer 912M
memory_target                        big integer 912M
shared_memory_address                integer     0

从以上日志可以看到导出进程在为”fixed allocation callback”从streams pool流池中分配内存时遇到了ORA-04031错误,进而导致了UDE-31623错误。为什么expdp要从streams pool中分配内存呢?data pump数据泵在11g中开始Advanced Queue高级队列来控制其job作业的启动、停止和重启了。如果streams pool的当前size为0,那么显然无法分配到任何内存;而当前系统中就只设置了AMM特性的memory_target内存参数,并没有显式地设置streams_pool_size参数,这就导致了问题的发生。
我们可以通过上症状在MOS上搜索可以找到Note”UDE-31623 Error with Data Pump Export”:

Applies to:

Oracle Server - Enterprise Edition - Version: 11.2.0.0. and later   [Release: 11.2 and later ]
Information in this document applies to any platform.
Goal

Data Pump Export does not start:

$ expdp system/ directory=test dumpfile=system.dmp logfile=abcd.log full=y

Export: Release 11.2.0.1.0 - Production on Thu Mar 25 12:11:31 2010

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

UDE-31623: operation generated ORACLE error 31623
ORA-31623: a job is not attached to this session via the specified handle
ORA-06512: at "SYS.DBMS_DATAPUMP", line 3263
ORA-06512: at "SYS.DBMS_DATAPUMP", line 4488
ORA-06512: at line 1

Solution

For this Data Pump error, you will need to configure the database with some Streams Pool.

Perform the following query to find out how much STREAMS_POOL_SIZE has been allocated

 select * from v$sgainfo;
 ...
 Streams Pool Size                               0 Yes

Data Pump now uses Advanced Queuing (AQ) internally, so it can handle job stops/starts/re-starts.  
However, AQ using the Streams Pool, where the database currently has the STREAMS_POOL_SIZE set to ZERO.

Manually set the STREAMS_POOL_SIZE, re-start the database and re-attempt the Data Pump Export.

Alternately, turn on Automatic Shared Memory management (ASMM) by setting SGA_TARGET or MEMORY_TARGET.

显然上述Note在有一点上说错了,那就是设置memory_target参数无益于解决该UDE-31623问题;最好的做法还是显式地去设置streams_pool_size参数,但遗憾的是streams_pool_size虽然是一个可以动态修改的参数,但实际上是无法动态修改的:

SQL> col name for a20
SQL>
SQL> select name,isinstance_modifiable from v$system_parameter where name='streams_pool_size';

NAME                 ISINS
-------------------- -----
streams_pool_size    TRUE

SQL> alter system set streams_pool_size=60M sid='PROD1';
alter system set streams_pool_size=60M sid='PROD1'
*
ERROR at line 1:
ORA-02097: parameter cannot be modified because specified value is invalid
ORA-04033: Insufficient memory to grow pool

SQL> alter system set streams_pool_size=60M sid='PROD1' scope=spfile;
System altered.

Restart Instance .................

[oracle@rh2 dbs]$ expdp system/password dumpfile=s01:sh1.dmp schemas=sh

Export: Release 11.2.0.2.0 - Production on Mon May 2 23:28:33 2011

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
Starting "SYSTEM"."SYS_EXPORT_SCHEMA_01":  system/******** dumpfile=s01:sh1.dmp schemas=sh
Estimate in progress using BLOCKS method...
Processing object type SCHEMA_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 273.8 MB
Processing object type SCHEMA_EXPORT/USER
Processing object type SCHEMA_EXPORT/SYSTEM_GRANT
Processing object type SCHEMA_EXPORT/ROLE_GRANT
Processing object type SCHEMA_EXPORT/DEFAULT_ROLE
Processing object type SCHEMA_EXPORT/PRE_SCHEMA/PROCACT_SCHEMA
Processing object type SCHEMA_EXPORT/TABLE/TABLE
Processing object type SCHEMA_EXPORT/TABLE/GRANT/OWNER_GRANT/OBJECT_GRANT
Processing object type SCHEMA_EXPORT/TABLE/INDEX/INDEX
Processing object type SCHEMA_EXPORT/TABLE/CONSTRAINT/CONSTRAINT
Processing object type SCHEMA_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
Processing object type SCHEMA_EXPORT/TABLE/COMMENT
Processing object type SCHEMA_EXPORT/VIEW/VIEW
Processing object type SCHEMA_EXPORT/TABLE/CONSTRAINT/REF_CONSTRAINT
Processing object type SCHEMA_EXPORT/TABLE/INDEX/FUNCTIONAL_AND_BITMAP/INDEX
Processing object type SCHEMA_EXPORT/TABLE/INDEX/STATISTICS/FUNCTIONAL_AND_BITMAP/INDEX_STATISTICS
Processing object type SCHEMA_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
Processing object type SCHEMA_EXPORT/TABLE/INDEX/DOMAIN_INDEX/INDEX
Processing object type SCHEMA_EXPORT/MATERIALIZED_VIEW
Processing object type SCHEMA_EXPORT/DIMENSION
. . exported "SH"."CUSTOMERS"                            9.853 MB   55500 rows
. . exported "SH"."COSTS":"COSTS_Q1_1998"                139.5 KB    4411 rows
. . exported "SH"."COSTS":"COSTS_Q1_1999"                183.5 KB    5884 rows
. . exported "SH"."COSTS":"COSTS_Q1_2000"                120.6 KB    3772 rows
. . exported "SH"."COSTS":"COSTS_Q1_2001"                227.8 KB    7328 rows
. . exported "SH"."COSTS":"COSTS_Q2_1998"                79.52 KB    2397 rows
. . exported "SH"."COSTS":"COSTS_Q2_1999"                132.5 KB    4179 rows
. . exported "SH"."COSTS":"COSTS_Q2_2000"                119.0 KB    3715 rows
. . exported "SH"."COSTS":"COSTS_Q2_2001"                184.5 KB    5882 rows
. . exported "SH"."COSTS":"COSTS_Q3_1998"                131.1 KB    4129 rows
. . exported "SH"."COSTS":"COSTS_Q3_1999"                137.3 KB    4336 rows
. . exported "SH"."COSTS":"COSTS_Q3_2000"                151.4 KB    4798 rows
. . exported "SH"."COSTS":"COSTS_Q3_2001"                234.4 KB    7545 rows
. . exported "SH"."COSTS":"COSTS_Q4_1998"                144.7 KB    4577 rows
. . exported "SH"."COSTS":"COSTS_Q4_1999"                159.0 KB    5060 rows
. . exported "SH"."COSTS":"COSTS_Q4_2000"                160.2 KB    5088 rows
. . exported "SH"."COSTS":"COSTS_Q4_2001"                278.4 KB    9011 rows
. . exported "SH"."SALES":"SALES_Q1_1998"                1.412 MB   43687 rows
. . exported "SH"."SALES":"SALES_Q1_1999"                2.071 MB   64186 rows
. . exported "SH"."SALES":"SALES_Q1_2000"                2.012 MB   62197 rows
. . exported "SH"."SALES":"SALES_Q1_2001"                1.965 MB   60608 rows
. . exported "SH"."SALES":"SALES_Q2_1998"                1.160 MB   35758 rows
. . exported "SH"."SALES":"SALES_Q2_1999"                1.754 MB   54233 rows
. . exported "SH"."SALES":"SALES_Q2_2000"                1.802 MB   55515 rows
. . exported "SH"."SALES":"SALES_Q2_2001"                2.051 MB   63292 rows
. . exported "SH"."SALES":"SALES_Q3_1998"                1.633 MB   50515 rows
. . exported "SH"."SALES":"SALES_Q3_1999"                2.166 MB   67138 rows
. . exported "SH"."SALES":"SALES_Q3_2000"                1.909 MB   58950 rows
. . exported "SH"."SALES":"SALES_Q3_2001"                2.130 MB   65769 rows
. . exported "SH"."SALES":"SALES_Q4_1998"                1.581 MB   48874 rows
. . exported "SH"."SALES":"SALES_Q4_1999"                2.014 MB   62388 rows
. . exported "SH"."SALES":"SALES_Q4_2000"                1.814 MB   55984 rows
. . exported "SH"."SALES":"SALES_Q4_2001"                2.257 MB   69749 rows
. . exported "SH"."SUPPLEMENTARY_DEMOGRAPHICS"           697.3 KB    4500 rows
. . exported "SH"."FWEEK_PSCAT_SALES_MV"                 419.8 KB   11266 rows
. . exported "SH"."PROMOTIONS"                           58.89 KB     503 rows
. . exported "SH"."TIMES"                                380.8 KB    1826 rows
. . exported "SH"."CAL_MONTH_SALES_MV"                   6.312 KB      48 rows
. . exported "SH"."CHANNELS"                              7.25 KB       5 rows
. . exported "SH"."COUNTRIES"                            10.20 KB      23 rows
. . exported "SH"."PRODUCTS"                             26.17 KB      72 rows
. . exported "SH"."COSTS":"COSTS_1995"                       0 KB       0 rows
. . exported "SH"."COSTS":"COSTS_1996"                       0 KB       0 rows
. . exported "SH"."COSTS":"COSTS_H1_1997"                    0 KB       0 rows
. . exported "SH"."COSTS":"COSTS_H2_1997"                    0 KB       0 rows
. . exported "SH"."COSTS":"COSTS_Q1_2002"                    0 KB       0 rows
. . exported "SH"."COSTS":"COSTS_Q1_2003"                    0 KB       0 rows
. . exported "SH"."COSTS":"COSTS_Q2_2002"                    0 KB       0 rows
. . exported "SH"."COSTS":"COSTS_Q2_2003"                    0 KB       0 rows
. . exported "SH"."COSTS":"COSTS_Q3_2002"                    0 KB       0 rows
. . exported "SH"."COSTS":"COSTS_Q3_2003"                    0 KB       0 rows
. . exported "SH"."COSTS":"COSTS_Q4_2002"                    0 KB       0 rows
. . exported "SH"."COSTS":"COSTS_Q4_2003"                    0 KB       0 rows
. . exported "SH"."DIMENSION_EXCEPTIONS"                     0 KB       0 rows
. . exported "SH"."SALES":"SALES_1995"                       0 KB       0 rows
. . exported "SH"."SALES":"SALES_1996"                       0 KB       0 rows
. . exported "SH"."SALES":"SALES_H1_1997"                    0 KB       0 rows
. . exported "SH"."SALES":"SALES_H2_1997"                    0 KB       0 rows
. . exported "SH"."SALES":"SALES_Q1_2002"                    0 KB       0 rows
. . exported "SH"."SALES":"SALES_Q1_2003"                    0 KB       0 rows
. . exported "SH"."SALES":"SALES_Q2_2002"                    0 KB       0 rows
. . exported "SH"."SALES":"SALES_Q2_2003"                    0 KB       0 rows
. . exported "SH"."SALES":"SALES_Q3_2002"                    0 KB       0 rows
. . exported "SH"."SALES":"SALES_Q3_2003"                    0 KB       0 rows
. . exported "SH"."SALES":"SALES_Q4_2002"                    0 KB       0 rows
. . exported "SH"."SALES":"SALES_Q4_2003"                    0 KB       0 rows
Master table "SYSTEM"."SYS_EXPORT_SCHEMA_01" successfully loaded/unloaded
******************************************************************************
Dump file set for SYSTEM.SYS_EXPORT_SCHEMA_01 is:
  /s01/sh1.dmp
Job "SYSTEM"."SYS_EXPORT_SCHEMA_01" successfully completed at 23:30:20

/* 11g中加强了data pump在告警日志中的记录信息,可以看到这里成功启动了DM进程*/

DM00 started with pid=55, OS id=25347, job SYSTEM.SYS_EXPORT_SCHEMA_01
2011-05-02 23:28:35.814000 +08:00
Starting background process SMCO
SMCO started with pid=56, OS id=25349
DW00 started with pid=57, OS id=25351, wid=1, job SYSTEM.SYS_EXPORT_SCHEMA_01
2011-05-02 23:28:54.029000 +08:00
XDB installed.
XDB initialized.

解决ORA-27103:internal error错误一例

一套Linux x86-64上的11.2.0.2数据库在startup启动阶段遭遇了ORA-27103:internal error内部错误,其出错日志如下:

SQL> startup nomount;
ORA-27103: internal error
Linux-x86_64 Error: 2: No such file or directory
Additional information: 9404423
Additional information: 2

oerr 27103
Usage: oerr facility error

Facility is identified by the prefix string in the error message.
For example, if you get ORA-7300, "ora" is the facility and "7300"
is the error.  So you should type "oerr ora 7300".

If you get LCD-111, type "oerr lcd 111", and so on.

================= alert.log ====================
This instance was first to mount
2011-05-02 21:49:47.009000 +08:00
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /s01/orabase/diag/rdbms/prod/PROD1/trace/PROD1_asmb_14386.trc:
ORA-04031: unable to allocate 393240 bytes of shared memory 
("large pool","unknown object","large pool","ASM map operations hashtable")
ASMB (ospid: 14386): terminating the instance due to error 4031

System state dump requested by (instance=1, osid=14386 (ASMB)), summary=[abnormal instance termination].

System State dumped to trace file /s01/orabase/diag/rdbms/prod/PROD1/trace/PROD1_diag_14346.trc

Dumping diagnostic data in directory=[cdmp_20110502214947], requested by 
(instance=1, osid=14386 (ASMB)), summary=[abnormal instance termination].

Instance terminated by ASMB, pid = 14386

=============================system state dump============================

PROCESS 24: ASMB
  ----------------------------------------
  SO: 0x92c955c8, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x92c955c8, name=process, file=ksu.h LINE:12451, pg=0
  (process) Oracle pid:24, ser:1, calls cur/top: 0x9288c778/0x9288c778
            flags : (0x6) SYSTEM
            flags2: (0x0),  flags3: (0x0)
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 2296 0 2
              last post received-location: ksl2.h LINE:2293 ID:kslpsr
              last process to post me: 92c8e248 1 6
              last post sent: 0 0 26
              last post sent-location: ksa2.h LINE:282 ID:ksasnd
              last process posted by me: 92c8e248 1 6
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 0x92d24ae0
    O/S info: user: oracle, term: UNKNOWN, ospid: 14386
    OSD pid info: Unix process pid: 14386, image: oracle@rh2.oracle.com (ASMB)
    ----------------------------------------
    SO: 0x92e80a58, type: 4, owner: 0x92c955c8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x92c955c8, name=session, file=ksu.h LINE:12459, pg=0
    (session) sid: 13 ser: 1 trans: (nil), creator: 0x92c955c8
              flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
              flags2: (0x408) -/-
              DID: , short-term DID:
              txn branch: (nil)
              oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
    ksuxds FALSE at location: 0
    service name: SYS$BACKGROUND
    Current Wait Stack:
      Not in wait; last wait ended 1.501399 sec ago
    Wait State:
      fixed_waits=0 flags=0x21 boundary=(nil)/-1
    Session Wait History:
        elapsed time of 1.501469 sec since last wait
     0: waited for 'SGA: allocation forcing component growth'
        =0x0, =0x0, =0x0
        wait_id=14 seq_num=27 snap_id=7
        wait times: snap=0.000000 sec, exc=0.305374 sec, total=0.305390 sec
        wait times: max=infinite
        wait counts: calls=6 os=6
        occurred after 0.000000 sec of elapsed time
     1: waited for 'SGA: allocation forcing component growth'
        =0x0, =0x0, =0x0
        wait_id=20 seq_num=26 snap_id=1
        wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec
        wait times: max=infinite
        wait counts: calls=1 os=1
        occurred after 0.000000 sec of elapsed time
     2: waited for 'SGA: allocation forcing component growth'
        =0x0, =0x0, =0x0
        wait_id=14 seq_num=25 snap_id=6

在没有阅读告警日志前我的第一反应可能是上次shutdown时Oracle进程没有被清理干净,导致shared memory segments一直没有释放,从而造成了以上ORA-27103错误。

不过其实这个问题告警日志里有明确的信息,即RDBMS Instance数据库实例在mount阶段asmb进程(负责db instance与asm instance的交互)试图从large pool大池中分配390k的空间,但遭遇了ORA-04031错误,如果asmb后台进程无法正常工作将直接导致db实例无法找到asm存储上的必要Extent,因此导致出出现了”ORA-27103: internal error:Linux-x86_64 Error: 2: No such file or directory”。

换而言之ORA-04031错误才是罪魁祸首,我们来是看该实例初始化的内存参数:

[oracle@rh2 dbs]$ strings spfilePROD1.ora |egrep "sga|memory|pool"
PROD1.__large_pool_size=16777216
*.memory_target=943718400
*.shared_pool_size=314572800
*.streams_pool_size=0

因为是11g的实例所以采用了automatic memory management特性管理直接设置了memory_target参数为900M,并设置了1号实例的large pool最小为16M,900M的大小对于10g的实例而言仍是绰绰有余的,但是显然在11gr2中设置memory_target为900M是不足以驱动这样一个”庞然大物”的。我们需要配置更多的内存,亦或者可以通过设置更大的large pool来解决令人郁闷的ORA-04031错误:

[oracle@rh2 dbs]$ strings spfilePROD1.ora > initPROD1.ora
[oracle@rh2 dbs]$ rm spfilePROD1.ora 
[oracle@rh2 dbs]$ vi initPROD1.ora

/* 修改memory_target为至少912M */

*.memory_target=1200M

/* 成功启动!  */
SQL> startup ;
ORACLE instance started.

Total System Global Area 1252663296 bytes
Fixed Size                  2226072 bytes
Variable Size             687868008 bytes
Database Buffers          553648128 bytes
Redo Buffers                8921088 bytes
Database mounted.
Database opened.

Private Interface 'eth1:1' configured from GPnP for use as a private interconnect.
  [name='eth1:1', type=1, ip=169.254.236.169, mac=94-0c-6d-71-8c-c2, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'eth0' configured from GPnP for use as a public interface.
  [name='eth0', type=1, ip=192.168.1.121, mac=6c-f0-49-03-5f-99, net=192.168.1.0/24, mask=255.255.255.0, use=public/1]
Public Interface 'eth0:1' configured from GPnP for use as a public interface.
  [name='eth0:1', type=1, ip=192.168.1.133, mac=6c-f0-49-03-5f-99, net=192.168.1.0/24, mask=255.255.255.0, use=public/1]
Public Interface 'eth0:2' configured from GPnP for use as a public interface.
  [name='eth0:2', type=1, ip=192.168.1.122, mac=6c-f0-49-03-5f-99, net=192.168.1.0/24, mask=255.255.255.0, use=public/1]
Picked latch-free SCN scheme 3
2011-05-02 22:28:04.408000 +08:00
WARNING: db_recovery_file_dest is same as db_create_file_dest
Autotune of undo retention is turned on. 
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
Using parameter settings in server-side pfile /s01/oracle/product/11.2.0/dbhome_1/dbs/initPROD1.ora
System parameters with non-default values:
  processes                = 150
  shared_pool_size         = 304M
  streams_pool_size        = 0
  memory_target            = 1200M
  control_files            = "+DATA/prod/controlfile/current.261.747100215"
  control_files            = "+DATA/prod/controlfile/current.260.747100215"
  db_block_size            = 8192
  db_flash_cache_file      = "/flashcard/prod1cache.dsk"
  db_flash_cache_size      = 20G
  compatible               = "11.2.0.0.0"
  log_archive_dest_1       = "location=+DATA"
  cluster_database         = TRUE
  db_create_file_dest      = "+DATA"
  db_recovery_file_dest    = "+DATA"
  db_recovery_file_dest_size= 40320M
  thread                   = 1
  undo_tablespace          = "UNDOTBS1"
  instance_number          = 1
  db_domain                = ""
  dispatchers              = "(PROTOCOL=TCP) (SERVICE=PRODXDB)"
  remote_listener          = "rh-cluster-scan:1521"
  remote_listener          = "*.remote_login_pas"
  audit_file_dest          = "/s01/orabase/admin/PROD/adump"
  audit_trail              = "DB"
  db_name                  = "PROD"
  open_cursors             = 300
  diagnostic_dest          = "/s01/orabase"
Cluster communication is configured to use the following interface(s) for this instance
  169.254.236.169
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
2011-05-02 22:28:07.675000 +08:00
ORA-00132: syntax error or unresolved network name '*.remote_login_pas'
PMON started with pid=2, OS id=19807 
PSP0 started with pid=3, OS id=19809 
2011-05-02 22:28:08.754000 +08:00
VKTM started with pid=4, OS id=19811 at elevated priority
GEN0 started with pid=5, OS id=19815 
VKTM running at (1)millisec precision with DBRM quantum (100)ms
DIAG started with pid=6, OS id=19817 
DBRM started with pid=7, OS id=19819 
PING started with pid=8, OS id=19821 
ACMS started with pid=9, OS id=19823 
DIA0 started with pid=10, OS id=19825 
LMON started with pid=11, OS id=19827 
LMD0 started with pid=12, OS id=19829 
LMS0 started with pid=13, OS id=19831 at elevated priority
RMS0 started with pid=14, OS id=19835 
LMHB started with pid=15, OS id=19837 
MMAN started with pid=16, OS id=19839 
* Load Monitor used for high load check 
* New Low - High Load Threshold Range = [1920 - 2560] 
LGWR started with pid=18, OS id=19843 
DBW0 started with pid=17, OS id=19841 
CKPT started with pid=19, OS id=19845 
SMON started with pid=20, OS id=19847 
RECO started with pid=21, OS id=19849 
RBAL started with pid=22, OS id=19851 
ASMB started with pid=23, OS id=19853 
MMON started with pid=24, OS id=19855 
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
MMNL started with pid=25, OS id=19857 
starting up 1 shared server(s) ...
lmon registered with NM - instance number 1 (internal mem no 0)
2011-05-02 22:28:09.825000 +08:00
NOTE: initiating MARK startup 
Starting background process MARK
MARK started with pid=28, OS id=19866 
NOTE: MARK has subscribed 
Reconfiguration started (old inc 0, new inc 2)
List of instances:
 1 (myinst: 1) 
 Global Resource Directory frozen
* allocate domain 0, invalid = TRUE 
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
LCK0 started with pid=30, OS id=19872 
Starting background process RSMN
RSMN started with pid=31, OS id=19874 
ORACLE_BASE from environment = /s01/orabase
2011-05-02 22:28:12.112000 +08:00
ALTER DATABASE   MOUNT
This instance was first to mount
2011-05-02 22:28:13.202000 +08:00
NOTE: Loaded library: System 
ALTER SYSTEM SET 
local_listener='(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.1.122)(PORT=1521))))' 
SCOPE=MEMORY SID='PROD1';
SUCCESS: diskgroup DATA was mounted
NOTE: dependency between database PROD and diskgroup resource ora.DATA.dg is established


/* 也可以直接增大large_pool_size来解决上述问题 */

large_pool_size=30M
memory_target=912M

沪ICP备14014813号-2

沪公网安备 31010802001379号