Warning:Even Exadata has a wrong memlock setting

事情要从大约2个月前的一起事故说起,有一套部署在Oracle-Sun Exadata V2 Database Machine上的4节点11.2.0.1 RAC数据库,其中一个节点的RAC关键后台进程LMS报ORA-00600[kjbmprlst:shadow]错误,随后LMS后台进程将该节点上的实例终止。其他节点上的CRS软件检测到该意外终止后,数据库进入全局资源的重新配置过程(Reconfiguration),Reconfiguration在所有剩余节点上都顺利完成了。

但是随后其中一个节点的告警日志中持续出现”Process W000 died, see its trace file”,似乎是实例无法得到分配新进程的必要资源,同时应用程序出现无法登陆该节点上实例的情况,本来4节点的RAC数据库,因为ORA-00600挂了一个,现在又有一个节点登不上,一下变得只剩下一半性能。

随后我赶到了问题现场,继续诊断问题,并发现了以下症状,在此一一列举:

1.尝试远程登录该实例,但是失败,出现ORA-12516 TNS:listener could not find available handler with matching protocol stack”错误。反复登录会出现以下信息:

Linux Error: 12: Cannot allocate memory
 Additional information: 1
 ORA-01034: ORACLE not available

 

2.确认过ORACLE_SID、ORACLE_HOME等多环境变量后使用”sqlplus / as sysdba”登录却返回”Connected to an idle instance.”(这一点最为蹊跷),无法以sysdba登录就无法收集必要的诊断信息,这个虽然可以通过gdb等手段做systemstate dump,但是暂时绕过

 

3. 后台进程W000由SMCO进程启动, SMCO进程的日志如下,所报状态为KSOSP_SPAWNED:

Process W000 is dead (pid=2648 req_ver=3812 cur_ver=3812 state=KSOSP_SPAWNED).
 *** 2011-07-08 02:44:32.971
 Process W000 is dead (pid=2650 req_ver=3813 cur_ver=3813 state=KSOSP_SPAWNED).

 

4. 确认组成instance的内存和后台进程均存活,且仍有日志产生

[oracle@maclean04 trace]$ ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0xac5ffd78 491524     oracle    660        4096       0
0x96c5992c 1409029    oracle    660        4096       0  

[oracle@maclean04 trace]$ ls -l /dev/shm
total 34839780
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_0
-rw-r----- 1 oracle oinstall         0 Jun  7 07:19 ora_maclean4_1409029_1
-rw-r----- 1 oracle oinstall         0 Jun  7 07:19 ora_maclean4_1409029_10
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_100
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_101
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_102
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_103
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_104
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_105
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_106
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_107
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_108
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_109
-rw-r----- 1 oracle oinstall         0 Jun  7 07:19 ora_maclean4_1409029_11
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_110
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_111
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_112
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_113
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_114
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_115
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_116
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_117
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_118
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_119
-rw-r----- 1 oracle oinstall         0 Jun  7 07:19 ora_maclean4_1409029_12
.......................

[oracle@maclean04 trace]$ ps -ef|grep ora_
oracle    5466     1  0 Jul03 ?        00:00:18 ora_pz99_maclean4
oracle   14842 10564  0 19:54 pts/9    00:00:00 grep ora_
oracle   18641     1  0 Jun08 ?        00:00:02 ora_q002_maclean4
oracle   23932     1  0 Jun07 ?        00:04:26 ora_pmon_maclean4
oracle   23934     1  0 Jun07 ?        00:00:06 ora_vktm_maclean4
oracle   23938     1  0 Jun07 ?        00:00:00 ora_gen0_maclean4
oracle   23940     1  0 Jun07 ?        00:00:06 ora_diag_maclean4
oracle   23942     1  0 Jun07 ?        00:00:00 ora_dbrm_maclean4
oracle   23944     1  0 Jun07 ?        00:01:01 ora_ping_maclean4
oracle   23946     1  0 Jun07 ?        00:00:16 ora_psp0_maclean4
oracle   23948     1  0 Jun07 ?        00:00:00 ora_acms_maclean4
oracle   23950     1  0 Jun07 ?        02:27:29 ora_dia0_maclean4
oracle   23952     1  0 Jun07 ?        01:19:42 ora_lmon_maclean4
oracle   23954     1  0 Jun07 ?        02:23:59 ora_lmd0_maclean4
oracle   23956     1  5 Jun07 ?        1-13:50:36 ora_lms0_maclean4
oracle   23960     1  4 Jun07 ?        1-12:44:25 ora_lms1_maclean4
oracle   23964     1  0 Jun07 ?        00:00:00 ora_rms0_maclean4
oracle   23966     1  0 Jun07 ?        00:00:00 ora_lmhb_maclean4
oracle   23968     1  0 Jun07 ?        01:58:35 ora_mman_maclean4
oracle   23970     1  0 Jun07 ?        06:28:39 ora_dbw0_maclean4
oracle   23972     1  0 Jun07 ?        06:27:08 ora_dbw1_maclean4
oracle   23974     1  2 Jun07 ?        16:49:56 ora_lgwr_maclean4
oracle   23976     1  0 Jun07 ?        00:20:48 ora_ckpt_maclean4
oracle   23978     1  0 Jun07 ?        00:07:03 ora_smon_maclean4
oracle   23980     1  0 Jun07 ?        00:00:00 ora_reco_maclean4
oracle   23982     1  0 Jun07 ?        00:00:00 ora_rbal_maclean4
oracle   23984     1  0 Jun07 ?        00:01:00 ora_asmb_maclean4
oracle   23986     1  0 Jun07 ?        00:08:15 ora_mmon_maclean4
oracle   23988     1  0 Jun07 ?        00:18:19 ora_mmnl_maclean4
oracle   23992     1  0 Jun07 ?        00:00:00 ora_d000_maclean4
oracle   23994     1  0 Jun07 ?        00:00:00 ora_s000_maclean4
oracle   23996     1  0 Jun07 ?        00:00:00 ora_mark_maclean4
oracle   24065     1  0 Jun07 ?        01:16:54 ora_lck0_maclean4
oracle   24067     1  0 Jun07 ?        00:00:00 ora_rsmn_maclean4
oracle   24079     1  0 Jun07 ?        00:01:02 ora_dskm_maclean4
oracle   24174     1  0 Jun07 ?        00:08:18 ora_arc0_maclean4
oracle   24188     1  0 Jun07 ?        00:08:19 ora_arc1_maclean4
oracle   24190     1  0 Jun07 ?        00:00:59 ora_arc2_maclean4
oracle   24192     1  0 Jun07 ?        00:08:12 ora_arc3_maclean4
oracle   24235     1  0 Jun07 ?        00:00:00 ora_gtx0_maclean4
oracle   24237     1  0 Jun07 ?        00:00:00 ora_rcbg_maclean4
oracle   24241     1  0 Jun07 ?        00:00:00 ora_qmnc_maclean4
oracle   24245     1  0 Jun07 ?        00:00:00 ora_q001_maclean4
oracle   24264     1  0 Jun07 ?        00:08:28 ora_cjq0_maclean4
oracle   25782     1  0 Jun07 ?        00:00:00 ora_smco_maclean4

 

5.确认在问题发生时系统中仍有大量的空闲内存且未发生大量的SWAP,此外/dev/shm共享内存目录仍有27G的空闲。

 

6.在其他节点上查询全局动态性能视图gv$resource_limit发现当前故障节点上的登录进程总数上限仅为404,并不多。

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
PL/SQL Release 11.2.0.1.0 - Production
CORE    11.2.0.1.0      Production
TNS for Linux: Version 11.2.0.1.0 - Production
NLSRTL Version 11.2.0.1.0 - Production

SQL> select * from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
www.askmac.cn

SQL> select * from gv$resource_limit where inst_id=4;

RESOURCE_NAME                  CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION             LIMIT_VALUE
------------------------------ ------------------- --------------- ------------------------------ -------------
processes                                       50             404       1500                           1500
sessions                                        61             616       2272                           2272
enqueue_locks                                  849            1599      31062                          31062
enqueue_resources                              846            1007      15016                      UNLIMITED
ges_procs                                       47             399       1503                           1503
ges_ress                                     65943          109281      67416                      UNLIMITED
ges_locks                                    23448           37966      92350                      UNLIMITED
ges_cache_ress                                7347           14716          0                      UNLIMITED
ges_reg_msgs                                   337            5040       3730                      UNLIMITED
ges_big_msgs                                    26             502       3730                      UNLIMITED
ges_rsv_msgs                                     0               1       1000                           1000
gcs_resources                              2008435         2876561    3446548                        3446548
gcs_shadows                                1888276         2392064    3446548                        3446548
dml_locks                                        0               0       9996                      UNLIMITED
temporary_table_locks                            0              45  UNLIMITED                      UNLIMITED
transactions                                     0               0       2499                      UNLIMITED
branches                                         0               2       2499                      UNLIMITED
cmtcallbk                                        0               3       2499                      UNLIMITED
max_rollback_segments                          109             129       2499                          65535
sort_segment_locks                               0              14  UNLIMITED                      UNLIMITED
k2q_locks                                        0               2       4544                      UNLIMITED
max_shared_servers                               1               1  UNLIMITED                      UNLIMITED
parallel_max_servers                             1              19        160                           3600

 

7. Exadata节点系统内核参数文件sysctl.conf中的配置正确:

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296

########### BEGIN DO NOT REMOVE Added by Oracle Exadata ###########
kernel.shmmni = 4096
kernel.sem = 250 32000 100 128
# bug 8311668 file-max and aio-max-nr
fs.file-max = 6815744
# DB install guide says the above
fs.aio-max-nr = 1048576
# 8976963
net.ipv4.neigh.bond0.locktime=0
net.ipv4.ip_local_port_range = 9000 65500
# DB install guide says the above
net.core.rmem_default = 4194304
net.core.wmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_max = 2097152
# The original DB deployment was net.core.wmem_max = 1048586 but IB works
# best for Exadata at the above net.core settings
# bug 8268393 remove vm.nr_hugepages = 2048
# bug 8778821 system reboots after 60 sec on panic
kernel.panic=60
########### END DO NOT REMOVE Added by Oracle Exadata ###########

########### BEGIN DO NOT REMOVE Added by Oracle Exadata ###########
kernel.shmmax = 64547735961
kernel.shmall = 15758724
########### END DO NOT REMOVE Added by Oracle Exadata ###########

 

8. 至此问题还是显得扑朔迷离,主要后台进程和SGA内存的完好,而且操作系统上也仍有大量空闲内存,实例上的资源也没有达到一个临界点。到底是什么造成了无法分配新进程!?

出于谨慎我最后还是检查了系统上的/etc/security/limits.conf参数文件,该参数文件控制了shell的一些ulimit的上限。因为Exadata一体机是由Oracle安装配置后直接交付使用的,我最初的认识是这些配置文件都毫无疑问都应当是最佳配置,遵循Oracle的Best Practices。

但是当我实际打开这个文件后我立即意识到这个配置有问题,似乎少了点什么,以下为该Exadata上的limits.conf文件:

########### BEGIN DO NOT REMOVE Added by Oracle Deployment Scripts ###########

oracle     soft    nproc       2047
oracle     hard    nproc       16384
oracle     soft    nofile      65536
oracle     hard    nofile      65536

########### END DO NOT REMOVE Added by Oracle Deployment Scripts ###########

显然上述limits.conf中缺少了对memlock参数的设置,在不设置memlock参数的情况下使用缺省的memlock为32,以下为Exadata host上的ulimit输出:

[oracle@maclean4 shm]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 606208
max locked memory (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2047
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

可以观察到这里的max locked memory确实是缺省的32,而Oracle所推荐的memlock参数却要远大于32。

在Oracle validated Configuration中经过验证的memlock推荐值为50000000,关于Oracle Validated Configuration详见拙作<Understand Oracle Validated Configurations>

[oracle@rh2 ~]$ cat /etc/security/limits.conf

# Oracle-Validated setting for nofile soft limit is 131072
oracle   soft   nofile    131072

# Oracle-Validated setting for nofile hard limit is 131072
oracle   hard   nofile    131072

# Oracle-Validated setting for nproc soft limit is 131072
oracle   soft   nproc    131072

# Oracle-Validated setting for nproc hard limit is 131072
oracle   hard   nproc    131072

# Oracle-Validated setting for core soft limit is unlimited
oracle   soft   core    unlimited

# Oracle-Validated setting for core hard limit is unlimited
oracle   hard   core    unlimited

# Oracle-Validated setting for memlock soft limit is 50000000
oracle   soft   memlock    50000000 

# Oracle-Validated setting for memlock hard limit is 50000000
oracle   hard   memlock    50000000

搜索Mos可以发现Note[Ora-27102: Out Of Memory: Linux Error: 12: Cannot Allocate Memory with LOCK_SGA=TRUE [ID 401077.1]:指出了因max locked memory过小可能引发Linux Error: 12: Cannot Allocate Memory内存无法分配的问题。

因为修改limits.conf配置文件对已经启动的实例是无效的,所以我们无法通过纠正参数来解决现有的问题。

实际我采用了释放一些资源的方法来workaround了这个问题,通过以下脚本将实例内的所有前台服务进程杀死以释放资源。

ps -ef|grep $SID|grep LOCAL=NO|grep -v grep| awk ‘{print $2}’|xargs kill -9

完成以上命令后出现了终端有点卡的现象,之后恢复正常。尝试使用sysdba本地和远程登录实例均成功,应用的链接也恢复正常。

虽然修复了问题,但是还需要和客户做详尽的说明。我在邮件中阐明了该Exadata一体机上配置文件存在的问题,并提出了几点建议:

1.要求Oracle Support确认该/etc/security/limits.conf中的配置是否合理,是否需要修改
2.设置vm.min_free_kbytes = 51200 内核参数,避免因空闲内存不足引起的性能问题
3.安装OSWatcher监控软件,监控必要的系统资源

客户对我的说法也比较信服,但还是将邮件抄送了原厂Exadata一体机的售前人员。

之后售前人员也曾联系过我,我也做了相同的说明。但原厂售前认为在Exadata一体机是在Oracle美国原厂进行配置安装的,在配置上肯定是最优的,而且该limits.conf中的memlock参数的当前值(32)和推荐值(50000000)之间有如此大的差距,他们认为美国原厂的部署人员不可能犯这么低级的错误。

所以实际他们对我对该memlock参数的说明持一种怀疑的态度,我的建议是就该memlock参数和MOS进行进一步的沟通,以确认该问题。当然这不是我需要完成的工作了。因为对该memlock参数存在分歧,所以短期内也没有修改该参数。

这个case就这样过去了,时间过得很快,转眼已经2个月了。恰巧最近有升级Exadata上数据库到11.2.0.2的项目,所以翻阅了相关patch的readme文档,因为升级RAC到11.2.0.2的前提是Exadata Storage Server Software、InfiniBand Switch Software Version软件版本能够兼容,所以查阅了其兼容列表:

Version Compatibility

The following table lists the Exadata Storage Server software versions that are compatible with each supported Oracle Database 11g Release 2 software version.

Oracle Database Software version Required Exadata Storage Server Software version
11g Release 2 (11.2.0.2.0) Patch Set 1 11.2.2.x
11g Release 2 (11.2.0.1.0) 11.2.2.x
11.2.1.x

The following table lists the InfiniBand Switch software versions that are compatible with each supported Exadata Storage Server software version.

Exadata Storage Server Software version Required InfiniBand Switch software version
11.2.2.2.2 and later Exadata Database Machine – Sun Datacenter InfiniBand Switch 36
Switch software version 1.1.3-2 or laterHP Oracle Database Machine – Voltaire ISR 9024D-M and ISR 9024D
Switch software 5.1.1 build ID 872 (ISR 9024D-M only)
Switch firmware 1.0.0 or higher
11.2.2.2.0 or earlier Exadata Database Machine – Sun Datacenter InfiniBand Switch 36
Switch software version 1.0.1-1 or laterHP Oracle Database Machine – Voltaire ISR 9024D-M and ISR 9024D
Switch software 5.1.1 build ID 872 (ISR 9024D-M only)
Switch firmware 1.0.0 or higher

 

为了将Exadata上的RAC数据库升级到11.2.0.2,首先要将Exadata Storage Server Software升级到11.2.2.x,Oracle官方目前推荐的版本是11.2.2.3.2。

所以随后我也翻阅了Exadata Storage Server Software 11.2.2.3.2 的update readme文档,即<Oracle Exadata Database Machine README for patch 12577723 (Support note 1323958.1)>。

该Patch的升级主要分成”Applying the Patch to Exadata Cells”和”Applying the Patch to the Database Server” 2个阶段,即不仅需要在Exadata Cell上实施补丁,还需要在Database节点上实施一个小补丁。

 

查看”Applying the Patch to the Database Server”章节可以发现存在这样一个步骤:

Repeat the following steps for each database host. If you are taking deployment-wide downtime for the patching, then these steps may be performed in parallel on all database hosts.

  1. Update the resource limits for the database and the grid users:
    Note:

    • This step does not apply if you have customized the values for your specific deployment and database requirements.
    WARNING:

    • Do not run this step if you have specific customized values in use for your deployment.
    1. Calculate 75% of the physical memory on the machine using the following command:.
      let -i x=($((`cat /proc/meminfo | grep 'MemTotal:' | awk '{print $2}'` * 3 / 4))); echo $x
    2. Edit the /etc/security/limits.conf file to update or add following limits for the database owner (orauser) and the grid infrastructure user (griduser). Your deployment may use the same operating system user for both and it may be named as oracle user. Adjust the following as needed.
      ########## BEGIN DO NOT REMOVE Added by Oracle ###########
      orauser     soft    core       unlimited
      orauser     hard    core       unlimited
      orauser     soft    nproc       131072
      orauser     hard    nproc       131072
      orauser     soft    nofile      131072
      orauser     hard    nofile      131072
      orauser     soft    memlock     <value of x from step 01.a>
      orauser     hard    memlock     <value of x from step 01.a>
      
      griduser     soft    core       unlimited
      griduser     hard    core       unlimited
      griduser     soft    nproc       131072
      griduser     hard    nproc       131072
      griduser     soft    nofile      131072
      griduser     hard    nofile      131072
      griduser     soft    memlock     <value of x from step 01.a>
      griduser     hard    memlock     <value of x from step 01.a>
      
      ########### END DO NOT REMOVE Added by Oracle ###########

 

以上可以看到在正式实施Patch to Database server前做了一个补救措施,那就是为oracle和grid用户添加memlock参数,这里的memlock参数是通过将/proc/meminfo中的MemTotal取75%获得,在<Exadata Server Hardware Details>中我列出了Exadata Database Host的一些硬件参数,其中总内存MemTotal一般为70GB(74027752 kB),换算过来74027752*75%=55520814,也就是说Oracle实际推荐在Exadata上使用的memlock参数应当为55520814,甚至要高于我之前所说的50000000的验证值。

至此该问题终于真相大白!而我们也可以从中学到很多东西:

1.首先我大胆的猜测,实际部署Sun Exadata Machine的因该是Oracle硬件部门,也就是以前Sun的部门。实际在部署过程中,部门与部门之间的充分交流是很重要的,而这里09年匆匆上线的Oracle-Sun Exadata V2显然没有做好,而直到2011 5月发布的Oracle Exadata Database Machine 11g Release 2 (11.2) 11.2.2.3.2 patch 12577723中才反应并解决了该问题

2.IT始终是以人为本,不管是多么高端的服务器、多么先进的技术,如果没有与之相匹配的人和团队来驾驭的话,那么至多只能发挥出50%的效益,在人员对先进技术极端不熟悉的情况下,智能化只是空谈!

Oracle在硬件领域正式向IBM宣战

在对原SUN客户的公示中,Larry Ellison火药味十足地表示将在硬件领域与龙头老大IBM争霸。

公示向原SUN客户表示一下四点:

  1. 在Sparc架构上花费比SUN更多的开发费用。(当然光钱多,似乎有点形而上)
  2. 在Solaris操作系统上花费比SUN更多的开发费用。(Solaris已经十分健全,Oracle可能让其更自动化–auto tuning,更好的支持?大metalink计划?)
  3. 投入超过2倍于SUN的销售和技术专家服务于Sparc和Solaris的业务。(销售当然要多,这是Oracle的老路)
  4. 整合Oracle软件与Sun硬件,将明显提升原Sun的硬件性能。

公示结尾为larry Ellison的结语:

We’re in it to win it. IBM,we ‘re looking foward to competing with you in hardware business.

oracle com iobm

这份开战宣言表明了Oracle在硬件领域发展的决心,以及同IBM争霸的勇气。

两家同样以销售著名的公司(相对SUN等技术著名公司),将在未来几年内为我们展示新一轮的IT航母兼并战。

沪ICP备14014813号-2

沪公网安备 31010802001379号