Exadata、Greenplum、Netezza配置大对比

Exadata vs Greenplum vs Netezza

 

Exadata实体图

Oracle-Sun  Exadata X2 数据库一体机实体图:

 

 

 

Exadata Smart Flash Logging新特性

从Exadata Storage Software 11.2.2.4开始引入了Exadata Smart Flash Logging的新特性,该特性允许LGWR进程将redo同时并行写入flash cache 和 disk controller 中, 只要在flash cache 和 disk controller中有任意一个率先写完成就会通知RDBMS数据库继续工作, 该特性着眼于改善了Exadata对Redo写入的响应时间和吞吐量。(The feature allows redo writes to be written to both flash cache and disk controller cache, with an acknowledgement sent to the RDBMS as soon as either of these writes complete; this improves response times and thoughput.);特别有利于改善log file sync等待事件是主要性能瓶颈的系统。

当频繁写redo重做日志的IO成为Exadata一体机的主要性能瓶颈的时候,Exadata开发部门自然想到了通过DBM上已有的flashcache来减少响应时间的办法。但是又要保证不会因为flashcache的引入而导致redo关键日志信息的丢失:

 

The main problem that number of writes in redo logs is very high, even there are no activity in database. Therefore Flash Cache disk will reach his write limit very fast – some days or months (I am not see exact test results). In this way you will lost flash cache disk and all data on them. But losing redo logs is very unpleasant case of database unavailability, which can lie in big downtime and possible data loss.
As you already know, 11.2.2.4.0 introduced the Smart Flash Log feature. For customers that are not in 11.2.2.4.0, don’t suggest putting the redo logs on a diskgroup that uses griddisks carved from flashdisks. There are different issues when using redo logs on the flashcache in previous versions and those should be avoided.

 

 

只要是安装过Exadata Storage Software 11.2.2.4补丁的系统都会隐式地启用该Exadata Smart Flash Logging特性,但是它同时也要求数据库版本要大于Database 11.2.0.2 Bundle Patch 11。

Metalink目前没有介绍如何在已经启用Exadata Smart Flash Logging的DBM上禁用(disable)该特性。

 

实际每个cell会分配512MB的flashcache用于Smart Flash Logging,因此现在每个cell的可用flash空间为 364.75Gb 。

 

不仅局限于Online Redo Log可以受益于Smart Flash Logging,Standby Redo Log 也可以从该特性中得到性能提升,前提是满足必要的软件版本组合cell patch 11.2.2.4 and Database 11.2.0.2 Bundle Patch 11 or greate。

可以通过CellCLI 命令行了解现有的Smart Flash Logging配置,若有输出则说明配置了Smart Flash Logging。

 

CellCLI> LIST FLASHLOG DETAIL

 

更多信息可以参考文档”Exadata Smart Flash Logging Explained”,引用如下:

 

Smart Flash Logging works as follows. When receiving a redo log write request, Exadata will do
parallel writes to the on-disk redo logs as well as a small amount of space reserved in the flash
hardware. When either of these writes has successfully completed the database will be
immediately notified of completion. If the disk drives hosting the logs experience slow response
times, then the Exadata Smart Flash Cache will provide a faster log write response time.
Conversely, if the Exadata Smart Flash Cache is temporarily experiencing slow response times
(e.g., due to wear leveling algorithms), then the disk drive will provide a faster response time.
Given the speed advantage the Exadata flash hardware has over disk drives, log writes should be
written to Exadata Smart Flash Cache, almost all of the time, resulting in very fast redo write
performance. This algorithm will significantly smooth out redo write response times and provide
overall better database performance.

The Exadata Smart Flash Cache is not used as a permanent store for redo data – it is just a
temporary store for the purpose of providing fast redo write response time. The Exadata Smart
Flash Cache is a cache for storing redo data until this data is safely written to disk. The Exadata
Storage Server comes with a substantial amount of flash storage. A small amount is allocated for
database logging and the remainder will be used for caching user data. The best practices and
configuration of redo log sizing, duplexing and mirroring do not change when using Exadata
Smart Flash Logging. Smart Flash Logging handles all crash and recovery scenarios without
requiring any additional or special administrator intervention beyond what would normally be
needed for recovery of the database from redo logs. From an end user perspective, the system
behaves in a completely transparent manner and the user need not be aware that flash is being
used as a temporary store for redo. The only behavioral difference will be consistently low
latencies for redo log writes.

By default, 512 MB of the Exadata flash is allocated to Smart Flash Logging. Relative to the 384
GB of flash in each Exadata cell this is an insignificant investment for a huge performance
benefit. This default allocation will be sufficient for most situations. Statistics are maintained to
indicate the number and frequency of redo writes serviced by flash and those that could not be
serviced, due to, for example, insufficient flash space being allocated for Smart Flash Logging.
For a database with a high redo generation rate, or when many databases are consolidated on to
one Exadata Database Machine, the size of the flash allocated to Smart Flash Logging may need
to be enlarged. In addition, for consolidated deployments, the Exadata I/O Resource Manager
(IORM) has been enhanced to enable or disable Smart Flash Logging for the different databases
running on the Database Machine, reserving flash for the most performance critical databases.

 

以及<Exadata Smart Flash Cache Features and the Oracle Exadata Database Machine>官方白皮书,公开的文档地址

 

 

[gview file=”http://www.oracle.com/technetwork/database/exadata/exadata-smart-flash-cache-366203.pdf”]

Exadata X2-2 vs EMC Greenplum DCA vs Netezza TwinFin 12主要配置大对比

下图列出了Oracle Exadata X2-2 vs EMC Greenplum DCA vs Netezza TwinFin 12 三种一体机的主要配置对比:

 

Oracle Sun Exadata V2 ,X2-2,X2-8 主要配置对比

  v2 Full Rack x2-2 Full Rack x2-8 Full Rack
Database servers 8 x Sun Fire x4170 1U 8 x Sun Fire x4170 M2 1U 2 x Sun Fire x4800 5U
Database CPUs Xeon E5540 quad core 2.53GHz Xeon X5670 six cores 2.93GHz Xeon X7560 eight cores 2.26GHz
database cores 64 96 128
database RAM 576GB 768GB 2TB
Storage cells 14 x SunFire X4275 14 x SunFire X4270 M2 14 x SunFire X4270 M2
storage cell CPUs Xeon E5540 quad core 2.53GHz Xeon L5640 six cores 2.26GHz Xeon L5640 six cores 2.26GHz
storage cells CPU cores 112 168 168
IO performance & capacity 15K RPM 600GB SAS or 2TB SATA 7.2K RPM disks 15K RPM 600GB SAS (HP model – high performance) or 2TB SAS 7.2K RPM disks (HC model – high capacity) 15K RPM 600GB SAS (HP model – high performance) or 2TB SAS 7.2K RPM disks (HC model – high capacity)
Note that 2TB SAS are the same old 2 TB drives with new SAS electronics. (Thanks Kevin Closson for ref) Note that 2TB SAS are the same old 2 TB drives with new SAS electronics. (Thanks Kevin Closson for ref)
Flash Cache 5.3TB 5.3TB 5.3TB
Database Servers networking 4 x 1GbE x 8 servers = 32 x 1GbE 4 x 1GbE x 8 servers + 2 x 10GbE x 8 servers = 32 x 1Gb + 16 x 10GbEE 8 x 1GbE x 2 servers + 8 x 10GbE x 2 servers = 16 x 1Gb + 16 x 10GbEE
InfiniBand Switches QDR 40Gbit/s wire QDR 40Gbit/s wire QDR 40Gbit/s wire
InfiniBand ports on database servers (total) 2 ports x 8 servers = 16 ports 2 ports x 8 servers = 16 ports 8 ports x 2 servers = 16 ports
Database Servers OS Oracle Linux only Oracle Linux (possible Solaris later, still unclear) Oracle Linux or Solaris x86

Exadata Database Machine Host的操作系统OS版本

之前有同事问我Exadata用的是什么操作系统这个问题?

最早Oracle与HP合作的Exadata V1采用的是Oracle Enterprise Linux,而Oracle-Sun Exadata V2则目前还仅提供OEL,但是已经通过了Solaris -11 Express在 Exadata V2上的测试, 所以很快Exadata V2将会有Solaris的选择。

目前现有的Exadata X2-2 和 X2-8 绝大多数采用2个OEL 5的小版本:

较早出厂的使用OEL 5.3
# cat /etc/enterprise-release
Enterprise Linux Enterprise Linux Server release 5.3 (Carthage)

近期出场的使用OEL 5.5

# cat /etc/enterprise-release
Enterprise Linux Enterprise Linux Server release 5.5 (Carthage)

# uname -a
Linux vrh1.us.oracle.com 2.6.18-128.1.16.0.1.el5 #1 SMP Tue x86_64 x86_64 x86_64 GNU/Linux

 

The IB should be one of the compatible cards specified in Note 888828.1
If you build a backup server machine it is best tro build as close a clone of the Exadata Compute nodes as you can get.
I.e. install OEL 5 Update 5 and one of the IB cards specified in the note and you will have the correct ofed versions and kernel
This will guarantee interoperabilty and correct operation with the kernel and ofed drivers
From the doc
InfiniBand OFED Software
Exadata Storage Servers and database servers will interoperate with different InfiniBand OFED software versions, however, Oracle recommends that all versions be the same unless performing a rolling upgrade. Review Note 1262380.1 for database server software and firmware guidelines.

InfiniBand HCA
Exadata Storage Servers and database servers will interoperate with different InfiniBand host channel adapter (HCA) firmware versions, however, Oracle recommends that all versions be the same unless performing a rolling upgrade. Review Note 1262380.1 for database server software and firmware guidelines.

For a complete list of the Oracle QDR Infinband adaptors see here:

http://www.oracle.com/technetwork/documentation/oracle-net-sec-hw-190016.html#infinibandadp

For the compute nodes all firmware updates must be done via the bundle patches descibed in Doc 888828.1
So I would advise upgrading to the latest supported bundel patch.

For you backup server choose the same model card that came with the X2 compute nodes.
Install Oracle Eterprise Linux Release 5 Update 5
Upgrade the firmware to the same firmware an on the X2 or higher if not already the same,

Database Machine and Exadata Storage Server 11g Release 2 (11.2) Supported Versions [ID 888828.1]

Exadata混合列压缩如何处理INSERT和UPDATE

Hybrid Columnar Compression混合列压缩是Exadata数据库一体机的核心功能之一,与普通的高级压缩特性(advanced compression)不同,Hybrid columnar compression (HCC) 仅仅在Exadata平台上可用。使用HCC的情况下数据压缩存放在CU(compression unit压缩单位中),一个CU单位包括多个数据库块,这是出于单数据块不利于以列值压缩算法的考量所决定的,当一个CU包含多个block时可以实现较优的列值压缩算法。

同时对于普通的INSERT/UPDATE操作,需要造成对行级数据的压缩降级,即在经历UPDATE/INSERT后原本HCC压缩的行可能变成普通高级压缩的水平。

 

hybrid columnar compression与数据仓库行为的批量初始化导入(bulk initial load)配合,直接路径导入(direct load)例如ALTER TABLE MOVE, IMPDP或直接路径插入(append INSERT),使用HCC的前提是这些数据将不会被频繁修改或从不被修改。

 

当你更新混合列压缩启动的表中的数据行时,相关整个的压缩单位CU中的数据将被锁住。 被更新的这些数据将不得不从原HCC压缩级别降级到例如无压缩或for OLTP压缩的水准。

 

我们来看以下例子:

 

 

SQL*Plus: Release 11.2.0.2.0 Production on Wed Sep 12 06:14:53 2012

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production
With the Partitioning, Automatic Storage Management, OLAP, Data Mining
and Real Application Testing options

SQL> grant dba to scott;

Grant succeeded.

SQL> conn scott/oracle
Connected.
SQL> 
SQL> create table hcc_maclean tablespace users compress for query high as select * from dba_objects;

Table created.

  1* select rowid,owner,object_name,dbms_rowid.rowid_block_number(rowid) from hcc_maclean where owner='MACLEAN'
SQL> /

ROWID                          OWNER                          OBJECT_NAME          DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)
------------------------------ ------------------------------ -------------------- ------------------------------------
AAAThuAAEAAAHTJAOI             MACLEAN                        SALES                                               29897
AAAThuAAEAAAHTJAOJ             MACLEAN                        MYCUSTOMERS                                         29897
AAAThuAAEAAAHTJAOK             MACLEAN                        MYCUST_ARCHIVE                                      29897
AAAThuAAEAAAHTJAOL             MACLEAN                        MYCUST_QUERY                                        29897
AAAThuAAEAAAHTJAOh             MACLEAN                        COMPRESS_QUERY                                      29897
AAAThuAAEAAAHTJAOi             MACLEAN                        UNCOMPRESS                                          29897
AAAThuAAEAAAHTJAOj             MACLEAN                        CHAINED_ROWS                                        29897
AAAThuAAEAAAHTJAOk             MACLEAN                        COMPRESS_QUERY1                                     29897

8 rows selected.

select dbms_rowid.rowid_block_number(rowid),dbms_rowid.rowid_relative_fno(rowid) from hcc_maclean where owner='MACLEAN';

session A:

update hcc_maclean set OBJECT_NAME=OBJECT_NAME||'DBM' where rowid='AAAThuAAEAAAHTJAOI';

session B:

update hcc_maclean set OBJECT_NAME=OBJECT_NAME||'DBM' where rowid='AAAThuAAEAAAHTJAOJ';

SQL> select sid,wait_event_text,BLOCKER_SID from v$wait_chains;

       SID WAIT_EVENT_TEXT                                                  BLOCKER_SID
---------- ---------------------------------------------------------------- -----------
        13 enq: TX - row lock contention                                            136
       136 SQL*Net message from client

可以看到session A block B,这验证了HCC压缩后update row所在CU会造成整个CU被锁住的说法

SQL> alter system checkpoint;

System altered.

SQL> /     

System altered.

SQL> alter system dump datafile 4 block 29897
  2  ;

  Block header dump:  0x010074c9
 Object id on Block? Y
 seg/obj: 0x1386e  csc: 0x00.1cad7e  itc: 3  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x10074c8 ver: 0x01 opc: 0
     inc: 0  exflg: 0

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0xffff.000.00000000  0x00000000.0000.00  C---    0  scn 0x0000.001cabfa
0x02   0x000a.00a.00000430  0x00c051a7.0169.17  ----    1  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000

avsp=0x14
tosp=0x14
        r0_9ir2=0x0
        mec_kdbh9ir2=0x0
                      76543210
        shcf_kdbh9ir2=----------
                  76543210
        flag_9ir2=--R-----      Archive compression: Y
                fcls_9ir2[0]={ }
0x16:pti[0]     nrow=1  offs=0
0x1a:pri[0]     offs=0x30
block_row_dump:
tab 0, row 0, @0x30
tl: 8016 fb: --H-F--N lb: 0x2  cc: 1          ==>整个CU指向ITL 0x02
nrid:  0x010074ca.0
col  0: [8004]
Compression level: 02 (Query High)
 Length of CU row: 8004
kdzhrh: ------PC CBLK: 1 Start Slot: 00
 NUMP: 01
 PNUM: 00 POFF: 7984 PRID: 0x010074ca.0
CU header:
CU version: 0   CU magic number: 0x4b445a30
CU checksum: 0xf8faf86e
CU total length: 8694
CU flags: NC-U-CRD-OP
ncols: 15
nrows: 995
algo: 0
CU decomp length: 8487   len/value length: 100111
row pieces per row: 1
num deleted rows: 1
deleted rows: 904,
START_CU:

 

 

我们可以使用如下方式衡量row的压缩情况:

 

 

SQL> select DBMS_COMPRESSION.GET_COMPRESSION_TYPE('SCOTT','HCC_MACLEAN','AAAThuAAEAAAHTJAOk') from dual;

DBMS_COMPRESSION.GET_COMPRESSION_TYPE('SCOTT','HCC_MACLEAN','AAATHUAAEAAAHTJAOK'
--------------------------------------------------------------------------------
                                                                               4

 

COMP_NOCOMPRESS CONSTANT NUMBER := 1;
COMP_FOR_OLTP CONSTANT NUMBER := 2;
COMP_FOR_QUERY_HIGH CONSTANT NUMBER := 4;
COMP_FOR_QUERY_LOW CONSTANT NUMBER := 8;
COMP_FOR_ARCHIVE_HIGH CONSTANT NUMBER := 16;
COMP_FOR_ARCHIVE_LOW CONSTANT NUMBER := 32;

COMP_RATIO_MINROWS CONSTANT NUMBER := 1000000;
COMP_RATIO_ALLROWS CONSTANT NUMBER := -1;

上表列出了压缩类型的常数值,例如COMP_FOR_QUERY_HIGH是4,COMP_FOR_QUERY_LOW 是8

这里我们从上述查询GET_COMPRESSION_TYPE指定rowid的情况下得到的是4说明该列以COMP_FOR_QUERY_HIGH形式压缩:



SQL>  update hcc_maclean set OBJECT_NAME=OBJECT_NAME||'DBM' where owner='MACLEAN';

8 rows updated.

SQL> commit;

Commit complete.




SQL>  select DBMS_COMPRESSION.GET_COMPRESSION_TYPE('SCOTT','HCC_MACLEAN',rowid) from HCC_MACLEAN where owner='MACLEAN';

DBMS_COMPRESSION.GET_COMPRESSION_TYPE('SCOTT','HCC_MACLEAN',ROWID)
------------------------------------------------------------------
                                                                 1
                                                                 1
                                                                 1
                                                                 1
                                                                 1
                                                                 1
                                                                 1
                                                                 1

8 rows selected.

以上更新一定量数据后可以看到COMPRESSION_TYPE由COMP_FOR_QUERY_HIGH降级为COMP_NOCOMPRESS,这说明这张表虽然compress for query high但部分数据在更新后实际不再被压缩。

在11g中这些非压缩态复萌的数据行不会自动升级成HCC状态。必要的时候手动作 ALTER TABLE MOVE或在线重定义以便将非压缩态的数据转换回HCC状态。



SQL>  ALTER TABLE hcc_MACLEAN move COMPRESS FOR ARCHIVE HIGH;

Table altered.

SQL> select DBMS_COMPRESSION.GET_COMPRESSION_TYPE('SCOTT','HCC_MACLEAN',rowid) from HCC_MACLEAN where owner='MACLEAN';

DBMS_COMPRESSION.GET_COMPRESSION_TYPE('SCOTT','HCC_MACLEAN',ROWID)
------------------------------------------------------------------
                                                                16
                                                                16
                                                                16
                                                                16
                                                                16
                                                                16
                                                                16
                                                                16

8 rows selected.

数据仓库:Oracle Exadata和Netezza的比较

虽然都是王婆卖瓜,但胜在他山之石可以攻玉,了解一下Exadata 之外的 一体机选择。

 

oracle_exadata_netezzaTwinfin_compared

 

 

 

Exadata用户的困境

当然这是Oracle-Sun Exadata V2 数据库一体机的竞争者,Netezza公司制作的一个小视频,但并非空穴来风!

 

 

很多人告诉我有了Exadata这么智能的一体机后都可以不需要dba了,那么这是事实吗?

 

就我个人的体验,部署了Exadata之后DBA花在管理上的时间更多了。

 

Exadata上复杂的数据库升级步骤:

Infiniband Switch – > Exadata Storage Software – > Exadata Host Minimal Patch -> Grid Infrastructure – > Database

客户视角的Exadata 详细升级过程:

  1. Exadata Cells from 11.2.1.3.1 to 11.2.2.2.0
  2. Exadata database nodes from 11.2.1.3.1 to 11.2.2.2.0
  3. Applying a number of pre and post patching fixes (addendums to the standard patching instructions,probably included due to customer reported issue with eth patching
  4. Upgrade kernel on database nodes
  5. Upgrade GI Home to 11.2.0.2
  6. Upgrade ALL RDBMS Homes to 11.2.0.2
  7. Upgrade the database to 11.2.0.2
  8. Apply 11.2.0.2 BP2 to ALL Homes(Part of this need to be done manually ,as opatch auto doesn’t work when the installation has different GI and RDBMS owners)

 

升级所涉及的多层Component :

这升级步骤不是人写的,有木有?!

Exadata上的 Bundle Patch (Replaces recalled patch 8626760) 是可以recall 召回的有木有?!

#别去搜这个patch,说了recall了就是不会让你能找到了!

号称出厂即最优的配置,结果连简单的ulimit-memlock参数设置都有问题的情况,有木有?

 

昨天客户语重心长的告诉我,他们准备把Exadata V2 上的核心应用迁移走,客户在09年就开始用Exadata,是不是国内第一家我不知道,但至少应该是用于生产的第一批。但是这2年来因为Exadata折 腾了无数次,现在终于不想折腾了。实际是用户在测试Exadata时总是能看到其优点,但实际使用时只看到缺点了!

历数Exadata V2数据库一体机的几大致命缺点

  1. 软硬件一体,耦合过于紧密
  2. 升级操作十分复杂,客户看到升级流程后几乎绝望
  3. 国内使用经验贫乏
  4. support只有Oracle自己一家,没有第三方可选
  5. 就以往的表现来看美国支撑的X team并不给力。。。。。。
  6. 贵,Exadata很贵!具体了解其报价Exadata V2 Pricing

Warning:Even Exadata has a wrong memlock setting

事情要从大约2个月前的一起事故说起,有一套部署在Oracle-Sun Exadata V2 Database Machine上的4节点11.2.0.1 RAC数据库,其中一个节点的RAC关键后台进程LMS报ORA-00600[kjbmprlst:shadow]错误,随后LMS后台进程将该节点上的实例终止。其他节点上的CRS软件检测到该意外终止后,数据库进入全局资源的重新配置过程(Reconfiguration),Reconfiguration在所有剩余节点上都顺利完成了。

但是随后其中一个节点的告警日志中持续出现”Process W000 died, see its trace file”,似乎是实例无法得到分配新进程的必要资源,同时应用程序出现无法登陆该节点上实例的情况,本来4节点的RAC数据库,因为ORA-00600挂了一个,现在又有一个节点登不上,一下变得只剩下一半性能。

随后我赶到了问题现场,继续诊断问题,并发现了以下症状,在此一一列举:

1.尝试远程登录该实例,但是失败,出现ORA-12516 TNS:listener could not find available handler with matching protocol stack”错误。反复登录会出现以下信息:

Linux Error: 12: Cannot allocate memory
 Additional information: 1
 ORA-01034: ORACLE not available

 

2.确认过ORACLE_SID、ORACLE_HOME等多环境变量后使用”sqlplus / as sysdba”登录却返回”Connected to an idle instance.”(这一点最为蹊跷),无法以sysdba登录就无法收集必要的诊断信息,这个虽然可以通过gdb等手段做systemstate dump,但是暂时绕过

 

3. 后台进程W000由SMCO进程启动, SMCO进程的日志如下,所报状态为KSOSP_SPAWNED:

Process W000 is dead (pid=2648 req_ver=3812 cur_ver=3812 state=KSOSP_SPAWNED).
 *** 2011-07-08 02:44:32.971
 Process W000 is dead (pid=2650 req_ver=3813 cur_ver=3813 state=KSOSP_SPAWNED).

 

4. 确认组成instance的内存和后台进程均存活,且仍有日志产生

[oracle@maclean04 trace]$ ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0xac5ffd78 491524     oracle    660        4096       0
0x96c5992c 1409029    oracle    660        4096       0  

[oracle@maclean04 trace]$ ls -l /dev/shm
total 34839780
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_0
-rw-r----- 1 oracle oinstall         0 Jun  7 07:19 ora_maclean4_1409029_1
-rw-r----- 1 oracle oinstall         0 Jun  7 07:19 ora_maclean4_1409029_10
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_100
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_101
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_102
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_103
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_104
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_105
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_106
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_107
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_108
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_109
-rw-r----- 1 oracle oinstall         0 Jun  7 07:19 ora_maclean4_1409029_11
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_110
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_111
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_112
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_113
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_114
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_115
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_116
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_117
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_118
-rw-r----- 1 oracle oinstall 268435456 Jun  7 07:19 ora_maclean4_1409029_119
-rw-r----- 1 oracle oinstall         0 Jun  7 07:19 ora_maclean4_1409029_12
.......................

[oracle@maclean04 trace]$ ps -ef|grep ora_
oracle    5466     1  0 Jul03 ?        00:00:18 ora_pz99_maclean4
oracle   14842 10564  0 19:54 pts/9    00:00:00 grep ora_
oracle   18641     1  0 Jun08 ?        00:00:02 ora_q002_maclean4
oracle   23932     1  0 Jun07 ?        00:04:26 ora_pmon_maclean4
oracle   23934     1  0 Jun07 ?        00:00:06 ora_vktm_maclean4
oracle   23938     1  0 Jun07 ?        00:00:00 ora_gen0_maclean4
oracle   23940     1  0 Jun07 ?        00:00:06 ora_diag_maclean4
oracle   23942     1  0 Jun07 ?        00:00:00 ora_dbrm_maclean4
oracle   23944     1  0 Jun07 ?        00:01:01 ora_ping_maclean4
oracle   23946     1  0 Jun07 ?        00:00:16 ora_psp0_maclean4
oracle   23948     1  0 Jun07 ?        00:00:00 ora_acms_maclean4
oracle   23950     1  0 Jun07 ?        02:27:29 ora_dia0_maclean4
oracle   23952     1  0 Jun07 ?        01:19:42 ora_lmon_maclean4
oracle   23954     1  0 Jun07 ?        02:23:59 ora_lmd0_maclean4
oracle   23956     1  5 Jun07 ?        1-13:50:36 ora_lms0_maclean4
oracle   23960     1  4 Jun07 ?        1-12:44:25 ora_lms1_maclean4
oracle   23964     1  0 Jun07 ?        00:00:00 ora_rms0_maclean4
oracle   23966     1  0 Jun07 ?        00:00:00 ora_lmhb_maclean4
oracle   23968     1  0 Jun07 ?        01:58:35 ora_mman_maclean4
oracle   23970     1  0 Jun07 ?        06:28:39 ora_dbw0_maclean4
oracle   23972     1  0 Jun07 ?        06:27:08 ora_dbw1_maclean4
oracle   23974     1  2 Jun07 ?        16:49:56 ora_lgwr_maclean4
oracle   23976     1  0 Jun07 ?        00:20:48 ora_ckpt_maclean4
oracle   23978     1  0 Jun07 ?        00:07:03 ora_smon_maclean4
oracle   23980     1  0 Jun07 ?        00:00:00 ora_reco_maclean4
oracle   23982     1  0 Jun07 ?        00:00:00 ora_rbal_maclean4
oracle   23984     1  0 Jun07 ?        00:01:00 ora_asmb_maclean4
oracle   23986     1  0 Jun07 ?        00:08:15 ora_mmon_maclean4
oracle   23988     1  0 Jun07 ?        00:18:19 ora_mmnl_maclean4
oracle   23992     1  0 Jun07 ?        00:00:00 ora_d000_maclean4
oracle   23994     1  0 Jun07 ?        00:00:00 ora_s000_maclean4
oracle   23996     1  0 Jun07 ?        00:00:00 ora_mark_maclean4
oracle   24065     1  0 Jun07 ?        01:16:54 ora_lck0_maclean4
oracle   24067     1  0 Jun07 ?        00:00:00 ora_rsmn_maclean4
oracle   24079     1  0 Jun07 ?        00:01:02 ora_dskm_maclean4
oracle   24174     1  0 Jun07 ?        00:08:18 ora_arc0_maclean4
oracle   24188     1  0 Jun07 ?        00:08:19 ora_arc1_maclean4
oracle   24190     1  0 Jun07 ?        00:00:59 ora_arc2_maclean4
oracle   24192     1  0 Jun07 ?        00:08:12 ora_arc3_maclean4
oracle   24235     1  0 Jun07 ?        00:00:00 ora_gtx0_maclean4
oracle   24237     1  0 Jun07 ?        00:00:00 ora_rcbg_maclean4
oracle   24241     1  0 Jun07 ?        00:00:00 ora_qmnc_maclean4
oracle   24245     1  0 Jun07 ?        00:00:00 ora_q001_maclean4
oracle   24264     1  0 Jun07 ?        00:08:28 ora_cjq0_maclean4
oracle   25782     1  0 Jun07 ?        00:00:00 ora_smco_maclean4

 

5.确认在问题发生时系统中仍有大量的空闲内存且未发生大量的SWAP,此外/dev/shm共享内存目录仍有27G的空闲。

 

6.在其他节点上查询全局动态性能视图gv$resource_limit发现当前故障节点上的登录进程总数上限仅为404,并不多。

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
PL/SQL Release 11.2.0.1.0 - Production
CORE    11.2.0.1.0      Production
TNS for Linux: Version 11.2.0.1.0 - Production
NLSRTL Version 11.2.0.1.0 - Production

SQL> select * from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
www.askmac.cn

SQL> select * from gv$resource_limit where inst_id=4;

RESOURCE_NAME                  CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION             LIMIT_VALUE
------------------------------ ------------------- --------------- ------------------------------ -------------
processes                                       50             404       1500                           1500
sessions                                        61             616       2272                           2272
enqueue_locks                                  849            1599      31062                          31062
enqueue_resources                              846            1007      15016                      UNLIMITED
ges_procs                                       47             399       1503                           1503
ges_ress                                     65943          109281      67416                      UNLIMITED
ges_locks                                    23448           37966      92350                      UNLIMITED
ges_cache_ress                                7347           14716          0                      UNLIMITED
ges_reg_msgs                                   337            5040       3730                      UNLIMITED
ges_big_msgs                                    26             502       3730                      UNLIMITED
ges_rsv_msgs                                     0               1       1000                           1000
gcs_resources                              2008435         2876561    3446548                        3446548
gcs_shadows                                1888276         2392064    3446548                        3446548
dml_locks                                        0               0       9996                      UNLIMITED
temporary_table_locks                            0              45  UNLIMITED                      UNLIMITED
transactions                                     0               0       2499                      UNLIMITED
branches                                         0               2       2499                      UNLIMITED
cmtcallbk                                        0               3       2499                      UNLIMITED
max_rollback_segments                          109             129       2499                          65535
sort_segment_locks                               0              14  UNLIMITED                      UNLIMITED
k2q_locks                                        0               2       4544                      UNLIMITED
max_shared_servers                               1               1  UNLIMITED                      UNLIMITED
parallel_max_servers                             1              19        160                           3600

 

7. Exadata节点系统内核参数文件sysctl.conf中的配置正确:

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296

########### BEGIN DO NOT REMOVE Added by Oracle Exadata ###########
kernel.shmmni = 4096
kernel.sem = 250 32000 100 128
# bug 8311668 file-max and aio-max-nr
fs.file-max = 6815744
# DB install guide says the above
fs.aio-max-nr = 1048576
# 8976963
net.ipv4.neigh.bond0.locktime=0
net.ipv4.ip_local_port_range = 9000 65500
# DB install guide says the above
net.core.rmem_default = 4194304
net.core.wmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_max = 2097152
# The original DB deployment was net.core.wmem_max = 1048586 but IB works
# best for Exadata at the above net.core settings
# bug 8268393 remove vm.nr_hugepages = 2048
# bug 8778821 system reboots after 60 sec on panic
kernel.panic=60
########### END DO NOT REMOVE Added by Oracle Exadata ###########

########### BEGIN DO NOT REMOVE Added by Oracle Exadata ###########
kernel.shmmax = 64547735961
kernel.shmall = 15758724
########### END DO NOT REMOVE Added by Oracle Exadata ###########

 

8. 至此问题还是显得扑朔迷离,主要后台进程和SGA内存的完好,而且操作系统上也仍有大量空闲内存,实例上的资源也没有达到一个临界点。到底是什么造成了无法分配新进程!?

出于谨慎我最后还是检查了系统上的/etc/security/limits.conf参数文件,该参数文件控制了shell的一些ulimit的上限。因为Exadata一体机是由Oracle安装配置后直接交付使用的,我最初的认识是这些配置文件都毫无疑问都应当是最佳配置,遵循Oracle的Best Practices。

但是当我实际打开这个文件后我立即意识到这个配置有问题,似乎少了点什么,以下为该Exadata上的limits.conf文件:

########### BEGIN DO NOT REMOVE Added by Oracle Deployment Scripts ###########

oracle     soft    nproc       2047
oracle     hard    nproc       16384
oracle     soft    nofile      65536
oracle     hard    nofile      65536

########### END DO NOT REMOVE Added by Oracle Deployment Scripts ###########

显然上述limits.conf中缺少了对memlock参数的设置,在不设置memlock参数的情况下使用缺省的memlock为32,以下为Exadata host上的ulimit输出:

[oracle@maclean4 shm]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 606208
max locked memory (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2047
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

可以观察到这里的max locked memory确实是缺省的32,而Oracle所推荐的memlock参数却要远大于32。

在Oracle validated Configuration中经过验证的memlock推荐值为50000000,关于Oracle Validated Configuration详见拙作<Understand Oracle Validated Configurations>

[oracle@rh2 ~]$ cat /etc/security/limits.conf

# Oracle-Validated setting for nofile soft limit is 131072
oracle   soft   nofile    131072

# Oracle-Validated setting for nofile hard limit is 131072
oracle   hard   nofile    131072

# Oracle-Validated setting for nproc soft limit is 131072
oracle   soft   nproc    131072

# Oracle-Validated setting for nproc hard limit is 131072
oracle   hard   nproc    131072

# Oracle-Validated setting for core soft limit is unlimited
oracle   soft   core    unlimited

# Oracle-Validated setting for core hard limit is unlimited
oracle   hard   core    unlimited

# Oracle-Validated setting for memlock soft limit is 50000000
oracle   soft   memlock    50000000 

# Oracle-Validated setting for memlock hard limit is 50000000
oracle   hard   memlock    50000000

搜索Mos可以发现Note[Ora-27102: Out Of Memory: Linux Error: 12: Cannot Allocate Memory with LOCK_SGA=TRUE [ID 401077.1]:指出了因max locked memory过小可能引发Linux Error: 12: Cannot Allocate Memory内存无法分配的问题。

因为修改limits.conf配置文件对已经启动的实例是无效的,所以我们无法通过纠正参数来解决现有的问题。

实际我采用了释放一些资源的方法来workaround了这个问题,通过以下脚本将实例内的所有前台服务进程杀死以释放资源。

ps -ef|grep $SID|grep LOCAL=NO|grep -v grep| awk ‘{print $2}’|xargs kill -9

完成以上命令后出现了终端有点卡的现象,之后恢复正常。尝试使用sysdba本地和远程登录实例均成功,应用的链接也恢复正常。

虽然修复了问题,但是还需要和客户做详尽的说明。我在邮件中阐明了该Exadata一体机上配置文件存在的问题,并提出了几点建议:

1.要求Oracle Support确认该/etc/security/limits.conf中的配置是否合理,是否需要修改
2.设置vm.min_free_kbytes = 51200 内核参数,避免因空闲内存不足引起的性能问题
3.安装OSWatcher监控软件,监控必要的系统资源

客户对我的说法也比较信服,但还是将邮件抄送了原厂Exadata一体机的售前人员。

之后售前人员也曾联系过我,我也做了相同的说明。但原厂售前认为在Exadata一体机是在Oracle美国原厂进行配置安装的,在配置上肯定是最优的,而且该limits.conf中的memlock参数的当前值(32)和推荐值(50000000)之间有如此大的差距,他们认为美国原厂的部署人员不可能犯这么低级的错误。

所以实际他们对我对该memlock参数的说明持一种怀疑的态度,我的建议是就该memlock参数和MOS进行进一步的沟通,以确认该问题。当然这不是我需要完成的工作了。因为对该memlock参数存在分歧,所以短期内也没有修改该参数。

这个case就这样过去了,时间过得很快,转眼已经2个月了。恰巧最近有升级Exadata上数据库到11.2.0.2的项目,所以翻阅了相关patch的readme文档,因为升级RAC到11.2.0.2的前提是Exadata Storage Server Software、InfiniBand Switch Software Version软件版本能够兼容,所以查阅了其兼容列表:

Version Compatibility

The following table lists the Exadata Storage Server software versions that are compatible with each supported Oracle Database 11g Release 2 software version.

Oracle Database Software version Required Exadata Storage Server Software version
11g Release 2 (11.2.0.2.0) Patch Set 1 11.2.2.x
11g Release 2 (11.2.0.1.0) 11.2.2.x
11.2.1.x

The following table lists the InfiniBand Switch software versions that are compatible with each supported Exadata Storage Server software version.

Exadata Storage Server Software version Required InfiniBand Switch software version
11.2.2.2.2 and later Exadata Database Machine – Sun Datacenter InfiniBand Switch 36
Switch software version 1.1.3-2 or laterHP Oracle Database Machine – Voltaire ISR 9024D-M and ISR 9024D
Switch software 5.1.1 build ID 872 (ISR 9024D-M only)
Switch firmware 1.0.0 or higher
11.2.2.2.0 or earlier Exadata Database Machine – Sun Datacenter InfiniBand Switch 36
Switch software version 1.0.1-1 or laterHP Oracle Database Machine – Voltaire ISR 9024D-M and ISR 9024D
Switch software 5.1.1 build ID 872 (ISR 9024D-M only)
Switch firmware 1.0.0 or higher

 

为了将Exadata上的RAC数据库升级到11.2.0.2,首先要将Exadata Storage Server Software升级到11.2.2.x,Oracle官方目前推荐的版本是11.2.2.3.2。

所以随后我也翻阅了Exadata Storage Server Software 11.2.2.3.2 的update readme文档,即<Oracle Exadata Database Machine README for patch 12577723 (Support note 1323958.1)>。

该Patch的升级主要分成”Applying the Patch to Exadata Cells”和”Applying the Patch to the Database Server” 2个阶段,即不仅需要在Exadata Cell上实施补丁,还需要在Database节点上实施一个小补丁。

 

查看”Applying the Patch to the Database Server”章节可以发现存在这样一个步骤:

Repeat the following steps for each database host. If you are taking deployment-wide downtime for the patching, then these steps may be performed in parallel on all database hosts.

  1. Update the resource limits for the database and the grid users:
    Note:

    • This step does not apply if you have customized the values for your specific deployment and database requirements.
    WARNING:

    • Do not run this step if you have specific customized values in use for your deployment.
    1. Calculate 75% of the physical memory on the machine using the following command:.
      let -i x=($((`cat /proc/meminfo | grep 'MemTotal:' | awk '{print $2}'` * 3 / 4))); echo $x
    2. Edit the /etc/security/limits.conf file to update or add following limits for the database owner (orauser) and the grid infrastructure user (griduser). Your deployment may use the same operating system user for both and it may be named as oracle user. Adjust the following as needed.
      ########## BEGIN DO NOT REMOVE Added by Oracle ###########
      orauser     soft    core       unlimited
      orauser     hard    core       unlimited
      orauser     soft    nproc       131072
      orauser     hard    nproc       131072
      orauser     soft    nofile      131072
      orauser     hard    nofile      131072
      orauser     soft    memlock     <value of x from step 01.a>
      orauser     hard    memlock     <value of x from step 01.a>
      
      griduser     soft    core       unlimited
      griduser     hard    core       unlimited
      griduser     soft    nproc       131072
      griduser     hard    nproc       131072
      griduser     soft    nofile      131072
      griduser     hard    nofile      131072
      griduser     soft    memlock     <value of x from step 01.a>
      griduser     hard    memlock     <value of x from step 01.a>
      
      ########### END DO NOT REMOVE Added by Oracle ###########

 

以上可以看到在正式实施Patch to Database server前做了一个补救措施,那就是为oracle和grid用户添加memlock参数,这里的memlock参数是通过将/proc/meminfo中的MemTotal取75%获得,在<Exadata Server Hardware Details>中我列出了Exadata Database Host的一些硬件参数,其中总内存MemTotal一般为70GB(74027752 kB),换算过来74027752*75%=55520814,也就是说Oracle实际推荐在Exadata上使用的memlock参数应当为55520814,甚至要高于我之前所说的50000000的验证值。

至此该问题终于真相大白!而我们也可以从中学到很多东西:

1.首先我大胆的猜测,实际部署Sun Exadata Machine的因该是Oracle硬件部门,也就是以前Sun的部门。实际在部署过程中,部门与部门之间的充分交流是很重要的,而这里09年匆匆上线的Oracle-Sun Exadata V2显然没有做好,而直到2011 5月发布的Oracle Exadata Database Machine 11g Release 2 (11.2) 11.2.2.3.2 patch 12577723中才反应并解决了该问题

2.IT始终是以人为本,不管是多么高端的服务器、多么先进的技术,如果没有与之相匹配的人和团队来驾驭的话,那么至多只能发挥出50%的效益,在人员对先进技术极端不熟悉的情况下,智能化只是空谈!

沪ICP备14014813号-2

沪公网安备 31010802001379号