AIX平台上大型OLTP数据库的shutdown问题

客户的新系统上线已经一年有余,核心系统硬件采用IBM P595,操作系统版本AIX 5300-09,存储使用DS6000,数据库版本为10.2.0.4,没有打额外的patch set update.此套系统平时会话数量在2000-3000的水平,每秒新建会话在10个左右。

客户这套系统一直有一个问题,即每次准备关闭实例进行一些维护工作时,在多次手动执行检查点(确保脏块被写出)后,shudown immediate命令仍需要非常长的时间才完成关闭数据库动作;之前客户一直使用在shutdown之前将大部分应用服务进程杀死的方法,可以缩短 shutdown immediate命令完成的时间。

实际上服务进程在2000-3000的OLTP系统在当前已经很普及了,而在其他平台上(譬如:Linux,SUN OS)上则不会出现一个shutdown操作持续半个小时以上的情况。

通过查询My Oracle support发现一个9i上shutdown immediate费时半个小时的note:

Hdr: 3484589 9.2.0.4.0 RDBMS 9.2.0.4.0 PRODID-5 PORTID-212
Abstract: BUG:3046394  WHICH IS A REWORK OF BUG :2674297 DOES NOT  STILL FIX THE PROBLEM.
PROBLEM:
--------
Shutdown abort on AIX5L takes 7 min. and Shutdown immediate takes 30 min.
Patch for the bug 3046394( which is a rework of Bug 2674297) is applied ,but
there is no improvement in the time taken ot shutdown.
The problem is same as addressed in these two bugs but has not been fixed.

2. Pertinent configuration information
none

3. Indication of the frequency and predictability of the problem
Consistent.

4. Sequence of events leading to the problem
Shutdown abort

5. Technical impact on the customer. Include persistent after effects.
Shutdown taking a long time which is unacceptable to the  Customer.

DIAGNOSTIC ANALYSIS:
--------------------
From the alert log we see time taken for shutdown is around 7 min.
The patch 3046394 is applie,which is confirmed form the output of
Apply_3046394_02-20-2004_19-11-35.log

The shutdown abort was tried again now,it still takes the same amount of time.

The system calls made by the shutdown abort include:-
(These are a subset of the total Kill calls made)
196022:    kill(359980, 9)                    = 0
196022:    kill(359980, 9)                    = 0
196022:    kill(359980, 0)                    = 0
196022:    _nsleep(0x0FFFFFFFFFFF9D80, 0x0FFFFFFFFFFF9E50)    = 0
196022:    kill(359980, 0)                    = 0
196022:    _nsleep(0x0FFFFFFFFFFF9D80, 0x0FFFFFFFFFFF9E50)    = 0
196022:    kill(359980, 0)                    = 0
196022:    _nsleep(0x0FFFFFFFFFFF9D80, 0x0FFFFFFFFFFF9E50)    = 0
196022:    kill(359980, 0)                    = 0
196022:    _nsleep(0x0FFFFFFFFFFF9D80, 0x0FFFFFFFFFFF9E50)    = 0

WORKAROUND:
-----------
No Workaround available.

RELATED BUGS:
-------------
Bug:3046394 , Bug :2674297

另一个文档指出,shutdown abort慢的主要原因Oracle关闭进程使用的是system call kill()函数,Bug 3484589的补丁中已经将Oracle 在shutdown实例时可能的等待问题移除了。

Shutdown abort taking long time on AIX (patch 3046394 applied) [ID 274399.1]
Applies To

Oracle Server - Enterprise Edition - Version: 9.2.0.1 to 9.2.0.4
AIX5L Based Systems (64-bit)
Symptoms

Shutdown abort taking a long time to complete on AIX.
Cause
This has been filed as bug 2674297. The wait time was reduced to shorten the shutdown time. This still 
takes around 50ms for each process to be terminated. This has been re-worked in Bug 3046394 to remove the 
delay during the shutdown and compensate for it during database startup.
But we still see the delay in shutdown abort on AIX. The delay during shutdown abort persists after applying 
the above patch (3046394) due to OS level handling of the shutdown command.

Fix
There is no fix. The problem still persists at the OS level. This is because of the way the kill command is 
processed at the OS level. Bug 3484589, has clearly mentioned that all the waits from the Oracle side have 
been removed. There is no further code that can be implemented to reduce the time at shutdown.

另一个文档则指出了这种连shutdown abort都慢的情况仅出现在AIX平台上:

Hdr: 3485690 9.2.0.3 RDBMS 9.2.0.3 OSD PRODID-5 PORTID-319 3046394
Abstract: SHUTDOWN ABORT TAKES A LONG TIME ON IBM AIX
PROBLEM:
--------
Ct has up to  8000 connections on their database. When they issue a shutdown
abort, it takes 8 minutes to complete. With 175 connections, it took 2.5
minutes to shutdown abort. With no one attached it takes 14 seconds.

When I test this on our database in house, AIX took 12 seconds to shutdown.
Linux takes .2 seconds, as does Solaris, and Windows 2000.

DIAGNOSTIC ANALYSIS:
--------------------
Applied patch for bug 2674297, and shutdown abort with 5000 connections took 5
minutes.

看起来这是一个操作系统调用引起的问题,且因为是操作系统的问题,Oracle无法提供进一步彻底解决方法;

总结起来当服务进程数量较多时,在AIX平台上shutdown immediate的缓慢程度往往是用户无法接受的,我们可以采取一个折中的方案,使用shutdown abort来缩短关闭实例所用的时间:

1.  alter system checkpoint;             -- 手动做检查点三次,保证脏块写出

2.  shutdown abort;                      -- abort实例,相当于断电

3.  startup restrict ;                   -- 以限制模式开启实例,普通应用此时无法连接数据库,故不会产生额外的服务进程

4.  shutdown normal;                     -- 以普通模式关闭实例,会再次进行完全检查点,并回滚事务

其中第二步的shutdown abort,可以以杀死pmon进程的方式替代(这样其实会更快);采用以上折中方式前,请确认您的online redo log处于复用或已镜像的状态下。

Comments

  1. noDBA says

    shutdown abort有些情况能乱用吗?你这边文章有问题.

    • 任何命令工具 包括shutdown abort都有其适用的范围, 请注意这篇文章的前提:

      1. shutdown immediate持续几个小时无法完成,结果还是需要abort或者KILL PMON之类的关键进程来关闭实例

      2.在shutdown abort之前完成多个”ALTER SYSTEM CHECKPOINT” 的FULL CHECK,保证绝大多数脏块都已经被写出了 ,前滚(ROLLING FORWARD) 需要完成的工作很少

      3.确保REDO LOG多路复用,不被损坏, 这保证了前滚可以顺利完成

      以上三点前提下 我认为使用SHUTDOWN ABORT是合理的,而非”乱用”。

  2. hctech says

    我这边也有shutdown immediate很久(20分钟左右)的情况,但我觉得是因为SGA过大的原因(80G)左右~

    后来我先用alter system flush buffer_cache,然后再checkpoint,shutdown就快很多了~

Comment

*

沪ICP备14014813号-2

沪公网安备 31010802001379号