【Oracle数据恢复】如何重建SYSAUX表空间上的对象

SQL> select file_id,file_name from dba_data_files where tablespace_name=’SYSAUX’;

FILE_ID
———-
FILE_NAME
——————————————————————————————————————————————–
2
/s01/oradata/FIXIT/datafile/o1_mf_sysaux_85wkhmrk_.dbf
SQL> alter database datafile 2 offline;

Database altered.

 
假设这里datafile 2即SYSAUX所有的数据文件都丢失了 且没有任何备份,此时我们无法利用任何备份还原该表空间
但是SYSAUX表空间却是数据库必要的系统表空间之一,存放了 AWR等重要数据和其他一些组件的辅助数据

 

 
SQL> exec dbms_workload_repository.create_snapshot;
BEGIN dbms_workload_repository.create_snapshot; END;

*
ERROR at line 1:
ORA-13509: error encountered during updates to a AWR table
ORA-00376: file ORA-00376: file 2 cannot be read at this time
ORA-01110: data file 2: ‘/s01/oradata/FIXIT/datafile/o1_mf_sysaux_85wkhmrk_.dbf’
cannot be read at this time
ORA-06512: at “SYS.DBMS_WORKLOAD_REPOSITORY”, line 99
ORA-06512: at “SYS.DBMS_WORKLOAD_REPOSITORY”, line 122
ORA-06512: at line 1

 

我们需要在 丢失SYSAUX所有数据文件及其备份的情况下 重建该表空间上的对象

 

除了以上彻底丢失 数据文件的情况, SYSAUX 数据文件存在某些逻辑或物理坏块也可能是重建该表空间上对象的理由

例如以下错误:

ORA-00600: [kcbz_check_objd_typ] from MMON slave or its process
ORA-00600: [kdsgrp1] while querying WR% tables from SYSAUX

对应不同的AWR负载基表, $ORACLE_HOME/rdbms/admin/目录下存有相关这些系统表的创建脚本,例如:

WRI$_OPTSTAT catost.sql – Optimizer Statistics Tables
WRI$_ALERT catalrt.sql – Catalog script for server ALeRT
WRH$_* catawrtb.sql – Catalog script for AWR Tables
catawrvw.sql – Catalog script for AWR Views

接下来我们开始着手重建SYSAUX上的对象,请注意recreate sysaux上的对象是在正常恢复之外的非常规手段,也是恢复SYSAUX的最后一步,不要在产品环境中滥用以下手段:

鉴于该操作过于危险,只适合于深入了解Oracle的人士,所以具体操作步骤移除。

如果不能自行解决该问题,那么也可以联系MACLEAN专业数据库修复团队。

 

 

Oracle数据恢复:解决ORA-00600:[4000] ORA-00704: bootstrap process failure错误一例

节前帮网友打开了一套断电后无法正常启动的10.2.0.1数据库,该库使用非归档模式且没有任何备份。

主机断电后,该网友尝试重启数据库实例,但是遇到了ORA-00600:[kccpb_sanity_check_2]内部错误:

 

如果自己搞不定可以找诗檀软件专业ORACLE数据库修复团队成员帮您恢复!

诗檀软件专业数据库修复团队

服务热线 : 13764045638   QQ号:47079569    邮箱:service@parnassusdata.com

 

SQL> select status from v$instance;                                             

STATUS
------------
STARTED                                                                         

SQL>
SQL> shutdown immediate;
ORA-01507: database not mounted                                                 

ORACLE instance shut down.
SQL> startup mount;
ORACLE instance started.                                                        

Total System Global Area 2147483648 bytes
Fixed Size                  1220432 bytes
Variable Size             486539440 bytes
Database Buffers         1644167168 bytes
Redo Buffers               15556608 bytes
ORA-00600: internal error code, arguments: [kccpb_sanity_check_2], [8198],
[8175], [0x0], [], [], [], []

 

该kccpb_sanity_check_2内部错误常由不正确的control file中seq#记录引起,MOS note <ORA-00600: [kccpb_sanity_check_2] During Instance Startup>简单描述了该问题:

 

ORA-00600: [kccpb_sanity_check_2] During Instance Startup

Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 and later   [Release: 10.2 and later ]
Information in this document applies to any platform.
Symptoms
The database is getting the following errors on Startup:

ORA-00600: internal error code, arguments: [kccpb_sanity_check_2], [3621501], [3621462], [0x000000000]

Changes
In this case, the customer moved the box from one data center to another.
Cause
ORA-600 [kccpb_sanity_check_2] indicates that the seq# of the last read block is
higher than the seq# of the control file header block. This is indication of
the lost write of the header block during commit of the previous cf
transaction.

Solution

1) restore a backup of a controlfile and recover

OR

2) recreate the controlfile

OR

3) restore the database from last good backup and recover

NOTE:  If you do not have any special backup of control file to restore and you are
using Multiple Control File copies in your pfile/init.ora/spfile you can attempt to mount the
database using each control file one by one.
If you are able to mount the database with any of these control file copies
you can then issue 'alter database backup controlfile to trace' to recreate controlfile.

针对该由control file corruption引起的ORA-00600: [kccpb_sanity_check_2] ,可以尝试一下三种解决方法:

  1. restore controlfile from backup
  2. 若有多路复用controlfile,则尝试使用每个单独的control file来mount实例,若能成功则使用该可用controlfile复制到其余位置
  3. 使用controlfile backup trace来重建控制文件,若没有trace那么也可以手动写create controlfile语句来重建

 

接手这个问题后尝试第二种方法失败,只好到udump/bdump目录下搜索可用的controlfile backup trace,所幸找到了,重建控制文件并成功:

Create controlfile reuse set database "kmcdb"
MAXINSTANCES 8
MAXLOGHISTORY 1
MAXLOGFILES 16
MAXLOGMEMBERS 3
MAXDATAFILES 100
Datafile
'/u01/oracle/oradata/kmcdb/system01.dbf',
'/u01/oracle/oradata/kmcdb/undotbs01.dbf',
'/u01/oracle/oradata/kmcdb/sysaux01.dbf',
'/u01/oracle/oradata/kmcdb/users01.dbf',
'/u01/oracle/oradata/kmcdb/example01.dbf'
LOGFILE GROUP 1 ('/u01/oracle/oradata/kmcdb/redo01.log') SIZE 51200K,
GROUP 2 ('/u01/oracle/oradata/kmcdb/redo02.log') SIZE 51200K,
GROUP 3 ('/u01/oracle/oradata/kmcdb/redo03.log') SIZE 51200K RESETLOGS
Wed Sep 28 16:57:47 2011
WARNING: Default Temporary Tablespace not specified in CREATE DATABASE command
Default Temporary Tablespace will be necessary for a locally managed database in future release
Setting recovery target incarnation to 1
Wed Sep 28 16:57:48 2011
Successful mount of redo thread 1, with mount id 523482251
Wed Sep 28 16:57:48 2011
Completed: Create controlfile reuse set database "kmcdb"
MAXINSTANCES 8
MAXLOGHISTORY 1
MAXLOGFILES 16
MAXLOGMEMBERS 3
MAXDATAFILES 100
Datafile
'/u01/oracle/oradata/kmcdb/system01.dbf',
'/u01/oracle/oradata/kmcdb/undotbs01.dbf',
'/u01/oracle/oradata/kmcdb/sysaux01.dbf',
'/u01/oracle/oradata/kmcdb/users01.dbf',
'/u01/oracle/oradata/kmcdb/example01.dbf'
LOGFILE GROUP 1 ('/u01/oracle/oradata/kmcdb/redo01.log') SIZE 51200K,
GROUP 2 ('/u01/oracle/oradata/kmcdb/redo02.log') SIZE 51200K,
GROUP 3 ('/u01/oracle/oradata/kmcdb/redo03.log') SIZE 51200K RESETLOGS
Wed Sep 28 16:57:57 2011
alter database mount

 

成功mount实例后,尝试recover database 并 open 数据库 :

 

 recover database using backup controlfile until cancel;

因为controlfile是重建出来的,所以当前的controlfile并不知道那个在线日志是current的,需要手动指定

ALTER DATABASE RECOVER  database using backup controlfile until cancel
Wed Sep 28 17:27:09 2011
Media Recovery Start
WARNING! Recovering data file 1 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
WARNING! Recovering data file 2 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
WARNING! Recovering data file 3 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
WARNING! Recovering data file 4 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
WARNING! Recovering data file 5 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
 parallel recovery started with 7 processes
ORA-279 signalled during: ALTER DATABASE RECOVER  database using backup controlfile until cancel  ...
Wed Sep 28 17:27:20 2011
ALTER DATABASE RECOVER    LOGFILE '/u01/oracle/oradata/kmcdb/redo02.log'
Wed Sep 28 17:27:20 2011
Media Recovery Log /u01/oracle/oradata/kmcdb/redo02.log
Errors with log /u01/oracle/oradata/kmcdb/redo02.log
ORA-339 signalled during: ALTER DATABASE RECOVER    LOGFILE '/u01/oracle/oradata/kmcdb/redo02.log'  ...
Wed Sep 28 17:27:20 2011
ALTER DATABASE RECOVER CANCEL
ORA-1547 signalled during: ALTER DATABASE RECOVER CANCEL ...
Wed Sep 28 17:27:37 2011
ALTER DATABASE RECOVER  database using backup controlfile until cancel
Wed Sep 28 17:27:37 2011
Media Recovery Start
WARNING! Recovering data file 1 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
WARNING! Recovering data file 2 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
WARNING! Recovering data file 3 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
WARNING! Recovering data file 4 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
WARNING! Recovering data file 5 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.
 parallel recovery started with 7 processes
ORA-279 signalled during: ALTER DATABASE RECOVER  database using backup controlfile until cancel  ...
Wed Sep 28 17:27:52 2011
ALTER DATABASE RECOVER    LOGFILE '/u01/oracle/oradata/kmcdb/redo03.log'
Wed Sep 28 17:27:52 2011
Media Recovery Log /u01/oracle/oradata/kmcdb/redo03.log
Wed Sep 28 17:27:52 2011
Incomplete recovery applied all redo ever generated.
Recovery completed through change 16228646
Wed Sep 28 17:27:52 2011
Media Recovery Complete (kmcdb)
Completed: ALTER DATABASE RECOVER    LOGFILE '/u01/oracle/oradata/kmcdb/redo03.log'
Wed Sep 28 17:28:12 2011

 

多次尝试后发现redo03.log是所需要的current group member,尝试alter database open resetlogs,但是出现ORA-00600: internal error code, arguments: [4000], [7], [], [], [], [], [], []错误:

 

alter database open resetlogs
Wed Sep 28 17:28:12 2011
RESETLOGS is being done without consistancy checks. This may result
in a corrupted database. The database should be recreated.
RESETLOGS after incomplete recovery UNTIL CHANGE 16228646
Resetting resetlogs activation ID 523468297 (0x1f337e09)
Online log /u01/oracle/oradata/kmcdb/redo01.log: Thread 1 Group 1 was previously cleared
Online log /u01/oracle/oradata/kmcdb/redo02.log: Thread 1 Group 2 was previously cleared
Wed Sep 28 17:28:12 2011
Setting recovery target incarnation to 5
Wed Sep 28 17:28:12 2011
Assigning activation ID 523485789 (0x1f33c25d)
Thread 1 opened at log sequence 1
  Current log# 3 seq# 1 mem# 0: /u01/oracle/oradata/kmcdb/redo03.log
Successful open of redo thread 1
Wed Sep 28 17:28:12 2011
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Wed Sep 28 17:28:12 2011
SMON: enabling cache recovery
Wed Sep 28 17:28:12 2011
Errors in file /u01/oracle/admin/kmcdb/udump/kmcdb_ora_31798.trc:
ORA-00600: internal error code, arguments: [4000], [7], [], [], [], [], [], []
Wed Sep 28 17:28:14 2011
Errors in file /u01/oracle/admin/kmcdb/udump/kmcdb_ora_31798.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [4000], [7], [], [], [], [], [], []

以上ORA-00600: [4000], [7],说明在使用usn=7的回滚段rollback数据块时发现rollback segment存在讹误(注意不要和ora600 [2662]错误混淆],且该ORA-00600: [4000]伴随有ORA-00704: bootstrap process failure错误,说明需要回滚的数据块是bootstrap重要的自举对象。

一般来说bootstrap object需要做rollback或cleanup而apply undo数据时,我们是无法使用_corrupted_rollback_segments,_offline_rollback_segments或10513事件来阻止ORA-00600: [4000]的发生。

 

但还是仍心存侥幸,修改_corrupted_rollback_segments,_offline_rollback_segments和event参数再次尝试:

alter system set event='10513 trace name context forever,level 2 :
10512 trace name context forever,level 1: 10511 trace name context forever,level 2:
10510 trace name context forever,level 1' scope=spfile;

ALTER SYSTEM SET _offline_rollback_segments='(_SYSSMU7$)' SCOPE=SPFILE;

ALTER SYSTEM SET _corrupted_rollback_segments='(_SYSSMU7$)' SCOPE=SPFILE;

再次open database

alter database open resetlogs
Wed Sep 28 17:47:05 2011
RESETLOGS after complete recovery through change 16228668
Resetting resetlogs activation ID 523471415 (0x1f338a37)
Online log /u01/oracle/oradata/kmcdb/redo01.log: Thread 1 Group 1 was previously cleared
Online log /u01/oracle/oradata/kmcdb/redo02.log: Thread 1 Group 2 was previously cleared
Wed Sep 28 17:47:05 2011
Setting recovery target incarnation to 11
Wed Sep 28 17:47:05 2011
Assigning activation ID 523466708 (0x1f3377d4)
Thread 1 opened at log sequence 1
  Current log# 3 seq# 1 mem# 0: /u01/oracle/oradata/kmcdb/redo03.log
Successful open of redo thread 1
Wed Sep 28 17:47:05 2011
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Wed Sep 28 17:47:05 2011
SMON: enabling cache recovery
Wed Sep 28 17:47:05 2011
Errors in file /u01/oracle/admin/kmcdb/udump/kmcdb_ora_32261.trc:
ORA-00600: internal error code, arguments: [4000], [7], [], [], [], [], [], []
Wed Sep 28 17:47:06 2011
Errors in file /u01/oracle/admin/kmcdb/udump/kmcdb_ora_32261.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [4000], [7], [], [], [], [], [], []
Wed Sep 28 17:47:06 2011
Error 704 happened during db open, shutting down database
USER: terminating instance due to error 704
Instance terminated by USER, pid = 32261
ORA-1092 signalled during: alter database open resetlogs...

无法避免bootstrap对象发生ORA-00600:[4000],而这对打开数据库来说是致命的,要想恢复这个数据库就必须要解决该bootstrap对象。

怎么办呢?

我们先来读一读ORA-00600:[4000]内部错误的trace日志:

 

*** SERVICE NAME:() 2011-09-28 17:46:09.022
*** SESSION ID:(160.3) 2011-09-28 17:46:09.022
Recovery target incarnation = 10, activation ID = 0
Influx buffer limit = 98098 (50% x 196196)
Successfully allocated 7 recovery slaves
Using 156 overflow buffers per recovery slave
Start recovery at thread 1 ckpt scn 16228666 logseq 1 block 2
*** 2011-09-28 17:46:09.236
Media Recovery add redo thread 1
*** 2011-09-28 17:46:14.752
Media Recovery drop redo thread 1
*** 2011-09-28 17:46:41.788
Recovery target incarnation = 10, activation ID = 0
Influx buffer limit = 98098 (50% x 196196)
Successfully allocated 7 recovery slaves
Using 156 overflow buffers per recovery slave
Start recovery at thread 1 ckpt scn 16228666 logseq 1 block 2
*** 2011-09-28 17:46:41.842
Media Recovery add redo thread 1
*** 2011-09-28 17:46:52.799
Media Recovery Log /u01/oracle/oradata/kmcdb/redo03.log
----- Redo read statistics for thread 1 -----
Read rate (ASYNC): 0Kb in 10.99s => 0.00 Mb/sec
Total physical reads: 4096Kb
Longest record: 0Kb, moves: 0/1 (0%)
Change moves: 0/1 (0%), moved: 0Mb
Longest LWN: 0Kb, moves: 0/1 (0%), moved: 0Mb
Last redo scn: 0x0000.00f7a13b (16228667)
----------------------------------------------
*** 2011-09-28 17:46:52.835
Media Recovery drop redo thread 1
File 1 (stop scn 16228668) completed recovery at checkpoint scn 16228668
File 2 (stop scn 16228668) completed recovery at checkpoint scn 16228668
File 3 (stop scn 16228668) completed recovery at checkpoint scn 16228668
File 4 (stop scn 16228668) completed recovery at checkpoint scn 16228668
File 5 (stop scn 16228668) completed recovery at checkpoint scn 16228668
ARCH: Connecting to console port...
*** 2011-09-28 17:47:05.351
Prior to RESETLOGS processing...
ALTER SYSTEM ARCHIVE LOG ALL USING BACKUP CONTROLFILE start
Database is not in archivelog mode
ALTER SYSTEM ARCHIVE LOG ALL USING BACKUP CONTROLFILE complete
*** 2011-09-28 17:47:05.353
*** 2011-09-28 17:47:05.707
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [4000], [7], [], [], [], [], [], []
Current SQL statement for this session:
select ctime, mtime, stime from obj$ where obj# = :1
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst()+27          call     ksedst1()            0 ? 1 ?
ksedmp()+557         call     ksedst()             0 ? 9EF2E6D0 ? 0 ? 2A ?
                                                   9CCB06F4 ? 70000 ?
ksfdmp()+19          call     ksedmp()             3 ? BFAA18D0 ? AC152A0 ?
                                                   CBD2D40 ? 3 ? CB843B8 ?
kgeriv()+188         call     00000000             CBD2D40 ? 3 ?
kgeasi()+113         call     kgeriv()             CBD2D40 ? B7EB0020 ? FA0 ?
                                                   1 ? BFAA190C ?
ktudba()+264         call     kgeasi()             CBD2D40 ? B7EB0020 ? FA0 ?
                                                   2 ? 1 ? 0 ? 7 ? 0 ?
ktrgcm()+6207        call     ktudba()             7 ? BFAA1DEC ? 0 ? 0 ?
ktrgtc()+941         call     ktrgcm()             B7EFF4F4 ? 0 ? B7EFE054 ?
                                                   96F1A0B4 ? 96F10CE8 ? 198 ?
kdsgrp()+107         call     ktrgtc()             B7EFF4F4 ? B7EFF49C ?
                                                   9C22142 ? BFAA1F08 ? 240 ?
                                                   9C24DC4 ? 9C21D7C ?
kdsfbrcb()+513       call     kdsgrp()             B7EFF4F0 ? 0 ? B7EFF4F0 ?
qertbFetchByRowID()  call     kdsfbrcb()           B7EFF4F0 ? B7EFDFEC ? 0 ? 1 ?
+2052                                              0 ? 0 ?
opifch2()+5157       call     00000000             96F10A8C ? A11CDE4 ?
                                                   BFAA2534 ? 1 ?

引发ORA-00600:[4000]错误的数据块头信息:

Block header dump:  0x0040007a
 Object id on Block? Y
 seg/obj: 0x12  csc: 0x00.f84ef0  itc: 1  flg: -  typ: 1 - DATA
     fsl: 0  fnx: 0x0 ver: 0x01

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0007.01f.000011cf  0x00800057.0faf.07  --U-    1  fsc 0x0000.00f84ef1

data_block_dump,data header at 0x82bba044

以上的trace文件信息量极大,我们来抽丝剥茧的罗列这些信息量:

1.
File 1 (stop scn 16228668) completed recovery at checkpoint scn 16228668, 这条日志指出了1号数据文件最后恢复到的checkpoint scn,当尝试open database时数据库当前的scn将从16228668开始

2.
ORA-00600: internal error code, arguments: [4000], [7], [], [], [], [], [], []
Current SQL statement for this session:
select ctime, mtime, stime from obj$ where obj# = :1

引发ORA-00600:[4000], [7]错误的语句是”select ctime, mtime, stime from obj$ where obj# = :1″,这是一条常用的递归SQL语句,查询的对象是非常critical的bootstrap自举对象OBJ$基表,这说明需要cleanup的块时OBJ$表上的

3. 引发ORA-00600:[4000] raised是ktudba函数,由它触发了oracle 错误机制,其stack call为kdsgrp-> ktrgtc -> ktrgcm -> ktudba -> kgeasi(错误处理函数)

4. 引发ORA-00600:[4000], [7]错误的数据块是1号数据文件的122块,seg/obj为0x12,块类型为Data,且存在有一条ITL entry:

Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0007.01f.000011cf 0x00800057.0faf.07 –U- 1 fsc 0x0000.00f84ef1

该ITL 的Scn为00f84ef1= 16273137 > current_scn=16228668 且lck=1

这里存在一个问题:
为什么这个”select ctime, mtime, stime from obj$ where obj# = :1″查询语句要针对该OBJ$的数据块访问该块上ITL相关的rollback segment呢?

这里要牵扯到Oracle中查询语句构建CR一致性读块和块延迟清除的一些概念:

1. 当现有的buffer block无法满足查询的SCN要求时,查询进程会通过现有buffer和undo信息构造出一个CR块来。譬如说cache中现有的一个block的SCN是10,而我的查询语句其实是从Scn=1时就开始的,那么这个查询语句的Snap_scn就是1,为了满足其SCN的需求,就需要访问相关的rollback segment 并对 buffer block做”rollback”。

2. 另一方面当查询语句访问块时,可能需要对块做cleanup,也就是我们说的延迟块清除(deferred block cleanout),当这种Block Cleanout发生时查询进程也需要去访问块上ITL相关的 rollback segment ,那么具体当访问到哪些块时需要做cleanout呢?这个将会在今后的文章中详述。

回归到上面的ORA-00600:[4000]中来,因为该致命内部错误涉及到的对象是重要的Bootstrap表OBJ$,所以无法使用传统的例如_corrupted_rollback_segments,_offline_rollback_segments或10513事件来阻止ORA-00600: [4000]的发生,必须使用块修改工具BBED来修改存在问题的数据块将ITL事务槽的FLAG从U修改为C(Commit)。

TRANSACTION_COMMITED = 0x08;
TRANSACTION_UPBOUND = 0x02;
TRANSACTION_ACTIVE = 0x01;

Flag= –U- 即TRANSACTION_UPBOUND时flag所占字节为0x02,需要将该字节修改为TRANSACTION_COMMITED = 0x08;

十分不巧的是,通过QQ远程协助帮助这位网友使用BBED打开了数据块,但因为这位网友的Terminal工具存在问题导致日志没有记全,特别是BBED这部分的:

[oracle@DBserver1 lib]$ cd /u01/oracle/oradata/kmcdb/
[oracle@DBserver1 kmcdb]$ ls
control01.ctl      control02.ctl.bak  ctl_backup     redo02.log    system01.dbf   users01.dbf
control01.ctl.bak  control03.ctl      example01.dbf  redo03.log    temp01.dbf
control02.ctl      control03.ctl.bak  redo01.log     sysaux01.dbf  undotbs01.dbf
[oracle@DBserver1 kmcdb]$ cp system01.dbf ctl_backup/
[oracle@DBserver1 kmcdb]$ cp system01.dbf system01.dbf.bak           

在正式BBED之前一定要备份所有的数据文件

oracle@DBserver1 kmcdb]$ bbed filename=system01.dbf password=blockedit mode=edit                                         

BBED: Release 2.0.0.0.0 - Limited Production on Wed Sep 28 21:22:41 2011                                        

Copyright (c) 1982, 2005, Oracle.  All rights reserved.                                                         

BBED> set blocksize 8192
        BLOCKSIZE       8192                                                                                    

BBED> set block 123
        BLOCK#          123                                                                                     

BBED> map /v
 File: system01.dbf (0)
 Block: 123                                   Dba:0x00000000
------------------------------------------------------------
 KTB Data Block (Table/Cluster)                                                                                 

 struct kcbh, 20 bytes                      @0
    ub1 type_kcbh                           @0
    ub1 frmt_kcbh                           @1
    ub1 spare1_kcbh                         @2
    ub1 spare2_kcbh                         @3
    ub4 rdba_kcbh                           @4
    ub4 bas_kcbh                            @8
    ub2 wrp_kcbh                            @12
    ub1 seq_kcbh                            @14
    ub1 flg_kcbh                            @15
    ub2 chkval_kcbh                         @16
    ub2 spare3_kcbh                         @18                                                                 

 struct ktbbh, 48 bytes                     @20
    ub1 ktbbhtyp                            @20
    union ktbbhsid, 4 bytes                 @24
    struct ktbbhcsc, 8 bytes                @28
    b2 ktbbhict                             @36
    ub1 ktbbhflg                            @38
    ub1 ktbbhfsl                            @39
    ub4 ktbbhfnx                            @40
    struct ktbbhitl[1], 24 bytes            @44                                                                 

 struct kdbh, 14 bytes                      @68
    ub1 kdbhflag                            @68
    b1 kdbhntab                             @69
    b2 kdbhnrow                             @70
    sb2 kdbhfrre                            @72
    sb2 kdbhfsbo                            @74
    sb2 kdbhfseo                            @76
    b2 kdbhavsp                             @78
    b2 kdbhtosp                             @80                                                                 

 struct kdbt[1], 4 bytes                    @82
    b2 kdbtoffs                             @82
    b2 kdbtnrow                             @84                                                                 

 sb2 kdbr[108]                              @86                                                                 

 ub1 freespace[873]                         @302                                                                

 ub1 rowdata[7013]                          @1175                                                               

 ub4 tailchk                                @8188      

ITL记录从    struct ktbbhitl[1], 24 bytes            @44  开始

BBED> set offset 61
        OFFSET          61                                                                                      

BBED> d
 File: system01.dbf (0)
 Block: 123              Offsets:   61 to  572           Dba:0x00000000
------------------------------------------------------------------------
 200000f1 4ef80000 016c00ff ffea0053 04690369 0300006c 007b1f3a 1ffe1ec1
 1e801e40 1ef81db8 1d771d29 1dec1cb0 1c681c27 1ce71ba7 1b681b27 1beb1aaf
 1a701a2f 1af319b4 19731937 19f118ae 18671827 18e717a7 1766172b 17e316a2
 1661161f 16e315a3 15661529 15e914a8 146c142c 14ec13b0 13731332 13f112b0
 126f122f 12ee11a0 11601121 11de109b 105f1023 10e60fa3 0f620f1d 0fd80e9c
 0e5a0e14 0ed30d94 0d540d13 0dd30c93 0c570c17 0cd90b98 0b580b15 0bd00a8b
 0a460a01 0ac10981 09410901 09c0087d 083e0801 08bc077e 074207fc 06bb0674
 063206f0 05aa0564 051e05d8 04950453 04000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000      

BBED> modify /x 0x80
#modify /x 20 filename 'system01.dbf' block 123. offset 61.
sum apply
verify

因为该ITL事务槽的lck=1,因此还需要将lb: 0x1的row piece记录修改为lb: 0x0

在trace中,该记录为
tab 0, row 26, @0x18f1
tl: 70 fb: --H-FL-- lb: 0x1  cc: 17
col  0: [ 2]  c1 02
col  1: [ 4]  c3 08 61 2c
col  2: [ 1]  80
col  3: [12]  5f 4e 45 58 54 5f 4f 42 4a 45 43 54
col  4: [ 2]  c1 02
col  5: *NULL*

set offset 6454

modify /x 0x00
#modify /x 01 filename 'system01.dbf' block 123. offset 6454.
sum apply
verify

以上完成了对OBJ$上数据块的修改

 

完成以上BBED手动修改操作后,再次打开数据库成功,但是发现有很多坏块:

 

alter database open resetlogs
Wed Sep 28 21:42:53 2011
RESETLOGS after complete recovery through change 16228672
Resetting resetlogs activation ID 523466708 (0x1f3377d4)
Online log /u01/oracle/oradata/kmcdb/redo01.log: Thread 1 Group 1 was previously cleared
Online log /u01/oracle/oradata/kmcdb/redo02.log: Thread 1 Group 2 was previously cleared
Wed Sep 28 21:42:54 2011
Setting recovery target incarnation to 12
Wed Sep 28 21:42:54 2011
Assigning activation ID 523426694 (0x1f32db86)
Thread 1 opened at log sequence 1
  Current log# 3 seq# 1 mem# 0: /u01/oracle/oradata/kmcdb/redo03.log
Successful open of redo thread 1
Wed Sep 28 21:42:54 2011
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Wed Sep 28 21:42:54 2011
SMON: enabling cache recovery
Wed Sep 28 21:42:54 2011
Dictionary check beginning
Tablespace 'TEMP' #3 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Corrected file 5 plugged in read-only status in control file
Dictionary check complete
Wed Sep 28 21:42:54 2011
SMON: enabling tx recovery
Wed Sep 28 21:42:54 2011
*********************************************************************
WARNING: The following temporary tablespaces contain no files.
         This condition can occur when a backup controlfile has
         been restored.  It may be necessary to add files to these
         tablespaces.  That can be done using the SQL statement:

         ALTER TABLESPACE  ADD TEMPFILE

         Alternatively, if these temporary tablespaces are no longer
         needed, then they can be dropped.
           Empty temporary tablespace: TEMP
*********************************************************************
Updating character set in controlfile to ZHS16GBK
Wed Sep 28 21:42:54 2011
Errors in file /u01/oracle/admin/kmcdb/bdump/kmcdb_smon_32531.trc:
ORA-00600: internal error code, arguments: [2662], [0], [16273152], [0], [16273158], [4220957], [], []
**********************************************************
WARNING: Files may exists in db_recovery_file_dest
that are not known to the database. Use the RMAN command
CATALOG RECOVERY AREA to re-catalog any such files.
One of the following events caused this:
1. A backup controlfile was restored.
2. A standby controlfile was restored.
3. The controlfile was re-created.
4. db_recovery_file_dest had previously been enabled and
   then disabled.
**********************************************************
replication_dependency_tracking turned off (no async multimaster replication found)
WARNING: AQ_TM_PROCESSES is set to 0. System operation might be adversely affected.
Wed Sep 28 21:42:55 2011
LOGSTDBY: Validating controlfile with logical metadata
Wed Sep 28 21:42:55 2011
LOGSTDBY: Validation complete
ORA-01555 caused by SQL statement below (SQL ID: 96g93hntrzjtr, SCN: 0x0000.00f84f0a):
Wed Sep 28 21:42:55 2011
select /*+ rule */ bucket_cnt, row_cnt, cache_cnt, null_cnt, timestamp#, sample_size, minimum,
maximum, distcnt, lowval, hival, density, col#, spare1, spare2, avgcln from hist_head$ where obj#=:1 and intcol#=:2
Global Name changed to KMCDB
Wed Sep 28 21:42:55 2011
Non-fatal internal error happenned while SMON was doing non-existent object cleanup.
SMON encountered 1 out of maximum 100 non-fatal internal errors.
Wed Sep 28 21:42:56 2011
Completed: alter database open resetlogs
Wed Sep 28 21:42:57 2011
Hex dump of (file 1, block 61479) in trace file /u01/oracle/admin/kmcdb/bdump/kmcdb_smon_32531.trc
Corrupt block relative dba: 0x0040f027 (file 1, block 61479)
Fractured block found during buffer read
Data in bad block:
 type: 6 format: 2 rdba: 0x0040f027
 last change scn: 0x0000.00f80906 seq: 0x1 flg: 0x06
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x36a70601
 check value in block header: 0xe5d2
 computed block checksum: 0x3fa1
Reread of rdba: 0x0040f027 (file 1, block 61479) found same corrupted data
Wed Sep 28 21:42:57 2011
Errors in file /u01/oracle/admin/kmcdb/bdump/kmcdb_smon_32531.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-01578: ORACLE data block corrupted (file # 1, block # 61479)
ORA-01110: data file 1: '/u01/oracle/oradata/kmcdb/system01.dbf'
Wed Sep 28 21:42:58 2011
Starting background process CJQ0
CJQ0 started with pid=24, OS id=724
Wed Sep 28 21:43:01 2011
Errors in file /u01/oracle/admin/kmcdb/bdump/kmcdb_m000_720.trc:
ORA-00600: internal error code, arguments: [2662], [0], [16273222], [0], [16273348], [12624044], [], []
Wed Sep 28 21:43:02 2011
Errors in file /u01/oracle/admin/kmcdb/bdump/kmcdb_m000_720.trc:
ORA-00600: internal error code, arguments: [2662], [0], [16273288], [0], [16273348], [12624044], [], []
ORA-00600: internal error code, arguments: [2662], [0], [16273222], [0], [16273348], [12624044], [], []
Wed Sep 28 21:44:35 2011
 create temporary tablespace temp02 tempfile '/u01/oracle/oradata/temp02.dbf' size 400M
Wed Sep 28 21:44:35 2011
Completed:  create temporary tablespace temp02 tempfile '/u01/oracle/oradata/temp02.dbf' size 400M
Wed Sep 28 21:45:52 2011
create undo tablespace undo03 datafile '/u01/oracle/oradata/undo03.dbf' size 500M
Hex dump of (file 3, block 3839) in trace file /u01/oracle/admin/kmcdb/udump/kmcdb_ora_749.trc
Corrupt block relative dba: 0x00c00eff (file 3, block 3839)
Fractured block found during buffer read
Data in bad block:
 type: 6 format: 2 rdba: 0x00c00eff
 last change scn: 0x0000.00f7be10 seq: 0x1 flg: 0x06
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x49b20601
 check value in block header: 0xe0a3
 computed block checksum: 0xf7a2
Reread of rdba: 0x00c00eff (file 3, block 3839) found same corrupted data
Completed: create undo tablespace undo03 datafile '/u01/oracle/oradata/undo03.dbf' size 500M
Wed Sep 28 21:45:58 2011
Errors in file /u01/oracle/admin/kmcdb/bdump/kmcdb_smon_32531.trc:
ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []
Wed Sep 28 21:45:58 2011
Non-fatal internal error happenned while SMON was doing Corrupt Block Seg Info.
SMON encountered 2 out of maximum 100 non-fatal internal errors.
Wed Sep 28 21:47:15 2011
ALTER SYSTEM SET undo_management='AUTO' SCOPE=SPFILE;
Wed Sep 28 21:47:19 2011
Starting background process EMN0
EMN0 started with pid=22, OS id=757
Wed Sep 28 21:47:19 2011
Shutting down instance: further logons disabled
Wed Sep 28 21:47:19 2011
Stopping background process CJQ0
Wed Sep 28 21:47:19 2011

 

尝试创建并切换到新的Undo表空间,可能可以避免一些后续出现的ORA-600问题。

以上alert.log中2662和4097内部错误暂时不导致数据库crash,但是因为是用非常手段打开的数据库所以这套数据库已不再稳定,所以要第一时间逻辑和物理备份重要数据, 并着手重建数据库:

 

run{
set maxcorrupt for datafile 1,2,3,4,5 to 100;
backup as compressed backupset incremental level 0 database;
}

exp system/oracle full=y file=full.dmp

 

虽然该网友和我一再强调他的数据库中的数据十分重要,但是他的行为是在无声的诉说这套数据库里的数据是无关紧要的。

如果你觉得数据是important的,那么请善待数据。 善待数据的第一条准则是 数据备份重于一切,如果没有备份 那么请不要奢望数据是持久的!

Script:检查数据库当前是否有备份操作在执行中

以下脚本可以用于检测数据库当前是否有备份操作在执行中:

SELECT DECODE(os_backup.backup + rman_backup.backup, 0, 'FALSE', 'TRUE') backup
  FROM (SELECT COUNT(*) backup FROM gv$backup WHERE status = 'ACTIVE') os_backup,
       (SELECT COUNT(*) backup
          FROM gv$session
         WHERE status = 'ACTIVE'
           AND client_info like '%rman%') rman_backup
/

数据恢复:模拟2个逻辑坏块

TESTA:

 

Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 – 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL>
SQL>
SQL>
SQL> create table corrupt_id (t1 int) tablespace users;

Table created.

SQL>
SQL> insert into corrupt_id values(1);

1 row created.

SQL> commit;

Commit complete.

SQL> alter system checkpoint;

System altered.
SQL> select dbms_rowid.rowid_block_number(rowid),dbms_rowid.rowid_relative_fno(rowid) from corrupt_id;

DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID) DBMS_ROWID.ROWID_RELATIVE_FNO(ROWID)
———————————— ————————————
527 4
SQL> oradebug setmypid
Statement processed.

SQL> oradebug tracefile_name
/s01/diag/rdbms/fixit/FIXIT/trace/FIXIT_ora_7227.trc
Block header dump: 0x0100020f
Object id on Block? Y
seg/obj: 0x12c20 csc: 0x00.10d097 itc: 2 flg: E typ: 1 – DATA
brn: 0 bdba: 0x1000208 ver: 0x01 opc: 0
inc: 0 exflg: 0

Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0006.01f.000003e8 0x00c00c61.00a9.07 –U- 1 fsc 0x0000.0010d098
0x02 0x0000.000.00000000 0x00000000.0000.00 —- 0 fsc 0x0000.00000000
bdba: 0x0100020f
data_block_dump,data header at 0x7f15d62b2264
===============
tsiz: 0x1f98
hsiz: 0x14
pbl: 0x7f15d62b2264
76543210
flag=——–
ntab=1
nrow=1
frre=-1
fsbo=0x14
fseo=0x1f92
avsp=0x1f7b
tosp=0x1f7b
0xe:pti[0] nrow=1 offs=0
0x12:pri[0] offs=0x1f92
block_row_dump:
tab 0, row 0, @0x1f92
tl: 6 fb: –H-FL– lb: 0x1 cc: 1
col 0: [ 2] c1 02
end_of_block_dump

fsbo=0x14

FSBO Free Space Begin offset 除去row dict 后面的可以放数据的空间的起始位置,也可以看成是从这个区域的开始”flag”到最后一个 “row offs”占用的空间

fseo=0x1f92

FSEO Free Space End offset ( 9.2.0 )参与db_block_checking的计算剩余空间,select 的时候oracle不是简单的根据offset定位row.这个值也是参与了定位row的

 

 

FSBO 总是小于FSEO ,若Oracle进程读取数据块 发现FSBO>FSEO 且blocksum正确则认为出现了逻辑讹误

 
SQL> update corrupt_id set t1=t1+1;

1 row updated.

SQL> alter system checkpoint;

System altered.

SQL> alter system flush buffer_cache;

System altered.

SQL> /

System altered.

 

 

 

 
[oracle@lab1 ~]$ bbed filename=/s01/oradata/FIXIT/datafile/o1_mf_users_85ztlh77_.dbf blocksize=8192 mode=edit
Password:

BBED: Release 2.0.0.0.0 – Limited Production on Mon Sep 24 01:40:46 2012

Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.

************* !!! For Oracle Internal Use only !!! ***************

BBED> set block 527
BLOCK# 527

BBED> map
File: /s01/oradata/FIXIT/datafile/o1_mf_users_85ztlh77_.dbf (0)
Block: 527 Dba:0x00000000
————————————————————
KTB Data Block (Table/Cluster)

struct kcbh, 20 bytes @0

struct ktbbh, 72 bytes @20

struct kdbh, 14 bytes @100

struct kdbt[1], 4 bytes @114

sb2 kdbr[1] @118

ub1 freespace[8062] @120

ub1 rowdata[6] @8182

ub4 tailchk @8188
BBED> p *kdbr[0]
rowdata[0]
———-
ub1 rowdata[0] @8182 0x2c

BBED> set offset 8182
OFFSET 8182

BBED> x /rn
rowdata[0] @8182
———-
flag@8182: 0x2c (KDRHFL, KDRHFF, KDRHFH)
lock@8183: 0x01
cols@8184: 1

col 0[2] @8185: 3

 

 

 
BBED> p kdbh
struct kdbh, 14 bytes @100
ub1 kdbhflag @100 0x00 (NONE)
sb1 kdbhntab @101 1
sb2 kdbhnrow @102 1
sb2 kdbhfrre @104 -1
sb2 kdbhfsbo @106 20
sb2 kdbhfseo @108 8082
sb2 kdbhavsp @110 8059
sb2 kdbhtosp @112 8059

BBED> set offset 108
OFFSET 108

BBED> d
File: /s01/oradata/FIXIT/datafile/o1_mf_users_85ztlh77_.dbf (0)
Block: 527 Offsets: 108 to 619 Dba:0x00000000
————————————————————————
921f7b1f 7b1f0000 0100921f 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

<32 bytes per line>

BBED> x
kdbh.kdbhfseo @108
————-
0x92

BBED> modify /x 0x1000
File: /s01/oradata/FIXIT/datafile/o1_mf_users_85ztlh77_.dbf (0)
Block: 527 Offsets: 108 to 619 Dba:0x00000000
————————————————————————
10007b1f 7b1f0000 0100921f 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

<32 bytes per line>

BBED> p kdbh
struct kdbh, 14 bytes @100
ub1 kdbhflag @100 0x00 (NONE)
sb1 kdbhntab @101 1
sb2 kdbhnrow @102 1
sb2 kdbhfrre @104 -1
sb2 kdbhfsbo @106 20
sb2 kdbhfseo @108 16
sb2 kdbhavsp @110 8059
sb2 kdbhtosp @112 8059

BBED> sum
Check value for File 0, Block 527:
current = 0xd30a, required = 0xcc88

BBED> sum apply
Check value for File 0, Block 527:
current = 0xcc88, required = 0xcc88

 

 

 
[oracle@lab1 ~]$ dbv file=/s01/oradata/FIXIT/datafile/o1_mf_users_85ztlh77_.dbf start=526 end=528

DBVERIFY: Release 11.2.0.3.0 – Production on Mon Sep 24 01:44:36 2012

Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.

DBVERIFY – Verification starting : FILE = /s01/oradata/FIXIT/datafile/o1_mf_users_85ztlh77_.dbf
Block Checking: DBA = 16777743, Block Type = KTB-managed data block
data header at 0x7fa1288a9064
kdbchk: fseo(16) < fsbo(20)
Page 527 failed with check code 6130
DBVERIFY – Verification complete

Total Pages Examined : 3
Total Pages Processed (Data) : 2
Total Pages Failing (Data) : 1
Total Pages Processed (Index): 0
Total Pages Failing (Index): 0
Total Pages Processed (Other): 1
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 0
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Total Pages Encrypted : 0
Highest block SCN : 1102645 (0.1102645)
SQL> shutdown abort;
ORACLE instance shut down.
RMAN> backup validate check logical datafile 4;

Starting backup at 24-SEP-12
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=139 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00004 name=/s01/oradata/FIXIT/datafile/o1_mf_users_85ztlh77_.dbf
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
—- —— ————– ———— ————— ———-
4 FAILED 0 18 668 1101975
File Name: /s01/oradata/FIXIT/datafile/o1_mf_users_85ztlh77_.dbf
Block Type Blocks Failing Blocks Processed
———- ————– —————-
Data 1 96
Index 0 39
Other 0 487

validate found one or more corrupt blocks
See trace file /s01/diag/rdbms/fixit/FIXIT/trace/FIXIT_ora_8036.trc for details
Finished backup at 24-SEP-12

Block Checking: DBA = 16777743, Block Type = KTB-managed data block
data header at 0x7f00bc1f6064
kdbchk: fseo(16) < fsbo(20)
Error backing up file 4, block 527: logical corruption
dbkh_create_finding: BEGIN
dbkhu_prepare_default_msgobj: BEGIN
dbkhu_prepare_default_msgobj:; name_id=25, type=2, flags=1
dbkhu_get_default_msg_def: BEGIN
dbkhu_get_default_msg_def: END
dbkhu_prepare_default_msgobj:: MSG PARAMS-1; i=2
dbkhu_prepare_default_msgobj: END
dbkhu_prepare_default_msgobj: BEGIN
dbkhu_prepare_default_msgobj:; name_id=25, type=2, flags=2
dbkhu_get_default_msg_def: BEGIN
dbkhu_get_default_msg_def: END
dbkhu_prepare_default_msgobj:: MSG PARAMS-2; i=1
dbkhu_prepare_default_msgobj: END

*** 2012-09-24 01:48:25.079
dbkhu_prepare_default_msgobj: BEGIN
dbkhu_prepare_default_msgobj:; name_id=38, type=2, flags=1
dbkhu_get_default_msg_def: BEGIN
dbkhu_get_default_msg_def: END
dbkhu_prepare_default_msgobj:: MSG PARAMS-1; i=3
dbkhu_prepare_default_msgobj: END
dbkhu_prepare_default_msgobj: BEGIN
dbkhu_prepare_default_msgobj:; name_id=38, type=2, flags=2
dbkhu_get_default_msg_def: BEGIN
dbkhu_get_default_msg_def: END
dbkhu_prepare_default_msgobj:: MSG PARAMS-2; i=2
dbkhu_prepare_default_msgobj: END
dbkh_create_finding: END

 

kdbchk: fseo(16) < fsbo(20)
Page 527 failed with check code 6130 可能引起

以下错误:

ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00607: Internal error occurred while making a change to a data block
ORA-00602: internal programming exception
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00607: Internal error occurred while making a change to a data block
ORA-00602: internal programming exception
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00600: internal error code, arguments: [kghstack_free2], [], [], [], [], [], [], []

 

 

TESTB:

 

SQL*Plus: Release 10.2.0.5.0 – Production on Mon Sep 24 09:30:18 2012

Copyright (c) 1982, 2010, Oracle. All Rights Reserved.

Connected to an idle instance.

SQL> startup;
ORACLE instance started.

Total System Global Area 1048576000 bytes
Fixed Size 2101544 bytes
Variable Size 738201304 bytes
Database Buffers 301989888 bytes
Redo Buffers 6283264 bytes
Database mounted.
Database opened.

 

SQL> create table corrupt_id (t1 int) tablespace users;

Table created.

SQL> insert into corrupt_id values(1);

1 row created.

SQL> commit;

Commit complete.

SQL> alter system checkpoint;

System altered.

SQL> select dbms_rowid.rowid_block_number(rowid),dbms_rowid.rowid_relative_fno(rowid) from corrupt_id;

DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID) DBMS_ROWID.ROWID_RELATIVE_FNO(ROWID)
———————————— ————————————
917060 4

SQL> alter system dump datafile 4 block 917060;

System altered.

SQL> oradebug setmypid
Statement processed.
SQL> oradebug tracefile_name
/s01/admin/G10R25/udump/g10r25_ora_4497.trc
Block header dump: 0x010dfe44
Object id on Block? Y
seg/obj: 0xf0e4 csc: 0x00.17d1aac itc: 2 flg: E typ: 1 – DATA
brn: 0 bdba: 0x10dfe41 ver: 0x01 opc: 0
inc: 0 exflg: 0

Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0009.00c.0000515d 0x008001c2.1138.08 –U- 1 fsc 0x0000.017d1ab2
0x02 0x0000.000.00000000 0x00000000.0000.00 —- 0 fsc 0x0000.00000000

data_block_dump,data header at 0x82d8864
===============
tsiz: 0x1f98
hsiz: 0x14
pbl: 0x082d8864
bdba: 0x010dfe44
76543210
flag=——–
ntab=1
nrow=1
frre=-1
fsbo=0x14
fseo=0x1f92
avsp=0x1f7b
tosp=0x1f7b
0xe:pti[0] nrow=1 offs=0
0x12:pri[0] offs=0x1f92
block_row_dump:
tab 0, row 0, @0x1f92
tl: 6 fb: –H-FL– lb: 0x1 cc: 1
col 0: [ 2] c1 02
end_of_block_dump
End dump data blocks tsn: 4 file#: 4 minblk 917060 maxblk 917060

 

fsbo=0x14

FSBO Free Space Begin offset 除去row dict 后面的可以放数据的空间的起始位置,也可以看成是从这个区域的开始”flag”到最后一个 “row offs”占用的空间

fseo=0x1f92

FSEO Free Space End offset ( 9.2.0 )参与db_block_checking的计算剩余空间,select 的时候oracle不是简单的根据offset定位row.这个值也是参与了定位row的

 

 

 

 

FSBO 总是小于FSEO ,若Oracle进程读取数据块 发现FSBO>FSEO 且blocksum正确则认为出现了逻辑讹误

 

 

SQL> update corrupt_id set t1=t1+1;

1 row updated.

SQL> alter system checkpoint;

System altered.

SQL> alter system flush buffer_cache;

System altered.

SQL> /

System altered.

 

 

 

 

 

 

[oracle@vrh8 ~]$ bbed filename=/s01/oradata/G10R25/datafile/o1_mf_users_7ch7d4nx_.dbf blocksize=8192 mode=edit
Password:

 

BBED> set block 917060
BLOCK# 917060

BBED> map
File: /s01/oradata/G10R25/datafile/o1_mf_users_7ch7d4nx_.dbf (0)
Block: 917060 Dba:0x00000000
————————————————————
KTB Data Block (Table/Cluster)

struct kcbh, 20 bytes @0

struct ktbbh, 72 bytes @20

struct kdbh, 14 bytes @100

struct kdbt[1], 4 bytes @114

sb2 kdbr[1] @118

ub1 freespace[8062] @120

ub1 rowdata[6] @8182

ub4 tailchk @8188
BBED> p kdbh
struct kdbh, 14 bytes @100
ub1 kdbhflag @100 0x00 (NONE)
b1 kdbhntab @101 1
b2 kdbhnrow @102 1
sb2 kdbhfrre @104 -1
sb2 kdbhfsbo @106 20
sb2 kdbhfseo @108 8082
b2 kdbhavsp @110 8059
b2 kdbhtosp @112 8059

BBED> p *kdbr[0]
rowdata[0]
———-
ub1 rowdata[0] @8182 0x2c

BBED> set offset 8182
OFFSET 8182

BBED> x /rn
rowdata[0] @8182
———-
flag@8182: 0x2c (KDRHFL, KDRHFF, KDRHFH)
lock@8183: 0x02
cols@8184: 1

col 0[2] @8185: 2
BBED> set offset 108
OFFSET 108

BBED> d
File: /s01/oradata/G10R25/datafile/o1_mf_users_7ch7d4nx_.dbf (0)
Block: 917060 Offsets: 108 to 619 Dba:0x00000000
————————————————————————
921f7b1f 7b1f0000 0100921f 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

<32 bytes per line>

BBED> modify /x 0x3d00
Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) y
File: /s01/oradata/G10R25/datafile/o1_mf_users_7ch7d4nx_.dbf (0)
Block: 917060 Offsets: 108 to 619 Dba:0x00000000
————————————————————————
3d007b1f 7b1f0000 0100921f 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

<32 bytes per line>

BBED> p kdbh
struct kdbh, 14 bytes @100
ub1 kdbhflag @100 0x00 (NONE)
b1 kdbhntab @101 1
b2 kdbhnrow @102 1
sb2 kdbhfrre @104 -1
sb2 kdbhfsbo @106 20
sb2 kdbhfseo @108 61
b2 kdbhavsp @110 8059
b2 kdbhtosp @112 8059

BBED> set offset 106
OFFSET 106

BBED> modify /x 0x4000
File: /s01/oradata/G10R25/datafile/o1_mf_users_7ch7d4nx_.dbf (0)
Block: 917060 Offsets: 106 to 617 Dba:0x00000000
————————————————————————
40003d00 7b1f7b1f 00000100 921f0000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

<32 bytes per line>

BBED> p kdbh
struct kdbh, 14 bytes @100
ub1 kdbhflag @100 0x00 (NONE)
b1 kdbhntab @101 1
b2 kdbhnrow @102 1
sb2 kdbhfrre @104 -1
sb2 kdbhfsbo @106 64
sb2 kdbhfseo @108 61
b2 kdbhavsp @110 8059
b2 kdbhtosp @112 8059

BBED> sum
Check value for File 0, Block 917060:
current = 0x52c9, required = 0x4d32

BBED> sum apply
Check value for File 0, Block 917060:
current = 0x4d32, required = 0x4d32

[oracle@vrh8 ~]$ dbv file=/s01/oradata/G10R25/datafile/o1_mf_users_7ch7d4nx_.dbf start=917059 end=917060

DBVERIFY: Release 10.2.0.5.0 – Production on Mon Sep 24 10:11:16 2012

Copyright (c) 1982, 2007, Oracle. All rights reserved.

DBVERIFY – Verification starting : FILE = /s01/oradata/G10R25/datafile/o1_mf_users_7ch7d4nx_.dbf
Block Checking: DBA = 17694276, Block Type = KTB-managed data block
data header at 0x7f49e04e8064
kdbchk: fsbo(64) wrong, (hsz 20)
Page 917060 failed with check code 6129
DBVERIFY – Verification complete

Total Pages Examined : 2
Total Pages Processed (Data) : 1
Total Pages Failing (Data) : 1
Total Pages Processed (Index): 0
Total Pages Failing (Index): 0
Total Pages Processed (Other): 1
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 0
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Highest block SCN : 24976162 (0.24976162)

 

 

Does Duplicate Target Database need Pre-existing DB backup?

之前有网友问我在10g中通过RMAN 的duplicate target database命令复制数据库时是否需要先完成全库的备份。

实际我在10g中并不常用duplicate target database 来帮助创建DataGuard Standby Database,所以虽然记忆中仍有些印象,却不能十分确定地回答了。

今天查了一下资料,发现原来Active database duplication 和 Backup-based duplication 是11g才引入的特性,换句话说10g中duplication是要求预先完成数据库的RMAN backup备份的。

具体关于以上2个特性见文档<RMAN ‘Duplicate Database’ Feature in 11G>,引文如下:

RMAN 'Duplicate Database' Feature in 11G

You can create a duplicate database using the RMAN duplicate command.
The duplicate database has a different DBID from the source database and functions
entirely independently.Starting from 11g you can do duplicate database in 2 ways.

1. Active database duplication
2. Backup-based duplication

Active database duplication copies the live target database over the network to the
auxiliary destination and then creates the duplicate database.Only difference is that you
don't need to have the pre-existing RMAN backups and copies.

The duplication work is performed by an auxiliary channel.
This channel corresponds to a server session on the auxiliary instance on the auxiliary host.

As part of the duplicating operation, RMAN automates the following steps:

1. Creates a control file for the duplicate database
2. Restarts the auxiliary instance and mounts the duplicate control file
3. Creates the duplicate datafiles and recovers them with incremental backups and archived redo logs.
4. Opens the duplicate database with the RESETLOGS option

For the active database duplication, RMAN does one extra step .i.e. copy the
target database datafiles over the network to the auxiliary instance

A RAC TARGET database can be duplicated as well. The procedure is the same as below.
If the auxiliary instance needs to be a RAC-database as well,
than start the duplicate procedure for to a single instance and convert
the auxiliary to RAC after the duplicate has succeeded.

 

而在10g 中不仅需要对目标数据库进行备份,还需要手动将备份集(backupset)拷贝至目标主机上,这确实过于繁琐了:

 

Oracle10G RMAN Database Duplication
 If you are using a disk backup solution and duplicate to a
remote node you must first copy the backupsets from the original hosts backup
location to the same mount and path on the remote server. Because duplication
uses auxiliary channels the files must be where the IO pipe is allocated. So the
IO will take place on the remote node and disk backups must be locally available.

Restore Production Database to Development Platform A step-by-step example

Overview of setup and configuration

  • Oracle Enterprise Edition 11gR2. 11.2.0.1.0 Linux RH5 NetBackup 6.5.5
  • Identical environments:
    • Identical versions of Oracle.
    • Same ORACLE_HOME and ORACLE_BASE.
    • The archive log directory is defined to be the Flash Recovery Area:
  • You know the DBID.
  • Control file autobackup feature enabled with the default format.

NetBackup MML considerations

Restoring from tape using the NetBackup MML
MANDITORY requirement on the NetBackup Master Server.

On the NBU Master server the following directory and file must exist:
/[netbackup install]/db/altnames/No.Restrictions

Restore

Restore the spfile to pfile.

RMAN> run {
allocate channel t1 device type sbt;
send ‘NB_ORA_SERV=NBU_MasterServer.com, NB_ORA_CLIENT=Original_Client.com’;
set controlfile autobackup format for device type sbt to ‘%F’;
restore spfile
to pfile ‘?/dbs/initprod.ora’
from autobackup;
shutdown abort;
}

Backup Script:Expdp Schema to ASM Storage

Below is a shell script  which backup dmpfile to ASM storage everyday:

#!/bin/bash

#asmcmd mkdir DATA/ASM_DATAPUMP_BACKUP
#create directory ASM_DATAPUMP_BACKUP as '+DATA/ASM_DATAPUMP_BACKUP';
#grant read,write on directory ASM_DATAPUMP_BACKUP to system;
#create directory LOGDIR as '/s01/logdir';
#grant read,write on directory LOGDIR to system;
#create directory DMPBACKUP as '/s01/dmpbackup';
#grant read,write on directory DMPBACKUP to system;

export ORACLE_HOME=/s01/oracle/product/11.2.0/dbhome_1;
export ORA_CRS_HOME=/s01/grid;
export ORACLE_SID=PROD1;
export PATH=$ORACLE_HOME/bin:/s01/grid:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/oracle/bin:/home/oracle/bin:$ORACLE_HOME/OPatch;
thisday=`date "+%Y""%m""%d"`;
expfilename='maclean'$thisday'.dmp';
explogname='maclean'$thisday'.log';
yesterday=`date -d "1 days ago" +%Y%m%d`;
yesterdayfile='maclean'$yesterday'.dmp';
expdp system/system schemas=maclean directory=ASM_DATAPUMP_BACKUP dumpfile=$expfilename logfile=LOGDIR:$explogname;

#TRANSFER FILE FROM ASM TO FILESYSTEM

sqlplus / as sysdba <<EOF
exec DBMS_FILE_TRANSFER.COPY_FILE('ASM_DATAPUMP_BACKUP','$expfilename','DMPBACKUP','$expfilename');
EOF

export ORACLE_SID=+ASM1;
export ORACLE_HOME=/s01/grid;
$ORACLE_HOME/bin/asmcmd rm DATA/ASM_DATAPUMP_BACKUP/$yesterdayfile;
#/usr/bin/find /s01/dmpbackup -mtime 1 -name 'maclean*.dmp'  -exec /bin/rm -rf {} \;
#asmcmd cp DATA/ASM_DATAPUMP_BACKUP/$expfilename /s01/dmpbackup;

Script:收集Oracle备份恢复信息

我们在诊断Oracle backup restore问题时总是希望能获得足够的诊断信息,一般来说RDA会是一个最好的诊断信息收集工具,但是有时候客户会很反感使用RDA(不信任感),这里我们提供一段专门用来收集oracle备份恢复信息的脚本。

运行以下脚本需要设置合理的”ORACLE_HOME、ORACLE_SID”环境变量,并设置NLS_DATE_FORMAT环境变量,如

NLS_DATE_FORMAT="DD-MON-RRRR HH24:MI:SS"
export NLS_DATE_FORMAT

以”rman target /”登陆并运行:

spool log to rman_report.log
set echo on
show all;
report schema;
list incarnation;
list backup summary;
list backup;
list copy;
report need backup;
report obsolete;
restore database preview;
spool log off

以下脚本在sqlplus中以sysdba身份执行,执行要求数据库至少处于mounted已加载状态下;注意该原始脚本是只读readonly的,它仅仅是读取数据字典,不会造成危害,当然请确保你的脚本来源!!

spool results01.txt
set echo on feedback on time on timing on pagesize 100 linesize 80 numwidth 13
show user
alter session set nls_date_format = 'DD-MON-YYYY HH24:MI:SS';
select * from v$version;
select to_char(sysdate, 'DD-MON-YYYY HH24:MI:SS') as current_date from dual;
column name format a30
column value format a49
select name, value from v$parameter where isdefault='FALSE' order by 1;
column parameter format a30
column value format a49
select * from v$nls_parameters order by parameter;
column name format a10
select dbid, name,
       to_char(created, 'DD-MON-YYYY HH24:MI:SS') created,
       open_mode, log_mode,
       to_char(checkpoint_change#, '999999999999999') as checkpoint_change#,
       controlfile_type,
       to_char(controlfile_change#, '999999999999999') as controlfile_change#,
       to_char(controlfile_time, 'DD-MON-YYYY HH24:MI:SS') controlfile_time,
       to_char(resetlogs_change#, '999999999999999') as resetlogs_change#,
       to_char(resetlogs_time, 'DD-MON-YYYY HH24:MI:SS') resetlogs_time
from v$database;
select * from v$instance;
archive log list;
select * from v$thread order by thread#;
select * from v$log order by first_change#;
column member format a45
select * from v$logfile;
column name format a79
select '#' || ts.name || '#' as tablespace_name, ts.ts#,
       '#' || df.name || '#' as filename, df.file#, df.status, df.enabled, df.creation_change#,
       to_char(df.creation_time, 'DD-MON-YYYY HH24:MI:SS') as creation_time,
       to_char(df.checkpoint_change#, '999999999999999') as checkpoint_change#,
       to_char(df.checkpoint_time, 'DD-MON-YYYY HH24:MI:SS') as checkpoint_time,
       to_char(df.offline_change#, '999999999999999') as offline_change#,
       to_char(df.online_change#, '999999999999999') as online_change#,
       to_char(df.online_time, 'DD-MON-YYYY HH24:MI:SS') as online_time,
       to_char(df.unrecoverable_change#, '999999999999999') as online_change#,
       to_char(df.unrecoverable_time, 'DD-MON-YYYY HH24:MI:SS') as online_time,
       to_char(df.bytes, '9,999,999,999,990') as bytes, block_size
from v$datafile df, v$tablespace ts
where ts.ts# = df.ts#
and ( df.status <> 'ONLINE'
or    df.checkpoint_change# <> (select checkpoint_change# from v$database) );
select '#' || ts.name || '#' as tablespace_name, ts.ts#,
       '#' || dh.name || '#' as filename, dh.file#, dh.status, dh.error, dh.
fuzzy, dh.creation_change#,
       to_char(dh.creation_time, 'DD-MON-YYYY HH24:MI:SS') as creation_time,
       to_char(dh.checkpoint_change#, '999999999999999') as checkpoint_change#,
       to_char(dh.checkpoint_time, 'DD-MON-YYYY HH24:MI:SS') as checkpoint_time,
       to_char(dh.resetlogs_change#, '999999999999999') as resetlogs_change#,
       to_char(dh.resetlogs_time, 'DD-MON-YYYY HH24:MI:SS') as resetlogs_time,
       to_char(dh.bytes, '9,999,999,999,990') as bytes
from v$datafile_header dh, v$tablespace ts
where ts.ts# = dh.ts#
and ( dh.status <> 'ONLINE'
or    dh.checkpoint_change# <> (select checkpoint_change# from v$database) );
select * from v$tempfile;
select HXFIL File_num,substr(HXFNM,1,60) file_name, FHTNM tablespace_name,
       FHTYP type, HXERR validity,
       FHSCN SCN, FHTIM SCN_Time, FHSTA status,
       FHTHR Thread, FHRBA_SEQ Sequence
from X$KCVFH
--where HXERR > 0
order by HXERR, FHSTA, FHSCN, HXFIL;
column error format a15
select error, fuzzy, status, checkpoint_change#,
       to_char(checkpoint_time, 'DD-MON-YYYY HH24:MI:SS') as checkpoint_time,
       count(*)
from v$datafile_header
group by error, fuzzy, status, checkpoint_change#, checkpoint_time
order by checkpoint_change#, checkpoint_time;
select * from V$INSTANCE_RECOVERY;
select * from v$recover_file order by change#;
select * from dba_tablespaces where status <> 'ONLINE';
SELECT * FROM database_properties order by property_name;
select *
from X$KCCLH, (select min(checkpoint_change#) df_min_scn,
min(checkpoint_change#) df_max_scn
               from v$datafile_header
               where status='ONLINE') df
where LHLOS in (select first_change# from v$log)
or df.df_min_scn between LHLOS and LHNXS
or df.df_max_scn between LHLOS and LHNXS;
select * from v$backup where status <> 'NOT ACTIVE';
select ADDR, XIDUSN, XIDSLOT, XIDSQN,
       UBAFIL, UBABLK, UBASQN,
       START_UBAFIL, START_UBABLK, START_UBASQN,
       USED_UBLK, STATUS
from   v$transaction;
select * from v$archive_gap;
select * from v$archive_dest_status where recovery_mode <> 'IDLE';
column USED_GB format 999,990.999
column USED% format 990.99
column RECLAIM_GB format 999,990.999
column RECLAIMABLE% format 990.99
column LIMIT_GB format 999,990.999
select frau.file_type as type,
       frau.percent_space_used/100 * rfd.space_limit /1024/1024/1024 "USED_GB",
       frau.percent_space_used "USED%",
       frau.percent_space_reclaimable "RECLAIMABLE%",
       frau.percent_space_reclaimable/100 * rfd.space_limit /1024/1024/1024 "RECLAIM_GB",
       frau.number_of_files "FILES#"
from   v$flash_recovery_area_usage frau,
       v$recovery_file_dest rfd
order by file_type;
select name,
       space_limit/1024/1024/1024 "LIMIT_GB",
       space_used/1024/1024/1024 "USED_GB",
       space_used/space_limit*100 "USED%",
       space_reclaimable/1024/1024/1024 "RECLAIM_GB",
       number_of_files "FILE#"
from   v$recovery_file_dest;
select * from v$backup_corruption;
select * from v$copy_corruption order by file#, block#;
select * from v$database_block_corruption order by file#, block#;
SELECT f.file#, f.name,
       e.tablespace_name, e.segment_type, e.owner, e.segment_name,
       c.file#, c.block#, c.blocks, c.corruption_change#, c.corruption_type
FROM dba_extents e, V$database_block_corruption c, v$datafile f
WHERE c.file# = f.file#
and   e.file_id = c.file#
and   c.block# between e.block_id AND e.block_id + e.blocks - 1;
select * from v$database_incarnation;
select * from v$rman_configuration;
select s.recid as bs_key, p.recid as bp_key, p.status, p.tag, p.device_type,
       p.handle, p.media, p.completion_time, p.bytes
from   v$backup_piece p, v$backup_set s
where  p.set_stamp = s.set_stamp
and    s.controlfile_included='YES'
order by p.completion_time;
select s.recid as bs_key, p.recid as bp_key, p.status, p.tag, p.device_type,
       p.handle, p.media, p.completion_time, f.absolute_fuzzy_change#, p.bytes
from   v$backup_datafile f, v$backup_piece p, v$backup_set s
where  p.set_stamp = s.set_stamp
and    f.set_stamp = s.set_stamp
and    p.handle is not null
and    f.file# = 1
order by p.completion_time;
SELECT
  session_recid,
  input_bytes_per_sec_display,
  output_bytes_per_sec_display,
  time_taken_display,
  end_time
FROM v$rman_backup_job_details
ORDER BY end_time;
select * from v$filestat;
column EBS_MB format 9,990.99
column TOTAL_MB format 999,990.99
select SID, SERIAL, FILENAME, EFFECTIVE_BYTES_PER_SECOND/1024/1024 as EBS_MB,
      OPEN_TIME, CLOSE_TIME, ELAPSED_TIME, TOTAL_BYTES/1024/1024 as TOTAL_MB,
      STATUS, MAXOPENFILES, buffer_size, buffer_count
from v$backup_async_io
where close_time >= sysdate-3
order by close_time;
select SID, SERIAL, FILENAME, EFFECTIVE_BYTES_PER_SECOND/1024/1024 as EBS_MB,
      OPEN_TIME, CLOSE_TIME, ELAPSED_TIME, TOTAL_BYTES/1024/1024 as TOTAL_MB,
      STATUS, MAXOPENFILES, buffer_size, buffer_count
from v$backup_sync_io
where close_time >= sysdate-3;
select * from v$controlfile_record_section order by type;
select to_char(rownum) || '. ' || output rman_output from v$rman_output;
select * from v$rman_status where trunc(end_time) > trunc(sysdate)-3;
select protection_mode, protection_level from v$database;
select * from v$recovery_progress;
select s.client_info,
       sl.message,
       sl.sid, sl.serial#, p.spid,
       round(sl.sofar/sl.totalwork*100,2) "% Complete"
from   v$session_longops sl, v$session s, v$process p
where  p.addr = s.paddr
and    sl.sid=s.sid
and    sl.serial#=s.serial#
and    opname LIKE 'RMAN%'
and    opname NOT LIKE '%aggregate%'
and    totalwork != 0
and    sofar <> totalwork;
select AL.*,
       DF.min_checkpoint_change#, DF.min_checkpoint_time
from v$archived_log AL,
     (select min(checkpoint_change#) min_checkpoint_change#,
             min(checkpoint_time) min_checkpoint_time
      from v$datafile_header
      where status='ONLINE') DF
where DF.min_checkpoint_change# between AL.first_change# and AL.next_change#
order by AL.first_change#;
select * from v$asm_diskgroup;
select * from v$asm_disk;
select * from v$flashback_database_log;
select * from v$flashback_database_logfile order by first_change# desc;
select * from v$flashback_database_stat order by begin_time desc;
select * from v$restore_point;
select * from v$rollname;
select * from v$undostat;
select * from dba_rollback_segs;
spool off

Does Rman Backup benefit from Large Pool?

我们在学习Oracle的过程中,或多或少会存在个人对概念的理解错误、误解或者根本是教材编写存在不严谨的地方,这样或以讹传讹或三人言虎,导致在Oracle圈子存在着一些古老相传的迷信(superstition),因为这些迷信已经深入人心了,所以我们几乎很难纠正过来;这其实很有意思,IT作为一个高科技的领域也会出现迷信,说明我们在IT技术的”教学”和”思考”上存在问题,这一点值得深思。

这里我列出几个最为常见的迷信,算作抛砖引玉:

1.几乎所有的Oracle入门教程都会在介绍Large pool的时候这样描述:”RMAN 备份使用large pool作为磁盘I/O缓冲区,配置Large pool有助于提高RMAN备份性能”

Truth:除非你启用了slaves IO,否则rman并不使用large pool

RMAN I/O可以分成三种模式:

Mode Disk tape
Asynchronous I/O 绝大多数操作系统支持AIO,默认disk_asynch_io为TRUE,即默认启用磁盘异步IO。如果磁盘设备不支持AIO,那么会使用synchronous I/O。磁盘异步模式下RMAN I/O缓冲区域从PGA中分配,相关IO性能信息存放在V$backup_async_io视图中 磁带设备本身不支持AIO(tape I/O is always synchronous),虽然默认tape_asynch_io为TRUE,但磁带设备只能通过IO slaves模拟异步IO,所以启用磁带AIO需要另外设置backup_tape_io_slaves=TRUE。此模式下RMAN I/O缓冲区从shared pool或者large pool中分配,相关IO性能信息存放在V$backup_async_io视图中
Synchronous I/O 若disk_asynch_io设置为false,或操作系统不支持异步IO,且dbwr_io_slaves=0时启用Synchronous I/O。此时RMAN I/O缓冲区从PGA中分配,相关IO性能信息存放在V$backup_sync_io视图中 默认backup_tape_io_slaves为false,即磁带设备默认不启用AIO而使用Synchronous I/O。此时RMAN I/O缓冲区从PGA中分配,相关性能信息存放在V$backup_sync_io视图中
Slaves I/O 启用disk slaves I/O,要求设置disk_asynch_io=false且dbwr_io_slaves>0。此模式下RMAN I/O缓冲区从shared pool或者large pool中分配,相关IO性能信息存放在V$backup_async_io视图中 设置tape_asynch_io=true且backup_tape_io_slaves=true时启用,磁带的AIO模式其实就是使用slaves Io模拟获得的。所以此模式下的一切和tape AIO完全一样

我们在使用RMAN备份数据库时无论是磁盘备份还是磁带备份总是优先期望使用AIO异步IO特性(tape aio比较特殊,见上表),使用AIO的前提是设置合理的初始化参数以及操作系统支持AIO,如果我们使用的操作系统不支持AIO那么我们将不得不使用Synchronous IO同步IO。这并不是世界末日,因为Oracle提供了IO从属进程(slaves IO)来模拟AIO,当然这是退而求其次的。为了启用slaves IO,我们需要手动设置backup_tape_io_slaves或dbwr_io_slaves参数来启用IO从属特性,当使用磁带备份时设置backup_tape_io_slaves(此时tape_asynch_io应当为true)为true,当使用磁盘设备时设置dbwr_io_slaves(此时disk_asynch_io应当为false)为非零值。在启用slaves IO的前提下RMAN才会从Large pool当中分配内存并加以利用,如果没有配置large pool(注意如果启用了ASMM,那么Oracle会自动为large pool分配一个granule大小的空间)或者large pool过小,那么RMAN的内存缓冲区将从shared pool中分配。如果Oracle仍不能获得足够内存,那么将本地进程获取足够的IO缓存。若我们启用了I/O slaves,那么很有必要配置一个足够大的Large pool(一般60-100M就足够了),这样RMAN的I/O缓存区可以从large pool中分配,避免了RMAN的I/O buffer和shared pool中的library cache等其他组件发生竞争。

If I/O slaves are used, I/O buffers are obtained from the SGA ,or the large pool, if configured.If LARGE_POOL_SIZE is set, then Oracle attempts to get memory from the large pool. If this value is not large enough, then Oracle does not try to get buffers from the shared pool.If Oracle cannot get enough memory, then it obtains I/O buffer memory from local process memory and writes a message to the alert.log file indicating that synchronous I/O is used for this backup.

在默认情况下Oracle对于磁盘设备使用AIO模式(disk_asynch_io=true & dbwr_io_slaves=0 by default),而对于磁带设备使用synchronous I/O(tape_asynch_io=true & backup_tape_io_slaves=false by default),都不会启用slaves I/O,所以默认情况下RMAN总是从PGA中分配缓存。换而言之在默认情况下,即便配置了较大的Large pool也不会为RMAN所用。

RMAN allocates the tape buffers in the SGA or the PGA, depending on whether I/O slaves are used. If you set the initialization parameter BACKUP_TAPE_IO_SLAVES = true, then RMAN allocates tape buffers from the SGA or the large pool if the LARGE_POOL_SIZE initialization parameter is set. If you set the parameter to false, then RMAN allocates the buffers from the PGA.

我们来通过以下演示,进一步验证AIO/Slave IO环境下RMAN内存缓冲区从哪里分配,并加强印象:

SQL> select * From v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

SQL> show parameter async

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
disk_asynch_io                       boolean     TRUE
tape_asynch_io                       boolean     TRUE

SQL> select * From v$sgastat where pool='large pool';

POOL         NAME                            BYTES
------------ -------------------------- ----------
large pool   PX msg pool                    903840
large pool   free memory                  15873376

backup as backupset database skip offline;

SQL> select * From v$sgastat where pool='large pool';

POOL         NAME                            BYTES
------------ -------------------------- ----------
large pool   PX msg pool                    903840
large pool   free memory                  15873376

/* 在AIO模式下,全库备份后发现large pool未发生变化 */

SQL> alter system set disk_asynch_io=false scope=spfile;
System altered.

SQL> alter system set dbwr_io_slaves=2 scope=spfile;
System altered.

/* 以上启用了磁盘I/O Slave特性 */

SQL> startup force;

[oracle@rh2 ~]$ ps -ef|grep i10|grep -v grep
oracle   20761     1  0 20:44 ?        00:00:00 ora_i101_G10R2
oracle   20763     1  0 20:44 ?        00:00:00 ora_i102_G10R2

/* 启用I/O Slave后会出现ora_ixxx_SID这样的后台进程 */

SQL> select * From v$sgastat where pool='large pool';

POOL         NAME                            BYTES
------------ -------------------------- ----------
large pool   PX msg pool                    903840
large pool   free memory                  15873376

RMAN> backup as backupset database skip offline;

SQL> select * From v$sgastat where pool='large pool';

POOL         NAME                            BYTES
------------ -------------------------- ----------
large pool   PX msg pool                    903840
large pool   free memory                  24151392
large pool   KSFQ Buffers                 25276416

SQL> /

POOL         NAME                            BYTES
------------ -------------------------- ----------
large pool   PX msg pool                    903840
large pool   free memory                  41006432
large pool   KSFQ Buffers                  8421376

/* 启用了I/O Slave后执行备份操作,
    large pool中出现了KSFQ Buffers,这个KSFQ buffer就是RMAN所使用的缓冲区,
    实际决定该buffer大小的是隐藏参数_backup_ksfq_bufsz和_backup_ksfq_bufcnt */

SQL> col name for a30
SQL> col describ for a70
SQL> SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ
  2   FROM SYS.x$ksppi x, SYS.x$ksppcv y
  3   WHERE x.inst_id = USERENV ('Instance')
  4   AND y.inst_id = USERENV ('Instance')
  5   AND x.indx = y.indx
  6  AND x.ksppinm LIKE '%ksfq%';

NAME                           VALUE      DESCRIB
------------------------------ ---------- ----------------------------------------------------------------------
_backup_ksfq_bufsz             0          size of the ksfq buffer used for backup/restore
_backup_ksfq_bufcnt            0          number of the ksfq buffers used for backup/restore

/* 在10g中似乎Oracle会自动调控以上2个参数 */

SQL> alter system set "_backup_ksfq_bufsz"=131072;
System altered.

SQL> alter system set "_backup_ksfq_bufcnt"=1;
System altered.

RMAN> backup tablespace data01;

/*  I/O slaves的IO统计信息仍存放在V$backup_sync_io视图中,
    而非可能是你所预期的v$backup_sync_io视图  */

SQL> select type,buffer_size,buffer_count from v$backup_async_io;

TYPE      BUFFER_SIZE BUFFER_COUNT
--------- ----------- ------------
AGGREGATE           0            0
INPUT          131072            1
OUTPUT        1048576            4

另外large pool的使用量可以通过下列公式来估算:
LARGE_POOL_SIZE =
(4 * {RMAN Channels} * {DB_BLOCK_SIZE} * {DB_DIRECT_IO_COUNT} * {Multiplexing Level})
+
(4 * {RMAN Channels} * {Tape Buffer Size})

事实上如果你probe过PGA的内存使用情况,那么你或许会在PGA headdump中看到过”KSFQ heap”的相关信息。显然当在非slaves IO模式下,RMAN会从PGA HEAP->KSFQ heap这个subheap子堆中分配必要的buffer。

我们在磁盘AIO模式下执行必要的backup操作,之后找出RMAN相关的shadow process并对其做heapdump,分析其pga内存使用情况

SQL> select spid,pga_used_mem,pga_max_mem from v$process where addr in
  2  (select paddr from v$session where program like '%rman%')
  3  order by pga_used_mem desc ;

SPID         PGA_USED_MEM PGA_MAX_MEM
------------ ------------ -----------
24424             5750341    14410829
24425             4717957    12134125
24413             3308341     9626701
24423              435773      993005

SQL> oradebug setospid 24424;
Oracle pid: 25, Unix process pid: 24424, image: oracle@rh2.oracle.com (TNS V1-V3)

SQL> oradebug dump heapdump 536870917;
Statement processed.

SQL> oradebug tracefile_name;
/s01/admin/G10R2/udump/g10r2_ora_24424.trc

==========================heapdump details==============================

FIVE LARGEST SUB HEAPS for heap name="pga heap"   desc=0x68d3ec0
  Subheap ds=0x87c83e8  heap name=       KSFQ heap  size=         4205296
   owner=(nil)  latch=(nil)

******************************************************
HEAP DUMP heap name="KSFQ heap"  desc=0x87c83e8
 extent sz=0x1040 alt=32767 het=32767 rec=0 flg=2 opc=2
 parent=0x68d3ec0 owner=(nil) nex=(nil) xsz=0x20228
EXTENT 0 addr=0x7f86bf788dd8
  Chunk     7f86bf788de8 sz=  1049112    freeable  "KSFQ Buffers   "
EXTENT 1 addr=0x7f86bf988dd8
  Chunk     7f86bf988de8 sz=  1049112    freeable  "KSFQ Buffers   "
EXTENT 2 addr=0x7f86bfb88dd8
  Chunk     7f86bfb88de8 sz=  1049112    freeable  "KSFQ Buffers   "
EXTENT 3 addr=0x7f86bfc98dd8
  Chunk     7f86bfc98de8 sz=  1049112    freeable  "KSFQ Buffers   "
EXTENT 4 addr=0x7f86bfddf358
  Chunk     7f86bfddf368 sz=     5192    freeable  "KSFQ ctx       "
EXTENT 5 addr=0x87c7680
  Chunk        0087c7690 sz=      984    perm      "perm           "  alo=984
  Chunk        0087c7a68 sz=     1944    free      "               "
  Chunk        0087c8200 sz=      464    freeable  "KSFQ buffer poo"
Total heap size    =  4205032
FREE LISTS:
 Bucket 0 size=0
  Chunk        0087c7a68 sz=     1944    free      "               "
Total free space   =     1944
UNPINNED RECREATABLE CHUNKS (lru first):
PERMANENT CHUNKS:
  Chunk        0087c7690 sz=      984    perm      "perm           "  alo=984
Permanent space    =      984

/* 以上可以看到KSFQ Heap子堆共占用了4205296=4M内存,
    而该服务进程的pga_used_memory总量为5750341 bytes,即KSFQ占该进程PGA的73%
    另外这里KSFQ Buffer的大多Chunk是freeable的,仅少量为perm                    */

另外磁带I/O缓冲区的大小可以在配置通道时指定,其默认值由操作系统决定,一般为64k。我们使用allocate channel命令设置磁带I/O缓冲区,为了达到最佳性能,可以将磁带I/O缓冲区大小设置为256KB或更大,如:

allocate channel maclean1 device type sbt
parms="blksize=262144,ENV=(NB_ORA_SERV=nas,NB_ORA_POLICY=racdb,NB_ORA_CLIENT=rh2)";

结论:

  1. 在默认情况下(即disk backup使用AIO而tape backup使用sync IO),RMAN backup并不会从Large Pool中获益,而是从PGA中的KSFQ heap中分配必要的I/O内存缓冲区。当然我们还是推荐设置Large_pool_size为100M左右,因为即使是PC服务器也不会缺这一点内存
  2. 仅当启用I/O slaves时RMAN backup从Large Pool中分配ksfq buffer(ksfq buffer used for backup/restore),在9i/10g中常有因large pool过小而引起ORA-04031错误的案例;若large pool大小为0,那么ksfq buffer会从shared pool中分配,一方面ORA-04031:(“shared pool”,”unknown object”,”sga heap(1,0)”,”KSFQ Buffers”)错误的概率将大大增加,另一个方面将造成KSFQ与library cache间的竞争,不利于性能。这种情况下RMAN backup的确从Large pool中得到收益,设置large_pool_size为100M仍是被推荐的
  3. 在非slaves IO模式下RMAN从PGA的KSFQ heap子堆中分配I/O缓冲区,因为在非slaves IO模式下该内存缓冲区没有共享的必要

待修订!!

Mode Disk tape
Asynchronous I/O 绝大多数磁盘设备支持AIO,默认disk_asynch_io为TRUE,即默认启用磁盘异步IO。如果磁盘设备不支持AIO,那么会使用synchronous I/O。磁盘异步模式下RMAN I/O缓冲区域从PGA中分配,相关IO性能信息存放在V$backup_async_io视图中 磁带设备本身不支持AIO,虽然默认tape_asynch_io为TRUE,但磁带设备只能通过IO slaves模拟异步IO,所以启用磁带AIO需要另外设置backup_tape_io_slaves=TRUE。此模式下RMAN I/O缓冲区从shared pool或者large pool中分配,相关IO性能信息存放在V$backup_async_io视图中
Synchronous I/O 若disk_asynch_io设置为false,或磁盘设备不支持异步IO,且dbwr_io_slaves=0时启用Synchronous I/O。此时RMAN I/O缓冲区从PGA中分配,相关IO性能信息存放在V$backup_sync_io视图中 默认backup_tape_io_slaves为false,即磁带设备默认不启用AIO而使用Synchronous I/O。此时RMAN I/O缓冲区从PGA中分配,相关性能信息存放在V$backup_sync_io视图中
Slaves I/O 启用disk slaves I/O,要求设置disk_asynch_io=false且dbwr_io_slaves>0。此模式下RMAN I/O缓冲区从shared pool或者large pool中分配,相关IO性能信息存放在V$backup_async_io视图中 同tape AIO时一样

Oracle备份恢复:Rman Backup缓慢问题一例

近日客户报多套10g的数据库在使用NBU磁带备份时出现RMAN FULL BACKUP十分缓慢的问题,这些数据库中最大一个的达到2.61T,该数据库在一个月前地全库0级备份耗时在3-4个小时,而在最近猛涨到17个小时。客户之前已经向Symantec提交了服务请求,但暂时没有得到结论。希望我们从Oracle角度分析该备份速度变慢问题。

我们首先分析了备份信息的动态视图V$rman_backup_job_details:

OUTPUT_DEVICE INPUT_TYPE ELAPSED_SECONDS INPUT_BYTES_DISPLAY INPUT_BYTES_PER_SEC OUTPUT_BYTES_PER_SEC
17 SBT_TAPE DB INCR 62078 2.61T 44.08M 18.10M

以上可以确认在对2.61T大小的数据库执行全库磁带备份时耗费了62078s,这里还显示了backup时每秒的读取IO为44M,每秒的写出IO为18M;这里不能因为简单的output io per second<input io per second,而断论写出能力存在瓶颈;备份时对数据文件的读取和写出backup piece到备份介质上的操作是一个整体,CPU、Input IO、Output IO任何一环都可能成为备份的瓶颈;譬如因为对数据文件的读取IO存在瓶颈,那么相应的写出IO也会慢下来;又譬如当RMAN与备份服务器之间的IO带宽存在瓶颈,那么相应的读取IO也会不得不慢下来。具体是哪一个环节出现了问题,我们需要求助于其他的RMAN动态性能视图,如:

V$BACKUP_SYNC_IO
Displays rows when the I/O is synchronous to the process (or thread on some platforms) performing the backup.

V$BACKUP_ASYNC_IO
Displays rows when the I/O is asynchronous to the process (or thread on some platforms) performing the backup.

以上2个视图的区别在于一个汇聚了使用同步IO执行RMAN备份恢复操作的性能信息,而另一个是异步IO的。

因为客户使用默认配置disk_async_io为true,故而我们首先关注input IO的性能信息,这些信息存放在V$backup_async_io视图中;而对于磁带设备未启用slaves IO模拟的异步IO(tape_asynch_io=true但是backup_tape_io_slaves为默认的false),所以与磁带设备相关的output IO的性能信息存放在v$backup_sync_io视图中。

DEVICE OPEN_TIME ELAPSED BYTES/S IO_COUNT READY long_waits LONG_WAIT_TIME_TOTAL LONG_WAIT_TIME_MAX
DISK 4/4 388900 372827681 2765564 2074114 90028 231181 53
DISK 4/5 753900 192323498 2765564 2074114 90028 178548 41
DISK 4/6 369000 392934106 2765564 2074114 90028 243507 41
DISK 4/7 405100 357918255 2765564 2074114 90028 268117 73
DISK 4/8 347900 416765407 2765564 2074114 90028 183543 77
DISK 4/9 395800 366328159 2765564 2074114 90028 255399 48
DISK 4/10 428400 338451646 2765564 2074114 90028 268544 45
DISK 4/11 413200 350901949 2765564 2074114 90028 269857 42
DISK 4/12 735400 197161661 2765564 2074114 90028 169016 34
DISK 4/13 410000 353640696 2765564 2074114 90028 283607 60
DISK 4/14 408300 355113116 2765564 2074114 90028 279012 38
DISK 4/15 442700 327519054 2765564 2074114 90028 308744 605
DISK 4/16 393000 368938130 2765564 2074114 90028 251509 205
DISK 4/17 423100 342691291 2765564 2074114 90028 273868 42
DISK 4/18 375600 386029513 2765564 2074114 90028 230859 328
DISK 4/19 721200 201043657 2765564 2074114 90028 191753 162
DISK 4/20 401000 361577769 2765564 2074114 90028 272852 147
DISK 4/21 346600 418328578 2765564 2074114 90028 209569 109
DISK 4/22 400500 362029177 2765564 2074114 90028 265060 112
DISK 4/23 402400 360319794 2765564 2074114 90028 259008 594
DISK 4/24 449600 322492627 2765564 2074114 90028 274202 64
DISK 4/25 413900 350308493 2765564 2074114 90028 269380 230
DISK 4/26 748600 193685126 2765564 2074114 90028 177477 105
DISK 4/27 389900 371871468 2765564 2074114 90028 261200 38
DISK 4/28 383800 377781879 2765564 2074114 90028 241870 158
DISK 4/29 403700 359159488 2765564 2074114 90028 266135 212
DISK 4/30 390600 371205031 2765564 2074114 90028 248161 340
DISK 5/1 463600 312753851 2765564 2074114 90028 271247 39
DISK 5/2 419900 345302894 2765564 2074114 90028 255042 117
DISK 5/3 705700 205459381 2765564 2074114 90028 173043 189
DISK 5/4 418400 346540835 2765564 2074114 90028 291614 47
DISK 5/5 357400 405687424 2765564 2074114 90028 222676 188
DISK 5/6 421400 344073767 2765564 2074114 90028 285860 95
DISK 5/7 405100 357918255 2765564 2074114 90028 263761 38
DISK 5/8 381500 380059463 2765564 2074114 90028 203510 210
DISK 5/9 918400 157875311 2765564 2074114 90028 221293 37
DISK 5/10 3378600 42915020 2765564 2074114 90028 142211 36
DISK 5/11 559900 258961753 2765564 2074114 90028 252911 174
DISK 5/12 622500 232919976 2765564 2074114 90028 241495 40
DISK 5/13 626700 231359000 2765564 2074114 90028 237973 41
DISK 5/14 802000 180788884 2765564 2074114 90028 231283 42
DISK 5/15 601200 241172131 2765564 2074114 90028 209418 40
DISK 5/16 1387800 104476643 2765564 2074114 90028 211878 36

这里我们关心的几个时间指标包括:ELAPSED(Input IO的总耗时)、LONG_WAIT_TIME_TOTAL(长IO的总等待时间)、LONG_WAIT_TIME_MAX(最长一次的IO等待时间)的单位均为百分之一秒,该视图的具体定义参考这里
从以上chart中(由于列宽的原因只截取了部分数据)我们可以看到从4/4到5/16之间备份的Input IO总耗时(elapsed)虽然间歇性的暴涨到过33786s,但其他IO指标:IO总数、READY IO总数、LONG IO WAIT的次数、LONG IO WAIT的总耗时、最长LONG IO WAIT均没有出现大的变化;基本可以判定在备份期间对数据文件的读取不存在瓶颈,为了进一步诊断需要分析备份输出的IO性能状况:

DEVICE date ELAPSED BYTES BYTES/S IO_COUNT IO_TIME_TOTAL IO_TIME_MAX DISCRETE_BYTES_PER_SECOND
SBT_TAPE 4/5 754900 584663433216 77449123 2230314 440365 2600 132767916
SBT_TAPE 4/5 703900 553432907776 78623797 2111179 381135 5800 145206530
SBT_TAPE 4/12 736400 588200542208 79875142 2243807 433298 3400 135749655
SBT_TAPE 4/12 692300 556839731200 80433299 2124175 369237 2600 150808216
SBT_TAPE 4/19 722200 591873179648 81954193 2257817 395904 3400 149499166
SBT_TAPE 4/19 829000 561043210240 67677106 2140210 510746 2801 109847793
SBT_TAPE 4/26 749600 596010598400 79510485 2273600 435940 2600 136718493
SBT_TAPE 4/26 700300 565171191808 80704154 2155957 377019 2800 149905228
SBT_TAPE 5/3 706800 600177377280 84914739 2289495 396965 5800 151191510
SBT_TAPE 5/3 712500 569155518464 79881476 2171156 392324 5800 145072827
SBT_TAPE 5/10 3379700 604452159488 17884787 2305802 3093781 2802 19537652
SBT_TAPE 5/10 2798400 573396746240 20490164 2187335 2524296 2801 22715115
SBT_TAPE 5/17 428095307776 1633054 2216291 5800 19315844

可以从以上chart中可以发现到5/3为止的output io总耗时还处于合理范围内(为7068+7125)s,约为4小时。而到了5/10时输出IO的总耗时达到了(33797+29784)s,约为17.6小时。研究其他IO统计信息可以发现5/10的IO_TIME_TOTAL总数为(30937+25242)s,IO_TIME_TOTAL代表了某个IO等待的总耗时,单位为百分之一秒。从另一个IO性能指标DISCRETE_BYTES_PER_SECOND来看,在5/10备份文件的平均传输率急剧下降。

综合上述现象,诱发客户环境中的RMAN全库备份缓慢的主要原因是备份文件输出IO性能在一段时间内急剧下降,瓶颈应当存在于RMAN与NBU备份服务器之间,与数据库读取性能没有关系。给客户的建议是复查数据库服务器到NBU备份服务器间的网络带宽是否正常,NBU服务器的性能是否正常,磁带库的写出性能是否正常。

这个case后续经过对NBU的复查发现原因是虚拟磁带库VTL的存储电池过期,导致读写数据时直接从磁盘上进行,而不经过缓存,故影响了数据库全备份的速度。由于VTL存储电池过期信息需要从串口直接访问其存储才能确认问题,所以在问题发生之初无法从VTL的管理界面中获取该信息。

这个case告诉我们不仅是raid卡的冲放电、存储的ups电池过期会引发严重的IO性能问题,VTL的存储电池过期也可能给数据库备份造成麻烦,要让系统7*24时刻正常运行,有太多指标需要我们去关注,良好的规范很重要,同时很多时候我们不得不和硬件打交道。

沪ICP备14014813号-2

沪公网安备 31010802001379号