【Oracleデータリカバリ】ORA-01115、ORA-01110、ORA-27091、ORA-27070、OSD-04006、O/S-Error

プロのOracle Databaseの復旧サービスを提供
携帯番号: +86 13764045638 メール:service@parnassusdata.com

 

あるユーザーがwindows 2003のデータベースはストレージトラブルで、システムテーブルスペースにsystem.dbfがIOトラブルが現れた。データベースを起動するときに、エラになる:

 

 

ORA-01115: IO error reading block from file 15 ORA-01110: data file … ORA-27091: unable to queue I/O ORA-27070: async read/write failed OSD-04006: ReadFile() failure, unable to read from file O/S-Error: (OS 121) The semaphore timeout period has expired.

以上ORA-01115、ORA-01110、ORA-27091、ORA-27070、OSD-04006、O/S-Error などのエラが根本的にはOracleデータベースと関係ない。トラブルの本質はWindowsで該当するディスクドライブのファイルが読み取れない。これもOS bug あるいは該当するディスクに物理的なトラブルが現れた。このトラブルに対してOSの視角で解決策を求めてください。うまくいかない場合に、特別な解決策を考えてください。 ‘

 

Error: OSD 4006
Text: ReadFile() failure, unable to read from file
—————————————————————————
Cause: Unexpected return from Windows NT system service ReadFile()
Action: Check OS error code and consult Windows NT documentation

This is due to a problem in Windows such that when Oracle attempted to access the data file on that device, it could not because the device timed out. This suggests that Windows has run out of asynchronous I/O buffers or there is a communications delay on the device.

There is nothing you’re going to be able to do at the database level to resolve this error, unless you move the data files to another drive. Ask the O/S system administrator to run diagnostics tools to check for possible faulty hardware and disk corruption on the disk device where the error is showing in the loader log. If the error persists, then log a call with Microsoft Support.

Oracle processes may encounter various (OS 1117) errors on a Windows 2003 Server. The text of the (OS 1117) error can be seen as follows:

C:\>net helpmsg 1117
The request could not be performed because of an I/O device error.
This error may manifest itself in different ways, depending on which Oracle process encounters the error:

Oracle RDBMS Instance Encounters (OS 1117) error

  1. If an Oracle RDBMS instance encounters the error, you may see messages such as the following in the alert log for the RDBMS instance:

==========================================================================

Fri Jul 13 01:21:33 2007
Errors in file d:\oracle\db\product\admin\mydb\bdump\mydb1_lmon_4608.trc:
ORA-27091 : unable to queue I/O
ORA-27070 : async read/write failed
OSD-04006: ReadFile() failure, unable to read from file
O/S-Error: (OS 1117) The request could not be performed because of an I/O device error.
==========================================================================

Oracle ASM Instance Encounters (OS 1117) errors

  1. If an Oracle ASM instance encounters the error, you may see similar errors in the ASM instance’s Alert log:

============================================
Fri Jul 13 01:22:10 2007
Errors in file d:\oracle\asm\product\admin\+asm\bdump\+asm1_gmon_3836.trc:
ORA-27091 : unable to queue I/O
ORA-27070 : async read/write failed
OSD-04016: Error queuing an asynchronous I/O request.
O/S-Error: (OS 1117) The request could not be performed because of an I/O device error.
============================================

CRS Daemon (crsd.exe) encounters 1117 errors
3. If you are running in an Oracle Clusterware environment, then you may also see errors in the crsd.log and/or certain resource logs, indicating a problem accessing the OCR (Oracle Cluster Registry). An example of those errors would be:

================================================
2007-07-13 01:21:51.766: [ OCROSD][4272]utwrite:4: Problem writing the buffer phy offset 184320 and oserror 1117
2007-07-13 01:21:51.766: [ OCROSD][4352]utwrite:4: Problem writing the buffer phy offset 184320 and oserror 1117
2007-07-13 01:21:51.766: [ OCRRAW][4352]beginlog: problem 26 clearing the log metadata buffer
2007-07-13 01:21:51.766: [ OCRRAW][4352]proprdkey: Problem in begin log
2007-07-13 01:21:51.766: [ OCRRAW][4352]proprseterror: Error in accessing physical storage [26] Marking context invalid.
================================================

CSS Daemon (ocssd.exe) encounters 1117 errors

  1. Also, in an Oracle Clusterware environment, the Cluster Synchronization Services daemon (ocssd.exe) may experience problems accessing the voting disk. If this occurs, you will see an error in the ocssd.log similar to the following:

============================================
[ CSSD]2007-07-13 01:22:12.501 [4052] >ERROR: Internal Error Information:
Category: 1234
Operation: scls_block_write
Location: WriteFile
Other: unable to write block(s)
Dep: 1117

[ CSSD]2007-07-13 01:22:12.501 [4052] >ERROR: clssnmvReadBlocks: read failed 1 at offset 533 of \\.\votedsk2
[ CSSD]2007-07-13 01:22:12.501 [4052] >TRACE: clssnmDiskStateChange: state from 4 to 3 disk (1/\\.\votedsk2)
[ CSSD]2007-07-13 01:22:12.501 [2200] >TRACE: clssnmDiskPMT: disk offline (1/\\.\votedsk2)
[ CSSD]2007-07-13 01:22:12.501 [2200] >ERROR: clssnmDiskPMT: Aborting, 1 of 2 voting disks unavailable
[ CSSD]2007-07-13 01:22:12.501 [2200] >ERROR: ###################################
[ CSSD]2007-07-13 01:22:12.501 [2200] >ERROR: clssscExit: CSSD aborting
[ CSSD]2007-07-13 01:22:12.501 [2200] >ERROR: ###################################
==============================================

  1. When you are running in an Oracle Clusterware environment, if the ocssd process encounters an I/O error when accessing the Voting Disk, the CSS daemon will evict the node from the cluster. This is done by signalling the Oracle Fence Driver (OraFencedrv.sys) to reboot the machine. When the fence driver reboots the machine, this will be seen as a bugcheck with stop code 0x0000ffff. You will be able to see this in the System Log with a message such as:

The computer has rebooted from a bugcheck.
The bugcheck was: 0x0000ffff (0x0000000000000000, 0x0000000000000000,
0x0000000000000000, 0x0000000000000000).
A dump was saved in: C:\WINDOWS\MEMORY.DMP.

Note that the bugcheck is expected behavior when ocssd.exe (the Cluster Synchornization Services daemon) encounters an I/O error when accessing the voting disk. The node experiencing the I/O error is intentionally rebooted to avoid a split-brain and possible data corruption when access to the voting disk is lost.
CHANGES

You may encounter this error after upgrading the Microsoft Storport driver to version 5.2.3790.4021 or later.

CAUSE

Reference Microsoft KB article#932755, available at the following URL:

http://support.microsoft.com/default.aspx?scid=kb;EN-US;932755

Per that article, one of the changes introduced in this version of the Storport driver is the following:

=========================================================
If a target returns a SCSI status of BUSY or Task Set Full, the port driver retries the command immediately. Storport retries the command an unlimited number of times. Therefore, if the busy status continues, the system could eventually experience problems.

This update configures the following behavior:

  • It limits the number of retries. The default is 20.
  • If the target returns a status of BUSY, the Storport driver performs a time-based pause before the Storport driver retries the command.
  • If the target returns a status of Task Set Full, the Storport driver performs an I/O completion-based pause before the Storport driver retries the command.
    =========================================================

Therefore, prior to upgrading the Storport driver, if a storage path had become saturated, the Storport driver would immediately continue to retry – indefinitely. This would result in slow I/O and perhaps a hang or spin scenario, but no error would be returned.

With the later version of the Storport driver, the retries are limited to 20 retries by default, with a pause between each retry. After 20 failures with a device busy status, the (OS 1117) error is returned to applications waiting on I/O. For more information on changes to the Storport driver, you must contact Microsoft.

SOLUTION

This is an I/O performance problem. You will need to increase the performance/capacity of the storage system to avoid the prolonged BUSY status. Specific solutions will vary, depending on your storage vendor, so the storage vendor may need to be contacted to assist with tuning the storage. One potential solution includes implementing multi-pathing technology to improve the throughput of the storage.

 

Comment

*

沪ICP备14014813号-2

沪公网安备 31010802001379号