![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Problem Resolution Sure Solution 1006649.1 : Sun StorEdge[TM] 3000 Arrays: 3.2x and 4.x Firmware Differences in Handling Media Errors
PreviouslyPublishedAs 209273 This document will clarify the behavior of the firmware in the event it encounters a bad block on a disk which is also known as a media error or an "Unrecoverable Read Error". Applies to:Sun Storage 3511 SATA Array - Version Not Applicable and laterSun Storage 3310 Array - Version Not Applicable and later Sun Storage 3320 SCSI Array - Version Not Applicable and later Sun Storage 3510 FC Array - Version Not Applicable and later All Platforms SymptomsTo discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage 2000, 3000, 6000 RAID Arrays & JBODs Community
The following event is seen in the array event log: [1113]: StorEdge Array SN#xxxx CH2 ID10: SCSI Drive ALERT: bad block encountered (02h, 03h,11/00)
Below is an example of a "read error" seen on a server running Solaris that might be found in the "/var/adm/messages*" files. Similar messages for "write errors" may also be found.
Sep 11 20:37:25 server1 scsi: [ID 107833 kern.warning] WARNING: /pci@1e,600000/scsi@3/sd@0,1 (sd72):
Sep 11 20:37:25 server1 Error for Command: read(10) Error Level: Retryable Sep 11 20:37:25 server1 scsi: [ID 107833 kern.notice] Requested Block: 8144 Error Block: -803274752 Sep 11 20:37:25 server1 scsi: [ID 107833 kern.notice] Vendor: SUN Serial Number:05AXXXXX-00 Sep 11 20:37:25 server1 scsi: [ID 107833 kern.notice] Sense Key: Media Error Sep 11 20:37:25 server1 scsi: [ID 107833 kern.notice] ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0 Sep 11 20:37:25 server1 scsi: [ID 107833 kern.warning] WARNING: /pci@1e,600000/scsi@3/sd@0,1 (sd72):
This is applicable for all the products under the Sun StorEdge[TM] 3000 family.
ChangesAs the drive capacity increases with the increased demand, we see that since the density of the data stored also increasing, there is a greater chance of encountering bad blocks, and array vendors are using their own ways of handling the same. This document is applicable for redundant RAID implementation, primarily RAID 5, and describes how the firmware handles the bad blocks on the drives.
CauseFor StorEdge[TM] 3000 Arrays with 3.2x Firmware: LG:2 NOTIFY:Logical Drive BAD Block Encountered 000000200. Notice that there is no specific disk mentioned, only the Logical Drive that contains that disk. To recover from this, the host has to issue a write to that area. If we have a filesystem on this logical drive, then one option is to run fsck and see if this works. If we don't have a file system, then we should be able to locate the Logical Drive Bad Block via a dd to /dev/null. After the file/block is located, you should take the appropriate recovery steps (ie. recover from backup, re-write the data, etc.). Explanation of Controller Behavior:
SolutionCase study:As an example, consider the following events which are taken from a customer case. Customer is running 4.15F firmware on a StorEdge[TM] 3510 and the following messages are logged in the event logs: Wed Jul 5 14:26:13 2006 Wed Jul 5 14:26:13 2006 ...
Notice that no specific drive is reporting the error so this should NOT be confused with a media error on a particular drive but a bad block on the LD and the host should also get a read error while accessing this block. We can check this by running format->analyze->read on this LD and we see.... analyze> read pass 0 Medium error during read: block 948949760 (0x388fd300) (948949760) ASC: 0x11 ASCQ: 0x0 Medium error during read: block 948949760 (0x388fd300) (948949760)
Please note that the block number reported by the format->analyze->read is the same as the block number reported by the 3510 in the event log. To recover from this, we need to find the file residing on this block and restore that file. If the application is a database, the DBA should be able to tell us the table residing on this block and we just need to restore that table. In short,
Note: The host needs to write to this block in order to make this block reusable.
Typically, a drive has latent disk errors that can only be detected when the affected disk sector is accessed. These latent disk errors can be avoided if we continuously access the drives which can be accomplished by enabling media-scan to scrub the disks continuously.
[For NRAID, or RAID0, if we encounter a bad block, the LD is effectively dead and there is no way of recovering other than having the host to issue a "write" to that block, or restoring the file sitting on that bad block.
Attachments This solution has no attachment |
||||||||||||
|