Sun StorEdge[TM] 3000 Arrays: 3.2x and 4.x Firmware Differences in Handling Media Errors

Asset ID:	1-72-1006649.1
Update Date:	2016-04-14
Keywords:

Solution Type Problem Resolution Sure

Solution 1006649.1 : Sun StorEdge[TM] 3000 Arrays: 3.2x and 4.x Firmware Differences in Handling Media Errors

Applies to:

Sun Storage 3511 SATA Array - Version Not Applicable and later
Sun Storage 3310 Array - Version Not Applicable and later
Sun Storage 3320 SCSI Array - Version Not Applicable and later
Sun Storage 3510 FC Array - Version Not Applicable and later
All Platforms

Symptoms

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage 2000, 3000, 6000 RAID Arrays & JBODs Community

The following event is seen in the array event log:

[1113]: StorEdge Array SN#xxxx CH2 ID10: SCSI Drive ALERT: bad block encountered (02h, 03h,11/00)

Below is an example of a "read error" seen on a server running Solaris that might be found in the "/var/adm/messages*" files. Similar messages for "write errors" may also be found.

Sep 11 20:37:25 server1 scsi: [ID 107833 kern.warning] WARNING: /pci@1e,600000/scsi@3/sd@0,1 (sd72):
Sep 11 20:37:25 server1   Error for Command: read(10)                Error Level: Retryable
Sep 11 20:37:25 server1 scsi: [ID 107833 kern.notice]     Requested Block: 8144                      Error Block: -803274752
Sep 11 20:37:25 server1 scsi: [ID 107833 kern.notice]     Vendor: SUN                                Serial Number:05AXXXXX-00
Sep 11 20:37:25 server1 scsi: [ID 107833 kern.notice]     Sense Key: Media Error
Sep 11 20:37:25 server1 scsi: [ID 107833 kern.notice]     ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0
Sep 11 20:37:25 server1 scsi: [ID 107833 kern.warning] WARNING: /pci@1e,600000/scsi@3/sd@0,1 (sd72):

This is applicable for all the products under the Sun StorEdge[TM] 3000 family.

Changes

As the drive capacity increases with the increased demand, we see that since the density of the data stored also increasing, there is a greater chance of encountering bad blocks, and array vendors are using their own ways of handling the same.

This document is applicable for redundant RAID implementation, primarily RAID 5, and describes how the firmware handles the bad blocks on the drives.

Consider the following scenario:

On a RAID 5 Logical drive:

1. One drive fails.
2. This causes the hotspare to trigger and start rebuild.
3. The rebuild finds a media error on another member drive.

Cause

For StorEdge[TM] 3000 Arrays with 3.2x Firmware:

The first time a bad block is encountered on a member disk while rebuild is in progress, the rebuild will fail. If we are using the serial/telnet menu when this happens, the firmware would prompt us to continue the rebuild even though there is a bad block. If we answered yes, then the rebuild would continue on to completion, provided there were no other error exceptions. For the block which has the "unrecoverable media error", the firmware zeroes out the ECC of that block and puts a special pattern there and then continues the rebuild until it completes.

For StorEdge[TM] 3000 Arrays with 4.x Firmware:

The firmware will automatically go ahead with the rebuild when a bad block is encountered on a member drive while rebuild is going on. Also on 4.x firmware this "specially marked bad sector of the individual disk" represents a "Logical Drive Bad Block" that will be reported when the host next tries to access that area of the Logical Drive.

The event log would log the following event in case the host tried to read this block:

LG:2 NOTIFY:Logical Drive BAD Block Encountered 000000200.

Notice that there is no specific disk mentioned, only the Logical Drive that contains that disk. To recover from this, the host has to issue a write to that area. If we have a filesystem on this logical drive, then one option is to run fsck and see if this works. If we don't have a file system, then we should be able to locate the Logical Drive Bad Block via a dd to /dev/null. After the file/block is located, you should take the appropriate recovery steps (ie. recover from backup, re-write the data, etc.).

Explanation of Controller Behavior:

For the bad blocks encountered on the member drive while rebuild is undergoing, the controller erases the ECC bytes for that block so any subsequent read will result in an unrecoverable ECC error. The controller will also write a unique pattern in the block so it can be identified by the firmware as a controller generated bad block. Before this feature was implemented in 4.x, an unrecoverable media error on a surviving disk in an LD would result in a Rebuild Failure or require active intervention to allow the rebuild to continue past the bad block.

Solution

Case study:

As an example, consider the following events which are taken from a customer case.

Customer is running 4.15F firmware on a StorEdge[TM] 3510 and the following messages are logged in the event logs:

Wed Jul 5 14:26:13 2006
[Primary] Alert
LG:0 NOTIFY:Logical Drive BAD Block Encountered 0388FD300

...

Notice that no specific drive is reporting the error so this should NOT be confused with a media error on a particular drive but a bad block on the LD and the host should also get a read error while accessing this block. We can check this by running format->analyze->read on this LD and we see....

analyze> read
Ready to analyze (won't harm SunOS). This takes a long time,
but is interruptable with CTRL-C. Continue? y

       pass 0
Medium error during read: block 948949760 (0x388fd300) (948949760)
ASC: 0x11   ASCQ: 0x0

Medium error during read: block 948949760 (0x388fd300) (948949760)
ASC: 0x11 ASCQ: 0x0

Please note that the block number reported by the format->analyze->read is the same as the block number reported by the 3510 in the event log. To recover from this, we need to find the file residing on this block and restore that file. If the application is a database, the DBA should be able to tell us the table residing on this block and we just need to restore that table. In short,

Note: The host needs to write to this block in order to make this block reusable.

Typically, a drive has latent disk errors that can only be detected when the affected disk sector is accessed. These latent disk errors can be avoided if we continuously access the drives which can be accomplished by enabling media-scan to scrub the disks continuously.

[For NRAID, or RAID0, if we encounter a bad block, the LD is effectively dead and there is no way of recovering other than having the host to issue a "write" to that block, or restoring the file sitting on that bad block.

Sense Key:0x03, Sense Code:0x11, rebuild, double, drive, failure, 3510, 3310, 3320, 3511, 4.11, 4.13, 4.15, 3.25, 3.27, 4.21, firmware, bad, block, media, scan, 4.15, parity, regenerate, RAID, disk
Previously Published As
85181

Change History
Date: 2010-11-11
User Name: sue.copeland@sun.com
Action: Currency & Update
Date: 2007-11-13
User Name: 7058
Action: Approved

Attachments

This solution has no attachment