Critical Faults for Unreadable Sectors on a Sun Storage 2500, 2500-M2, 6000 or Flexline RAID Array

Asset ID:	1-72-1019826.1
Update Date:	2017-05-05
Keywords:

Solution Type Problem Resolution Sure

Solution 1019826.1 : Critical Faults for Unreadable Sectors on a Sun Storage 2500, 2500-M2, 6000 or Flexline RAID Array

Applies to:

Sun Storage Common Array Manager (CAM) - Version 5.0 and later
Sun Storage Flexline 240 Array - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 2530-M2 Array - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage Flexline 280 Array - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 6140 Array - Version Not Applicable to Not Applicable [Release N/A]
All Platforms

Symptoms

A critical fault is generated similar to the following:

Alarm ID   : alarm1
Description: The unreadable sectors database is full. Sector count is 1000
Severity   : Critical
Element    :
GridCode   : 57.66.1074
Date       : 2008-12-03 12:33:53

Alarm ID   : alarm2
Description: Unreadable sectors exist. Current count is 1024
Severity   : Critical
Element    :
GridCode   : 57.66.1075
Date       : 2008-12-03 12:33:55

Cause

The term "unreadable sector" refers to a volume logical block address that has been rendered completely unreadable due to a disk media-related double fault condition on redundant volumes, or a disk media-related single fault condition on non-redundant volumes (RAID 0). Any user data contained within the unreadable sector is unrecoverable and should be considered lost. (These types of faults are more commonly referred to as "a two disk failure in a Raid 5" or a "read error on a raid 0")

Once an unreadable sector is detected and an entry is placed in the database for it, all future reads to that sector result in a media error being returned for the read. The entries in the database persist until the affected sectors are written by a host or internal write command, or explicitly cleared by a user action.

Solution

The unreadable sector database is used to count the number of logical block allocations (LBA) on a given volume. It can only hold around 1024 entries total.

1. Verify the Critical Fault.

Refer <Document 1021057.1> Sun Storage Common Array Manager (CAM): How to Verify Critical Faults for Sun Storage 2500, 2500-M2, 6000 and J4000 Arrays.

If you have a 66.1074 or 66.1075 critical fault, continue to Step 2.
If you do not have one of these faults, then the array has not detected any data loss, and no further work is required.

2. Identify the Volume(s) that have the unreadable sectors.

This can only be viewed by looking at the appropriate file collected within a supportdata. Collect this data and unzip it.

Refer <Document 1014074.1> Collecting Support Data for Arrays Using Sun StorageTek[TM] SANtricity Storage Manager.
Refer <Document 1002514.1> Collecting Sun StorageTek[TM] Common Array Manager Array Support Data.

This first example comes from a supportdata created with SANtricity. The unzipped file is called unreadableSectors.txt. We see PHYSICAL errors to disk drive 85.5

Volume     Date/Time                      Volume LBA    Tray,Slot    Drive LBA    Failure Type
-------    ---------------------------- ----------     ---------    ---------    ------------
ora-vol    Sun Mar 13 02:59:57 GMT 2011    276637252     85,5        276637252    PHYSICAL

This second example comes from a supportdata created by CAM. The unzipped file is called badBlocksData.txt. In this file we see multiple LOGICAL errors to disk 85.5

Volume                  Date/Time                       Volume LBA      Tray,Slot       Drive LBA       Failure Type
-------------------     ----------------------------    -----------     ---------       ----------      ------------
NTINV10CLUST2-vol2     Sat Sep 17 12:11:16 BST 2016    347605573     85,5            173803077      LOGICAL
NTINV10CLUST2-vol2     Sat Sep 17 12:11:15 BST 2016    347606597     85,12           173803077      PHYSICAL

A PHYSICAL error is reported by the disk drive itself.
A LOGICAL error is discovered and reported by the controller against the identified disk during rebuild/reconstruction of a Volume. Refer <Document 1021055.1> Troubleshooting Sun Storage[TM] 2500 and 6000 RAID Array Disk Failures to identify if any disk needs to be replaced.
The identified disk with PHYSICAL error needs to be replaced and data integrity checks (including a potential restore) need to be done.
The identified disk with LOGICAL error is no indication that this disk drive has a HW error and needs to be replaced. Further investigation (mel) is needed to see what disks were involved at the time of error happened.

If the badBlocksData.txt file or the unreadableSectors.txt files do not have any entries, collect a supportdata and contact Oracle support for further instruction. Otherwise, proceed to step 3.

There are certain circumstances where the badBlocksData.txt or unreadableSectors.txt files will be empty, but the alarm still exists. When this is the case, use the following shell procedure to clear the alarm from the array.

Important: The instructions in this document have to be used by an Oracle support engineer who received the required NetApp advanced training to access the shell. If you are not one of these engineers, you are not authorized to use these commands without guidance from one of these engineers. In that case, please open a collaboration SR with a TSC L2 engineer.

This first command will report all alarms on the array. Its possible that only one controller is reporting the alarm. Subsequent repairs should be run from that controller.

-> getRecoveryFailureList_MT

See if there are stale entries in the unreadable sectors database.

06.xx.xx.xx firmware

-> vdAll usmShowUnreadableSectorTable

07.xx.xx.xx firmware

-> readUnreadableSectorDatabase_MT

Clear the entries form the database.

-> clearUnreadableSectors_MT

If the problem still persists, Fix any alarms for Volumes Not On Preferred Path, and repeat the procedure. You may also need to reboot the controller reporting the alarm.

3. Recover and Restore the volume(s) that have the LBA errors

The data in these volumes is considered corrupt and should be restored. Whether this has impacted any application or not is unknown; however the data needs to be restored to recover the fault. Once the blocks have been written to again, the entries will be removed from the list and the fault will clear.

If there are other faults for drive failures for the same volumes, then these should be handled prior to the restore of the data. Any subsequent writes will cause any drives that may be marginal to fail for replacement.

Once a restore has been completed for all volumes affected, continue to Step 4.

In theory, the unreadable sectors could very well be in a non-data resident piece of the Raid Volume. When this is the case, data may actually be intact. For situations when a restore is not a viable solution, a server side check of the data may be an option. For example, a successful Solaris fsck of the UFS or VxFS file system, or a successful zpool scrub of the ZFS file system may be enough proof that the data is valid. If this is true, then a restore is not required.

4. Clear the unreadable sectors list if it has not been cleared by Step 3

Clear the alarm with the CAM GUI

Start Common Array Manager and Select the Array.
Launch the Service Advisor utility found in the upper right hand corner.
Open up -> Recovering from Unreadable Sectors
Select -> The volume
Follow the Instructions, including the item to -> Clear Unreadable Sector Database.
Exit Service Advisor

Clearing the alarm with the cli.

Location for sscs:

Solaris: /opt/SUNWstkcam/bin/
Linux: /opt/sun/cam/bin/
Windows: C:\Program Files\Sun\Common Array Manager\bin

Location for service:

Solaris: /opt/SUNWsefms/bin/
Linux: /opt/sun/cam/private/fms/bin/
Windows: C:\Program Files\Sun\Common Array Manager\Component\fms\bin\

Get the array name.

Example:

# sscs list array
Array: myarray
#
Clear the unreadable sectors list.

service -d <arrayname> -c reset -q usm -t <volume name>

Example:

# service -d myarray -c reset -q usm -t myvol

Clearing the alarm with SANtricity

Open the Array Management Window.
Open the Advanced Menu.
Open the recovery Sub-Menu.
Select "unreadable sectors".
Select all and clear them.

If the issue persists, collect a supportData and contact Oracle Support:

Refer <Document 1014074.1> Collecting Support Data for Arrays Using Sun StorageTek[TM] SANtricity Storage Manager.
Refer <Document 1002514.1> Collecting Sun StorageTek[TM] Common Array Manager Array Support Data.

Do you still have questions? You can use My Oracle Support Communities. Communities put you in touch with industry professionals like yourself. They are monitored by Oracle support engineers, so you can expect reliable and correct answers. Ask questions and see what others are asking about in the Disk Storage 2000, 3000, 6000 RAID Arrays & JBODs Community.

References

<NOTE:1021057.1> - Sun Storage Common Array Manager (CAM): How to Verify Critical Faults for Sun Storage 2500, 2500-M2, 6000 and J4000 Arrays
<NOTE:1021055.1> - Troubleshooting Sun Storage[TM] 2500 and 6000 RAID Array Disk Failures
<NOTE:1014074.1> - Collecting Support Data for Arrays Using Sun StorageTek SANtricity Storage Manager

Attachments

This solution has no attachment