Sun StorEdge[TM] A1000/A3000/A3500/A3500FC array: Array may crash after 828 days of uptime resulting in possible data loss.

Asset ID:	1-72-1004737.1
Update Date:	2016-03-01
Keywords:

Solution Type Problem Resolution Sure

Solution 1004737.1 : Sun StorEdge[TM] A1000/A3000/A3500/A3500FC array: Array may crash after 828 days of uptime resulting in possible data loss.

Applies to:

Sun Storage A3000 Array - Version All Versions and later
Sun Storage A3500 SCSI Array - Version All Versions and later
Sun Storage A1000 Array - Version All Versions and later
Sun Storage A3500 FC Array - Version Not Applicable and later
All Platforms

Symptoms

Sun[TM] StorEdge A1000/A3000/A3500/A3500FC arrays running Raid Manager firmware version 03.01.04.75 or earlier have a risk of losing data, or at least temporarily losing access to data, if the array has been running for 828 days without being reset or restarted.

After the array has been running continuously for more than 2 years, suddenly some LUNs disappear on the host, due to multiple disks being marked as failed. This may appear to be a hardware problem, and cause people to replace the array controller/drives, which does not solve the problem.

No errors are seen as symptoms. All of a sudden, some LUNs disappear, and the host application dies with SCSI reset/transport errors.

Changes

There is no change that precedes this problem, except that the array controller has been running continuous without being restarted for 828 days.

Cause

The problem was found to be in the handling of the internal clock tick counter, which the array controller maintains. This overflow occurs after the controller has been running continuously for approximately 828 days and 12 hours. If the counter overflow happens and a write is in progress to a LUN at that time, then the array controller fails all the drives in that LUN.

CAUTION: If this problem occurs, it is important not to reset or power-cycle the array, or reboot the attached host (for SCSI-attached arrays), as this will result in data loss. Once this happens, if the array is reset or power-cycled (or for SCSI-attached arrays, if the host is rebooted), then data on the LUNs using those drives is lost, and the only way to recover is to reset the whole array configuration and restore the data to the array from backup.

Solution

This issue has been identified and fixed in array controller firmware 03.01.04.81 in Raid Manager 6.22.1 (RM 6.22.1).

<Patch 112125-08> or higher for hosts running Solaris 2.6 and Solaris 7
<Patch 112126-08> or higher for hosts running Solaris 8 and Solaris 9 OS

There are no patches with this fix for earlier versions of Raid Manager. Therefore customers running earlier versions of Raid Manager must upgrade to RM 6.22.1 and apply one of those patches, to get the fix for this issue.

For SCSI-attached arrays (A1000/A3000/A3500) the array is reset when the attached host reboots. Therefore, rebooting the attached host once every 2 years or so (before 828 days of uptime) will prevent this issue being seen on those arrays.

However, rebooting the attached host every 2 years or so will not prevent the issue occurring on the StorEdge[TM] A3500FC array, since rebooting the attached host does not reset that type of array.

If one of the affected arrays is not rebooted for a period greater than 828 days and it suddenly suffers from failed drives, and you are confident that it is caused by this issue, because you know that the array has not been reset for 828 days, DO NOT REBOOT the array or the attached server. Doing so WILL result in data loss. Instead, use the Raid Manager command drivutil -u to unfail all of the drives in the LUN. If the LUN is still not optimal, contact your support provider for further assistance.

This problem may not be as likely to be seen with dual controller arrays, like Sun StorEdge A300/A3500/A3500FC which could have suffered from controller problem(s) and would be less likely to have survived for so many days without controller reset. However, this was seen on multiple single controller SunStorEdge A1000 arrays at a customer site.

Please see Bug ID 4874507 , escalations# 545577 and 546371 for details.

Some queries on this problem:

1. How do you determine that your controller has been up for 828 days

You can use the serial port command "vxAbsTicks" or from the command line using the "/usr/lib/osa/bin/perfutil" command.

# perfutil -c cXtXdX

On the output of "perfutil", run "drive_stats_u1.pl".

For example,

# ./drive_stats_u1.pl /net/sslab09/var/tmp/tfuku/perfutil-c_c5t5d0.out

drive_stats.pl version 1.1

Controller = c5t5d0 Host Time/Date: 10:25:45 08/07/2003 min of runtime = 26.2716666666667 <-- uptime total_recovered_errors = 0 total_unrecovered_errors = 0 total_request_time_outs = 0 total_retried_requests = 24 total_drive_bus_resets = 0 #

Uptime is 26.2716666666667. The time shown is minutes in this example. The script converts the ticks from min to hr when they exceed 60min. Likewise from days when they execeed 24hr.

2. Where do we get the drive_stats.pl perl script from ?

The script can be downloaded from the URL below:

http://cpre-emea.uk/tools/sonoma-info/sonoma_info.html

Alternatively, please see Technical Instruction <Document 1010352.1> : "Sun StorEdge[TM] Axx00:Tech Tip:Finding a Failed Disk that RM6 Reports as Optimal" which also has an alternative link to the script.

The full text of the recovery actions from LSI for an array which has hit the bug and has failed several drives, is reproduced here - some of their suggested actions (e.g. using "vdShow" via the serial port) are not customer actions, and hence have been removed from the customer-viewable section of the article.

"To recover, DO NOT REBOOT the controller or the server, doing so WILL result in data loss. Use drivutil -u (lower case) to unfail all of the drives in the lun, and verify the lun configuration with vdShow.

Attachments

This solution has no attachment