![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Problem Resolution Sure Solution 1004737.1 : Sun StorEdge[TM] A1000/A3000/A3500/A3500FC array: Array may crash after 828 days of uptime resulting in possible data loss.
PreviouslyPublishedAs 206579 Applies to:Sun Storage A3000 Array - Version All Versions and laterSun Storage A3500 SCSI Array - Version All Versions and later Sun Storage A1000 Array - Version All Versions and later Sun Storage A3500 FC Array - Version Not Applicable and later All Platforms SymptomsSun[TM] StorEdge A1000/A3000/A3500/A3500FC arrays running Raid Manager firmware version 03.01.04.75 or earlier have a risk of losing data, or at least temporarily losing access to data, if the array has been running for 828 days without being reset or restarted. After the array has been running continuously for more than 2 years, suddenly some LUNs disappear on the host, due to multiple disks being marked as failed. This may appear to be a hardware problem, and cause people to replace the array controller/drives, which does not solve the problem.
ChangesThere is no change that precedes this problem, except that the array controller has been running continuous without being restarted for 828 days.
CauseThe problem was found to be in the handling of the internal clock tick counter, which the array controller maintains. This overflow occurs after the controller has been running continuously for approximately 828 days and 12 hours. If the counter overflow happens and a write is in progress to a LUN at that time, then the array controller fails all the drives in that LUN. CAUTION: If this problem occurs, it is important not to reset or power-cycle the array, or reboot the attached host (for SCSI-attached arrays), as this will result in data loss. Once this happens, if the array is reset or power-cycled (or for SCSI-attached arrays, if the host is rebooted), then data on the LUNs using those drives is lost, and the only way to recover is to reset the whole array configuration and restore the data to the array from backup.
SolutionThis issue has been identified and fixed in array controller firmware 03.01.04.81 in Raid Manager 6.22.1 (RM 6.22.1).
This problem may not be as likely to be seen with dual controller arrays, like Sun StorEdge A300/A3500/A3500FC which could have suffered from controller problem(s) and would be less likely to have survived for so many days without controller reset. However, this was seen on multiple single controller SunStorEdge A1000 arrays at a customer site. Please see Bug ID 4874507 , escalations# 545577 and 546371 for details. Some queries on this problem: 1. How do you determine that your controller has been up for 828 days You can use the serial port command "vxAbsTicks" or from the command line using the "/usr/lib/osa/bin/perfutil" command. # perfutil -c cXtXdX On the output of "perfutil", run "drive_stats_u1.pl". For example, # ./drive_stats_u1.pl /net/sslab09/var/tmp/tfuku/perfutil-c_c5t5d0.out drive_stats.pl version 1.1 Controller = c5t5d0 Host Time/Date: 10:25:45 08/07/2003 min of runtime = 26.2716666666667 <-- uptime total_recovered_errors = 0 total_unrecovered_errors = 0 total_request_time_outs = 0 total_retried_requests = 24 total_drive_bus_resets = 0 # Uptime is 26.2716666666667. The time shown is minutes in this example. The script converts the ticks from min to hr when they exceed 60min. Likewise from days when they execeed 24hr. 2. Where do we get the drive_stats.pl perl script from ? The script can be downloaded from the URL below: http://cpre-emea.uk/tools/sonoma-info/sonoma_info.html Alternatively, please see Technical Instruction <Document 1010352.1> : "Sun StorEdge[TM] Axx00:Tech Tip:Finding a Failed Disk that RM6 Reports as Optimal" which also has an alternative link to the script. The full text of the recovery actions from LSI for an array which has hit the bug and has failed several drives, is reproduced here - some of their suggested actions (e.g. using "vdShow" via the serial port) are not customer actions, and hence have been removed from the customer-viewable section of the article. "To recover, DO NOT REBOOT the controller or the server, doing so WILL result in data loss. Use drivutil -u (lower case) to unfail all of the drives in the lun, and verify the lun configuration with vdShow.
Attachments This solution has no attachment |
||||||||||||
|