![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||
Solution Type Troubleshooting Sure Solution 1314607.1 : Troubleshooting Controller Unrecoverable Errors in Sun Storage SE3000 Arrays
In this Document
Applies to:Sun Storage 3310 Array - Version Not Applicable and laterSun Storage 3510 FC Array - Version Not Applicable and later Sun Storage 3511 SATA Array - Version Not Applicable and later Sun Storage 3320 SCSI Array - Version Not Applicable and later Information in this document applies to any platform. PurposeController Unrecoverable Error codes may be reported in the event log of SE3000 arrays, starting with firmware version 4.21. Wed June 4 13:12:22 2012
[Secondary] Alert ALERT: Controller Unrecoverable Error 0001 00000000 00000000 45754677
Troubleshooting Steps1. See if you have any "Controller Unrecoverable Errors" in the event log. Use sccli to generate a complete list of logs. Use the appropriate sccli command based upon your management connection to the array.
For example. (parsed output) sccli> show events
Wed June 4 13:12:22 2012 [Secondary] Alert ALERT: Controller Unrecoverable Error 0001 00000000 00000000 45754677
If yes, continue to step 2. Here is an example of dual raid controllers. sccli> show redundancy
Primary controller serial number: 8014068 Primary controller location: Upper Redundancy mode: Active-Active Redundancy status: Enabled Secondary controller serial number: 8012814
If one of the dual raid controllers is failed, it may report like this. sccli> show redundancy
Primary controller serial number: 8014068 Primary controller location: Upper Redundancy mode: Active-Active Redundancy status: Failed Secondary controller serial number: 8012814 sccli> show redundancy
Primary controller serial number: 8014068 Primary controller location: Upper Redundancy mode: Active-Active Redundancy status: Scanning Secondary controller serial number: 0
If the first field contains: 0001, or 0002,
after the controller was replaced; sccli> show redundancy Primary controller serial number: 8014068 Primary controller location: Upper Redundancy mode: Active-Active Redundancy status: Enabled Secondary controller serial number: 8012814
after the fail primary command; sccli> show redundancy Primary controller serial number: 8012814 Primary controller location: Lower Redundancy mode: Active-Active Redundancy status: Failed Secondary controller serial number: 8014068
after the unfail command; sccli> show redundancy Primary controller serial number: 8012814 Primary controller location: Lower Redundancy mode: Active-Active Redundancy status: Enabled Secondary controller serial number: 8014068
and then clear the core and monitor to see if the error reoccurs; otherwise, there is no need to clear core, because it only resides in NVSRAM on the controller that is being replaced. See <Document 1305700.1> How to Remove and Replace a Sun Storage 3510/3511 FC Array Single Controller
If the first field contains: 0001 or 0002 go to Step 5. 5 . Identify the affected controller. Finding which controller reported the Controller Unrecoverable Error, needs extra care, depending on what has been done after that event was logged. The event itself will show [Primary] or [Secondary], which is the affected controller's functional role when that event was logged. Unfortunately, these roles can change. sccli> show events (parsed) The affected controllers functional role was Secondary at the time the event was logged. However, an unfail, reset or power cycle of the array, can switch the controller roles. If there is an event showing an Initialization Completed, such as sccli> show events
Mon Nov 5 10:17:08 2012 [Secondary] Notification Controller Initialization Completed Proceed to step 6. Otherwise, proceed with replacement of the controller which logged the event. Ensure that the replacement controller boots and check for any unrecoverable errors - these could be left over from a previous issue and not cleared. If an unrecoverable error is reported upon the boot of the replacement controller then clear the core and monitor to see if the error reoccurs - otherwise, there is no need to clear core, because it only resides in NVSRAM on the controller that is being replaced. See <Document 1305676.1> How to Remove and Replace a Sun Storage 3510/3511 FC Array Dual Controller Identifying the controller that has the fault requires some analysis. Because the event is not posted at the time of the failure, the user will not normally see this event. The typical corrective action for a controller failure is to replace the faulted unit. At that time, identifying the failed part is straightforward; it will be the controller with the amber fault LED. The sccli “show redundancy-mode” command will also identify the failed controller:
sccli> show redundancy-mode
In this case, the faulted controller is down, and is being held in reset by the primary. The primary is in the upper bay, so the faulted controller is in the lower bay. If the failed controller is not restarted, it will not post a controller unrecoverable error event. If the controller is restarted, and does not post an unrecoverable event, the failure was not the result of a hardware or code trap. For example, pulling a controller or manually failing a controller does not encounter a trap. The controller unrecoverable error event message will be labeled with the functional role that the faulted controller has assumed at restart. That is when the message is posted. For example this event message indicates that the faulted controller came up as secondary on the restart: Wed June 4 13:12:22 2008 [Secondary] Alert ALERT: Controller Unrecoverable Error 0000 00000000 00000000 46BA15C3
Again, the primary controller can be identified with the “show redundancy-mode” information, or with the fault light on the controller back panel. The primary controller LED will be blinking green. Note that the “Redundant Controller Failure Detected” message will always be posted by the primary controller, even when the primary controller fails. When either controller fails, the survivor becomes primary, and it is that controller that posts the event message. Primary and secondary in this system are used to describe supervisory functions (ownership of the metadata updates, the active Ethernet port, and such). From a data handling perspective, each controller acts as a standby for the other. To summarize: determine the current redundancy mode (if a controller is currently failed, it is probably the one with the fault). If both controllers are active (redundancy status “enabled”), the controller with the unrecoverable error can be identified by the role it assumed at restart.
6. Engage Oracle Support
Do you still have questions? You can use My Oracle Support Communities. Communities put you in touch with industry professionals like yourself. They are monitored by Oracle support engineers, so you can expect reliable and correct answers. Ask questions and see what others are asking about in the Disk Storage 2000, 3000, 6000 RAID Arrays & JBODs Community.
Attachments This solution has no attachment |
||||||||||||||||||||||||||||
|