Troubleshooting Controller Unrecoverable Errors in Sun Storage SE3000 Arrays

Asset ID:	1-75-1314607.1
Update Date:	2017-10-01
Keywords:

Solution Type Troubleshooting Sure

Solution 1314607.1 : Troubleshooting Controller Unrecoverable Errors in Sun Storage SE3000 Arrays

Applies to:

Sun Storage 3310 Array - Version Not Applicable and later
Sun Storage 3510 FC Array - Version Not Applicable and later
Sun Storage 3511 SATA Array - Version Not Applicable and later
Sun Storage 3320 SCSI Array - Version Not Applicable and later
Information in this document applies to any platform.

Purpose

Controller Unrecoverable Error codes may be reported in the event log of SE3000 arrays, starting with firmware version 4.21.

The event log will report something similar to....

Wed June 4 13:12:22 2012
[Secondary] Alert
ALERT: Controller Unrecoverable Error 0001 00000000 00000000 45754677

The first field contains: 0001, indicating a multi-bit memory error.
The last field: 45754677, reflects a date field in hexadecimal.

The Controller Unrecoverable Error event containing the error codes is written to the array event log when the affected controller restarts. Because there may be multiple reboots, the same event may be written to the event log multiple times.

First Field Error Code	Error
0000	PCI Parity Error
0001	Multi-bit memory Error
0002	Hardware Error
0003	DSI exception
0004	Illegal Request

The error codes above give some more information about why the controller rebooted.

Troubleshooting Steps

1. See if you have any "Controller Unrecoverable Errors" in the event log. Use sccli to generate a complete list of logs. Use the appropriate sccli command based upon your management connection to the array.

sccli> show events (if you only have in-band management)
sccli> show persistent (if you have out of band management configured)

For example. (parsed output)

sccli> show events
   Wed June 4 13:12:22 2012
   [Secondary] Alert
   ALERT: Controller Unrecoverable Error 0001 00000000 00000000 45754677

If yes, continue to step 2.

2. Determine if the array is a single or dual raid controller array. You can use the sccli utility, or visually inspect the array.

Here is an example of dual raid controllers.

sccli> show redundancy
Primary controller serial number: 8014068
Primary controller location: Upper
Redundancy mode: Active-Active
Redundancy status: Enabled
Secondary controller serial number: 8012814

If one of the dual raid controllers is failed, it may report like this.

sccli> show redundancy
Primary controller serial number: 8014068
Primary controller location: Upper
Redundancy mode: Active-Active
Redundancy status: Failed
Secondary controller serial number: 8012814

Here is an example of a single raid controller.

sccli> show redundancy
Primary controller serial number: 8014068
Primary controller location: Upper
Redundancy mode: Active-Active
Redundancy status: Scanning
Secondary controller serial number: 0

If the Redundancy status is Scanning, then the array is a single controller array and continue to step 3.
If the Redundancy status is Failed or Enabled, then the array is a dual controller array and go to step 4.

3. Evaluate the the Error Code for a Single Controller Array

For a single-controller array, the controller resets automatically, and the event is added to the event log.
If the first field contains: 0000, 0003, or 0004:

Permanently clear the persistent event. You must use the tip/tenet interface to perform this task.
System Functions-> Controller Maintenance -> Clear Core.
Release to production.

If the first field contains: 0001, or 0002,

Replace the controller. Ensure that the replacement controller boots and check for any unrecoverable errors. These could be left over from a previous issue and not cleared. If an unrecoverable error is reported upon the boot of the replacement controller and it is a single raid controller array then simply clear the core, if however it is a dual raid controller array then fail the primary controller to promote the newly replaced controller to primary by running the sccli command fail primary and then run the sccli command unfail to bring the failed controller back online and then finally check the primary controller is now the newly replaced controller and the original controller is now the secondary;

after the controller was replaced;

sccli> show redundancy

Primary controller serial number: 8014068

Primary controller location: Upper

Redundancy mode: Active-Active

Redundancy status: Enabled

Secondary controller serial number: 8012814

after the fail primary command;

sccli> show redundancy

Primary controller serial number: 8012814

Primary controller location: Lower

Redundancy mode: Active-Active

Redundancy status: Failed

Secondary controller serial number: 8014068

after the unfail command;

sccli> show redundancy

Primary controller serial number: 8012814

Primary controller location: Lower

Redundancy mode: Active-Active

Redundancy status: Enabled

Secondary controller serial number: 8014068

and then clear the core and monitor to see if the error reoccurs; otherwise, there is no need to clear core, because it only resides in NVSRAM on the controller that is being replaced.

See <Document 1305700.1> How to Remove and Replace a Sun Storage 3510/3511 FC Array Single Controller

4. Evaluate the the Error Code for a Dual Controller Array. Typically, the controller reporting the unrecoverable error will be failed. If the first field contains: 0000, 0003, or 0004:

Permanently clear the persistent event. You must use the tip/tenet interface to perform this task.
System Functions-> Controller Maintenance -> Clear Core.
Use sccli to unfail the failed controller. (if needed)

sccli> unfail
Are you sure?
Are you sure? yes
sccli>
Confirm controllers are operational again. sccli command show redundancy must report Active and Enabled controllers. If not, proceed to step 6
Otherwise, release array to production.

If the first field contains: 0001 or 0002 go to Step 5.

5 . Identify the affected controller. Finding which controller reported the Controller Unrecoverable Error, needs extra care, depending on what has been done after that event was logged. The event itself will show [Primary] or [Secondary], which is the affected controller's functional role when that event was logged. Unfortunately, these roles can change.

sccli> show events (parsed)
Wed June 4 13:12:22 2012
[Secondary] Alert
ALERT: Controller Unrecoverable Error 0001 00000000 00000000 45754677

The affected controllers functional role was Secondary at the time the event was logged. However, an unfail, reset or power cycle of the array, can switch the controller roles. If there is an event showing an Initialization Completed, such as

sccli> show events
Mon Nov 5 10:17:08 2012
[Secondary] Notification
Controller Initialization Completed

Proceed to step 6. Otherwise, proceed with replacement of the controller which logged the event. Ensure that the replacement controller boots and check for any unrecoverable errors - these could be left over from a previous issue and not cleared. If an unrecoverable error is reported upon the boot of the replacement controller then clear the core and monitor to see if the error reoccurs - otherwise, there is no need to clear core, because it only resides in NVSRAM on the controller that is being replaced. See <Document 1305676.1> How to Remove and Replace a Sun Storage 3510/3511 FC Array Dual Controller
Identifying the faulted Controller:

Identifying the controller that has the fault requires some analysis. Because the event is not posted at the time of the failure, the user will not normally see this event. The typical corrective action for a controller failure is to replace the faulted unit. At that time, identifying the failed part is straightforward; it will be the controller with the amber fault LED. The sccli “show redundancy-mode” command will also identify the failed controller:

sccli> show redundancy-mode
Primary controller serial number: 8000640
Primary controller location: Upper
Redundancy mode: Active-Active
Redundancy status: Failed
Secondary controller serial number: 8001418

In this case, the faulted controller is down, and is being held in reset by the primary. The primary is in the upper bay, so the faulted controller is in the lower bay. If the failed controller is not restarted, it will not post a controller unrecoverable error event. If the controller is restarted, and does not post an unrecoverable event, the failure was not the result of a hardware or code trap. For example, pulling a controller or manually failing a controller does not encounter a trap. The controller unrecoverable error event message will be labeled with the functional role that the faulted controller has assumed at restart. That is when the message is posted. For example this event message indicates that the faulted controller came up as secondary on the restart:

Wed June 4 13:12:22 2008 [Secondary] Alert ALERT: Controller Unrecoverable Error 0000 00000000 00000000 46BA15C3

Again, the primary controller can be identified with the “show redundancy-mode” information, or with the fault light on the controller back panel. The primary controller LED will be blinking green. Note that the “Redundant Controller Failure Detected” message will always be posted by the primary controller, even when the primary controller fails. When either controller fails, the survivor becomes primary, and it is that controller that posts the event message. Primary and secondary in this system are used to describe supervisory functions (ownership of the metadata updates, the active Ethernet port, and such). From a data handling perspective, each controller acts as a standby for the other.

To summarize: determine the current redundancy mode (if a controller is currently failed, it is probably the one with the fault). If both controllers are active (redundancy status “enabled”), the controller with the unrecoverable error can be identified by the role it assumed at restart.

6. Engage Oracle Support

Collect an Oracle Explorer using <Document 1010987.1> Oracle Explorer Data Collector: How to Get se3kxtr in Out-of-Band Status
Open an Oracle Service Request for further assistance.

Do you still have questions? You can use My Oracle Support Communities. Communities put you in touch with industry professionals like yourself. They are monitored by Oracle support engineers, so you can expect reliable and correct answers. Ask questions and see what others are asking about in the Disk Storage 2000, 3000, 6000 RAID Arrays & JBODs Community.

Attachments

This solution has no attachment