Pillar Axiom: Handling Brick Raid Controller Fault

Asset ID:	1-71-1437350.1
Update Date:	2018-01-08
Keywords:

Solution Type Technical Instruction Sure

Solution 1437350.1 : Pillar Axiom: Handling Brick Raid Controller Fault

Applies to:

Pillar Axiom 500 Storage System - Version Not Applicable and later
Pillar Axiom 300 Storage System - Version Not Applicable and later
Pillar Axiom 600 Storage System - Version Not Applicable and later
Information in this document applies to any platform.

Goal

What are customer options when receiving events related to a RAID Controller Fault?

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage Pillar Axiom System

Solution

A Brick’s RAID controller has encountered a fault. This is usually a recoverable event and the RAID controller should return to normal status. Since the faults can vary, logs need to be obtained to determine root cause.
The Axiom has built in redundancy to handle a failed RAID controller even if the controller is not able to recover from this fault. The data serving will typically be handled by the surviving controller.

The handling of RAID Controller Fault Callhomes is done in 2 steps:

Check the state of the affected Brick and ensure the RAID Controllers are Normal in the GUI, and that the LED fault lights are off.
1. If Normal -- this is a healthy indication that the RAID Controller successfully recovered from the fault.
2. If other than Normal, further analysis of the logs might be required in order to determine if a replacement RAID Controller needs to be shipped.
The customer may opt to have analysis be performed on the fault for root-cause. Factors that help determine this analysis need may be as follows.
1. If the RAID Controller recovered, then the need for an analysis may not be as critical due to the RAID Controller's inherent nature to recover from such faults.
2. Factors such as recurrence (over a short period of time) and unexpected data inaccess may weigh on the need for further analysis.
3. On occasion the corresponding Brick logs may not end up being bundled with the Event-driven Callhome logs received by Customer Support. Therefore a manual logset will be requested (with the appropriate Brick logs). Please refer to <Document 1906876.1> Pillar Axiom: How to Collect a System Information Log and Transfer it to Oracle.

A number of Brick RAID controller reboots are due to Cache Parity Errors. What is a Cache Parity Error ?

The RAID Controller uses an Intel X-Scale microprocessor, which has internal cache memory. This error indicates that the parity error is in the cache memory inside the X-Scale processor. Extensive testing has indicated that these highly intermittent errors are not the result of heat or voltage conditions at the chip. The X-scale microprocessor, on encountering this error must reset itself, which causes the RAID Controller to fail over to the companion (which continues to provide data access), reboot itself, then fail back.

The typical failure is that a RAID Controller Unit (CU) will experience no more than one of these events in its service life. There is no benefit to replacing a RAID CU with a single instance, as the new RC would have the same probability of failure.
If a RAID CU experiences multiple failures of this type, it is replaced.

The Serial Number of the individual RAID CU is kept in an internal Oracle database, to track any repeat failures.

This section is Oracle Internal Only

Customer Support Facing Description (Internal Only)

Description:

The issue is that there is a product defect in an Intel chip within the RAID Controller such that when certain conditions are met, the RAID Controller faults and is reset by the companion Controller. This is not a defect specific to the Axiom, but rather a defect in the OEM part (Intel 80321 X-Scale microprocessor) Over the course of product development with the Intel 80321 (X-Scale microprocessor) and now the Intel 80331, Pillar Data Systems has seen an occasional condition known as a “Cache Parity Error - FSR:0x408”. Most recent conditions reports have also been “Data Abort: Data Cache Parity Error Exception - FS:0x408”. Pillar Data Systems has been unable to reproduce this issue on the same boards during on-going FA (Failure Analysis).

Containment:

This error occurs inside the X-Scale microprocessor. There is a different error code for an External Data Abort, so this error is contained in the X-Scale. This corresponds with our experimental data, that we cannot induce a 408 type error by margining the X-Scale’s external PCI-X bus or the CPU voltage. We have reported the problem to Intel, but they have, thus far, not been able to duplicate the problem. Given the infrequent nature of these errors and the inability to generate a reoccurrence, we are working to generate a test code version that will loop and trap errors, in order to generate enough information for Intel to debug the problem at the chip level.

Solution:

There is currently no firmware work-around at the RAID controller level to address this problem. The CPU core inside the X-Scale has received bad data and must abort and reset. Until this problem is fixed at chip level, we must assume that we must accommodate an occasional failure by performing a fail-over and a fail-back and keep on running while logging that the controller has experienced a fault.

Recommendations:

We have entered the SSN of this RAID Controller in our tracking database to reflect this instance of the Cache Parity Error, should this Controller encounter this condition again Pillar Support will make arrangements for replacement of the Controller.

References

<NOTE:1906876.1> - Pillar Axiom: How to Collect a System Information Log and Transfer it to Oracle
<NOTE:1456657.1> - Pillar Axiom: Handling Brick RAID Controller Fault (BrickRAIDControllerFault) due to Cache Parity Error

Attachments

This solution has no attachment