Oracle ZFS Storage Appliance: How to address False Clustron FMA Faults

Asset ID:	1-71-2216800.1
Update Date:	2018-04-17
Keywords:

Solution Type Technical Instruction Sure

Solution 2216800.1 : Oracle ZFS Storage Appliance: How to address False Clustron FMA Faults

Applies to:

Sun ZFS Storage 7420 - Version All Versions and later
Sun ZFS Storage 7320 - Version All Versions and later
Oracle ZFS Storage ZS3-4 - Version All Versions and later
Oracle ZFS Storage ZS3-2 - Version All Versions and later
Oracle ZFS Storage ZS4-4 - Version All Versions and later
7000 Appliance OS (Fishworks)
FMA fault on Clustron devices may be observed even though hardware is functioning correctly. This document explains the background, occurrences, and workaround.

Goal

In order to identify faulted Clustron links, 8.6 has introduced a new functionality which provides some visibility to Clustron connectivity faults.

However, in certain situations, the Clustron link will be marked offline correctly, but the subsequent online event will not be recognized, so that the Clustron link is incorrectly marked faulted. This can occur after reboot, or any other time that akd restarts.

Faults because of this bug do not impact the functionality of the clustering subsystem. However, genuine hardware faults are flagged in the same way, so the user should not assume that all faults are due to this bug.

Solution

We provide here a workaround for this issue.

If the customer desires to clear the faults, he/she should first verify the state of the Clustron hardware by using the “configuration cluster links” command and verifying that all 3 links are active. Assuming they are active, the “maintenance problems mark repaired” command can clear the fault.

This is an example of a case where hardware is good, and any fault may be cleared:

hostname:> configuration cluster links show

clustron2_embedded:0/clustron_uart:0 = AKCIOS_ACTIVE
clustron2_embedded:0/clustron_uart:1 = AKCIOS_ACTIVE
clustron2_embedded:0/dlpi:0 = AKCIOS_ACTIVE

In this example, the hardware is correctly faulted, and should not be marked repaired:

hostname:> configuration cluster links show

clustron2_embedded:0/clustron_uart:0 = AKCIOS_TIMEDOUT
clustron2_embedded:0/clustron_uart:1 = AKCIOS_ACTIVE
clustron2_embedded:0/dlpi:0 = AKCIOS_ACTIVE

When the hardware is good, the 'markrepaired' functionality should be used in order to clear this issue :

hostname:> maintenance problems select problem-000 markrepaired

That should be done for problems of these types :

Communication with the cluster peer via the serial link is lost.

Communication with the cluster peer via the Ethernet port is lost.

If the problem persists despite this action, the customer should gather a support bundle from each cluster node and Oracle Support will need to be engaged.

If an AKCIOS_TIMEDOUT state is observed on any of the links, Oracle Support will also need to be engaged.

Bug 23092294 describes this issue (Fixed in Micro Release 2013.1.6.13).

Attachments

This solution has no attachment