Oracle ZFS Storage Appliance: FMA fault "cluster.link.down"

Asset ID:	1-75-2186614.1
Update Date:	2018-01-05
Keywords:

Solution Type Troubleshooting Sure

Solution 2186614.1 : Oracle ZFS Storage Appliance: FMA fault "cluster.link.down"

Applies to:

Oracle ZFS Storage ZS3-2 - Version All Versions and later
Sun ZFS Storage 7420 - Version All Versions and later
Oracle ZFS Storage ZS4-4 - Version All Versions and later
Sun Storage 7720 Unified Storage System - Version All Versions and later
Oracle ZFS Storage ZS3-4 - Version All Versions and later
7000 Appliance OS (Fishworks)
The FMA fault fault.ak.xmlrpc.cluster.link.down is generated when the cluster peer cannot be joined.

Purpose

This document intends to explain the condition in which the FMA fault "fault.ak.xmlrpc.cluster.link.down" is generated.

Troubleshooting Steps

In the general case, since OS8.6 (2013.1.6) we now track unresolved clustron link down state as a fault, if the link state remains disconnected for 10 seconds after the alert is generated.

An FM ereport is generated for this condition with the FMA FRU/resource.

The ereport is diagnosed as a fault by the AK fault diagnostic engine and is also transmitted to the ILOM.

When the problem is rectified and the link is active again, the system automatically tracks this state and clears the fault. The ereport remains logged for further troubleshooting.

The FM fault will remain unresolved as long as the cluster peer is switched off.

This is the designed behaviour of reporting of the clustron link states down states on the ZFSSA - post OS8.6 (2013.1.6).

In some very specific cases though, if one cluster head cannot reach out to its peer - e.g. if the peer node refuses to boot up because of some hardware or software issue then the surviving head may decide to raise an FMA fault of type :

fault.ak.xmlrpc.cluster.link.down

This occurs because a connection cannot be established between the cluster peers through the clustron links.

In this very specific case, other relevant alerts and faults will be reported to indicate what the kernel is doing, or which hardware issue we are hitting.

If the cluster peer cannot be joined through any of the three cluster links, an alert of this type will be reported for each of the three links.

UART0 - fault.ak.xmlrpc.cluster.link.down

UART1 - fault.ak.xmlrpc.cluster.link.down

DLPI0 - fault.ak.xmlrpc.cluster.link.down

This may be confusing to the user : this fault should be interpreted as a failure from the clustering subsystem to join its cluster peer though the clustron link.

It does not systematically detect a link failure and should not lead to a cable replacement in every case.

From the clustering subsystem's standpoint it is a fault, but the TSC engineer (engaged on this issue) should be able to distinguish between the different subcases when this fault is reported.

Additional reasons WHY this situation could arise:

- Power on/off, OS shutdown/re-start -> Now using with the new 8.6 update procedure during a Cluster OS Update
- Panic, application-core, core dump,
- High system load
- Ethernet cable not seated well

See also Doc ID 1542550.1 (Sun Storage 7000 Unified Storage System: Communication with the cluster peer via a cluster interconnect link has been lost)

Related bugs:

15538608 - dlpi cluster links flap spuriously
15726659 - dlpi clustron2 links flap spuriously under load

Relevant ASR Event messages:

AK-8002-70 = ereport.ak.xmlrpc.cluster.link.uart0.down
AK-8002-88 = ereport.ak.xmlrpc.cluster.link.uart1.down
AK-8002-9M = ereport.ak.xmlrpc.cluster.link.dlpi.down

Such an FMA fault should lead to engagement of Oracle Support by opening a Service Request to assist you further.

Please include all the relevant details and information along with an accurate problem description in the SR notes.

If possible, a current supportbundle (from both heads, if this a cluster system) should also be obtained and uploaded to Oracle.

The following links will provide more information:

Document 1019887.1 - Sun Storage 7000 Unified Storage System: How to collect a supportbundle using the BUI or CLI

Document 2021771.1 - Oracle ZFS Storage Appliance: Software Updates

Attachments

This solution has no attachment