![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Troubleshooting Sure Solution 2186614.1 : Oracle ZFS Storage Appliance: FMA fault "cluster.link.down"
This document is intended to explain when the FMA fault fault.ak.xmlrpc.cluster.link.down is generated In this Document
Applies to:Oracle ZFS Storage ZS3-2 - Version All Versions and laterSun ZFS Storage 7420 - Version All Versions and later Oracle ZFS Storage ZS4-4 - Version All Versions and later Sun Storage 7720 Unified Storage System - Version All Versions and later Oracle ZFS Storage ZS3-4 - Version All Versions and later 7000 Appliance OS (Fishworks) The FMA fault fault.ak.xmlrpc.cluster.link.down is generated when the cluster peer cannot be joined. PurposeThis document intends to explain the condition in which the FMA fault "fault.ak.xmlrpc.cluster.link.down" is generated.
Troubleshooting StepsIn the general case, since OS8.6 (2013.1.6) we now track unresolved clustron link down state as a fault, if the link state remains disconnected for 10 seconds after the alert is generated. An FM ereport is generated for this condition with the FMA FRU/resource. The ereport is diagnosed as a fault by the AK fault diagnostic engine and is also transmitted to the ILOM.
When the problem is rectified and the link is active again, the system automatically tracks this state and clears the fault. The ereport remains logged for further troubleshooting. The FM fault will remain unresolved as long as the cluster peer is switched off. This is the designed behaviour of reporting of the clustron link states down states on the ZFSSA - post OS8.6 (2013.1.6).
In some very specific cases though, if one cluster head cannot reach out to its peer - e.g. if the peer node refuses to boot up because of some hardware or software issue then the surviving head may decide to raise an FMA fault of type : fault.ak.xmlrpc.cluster.link.down
This occurs because a connection cannot be established between the cluster peers through the clustron links. In this very specific case, other relevant alerts and faults will be reported to indicate what the kernel is doing, or which hardware issue we are hitting. If the cluster peer cannot be joined through any of the three cluster links, an alert of this type will be reported for each of the three links. UART0 - fault.ak.xmlrpc.cluster.link.down UART1 - fault.ak.xmlrpc.cluster.link.down DLPI0 - fault.ak.xmlrpc.cluster.link.down This may be confusing to the user : this fault should be interpreted as a failure from the clustering subsystem to join its cluster peer though the clustron link. It does not systematically detect a link failure and should not lead to a cable replacement in every case. From the clustering subsystem's standpoint it is a fault, but the TSC engineer (engaged on this issue) should be able to distinguish between the different subcases when this fault is reported.
Additional reasons WHY this situation could arise: - Power on/off, OS shutdown/re-start -> Now using with the new 8.6 update procedure during a Cluster OS Update See also Doc ID 1542550.1 (Sun Storage 7000 Unified Storage System: Communication with the cluster peer via a cluster interconnect link has been lost) Related bugs: 15538608 - dlpi cluster links flap spuriously Relevant ASR Event messages: AK-8002-70 = ereport.ak.xmlrpc.cluster.link.uart0.down
Such an FMA fault should lead to engagement of Oracle Support by opening a Service Request to assist you further. Please include all the relevant details and information along with an accurate problem description in the SR notes. If possible, a current supportbundle (from both heads, if this a cluster system) should also be obtained and uploaded to Oracle.
The following links will provide more information: Document 1019887.1 - Sun Storage 7000 Unified Storage System: How to collect a supportbundle using the BUI or CLI Document 2021771.1 - Oracle ZFS Storage Appliance: Software Updates
Attachments This solution has no attachment |
||||||||||||||||
|