Sun Storage 7000 Unified Storage System: Communication with the cluster peer via a cluster interconnect link has been lost

Asset ID:	1-72-1542550.1
Update Date:	2018-05-02
Keywords:

Solution Type Problem Resolution Sure

Solution 1542550.1 : Sun Storage 7000 Unified Storage System: Communication with the cluster peer via a cluster interconnect link has been lost

Applies to:

Oracle ZFS Storage ZS3-4 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS3-BA - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS4-4 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage Appliance Racked System ZS4-4 - Version All Versions to All Versions [Release All Releases]
Sun ZFS Storage 7320 - Version All Versions to All Versions [Release All Releases]
7000 Appliance OS (Fishworks)

Symptoms

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage ZFS Storage Appliance

NOTE: To confirm that the cluster 'links' cabling is correctly configured - See Document ID 2081179.1

Alerts mentioned below may be seen on 7000 Series ZFS Storage Appliance working in a cluster configuration.

Issue reported on AK firmware version: 2011.1.x

SUNW-MSG-ID: AK-8001-RK, TYPE: alert, VER: 1, SEVERITY: Minor
EVENT-TIME: Thu Jan 31 15:42:59 2013
PLATFORM: i86pc, CSN: <serialno>, HOSTNAME: <hostname>
SOURCE: svc:/appliance/kit/akd:default, REV: 1.0
EVENT-ID: 665840e4-94f8-6516-eb1a-f90c0edd0c59
DESC: Communication with the cluster peer via a cluster interconnect link has been lost.
AUTO-RESPONSE: None.
IMPACT: Cluster reliability is impaired. If the cluster peer is functioning normally but no cluster interconnects remain active, arbitrary and unwanted cluster takeover may occur.
REC-ACTION: Check the cluster interconnect cables and the state of the cluster peer. Contact your vendor for support if an interconnect link remains inexplicably down.

But, no takeover or reboot or any other cluster interconnect related issues are seen.

Peer head seems to be working fine - without any issues.

PLEASE NOTE: After upgrade to 2013.1.6.0, 'cluster link down' alerts are reported even on a 'normal' reboot.

See MOS Doc ID 2195659.1 (See also - Bug 23092294 clustron component fault shows up in problems while links are still active)

Changes

Cause

Please check if any support bundle was being generated on the cluster peer head at the time the alert was generated.

One of the cause can be 'gcore taking more than 30 seconds to collect akd core' while generating a support bundle.

In case you were collecting a support bundle (or a manual 'gcore' was executed by a Technical Support Engineer) may trigger this alert on the peer head.

When collecting a support bundle it does a 'gcore' of akd process - it freezes the akd process so that it can get consistent memory image of akd while creating core file.

Once, the core file is created, it unfreezes the akd process so that normal operation can resume.

Heartbeats using dlpi link (ethernet) are stopped during this time. Serial port heartbeats continue using the clustron kernel driver and are not affected by akd process.

The peer head notices that the heartbeats have stopped from the dlpi link and an alert (alert.ak.xmlrpc.cluster.link.down) is posted - after cio_alert_delay (30 seconds default)

Bug:15726659 SUNBT7063308 dlpi clustron2 links flap spuriously under load
Bug:15538608 SUNBT6799505 dlpi cluster links flap spuriously
Bug:16083259 - gcore of akd can cause alert.ak.xmlrpc.cluster.link.down alert on peer node

We are now recommending when you are running on a live system and need to collect the akd core file, you follow this procedure
because writing large files to /tmp is much faster than writing to disk, so the lock is held for much less time

cd /tmp; gcore -o akd `pgrep -ox akd` && mv /tmp/akd.`pgrep -ox akd` /var/ak/dropbox

Solution

If the issue is identified as that mentioned in the above section, then alerts are known not to cause any issues with cluster functionality or reliability - and can be ignored.
This issue is fixed in 2013.1.6.2 and above Appliance firmware.

If you see alerts which are not triggered because of this issue with support bundle collection, then please contact Oracle Support to further diagnose the issue.

This Bug: 16083259 was closed as duplicate of Bug: 21224255 which is fixed in 2013.1.6.0

References

<BUG:16083259> - GCORE OF AKD SLOWER IN ZFS. CAUSES ALERT.AK.XMLRPC.CLUSTER.LINK.DOWN ON PEER
<NOTE:1402545.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Cluster Problems
<NOTE:2021771.1> - Oracle ZFS Storage Appliance: Software Updates
<BUG:21224255> - UNRESOLVED CLUSTRON LINK DOWN STATE SHOULD BE TRACKED AS A FMA FAULT
<NOTE:2262465.1> - Oracle ZFS Storage Appliance: Reboot "Unexpectedly found SAS zone locks held"

Attachments

This solution has no attachment