VSM7 - Nodes rebooting due to internal network communication failure "failing_proc_for_rnf:Cluster failed to send a remote node failure to process"

Asset ID:	1-72-2331228.1
Update Date:	2017-11-30
Keywords:

Solution Type Problem Resolution Sure

Solution 2331228.1 : VSM7 - Nodes rebooting due to internal network communication failure "failing_proc_for_rnf:Cluster failed to send a remote node failure to process"

Applies to:

StorageTek Virtual Storage Manager System 7 (VSM7) - Version 7.0.0 to 7.1.2 [Release 7.0]
Information in this document applies to any platform.

Symptoms

FSC0010000F_CLUSTER_FAILED_LOCKING1

failing_proc_for_rnf:Cluster failed to send a remote node failure to process:59020 2249

FSC00031001_SERVER_TERMINATED Process:lock

One of the nodes will not join the cluster and come online

FAILURE: IPMP2 (ngxe2/igb10/ixgbe5) cable removed/disconnected, link down.

Changes

None

Cause

Nodes rebooting due to one of the net3 ports links going down could be a result of CAT-6 cable or a motherboard (MB) Network Interface Card (NIC) failure since the net3 is on the MB. The cable for the net3 ports should be checked for seating or changed before performing any further fault isolation. The net3 ports for the VSM7 are a direct connection and do not connect to the Oracle switch.

In the scenario below, the net3 port link down condition was intermittent and would run fine at 1G but would fail at 10G. The following were the steps taken to isolate the issue.

1. To identify if net3 is causing the node reboots, grep for the ixgbe3 or net3 link down events in the messages files with the following commands. These events can also be found in the Support File Bundle (SFB) /system_logfiles/var_adm_messages.txt. or the statesaves' messages log when the event occurred. If possible, log into each node and grep the /var/adm/messages logs to check for link down messages. The old messages logs may need to be obtained and reviewed depending upon the time of the events.

grep "ixgbe3 link down" /var/adm/messages

and

grep "net3 link down" /var/adm/messages

If the ixgbe3 port is failing, the following message(s) will be seen in the messages logs for both nodes.

Sep 1 10:10:27 vsmpriv1 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down

The complete link down events as seen in the messages log in the SFB /system_logfiles/var_adm_messages.txt.

Sep 1 10:10:27 vsmpriv1 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down
Sep 1 10:10:27 vsmpriv1 in.mpathd[98]: [ID 215189 daemon.error] The link has gone down on net3
Sep 1 10:10:27 vsmpriv1 in.mpathd[98]: [ID 968981 daemon.error] IP interface failure detected on net3 of group sc_ipmp0
Sep 1 10:10:27 vsmpriv1 mac: [ID 486395 kern.info] NOTICE: vlanIC1 link down
Sep 1 10:10:27 vsmpriv1 in.mpathd[98]: [ID 215189 daemon.error] The link has gone down on net3
Sep 1 10:10:27 vsmpriv1 in.mpathd[98]: [ID 968981 daemon.error] IP interface failure detected on net3 of group sc_ipmp0
Sep 1 10:10:27 vsmpriv1 in.routed[618]: [ID 238047 daemon.warning] interface vlanIC1 to 172.16.0.193 turned off
Sep 1 10:10:27 vsmpriv1 in.routed[618]: [ID 238047 daemon.warning] interface vlanIC1 to 172.16.0.193 turned off

Events reported in the major events log

2017/09/06 09:06:28.187138 2269:I:VSM:000f:scvsm_utils_rs.c:1008:failing_proc_for_rnf:Cluster failed to send a remote node failure to process:59020 2249
2017/09/06 09:06:28.188723 2249:E:VSM:0001:lock_server_utils.c:96:SNO:FSC00031001_SERVER_TERMINATED Process:lock

2. If there are link down events occurring, check net3 port cable seating on both nodes and/or replace the net3 cable (7069031-Ethernet Cable, Category 6A, RJ45 to RJ45, 10-Foot, Black). In theory, the reseating or replacement of the net3 cable should not require any system downtime as net5 is in the same IPMP group but it is highly recommended to have the VTSS placed offline for reseating, replacing or reconfiguring the net3 cable(s) as the nodes could result in rebooting.

Steps to perform prior to reseating and/or replacing the net3 cable.

a) Verify the customer has varied the VTSS offline.

b) Connect to Node 1 (vsmpriv1) and run following command :

$ /opt/vsm/bin/killtikka.pl

Note: this stops VSM6 processes on both nodes

c) Verify all processes have stopped, by issuing the following command(can take 15+ mins):

$ /usr/cluster/bin/scstat -g

Note: Will return no output once all processes are stopped.

d) Reseat and/or replace the cable.

e) Log into VSM7 node 1 and reboot with the following command. This same command with the same paramters (resetNode 2 1) will be performed on node 2 once node 1 is verified to be Fully Operational.

resetNode 2 1

f) Verify node 1 is Fully Operational by issuing the following command.

scstat | grep vsm[12]-rs

g) Once node 1 is Fully Operational, log into VSM7 node 2 and reboot with the following command.

resetNode 2 1

h) Verify both nodes are Fully Operational by issuing the following command.

scstat | grep vsm[12]-rs

i) Run the following command on both nodes to confirm net3 link is up and running 10G (10000).

dladm show-phys

j) Let this run for a few hours or a day and then recheck the messages logs for link down events on both nodes.

3. As a temporary test and/or fix, the port speed can be changed to run at 1G as follows to see if the ixgbe3 link from going down stops.

a) Check the net3 port for the link propagation as shown.

sudo dladm show-linkprop net3 | grep en_ | grep fdx

vsmadm@vsmpriv1:~$ sudo dladm show-linkprop net3 | grep en_ | grep fdx
net3 en_10gfdx_cap rw 1 1 1 1,0
net3 en_1000fdx_cap rw 1 1 1 1,0
net3 en_100fdx_cap rw 1 1 1 1,0
net3 en_10fdx_cap -- -- -- 0 1,0

b) Check the IPMP group as shown. This is the normal state.

ipmpstat -i

INTERFACE ACTIVE GROUP FLAGS LINK PROBE STATE
net5 no sc_ipmp0 is----- up disabled ok
net3 yes sc_ipmp0 --mbM-- up disabled ok

c Run the following command to turn off the net3 port 10G speed capability. To re-enable change the cap value (sudo dladm set-linkprop -p en_10gfdx_cap=1 net3) to 1. This should be performed on both nodes for consistency but will need to be changed back to 10G once testing is complete.

sudo dladm set-linkprop -p en_10gfdx_cap=0 net3

d) Run the dladm command below to verify the net3 port speed is running at 1G (1000) as shown.

dladm show-phys net3

net3 Ethernet up 1000 full ixgbe3

e) Recheck the IPMP group again.

ipmpstat -i

INTERFACE ACTIVE GROUP FLAGS LINK PROBE STATE
net5 no sc_ipmp0 is----- up disabled ok
net3 yes sc_ipmp0 --mbM-- up disabled ok

f) Let this run at 1G for a few hours or a day and then recheck the messages logs for link down events on both nodes.

4. Additional testing and fault isolation may be necessary if the cable seating or replacement does not resolve the issue.

It may be necessary to utilize the Oracle switch 2 ports 11 and 12 to connect the net3 ports for fault isolation should the setting of the net3 ports to 1G prove successful. The CAT-6 cable from step 1 can be used as it did not resolve the issue and should be a good cable. The end result of the fault isolation will have node 1 net3 port connected to switch port 11 and node 2 net3 port connected to switch port 12. The switch ports should be already configured from manufacturing.

a) Verify the customer has varied the VTSS offline and the second net3 CAT-6 cable is available.

b) Connect to Node 1 (vsmpriv1) and run following command :

$ /opt/vsm/bin/killtikka.pl

Note: this stops VSM6 processes on both nodes

c Verify all processes have stopped, by issuing the following command(can take 15+ mins):

$ /usr/cluster/bin/scstat -g

Note: Will return no output once all processes are stopped.

d) Disconnect the existing CAT-6 cable from node 2 net3 port and connect to port 11 on the Oracle switch 2. This will complete the node 1 net3 to switch port 11 connection.

e) Take the second CAT-6 cable and connect one end to node 2 net3 port and the other end to the Oracle switch 2 port 12. This will complete the node 2 net3 to switch port 12 connection.

f) Log into VSM7 node 1 and reboot with the following command. This same command with the same paramters (resetNode 2 1) will be performed on node 2 once node 1 is verified to be Fully Operational.

resetNode 2 1

g) Verify node 1 is Fully Operational by issuing the following command.

scstat | grep vsm[12]-rs

h) Once node 1 is Fully Operational, log into VSM7 node 2 and reboot with the following command.

resetNode 2 1

i) Verify both nodes are Fully Operational by issuing the following command.

scstat | grep vsm[12]-rs

j) Run the following command on both nodes to confirm net3 link is up and running 10G (10000).

dladm show-phys

k) Turn the VSM7 over to the customer to have them vary the VTSS back online.

5. After reconfiguring the net3 ports to the Oracle switch 2, make note of the date and time for reference. Then after a few hours or a day, recheck each nodes' messages logs for the link down events.

grep "ixgbe3 link down" messages

grep "net3 link down" messages

Node 2 messages log

Nov 21 02:29:38 vsmpriv2 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down
Nov 21 02:48:16 vsmpriv2 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down
Nov 21 03:17:12 vsmpriv2 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down
Nov 21 03:35:55 vsmpriv2 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down
Nov 21 04:15:18 vsmpriv2 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down
Nov 21 04:21:08 vsmpriv2 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down
Nov 21 06:34:21 vsmpriv2 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down
Nov 21 07:02:18 vsmpriv2 mac: [ID 486395 kern.info] NOTICE: ixgbe3 link down

6. If you are still seeing the link down events but only on one node, then the MB for that node needs to be replaced. If on both nodes, further investigation needs to be performed.

Solution

Reseat the net3 CAT-6 cable

Replace the net3 CAT-6 cable

Replace the MB

References

<BUG:26650397> - FSC0010000F_CLUSTER_FAILED_LOCKING1
<NOTE:2067247.1> - How To Run Explorer on a VLE, VSM6 or VSM7
<NOTE:1510362.1> - VSM6, VSM7 - How to initiate a manual ASR to collect Support File Bundle (SFB)
<NOTE:1586216.1> - VSM6 - How to collect additional diagnostic statesave files

Attachments

This solution has no attachment