FS System: Improper Controller Motherboard Replacement in an FS1-2 Controller Results in 3rd Controller

Asset ID:	1-72-2017890.1
Update Date:	2016-09-16
Keywords:

Solution Type Problem Resolution Sure

Solution 2017890.1 : FS System: Improper Controller Motherboard Replacement in an FS1-2 Controller Results in 3rd Controller

Applies to:

Oracle FS1-2 Flash Storage System - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

The Oracle FS System Manager GUI shows 3 Controllers and only one of them is working.

Changes

This has only been observed after replacement of a Controller motherboard.

Cause

When a new motherboard is replaced, information from it is introduced into the FS1 configuration when that Controller attempts to boot. There is a transition period where the information from both the previous and replacement Controllers exist. Under normal circumstances this information is consolidated and the "third Controller" is removed during the boot process and thus only 2 Controllers are ever seen by the time the boot sequence completes.

If, on the other hand, the boot sequence is not allowed to complete, the third Controller will still exist:

=====================================

Controllers (3)

=====================================
Name                Model               Type          Status
----                -----               ----          ------
/CONTROLLER-01      FS1_CONTROLLER      SAN           NORMAL
/CONTROLLER-02      FS1_CONTROLLER      SAN           --> CRITICAL <--
/CONTROLLER-03      FS1_CONTROLLER      UNKNOWN      --> CRITICAL <--

The most common reason for this to happen is when the NVDIMM cables are not connected to the correct ESM connectors on the Disk Backplane but this same phenomena has also been seen with motherboards with incompatible SPBIOS. Please see Document 2173777.1 FS System: How to Manually Downgrade SPBIOS on an FS1 Controller. But there may be other reasons as well that prohibit the Controller from properly booting.

Solution

Correct the problem based on the Controller Service Label located inside the top cover.
Clear the failure history of the Controller.
Boot the Controller.
Reboot both Pilots.
If a provisioning mismatch System Alert is observed, right click on it and Accept it.

If the problem persists after this, escalate the issue to Engineering.

The following steps were used to recover a system with a 3rd Controller for SR 3-13144950761. They require a system outage and if attempted, must be done in conjunction with an escalation into Engineering.

Note: The Controller WWNs in these steps have been sanitized with six X's. The right most number will determine the Controller number. A 2 equates to Controller 3.

System restart - Controller 1 & 2 Normal, but Controller 3 still shows missing in GUI and System remains in Critical state:

# fscli system –restart –emergencyPreserveFbm
Attempted to use pcli command to clear Controller 3:

# pcli sub -u pillar -p pillar -H <Public IP of FS1> RemoveController Identity.Id=¨508002000XXXXXX2¨

Controller 3 no longer appears in GUI and System status changed from Critical to Warning.
Pilots still had a Controller 3 entry:

[root@pilot1 ~]# ls -ltr /var/PillarPilotPersistence/com.pillardata.pmi.message.ControllerDisplayAttributes/
508002000XXXXXX0.xml
508002000XXXXXX1.xml
508002000XXXXXX2.xml

But /etc/nodenames was correct:

[root@pilot1 ~]# cat /etc/nodenames
172.30.80.2 WN2008fffffffffff2 WN2008000101000000 mgmtnode
172.30.80.3 WN2009fffffffffffa
172.30.80.128 WN508002000XXXXXX0 WN2008000101000001
172.30.80.129 WN508002000XXXXXX1
Stop Pilot configuration services from both Pilots:

[root@pilot1 ~]# service pilotcfg stop
Shutting down pcp_monitor:
Shutting down pilotcfg:

[root@pilot2 ~]# service pilotcfg stop
Shutting down pcp_monitor:
Shutting down pilotcfg:
Remove the erroneous entry from /var/PillarPilotPersistence/com.pillardata.pmi.message.ControllerDisplayAttributes from both Pilots:

[root@pilot1 ~]# rm /var/PillarPilotPersistence/com.pillardata.pmi.message.ControllerDisplayAttributes/508002000XXXXXX2.xml

[root@pilot2 ~]# rm /var/PillarPilotPersistence/com.pillardata.pmi.message.ControllerDisplayAttributes/508002000XXXXXX2.xml
If present, check the following directories for Controller 3 entries and remove them:
/var/PillarPilotPersistence/com.pillardata.pmi.message.ControllerDiagnosticRecordInfo/
/var/PillarPilotPersistence/com.pillardata.pmi.message.ControllerDiagnosticsInProgressCookie/
Delete /var/lib/pillar/pcp/node-info.xml on both Pilots (which is a manual clear failure history)
Clear Controller cache from both Controllers 1 & 2:

[root@pilot1 ~]# ssh 172.30.80.128 psg_shutdown -c &
[root@pilot1 ~]# ssh 172.30.80.129 psg_shutdown -c &
Restart Pilot configuration service on both Pilots:

[root@pilot1 ~]# service pilotcfg start

[root@pilot2 ~]# service pilotcfg start
Monitor node matrix (/var/log/pcp.log) on active Pilot.

Attachments

This solution has no attachment