Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2017890.1
Update Date:2016-09-16
Keywords:

Solution Type  Problem Resolution Sure

Solution  2017890.1 :   FS System: Improper Controller Motherboard Replacement in an FS1-2 Controller Results in 3rd Controller  


Related Items
  • Oracle FS1-2 Flash Storage System
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>Flash Storage>SN-EStor: FSx
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Oracle Confidential PARTNER - Available to partners (SUN).
Reason: Improper FRU replacement is only way this will be seen
Created from <SR 3-10847218241>

Applies to:

Oracle FS1-2 Flash Storage System - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

The Oracle FS System Manager GUI shows 3 Controllers and only one of them is working.

Changes

This has only been observed after replacement of a Controller motherboard.

Cause

When a new motherboard is replaced, information from it is introduced into the FS1 configuration when that Controller attempts to boot.  There is a transition period where the information from both the previous and replacement Controllers exist.  Under normal circumstances this information is consolidated and the "third Controller" is removed during the boot process and thus only 2 Controllers are ever seen by the time the boot sequence completes.

If, on the other hand, the boot sequence is not allowed to complete, the third Controller will still exist:

=====================================

Controllers (3)

=====================================
Name                Model               Type          Status
----                -----               ----          ------
/CONTROLLER-01      FS1_CONTROLLER      SAN           NORMAL
/CONTROLLER-02      FS1_CONTROLLER      SAN           --> CRITICAL <--
/CONTROLLER-03      FS1_CONTROLLER      UNKNOWN      --> CRITICAL <-- 

The most common reason for this to happen is when the NVDIMM cables are not connected to the correct ESM connectors on the Disk Backplane but this same phenomena has also been seen with motherboards with incompatible SPBIOS. Please see Document 2173777.1 FS System: How to Manually Downgrade SPBIOS on an FS1 Controller.     But there may be other reasons as well that prohibit the Controller from properly booting.

Solution

  1. Correct the problem based on the Controller Service Label located inside the top cover.
  2. Clear the failure history of the Controller.
  3. Boot the Controller.
  4. Reboot both Pilots.
  5. If a provisioning mismatch System Alert is observed, right click on it and Accept it.

If the problem persists after this, escalate the issue to Engineering.

The following steps were used to recover a system with a 3rd Controller for SR 3-13144950761.  They require a system outage and if attempted, must be done in conjunction with an escalation into Engineering.

Note: The Controller WWNs in these steps have been sanitized with six X's.  The right most number will determine the Controller number.  A 2 equates to Controller 3.

 

  1. System restart - Controller 1 & 2 Normal, but Controller 3 still shows missing in GUI and System remains in Critical state:
    # fscli system –restart –emergencyPreserveFbm
     
  2. Attempted to use pcli command to clear Controller 3:
    # pcli sub -u pillar -p pillar -H <Public IP of FS1> RemoveController Identity.Id=¨508002000XXXXXX2¨
     
    Controller 3 no longer appears in GUI and System status changed from Critical to Warning.
     
  3. Pilots still had a Controller 3 entry:
    [root@pilot1 ~]# ls -ltr /var/PillarPilotPersistence/com.pillardata.pmi.message.ControllerDisplayAttributes/
    508002000XXXXXX0.xml
    508002000XXXXXX1.xml
    508002000XXXXXX2.xml


    But /etc/nodenames was correct:
    [root@pilot1 ~]# cat /etc/nodenames
    172.30.80.2 WN2008fffffffffff2 WN2008000101000000 mgmtnode
    172.30.80.3 WN2009fffffffffffa
    172.30.80.128 WN508002000XXXXXX0 WN2008000101000001
    172.30.80.129 WN508002000XXXXXX1
     
  4. Stop Pilot configuration services from both Pilots:
    [root@pilot1 ~]# service pilotcfg stop
    Shutting down pcp_monitor:
    Shutting down pilotcfg:

    [root@pilot2 ~]# service pilotcfg stop
    Shutting down pcp_monitor:
    Shutting down pilotcfg:
     
  5. Remove the erroneous entry from /var/PillarPilotPersistence/com.pillardata.pmi.message.ControllerDisplayAttributes from both Pilots:
    [root@pilot1 ~]# rm /var/PillarPilotPersistence/com.pillardata.pmi.message.ControllerDisplayAttributes/508002000XXXXXX2.xml

    [root@pilot2 ~]# rm /var/PillarPilotPersistence/com.pillardata.pmi.message.ControllerDisplayAttributes/508002000XXXXXX2.xml
     
  6. If present, check the following directories for Controller 3 entries and remove them:
     /var/PillarPilotPersistence/com.pillardata.pmi.message.ControllerDiagnosticRecordInfo/
    /var/PillarPilotPersistence/com.pillardata.pmi.message.ControllerDiagnosticsInProgressCookie/
     
  7. Delete /var/lib/pillar/pcp/node-info.xml on both Pilots (which is a manual clear failure history)
  8. Clear Controller cache from both Controllers 1 & 2:
    [root@pilot1 ~]# ssh 172.30.80.128 psg_shutdown -c &
    [root@pilot1 ~]# ssh 172.30.80.129 psg_shutdown -c &
     
  9. Restart Pilot configuration service on both Pilots:
    [root@pilot1 ~]# service pilotcfg start

    [root@pilot2 ~]# service pilotcfg start
     
  10. Monitor node matrix (/var/log/pcp.log) on active Pilot.

 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback