FS System: Controller Critical After Restart

Asset ID:	1-72-2098654.1
Update Date:	2018-01-04
Keywords:

Solution Type Problem Resolution Sure

Solution 2098654.1 : FS System: Controller Critical After Restart

Applies to:

Oracle FS1-2 Flash Storage System - Version 6.1 to 6.2 [Release 6.1 to 6.2]
Information in this document applies to any platform.

Symptoms

Controller fails to boot normally and may eventually become disabled. The GUI shows the affected Controller red, critical and disabled. Subsequent power cycle attempts fail, even after clearing the failure history. To confirm the issue, the EEL logs need to be examined.

EEL logs can be found in the slammer directory of a log bundle if the system has cold started.
EEL logs of a problem Controller can be found in the slammer directory of a log bundle if the Controller is unresponsive.
EEL logs can also be gathered manually from the ILOM
-> show /SP/logs/event/list

Look for the mrc event in the SP Event logs:

# find . -name "Event*" | xargs grep -Hn mrc-failed
./Event_buddy_20160118190744_np.log:438: Fault fault.chassis.domain.boot.intel.mrc-failed on component /SYS cleare
./Event_buddy_20160118190744_np.log:449: /SYS has fault.chassis.domain.boot.intel.mrc-failed with probability=100
./508002000158BE91/var_log/eel/Event_buddy_20160118183629_:420: Fault fault.chassis.domain.boot.intel.mrc-failed on component /SYS cleare
./508002000158BE91/var_log/eel/Event_buddy_20160118183629_:431: /SYS has fault.chassis.domain.boot.intel.mrc-failed with probability=100
./508002000158BE91/var_log/eel/Event_buddy_20160118190744_np.log:438: Fault fault.chassis.domain.boot.intel.mrc-failed on component /SYS cleare ./508002000158BE91/var_log/eel/Event_buddy_20160118190744_np.log:449: /SYS has fault.chassis.domain.boot.intel.mrc-failed with probability=100
Using the line number (438), vi the filename to see the EEL data with the mrc-failed error:

#vi +438 ./Event_buddy_20160118190744_np.log
1006 Mon Jan 18 13:34:07 2016 Fault Repair minor
     Fault fault.chassis.domain.boot.intel.mrc-failed on component /SYS cleare
     d
1005 Mon Jan 18 13:34:07 2016 HOST Log minor
     IPMI sensor HOST/STATUS state transition to 0x0
1004 Mon Jan 18 13:34:07 2016 HOST Log critical
     Critical Alarm is On.
1003 Mon Jan 18 13:34:07 2016 Power Cycle major
     /SYS has been cycled by: SP, Reason: Fault, UUID:2bad2c41-654a-c52e-fc83
     -db5f537fae24
1002 Mon Jan 18 13:34:01 2016 Fault Fault critical
     Fault detected at time = Mon Jan 18 13:34:01 2016. The suspect component
     /SYS has fault.chassis.domain.boot.intel.mrc-failed with probability= 100
     . Refer to http://www.sun.com/msg/SPX86-8002-5C for details.
Then look in the EEL* logs (host console logs) and look for a timestamp about 4 minutes prior to the mrc timestamp and we should see that the Controller has booted up happily. When the mrc event hits, the Controller reboots so you should also see that in the EEL* log.

1001 Mon Jan 18 13:29:33 2016 HOST Log minor
     Critical Alarm is Off.
1000 Mon Jan 18 13:29:33 2016 HOST Log minor
     IPMI sensor HOST/STATUS state transition to 0x40
999 Mon Jan 18 13:29:33 2016 IPMI Log minor
     ID = 2fd : 01/18/2016 : 13:29:33 : System Firmware Progress : SMI Handle
     r : System boot initiated : Asserted

Changes

This issue could potentially happen during any Controller reboot such as after an ESM replacement.

Cause

NVDIMM image corruption due to ILOM code timing issue. The root cause is a known issue in ILOM code and will be fixed in version 3.2.4.58. To check the current version, run the command "version" from ILOM.

-> version
SP firmware 3.2.4.42
SP firmware build number: 99377
SP firmware date: Wed Apr 29 18:07:29 CST 2015
SP filesystem version: 0.2.10

Solution

The solution is to clear the affected Controller Fbm memory, which can be done a few different ways. There are 3 methods to resolve this issue and are listed in order of preference by balancing ease of execution and impact to the system.

Method 1: SSH to Pilot

NOTE: Due to the sensitivity of the commands used in this method, Engineering approval is required to execute it.

Clear the Controller failure history using fscli. Reference <Document 2093580.1> FS System: How to Clear the Controller Failure History:

# fscli controller -reenable -controller <Controller FQN or unique identifier(ID)>
ssh to the active Pilot. Reference <Document 2029847.1> FS System How to Enable SSH Access to the Pilot.
Determine the IP address of the problem Controller:

[root@pilot2 ~]# cat /etc/nodenames
172.30.80.3 WN2009fffffffffffa WN2008000101000000 mgmtnode
172.30.80.129 WN508002000158ba51 WN2008000101000001
172.30.80.2 WN2008fffffffffff2
[root@pilot2 ~]#

NOTE: Controller IP address are 172.30.80.128 and 172.30.80.129 for Controllers 1 and 2 respectively. In the above output, IP address 172.30.80.128 is missing indicating that Controller 1 is the Controller that won't boot.
Using the IP address of the problem Controller, run the psg_shutdown utility:

[root@pilot2 ~]# ssh 172.30.80.128 psg_shutdown -c

Method 2: Use the fscli (requires an outage)

NOTE: Due to the sensitivity of the commands used in this method, Engineering approval is required to execute it.

Clear the Controller failure history using fscli. Reference <Document 2093580.1> FS System: How to Clear the Controller Failure History.

# fscli controller -reenable -controller <Controller FQN or unique identifier(ID)>
Use FSCLI to issue an emergency restart and clear Fbm. While logged in as pillar or administrator with support role, issue the following command.

# fscli.exe system -restart -emergencyClearFbm

Method 3: Access the Controller BIOS.

NOTE: Due to the sensitivity of the commands used in this method, Engineering approval is required to execute it.

NOTE: Only use this method if the first two methods are not possible. See <Document 2070735.1> FS System: How to Access FS1-2 ILOMs Using a Serial Connection. The Controller must have a solid SP LED on in front indicating the ILOM is booted.

Clear the Controller failure history using fscli. Reference <Document 2093580.1> FS System: How to Clear the Controller Failure History:

# fscli controller -reenable -controller <Controller FQN or unique identifier(ID)>
Using the initial steps in Document 2070735.1, login at console prompt:

ORACLESP-1315FM2009 login: root <------------------------------------- Login as user "root"
Password: changeme <----------------------------------------------------Password is "changeme"

Oracle(R) Integrated Lights Out Manager

Version 3.1.2.40 r93718

Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved.

Warning: password is set to factory default.

->
Start the host console:

-> start /host/console <---------------------------------------------------------- Type "start /host/console" here.
Are you sure you want to start /HOST/console (y/n)? y <--------------------------- Select "y" here
Serial console started. To stop, type ESC
****Press enter a couple of times to get console prompt****
WN508002000158BA50 #
Restart the Controller:

WN508002000158BA50 # "reboot"
While the Controller is rebooting, watch to see that all of the NVD (nvdimm portion of the BIOS) messages scroll by. After a couple of minutes, you will see a BIOS screen message about pressing function keys(F2, F8, F12) to enter Setup like a normal PC. However, since this is a serial connection, use "ctrl-e" to enter the setup. So, press "Control" and the "E" key" simultaneously when seeing this message:
Once the BIOS is accessed, press "ctrl-u" to enter "expert mode". This is a hidden and undocumented feature:
Using the arrow keys, navigate to the "Memory" menu and one of the options should be "NVDIMM Release Images". Select this option and choose "enabled". Enabling this option forces all subsequent boots to be PO, (Power On with out data recovery) so this will need to disabled again once the node has cleared memory. In other words, let the Controller boot up to the host login prompt, then type "reboot" or "psg_shutdown -r":

WN508002000158BA50 # reboot
The Controller will now reboot. During boot, break into the BIOS like before and change the NVDIMM Release Images option back to "disabled", save and exit. Now let the Controller boot normally:

References

<BUG:22561997> - FS1-2, CONTROLLER-01 CRITICAL, WILL NOT BOOT AFTER FCO ESM REPLACMENTS
<BUG:22301155> - NEED 3.2.4.58 SP TO FIX MRC EVENT BUG SEEN AFTER ESM REPLACEMENT

Attachments

This solution has no attachment