Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2098654.1
Update Date:2018-01-04
Keywords:

Solution Type  Problem Resolution Sure

Solution  2098654.1 :   FS System: Controller Critical After Restart  


Related Items
  • Oracle FS1-2 Flash Storage System
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>Flash Storage>SN-EStor: FSx
  •  




In this Document
Symptoms
Changes
Cause
Solution
 Method 1: SSH to Pilot
 Method 3: Access the Controller BIOS.
References


Oracle Confidential PARTNER - Available to partners (SUN).
Reason: complex commands
Created from <SR 3-12029889841>

Applies to:

Oracle FS1-2 Flash Storage System - Version 6.1 to 6.2 [Release 6.1 to 6.2]
Information in this document applies to any platform.

Symptoms

Controller fails to boot normally and may eventually become disabled.  The GUI shows the affected Controller red, critical and disabled.  Subsequent power cycle attempts fail, even after clearing the failure history. To confirm the issue, the EEL logs need to be examined. 

  • EEL logs can be found in the slammer directory of a log bundle if the system has cold started. 
  • EEL logs of a problem Controller can be found in the slammer directory of a log bundle if the Controller is unresponsive. 
  • EEL logs can also be gathered manually from the ILOM
    -> show /SP/logs/event/list
     
  1. Look for the mrc event in the SP Event logs:
    # find . -name "Event*" | xargs grep -Hn mrc-failed
    ./Event_buddy_20160118190744_np.log:438: Fault fault.chassis.domain.boot.intel.mrc-failed on component /SYS cleare
    ./Event_buddy_20160118190744_np.log:449: /SYS has fault.chassis.domain.boot.intel.mrc-failed with probability=100
    ./508002000158BE91/var_log/eel/Event_buddy_20160118183629_:420: Fault fault.chassis.domain.boot.intel.mrc-failed on component /SYS cleare
    ./508002000158BE91/var_log/eel/Event_buddy_20160118183629_:431: /SYS has fault.chassis.domain.boot.intel.mrc-failed with probability=100
    ./508002000158BE91/var_log/eel/Event_buddy_20160118190744_np.log:438: Fault fault.chassis.domain.boot.intel.mrc-failed on component /SYS cleare ./508002000158BE91/var_log/eel/Event_buddy_20160118190744_np.log:449: /SYS has fault.chassis.domain.boot.intel.mrc-failed with probability=100
     
  2. Using the line number (438), vi the filename to see the EEL data with the mrc-failed error:
    #vi +438 ./Event_buddy_20160118190744_np.log
    1006 Mon Jan 18 13:34:07 2016 Fault Repair minor
         Fault fault.chassis.domain.boot.intel.mrc-failed on component /SYS cleare
         d
    1005 Mon Jan 18 13:34:07 2016 HOST Log minor
         IPMI sensor HOST/STATUS state transition to 0x0
    1004 Mon Jan 18 13:34:07 2016 HOST Log critical
         Critical Alarm is On.
    1003 Mon Jan 18 13:34:07 2016 Power Cycle major
         /SYS has been cycled by: SP, Reason: Fault, UUID:2bad2c41-654a-c52e-fc83
         -db5f537fae24
    1002 Mon Jan 18 13:34:01 2016 Fault Fault critical
         Fault detected at time = Mon Jan 18 13:34:01 2016. The suspect component
         /SYS has fault.chassis.domain.boot.intel.mrc-failed with probability= 100
         . Refer to http://www.sun.com/msg/SPX86-8002-5C for details.
     
  3. Then look in the EEL* logs (host console logs) and look for a timestamp about 4 minutes prior to the mrc timestamp and we should see that the Controller has booted up happily.  When the mrc event hits, the Controller reboots so you should also see that in the EEL* log.
    1001 Mon Jan 18 13:29:33 2016 HOST Log minor
         Critical Alarm is Off.
    1000 Mon Jan 18 13:29:33 2016 HOST Log minor
         IPMI sensor HOST/STATUS state transition to 0x40
    999  Mon Jan 18 13:29:33 2016 IPMI Log minor
         ID = 2fd : 01/18/2016 : 13:29:33 : System Firmware Progress : SMI Handle
         r : System boot initiated : Asserted
     

Changes

 This issue could potentially happen during any Controller reboot such as after an ESM replacement.

Cause

 NVDIMM image corruption due to ILOM code timing issue. The root cause is a known issue in ILOM code and will be fixed in version 3.2.4.58.  To check the current version, run the command "version" from ILOM.

-> version
SP firmware 3.2.4.42
SP firmware build number: 99377
SP firmware date: Wed Apr 29 18:07:29 CST 2015
SP filesystem version: 0.2.10

Solution

The solution is to clear the affected Controller Fbm memory, which can be done a few different ways.  There are 3 methods to resolve this issue and are listed in order of preference by balancing ease of execution and impact to the system.

Method 1: SSH to Pilot

NOTE: Due to the sensitivity of the commands used in this method, Engineering approval is required to execute it.
  1. Clear the Controller failure history using fscli. Reference <Document 2093580.1> FS System: How to Clear the Controller Failure History:
    # fscli controller -reenable -controller <Controller FQN or unique identifier(ID)>
     
  2. ssh to the active Pilot. Reference <Document 2029847.1> FS System How to Enable SSH Access to the Pilot.
  3. Determine the IP address of the problem Controller:
    [root@pilot2 ~]# cat /etc/nodenames
    172.30.80.3 WN2009fffffffffffa WN2008000101000000 mgmtnode
    172.30.80.129 WN508002000158ba51 WN2008000101000001
    172.30.80.2 WN2008fffffffffff2
    [root@pilot2 ~]#
     
    NOTE: Controller IP address are 172.30.80.128 and 172.30.80.129 for Controllers 1 and 2 respectively. In the above output, IP address 172.30.80.128 is missing indicating that Controller 1 is the Controller that won't boot.
     
  4. Using the IP address of the problem Controller, run the psg_shutdown utility: 

    [root@pilot2 ~]# ssh 172.30.80.128 psg_shutdown -c

 Method 2: Use the fscli (requires an outage)

NOTE: Due to the sensitivity of the commands used in this method, Engineering approval is required to execute it.
  1. Clear the Controller failure history using fscli. Reference <Document 2093580.1> FS System: How to Clear the Controller Failure History.
    # fscli controller -reenable -controller <Controller FQN or unique identifier(ID)>
      
  2. Use FSCLI  to issue an emergency restart and clear Fbm. While logged in as pillar or administrator with support role, issue the following command.
    # fscli.exe system -restart -emergencyClearFbm
      

Method 3: Access the Controller BIOS.

NOTE: Due to the sensitivity of the commands used in this method, Engineering approval is required to execute it.

 

NOTE: Only use this method if the first two methods are not possible.  See <Document 2070735.1> FS System: How to Access FS1-2 ILOMs Using a Serial Connection. The Controller must have a solid SP LED on in front indicating the ILOM is booted.
  1. Clear the Controller failure history using fscli.  Reference <Document 2093580.1> FS System: How to Clear the Controller Failure History:
    # fscli controller -reenable -controller <Controller FQN or unique identifier(ID)>
     
  2. Using the initial steps in Document 2070735.1, login at console prompt:
    ORACLESP-1315FM2009 login: root  <------------------------------------- Login as user "root"
    Password: changeme <----------------------------------------------------Password is "changeme"

    Oracle(R) Integrated Lights Out Manager

    Version 3.1.2.40 r93718

    Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved.

    Warning: password is set to factory default.

    ->
     
  3. Start the host console:
    -> start /host/console <---------------------------------------------------------- Type "start /host/console" here.
    Are you sure you want to start /HOST/console (y/n)? y <--------------------------- Select "y" here
    Serial console started. To stop, type ESC
    ****Press enter a couple of times to get console prompt****
    WN508002000158BA50 #
     
  4. Restart the Controller:
    WN508002000158BA50 # "reboot"
     
  5. While the Controller is rebooting, watch to see that all of the NVD (nvdimm portion of the BIOS) messages scroll by. After a couple of minutes, you will see a BIOS screen message about pressing function keys(F2, F8, F12) to enter Setup like a normal PC. However, since this is a serial connection, use "ctrl-e" to enter the setup.  So, press "Control" and the "E" key" simultaneously when seeing this message:
    entering bios

  6. Once the BIOS is accessed, press "ctrl-u" to enter "expert mode". This is a hidden and undocumented feature:
    entering expet mode

  7. Using the arrow keys, navigate to the "Memory" menu and one of the options should be "NVDIMM Release Images". Select this option and choose "enabled".  Enabling this option forces all subsequent boots to be PO, (Power On with out data recovery) so this will need to disabled again once the node has cleared memory. In other words, let the Controller boot up to the host login prompt, then type "reboot" or "psg_shutdown -r":
    WN508002000158BA50 # reboot
     
  8. The Controller will now reboot. During boot, break into the BIOS like before and change the NVDIMM Release Images option back to "disabled", save and exit. Now let the Controller boot normally:
    nvdimm_release_image 

 

References

<BUG:22561997> - FS1-2, CONTROLLER-01 CRITICAL, WILL NOT BOOT AFTER FCO ESM REPLACMENTS
<BUG:22301155> - NEED 3.2.4.58 SP TO FIX MRC EVENT BUG SEEN AFTER ESM REPLACEMENT

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback