Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-77-2380939.1
Update Date:2018-03-30
Keywords:

Solution Type  Sun Alert Sure

Solution  2380939.1 :   SPARC M5, M6, and M7 Servers Physical Domain(s) may Unexpectedly Reset Following the Reset of the Active Service Processor  


Related Items
  • SPARC M5-32
  •  
  • SPARC M7-8
  •  
  • SPARC - Sun System Firmware
  •  
  • SPARC M6-32
  •  
  • SPARC M7-16
  •  
  • Sun Hardware - Generic
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: Sun Alert
  •  




In this Document
Description
Occurrence
Symptoms
Workaround
History
References


Applies to:

SPARC - Sun System Firmware
Sun Hardware - Generic
SPARC M5-32
SPARC M6-32
SPARC M7-8
Information in this document applies to any platform.
SPARC
___________________________________________



Date of Resolved Release: 30-Mar-2018
___________________________________________

Description

Physical Domains on SPARC M5, M6, and M7 servers with certain firmware (as listed below) may unexpectedly reset following an active Service Processor (SP) reset. This can occur either during explicit user request for the SP to reset (e.g., 'reset /SP') or during a System Firmware upgrade that initiates the reset of the Active SP. 

Occurrence

This issue can occur on the following platforms:

SPARC Platform

  • SPARC M5/M6 Servers with Firmware version prior to 9.6.20.b
  • SPARC M7 Servers with Firmware version prior to 9.8.0.d

Note: To determine the firmware version installed on the system, use the following ILOM command:

      -> show /HOST sysfw_version

      /HOST
      Properties:
      sysfw_version = Sun System Firmware 9.6.20.b 2017/07/10 15:31

Symptoms

When a ‘reset /SP’ command is issued on the Service Processor, one or more of the hosts may reset and reboot. During a firmware upgrade this issue can occur if the firmware version being upgraded is earlier than 9.6.20.b (on SPARC M5/M6) and 9.8.0.d (on SPARC M7). 

This issue only occurs if:

  1. During a host reset, the host panics and reboots.
  2. The Host is stopped (stop /HOSTx) and re-started (start /HOSTx), while the Host was already rebooting.

In both of the above failure scenarios the Host never reaches a stopped state.  A pending shutdown is retained within the SP Host state transition. When the SP recovers following its reset, it resumes the Host reset and an unexpected domain outage occurs.  The aborted state transition can be weeks or months old.

During a normal reset or stop/start of the Host, a 'Host stopped’ message is evident in the Host status log and SP event log. To verify whether the Host actually stopped, use the following commands to check the Host's status log and SP event log: 

    -> show /HOST0/status_history/list
    20180316 11:57:35: status='Host shutting down'
    20180316 11:58:34: status='Solaris panicking'
    20180316 11:58:55: status='Solaris rebooting'
    20180316 11:58:56: status='Host stopped'  <<< Host stopped indication
    20180316 11:58:59: status='Standby'
    20180316 11:59:00: Shutdown Host in progress

SP event log:

    -> show /sp/logs/event/list
    719    Fri Mar 16 11:58:59 2018  System    Log       minor
           Host ID 0: Standby
    718    Fri Mar 16 11:58:56 2018  System    Log       minor
           Host ID 0: Host stopped
    717    Fri Mar 16 11:58:55 2018  System    Log       minor
           Host ID 0: Solaris rebooting

Note: If the ‘Host stopped’ message is absent in the host status list, then the SP is unaware that the Host reset or restarted. Therefore the SP has a pending reboot action in effect. This means that the next time the SP is reset, a command to reset the Host is issued.

In the following failed scenario, the 'stopped’ message is missing from the Host status log and SP event log.

Host event log:

    20180316 09:57:14: status='Host shutting down'
    20180316 09:57:59: status='Solaris panicking'
    20180316 09:58:26: status='Solaris rebooting'
    20180316 09:58:41: status='Solaris rebooting'
    20180316 09:58:44: status='OpenBoot initializing'
    20180316 09:59:00: status='OpenBoot Primary Boot Loader'
    20180316 09:59:06: status='OpenBoot Primary Boot Loader'
    20180316 09:59:27: status='OpenBoot Running OS Boot'
    20180316 10:01:10: status='Solaris running' 

SP event log:

    63816 Fri Mar 16 09:58:42 2018 System Log minor
    Host ID 0: Solaris rebooting
    63815 Fri Mar 16 09:58:26 2018 System Log minor
    Host ID 0: Solaris rebooting
    63814 Fri Mar 16 09:58:00 2018 System Log minor
    Host ID 0: Solaris panicking
    63813 Fri Mar 16 09:57:14 2018 System Log minor
    Host ID 0: Host shutting down
    63812 Fri Mar 16 09:57:12 2018 Reset Log major
    Reset of /HOST0 by root succeeded.

 In the above example, it is evident that during the Host reset, the Host paniced and rebooted, so the Host never stopped before coming back up.

Workaround

If the Host status log or SP event log does not show the ‘stopped’ message during a Host reset or Host stop/start, then schedule downtime at earliest convenience to stop and start the Host so that the SP is aware of the proper state of the Host.

Resolution

This issue is addressed in the following releases:

SPARC Platform

  • SPARC M5/M6 Servers with Firmware version 9.6.20.b (as delivered in patch 27043440) or later
  • SPARC M7 Servers with Firmware version 9.8.0.d (as delivered in patch 27185996) or later

History

30-Mar-2018: Document released, status is Resolved

This issue is not seen on M8 servers since the minimum ILOM version on the M8 servers is 4.0.0.1.c. The bug listed here only addresses part of the solution, but other changes to SP states addresses the whole issue.

Questions regarding this document should be addressed to
sunalertpublication_us_grp@oracle.com and copy the
submitter/responsible engineer listed below:

Internal Contributor/Submitter: Pious.Kallarackal@oracle.com
Internal Eng Responsible Engineer: Pious.Kallarackal@oracle.com
Oracle Knowledge Analyst: jeff.folla@oracle.com
Internal Eng Business Unit Group: Systems Server OS
Internal Associated SRs: 3-16985067319
Internal Resolution Patches:

References

<BUG:23309265> - FAILED HOST SHUTDOWN MAY LEAD TO UNEXPECTED HOST SHUTDOWN AFTER SP REBOOT

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback