Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2342857.1
Update Date:2018-04-20
Keywords:

Solution Type  Problem Resolution Sure

Solution  2342857.1 :   SPARC M7 HOST may panic upon forced SP failover (e.g,reset by Standby)  


Related Items
  • Oracle SuperCluster M7 Hardware
  •  
  • SPARC M7-8
  •  
  • SPARC M7-16
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: M7
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Created from <SR 3-16418793701>

Applies to:

SPARC M7-8 - Version All Versions to All Versions [Release All Releases]
SPARC M7-16 - Version All Versions to All Versions [Release All Releases]
Oracle SuperCluster M7 Hardware - Version All Versions to All Versions [Release All Releases]
Oracle Solaris on SPARC (64-bit)

Symptoms

A HOST may suffer an unexpected drop to the OBP debugger upon a forced SP failover, either through explicit user action or catastrophic SP reset.

The HOST console will show the drop to the OBP debugger.  Upon selecting 's' to sync, the domain will panic.
e.g.,
Dec 14 12:15:44 hostname     sshd[59354]: fatal: Read from socket failed: Connection reset by peer

c)ontinue, s)ync, r)eset? s <------------host dropped to debugger
c)ontinue, s)ync, r)eset? s

panic[cpu37]/thread=2a10a3b9b80: sync initiated
sched: trap type = 0x0
pid=0, pc=0x0, sp=0x0, tstate=0x0, context=0x0
o0-o7: 0, 0, 0, 0, 0, 0, 0, 0
g1-g7: 0, 0, 0, 0, 0, 0, 0

 

Domain messages immediately preceding the drop to the debugger will include messages such as those below.  The key messages to recognize are the "Link retraining detected" and "Surprise removal of mga0 detected".

Dec 14 12:38:09 hostname pcie: [ID 297812 kern.info] NOTICE: Live Suspend: port pci.0,0: child dev mga#0(400417ccab8) and descendants
Dec 14 12:38:09 hostname pcie: [ID 286789 kern.info] NOTICE: Live Suspend: mga0 suspended successfully
Dec 14 12:38:09 hostname pcie: [ID 486281 kern.info] NOTICE: IOR dev:////pci@304/pci@1/pci@0/pci@4/display@0, Reason: device has been surprise removed, Action: Hotplug LSR Suspend, Result: success, Current state: suspended
Dec 14 12:38:09 hostname pcie: [ID 833280 kern.notice] NOTICE: Suspend of mga0 succeeded.
Dec 14 12:38:09 hostname pcie: [ID 958946 kern.warning] WARNING: Link retraining detected in port pcieb7
Dec 14 12:38:09 hostname pcie: [ID 965590 kern.warning] WARNING: Surprise removal of mga0 detected
Dec 14 12:38:09 hostname mac: [ID 486395 kern.info] NOTICE: usbecm2 link down
Dec 14 12:38:15 hostname genunix: [ID 408114 kern.info] /pci@304/pci@1/pci@0/pci@2/usb@0/communications@1 (usbecm2) online
Dec 14 12:38:19 hostname fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-3S, TYPE: Fault, VER: 1, SEVERITY: Critical

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Various issues may cause forced SP failover.  One condition known to trigger the failover is SP memory exhaustion.  This condition can only be diagnosed by an Oracle engineer when SP snapshot is provided.

The SP failover will be evident in the event log.  Use the following command to display the event log and look for messages such as those appearing below.

 -> show -script  /SP/logs/event/list

 281 Thu Dec 14 12:47:57 2017 System Log minor
Host ID 0: Solaris running
280 Thu Dec 14 12:43:39 2017 Reset Log minor
/Servers/PDomains/PDomain_0 is now managed by PDomain SPP /SYS/SP1/SPM0.
279 Thu Dec 14 12:43:39 2017 Reset Log minor
/System/DCUs/DCU_0 is now managed by /SYS/SP1/SPM0.
278 Thu Dec 14 12:43:38 2017 Reset Log minor
Failover completed. Active SP is /SYS/SP1/SPM0.

 

SR owners can confirm forced failover in trace messages in the SSM ring.

#SP Trace logs#

SSM 2017-12-14 12:38:44.860240 1382 ssm_priv_link.c:703 [4] no message received in 61 secs
SSM 2017-12-14 12:38:44.860293 1382 ssm_priv_link.c:615 [4] priv link is DOWN
SSM 2017-12-14 12:39:15.129572 1385 ssm_util.c:417 [10] couldn't open socket to peer xxx.xxx.xx.x:9738, rc=-1, errno=113
SSM 2017-12-14 12:39:15.129636 1385 ssm_heartbeat.c:470 [10] Active SP appears to be down
SSM 2017-12-14 12:39:48.441329 1385 ssm_util.c:417 [10] couldn't open socket to peer xxx.xxx.xx.x:9738, rc=-1, errno=113
SSM 2017-12-14 12:39:48.441416 1385 ssm_heartbeat.c:470 [10] Active SP appears to be down
SSM 2017-12-14 12:40:21.753137 1385 ssm_util.c:417 [10] couldn't open socket to peer xxx.xxx.xx.x:9738, rc=-1, errno=113
SSM 2017-12-14 12:40:21.753199 1385 ssm_heartbeat.c:470 [10] Active SP appears to be down
SSM 2017-12-14 12:40:55.064898 1385 ssm_util.c:417 [10] couldn't open socket to peer xxx.xxx.xx.x:9738, rc=-1, errno=113
SSM 2017-12-14 12:40:55.064962 1385 ssm_heartbeat.c:470 [10] Active SP appears to be down
SSM 2017-12-14 12:41:28.376902 1385 ssm_util.c:417 [10] couldn't open socket to peer xxx.xxx.xx.x:9738, rc=-1, errno=113
SSM 2017-12-14 12:41:28.376965 1385 ssm_heartbeat.c:470 [10] Active SP appears to be down
SSM 2017-12-14 12:41:28.378280 1381 ssmd.c:335 [14] androproc_event_received: class="124" type=2
SSM 2017-12-14 12:41:28.379037 1379 ssm_failover.c:106 [2] executing command 3 (Standby->Master forced)        <==============FORCED FAILOVER IS EVIDENT HERE
SSM 2017-12-14 12:41:28.400000 1379 ssm_failover.c:403 [2] state machine : 2 (Standby) -> 3 (Stand-alone)

 

 ----------------------------------------------------------------------------------------------------------------------------------------------

The issue impacts all System Firmware releases earlier than 9.8.0.d.  The SysFW version can be displayed with this command,

-> show /System/Firmware system_fw_version

 /System/Firmware
    Properties:
        system_fw_version = Sun System Firmware 9.7.4 2016/12/08 07:51

Changes

 

Cause

Forced SP failover causes the unintended loss of the HOST.

Host encountered a known issue described in <BUG 23621056> - Host dropped to Debugger when running SP force failover.

The reason for SP failover was watchdog timeout due to memory exhaustion.

 

Solution

Workaround:  Clear any faults and replace no hardware

Fix:  Install SysFW 9.8.0.d or higher

Important Note : For Oracle SuperCluster M7 Hardware (SuperCluster Patch Policy)

QFSDP release is the supported vehicle for SysFW deployment on SuperCluster.  See Doc ID 1567979.1 for details.  It may be necessary to seek exception approval for SysFW upgrade outside a QFSDP release.

Never tell a SuperCluster customer to patch an individual component in isolation.  SysFW 9.8.0.d or higher is not yet in a QFSDP and must receive exception approval on a case-by-case basis for SuperCluster.

Reactive patching is only allowed for critical issues with no easy/viable workaround.  For approval always check with SuperCluster Maintenance Group first - ssc_maintenance_grp@oracle.com.

 

 

References

<NOTE:2064922.1> - ILOM-8000-F7 - the link between the Service Processor and host has a heartbeat failure
<NOTE:2063349.1> - SPARC M7 Series Servers : Interconnect - EoUSB
<NOTE:1967027.1> - SPARC M8 and SPARC M7 Series Servers : Current Issues Page
<BUG:23621056> - HOST DROPPED TO DEBUGGER WHEN RUNNING SP FORCE FAILOVER
<NOTE:1567979.1> - Oracle SuperCluster Supported Software Versions - All Hardware Types

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback