Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-77-2188787.1
Update Date:2017-07-25
Keywords:

Solution Type  Sun Alert Sure

Solution  2188787.1 :   SPARC M5-32 and M6-32 Systems With System Firmware Versions 9.5.3 through 9.6.5.a May Experience Solaris Panic Due to Certain PCIe Fabric Errors  


Related Items
  • SPARC M5-32
  •  
  • Sun Software - Generic
  •  
  • SPARC M6-32
  •  
  • Sun Hardware - Generic
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: Sun Alert
  •  




In this Document
Description
Occurrence
Symptoms
Workaround
Patches
History
References


Applies to:

SPARC M6-32
SPARC M5-32
Sun Hardware - Generic
Sun Software - Generic
SPARC
_____________________________________________



Date of Resolved Release: 06-Oct-2016
_____________________________________________

Description

On SPARC M5-32 and M6-32 systems, a problem in system firmware versions 9.5.3 through 9.6.5.a may allow the hardware to enter a state that it interprets as a "surprise removal" of an entire PCIe bus. This condition will cause Solaris to panic.

A system and all its hosts are vulnerable to this issue if one or more of the following sequences of events has taken place after an affected version of the firmware was installed:

    Power was removed from the entire chassis (an AC cycle)
    A Service Processor (SP) was reset while its host was powered down.

Once the system is vulnerable, the issue may occur at any time but has been seen to take months before it is triggered.

Occurrence

This issue can occur on the following platforms:

SPARC Platform

SPARC M5-32/M6-32 Systems with any of the following Sun System Firmware versions:

  • Firmware 9.5.3 (patch 22270916)
  • Firmware 9.5.4.b (patch 22982110)
  • Firmware 9.6.5 (patch 23763460)
  • Firmware 9.6.5.a (patch 24441089)

Notes:

1. The x86 Platform is not affected by this issue.
2. No other SPARC systems are affected by this issue.

To determine the firmware version installed on the system, use the following ILOM command:

      -> show /System system_fw_version
          /System
            Properties:
              system_fw_version = Sun System Firmware 9.5.4.b 2016/03/24 22:18

Symptoms

If the described issue occurs, the Operating System will panic as shown below, preceded by a PCIe 'surprise removal’ warning message, which will appear on the console and also in the history file (/hostX/console/history) where X is the specific host that had the panic:

      WARNING: Link retraining detected in SP port pcieb222
      WARNING: Driver for pcieb223 does not support surprise removal
      WARNING: pf_process_nr_children: Cannot suspend surprise-removed device tree below pcieb222

      SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major
      EVENT-TIME: 0x5778fff4.0x304801a9 (0x7e75ef2407693)
      PLATFORM: sun4v, CSN: -, HOSTNAME: xxxx
      SOURCE: SunOS, REV: 5.11 11.2
      DESC: Errors have been detected that require a reboot to ensure system
      integrity. See http://www.sun.com/msg/SUNOS-8000-0G for more information.
      AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry
      IMPACT: The system will sync files, save a crash dump if needed, and reboot
      REC-ACTION: Save the error summary below in case telemetry cannot be saved

      panic[cpu0]/thread=2a10009dc20: Fatal error has occured in: PCIe fabric.(0x1)(0x105)

      000002a10009d590 px:px_err_panic+1c4 (208ea000, 1, 105, 1223ec00, 1, 208e7018)
      %l0-3: 000002a10009d640 0000000000000016 00000000208ea400 000000000000005f
      %l4-7: 0000000000000000 00000000204c6c00 ffffffffffffffff 0000000000000000
      000002a10009d6a0 px:px_err_fabric_intr+1a8 (c427cf9b4000, 1, c427cf9b7960, 1, 401ce974a70, 105)
      %l0-3: 0000000000000260 000000001223ec68 0000000000000000 0000000000000260
      %l4-7: 0000000000000001 000000001223ec70 000000001223ec00 0000000000000001
      000002a10009d820 pxvec:vec_msiq_intr+220 (c427cf999a40, 6, 9, 12231ebc, 0, c427cf9a1d00)
      %l0-3: 0000000000000002 00000000436a0000 0000000000000000 0000000000000001
      %l4-7: 00000401ce974a70 0000c427cf9b4000 0000c427cf9c0cb0 0000c427cf8acb60
      000002a10009d950 unix:dispatch_handler+1cc (20010000, 200, 0, 7008da40, 2a10009dc20, 9)
      %l0-3: 000000001225d710 0000000020010048 0000000000000000 0000c427cf999a40
      %l4-7: 0000000020010090 000000000000000f 0000000000000001 0000000020010040

      syncing file systems...

Workaround

There is no workaround for this issue. Please see the "Resolution" section below.

Resolution

This issue is addressed in the following release:

  • SPARC M5-32/M6-32 Systems With Sun System Firmware 9.6.6.a (patch 24736423) or later

Note: After loading firmware, it is important to shut power off to each physical domain and reset the Active SP.

Either of following two procedures can be followed:

A) Update all Physical Domains (PDOMs) at the same time:

  1) Power off all running PDOMs:

      -> stop /System

  2) Load the new system firmware version 9.6.6.a as described in "Mx-32 - How to update System Firmware (ILOM / HC / POST / HV / OBP / GM )" - <Document:1981675.1>

  3) And then, power up each physical domain via:

      -> start /HOSTx

      (where x corresponds to 0, 1, 2, or 3)

OR:

B) If it is desirable to keep one or more of the physical domains running during the firmware upgrade:

  1) Load the new system firmware version 9.6.6.a as described in "Mx-32 - How to update System Firmware (ILOM / HC / POST / HV / OBP / GM )" - <Document:1981675.1>

  2) Then, when possible power down the PDOM:

      -> stop /HOSTx

  3) Then reset the SP:

      -> reset /SP

  4) After this, it is safe to power back up the PDOM:

      -> start /HOSTx

      (where x corresponds to 0, 1, 2, or 3)

Notes:

    (a) It will require approximately 20 minutes for SP reset to complete before you can continue to start the HOST.

    (b) If running POST with max level is not desired due to longer downtime, temporarily set 'hw_change_level' to 'min' before resetting the SP. This can be done as follows:

        -> set /HOSTx/diag hw_change_level=min

As stated above, the system is still vulnerable until all PDOMs have been powered down and SP has been reset and power restored to the physical domain.

Patches

 <Patch:24736423>

History

06-Oct-2016: Document released, status is Resolved
13-Dec-2016: Updated for clarification in Resolution section (Option B first statement)

This issue was caused by a regression introduced by bug 21424759. This change forced
the FRUID capability be set for each FRU, however the IOB failed to have this
set and therefore the wrong parameter file was used. This caused a temperature
sensor to be enabled and to not average the readings which can cause an over
temperature condition and thereby shutting down the IOB resulting in OS panic.
This is considered resolution to the problem.

Comments regarding any portion of this document should be submitted to
sunalertpublication_us_grp@oracle.com and copy the submitter/responsible engineer
listed below.

Internal Contributor/Submitter: marcel.widjaja@oracle.com
Internal Eng Responsible Engineer: david.arneson@oracle.com
Oracle Knowledge Analyst: david.mariotto@oracle.com
Internal Eng Business Unit Group: Systems RPE
Internal Escalation ID: 3-12962043641, 3-13064075261, 3-13310427991, 3-13312770771
Internal Resolution Patches: 24736423

References

<BUG:24351722> - MULTIPLE PANIC FATAL ERROR HAS OCCURRED IN: PCIE FABRIC.

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback