M12-io.pciex.switch.se - An error was detected within a PCIe switch

Asset ID:	1-79-2218459.1
Update Date:	2017-08-14
Keywords:

Solution Type Predictive Self-Healing Sure

Solution 2218459.1 : M12-io.pciex.switch.se - An error was detected within a PCIe switch

Applies to:

Fujitsu SPARC M12-2
Fujitsu M10 PCI Expansion Unit
Fujitsu SPARC M12-2S
Fujitsu SPARC M12-1
SPARC

Purpose

Provide additional information for message ID: M12-io.pciex.switch.se

Fujitsu error code:

03000001

Details

Type

: Hardware Fault; io.pciex.switch.se

Severity

: Major

Description

: Fault due to a serious error detected on a PCIe switch chip, onboard device or a card in a PCI slot.

Automated Response

: When the failure is detected, the domain will be rebooted, the platform administrator should make sure the FRU with the faulty component is replaced.

Impact

: All the IO behind the PCIe switch chip is deconfigured.; NOTE: When the failure is on a PCI card, the fault information for this fault is not stored in the FMA resource cache, nor is it stored on the XSCF's persistent storage. Instead, the fault information is stored only in the hardware descriptor (HWD) of the domain that the device belongs to. The HWD itself is cleared of all information about faulty devices when the domain is powered down (this includes platform resets and platform power-downs). The HWD information about this device being faulty is also cleared when a hot-plug operation is performed on the faulty PCIe card from within Solaris running on the domain. However, even though the fault information is not stored in the FMA resource cache or XSCF persistent storage, the fault occurrence is logged in the relevant error logs and fault logs.

NOTE: OBP stops using the PCI card when the failure is detected. But, by power-off and on the domain, the domain will start using it again.

Indicted Hardware

If the fault is detected on a PCI card, the PCI card should be replaced.
For M12-1 systems, MBU should be replaced.
For M12-2/M12-2S systems, CMUL should be replaced.
For PCI expansion box, linkcard should be replaced.

If the fault was detected while running POST or while running OBP then the, fault may be detected on:

- For M12-1 systems, the PCIe switches are on the MBU;
- For M12-2/M12-2S systems, the PCIe switches are on the CMUL
- For PCI expansion Box, the PCIe switch on the Linkcard

For M12-1 systems:

- If this is PCIe switch chip 0, then built-in disks, built-in USB i/f, built-in GbE port#0 and #1, and the first PCIe slot are deconfigured.

- If this is PCIe switch chip 1, then built-in GbE port#2 and #3 and the second and third PCIe slots are deconfigured.

For M12-2/M12-2S systems:

- If the fault was detected on PCIe switch chip 0 on CMUL, then all the built-in IO behind the PCIe switch chip is deconfigured (built-in disks, built-in GbE, and built-in USB i/f.).

- If the fault was detected on PCIe switch chip 1, then the first, second and third PCIe slots are deconfigured.

- If the fault was detected on PCIe switch chip 2, then the fourth, fifth, sixth and seventh PCIe slots are deconfigured.

- If the fault was detected on PCIe switch chip 3 on M12-2, then the eighth, ninth, tenth and eleventh PCIe slots are deconfigured.

- If the fault was detected on PCIe switch chip 3 on M12-2S, then the eighth PCIe slot is deconfigured.

- If the fault was detected on a PCIe switch chip on a link card, then all the slots behind the PCIe switch chip are deconfigured i.e. all the PCIe slots in the PCI expansion box connected via this link card are deconfigured

NOTE: When the code is 03000001, OBP detects the failure while probing PCIe devices. This is done by generic OBP code, and this specific portion of OBP is not capable to narrow down the failure location. As a result, OBP will notify XSCF that the suspect is the end point (= PCIe device). This is adequate because failure rate of PCIe card is higher than other suspects. But, there is relatively small chance that the other PCIe fabric connected to the end point, such as PCIe switch on the motherboard, or PCIe cable between CMUU and CMUL,can be the root cause.

Suggested Action for System Administrator

: The recommended service action for this event is to schedule replacement of the affected component(s) at the earliest possible convenience. Although the hardware may be functioning, it is not intended nor recommended that the faulted component(s) remain in the system for a prolonged period of time.

Refer to the following document for the latest procedures for displaying event content in preparation for submitting a service request and applying any post-repair actions that may be required.

PSH Procedural Article for Fujitsu M10 Diagnosis (Doc ID 1525156.1)

Attachments

This solution has no attachment