M12-io.pciex.switch.fe - Fatal error within a PCIe switch chip

Asset ID:	1-79-2218450.1
Update Date:	2017-08-14
Keywords:

Solution Type Predictive Self-Healing Sure

Solution 2218450.1 : M12-io.pciex.switch.fe - Fatal error within a PCIe switch chip

Applies to:

Fujitsu SPARC M12-2
Fujitsu SPARC M12-2S
Fujitsu SPARC M12-1
SPARC

Purpose

Provide additional information for message ID: M12-io.pciex.switch.fe

Fujitsu fault codes:

0200242d, 0200242e, 0200242f, 02002430, 02002434, 02002436,
02002437

Details

Type

: Hardware Fault; io.pciex.switch.fe

Severity

: Minor

Description

: Fault due to a fatal error within a PCIe switch chip.

Automated Response

: If the fault is detected by Power On Self Test (POST), then no immediate action is taken. Otherwise, the domain using the related portion of PCIe switch chip is reset.; For M12-2/2S systems with 2CPU chip installed in a box, one PCIe switch chip will be shared by 2 IOC (root complexes). Therefore, in case one entire PCIe switch chip is failed, the failure can be visible from 2 IOC’s. On the other hand, when specific portion of a PCIe switch failed, such as upper port of a PCIe switch chip connecting between the IOC and the PCIe switch chip failed, one of 2 IOC’s, sharing this PCIe switch chip, will face the failure.

Impact

: All the IO behind the PCIe switch chip is deconfigured.; The fault information for this fault is not stored in the FMA resource cache, nor is it stored on the XSCF's persistent storage. Instead, the fault information is stored only in the hardware descriptor (HWD) of the domain that the device belongs to. The HWD itself is cleared of all information about faulty devices when the domain is powered down (this includes platform resets and platform power-downs). The HWD information about this device being faulty is also cleared when a hot-plug operation is performed on the faulty PCIe card from within Solaris running on the domain. However, even though the fault information is not stored in the FMA resource cache or XSCF persistent storage, the fault occurrence is logged in the relevant error logs and fault logs.

Indicted Hardware

For M12-1 systems, the MBU is marked for replacement.

For M12-2/M12-2S systems, this error could be due to an unseated or mis seated pci cable between the cmul and cmuu. If the error occurred immediately after a maintenance the initial action plan should be to check the pci cables.
For M12-2/M12-2S systems, the CMUL is marked for replacement.
For PCI-Box, the Linkcard is marked for replacement.

If the fault was detected while running POST, such events are listed in the following categories:
- fe-linkup-err: 0200242d The PCIe linkup process failed
- fe-linkup-disconnect-err: 0200242e, 02002437 The link disconnect that occurs during the POST linkup linkup process failed
- fe-no-access: 0200242f The PCIe device was not accessible
- fe-linkup-tlu-ue: 02002430 The TLU UE (Uncorrectable Error) status was not zero during the PCIe linkup process;
- fe-reg-cmp-err: 020002434, 02002436 An error is detected by writing to an register in PCIe switch chip and not getting the expected result when reading back

For M12-1 systems:

If this is PCIe switch chip 0, then built-in disks, built-in USB i/f, built-in GbE port#0 and #1, and the first PCIe slot are deconfigured. If this is PCIe switch chip 1, then built-in GbE port#2 and #3, the second and third PCIe slots are deconfigured.

For M12-2/2S systems

If the fault was detected on PCIe switch chip 0 on CMUL, then the first and second PCIe slots and the built-in IO behind the PCIe switch chip is deconfigured (built-in SAS chip#0, and built-in USB i/f.).
If the fault was detected on PCIe switch chip 1, then the third, fourth and ninth PCIe slots and onboard 10GbE#0 are deconfigured.
If the fault was detected on PCIe switch chip 2, then SAS chip#1 and the fifth, sixth and (if it’s M12-2) tenth PCIe slots are deconfigured.
If the fault was detected on PCIe switch chip 3, then the 10GbE#1 and the seventh and (if it’s M12-2) eighth, and eleventh PCIe slots are deconfigured.
If the fault was detected on a PCIe switch chip on a Linkcard, then all the slots behind the PCIe switch chip are deconfigured. On the other word, all the PCIe slots in the PCI expansion box connected via this Linkcard are deconfigured.

For PCI-Box:

If the fault was detected on a PCIe switch chip on a Linkcard, then all the slots behind the PCIe switch chip are deconfigured i.e. all the PCIe slots in the PCI expansion box connected via this Linkcard are deconfigured.

Note: For PCI-Box, PCIe switches on the link board and IOB are not tested by POST. They are tested by PCI expansion box firmware instead.

Suggested Action for System Administrator

: The recommended service action for this event is to schedule replacement of the affected component(s) at the earliest possible convenience. Although the hardware may be functioning, it is not intended nor recommended that the faulted component(s) remain in the system for a prolonged period of time.

Refer to the following document for the latest procedures for displaying event content in preparation for submitting a service request and applying any post-repair actions that may be required.

PSH Procedural Article for Fujitsu M10 Diagnosis (Doc ID 1525156.1)

Attachments

This solution has no attachment