Troubleshooting "Fatal error has occured in: PCIe fabric" panics on TX-X servers

Asset ID:	1-72-1672818.1
Update Date:	2017-10-04
Keywords:

Solution Type Problem Resolution Sure

Solution 1672818.1 : Troubleshooting "Fatal error has occured in: PCIe fabric" panics on TX-X servers

Applies to:

SPARC T4-4 - Version All Versions to All Versions [Release All Releases]
SPARC T4-2 - Version All Versions to All Versions [Release All Releases]
SPARC T4-1 - Version All Versions to All Versions [Release All Releases]
SPARC T3-1 - Version All Versions to All Versions [Release All Releases]
SPARC T3-2 - Version All Versions to All Versions [Release All Releases]
Oracle Solaris on SPARC (64-bit)

Symptoms

Below is an example of a PCIe panic

May 9 08:21:00 xxxxxxx^Mpanic[cpu120]/thread=2a104e41c80:
May 9 08:21:00 xxxxxxx unix: [ID 198415 kern.notice] Fatal error has occured in: PCIe fabric.(0x1)(0x101)
May 9 08:21:00 xxxxxxx unix: [ID 100000 kern.notice]
May 9 08:21:00 xxxxxxx genunix: [ID 723222 kern.notice] 000002a104e416f0 px:px_err_panic+1ac (19c5400, 13aa000, 101, 2a104e417a0, 1, 0)
May 9 08:21:00 xxxxxxx genunix: [ID 179002 kern.notice]   %l0-3: 0000009980001602 00000000019c5400 0000000000000000 0000000000000001
May 9 08:21:00 xxxxxxx %l4-7: 0000000000000000 000000000190d400 0000000000000001 0000000000000000
May 9 08:21:00 xxxxxxx genunix: [ID 723222 kern.notice] 000002a104e41800 px:px_err_fabric_intr+1c0 (6012a13d280, 1, 19c5800, 1, 101, 100)
May 9 08:21:00 xxxxxxx genunix: [ID 179002 kern.notice]   %l0-3: 0000000000000100 0000000000000001 00000000019c5960 00000000019c5800
May 9 08:21:00 xxxxxxx %l4-7: 00000000019c5958 00000000019c5800 0000000000000001 00000300147bd2e0
May 9 08:21:00 xxxxxxx genunix: [ID 723222 kern.notice] 000002a104e41970 px:px_msiq_intr+1e8 (300147c6638, 1, 30011727d38, 30011727d38, 6012a140a88, 2)
May 9 08:21:00 xxxxxxx genunix: [ID 179002 kern.notice]   %l0-3: 000006012a15f660 000006012a0654a0 000006012a15dd60 000002a104e41a80
May 9 08:21:00 xxxxxxx %l4-7: 0000000000000000 0000000003820000 0000000000000000 0000000000000030

Check the second number (in this case 0x101) for any known issues (eg PCIe fabric.(0x1) (0x41) which has a known issue in 1519563.1) before continuing. Now 0x101 does not have one listed so we need to continue.

You now should look at FMA and the console logs. In this instance the customer only provided an explorer so we will concentrate on FMA.

Run fmdump -eV and look through the output to find the longest device paths, then compare them with prtdiag -v & doc 1005907.1. In this SR the following were found.

/pci@600/pci@1/pci@0/pci@4/network@0 ---------------------> EM4: /pci@600/pci@1/pci@0/pci@4

/pci@600/pci@2/pci@0/pci@3/network@0 ---------------------> NET2: /pci@600/pci@2/pci@0/pci@3/network@0

/pci@600/pci@2/pci@0/pci@5/pci@0/pci@2/SUNW,qlc@0,1 --> EM5: /pci@600/pci@2/pci@0/pci@5

/pci@600/pci@2/pci@0/pci@5/pci@0/pci@3/network@0 -----> EM5: /pci@600/pci@2/pci@0/pci@5

Cause

As these paths were to multiple devices that handle the same thing (network), this is most likely a driver issue.

Had there been multiple devices with different uses (network, hba, other) then firmware should be checked. If it is up to date open an SR with support.

Had there been only 1 path, then the component is more likely to be at fault but drivers should be updated first.

Solution

In the SR this document was developed from, network driver patches were missing and the customer was advised to update them.

Attachments

This solution has no attachment