PCIEX-8000-MH causing AMBER LED On, on T5-x Systems

Asset ID:	1-72-1604929.1
Update Date:	2017-08-16
Keywords:

Solution Type Problem Resolution Sure

Solution 1604929.1 : PCIEX-8000-MH causing AMBER LED On, on T5-x Systems

Applies to:

SPARC T5-2 - Version All Versions to All Versions [Release All Releases]
SPARC T5-4 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

PCIEX-8000-MH observed from network port, flagging all components faulted with different severity.

Cause

FMA event would look like this:

=======================================================================

'fmdump -e' says, repeatedly 'ereport.io.servive.degraded' and then 'ereport.io.service.restored'. i.e.

<explorer-output>/fma $more fmdump-e.out

Nov 10 17:42:23.3151 ereport.io.service.degraded
Nov 10 17:42:23.9557 ereport.io.service.restored

==========

checking verbose output of fmdump on errlog says:

<explorer-output>/fma/var/fm/fmd $fmdump -V -t 11/10/2013 -T 11/11/2013 errlog
TIME CLASS
Nov 10 2013 07:42:23.315125300 ereport.io.service.degraded
nvlist version: 0
class = ereport.io.service.degraded
ena = 0x7afcf19a75a00801
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci@300/pci@1/pci@0/pci@4/pci@0/pci@8/network@0,1
(end detector)

__ttl = 0x1
__tod = 0x527f38df 0x12c86e34

Nov 10 2013 07:42:23.955726041 ereport.io.service.restored
nvlist version: 0
class = ereport.io.service.restored
ena = 0x7aff547fd0321c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci@300/pci@1/pci@0/pci@4/pci@0/pci@8/network@0,1
(end detector)

__ttl = 0x1
__tod = 0x527f38df 0x38f738d9

=======================

faults log in snapshot FMA would say:

<ilom-snapshot>/fma $more @persist@faultdiags@faults.log

2013-11-16/12:30:42 767966d3-3d23-6b9e-c75f-e1a128679899 PCIEX-8000-MH
list_sz = 5

fault[0] = fault.io.pciex.device-interr-unaf <<<<<<<< too many correctable events associated with the device..
certainty = 22.0 %
FRU = /SYS/MB
ASRU = dev:////pci@300/pci@1/pci@0/pci@4/pci@0/pci@8/network@0,1 <<<<<<<<<<<<< this is network controller chip on RIO (in T5-4).
RESOURCE = hc:///chassis=0/motherboard=0/cpuboard=0/chip=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=0/pciexbus=2/pciexdev=4/pc
fru_part_number = xxx
fru_serial_number = xxx
chassis_serial_number = xxx

fault[1] = fault.io.pciex.device-interr-unaf
certainty = 22.0 %
FRU = /SYS/MB
ASRU = dev:////pci@300/pci@1/pci@0/pci@4/pci@0
RESOURCE = hc:///chassis=0/motherboard=0/cpuboard=0/chip=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=0/pciexbus=2/pciexdev=4/pc
fru_part_number = xxx
fru_serial_number = xxx
chassis_serial_number = xxx

fault[2] = fault.io.pciex.device-interr-unaf
certainty = 22.0 %
FRU = /SYS/MB
ASRU = dev:////pci@300/pci@1/pci@0/pci@4
RESOURCE = hc:///chassis=0/motherboard=0/cpuboard=0/chip=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=0/pciexbus=2/pciexdev=4/pc
fru_part_number = xxx
fru_serial_number = xxx
chassis_serial_number = xxx

fault[3] = fault.io.pciex.device-interr-unaf
certainty = 22.0 %
FRU = /SYS/MB
ASRU = dev:////pci@300/pci@1/pci@0
RESOURCE = hc:///chassis=0/motherboard=0/cpuboard=0/chip=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=0
fru_part_number = xxx
fru_serial_number = xxx
chassis_serial_number = xxx

fault[4] = fault.io.pciex.device-interr-unaf
certainty = 11.0 %
FRU = /SYS/PM0
ASRU = dev:////pci@300/pci@1
RESOURCE = hc:///chassis=0/motherboard=0/cpuboard=0/chip=0/hostbridge=0/pciexrc=0
fru_part_number = xxx
fru_serial_number = xxx
chassis_serial_number = xxx

===================================

Apparently it would look like either network port issue, or driver issue.

In this SR, We replaced RIO which contains network controller, however it did not help. This could be possible driver issue as described in document 1951204.1, which is published later.

=======================================================================

Investigating further, check if LDOMs are configured and condition listed in document 1593243.1 are met or what.

Here in this case, it is matching the condition. i.e.

<explorer-output>/sysconfig $grep extended-mapin ldm_list_-l.out
  extended-mapin-space=on
  extended-mapin-space=off <<<<<<<<<<<< this should be on, as per document 1593243.1
  extended-mapin-space=on
  extended-mapin-space=on
  extended-mapin-space=on
  extended-mapin-space=on
  extended-mapin-space=on

Solution

1)Reconfigure respective domain (in this example, it is lasun318) with 'extended-mapin-space=on', as per document 1593243.1

2) clear faults from Solaris FMA and ILOM FMA (document 1483194.1)

3) if problem persists, then log a service request with Oracle Support, to investigate this further.

References

<NOTE:1593243.1> - Solaris 10 and 11 Virtual Network Switch Can Corrupt TCP Packets Or Hang Interface When 'extended-mapin-space' is Off
<NOTE:1021334.1> - PCIEX-8000-MH - PCIEX subsystem problem
<NOTE:1951204.1> - Solaris 10, Solaris 11, and ZFS Storage Appliance Software (ZFSSA) Using the ixgbe(7D) Driver may Experience a NIC chip Reset and Report FMA Errors

Attachments

This solution has no attachment