Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1642066.1
Update Date:2017-07-11
Keywords:

Solution Type  Problem Resolution Sure

Solution  1642066.1 :   ixgbe interface faulted because of message "Problem: Network adapter has been stopped because it has overheated"  


Related Items
  • SPARC M5-32
  •  
  • SPARC M6-32
  •  
  • SPARC T5-8
  •  
  • SPARC T5-4
  •  
  • SPARC T5-2
  •  
Related Categories
  • PLA-Support>Sun Systems>SAND>Network>SN-SND: Sun Network Interfaces
  •  


FMA has faulted ixgbe interface with fault ID PCIEX-8000-0A. This is a software bug and not a hardware fault.

In this Document
Symptoms
Changes
Cause
Solution
References


Created from <SR 3-8243904081>

Applies to:

SPARC M5-32 - Version All Versions and later
SPARC M6-32 - Version All Versions and later
SPARC T5-2 - Version All Versions and later
SPARC T5-4 - Version All Versions and later
SPARC T5-8 - Version All Versions and later
Oracle Solaris on x86-64 (64-bit)
Oracle Solaris on SPARC (64-bit)
Affected are all Network Interface PCIe Cards or onboard interfaces that use the Intel Twinville 10G X540 dual Ethernet controller:

Dual 10-Gigabit Ethernet Base-T PCIe Gen2 PN 7014776 / 7070006
Dual 10-Gigabit Base-T PCIe 2.0 ExpressModule PN 7014780 / 7069995

Symptoms

System reports a fault for the mother board (network interface is on mother board) or the PCIe card (network interface is a PCIe) card.

# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Dec 12 02:32:52 c644edfb-b986-c3c1-9b9b-e4d3b0514a46  PCIEX-8000-0A  Critical

Problem Status    : solved
Diag Engine       : eft / 1.16
System
   Manufacturer  : Oracle-Corporation
   Name          : SPARC-T5-2
   Part_Number   : 31731774+1+1
   Serial_Number : 1234567890
   Host_ID       : 12345678

----------------------------------------
Suspect 1 of 3 :
  Fault class : fault.io.pciex.device-interr
  Certainty   : 40%
  Affects     : dev:////pci@300/pci@1/pci@0/pci@1/network@0,1
  Status      : out of service, but associated components no longer faulty

  FRU
    Location         : "/SYS/MB"
    Manufacturer     : unknown
    Name             : unknown
    Part_Number      : 7063306
    Revision         : 04
    Serial_Number    : 465769T+1317UL03NT
    Chassis
       Manufacturer  : Oracle Corporation
       Name          : SPARC T5-2
       Part_Number   : 31731774+1+1
       Serial_Number : 1234567890
       Status        : faulty
----------------------------------------
Suspect 2 of 3 :
  Fault class : fault.io.pciex.device-interr
  Certainty   : 40%
  Affects     : dev:////pci@300/pci@1/pci@0
  Status      : faulted but still in service

  FRU
    Location         : "/SYS/MB"
    Manufacturer     : unknown
    Name             : unknown
    Part_Number      : 7063306
    Revision         : 04
    Serial_Number    : 465769T+1317UL03NT
    Chassis
       Manufacturer  : Oracle Corporation
       Name          : SPARC T5-2
       Part_Number   : 31731774+1+1
       Serial_Number : 1234567890
       Status        : faulty
----------------------------------------
Suspect 3 of 3 :
  Fault class : fault.io.pciex.device-interr
  Certainty   : 20%
  Affects     : dev:////pci@300/pci@1
  Status      : faulted but still in service

  FRU
    Location         : "/SYS/MB"
    Manufacturer     : unknown
    Name             : unknown
    Part_Number      : 7063306
    Revision         : 04
    Serial_Number    : 465769T+1234567
    Chassis
       Manufacturer  : Oracle Corporation
       Name          : SPARC T5-2
       Part_Number   : 31731774+1+1
       Serial_Number : 1234567890
       Status        : faulty

Description : A problem was detected for a PCIEX device.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
             this fault

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
             Please refer to the associated reference document at
             http://support.oracle.com/msg/PCIEX-8000-0A for the latest
             service procedures and policies regarding this diagnosis.

 

 Link up messages of the following kind appear every hour at exactly the same time and finally a message that the network adapter has "overheated":

$ grep ixgbe messages
...
Dec 11 21:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, unknown duplex
Dec 11 21:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, unknown duplex
Dec 11 22:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, full duplex
Dec 11 22:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, unknown duplex
Dec 11 23:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, full duplex
Dec 11 23:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, full duplex
Dec 12 00:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, full duplex
Dec 12 00:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, unknown duplex
Dec 12 01:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 1000 Mbps, full duplex
Dec 12 01:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, unknown duplex
Dec 12 02:32:22 abcde1 ixgbe: [ID 611667 kern.warning] WARNING: ixgbe1: Problem: Network adapter has been stopped because it has overheated                                          <----
Dec 12 02:32:22 abcde1 ixgbe: [ID 611667 kern.warning] WARNING: ixgbe1: Action: Restart the computer. If the problem persists, power off the system and replace the adapter          <----
Dec 12 02:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, unknown duplex
Dec 12 02:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, unknown duplex
Dec 12 02:32:52 abcde1 genunix: [ID 408114 kern.info] /pci@300/pci@1/pci@0/pci@1/network@0,1 (ixgbe1) down
Dec 12 03:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, full duplex
Dec 12 03:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, full duplex
Dec 12 04:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, full duplex
Dec 12 04:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, full duplex
Dec 12 05:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 1000 Mbps, full duplex
Dec 12 05:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 1000 Mbps, full duplex
Dec 12 06:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 1000 Mbps, full duplex
Dec 12 06:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 1000 Mbps, full duplex
Dec 12 07:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 1000 Mbps, full duplex
Dec 12 07:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 1000 Mbps, full duplex
Dec 12 08:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, full duplex
Dec 12 08:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, full duplex
...
Dec 14 21:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 1000 Mbps, full duplex
Dec 14 22:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, unknown duplex
Dec 14 22:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, full duplex
Dec 14 23:32:22 abcde1 ixgbe: [ID 611667 kern.warning] WARNING: ixgbe1: Problem: Network adapter has been stopped because it has overheated                                          <----
Dec 14 23:32:22 abcde1 ixgbe: [ID 611667 kern.warning] WARNING: ixgbe1: Action: Restart the computer. If the problem persists, power off the system and replace the adapter          <----
Dec 14 23:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, unknown duplex
Dec 14 23:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 1000 Mbps, full duplex
Dec 14 23:32:52 abcde1 genunix: [ID 408114 kern.info] /pci@300/pci@1/pci@0/pci@1/network@0,1 (ixgbe1) down
Dec 15 00:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, full duplex
Dec 15 00:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, full duplex
Dec 15 01:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, unknown duplex
Dec 15 01:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, full duplex
Dec 15 02:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, full duplex
Dec 15 02:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, unknown duplex
Dec 15 03:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, full duplex
Dec 15 03:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, full duplex
Dec 15 04:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, unknown duplex
Dec 15 04:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, unknown duplex
Dec 15 05:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, unknown duplex
Dec 15 05:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, unknown duplex
Dec 15 06:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, unknown duplex
Dec 15 06:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, unknown duplex
Dec 15 07:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, full duplex
Dec 15 07:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, unknown duplex
Dec 15 08:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, unknown duplex
Dec 15 08:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 1000 Mbps, full duplex
...

 

Changes

 A network cable is installed but the interface is not yet configured / in use.

Cause

At first glance this looks like a hardware issue with the network interface but this is a bug instead.
In this case here it was planned to use the two 10GB interfaces net1 (ixgbe1) and net2 (ixgbe2) as a trunked interface.
The cables were already installed but the trunk not yet configured.
Two T5-2 at the same site saw this overheated event twice each.
It is conspicuous that on both systems the messages were seen precisely to the second on an hourly interval (please see below).

The reason is that they were triggered by Oracle Ops Center which issues every hour "dladm" commands to all network interfaces (also to not configured ones).
Bug 16743960 causes the many "link up... *...duplex" messages (please see below).
In addition this bug can trigger the message "Network adapter has been stopped because it has overheated" which is a false message because the network adapter actually has not overheated.
This is a bug in the ixgbe driver (bug 18131062 / bug 17502286).
Furthermore, this "Network adapter has been stopped because it has overheated" event in turn can lead to the FMA fault PCIEX-8000-0A.

 

messages file

...
Dec 12 02:32:16 abcde1 SC Alert: [ID 438350 daemon.notice] Audit | minor: root : Open Session : object = "/SP/session/type" : value = "shell" : success
Dec 12 02:32:18 abcde1 SC Alert: [ID 665947 daemon.notice] Audit | minor: root : Close Session : object = "/SP/session/type" : value = "shell" : success
Dec 12 02:32:22 abcde1 ixgbe: [ID 611667 kern.warning] WARNING: ixgbe1: Problem: Network adapter has been stopped because it has overheated                                          <----
Dec 12 02:32:22 abcde1 ixgbe: [ID 611667 kern.warning] WARNING: ixgbe1: Action: Restart the computer. If the problem persists, power off the system and replace the adapter          <----
Dec 12 02:32:22 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe1 link up, 0 Mbps, unknown duplex
Dec 12 02:32:23 abcde1 SC Alert: [ID 438350 daemon.notice] Audit | minor: root : Open Session : object = "/SP/session/type" : value = "shell" : success
Dec 12 02:32:25 abcde1 mac: [ID 435574 kern.info] NOTICE: ixgbe2 link up, 0 Mbps, unknown duplex
Dec 12 02:32:28 abcde1 SC Alert: [ID 438350 daemon.notice] Audit | minor: root : Open Session : object = "/SP/session/type" : value = "shell" : success
Dec 12 02:32:30 abcde1 SC Alert: [ID 665947 daemon.notice] Audit | minor: root : Close Session : object = "/SP/session/type" : value = "shell" : success
Dec 12 02:32:30 abcde1 last message repeated 1 time
Dec 12 02:32:52 abcde1 genunix: [ID 408114 kern.info] /pci@300/pci@1/pci@0/pci@1/network@0,1 (ixgbe1) down                                                        <----
Dec 12 02:32:52 abcde1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-0A, TYPE: Fault, VER: 1, SEVERITY: Critical                                          <----
Dec 12 02:32:52 abcde1 EVENT-TIME: Thu Dec 12 02:32:52 CET 2013
Dec 12 02:32:52 abcde1 PLATFORM: SPARC-T5-2, CSN: 1234567890, HOSTNAME: abcde1
Dec 12 02:32:52 abcde1 SOURCE: eft, REV: 1.16
Dec 12 02:32:52 abcde1 EVENT-ID: c644edfb-b986-c3c1-9b9b-e4d3b0514a46
Dec 12 02:32:52 abcde1 DESC: A problem was detected for a PCIEX device.
Dec 12 02:32:52 abcde1 AUTO-RESPONSE: One or more device instances may be disabled
Dec 12 02:32:52 abcde1 IMPACT: Loss of services provided by the device instances associated with this fault
Dec 12 02:32:52 abcde1 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/PCIEX-8000-0A for the latest service procedures and policies regarding this diagnosis.
Dec 12 02:32:52 abcde1 SC Alert: [ID 482699 daemon.alert] Fault | critical: Fault detected at time = Thu Dec 12 02:32:52 2013. The suspect components: /SYS/MB has fault.io.pciex.device-interr with probability=40, /SYS/MB has fault.io.pciex.device-interr with probability=40, /SYS/MB has fault.io.pciex.device-inte
Dec 12 02:32:58 abcde1 SC Alert: [ID 645471 daemon.error] Email | major: Alert rule 1: SMTP session failed with error: Code 0
Dec 12 02:32:59 abcde1 hwmgmtd[3244]: [ID 702911 daemon.notice] State change: service indicator: /SYS/SERVICE (ID: 232) changed state from "Off" (3) to "On" (4).
Dec 12 02:33:06 abcde1 SC Alert: [ID 438350 daemon.notice] Audit | minor: root : Open Session : object = "/SP/session/type" : value = "shell" : success
...

 

As we can see both hosts were affected and the issue occurred exactly on the hourly intervall (per server):

abcde1
Dec 12 02:32:22 abcde1 ixgbe: [ID 611667 kern.warning] WARNING: ixgbe1: Problem: Network adapter has been stopped because it has overheated
Dec 12 02:32:22 abcde1 ixgbe: [ID 611667 kern.warning] WARNING: ixgbe1: Action: Restart the computer. If the problem persists, power off the system and replace the adapter

Dec 14 23:32:22 abcde1 ixgbe: [ID 611667 kern.warning] WARNING: ixgbe1: Problem: Network adapter has been stopped because it has overheated
Dec 14 23:32:22 abcde1 ixgbe: [ID 611667 kern.warning] WARNING: ixgbe1: Action: Restart the computer. If the problem persists, power off the system and replace the adapter


abcde2
Nov 21 00:38:47 abcde2 ixgbe: [ID 611667 kern.warning] WARNING: ixgbe1: Problem: Network adapter has been stopped because it has overheated
Nov 21 00:38:47 abcde2 ixgbe: [ID 611667 kern.warning] WARNING: ixgbe1: Action: Restart the computer. If the problem persists, power off the system and replace the adapter

Nov 21 07:38:47 abcde2 ixgbe: [ID 611667 kern.warning] WARNING: ixgbe1: Problem: Network adapter has been stopped because it has overheated
Nov 21 07:38:47 abcde2 ixgbe: [ID 611667 kern.warning] WARNING: ixgbe1: Action: Restart the computer. If the problem persists, power off the system and replace the adapter


ereport of PCIEX-8000-0A event

$ more fmdump-eVu_c644edfb-b986-c3c1-9b9b-e4d3b0514a46.out

TIME CLASS
Dec 12 2013 02:32:22.247415010 ereport.io.service.lost
nvlist version: 0
class = ereport.io.service.lost
ena = 0x5ea87ea277d01001
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci@300/pci@1/pci@0/pci@1/network@0,1
(end detector)

__ttl = 0x1
__tod = 0x52a91226 0xebf40e2

...


Kind of ereports

  1 ereport.io.pciex.dl.btlp
  5 ereport.io.pciex.pl.re
  5 ereport.io.pciex.rc.ce-msg
  1 ereport.io.pciex.rc.mce-msg
  29 ereport.io.pci.fabric
  2 ereport.io.service.lost

 
Affected paths in this example is indeed ixgbe1 ...

$ grep /pci@300/pci@1/pci@0/pci@1/network@0,1 etc/path_to_inst
"/pci@300/pci@1/pci@0/pci@1/network@0,1" 1 "ixgbe"

 
...which is an onboard interface of the T5-2

# dladm show-phys -L
LINK DEVICE LOC
net0 ixgbe0 /SYS/MB
net1 ixgbe1 /SYS/MB            <----
net2 ixgbe2 /SYS/MB
net3 ixgbe3 /SYS/MB
net4 igb0 PCIE1
net5 igb1 PCIE1
net6 igb2 PCIE1
net7 igb3 PCIE1
net8 igb4 PCIE2
net9 igb5 PCIE2
net10 igb6 PCIE2
net11 igb7 PCIE2
net12 ixgbe4 PCIE3
net13 ixgbe5 PCIE3
net16 vsw0 --
net14 usbecm2 --

 
ixgbe1 hat no link. No cable? Cable was installed but nothing configured

# dladm show-phys -Z
LINK              ZONE      MEDIA                STATE      SPEED  DUPLEX    DEVICE
net11             global    Ethernet             unknown    0      unknown   igb7
net13             global    Ethernet             unknown    0      unknown   ixgbe5
net0              global    Ethernet             up         1000   full      ixgbe0
net3              global    Ethernet             unknown    0      unknown   ixgbe3
net9              global    Ethernet             unknown    0      unknown   igb5
net4              global    Ethernet             unknown    0      unknown   igb0
net10             global    Ethernet             unknown    0      unknown   igb6
net5              global    Ethernet             unknown    0      unknown   igb1
net1              global    Ethernet             unknown    0      unknown   ixgbe1         <---- No link
net7              global    Ethernet             unknown    0      unknown   igb3
net6              global    Ethernet             unknown    0      unknown   igb2
net12             global    Ethernet             unknown    0      unknown   ixgbe4
net8              global    Ethernet             unknown    0      unknown   igb4
net2              global    Ethernet             unknown    0      unknown   ixgbe2
net14             global    Ethernet             up         10     full      usbecm2
net16             global    Ethernet             up         1000   full      vsw0

 
Accordingly no IP configured

# ipadm show-addr
ADDROBJ TYPE STATE ADDR
lo0/v4 static ok 127.0.0.1/8
net0/v4 static ok 10.xx.xx.xx/24
net14/v4 static ok 169.254.182.77/24
lo0/v6 static ok ::1/128

 

Solution

As a workaround plumb the affected network interface ("ipadm create-ip <interface>")
There seem to be plans after which future Ops Center releases will no longer touch unused network interfaces. So upgrading Ops Center to a future version would be another workaround

As solution install the fix for the ixgbe bug 17502286 / bug 18131062 and bug 16743960 once available.

  • Bug 17502286 / 18131062: Backport for S11.1 is Bug 18514650 which is fixed with Solaris 11.1 SRU 19.6 or later
  • Bug 16743960 is fixed with patch 150400-19 or later. The fix for Solaris 11 (Bug 16696074) is in Solaris 11.2 SRU 9 or later

References

<NOTE:1005907.1> - SPARC Platforms: Matrix of Recognized Device Paths
<NOTE:1467458.1> - Twinville(Intel) 10 GbE NIC's(copper ports) - Info
<BUG:18131062> - IXGBE1: PROBLEM: NETWORK ADAPTER HAS BEEN STOPPED BECAUSE IT HAS OVERHEATED
<BUG:16743960> - DLADM SHOW-LINKPROP RESPONDS SLOWER WHEN MACHINE HAS UNSET NETWORK INTERFACES
<BUG:17502286> - OVERTEMP CHECK FOR X540 NIC USES RESERVED BIT

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback