Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1576683.1
Update Date:2014-04-02
Keywords:

Solution Type  Problem Resolution Sure

Solution  1576683.1 :   Windows BSOD WHEA_UNCORRECTABLE_ERROR(124) & correctable WHEA events when disabling network ports on X4170M2/X4270M2  


Related Items
  • Windows Server
  •  
  • Windows Server
  •  
  • Sun Fire X4270 M2 Server
  •  
  • Sun Fire X4170 M2 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>x86>Server>SN-x64: MISC-SERVER
  •  




In this Document
Symptoms
Cause
Solution
References


Applies to:

Sun Fire X4270 M2 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Fire X4170 M2 Server - Version Not Applicable to Not Applicable [Release N/A]
Windows Server - Version 2008 x64 to 2008 x64
Windows Server - Version 2008 to 2008
Information in this document applies to any platform.

Symptoms

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Sun x86 Systems

When disabling unconnected network ports in Windows 2008 / 2008R2 running on X4170M2 or X4270 M2 hardware you may see a number of correctable WHEA events being logged in the Windows System event logs.

This may also lead to uncorrectable WHEA events under heavy network load which will trigger a Windows Blue Screen of Death (BSOD) crash of type WHEA_UNCORRECTABLE_ERROR(124) causing the system to unexpectedly reset.

These symptoms are not always seen and may depend on what network ports have been enabled/disabled in windows. For example, we have seen the issue using the following configuration of the on-board network ports :


Local Area Connection 1  - ENABLED AND ACTIVE WITH CABLE
Local Area Connection 2  - DISABLED WITH NO CABLE ATTACHED
Local Area Connection 3  - DISABLED WITH NO CABLE ATTACHED
Local Area Connection 4  - ENABLED AND ACTIVE WITH CABLE

 

When this is configured you may start to see a number of correctable WHEA events in the Windows system event log similar to this :

 

Log Name:      System
Source:        Microsoft-Windows-WHEA-Logger
Date:          07/03/2013 19:05:16
Event ID:      17
Task Category: None
Level:         Warning
Keywords:      
User:          LOCAL SERVICE
Computer:      hostname
Description:
A corrected hardware error has occurred.

Component: PCI Express Root Port
Error Source: Advanced Error Reporting (PCI Express)

Bus:Device:Function: 0x0:0x0:0x0
Vendor ID:Device ID: 0x8086:0x3406             << !! Sometimes reported as 0x8086:0x3409 depending on the port involved
Class Code: 0x30000

 

If heavy network load is placed on the active ports then an uncorrectable WHEA event is possible, causing a windows BSOD crash similar to this:

 

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error condition.
Arguments:
Arg1: 0000000000000004, PCI Express Error
Arg2: fffffa8019d9f8d8, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000
Arg4: 0000000000000000

 

2: kd> !errrec fffffa8019d9f8d8
==============================================================================
=
Common Platform Error Record @ fffffa8019d9f8d8
------------------------------------------------------------------------------
-
Record Id     : 01ce1a718f328b24
Severity      : Fatal (1)
Length        : 672
Creator       : Microsoft
Notify Type   : PCI Express Error
Timestamp     : 3/6/2013 14:40:49
Flags         : 0x00000000

==============================================================================
=
Section 0     : PCI Express
------------------------------------------------------------------------------
-
Descriptor    @ fffffa8019d9f958
Section       @ fffffa8019d9f9e8
Offset        : 272
Length        : 208
Flags         : 0x00000001 Primary
Severity      : Fatal

Port Type     : Root Port
Version       : 1.1
Command/Status: 0x0010/0x0000
Device Id     :
 VenId:DevId : 8086:3406    << !! Again, sometimes reported as 0x8086:0x3409 depending on the port involved
 Class code  : 030000
 Function No : 0x00
 Device No   : 0x00
 Segment     : 0x0000
 Primary Bus : 0x00
 Second. Bus : 0x00
 Slot        : 0x0000
Dev. Serial # : 0000000000000000
Express Capability Information @ fffffa8019d9fa1c
 Device Caps : 00008020 Role-Based Error Reporting: 1
 Device Ctl  : 0000 ur fe nf ce
 Dev Status  : 0000 ur fe nf ce
  Root Ctl   : 0000 fs nfs cs

AER Information @ fffffa8019d9fa58
 Uncorrectable Error Status    : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp
sd dlp und
 Uncorrectable Error Mask      : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp
sd dlp und
 Uncorrectable Error Severity  : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp
sd DLP und
 Correctable Error Status      : 00000000 adv rtto rnro dllp tlp re
 Correctable Error Mask        : 00000000 adv rtto rnro dllp tlp re
 Caps & Control                : 00000000 ecrcchken ecrcchkcap ecrcgenen
ecrcgencap fep
 Header Log                    : 00000000 00000000 00000000 00000000
 Root Error Command            : 00000000 fen nfen cen
 Root Error Status             : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
 Correctable Error Source ID   : 00,00,00
 Correctable Error Source ID   : 00,00,00

 

Note: This issue may also be seen when disabling network ports on Intel based PCIe network cards on these platforms also, not just the on-board network ports.

Cause

The uncorrectable WHEA BSOD crash is caused by a fatal error on the PCIe bus generated by one of the Intel Kawela network ports. This fatal PCIe bus error is triggered by a firmware issue caused by a mis-match of the supported Max Payload Size.

 

Solution

Firstly this is NOT a hardware fault and no parts should be replaced as it will not resolve the issue.

 

This issue is fixed by using a specially created MPSTool to update the Intel NIC EEPROM. This tool can be found on the latest Tools & Drivers CD (SW 1.7.3 or later - Patchid 18489051) which can be downloaded from My Oracle Support patch downloads.

 

On the CD image the tool resides at the following location :

tools_and_drivers_CD:/Windows/W2K8R2/Tools/64bit/Network/MPSTools

 

How to use this tool:


MPS128W64e.exe and MPS512W64e.exe are custom utility to update 82576's EEPROM devices.

There are 2 Intel 82576 devices, each of which has 2 1GB NIC ports, thus, 4 ports in box totally.

It is intended for updating two 82576's EEPROM devices on X4170M2 or X4270M2 systems.
Both 82576 (i.e, 4 NIC ports) must be enabled for this tool to work.

When to fix bug, run "MPS128W64e.exe" to set MPS=128 to the 82576 EEPROM. Take effect after OS reboots.
While, "MPS512W64e.exe" is used to restore the default setting if you want to revert.


Note:
When you execute the tool you will notice the output below. "4 matching devices found." means four ports were found.
82576 has two ports per chip. "2 devices updated" means two EEPROM devices were updated.

For example,
Output to the screen:

 WARNING: DO NOT POWEROFF SYSTEM
 Updating NVM, device 1 of 2... Done
 Updating NVM, device 2 of 2... Done
 4 matching devices found.
 2 devices updated.
 You must restart the computer now.
 

 

 

The following workarounds are also available which should prevent the correctable and uncorrectable WHEA events :

 

1. In Windows leave any unused networks ports enabled so that the adapter reports status as "Network cable unplugged" rather than DISABLED.

 

OR

 

2. Change the maximum payload size in the BIOS to 256 :

 

Press F2 during POST to enter the BIOS and then navigate to :


Advanced --> PCI Express Configuration --> Maximum Payload Size -> change to 256 instead of the default of AUTO.

 

 

 

References

<BUG:16507719> - WHEA_UNCORRECTABLE_ERROR(124) WHEN DISABLING NETWORK PORTS IN WIN2008R2 ON LYNX+

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback