Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1490545.1
Update Date:2016-07-18
Keywords:

Solution Type  Problem Resolution Sure

Solution  1490545.1 :   SM BIOS Uncorrectable CPU-complex Error in ILOM SEL and system hard hangs when running sosreport or commands such as lspci  


Related Items
  • Exalytics In-Memory Machine X2-4
  •  
  • Exadata Database Machine V2
  •  
  • Sun Storage 6Gb SAS PCIe RAID HBA
  •  
  • Sun Blade 6000 System
  •  
  • Exalytics In-Memory Machine X3-4
  •  
  • Exadata Database Machine X2-2 Full Rack
  •  
  • Exadata Database Machine X2-8
  •  
  • Oracle Exalogic Elastic Cloud X2-2 Full Rack
  •  
  • Sun Fire X4270 M2 Server
  •  
  • Sun Fire X4170 M2 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>x86>Server>SN-x64: MISC-SERVER
  •  


A system running with an LSI Sun StorageTek 6Gb/s SAS PCIe RAID HBA -  SGX-SAS6-R-INT-Z (and possibly the Blade REM equivalent SGX-SAS6-R-REM-Z) may experience a system hard hang when running low level hardware commands such as lspci and similar commands run by sosreport scripts.

In this Document
Symptoms
Cause
Solution


Applies to:

Sun Storage 6Gb SAS PCIe RAID HBA - Version Not Applicable to Not Applicable [Release N/A]
Oracle Exalogic Elastic Cloud X2-2 Full Rack - Version X2 to X4 [Release X2 to X4]
Exadata Database Machine X2-2 Full Rack - Version All Versions to All Versions [Release All Releases]
Exadata Database Machine X2-8 - Version All Versions to All Versions [Release All Releases]
Exadata Database Machine V2 - Version All Versions to All Versions [Release All Releases]
Linux x86-64
-
A system running with an LSI Sun StorageTek 6Gb/s SAS PCIe RAID HBA - SGX-SAS6-R-INT-Z (and possibly the Blade REM equivalent SGX-SAS6-R-REM-Z) may experience a system hard hang when running low level hardware commands such as lspci and similar commands run by sosreport scripts.

Note that this is just *ONE* possible cause of a "Uncorrectable CPU-complex Error" and there are other unrelated triggers which can cause this kind of error. This specific document refers to a "Uncorrectable CPU-complex Error" and system "hard hang" followed by a system reset triggered by running sosreport, sundiag or specific low level hardware commands such as lspci, udevinfo, dmraid, dmidecode, x86info, lshal on Linux. The event is more likely to be triggered if these commands are run repeatedly or a mixture of these type of commands are run in parallel (like when sosreport is run). If you are seeing this error under other conditions then it may be unrelated to this issue.

When the system hangs FMA should flag one or more CPUs in the system as faulty with a failure code of "fault.cpu.intel.internal". The CPU which is flagged as faulty can change on each occurrence and the CPU itself is *NOT* actually at fault and should not be replaced. The event logs on the ILOM will also report an uncorrectable MCA error. The ereport logs will show "ereport.cpu.intel.caterr" followed by "ereport.cpu.intel.internal_timer" (see symptoms section for example below). If you do not see this then it may be a different issue.

The issue is caused by a firmware issue on the LSI PCIe card which causes a (ROB) time-out to occur.

This issue has been seen on systems running Oracle VM 3.1 but may also be seen on any systems running Oracle Enterprise Linux or even Red Hat Releases which Oracle VM is based on.

This issue has now also been found to affect Exadata and Exalogic nodes running Oracle Enterprise Linux or Oracle VM. It has also been found to affect Exalytics X2-4 and X3-4 systems running old LSI firmware versions.

This issue does NOT affect systems running Solaris or Windows.

Symptoms

 The following are examples of the kind of error you may see in the ILOM event logs and SEL after the system hang: 

SEL :

eda | 06/26/2012 | 15:06:47 | Processor | IERR | Asserted
edb | 06/26/2012 | 15:08:23 | System Boot Initiated | Initiated by warm reset | Asserted
edc | 06/26/2012 | 15:08:23 | System Firmware Progress | Memory initialization | Asserted
edd | 06/26/2012 | 15:08:23 | System Firmware Progress | Primary CPU initialization | Asserted
ede | 06/26/2012 | 15:08:23 | System Boot Initiated | System Restart | Asserted
edf | 06/26/2012 | 15:09:05 | System Firmware Progress | Management controller initialization | Asserted
ee0 | 06/26/2012 | 15:09:05 | System Firmware Progress | Secondary CPU Initialization | Asserted
ee1 | 06/26/2012 | 15:09:06 | Processor | SM BIOS Uncorrectable CPU-complex Error | Asserted
ee2 | 06/26/2012 | 15:09:06 | Processor | SM BIOS Uncorrectable CPU-complex Error | Asserted

ILOM Event logs :

4552   Tue Jun 26 11:05:55 2012  IPMI      Log       minor  
      ID =  e35 : 06/26/2012 : 11:05:55 : Processor : BIOS : Uncorrectable MCA
      Error Node 0 : Asserted

4551   Tue Jun 26 11:05:55 2012  Fault     Fault     critical
      Fault detected at time = Tue Jun 26 12:05:55 2012. The suspect component:
       /SYS/MB/P0 has fault.cpu.intel.internal with probability=100. Refer to h
      ttp://www.sun.com/msg/SPX86-8000-F4 for details.

4484   Thu Jun 21 16:54:25 2012  IPMI      Log       minor  
      ID =  e04 : 06/21/2012 : 16:54:25 : Processor : BIOS : Uncorrectable MCA
      Error Node 1 : Asserted

4483   Thu Jun 21 16:54:25 2012  Fault     Fault     critical
      Fault detected at time = Thu Jun 21 17:54:25 2012. The suspect component:
       /SYS/MB/P1 has fault.cpu.intel.internal with probability=100. Refer to h
      ttp://www.sun.com/msg/SPX86-8000-F4 for details.

ILOM FMA logs <snapshot dir>/fma/@persist@faultdiags@faults.log (from ILOM snapshot) : 

2012-06-26/12:12:43  eb369415-d451-c8e2-f777-eba8555aff0e   SPX86-8000-F4  

    fault = fault.cpu.intel.internal@/sys/mb/p0
        certainty = 100.0 %
        FRU       = /sys/mb/p0
        ASRU      = /sys/mb/p0
        resource  = /sys/mb/p0
        chassis_serial_number = XXXXXXXXXX
        product_serial_number = XXXXXXXXXX
        fru_part_number = 060C

2012-06-26/15:09:06  0b926c32-07ab-6f7b-b888-88b39de4ef90   SPX86-8000-F4  

    fault = fault.cpu.intel.internal@/sys/mb/p1
        certainty = 100.0 %
        FRU       = /sys/mb/p1
        ASRU      = /sys/mb/p1
        resource  = /sys/mb/p1
        chassis_serial_number = XXXXXXXXXX
        product_serial_number = XXXXXXXXXX
        fru_part_number = 060C

 

FMA ereport logs <snapshot dir>/fma/@persist@faultdiags@ereports.log OR <snapshot dir>/fma/@usr@local@bin@fmdump_-ev.out (from ILOM snapshot) : 

2012-11-13/21:42:29 ereport.cpu.intel.caterr@/sys [unrecognized] 
2012-11-13/22:15:26 ereport.cpu.intel.internal_timer@/sys/mb/p1 
2012-11-13/22:15:27 ereport.cpu.intel.internal_timer@/sys/mb/p1

ILOM host_debug_err.log <snapshot dir>/ilom/@persist@host_debug_err.log (from ILOM snapshot) : 

Tue Nov 13 22:15:26 2012 ID 014e V MCA Error CPU Package 1 Core 2 MCA Bank 5
Tue Nov 13 22:15:26 2012 ID 014e : 07 01 02 05 00 00 00 00 00 00 00 00 00 04 80 00
                              16 | 00 00 00 fe cd 88 16 80 c4 02 00 00 fe 7f 00 00
                              32 | 00 00 00 00
-
Tue Nov 13 22:15:26 2012 ID 014f V MCA Error CPU Package 1 Core 2 MCA Bank 5
Tue Nov 13 22:15:26 2012 ID 014f : 07 01 02 05 00 00 00 00 00 00 00 00 00 04 80 00
                              16 | 00 00 00 fe cd 88 16 80 c4 02 00 00 fe 7f 00 00
                              32 | 00 00 00 00

Cause

The issue is a bug in LSI firmware 12.12.0-0048 (and possibly below but this is unverified)

Solution

The issue is resolved in LSI firmware 12.12.0-0079 and later which can be downloaded from LSI's web site here (Only for non Exadata/Exalogic systems)


Exadata and Exalogic nodes should be upgraded to the latest image release which contains LSI firmware 12.12.0-0079 or higher.


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback