![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Problem Resolution Sure Solution 1490545.1 : SM BIOS Uncorrectable CPU-complex Error in ILOM SEL and system hard hangs when running sosreport or commands such as lspci
A system running with an LSI Sun StorageTek 6Gb/s SAS PCIe RAID HBA - SGX-SAS6-R-INT-Z (and possibly the Blade REM equivalent SGX-SAS6-R-REM-Z) may experience a system hard hang when running low level hardware commands such as lspci and similar commands run by sosreport scripts. In this Document
Applies to:Sun Storage 6Gb SAS PCIe RAID HBA - Version Not Applicable to Not Applicable [Release N/A]Oracle Exalogic Elastic Cloud X2-2 Full Rack - Version X2 to X4 [Release X2 to X4] Exadata Database Machine X2-2 Full Rack - Version All Versions to All Versions [Release All Releases] Exadata Database Machine X2-8 - Version All Versions to All Versions [Release All Releases] Exadata Database Machine V2 - Version All Versions to All Versions [Release All Releases] Linux x86-64 - A system running with an LSI Sun StorageTek 6Gb/s SAS PCIe RAID HBA - SGX-SAS6-R-INT-Z (and possibly the Blade REM equivalent SGX-SAS6-R-REM-Z) may experience a system hard hang when running low level hardware commands such as lspci and similar commands run by sosreport scripts. Note that this is just *ONE* possible cause of a "Uncorrectable CPU-complex Error" and there are other unrelated triggers which can cause this kind of error. This specific document refers to a "Uncorrectable CPU-complex Error" and system "hard hang" followed by a system reset triggered by running sosreport, sundiag or specific low level hardware commands such as lspci, udevinfo, dmraid, dmidecode, x86info, lshal on Linux. The event is more likely to be triggered if these commands are run repeatedly or a mixture of these type of commands are run in parallel (like when sosreport is run). If you are seeing this error under other conditions then it may be unrelated to this issue. When the system hangs FMA should flag one or more CPUs in the system as faulty with a failure code of "fault.cpu.intel.internal". The CPU which is flagged as faulty can change on each occurrence and the CPU itself is *NOT* actually at fault and should not be replaced. The event logs on the ILOM will also report an uncorrectable MCA error. The ereport logs will show "ereport.cpu.intel.caterr" followed by "ereport.cpu.intel.internal_timer" (see symptoms section for example below). If you do not see this then it may be a different issue. The issue is caused by a firmware issue on the LSI PCIe card which causes a (ROB) time-out to occur. This issue has been seen on systems running Oracle VM 3.1 but may also be seen on any systems running Oracle Enterprise Linux or even Red Hat Releases which Oracle VM is based on. This issue has now also been found to affect Exadata and Exalogic nodes running Oracle Enterprise Linux or Oracle VM. It has also been found to affect Exalytics X2-4 and X3-4 systems running old LSI firmware versions. This issue does NOT affect systems running Solaris or Windows. SymptomsThe following are examples of the kind of error you may see in the ILOM event logs and SEL after the system hang: SEL : eda | 06/26/2012 | 15:06:47 | Processor | IERR | Asserted
edb | 06/26/2012 | 15:08:23 | System Boot Initiated | Initiated by warm reset | Asserted edc | 06/26/2012 | 15:08:23 | System Firmware Progress | Memory initialization | Asserted edd | 06/26/2012 | 15:08:23 | System Firmware Progress | Primary CPU initialization | Asserted ede | 06/26/2012 | 15:08:23 | System Boot Initiated | System Restart | Asserted edf | 06/26/2012 | 15:09:05 | System Firmware Progress | Management controller initialization | Asserted ee0 | 06/26/2012 | 15:09:05 | System Firmware Progress | Secondary CPU Initialization | Asserted ee1 | 06/26/2012 | 15:09:06 | Processor | SM BIOS Uncorrectable CPU-complex Error | Asserted ee2 | 06/26/2012 | 15:09:06 | Processor | SM BIOS Uncorrectable CPU-complex Error | Asserted ILOM Event logs : 4552 Tue Jun 26 11:05:55 2012 IPMI Log minor
ID = e35 : 06/26/2012 : 11:05:55 : Processor : BIOS : Uncorrectable MCA Error Node 0 : Asserted 4551 Tue Jun 26 11:05:55 2012 Fault Fault critical Fault detected at time = Tue Jun 26 12:05:55 2012. The suspect component: /SYS/MB/P0 has fault.cpu.intel.internal with probability=100. Refer to h ttp://www.sun.com/msg/SPX86-8000-F4 for details. 4484 Thu Jun 21 16:54:25 2012 IPMI Log minor ID = e04 : 06/21/2012 : 16:54:25 : Processor : BIOS : Uncorrectable MCA Error Node 1 : Asserted 4483 Thu Jun 21 16:54:25 2012 Fault Fault critical Fault detected at time = Thu Jun 21 17:54:25 2012. The suspect component: /SYS/MB/P1 has fault.cpu.intel.internal with probability=100. Refer to h ttp://www.sun.com/msg/SPX86-8000-F4 for details. ILOM FMA logs <snapshot dir>/fma/@persist@faultdiags@faults.log (from ILOM snapshot) : 2012-06-26/12:12:43 eb369415-d451-c8e2-f777-eba8555aff0e SPX86-8000-F4 fault = fault.cpu.intel.internal@/sys/mb/p0 certainty = 100.0 % FRU = /sys/mb/p0 ASRU = /sys/mb/p0 resource = /sys/mb/p0 chassis_serial_number = XXXXXXXXXX product_serial_number = XXXXXXXXXX fru_part_number = 060C 2012-06-26/15:09:06 0b926c32-07ab-6f7b-b888-88b39de4ef90 SPX86-8000-F4 fault = fault.cpu.intel.internal@/sys/mb/p1 certainty = 100.0 % FRU = /sys/mb/p1 ASRU = /sys/mb/p1 resource = /sys/mb/p1 chassis_serial_number = XXXXXXXXXX product_serial_number = XXXXXXXXXX fru_part_number = 060C
FMA ereport logs <snapshot dir>/fma/@persist@faultdiags@ereports.log OR <snapshot dir>/fma/@usr@local@bin@fmdump_-ev.out (from ILOM snapshot) : 2012-11-13/21:42:29 ereport.cpu.intel.caterr@/sys [unrecognized] 2012-11-13/22:15:26 ereport.cpu.intel.internal_timer@/sys/mb/p1 2012-11-13/22:15:27 ereport.cpu.intel.internal_timer@/sys/mb/p1 ILOM host_debug_err.log <snapshot dir>/ilom/@persist@host_debug_err.log (from ILOM snapshot) : Tue Nov 13 22:15:26 2012 ID 014e V MCA Error CPU Package 1 Core 2 MCA Bank 5 Tue Nov 13 22:15:26 2012 ID 014e : 07 01 02 05 00 00 00 00 00 00 00 00 00 04 80 00 16 | 00 00 00 fe cd 88 16 80 c4 02 00 00 fe 7f 00 00 32 | 00 00 00 00 - Tue Nov 13 22:15:26 2012 ID 014f V MCA Error CPU Package 1 Core 2 MCA Bank 5 Tue Nov 13 22:15:26 2012 ID 014f : 07 01 02 05 00 00 00 00 00 00 00 00 00 04 80 00 16 | 00 00 00 fe cd 88 16 80 c4 02 00 00 fe 7f 00 00 32 | 00 00 00 00 CauseThe issue is a bug in LSI firmware 12.12.0-0048 (and possibly below but this is unverified) SolutionThe issue is resolved in LSI firmware 12.12.0-0079 and later which can be downloaded from LSI's web site here (Only for non Exadata/Exalogic systems)
Attachments This solution has no attachment |
||||||||||||||||
|