Asset ID: |
1-71-2385601.1 |
Update Date: | 2018-04-23 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
2385601.1
:
Machine Check Exception (MCE) On X5-2 CPU
Related Categories |
- PLA-Support>Sun Systems>x86>Server>SN-x86: Oracle Server X5
|
In this Document
Created from <SR 3-16782428081>
Applies to:
Oracle Server X5-2 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.
Goal
Machine Check Exception (MCE) on X5-2 CPU
Solution
At first, we notice that PSOD happened with MCE on 8-Jan, the system was reset manually later on
1116 Mon Jan 8 00:07:09 2018 Power Reset major
/SYS has been reset by: Web session^M
1089 Sun Jan 7 23:57:37 2018 Power Reset major
/SYS has been reset by: Web session^M
1088 Sun Jan 7 23:57:37 2018 IPMI Log minor
ID = 3 : 01/07/2018 : 23:57:37 : System Boot Initiated : System
Management Software : System Restart : Asserted^M
-- However, Any Hardware Error wasn't reported in ILOM at that time.
-- So, First, we'd recommend to request VMware vendor to look into dump file.
-- From boot file.
0:00:00:00.004 cpu0:1)Cpu: 89: Changing PAT MSR on PCPU 0 from 0x7010600070106 to 0x7010600070106
0:00:00:00.053 cpu0:1)SMP: 1034: Using ACPI for cpu information, numPCPUs=72
0:00:00:06.549 cpu1:32769)Cpu: 89: Changing PAT MSR on PCPU 1 from 0x7040600070406 to 0x7010600070106
0:00:00:06.594 cpu2:32770)Cpu: 89: Changing PAT MSR on PCPU 2 from 0x7040600070406 to 0x7010600070106
0:00:00:06.609 cpu3:32771)Cpu: 89: Changing PAT MSR on PCPU 3 from 0x7040600070406 to 0x7010600070106
~~ Cut ~~
0:00:00:08.425 cpu68:32836)Cpu: 89: Changing PAT MSR on PCPU 68 from 0x7040600070406 to 0x7010600070106
0:00:00:08.443 cpu69:32837)Cpu: 89: Changing PAT MSR on PCPU 69 from 0x7040600070406 to 0x7010600070106
0:00:00:08.460 cpu70:32838)Cpu: 89: Changing PAT MSR on PCPU 70 from 0x7040600070406 to 0x7010600070106
0:00:00:08.509 cpu71:32839)Cpu: 89: Changing PAT MSR on PCPU 71 from 0x7040600070406 to 0x7010600070106
0:00:00:10.714 cpu0:32768)SMP: 1714: ...finished booting APs, numPCPUs=72
-- So, We can think that MB/P0 is from PCPU0 to PCPU35, and MB/P1 is from PCPU36 to PCPU71.
-- We could think that the MCE messages is related to MB/P0.
-- We suggest to do pc-check test for CPU at first, but turns out there are not PC-check but UEFI diagnostics which is not sufficient.
-- We do suggest to swap the CPU and test, but it is hard for customer do so, so we schedule replacement of CPU 0
-- After FE carlos went onsite and replace the CPU 0, the DIMM on P0/D7 and P0/D8 report faults.
Fault class : fault.memory.intel.mrc.dimm.training.failure
Certainty : 65%
Affects : /SYS/MB/P0/D8
Status : faulted
Fault class : fault.memory.intel.mrc.dimm.training.failure
Certainty : 32%
Affects : /SYS/MB/P0/D7
Status : faulted
-- I can confirm this is first time of error reports on DIMMs P0/D7&P0/D8.
-- Then FE swap the CPUs and clear the faults and reset the SP, the erorr did not follow the slot and did not follow the CPU which mean both DIMMs and CPU are not failure.
-- This one-time issue for DIMMs which has been cleared and repaired, there are no error report anymore now.
Next action:
Please monitor the MCU error and ilom for a few days, if no further error report,there are no further action required.
References
<NOTE:1431330.1> - How to Collect Operating System Data to Troubleshoot Oracle X86 Platforms
Attachments
This solution has no attachment