Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-2385601.1
Update Date:2018-04-23
Keywords:

Solution Type  Technical Instruction Sure

Solution  2385601.1 :   Machine Check Exception (MCE) On X5-2 CPU  


Related Items
  • Oracle Server X5-2
  •  
Related Categories
  • PLA-Support>Sun Systems>x86>Server>SN-x86: Oracle Server X5
  •  




In this Document
Goal
Solution
References


Created from <SR 3-16782428081>

Applies to:

Oracle Server X5-2 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Goal

Machine Check Exception (MCE) on X5-2 CPU
 

Solution

At first, we notice that PSOD happened with MCE on 8-Jan, the system was reset manually later on

1116 Mon Jan 8 00:07:09 2018 Power Reset major
  /SYS has been reset by: Web session^M
1089 Sun Jan 7 23:57:37 2018 Power Reset major
  /SYS has been reset by: Web session^M
1088 Sun Jan 7 23:57:37 2018 IPMI Log minor
  ID = 3 : 01/07/2018 : 23:57:37 : System Boot Initiated : System
  Management Software : System Restart : Asserted^M

-- However, Any Hardware Error wasn't reported in ILOM at that time.
-- So, First, we'd recommend to request VMware vendor to look into dump file.

-- From boot file.

0:00:00:00.004 cpu0:1)Cpu: 89: Changing PAT MSR on PCPU 0 from 0x7010600070106 to 0x7010600070106
0:00:00:00.053 cpu0:1)SMP: 1034: Using ACPI for cpu information, numPCPUs=72
0:00:00:06.549 cpu1:32769)Cpu: 89: Changing PAT MSR on PCPU 1 from 0x7040600070406 to 0x7010600070106
0:00:00:06.594 cpu2:32770)Cpu: 89: Changing PAT MSR on PCPU 2 from 0x7040600070406 to 0x7010600070106
0:00:00:06.609 cpu3:32771)Cpu: 89: Changing PAT MSR on PCPU 3 from 0x7040600070406 to 0x7010600070106
  ~~ Cut ~~
0:00:00:08.425 cpu68:32836)Cpu: 89: Changing PAT MSR on PCPU 68 from 0x7040600070406 to 0x7010600070106
0:00:00:08.443 cpu69:32837)Cpu: 89: Changing PAT MSR on PCPU 69 from 0x7040600070406 to 0x7010600070106
0:00:00:08.460 cpu70:32838)Cpu: 89: Changing PAT MSR on PCPU 70 from 0x7040600070406 to 0x7010600070106
0:00:00:08.509 cpu71:32839)Cpu: 89: Changing PAT MSR on PCPU 71 from 0x7040600070406 to 0x7010600070106
0:00:00:10.714 cpu0:32768)SMP: 1714: ...finished booting APs, numPCPUs=72

-- So, We can think that MB/P0 is from PCPU0 to PCPU35, and MB/P1 is from PCPU36 to PCPU71.
-- We could think that the MCE messages is related to MB/P0.
-- We suggest to do pc-check test for CPU at first, but turns out there are not PC-check but UEFI diagnostics which is not sufficient.
-- We do suggest to swap the CPU and test, but it is hard for customer do so, so we schedule replacement of CPU 0
-- After FE carlos went onsite and replace the CPU 0, the DIMM on P0/D7 and P0/D8 report faults.

 Fault class : fault.memory.intel.mrc.dimm.training.failure
  Certainty : 65%
  Affects : /SYS/MB/P0/D8
  Status : faulted

  Fault class : fault.memory.intel.mrc.dimm.training.failure
  Certainty : 32%
  Affects : /SYS/MB/P0/D7
  Status : faulted

-- I can confirm this is first time of error reports on DIMMs P0/D7&P0/D8.
-- Then FE swap the CPUs and clear the faults and reset the SP, the erorr did not follow the slot and did not follow the CPU which mean both DIMMs and CPU are not failure.
-- This one-time issue for DIMMs which has been cleared and repaired, there are no error report anymore now.

Next action:

Please monitor the MCU error and ilom for a few days, if no further error report,there are no further action required.
 

References

<NOTE:1431330.1> - How to Collect Operating System Data to Troubleshoot Oracle X86 Platforms

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback