Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2294679.1
Update Date:2018-04-05
Keywords:

Solution Type  Problem Resolution Sure

Solution  2294679.1 :   After DIMM Replacement and Node Reboot the bdacheckhw Still Reports Insufficient Memory  


Related Items
  • Oracle Server X6-2L
  •  
  • Oracle Server X5-2
  •  
  • Oracle Server X6-2
  •  
  • Big Data Appliance X5-2 Hardware
  •  
  • Big Data Appliance X6-2 Hardware
  •  
  • Oracle Server X5-2L
  •  
Related Categories
  • PLA-Support>Sun Systems>x86>Engineered Systems HW>SN-x64: BDA
  •  




In this Document
Symptoms
Cause
Solution
References


Applies to:

Big Data Appliance X5-2 Hardware - Version All Versions and later
Big Data Appliance X6-2 Hardware - Version All Versions and later
Oracle Server X5-2 - Version All Versions and later
Oracle Server X6-2 - Version All Versions and later
Oracle Server X5-2L - Version All Versions and later
Linux x86-64

Symptoms

After DIMM replacement, and even after subsequent reboots, the bdacheckhw command still reports insufficient memory.

This example shows an X6-2L BDA node which reported an SPX86A-8002-XM DIMM fault.

2017-06-01/16:18:48 0b9d626a-ad99-c4c5-a2bd-a785831cc376 SPX86A-8002-XM

  timestamp           ereports
  2017-06-01/16:18:48 ereport.cpu.intel.quickpath.mem_ce@/sys/mb/p0/d3

  fault = fault.memory.intel.dimm_ce@/SYS/MB/P0/D3
          certainty = 100.0 %
          FRU = /SYS/MB/P0/D3
          ASRU = /SYS/MB/P0/D3
          resource = /SYS/MB/P0/D3
          chassis_name = ORACLE SERVER X6-2L

The fault turned the service LED for the DIMM on.

P0/D3/SERV | ON

A field engineer (FE) was dispatched for repair.
The FE replaced the DIMM and fault was automatically cleared, and the fault did not automatically clear and the FE manually cleared the fault.

All service LEDs are now off, and there are no DIMM faults identified in ILOM fault management, or in the ILOM snapshot.

-> show faulty
Target                    | Property                      | Value
--------------------+------------------------+---------------------------------
[No faults are listed]

The bdacheckhw still reports insufficient memory:

bdanode06: ERROR: Insufficient GB of memory : 221
bdanode06: INFO: Minimum GB of memory : 252
bdanode06: ERROR: Big Data Appliance failed hardware validation checks

 You attempted to reboot the node per (Doc ID 2129720.1), but the error remains.

Cause

The DIMM may still be mapped out (offline), because the actually timing of when the DIMM fault was cleared took place after MRC had completed during powered on.  A second power cycle will be required to allow MRC to enable the offline DIMM.

Solution

Using the restricted shell in the ILOM, run the HWdiag command to check to see if the same DIMM location is mapped out. This is done with the node powered on and the OS up and running.

1. ssh to the ILOM of the node in question.

2. Start the restricted shell and run the 'hwdiag mem info all' command, then exit the shell.

-> set SESSION mode=restricted

WARNING: The "Restricted Shell" account is provided solely
to allow Services to perform diagnostic tasks.

[(restricted_shell) bdax62bur09node03-ilom:~]# hwdiag mem info all

HWdiag (Restricted Mode) - Build Number 107051 (Jan 24 2016, 14:40:48)
        Current Date/Time: Aug 03 2017, 18:51:37
  CPU 0 Memory Devices
    Location        Mfg      Size(GB) Rank  Width Speed(MT/s) Chan Dimm Enabled-Ranks
    /SYS/MB/P0/D0   Samsung  32.00    Dual  x4    2400        2    0    2/2
    /SYS/MB/P0/D3   Samsung  32.00    Dual  x4    -           -    -    0/0 (DIMM mapped out)
    /SYS/MB/P0/D8   Samsung  32.00    Dual  x4    2400        1    0    2/2
    /SYS/MB/P0/D11  Samsung  32.00    Dual  x4    2400        0    0    2/2

Total memory populated on CPU 0: 96.00 GB

  CPU 1 Memory Devices
    Location       Mfg      Size(GB) Rank  Width Speed(MT/s) Chan Dimm Enabled-Ranks
    /SYS/MB/P1/D0  Samsung  32.00    Dual  x4    2400        2    0    2/2
    /SYS/MB/P1/D3  Samsung  32.00    Dual  x4    2400        3    0    2/2
    /SYS/MB/P1/D8  Samsung  32.00    Dual  x4    2400        1    0    2/2
    /SYS/MB/P1/D11 Samsung  32.00    Dual  x4    2400        0    0    2/2

Total memory populated on CPU 1: 128.00 GB

Total memory populated in system: 224.00 GB

-> exit


If the same DIMM location (in this example, P0/D3) is offline or mapped out, then a power cycle of the node will bring the memory back online.

3. Gracefully shutdown the OS.  If needed, the 'Step for Graceful Shutdown...' are listed in the reference section below.

4. Once the OS has been shutdown, power cycle the node:

-> stop /SYS

Are you sure you want to stop /SYS (y/n)? y
Stopping /SYS

   [wait several seconds]

-> show /SYS power_state

   /SYS
     Properties:
         power_state = Off

-> start /SYS

Are you sure you want to start /SYS (y/n)? y
Starting /SYS

-> exit

 
5. To verify run the HWdiag command again from the restricted shell to confirm all memory is back online.


If needed, use of the HWdiag and the restricted shell are located in the
Oracle® x86 Server Diagnostics, Applications, and Utilities Guide.

References

<NOTE:2099858.1> - Steps to Gracefully Shutdown and Power on a Single Node on Oracle Big Data Appliance Prior to Maintenance
<NOTE:1615285.1> - SPX86A-8002-XM - Memory Correctable ECC

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback