Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1636570.1
Update Date:2018-02-01
Keywords:

Solution Type  Problem Resolution Sure

Solution  1636570.1 :   BDA Server Inaccessible - ILOM Power Cycle Fails with "Power to server is not available due to a malfunctioning component detected by CPLD"  


Related Items
  • Big Data Appliance X3-2 Hardware
  •  
  • Big Data Appliance X5-2 Starter Rack
  •  
Related Categories
  • PLA-Support>Eng Systems>BDA>Big Data Appliance>DB: BDA_EST
  •  




In this Document
Symptoms
Cause
Solution


Created from <SR 3-8649479168>

Applies to:

Big Data Appliance X3-2 Hardware - Version All Versions and later
Big Data Appliance X5-2 Starter Rack
Linux x86-64

Symptoms

An X3-2 server on an Oracle Big Data Appliance (BDA) Oracle NoSQL DB cluster goes down. Trying to bring up the server via, "How to Power Cycle Oracle Big Data Appliance Node using ILOM when the node is NOT reachable using Ping/SSH (Doc ID 1550440.1)", fails with the following error on the ILOM console:

Description: Power to server is not available due to a malfunctioning component detected by CPLD

 

Other symptoms are:

1.  bdacheckcluster reports an ILOM fault.chassis.domain.boot.power- hardware error:

# bdacheckcluster
  
...
WARNING: Hardware errors reported by ILOM : fault.chassis.domain.boot.power-
INFO: Run 'ipmitool sunoem cli "show faulty"' to see the full error
WARNING: Big Data Appliance warnings during hardware validation checks
...

All other components verify as "healthy".

2. /root/BDA_REBOOT_WARNINGS on the server which will not reboot shows the same: fault.chassis.domain.boot.power-

# ls /root/BDA_REBOOT_WARNINGS
  
WARNING: Hardware errors reported by ILOM : fault.chassis.domain.boot.power-
INFO: Run 'ipmitool sunoem cli "show faulty"' to see the full error
WARNING: Big Data Appliance warnings during hardware validation checks

  

3. On the server which can not reboot, logging into the ILOM and running "show faulty" reports a fault.chassis.domain.boot.power- fault.

-> show faulty
Target | Property | Value
--------------------+------------------------+---------------------------------
/SP/faultmgmt/0 | fru | /SYS/MB
/SP/faultmgmt/0/ | class | fault.chassis.domain.boot.power-
faults/0 | | off-unexpected

 

Cause

The error message on the ILOM console:

Description: Power to server is not available due to a malfunctioning component detected by CPLD

and other symptoms can be indicative of a hardware issue.


Solution

1.  Clear the fault on the SP, and attempt a power cycle of the node to see if the issue repeats.  Follow the steps in:

a) The  Clearing Faults for Repairs or Replacements, section of the Oracle Integrated Lights Out Manager (ILOM) 3.1 Documentation Collection documentation.
 
b) From the: fmadm Command Usage and Syntax, section of the Oracle Integrated Lights Out Manager (ILOM) 3.1 Documentation Collection documentation, first attempt to clear the fault using the "repaired" command and then with the "replaced" command if necessary.

 

2. If Clearing the alerts allows the node to reboot, then this is likely not a hardware issue and more likely an ILOM hiccup of some sort. In this case no other action is required. In the case of a hardware issue, the server likely would not restart and the fault would be raised again. 
 

3. If the node will not restart after clearing the alerts, please open a SR with Oracle Support for further investigation.  It is likely in this case that a "Full" ILOM snapshot will be required.  You can collect a "Full" ILOM snapshot via, "How to run an ILOM Snapshot on a Sun/Oracle X86 System (Doc ID 1448069.1)" but replace Data Set "NORMAL" with Data Set "FULL", and upload that to the SR.

To recap from 1448069.1, from the  ILOM in a browser (<Node>-ilom):
Select ILOM Administration -> Maintenance -> Snapshot > then Data Set "FULL" > Transfer Method Browser > Run


You will be prompted with:
Selecting "Full Data Set" may cause the host to reset. Continue?

Selecting "OK" to collect the "Full" snapshot, might or might not cause the host to reboot.

In the case of a reboot, when a non-critical server (node 5-18 on a full rack) is taken down the cluster will not report healthy due to the missing  DataNode and TaskTracker but will be fully functional.  If  rebooting Name Node servers see: 1573109.1 - "Steps to Reboot Oracle Big Data Appliance High Availability Name Nodes Simultaneously", for detailed steps.

 

When investigating the "Full" ILOM snapshot note the below symptoms may be present:

 a) fma/@usr@local@bin@fmdump_-ev.out reports one voltage rail error like:

<timestamp>  ereport.chassis.device.cpld.voltage-rail-error@/SYS/MB/P0


b. However hwdiag/@usr@local@bin@hwdiag_cpld_vr_check.out confirms that all voltage rails now report okay: 

HWdiag - Version 5.21.74388 (Built Jun 19 2012 at 15:39:12)
CPLD Version - 2.3
CPU 0 - Present
CPU 1 - Present
Normal operation, Host powered on
  Voltage Rail              Address:Value       Status      Condition
  ------------------------------------------------------------------------
  1.5v Standby                      (0x50):0x0b        ON          OK
  1.8v Standby                     (0x51):0x0b         ON          OK
  1.26v Standby                   (0x52):0x0b         ON          OK
  5v Standby                        (0x53):0x0b        ON          OK
  4DBP 5v Power                   (0x55):0x0b        ON          OK
  SAS Expander Power           (0x56):0x0b        ON          OK
  Rear IO Power                   (0x57):0x0b        ON          OK
  NICPWR_0_0 Standby         (0x58):0x0b        ON          OK
  NICPWR_0_1 Standby         (0x59):0x0b        ON          OK
  NICPWR_1_0 Standby         (0x5a):0x0b        ON          OK
  NICPWR_1_1 Standby         (0x5b):0x0b        ON          OK
  NICPWR_2_0 Standby         (0x5c):0x0b        ON          OK
  NICPWR_2_1 Standby         (0x5d):0x0b       ON          OK
  NICPWR_3_0 Standby         (0x5e):0x0b       ON          OK
  NICPWR_3_1 Standby         (0x5f):0x0b        ON          OK
  PSU0                                (0x80):0x8f        ON          OK
  PSU1                                (0x81):0x8f        ON          OK
  3.3v HOST                        (0x83):0x0b        ON          OK
  1.1v HOST                        (0x84):0x0b        ON          OK
  5.0v HOST                        (0x85):0x0b        ON          OK
  1.5v HOST                        (0x86):0x0b        ON          OK
  3.3v PCI                           (0x87):0x0b        ON          OK
  VDDIO_0 HOST                 (0x88):0x0b         ON          OK
  VDDIO_1 HOST                 (0x89):0x0b         ON          OK
  VDDIO_2 HOST                 (0x8a):0x0b         ON          OK
  VDDIO_3 HOST                 (0x8b):0x0b         ON          OK
  VTT_0 HOST                     (0x8c):0x0b         ON          OK
  VTT_1 HOST                     (0x8d):0x0b         ON          OK
  VTT_2 HOST                     (0x8e):0x0b         ON          OK
  VTT_3 HOST                     (0x8f):0x0b          ON          OK
  VSA_0 HOST                     (0xb0):0x0b         ON          OK
  VSA_1 HOST                     (0xb1):0x0b         ON          OK
  VCCPLL_0 HOST                (0xb2):0x0b         ON          OK
  VCCPLL_1 HOST                (0xb3):0x0b         ON          OK
  VCORE_0 HOST                 (0xb4):0x0b         ON          OK
  VCORE_1 HOST                 (0xb5):0x0b         ON          OK
  CPUVTT_0 HOST               (0xb6):0x0b         ON          OK
  CPUVTT_1 HOST               (0xb7):0x0b         ON          OK

  


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback