Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2163890.1
Update Date:2017-10-05
Keywords:

Solution Type  Problem Resolution Sure

Solution  2163890.1 :   M10-4: PPAR Cannot Power On / CMUL Faulted after ALARM critical high temperature  


Related Items
  • Fujitsu M10-4
  •  
  • Fujitsu M10-4S
  •  
  • Fujitsu M10-1
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Fujitsu M10
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Created from <SR 3-13038571301>

Applies to:

Fujitsu M10-4 - Version All Versions to All Versions [Release All Releases]
Fujitsu M10-4S - Version All Versions to All Versions [Release All Releases]
Fujitsu M10-1 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

Server cannot power on after shutdown triggered by critical environmental temperature

Changes

 

Cause

Most likely cause is that the proactive controlled PPAR shutdown has caused the /BB#0/CMUL, called out by the critical  temperature alarm, to be marked faulty, leaving no BB available during poweron of the platform.
Looking at the XSCF showhardconf output, we do see the /BB#0/CMUL being marked faulted, causing the only available building block to be deconfigured:

 

SPARC M10-4;
  + Serial:PZ01504027; Operator_Panel_Switch:Service;
  + System_Power:Off; System_Phase:Cabinet Power Off;
  Partition#0 PPAR_Status:Powered Off;
* BB#00 Status:Deconfigured; Role:Master; Ver:2240h; Serial:PZ01504027;
  + FRU-Part-Number:CA07361-D202 B1 /7087592 ;
  + Power_Supply_System: ;
  + Memory_Size:64 GB;
* CMUL Status:Faulted; Ver:0201h; Serial:PP150303ZR ;                                        <<<<==== CMUL marked Faulted
  + FRU-Part-Number:CA07361-D943 D5 /7086555 ;
  + Memory_Size:64 GB; Type: A ;


Looking at the output of the XSCF showlogs error command, we can see when the ALARM against /BB#0/CMUL occurred:

Date: Jul 15 18:16:28 HKT 2016
    Code: 80002000-00570200feff0000ff-01910b110000000000000000
    Status: Alarm Occurred: Jul 15 18:16:24.548 HKT 2016
    FRU: /BB#0/CMUL,/ENVIRONMENT
    Msg: Critical high temperature at SW
    Diagnostic Code:
        00000200 00000000 0000
        00000000 00000000 0000
        00000000 00000000 0000
        23323538 33343436 31360000 00000000
        00000000 00000000 0000

Through the output of the XSCF showlogs monitor command, we can confirm there are no building blocks left to poweron:

Jul 18 10:26:08 xscf-mo3gpm03 Event: SCF:XSCF ready
Jul 18 10:25:57 xscf-mo3gpm03 Alarm: :SCF:no BB available
Jul 18 10:19:32 xscf-mo3gpm03 Alarm: /PPAR#0:SCF:no PSB available in PPAR
Jul 18 10:18:28 xscf-mo3gpm03 Event: SCF:PPAR-ID 0: Reset
Jul 18 10:18:25 xscf-mo3gpm03 Event: SCF:PPAR-ID 0: Delete CMI-group completed
Jul 18 10:18:21 xscf-mo3gpm03 Event: SCF:PPAR-ID 0: Delete CMI-group started
Jul 17 15:21:16 xscf-mo3gpm03 Event: SCF:power switch pushed (long)

 

Solution

Check  the showlogs monitor output on the active XSCF, to find any critical environmental alarm, and see if the FRU(s) called out in that alarm matches one or more of your components marked Faulted:

Jul 15 18:19:02 xscf-mo3gpm03 Event: SCF:System powered off
Jul 15 18:18:28 xscf-mo3gpm03 Event: SCF:PPAR-ID 0: Reset
Jul 15 18:18:18 xscf-mo3gpm03 Event: SCF:PPAR issued power-off request (PPARID 0)
Jul 15 18:18:18 xscf-mo3gpm03 Event: SCF:PPARID 0 GID 00000000 state change (Host stopped)
Jul 15 18:18:18 xscf-mo3gpm03 Event: SCF:PPARID 0 GID 00000000 state change (Solaris powering down)
Jul 15 18:16:35 xscf-mo3gpm03 Event: SCF:PPAR-ID 0:shutdown started
Jul 15 18:16:34 xscf-mo3gpm03 Alarm: /BB#0/CMUL,/ENVIRONMENT:SCF:Critical high temperature at SW                <<<<==== CMUL called out in critical alarm
Jul 10 15:41:29 xscf-mo3gpm03 Warning: /BB#0/CMUL:SCF:High temperature at SW

Use clearfault to set the status' of faulted part(s) back to normal and monitor the system for two weeks.
Through "Bug 24319481 - xscf Alarm: ,/ENVIRONMENT:SCF:Criti high temp at SW renders PPAR unbootable" we will monitor the root cause investigation of the issue. Once the bug has a resolution, this document will be updated with the outcome.

The suspect is the external environment when detected on the inlet temperature sensor.
In some cases, Fujitsu M10 hardware/firmware cannot distinguish if the failure is caused by either the external environment (ambient temperature problem or an issue decreasing air flow for the system) or a failure of the hardware.  An engineer should check the system.  There is a possibility that the cause is no longer present by the time the investigation is implemented.  If the engineer cannot find the root cause, the error state should be cleared, and try to start the system to see if the problem can be reproduced.  The error state of this failure can be cleared by disconnecting all the power cables to the chassis, where the FRU is located. (For example, if it’s PCI expansion box, disconnect all the power cables connected to the PCI expansion box).

 

References

<BUG:24319481> - XSCF ALARM: ,/ENVIRONMENT:SCF:CRITI HIGH TEMP AT SW RENDERS PPAR UNBOOTABLE
<NOTE:1535238.1> - M10-env.temp.over-warn - An overtemperature warning condition is detected by a temperature sensor

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback