Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1944134.1
Update Date:2018-03-05
Keywords:

Solution Type  Problem Resolution Sure

Solution  1944134.1 :   High fan speed in an M5000 server and no error status  


Related Items
  • Sun SPARC Enterprise M4000 Server
  •  
  • Sun SPARC Enterprise M5000 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Mx000
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Created from <SR 3-9858886871>

Applies to:

Sun SPARC Enterprise M5000 Server - Version All Versions to All Versions [Release All Releases]
Sun SPARC Enterprise M4000 Server - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

"High speed" or "Full speed"  of all FANs in your M5000 platform, but no fault of any component can be found.

Reboot of the XSCF unit as well as an update of XCP firmware does not change the behavior. The command
showenvironment is as follows:

XSCF> showenvironment Fan
FAN_A#0:High speed
       FAN_A#0:  5502rpm
FAN_A#1:High speed
       FAN_A#1:  5246rpm
FAN_A#2:High speed
       FAN_A#2:  5640rpm
FAN_A#3:High speed
       FAN_A#3:  5246rpm
PSU#0
   PSU#0:Full speed
       PSU#0:  8231rpm
       PSU#0:  8035rpm
PSU#1
   PSU#1:Full speed
       PSU#1:  8035rpm
       PSU#1:  8035rpm
PSU#2
   PSU#2:High speed
       PSU#2:  5192rpm
       PSU#2:  5357rpm
PSU#3
   PSU#3:High speed
       PSU#3:  5192rpm
       PSU#3:  5192rpm

Also the environmental values, which have influence to fan speed under normal operations are within a normal range,
hence the fan speed isn't set due to thermal abnormality of inlet temperature and / or altitude settings:

examples:

XSCF> showenvironment temp
Temperature:22.50C
MBU_B
    CPUM#0-CHIP#0:33.00C
    CPUM#0-CHIP#1:32.05C
    CPUM#2-CHIP#0:39.17C
    CPUM#2-CHIP#1:38.60C
    CPUM#3-CHIP#0:37.60C
    CPUM#3-CHIP#1:39.17C
IOU#0:27.00C
IOU#1:30.50C

 

XSCF> showaltitude
100m

 

 

Changes

This behavior was observed after data center power outage. The affected system lost sufficient power abruptly.
No relevant hardware events were found in FMA on the XSCF unit:

XSCF> fmdump
TIME                 UUID                                 MSG-ID
[ there maybe events without significance for the power outage ]

and the system status looks pretty much normal, as 'showstatus' indicates there are no faults.:

XSCF> showstatus
No failures found in System Initialization.

 

Cause

The behavior was caused through an inconsistant temporary status of one or more  FAN components versus
the status reflected in the XSCF database ( BDB ), due to the unpredictable nature of a power outage.

With respect to document 1019147.1 we know the Fan failure behavior within their cooling groups and we would
expect an issue with PSU#0 or PSU#1 in terms of the above example, because their fans are on full speed,
while all other fans of all other cooling groups are on high speed.

 

 

Solution

In the described scenario it is recommended to verify the overall hardware status of
all FANs in power supplies and FAN trays by running a hardware test to get the inconsistency
resolved for these components.

For the power supplies and fans this can be done by issuing the "replacefru" command
without physically pulling any components.
This needs to be done for each one, until speed has changed and issue is resolved.
It may happened after the first test or after the last one,this is not predictable.

There is a complete lab example for the first FAN_A#0 in a M4000:

XSCF> replacefru
----------------------------------------------------------------------
Maintenance/Replacement Menu
Please select a type of FRU to be replaced.

1. FAN        (Fan Unit)
2. PSU        (Power Supply Unit)
----------------------------------------------------------------------
Select [1,2|c:cancel] :1

----------------------------------------------------------------------
Maintenance/Replacement Menu
Please select a FAN to be replaced.

No. FRU             Status
--- --------------- ------------------
1. FAN_A#0         Normal
2. FAN_A#1         Normal
3. FAN_B#0         Normal
4. FAN_B#1         Normal
----------------------------------------------------------------------
Select [1-4|b:back] :1

You are about to replace FAN_A#0.
Do you want to continue?[r:replace|c:cancel] :r

Please confirm the Check LED is blinking.
If this is the case, please replace FAN_A#0.
After replacement has been completed, please select[f:finish] :f

Diagnostic tests for FAN_A#0 have started.
[This operation may take up to 3 minute(s)]
(progress scale reported in seconds)
  0.....  30..done

----------------------------------------------------------------------
Maintenance/Replacement Menu
Status of the replaced FRU.

FRU           Status
------------- --------
FAN_A#0       Normal
----------------------------------------------------------------------
The replacement of FAN_A#0 has completed normally.[f:finish] :f

----------------------------------------------------------------------
Maintenance/Replacement Menu
Please select a type of FRU to be replaced.

1. FAN        (Fan Unit)
2. PSU        (Power Supply Unit)
----------------------------------------------------------------------
Select [1,2|c:cancel] :

It may happen the check LED is lit for a short period of time while replacefru runs the
hardware test of the chosen component. If the the FAN speed goes to normal afterwards
it can be assume the problem is resolved and no more component needs to be tested.

References

<NOTE:1019147.1> - Sun SPARC Enterprise(R) M3000/M4000/M5000/M8000/M9000 Servers: Fan/fantray temperature and Over-temperature failure behavior

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback