Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1618516.1
Update Date:2018-05-08
Keywords:

Solution Type  Technical Instruction Sure

Solution  1618516.1 :   How to investigate the Auto Service Request "ASR: Thermal over-temperature warning" on External I/O Expansion Unit  


Related Items
  • Sun SPARC Enterprise M8000 Server
  •  
  • Sun External I/O Expansion Unit
  •  
  • Sun SPARC Enterprise M4000 Server
  •  
  • Sun SPARC Enterprise M9000-32 Server
  •  
  • Sun SPARC Enterprise M5000 Server
  •  
  • Sun SPARC Enterprise M9000-64 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Mx000
  •  




Applies to:

Sun SPARC Enterprise M8000 Server - Version All Versions and later
Sun SPARC Enterprise M5000 Server - Version All Versions and later
Sun SPARC Enterprise M9000-32 Server - Version All Versions and later
Sun SPARC Enterprise M9000-64 Server - Version All Versions and later
Sun External I/O Expansion Unit - Version Not Applicable to Not Applicable [Release N/A]
Information in this document applies to any platform.

Goal

This article describes activity required by a System Administrator to verify what action has to be taken on a External I/O Expansion Unit encountering an over-temperature condition.

Solution

Auto Service Request (ASR) provides automatic failure detection and SR creation for Oracle SPARC systems. See https://www.oracle.com/asr for more information on ASR. This particular ASR event has been created in Auto-Close mode, i.e. unless you update the SR, it will automatically close within two weeks.

Description of the ASR Event:

Thermal events on the IO Expansion Box need to be physically investigated on the platform for visual signs of blocked airflow, or other environmental issues.

Additional checks need to be performed in order to understand the cause of this ASR event. If a persistent failure has occurred the Service Request needs to be assigned to a Support Engineer for further investigation. Alternatively if thermal event or events cannot be explained by changes to airflow around the platform or work being carried out on the machine then please update the SR with your findings and it will be assigned to a Support Engineer.

If the event has been been caused by changes in site environmentals or a similar event then no action need be taken and the SR will automatically close within two weeks.

Please find an example ASR alarm at the bottom of this document. 

1) Ensure the current ambient temperature of your system is within the recommended range. Run the ioxadm command, as shown below, to get a full report of the current temperature values. If the temperature values are back in range, the platform needs to be monitored for any other temperature issues. If this warning seems to be an isolated event the issue can be cleared.

Example:

XSCF> ioxadm -v env
  Location     Sensor    Min Min Alarm Value Max Alarm Max   Units
  IOX@X031     ACTIVE    -        -      On      -     -     LED
  IOX@X031     LOCATE    -        -      Off     -     -     LED
  IOX@X031     OVERTEMP  -        -      Off     -     -     LED
  IOX@X031     SERVICE   -        -      Off     -     -     LED
  IOX@X031/PS0 DCOK      -        -      On      -     -     LED
  IOX@X031/PS0 POWER     -        -      On      -     -     LED
  IOX@X031/PS0 RDY2RM    -        -      Off     -     -     LED
  IOX@X031/PS0 SERVICE   -        -      Off     -     -     LED
  IOX@X031/PS0 T_AMBIENT -128.000 -    26.000 37.000 127.000 C
  IOX@X031/PS0 T_CHIP    -128.000 -    27.000 37.000 127.000 C
  IOX@X031/PS0 T_HOTSPOT -128.000 -    29.000 90.000 127.000 C
  IOX@X031/PS0 SWITCH    -        -      On      -     -     SWITCH
  ...

2) Physical inspection of the platform and surrounding area maybe needed to determine if this temperature warning is isolated to the platform, or to the area around the platform. If the surrounding area around the platform does not seem to be high in temperature in general, there maybe debris or some kind of obstruction in the way of airflow. Any other surrounding platforms may also have their exhaust directly impacting the air temperature around the platform having the warning.

If other nearby platforms also are experiencing higher then expected temperatures, investigation into the data-centers air flow will be needed to help control the surrounding temperature.

3) Ensure there are no existing Fan faults.

Step 1:Collect the fault message.
A single line fault message is displayed to the system console and logged in the /var/logs/message file. The complete message may be retrieved by using the 'fmdump' command on the XSCF console as shown below.

Example:

XSCF> fmdump -m
   MSG-ID: IOXSCF-8000-NH
, TYPE: Fault, VER: 1, SEVERITY: Major
   EVENT-TIME: Tue Mar 27 05:59:59 PDT 2007
   PLATFORM: SPARC-Enterprise, CSN: BE80601000, HOSTNAME: server-0
   SOURCE: sde, REV: 1.12
   EVENT-ID: e37f42ad-946d-4e52-8952-3eb3e4c7da21
   DESC: A thermal sensor is above the high warning threshold in an External I/O Expansion Unit FRU
   Refer to http://www.sun.com/msg/IOXSCF-8000-NH
 for more information.
   AUTO-RESPONSE: Domains using the affected hardware may be shut down.
   IMPACT: Interruption of service to the attached domains if the domain is shut down.
   REC-ACTION: Check ambient temperatures in the environment.

Step 2:Collect the "fmdump" output.
Use the fmdump command with the Event-ID option to retrieve more information on the fault which has occurred.

Example:

XSCF> fmdump -vu e37f42ad-946d-4e52-8952-3eb3e4c7da21
   TIME                 UUID                                 MSG-ID
   Mar 27 05:59:59.1975 e37f42ad-946d-4e52-8952-3eb3e4c7da21IOXSCF-8000-NH
       100% fault.chassis.iox.env.temp.over-warn
       Problem in: hc:///iox=983392/ps=0/thermctrl=0/t_ambient=0
          Affects: -
              FRU: hc://:product-id=SPARC-Enterprise:chassisid=BE80601000:server-id=server-
0:serial=T00560:part=3001701:revision=02/component=IOX@X031/PS0

Step 3:

If a failure is not persistent, no further action is required. The SR will automatically close within two weeks..

If the failure has been verified as persistent or is a cause of concern, please update the SR with your findings and it will be assigned to a Support Engineer.

References

<NOTE:1021477.1> - IOXSCF-8000-NH - Thermal over-temperature warning

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback