Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1507619.1
Update Date:2018-01-16
Keywords:

Solution Type  Technical Instruction Sure

Solution  1507619.1 :   Watchdog reboots on T1000 & T2000 servers  


Related Items
  • Sun Fire T1000 Server
  •  
  • Sun Fire T2000 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>CMT>SN-SPARC: Tx000
  •  




Created from <SR 3-6472889849>

Applies to:

Sun Fire T2000 Server - Version Not Applicable and later
Sun Fire T1000 Server - Version Not Applicable and later
Information in this document applies to any platform.

Goal

 
Watchdog resets are typically caused by an OS hang on the host. The watchdog must be configured on the ALOM for this to be enabled.

Solution

The T2000 watchdog is enabled on the ALOM by configuring variable sys_autorestart to either ‘dumpcore’ or ‘reset’ (default), but dumpcore should be chosen to obtain a core file for any hangs.  It is done as follows:

       

sc> setsc sys_autorestart dumpcore

 

The setting can be confirmed as follows:

       

sc> showsc

            ...
            sys_autorestart     dumpcore 

        

 

A watchdog reset will contain the folowing message in the ALOM event logs:

        

sc> showlogs

            APR 22 08:57:28: 00060014: "SC Request to Reset Host due to Watchdog"
            APR 22 08:57:28: 00040002: "Host System has Reset"
            APR 22 08:57:29: 00060021: "CRITICAL ALARM is set"



To confirm that the reset was not hardware related, please determine if a fault occurred near the time of the reset:
    
        

sc> showfaults -v
            Last POST run: WED APR 22 09:57:29 2012
            POST status: Passed all devices
            No failures found in System 




A watchdog reset typically indicates that the system experienced a hang and performed a reset.  Once the reboot is confirmed as a result of the watchdog timer, max POST can be run on the server to confirm all devices successfully pass POST

    - If any internal component failure is reported by POST, please open a service request by using the My Oracle Support interface or call 1-800-223-1711 for confirmation of the component failure and replacement.
    - If no internal component failures are reported or no further diagnostics can be performed, the deadman kernel can be configured to attempt to capture a core dump during the next system hang.

A core file is required to isolate the cause of the hang.  If one was generated by the events above, then please submit it to a kernel specialist for analysis.  If not, another method to obtain one is shown in Doc ID 1004530.1 'How to Enable Deadman Kernel Code in Solaris 8 and Newer to Force a Kernel Panic During a Hang' .

For more information on watchdog configuration, please see:  http://docs.oracle.com/cd/E19076-01/t2k.srvr/819-7991-10/variables.html#50524208_pgfId-1011121


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback