Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-79-1616314.1
Update Date:2017-10-11
Keywords:

Solution Type  Predictive Self-Healing Sure

Solution  1616314.1 :   Sun SPARC(R) Enterprise M3000/M4000/M5000/M8000/M9000 - XSCF Patrol Diagnosis behaviour  


Related Items
  • Sun SPARC Enterprise M8000 Server
  •  
  • Sun SPARC Enterprise M4000 Server
  •  
  • Sun SPARC Enterprise M3000 Server
  •  
  • Sun SPARC Enterprise M9000-32 Server
  •  
  • Sun SPARC Enterprise M5000 Server
  •  
  • Sun SPARC Enterprise M9000-64 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Mx000
  •  




In this Document
Purpose
Details
References


Applies to:

Sun SPARC Enterprise M9000-32 Server - Version All Versions and later
Sun SPARC Enterprise M5000 Server - Version All Versions and later
Sun SPARC Enterprise M8000 Server - Version All Versions and later
Sun SPARC Enterprise M4000 Server - Version All Versions and later
Sun SPARC Enterprise M3000 Server - Version All Versions and later
Information in this document applies to any platform.

Purpose

 The goal of this document is to provide some details about the Patrol Diagnosis behaviour.

Details

XCP1115 has introduced a new check called Patrol Diagnosis.
Patrol Diagnosis checks some specific files on the XSCFU filesystem, some BDB files and the BDB access on a regular basis.

Note : The BDB on XSCF is a DB used to store various information about the system.

When a problem is detected with a file or to access the BDB then various action can be taken.



1. When a problem is detected by Patrol Diagnosis because of a file access issue, an error is logged and the XSCFU/MBU is marked as Degraded. :

  • from the error logs
    Date: Jun 15 04:38:35 JST 2013     Code: 40000000-faffc208-0101000200000000
    Status: Information            Occurred: Jun 15 04:38:34.257 JST 2013
    FRU: /FIRMWARE,/MBU_A
    Msg: XSCF self diagnosis warning detection
    Diagnostic Code:
        00000000 00000000 00000000
        00000000 00000000 00000000 00000000
        00000000 00000000 00000000 00000000
    UUID: 2c510434-c43f-4cbe-b8e2-6bd0322f3257 MSG-ID: SCF-8006-WP

 

  • from the monitor logs
Jun 15 04:38:40 Information: /FIRMWARE,/MBU_A:SCF:XSCF self diagnosis warning detection

 

  • from the FMA logs
Jun 15 04:38:34.5948 ereport.chassis.SPARC-Enterprise.xscfu.fail

 

  • From the Linux messages file (available from snapshot)
Jun 14 20:22:39 HTCIF2 patrol_diagnosis: Diag File(/hcp/remcscm/REMCS/setup/trc_conf) Read Error!!
Jun 14 20:36:56 HTCIF2 patrol_diagnosis: Diag File(/hcp/scfprog/sun/var/opt/SUNWsymon/cfg/snmpd.conf) Read Error!!
Jun 14 20:40:06 HTCIF2 patrol_diagnosis: Diag File(/hcp/scfprog/sun/var/opt/SUNWsymon/cfg/agent-engine-d.dat) Read Error!!
Jun 14 20:43:16 HTCIF2 patrol_diagnosis: Diag File(/hcp/scfprog/sun/var/opt/SUNWsymon/cfg/domain-config.x) Read Error!!
Jun 14 20:46:26 HTCIF2 patrol_diagnosis: Diag File(/hcp/scfprog/sun/var/opt/SUNWsymon/cfg/platform-engine-d.dat) Read Error!!
Jun 14 20:49:36 HTCIF2 patrol_diagnosis: Diag File(/hcp/scfprog/sun/var/opt/SUNWsymon/cfg/agent-usmusertbl-d.dat) Read Error!!
Jun 14 20:52:46 HTCIF2 patrol_diagnosis: Diag File(/hcp/scfprog/sun/var/opt/SUNWsymon/cfg/platform-usmusertbl-d.dat) Read Error!!
Jun 14 20:55:56 HTCIF2 patrol_diagnosis: Diag File(/hcp/scfprog/sun/var/opt/SUNWsymon/cfg/base-modules-d.dat) Read Error!!
Jun 15 04:38:34 HTCIF2 patrol_diagnosis: Warning Patrol Diagnosis error log
Jun 15 04:38:34 HTCIF2 patrol_diagnosis: model = 1 scf_id = 0
Jun 15 04:38:35 HTCIF2 patrol_diagnosis: End Patrol Diagnosis
Jun 15 04:38:35 HTCIF2 patrol_diagnosis: Start Patrol Diagnosis hangup_check_interval = 60
Jun 15 04:38:35 HTCIF2 fmd: SOURCE: sde, REV: 1.17, CSN: PX61312031  EVENT-ID: 2c510434-c43f-4cbe-b8e2-6bd0322f3257 Refer to http://www.sun.com/msg/SCF-8006-WP for detailed information.

As a result, the XSCFU (or MBU on M3000) is marked as degraded

XSCF> showstatus
*   MBU_A Status:Degraded;

  

Oracle Services should be contacted.
From an XSCF snapshot, Oracle Services will be able to determine the right course of action and to confirm whether or not the XSCFU/MBU should be replaced.

Reference :

Sun SPARC(R) Enterprise M3000/M4000/M5000/M8000/M9000 - XSCFU/MBU incorrectly marked as Degraded on 1115 - "XSCF self diagnosis warning detection" (Doc ID 1562888.1)



2. When a problem is detected by Patrol Diagnosis because of a BDB access issue, the XSCF is rebooted via the 'forcereboot' command introduced along with Patrol Diagnosis in XCP1115.

The BDB is composed of multiple DBs.
Because it contains important information about the system, Patrol Diagnosis does check on a regular basis that the most important DBs are accessible and no process is preventing some other processes to access the DBs. This is called SCDB Hang-up detection.

If Patrol Diagnosis is not able to access the BDB then an XSCF reboot is initiated in order to clear the situation.
At the same time as rebooting the XSCF, relevant  information is collected from the XSCF in order to analyse what happened.
The information collected is available in a snapshot.

An XSCF reboot initiated in such a case is not indicative of a bad or defective XSCFU.
It's a preventive action to avoid any further problem.
This reboot has no impact on the running domains.
This may result in an XSCF failover on M8000 and M9000 platforms.
The XSCFU/MBU is not reported as Degraded as a result.

  • From the error logs
Date: Jan 13 08:55:12 CET 2014     Code: 40000000-faffc201-011d000200000000
    Status: Information            Occurred: Jan 13 08:55:10.207 CET 2014
    FRU: /FIRMWARE,/XSCFU
    Msg: XSCF process down detected
    Diagnostic Code:
        00000000 00000000 00000000
        30303030 2e726562 6f6f742e 32303134
        00000000 00000000 00000000 00000000
    UUID: 4e3e239b-dd0d-42a0-a8c7-7e6b8c04e9c2 MSG-ID: SCF-8005-NE

 

  • from the monitor logs
Jan 13 08:55:15 CACSVSDDC600 Information: /FIRMWARE,/XSCFU:SCF:XSCF process down detected

 

  • from the FMA logs
Jan 13 08:55:11.6714 ereport.chassis.software.core

 

  • From the Linux messages file (available from snapshot)
Jan 13 08:50:35 CACSVSDDC600 patrol_diagnosis: scdb Hang-up Detection is Restarting System(dbnum=0)
Jan 13 08:50:35 CACSVSDDC600 forcereboot: forcereboot command accepted.
Jan 13 08:50:35 CACSVSDDC600 root: collecting command is started.
Jan 13 08:50:49 CACSVSDDC600 root: collecting command is done.
Jan 13 08:50:55 CACSVSDDC600 forcereboot: XSCF reboot by forcereboot command.
Jan 13 07:53:00 (none) syslogd 1.4.1: restart.



It must be noted that a misbehaviour of the Patrol Diagnosis SCDB hang-up detection module has been identified for XCP1115 and XCP1116 and is tracked via Bug#17049708.
As a result, Patrol Diagnosis may incorrectly identify an SCDB hang-up; unnecessarily rebooting the XSCF.
This has no impact on the running domains besides the XSCF reboot and possible failover on M8000 and M9000 platforms.
This will be fixed in the next XCP release.

Oracle Service should be contacted.
From an XSCF snapshot, Oracle Services will be able to determine the root cause for the reboot and BDB access issue.
Oracle Services will also be able to confirm whether or not the situation is due to the known Patrol Diagnosis misbehaviour.


Internal Doc 1616137.1 is available to describe how to diagnose XSCF reboot due to patrol diagnosis SCDB hang-up detection.


References

<NOTE:1008229.1> - Gathering diagnostic data for SPARC Enterprise M3000/M4000/M5000/M8000/M9000 (OPL) Servers

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback