![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Solution Type Technical Instruction Sure Solution 1007101.1 : Sun SPARC(R)Enterprise M3000/M4000/M5000/M8000/M9000 (OPL) Servers: Fault clearing and LEDs behavior
PreviouslyPublishedAs 209792 Oracle Confidential PARTNER - Available to partners (SUN). Reason: Internal information Applies to:Sun SPARC Enterprise M9000-32 Server - Version All Versions and laterSun SPARC Enterprise M8000 Server - Version All Versions and later Sun SPARC Enterprise M3000 Server - Version All Versions and later Sun SPARC Enterprise M5000 Server - Version All Versions and later Sun SPARC Enterprise M9000-64 Server - Version All Versions and later All Platforms GoalThe implementation of Fault Management Architecture (FMA) on Sun SPARC Enterprise M3000/M4000/M5000/M8000/M9000 (OPL) Servers is complex. The goal of this document is not to describe how FMA behaves on the Sun SPARC Enterprise M3000/M4000/M5000/M8000/M9000 (OPL) Servers but to help to identify and display the faults reported on the components of these platforms, and how and when these faults can be cleared and how and when the fault LEDs are turned ON or OFF. SolutionFaults on Sun SPARC Enterprise M3000/M4000/M5000/M8000/M9000 (OPL) Servers :FMA (Fault Management Architecture) is available on Sun SPARC Enterprise platforms running Solaris[TM] 10 and beyond. For M-series systems FMA is also built into the Service Processor as part of the Service Processor (aka XSCF) software. Error reports and faults are passed between XSCF and the Solaris domain via the "Event Transport Module" (ETM) using "Domain to Service Processor Communications Protocol" (DSCP). When a fault is diagnosed the system is usually able to identify one or more suspect components depending on the nature of the fault. The suspect or the list of the suspects can be displayed using the 'fmdump' command on XSCF. Example where a list of suspects has been identified : XSCF> fmdump -v
TIME UUID MSG-ID May 25 16:02:53.0556 6070a711-49ad-4b23-a172-5524274deceb SCF-8001-KC 66% upset.chassis.SPARC-Enterprise.io.disk.boot Problem in: hc:///chassis=0/iou=8/pcislot=0/ioua=0/pci_br=0/sas=0/disk=1 Affects: - FRU: hc://:product-id=SPARC-Enterprise:chassis-id=BE80601000:server-id=san-dc3-1-0/component=/IOU#8/HDD#1 Location: /IOU#8/HDD#1 33% upset.chassis.SPARC-Enterprise.io.disk.boot Problem in: hc:///chassis=0/iou=8/pcislot=0/ioua=0/pci_br=0/sas=0 Affects: - FRU: hc://:product-id=SPARC-Enterprise:chassis-id=BE80601000:server-id=san-dc3-1-0/component=/IOU#8/PCI#0/IOUA Location: /IOU#8/PCI#0/IOUA
XSCF> fmdump -v -u 7d1b6fac-ff1f-4d3d-afff-faf6c0a2ed07
TIME UUID MSG-ID Jun 15 02:53:32.1628 7d1b6fac-ff1f-4d3d-afff-faf6c0a2ed07 SCF-8005-PX 100% upset.chassis.domain.panic Problem in: hc:///chassis=0/domain=0 Affects: - FRU: hc://:product-id=SPARC-Enterprise:chassis-id=BE80601000:server-id=san-dc3-1-0/component=CHASSIS Location: CHASSIS M-Series related Knowledge Article Documents (KA docs) available at https://support.oracle.com suggest that one collect the output from 'fmdump -m' to aid in diagnosisng these faults. The "-m" option is available only on the XSCF (not on the Solaris domain) and displays the Fault Manager syslog message contents for the event(s). Example : XSCF> fmdump -m
T-TIME: Fri Apr 13 08:06:05 PDT 2007 PLATFORM: SPARC-Enterprise, CSN: BE80601000, HOSTNAME: san-dc3-1-0 SOURCE: sde, REV: 1.12 EVENT-ID: cfcd90f3-5988-4707-ba8e-fdd03d417fc3 DESC: An internal fatal error within a strand on a CPU chip was detected. Refer to http://www.sun.com/msg/SCF-8000-EQ for more information. AUTO-RESPONSE: The domain using this CPU will be reset and the strand will be deconfigured. IMPACT: The domain using this CPU chip is reset. REC-ACTION: Schedule a repair action to replace the affected Field Replaceable Unit (FRU), the identity of which can be determined using fmdump -v -u EVENT_ID. Please consult the detail section of the knowledge article for additional information. XSCF user ID's with the 'platop', 'platadm', or 'fieldeng' privileges can run the 'fmdump' command. Information about the faulty status of all components is available in the CMEM database on XSCF. Based on the 'level' of certainty of the diagnosis of any given fault, the following flags are set:
Every FRU that is marked as 'suspect' in the list will have the uncertain_secondary_status bit set; however, only the primary suspect will have either the CFF or UFF bit set as well. As a result of having detected a fault and depending on which one of the CFF or UFF bits is set, the primary suspects in a suspect list are reported as "faulted" (completely broken/not working) or "degraded" (should be replaced, but is still working with some limitations). This status can be viewed in the output of the 'showhardconf' and 'showstatus' commands on the XSCF.
The 'fmadm faulty' command on the XSCF is only available in Escalation mode, and examines the resource cache, which typically has less info about platform faults than the "CMEM" database. The CMEM database contains all of the information about faulty FRUs for OPL (M-series) platforms. For this reason the XSCF 'showstatus' command is the preferred method to be used by the customer and field engineering, as it provides the most accurate information pertaining to FMA faults. The XSCF 'showstatus' and 'showhardconf' commands are available to users ID's with the following privileges : useradm, platadm, platop, fieldeng Example showstatus and showhardconf command output reporting a chip on a CPU Module as "faulted". XSCF> showstatus CMU#1 Status:Normal;
* CPUM#2-CHIP#0 Status:Faulted; XSCF> showhardconf -M SPARC Enterprise M9000; + Serial:BE80601000; Operator_Panel_Switch:Locked; + Power_Supply_System:Dual-3Phase; Ex:Dual-3Phase; SCF-ID:XSCF#0; + System_Power:On; [output omitted] CMU#1 Status:Normal; Ver:0101h; Serial:PP0642Z470 ; +FRU-Part-Number:CA06620-D001 A8 ; + Memory_Size:64 GB; CPUM#0-CHIP#0 Status:Normal; Ver:0201h; Serial:PP06447337 ; +FRU-Part-Number:CA06620-D021 A6 ; + Freq:2.280 GHz; Type:16; + Core:2;Strand:2; CPUM#1-CHIP#0 Status:Normal; Ver:0201h; Serial:PP06447340 ; +FRU-Part-Number:CA06620-D021 A6 ; + Freq:2.280 GHz; Type:16; + Core:2;Strand:2; * CPUM#2-CHIP#0 Status:Faulted; Ver:0201h; Serial:PP06447336 ; +FRU-Part-Number:CA06620-D021 A6 ; + Freq:2.280 GHz; Type:16; + Core:2;Strand:2; [output omitted] Example from a M9000 system where a CMU is reported as degraded due to some DIMMs deconfigured as a result of a a fault detected on a Memory Address Controller. XSCF> fmdump -av
TIME UUID MSG-ID Apr 29 20:03:02.7818 5817837d-6ee9-4ffd-af17-fee44d76da0d SCF-8005-CA 100% fault.chassis.SPARC-Enterprise.asic.sc.fe Problem in: hc:///chassis=0/cmu=6/sc=2 Affects: hc:///chassis=0/cmu=6/mac=2/bank=0 XSCF> showstatus * CMU#6 Status:Degraded; * MEM#00A Status:Deconfigured; * MEM#00B Status:Deconfigured; * MEM#01A Status:Deconfigured; * MEM#01B Status:Deconfigured; * MEM#02A Status:Deconfigured; * MEM#02B Status:Deconfigured; * MEM#03A Status:Deconfigured; * MEM#03B Status:Deconfigured; * MEM#10A Status:Deconfigured; * MEM#10B Status:Deconfigured; * MEM#11A Status:Deconfigured; * MEM#11B Status:Deconfigured; * MEM#12A Status:Deconfigured; * MEM#12B Status:Deconfigured; * MEM#13A Status:Deconfigured; * MEM#13B Status:Deconfigured; For components reported as the primary suspect and with certainly faulty (CFF) bit set, the Maintenance Action Required bit is set. This information is available in the output of the 'fmdump -V' command. Example : XSCF> fmdump -Ve
TIME CLASS Jun 15 2007 02:48:35.110134400 ereport.chassis.SPARC-Enterprise.cpu.SPARC64-VI.se-offlinereq nvlist version: 0 class = ereport.chassis.SPARC-Enterprise.cpu.SPARC64-VI.se-offlinereq [output omitted] opl_platform = DC3 detected-by = ANALYZE maintenance-action-required = true __ttl = 0x1 __tod = 0x46726073 0x6908480 Further OPL FMA information can be found at: <Document 1386385.1> M3000/M4000/M5000/M8000/M9000 Server: How to Use FMA With OPL Servers
Steps to Follow Clearing Faults on Sun SPARC [TM] Enterprise M3000/M4000/M5000/M8000/M9000 (OPL) Servers. Note: In XCP1115 and above the clearfault command can be performed from normal mode. Clearing Faults (Sun SPARC Enterprise M3000/M4000/M5000/M8000/M9000 (OPL) Servers running XCP 1050 or later) :
1. Usual process : When the system has identified faulty components on a platform, the correct action is normally to replace the primary suspect. There are however certain conditions detailed futher under 'Complex Cases' in this document where this might be deferred. Here we discuss standard procedure. In order to repair a fault with single or multiple FRUs, the typical repair action will be:
2. Complex cases : In some more complex cases, analysis of FMA events might result in the decision NOT to replace the primary FRU in the suspect list. This might be due to the diagnosis engine being affected by some SW issue, or perhaps analysis shows that the FRU in a single-FRU indictment seems to be wrong or the first FRU has already been replaced and this is a repeat fault with an identical list, etc. etc. The decision to not replace the first FRU in the suspect list MUST be made by Service and the entire process of clearing a fault without replacing the suspect component must be done under the supervision of TSC, preferably via remote access shared session. The following section describe how to handle the complex cases in more detail. 2.1 - Power cycle via NFB (Non-Fused Breaker) Off/On : The term 'NFB Off/On' means to power cycle the system by temporarily removing, and then restoring, AC input power. For systems running firmware version XCP1050 it is possible to clear fault status for primary suspects by power-cycling the platform with the keyswitch is in the 'service' position. For systems running XCP1060 and later, faults are not cleared on NFB-on, no matter what the position of the keyswitch.
Note : Whatever the faulted component (with ot without FRUID), a power cycle with the keyswitch in the 'Locked' position will have no effect on the fault status of said component unless a clearfault/clearstatus/clearfru command had been invoked prior to the power cycle. See the section on clearing faults below.
2.2 - Commands available to clear the faults : 2.2.1 - clearfru / clearstatus : These two commands can be used to clear the fault information of all the FRUs (clearfru) or the fault information of FRUs that have been detected as faulty units (clearstatus). You must be in Escalation mode to run the clearfru / clearstatus commands. The domains must be down and an immediate platform power cycle is required. The component is reported as faulted as long as the power cycle hasn't occured. The use of 'clearfru' and 'clearstatus' commands must be done *only* under direction from TSC and/or Engineering. Example : XSCF> showstatus
CMU#0 Status:Faulted; service> clearstatus /CMU#0 XSCF> showstatus No failures found in System Initialization. 2.2.2 - clearfault : The 'clearfault' command provides a way to manage faults for primary suspects and may be used for the following actions :
As a result of executing this command, faults will be cleared on the next power-cycle. Notes :
Example : service> clearfault -l /IOU#0
Fault will be cleared after circuit breaker off and on
Example : service> clearfault /IOU#0
clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: y Fault will be cleared after circuit breaker off and on
Example : service> clearfault /IOU#0
Unable to get maintenance lock clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: 2.2.2.2 - M3000/M4000/M5000 :
Example : service> clearfault /FAN_B#0
Testing the hardware...
Example : service> clearfault /MBU_A/MEMB#0/MEM#0A
clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: yes Fault will be cleared after circuit breaker off and on
Note : With XCP 1050 there is no way to clear a UFF or CFF fault in a DDCR on a M4000/M5000 IOU using the clearfault command. The only way is to invoke clearfru in escalation mode and power cycle the platform. Reference : CR#6577745. This is fixed in release of XCP 1060.
Example : service> clearfault /PSU#0
Testing the hardware...
2.3 - As a summary, to clear fault on a FRU : M3000/M4000 / M5000 :
M8000 / M9000 :
For more detail on accessing escalation or service mode see <Document 1002928.1> LED behaviour :
Each M3000/M4000/M5000/M8000/M9000 system has an Operator Panel (OPNL) with 3 LEDs :
When turned ON, the Check LED, aka the System Check LED, indicates a fault on the system. See below. Most of the FRUs on the SPARC Enterprise servers have a FRU check LED which reports that the unit contains an error. However, some FRUs like DIMMs or CPUMs do not have LEDs. Refer to the SPARC Enterprise Mx000/Mx000 Servers Service Manuals for more information about LEDs. For Sun SPARC Enterprise servers running a version of XCP later than 1050, the check LEDs will be set and reset as below :
Note that the check LED for the PSUs on the M8000/M9000 may not behave as expected; not being set when it's the primary suspect.
Check LEDs behaviour after clearfault, clearstatus, replacefru :
Faults on IOBox :
XSCF> showstatus
IOU#4 Status:Normal; * PCI#5 Status:Degraded; IOX@X156 Status:Normal; * IOB1 Status:Faulted; * PS0 Status:Degraded; * PS1 Status:Degraded;
XSCF> ioxadm env -v Location Sensor Min Min Alarm Value Max Alarm Max Units [...] IOX@X156/IOB1 SERVICE - - On - - LED
Even if a fault is reported on IOBox and Service LED is lit, the OPNL System Ckeck LED is not lit . XSCF> showstatus
IOU#4 Status:Normal; * PCI#5 Status:Degraded; IOX@X156 Status:Normal; * IOB1 Status:Faulted; * PS0 Status:Degraded; * PS1 Status:Degraded; service> clearfault IOU#4-PCI#5 service> clearfault IOX@X156/IOB1 service> clearfault IOX@X156/PS0 service> clearfault IOX@X156/PS1 XSCF> showstatus No failures found in System Initialization.
service> clearfault IOX@X1CK/IOB0/LINK
Clearing a fault on an IOX XSCF> showstatus IOU#0 Status:Normal; PCI#2 Status:Normal; * IOX@X1NW Status:Faulted; <<<=== !!! XSCF> clearfault IOX@X1NW This command must only be used at the request of a product support engineer. Continue? [y|n]: y XSCF> showstatus No failures found in System Initialization. XSCF> Hierarchical fault clearing :In certain cases, the faulted resources appear to be hierarchical. XSCF> showstatus
* CMU#0 Status:Faulted; * CPUM#0-CHIP#0 Status:Faulted; * MEM#03A Status:Faulted; service> clearfault CMU#0 XSCF> showstatus CMU#0 Status:Normal; * CPUM#0-CHIP#0 Status:Faulted; * MEM#03A Status:Faulted; CMU#0 remains in the output, although not marked faulted, until the subordinates are cleared: service> clearfault CMU#0/CPUM#0
XSCF> showstatus CMU#0 Status:Normal; * MEM#03A Status:Faulted; service> clearfault CMU#0/MEM#03A XSCF> showstatus No failures found in System Initialization.
1. M3000/M4000 / M5000 :1.1 - clearing a fault on a PSU : XSCF> showstatus
* PSU#1 Status:Faulted; service> clearfault /PSU#1 Testing the hardware... XSCF> showstatus No failures found in System Initialization. 1.2 - clearing a fault on a DIMM : XSCF> showstatus
MBU_A Status:Normal; MEMB#0 Status:Normal; * MEM#0A Status:Faulted; service> clearfault /MBU_A/MEMB#0/MEM#0A clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: yes Fault will be cleared after circuit breaker off and on XSCF> showstatus MBU_A Status:Normal; MEMB#0 Status:Normal; * MEM#0A Status:Faulted; Note: Due to Solaris CR 6668237 the faulted DIMM may need to be cleared in Solaris as well. Normally, the change in serial number after the DIMM replacement is detected by Solaris and the fault is cleared. Due to CR 6668237 the fault may need to be cleared manually. CR 6668237 is fixed in patch 143527-01.
1.3 - clearing a fault on a CPUM : XSCF> showstatus
MBU_A Status:Normal; * CPUM#0-CHIP#0 Status:Faulted; * CPUM#0-CHIP#1 Status:Faulted; service> clearfault /MBU_A/CPUM#0 clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: y Fault will be cleared after circuit breaker off and on XSCF> showstatus MBU_A Status:Normal; * CPUM#0-CHIP#0 Status:Faulted; * CPUM#0-CHIP#1 Status:Faulted; 1.4 - clearing a degraded MBU : XSCF> showstatus
* MBU_B Status:Degraded; service> clearfault /MBU_B clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: y Fault will be cleared after circuit breaker off and on XSCF> showstatus * MBU_B Status:Degraded; 2. M8000 / M9000 :2.1 - clearing a fault on a PSU : XSCF> showstatus
* PSU#0 Status:Faulted; service> clearfault /PSU#0 Testing the hardware... XSCF> showstatus No failures found in System Initialization. 2.2 - clearing a fault on the OPNL : XSCF> showstatus
* OPNL#0 Status:Faulted; service> clearfault /OPNL clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: y Fault will be cleared after circuit breaker off and on XSCF> showstatus * OPNL#0 Status:Faulted; 2.3 - clearing a fault on an IOU not part of a running domain : XSCF> showstatus
* IOU#1 Status:Faulted; XSCF> showboards -v -a XSB R DID(LSB) Assignment Pwr Conn Conf Test Fault COD ---- - -------- ----------- ---- ---- ---- ------- -------- ---- 00-0 * 00(00) Assigned y n n Unknown Normal n 01-0 * 00(01) Assigned y n n Unknown Faulted n 02-0 SP Unavailable y n n Unknown Normal n 03-0 SP Unavailable y n n Unknown Normal n service> clearfault /IOU#1 Testing the hardware. This may take up to six minutes XSCF> showstatus No failures found in System Initialization. 2.4 - clearing a fault on a CMU not part of a running domain service> clearfault /CMU#2/CPUM#2
Testing the hardware. This may take up to six minutes XSCF> showstatus No failures found in System Initialization. 2.5 - clearing a fault on a CMU which is part of a running domain : XSCF> showstatus
CMU#3 Status:Normal; * CPUM#0-CHIP#0 Status:Faulted; * OPNL#0 Status:Faulted; XSCF> showboards -v -a XSB R DID(LSB) Assignment Pwr Conn Conf Test Fault COD ---- - -------- ----------- ---- ---- ---- ------- -------- ---- 00-0 00(00) Assigned y y y Passed Normal n 01-0 00(01) Assigned y y y Passed Normal n 03-0 00(03) Assigned y y y Passed Degraded n service> clearfault /CMU#3/CPUM#0 FRU cannot be detached The FRU is in an active domain. It must be removed from the domain or the domain must be powered off, before its fault status can be cleared. clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear faulton next circuit breaker off and on. Continue? [y|n]: n We can use DR to detach the XSB and clear the fault. XSCF> deleteboard -c unassign 03-0
XSB#03-0 will be unassigned from domain immediately. Continue?[y|n] :y Start unconfiguring XSB from domain. Unconfigured XSB from domain. XSB power off sequence started. [1200sec] 0...end Operation has completed. XSCF> showboards -v -a XSB R DID(LSB) Assignment Pwr Conn Conf Test Fault COD ---- - -------- ----------- ---- ---- ---- ------- -------- ---- 00-0 00(00) Assigned y y y Passed Normal n 01-0 00(01) Assigned y y y Passed Normal n 03-0 SP Available y n n Passed Degraded n service> clearfault /CMU#3/CPUM#0 Testing the hardware. This may take up to six minutes XSCF> showboards -v -a XSB R DID(LSB) Assignment Pwr Conn Conf Test Fault COD ---- - -------- ----------- ---- ---- ---- ------- -------- ---- 00-0 00(00) Assigned y y y Passed Normal n 01-0 00(01) Assigned y y y Passed Normal n 03-0 00(03) Assigned y y y Passed Normal n XSCF> showstatus No failures found in System Initialization. 2.6 - clearing a fault on a CMU which is part of a running domain but DR cannot be used : XSCF> showstatus
CMU#3 Status:Normal; * MEM#00A Status:Faulted; XSCF> clearfault /CMU#3/MEM#00A FRU cannot be detached clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: n XSCF> showboards -v -a XSB R DID(LSB) Assignment Pwr Conn Conf Test Fault COD ---- - -------- ----------- ---- ---- ---- ------- -------- ---- 00-2 00(00) Assigned y y y Passed Normal n 03-0 00(12) Assigned y y y Passed Degraded n Since DR cannot be used for whatever reason, the domain must be powered off prior to using clearfault : XSCF> showdomainstatus -d 0
DID Domain Status 00 Powered Off service> clearfault /CMU#3/MEM#00A Testing the hardware. This may take up to six minutes XSCF> showstatus No failures found in System Initialization. XSCF> showboards -v -d 0 XSB R DID(LSB) Assignment Pwr Conn Conf Test Fault COD ---- - -------- ----------- ---- ---- ---- ------- -------- ---- 00-2 * 00(00) Assigned y n n Passed Normal n 03-0 * 00(12) Assigned y n n Passed Normal n
References<NOTE:1002928.1> - Sun SPARC(R) Enterprise M9000/M8000/M5000/M4000/M3000 Server: Accessing service mode<NOTE:1386385.1> - M3000/M4000/M5000/M8000/M9000 Server: Using FMA With OPL Servers Attachments This solution has no attachment |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|