Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1533067.1
Update Date:2018-02-28
Keywords:

Solution Type  Problem Resolution Sure

Solution  1533067.1 :   FMD-8000-0W defect.sunos.fmd.nosub with disabled cpumem-diagnosis module  


Related Items
  • Solaris (SPARC)
  •  
  • Sun Fire V490 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: SF-x8x0/Ex900
  •  


fmadm faulty -a reports FMD-8000-0W defect.sunos.fmd.nosub with cpumem-diagnosis module disabled and memory errors

Created from <SR 3-6861293130>

Applies to:

Solaris (SPARC) - Version 10 and later
Sun Fire V490 Server - Version All Versions to All Versions [Release All Releases]
SPARC

Symptoms

`fmadm faulty -a`     reports FMD-8000-0W defect.sunos.fmd.nosub

or

/var/adm/messages shows

fmd: [ID 441519 daemon.error] SUNW-MSG-ID: FMD-8000-0W, TYPE: Defect, VER: 1, SEVERITY: Minor
EVENT-TIME: Sat Feb 23 09:22:32 EST 2013
PLATFORM: SUNW,Netra-T12, CSN: -, HOSTNAME: myhost
SOURCE: fmd-self-diagnosis, REV: 1.0
EVENT-ID: 2bdc346f-c7e7-64e9-b5d1-a4aa40f8534c
DESC: The Solaris Fault Manager received an event from a component to which no automated diagnosis software is currently subscribed.  Refer to http://sun.com/msg/FMD-8000-0W for more information.
AUTO-RESPONSE: Error reports from the component will be logged for examination by Sun.
IMPACT: Automated diagnosis and response for these events will not occur.
REC-ACTION: Run pkgchk -n SUNWfmd to ensure that fault management software is installed properly.  Contact Sun for support.

 
and

fmdump -e reports memory errors such as

ereport.cpu.ultraSPARC-III.ce
ereport.cpu.ultraSPARC-IIIplus.ce
ereport.cpu.ultraSPARC-IV.ce

ereport.cpu.ultraSPARC-IVplus.ce
ereport.cpu.ultraSPARC-IVplus.cpc
ereport.cpu.ultraSPARC-IVplus.edc
ereport.cpu.ultraSPARC-IVplus.ucc
ereport.cpu.ultraSPARC-IVplus.wdc
ereport.cpu.ultraSPARC-IVplus.l3-cpc
ereport.cpu.ultraSPARC-IVplus.l3-edc
ereport.cpu.ultraSPARC-IVplus.l3-ucc
ereport.cpu.ultraSPARC-IVplus.l3-wdc

ereport.io.sch.ecc.drce
ereport.io.sch.ecc.dwce
ereport.io.sch.ecc.s-drce

ereport.io.xmits.ecc.drce
ereport.io.xmits.ecc.dwce
ereport.io.xmits.ecc.s-drce

 

Cause

It may be the case that something has terminated the FMA cpumem-diagnosis module which normally handles these memory events.

Check the the status of the cpumem-diagnosis module with the `fmadm config` command:


# fmadm config
MODULE                   VERSION STATUS  DESCRIPTION
cpumem-diagnosis         1.7     active  CPU/Memory Diagnosis
cpumem-retire            1.1     active  CPU/Memory Retire Agent
datapath-retire          1.0     active  Datapath Retire Agent
disk-transport           1.0     active  Disk Transport Agent
eft                      1.16    active  eft diagnosis engine
...snip...

 
If cpumem-diagnosis is missing from the list, it may need to be restarted.  Normally, the module should become active at boot time.  There is a situation where cpumem-diagnosis fails repeatedly.

Check the output of `fmdump -e` to see when it may have died.  Look for:

ereport.fm.fmd.module
ereport.fm.fmd.mod_init

 

Then with `fmdump -eV`   look at the ereport.fm.fmd.module for

mod-name = cpumem-diagnosis and a message such as

msg = Lxcache referenced by case ab0752b7-fb16-4812-fd7e-c3eefa7bcb2a does not exist in saved state
msg = Lxcache buffer referenced by case 46507d1f-5515-c557-8f89-a0190c9b8106 is 196 bytes. Expected size is 324 bytes

 
Also with fmdump -eV, look at the ereport.fm.fmd.mod_init for "msg = failed to load /usr/platform/sun4u/lib/fm/fmd/plugins/cpumem-diagnosis.so: client requested that module execution abort"




Dec 06 2011 01:17:30.099492760 ereport.fm.fmd.module
nvlist version: 0
        version = 0x0
        class = ereport.fm.fmd.module
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = fmd
                authority = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        product-id = SUNW,Netra-T12
                        server-id = myhost
                (end authority)

                mod-name = cpumem-diagnosis
                mod-version = 1.7
        (end detector)

        ena = 0x6e84ec9c6002401
        msg = Lxcache referenced by case ab0752b7-fb16-4812-fd7e-c3eefa7bcb2a does not exist in saved state
        __ttl = 0x1
        __tod = 0x4eddb37a 0x5ee2398

Dec 06 2011 01:17:30.128265240 ereport.fm.fmd.mod_init
nvlist version: 0
        version = 0x0
        class = ereport.fm.fmd.mod_init
        ena = 0x6e86a3a53000401
        msg = failed to load /usr/platform/sun4u/lib/fm/fmd/plugins/cpumem-diagnosis.so: client requested that module execution abort

        __ttl = 0x1
        __tod = 0x4eddb37a 0x7a52c18

 

This is Bug 15465424: SUNBT6627630-SOLARIS_10U6 FMD FAILS TO LOAD CPUMEM-DIAGNOSIS MODULE

We re-play some checkpointed cpumem-diagnosis data at fmd startup (we do this at every boot), and there is a logical inconsistency in the checkpointed data, causing us to disable cpumem-diagnosis.

This in turn, causes the FMD-8000-0W defect.sunos.fmd.nosub on the next transient memory error which should have been handled by the cpumem-diagnosis module.  We can clear the FMD-8000-0W, but any event which cpumem-diagnosis would normally handle will trigger another FMD-8000-0W defect.sunos.fmd.nosub.

The fix for Bug 15465424 is in patch 138052-01, KJP 137137-09, s10u6 Solaris 10 10/08, etc..., but once we build the bad checkpoint record the damage persists.  Note that explorer does not collect the fmd checkpoint files.  The only indication we have is the recurring FMD-8000-0W defect.sunos.fmd.nosub in the fault log with the ereport.fm.fmd.module and ereport.fm.fmd.mod_init reports in the error log.

Solution

 We can use the workaround for Bug 15465424 to scrub the fouled checkpoint data, namely:

First roll the logs and restart the FMA daemon to keep the history.

logadm -p now -s 1b /var/fm/fmd/errlog
logadm -p now -s 1b /var/fm/fmd/fltlog
svcadm restart fmd

 
...wait two minutes...

Now scrub the ckeckpoint files


svcadm disable -st fmd
find /var/fm/fmd/ckpt -type f | xargs rm
svcadm enable fmd

 
...wait 2 minutes...

Now see if everything is clear

fmadm config       - check that cpumem-diagnosis is active
fmadm faulty -a    - shouldn't return anything


also check to see if we logged any new errors on fmd startup; if we did, we'll need to check further...

fmdump -e
should return nothing

References

<BUG:15465424> - SUNBT6627630-SOLARIS_10U6 FMD FAILS TO LOAD CPUMEM-DIAGNOSIS MODULE

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback