Sun SPARC Enterprise Mx000 Server: How to Troubleshoot XSCFU WDT event SCF-8005-NE due to Error "Recovery of wbuf failed due to a second write error"

Asset ID:	1-72-2329333.1
Update Date:	2017-11-20
Keywords:

Solution Type Problem Resolution Sure

Solution 2329333.1 : Sun SPARC Enterprise Mx000 Server: How to Troubleshoot XSCFU WDT event SCF-8005-NE due to Error "Recovery of wbuf failed due to a second write error"

Applies to:

Sun SPARC Enterprise M9000-64 Server - Version All Versions to All Versions [Release All Releases]
Sun SPARC Enterprise M5000 Server - Version All Versions to All Versions [Release All Releases]
Sun SPARC Enterprise M3000 Server - Version All Versions to All Versions [Release All Releases]
Sun SPARC Enterprise M4000 Server - Version All Versions to All Versions [Release All Releases]
Sun SPARC Enterprise M8000 Server - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

The goal of this document is to help understanding situations where XSCFU reports process down with the following signature reported in the Linux messages file (spos_log/*messages*):

ECC failures may return during readback from the writebuffer ( Second Write errors)

XSCFU may run into hung state or just reboot and recovers.

showlogs error -v will report like this

Date: Sep 21 14:13:49 EST 2017 Code: 40000000-faffc201-011d000200000000
Status: Information Occurred: Sep 21 14:13:48.700 EST 2017
FRU: /FIRMWARE,/XSCFU
Msg: XSCF process down detected
Diagnostic Code:
00000000 00000000 00000000
66666666 2e736364 622e3230 31373039
00000000 00000000 00000000 00000000
UUID: 9521c18b-3686-4d7f-bf94-c62b744d86f2 MSG-ID: SCF-8005-NE

FMA reports below signatures

XSCF> fmdump -v
TIME UUID MSG-ID
Sep 21 14:06:47.8206 fa4ada6a-29fc-4e9c-b851-1213fa94f3dd SCF-8005-NE
100% defect.chassis.software

Problem in: hc:///chassis=0/xcp=0
Affects: -
FRU: hc://:product-id=SPARC Enterprise M5000:chassis-id=BEF1010A41:server-id=##Hostname##/component=CHASSIS
Location: CHASSIS

The XSCF monitor log ('showlogs monitor') will contain an informational message similar to this:

Sep 21 14:13:58 ##Hostname## Information: /FIRMWARE,/XSCFU:SCF:XSCF process down detected

The following signature will be reported in the Linux messages file (spos_log/*messages*):

Sep 21 19:01:47 (none) kernel: JFFS2:1506020443.308518:scf_init(106):[06 /hcpc/tmp]:mtd->read(0x1facc bytes from 0x4ca0534) returned ECC error
Sep 21 19:01:47 (none) kernel: JFFS2:1506020452.820616:jffs2_gcd_mtd6(185):[06 /hcpc/tmp]:mtd->read(0x232 bytes from 0x4cb57ac) returned ECC error
Sep 21 19:01:05 (none) portmap: portmap startup succeeded
Sep 21 19:01:47 (none) kernel: JFFS2:1506020493.470935:jffs2_gcd_mtd6(185):[06 /hcpc/tmp]:mtd->read(0x1000 bytes from 0x4ca5064) returned ECC error
Sep 21 19:01:47 (none) kernel: JFFS2:1506020493.476174:jffs2_gcd_mtd6(185):[06 /hcpc/tmp]:mtd->read(0x1000 bytes from 0x4ca0840) returned ECC error
Sep 21 19:01:47 (none) kernel: JFFS2:1506020506.662663:exe(338):[06 /hcpc/tmp]:mtd->read(0x44 bytes from 0x4ca4f7c) returned ECC error
Starting pid 360, console /dev/console: '/scf/init/scf_stop'
Sep 21 19:02:39 (none) exiting on signal 15

Cause

Check the Linux messages file (spos_logs/@var@log@messages*) and dmesg file (spos_logs/@scf@bin@*dmesg*) for some ECC errors

Sep 17 10:28:02 ##Hostname## kernel: JFFS2:1505662082.875260:dbs(379):[06 /hcpc/tmp]:Data CRC 67713f1a != calculated CRC 1fcd8730 for node at 04cb0a68
Sep 17 10:28:03 ##Hostname## kernel: JFFS2:1505662083.028197:dbs(379):[06 /hcpc/tmp]:Data CRC 67713f1a != calculated CRC 1fcd8730 for node at 04cb0a68
Sep 17 10:28:03 ##Hostname## kernel: JFFS2:1505662083.182062:dbs(379):[06 /hcpc/tmp]:Data CRC 67713f1a != calculated CRC 1fcd8730 for node at 04cb0a68
Sep 17 10:28:03 ##Hostname## kernel: JFFS2:1505662083.334375:dbs(379):[06 /hcpc/tmp]:Data CRC 67713f1a != calculated CRC 1fcd8730 for node at 04cb0a68

Sep 17 10:26:30 ##Hostname## kernel: Recovery of wbuf failed due to a second write error
Sep 17 10:26:30 ##Hostname## kernel: Write of 1381 bytes at 0x04cb6db0 failed. returned -5, retlen 0
Sep 17 10:26:30 ##Hostname## kernel: Not marking the space at 0x04cb6db0 as dirty because the flash driver returned retlen zero
Sep 17 10:26:30 ##Hostname## kernel: verify buffer:e6a1 e681
Sep 17 10:26:30 ##Hostname## kernel: jffs2_flush_wbuf(): Write failed with -5
Sep 17 10:26:30 ##Hostname## kernel: verify buffer:7288 6288
Sep 17 10:26:30 ##Hostname## kernel: Recovery of wbuf failed due to a second write error ====>>second write error
Sep 17 10:26:30 ##Hostname## kernel: Write of 1381 bytes at 0x03b20000 failed. returned -5, retlen 0
Sep 17 10:26:30 ##Hostname## kernel: Not marking the space at 0x03b20000 as dirty because the flash driver returned retlen zero
Sep 17 10:26:30 ##Hostname## kernel: verify buffer:e6a1 e681
Sep 17 10:26:30 ##Hostname## kernel: jffs2_flush_wbuf(): Write failed with -5
Sep 17 10:26:30 ##Hostname## kernel: verify buffer:7288 6288
Sep 17 10:26:30 ##Hostname## kernel: Recovery of wbuf failed due to a second write error ====>>second write error
Sep 17 10:26:30 ##Hostname## kernel: Write of 1381 bytes at 0x03b20000 failed. returned -5, retlen 0
Sep 17 10:26:30 ##Hostname## kernel: Not marking the space at 0x03b20000 as dirty because the flash driver returned retlen zero
Sep 17 10:26:30 ##Hostname## kernel: verify buffer:e6a1 e681
Sep 17 10:26:30 ##Hostname## kernel: jffs2_flush_wbuf(): Write failed with -5
Sep 17 10:26:30 ##Hostname## kernel: verify buffer:7288 6288

JFFS2:1506021043.098011:jffs2_gcd_mtd13(191):[16 /hcpc/scflog2]:start gc thread.
Data CRC failed on node at 0x10ae8b4: read 0xb2dd1590, calculated 0xffe2a6a5
Data CRC failed on node at 0x1450134: read 0xaf2d60ca, calculated 0x1842b158

JFFS2 error statistics:
jffs2_error.header_crc_failed: 3
jffs2_error.data_crc_failed: 7
jffs2_error.node_crc_failed: 0
jffs2_error.name_crc_failed: 0
jffs2_error.empty_flash: 4
jffs2_error.read_ecc_error_recovered: 0
jffs2_error.read_error: 0
jffs2_error.detect_bad_block: 4
jffs2_error.mark_bad_block: 0

This may lead to the system failing to recover from the failing write operation; this is again visible in the Linux messages file (spos_logs/@var@log@messages*) and dmesg file (spos_logs/@scf@bin@*dmesg*)

-bash-3.2$ grep second @var@log@messages

Sep 17 10:28:03 ##Hostname## kernel: Recovery of wbuf failed due to a second write error
Sep 17 10:28:03 ##Hostname## kernel: Recovery of wbuf failed due to a second write error
Sep 17 10:28:03 ##Hostname## kernel: Recovery of wbuf failed due to a second write error
Sep 17 10:28:03 ##Hostname## kernel: Recovery of wbuf failed due to a second write error
Sep 17 10:28:03 ##Hostname## kernel: Recovery of wbuf failed due to a second write error

Solution

Recovery of wbuf failed due to a second write error" can experience WDT situation .
ECC or CRC errors reported alone are not necessarily indicative of XSCFU HW problem but when ECC or CRC errors reported with "second write error " is qualified for replacement of the XSCFU hardware.
NAND flash memory on XSCFU is damaged and damage of NAND flash may have caused some software trouble.

Replacing the XSCFU hardware will resolve the issue

Bug 15632749 : SUNBT6938935 Watchdog timeout situations due to recovery of wbuf failure

References

<NOTE:1942533.1> - M-Series Servers: XSCF watchdog timeout without auto negotiation on Ethernet port
<NOTE:1021929.1> - SCF-8005-NE - XSCF firmware is defective.
<NOTE:1339399.1> - Automated Diagnosis Requirements and Expectations for SPARC Servers
<NOTE:2097446.1> - SRDC – SPARC Mx000 and M10/M12 systems: Simple Instructions to Collect an XCP Snapshot

Attachments

This solution has no attachment