ILOM critical event: Unrecoverable error during rebuild, sun-id=/SYS/DBP/HDD

Asset ID:	1-72-2209723.1
Update Date:	2018-03-23
Keywords:

Solution Type Problem Resolution Sure

Solution 2209723.1 : ILOM critical event: Unrecoverable error during rebuild, sun-id=/SYS/DBP/HDD

Applies to:

Oracle Server X5-2 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

ILOM showing below critical event :

342 Sat Oct 8 03:09:36 2016 Storage Log critical
Unrecoverable error during rebuild, sun-id=/SYS/DBP/HDD2

Cause

This event indicates that there was a read/write medium error on the disk during a "rebuild".
This event is sent by the Oracle Hardware Management Pack (OHMP) agent to the ILOM and does not mean that the disk is faulty.

In this particular case, the read/write medium error occurred during a consistency check and was corrected:

10/08/16 3:00:03: 113=Unexpected sense: PD 0a(e0xfc/s2) Path 5000cca02f096189, CDB: 28 00 00 04 46 00 00 02 00 00, Sense: 3/11/00 <-------------- Medium Error - read error
10/08/16 3:00:03: C0:Raw Sense for PD a: f0 00 03 00 04 47 01 18 00 00 00 00 11 00 2d 80 00 67 00 00 f7 2d 00 00 00 00 6e 00 04 df 00 00
10/08/16 3:00:03: C0:DM_PerformSenseDataRecovery:Medium Error DevId[a] devHandle b RDM=40c70e00 retries=0 callback=0
10/08/16 3:00:03: C0:DM_PerformSenseDataRecovery: Medium Error is for: cmdId=420, ld=0, src=6, cmd=1, lba=88e00, cnt=200, rmwOp=0
10/08/16 3:00:03: C0:ErrLBAOffset (101) LBA(44600) BadLba=44701
10/08/16 3:00:03: C0:BBM_CheckCmdTimer: TIMER_WITHIN_LIMIT

10/08/16 3:00:03: C0:EVT#04735-10/08/16 3:00:03: 57=Consistency Check corrected medium error (VD 00/0 at 88f01, PD 0a(e0xfc/s2) at 44701) <------ corrected!!

Note that the disk controller firmware trigger a consistency check (cc) periodically (once per week) to validate the consistency of the RAID volume.Example:

01/30/16 9:57:53: C0:Next cc scheduled to start at 02/06/16 3:00:00
02/06/16 3:58:26: C0:Next cc scheduled to start at 02/13/16 3:00:00
02/13/16 3:58:23: C0:Next cc scheduled to start at 02/20/16 3:00:00
02/20/16 3:58:23: C0:Next cc scheduled to start at 02/27/16 3:00:00
02/27/16 3:58:28: C0:Next cc scheduled to start at 03/05/16 3:00:00
03/05/16 3:58:24: C0:Next cc scheduled to start at 03/12/16 3:00:00
03/12/16 3:58:24: C0:Next cc scheduled to start at 03/19/16 3:00:00
03/19/16 3:58:22: C0:Next cc scheduled to start at 03/26/16 3:00:00
03/26/16 3:58:23: C0:Next cc scheduled to start at 04/02/16 3:00:00
04/02/16 3:58:25: C0:Next cc scheduled to start at 04/09/16 3:00:00

If there is a read/write medium error in one disk during the consistency check (or during a rebuild), even if it was corrected,
OHMP will send an event (Unrecoverable error during rebuild) to the ILOM and it will be displayed as critical. It does not mean the disk is faulty.

We should let the disk controller manage the read/write errors and replace the disk only if it is flagged as faulted by the disk controller.

Solution

Please ignore the ILOM critical event "Unrecoverable error during rebuild" and do not replace the disk unless it is flagged as faulted by the disk controller.
If the disk is flagged as faulted by the disk controller the ILOM will also be notified by OHMP. Find below some examples for ILOM events that will indicate the disk is faulted:

/SYS/DBP/HDDx has fault.io.disk.predictive-failure with probability=100.
SMART health status is failed, sun-id=/SYS/DBP/HDDx

A bug has been created to prevent OHMP from sending this confusing "critical" events to the ILOM.

References

<BUG:25075936> - OHMP REPORTING "UNRECOVERABLE ERROR DURING REBUILD" EVENTS TO ILOM

Attachments

This solution has no attachment