Sun Storage 7000 Unified Storage System: Alerts from readzilla cannot be cleared

Asset ID:	1-72-1506500.1
Update Date:	2014-12-16
Keywords:

Solution Type Problem Resolution Sure

Solution 1506500.1 : Sun Storage 7000 Unified Storage System: Alerts from readzilla cannot be cleared - even after device replacement or reboot.

Applies to:

Sun ZFS Storage 7320 - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7420 - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7410 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7310 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
7000 Appliance OS (Fishworks)

Symptoms

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage ZFS Storage Appliance Community

The symptoms of this problem will present as:

Alerts received for a readzilla read cache SSD in the head unit of a ZFS Storage Appliance like:

The device 'NAS7000/HDD 3' has failed or could not be opened.

However attempting to clear this alert, or even replacing the component identified as faulted will just cause a new alert to be generated for the same problem.

This is not the same issue as discussed in <Document 1457578.1>. A reboot does not clear the problem after replacing the device identified as faulted.

Oracle support will need to be engaged to continue diagnosis.

The pool may show it's cache devices like this:

cache
c2t2d0 ONLINE 0 0 0
c2t4d0 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c2t2d0 FAULTED 0 0 0 corrupted data
c2t4d0 FAULTED 0 0 0 corrupted data
c2t5d0 FAULTED 0 0 0 corrupted data
c2t3d0 FAULTED 0 0 0 corrupted data
c2t3d0 ONLINE 0 0 0

Here it can be seen that the ONLINE c2t2d0, c2t3d0, c2t4d0 and c2t5d0 devices should be for the readzilla devices in the head that currently has the pool imported.

The FAULTED devices are the readzilla devices that correspond to the devices in the partner head, that will come online should the other head take over this pool.

However in this case the alert was reporting that HDD3 was FAULTED when in fact it can be seen from the zpool status that it is ONLINE (HDD3 = c2t3d0 - See below)

To see what is happening check the addresses of the devices in mdb and then looking at the FRU details for the FAULTED device that the alert is logged for, and the compare with the corresponding ONLINE device of the same name (See <Bug 15797165> for further details), for instance in this case HDD3 was generating the alerts, HDD3 corresponds to c2t3d0, there are two instances of c2t3d0 one ONLINE and one FAULTED:

>::spa -v

fffff823692ecc0 HEALTHY - /dev/dsk/c2t3d0s0 fffff81fa1f6d40 CANT_OPEN CORRUPT_DATA /dev/dsk/c2t3d0s0 > ffffff823692ecc0::print -t vdev_t vdev_guid vdev_fru uint64_t vdev_guid = 0x344e7a459b00e598 char *vdev_fru = 0xffffff82366be590 "hc://:product-id=SUN-FIRE-X4170-M2-SERVER:product-sn=1102FMM0D6:server-id=adc40xstor07:chassis-id=1102FMM0D6:serial=805S1011TBUZ:part=TOSHIBA-THNS512GG8BBAA:revision=AGYA0201/chassis=0/bay=3/disk=0" > ffffff81fa1f6d40::print -t vdev_t vdev_guid vdev_fru uint64_t vdev_guid = 0xf242182d2d7c5787 char *vdev_fru = 0xffffff82366af688 "hc://:product-id=SUN-FIRE-X4170-M2-SERVER:product-sn=1102FMM0D6:server-id=adc40xstor07:chassis-id=1102FMM0D6:serial=805S1011TBUZ:part=TOSHIBA-THNS512GG8BBAA:revision=AGYA0201/chassis=0/bay=3/disk=0"

Here it can be seen that the FRU details of the two devices are identical - the chassis-id and serial number are the same for both, yet the devices are supposed to be in different heads.

Somehow the FRU details of one cache device have become mixed up with the other.

Cause

The FRU details of the ONLINE readzilla cache device for the head that currently has the pool imported have become mixed up with the FRU details of the readzilla cache device that is currently in the other head but has the same device name. This readzilla device in the partner head legitimately shows up as FAULTED because the pool cannot use the readzilla devices in the partner head. However because the FRU details of the readzilla device in the head owning the pool are now the same as the readzilla device in the partner head, an alert is generated that the readzilla is faulted, even though it is reporting online.

Solution

The alert does not cause any issues (there will still be IO going through the device that is reported as faulted - it is just that the FRU is incorrectly reported).

A maintenance window can therefore be planned to carry out the necessary work to enable the alert to be cleared.

When a maintenance window is finally obtained Oracle support will need to be engaged to run through a procedure via a remote support tool such as Oracle Shared Shell or Webex.

This procedure will involve a takeover and a failback so there will be some interruption to service as these operations complete.

References

<BUG:15797165> - SUNBT7175772 FALSE ALERT "DEVICE HAS FAILED OR COULD NOT BE OPENED" GENERATED FO
<NOTE:1457578.1> - Sun Storage 7000 Unified Storage System: When Replacing Faulted Readzilla SSD and/or System Disks in the head unit the replacement is not recognized

Attachments

This solution has no attachment