Sun Storage 7000 Unified Storage System: When several disks are failed, what is the replacement order ?

Asset ID:	1-71-1557769.1
Update Date:	2018-01-05
Keywords:

Solution Type Technical Instruction Sure

Solution 1557769.1 : Sun Storage 7000 Unified Storage System: When several disks are failed, what is the replacement order ?

Applies to:

Sun Storage 7110 Unified Storage System - Version All Versions and later
Sun Storage 7410 Unified Storage System - Version All Versions and later
Sun ZFS Storage 7120 - Version All Versions and later
Sun Storage 7720 Unified Storage System - Version All Versions and later
Sun ZFS Storage 7420 - Version All Versions and later
7000 Appliance OS (Fishworks)
When several disks in a pool are failed, we should check the order of the disks replacement. There is a possibility of data loss if we don't check that the RAID parity is still able to be calculated.

Goal

To replace 'failed' disks in the correct order in case of multiple disks failures in a pool - especially if failed disks are in the same vdev.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage ZFS Storage Appliance

Solution

When several disks are failed in a RAIDz, we should check the level of parity of this RAIDz to choose the method and the order of disks replacement.

We are sometimes called upon to 'fix' ZFS pools that are 'in a mess' with several failed HDDs. Things might go wrong if we simply remove failed or unavailable disks from a pool.

Looking carefully at the status of the pool, we should check how many drives are faulted if the same top-level vdev.

This is the number of failed drives we can afford per top-level vdev :

raidz = 1 disk failed per vdev

raidz2 = 2 disks failed per vdev

raidz3 = 3 disks failed per vdev.

Let's call N the raid level, N=1,2,3

If we lose more that this number (N), we lose the RAIDz consistency and we do not have enough replicas left to re-calculate parity.

This is obviously the case if we lose both sides of a mirror.

In any case, a Service Request should be opened at Oracle Support and the Oracle TSC engineer will check which disk should be replaced first.

The TSC support engineer would check if any disk is faulted, if there are any imminent failures and if any could be repaired. he will also check if any other failure is threatening at the moment.

Oracle TSC engineer will try to import the pool read-only, to make data available for backup.

When N disks are faulted ( but still in action) in a raidzN, here is the action plan that should be applied. Indeed, in this case, when we have reached the maximum number of disks failure in the vdev.

To avoid a further failure to calculate RAID parity, we should do the following :

1. Replace N-1 disks,

2. Wait for resilvering to finish after N-1 disk replacements

3. Replace the latest faulted disk.

Situations like this are critical, and we should take care to avoid 'losing' the vdev (unable to reconstruct the data from parity), as the pool would become somehow unrecoverable.

This is why a Service Request must be opened.

References

<NOTE:1366035.1> - Sun Storage 7000 Unified Storage System: Troubleshooting Disk Drive Failures
<NOTE:1380045.1> - Sun Storage 7000 Unified Storage System: Resilver did not start after replacing a failed disk
<NOTE:1197903.1> - Sun Storage 7000 Unified Storage System: Spare disk may not be freed up after the data disk is online

Attachments

This solution has no attachment