![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Technical Instruction Sure Solution 1569203.1 : How to Replace Multiple Failed Drives in a J4000 when the Disks are Resident in a Redundant ZFS Zpool.
In this Document
Applies to:Sun Storage J4400 Array - Version Not Applicable and laterSun Storage J4200 Array - Version Not Applicable and later Sun Storage J4500 Array - Version Not Applicable and later Information in this document applies to any platform. GoalThe J4000 family of arrays provides for the requirement of a large scale storage solution. The drawback to the J4000 is that it is only a JBOD (Just a Bunch of Disks). In order to introduce redundancy, a software RAID solution, such as ZFS, is usually applied to the disks. In this working example, a multi disk failure in a J4500 and its associated zpool will be repaired. SolutionStarting with an optimal zpool. The 5Tb pool is made up of 12 disks and 2 spares. Redundancy is built in with ZFS, RAID Z2 (Also known as RAID 6). The pool is online. # zpool list
NAME SIZE ALLOC FREE CAP HEALTH ALTROOT datapool 5.44T 1.17T 4.22T 34% ONLINE - The J4500 incurs a hard fault. At this point in time, the cause is unknown. ZFS has failed 4 of the drives. RAID Z2 redundancy has maintained access to data. The pool has toggled to a DEGRADED state. # zpool status
pool: datapool state: DEGRADED NAME STATE READ WRITE CKSUM datapool DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 c6t5000CCA214C48EF0d0 ONLINE 0 0 0 spare-1 DEGRADED 0 0 0 c6t5000CCA214C48993d0 UNAVAIL 0 24 0 cannot open c6t5000CCA214C458B3d0 ONLINE 0 0 0 c6t5000CCA214C48252d0 ONLINE 0 0 0 spare-3 DEGRADED 0 0 0 c6t5000CCA214C48071d0 UNAVAIL 0 26 0 cannot open c6t5000CCA214C49ECBd0 ONLINE 0 0 0 c6t5000CCA214C39962d0 ONLINE 0 0 0 c6t5000CCA214C39677d0 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 c6t5000CCA214C38062d0 ONLINE 0 0 0 c6t5000CCA214C3897Fd0 UNAVAIL 0 4.85K 0 cannot open c6t5000CCA214C489B5d0 ONLINE 0 0 0 c6t5000CCA214C488EAd0 UNAVAIL 0 4.85K 0 corrupted data c6t5000CCA214C486FBd0 ONLINE 0 0 0 c6t5000CCA214C482EDd0 ONLINE 0 0 0 spares c6t5000CCA214C458B3d0 INUSE currently in use c6t5000CCA214C49ECBd0 INUSE currently in use
In order to triage, diagnose and repair this problem, two data collections must to be performed.
As with most J4000 disk faults, the failure is only seen at the Operating System layer. The supportdata typically reports the J4000 as healthy. Some of the more common Solaris utilities reporting the problem are...
Mar 14 15:57:59 t2000-bur09-f DESC: A ZFS device failed. Refer to http://sun.com/msg/ZFS-8000-D3 for more information.
Mar 14 15:57:59 t2000-bur09-f AUTO-RESPONSE: No automated response will occur. Mar 14 15:57:59 t2000-bur09-f IMPACT: Fault tolerance of the pool may be compromised. Mar 14 15:57:59 t2000-bur09-f REC-ACTION: Run 'zpool status -x' and replace the bad device. Mar 14 15:58:02 t2000-bur09-f fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major All of these utilities need to be used together to arrive at a FRU strategy to repair the problem. In this particular instance, it has been decided to replace 3 of the UNAVAIL disks and repair the 4th. The first step in performing this work will be to identify the disk locations in the J4500. In order to do this, dataStore.txt file, collected in the supportdata, will be required. From this file, the UNAVAIL disk in the zpool can be correlated to a specific slot in the J4500. Here are the 3 candidates for replacement found in the dataStore.txt file. Host Path:t2000:/dev/rdsk/c6t5000CCA214C48993d0s2
Slot Number:25 Drive Zoned Out:False Type:SATA Device ID:HDD25 Host Path:t2000:/dev/rdsk/c6t5000CCA214C48071d0s2 Slot Number:11 Drive Zoned Out:False Type:SATA Device ID:HDD11 Host Path:t2000:/dev/rdsk/c6t5000CCA214C3897Fd0s2 Slot Number:30 Drive Zoned Out:False Type:SATA Device ID:HDD30 To further identify the physical location of the disks in the array, disable and locate them with the service utility. This should toggle the fault / location indicator LED of the drive. # /opt/SUNWsefms/bin/service -d J4500 -c disable -t Disk.25
Executing the disable command on J4500 Completion Status: Success # /opt/SUNWsefms/bin/service -d J4500 -c disable -t Disk.11 Executing the disable command on J4500 Completion Status: Success # /opt/SUNWsefms/bin/service -d J4500 -c disable -t Disk.30 Executing the disable command on J4500 Completion Status: Success # /opt/SUNWsefms/bin/service -d J4500 -c locate -t Disk.25
Executing the locate command on J4500 Completion Status: Success # /opt/SUNWsefms/bin/service -d J4500 -c locate -t Disk.11 Executing the locate command on J4500 Completion Status: Success # /opt/SUNWsefms/bin/service -d J4500 -c locate -t Disk.30 Executing the locate command on J4500 Completion Status: Success Using the ZFS zpool command, offline the faulted mpt devices. zpool offline datapool c6t5000CCA214C48993d0
zpool offline datapool c6t5000CCA214C48071d0 zpool offline datapool c6t5000CCA214C48071d0 <Bug 20365630> zfs sets vdev to FAULTED state without always closing, preventing replacement Using the Service Adviser utility in Common Array Manager, replace the disks. The device drivers may need to be reloaded to recognise the new drives. # drvconfig -i mpt
# drvconfig -i scsi_vhci # devfsadm -Cv A new supportdata is collected. The disk positions now reflect new device names for the new disk drives. This information is used to repair the zpool. Host Path:t2000:/dev/rdsk/c6t5000CCA214C482FBd0s2
Slot Number:25 Drive Zoned Out:False Type:SATA Device ID:HDD25 Host Path:t2000:/dev/rdsk/c6t5000CCA214C489E1d0s2 Slot Number:11 Drive Zoned Out:False Type:SATA Device ID:HDD11 Host Path:t2000f:/dev/rdsk/c6t5000CCA214C3B635d0s2 Slot Number:30 Drive Zoned Out:False Type:SATA Device ID:HDD30 4 unique operations will need to be done on the zpool to return it to full redundancy. The first disk to repair is c6t5000CCA214C488EAd0. As previously mentioned, there appeared to be no hardware faults on this disk. A resilver attempt, back to itself will be performed. # zpool clear datapool c6t5000CCA214C488EAd0
# zpool online datapool c6t5000CCA214C488EAd0 # zpool status pool: datapool state: DEGRADED scrub: resilver in progress for 0h0m, 48.59% done, 0h0m to go NAME STATE READ WRITE CKSUM datapool DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 c6t5000CCA214C48EF0d0 ONLINE 0 0 0 spare-1 DEGRADED 0 0 0 cannot open c6t5000CCA214C48993d0 UNAVAIL 0 24 0 c6t5000CCA214C458B3d0 ONLINE 0 0 0 c6t5000CCA214C48252d0 ONLINE 0 0 0 spare-3 DEGRADED 0 0 0 c6t5000CCA214C48071d0 UNAVAIL 0 26 0 cannot open c6t5000CCA214C49ECBd0 ONLINE 0 0 0 c6t5000CCA214C39962d0 ONLINE 0 0 0 c6t5000CCA214C39677d0 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 c6t5000CCA214C38062d0 ONLINE 0 0 0 c6t5000CCA214C3897Fd0 UNAVAIL 0 4.85K 0 cannot open c6t5000CCA214C489B5d0 ONLINE 0 0 0 c6t5000CCA214C488EAd0 ONLINE 0 0 0 571M resilvered c6t5000CCA214C486FBd0 ONLINE 0 0 0 c6t5000CCA214C482EDd0 ONLINE 0 0 0 spares c6t5000CCA214C458B3d0 INUSE currently in use c6t5000CCA214C49ECBd0 INUSE currently in use The next disk to repair is c6t5000CCA214C3897Fd0. From the supportdata, it is determined that this drive was replaced with c6t5000CCA214C3B635d0. Simply replace the UNAVAIL disk with the new drive inserted into slot 30. # zpool replace datapool c6t5000CCA214C3897Fd0 c6t5000CCA214C3B635d0
# zpool status pool: datapool state: DEGRADED scrub: resilver completed after 0h0m with 0 errors on Wed Mar 15 06:20:00 2000 NAME STATE READ WRITE CKSUM datapool DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 c6t5000CCA214C48EF0d0 ONLINE 0 0 0 spare-1 DEGRADED 0 0 0 c6t5000CCA214C48993d0 UNAVAIL 0 24 0 cannot open c6t5000CCA214C458B3d0 ONLINE 0 0 0 c6t5000CCA214C48252d0 ONLINE 0 0 0 spare-3 DEGRADED 0 0 0 c6t5000CCA214C48071d0 UNAVAIL 0 26 0 cannot open c6t5000CCA214C49ECBd0 ONLINE 0 0 0 c6t5000CCA214C39962d0 ONLINE 0 0 0 c6t5000CCA214C39677d0 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 c6t5000CCA214C38062d0 ONLINE 0 0 0 c6t5000CCA214C3B635d0 ONLINE 0 0 0 1.13G resilvered c6t5000CCA214C489B5d0 ONLINE 0 0 0 c6t5000CCA214C488EAd0 ONLINE 0 0 0 c6t5000CCA214C486FBd0 ONLINE 0 0 0 c6t5000CCA214C482EDd0 ONLINE 0 0 0 spares c6t5000CCA214C458B3d0 INUSE currently in use c6t5000CCA214C49ECBd0 INUSE currently in use Note: Here is a snippet from zpool status during the resilver. # zpool status pool: datapool state: DEGRADED ............. scrub: resilver in progress for 0h0m, 45.57% done, 0h0m to go ............. replacing-1 DEGRADED 0 0 0 c6t5000CCA214C3897Fd0 UNAVAIL 0 4.85K 0 cannot open c6t5000CCA214C3B635d0 ONLINE 0 0 0 538M resilvered ............. Next, copy back all the data from the spare disk, c6t5000CCA214C49ECBd0 to the new disk, c6t5000CCA214C489E1d0 and get rid of the UNAVAIL disk c6t5000CCA214C48071d0. The spare will also be returned to the hot spare pool. # zpool replace datapool c6t5000CCA214C48071d0 c6t5000CCA214C489E1d0
# zpool status pool: datapool state: DEGRADED scrub: resilver completed after 0h0m with 0 errors on Wed Mar 15 06:31:45 2000 NAME STATE READ WRITE CKSUM datapool DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 c6t5000CCA214C48EF0d0 ONLINE 0 0 0 spare-1 DEGRADED 0 0 0 c6t5000CCA214C48993d0 UNAVAIL 0 24 0 cannot open c6t5000CCA214C458B3d0 ONLINE 0 0 0 c6t5000CCA214C48252d0 ONLINE 0 0 0 c6t5000CCA214C489E1d0 ONLINE 0 0 0 1.02G resilvered c6t5000CCA214C39962d0 ONLINE 0 0 0 c6t5000CCA214C39677d0 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 c6t5000CCA214C38062d0 ONLINE 0 0 0 c6t5000CCA214C3B635d0 ONLINE 0 0 0 c6t5000CCA214C489B5d0 ONLINE 0 0 0 c6t5000CCA214C488EAd0 ONLINE 0 0 0 c6t5000CCA214C486FBd0 ONLINE 0 0 0 c6t5000CCA214C482EDd0 ONLINE 0 0 0 spares c6t5000CCA214C458B3d0 INUSE currently in use c6t5000CCA214C49ECBd0 AVAIL The last repair takes a different approach. No resilver will be done. Simply remove the UNAVAIL disk, c6t5000CCA214C48993d0 from the pool. As soon as this is done, spare disk c6t5000CCA214C458B3d0 will become a permanant member of datapool. Then add new disk c6t5000CCA214C482FBd0 in as a new spare. This approach avoids any resilver and does not impact redundancy or availability as all the disks reside within the same enclosure. # zpool detach datapool c6t5000CCA214C48993d0
# zpool add datapool spare c6t5000CCA214C482FBd0 # zpool status pool: datapool state: ONLINE ! scrub: resilver completed after 0h0m with 0 errors on Wed Mar 15 06:31:45 2000 NAME STATE READ WRITE CKSUM datapool ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c6t5000CCA214C48EF0d0 ONLINE 0 0 0 c6t5000CCA214C458B3d0 ONLINE 0 0 0 c6t5000CCA214C48252d0 ONLINE 0 0 0 c6t5000CCA214C489E1d0 ONLINE 0 0 0 1.02G resilvered c6t5000CCA214C39962d0 ONLINE 0 0 0 c6t5000CCA214C39677d0 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 c6t5000CCA214C38062d0 ONLINE 0 0 0 c6t5000CCA214C3B635d0 ONLINE 0 0 0 c6t5000CCA214C489B5d0 ONLINE 0 0 0 c6t5000CCA214C488EAd0 ONLINE 0 0 0 c6t5000CCA214C486FBd0 ONLINE 0 0 0 c6t5000CCA214C482EDd0 ONLINE 0 0 0 spares c6t5000CCA214C49ECBd0 AVAIL c6t5000CCA214C482FBd0 AVAIL
Attachments This solution has no attachment |
||||||||||||||||
|