![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Problem Resolution Sure Solution 1004598.1 : Sun Fire[TM] 12K/15K/E20K/E25K: Recovering from a System Controller disk failure
PreviouslyPublishedAs 206377 Applies to:Sun Fire E25K Server - Version Not Applicable to Not Applicable [Release N/A]Sun Fire 12K Server - Version Not Applicable to Not Applicable [Release N/A] Sun Fire 15K Server - Version Not Applicable to Not Applicable [Release N/A] Sun Fire E20K Server - Version Not Applicable to Not Applicable [Release N/A] All Platforms SymptomsOS version: Solaris[TM] 8 10/01 or later Changes
CauseOne (or both) internal disk of the Platform System Controller (SC, where SMS services run) is faulted and needs to be replaced. Solution Scenario #1: Loss of 1 of 2 disks on SC % showfailover -v SC Failover Status: ACTIVE Clock Phase Locked: .....................................Yes HASRAM Status (by location): HASRAM (CSB at CS1): ....................................Good HASRAM (CSB at CS0): ....................................Good Status of sf15k-sc1: Role: ....................................MAIN Status of sf15k-sc0: Role: ...................................SPARE
d10: Mirror Submirror 0: d11 State: Okay Submirror 1: d12 State: Needs maintenance … d20: Mirror Submirror 0: d21 State: Okay Submirror 1: d22 State: Needs maintenance … d30: Mirror Submirror 0: d31 State: Okay Submirror 1: d32 State: Needs maintenance 3. Use the 'metadb' command to determine unavailable/unreadable state databases replicas: flags first blk block count a m p luo 16 1034 /dev/dsk/c0t2d0s4 a p luo 1050 1034 /dev/dsk/c0t2d0s4 a p luo 2084 1034 /dev/dsk/c0t2d0s4 a p luo 16 1034 /dev/dsk/c0t2d0s5 a p luo 1050 1034 /dev/dsk/c0t2d0s5 a p luo 2084 1034 /dev/dsk/c0t2d0s5 M p Unknown Unknown /dev/dsk/c0t3d0s4 M p Unknown Unknown /dev/dsk/c0t3d0s4 M p Unknown Unknown /dev/dsk/c0t3d0s4 M p Unknown Unknown /dev/dsk/c0t3d0s5 M p Unknown Unknown /dev/dsk/c0t3d0s5 M p Unknown Unknown /dev/dsk/c0t3d0s5 4. Use the 'metadb' command to delete the state databases replicas on the bad disk: # metadb -d -f c0t3d0s4 # metadb -d -f c0t3d0s5 Depending on how the disk has failed, this step may not succeed. If this is the case, we will delete the state database replicas during reboot (step 9). 5. As we have to shutdown the SC to replace the defective disk, we need to ensure that the SC will boot using the correct OBP alias. Prevent the SC rebooting after shutdown to the ok prompt by setting "auto-boot " to false using the 'eeprom' command as superuser on the SC: # eeprom 'auto-boot =false'
% setfailover force
% setfailover off On the SPARE SC # init O
To power off the SCPER, you must run poweroff SC#.
8. Boot using the correct OBP alias: ok devalias disk2 /pci@1f,0/pci@1,1/scsi@2/sd@2,0:a disk3 /pci@1f,0/pci@1,1/scsi@2/sd@3,0:a If faulty disk was c0t2d0 (disk2), boot from disk3 If faulty disk was c0t3d0 (disk3), boot from disk2 ok boot disk2 or ok boot disk3 9. If step 4 above failed (the metadb -d -f command was unsuccessful due to the nature of the disk failure) OR If a reboot occurs before replacing the disk, the current boot will fail and stop in single-user mode as 51% readable state database replicas are needed. If this is the case, in single-user mode, use the 'metadb' command to delete the state databases replicas (ignore any "Read-only file system" error messages), then proceed with normal startup: # metadb -d -f c0t3d0s4 # metadb -d -f c0t3d0s5
# metadb -a -c3 -f c0t3d0s4 # metadb -a -c3 -f c0t3d0s5 This configuration can be checked using 'metadb -i'. # metareplace -e d10 c0t3d0s0 # metareplace -e d20 c0t3d0s1 # metareplace -e d30 c0t3d0s7 This operation will take about 20 minutes per every gigabyte of filesystem. This configuration can be checked using 'metastat'. # eeprom 'auto-boot=true' 14. Failover must be enabled using the 'setfailover' command as user sms-svc user on the MAIN SC: % setfailover on 15. Synchronize data from MAIN SC to SPARE SC using the 'setdatasync' command as user sms-svc on the MAIN SC: % setdatasync backup
% setfailover force
1. If disk failures have occurred, replace the defective disks in SCPER board (see Sun Fire 15K System Service Manual 806-3512-xx). If disks have not failed but have been corrupted, continue with step 2 below.
ok boot cdrom -s # format # newfs /dev/rdsk/c0t2d0s0 # newfs /dev/rdsk/c0t2d0s7 # mount /dev/dsk/c0t2d0s0 /a
# installboot /usr/platform/sun4u/lib/fs/ufs/bootblk /dev/rdsk/c0t2d0s0
Begin MDD root info (do not edit) forceload: misc/md_trans forceload: misc/md_raid forceload: misc/md_hotspares forceload: misc/md_sp forceload: misc/md_stripe forceload: misc/md_mirror forceload: drv/pcipsy forceload: drv/simba forceload: drv/glm forceload: drv/sd rootdev:/pseudo/md@0:0,10,blk * End MDD root info (do not edit) * Begin MDD database info (do not edit) set md:mddb_bootlist1="sd:20:16 sd:20:1050 sd:20:2084 sd:21:16 sd:21:1050" set md:mddb_bootlist2="sd:21:2084 sd:28:16 sd:28:1050 sd:28:2084 sd:29:16" set md:mddb_bootlist3="sd:29:1050 sd:29:2084" * End MDD database info (do not edit)
#device device mount FS fsck mount mount #to mount to fsck point type pass at boot options fd - /dev/fd fd - no - /proc - /proc proc - no - /dev/md/dsk/d20 - - swap - no - /dev/md/dsk/d10 /dev/md/rdsk/d10 / ufs 1 no logging /dev/md/dsk/d30 /dev/md/rdsk/d30 /export/install ufs 2 yes logging swap - /tmp tmpfs - yes - After #device device mount FS fsck mount mount #to mount to fsck point type pass at boot options fd - /dev/fd fd - no - /proc - /proc proc - no - /dev/dsk/c0t2d0s1 - - swap - no - /dev/dsk/c0t2d0s0 /dev/rdsk/c0t2d0s0 / ufs 1 no logging #/dev/md/dsk/d30 /dev/md/rdsk/d30 /export/install ufs 2 yes logging swap - /tmp tmpfs - yes -
ok boot disk2 At this time, this SC is defined as the SPARE SC. ok devalias disk2 /pci@1f,0/pci@1,1/scsi@2/sd@2,0:a disk3 /pci@1f,0/pci@1,1/scsi@2/sd@3,0:a
# metadb -a -c3 -f c0t2d0s4 # metadb -a -c3 -f c0t2d0s5 # metadb -a -c3 -f c0t3d0s4 # metadb -a -c3 -f c0t3d0s5 10. Modify the /etc/lvm/md.tab, make sure that all mirrors are one-way mirrors, make sure that the one-way mirrors refer to the restored side: Before d10 -m d11 d12 d11 1 1 /dev/dsk/c0t2d0s0 d12 1 1 /dev/dsk/c0t3d0s0 d20 -m d21 d22 d21 1 1 /dev/dsk/c0t2d0s1 d22 1 1 /dev/dsk/c0t3d0s1 d30 -m d31 d32 d31 1 1 /dev/dsk/c0t2d0s7 d32 1 1 /dev/dsk/c0t3d0s7 After d10 -m d11 d11 1 1 /dev/dsk/c0t2d0s0 d12 1 1 /dev/dsk/c0t3d0s0 d20 -m d21 d21 1 1 /dev/dsk/c0t2d0s1 d22 1 1 /dev/dsk/c0t3d0s1 d30 -m d31 d31 1 1 /dev/dsk/c0t2d0s7 d32 1 1 /dev/dsk/c0t3d0s7 11. Create the metadevices: # metainit -f -a 12. Set the metadevice as a root device: # metaroot d10
#device device mount FS fsck mount mount #to mount to fsck point type pass at boot options fd - /dev/fd fd - no - /proc - /proc proc - no - /dev/md/dsk/d20 - - swap - no - /dev/md/dsk/d10 /dev/md/rdsk/d10 / ufs 1 no logging /dev/md/dsk/d30 /dev/md/rdsk/d30 /export/install ufs 2 yes logging swap - /tmp tmpfs - yes -
# metattach d10 d12 # metattach d20 d22 # metattach d30 d32
% setfailover on
% setdatasync backup Note: Scripts are available on the EIS-CD to set up the SC disks: /sun/tools/SF15K/SF15k-sc-bootdisks-start.sh /sun/tools/SF15K/SF15k-sc-bootdisks-finish.sh After running the scripts: # df -k Filesystem kbytes used avail capacity Mounted on /dev/md/dsk/d10 8261393 1948634 6230146 24% / /proc 0 0 0 0% /proc fd 0 0 0 0% /dev/fd mnttab 0 0 0 0% /etc/mnttab swap 2185528 8 2185520 1% /var/run swap 2187656 2136 2185520 1% /tmp /dev/md/dsk/d30 7061557 1370547 5620395 20% /export/install # metastat -p d10 -m d11 d12 1 d11 1 1 c0t2d0s0 d12 1 1 c0t3d0s0 d20 -m d21 d22 1 d21 1 1 c0t2d0s1 d22 1 1 c0t3d0s1 d30 -m d31 d32 1 d31 1 1 c0t2d0s7 d32 1 1 c0t3d0s7 # metadb -i flags first blk block count a m p luo 16 1034 /dev/dsk/c0t2d0s4 a p luo 1050 1034 /dev/dsk/c0t2d0s4 a p luo 2084 1034 /dev/dsk/c0t2d0s4 a p luo 16 1034 /dev/dsk/c0t2d0s5 a p luo 1050 1034 /dev/dsk/c0t2d0s5 a p luo 2084 1034 /dev/dsk/c0t2d0s5 a p luo 16 1034 /dev/dsk/c0t3d0s4 a p luo 1050 1034 /dev/dsk/c0t3d0s4 a p luo 2084 1034 /dev/dsk/c0t3d0s4 a p luo 16 1034 /dev/dsk/c0t3d0s5 a p luo 1050 1034 /dev/dsk/c0t3d0s5 a p luo 2084 1034 /dev/dsk/c0t3d0s5 # format / partition / print c0t2d0 Part Tag Flag Cylinders Size Blocks 0 root wm 0 - 3560 8.00GB (3561/0/0) 16779432 1 swap wu 3561 - 4451 2.00GB (891/0/0) 4198392 2 backup wm 0 - 7505 16.86GB (7506/0/0) 35368272 3 unassigned wm 0 0 (0/0/0) 0 4 unassigned wm 4452 - 4456 11.50MB (5/0/0) 23560 5 unassigned wm 4457 - 4461 11.50MB (5/0/0) 23560 6 unassigned wm 0 0 (0/0/0) 0 7 unassigned wm 4462 - 7505 6.84GB (3044/0/0) 14343328 c0t3d0 Part Tag Flag Cylinders Size Blocks 0 root wm 0 - 3560 8.00GB (3561/0/0) 16779432 1 swap wu 3561 - 4451 2.00GB (891/0/0) 4198392 2 backup wm 0 - 7505 16.86GB (7506/0/0) 35368272 3 unassigned wu 0 0 (0/0/0) 0 4 unassigned wm 4452 - 4456 11.50MB (5/0/0) 23560 5 unassigned wm 4457 - 4461 11.50MB (5/0/0) 23560 6 unassigned wu 0 0 (0/0/0) 0 7 unassigned wm 4462 - 7505 6.84GB (3044/0/0) 14343328 ok printenv boot-device boot-device=disk2 disk3 ok devalias disk2 /pci@1f,0/pci@1,1/scsi@2/sd@2,0:a disk3 /pci@1f,0/pci@1,1/scsi@2/sd@3,0:a Scenario #3 Errors on Drive but no failure of disk on SC 1. As we have to shutdown the SC to replace the defective disk, we need to ensure that this SC is the SPARE before shutting it down. As user sms-svc, use the 'showfailover' command to determine status of the SCs: % showfailover -v SC Failover Status: ACTIVE Clock Phase Locked: .....................................Yes HASRAM Status (by location): HASRAM (CSB at CS1): ....................................Good HASRAM (CSB at CS0): ....................................Good Status of sf15k-sc1: Role: ....................................MAIN Status of sf15k-sc0: Role: ...................................SPARE
d10: Mirror Submirror 0: d11 State: Okay Submirror 1: d12 State: Needs maintenance ... d20: Mirror Submirror 0: d21 State: Okay Submirror 1: d22 State: Needs maintenance ... d30: Mirror Submirror 0: d31 State: Okay Submirror 1: d32 State: Needs maintenance ... 3. Use the 'metadb' command to determine unavailable/unreadable state databases replicas: flags first blk block count a m p luo 16 1034 /dev/dsk/c0t2d0s4 a p luo 1050 1034 /dev/dsk/c0t2d0s4 a p luo 2084 1034 /dev/dsk/c0t2d0s4 a p luo 16 1034 /dev/dsk/c0t2d0s5 a p luo 1050 1034 /dev/dsk/c0t2d0s5 a p luo 2084 1034 /dev/dsk/c0t2d0s5 M p Unknown Unknown /dev/dsk/c0t3d0s4 M p Unknown Unknown /dev/dsk/c0t3d0s4 M p Unknown Unknown /dev/dsk/c0t3d0s4 M p Unknown Unknown /dev/dsk/c0t3d0s5 M p Unknown Unknown /dev/dsk/c0t3d0s5 M p Unknown Unknown /dev/dsk/c0t3d0s5 # prtvtoc /dev/rdsk/c0t3d0s2 > /var/tmp/prtvtoc.orig # metadb d f c0t3d0s4 # metadb d f c0t3d0s4 # metadetach d10 d12 d20 d22 d30 d32
# eeprom 'auto-boot=false'
% setfailover force 6. This action will force the former MAIN SC to reset and reboot as SPARE and transfer the role of MAIN SC to the opposite SC. If a disk failure occurs on the SPARE SC, disable failover on the MAIN SC using the setfailover' command as user sms-svc. The SPARE SC can then be shut down: On the MAIN SC % setfailover off On the SPARE SC # init 0
To power off the SCPER, you must run poweroff SC#. ok devalias disk2 /pci@1f,0/pci@1,1/scsi@2/sd@2,0:a disk3 /pci@1f,0/pci@1,1/scsi@2/sd@3,0:a If faulty disk was c0t2d0 (disk2), boot from disk3 ok boot disk2 or ok boot disk3
# metadb -d -f c0t3d0s4 # metadb -d -f c0t3d0s5 then proceed with normal startup. fmthard s /var/tmp/prtvtoc.orig /dev/rdsk/c0t3d0s2
# metadb -a -c3 -f c0t3d0s4 # metadb -a -c3 -f c0t3d0s5 This configuration can be checked using 'metadb -i'. # eeprom 'auto-boot =true'
% setfailover on
% setdatasync backup To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in an appropriate
My Oracle Support Community - Oracle Sun Technologies Community.
Attachments This solution has no attachment |
||||||||||||
|