![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Troubleshooting Sure Solution 2314065.1 : Troubleshooting Disk Failures on SPARC T3-x, T4-x and T5-x Servers When an Explorer Cannot be Provided
In this Document
Applies to:SPARC T5-4SPARC T5-8 SPARC T3-2 SPARC T3-4 SPARC T3-1 Oracle Solaris on SPARC (32-bit) Oracle Solaris on SPARC (64-bit) PurposeThis document provides guidance about what data to collect and how to analyze disk failures when an explorer cannot be provided. Troubleshooting StepsPlease gather and analyze the output of the commands listed below. Be informed that an Explorer gathers all this data but it's important to know the commands that reveal disk information, what to check for, and the interpretation of each output. I. If you are not able to reach Solaris, provide the output of the following commands from OBP (ok> prompt): ok> printenv Determine boot device (internal or external!!!). If external and an Oracle storage array, then open an SR into the Storage group. ok> setenv auto-boot? false Change the PROM auto-boot? parameter to false. ok> reset-all Reset the system before executing probe-scsi-all command (to prevent the system from hanging). ok> probe-scsi-all The probe-scsi-all command transmits an inquiry command to all SCSI devices connected to the system SCSI host adapters, including any host adapters installed in PCI slots. ok> show-disks Lists disks and has options to add to dev alias.
II. If you are able to reach Solaris, provide the output of the following commands: 1. #prtdiag -v Please check the LED status, environmental status. 2. #cfgadm -alv This command displays information about a system's SCSI controllers and their attached devices.
3. #format The command displays a list of recognized disks under AVAILABLE DISK SELECTIONS. Sometimes when an disk has failed it's no longer available in the format output.
4. #uptime This command shows how long the system has been up.
5. #iostat -En You can search for hardware errors but please correlate the output you see here with the uptime. Iostat counters are reset at system reboot. If a failed disk has been replaced previously and the system was not rebooted, the errors will still be reported in Iostat for the new disk. If you can't reboot the server, check Document 1012731.1 Reset the iostat -E hard/soft/tran errors counters without rebooting. #iostat -En
Disk Size Soft Hard Trans Media Ready NoDev Recov Illeg PFlAn c1t0d0 0.00GB 0 8 0 0 0 8 0 10 0 AMI Virtual CDROM c0t5000CCA0123D00E8d0 600.13GB 2369 24 2 23 0 1 2369 0 0 HITACHI H106060SDSUN600G c0t5000CCA012569F20d0 600.13GB 0 0 0 0 0 0 0 0 0 HITACHI H106060SDSUN600G c0t5000CCA0123AB328d0 600.13GB 0 0 0 0 0 0 0 0 0 HITACHI H106060SDSUN600G c0t5000CCA01244E0E0d0 600.13GB 0 0 0 0 0 0 0 0 0 HITACHI H106060SDSUN600G
6. #metastat -t Use this command for disks configured under Solaris Volume Manager. You need to check if any submirros that are in status "Need maintenance". Also check Document 1002410.1 How to recover mirrored metadevice with both submirrors in a "Needs Maintenance" state caused by a bad disk block. #metastat -t
Submirror 1: d2 State: Needs maintenance Sat Jun 25 00:22:43 2011 d2: Submirror of d0 State: Needs maintenance Sat Jun 25 00:22:43 2011 Invoke: metareplace d0 /dev/dsk/c0t5000C500332C7E73d0s0 <new device> Size: 50337500 blocks (24 GB) Stripe 0: Device Start Dbase State Reloc Hot Spare Time /dev/dsk/c0t5000C500332C7E73d0s0 0 No Maintenance Yes Sat Jun 25 00:22:43 2011
7. #zpool status Use this command for disks configured under ZFS. You need to check if any pools are in state degraded. #zpool status
pool: rpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c0t5000CCA025244F18d0 ONLINE 0 0 0 errors: No known data errors
8. Get the output from dmesg and messages #dmesg # more /var/adm/messages Look for error blocks that are reported and for messages mentioning that the disk has been taken offline. Important! If you can't find any evidence of the fault in /var/adm/messages, then you probably do not need to replace the disk.
One exception would be a disk that failed so long ago that the messages files have rolled over and no longer show that time period. If after proceeding with steps 1-8 you only notice error blocks on a certain disk, run the command format > analyze > read to repair the defective blocks on a disk. Example:
Aug 24 22:39:56 : [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000cca0123d00e8 (sd5): Aug 24 22:39:56 Error for Command: read(10) Error Level: Fatal Aug 24 22:39:56 : [ID 107833 kern.notice] Requested Block: 1118816592 Error Block: 1118816592 Aug 24 22:39:56 : [ID 107833 kern.notice] Vendor: HITACHI Serial Number: 111555BBB Aug 24 22:39:56 : [ID 107833 kern.notice] Sense Key: Media_Error Aug 24 22:39:56 : [ID 107833 kern.notice] ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x2d Aug 24 22:39:57 : [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000cca0123d00e8 (sd5): Aug 24 22:39:57 Error for Command: read(10) Error Level: Retryable Aug 24 22:39:57 : [ID 107833 kern.notice] Requested Block: 1119053824 Error Block: 1119053825 Aug 24 22:39:57 : [ID 107833 kern.notice] Vendor: HITACHI Serial Number: 111555BBB
9. If Internal LSI Hardware RAID Volumes are configured using one the following 3 utilities: Sas2ircu, Raidconfig and Fcode check Document 1483874.1 - How to Identify a Failed Disk when Internal LSI Hardware RAID Volumes are configured (T3-x and T4-x). If the internal disks are configured using a RAID HBA highlighted by the path to disks with "LSI,mrsas", check Document 1471121.1 Some or all internal disks in SPARC T3 / T4 Servers may not show up in "# format" or "ok> probe-scsi-all".
10.Displaying Disk Physical Locations: a. Please get the output of the command #prtconf -v. And then identify the value of the 'phy-num'. Further details can be found in Document 1365089.1 - How to locate a failed internal disk drive on a T3/T4/T5/T7/T8 series system. b. For servers that use Solaris 11, you can use the #diskinfo command. Details can be found at https://docs.oracle.com/cd/E53394_01/html/E54782/disksprep-3.html. root@t5-8:~# diskinfo
D:devchassis-path c:occupant-compdev ------------------------------------ --------------------- /dev/chassis/SYS/PCIE9/F40/LUN0/disk c0t50005160000283D8d0 /dev/chassis/SYS/PCIE9/F40/LUN1/disk c0t50005160000275CCd0 /dev/chassis/SYS/PCIE9/F40/LUN2/disk c0t5000516000028010d0 /dev/chassis/SYS/PCIE9/F40/LUN3/disk c0t50005160000278D0d0 /dev/chassis/SYS/SASBP0/HDD0/disk c0t5000CCA0162B76ACd0 /dev/chassis/SYS/SASBP0/HDD1/disk c0t5000CCA0162A47B0d0 /dev/chassis/SYS/SASBP0/HDD2 - /dev/chassis/SYS/SASBP0/HDD3 - /dev/chassis/SYS/SASBP1/HDD4/disk c0t5000CCA0162A6B4Cd0 /dev/chassis/SYS/SASBP1/HDD5/disk c0t5000CCA0162AAA00d0 /dev/chassis/SYS/SASBP1/HDD6/disk c0t5000CCA0160B4C88d0 /dev/chassis/SYS/SASBP1/HDD7/disk c0t5000CCA0160BF17Cd0
c. If the disk is under MPXIO /scsi_vhci control check Document 1542744.1 How to Identify the Target ID for an Internal System Disk under MPXIO/scsi_vhci Control.
11. raidconfig list all -v - raidconfig is a command available when Hardware Management Pack (HMP) is installed. HMP is automatically installed on Solaris 11.2 and later. Sample output from a T3-4: root@t3-4:~# raidconfig list all -v
CONTROLLER c0 Disk c0d0 Disk c0d1
12. If after taking steps 1-9 you have gathered sufficient information probing the fact that a disk has failed, please contact Oracle support.
References<NOTE:1012731.1> - How to Reset the iostat -E Error Counters Without Rebooting<NOTE:1002410.1> - Solaris Volume Manager (SVM) How to recover mirrored metadevice with both submirrors in a "Needs Maintenance" state caused by a bad disk block <NOTE:1483874.1> - How to Identify a Failed Disk when Internal LSI Hardware RAID Volumes are configured (T3-x and T4-x) <NOTE:1469821.1> - Solaris Volume Manager (SVM) SPARC How to Replace a Failed, SCSI Disk, Mirrored with SVM <NOTE:1471121.1> - Some or all internal disks in SPARC T3 / T4 Servers may not show up in "# format" or "ok> probe-scsi-all" <NOTE:1153444.1> - Oracle Services Tools Bundle (STB) - RDA/Explorer, SNEEP, ACT <NOTE:1365089.1> - How to locate a failed internal disk drive on a T3/T4/T5/T7/T8 series system <NOTE:1542744.1> - How to Identify the Target ID for an Internal System Disk under MPXIO/scsi_vhci Control Attachments This solution has no attachment |
||||||||||||||||
|