Troubleshooting Disk Failures on SPARC T3-x, T4-x and T5-x Servers When an Explorer Cannot be Provided

Asset ID:	1-75-2314065.1
Update Date:	2018-02-07
Keywords:

Solution Type Troubleshooting Sure

Solution 2314065.1 : Troubleshooting Disk Failures on SPARC T3-x, T4-x and T5-x Servers When an Explorer Cannot be Provided

Applies to:

SPARC T5-4
SPARC T5-8
SPARC T3-2
SPARC T3-4
SPARC T3-1
Oracle Solaris on SPARC (32-bit)
Oracle Solaris on SPARC (64-bit)

Purpose

This document provides guidance about what data to collect and how to analyze disk failures when an explorer cannot be provided.

Troubleshooting Steps

Please gather and analyze the output of the commands listed below.

Be informed that an Explorer gathers all this data but it's important to know the commands that reveal disk information, what to check for, and the interpretation of each output.

I. If you are not able to reach Solaris, provide the output of the following commands from OBP (ok> prompt):

ok> printenv Determine boot device (internal or external!!!). If external and an Oracle storage array, then open an SR into the Storage group.

ok> setenv auto-boot? false Change the PROM auto-boot? parameter to false.

ok> reset-all Reset the system before executing probe-scsi-all command (to prevent the system from hanging).

ok> probe-scsi-all The probe-scsi-all command transmits an inquiry command to all SCSI devices connected to the system SCSI host adapters, including any host adapters installed in PCI slots.
The first identifier listed in the display is the SCSI host adapter address in the system device tree followed by the SCSI device identification data.

ok> devalias Lists the device aliases and the associated paths of devices that might be connected to the system.

ok> show-disks Lists disks and has options to add to dev alias.

II. If you are able to reach Solaris, provide the output of the following commands:

1. #prtdiag -v

Please check the LED status, environmental status.

2. #cfgadm -alv

This command displays information about a system's SCSI controllers and their attached devices.

3. #format

The command displays a list of recognized disks under AVAILABLE DISK SELECTIONS.

Sometimes when an disk has failed it's no longer available in the format output.

4. #uptime

This command shows how long the system has been up.

5. #iostat -En

You can search for hardware errors but please correlate the output you see here with the uptime.

Iostat counters are reset at system reboot.

If a failed disk has been replaced previously and the system was not rebooted, the errors will still be reported in Iostat for the new disk.

If you can't reboot the server, check Document 1012731.1 Reset the iostat -E hard/soft/tran errors counters without rebooting.

#iostat -En
Disk                              Size       Soft    Hard Trans Media Ready NoDev Recov Illeg PFlAn
c1t0d0                           0.00GB     0       8       0        0       0      8 0 10 0 AMI Virtual CDROM
c0t5000CCA0123D00E8d0 600.13GB 2369 24      2       23       0      1 2369 0 0 HITACHI H106060SDSUN600G
c0t5000CCA012569F20d0 600.13GB 0        0       0        0       0      0 0 0 0 HITACHI H106060SDSUN600G
c0t5000CCA0123AB328d0 600.13GB 0        0       0        0       0      0 0 0 0 HITACHI H106060SDSUN600G
c0t5000CCA01244E0E0d0 600.13GB 0        0       0        0       0      0 0 0 0 HITACHI H106060SDSUN600G

6. #metastat -t

Use this command for disks configured under Solaris Volume Manager.

You need to check if any submirros that are in status "Need maintenance".

Also check Document 1002410.1 How to recover mirrored metadevice with both submirrors in a "Needs Maintenance" state caused by a bad disk block.

#metastat -t
Submirror 1: d2 State: Needs maintenance Sat Jun 25 00:22:43 2011
d2: Submirror of d0
State: Needs maintenance Sat Jun 25 00:22:43 2011
Invoke: metareplace d0 /dev/dsk/c0t5000C500332C7E73d0s0 <new device>
Size: 50337500 blocks (24 GB)
Stripe 0:
Device Start Dbase State Reloc Hot Spare Time
/dev/dsk/c0t5000C500332C7E73d0s0 0 No Maintenance Yes Sat Jun 25 00:22:43 2011

7. #zpool status

Use this command for disks configured under ZFS.

You need to check if any pools are in state degraded.

#zpool status
pool: rpool
state: ONLINE
scan: none requested
config:
NAME                                                STATE                         READ WRITE   CKSUM
rpool                                                 ONLINE                         0         0           0
c0t5000CCA025244F18d0                     ONLINE                          0         0           0
errors: No known data errors

8. Get the output from dmesg and messages

#dmesg

# more /var/adm/messages

Look for error blocks that are reported and for messages mentioning that the disk has been taken offline.

Important! If you can't find any evidence of the fault in /var/adm/messages, then you probably do not need to replace the disk.
One exception would be a disk that failed so long ago that the messages files have rolled over and no longer show that time period.

If after proceeding with steps 1-8 you only notice error blocks on a certain disk, run the command format > analyze > read to repair the defective blocks on a disk.
format> analyze > read - command reads each sector on the current disk. Repairs defective blocks as a default (it doesn't harm SunOS).

Example:
Aug 24 22:39:56 : [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000cca0123d00e8 (sd5):
Aug 24 22:39:56 Error for Command: read(10) Error Level: Fatal
Aug 24 22:39:56 : [ID 107833 kern.notice] Requested Block: 1118816592 Error Block: 1118816592
Aug 24 22:39:56 : [ID 107833 kern.notice] Vendor: HITACHI Serial Number: 111555BBB
Aug 24 22:39:56 : [ID 107833 kern.notice] Sense Key: Media_Error
Aug 24 22:39:56 : [ID 107833 kern.notice] ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x2d
Aug 24 22:39:57 : [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000cca0123d00e8 (sd5):
Aug 24 22:39:57 Error for Command: read(10) Error Level: Retryable
Aug 24 22:39:57 : [ID 107833 kern.notice] Requested Block: 1119053824 Error Block: 1119053825
Aug 24 22:39:57 : [ID 107833 kern.notice] Vendor: HITACHI Serial Number: 111555BBB

9. If Internal LSI Hardware RAID Volumes are configured using one the following 3 utilities: Sas2ircu, Raidconfig and Fcode check Document 1483874.1 - How to Identify a Failed Disk when Internal LSI Hardware RAID Volumes are configured (T3-x and T4-x).

If the internal disks are configured using a RAID HBA highlighted by the path to disks with "LSI,mrsas", check Document 1471121.1 Some or all internal disks in SPARC T3 / T4 Servers may not show up in "# format" or "ok> probe-scsi-all".

10.Displaying Disk Physical Locations:

a. Please get the output of the command #prtconf -v.

And then identify the value of the 'phy-num'.

Further details can be found in Document 1365089.1 - How to locate a failed internal disk drive on a T3/T4/T5/T7/T8 series system.

b. For servers that use Solaris 11, you can use the #diskinfo command.

Details can be found at https://docs.oracle.com/cd/E53394_01/html/E54782/disksprep-3.html.

root@t5-8:~# diskinfo
D:devchassis-path c:occupant-compdev
------------------------------------ ---------------------
/dev/chassis/SYS/PCIE9/F40/LUN0/disk c0t50005160000283D8d0
/dev/chassis/SYS/PCIE9/F40/LUN1/disk c0t50005160000275CCd0
/dev/chassis/SYS/PCIE9/F40/LUN2/disk c0t5000516000028010d0
/dev/chassis/SYS/PCIE9/F40/LUN3/disk c0t50005160000278D0d0
/dev/chassis/SYS/SASBP0/HDD0/disk c0t5000CCA0162B76ACd0
/dev/chassis/SYS/SASBP0/HDD1/disk c0t5000CCA0162A47B0d0
/dev/chassis/SYS/SASBP0/HDD2 -
/dev/chassis/SYS/SASBP0/HDD3 -
/dev/chassis/SYS/SASBP1/HDD4/disk c0t5000CCA0162A6B4Cd0
/dev/chassis/SYS/SASBP1/HDD5/disk c0t5000CCA0162AAA00d0
/dev/chassis/SYS/SASBP1/HDD6/disk c0t5000CCA0160B4C88d0
/dev/chassis/SYS/SASBP1/HDD7/disk c0t5000CCA0160BF17Cd0

c. If the disk is under MPXIO /scsi_vhci control check Document 1542744.1 How to Identify the Target ID for an Internal System Disk under MPXIO/scsi_vhci Control.

11. raidconfig list all -v - raidconfig is a command available when Hardware Management Pack (HMP) is installed. HMP is automatically installed on Solaris 11.2 and later.

Sample output from a T3-4:

root@t3-4:~# raidconfig list all -v

CONTROLLER c0
=============
Node ID: mptir2:03:00.0:500605b0028cf110
Manufacturer: LSI Logic
Model: SGX-SAS6-REM-Z
F/W Version: 05.00.17.00
Serial Number: 500605b0028cf110
RAID Volumes: 0
Disks: 4
PCI Address: 03:00.0
PCI Vendor ID: 0x1000
PCI Device ID: 0x0072
PCI Subvendor ID: 0x1000
PCI Subdevice ID: 0x3180
Battery Backup Status: Not installed
Max RAID Volumes: 2
Max Disks per RAID Volume: 256
Supported RAID Levels: 0, 1, 10
Max Dedicated Spares: 0
Max Global Spares: 2
Stripe Size Min (KB): 64
Stripe Size Max (KB): 64

Disk c0d0
=========
ID: c0d0
Chassis: 0
Slot: 0
Node ID: PDS:5000cca0150c4fb9
Mapped to Host OS: true
Device: /dev/rdsk/c0t5000CCA0150C4FB8d0s2
Disabled: false
Status: OK
Type: sas
Media: HDD
Manufacturer: HITACHI
Model: H103030SCSUN300G
Size (GiB): 279
Serial Number: 001043G6SWTE PFV6SWTE
F/W Version: A2A8
NAC Name: /SYS/MB/HDD0
Current Task: none

Disk c0d1
=========
ID: c0d1
Chassis: 0
Slot: 1
Node ID: PDS:5000cca00ac1fbfd
Mapped to Host OS: true
Device: /dev/rdsk/c0t5000CCA00AC1FBFCd0s2
Disabled: false
Status: OK
Type: sas
Media: HDD
Manufacturer: HITACHI
Model: H103030SCSUN300G
Size (GiB): 279
Serial Number: 001033GEP6VE PDYEP6VE
F/W Version: A2A8
NAC Name: /SYS/MB/HDD1
Current Task: none

12. If after taking steps 1-9 you have gathered sufficient information probing the fact that a disk has failed, please contact Oracle support.

References

<NOTE:1012731.1> - How to Reset the iostat -E Error Counters Without Rebooting
<NOTE:1002410.1> - Solaris Volume Manager (SVM) How to recover mirrored metadevice with both submirrors in a "Needs Maintenance" state caused by a bad disk block
<NOTE:1483874.1> - How to Identify a Failed Disk when Internal LSI Hardware RAID Volumes are configured (T3-x and T4-x)
<NOTE:1469821.1> - Solaris Volume Manager (SVM) SPARC How to Replace a Failed, SCSI Disk, Mirrored with SVM
<NOTE:1471121.1> - Some or all internal disks in SPARC T3 / T4 Servers may not show up in "# format" or "ok> probe-scsi-all"
<NOTE:1153444.1> - Oracle Services Tools Bundle (STB) - RDA/Explorer, SNEEP, ACT
<NOTE:1365089.1> - How to locate a failed internal disk drive on a T3/T4/T5/T7/T8 series system
<NOTE:1542744.1> - How to Identify the Target ID for an Internal System Disk under MPXIO/scsi_vhci Control

Attachments

This solution has no attachment