![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Technical Instruction Sure Solution 1005520.1 : How to verify I/O device errors on V210/V240/V215/V245/V250/V440/V445, T1000/T2000, T5120/T5140/T5220/T5240/T5440, V480/V490/V880/V890 servers
PreviouslyPublishedAs 207650 Applies to:Sun Fire V480 Server - Version All Versions and laterSun Fire V880 Server - Version All Versions and later Sun Fire V210 Server - Version All Versions and later Sun Fire V215 Server - Version All Versions and later Sun Fire V240 Server - Version All Versions and later All Platforms To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community, Oracle Entrylevel Servers. GoalDescription This information doesn't apply to systems, in which the disks are configured in a hardware raid volume (as 'format' will not show disks that are part of a raid volume)
SolutionSteps to Follow Most of the I/O errors for failing drives on the Sun Fire[TM] servers are related to a disk problem and not to disk backplane or cables. To confirm a disk failure from I/O errors, there are several things that can be checked. First you may need verify that 'format' is not seeing a device problem. A typical example here is when format shows 'drive type unknown' for a specific drive. Server platforms, such as 280R, V480/V490, and V880/V890 are using FC-AL disk drives. Note that the FC-AL disks have a World Wide Number (WWN) attached to each disk, which affects how devices appear in Solaris[TM] (in the format output): AVAILABLE DISK SELECTIONS: 0. c1t0d0 /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000011c6371e4d,0 1. c1t1d0 /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000011c6372ccc,0 2. c1t2d0 /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000011c6371bc0,0 After analyzing the format output, in this case it is strongly recommended to also examine /var/adm/messages for matching disk drive errors: Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000011c6371bc0,0 (ssd0): Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice] Error for Command: read(10) Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice] Error Level: Retryable Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice] Requested Block: 404016 Error Block: 404016 Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice] Vendor: SEAGATE Serial Number: 0446B9xxxx Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice] ASC: 0x29 (), ASCQ: 0x3, FRU: 0x4 Errors like these generally indicate that the drive listed needs to be replaced. To confirm the failing drive, the WWN of w21000011c6371bc0,0 in the above messages should be mapped to 'c1t2d0' drive shown in the output of the format command (in this case they match). Here is another example of format errors for server platforms using SCSI drives (servers such as V215/V245, V440/445, T1000/T2000): AVAILABLE DISK SELECTIONS: 0. c0t0d0 /pci@1f,700000/scsi@2/sd@0,0 1. c0t1d0 /pci@1f,700000/scsi@2/sd@1,0 The following errors are in the /var/adm/messages: Nov 20 12:28:51 sg5000-maildb-0 scsi: WARNING: /pci@1f,700000/scsi@2/sd@1,0 (sd2): Nov 20 12:28:51 sg5000-maildb-0 scsi: Error for Command: persistent reservation in Error Level: Informational Nov 20 12:28:51 sg5000-maildb-0 scsi: Requested Block: 0 Error Block: 0 Nov 20 12:28:51 sg5000-maildb-0 scsi: Vendor: SEAGATE Serial Number: 0449B9xxxx Nov 20 12:28:51 sg5000-maildb-0 scsi: Sense Key: Soft Error Nov 20 12:28:51 sg5000-maildb-0 scsi: ASC: 0x5d (drive operation marginal, service immediately (failure prediction threshold exceeded)), ASCQ: 0x0, FRU: 0x5 In the above example the device path from messages matches the disk c1t1d0 reported within the format output, so the disk needs to be replaced. When troubleshooting I/O errors for failing devices you'll also need to carefully examine the output of the 'iostat -E' (iostat -En) command, for any matching error events that affect the disk drives. Note: need to emphasize that iostat is only used after we see errors in messages. Iostat alone should never be used as justification for disk replacement, only if there are matching errors in the messages file. Look for non-zero counts in the output of 'iostat' (usually in the 1st, 4th, and 5th lines): # iostat -En c1t1d0 Soft Errors: 1 Hard Errors: 0 Transport Errors: 0 Vendor: SEAGATE Product: ST373307LSUN72G Revision: 0507 Serial No: 3HZ7CC470000xxxx Size: 73.40GB <73400057856 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 1 Illegal Request: 0 Predictive Failure Analysis: 0 c1t2d0 Soft Errors: 0 Hard Errors: 394 Transport Errors: 0 Vendor: SEAGATE Product: ST373307LSUN72G Revision: 0507 Serial No: 3HZ7CLM60000xxxx Size: 0.00GB <0 bytes> Media Error: 394 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 If more that one disk has a non-zero counts (as in the above example), this could be a problem on one disk and a side-effect of that problem on the other. In this case the error counts on the failing drive c1t2d0 are significantly higher compared to the other disk c1t1d0. A disk problem reported in the 'format' output (or messages) typically translates to a high error count in iostat, for example: 2. c1t2d0 /pci@1f,700000/scsi@2/sd@2,0 # iostat -E .......... c1t2d0 Soft Errors: 0 Hard Errors: 932 Transport Errors: 0 Vendor: FUJITSU Product: MAP3735N SUN72G Revision: 0401 Serial No: 0415Q0XXXX Size: 0.00GB <0 bytes> Media Error: 0 Device Not Ready: 335 No Device: 190 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 However, a non-zero count in 'iostat -E' output does not always mean an error event on a device. Some specific conditions of the target device, can cause non-zero values in the 'iostat' output. Following, is an example of such a condition where the device is working normally: # iostat -E ssd10 Soft Errors: 0 Hard Errors: 10 Transport Errors: 0 Vendor: SEAGATE Product: ST336605FSUN36G Revision: 0638 Serial No: 0201P1xxxx Size: 36.42GB <36418595328 bytes> Media Error: 0 Device Not Ready: 0 No Device: 10 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 In this case, both "Hard Errors" and "No Device" are the same. This implies that the device has gone through resets or power on. The device does not need an immediate replacement. It is recommended to monitor the value over a period of time, and if there are other related errors, this has to be investigated. Refer to Document: 1017741.1 Solaris Operating System: High Hard Error value in iostat -E output for more details. NOTE: There is a helpful utility "diskinfo.sparc", which is part of the Sun explorer. It always gives updated disk model and serial number information even after a disk hot swap. For example: # /opt/SUNWexplo/bin/diskinfo.sparc AVAILABLE SCSI DEVICES: Location Vendor Product Rev Serial # c1t0d0 FUJITSU MAP3147F SUN146G 1601 0515R0304B c1t1d0 SEAGATE ST373307FSUN72G 0307 0426B7MQQ5 Internal Comments
References<NOTE:778.1> - Troubleshooting Video Issues in MOS<NOTE:1017741.1> - Solaris Operating System High Hard Error value in iostat -E output Attachments This solution has no attachment |
||||||||||||
|