Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1005520.1
Update Date:2017-11-30
Keywords:

Solution Type  Technical Instruction Sure

Solution  1005520.1 :   How to verify I/O device errors on V210/V240/V215/V245/V250/V440/V445, T1000/T2000, T5120/T5140/T5220/T5240/T5440, V480/V490/V880/V890 servers  


Related Items
  • Sun Fire T1000 Server
  •  
  • Sun SPARC Enterprise T5220 Server
  •  
  • Sun Blade T6300 Server Module
  •  
  • Sun Fire V440 Server
  •  
  • Sun SPARC Enterprise T5240 Server
  •  
  • Sun Blade T6320 Server Module
  •  
  • Sun Fire V880 Server
  •  
  • Sun Fire V215 Server
  •  
  • Sun Fire V890 Server
  •  
  • Sun SPARC Enterprise T5140 Server
  •  
  • Sun Fire V240 Server
  •  
  • Solaris Operating System
  •  
  • Sun Fire V480 Server
  •  
  • Sun Fire V210 Server
  •  
  • Sun Fire V245 Server
  •  
  • Sun SPARC Enterprise T5120 Server
  •  
  • Sun Fire V445 Server
  •  
  • Sun Fire T2000 Server
  •  
  • Sun Blade T6340 Server Module
  •  
  • Sun Fire V490 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Workgroup Servers>SN-SPARC: SF-V2x0
  •  
  • _Old GCS Categories>Sun Microsystems>Servers>Entry-Level Servers
  •  
  • _KM>Content>Video
  •  
  • _Old GCS Categories>Sun Microsystems>Servers>CMT Servers
  •  

PreviouslyPublishedAs
207650


Applies to:

Sun Fire V480 Server - Version All Versions and later
Sun Fire V880 Server - Version All Versions and later
Sun Fire V210 Server - Version All Versions and later
Sun Fire V215 Server - Version All Versions and later
Sun Fire V240 Server - Version All Versions and later
All Platforms

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community, Oracle Entrylevel Servers.





Goal

Description
This document will help the user to identify a failing disk device based on errors reported in the 'format' output, 'iostat' and /var/adm/messages.

This information doesn't apply to systems, in which the disks are configured in a hardware raid volume (as 'format' will not show disks that are part of a raid volume)


Available for this topic, a Video Tutorial; Brief how-to video tutorials that provide step-by-step instructions answering Sun's most frequently asked questions.  View the video answer and/or follow the detailed instructions below.

 

Video - Analysing Disk Errors (5:00)

 

Solution

Steps to Follow
Confirming Disk failure for failing drives

Most of the I/O errors for failing drives on the Sun Fire[TM] servers are related to a disk problem and not to disk backplane or cables. To confirm a disk failure from I/O errors, there are several things that can be checked.

First you may need verify that 'format' is not seeing a device problem. A typical example here is when format shows 'drive type unknown' for a specific drive. Server platforms, such as 280R, V480/V490, and V880/V890 are using FC-AL disk drives. Note that the FC-AL disks have a World Wide Number (WWN) attached to each disk, which affects how devices appear in Solaris[TM] (in the format output):

AVAILABLE DISK SELECTIONS:



0. c1t0d0  /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000011c6371e4d,0

1. c1t1d0  /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000011c6372ccc,0

2. c1t2d0  /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000011c6371bc0,0

After analyzing the format output, in this case it is strongly recommended to also examine /var/adm/messages for matching disk drive errors:

Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000011c6371bc0,0 (ssd0):

Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice]  Error for Command: read(10) 

Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice]  Error Level: Retryable

Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice]  Requested Block: 404016 Error Block: 404016

Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice]  Vendor: SEAGATE Serial Number: 0446B9xxxx

Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice]  Sense Key: Unit Attention

Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice]  ASC: 0x29 (), ASCQ: 0x3, FRU: 0x4

Errors like these generally indicate that the drive listed needs to be replaced. To confirm the failing drive, the WWN of  w21000011c6371bc0,0 in the above messages should be mapped to 'c1t2d0' drive shown in the output of the format command (in this case they match).

Here is another example of format errors for server platforms using SCSI drives (servers such as V215/V245, V440/445, T1000/T2000):

AVAILABLE DISK SELECTIONS:



0. c0t0d0 /pci@1f,700000/scsi@2/sd@0,0

1. c0t1d0 /pci@1f,700000/scsi@2/sd@1,0

The following errors are in the /var/adm/messages:

Nov 20 12:28:51 sg5000-maildb-0 scsi: WARNING: /pci@1f,700000/scsi@2/sd@1,0 (sd2):

Nov 20 12:28:51 sg5000-maildb-0 scsi: Error for Command: persistent reservation in Error Level: Informational

Nov 20 12:28:51 sg5000-maildb-0 scsi:   Requested Block: 0 Error Block: 0

Nov 20 12:28:51 sg5000-maildb-0 scsi:   Vendor: SEAGATE Serial Number: 0449B9xxxx

Nov 20 12:28:51 sg5000-maildb-0 scsi:   Sense Key: Soft Error

Nov 20 12:28:51 sg5000-maildb-0 scsi:   ASC: 0x5d (drive operation marginal, service immediately (failure prediction threshold exceeded)), ASCQ: 0x0, FRU: 0x5

In the above example the device path from messages matches the disk c1t1d0 reported within the format output, so the disk needs to be replaced.

When troubleshooting I/O errors for failing devices you'll also need to carefully examine the output of the 'iostat -E' (iostat -En) command, for any matching error events that affect the disk drives.

Note: need to emphasize that iostat is only used after we see errors in messages. Iostat alone should never be used as justification for disk replacement, only if there are matching errors in the messages file.

Look for non-zero counts in the output of 'iostat' (usually in the 1st, 4th, and 5th lines):

# iostat -En 

c1t1d0          Soft Errors: 1 Hard Errors: 0 Transport Errors: 0

Vendor: SEAGATE  Product: ST373307LSUN72G  Revision: 0507 Serial No: 3HZ7CC470000xxxx

Size: 73.40GB <73400057856 bytes>

Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 1

Illegal Request: 0 Predictive Failure Analysis: 0

c1t2d0          Soft Errors: 0 Hard Errors: 394 Transport Errors: 0

Vendor: SEAGATE  Product: ST373307LSUN72G  Revision: 0507 Serial No: 3HZ7CLM60000xxxx

Size: 0.00GB <0 bytes>

Media Error: 394 Device Not Ready: 0 No Device: 0 Recoverable: 0

Illegal Request: 0 Predictive Failure Analysis: 0

If more that one disk has a non-zero counts (as in the above example), this could be a problem on one disk and a side-effect of that problem on the other. In this case the error counts on the failing drive c1t2d0 are significantly higher compared to the other disk c1t1d0.

A disk problem reported in the 'format' output (or messages) typically translates to a high error count in iostat, for example:

2. c1t2d0 /pci@1f,700000/scsi@2/sd@2,0
# iostat -E 
..........

c1t2d0          Soft Errors: 0 Hard Errors: 932 Transport Errors: 0

Vendor: FUJITSU  Product: MAP3735N SUN72G  Revision: 0401 Serial No: 0415Q0XXXX

Size: 0.00GB <0 bytes>

Media Error: 0 Device Not Ready: 335 No Device: 190 Recoverable: 0

Illegal Request: 0 Predictive Failure Analysis: 0

However, a non-zero count in 'iostat -E' output does not always mean an error event on a device. Some specific conditions of the target device, can cause non-zero values in the 'iostat' output. Following, is an example of such a condition where the device is working normally:

# iostat -E 

ssd10    Soft Errors: 0 Hard Errors: 10 Transport Errors: 0

Vendor: SEAGATE  Product: ST336605FSUN36G  Revision: 0638 Serial No:  0201P1xxxx

Size: 36.42GB <36418595328 bytes>

Media Error: 0 Device Not Ready: 0 No Device: 10 Recoverable: 0

Illegal Request: 0 Predictive Failure Analysis: 0

In this case, both "Hard Errors" and "No Device" are the same. This implies that the device has gone through resets or power on. The device does not need an immediate replacement. It is recommended to monitor the value over a period of time, and if there are other related errors, this has to be investigated.

Refer to  Document: 1017741.1   Solaris Operating System: High Hard Error value in iostat -E output  for more details.

NOTE: There is a helpful utility "diskinfo.sparc", which is part of the Sun explorer. It always gives updated disk model and serial number information even after a disk hot swap. For example:

# /opt/SUNWexplo/bin/diskinfo.sparc



AVAILABLE SCSI DEVICES:

   Location     Vendor          Product         Rev  Serial #

    c1t0d0      FUJITSU    MAP3147F SUN146G     1601 0515R0304B

    c1t1d0      SEAGATE    ST373307FSUN72G      0307 0426B7MQQ5


Internal Comments
  This document contains normalized content and is managed by the the Domain Lead(s) of the
  respective domains. To notify content owners of a knowledge gap contained in this document,
  and/or prior to updating this document, please contact the domain engineers that are managing this @ document via the "Document Feedback" alias(es) listed below:    VSP-SPARC-Normalization@sun.com

  Note:
 
Some of the error
  examples in document list  Vendor ,   Sense Key ,  ASC and ASCQ information.
  These values will vary with the type of drive error and are explained further in   
  Doc ID 1005787.1 Kernel tips: understanding SCSI and its errors.
 
  normalized, I/O errors, failed drive, format, iostat, Problem Solved = Disk Error Verification
  Previously Published As   91406
  Change History
  Date: 2009-11-18
  User name: Dencho Kojucharov
  Action: Updated
  Comments: Currency check, audited by Dencho Kojucharov, Entry-Level SPARC Content Lead

 

References

<NOTE:778.1> - Troubleshooting Video Issues in MOS
<NOTE:1017741.1> - Solaris Operating System High Hard Error value in iostat -E output

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback