Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1534963.1
Update Date:2017-05-01
Keywords:

Solution Type  Problem Resolution Sure

Solution  1534963.1 :   How to verify a bad disk in a cougar controlled Raid 5 array  


Related Items
  • Sun SPARC Enterprise T5220 Server
  •  
  • Sun SPARC Enterprise T5240 Server
  •  
  • Sun SPARC Enterprise T5140 Server
  •  
  • Sun Netra T5220 Server
  •  
  • Sun SPARC Enterprise T5120 Server
  •  
  • Sun Netra T5440 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>CMT>SN-SPARC: T5xx0
  •  


This document will demonstrate where to look in order to verify a bad disk in a cougar controlled RAID 5 array.

In this Document
Symptoms
Cause
Solution
References


Created from <SR 3-6900218801>

Applies to:

Sun SPARC Enterprise T5120 Server - Version All Versions and later
Sun Netra T5440 Server - Version All Versions to All Versions [Release All Releases]
Sun SPARC Enterprise T5220 Server - Version All Versions and later
Sun SPARC Enterprise T5240 Server - Version All Versions and later
Sun SPARC Enterprise T5140 Server - Version All Versions and later
Information in this document applies to any platform.

Symptoms

Customer reports a disk (0,11) in a cougar controlled array has disappeared from the configuration.

Cause

First we check with the "/opt/StorMan/arcconf getconfig 1" output and look for the missing disk, in this case it's in Logical Device 2 (LD_2):

...
Logical device number 2
  Logical device name                      : LD_2
  RAID level                               : 5
  Status of logical device                 : Optimal
  Size                                     : 857078 MB
  Stripe-unit size                         : 256 KB
  Read-cache mode                          : Enabled
  Write-cache mode                         : Enabled (write-back)
  Write-cache setting                      : Enabled (write-back)
  Partitioned                              : Yes
  Protected by Hot-Spare                   : Yes
  Dedicated Hot-Spare                      : 0,19
  Bootable                                 : No
  Failed stripes                           : No
  --------------------------------------------------------
  Logical device segment information
  --------------------------------------------------------
  Segment 0                                : Present (0,10) 00100371LSP2        3SE1LSP2
  Segment 1                                : Spare (0,18) 00100271JLP0        3SE1JLP0  <--------------------- Missing (0,11) likely Hot Spare has kicked in.
  Segment 2                                : Present (0,12) 00100271HQ7F        3SE1HQ7F
  Segment 3                                : Present (0,13) 00100371LJSC        3SE1LJSC

...

 

The disk doesn't appear here or in the Physical Device information section:

...
Device #2
         Device is a Hard drive
         State                              : Online
         Supported                          : Yes
         Transfer Speed                     : SAS 3.0 Gb/s
         Reported Channel,Device            : 0,10
         Reported Location                  : Enclosure 0, Slot 2
         Reported ESD                       : 2,0
         Vendor                             : SEAGATE
         Model                              : ST930003SSUN300G
         Firmware                           : 0868
         Serial number                      : 00100371LSP2        3SE1LSP2
         World-wide name                    : 5000C5001D2033BC
         Size                               : 286102 MB
         Write Cache                        : Disabled (write-through)
         FRU                                : None
         S.M.A.R.T.                         : No
      Device #3
         Device is a Hard drive
         State                              : Online
         Supported                          : Yes
         Transfer Speed                     : SAS 3.0 Gb/s
         Reported Channel,Device            : 0,12
         Reported Location                  : Enclosure 0, Slot 4
         Reported ESD                       : 2,0
         Vendor                             : SEAGATE
         Model                              : ST930003SSUN300G
         Firmware                           : 0868
         Serial number                      : 00100271HQ7F        3SE1HQ7F
         World-wide name                    : 5000C5001D097D34
         Size                               : 286102 MB
         Write Cache                        : Disabled (write-through)
         FRU                                : None
         S.M.A.R.T.                         : No

Normally, a failed disk would show in the above output of the devices. In this case, disk 0,11 has disappeared from the configuration.
This is likely a hard failure for which the Cougar card can't reach the disk
However, the Raid Event Log (/opt/StorMan/RaidEvtA.log) registered the disk failure events:
March 4, 2013 9:22:08 PM CST WRN                   ILHH241FDA Command timeout: controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [28 00 1c 54 58 00 00 02 00 00 00 00]    <------------------ Original error.
March 4, 2013 9:22:18 PM CST INF                   ILHH241FDA Sense data: Unit attention (BUS DEVICE RESET FUNCTION OCCURRED). Controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [28 00 1c 71 96 00 00 02 00 00 00 00], data [70 00 06 00 00 00 00 00 00 00 00 00 29 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00]  
March 4, 2013 9:28:10 PM CST WRN                   ILHH241FDA Command timeout: controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [28 00 14 b9 c8 00 00 02 00 00 00 00]  
March 4, 2013 9:28:21 PM CST INF                   ILHH241FDA Sense data: Unit attention (BUS DEVICE RESET FUNCTION OCCURRED). Controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [28 00 00 01 15 eb 00 00 10 00 00 00], data [70 00 06 00 00 00 00 00 00 00 00 00 29 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00]  
March 4, 2013 9:30:53 PM CST WRN                   ILHH241FDA Command timeout: controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [2a 00 00 06 5c 00 00 02 00 00 00 00]  
March 4, 2013 9:30:58 PM CST INF                   ILHH241FDA Sense data: Unit attention (BUS DEVICE RESET FUNCTION OCCURRED). Controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [2a 00 1c 60 04 00 00 02 00 00 00 00], data [70 00 06 00 00 00 00 00 00 00 00 00 29 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00]  
March 4, 2013 9:32:11 PM CST WRN                   ILHH241FDA Command timeout: controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [28 00 22 ec b1 e0 00 00 01 00 00 00]  
March 4, 2013 9:32:27 PM CST INF                   ILHH241FDA Sense data: Unit attention (BUS DEVICE RESET FUNCTION OCCURRED). Controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [28 00 00 cb fe 00 00 02 00 00 00 00], data [70 00 06 00 00 00 00 00 00 00 00 00 29 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00]  
March 4, 2013 9:33:34 PM CST WRN                   ILHH241FDA Command timeout: controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [28 00 00 01 a6 00 00 02 00 00 00 00]  
March 4, 2013 9:33:55 PM CST INF                   ILHH241FDA Sense data: Unit attention (BUS DEVICE RESET FUNCTION OCCURRED). Controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [28 00 22 ec b2 23 00 00 01 00 00 00], data [70 00 06 00 00 00 00 00 00 00 00 00 29 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00]  
March 4, 2013 9:53:41 PM CST WRN                   ILHH241FDA Command timeout: controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [28 00 1c 5d 94 f6 00 00 10 00 00 00]    
March 4, 2013 9:54:05 PM CST ERR                   ILHH241FDA Drive in a RAID-5 set failed: controller 1, logical device 2  <-------------- RAID 5 failover
March 4, 2013 9:54:05 PM CST ERR                   ILHH241FDA Disk failed: controller 1, channel 0, SCSI device ID 11
March 4, 2013 9:54:40 PM CST INF                   ILHH241FDA Container changed: controller 1, logical device 2
March 4, 2013 9:54:40 PM CST INF                   ILHH241FDA PPI update.  Age 278
March 4, 2013 9:54:40 PM CST INF                   ILHH241FDA PPI update.  Age 279
March 4, 2013 9:54:45 PM CST INF                   ILHH241FDA Container changed: controller 1, logical device 2
March 4, 2013 9:54:46 PM CST INF                   ILHH241FDA PPI update.  Age 280
March 4, 2013 9:54:46 PM CST INF                   ILHH241FDA Failover disk changed: controller 1, logical device 2
March 4, 2013 9:54:46 PM CST INF                   ILHH241FDA Failover and rebuild operation started on a RAID-5 set: controller 1, logical device 2
March 4, 2013 9:54:46 PM CST INF                   ILHH241FDA Container changed: controller 1, logical device 2
March 4, 2013 9:54:46 PM CST INF                   ILHH241FDA Configuration has changed.
March 4, 2013 9:55:01 PM CST WRN      301:A01C-S--L02 ILHH241FDA Logical device is degraded: controller 1, logical device 2 ("LD_2").
March 4, 2013 9:55:01 PM CST ERR      401:A01C0S11L-- ILHH241FDA Failed drive: controller 1, enclosure 0, slot 3, S/N 00100271HQVC        3SE1HQVC (Vendor: SEAGATE Model: ST930003SSUN300G).  <--------------- Controller acknowledging the disk is bad.
March 4, 2013 9:55:06 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 0%. Controller 1, logical device 2   <------------------- Rebuilding the RAID.
March 4, 2013 9:55:06 PM CST INF      304:A01C-S--L02 ILHH241FDA Rebuilding: controller 1, logical device 2 ("LD_2").
March 4, 2013 9:57:33 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 5%. Controller 1, logical device 2
March 4, 2013 10:00:09 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 10%. Controller 1, logical device 2
March 4, 2013 10:02:50 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 15%. Controller 1, logical device 2
March 4, 2013 10:06:03 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 20%. Controller 1, logical device 2
March 4, 2013 10:08:39 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 25%. Controller 1, logical device 2
March 4, 2013 10:11:20 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 30%. Controller 1, logical device 2
March 4, 2013 10:14:51 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 35%. Controller 1, logical device 2
March 4, 2013 10:17:43 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 40%. Controller 1, logical device 2
March 4, 2013 10:20:24 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 45%. Controller 1, logical device 2
March 4, 2013 10:23:31 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 50%. Controller 1, logical device 2
March 4, 2013 10:26:31 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 55%. Controller 1, logical device 2
March 4, 2013 10:27:46 PM CST WRN                   ILHH241FDA Command timeout: controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [1a 08 1c 00 10 00 00 02 00 00 00 00]  
March 4, 2013 10:30:39 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 60%. Controller 1, logical device 2
March 4, 2013 10:33:26 PM CST WRN                   ILHH241FDA Command timeout: controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [1a 08 3f 00 ff 00 00 02 00 00 00 00]  
March 4, 2013 10:33:26 PM CST WRN                   ILHH241FDA Command timeout: controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [1a 08 3f 00 ff 00 00 02 00 00 00 00]  
March 4, 2013 10:34:59 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 65%. Controller 1, logical device 2
March 4, 2013 10:38:08 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 70%. Controller 1, logical device 2
March 4, 2013 10:41:15 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 75%. Controller 1, logical device 2
March 4, 2013 10:44:32 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 80%. Controller 1, logical device 2
March 4, 2013 10:47:51 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 85%. Controller 1, logical device 2
March 4, 2013 10:51:29 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 90%. Controller 1, logical device 2
March 4, 2013 10:55:37 PM CST INF                   ILHH241FDA Running: RAID 5 rebuild - 95%. Controller 1, logical device 2
March 4, 2013 10:59:36 PM CST INF                   ILHH241FDA Complete: RAID 5 rebuild - 100%. Controller 1, logical device 2
March 4, 2013 10:59:36 PM CST INF                   ILHH241FDA PPI update.  Age 281
March 4, 2013 10:59:36 PM CST INF                   ILHH241FDA Container changed: controller 1, logical device 2
March 4, 2013 10:59:36 PM CST INF                   ILHH241FDA PPI update.  Age 282
March 4, 2013 10:59:36 PM CST INF                   ILHH241FDA RAID-5 rebuild operation completed successfully: controller 1, logical device 2
March 4, 2013 10:59:36 PM CST INF                   ILHH241FDA Complete: RAID 5 rebuild - 100%. Controller 1, logical device 2
March 4, 2013 10:59:36 PM CST INF                   ILHH241FDA Configuration has changed.
March 4, 2013 10:59:37 PM CST INF      345:A01C-S--L02 ILHH241FDA Logical device is normal: controller 1, logical device 2 ("LD_2").
March 4, 2013 10:59:37 PM CST INF      305:A01C-S--L02 ILHH241FDA Rebuild complete: controller 1, logical device 2 ("LD_2").
March 4, 2013 11:53:13 PM CST WRN                   ILHH241FDA Command timeout: controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [1a 08 3f 00 ff 00 00 02 00 00 00 00]  
March 4, 2013 11:53:59 PM CST WRN                   ILHH241FDA Command timeout: controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [1a 08 3f 00 ff 00 00 02 00 00 00 00]  
March 5, 2013 1:43:28 AM CST WRN                   ILHH241FDA Command timeout: controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [1a 08 3f 00 ff 00 00 00 01 00 00 00]  
March 5, 2013 1:44:20 AM CST INF                   ILHH241FDA Sense data: Unit attention (BUS DEVICE RESET FUNCTION OCCURRED). Controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [1a 08 3f 00 ff 00 00 00 01 00 00 00], data [70 00 06 00 00 00 00 00 00 00 00 00 29 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00]  
March 5, 2013 1:44:56 AM CST WRN                   ILHH241FDA Command timeout: controller 1, channel 0, SCSI device ID 11, LUN 0, cdb [00 00 00 00 00 00 00 00 01 00 00 00]  
March 5, 2013 1:50:18 AM CST INF      408:A01C0S11L-- ILHH241FDA Physical drive removed: controller 1, enclosure 0, slot 3, S/N 00100271HQVC        3SE1HQVC.   <----------------- Shows the drive was removed.
 

Solution

Replace disk drive with the appropriate part.
 

References

<NOTE:1509311.1> - How to isolate disk problems on an Adaptec RAID controller (Cougar)

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback