Replacement disk not recognized on X4540

Asset ID:	1-72-1386502.1
Update Date:	2016-10-24
Keywords:

Solution Type Problem Resolution Sure

Solution 1386502.1 : Replacement disk not recognized on X4540

Applies to:

Sun Fire X4540 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Fire X4500 Server - Version Not Applicable to Not Applicable [Release N/A]
Oracle Solaris on x86-64 (64-bit)
Oracle Solaris on x86 (32-bit)
***Checked for relevance on 15-Dec-2013***

Symptoms

Under very rare circumstance, replacement disk would not be recognized by the system, and when "#cfgadm -al" is issued customer would see sd instance under Attachment Point (Ap_Id) instead of the actual target.

x4540# raidctl -l
Controller: 1
Disk: 0.0.0
'
'
Disk: 0.7.0
Controller: 2
Disk: 0.0.0
Disk: 0.1.0
Disk: 0.2.0
Disk: 0.3.0
Disk: 0.5.0
Disk: 0.6.0
Disk: 0.7.0

Discrepancy in cfgadm -alv

Ap_Id Type Receptacle Occupant Condition
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 disk connected configured unknown
c2::dsk/c2t1d0 disk connected configured unknown
c2::dsk/c2t2d0 disk connected configured unknown
c2::dsk/c2t3d0 disk connected configured unknown
c2::dsk/c2t5d0 disk connected configured unknown
c2::dsk/c2t6d0 disk connected configured unknown
c2::dsk/c2t7d0 disk connected configured unknown
c2::sd30 disk connected unconfigured unknown <<<<<<<<<<<<<
(SD Instance Number recorded as Attachment Point ID- Stale Entry )

Changes

Customer had replaced a disk.

Cause

Cause 1.
Quiet possible disk replacement was not done as per the procedure
(refer to the links below for procedure)
Customers should simply un-configured the disk when the "ready to remove: light was lit they should have removed the disk and replaced the same with new disk.

Verify that the blue LED (Light Emitting Diode) on the disk turns off after one minute. If the blue LED does not turn off after one minute, you can have the O/S re-enumerate device nodes and links by typing:

# devfsadm -c

Cause 2.
Assumptions:
Buggy disk firmware, disk controller (LSI) firmware, sd, cfgadm driver patches

Solution

Most customers would not agree for down time, hence possible ways to recover from this situation are:

Step a.
- Pull out the disk
- Perform a:

# devfsadm -Cv #command to clear all stale enteries

- Perform a:

# cfgadm -c unconfigure c2::sd30 # command to unconfigure the device

Step b.
- Insert the disk
- Perform a:

# devfsadm -v [#command to scan for new devices]

- check format output to see if disk is getting detected or not
- Perform a:

# cfgadm -c configure c2 [#configure all the underlined device in controller 2]

If steps 'a' and 'b' fails:

Step c.
Redo the below steps once again:

# cfgadm -c unconfigure c2::sd30 [command to unconfigure the device, if fails try option 2]
# cfgadm -x remove_device c2::sd30 [Forceful removal]
# cfgadm -al #and see if the c2::sd30 is gone
# cfgadm -x insert_device c2::dsk/c2t4d0 or
if the above command fails run
# cfgadm -x insert_device c2
verify the device c2::dsk/c2t4d0 is added by running cfgadm -al
# cfgadm -c configure c2::dsk/c2t4d0

If the above step fails customers are requested to reboot the server (cold reboot preferred)

Why:
Because devfsadm has not responded even after preforming the unconfigure command the corruption persist and entries of cx::sdxx is not removed.

If the issue is not resolved following a reboot / cold powercylce engage Oracle Support.

Important Document Links:

Replacing a drive that has not been explicitly failed by ZFS
Server Maintenance Documentation

Note: 1.
These types of issue are basically due to incorrect procedures followed during disk replacement or a genuine driver bug (SN-DK Storage Divers)

Customer should be advised of keeping their system firmware, disk controller firmware and disk firmware up-to date.

TSC is requested to engage the drivers team SN-DK Storage Drivers (MOS Group) Why: Because its Driver mess-up.

Attachments

This solution has no attachment