Recommended Setting for the "fcp_offline_delay" Variable When Upgrading a Sun Storage 6000, 2500 or 2500-M2 Controller Firmware

Asset ID:	1-79-1569976.1
Update Date:	2017-07-17
Keywords:

Solution Type Predictive Self-Healing Sure

Solution 1569976.1 : Recommended Setting for the "fcp_offline_delay" Variable When Upgrading a Sun Storage 6000, 2500 or 2500-M2 Controller Firmware

Applies to:

Sun Storage 6180 Array - Version All Versions to All Versions [Release All Releases]
Sun Storage 6580 Array - Version All Versions to All Versions [Release All Releases]
Sun Storage 2540 Array - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 2540-M2 Array - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage Traffic Manager Software - Version 3.0 to 4.6 [Release 3.0 to 4.0]
Oracle Solaris on x86-64 (64-bit)
Oracle Solaris on SPARC (64-bit)
Oracle Solaris on SPARC (32-bit)
Oracle Solaris on x86 (32-bit)

Purpose

The purpose of this document is to provide recommendations for fibre channel timeout values on Solaris SPARC or x86 only before performing a controller firmware upgrade on Sun Storage 6000, 2500 or 2500-M2 arrays.

If I/O is quiesced before proceeding with the controller firmware upgrade, do not follow the recommendations in this document. These recommendations are only for online controller firmware upgrades, while I/O is running.

Details

A Sun Storage 6000, 2500 or 2500-M2 controller firmware upgrade goes through the following steps:

Sun Storage Common Array Manager (CAM) sends the new firmware file (using chunks) to controller A.
Controller A verifies the checksum for each chunk and if it is valid, it sends a copy to controller B.
Once the transfer completes, CAM requests controller A to activate the new firmware on it.
At this time controller A reboots, which takes between 20-40 seconds.
After controller A completes its reboot, controller B is automatically rebooted and takes also between 20-40 seconds.

When controller A reboots, the fibre channel targets (controller host ports) disappear. Sun StorEdge Traffic Manager (STMS aka MPxIO) is designed to offline the path 20 seconds after the targets disappear, which means that at this time there is an STMS failover to the controller B path. Then about 20 seconds later when controller A comes back online, the targets reappear. This means that STMS can failback the volumes; however this process may take some time depending on the workload on a production environment. Despite this controller B is automatically rebooted to activate its firmware, and when the targets from controller B disappear, STMS may not have completed the failback of all the volumes. As a result, the host applications may temporarily lose access to the volumes, and this may lead to "transport rejected fatal error " events like the following:

Jul 11 11:35:51 <hostname> scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g600a0b80006e32e20000cdcb4d92c41f (ssd29):
Jul 11 11:35:51 <hostname>       transport rejected fatal error
Jul 11 11:35:51 <hostname> scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g600a0b80006e33820000cd2e4d81b669 (ssd14):
Jul 11 11:35:51 <hostname>       transport rejected fatal error
Jul 11 11:35:51 <hostname> scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g600a0b80006e32e20000ccb34d8097b2 (ssd2):
Jul 11 11:35:51 <hostname>       transport rejected fatal error

The above scenario is specific to the controller firmware upgrade only.

In order to prevent these transport errors, the recommendation is to temporarily increase the timeout variable which offlines the path and which is called fcp_offline_delay. By default this variable is set to 20 seconds. The recommendation before performing a controller firmware upgrade is to increase this value to 50 seconds. Then it can be changed back to 20 seconds after the controller firmware upgrade activity completes.

The modification of this value can be done in a running kernel without the need to reboot the server.

root login is required while executing the command below on a Solaris server.

Procedure to change the "fcp_offline_delay" value to 50 seconds:

Execute the following command to set the value to 50:

# echo fcp_offline_delay/W 0t50 | mdb -kw

Example:

# echo fcp_offline_delay/W 0t50 | mdb -kw
fcp_offline_delay: 0x14 = 0x32
Confirm the value with the following command:

# echo fcp_offline_delay/D | mdb -k

Example:

# echo fcp_offline_delay/D | mdb -k
fcp_offline_delay:
fcp_offline_delay: 50

Procedure to change the "fcp_offline_delay" value back to 20 seconds:

Execute the following command to set the value to 20:

# echo fcp_offline_delay/W 0t20 | mdb -kw

Example:

# echo fcp_offline_delay/W 0t20 | mdb -kw
fcp_offline_delay: 0x32 = 0x14
Confirm the value with the following command:

# echo fcp_offline_delay/D | mdb -k

Example:

# echo fcp_offline_delay/D | mdb -k
fcp_offline_delay:
fcp_offline_delay: 20

Attachments

This solution has no attachment