Red Hat Linux - FC HBA Qla2xxx Abort Command Issued, qla2xxx_eh_bus_reset: reset succeeded To Sun ZFS Storage 7320 Through Brocade 300

Asset ID:	1-72-1645310.1
Update Date:	2017-02-13
Keywords:

Solution Type Problem Resolution Sure

Solution 1645310.1 : Red Hat Linux - FC HBA Qla2xxx Abort Command Issued, qla2xxx_eh_bus_reset: reset succeeded To Sun ZFS Storage 7320 Through Brocade 300

Applies to:

Qlogic FC HBA - Version All Versions to All Versions [Release All Releases]
Emulex FC HBA - Version All Versions and later
Sun Fire X4270 M2 Server - Version All Versions and later
Information in this document applies to any platform.

Symptoms

Red Hat Enterprise Linux Server release 5.9 , on Oracle SUN FIRE X4270 M2 SERVER with Oracle FC HBAs, accessing Oracle Storage 7320 servers object of the problem are server01 and server02

Server Model : SF x4270M2
OS : RHEL 5.10, kernel version 2.6.18-308.13.1.el5 and qla driver 8.03.07.09.05.08-k
PCI : 2x PCIe 8GB FC HBA "SG-XPCIE2FC-QF8-Z"
SAN Switch : 2x Brocade 300
Storage : Sun 7320 ZFS Storage

/var/log/messages:
Jan 22 08:05:11 server02 auditd[17338]: Audit daemon rotating log files
Jan 22 09:36:03 server02 kernel: qla2xxx 0000:21:00.0: scsi(7:1:12): Abort command issued -- 1 1370fe4b 2002.
Jan 22 09:36:14 server02 kernel: qla2xxx 0000:21:00.0: scsi(7:1:12): Abort command issued -- 1 1370fe4b 2002.
Jan 22 09:36:14 server02 kernel: qla2xxx 0000:21:00.0: scsi(7:1:12): DEVICE RESET ISSUED.
Jan 22 09:36:14 server02 kernel: qla2xxx 0000:21:00.0: scsi(7:1:12): DEVICE RESET SUCCEEDED.
Jan 22 09:36:25 server02 kernel: qla2xxx 0000:21:00.0: scsi(7:1:12): Abort command issued -- 1 1370fe4b 2002.
Jan 22 09:36:25 server02 kernel: qla2xxx 0000:21:00.0: scsi(7:1:12): LOOP RESET ISSUED.
Jan 22 09:36:25 server02 kernel: qla2xxx 0000:21:00.0: qla2xxx_eh_bus_reset: reset succeeded
Jan 22 09:36:46 server02 kernel: qla2xxx 0000:21:00.0: scsi(7:1:12): Abort command issued -- 1 1370fe4b 2002.
Jan 22 09:36:46 server02 kernel: qla2xxx 0000:21:00.0: scsi(7:1:12): ADAPTER RESET ISSUED.
Jan 22 09:36:46 server02 kernel: qla2xxx 0000:21:00.0: Performing ISP error recovery - ha= ffff81047f78c4f8.
Jan 22 09:36:47 server02 kernel: qla2xxx 0000:21:00.0: LIP reset occured (f700).
Jan 22 09:36:47 server02 kernel: qla2xxx 0000:21:00.0: LOOP UP detected (4 Gbps).
Jan 22 09:36:47 server02 kernel: qla2xxx 0000:21:00.0: qla2xxx_eh_host_reset: reset succeeded
Jan 22 09:37:08 server02 kernel: qla2xxx 0000:21:00.0: scsi(7:1:12): Abort command issued -- 1 1370fe4b 2002.
Jan 22 09:37:08 server02 kernel: sd 7:0:1:12: scsi: Device offlined - not ready after error recovery
Jan 22 09:37:08 server02 kernel: sd 7:0:1:12: rejecting I/O to offline device
Jan 22 09:37:08 server02 kernel: Buffer I/O error on device sdd, logical block 0
Jan 22 09:37:08 server02 kernel: sd 7:0:1:12: rejecting I/O to offline device
Jan 22 09:37:08 server02 kernel: Buffer I/O error on device sdd, logical block 0
Jan 22 09:38:08 server02 kernel: qla2xxx 0000:2f:00.0: scsi(9:1:12): Abort command issued -- 1 3b8a77d 2002.
Jan 22 09:38:18 server02 kernel: qla2xxx 0000:2f:00.0: scsi(9:1:12): Abort command issued -- 1 3b8a77d 2002.
Jan 22 09:38:18 server02 kernel: qla2xxx 0000:2f:00.0: scsi(9:1:12): DEVICE RESET ISSUED.
Jan 22 09:38:18 server02 kernel: qla2xxx 0000:2f:00.0: scsi(9:1:12): DEVICE RESET SUCCEEDED.
Jan 22 09:38:28 server02 kernel: qla2xxx 0000:2f:00.0: scsi(9:1:12): Abort command issued -- 1 3b8a77d 2002.
Jan 22 09:38:28 server02 kernel: qla2xxx 0000:2f:00.0: scsi(9:1:12): LOOP RESET ISSUED.
Jan 22 09:38:28 server02 kernel: qla2xxx 0000:2f:00.0: qla2xxx_eh_bus_reset: reset succeeded
Jan 22 09:38:48 server02 kernel: qla2xxx 0000:2f:00.0: scsi(9:1:12): Abort command issued -- 1 3b8a77d 2002.
Jan 22 09:38:48 server02 kernel: qla2xxx 0000:2f:00.0: scsi(9:1:12): ADAPTER RESET ISSUED.
Jan 22 09:38:48 server02 kernel: qla2xxx 0000:2f:00.0: Performing ISP error recovery - ha= ffff81047d0584f8.
Jan 22 09:38:49 server02 kernel: qla2xxx 0000:2f:00.0: LIP reset occured (f700).
Jan 22 09:38:49 server02 kernel: qla2xxx 0000:2f:00.0: LIP occured (f700).
Jan 22 09:38:49 server02 kernel: qla2xxx 0000:2f:00.0: LIP reset occured (f7f7).
Jan 22 09:38:49 server02 kernel: qla2xxx 0000:2f:00.0: LOOP UP detected (4 Gbps).
Jan 22 09:38:50 server02 kernel: qla2xxx 0000:2f:00.0: qla2xxx_eh_host_reset: reset succeeded
Jan 22 09:39:11 server02 kernel: qla2xxx 0000:2f:00.0: scsi(9:1:12): Abort command issued -- 1 3b8a77d 2002.
Jan 22 09:39:11 server02 kernel: sd 9:0:1:12: scsi: Device offlined - not ready after error recovery
Jan 22 09:39:11 server02 kernel: sd 9:0:1:12: rejecting I/O to offline device
Jan 22 09:39:11 server02 kernel: Buffer I/O error on device sdf, logical block 0
Jan 22 09:39:11 server02 kernel: sd 9:0:1:12: rejecting I/O to offline device
Jan 22 09:39:11 server02 kernel: Buffer I/O error on device sdf, logical block 0

dmesg:

qla2xxx 0000:2f:00.0:
QLogic Fibre Channel HBA Driver: 8.03.07.09.05.08-k
QLogic QLE2562 - Sun StorageTek 8Gb FC PCIe HBA, dual port
ISP2532: PCIe (5.0Gb/s x4) @ 0000:2f:00.0 hdma+, host#=9, fw=5.06.03 (90d5)
GSI 32 sharing vector 0x74 and IRQ 32
ACPI: PCI Interrupt 0000:2f:00.1[B] -> GSI 42 (level, low) -> IRQ 116
qla2xxx 0000:2f:00.1: Found an ISP2532, irq 116, iobase 0xffffc200100ec000
qla2xxx 0000:2f:00.1: Configuring PCI space...
PCI: Setting latency timer of device 0000:2f:00.1 to 64
qla2xxx 0000:2f:00.1: Configure NVRAM parameters...
qla2xxx 0000:2f:00.1: Verifying loaded RISC code...
qla2xxx 0000:2f:00.1: Allocated (64 KB) for EFT...
qla2xxx 0000:2f:00.1: Allocated (1414 KB) for firmware dump...
scsi10 : qla2xxx
qla2xxx 0000:2f:00.1:
QLogic Fibre Channel HBA Driver: 8.03.07.09.05.08-k
QLogic QLE2562 - Sun StorageTek 8Gb FC PCIe HBA, dual port
ISP2532: PCIe (5.0Gb/s x4) @ 0000:2f:00.1 hdma+, host#=10, fw=5.06.03 (90d5)
qla2xxx 0000:21:00.0: LIP reset occured (f700).
qla2xxx 0000:21:00.0: LIP occured (f700).

We have focused analysis on server server01

We have replaced the FC HBAs on server server01
Also replaced the SFP on port6 of Brocade switches where the FC HBAs of this server are connected.
Customer also replaced the FC cables

After that, the problem persist:

Mar 3 14:28:01 server01 kernel: scsi(7): RFT_ID exiting normally.
Mar 3 14:28:01 server01 kernel: scsi(7): RFF_ID exiting normally.
Mar 3 14:28:01 server01 kernel: scsi(7): RNN_ID exiting normally.
Mar 3 14:28:01 server01 kernel: scsi(7): RSNN_NN exiting normally.
Mar 3 14:28:01 server01 kernel: scsi(7): GID_PT entry - nn 20000024ff4a8b58 pn 21000024ff4a8b58 portid=010600.
Mar 3 14:28:01 server01 kernel: scsi(7): GID_PT entry - nn 20000024ff34235e pn 21000024ff34235e portid=010100.
Mar 3 14:28:01 server01 kernel: scsi(7): GID_PT entry - nn 20000024ff38c330 pn 21000024ff38c330 portid=010000.
Mar 3 14:28:01 server01 kernel: scsi(7): GPSC failed, rejected request:
Mar 3 14:28:01 server01 kernel: 0 1 2 3 4 5 6 7 8 9 Ah Bh Ch Dh Eh Fh
Mar 3 14:28:01 server01 kernel: --------------------------------------------------------------
Mar 3 14:28:02 server01 kernel: 01 00 00 00 fa 01 00 00 80 01 00 00 00 0b 00 00
Mar 3 14:28:02 server01 kernel: scsi(7): GPSC command unsupported, disabling query...
Mar 3 14:28:02 server01 kernel: scsi(7): device wrap (010000)
Mar 3 14:28:02 server01 kernel: scsi(7): Trying Fabric Login w/loop id 0x0081 for port 010100.
Mar 3 14:28:02 server01 kernel: scsi(7): Trying Fabric Login w/loop id 0x0082 for port 010000.
Mar 3 14:28:02 server01 kernel: scsi(7): LOOP READY
Mar 3 14:28:02 server01 kernel: scsi(7): qla2x00_loop_resync - end
Mar 3 14:28:02 server01 kernel: scsi(7): Asynchronous PORT UPDATE ignored 0000/0004/0600.
Mar 3 14:28:02 server01 kernel: scsi(7): Asynchronous PORT UPDATE ignored 0000/0007/0b00.
Mar 3 14:28:04 server01 kernel: qla2xxx_eh_abort(9): aborting sp ffff8103917cc840 from RISC. pid=28264.
Mar 3 14:28:04 server01 kernel: qla2xxx 0000:2f:00.0: scsi(9:1:4) FCP command status: 0x5-0x0 (0x80000) portid=010000 oxid=0x23 ser=0x6e68 cdb=008000 len=0x0 rsp_info=0x0 resid=0x0 fw_resid=0x0
Mar 3 14:28:04 server01 kernel: qla2xxx 0000:2f:00.0: scsi(9:1:4): Abort command issued -- 1 6e68 2002.
Mar 3 14:28:04 server01 kernel: qla2xxx 0000:2f:00.0: scsi(9:1:4): LOOP RESET ISSUED.
Mar 3 14:28:04 server01 kernel: qla2x00_wait_for_hba_online return_status=0
Mar 3 14:28:04 server01 kernel: qla2xxx 0000:2f:00.0: qla2xxx_eh_bus_reset: reset succeeded
Mar 3 14:28:04 server01 kernel: scsi(9): Asynchronous P2P MODE received.
Mar 3 14:28:04 server01 kernel: scsi(9): qla2x00_reset_marker()
Mar 3 14:28:04 server01 kernel: scsi(9): Asynchronous PORT UPDATE.
Mar 3 14:28:04 server01 kernel: scsi(9): Port database changed ffff 0006 0000.
Mar 3 14:28:04 server01 kernel: scsi(9): qla2x00_loop_resync()
Mar 3 14:28:04 server01 kernel: scsi(9): F/W Ready - OK
Mar 3 14:28:04 server01 kernel: scsi(9): fw_state=3 (3, 0, 4, 0) curr time=1149a59d8.
Mar 3 14:28:04 server01 kernel: scsi(9): Configure loop -- dpc flags =0x5260
Mar 3 14:28:04 server01 kernel: scsi(9): RSCN queue entry[0] = [00/000000].
Mar 3 14:28:04 server01 kernel: scsi(9): device_resync: rscn overflow.
Mar 3 14:28:04 server01 kernel: scsi(9): RFT_ID exiting normally.
Mar 3 14:28:04 server01 kernel: scsi(9): RFF_ID exiting normally.
Mar 3 14:28:04 server01 kernel: scsi(9): RNN_ID exiting normally.
Mar 3 14:28:04 server01 kernel: scsi(9): RSNN_NN exiting normally.
Mar 3 14:28:04 server01 kernel: scsi(9): GID_PT entry - nn 20000024ff35ce36 pn 21000024ff35ce36 portid=010600.
Mar 3 14:28:04 server01 kernel: scsi(9): GID_PT entry - nn 20000024ff34235f pn 21000024ff34235f portid=010100.
Mar 3 14:28:04 server01 kernel: scsi(9): GID_PT entry - nn 20000024ff38c331 pn 21000024ff38c331 portid=010000.
Mar 3 14:28:04 server01 kernel: scsi(9): GPSC failed, rejected request:
Mar 3 14:28:04 server01 kernel: 0 1 2 3 4 5 6 7 8 9 Ah Bh Ch Dh Eh Fh
Mar 3 14:28:04 server01 kernel: --------------------------------------------------------------

scsi 9:0:1:4: scsi: Device offlined - not ready after error recovery
scsi 9:0:1:4: timing out command, waited 22s
scsi 9:0:1:0: Unexpected response from lun 4 while scanning, scan aborted
qla2xxx 0000:2f:00.0: scsi(9:1:4): Abort command issued -- 1 3bd 2002.
qla2xxx 0000:21:00.0: scsi(7:1:4): Abort command issued -- 1 1a900 2002.
scsi 7:0:1:4: scsi: Device offlined - not ready after error recovery
scsi 7:0:1:4: timing out command, waited 22s
scsi 7:0:1:0: Unexpected response from lun 4 while scanning, scan aborted
qla2xxx 0000:2f:00.0: scsi(9:1:4): Abort command issued -- 1 3bd 2002.
qla2xxx 0000:2f:00.0: scsi(9:1:4): DEVICE RESET ISSUED.
qla2xxx 0000:2f:00.0: scsi(9:1:4): DEVICE RESET SUCCEEDED.
qla2xxx 0000:21:00.0: scsi(7:1:4): Abort command issued -- 1 1b362 2002.
qla2xxx 0000:2f:00.0: scsi(9:1:4): Abort command issued -- 1 3bd 2002.
qla2xxx 0000:2f:00.0: scsi(9:1:4): LOOP RESET ISSUED.
qla2xxx 0000:2f:00.0: qla2xxx_eh_bus_reset: reset succeeded
qla2xxx 0000:21:00.0: scsi(7:1:4): Abort command issued -- 1 1b362 2002.
qla2xxx 0000:21:00.0: scsi(7:1:4): DEVICE RESET ISSUED.
qla2xxx 0000:21:00.0: scsi(7:1:4): DEVICE RESET SUCCEEDED.
qla2xxx 0000:21:00.0: scsi(7:1:4): Abort command issued -- 1 1b362 2002.
qla2xxx 0000:21:00.0: scsi(7:1:4): LOOP RESET ISSUED.
qla2xxx 0000:21:00.0: qla2xxx_eh_bus_reset: reset succeeded
qla2xxx 0000:2f:00.0: scsi(9:1:4): Abort command issued -- 1 3bd 2002.
qla2xxx 0000:2f:00.0: scsi(9:1:4): ADAPTER RESET ISSUED.
qla2xxx 0000:2f:00.0: Performing ISP error recovery - ha= ffff81047acdc4f8.
qla2xxx 0000:2f:00.0: LIP reset occured (f700).
qla2xxx 0000:2f:00.0: LIP occured (f700).
qla2xxx 0000:2f:00.0: LIP reset occured (f7f7).
qla2xxx 0000:2f:00.0: LOOP UP detected (4 Gbps).
qla2xxx 0000:2f:00.0: qla2xxx_eh_host_reset: reset succeeded
qla2xxx 0000:21:00.0: scsi(7:1:4): Abort command issued -- 1 1b362 2002.
qla2xxx 0000:21:00.0: scsi(7:1:4): ADAPTER RESET ISSUED.
qla2xxx 0000:21:00.0: Performing ISP error recovery - ha= ffff81047f5944f8.
qla2xxx 0000:21:00.0: LIP reset occured (f700).
qla2xxx 0000:21:00.0: LIP occured (f700).
qla2xxx 0000:21:00.0: LIP reset occured (f7f7).
qla2xxx 0000:21:00.0: LOOP UP detected (4 Gbps).
qla2xxx 0000:21:00.0: qla2xxx_eh_host_reset: reset succeeded
qla2xxx 0000:2f:00.0: scsi(9:1:4): Abort command issued -- 1 3bd 2002.

On the Brocade 300 switches, the FC HBAs are logged as Fabric ports

switchName: swA
Index Port Address Media Speed State Proto
==============================================
6 6 010600 id N4 Online FC F-Port 21:00:00:24:ff:4a:8b:58

switchName: swB
Index Port Address Media Speed State Proto
==============================================
6 6 010600 id N4 Online FC F-Port 21:00:00:24:ff:35:ce:36

These ports present enc_out values being way above normal threshold, also high values for link fail, loss sync, loss sig :

Switch Name: swA Version: v6.4.0c

port enc crc crc too too bad enc disc link loss loss
in err g_eof short long eof out c3 fail sync sig
================================================================================================
6: 8 11 0 9 0 12 5.7m 5.2k 12.8k 13.0k 13.0k

Switch Name: swB Version: v6.4.0c

port enc crc crc too too bad enc disc link loss loss
in err g_eof short long eof out c3 fail sync sig
================================================================================================
6: 0 4 0 4 0 4 405.8k 4.7k 13.3k 13.3k 13.3k

On the FC switch side we could see constant errors like this (on both switches)

fabriclog --show :
Time Stamp Input and *Action S, P Sn,Pn Port Xid
===================================================================================
Switch 0; Thu Feb 27 10:00:38 2014 GMT (GMT+0:00)
10:00:38.349377 *Removing all nodes from port D2,P0 D2,P0 6 NA
10:00:38.349494 SCN Port F_PORT D2,P1 D2,P0 6 NA

10:01:25.450826 SCN Port Offline;g=0x381e8 D2,P0 D2,P0 6 NA
10:01:25.450843 *Removing all nodes from port D2,P0 D2,P0 6 NA
10:01:25.464843 SCN LR_PORT (0);g=0x381e8 D2,P0 D2,P0 6 NA
10:01:25.464883 SCN Port Online;g=0x381e8 D2,P0 D2,P1 6 NA
10:01:25.465060 Port Elp engaged D2,P1 D2,P0 6 NA
10:01:25.465123 *Removing all nodes from port D2,P0 D2,P0 6 NA
10:01:25.465243 SCN Port F_PORT D2,P1 D2,P0 6 NA

10:01:45.453616 SCN Port Offline;g=0x381ea D2,P0 D2,P0 6 NA
10:01:45.453633 *Removing all nodes from port D2,P0 D2,P0 6 NA
10:01:46.460183 SCN LR_PORT (0);g=0x381ea D2,P0 D2,P0 6 NA
10:01:46.463966 SCN Port Online;g=0x381ea D2,P0 D2,P1 6 NA
10:01:46.464143 Port Elp engaged D2,P1 D2,P0 6 NA
10:01:46.464193 *Removing all nodes from port D2,P0 D2,P0 6 NA
10:01:46.464309 SCN Port F_PORT D2,P1 D2,P0 6 NA

So, these errors on switch side are related to these other on the server side:

qla2xxx 0000:21:00.0: LIP reset occured (f700).
qla2xxx 0000:21:00.0: LIP occured (f700).
qla2xxx 0000:21:00.0: LIP reset occured (f7f7).
qla2xxx 0000:21:00.0: LOOP UP detected (4 Gbps).
qla2xxx 0000:21:00.0: qla2xxx_eh_host_reset: reset succeeded
qla2xxx 0000:2f:00.0: scsi(9:1:4): Abort command issued -- 1 3bd 2002.

Standard Enc_out definition:

8bit/10bit encoding errors occurred in words (ordered sets) outside of the FC frame. Words outside of frames are encoded, if this encoding is corrupted or an
error is detected enc_out is generated. It indicates a problem if it increments faster than the link-bit error rate allows, approximately once every 20 minutes for 1 Gbit/s. Statistically, enc_out errors on their own imply a cable/connector problem. Enc_out errors and crc_err together imply GBIC/SFP problem.

Such errors are also expected every time a user brings a port down and up (i.e. reboot host, power-cycle storage subsystem, unplug/plug cable or portdisable/portenable etc).Such errors will also be generated on a link which has a 1Gbit/s port connected to a 2Gbit/s port when autonegotiation is turned off.

--> So a possible scenario that explains the errors observed on the Brocade 300 port 6 is that qla2xx driver reset the FC HBAs.

Cause

On the Storage 7320 , we observed that we have high load on system: over 9 load average considering only 8 CPU cores available.
There are CR related to FC fixed in recent ak version. It seems we were hitting this known bug: 15758100 FC ports are frequently reset while the system is under heavy load.

It is possible that the server received from the storage some ASC/ASCQ sense data and the reaction of the qla driver was to reset the link and the switch errors/messages on the host side might be _after_, as a consequence of the driver resetting, this is just an hypothesis of what was going on.

Errors stopped only after firmware upgrade on the 7320 storage.

If problem recur, next time can try to compare time stamps on qlc messages with switch messages to work out which messages happened first, and also we can try to enable on linux servers qla2xxx extended logging to see if any ASC/ASCQ sense data is received:

Doc ID 1452418.1 - How to configure detailed error logging for QLogic/Emulex (qla2xxx/lpfc) HBA driver on Oracle Linux & Oracle VMDelete Reference
Doc ID 1531816.1 - Enable Error Logging for QLogic/Emulex (qla2xxx/lpfc) HBA Driver on Oracle Linux & Oracle VMDelete Reference
Doc ID 1612676.1 - How to get Fibre Channel HBA information from Linux SOSreport

Solution

Upgrade the 7320 storage to the latest firmware.
Please look at this document for Oracle ZFS Storage Appliance: Software Updates (Doc ID 2021771.1)

And take into account the next document:
Sun Storage 7000 Unified Storage System: How to upgrade a clustered system (Doc ID 1447284.1)

Firmware on Brocade 300 switches was not updated, only the Storage 7320 firmware,
the sequence as below:
1- Upgrade Storage FW from ak-2011.04.24.1.0 -> ak-2011.04.24.4.2
2- Upgrade Storage FW from ak-2011.04.24.4.2 -> ak-2013.06.05.1.6
3- Both 2x Brocade Switch remain at current running FW 6.4.0c.
4- Bring up all connected servers successful.
5- Bring up the 2x RHEL servers which affected and customer verified no
more port reset seen in messages right after OS is up.
6- Tested now all the backup is running fine and no issues or slowness encountered.

References

<NOTE:1582197.1> - Very Slow I/O Performance On Storage - Qlogic HBA Reporting Dropped Frame And I/O Error
<NOTE:1003754.1> - How to Gather Brocade SupportSave or SupportShow Logs
<NOTE:1452418.1> - How to configure detailed error logging for QLogic/Emulex (qla2xxx/lpfc) HBA driver on Oracle Linux & Oracle VM
<NOTE:1531816.1> - Enable Error Logging for QLogic/Emulex (qla2xxx/lpfc) HBA Driver on Oracle Linux & Oracle VM
<NOTE:1612676.1> - How to get Fibre Channel HBA information from Linux SOSreport

Attachments

This solution has no attachment