Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1370677.1
Update Date:2018-01-10
Keywords:

Solution Type  Problem Resolution Sure

Solution  1370677.1 :   FC HBA (Invalid Tx Word Count Errors Are Increasing)  


Related Items
  • Sun Storage FC HBA
  •  
  • Emulex FC HBA
  •  
  • Qlogic FC HBA
  •  
  • Sun SPARC Enterprise M5000 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>HBA>SN-DK: FC HBA
  •  
  • _Old GCS Categories>Sun Microsystems>Storage - Disk>Drives - FC
  •  


How to troubleshoot suspected FC HBA problems where a link is bouncing, timeouts are seen and the Invalid Tx Word counter is increasing.

In this Document
Symptoms
Changes
Cause
Solution
References


Created from <SR 3-4817403391>

Applies to:

Qlogic FC HBA - Version All Versions to All Versions [Release All Releases]
Emulex FC HBA - Version All Versions to All Versions [Release All Releases]
Sun SPARC Enterprise M5000 Server - Version All Versions and later
Sun Storage FC HBA - Version Not Applicable and later
Information in this document applies to any platform.

Symptoms

You may see one or more of the following symptoms:

Command Timeouts
Retryable SCSI "tran_err" messages
Repeating "Link down" and "Link up" messages on HBAs using the emlxs driver.
Repeating "Link OFFLINE" and "Link ONLINE" messages on HBAs using the qlc driver.

Examples:

scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):

/scsi_vhci/ssd@g6006016077a0290080b06a0303eae011 (ssd45): Command Timeout on path /pci@3,700000/SUNW,emlxs@0/fp@0,0 (fp2)



/scsi_vhci/ssd@g600a0b80005b8bcc00000c434a8d400d (ssd75): Command Timeout on path fp1/ssd@w202300a0b85b8bda,7



emlxs: [ID 349649 kern.info] [ 5.031F]emlxs0: NOTICE: 710: Link down.
scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g6006016074a02900b69d3895b1b3e011 (ssd31):
       SCSI transport failed: reason 'tran_err': retrying command
scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g6006016077a0290080b06a0303eae011 (ssd45):
       SCSI transport failed: reason 'tran_err': retrying command
scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g6006016074a02900965ecbcb92b8e011 (ssd40):
       SCSI transport failed: reason 'tran_err': retrying command
emlxs: [ID 349649 kern.info] [ 5.0549]emlxs0: NOTICE: 720: Link up. (4Gb, fabric, initiator)


In addition to the messages above, the "Invalid Tx Word" counter is seen to be growing continuously.  This counter can be viewed from either the 'fcinfo' or 'luxadm -e rdls' commands.  It is necessary to collect at least two samples of these commands to check for growth in the counter values.

Example:

# fcinfo hba-port -l

HBA Port WWN: 10000000c991afba
        OS Device Name: /dev/cfg/c1
        Manufacturer: Emulex
        Model: LPe11002-S
        Firmware Version: 2.82a4 (Z3F2.82A4)
        FCode/BIOS Version: Boot:5.02a1 Fcode:1.50a9
        Serial Number: 0999BT0-094200059G
        Driver Name: emlxs
        Driver Version: 2.60h (2010.10.22.16.55)
        Type: N-port
        State: online
        Supported Speeds: 1Gb 2Gb 4Gb
        Current Speed: 4Gb
        Node WWN: 20000000c991afba
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 177
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 1337580037
                Invalid CRC Count: 0

# fcinfo hba-port -l

HBA Port WWN: 10000000c991afba
        OS Device Name: /dev/cfg/c1
        Manufacturer: Emulex
        Model: LPe11002-S
        Firmware Version: 2.82a4 (Z3F2.82A4)
        FCode/BIOS Version: Boot:5.02a1 Fcode:1.50a9
        Serial Number: 0999BT0-094200059G
        Driver Name: emlxs
        Driver Version: 2.60h (2010.10.22.16.55)
        Type: N-port
        State: online
        Supported Speeds: 1Gb 2Gb 4Gb
        Current Speed: 4Gb
        Node WWN: 20000000c991afba
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 177
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 1337583533
                Invalid CRC Count: 0

 

# luxadm -e rdls /dev/cfg/c1

Link Error Status information for loop:

al_pa   lnk fail    sync loss   signal loss   sequence err   invalid word   CRC
20000   4           9           5             0              1020           0
20800   5           95          5             0              1275           0
31700   0           0           0             0              1              0
31c00   0           0           0             0              2              0
21200   0           177         0             0              1337583533     0

# luxadm -e rdls /dev/cfg/c1

Link Error Status information for loop:
al_pa   lnk fail    sync loss   signal loss   sequence err   invalid word   CRC
20000   4           9           5             0              1020           0
20800   5           95          5             0              1275           0
31700   0           0           0             0              1              0
31c00   0           0           0             0              2              0
21200   0           177         0             0              1337625025     0
&nbsp; 

Changes

 

Cause

The increasing error counters here are reported against incoming signal decoding violations.  In other words, in these situations where the counters are seen increasing on the host side, the most likely cause of the errors is some component outside the server sending the traffic INTO the HBA.  Therefore the SFP on the switch side and/or the cable should be examined first for properly secured connections.

If the connections are firmly seated and the errors continue, a spare switch-side SFP and/or spare cable should be tried to see if the errors will subside.  Again, because the errors are being reported on incoming traffic, the HBA is the least likely candidate for the cause. Having said that, the HBA itself cannot be completely ruled out. There is still a possibility that the receiver optic is faulty as well.

Please note:

If the HBA is QLogic, and the only indication of a fault is the luxadm and fcinfo Invalid Word counts then please also check <Document 1594320.1> Incorrect Invalid Tx Word Counts may be reported against QLogic HBAs  

 

Solution

Check fc switch port error counters by logging onto the fc switch to see if they indicate a issue on fc switch side.

If so, engage your fc switch support vendor to investigate.

If the fc switch is under Oracle support open a Oracle Service Request (SR) and provide fc switch support data:

Cisco MDS switch - What logs are required to troubleshoot a Cisco MDS switch? (Doc ID 1016141.1)
Brocade: What logs are required to troubleshoot a Brocade switch? (Doc ID 1003754.1)
Qlogic Switch - What logs are required to troubleshoot a Qlogic Fibre Channel switch (Doc ID 1270583.1)
McData - What logs are required to troubleshoot a McData switch? (Doc ID 1015572.1)

Note. It has been found situations were fc switch port error counters are not increased, but only increasing when looking from server side with luxadm and fcinfo.

In these cases, look into sfpshow (on Brocade switches) for Tx and Rx values, a lower value of Rx may indicate a wrong type of FC cable used, see:

Brocade FC Switch Port RX Power Shows Low Value - FC Cable Types - SFP Types (Doc ID 2306903.1)


If no indications of a issue on the fc switch side then check if error count is still increasing after each step below.



If so move to next step:

1. If there is a spare port on the fc switch and assuming soft/wwn zoning is being used, move fc switch end of the cable to another fc switch port to eliminate original fc switch sfp and port. Otherwise, check and/or replace everything between the fc hba port and the fc switch port, such as, cabling, patch panels, splices, etc.

Note: To help isolate issue faster, if possible, use a known good fc cable and connect directly from fc hba port to the fc switch port in order to bypass all patch panels, splices etc. between them and then monitor for a few days. If no longer have issue then that would point to issue being in the cabling, patch panels, etc.

2. If server has Qlogic fc hba cards check also doc:

Incorrect Invalid Tx Word Counts may be reported against QLogic HBAs (Doc ID 1594320.1)

 

If still have issue:

1. Verify fc hba is Oracle branded, see doc:
How to Identify Oracle[TM] Branded Fibre Channel (FC) HBA and CNA Cards and Their Slot Locations (Doc ID 1282491.1)

2. Open a Oracle Service Request (SR) and provide error count samples from each step and a new explorer output:
Oracle Explorer Data Collector - Product Information Center (Doc ID 1312847.1)

3. Collect and upload FC Switch port error counters and port/sfp light power levels

4. Collect and upload FC HBA light port power levels

<Document 2345039.1> How to check Fibre Channel (FC) HBA port Light Tx and Rx Power Levels

5. Provide details on connection between the fc hbaport and the fc switch port, is there just a single cable or are patch panels, splices, etc. involved?

6. Verify and provide server address/location and site contact person information.


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback