Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1602837.1
Update Date:2018-03-07
Keywords:

Solution Type  Problem Resolution Sure

Solution  1602837.1 :   FC HBA Emlxs ERROR: 420: Adapter Hardware Error. (Host Error Attention: Status=0x40000000  


Related Items
  • Sun SPARC Enterprise T5140 Server
  •  
  • Emulex FC HBA
  •  
  • Solaris Operating System
  •  
  • Solaris Operating System
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>HBA>SN-DK: FC HBA
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Created from <SR 3-8084961071>

Applies to:

Sun SPARC Enterprise T5140 Server - Version All Versions and later
Solaris Operating System - Version 8.0 and later
Emulex FC HBA - Version Not Applicable and later
Information in this document applies to any platform.

Symptoms

Solaris 10 SPARC server shows one FC HBA LPe11000-S as NOT CONNECTED

C# INST# PORT WWN MODEL FCODE STATUS DEVICE PATH
-- ----- -------- ----- ----- ------ -----------
c2 emlxs0 10000000c981b0f2 LPe11000-S 1.50a9 CONNECTED /pci@400/pci@0/pci@c/SUNW,emlxs@0
c3 emlxs1 10000000c9b03dd4 LPe11000-S 1.50a9 NOT CONNECTED /pci@500/pci@0/pci@9/SUNW,emlxs@0  <<--Problem

 

The fcinfo command does not return any errors, just that the port c3 state is Offline, and firmware is below

HBA Port WWN: 10000000xxxxxxx2
        OS Device Name: /dev/cfg/c2
        Manufacturer: Emulex
        Model: LPe11000-S
        Firmware Version: 2.82a4 (Z3D2.82A4)
        FCode/BIOS Version: Boot:5.02a1 Fcode:1.50a9
        Serial Number: 0999VM0-0846000NXU
        Driver Name: emlxs
        Driver Version: 2.60k (2011.03.24.16.45)
        Type: N-port
        State: online
        Supported Speeds: 1Gb 2Gb 4Gb
        Current Speed: 4Gb
        Node WWN: 20000000c981b0f2
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 1
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 26
                Invalid CRC Count: 0

HBA Port WWN: 10000000xxxxxxx4
        OS Device Name: /dev/cfg/c3
        Manufacturer: Emulex
        Model: LPe11000-S
        Firmware Version: 2.50a6 (Z2D2.50A6)  <<<-- low firmware
        FCode/BIOS Version: Boot:5.02a1 Fcode:1.50a9
        Serial Number: 0999BT0-1014000ENQ
        Driver Name: emlxs
        Driver Version: 2.60k (2011.03.24.16.45)
        Type: N-port
        State: offline   <<-- problem
        Supported Speeds: 1Gb 2Gb 4Gb
        Current Speed: not established
        Node WWN: 20000000c9b03dd4

 

 

The emlxs patch 149173-03 was installed, rebooted the server and upgraded firmware on that FC HBA.
After that the link come online, but it failed some minutes after that with this errors:

The link is up during booting :

Nov 13 14:09:47 host1 emlxs: [ID 349649 kern.info] [13.0303]emlxs1: NOTICE: 200: Adapter initialization. (Firmware update not needed.)
Nov 13 14:09:48 host1 emlxs: [ID 349649 kern.info] [ B.1A84]emlxs1: NOTICE: 100: Driver attach. (Emulex-S s10-64 sparc v2.80.8.0 (2012.09.17.15.45))
Nov 13 14:09:48 host1 emlxs: [ID 349649 kern.info] [ B.1A87]emlxs1: NOTICE: 100: Driver attach. (LPe11000-S Dev_id:fc20 Sub_id:fc21 Id:25)
Nov 13 14:09:48 host1 emlxs: [ID 349649 kern.info] [ B.1A94]emlxs1: NOTICE: 100: Driver attach. (Firmware:2.82a4 (Z3D2.82A4) Boot:5.02a1 Fcode:1.50a9)
Nov 13 14:09:48 host1 emlxs: [ID 349649 kern.info] [ B.1AC4]emlxs1: NOTICE: 100: Driver attach. (SLI:3 MSI:2 NPIV:0 FCA)
Nov 13 14:09:48 host1 emlxs: [ID 349649 kern.info] [ B.1ACC]emlxs1: NOTICE: 100: Driver attach. (WWPN:10000000C9B03DD4 WWNN:20000000C9B03DD4)
Nov 13 14:09:48 host1 pcieb: [ID 586369 kern.info] PCIE-device: SUNW,emlxs@0, emlxs1
Nov 13 14:09:48 host1 genunix: [ID 936769 kern.info] emlxs1 is /pci@500/pci@0/pci@9/SUNW,emlxs@0
Nov 13 14:09:48 host1 pcieb: [ID 586369 kern.info] PCIE-device: SUNW,emlxs@0, emlxs1

Nov 13 14:09:48 host1 emlxs: [ID 349649 kern.info] [ B.0680]emlxs1: NOTICE: 720: Link up. (4Gb, fabric, initiator)

Nov 13 14:09:48 host1 genunix: [ID 936769 kern.info] fp4 is /pci@500/pci@0/pci@9/SUNW,emlxs@0/fp@0,0


But some minutes later it fails with this error:

Nov 13 14:15:52 host1 emlxs: [ID 349649 kern.info] [13.11EE]emlxs1:  ERROR: 420: Adapter hardware error. (HS_FFER1 cleared)
Nov 13 14:15:52 host1 emlxs: [ID 349649 kern.info] [13.1208]emlxs1:  ERROR: 420: Adapter hardware error. (Host Error Attention: status=0x40000000 status1=0x93994 status2=0x6000000d)
Nov 13 14:15:52 host1 emlxs: [ID 349649 kern.info] [ 5.03DD]emlxs1: NOTICE: 710: Link down.
Nov 13 14:15:54 host1 emlxs: [ID 349649 kern.info] [ 6.0901]emlxs1:WARNING: 231: Adapter shutdown. (Reboot required.)
Nov 13 14:16:14 host1 emlxs: [ID 349649 kern.info] [13.0303]emlxs1: NOTICE: 200: Adapter initialization. (Firmware update not needed.)
Nov 13 14:16:24 host1 genunix: [ID 408114 kern.info] /pci@500/pci@0/pci@9/SUNW,emlxs@0 (emlxs1) down
Nov 13 14:17:10 host1 emcp: [ID 801593 kern.notice] Error: Path Bus 3076 Tgt 500009740825F959 Lun 3 to 000292602430 is dead.
Nov 13 14:17:10 host1 emcp: [ID 801593 kern.notice] Error: Killing bus 3076 to Symmetrix     000292602430 port 7fB.
Nov 13 14:17:10 host1 emcp: [ID 801593 kern.notice] Error: Path Bus 3076 Tgt 500009740825F959 Lun 6 to 000292602430 is dead.
Nov 13 14:17:10 host1 emcp: [ID 801593 kern.notice] Error: Path Bus 3076 Tgt 500009740825F959 Lun 5 to 000292602430 is dead.
Nov 13 14:17:10 host1 emcp: [ID 801593 kern.notice] Error: Path Bus 3076 Tgt 500009740825F959 Lun 4 to 000292602430 is dead.
Nov 13 14:17:10 host1 emcp: [ID 801593 kern.notice] Error: Path Bus 3076 Tgt 500009740825F959 Lun 1 to 000292602430 is dead.
Nov 13 14:17:10 host1 emcp: [ID 801593 kern.notice] Error: Path Bus 3076 Tgt 500009740825F959 Lun 2 to 000292602430 is dead.
Nov 13 14:17:22 host1 fctl: [ID 517869 kern.warning] WARNING: fp(4)::OFFLINE timeout

 

Due to that fma reports these errors:

bash-3.2$ more fmdump-e.out
TIME                 CLASS
Nov 13 14:15:54.8141 ereport.io.device.inval_state
Nov 13 14:15:54.8142 ereport.io.service.lost

  

bash-3.2$ more fmdump-eV.out

Nov 13 2013 14:15:54.814112899 ereport.io.device.inval_state
nvlist version: 0
        class = ereport.io.device.inval_state
        ena = 0xab7324acf9e09c01
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = /pci@500/pci@0/pci@9/SUNW,emlxs@0
        (end detector)

        __ttl = 0x1
        __tod = 0x5283515a 0x30866083

Nov 13 2013 14:15:54.814246955 ereport.io.service.lost
nvlist version: 0
        class = ereport.io.service.lost
        ena = 0xab7324cdc5609c01
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = /pci@500/pci@0/pci@9/SUNW,emlxs@0
        (end detector)

        __ttl = 0x1
        __tod = 0x5283515a 0x30886c2b

 

NOTE. Observation of the problem shows that if the server reboots, the link comes up online , but again after some minutes the link fails with the same errors.

Changes

 
On the FC switch port where this HBA is connected there are high values of "Invalid_word" and this is incremented while the link is online.

BROCADE-SW1:admin> portshow 4/16
portIndex: 28
portName: host1 - 10000000xxxxxxx4
portHealth: No Fabric Watch License

Authentication: None
portDisableReason: None
portCFlags: 0x1
portFlags: 0x4001 PRESENT U_PORT LED
LocalSwcFlags: 0x0
portType: 10.0
portState: 2 Offline
Protocol: FC
portPhys: 4 No_Light portScn: 2 Offline
port generation number: 5018
state transition count: 22

portId: 082b00
portIfId: 43220013
portWwn: 20:1c:00:05:1e:36:00:02
portWwn of device(s) connected:

Distance: normal
portSpeed: N4Gbps

LE domain: 0
FC Fastwrite: OFF
Interrupts: 216 Link_failure: 4 Frjt: 0
Unknown: 3 Loss_of_sync: 34 Fbsy: 0
Lli: 160 Loss_of_sig: 59
Proc_rqrd: 530 Protocol_err: 0
Timed_out: 0 Invalid_word: 533481 <<<----
Rx_flushed: 0 Invalid_crc: 0
Tx_unavail: 0 Delim_err: 0
Free_buffer: 0 Address_err: 0
Overrun: 0 Lr_in: 11
Suspended: 0 Lr_out: 2
Parity_err: 0 Ols_in: 2
2_parity_err: 0 Ols_out: 11
CMI_bus_err: 0

 

Cause

1) Failed Oracle Emulex FC HBA or Bug 24320491 - LPe12002-S ERROR: 420:Adapter hardware error.
  

emlxs1:  ERROR: 420: Adapter hardware error. (HS_FFER1 cleared)
emlxs1:  ERROR: 420: Adapter hardware error. (Host Error Attention: status=0x40000000 status1=0x93994 status2=0x6000000d)

This indicates that an interrupt has occurred and
the status register indicates a nonrecoverable hardware error.
This error usually indicates a hardware problem with the adapter.
Try running adapter diagnostics.
Report these errors to customer service.

From a recent Bug 24320491 - LPe12002-S ERROR: 420:Adapter hardware error.

From the error code in the emlxs driver messages we can state that the
following does indicate a parity error occurred,

(Host Error Attention: status=0x40000000 status1= 0x9ee1a4 status2=0x6000000e)

Specifically, "trap code of "0x6000000e" indicates that a parity error
was hit and detected by Saturn's Memory controller (a chip of the Emulex FC HBA)."

In the case of a parity error, Emulex does recommend that an adapter be reset
and put back into service, due to the possibility this was a one time
occurrence due to environmental factors, and if a parity error reoccurs then
the adapter should be RMA'd if under warranty, due to the possibility the
errors are occurring due to hardware failure. Firmware dumps can be reviewed
to note any hardware issues.

 

 

2) It has been found other cases with a similar failure, where the FC HBA has not been replaced and continues to work with no issues:

Feb 14 06:39:07 server01 emlxs: [ID 349649 kern.info] [13.1225]emlxs1:  ERROR: 420: Adapter hardware error. (HS_FFER1 cleared)
Feb 14 06:39:07 server01 emlxs: [ID 349649 kern.info] [13.123F]emlxs1:  ERROR: 420: Adapter hardware error. (Host Error Attention: status=0x20000000 status1=0x1e78 status2=0x168200)
Feb 14 06:39:07 server01 emlxs: [ID 349649 kern.info] [ 5.0401]emlxs1: NOTICE: 710: Link down.
Feb 14 06:39:09 server01 emlxs: [ID 349649 kern.info] [ 6.0987]emlxs1:WARNING: 231: Adapter shutdown. (Reboot required.)
Feb 14 06:39:12 server01 emlxs: [ID 349649 kern.info] [13.0315]emlxs1: NOTICE: 200: Adapter initialization. (Firmware update not needed.)
Feb 14 06:39:39 server01 genunix: [ID 408114 kern.info] /pci@340/pci@1/pci@0/pci@c/SUNW,emlxs@0,1 (emlxs1) down
Feb 14 06:39:40 server01 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-0A, TYPE: Fault, VER: 1, SEVERITY: Critical
Feb 14 06:39:40 server01 EVENT-TIME: Sat Feb 14 06:39:39 CET 2015
Feb 14 06:39:40 server01 PLATFORM: unknown, CSN: unknown, HOSTNAME: server01
Feb 14 06:39:40 server01 SOURCE: eft, REV: 1.16
Feb 14 06:39:40 server01 EVENT-ID: d1317dc3-aec4-4283-8da7-c859d4a1307d
Feb 14 06:39:40 server01 DESC: A problem was detected for a PCIEX device.
Feb 14 06:39:40 server01 AUTO-RESPONSE: One or more device instances may be disabled
Feb 14 06:39:40 server01 IMPACT: Loss of services provided by the device instances associated with this fault
Feb 14 06:39:40 server01 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/PCIEX-8000-0A for the latest service procedures and policies regarding this diagnosis.
Feb 14 06:39:43 server01 genunix: [ID 631017 kern.notice] NOTICE: Device: already retired: /pci@340/pci@1/pci@0/pci@c/SUNW,emlxs@0,1
Feb 14 06:39:53 server01 genunix: [ID 390243 kern.info] Creating /etc/devices/retire_store
Feb 14 06:40:05 server01 emlxs: [ID 349649 kern.info] [13.0315]emlxs1: NOTICE: 200: Adapter initialization. (Firmware update not needed.)

 

The fma fault associated:

Feb 14 06:39:39 d1317dc3-aec4-4283-8da7-c859d4a1307d  PCIEX-8000-0A  Critical

Problem Status    : isolated
Diag Engine       : eft / 1.16
System
   Manufacturer  : unknown
   Name          : unknown
   Part_Number   : unknown
   Serial_Number : unknown
   Host_ID       : 84fafc23

----------------------------------------
Suspect 1 of 1 :
  Fault class : fault.io.pciex.device-interr
  Certainty   : 100%
  Affects     : dev:////pci@340/pci@1/pci@0/pci@c/SUNW,emlxs@0,1
  Status      : faulted and taken out of service

  FRU
    Location         : "PCIE4"
    Manufacturer     : unknown
    Name             : unknown
    Part_Number      : unknown
    Revision         : unknown
    Serial_Number    : unknown
    Chassis
       Manufacturer  : Oracle Corporation
       Name          : SPARC T5-4
       Part_Number   : 31930909+7+1
       Serial_Number : AK00117917
       Status        : faulty

Description : A problem was detected for a PCIEX device.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
             this fault

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
             Please refer to the associated reference document at
             http://support.oracle.com/msg/PCIEX-8000-0A for the latest
             service procedures and policies regarding this diagnosis.

 

Clear the error , see: How to repair FMA module errors seen in 'fmadm faulty' (Doc ID 1332409.1)

# fmadm repaired dev:////pci@340/pci@1/pci@0/pci@c/SUNW,emlxs@0,1
# fmadm acquit d1317dc3-aec4-4283-8da7-c859d4a1307d
# fmadm flush PCIE4

Then reboot the server and check it has been removed from retired list:
strings /etc/devices/retire_store

Instead of rebooting the server, before that the FC HBA can be reseted to see if this makes it work again:

a) use luxadm -e offline/online

# luxadm -e port
/devices/pci@340/pci@1/pci@0/pci@c/SUNW,emlxs@0/fp@0,0:devctl  NOT CONNECTED
/devices/pci@340/pci@1/pci@0/pci@c/SUNW,emlxs@0,1/fp@0,0:devctl  CONNECTED
 
# luxadm -e offline /devices/pci@340/pci@1/pci@0/pci@c/SUNW,emlxs@0

# luxadm -e port
/devices/pci@340/pci@1/pci@0/pci@c/SUNW,emlxs@0,1/fp@0,0:devctl  CONNECTED

# luxadm -e online /devices/pci@340/pci@1/pci@0/pci@c/SUNW,emlxs@0

# luxadm -e port
/devices/pci@340/pci@1/pci@0/pci@c/SUNW,emlxs@0/fp@0,0:devctl  CONNECTED  <--it worked, now is connected again
/devices/pci@340/pci@1/pci@0/pci@c/SUNW,emlxs@0,1/fp@0,0:devctl  CONNECTED

  
OR

b) use One Command Manager tool utility from Emulex.

Select the failed port and "Reset FC port", a window box will appear with this:

"Resetting a boot adapter may cause system instability.
Emulex assumes no responsibility for the consequences
of reseting a boot adapter.

Do you want to continue?

Yes   No  " --> select Yes


After reboot, if the problem persist, replace the FC HBA.

 

 

Please refer emlxs_messages.h which has the messages, located here:
https://grok.cz.oracle.com/source/xref/nws10-patch-clone/src/sun_nws/emlxs/hdrs/emlxs_messages.h


204 /* MESSAGE defines */
205 #ifdef DEF_MSG_REPORT
206 emlxs_msg_t emlxs_message[] =
207 {
208 #endif /* DEF_MSG_REPORT */

...


650    DEFINE_MSG(420, \
651        emlxs_hardware_error_msg, \
652        "Adapter hardware error.", \
653        EMLXS_ERROR, \
654        MSG_SLI, \
655        "This indicates that an interrupt has occurred and the " \
656        "status register indicates a nonrecoverable hardware ", \
657        "error. This error usually indicates a hardware problem " \
658        "with the adapter. Try running adapter diagnostics. Report "\
659        "these errors to customer service.", \
660        NULL, \
661        0)



As for the status values these come directly from the HBA registers. I can't find a decode for them in the source code.
I suspect you will need to look at the CHIP/ fw documentation to decode them. Most likely not published by Oracle.
Probably need to ask emulex:


4609            status =
4610                READ_CSR_REG(hba, FC_HS_REG(hba));

4646        status1 =
4647            READ_SLIM_ADDR(hba,
4648            ((volatile uint8_t *)hba->sli.sli3.slim_addr + 0xa8));
4649        status2 =
4650            READ_SLIM_ADDR(hba,
4651            ((volatile uint8_t *)hba->sli.sli3.slim_addr + 0xac));
4652


4653        EMLXS_MSGF(EMLXS_CONTEXT, &emlxs_hardware_error_msg,
4654            "Host Error Attention: "
4655            "status=0x%x status1=0x%x status2=0x%x",
4656            status, status1, status2);
4657

 

 

Solution

If you are facing the scenario presented above, recommended action is to collect firmware dump (if possible) to troubleshoot this problem further (see below)
and depending on the error / number of times the problem occurred to replace Oracle Emulex FC HBA.

It has been found other cases with a similar failure, where the FC HBA has not been replaced and continues to work with no issues (see cause section)

For example, in the case of a parity error (a trap code of "0x6000000e" indicates that)
Emulex does recommend that an adapter be reset and put back into service,
due to the possibility this was a one time occurrence due to environmental factors, and
if a parity error reoccurs then the adapter should be replaced
due to the possibility the errors are occurring due to hardware failure.

 

Note. To troubleshoot this problem further, a firmware dumps can be collected and send
to Oracle Support for analysis to note any hardware issues.

Please pick up and install the OCM version 11.1.218.18-1 for your OS from

https://www.broadcom.com/support/oem/oracle-fc/fibre-channel-8gb/sg-xpcie2fc-em8-z

In this particular case (the FW-detected parity error), reboot if there's a
single occurrence; if there are multiple occurrences, replace the HBA.

Getting a useful dump: In order to help us provide you with the best possible support, please download and install OCM as directed above.

The installation of this CLI will start the elxhbamgrd daemon process, which will ensure that upon failure, a usable firmware (FW) dump will be available to
send to Oracle and Emulex in the unlikely event of a failure.

This is needed for all operating systems, and the default location for the firmware dump varies by OS:

- Windows : In the Dump directory under the OneCommand Manager Installation Directory \Util\Dump\
- Solaris : /opt/ELXocm/Dump
- Linux : /var/log/emulex/ocmanager/Dump

Processes around creation and collection of firmware dumps after a fatal firmware error varies by generation of adapter, but in all cases, OCM
(OneCommand Manager) / hbacmd) 11.1.218.x and/or higher must be installed.

In the case of the 8Gb adapters, manual FW dumps apparently do not collect fatal FW errors,
and they are not stored on the HBA, which has no flash memory for that.
Instead, when one is detected, the firmware tells the driver, which notifies the elxhbamgrd to collect the dump and place it into the OCM dump directory.

 

 

Some other notes from Emulex:

If a hardware error is noted in the emlxs driver messages but a dump is not
present in the OS's filesystem (on /opt/ELXocm/Dump/ ), possibly due to OCM not being installed the
following workaround should be attempted.

OCM must be installed. Existing fw dumps can be captured by restarting OCM
daemon with the following commands

1. /opt/ELXocm/stop_ocmanager
2. /opt/ELXocm/start_ocmanager
3. dump file will be located in /opt/ELXocm/Dump/

Server reboots will also collect firmware dumps to /opt/ELXocm/Dump/

Firmware dumps initiated through OCM (with hbacmd)
collect the current state of the adapter and erase any dump that existed in the adapter.

There is a Bug closed to address this issue with hbacmd, see

Bug 24450164 - Firmware dump collected by emlxs driver in kernel needs to be usuable.

this bug was opened after the comments made by Emulex on this other bug:

Bug 24320491 - LPe12002-S ERROR: 420:Adapter hardware error.  --> this bug has been closed as not reproducible

References

<BUG:24320491> - LPE12002-S ERROR: 420:ADAPTER HARDWARE ERROR.
<BUG:24450164> - FIRMWARE DUMP COLLECTED BY EMLXS DRIVER IN KERNEL NEEDS TO BE USUABLE
<NOTE:1629921.1> - How To Get a Firmware Dump From an Emulex FC HBA
<NOTE:1356876.1> - Firmware Update Required. (A Manual Hba Reset Or Link Reset (Using Luxadm Or Fcadm) Is Required
<BUG:18940856> - ADAPTER HARDWARE ERROR / FMA ERROR
<NOTE:1399644.1> - How to Locate FC HBA Manual to Get Oracle Fibre Channel (FC) HBA Port LED Patterns and Other HBA information

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback