Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1926283.1
Update Date:2018-03-07
Keywords:

Solution Type  Problem Resolution Sure

Solution  1926283.1 :   FC HBA Replacement Error - cfgadm: Hardware specific failure: configure failed  


Related Items
  • Sun Storage FC HBA
  •  
  • Qlogic FC HBA
  •  
  • Emulex FC HBA
  •  
  • Sun SPARC Enterprise M5000 Server
  •  
  • Solaris Operating System
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>HBA>SN-DK: FC HBA
  •  




In this Document
Symptoms
Cause
Solution
References


Created from <SR 3-9567926061>

Applies to:

Sun SPARC Enterprise M5000 Server - Version All Versions and later
Qlogic FC HBA - Version Not Applicable and later
Emulex FC HBA - Version Not Applicable and later
Sun Storage FC HBA - Version Not Applicable and later
Solaris Operating System - Version 8.0 and later
Information in this document applies to any platform.

Symptoms

M5000 Solaris 10 server01 had present two FC HBA, c2 has to be replaced due to HW problem

FOUND PATH TO 2 LEADVILLE HBA/CNA PORTS IN EXPLORER

C#  INST#  PORT WWN          MODEL            FCODE    STATUS         DEVICE PATH
--  -----  --------          -----            -----    ------         -----------
c0  qlc0   210000e08b94123a  SG-XPCIE1FC-QF4   2.01     CONNECTED      /pci@1,700000/SUNW,qlc@0
c2  qlc1   210000e08b94567b  SG-XPCIE1FC-QF4   2.01     CONNECTED      /pci@2,600000/SUNW,qlc@0  <<- to replace, localted in iou#0-pci#3



The FC HBA was operative and was replaced online, with Solaris up an running.
It was replaced following this procedure:

1) Check status

# cfgadm -al
      iou#0-pci#0                    unknown      empty        unconfigured unknown
      iou#0-pci#1                    etherne/hp   connected    configured   ok
      iou#0-pci#2                    fibre/hp     connected    configured   ok
      iou#0-pci#3                    fibre/hp     connected    configured   ok  <<--OK
      iou#0-pci#4                    unknown      empty        unconfigured unknown


2) Unconfigure the card

# cfgadm -c unconfigure iou#0-pci#3
      iou#0-pci#0                    unknown      empty        unconfigured unknown
      iou#0-pci#1                    etherne/hp   connected    configured   ok
      iou#0-pci#2                    fibre/hp     connected    configured   ok
      iou#0-pci#3                    unknown      connected    unconfigured unknown  <<-- now unconfigured
      iou#0-pci#4                    unknown      empty        unconfigured unknown



3) Oracle Field Engineer physically replace the FC HBA, and get some errors and fma fault

Notice. After step 2, we forgot to run the "cfgadm -c disconnect iou#0-pci#3" on the PCI-Device , this could be the reason of the fma fault



..at this point it was removed

Sep  9 13:52:47 server01 pcie: [ID 126225 kern.notice] NOTICE: pciehpc (px2): card is removed from the slot iou#0-pci#3
Sep  9 13:52:48 server01 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: SUNOS-8000-FU, TYPE: Defect, VER: 1, SEVERITY: Major
Sep  9 13:52:48 server01 EVENT-TIME: Tue Sep  9 13:52:48 MEST 2014
Sep  9 13:52:48 server01 PLATFORM: SUNW,SPARC-Enterprise, CSN: BCF1018012, HOSTNAME: server01
Sep  9 13:52:48 server01 SOURCE: eft, REV: 1.16
Sep  9 13:52:48 server01 EVENT-ID: af575bdb-e6f5-6cf2-9123-c4127e0ff4ba
Sep  9 13:52:48 server01 DESC: The diagnosis engine encountered telemetry for which it was unable to perform a diagnosis.  Refer to http://sun.com/msg/SUNOS-8000-FU for more information.
Sep  9 13:52:48 server01 AUTO-RESPONSE: Error reports have been logged for examination by Sun.
Sep  9 13:52:48 server01 IMPACT: Automated diagnosis and response for these events will not occur.
Sep  9 13:52:48 server01 REC-ACTION: Ensure that the latest Solaris Kernel and Predictive Self-Healing (PSH) patches are installed.



...the fma fault in detail:

bash-3.2$ more fmadm-faulty.out
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 09 13:52:48 af575bdb-e6f5-6cf2-9123-c4127e0ff4ba  SUNOS-8000-FU  Major

Host        : server01
Platform    : SUNW,SPARC-Enterprise     Chassis_id  : BCF1018012
Product_sn  :

Fault class : defect.sunos.eft.undiag.fme
FRU         : None
                 faulty

Description : The diagnosis engine encountered telemetry for which it was
             unable to perform a diagnosis.  Refer to
             http://sun.com/msg/SUNOS-8000-FU for more information.

Response    : Error reports have been logged for examination by Sun.

Impact      : Automated diagnosis and response for these events will not occur.

Action      : Ensure that the latest Solaris Kernel and Predictive Self-Healing
             (PSH) patches are installed.

bash-3.2$


..with some events associated that point to /pci@2,600000


bash-3.2$ more fmdump-e.out
...
Sep 09 13:52:47.9385 ereport.io.fire.pec.fcp
Sep 09 13:52:47.9385 ereport.io.fire.pec.te
Sep 09 13:52:47.9385 ereport.io.fire.pec.ldn
Sep 09 13:52:47.9385 ereport.io.pci.sserr
Sep 09 13:52:47.9385 ereport.io.pciex.pl.te
Sep 09 13:52:47.9385 ereport.io.pciex.tl.fcp
Sep 09 13:52:47.9385 ereport.io.pci.sserr
Sep 09 13:52:47.9385 ereport.io.pciex.pl.te
Sep 09 13:52:47.9385 ereport.io.pciex.tl.fcp


...more in detail

bash-3.2$ more fmdump-eV.out
...
Sep 09 2014 13:52:47.938510300 ereport.io.fire.pec.fcp
nvlist version: 0
       class = ereport.io.fire.pec.fcp
       ena = 0x22b5bd1722002c01
       detector = (embedded nvlist)
       nvlist version: 0
               version = 0x0
               scheme = dev
               device-path = /pci@2,600000
       (end detector)

       primary = 1
       tlu-uele = 0x1fffff
       tlu-uie = 0x1fffff001fffff
       tlu-uis = 0x100002000
       tlu-uess = 0x100002000
       __ttl = 0x1
       __tod = 0x540eea0f 0x37f087dc



...here the FC HBA was inserted

Sep  9 13:59:30 server01 pcie: [ID 661617 kern.notice] NOTICE: pciehpc (px2): card is inserted in the slot iou#0-pci#3



4) Try to configure it, but it fails:

# cfgadm -c configure iou#0-pci#3
cfgadm: Hardware specific failure: configure failed


On the messages , see this error associated to previous command:

Sep  9 14:11:04 server01 genunix: [ID 396655 kern.warning] WARNING: (px2): failed to probe the Connection iou#0-pci#3



Try again with force option and still gives the same error :

# cfgadm -f -c configure iou#0-pci#3
cfgadm: Hardware specific failure: configure failed  

...so the new PCI card fails to be configured.



On the XSCF , FC HBA card on PCI#3 is present but without known Name_Property

M5000 Active XSCFU/XCP1116

- showhardconf.out
 IOU#0 Status:Normal; Ver:0101h; Serial:BF08412AAA  ;
        + FRU-Part-Number:CF00541-2240 03   /541-2240-03          ;
        + Type:1;
        DDC_A#0 Status:Normal;
        DDCR Status:Normal;
            DDC_B#0 Status:Normal;
        PCI#1 Name_Property:network; Card_Type:Other;
        PCI#2 Name_Property:SUNW,qlc; Card_Type:Other;
        PCI#3 Name_Property:; Card_Type:Other;   <<< new HBA should be "Name_Property:SUNW,qlc"

 

At this point, before rebooting the server , try this:

A)
1. Clear FMA fault
# fmadm repair af575bdb-e6f5-6cf2-9123-c4127e0ff4ba

2. Try to configure the new FC HBA:

 # cfgadm -c connect iou#0-pci#3
 # cfgadm -c configure iou#0-pci#3

--> did not work, same errors and Name_Property on XSCF did not change


B)
0. Check there are no fma faults. If any clear fma fault as before.

1. Unconfigure and disconnect the FC HBA

 # cfgadm -c unconfigure iou#0-pci#3
 # cfgadm -c disconnect iou#0-pci#3

2. Try to configure the new FC HBA:

 # cfgadm -c connect iou#0-pci#3
 # cfgadm -c configure iou#0-pci#3

--> did not work, same errors and Name_Property on XSCF did not change


C)
0. Check there are no fma faults. If any clear fma fault as before.

1. Unconfigure and disconnect the FC HBA

 # cfgadm -c unconfigure iou#0-pci#3
 # cfgadm -c disconnect iou#0-pci#3

2. At this point dummy replace the FC HBA.

- Push the lever to the right to unseat cassette
- Pull out the cassette from the slot but do not remove it completely, wait 1 min and insert it back again
- Push the lever to the left to seat cassette


3. Try to configure the new FC HBA:

 # cfgadm -c configure iou#0-pci#3

--> did not work, same errors and Name_Property on XSCF did not change
 

Cause

Overall it appears the issue is due to an incomplete procedure when
iou#0-pci#3 has been replaced. A "disconnect" is required when
dynamically replacing the card in order to isolate it from Solaris
and turn off the power to the card.

FMAs SUNOS-8000-FU and its associated ereports on Sep 09 2014 13:52:48
seems to be an effect of the wrong procedure, i.e. pulling out the card
while it was NOT "disconnected"

For FMA a 'fmadm repair <uuid>' should be enough.
To get the new FC HBA configured a reboot of domain server01 would be required.

 

The active replacement of PCI Cassette, requires to do the following steps:

We verify with the customer that there is no I/O activity on the card in the cassette.

From Solaris prompt, we check the component status.
    # cfgadm -a  
    Ap_Id includes the IOU number (iou#0 or iou#1) and the cassette slot number (pci#1, pci#2, pci#3, pci#4).
Then unconfigure the FC HBA:
   # cfgadm -c unconfigure Ap_Id
Then disconnect the component (powers off the PCI FC HBA on the cassete)
   # cfgadm -c disconnect Ap_Id
Verify component is now unconfigured
   # cfgadm -a

Then physically remove the cassete from the PCI slot and replace the FC HBA that is inside the cassete
Once inserted the cassete with the new FC HBA run:

    # cfgadm -c configure Ap_Id
Check the component is now connected & configured
    # cfgadm -a

 

Solution

A workaround to avoid server reboot is to configure the FC HBA on another free PCIE slot on the M5000 domain server, on this case we use PCIE slot 4.


1. Clear FMA fault

# fmadm repair af575bdb-e6f5-6cf2-9123-c4127e0ff4ba

2. Unconfigure and disconnect the FC HBA

 

# cfgadm -c unconfigure iou#0-pci#3
# cfgadm -c disconnect iou#0-pci#3

3. Move the cassette with the FC HBA from iou#0-pci#3 to iou#0-pci#4

- Push the lever to the right to unseat cassette
- Remove the cassette from the slot 3 and insert it it on slot 4
- Push the lever to the left to seat cassette

4. Try to configure the new FC HBA:

 # cfgadm -c configure iou#0-pci#4

Note. After inserting the cassete in PCIE Slot 4 , the xscf did not recognized initially the Name_Property and this command took a long time (some minutes) but it ended with no error.

Now the FC is seen under new path:
/devices/pci@3,700000/fibre-channel@0/fp@0,0:devctl

root@server01:/root# luxadm -e port
/devices/pci@1,700000/SUNW,qlc@0/fp@0,0:devctl CONNECTED
/devices/pci@3,700000/fibre-channel@0/fp@0,0:devctl NOT CONNECTED

 

Note. If above steps does not work , a server reboot would be required to fix this status.

 

References

<NOTE:1399644.1> - How to Locate FC HBA Manual to Get Oracle Fibre Channel (FC) HBA Port LED Patterns and Other HBA information
<NOTE:1012980.1> - Sun SPARC(R) Enterprise M3000/M4000/M5000/M8000/M9000 (OPL) Servers: General troubleshooting running the snapshot command

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback