Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2079496.1
Update Date:2018-03-07
Keywords:

Solution Type  Problem Resolution Sure

Solution  2079496.1 :   FC HBA PCI Card Not Detected by Solaris After T5-4 Firmware Upgrade  


Related Items
  • Sun Storage 8Gb FC PCIe HBA, 2 Port, Emulex
  •  
  • SPARC T5-4
  •  
  • Qlogic FC HBA
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>HBA>SN-DK: FC HBA
  •  




In this Document
Symptoms
Cause
Solution
References


Created from <SR 3-11689059251>

Applies to:

SPARC T5-4 - Version All Versions and later
Qlogic FC HBA - Version All Versions and later
Sun Storage 8Gb FC PCIe HBA, 2 Port, Emulex - Version All Versions and later
Information in this document applies to any platform.

Symptoms

Solaris 11.1 T5-4 control domain with 4 FC HBAs installed on PCI slot 5, 9, 11 and 13 , and assigned to other guest domains,

but after T5-4 firmware upgrade , pci card on slot 5 is not visible on prtdiag:

======================================== IO Devices =======================================
Slot + Bus Name + Model Max Speed Cur Speed
Status Type Path /Width /Width
-------------------------------------------------------------------------------------------

/SYS/RCSA/PCIE9 PCIE SUNW,assigned-device-pciex10df,fc40 5.0GT/x8 2.5GT/x8
  /pci@380/pci@1/pci@0/pci@a/SUNW,assigned-device@0
/SYS/RCSA/PCIE9 PCIE SUNW,assigned-device-pciex10df,fc40 5.0GT/x8 2.5GT/x8
  /pci@380/pci@1/pci@0/pci@a/SUNW,assigned-device@0,1

/SYS/RCSA/PCIE11 PCIE SUNW,assigned-device-pciex10df,fc40 5.0GT/x8 2.5GT/x8
  /pci@3c0/pci@1/pci@0/pci@e/SUNW,assigned-device@0
/SYS/RCSA/PCIE11 PCIE SUNW,assigned-device-pciex10df,fc40 5.0GT/x8 2.5GT/x8
  /pci@3c0/pci@1/pci@0/pci@e/SUNW,assigned-device@0,1

/SYS/RCSA/PCIE13 PCIE SUNW,assigned-device-pciex10df,fc40 5.0GT/x8 5.0GT/x4
  /pci@480/pci@1/pci@0/pci@a/SUNW,assigned-device@0
/SYS/RCSA/PCIE13 PCIE SUNW,assigned-device-pciex10df,fc40 5.0GT/x8 5.0GT/x4
  /pci@480/pci@1/pci@0/pci@a/SUNW,assigned-device@0,1

 

A fma fault was generated in relation to PCIE5

--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Nov 11 09:37:16 490c94d7-013f-6b85-f62d-98d2992d8c96 ILOM-8000-1G Major

Problem Status : repaired
Diag Engine : fdd / 1.0
System
  Manufacturer : Oracle Corporation
  Name : SPARC T5-4
  Part_Number : 31930909+8+1
  Serial_Number : AK00117922
  Host_ID : 862dd3a4

----------------------------------------
Suspect 1 of 1 :
  Fault class : fault.fruid.replay
  Certainty : 100%
  Affects : /chassis=0/rcsa=0/pcie=5/car=0

  FRU
  Location : "/SYS/RCSA/PCIE5/CAR"
  Manufacturer : Oracle Corporation
  Name : TLA,CAR,T5-4,T5-8
  Part_Number : 7069814
  Revision : 01
  Serial_Number : 465769T+1311U20L1L
  Chassis
  Manufacturer : Oracle Corporation
  Name : SPARC T5-4
  Part_Number : 31930909+8+1
  Serial_Number : AK00117922
  Status : removed

Description : A Field Replaceable Unit (FRU) in the chassis contains records to
  indicate it is faulty.

Response : The service-required LED may be illuminated on the affected FRU
  and chassis.

Impact : The system may not be able to use one or more components on the
  affected FRU.

Action : Please refer to the associated reference document at
  http://support.oracle.com/msg/ILOM-8000-1G for the latest service
  procedures and policies regarding this diagnosis.

The Solaris Virtualization software recognize the card as UNK  

bash-4.1$ more ldm_ls-io.out
NAME                                      TYPE   BUS      DOMAIN   STATUS
----                                      ----   ---      ------   ------
pci_0                                     BUS    pci_0    primary
pci_1                                     BUS    pci_1    primary
pci_2                                     BUS    pci_2    primary  IOV
pci_3                                     BUS    pci_3    primary  IOV
pci_4                                     BUS    pci_4    primary  IOV
pci_5                                     BUS    pci_5    primary
pci_6                                     BUS    pci_6    primary  IOV
pci_7                                     BUS    pci_7    primary
/SYS/RCSA/PCIE1                           PCIE   pci_0    primary  EMP
/SYS/RCSA/PCIE2                           PCIE   pci_0    primary  EMP
/SYS/MB/SASHBA0                           PCIE   pci_0    primary  OCC
/SYS/RIO/NET0                             PCIE   pci_0    primary  OCC
/SYS/RCSA/PCIE3                           PCIE   pci_1    primary  EMP
/SYS/RCSA/PCIE4                           PCIE   pci_1    primary  EMP
/SYS/RCSA/PCIE9                           PCIE   pci_2    nodeB    OCC
/SYS/RCSA/PCIE10                          PCIE   pci_2    primary  EMP
/SYS/RCSA/PCIE11                          PCIE   pci_3    nodeA    OCC
/SYS/RCSA/PCIE12                          PCIE   pci_3    primary  EMP
/SYS/RCSA/PCIE5                           PCIE   pci_4    nodeA    UNK  <<----
/SYS/RCSA/PCIE6                           PCIE   pci_4    primary  EMP
/SYS/RCSA/PCIE7                           PCIE   pci_5    primary  OCC
/SYS/RCSA/PCIE8                           PCIE   pci_5    primary  EMP
/SYS/RCSA/PCIE13                          PCIE   pci_6    nodeB    OCC
/SYS/RCSA/PCIE14                          PCIE   pci_6    primary  EMP
/SYS/RCSA/PCIE15                          PCIE   pci_7    primary  EMP
/SYS/RCSA/PCIE16                          PCIE   pci_7    primary  EMP
/SYS/MB/SASHBA1                           PCIE   pci_7    primary  OCC
/SYS/RIO/NET2                             PCIE   pci_7    primary  OCC


from man ldm :

       The STATUS column applies to slots that  accept  plug-in
       cards  as  well  as  to devices on a motherboard and can
       have one of the following values:

           o    UNK - The device in the slot has been  detected  <<---
                by the firmware, but not by the OS.

           o    OCC - The device has been detected on the moth-
                erboard or is a PCIe card in a slot.


This PCIE5 card should be seen on guest domain as here (from previous explorer), but it is not visible on the guest domain (ie luxadm or prtdiag)

lrwxrwxrwx   1 root     root          64 Aug 29  2013 c3 -> ../../devices/pci@400/pci@1/pci@0/pci@e/SUNW,emlxs@0,1/fp@0,0:fc
lrwxrwxrwx   1 root     root          62 Aug 29  2013 c6 -> ../../devices/pci@400/pci@1/pci@0/pci@e/SUNW,emlxs@0/fp@0,0:fc

 

A snapshot of the T5-4 shows:

/SYS/RCSA/PCIE5
   Properties:
       type = Slot
       requested_config_state = Enabled  <<-------- should be enabled on next post ( reset/SYS )
       current_config_state = Disabled  <<-------WHY??
       disable_reason = Configuration Rules

 /SYS/RCSA/PCIE5/CAR
   Properties:
       type = PCIe Hot Plug Carrier
       fru_description = TLA,CAR,T5-4,T5-8
       fru_manufacturer = Oracle Corporation
       fru_part_number = 7069814   <<------------- PCI carrier correct
       fru_rev_level = 01
       fru_serial_number = 465769T+1311U20L1L
       fault_state = OK
       clear_fault_action = (none)

 /SYS/RCSA/PCIE5/CAR/CARD
   Properties:
       type = PCIE Module
       fault_state = OK
       clear_fault_action = (none)

 

We can see also the card type in PCIE5 :

 /System/PCI_Devices/On-board/Device_5
   Properties:
       description = 8-port SAS Controller
       location = SASHBA1 (SAS Controller 1)

 /System/PCI_Devices/Add-on
   Properties:

 /System/PCI_Devices/Add-on/Device_5
   Properties:
       part_number = SG-XPCIE2FC-EM8-Z
       description = Sun StorageTek Dual 8 Gb Fibre Channel PCIe HBA, Emulex
       location = PCIE5 (PCIE 5)
       pci_vendor_id = 0x10df
       pci_device_id = 0xfc40
       pci_subvendor_id = 0x10df
       pci_subdevice_id = 0xfc42

 

The link to the PCI on post was ok :

2:0 |                           /SYS/RCSA/PCIE5/CAR | 815001a00000 | 26:00:0 | 10df | fc40 | 03 | fc42 | 10df |  8 G1
2:0 |                           /SYS/RCSA/PCIE5/CAR | 815001a01000 | 26:00:1 | 10df | fc40 | 03 | fc42 | 10df |  8 G1
 

  

After clearing fma faults on ILOM , FMA shell and Solaris, and power cycling the T5-4 (ILOM and Solaris reset) , the problem persist.

Cause

There is an old fault (it can be found on the faultDB data collected by snapshot) from 2013 that needs to be removed: 

<root revision="99" version="1.0" qualifier="pod">
<FAULT_LIST VAL="/SYS/RCSA/PCIE5/CAR:930eb93e-8f22-c0c3-a383-f6e1ee722d17:0"><FRU>/SYS/RCSA/PCIE5/CAR</FRU><Time_Stamp>2013-10-02/12:11:05</Tim
e_Stamp><UUID>930eb93e-8f22-c0c3-a383-f6e1ee722d17</UUID><MSGID>PCIEX-8000-0A</MSGID><CLASS>fault.io.pciex.device-interr</CLASS><RESOURCE>/SYS/
RCSA/PCIE5/CAR/CARD</RESOURCE><ASRU>/SYS/RCSA/PCIE5/CAR/CARD</ASRU><LOCATION>/SYS/RCSA/PCIE5/CAR</LOCATION><PERCENT>100</PERCENT><STATE>268</ST
ATE><RAR_Timestamp>0</RAR_Timestamp><NV0>_cid=152584</NV0><NV1>_list_sz=1</NV1><NV2>_list_idx=0</NV2><NV3>system_component_serial_number=AK0011
7922</NV3><NV4>system_component_part_number=31930909+8+1</NV4><NV5>system_component_name=SPARC T5-4</NV5><NV6>system_component_manufacturer=Ora
cle Corporation</NV6><NV7>chassis_serial_number=AK00117922</NV7><NV8>chassis_part_number=31930909+8+1</NV8><NV9>chassis_name=SPARC T5-4</NV9><N
V10>chassis_manufacturer=Oracle Corporation</NV10><NV11>system_serial_number=AK00117922</NV11><NV12>system_part_number=31930909+8+1</NV12><NV13
>system_name=SPARC T5-4</NV13><NV14>system_manufacturer=Oracle Corporation</NV14><NV15>fru_name=TLA,CAR,T5-4,T5-8</NV15><NV16>fru_manufacturer=
Oracle Corporation</NV16><NV17>fru_serial_number=465769T+1311U20L1L</NV17><NV18>fru_rev_level=01</NV18><NV19>fru_part_number=7069814</NV19><NV2
0>mod-version=1.16</NV20><NV21>mod-name=eft</NV21><NV22>severity=Critical</NV22></FAULT_LIST></root>

This internal document explains the issue:

Disabled PCIe devices due to stale faultDB and deconfigDB entries w/SysFW 9.3.0.x (Doc ID 1999520.1)

This can happen with any PCI card on any T5 server platform: T5-2, T5-4, T5-8 and T5-1B

Solution

Contact with T5-4 server team that will confirm this issue on the snapshot and will provide a escalation password and action plan to clear this fault.

The action plan requires to power cycle the T5-4 server (completely power off and then power on)

 

Note. In the case the system is configured with LDoms , please, engage with Oracle Support Solaris Virtualization team, as new steps may be required
like LDoms configuration profile recreated (please refer to the Oracle VM administration guide for more information):

In the case of systems configured with LDoms boot factory-default and confirm the devices are now visible to to Solaris. Re-init the configuration from a backup to recreate the spconfig without the device disable flag being set.

For factory-default simply confirm the devices are now visible to the OS.

 

As explained on internal document:

Disabled PCIe devices due to stale faultDB and deconfigDB entries w/SysFW 9.3.0.x (Doc ID 1999520.1)

Once confirmed there are no actionable faults and /etc/devices/retire_store is clear, the faultDB and deconfigDB should be purged via the escalation shell.


 

References

<NOTE:1999520.1> - Disabled PCIe devices due to stale faultDB and deconfigDB entries w/SysFW 9.3.0.x
<NOTE:1332409.1> - How to Repair FMA Module Errors Seen in 'fmadm faulty'
<NOTE:1483194.1> - Commands to run to fully clear ILOM/SP, faultmgmt shell, and FMA faults on the T3-x and T4-x Servers

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback