Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1516022.1
Update Date:2018-03-01
Keywords:

Solution Type  Technical Instruction Sure

Solution  1516022.1 :   How to Detect a Bad (Failed) Front Panel HCA in Oracle Fabric Interconnects (OVN) Formerly Xsigo Fabric Director  


Related Items
  • Oracle Fabric Interconnect F1-15
  •  
  • Oracle Fabric Interconnect F1-4
  •  
Related Categories
  • PLA-Support>Sun Systems>SAND>Network>SN-SND: Oracle Virtual Networking
  •  




In this Document
Goal
Solution
References


Applies to:

Oracle Fabric Interconnect F1-4 - Version All Versions to All Versions [Release All Releases]
Oracle Fabric Interconnect F1-15 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.
***Checked for relevance on 02-JUL-2014***

Goal

 Be able to find errors in Fabric Director logs or diagnostic log bundle that points to bad Front Panel HCA.

Solution

Customer may note problems bringing up IO Modules -  Ethernet IO Modules may show as up/failed (xtLidInvalid).  FC IO Modules will show up/failed (xtChip) and will be unable to successfully bring IO Modules  up even after resetting them.   Also when a Front Panel failure is encountered, existing servers will stay up and running, but will not be able to bring up a new server-profile or any new or existing server that is rebooted.   Front Panel HCA failure is often encountered after a power outage is experienced in the datacenter, or when the Fabric Interconnect is power cycled. 

Conditions that may occur when Front Panel HCA fails:

(1) Unable to bring up IO Modules - IO Modules do not come fully up after reset or rebooting the Fabric Interconnect

(2) Unable to bring up a new server-profile or existing server-profile after server was rebooted.

(3) OpenSM not running - won't start after rebooting Fabric Interconnect.

If any of the conditions above are noted, login to the Fabric Director as user 'root' and then navigate to /var/log:

chassis/var/log# vi dmesg

 Look for any DDR errors here:

Waiting for /dev to be fully populated...[ 18.931000] ib_mthca: Mellanox InfiniBand HCA drive
r v0.08 (February 14, 2006)
[ 18.932000] ib_mthca: Initializing 0000:02:00.0
[ 18.933000] ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 22 (level, low) -> IRQ 23
[ 19.939000] ib_mthca 0000:02:00.0: SYS_EN DDR error: syn=4, sock=0, sladdr=0, SPD source=DIM
M
[ 19.940000] ib_mthca 0000:02:00.0: SYS_EN returned status 0x07, aborting.
[ 19.941000] ACPI: PCI interrupt for device 0000:02:00.0 disabled
[ 19.942000] ib_mthca: probe of 0000:02:00.0 failed with error -22

 OR 'grep' for 'error -22' in the dmesg/kern.log as shown below:

[root@Centos_Balaji xsesxlgmt0201]# grep -i error dmesg.1
[ 81.859000] ib_mthca 0000:02:00.0: SYS_EN DDR error: syn=4, sock=0, sladdr=0, SPD source=DIMM
[ 81.862000] ib_mthca: probe of 0000:02:00.0 failed with error -22

[root@Centos_Balaji xsesxlgmt0201]# grep -i error kern.log
Jun 16 00:00:57 xsesxlgmt0201 kernel: [ 81.859000] ib_mthca 0000:02:00.0: SYS_EN DDR error: syn=4, sock=0, sladdr=0, SPD source=DIMM
Jun 16 00:00:57 xsesxlgmt0201 kernel: [ 81.862000] ib_mthca: probe of 0000:02:00.0 failed with error -22
Jun 16 04:06:24 xsesxlgmt0201 kernel: [ 21.224000] ib_mthca 0000:02:00.0: SYS_EN DDR error: syn=4, sock=0, sladdr=0, SPD source=DIMM
Jun 16 04:06:24 xsesxlgmt0201 kernel: [ 21.227000] ib_mthca: probe of 0000:02:00.0 failed with error -22
Jun 16 06:14:22 xsesxlgmt0201 kernel: [ 14.394000] ib_mthca 0000:02:00.0: SYS_EN DDR error: syn=4, sock=0, sladdr=0, SPD source=DIMM
Jun 16 06:14:22 xsesxlgmt0201 kernel: [ 14.397000] ib_mthca: probe of 0000:02:00.0 failed with error -22

 

Grepping /var/log/kern.log or dmesg.log for:

ib_mthca

No HCA

SYS_EN_DDR

error -22

 

if there is returns on any of the above greps showing failed HCA, the HCA in the Front Panel is bad and needs to be replaced.

Before RMA'ing the Front Panel HCA also confirm if opensm is up and running on the Fabric Director. If not, proceed with the RMA of the Front Panel:

admin@xavier[xsigo] show diagnostics sm-info

 

 

SM information:

- SM is running on xavier

- SM Lid 87

- SM Guid 0x139702010003a9

- SM key 0x0

- SM priority 0

- SM State MASTER

 

#show diagnostics opensm-param

 

OpenSM $ Current log level is 0x3

OpenSM $ Current sm-priority is 0

OpenSM $ OpenSM Version : OpenSM 3.3.5

SM State : Master

SM Priority : 0

SA State : Ready

Routing Engine : minhop

Loaded event plugins : <none>

 

Please note, the above errors are noted on Gen1 Front Panels, not on Gen2 Front Panels.

NOTE when creating an RMA to replace a failed Gen1 Front Panel:    

Gen1 Front Panels are EOL'd and there are very few (if any) Gen1 Front Panels in stock in any Oracle Global Depot. The Gen1 Front Panels that may be available are most likely refurbished/remanufactured. Replacing a failed Gen1 Front Panel with another Gen1 Front Panel leaves customers exposed to the intermittent random failure of Gen1 Front Panel HCAs. Gen2 Front Panels in addition to having an embedded System Control Processor (SCP) and HDD, also have a more powerful processor than Gen1 Front Panels. Therefore, it is almost always better to guide customer to replace Gen1 Front Panels with Gen2 Front Panels.

The only reason that Gen1 Front Panels are still stocked at all is that some customers have demanded to not move to Gen2 Front Panels, usually when the customer doesn't want to/is unable to upgrade their environment to the minimum required XgOS version (3.8.2 and above).    Also, note that per Doc ID 1916977.1 , some older Serial#s of F1-4 Chassis need to be replaced prior to upgrading to Gen2 Front Panel.

Taking all the above into account, here is the sequence of steps to follow when guiding customer on Gen1 Front Panel replacement: 

   1) Check with customer if they are on XgOS 3.8.2 or above or are they prepared to upgrade, given the aforementioned limitations of the Gen1 FP including e.g. risk of fail on power-cycle.

         1.1  If customer not on XgOS 3.8.2. or above (3.9.0 required for Gen2 Front Panels) and is not prepared to upgrade, then inform customer of the possible delays in sourcing the part and then create FS-Task for Gen1 Front Panel and include in Special Instructions that the customer's XgOS version will not support a Gen2 Front Panel as a substitute and we need to wait for Gen1 FP to be sourced.   You are done, exit here.

         1.2  If customer is already on XgOS 3.8.2 (and above) and/or prepared to use the minimum 3.8.2 XgOS version or above that *ships* with Gen2 Front Panels then:
                 - If this is an F1-4:  Check Doc ID 1916977.1 to validate the Serial# of the Chassis and whether it would require replacement at the same time as a Gen2 FP being ordered.  If it is one of the older Serial#s mentioned, then order both the Gen2 FP and a new Chassis and provide instructions to FE for both.
                - If this is an F1-15 and/or a newer F1-4 chassis:    Simply order a new Gen2 FP.

The XgOS version that is shipped on new Gen2 Front Panel is unknown until it arrives at the customer site.  The XgOS version can be downgraded (or upgraded) from one version to another to match the XgOS version that customer already has running, or plans to run, in their environment as long as the version is GA released version - 3.8.2 and above XgOS for Gen1 Front Panels and 3.9.0 XgOS and above for Gen2 Front Panels. 

Gen2 Front Panels support 3.9.0 XgOS and above, whereas Gen1 Front Panels are supported on 3.8.2 XgOS and above.  For more information see this document:

https://docs.oracle.com/cd/E51938_01/pdf/E39030.pdf

What’s New in this Release

This release contains the following new features and functionality:

• A new Front Panel assembly, called the FP-2, is available. This new part offers a new processor and a

streamlined design which includes an embedded SCP which is no longer a field-replaceable part. In addition,

the entire FP-2 is RoHS compliant. The FP-2 is installed in all new shipments from the factory.

The FP-2 is supported only with XgOS 3.9.0 and later. Earlier versions of XgOS do not support the FP2.

Here is a KB that corrects 4.x Product Notes and states that Gen1 Front Panels are supported and will work with 4.x XgOS:

Please see these KBs for instructions on how to upgrade XgOS version and how to replace a Gen2 Front Panel:

 

How to Perform an Oracle Fabric Interconnect Firmware (XgOS) Upgrade (Doc ID 1517629.1)

How to replace a Gen2 Front Panel on Oracle Fabric Interconnects (Xsigo) (Doc ID 1663431.1)

 

The logscanner (deprecated) should also point out this (phone home output) :

 

(CRIT-1) CRITICAL problem: Detected 3 instances of HCA DDR errors. HCA is bad.
Please request RMA for Front Panel card
Date range: Jun 16 07:00:57 2012 UTC until Jun 16 13:14:22 2012 UTC

The Note below is an entry taken directly form the Oracle System Handbook but was edited since one person found it confusing (dropped the "a" and added an s to Fan Tray(s).

NOTE:  2U/4U Front Panel G2 (ComX-Xi7) Silver with P/N 7077986  1 2 [F] is *not* an alternate part for Front Panel P/N 725-00097-10 4 [F] - 2U/4U Front Panel G2 (Com-Xi7).   The 7077986 Gen2 Front Panel P/N requires new Fan Trays to also be ordered if the 7077986 P/N is used:

Oracle Fabric Interconnect F1-15 - Components

Screenshot  below was crated directly from the Oracle System Handbook with no changes, it is  understood that the alternate Front Panel Part (R13) requires new (all) Fan Trays for each Fabric Interconnect model, and at this time has not been used.     KB Team does not maintain the Oracle System Handbook.   

 

Oracle Fabric Interconnect F1-15 Home Page

 

Service Views

Front

Rear

 

Components

Full Components List

Cables

Communication

Ethernet

Fan

Fibre Channel

I/O

Miscellaneous

Power

Rack

Switch/Hub

System Controller

 

System Breakdown

Exploded View

 

Miscellaneous

Option #

Manufacturing
Part #

Description

Previous
Part #

 

n/a

7063220

2U/4U Front Panel G2 (Com-X i7)

n/a

 
  • 7077986 1 2 [F]
  • 2U/4U Front Panel G2 (Com-X i7), Silver

n/a

 

n/a

7076049

F1-15 Chassis (unpainted) without Power Supply, Fan, Fabric Board, Front Panel

n/a

 

n/a

7301694 [F]

F1-15 Chassis (unpainted) without Power Supply, Fan, Fabric Board, Front Panel

7076049

 

n/a

VP-FRU-FP

2U/4U Front Panel

n/a

 
  • 725-00006-05 3 [F]
  • 2U/4U Front Panel

n/a

 

n/a

VP-FRU-FP-C7-R13

2U/4U Front Panel G2 (Com-X i7)

n/a

 
  • 725-00097-10 4 [F]
  • 2U/4U Front Panel G2 (Com-X i7)

n/a

 

n/a

VP780-CH-DDR

F1-15 DDR I/O Director Chassis Assembly

n/a

 

n/a

VP780-CH-QDR [F]

F1-15 QDR I/O Director Chassis Assembly [SKU]

n/a

 

n/a

VP780-CH-QDR-R13

F1-15 QDR I/O Director Chassis Assembly

VP780-CH-QDR [F]

 

n/a

VP780-FRU-BEZEL

F1-15 Bezel

n/a

 

n/a

VP780-FRU-CHASSIS-DDR 5

VP780 DDR I/O Director Chassis Assembly

n/a

 

n/a

VP780-FRU-FP

F1-15 Gen1 Front Panel[SKU]

n/a

 
  • 725-00006-03 [F]
  • F1-15 Gen1 Front Panel

n/a

 

n/a

VP780E-CH

F1-15 Ethernet I/O Director Chassis Assembly [SKU]

n/a

 

 

1 Front Panel 7077986 requires F1-4 Fan Tray 7085622 or F1-15 Fan Tray 7082052
2 Includes SCP.
3 Includes SCP
4 Includes SCP
5 This part is just the chassis, no fans, power supplies, or System Control Processor (SCP) are included.

 

Table Legend

[F] = Field Replaceable Unit (FRU)
[C] = Customer Replaceable Unit (CRU)
[A] = Alternate, can be used in system; wasn't used during production
[S] = Supported, but can no longer be ordered with this system
[N] = No longer supported for this system

 

References

<BUG:20828613> - HIGH NUMBER OF GEN1 FRONT PANEL HCA FAILURES

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback