![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Solution Type Technical Instruction Sure Solution 1516022.1 : How to Detect a Bad (Failed) Front Panel HCA in Oracle Fabric Interconnects (OVN) Formerly Xsigo Fabric Director
In this Document
Applies to:Oracle Fabric Interconnect F1-4 - Version All Versions to All Versions [Release All Releases]Oracle Fabric Interconnect F1-15 - Version All Versions to All Versions [Release All Releases] Information in this document applies to any platform. ***Checked for relevance on 02-JUL-2014*** GoalBe able to find errors in Fabric Director logs or diagnostic log bundle that points to bad Front Panel HCA. SolutionCustomer may note problems bringing up IO Modules - Ethernet IO Modules may show as up/failed (xtLidInvalid). FC IO Modules will show up/failed (xtChip) and will be unable to successfully bring IO Modules up even after resetting them. Also when a Front Panel failure is encountered, existing servers will stay up and running, but will not be able to bring up a new server-profile or any new or existing server that is rebooted. Front Panel HCA failure is often encountered after a power outage is experienced in the datacenter, or when the Fabric Interconnect is power cycled. Conditions that may occur when Front Panel HCA fails: (1) Unable to bring up IO Modules - IO Modules do not come fully up after reset or rebooting the Fabric Interconnect (2) Unable to bring up a new server-profile or existing server-profile after server was rebooted. (3) OpenSM not running - won't start after rebooting Fabric Interconnect. If any of the conditions above are noted, login to the Fabric Director as user 'root' and then navigate to /var/log: chassis/var/log# vi dmesg
Look for any DDR errors here: Waiting for /dev to be fully populated...[ 18.931000] ib_mthca: Mellanox InfiniBand HCA drive OR 'grep' for 'error -22' in the dmesg/kern.log as shown below: [root@Centos_Balaji xsesxlgmt0201]# grep -i error dmesg.1
[ 81.859000] ib_mthca 0000:02:00.0: SYS_EN DDR error: syn=4, sock=0, sladdr=0, SPD source=DIMM [ 81.862000] ib_mthca: probe of 0000:02:00.0 failed with error -22 [root@Centos_Balaji xsesxlgmt0201]# grep -i error kern.log Jun 16 00:00:57 xsesxlgmt0201 kernel: [ 81.859000] ib_mthca 0000:02:00.0: SYS_EN DDR error: syn=4, sock=0, sladdr=0, SPD source=DIMM Jun 16 00:00:57 xsesxlgmt0201 kernel: [ 81.862000] ib_mthca: probe of 0000:02:00.0 failed with error -22 Jun 16 04:06:24 xsesxlgmt0201 kernel: [ 21.224000] ib_mthca 0000:02:00.0: SYS_EN DDR error: syn=4, sock=0, sladdr=0, SPD source=DIMM Jun 16 04:06:24 xsesxlgmt0201 kernel: [ 21.227000] ib_mthca: probe of 0000:02:00.0 failed with error -22 Jun 16 06:14:22 xsesxlgmt0201 kernel: [ 14.394000] ib_mthca 0000:02:00.0: SYS_EN DDR error: syn=4, sock=0, sladdr=0, SPD source=DIMM Jun 16 06:14:22 xsesxlgmt0201 kernel: [ 14.397000] ib_mthca: probe of 0000:02:00.0 failed with error -22
Grepping /var/log/kern.log or dmesg.log for: ib_mthca No HCA SYS_EN_DDR error -22
if there is returns on any of the above greps showing failed HCA, the HCA in the Front Panel is bad and needs to be replaced. Before RMA'ing the Front Panel HCA also confirm if opensm is up and running on the Fabric Director. If not, proceed with the RMA of the Front Panel: admin@xavier[xsigo] show diagnostics sm-info
SM information: - SM is running on xavier - SM Lid 87 - SM Guid 0x139702010003a9 - SM key 0x0 - SM priority 0 - SM State MASTER
#show diagnostics opensm-param
OpenSM $ Current log level is 0x3 OpenSM $ Current sm-priority is 0 OpenSM $ OpenSM Version : OpenSM 3.3.5 SM State : Master SM Priority : 0 SA State : Ready Routing Engine : minhop Loaded event plugins : <none>
Please note, the above errors are noted on Gen1 Front Panels, not on Gen2 Front Panels. Gen1 Front Panels are EOL'd and there are very few (if any) Gen1 Front Panels in stock in any Oracle Global Depot. The Gen1 Front Panels that may be available are most likely refurbished/remanufactured. Replacing a failed Gen1 Front Panel with another Gen1 Front Panel leaves customers exposed to the intermittent random failure of Gen1 Front Panel HCAs. Gen2 Front Panels in addition to having an embedded System Control Processor (SCP) and HDD, also have a more powerful processor than Gen1 Front Panels. Therefore, it is almost always better to guide customer to replace Gen1 Front Panels with Gen2 Front Panels. Gen2 Front Panels support 3.9.0 XgOS and above, whereas Gen1 Front Panels are supported on 3.8.2 XgOS and above. For more information see this document: https://docs.oracle.com/cd/E51938_01/pdf/E39030.pdf What’s New in this Release This release contains the following new features and functionality: • A new Front Panel assembly, called the FP-2, is available. This new part offers a new processor and a streamlined design which includes an embedded SCP which is no longer a field-replaceable part. In addition, the entire FP-2 is RoHS compliant. The FP-2 is installed in all new shipments from the factory. The FP-2 is supported only with XgOS 3.9.0 and later. Earlier versions of XgOS do not support the FP2. Here is a KB that corrects 4.x Product Notes and states that Gen1 Front Panels are supported and will work with 4.x XgOS:
Please see these KBs for instructions on how to upgrade XgOS version and how to replace a Gen2 Front Panel:
How to Perform an Oracle Fabric Interconnect Firmware (XgOS) Upgrade (Doc ID 1517629.1) How to replace a Gen2 Front Panel on Oracle Fabric Interconnects (Xsigo) (Doc ID 1663431.1)
The logscanner (deprecated) should also point out this (phone home output) :
(CRIT-1) CRITICAL problem: Detected 3 instances of HCA DDR errors. HCA is bad.
Please request RMA for Front Panel card Date range: Jun 16 07:00:57 2012 UTC until Jun 16 13:14:22 2012 UTC The Note below is an entry taken directly form the Oracle System Handbook but was edited since one person found it confusing (dropped the "a" and added an s to Fan Tray(s). NOTE: 2U/4U Front Panel G2 (ComX-Xi7) Silver with P/N 7077986 1 2 [F] is *not* an alternate part for Front Panel P/N 725-00097-10 4 [F] - 2U/4U Front Panel G2 (Com-Xi7). The 7077986 Gen2 Front Panel P/N requires new Fan Trays to also be ordered if the 7077986 P/N is used: Oracle Fabric Interconnect F1-15 - Components Screenshot below was crated directly from the Oracle System Handbook with no changes, it is understood that the alternate Front Panel Part (R13) requires new (all) Fan Trays for each Fabric Interconnect model, and at this time has not been used. KB Team does not maintain the Oracle System Handbook.
References<BUG:20828613> - HIGH NUMBER OF GEN1 FRONT PANEL HCA FAILURESAttachments This solution has no attachment |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|