Asset ID: |
1-72-2210497.1 |
Update Date: | 2018-02-17 |
Keywords: | |
Solution Type
Problem Resolution Sure
Solution
2210497.1
:
FS System: Controller Fails Due to a PCIe Fault
Related Items |
- Oracle FS1-2 Flash Storage System
|
Related Categories |
- PLA-Support>Sun Systems>DISK>Flash Storage>SN-EStor: FSx
|
In this Document
Oracle Confidential PARTNER - Available to partners (SUN).
Reason: not for customers
Created from <SR 3-13705927891>
Applies to:
Oracle FS1-2 Flash Storage System - Version 6.2 to 6.2 [Release 6.2]
Information in this document applies to any platform.
Symptoms
An FS1-2 Controller fails unexpectedly. From a processed FS1-2 log bundle the events_summary.txt reports a PSG_DMS_EVENT_ILOM Warning event:
2016-11-24T04:53:21.456 COLLECT_LOGS_INITIATED INFORMATIONAL
2016-11-24T04:53:21.980 PCP_EVT_CONTROLLER_FAILED CRITICAL ComponentName=/CONTROLLER-01, Source=/PILOT-1, reasonCode=PMI_SUSPICION
2016-11-24T04:53:21.982 PCP_EVT_CONTROLLER_FAILOVER_COMPLETE INFORMATIONAL ComponentName=/CONTROLLER-01, Source=/PILOT-1
.
.
.
2016-11-24T05:00:20.136 PCP_EVT_NODE_STATE_CHANGED INFORMATIONAL ComponentName=/CONTROLLER-01, Source=/PILOT-1
2016-11-24T05:00:42.839 PSG_DMS_EVENT_ILOM WARNING ComponentName=/CONTROLLER-01, Source=/CONTROLLER-01
2016-11-24T05:00:42.840 COLLECT_LOGS_INITIATED INFORMATIONAL
Viewing the details for the PSG_DMS_EVENT_ILOM event in the events.xml file show that a PCIe device has faulted:
<SystemEventInformation>
<EventType>PSG_DMS_EVENT_ILOM</EventType>
<Severity>WARNING</Severity>
<Category>SYSTEM</Category>
<Time>2016-11-24T05:00:42.839</Time>
<ComponentIdentity>
<WuName>508002000158D780</WuName>
</ComponentIdentity>
<ComponentName>/CONTROLLER-01</ComponentName>
<SourceNodeIdentity>
<Id>508002000158D780</Id>
<Fqn>/CONTROLLER-01</Fqn>
</SourceNodeIdentity>
<EventParameterList>
<ParameterName>IlomEvent.fmdTime</ParameterName>
<ParameterValue>Nov 24 04:53:03 2016</ParameterValue>
</EventParameterList>
.
.
.
<EventParameterList>
<ParameterName>IlomEvent.FmdFaulty.FaultyClass</ParameterName>
<ParameterValue>fault.io.intel.iio.pcie-downstream-devices</ParameterValue>
</EventParameterList>
<EventParameterList>
<ParameterName>IlomEvent.FmdFaulty.CertaintyFault</ParameterName>
<ParameterValue>100</ParameterValue>
</EventParameterList>
<EventParameterList>
<ParameterName>IlomEvent.FmdFaulty.FaultyLocation</ParameterName>
<ParameterValue>/chassis=0/motherboard=0/riser=3/pcie=3</ParameterValue>
</EventParameterList>
<EventParameterList>
<ParameterName>IlomEvent.FmdFaulty.FaultyStatus</ParameterName>
<ParameterValue>faulted but still in service</ParameterValue>
</EventParameterList>
<EventParameterList>
<ParameterName>IlomEvent.FmdFaulty.FmdFru.Status</ParameterName>
<ParameterValue></ParameterValue>
</EventParameterList>
<EventParameterList>
<ParameterName>IlomEvent.FmdFaulty.FmdFru.Location</ParameterName>
<ParameterValue>/SYS/MB/RISER3/PCIE3</ParameterValue>
</EventParameterList>
In this example the faulty device is /SYS/MB/RISER3/PCIE3 reported by Controller 1
Cause
There have been several instances of Controller ILOM PCIe fault events reported from multiple FS1-2 and Engineering is tracking. The cause is still unknown at this time.
Solution
Any Controller failure event (PCP_EVT_CONTROLLER_FAILED) should be investigated for this PCIe issue with the following steps:
- Display a list of Controller ILOM Fault events following the instructions in KM Document 1959101.1 FS System: How to Display the Fault Events for the Controller ILOM
# ./fscli controller -command -controller /CONTROLLER-01 -commandString IPMIFM -parameters "fmadm faulty"
Connected. Use ^D to exit.
-> cd /sp/faultmgmt
/SP/faultmgmt
-> start shell
Are you sure you want tostart /SP/faultmgmt/shell (y/n)? y
falutmgmtsp> fmadm faulty
------------------- ------------------------------------ -------------- --------
Time UUID msgid Severity
------------------- ------------------------------------ -------------- --------
2016-09-03/09:31:56 cd223e01-eb01-c649-8fc1-fb9492383eaa SPX86-8003-QH Major
Fault class : fault.io.intel.iio.pcie-downstream-devices
ASRU : /SYS/MB/RISER3/PCIE3
faulted
FRU : /SYS/MB/RISER3/PCIE3
(Part Number: unknown)
(Serial Number: unknown) 100%
faulty
- Examine the host_debug_err.log file:
- For systems running R6.2.9 and higher:
- Collect an ILOM Snapshot from the Failed Controller. For details see KM Document 1963071.1 FS System: How to Collect an ILOM Snapshot from the Pilot or Controller
- Once the Controller ILOM Snapshot have been received, extract the .zip file and under the ilom subfolder look for @persist@host_debug_err.log file.
- For systems running R6.2.8 and below that cannot successfully gather an ILOM snapshot:
- Enable ssh to the Pilot. For details see KM Documents 2029847.1 FS System: How to Enable Ssh Access to the Pilot and 2046703.1 FS System: Passwords Associated with the Oracle FS1-2 Flash Storage System.
- Enable session logging and ssh to the Pilot and then the surviving Controller (it will be either 172.30.80.128 or 172.30.80.129):
[root@pilot1 ~]# cat /etc/nodenames
172.30.80.2 WN2008fffffffffff2 WN2008000101000000 mgmtnode
172.30.80.3 WN2009fffffffffffa
172.30.80.128 WN508002000158ba50 WN2008000101000001
[root@pilot1 ~]# ssh 172.30.80.128
WN508002000158BA50 #
- Ssh to the failed Controller's ILOM and enter restricted mode:
WN508002000158BA50 # ssh 169.254.2.9
Password:
Oracle(R) Integrated Lights Out Manager
Version 3.1.2.40 r93718
Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved.
Warning: The system appears to be in manufacturing test mode.
Contact Service immediately.
Warning: password is set to factory default.
-> set SESSION mode=restricted
WARNING: The "Restricted Shell" account is provided solely
to allow Services to perform diagnostic tasks.
[(restricted_shell) ORACLESP-1315FM2008:~]#
- dump the contents of the host_debug_err.log and then exit all ssh sessions:
[(restricted_shell) ORACLESP-1315FM2008:~]# cat persist/host_debug_err.log
Thu Aug 11 23:41:37 2016 ID ffff
**** Host Boot ****
P0 GFERRSTS:00000000 GFFERRSTS:00000000 GFNERRST:00000000
Thu Aug 11 23:41:40 2016 ID ffff P0 DMI:DMI:PCIE_XP:Correctable:FIRST:76:PCI link bandwidth changed
P0:DMI:DMI XPCORERRSTS:00000001 XPCORERRMSK:00000000
P0 GNERRST:00100000 GNFERRST:00100000 GNNERRST:00000000
P1 GFERRSTS:00000000 GFFERRSTS:00000000 GFNERRST:00000000
Thu Aug 11 23:45:38 2016 ID ffff
...
Fri Nov 4 20:49:51 2016 ID ffff P1 PCIe 6:port3c:PCIE_XP:Correctable:NEXT:76:PCI link bandwidth changed
P1:PCIe 6:port3c XPCORERRSTS:00000001 XPCORERRMSK:00000000
P1 GNERRST:00005440 GNFERRST:00000040 GNNERRST:00000400
[(restricted_shell) ORACLESP-1315FM2008:~]# exit
exit
-> exit
WN508002000158BA50 # exit
[root@pilot1 ~]#exit
- View @persist@host_debug_err.log file or the file created from the ssh session and look for a PCIE Fatal error message similar to the following:
Thu Nov 24 04:53:03 2016 ID 0165 P1 PCIe 3:port3a:PCIE:Fatal:FIRST:84::Completion Time-out Status
P1:PCIe 3:port3a:UNCERRSTS:00004000 UNCERRSEV:0007F030 LNERRSTS:00000000 RPERRSTS:0000005C ERRSID:80180000
Thu Nov 24 04:53:04 2016 ID 0166 P1 PCIe 3:port3a:PCIE:Fatal:MSG:Fatal Error Msgs Received
Thu Nov 24 04:55:31 2016 ID ffff
- Clear the ILOM fault using the fscli utility:
# ./fscli controller -command -controller <controller ID> -commandString IPMIFM -parameters "fmadm repaired <fru|cru>"
In this example from the System Event Log, Controller 1 is reporting the faulty device being /SYS/MB/RISER3/PCIE3, the fscli command to clear the fault would be:
# ./fscli controller -command -controller /CONTROLLER-01 -commandString IPMIFM -parameters "fmadm repaired /SYS/MB/RISER3/PCIE3"
- Please make sure that the FS1-2 is on the latest software release. Beginning in release 6.2.11 the FS1-2 will warmstart the Controller to collect debug logs for the Controller SAS HBA, more information about this is mentioned in Bug 25225099.
In release 6.2.13 and higher the FS1-2 will warmstart the Controller to collect debug logs for the Controller SAS HBA, in addition depending on the type of PCIe fault it will reboot the Controller to reset the PCIe bus. If this reboot occurs, the Controller will perform a fast failover to make the LUNs available on the other Controller faster. For more information see Bug 25790908 and Bug 25746529. This information is mentioned in the 6.2.13 README.
For a PCIe Fault please do not replace the hardware component without engineering approval first. Please open a new Bug and include all information that has been collected for analysis.
References
<BUG:25182692> - CONTROLLER-01 PCIE FAULT IS A DUPLICATE OF BUG 24591140-AUTOMATICALLY RECOVERED
<BUG:25225099> - SOFTWARE HANDLING OF PCIE ERROR REPORTED IN BUG 24591140
<BUG:24591140> - CONTROLLER 1 PCIE FAULT
Attachments
This solution has no attachment