![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||||||
Solution Type Problem Resolution Sure Solution 2216951.1 : PCIE-Fatal Errors Reported in ILOM when NVMe Drives are Incorrectly Removed from Server
In this Document
Applies to:Exadata X6-2 Hardware - Version All Versions and laterOracle Server X6-2 - Version All Versions and later Oracle Server X6-2L - Version All Versions and later Oracle Server X5-2 - Version All Versions and later Oracle Server X5-2L - Version All Versions and later Information in this document applies to any platform. SymptomsThe ILOM may report pcie-fatal errors when the NVMe drives are removed without following the NVMe drive removal procedure in Doc ID: 2003727.1 for Exadata, Doc ID: 2034512.1 for standalone X5-2/X6-2 servers, Doc ID: 2034530.1 for standalone X5-2L/X6-2L servers. An example of the pcie-fatal errors look like the below: 2016-08-15/10:48:53 0e212277-19f4-4095-f550-cd82c0a67969 SPX86A-8002-RK timestamp ereports fault = fault.io.intel.iio.pcie-fatal@/SYS/MB/P0 fault = fault.io.intel.iio.pcie-fatal@/SYS/MB/PCIE5
ChangesAn NVMe drive was removed from the X5-2/X5-8/X6-2/X6-8 Extreme Flash Storage Cell or standalone Oracle Server X5-2/X5-2L/X6-2/X6-2L server CauseFurther investigations show that the errors below caused the fault: 2016-08-15/10:48:51 2016-08-15/10:48:51 2016-08-15/10:48:53
In the ILOM event logs you would be able to see that an NVMe drive was removed from the system around the time of the failure, the example below shows NVMe drive 4 being removed:
82 | 08/15/2016 | 17:48:51 | System Firmware Progress | Management
controller initialization | Asserted 83 | 08/15/2016 | 17:48:51 | System Firmware Progress | SMBus initialization | Asserted 84 | 08/15/2016 | 17:48:53 | Critical Interrupt | PCI SERR | Asserted | OEM Data-2 0x00 OEM Data-3 0x07 85 | 08/15/2016 | 17:48:53 | System Firmware Progress | Primary CPU initialization | Asserted 86 | 08/15/2016 | 17:48:53 | System Firmware Progress | Memory initialization | Asserted 87 | 08/15/2016 | 17:48:55 | Entity Presence HDD4/PRSNT | Device Absent 88 | 08/15/2016 | 17:48:55 | Entity Presence NVMe4/PRSNT | Device Absent <<<<<<<<<<<<<<<<<<<<<<<<<<<< 89 | 08/15/2016 | 17:49:12 | System Firmware Progress | Cache initialization | Asserted 8a | 08/15/2016 | 17:49:13 | System Firmware Progress | Secondary CPU Initialization | Asserted 8b | 08/15/2016 | 17:49:33 | System Firmware Progress | PCI resource configuration | Asserted 8c | 08/15/2016 | 17:49:47 | System Firmware Progress | PCI resource configuration | Asserted 8d | 08/15/2016 | 17:49:52 | System Firmware Progress | Video SolutionFor servers with a Linux Operating System installed: Run lspci and grep for 0953 on X5 servers or 172X on X6 servers, an example output from an X6 server is below:
[root@cel01 ~]# lspci | grep 172X
05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01) 07:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01) 25:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01) 27:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01) 86:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01) 88:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01) 96:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01) 98:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01) NOTE: If there is 1 or more NVMe drives missing in the lspci output, confirm that all NVMe drives are seated correctly. If necessary you may need to engage TSC x86 with the latest sundiag output to investigate further. When you have confirmed all NVMe drives are present, please advise the customer to clear the ILOM fault. Instructions on how to clear the ILOM fault are available in Doc ID: 1381773.1
For servers with a Solaris Operating System installed: NVMe storage drives are labeled on the system front panel as NVMe0, NVMe1, NVMe2, and NVMe3. However, the server BIOS internally identifies these drives by their virtual PCIe slot numbers. The following table lists the drive front panel label and its corresponding virtual PCIe slot number used by the operating system.
Note: The virtual PCIe slot name is not the same as the name on the server front panel label.
1. Log in to the Oracle Solaris host. 2. Confirm the number of NVMe drives that are detected and "ENABLED" using the "hotplug list -lc" command an example output showing 4 NVMe drives is below:
# hotplug list -lc
Connection State Description Path ________________________________________________________________________________ pcie13 ENABLED PCIe-Native /pci@7a,0/pci8086,2f08@3/pci111d,80b5@0/pci111d,80b5@4 pcie12 ENABLED PCIe-Native /pci@7a,0/pci8086,2f08@3/pci111d,80b5@0/pci111d,80b5@5 pcie10 ENABLED PCIe-Native /pci@7a,0/pci8086,2f08@3/pci111d,80b5@0/pci111d,80b5@6 pcie11 ENABLED PCIe-Native /pci@7a,0/pci8086,2f08@3/pci111d,80b5@0/pci111d,80b5@7 If there is 1 or more NVMe drives missing in the hotplug list -lc output, confirm that all NVMe drives are seated correctly. If necessary you may need to engage TSC x86 with the ILOM snapshot and explorer output to investigate further. When you have confirmed all the installed NVMe drives are present, please advise the customer to clear the ILOM fault. Instructions on how to clear the ILOM fault are available in Doc ID: 1381773.1
References:* How to Replace an Exadata X5-2/X6-2 Storage Server NVMe drive (Doc ID 2003727.1) References<NOTE:2003727.1> - How to Replace an Exadata X5-2/X6-2 Storage Server NVMe drive<NOTE:2034512.1> - How to Replace an Oracle Server X5-2 and X6-2 NVMe Disk <NOTE:2034530.1> - How to Replace an Oracle Server X5-2L and X6-2L NVMe Disk Attachments This solution has no attachment |
||||||||||||||||||||||||||||||||
|