Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2216951.1
Update Date:2017-09-18
Keywords:

Solution Type  Problem Resolution Sure

Solution  2216951.1 :   PCIE-Fatal Errors Reported in ILOM when NVMe Drives are Incorrectly Removed from Server  


Related Items
  • Oracle Server X6-2L
  •  
  • Oracle Server X5-2
  •  
  • Exadata X5-8 Hardware
  •  
  • Exadata X6-2 Hardware
  •  
  • Oracle Server X6-2
  •  
  • Exadata X5-2 Hardware
  •  
  • Exadata X6-8 Hardware
  •  
  • Oracle Server X5-2L
  •  
Related Categories
  • PLA-Support>Sun Systems>x86>Engineered Systems HW>SN-x64: EXADATA
  •  




In this Document
Symptoms
Changes
Cause
Solution
 References:
References


Applies to:

Exadata X6-2 Hardware - Version All Versions and later
Oracle Server X6-2 - Version All Versions and later
Oracle Server X6-2L - Version All Versions and later
Oracle Server X5-2 - Version All Versions and later
Oracle Server X5-2L - Version All Versions and later
Information in this document applies to any platform.

Symptoms

The ILOM may report pcie-fatal errors when the NVMe drives are removed without following the NVMe drive removal procedure in Doc ID: 2003727.1 for Exadata, Doc ID: 2034512.1 for standalone X5-2/X6-2 servers, Doc ID: 2034530.1 for standalone X5-2L/X6-2L servers.

An example of the pcie-fatal errors look like the below:

2016-08-15/10:48:53 0e212277-19f4-4095-f550-cd82c0a67969 SPX86A-8002-RK

timestamp ereports
2016-08-15/10:48:53
ereport.io.intel.iio.pcie-fatal-from-downstream@/sys/mb/p0/iio/dev00/fn0

fault = fault.io.intel.iio.pcie-fatal@/SYS/MB/P0
certainty = 50.0 %
FRU = /SYS/MB/P0
ASRU = /SYS/MB/P0
resource = /SYS/MB/P0
_list_sz = 2
_list_idx = 0
_diagnosis_engine_version = 1.0
_diagnosis_engine_name = fdd
system_serial_number = XXXXXXXXXX
system_part_number = Exadata X6-8
system_name = Exadata X6-8
system_manufacturer = Oracle Corporation
chassis_serial_number = XXXXXXXXXX
chassis_part_number = 7323224
chassis_name = ORACLE SERVER X6-2L
chassis_manufacturer = Oracle Corporation
system_component_serial_number = XXXXXXXXXX
system_component_part_number = 7323224
system_component_name = ORACLE SERVER X6-2L
system_component_manufacturer = Oracle Corporation
fru_name = Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
fru_part_number = 060F
[skipped fruid update]

fault = fault.io.intel.iio.pcie-fatal@/SYS/MB/PCIE5
certainty = 50.0 %
FRU = /SYS/MB/PCIE5
ASRU = /SYS/MB/PCIE5
resource = /SYS/MB/PCIE5
_list_sz = 2
_list_idx = 1
_diagnosis_engine_version = 1.0
_diagnosis_engine_name = fdd
system_serial_number = XXXXXXXXXX
system_part_number = Exadata X6-8
system_name = Exadata X6-8
system_manufacturer = Oracle Corporation
chassis_serial_number = XXXXXXXXXX
chassis_part_number = 7323224
chassis_name = ORACLE SERVER X6-2L
chassis_manufacturer = Oracle Corporation
system_component_serial_number = XXXXXXXXXX
system_component_part_number = 7323224
system_component_name = ORACLE SERVER X6-2L
system_component_manufacturer = Oracle Corporation
[skipped fruid update]

 

Changes

An NVMe drive was removed from the X5-2/X5-8/X6-2/X6-8 Extreme Flash Storage Cell or standalone Oracle Server X5-2/X5-2L/X6-2/X6-2L server

Cause

Further investigations show that the errors below caused the fault:

2016-08-15/10:48:51
ereport.io.intel.iio.pcie-correctable-from-downstream@/SYS/MB/P0/IIO/DEV03/FN0
/DEV00/FN0/DEV05/FN0
port = PCIe 5
slot_path = /SYS/DBP/NVMe4:/SYS/MB/PCIE5/PCIESW

2016-08-15/10:48:51
ereport.io.intel.iio.pcie-receiver-error@/SYS/MB/P0/IIO/DEV03/FN0/DEV00/FN0/DE
V05/FN0
port = PCIe 5
slot_path = /SYS/DBP/NVMe4:/SYS/MB/PCIE5/PCIESW

2016-08-15/10:48:53
ereport.io.intel.iio.pcie-fatal-from-downstream@/SYS/MB/P0/IIO/DEV00/FN0
port = PCIe 5
slot_path = /SYS/MB:/SYS/MB/P0:/SYS/MB/PCIE5

 

In the ILOM event logs you would be able to see that an NVMe drive was removed from the system around the time of the failure, the example below shows NVMe drive 4 being removed:

  

82 | 08/15/2016 | 17:48:51 | System Firmware Progress | Management
controller initialization | Asserted
83 | 08/15/2016 | 17:48:51 | System Firmware Progress | SMBus
initialization | Asserted
84 | 08/15/2016 | 17:48:53 | Critical Interrupt | PCI SERR | Asserted | OEM
Data-2 0x00 OEM Data-3 0x07
85 | 08/15/2016 | 17:48:53 | System Firmware Progress | Primary CPU
initialization | Asserted
86 | 08/15/2016 | 17:48:53 | System Firmware Progress | Memory
initialization | Asserted
87 | 08/15/2016 | 17:48:55 | Entity Presence HDD4/PRSNT | Device Absent
88 | 08/15/2016 | 17:48:55 | Entity Presence NVMe4/PRSNT | Device Absent <<<<<<<<<<<<<<<<<<<<<<<<<<<<
89 | 08/15/2016 | 17:49:12 | System Firmware Progress | Cache
initialization | Asserted
8a | 08/15/2016 | 17:49:13 | System Firmware Progress | Secondary CPU
Initialization | Asserted
8b | 08/15/2016 | 17:49:33 | System Firmware Progress | PCI resource
configuration | Asserted
8c | 08/15/2016 | 17:49:47 | System Firmware Progress | PCI resource
configuration | Asserted
8d | 08/15/2016 | 17:49:52 | System Firmware Progress | Video

  

Solution

For servers with a Linux Operating System installed:

Run lspci and grep for 0953 on X5 servers or 172X on X6 servers, an example output from an X6 server is below:

  

[root@cel01 ~]# lspci | grep 172X
05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01)
07:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01)
25:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01)
27:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01)
86:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01)
88:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01)
96:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01)
98:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01)

  

NOTE:
For Exadata configurations you would expect to see 8 NVMe drives in each Extreme Flash storage cell.

If there is 1 or more NVMe drives missing in the lspci output, confirm that all NVMe drives are seated correctly. If necessary you may need to engage TSC x86 with the latest sundiag output to investigate further.

When you have confirmed all NVMe drives are present, please advise the customer to clear the ILOM fault. Instructions on how to clear the ILOM fault are available in Doc ID: 1381773.1

 

For servers with a Solaris Operating System installed:

NVMe storage drives are labeled on the system front panel as NVMe0, NVMe1, NVMe2, and NVMe3. However, the server BIOS internally identifies these drives by their virtual PCIe slot numbers.

The following table lists the drive front panel label and its corresponding virtual PCIe slot number used by the operating system.

 

Front Panel Storage Drive Label Virtual PCIe Slot Number
NVMe0 (HDD2) PCIe slot 10
NVMe1 (HDD3) PCIe slot 11
NVMe2 (HDD4) PCIe slot 12
NVMe3 (HDD5) PCIe slot 13

 

Note: The virtual PCIe slot name is not the same as the name on the server front panel label.

  

1. Log in to the Oracle Solaris host.

2. Confirm the number of NVMe drives that are detected and "ENABLED" using the "hotplug list -lc" command an example output showing 4 NVMe drives is below:

  

# hotplug list -lc
Connection State Description
Path
________________________________________________________________________________
pcie13 ENABLED PCIe-Native
/pci@7a,0/pci8086,2f08@3/pci111d,80b5@0/pci111d,80b5@4
pcie12 ENABLED PCIe-Native
/pci@7a,0/pci8086,2f08@3/pci111d,80b5@0/pci111d,80b5@5
pcie10 ENABLED PCIe-Native
/pci@7a,0/pci8086,2f08@3/pci111d,80b5@0/pci111d,80b5@6
pcie11 ENABLED PCIe-Native
/pci@7a,0/pci8086,2f08@3/pci111d,80b5@0/pci111d,80b5@7

  

If there is 1 or more NVMe drives missing in the hotplug list -lc output, confirm that all NVMe drives are seated correctly. If necessary you may need to engage TSC x86 with the ILOM snapshot and explorer output to investigate further.

When you have confirmed all the installed NVMe drives are present, please advise the customer to clear the ILOM fault. Instructions on how to clear the ILOM fault are available in Doc ID: 1381773.1

 

References:

* How to Replace an Exadata X5-2/X6-2 Storage Server NVMe drive (Doc ID 2003727.1)
* How to Replace an Oracle Server X5-2 and X6-2 NVMe Disk (Doc ID 2034512.1)
* How to Replace an Oracle Server X5-2L and X6-2L NVMe Disk (Doc ID 2034530.1)
* Removing and Replacing an NVMe Storage Drive Using Oracle Linux https://docs.oracle.com/cd/E41059_01/html/E48312/napsm.gopbe.html
* Removing and Replacing an NVMe Storage Drive Using Oracle Solaris https://docs.oracle.com/cd/E41059_01/html/E48312/napsm.gooqp.html#scrolltoc

References

<NOTE:2003727.1> - How to Replace an Exadata X5-2/X6-2 Storage Server NVMe drive
<NOTE:2034512.1> - How to Replace an Oracle Server X5-2 and X6-2 NVMe Disk
<NOTE:2034530.1> - How to Replace an Oracle Server X5-2L and X6-2L NVMe Disk

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback