Asset ID: |
1-71-2325988.1 |
Update Date: | 2018-05-14 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
2325988.1
:
How to Replace an Oracle Server X7-2 NVMe Disk
Related Categories |
- PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
|
In this Document
Applies to:
Oracle Server X7-2 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.
Goal
How to Replace an Oracle Server X7-2 NVMe Disk.
Solution
DISPATCH INSTRUCTIONS
WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?:
No special skills required, Customer Replaceable Unit (CRU) procedure
TIME ESTIMATE: 30 minutes
TASK COMPLEXITY: 0
FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:
PROBLEM OVERVIEW: An Oracle Server X7-2 NVMe Disk needs replacement
WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY? :
It is expected that the Oracle X7-2 Server is up and running and the failed drive is booted and available.
Before proceeding, confirm the part number of the part in hand (either from logistics or an on-site spare) matches the part number dispatched for replacement.
The following commands are provided as a guide in case the customer needs assistance checking the system prior to replacement. If the customer or FSE requires more assistance prior to the physical replacement of the device, X86 HW TSC should be contacted.
NVMe storage drives are supported only on X7-2 servers running the Oracle Solaris, Oracle Linux, Oracle VM, or Microsoft Windows Server. Servers that are running Red Hat Enterprise Linux do not support NVME drives.
How to replace an NVME disk from Solaris Operating System
Before you begin, the Solaris hotplug daemon must be enabled on the host
NVMe Storage Drive Virtual PCIe Slot Designation
If NVMe storage drives are installed, they are labeled on the system front panel as NVMe0, NVMe1, NVMe2, NVMe3, NVMe4, NVMe5, NVMe6 and NVMe7. However, the server BIOS internally identifies these drives by their virtual PCIe slot numbers. When using operating system commands to power NVMe drives off before removal, you need to know the virtual PCIe slot number of the drive.
The following table lists the drive front panel label and its corresponding virtual PCIe slot number used by the operating system.
|
|
NVMe0
|
PCIe slot 100
|
NVMe1
|
PCIe slot 101
|
NVMe2
|
PCIe slot 102
|
NVMe3
|
PCIe slot 103
|
NVMe4
|
PCIe slot 104
|
NVMe5 |
PCIe slot 105 |
NVMe6 |
PCIe slot 106 |
NVMe7 |
PCIe slot 107 |
Note that the virtual PCIe slot name is not the same as the name on the server front panel label. The drive names provided in the table assume that NVMe cabling between the motherboard NVMe connectors and the disk backplane is correct.
- Log in to the Oracle Solaris host.
- Find the NVMe drive virtual PCIe slot number using the "hotplug list -lc" command:
# hotplug list -lc
Connection State Description Path
-------------------------------------------------------------------------------------
Slot100 ENABLED PCIe-Native /pci@13,0/pci8086,2030@0/pci111d,80b5@0/pci111d,80b5@5
- Prepare the NVMe drive for removal by powering off the drive slot using the "hotplug poweroff" command. In the following example we are powering off the NVMe drive in NVMe0 (pcie10)
# hotplug poweroff Slot100
You can see the drive is now powered off using the "hotplug list" command. The powered off drive will have a state of "present"
# hotplug list -lc
Connection State Description Path
-------------------------------------------------------------------------------------
Slot100 PRESENT PCIe-Native /pci@13,0/pci8086,2030@0/pci111d,80b5@0/pci111d,80b5@5
- Verify that the blue OK to Remove indicator on the NVMe drive is lit.
- On the drive you plan to remove, push the latch release button to open the drive latch.
- Grasp the latch and pull the drive out of the drive slot.
- Verify that the NVMe drive has been removed, in the "hotplug list -lc" output the slot should now report "empty".
# hotplug list -lc
Connection State Description Path
-----------------------------------------------------------------------------------
Slot100 EMPTY PCIe-Native /pci@13,0/pci8086,2030@0/pci111d,80b5@0/pci111d,80b5@5
- Align the replacement drive with the drive slot.
- Slide the drive into the slot until the drive is fully seated.
- Close the drive latch to lock the drive in place.
- Power on the slot for the drive with the "hotplug enable" command. (The system may perform this step automatically, if not then run this command)
# hotplug enable Slot100
Confirm that the drive has been enabled and is seen by the system.
# hotplug list -lc
Connection State Description Path
-------------------------------------------------------------------------------------
Slot100 Enabled PCIe-Native /pci@13,0/pci8086,2030@0/pci111d,80b5@0/pci111d,80b5@5
- To check the NVMe drive health, firmware level, temperature, get error log, SMART data, low level format, etc., use the following nvmeadm commands.
# nvmeadm list
SUNW-NVME-1
# nvmeadm getlog -h SUNW-NVME-1
SUNW-NVME-1
SMART/Health Information:
Critical Warning: 0
Temperature: 297 Kelvin
Available Spare: 100 percent
Available Spare Threshold: 10 percent
Percentage Used: 0 percent
Data Unit Read: 0x8e467e85 of 512k bytes.
Data Unit Written: 0x28af3dbf of 512k bytes.
Number of Host Read Commands: 0x5c7f318
Number of Host Write Commands: 0x3c02fe4
Controller Busy Time in Minutes: 0x3
Number of Power Cycle: 0x488
Number of Power On Hours: 0xe3e
Number of Unsafe Shutdown: 0x484
Number of Media Errors: 0x0
Number of Error Info Log Entries: 0x0
How to replace an NVME disk from Oracle Linux Operating System
Linux NVMe hot plug requires the kernel boot argument "pci=pcie_bus_perf" be set in order to get proper MPS (MaxPayloadSize) and MRR (MaxReadRequest). Fatal errors will occur without this argument.
- Log in to Oracle Linux that is running on the server.
- Obtain information about available NVMe storage devices.
- Obtain the PCIe addresses (Bus Device Function) of enabled NVMe drives using the following command.
# find /sys/devices |egrep 'nvme[0-9][0-9]?$'
/sys/devices/pci0000:85/0000:85:00.0/0000:86:00.0/nvme/nvme0
/sys/devices/pci0000:85/0000:85:01.0/0000:8d:00.0/nvme/nvme1
/sys/devices/pci0000:d7/0000:d7:02.0/0000:d9:00.0/nvme/nvme2
/sys/devices/pci0000:d7/0000:d7:03.0/0000:e0:00.0/nvme/nvme3
- Obtain the PCIe virtual slot number (APIC ID)
# egrep –H '.*' /sys/bus/pci/slots/*/address
/sys/bus/pci/slots/0-1/address:0000:17:00
/sys/bus/pci/slots/0-2/address:0000:d7:00
/sys/bus/pci/slots/0-3/address:0000:01:00
/sys/bus/pci/slots/0/address:0000:00:00
/sys/bus/pci/slots/100-1/address:0000:19:00
/sys/bus/pci/slots/100/address:0000:17:02
/sys/bus/pci/slots/101-1/address:0000:20:00
/sys/bus/pci/slots/101/address:0000:17:03
/sys/bus/pci/slots/102-1/address:0000:9b:00
/sys/bus/pci/slots/102/address:0000:85:03
/sys/bus/pci/slots/103-1/address:0000:94:00
/sys/bus/pci/slots/103/address:0000:85:02
/sys/bus/pci/slots/104-1/address:0000:8d:00
/sys/bus/pci/slots/104/address:0000:85:01
/sys/bus/pci/slots/105-1/address:0000:86:00
/sys/bus/pci/slots/105/address:0000:85:00
/sys/bus/pci/slots/106-1/address:0000:e0:00
/sys/bus/pci/slots/106/address:0000:d7:03
/sys/bus/pci/slots/107-1/address:0000:d9:00
/sys/bus/pci/slots/107/address:0000:d7:02
/sys/bus/pci/slots/1/address:0000:ae:00
/sys/bus/pci/slots/2/address:0000:3a:00
/sys/bus/pci/slots/3/address:0000:5d:00
/sys/bus/pci/slots/4/address:0000:5d:02
/sys/bus/pci/slots/8191-1/address:0000:80:00
/sys/bus/pci/slots/8191/address:0000:3a:02
In the above output, notice that the instance names for the NVMe drives do not correspond to the NVMe drive labels on the front of the server, that is, pci/slots/105-1/address: 0000:86:00 corresponds to instance nvme0; however, on the front of the server, this drive is labeled NVMe5.
- Remove the NVMe storage device path. Prepare the NVMe drive for removal.
- Use the umount command to unmount any file systems that are mounted on the device.
In Linux, NVMe drives do not use the standard block device labeling, such as /dev/sd*. For example, NVMe drive 0 that has a single namespace block device would be /dev/nvme0n1. If you formatted and partitioned that namespace with a single partition, that would be /dev/nvme0n1p1.
- Remove the device from any multiple device (md) and Logical Volume Manager (LVM) volume using it.If the device is a member of an LVM Volume group, then it may be necessary to move data off the device using the pvmove command, then use the vgreduce command to remove the physical volume, and (optionally) pvremove to remove the LVM meta data from the disk.
- If the device uses multipathing, run multipath -l and note all the paths to the device. Then, remove the multipathed device using the multipath -f device command.
- Run the blockdev --flushbufs device command to flush any outstanding I/O on all paths to the device
- Power off the NVMe slot with the command "echo 0 > /sys/bus/pci/slots/slot_number/power"
Where slot_number is the PCIe slot number obtained in step 2.b above.
for example to power off the NVMe disk labeled NVMe5 (PCIe slot 105-1):
# echo 0 > /sys/bus/pci/slots/105-1/power
- Verify that the blue OK to Remove indicator on the NVMe drive is lit.
- On the NVMe drive you plan to remove, push the latch release button to open the drive latch.
- Grasp the latch and pull the drive out of the drive slot.
- Verify that the NVMe drive has been removed. Type
# lspci -nnd :0a54
8d:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:0a54]
d9:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:0a54]
e0:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:0a54]
- Note that address 86:00.0, which represents PCIe slot 105-1 and is the drive labeled NVMe5 on the system front panel and the drive powered off is not listed.
- After you physically remove an NVMe drive from the server, wait at least 10 seconds before installing a replacement drive.
- Align the replacement drive with the drive slot.
- Slide the drive into the slot until the drive is fully seated.
- Close the drive latch to lock the drive in place.
- To power on the slot for the drive type "echo 1 > /sys/bus/pci/slots/slot_number/power" Where slot_number is the PCIe slot number assigned to the NVMe device slot (see step 4 above)
for example to power on the newly installed NVMe disk NVMe5 (PCIe slot 105-1):
# echo 1 > /sys/bus/pci/slots/105-1/power
- Confirm that the drive has been enabled and is seen by the system.
- Check the /var/log/messages log file.
- List available NVMe devices. Type: "ls -l /dev/nvme*"
- list the NVMe pci devices:
# lspci -nnd :0a54
86:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:0a54]
8d:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:0a54]
d9:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:0a54]
e0:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:0a54]
How to replace an NVME disk from Microsoft Windows Server
NVMe storage drive hot plug is not supported for an Oracle Server X7-2 running Microsoft Windows Server. The system must be powered down before removing and replacing an NVMe storage drive.
- Power down the server that contains the storage drive to be removed.
- On the NVMe drive you plan to remove, push the latch release button to open the drive latch.
- Grasp the latch and pull the drive out of the drive slot.
- Align the replacement drive with the drive slot.
- Slide the drive into the slot until the drive is fully seated.
- Close the drive latch to lock the drive in place.
- Power on the server.
REFERENCE INFORMATION:
Refer to the Oracle Server X7-2 Service Manual or System Handbook for part information.
Oracle Server X7-2 Service Manual
Oracle System Handbook - Oracle Server X7-2
Attachments
This solution has no attachment