Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-2032240.1
Update Date:2018-05-15
Keywords:

Solution Type  Technical Instruction Sure

Solution  2032240.1 :   How to Replace an Oracle Server X5-4 NVMe Disk [VCAP]  


Related Items
  • Oracle Server X5-4
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
  •  
  • Microlearning>Video>ML-VID-VCAP
  •  




In this Document
Goal
Solution
 NVMe Storage Drive Virtual PCIe Slot Designation


Applies to:

Oracle Server X5-4 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Goal

How to Replace an Oracle Server X5-4 NVMe Disk.

Solution

DISPATCH INSTRUCTIONS

WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED:
No special skills required, Customer Replaceable Unit (CRU) procedure

TIME ESTIMATE: 30 minutes

TASK COMPLEXITY: 0

FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:

PROBLEM OVERVIEW: An Oracle Server X5-4 NVMe Disk needs replacement

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY? :

NVMe drives are a combined controller and storage device and have very different failure modes compared to SAS devices. So the controller can report a Healthy Status and can also report failure code. If the controller believes the internal state of drive metadata could allow the drive to return incorrect data to the host, the drive will go into Disable Logical mode. This mode will shut down the drive storage device, but the controller will still be visible to the NVMe driver. This is also known as ASSERT or BAD_CONTEXT mode.

It is expected that the Oracle X5-4 Server is up and running and the failed drive is booted and available.

Before proceeding, confirm the part number of the part in hand (either from logistics or an on-site spare) matches the part number dispatched for replacement.

The following commands are provided as a guide in case the customer needs assistance checking the system prior to replacement. If the customer or FSE requires more assistance prior to the physical replacement of the device, X86 HW TSC should be contacted.

 

The Oracle X5-4 Server supports NVME disk on Solaris or Oracle Linux  [See the procedures below]


How to replace an NVME disk from Solaris Operating System


Before you begin, the Solaris hotplug daemon must be enabled on the host.

If you cannot use the hotplug command due to "command not found" or similar, then enable it like this:

#svcadm enable hotplug

 

 

For a list of the virtual PCIe slots of NVMe drives as seen by the operating system, see NVMe Storage Drive Virtual PCIe Slot Designation:  https://docs.oracle.com/cd/E56388_01/html/E56396/gomph.html#scrolltoc

NVMe Storage Drive Virtual PCIe Slot Designation

If NVMe storage drives are installed, they are labeled on the system front panel as NVMe0, NVMe1, NVMe2, and NVMe3. However, the server BIOS internally identifies these drives by their virtual PCIe slot numbers. When using operating system commands to power NVMe drives off before removal, you need to know the virtual PCIe slot number of the drive.

The following table lists the drive front panel label and its corresponding virtual PCIe slot number used by the operating system.

Front Panel Storage Drive Label
Virtual PCIe Slot Number
NVMe0
PCIe slot 100
NVMe1
PCIe slot 101
NVMe2
PCIe slot 102
NVMe3
PCIe slot 103

Note that the virtual PCIe slot name is not the same as the name on the server front panel label.

1.  Log in to the Oracle Solaris host.

2.  Find the NVMe drive virtual PCIe slot number. Type:

 

# hotplug list –lc

This command produces output similar to the following for each of the NVMe drives installed in the server:

# hotplug list –lc
Connection           State           Description
Path                          
-------------------------------------------------------
pcie100              ENABLED         PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@4
pcie101              ENABLED         PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@5
pcie102              ENABLED         PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@6
pcie103              ENABLED         PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@7

3.  Prepare the NVMe drive for removal by powering off the drive slot.

For example, to prepare NVMe0 for removal, type the following commands:

# hotplug poweroff pcie100

#hotplug list –lc

The following output appears for the NVMe0 drive that has been unmounted:

# hotplug list –lc
Connection           State           Description
Path                          
-------------------------------------------------------
pcie100              PRESENT         PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@4
pcie101              ENABLED         PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@5
pcie102              ENABLED         PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@6
pcie103              ENABLED         PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@7

 

4,  Verify that the blue OK to Remove indicator on the NVMe drive is lit.

5.  On the drive you plan to remove, push the latch release button to open the drive latch.

6.  Grasp the latch and pull the drive out of the drive slot.

7.  Verify that the NVMe drive has been removed. Type:

# hotplug list –lc

The following output appears (the removed drive will show the EMPTY state):

# hotplug list –lc
Connection           State           Description
Path                          
-------------------------------------------------------
pcie100              EMPTY           PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@4
pcie101              ENABLED         PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@5
pcie102              ENABLED         PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@6
pcie103              ENABLED         PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@7

  

 

8.  Align the replacement drive with the drive slot.

9.  Slide the drive into the slot until the drive is fully seated.

10.  Close the drive latch to lock the drive in place.

11.  Power on the slot for the drive. Type:  [This step may be automatic, however the command is here just in case]

  

# hotplug enable pcie100

  

12.  Confirm that the drive has been enabled and is seen by the system. Type:

hotplug list –lc

The following status is displayed (installed NVMe drives show the ENABLED state).

# hotplug list –lc
Connection           State           Description
Path                          
-------------------------------------------------------
pcie100              ENABLED         PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@4
pcie101              ENABLED         PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@5
pcie102              ENABLED         PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@6
pcie103              ENABLED         PCIe-Native
/pci@0,0/pci8086,2f06@2,2/pci111d,80b5@0/pci111d,80b5@7

  

 

13. To check the NVMe drive health, firmware level, temperature, get error log, SMART data, low level format, etc., type:

# nvmeadm list

root@x5-4-bur09-a:~# nvmeadm list
SUNW-NVME-1
SUNW-NVME-2
SUNW-NVME-3
SUNW-NVME-4

root@x5-4-bur09-a:~# nvmeadm getlog -h SUNW-NVME-1
SUNW-NVME-1
SMART/Health Information:
        Critical Warning: 0
        Temperature: 294 Kelvin
        Available Spare: 100 percent
        Available Spare Threshold: 10 percent
        Percentage Used: 1 percent
        Data Unit Read: 0x1ddb62 of 512k bytes.
        Data Unit Written: 0x147179 of 512k bytes.
        Number of Host Read Commands: 0x6387c6d3
        Number of Host Write Commands: 0x7066af00
        Controller Busy Time in Minutes: 0x34e
        Number of Power Cycle: 0x93
        Number of Power On Hours: 0x197
        Number of Unsafe Shutdown: 0x89
        Number of Media Errors: 0x0
        Number of Error Info Log Entries: 0x0

  

 


 
How to replace an NVME disk from Oracle Linux Operating System
  1. Log in to Oracle Linux that is running on the server.
  2. Obtain information about available NVMe storage devices.
    1. Obtain the PCIe addresses (Bus Device Function) of enabled NVMe drives. Type:  
      # find /sys/devices |egrep 'nvme[0-9][0-9]?$'
      /sys/devices/pci0000:00/0000:00:02.2/0000:10:00.0/0000:11:04.0/0000:12:00.0/misc/nvme0
      /sys/devices/pci0000:00/0000:00:02.2/0000:10:00.0/0000:11:05.0/0000:13:00.0/misc/nvme1
      /sys/devices/pci0000:00/0000:00:02.2/0000:10:00.0/0000:11:06.0/0000:14:00.0/misc/nvme2
      /sys/devices/pci0000:00/0000:00:02.2/0000:10:00.0/0000:11:07.0/0000:15:00.0/misc/nvme3

      For example, 0000:12:00.0 matches the PCIe address of the drive labeled NVMe0 on the system front panel.

    2. Obtain the PCIe virtual slot number (APIC ID). Type:  
      # egrep -H '.*' /sys/bus/pci/slots/10?/address
      /sys/bus/pci/slots/100/address:0000:12:00
      /sys/bus/pci/slots/101/address:0000:13:00
      /sys/bus/pci/slots/102/address:0000:14:00
      /sys/bus/pci/slots/103/address:0000:15:00

      For example, the PCIe address 0000:12:00.0 matches the PCIe slot number (100) for the drive labeled NVMe0 on the system front panel.

    3. Obtain the NVME storage device   
      paths# parted -l | grep nvme
      Disk /dev/nvme0n1: 1600GB
      Disk /dev/nvme1n1: 1600GB
      Disk /dev/nvme2n1: 1600GB
      Disk /dev/nvme3n1: 1600GB
       
      The devices correspond to the physical slots as follows:
      /dev/nvme0n1 - NVMe0
      /dev/nvme1n1 - NVMe1
      /dev/nvme2n1 - NVMe2
      /dev/nvme3n1 - NVMe3

     

  3. Remove the NVMe storage device path. Prepare the NVMe drive for removal by powering off the NVMe drive slot.
    1. Use the umountcommand to unmount any file systems that are mounted on the device.

      In Linux, NVMe drives do not use the standard block device labeling, such as /dev/sd*. For example, NVMe drive 0 that has a single namespace block device would be /dev/nvme0n1. If you formatted and partitioned that namespace with a single partition, that would be /dev/nvme0n1p1.

    2. Remove the device from any multiple device (md) and Logical Volume Manager (LVM) volume using it.

      If the device is a member of an LVM Volume group, then it may be necessary to move data off the device using the pvmove command, then use the vgreduce command to remove the physical volume, and (optionally) pvremove to remove the LVM meta data from the disk.

    3. If the device uses multipathing, run multipath -l and note all the paths to the device. Then, remove the multipathed device using the multipath -f device command.
    4. Run the blockdev --flushbufs device command to flush any outstanding I/O on all paths to the device (where device is the /dev entry from step 2c above).
  4. Power off the NVMe slot with the following command

    # echo 0 > /sys/bus/pci/slots/slot_number/power

    Where slot_number is the PCIe slot number asigned to the NVMe device slot:
    100 - NVMe0
    101 - NVMe1
    102 - NVMe2
    103 - NVMe3

  5. Verify that the blue OK to Remove indicator on the NVMe drive is lit.
  6. On the NVMe drive you plan to remove, push the latch release button to open the drive latch.
  7. Grasp the latch and pull the drive out of the drive slot.
  8. Verify that the NVMe drive has been removed. Type:
    # lspci -nnd :0953
    13:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:0953] (rev 01)
    14:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:0953] (rev 01)
    15:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:0953] (rev 01)
Note that address 12:00.0, which represents PCIe slot 100 and is the drive labeled NVMe0 on the system front panel and the drive powered off is not listed.

After you physically remove an NVMe drive from the server, wait at least 10 seconds before installing a replacement drive.

9. Align the replacement drive with the drive slot.

10. Slide the drive into the slot until the drive is fully seated.

11. Close the drive latch to lock the drive in place.

 

Power On an NVMe Storage Drive

Before You Begin

  • Linux NVMe hot plug requires the kernel boot argument "pci=pcie_bus_perf" be set in order to get proper MPS (MaxPayloadSize) and MRR (MaxReadRequest). Fatal errors will occur without this argument.

  • For a list of the virtual PCIe slots of NVMe drives as seen by the operating system, see NVMe Storage Drive Virtual PCIe Slot Designation:  https://docs.oracle.com/cd/E56388_01/html/E56396/gomph.html#scrolltoc

    • Note that the virtual PCIe slot name is not the same as the name on the server front panel label.

  1. To power on the slot for the drive. Type:
    # echo 1 > /sys/bus/pci/slots/slot_number/power

    Where slot_number is the PCIe slot number (e.g., 100, which represents the drive labeled NVMe0 on the system front panel).

  2. Confirm that the drive has been enabled and is seen by the system.

    Do one of the following:

    • Check the /var/log/messages log file.

    • List available NVMe devices. Type:

        

      # ls -l /dev/nvme*

        

    • list the nmve pci device:

  

# lspci -nnd :0953

  

 

 

 

PARTS NOTE:

Refer to the Oracle X5-4 Service Manual or System Handbook for part information.

REFERENCE INFORMATION:

Oracle Server X5-4 Service Manual:  https://docs.oracle.com/cd/E56388_01/html/E56396/gooqp.html#scrolltoc

 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback