How to Replace an Exadata X5-2/X6-2 Storage Server NVMe drive

Asset ID:	1-71-2003727.1
Update Date:	2018-04-05
Keywords:

Solution Type Technical Instruction Sure

Solution 2003727.1 : How to Replace an Exadata X5-2/X6-2 Storage Server NVMe drive

Applies to:

Oracle SuperCluster T5-8 Hardware - Version All Versions and later
Exadata X6-2 Hardware - Version All Versions and later
Exadata SL6 Hardware - Version All Versions and later
Exadata X6-8 Hardware - Version All Versions and later
Oracle SuperCluster M7 Hardware - Version All Versions and later
Information in this document applies to any platform.

Goal

Procedure for how to replace an NVMe drive in an Exadata Storage Cell without loss of data or Exadata service

Solution

The following information will be required prior to dispatch of a replacement:

Name/location of storage cell
Slot number of failed drive

Special Instructions for Dispatch are required for this part.

For Attention of Dispatcher:

The parts required in this action plan may be available as spares owned by the customer, which they received with the Engineered System. (These are sometimes referred to as ride-along spares.)

If parts are not available to meet the customer preferred delivery time/planned end date, then request TAM or field manager to contact the customer, and ask if the customer has parts available, and would be prepared to use them.

If customer spare parts are used, inform the customer that Oracle will replenish the customer part stock as soon as we can. More details on this process can be found in GDMR procedure "Handling Where No Parts Available" step 2: https://ptp.oraclecorp.com/pls/apex/f?p=151:138:38504529393::::DN,BRNID,DP,P138_DLID:2,86687,4,9082,

WHAT SKILLS DOES THE ENGINEER NEED:
Have familiarity with the Exadata Storage Servers and replacing hard drives.

TIME ESTIMATE: 60 minutes

Complete process may take longer depending on re-balance time that may be required.

TASK COMPLEXITY: 2

FIELD ENGINEER INSTRUCTIONS:

PROBLEM OVERVIEW:
Failed NVMe drive in Exadata Extreme Flash Storage Server.

NVMe drives are a combined controller and storage device and have very different failure modes compared to SAS devices. So the controller can report a Healthy Status and can also report failure code. If the controller believes the internal state of drive metadata could allow the drive to return incorrect data to the host, the drive will go into Disable Logical mode. This mode will shut down the drive storage device, but the controller will still be visible to the NVMe driver. This is also known as ASSERT or BAD_CONTEXT mode.

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:
It is expected that the Exadata Machine is up and running and the storage cell containing the failed drive is booted and available.

If there are multiple drives to be replaced within an Exadata machine (or between an Exadata interconnected with another Exadata or Expansion Cabinet), it is critical that only ONE DRIVE BE REPLACED AT A TIME to avoid the risk of data loss. Before replacing another disk in Exadata, ensure the re-balance operation has completed from the first replacement.

Before proceeding, confirm the part number of the part in hand (either from logistics or an on-site spare) matches the part number dispatched for replacement.

It is expected that the customer's DBA has completed these steps prior to arriving to replace the disk. The following commands are provided as a guidance in case the customer needs assistance checking the system prior to replacement. If the customer or FSE requires more assistance prior to the physical replacement of the device, EEST/TSC should be contacted.

1. Confirm the drive needing replacement based on the output provided, the below shows nvmecli output from a drive that is in Disable Logical state (assert):

[root@exdx5-tvp-a-cel3 ~]# nvmecli --identify --device=/dev/nvme7
================== Controller Information =====================
Serial Number : CVMD437300AX1P6LGN
Model Number : INTEL SSDPE2ME016T4S
Firmware Version : 8DV1RA12
Number of Namespaces : 1
Health Indicator : *ASSERT_40351938 80
Internal Device Error: The command was not completed successfully due to an internal
device error.

or check that the PCIe device is present using lspci | grep 0953 on X5 servers or lspci | grep 172X on X6 servers. Each NVMe device should appear once, there should be either 8 or 12 NVMe devices present depending on the customers configuration:

[root@cel3 ~]# lspci | grep 0953
05:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
07:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
25:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
27:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
86:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
88:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
96:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
98:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)

Note on the X6 servers the NVME drives were changed so instead of "Intel Corporation Device 0953", you should expect to see "Samsung Electronics Co Ltd NVMe SSD Controller 172X" in the above lspci output.

2. The Oracle ASM disks associated with the grid disks on the physical disk will be automatically dropped with FORCE option, and an ASM re-balance will start immediately to restore the data redundancy.

Validate the failed NVMe drive is no longer part of the ASM diskgroups:

a) Login to a database node with the username for the owner of Oracle Grid Infrastructure home. Typically this is the 'oracle' user.

    edx2db01 login: oracle
    Password:
    Last login: Thu Jul 12 14:43:10 on ttyS0
    [oracle@edx2db01 ~]$

b) Select the ASM instance for this DB node and connect to SQL Plus:

    [oracle@edx2db01 ~]$ . oraenv
    ORACLE_SID = [oracle] ? +ASM1
    The Oracle base has been set to /u01/app/oracle
    [oracle@edx2db01 ~]$ sqlplus ' / as sysasm'

    SQL*Plus: Release 11.2.0.2.0 Production on Thu Jul 12 14:45:20 2012

    Copyright (c) 1982, 2010, Oracle. All rights reserved.

    Connected to:
    Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
    With the Real Application Clusters and Automatic Storage Management options

    SQL>

    In the above output the “1” of “+ASM1” refers to the DB node number. For example, DB node #3 the value would be +ASM3.

   c) From the DB node, run the following query, using the name the celldisk is associated with on this physical disk, which is given in the Cell alert, an example is below:

SQL> select group_number,path,header_status,mount_status,mode_status,name from V$ASM_DISK where path like 'ý_05_exdx5_tvp_a_cel3';
no rows selected.
SQL>

This query should return no rows indicating the disk is no longer in the ASM diskgroup configuration. If this returns any other value, then contact the SR owner for further guidance.

Note: If you are not sure what the celldisk name is, or do not have the alert output available, from the CellCLI interface run "list alerthistory"

3. The Cell Management Server daemon monitors and takes action on replacement disks to automatically bring the new disk into the configuration.

a) Login to the cell server and enter the CellCLI interface

# cellcli
CellCLI: Release 12.1.2.1.0 - Production on Thu Apr 16 07:05:44 BST 2015

Copyright (c) 2007, 2013, Oracle. All rights reserved.
Cell Efficiency Ratio: 504

CellCLI>

b. Verify the status of the msStatus is running before replacing the disk:

CellCLI> list cell attributes cellsrvStatus,msStatus,rsStatus detail
         cellsrvStatus:          running
         msStatus:               running
         rsStatus:               running

4. If the failed NVMe drive is in slot 0 or slot 1, then the disk is a system disk which contains the running OS. Verify the root volume is in 'clean' state before hot replacing a system disk. If it is 'active' and the disk is hot removed, the OS may crash making the recovery more difficult.

a. Login as 'root' on the Storage Cell, and use 'df' to determine the md device name for "/" volume:

[root@cel2 ~]# df
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/md5        10321144 6340492   3456368 65% /
tmpfs           32791712       4 32791708   1% /dev/shm
/dev/md7         3096272 1655564   1283428 57% /opt/oracle
/dev/md4          483886   27532    431359   6% /boot
/dev/md11        5157312 210004   4685328   5% /var/log/oracle

b. Use 'mdadm' to determine the volume status:

[root@cel2 ~]# mdadm -Q --detail /dev/md5
/dev/md5:
        Version : 0.90
Creation Time : Thu Dec 25 12:59:29 2014
     Raid Level : raid1
     Array Size : 10485696 (10.00 GiB 10.74 GB)
Used Dev Size : 10485696 (10.00 GiB 10.74 GB)
   Raid Devices : 2
Total Devices : 2
Preferred Minor : 5
    Persistence : Superblock is persistent

    Update Time : Thu Apr 16 07:14:10 2015
          State : clean <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

           UUID : 711f6845:90d4551d:04894333:532a878b
         Events : 0.241

    Number   Major   Minor   RaidDevice State
       0     259        5        0      active sync   /dev/nvme0n1p5
       1     259       17        1      active sync   /dev/nvme1n1p5

WHAT ACTION DOES THE ENGINEER NEED TO TAKE:

Confirm the drive needing replacement based on the output provided ("name" or "slotNumber" value) and LED status of drive. In order to remove the NVMe drive the PCIe hot-plug procedure MUST be followed. If the drive is removed without performing the hot-removal operation, system will crash and reset with a PCIe Surprise Link Down against the drive. There is a clear visual indication (Blue LED) when drive is safe to remove. If Blue LED is not lit, do not remove the drive.

Drives have both a physical slot location, and an instance in /dev, they may not be the same numerically. For example, physical slot 10, may be /dev/nvme7 depending on how many drives were populated at boot time. When preparing to gather data and logs from a drive, always check physical to logical mapping with the below command:

cellcli -e list physicaldisk detail
name: NVME_7
deviceName: /dev/nvme5n1

Name is the physical slot, deviceName is the /dev/nvme entry.

Slot locations for Extreme Flash Exadata Storage Cell:

View from the front:

8 Drive Configuration:
[Filler] [NVMe0] [Filler] [NVMe1] [Filler] [Filler] [Filler] [NVMe3] [Filler] [NVMe4] [Filler] [Filler] [Filler] [NVMe6] [Filler] [NVMe7] [Filler] [Filler] [Filler] [NVMe9] [Filler] [NVMe10] [Filler] [Filler]

1. To prepare a NVMe drive for removal, the following command MUST be run:

cellcli -e alter physicaldisk NVME_# drop for replacement

where NVME_# where # is the slot ID.

eg:

CellCLI> alter physicaldisk NVME_7 DROP FOR REPLACEMENT
Physical disk NVME_7 was dropped for replacement.

CellCLI> list physicaldisk
         NVME_0          CVMD4470007K1P6LGN      normal
         NVME_1          CVMD437300J61P6LGN      normal
         NVME_3          CVMD447100791P6LGN      normal
         NVME_4          CVMD439000611P6LGN      normal
         NVME_6          CVMD4471001E1P6LGN      normal
         NVME_7          CVMD4471006X1P6LGN      normal - dropped for replacement
         NVME_9          CVMD4415003D1P6LGN      normal
         NVME_10         CVMD437300AX1P6LGN      normal

CellCLI> list physicaldisk NVME_7 detail
         name:                   NVME_7
         deviceName:             /dev/nvme5n1
         diskType:               FlashDisk
         luns:                   0_7
         makeModel:              "Oracle NVMe SSD"
         notPresentSince:        2015-04-23T08:45:01+01:00
         physicalFirmware:       8DV1RA10
         physicalInsertTime:     2015-03-31T19:29:29+01:00
         physicalSerial:         CVMD4471006X1P6LGN
         physicalSize:           1.4554837569594383T
         slotNumber:             7
         status:                 normal - dropped for replacement

NOTE: The blue OK to Remove status indicator LED on the drive will light once a PCIe hot-remove operation has completed. Do not remove the drive until this LED indicator is lit, otherwise a system reset could occur.

2. On the drive you plan to remove, push the latch release button to open the drive latch

3. Grasp the latch and pull the drive out of the drive slot (Caution: whenever you remove a storage drive, you should replace it with another storage drive or a filler panel, otherwise the server might overheat due to improper airflow.)

4. Wait three minutes for the MS daemon to recognize the removal of the old drive

5. Slide the drive into the slot until the drive is fully seated

6. Close the drive latch to lock the drive in place

7. Drive should automatically power on when inserted. /var/log/messages will report a drive is present and identify the slot ID

8. Wait three minutes for the MS daemon to start rebuilding the virtual drives before proceeding

OBTAIN CUSTOMER ACCEPTANCE
- WHAT ACTION DOES THE CUSTOMER NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:

It is expected that the engineer stay on-site until the customer has given the approval to depart. The customer should check the status of the drive after replacement. The following commands are provided as guidance in case the customer needs assistance checking the status of the system following replacement. If the customer or the FSE requires more assistance following the physical replacement of the device, EEST/TSC should be contacted.

After replacing the NVMe drive on Exadata Storage Server, wait for three minutes before running any commands to query the device from the server. CellCLI (examples below) should be the principle tool to query the drives.

1. Re-enable the NVMe drive after inserting it:

CellCLI> alter physicaldisk NVME_7 reenable

CellCLI> list physicaldisk NVME_7 detail
         name:                   NVME_7
         deviceName:             /dev/nvme5n1
         diskType:               FlashDisk
         luns:                   0_7
         makeModel:              "Oracle NVMe SSD"
         physicalFirmware:       8DV1RA10
         physicalInsertTime:     2015-04-23T08:57:44+01:00
         physicalSerial:         CVMD4471006X1P6LGN
         physicalSize:           1.4554837569594383T
         slotNumber:             7
         status:                 normal

The "status" field should report "normal". Note also that the physicalInsertTime should be current date and time, and not an earlier time. If they are not, then the old disk entries may still be present and the disk replacement did not complete successfully. If this is the case, refer to the SR owner for further assistance.

2. The firmware of the drive will be automatically upgraded to match the other disks in the system when the new drive is inserted, if it is below the supported version of the current image. If it is above the minimum supported version then no action will be taken, and the newer firmware will remain. This can be validated by the following command:

CellCLI> alter cell validate configuration

3. After the drive is replaced, a lun should be automatically created, and the grid disks and cell disks that existed on the previous disk in that slot are automatically re-created on the new physical disk. If those grid disks were part of an Oracle ASM group, then they will be added back to the disk group and the data will be re-balanced on them, based on the disk group redundancy and asm_power_limit parameter values.

Grid disks and cell disks can be verified with the following CellCLI command, where the lun name is reported in the physicaldisk output from step 1 above ("0_7" in this example"):

CellCLI> list lun 0_7 detail
         name:                   0_7
         cellDisk:               FD_05_exdx5_tvp_a_cel3
         deviceName:             /dev/nvme5n1
         diskType:               FlashDisk
         id:                     0_7
         isSystemLun:            FALSE
         lunSize:                1.4554837569594383T
         physicalDrives:         NVME_7
         status:                 normal

CellCLI> list celldisk where lun=0_7 detail
         name:                   FD_05_exdx5_tvp_a_cel3
         comment:
         creationTime:           2015-03-31T11:56:22+01:00
         deviceName:             /dev/nvme5n1
         devicePartition:        /dev/nvme5n1
         diskType:               FlashDisk
         errorCount:             0
         freeSpace:              0
         id:                     86f50408-9216-43d9-8e28-275d2a19df6f
         interleaving:           none
         lun:                    0_7
         physicalDisk:           CVMD4471006X1P6LGN
         raidLevel:
         size:                   1.455474853515625T
         status:                 normal

CellCLI> list griddisk where celldisk=FD_05_exdx5_tvp_a_cel3 detail
         name: BACKUP_FD_05_exdx5_tvp_a_cel3
         asmDiskGroupName: BACKUP_DG
         asmDiskName: BACKUP_FD_05_EXDX5_TVP_A_CEL3
         asmFailGroupName: EXDX5_TVP_A_CEL3
         availableTo:
         cachedBy: FD_05_exdx5_tvp_a_cel3
         cachingPolicy: default
         cellDisk: FD_05_exdx5_tvp_a_cel3
         comment: "Cluster exdx5-clu1 diskgroup BACKUP"
         creationTime: 2015-11-20T22:29:34+00:00
         diskType: FlashDisk
         errorCount: 0
         id: 5fcf1ec5-b05b-4f2e-8f73-2723de18ccfa
         offset: 390.625G
         size: 293G
         status: active

         name: DATAC1_FD_05_exdx5_tvp_a_cel3
         asmDiskGroupName: DATAC1
         asmDiskName: DATAC1_FD_05_EXDX5_TVP_A_CEL3
         asmFailGroupName: EXDX5_TVP_A_CEL3
         availableTo:
         cachedBy: FD_05_exdx5_tvp_a_cel3
         cachingPolicy: default
         cellDisk: FD_05_exdx5_tvp_a_cel3
         comment: "Cluster RacA diskgroup DATAC1"
         creationTime: 2015-09-28T16:38:00+01:00
         diskType: FlashDisk
         errorCount: 0
         id: 8ccd73e7-0fdf-4680-8e17-8c0fe0718b6e
         offset: 91.625G
         size: 257G
         status: active

         name: FS_DG1_FD_05_exdx5_tvp_a_cel3
         asmDiskGroupName: FS_DG1
         asmDiskName: FS_DG1_FD_05_EXDX5_TVP_A_CEL3
         asmFailGroupName: EXDX5_TVP_A_CEL3
         availableTo:
         cachedBy: FD_05_exdx5_tvp_a_cel3
         cachingPolicy: default
         cellDisk: FD_05_exdx5_tvp_a_cel3
         comment: "Cluster RacA diskgroup FS_DG1"
         creationTime: 2015-09-28T16:37:59+01:00
         diskType: FlashDisk
         errorCount: 0
         id: 3d93b6f2-4677-44cb-bf1e-2a7ce5bc5a70
         offset: 74.625G
         size: 17G
         status: active

         name: RECOC1_FD_05_exdx5_tvp_a_cel3
         asmDiskGroupName: RECOC1
         asmDiskName: RECOC1_FD_05_EXDX5_TVP_A_CEL3
         asmFailGroupName: EXDX5_TVP_A_CEL3
         availableTo:
         cachedBy: FD_05_exdx5_tvp_a_cel3
         cachingPolicy: default
         cellDisk: FD_05_exdx5_tvp_a_cel3
         comment: "Cluster RacA diskgroup RECOC1"
         creationTime: 2015-09-28T16:38:01+01:00
         diskType: FlashDisk
         errorCount: 0
         id: b11670a7-6bed-4433-8de9-91666cb33a33
         offset: 348.625G
         size: 42G
         status: active

Status should be normal for the cell disks and active for the grid disks. All of the creation times should also match the insertion time of the replacement disk. If they are not, then the old disk entries may still be present and the disk replacement did not complete successfully. If this is the case, refer to the SR owner for further assistance.

Note: The lun name attribute will also be shown in the original alert generated by the storage cell.

4. To confirm that the status of the re-balance, connect to the ASM instance on a database node, and validate the disks were added back to the ASM diskgroups and a re-balance is running:

SQL> set linesize 132
SQL> col path format a50
SQL> select group_number,path,header_status,mount_status,name from V$ASM_DISK where path like 'ý_05_exdx5_tvp_a_cel3';

GROUP_NUMBER PATH HEADER_STATU MOUNT_S NAME
------------ -------------------------------------------------- ------------ ------- ------------------------------
4 o/192.168.10.28;192.168.10.29/BACKUP_FD_05_exdx5_t MEMBER CACHED BACKUP_FD_05_EXDX5_TVP_A_CEL3
vp_a_cel3

2 o/192.168.10.28;192.168.10.29/FS_DG1_FD_05_exdx5_t MEMBER CACHED FS_DG1_FD_05_EXDX5_TVP_A_CEL3
vp_a_cel3

1 o/192.168.10.28;192.168.10.29/DATAC1_FD_05_exdx5_t MEMBER CACHED DATAC1_FD_05_EXDX5_TVP_A_CEL3
vp_a_cel3

3 o/192.168.10.28;192.168.10.29/RECOC1_FD_05_exdx5_t MEMBER CACHED RECOC1_FD_05_EXDX5_TVP_A_CEL3
vp_a_cel3

GROUP_NUMBER PATH HEADER_STATU MOUNT_S NAME
------------ -------------------------------------------------- ------------ ------- ------------------------------

SQL> select * from gv$asm_operation;

       INST_ID GROUP_NUMBER OPERA STAT      POWER     ACTUAL      SOFAR   EST_WORK EST_RATE
    ---------- ------------ ----- ---- ---------- ---------- ---------- ---------- ----------
    EST_MINUTES ERROR_CODE
    ----------- --------------------------------------------
             2            3 REBAL WAIT         10

             1            3 REBAL RUN          10         10       1541       2422
          7298           0

An active re-balance operation can be identified by STATE=RUN. The column group_number and inst_id provide the diskgroup number of the diskgroup been re-balanced and the instance number where the operation is running. The re-balance operation is complete when the above query returns "no rows selected".

Validate the expected number of griddisks per failgroup and diskgroup.

SQL> select group_number,failgroup,mode_status,count(*) from v$asm_disk group by group_number,failgroup,mode_status;

The re-balance operation has completed when there are no "group_number" values of 0, and each disk group has count the same number of disks.

5. If the disk replaced was a system disk in slot 0 or 1, then the status of the OS volume should also be checked. Login as 'root' on the Storage cell and check the status using the same 'df' and 'mdadm' commands listed above:

[root@dbm1cel1 ~]# mdadm -Q --detail /dev/md5
/dev/md5:
        Version : 0.90
Creation Time : Tue Mar 31 12:14:45 2015
     Raid Level : raid1
     Array Size : 10485696 (10.00 GiB 10.74 GB)
Used Dev Size : 10485696 (10.00 GiB 10.74 GB)
   Raid Devices : 2
Total Devices : 3
Preferred Minor : 5
    Persistence : Superblock is persistent

    Update Time : Mon Apr 27 02:53:48 2015
          State : active, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 1
Spare Devices : 1

           UUID : e75c1b6a:64cce9e4:924527db:b6e45d21
         Events : 0.215

    Number   Major   Minor   RaidDevice State
       3      65      213        0      spare rebuilding   /dev/nvme0n1p5
       1       8       21        1      active sync   /dev/nvme1n1p5

       2       8        5        -      faulty spare

[root@dbm1cel1 ~]#

While the system disk is rebuilding, the state will show as "active, degraded" or "active,degraded,recovering" with one indicating it is rebuilding and the 3rd being the 'faulty' disk. After rebuild has started, re-running this command will give a "Rebuild Status: X% complete" line in the output. When the system disk sync status is complete, the state should return to "clean" only with 2 devices.

If the status of any of the above checks (firmware, grid disk / cell disk creation, re-balance) is not successful, re-engage Oracle Support to get the correct action plan to manually complete the required steps.

PARTS NOTE:

Refer to the Exadata Database Machine Owner's Guide Appendix D for part information.

REFERENCE INFORMATION:

    Exadata Database Machine Documentation:
        Exadata Database Machine Owner's Guide is available on the Storage Server OS image in /opt/oracle/cell/doc/welcome.html
         http://amomv0115.us.oracle.com/archive/cd_ns/E13877_01/welcome.html
    Oracle Exadata Storage Server X5-2 Extreme Flash Service Manual
    Mirror partitions not resynced after replacing failed system drive (lun 0 or 1) (Doc ID 1316829.1)

Internal Only References:
- Replacing a physicaldisk on a storage cell , cellcli list physicaldisk reports two entries on same slot but LUN is not created (Doc ID 1352938.1)
- Exadata Documentation - http://amomv0115.us.oracle.com/archive/cd_ns/E50790_01/doc/doc.121/e51951/storage.htm#DBMMN21046

Attachments

This solution has no attachment