Asset ID: |
1-71-1999330.1 |
Update Date: | 2018-04-05 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
1999330.1
:
How to Replace an Exadata X5-2/X6-2 Storage Server NVMe cable
Related Items |
- Oracle SuperCluster T5-8 Full Rack
- Oracle SuperCluster M7 Hardware
- Zero Data Loss Recovery Appliance X6 Hardware
- Exadata SL6 Hardware
- Oracle SuperCluster T5-8 Half Rack
- Exadata X6-8 Hardware
- Exadata X5-2 Eighth Rack
- Exadata X5-2 Full Rack
- Exadata X5-2 Hardware
- Exadata X6-2 Hardware
- Exadata X5-2 Quarter Rack
- Exadata X4-8 Hardware
- Exadata X5-2 Half Rack
- Zero Data Loss Recovery Appliance X5 Hardware
- Oracle SuperCluster T5-8 Hardware
- Oracle SuperCluster M6-32 Hardware
|
Related Categories |
- PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
|
In this Document
Oracle Confidential PARTNER - Available to partners (SUN).
Reason: CAP for field engineers and partners to replace NVMe cable
Applies to:
Oracle SuperCluster M7 Hardware - Version All Versions and later
Zero Data Loss Recovery Appliance X6 Hardware - Version All Versions and later
Exadata SL6 Hardware - Version All Versions and later
Oracle SuperCluster T5-8 Hardware - Version All Versions and later
Oracle SuperCluster T5-8 Half Rack - Version All Versions and later
Information in this document applies to any platform.
Goal
Procedure for how to replace an NVMe cable in an Exadata Storage Cell without loss of data or Exadata service
Solution
DISPATCH INSTRUCTIONS:
The following information will be required prior to dispatch of a replacement:
Name/location of storage cell
Location of NVMe cable bundle
WHAT SKILLS DOES THE ENGINEER NEED:
The engineer must be Exadata trained, and have familiarity with the storage cells.
TIME ESTIMATE: 90 minutes
TASK COMPLEXITY: 3
FIELD ENGINEER INSTRUCTIONS:
PROBLEM OVERVIEW:
There is a failed NVMe Switch Card in an Exadata Storage Server (Cell) that needs replacing.
WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:
The Storage Cell containing the failed NVMe cable is required to be powered off prior to card replacement.
It is expected that the customer's DBA has completed these steps prior to arriving to replace the card. The following commands are provided as guidance in case the customer needs assistance checking the status of the system prior to replacement. If the customer or the FSE requires more assistance prior to the physical replacement of the device, EEST/TSC should be contacted.
1. Locate the server in the rack being serviced. The cell server within the rack can be determined from the hostname usually, and the known default Exadata server numbering scheme. Exadata Storage Servers are identified by a number 1 through 18, where 1 is the lowest most storage cell in the rack installed in RU2, counting up to the top of the rack.
Turn on the locate indicator light for easier identification of the server being repaired. If the server number has been identified then the Locate Button on the front panel may be pressed. To turn on remotely, use either of the following methods:
From a login to the CellCli on Exadata Storage Servers:
CellCli> alter cell led on
From a login to the server’s ILOM:
-> set /SYS/LOCATE value=Fast_Blink
Set 'value' to 'Fast_Blink
From a login to the server’s ‘root’ account:
# ipmitool sunoem cli ‘set /SYS/LOCATE value=Fast_Blink’
Connected. Use ^D to exit.
-> set /SYS/LOCATE value=Fast_Blink
Set 'value' to 'Fast_Blink'
-> Session closed
Disconnected
2. Shutdown the node for which the NVMe cable requires replacement.
a) For Extended information on this section check MOS Note: ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM
This is also documented in the Exadata Owner's Guide in chapter 7 section titled "Maintaining Exadata Storage Servers" subsection "Shutting Down Exadata Storage Server" available on the customer's cell server image in the /opt/oracle/cell/doc directory.
Exadata Owner's Guide Documentation is available internally here: http://amomv0115.us.oracle.com/archive/cd_ns/E13877_01/doc/doc.112/e13874/maintenance.htm#autoId33
In the following examples the SQL commands should be run by the customers DBA prior to doing the hardware replacement. These should be done by the field engineer only if the customer directs them to, or is unable to do them. The cellcli commands will need to be run as root.
Note the following when powering off Exadata Storage Servers:
- Verify there are no other storage servers with disk faults. Shutting down a storage server while another disk is failed may result in the running database processes and Oracle ASM to crash if it loses both disks in the partner pair when this server’s disks go offline.
- Powering off one Exadata Storage Server with no disk faults in the rest of the rack will not affect running database processes or Oracle ASM.
b) ASM drops a disk shortly after they are taken offline. Powering off or restarting Exadata Storage Servers can impact database performance if the storage server is offline for longer than the ASM disk repair timer to be restored. The default DISK_REPAIR_TIME attribute value of 3.6hrs should be adequate for replacing components, but may have been changed by the Customer. To check this parameter, have the Customer log into ASM and perform the following query:
SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg where a.name = 'disk_repair_time' and a.group_number = dg.group_number;
As long as the value is large enough to comfortably replace the components being replaced, then there is no need to change it.
c) Check if ASM will be OK if the grid disks go OFFLINE.
# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
...sample ...
DBFS_DG_FD_06_exdx5_tvp_a_cel3 ONLINE Yes
DBFS_DG_FD_07_exdx5_tvp_a_cel3 ONLINE Yes
RECOC1_FD_00_exdx5_tvp_a_cel3 ONLINE Yes
RECOC1_FD_01_exdx5_tvp_a_cel3 ONLINE Yes
RECOC1_FD_02_exdx5_tvp_a_cel3 ONLINE Yes
RECOC1_FD_03_exdx5_tvp_a_cel3 ONLINE Yes
...repeated for all griddisks....
If one or more disks return asmdeactivationoutcome='No', then wait for some time and repeat this command. Once all disks return asmdeactivationoutcome='Yes', proceed to the next step.
d) Run cellcli command to Inactivate all grid disks on the cell that needs to be powered down for maintenance. (this could take up to 10 minutes or longer)
# cellcli
CellCLI> ALTER GRIDDISK ALL INACTIVE
...sample ...
GridDisk DBFS_DG_FD_06_exdx5_tvp_a_cel3 successfully altered
GridDisk DBFS_DG_FD_07_exdx5_tvp_a_cel3 successfully altered
GridDisk RECOC1_FD_00_exdx5_tvp_a_cel3 successfully altered
GridDisk RECOC1_FD_01_exdx5_tvp_a_cel3 successfully altered
GridDisk RECOC1_FD_02_exdx5_tvp_a_cel3 successfully altered
GridDisk RECOC1_FD_03_exdx5_tvp_a_cel3 successfully altered
...repeated for all griddisks...
e) Execute the command below and the output should show asmmodestatus='UNUSED' or 'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and inactive in ASM.
CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
...sample...
DBFS_DG_FD_06_exdx5_tvp_a_cel3 inactive OFFLINE Yes
DBFS_DG_FD_07_exdx5_tvp_a_cel3 inactive OFFLINE Yes
RECOC1_FD_00_exdx5_tvp_a_cel3 inactive OFFLINE Yes
RECOC1_FD_01_exdx5_tvp_a_cel3 inactive OFFLINE Yes
RECOC1_FD_02_exdx5_tvp_a_cel3 inactive OFFLINE Yes
RECOC1_FD_03_exdx5_tvp_a_cel3 inactive OFFLINE Yes
...repeated for all griddisks...
f) Once all disks are offline and inactive, the customer may shutdown the Cell using the following command:
# shutdown -hP now
When powering off Exadata Storage Servers, all storage services are automatically stopped.
WHAT ACTION DOES THE ENGINEER NEED TO TAKE:
Confirm which NVMe cable bundle requires replacement. There are 2 cable bundles in an Extreme Flash storage cell. 1 bundle runs down the left hand side of the chassis when looking at the server from the front and the other bundle runs down the middle of the chassis.
The storage drive backplane has ports labelled "A" to "F". On the backplane end of the cable bundle, each cable is labelled "A" to "F", so each cable will connect to the respective ports on the backplane. The cable bundle that connects to the left hand side of the backplane when looking at the server from the front runs down the left of the chassis, while the cable bundle that connects to the right hand side runs down the middle of the chassis.
Each cable bundle connects to a pair of NVMe switch cards. The pairs of NVMe switch cards are [PCIe slot 1 and 2], or [PCIe slot 5 and 6]. On the NVMe switch card end of the cable bundle there are 2 groups of 3 cables labelled "slot 1/5" or "slot 2/6" and each individual cable is labelled 0,1, and 2. The cables will connect to the respective ports on the NVMe switch card. (Note the NVMe switch card has ports 0,1,2, and 3, port 3 is left empty on an Extreme Flash Storage Cell).
The Exadata Storage Server based on Sun Server X5-2L has six PCIe slots. They are numbered 1 through 6 with 1 nearest the Power Supplies, and 6 nearest the outside wall of the chassis (the onboard ports/connectors are located between slots 3 and 4). Slot locations for the NVMe switch cards in Exadata Storage Servers are PCIe Slot 1, 2, 5 and 6.
Physical card replacement
Reference links for Service Manual:
X5-2L : ( http://amomv0115.us.oracle.com/archive/cd_ns/E55029_01/html/E55031/goipe.html#scrolltoc )
Cable replacement
1. Slide out the server for maintenance. Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms.
2. Remove both power cables
3. Remove the server top cover
4. Remove the air baffle by lifting the baffle up and out of the storage server
5. Open the server fan assembly door and remove fan modules
6. Remove the storage server's front fan assembly door
7. Disconnect the NVMe cables from the NVMe switch card, by pressing each latch and then pull out to disengage the cable from each connector
8. Disconnect the NVMe cables from the disk backplane, by pressing each latch, and then pull out to disengage the cable from each connector
Install the new NVMe cables
Note - NVMe cable connectors do not fit through the left-side chassis mid-wall. If you are installing NVMe cables between the storage drive backplane and PCIe slots 5 and 6, you first must remove the chassis mid-wall.
For NVMe switch cards located in PCIe slots 1 and 2, route the NVMe cable bundle through the center chassis mid-wall and via the cable trough between the fan modules and processors.
For NVMe switch cards located in PCIe slots 5 and 6, route the NVMe cable bundle through the left-side chassis mid-wall and along the left side of the chassis.
1. Reconnect the NVMe cables to the storage drive backplane by plugging each cable into its respective connector until you hear an audible click
2. Reconnect the NVMe cables to the NVMe switch card by plugging each cable into its connector until you hear an audible click
3. Install the storage server's front fan assembly door
4. Install the fan modules
5. Install the air baffle
6. Install the storage server top cover
7. Connect the power supply cables
8. Slide server back into rack
Post-Replacement additional steps
1. Once the power cords have been re-attached, slide the server back into the rack, power on the server by pressing the power button at the front and enter the BIOS Setup Utility. To enter the BIOS Setup Utility, press the F2 key (Ctrl+E from a serial connection) when prompted and while the BIOS is running the power-on self-tests (POST).
Important - When NVMe cables are removed or replaced between the storage drive backplane and NVMe switch cards, you must perform the procedure in this section to confirm that all NVMe cable connections are correct. If the NVMe cable connections are not correct, the storage server operating system should not be allowed to boot, as it could cause a problem with disk drive mapping.
Log into the ILOM CLI and enter restricted mode and run the NVMe cable connection test.
-> set SESSION mode=restricted
WARNING: The "Restricted Shell" account is provided solely
to allow Services to perform diagnostic tasks.
[(restricted_shell) exdx5-tvp-a-cel3-sp:~]#
[(restricted_shell) exdx5-tvp-a-cel3-sp:~]#
[(restricted_shell) exdx5-tvp-a-cel3-sp:~]# hwdiag io nvme_test
HWdiag (Restricted Mode) - Build Number 94599 (Nov 17 2014, 18:59:38)
Current Date/Time: Apr 11 2015, 01:21:34
Checking NVME drive fru contents...
checking fru on drive NVMe 0 OK
checking fru on drive NVMe 1 OK
checking fru on drive NVMe 3 OK
checking fru on drive NVMe 4 OK
checking fru on drive NVMe 6 OK
checking fru on drive NVMe 7 OK
checking fru on drive NVMe 9 OK
checking fru on drive NVMe 10 OK
NVME drives fru check: PASSED
Checking NVME drive pcie links...
checking pcie link on drive NVMe 0 OK
checking pcie link on drive NVMe 1 OK
checking pcie link on drive NVMe 3 OK
checking pcie link on drive NVMe 4 OK
checking pcie link on drive NVMe 6 OK
checking pcie link on drive NVMe 7 OK
checking pcie link on drive NVMe 9 OK
checking pcie link on drive NVMe 10 OK
NVME drives pcie link check: PASSED
Checking NVME drive DSN...
checking DSN on drive NVMe 0 OK
checking DSN on drive NVMe 1 OK
checking DSN on drive NVMe 3 OK
checking DSN on drive NVMe 4 OK
checking DSN on drive NVMe 6 OK
checking DSN on drive NVMe 7 OK
checking DSN on drive NVMe 9 OK
checking DSN on drive NVMe 10 OK
NVME drives DSN check: PASSED
Checking NVME cabling...
Cables associated with Switch Card 3 in PCIe Slot 6 verified
Cables associated with Switch Card 2 in PCIe Slot 5 verified
Cables associated with Switch Card 1 in PCIe Slot 2 verified
Cables associated with Switch Card 0 in PCIe Slot 1 verified
NVME cable check: PASSED
NVME test PASSED
[(restricted_shell) exdx5-tvp-a-cel3-sp:~]#
If everything PASSED as shown above them continue to step 2. If there are any fail statuses then the cable issue must be resolved before going past this step. Remove the AC power cords again, and correct the cable issue. After each cable change the above step should be repeated until everything passes.
2. Exit the restricted shell and enter the serial console again to the BIOS setup menu. In the BIOS setup menu go to "Exit" tab and select "Discard Changes and Exit"
Server Services Startup Validation
As the system boots the hardware/firmware profile will be checked, and either a green "Passed" will be displayed, or a red "Warning" that the check does not match if the firmware on the HBA is different from what the image expects. If the check passes, then the firmware is correct, and the boot will continue up to the OS login prompt. If the check fails, then the firmware will automatically be updated, and a subsequent reboot will occur. Monitor to ensure this occurs properly.
OBTAIN CUSTOMER ACCEPTANCE
- WHAT ACTION DOES THE CUSTOMER NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:
It is expected that the engineer stay on-site until the customer has given the approval to depart. The following commands are provided as guidance in case the customer needs assistance checking the status of the system following replacement. If the customer or the FSE requires more assistance following the physical replacement of the device, EEST/TSC should be contacted.
Once the storage cell is booted up check that all the NVMe devices are present:
The following command should show 8 NVMe devices on an Extreme Flash storage cell:
[root@exdx5-tvp-a-cel3 ~]# nvmecli --identify --all | grep /dev
/***************** NVMe Device /dev/nvme0n1 ******************/
/***************** NVMe Device /dev/nvme1n1 ******************/
/***************** NVMe Device /dev/nvme2n1 ******************/
/***************** NVMe Device /dev/nvme3n1 ******************/
/***************** NVMe Device /dev/nvme4n1 ******************/
/***************** NVMe Device /dev/nvme5n1 ******************/
/***************** NVMe Device /dev/nvme6n1 ******************/
/***************** NVMe Device /dev/nvme7n1 ******************/
Also check that the OS can see the NMVe devices:
[root@exdx5-tvp-a-cel3 ~]# lspci | grep 0953
05:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
07:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
25:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
27:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
86:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
88:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
96:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
98:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
If this is not correct, then there is a problem with the disk volumes that may need additional assistance to correct. The server should be re-opened and the device connections and boards checked to be sure they are secure and well seated BEFORE the following CellCLI commands are issued.
After replacing the NVMe switch card, the Exadata Storage Server should boot up automatically. Once the Exadata Storage Server comes back online the cell services will start up automatically, however you will need to reactivate the griddisks as follows:
Activate the griddisks:
# cellcli
CellCLI> alter griddisk all active
...sample...
GridDisk DBFS_DG_FD_06_exdx5_tvp_a_cel3 successfully altered
GridDisk DBFS_DG_FD_07_exdx5_tvp_a_cel3 successfully altered
GridDisk RECOC1_FD_00_exdx5_tvp_a_cel3 successfully altered
GridDisk RECOC1_FD_01_exdx5_tvp_a_cel3 successfully altered
GridDisk RECOC1_FD_02_exdx5_tvp_a_cel3 successfully altered
GridDisk RECOC1_FD_03_exdx5_tvp_a_cel3 successfully altered
...repeated for all griddisks...
Verify all disks show 'active':
CellCLI> list griddisk
...sample...
DBFS_DG_FD_06_exdx5_tvp_a_cel3 active
DBFS_DG_FD_07_exdx5_tvp_a_cel3 active
RECOC1_FD_00_exdx5_tvp_a_cel3 active
RECOC1_FD_01_exdx5_tvp_a_cel3 active
RECOC1_FD_02_exdx5_tvp_a_cel3 active
RECOC1_FD_03_exdx5_tvp_a_cel3 active
...repeated for all griddisks...
Verify all grid disks have been successfully put online using the following command. Wait until 'asmmodestatus' is in status 'ONLINE' for all grid disks. The following is an example of the output early in the activation process.
CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
...sample...
DBFS_DG_FD_06_exdx5_tvp_a_cel3 active SYNCING Yes
DBFS_DG_FD_07_exdx5_tvp_a_cel3 active SYNCING Yes
RECOC1_FD_00_exdx5_tvp_a_cel3 active SYNCING Yes
RECOC1_FD_01_exdx5_tvp_a_cel3 active SYNCING Yes
RECOC1_FD_02_exdx5_tvp_a_cel3 active SYNCING Yes
RECOC1_FD_03_exdx5_tvp_a_cel3 active SYNCING Yes
...repeated for all griddisks...
Notice in the above example that the grid disks are still in the 'SYNCING' process. Oracle ASM synchronization is only complete when ALL grid disks show asmmodestatus=ONLINE. This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair.
PARTS NOTE:
Refer to the Exadata Database Machine Owner's Guide Appendix D for part information.
Oracle Exadata X5-2 - Full Components List (https://mosemp.us.oracle.com/handbook_internal/Systems/Exadata_X5_2/components.html)
Oracle Exadata X5-2 Storage Cell (X5-2L) - Full Components List (https://mosemp.us.oracle.com/handbook_internal/Systems/Exadata_X5_2_Storagecell/components.html)
REFERENCE INFORMATION:
Exadata Database Machine Documentation:
Exadata Database Machine Owner's Guide is available on the Storage Server OS image in /opt/oracle/cell/doc/welcome.html
http://amomv0115.us.oracle.com/archive/cd_ns/E13877_01/welcome.html
Oracle Exadata Storage Server X5-2 Extreme Flash Documentation Library (includes Sun Server X5-2L Service Manual) http://amomv0115.us.oracle.com/archive/cd_ns/E55029_01/index.html
Internal Only References:
- INTERNAL Exadata Database Machine Hardware Troubleshooting (Doc ID 1360360.1)
References
NOTE:1188080.1 - Steps to shut down or reboot an Exadata storage cell without affecting ASM
NVMe TOI pdf
Attachments
This solution has no attachment