![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Technical Instruction Sure Solution 1990221.1 : How to Replace an Exadata X5-2/X6-2 Storage Cell RAID HBA
In this Document
Oracle Confidential PARTNER - Available to partners (SUN). Applies to:Zero Data Loss Recovery Appliance X6 Hardware - Version All Versions and laterOracle SuperCluster T5-8 Hardware - Version All Versions and later Oracle SuperCluster M6-32 Hardware - Version All Versions and later Exadata X5-2 Eighth Rack - Version All Versions and later Exadata X5-2 Quarter Rack - Version All Versions and later Information in this document applies to any platform. GoalHow to Replace an Exadata X5-2/X6-2 Storage Cell RAID HBA Solution DISPATCH INSTRUCTIONS
- The server that contains the faulty HBA should have its services offline and system powered off.
The instructions below assume the customer DBA is available and working with the field engineer onsite to manage the host OS and
1. For Extended information on this section check MOS Note: ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM
This is also documented in the Exadata Owner's Guide in chapter 7 section titled "Maintaining Exadata Storage Servers" subsection "Shutting Down Exadata Storage Server" available on the customer's cell server image in the /opt/oracle/cell/doc directory.
Available to Oracle internally here: http://amomv0115.us.oracle.com/archive/cd_ns/E13877_01/doc/doc.112/e13874/maintenance.htm#DBMOG21129
In the following examples the SQL commands should be run by the Customers DBA prior to doing the hardware replacement. These should be done by the field engineer only if the customer directs them to, or is unable to do them. The cellcli commands will need to be run as root.
2. ASM drops a disk shortly after it/they are taken offline. The default DISK_REPAIR_TIME attribute value of 3.6hrs should be adequate for replacing components, but may have been changed by the Customer. To check this parameter, have the Customer log into ASM and perform the following query:
SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg where
a.name = 'disk_repair_time' and a.group_number = dg.group_number;
As long as the value is large enough to comfortably replace the components being replaced, then there is no need to change it.
3. Check if ASM will be OK if the grid disks go OFFLINE.
# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome ...sample ... ...repeated for all griddisks....
If one or more disks return asmdeactivationoutcome='No', then wait for some time and repeat step #2. Once all disks return asmdeactivationoutcome='Yes', proceed to the next step.
4. Run cellcli command to Inactivate all grid disks on the cell that needs to be powered down for maintenance. (this could take up to 10 minutes or longer)
# cellcli CellCLI> ALTER GRIDDISK ALL INACTIVE
...sample ... GridDisk CATALOG_CD_09_zdlx5_tvp_a_cel3 successfully alteredGridDisk CATALOG_CD_10_zdlx5_tvp_a_cel3 successfully altered GridDisk CATALOG_CD_11_zdlx5_tvp_a_cel3 successfully altered GridDisk DELTA_CD_00_zdlx5_tvp_a_cel3 successfully altered GridDisk DELTA_CD_01_zdlx5_tvp_a_cel3 successfully altered GridDisk DELTA_CD_02_zdlx5_tvp_a_cel3 successfully altered ...repeated for all griddisks...
5. Execute the command below and the output should show asmmodestatus='UNUSED' or 'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and inactive in ASM.
CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome CATALOG_CD_09_zdlx5_tvp_a_cel3 inactive OFFLINE YesCATALOG_CD_10_zdlx5_tvp_a_cel3 inactive OFFLINE Yes CATALOG_CD_11_zdlx5_tvp_a_cel3 inactive OFFLINE Yes DELTA_CD_00_zdlx5_tvp_a_cel3 inactive OFFLINE Yes DELTA_CD_01_zdlx5_tvp_a_cel3 inactive OFFLINE Yes DELTA_CD_02_zdlx5_tvp_a_cel3 inactive OFFLINE Yes ...repeated for all griddisks...
6. Revert all the RAID disk volumes to WriteThrough mode to ensure all data in the RAID cache memory is flushed to disk and not lost when replacement of the HBA occurs. As 'root' user, set all logical volumes cache policy to WriteThrough cache mode: # /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wt -lall -a0
Verify the current cache policy for all logical volumes is now WriteThrough : # /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU
7. Once all disks are offline and inactive, the customer may shutdown the Cell using the following command: # shutdown -hP now
8. The field engineer can now slide out the server for maintenance. Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms. Take care to ensure the cables and Cable Management Arm is moving properly. Refer to Note 1444683.1 for CMA handling training.
Remember to disconnect the power cords before opening the top cover of the server.
Step B. Physical RAID Card replacement 1. Swivel the air baffle into the upright position to allow access to the super capacitor cable and the Oracle Storage 12 Gb/s SAS PCIe RAID HBA card in PCI slot 6 2. Rotate the PCIe card locking mechanism, and then lift up on the PCIe HBA card to disengage it from the motherboard connectors 3. Disconnect the super capacitor cable and the SAS cables from the Oracle Storage 12 Gb/s SAS PCIe RAID HBA card 4. Lift and remove the Oracle Storage 12 Gb/s SAS PCIe RAID HBA card from the chassis
1. Once the power cords have been re-attached, slide the server back into the rack. 2. Once the ILOM has booted you will see a slow blink on the green LED for the server. Power on the server by pressing the power
Step D. Server Services Startup Validation 1. As the system boots the hardware/firmware profile will be checked, and either a green "Passed" will be displayed, or a red "Warning" that the check does not match if the firmware on the HBA is different from what the image expects. If the check passes, then the firmware is correct, and the boot will continue up to the OS login prompt. If the check fails, then the firmware will automatically be updated, and a subsequent reboot will occur. Monitor to ensure this occurs properly. 2. After the OS is up, login as root and verify all the expected devices are present: The following command should show 12 disks:
# lsscsi | grep -i LSI [0:2:0:0] disk LSI MR9361-8i 4.23 /dev/sda
If the device count is not correct check also that the LSI controller has the correct Virtual Drives configured and in Optimal state, physically Online and spun up, with no Foreign configuration. There should be Virtual Drives 0 to 11, and the physical slots 0 to 11 should be allocated to 1 each (not necessarily the same 0:0 1:1 etc. mapping).
# /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -a0 | grep "Virtual Drive\|State\|Slot\|Firmware state" Virtual Drive: 0 (Target Id: 0)
Check the status of the Super Cap:
# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0 BBU status for Adapter: 0
If this is not correct, then there is a problem with the disk volumes that may need additional assistance to correct. The server should be re-opened and the device connections and boards checked to be sure they are secure and well seated BEFORE the following CellCLI commands are issued. 3. Set all logical drives cache policy to WriteBack cache mode:
# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wb -lall -a0
Verify the current cache policy for all logical drives is now using WriteBack cache mode:
# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
4. Verify also the InfiniBand links are up at 40Gbps as the cables were disconnected:
# /usr/sbin/ibstatus Infiniband device 'mlx4_0' port 1 status:
5. Once the hardware is verified as up and running, the Customer's DBA will need to activate the grid disks:
# cellcli CellCLI> alter griddisk all active GridDisk CATALOG_CD_09_zdlx5_tvp_a_cel3 successfully altered ...repeated for all griddisks...
Issue the command below and all disks should show 'active':
CellCLI> list griddisk CATALOG_CD_09_zdlx5_tvp_a_cel3 active ...repeated for all griddisks...
6. Verify all grid disks have been successfully put online using the following command. Wait until asmmodestatus is ONLINE for all grid disks and no longer SYNCING. The following is an example of the output early in the activation process.
CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome CATALOG_CD_09_zdlx5_tvp_a_cel3 active SYNCING Yes ...repeated for all griddisks...
Notice in the above example that the grid disks are still in the 'SYNCING' process. Oracle ASM synchronization is only complete when ALL grid disks show asmmodestatus=ONLINE. This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair.
OBTAIN CUSTOMER ACCEPTANCE - WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:
- Verify that HW Components and SW Components are returned to properly functioning state with server up and all ASM disks online on Storage Servers.
REFERENCE INFORMATION: 1093890.1 Steps To Shutdown/Startup The Exadata & RDBMS Services and Cell/Compute Nodes On An Exadata Configuration. 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM Attachments This solution has no attachment |
||||||||||||||||
|