![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Technical Instruction Sure Solution 1388322.1 : How to Replace a Faulty RAID HBA on an Exadata server (V2 - X4-2/X4-8)
This CAP explains how to Replace a Faulty RAID HBA in an Exadata node (V2 to X4-8) Oracle Confidential PARTNER - Available to partners (SUN). Reason: Exadata Premier Support indicates this is a FRU only. Applies to:Exadata X4-2 Quarter Rack - Version All Versions and laterExadata Database Machine X2-2 Full Rack - Version All Versions and later Exadata X4-2 Half Rack - Version All Versions and later Exadata X4-2 Full Rack - Version All Versions and later Exadata Database Machine V2 - Version All Versions and later Information in this document applies to any platform. GoalHow to Replace a Faulty RAID HBA on Exadata (V2 through X4-2/X4-8) successfully. Solution
Step A. Pre-Steps to shutdown the node for servicing:For DB Nodes:1. To shutdown the compute node check MOS Note:
2. Revert all the RAID disk volumes to WriteThrough mode to ensure all data in the RAID cache memory is flushed to disk and not lost when replacement of the HBA occurs. Set all logical volumes cache policy to WriteThrough cache mode: # /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wt -lall -a0
Verify the current cache policy for all logical volumes is now WriteThrough : # /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU
# shutdown -hP now Solaris: # shutdown -y -i 5 -g 0 4. The field engineer can now slide out the server for maintenance. Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms (CMA). Ensure all customer-added data network cables are properly dressed into the CMA Take care to ensure the cables and CMA is moving properly. Refer to Note 1444683.1 for CMA handling training. For Storage Cells:
2. Revert all the RAID disk volumes to WriteThrough mode to ensure all data in the RAID cache memory is flushed to disk and not lost when replacement of the HBA occurs. Set all logical volumes cache policy to WriteThrough cache mode: # /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wt -lall -a0
Verify the current cache policy for all logical volumes is now WriteThrough : # /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU
3. Once all disks are offline and inactive, the customer may shutdown the Cell using the following command: # shutdown -hP now 4. The field engineer can now slide out the server for maintenance. Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms. Take care to ensure the cables and Cable Management Arm is moving properly. Refer to Note 1444683.1 for CMA handling training. Step B. Physical RAID Card replacementExadata X2-2 Database Machine Compute nodes and all Storage Cells:These steps are relevant to Exadata nodes based on x4170, x4170M2, x4275, and x4270M2. Remove the old HBA PCI Card1. On Storage cells remove the IB cables from the IB card in slot 3 above the HBA making a note of which port each cable goes into so they can go back into the same port. Remove the HBA's battery from the old HBA1. Use a No. 1 Phillips screwdriver to remove the 3 retaining screws that secure the battery to the HBA from the underside of the card. Do not attempt to remove any screws from the top side of the HBA. Reinstall the HBA's battery onto the new HBAReverse the removal instructions Install the new HBA PCI CardReverse the removal instructions, taking care to get the cables re-connected to the same ports they were removed from. If reversed, this may affect disk slot mappings. Exadata X2-8 Database Machine Compute nodes:These steps are relevant to Exadata nodes based on x4800, and x4800M2. Removing the HBA REM and the battery.1. Remove CMOD0 from the server setting it on a flat, antistatic surface with ample space and light. Install the new HBA REM1. Attach the battery pack to the REM by aligning the circuit board connectors and gently pressing together. Step C. Post-Replacement RAID Card additional steps:Power on for both DB nodes and Storage Cells:1. Once the power cords have been re-attached, slide the server back into the rack. Accepting the Foreign Configuration for both DB nodes and Storage Cells:1. During boot, monitor the graphics console through either ILOM javaconsole or the local KVM. When loading its BIOS ROM, the new RAID controller will detect the RAID configuration on the disks and complain it has a Foreign configuration. This is expected. At the prompt, press "F" or "C" to accept the foreign configuration or enter the controller BIOS utility. If you press any other key to continue, then the controller will not import the RAID and will fail to find a bootable disk. If this occurs, it is safe to press "ctrl-alt-del" and reset and get the "F" or "C" prompt again. NOTE:FOR SOLARIS ONLY CONFIG If the DB node is running Solaris ONLY when the foreign config is imported the Boot Drive is not defined .Therefore for Solaris ONLY config follow these steps. Solaris only config on non X2-8 compute nodes will have 4 Virtual Drives VD0,1,2 and 3 . VD1 and VD2 will contain the grub loader. i) When the Foreign Config detected message is displayed press "C" to load the config utillity ii) Then the system will display a warning that all disks from previous config are gone .Again press "C" to load config utillity. iii) Finally a warning that entering the utillity will result in drive config changes.Press Y to continue into the config utillity. iv) Press the start button and then select the "Preview" button and verify that configuration is correct.Save the configuration.Refer to step 4 below for correct configs. v) Now define the boot device.Tab to the Virtual Drives view ,then from this view tab to the box listing the drives ,select VD1 by using the up/down keys until it is highlighted .Hit enter to select VD1 ,now tab to "Set Boot Drives" and hit enter.Finally tab to "Go" and hit enter.The value "Set Boot Drives" should now show ( current = 1) ##refer to solaris-screen-4.jpg for screenshot. vi) return to the home window and proceed to "Step 5 " below.
2. When the utility loads, there should only be 1 adapter. Select the "Start" button.
Step D. Server Services Startup Validation:DB Node Startup:1. As the system boots the hardware/firmware profile will be checked, and either a green "Passed" will be displayed, or a red "Warning" that the check does not match if the firmware on the HBA is different from what the image expects. # /opt/oracle.SupportTools/CheckHWnFWProfile -U /opt/oracle.cellos/iso/cellbits e) After the firmware updates, the server will reboot again. The disk volumes should remain intact and boot up to the OS again. # /opt/MegaRAID/MegaCli/MegaCli64 -LdInfo -Lall -a0 # /opt/MegaRAID/MegaCli/MegaCli64 -PdList -a0 | grep "Slot\|Firmware\|Inq" # /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: iBBU ...Output truncated... 3. Set all logical drives cache policy to WriteBack cache mode: # /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wb -lall -a0
Verify the current cache policy for all logical drives is now using WriteBack cache mode: # /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU
4. CRS services should start automatically during the OS boot. After the OS is up, the Customer DBA should validate that CRS is running . Cell Node Startup:1. As the system boots the hardware/firmware profile will be checked, and either a green "Passed" will be displayed, or a red "Warning" that the check does not match if the firmware on the HBA is different from what the image expects. If the check passes, then the firmware is correct, and the boot will continue up to the OS login prompt. If the check fails, then the firmware will automatically be updated, and a subsequent reboot will occur. Monitor to ensure this occurs properly. # lsscsi | grep -i LSI [0:0:20:0] enclosu LSILOGIC SASX28 A.1 502E - [0:2:0:0] disk LSI MR9261-8i 2.90 /dev/sda . . ...Output truncated... . . . [0:2:11:0] disk LSI MR9261-8i 2.90 /dev/sdl
# /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -a0 | grep "Virtual Drive\|State\|Slot\|Firmware state" Virtual Drive: 0 (Target Id: 0) State : Optimal Slot Number: 0 Firmware state: Online, Spun Up Foreign State: None . . . . ...Output truncated... . . . . Slot Number: 11 Firmware state: Online, Spun Up Foreign State: None # /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: iBBU ...Output truncated...
3. Set all logical drives cache policy to WriteBack cache mode: # /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wb -lall -a0
Verify the current cache policy for all logical drives is now using WriteBack cache mode: # /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU
4. Verify also both InfiniBand links are up at 40Gbps as the cables were disconnected: # /usr/sbin/ibstatus Infiniband device 'mlx4_0' port 1 status: . ...truncated... . state: 4: ACTIVE phys state: 5: LinkUp rate: 40 Gb/sec (4X QDR) Infiniband device 'mlx4_0' port 2 status: . ..truncated.. state: 4: ACTIVE phys state: 5: LinkUp rate: 40 Gb/sec (4X QDR) 5. Once the hardware is verified as up and running, the Customer's DBA will need to activate the grid disks: # cellcli ... CellCLI> alter griddisk all active GridDisk DATA_CD_00_dmorlx8cel01 successfully altered ...repeated for all griddisks...
CellCLI> list griddisk DATA_CD_00_dmorlx8cel01 active ...repeated for all griddisks... 6. Verify all grid disks have been successfully put online using the following command. Wait until asmmodestatus is ONLINE for all grid disks and no longer SYNCING. The following is an example of the output early in the activation process. CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome DATA_CD_00_dmorlx8cel01 active ONLINE Yes RECO_CD_00_dmorlx8cel01 active SYNCING Yes ...repeated for all griddisks... Notice in the above example that RECO_CD_00_dmorlx8cel01 is still in the 'SYNCING' process. Oracle ASM synchronization is only complete when ALL grid disks show asmmodestatus=ONLINE. This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair. Attachments This solution has no attachment |
||||||||||||
|