How to Replace a HBA Battery Backup Unit (BBU) on Exadata Database Machine (V2, X2-2, X2-8) [VCAP]

Asset ID:	1-71-1527626.1
Update Date:	2017-07-07
Keywords:

Solution Type Technical Instruction Sure

Solution 1527626.1 : How to Replace a HBA Battery Backup Unit (BBU) on Exadata Database Machine (V2, X2-2, X2-8) [VCAP]

Applies to:

SPARC SuperCluster T4-4 - Version All Versions and later
Exadata Database Machine X2-8 - Version All Versions and later
Exadata Database Machine V2 - Version All Versions and later
Exadata Database Machine X2-2 Full Rack - Version All Versions and later
Exadata Database Machine X2-2 Half Rack - Version All Versions and later
Information in this document applies to any platform.

Goal

How to procedure for replacing a HBA Battery Backup Unit (BBU) on an Exadata Database Machine node (V2, X2-2, X2-8).

Solution

DISPATCH INSTRUCTIONS WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?: Exadata trained
TIME ESTIMATE: 60 minutes
TASK COMPLEXITY: 2

FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:

PROBLEM OVERVIEW:

The Battery Backup Unit (BBU) on the RAID HBA needs replacement in Exadata Database Machine V2, X2-2 or X2-8.

Videos for the physical replacement procedures are attached to this Note 1527626.1 for V2/X2-2 DB nodes and Storage cells, and for X2-8 DB nodes. (Note: Save and play them offline if they do not load with a browser plugin).

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:

The instructions below assume the customer DBA is available and working with the field engineer onsite to manage the host OS and DB/ASM services. They are provided here to allow the FE to have all the available steps needed when onsite, and can be done by the FE if the customer DBA wants or allows or needs help with their steps.

The server that contains the faulty RAID HBA BBU should have its services offline and system powered off.

1. Locate the server in the rack being serviced. Exadata Storage Servers are identified by a number 1 through 18, where 1 is the lowest most Storage Server in the rack installed in RU2, counting up to the top of the rack. Exadata Database Nodes are identified by a number 1 through 8, where 1 is the lowest most DB node in the rack installed in RU16.

Turn on the locate indicator light ‘on’ for easier identification of the server being repaired. If the server number has been identified then the Locate Button on the front panel may be pressed. To turn on remotely, use either of the following methods:

From a login to the CellCli on Exadata Storage Servers:

CellCli> alter cell led on

From a login to the server’s ILOM:

-> set /SYS/LOCATE value=Fast_Blink
Set 'value' to 'Fast_Blink

From a login to the server’s ‘root’ account:

# ipmitool sunoem cli ‘set /SYS/LOCATE value=Fast_Blink’
Connected. Use ^D to exit.
-> set /SYS/LOCATE value=Fast_Blink
Set 'value' to 'Fast_Blink'

-> Session closed
Disconnected

2. Revert all the RAID disk volumes to WriteThrough mode to ensure all data in the RAID cache memory is flushed to disk and not lost when replacement of the HBA BBU occurs. Set all logical volumes cache policy to WriteThrough cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wt -lall -a0

Verify the current cache policy for all logical volumes is now WriteThrough :

# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU

3. Shutdown the node for which the RAID HBA BBU requires replacement.

For Exadata DB Nodes:

a. For Extended information on this section check MOS Note:
ID 1093890.1 Steps To Shutdown/Startup The Exadata & RDBMS Services and Cell/Compute Nodes On An Exadata Configuration.https://support.us.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1093890.1&h=Y

For a documentation reference, in the Exadata Database Machine Administration Guide, use the section of chapter 1 titled "Non-Emergency Power Procedures" section "Powering Off Oracle Exadata Rack" sub-section "Powering off Database Servers" available on the customer's cell server image in the /opt/oracle/cell/doc directory or online:

https://docs.oracle.com/cd/E80920_01/DBMMN/exadata-general-maintenance.htm#DBMMN115

In the following examples the SQL commands should be run by the Customers DBA prior to doing the hardware replacement. These should be done by the field engineer only if the customer directs them to, or is unable to do them.

b. Customer should shutdown CRS services prior to powering down the DB node. As root user do the following:

# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle

# $ORACLE_HOME/bin/crsctl stop crs
or
# <GI_HOME>/bin/crsctl stop crs

where GI_HOME environment variable is typically set to "/u01/app/11.2.0/grid" but will depend on the customer's environment.

In the above output the "1" of "+ASM1" refers to the DB node number.
For example, Db node #3 the value would be +ASM3.

b. Validate CRS is down cleanly. There should be no processes running.

# ps -ef | grep css

c. Shutdown the server operating system:

Linux:

# shutdown -hP now

Solaris:

# shutdown -y -i 5 -g 0

For Exadata Storage Servers:

a. For Extended information on this section check MOS Note:
ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM
https://support.us.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1188080.1&h=Y

This is also documented in the Exadata Database Machine Administration Guide, use the section of chapter 1 titled "Non-Emergency Power Procedures" section "Powering Off Oracle Exadata Rack" sub-section "Powering Off Exadata Storage Servers" available on the customer's cell server image in the /opt/oracle/cell/doc directory or online:

https://docs.oracle.com/cd/E80920_01/DBMMN/exadata-general-maintenance.htm#DBMMN115

Note the following when powering off Exadata Storage Servers:

Verify there are no other storage servers with disk faults. Shutting down a storage server while another disk is failed may result in the running database processes and Oracle ASM to crash if it loses both disks in the partner pair when this server’s disks go offline.
Powering off one Exadata Storage Server with no disk faults in the rest of the rack will not affect running database processes or Oracle ASM.
All database and Oracle Clusterware processes should be shut down prior to shutting down more than one Exadata Storage Server. Refer to the Exadata documentation for details if this is necessary.

b. ASM drops a disk shortly after it/they are taken offline. Powering off or restarting Exadata Storage Servers can impact database performance if the storage server is offline for longer than the ASM disk repair timer to be restored. The default DISK_REPAIR_TIME attribute value of 3.6hrs should be adequate for replacing components, but may have been changed by the Customer. To check this parameter, have the Customer log into ASM and perform the following query:

SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg where a.name = 'disk_repair_time' and a.group_number = dg.group_number;

As long as the value is large enough to comfortably replace the components being replaced, then there is no need to change it.

c. Check if ASM will be OK if the grid disks go OFFLINE.

# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
...sample ...
DATA_CD_09_cel01 ONLINE Yes
DATA_CD_10_cel01 ONLINE Yes
DATA_CD_11_cel01 ONLINE Yes
RECO_CD_00_cel01 ONLINE Yes
RECO_CD_01_cel01 ONLINE Yes
...repeated for all griddisks....

If one or more disks return asmdeactivationoutcome='No', then wait for some time and repeat this command. Once all disks return asmdeactivationoutcome='Yes', proceed to the next step.

d. Run cellcli command to Inactivate all grid disks on the cell that needs to be powered down for maintenance. (this could take up to 10 minutes or longer)

# cellcli
...sample ...
CellCLI> ALTER GRIDDISK ALL INACTIVE
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered
GridDisk DATA_CD_01_dmorlx8cel01 successfully altered
GridDisk DATA_CD_02_dmorlx8cel01 successfully altered
GridDisk RECO_CD_00_dmorlx8cel01 successfully altered
GridDisk RECO_CD_01_dmorlx8cel01 successfully altered
GridDisk RECO_CD_02_dmorlx8cel01 successfully altered
...repeated for all griddisks...

e. Execute the command below and the output should show asmmodestatus='UNUSED' or 'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and inactive in ASM.

CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 inactive OFFLINE Yes
DATA_CD_01_dmorlx8cel01 inactive OFFLINE Yes
DATA_CD_02_dmorlx8cel01 inactive OFFLINE Yes
RECO_CD_00_dmorlx8cel01 inactive OFFLINE Yes
RECO_CD_01_dmorlx8cel01 inactive OFFLINE Yes
RECO_CD_02_dmorlx8cel01 inactive OFFLINE Yes
...repeated for all griddisks...

f. Revert all the RAID disk volumes to WriteThrough mode to ensure all data in the RAID cache memory is flushed to disk and not lost when replacement of the HBA BBU occurs. Set all logical volumes cache policy to WriteThrough cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wt -lall -a0

Verify the current cache policy for all logical volumes is now WriteThrough:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU

g. Once all disks are offline and inactive, the customer may shutdown the Cell using the following command:

# shutdown -hP now

When powering off Exadata Storage Servers, all storage services are automatically stopped.

WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?:

Exadata V2, X2-2 Database Machine Compute nodes and V2, X2-2, X2-8 Storage Cells:

These steps are relevant to Exadata nodes based on x4170, x4170m2, x4275, and x4275m2.

Slide out the server for maintenance. Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms. Take care to ensure the cables and Cable Management Arm is moving properly. Refer to Note 1444683.1 for CMA handling training.
https://support.us.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1444683.1&h=Y
Remove the AC power cords prior to removing the server’s top cover.
Remove the HBA PCI card:
1. On Storage cells remove the IB cables from the IB card in slot 3 above the HBA making a note of which port each cable goes into so they can go back into the same port.
2. Remove back panel PCI cross bar.
  1. Loosen the two captive Phillips screws on each end of the crossbar.
  2. Lift the PCI crossbar up and back to remove it from the chassis.
3. Remove the PCI Riser containing the PCI card to be serviced.
  1. Loosen the captive screw holding the riser to the motherboard.
  2. Lift up the riser and the PCI card that is attached to it as a unit.
4. Disconnect the SAS cables from PCI card making a note of which port each cable goes into so they can go back into the same port.
5. Extract the RAID HBA card from the PCI riser assembly.
Remove the old BBU from the HBA:
1. Use a No. 1 Phillips screwdriver to remove the 3 retaining screws that secure the battery to the HBA from the underside of the card. Do NOT attempt to remove any screws from the top side of the HBA and battery pack – those screws hold the standoffs that provide the bottom screw holes and should remain with the battery pack.
2. Detach the battery pack including circuit board from the HBA by gently lifting it from its circuit board connector on the top side of the HBA.
Install the new BBU on the HBA:
1. Attach the battery pack circuit board connector to mate with the HBA’s connector on the top side of the HBA.
2. Use a No. 1 Phillips screwdriver to install the 3 retaining screws, to secure the battery to the HBA from the underside of the card. If the BBU comes with a package of new screws, then use those new screws - do not re-use the screws from the old BBU attachment.
Reinstall the HBA PCI card into the PCI riser and server, reversing the removal step 3 above. Take care to get the cables re-connected to the same ports they were removed from. If reversed, this may affect disk slot mappings.

Take care on storage cells to also put the IB cables back into the original ports, as well, in the correct orientation. IB cables are factory labeled with the port identification where port 2 is the port nearest the PCI connector, and port 1 is the port near the top side of the card. The cables should be inserted with the latch release tab on the down side, so they fully seat and latch. If inserted upside down, they may appear installed but will not fully seat or latch.
Replace the server’s top cover and re-attach the AC power cords. ILOM will take up to 2 minutes to boot.
Slide the server back into the rack.

Exadata X2-8 Database Machine Compute nodes:

These steps are relevant to Exadata nodes based on x4800, and x4800M2.

Remove CMOD0 from the server setting it on a flat, antistatic surface.
Remove the CMOD top cover.
Remove the HBA REM with BBU attached:
1. Lift the REM ejector handle and rotate it to its fully open position
2. Lift the connector end of the REM and pull the REM away from the retaining clip on the front support bracket.
Remove the old BBU from the REM:
1. Use a No. 1 Phillips screwdriver to remove the 3 retaining screws that secure the battery to the REM card. Do NOT attempt to remove any screws from the top side of the REM and battery pack – those screws hold the standoffs that provide the bottom screw holes and should remain with the battery pack.
2. Detach the battery pack including circuit board from the REM by gently lifting it from its circuit board connector.
Install the new BBU on the REM:
1. Attach the battery pack circuit board connector to mate with the REM’s connector.
2. Use a No. 1 Phillips screwdriver to secure the battery to the REM. If the BBU comes with a package of new screws, then use those new screws - do not re-use the screws from the old BBU attachment.
Re-install the HBA REM with BBU attached:
1. Ensure that the REM ejector lever is in the closed position. The lever should be flat with the REM support bracket.
2. Position the REM so that the battery is facing downward and the connector is aligned with the connector on the motherboard.
3. Slip the opposite end of the REM under the retaining clips on the front support bracket
  and ensure that the notch on the edge of the REM is positioned around the alignment
  post on the bracket.
4. Carefully lower and position the connector end of the REM until the REM contacts the connector on the motherboard, ensuring that the connectors are aligned. To seat the connector, carefully push the REM downward until it is in a level position.
Install the cover on the CMOD.
Return the CMOD back into the unit in CMOD0 slot.

OBTAIN CUSTOMER ACCEPTANCE WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:

These steps are applicable to all systems after completing the physical replacement above. These should be done in co-operation with the customer’s administrator to complete the procedure and verification prior to the field engineer leaving the customer site.

Power on the server by pressing the power button.
After ILOM has booted, power on the server by pressing the power button, and then connect to the server’s console.
To connect to the console through ILOM:
From the server's console, monitor the system booting. Watch in particular, the LSI controller BIOS while it is loading. If it gives a warning message regarding drives with preserved cache, then choose “D” to discard the cache and continue. This is not an issue as the disk will get re-synced after boot by ASM. If it gives a warning message regarding drives are in write-through mode due to a low battery, then choose to continue.

The Exadata boot should continue normally after, showing the Exadata boot splash screen and continue with normal OS boot messages. Note there may be a long pause between screen outputs on the ILOM serial console during subsequent boot steps as the default console is the graphics, and the Exadata boot splash screen will not display.
Once full boot is completed you should be able to login as ‘root’ user and verify the new battery is seen and is charging.

Linux:

# /opt/MegaRAID/MegaCli/MegaCli64 -adpbbucmd -a0

Solaris:

# /opt/MegaRAID/MegaCli -adpbbucmd -a0

5. Set all logical drives cache policy to WriteBack cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wb -lall -a0

6. Verify the current cache policy for all logical drives is now using WriteBack cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU

For Storage Cells, use the following to return the Cell to service:

Activate the grid disks:

# cellcli
…
CellCLI> alter griddisk all active
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered
GridDisk DATA_CD_01_dmorlx8cel01 successfully altered
GridDisk DATA_CD_02_dmorlx8cel01 successfully altered
GridDisk RECO_CD_00_dmorlx8cel01 successfully altered
GridDisk RECO_CD_01_dmorlx8cel01 successfully altered
GridDisk RECO_CD_02_dmorlx8cel01 successfully altered
...etc...
Issue the command below and all disks should show 'active':

CellCLI> list griddisk
DATA_CD_00_dmorlx8cel01        active
DATA_CD_01_dmorlx8cel01        active
DATA_CD_02_dmorlx8cel01        active
RECO_CD_00_dmorlx8cel01        active
RECO_CD_01_dmorlx8cel01        active
RECO_CD_02_dmorlx8cel01        active
...etc...
Verify all grid disks have been successfully put online using the following command. Wait until 'asmmodestatus' is in status 'ONLINE' for all grid disks. The following is an example of the output early in the activation process.

CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 active ONLINE Yes
DATA_CD_01_dmorlx8cel01 active ONLINE Yes
DATA_CD_02_dmorlx8cel01 active ONLINE Yes
RECO_CD_00_dmorlx8cel01 active SYNCING Yes
RECO_CD_01_dmorlx8cel01 active ONLINE Yes
...etc...

Notice in the above example that 'RECO_CD_00_dmorlx8cel01' is still in the 'SYNCING' process. Oracle ASM synchronization is only complete when ALL grid disks show ‘asmmodestatus=ONLINE’. This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair.

For DB Nodes, the DB services should start automatically. Use the following to verify:

Validate that CRS is running, as ‘root’ user execute the following:

[root@db01 ~]# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle
[root@db01 ~]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3.
Validate that instances are running:

# ps -ef |grep pmon

It should return a record for ASM instance and a record for each database.

PARTS NOTE:

371-4746 6Gigabit SAS RAID PCI Battery Module ( LION), BBU-07 (Obsolete)

371-4982 6Gigabit SAS RAID PCI Battery Module ( LION), BBU-08

7050794 6Gigabit SAS RAID PCI Battery Module ( LION), BBU-08, RoHS2013.

REFERENCE INFORMATION:

Exadata Database Machine documentation 12.2 - https://docs.oracle.com/cd/E80920_01/index.htm

Oracle ILOM 3.0 documentation library - https://docs.oracle.com/cd/E19860-01/index.html

Broadcom (formerly LSI) MegaRAID User's Guide - https://www.broadcom.com/support/oem/oracle/6gb/sg_x_sas6-r-int-z

ID 1329989.1Reference guide for LSI disk controller batteries used in Exadata
https://support.us.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1329989.1

Attachments

This solution has no attachment