SAS HBA does not maintain logs over a reboot on Exadata X5-2L High Capacity Storage Servers on SW Image versions below 12.1.2.3.0

Asset ID:	1-72-2135119.1
Update Date:	2017-09-26
Keywords:

Solution Type Problem Resolution Sure

Solution 2135119.1 : SAS HBA does not maintain logs over a reboot on Exadata X5-2L High Capacity Storage Servers on SW Image versions below 12.1.2.3.0

Applies to:

Oracle SuperCluster M6-32 Hardware - Version All Versions and later
Exadata X5-2 Hardware - Version All Versions and later
Exadata X4-8 Hardware - Version All Versions and later
Exadata X5-8 Hardware - Version All Versions and later
Oracle SuperCluster T5-8 Hardware - Version All Versions and later
Information in this document applies to any platform.

Symptoms

SAS HBA does not maintain logs over a reboot on Exadata X5-2L High Capacity Storage Servers on SW Image versions below 12.1.2.3.0.

Note: Exadata X5-2L Extreme Flash Storage Servers do not have a SAS HBA and are not affected.
Note: Exadata systems older than X5-2 use a different SAS HBA and are not affected.

Cause

Firmware bug and default log setting changes.

Solution

1. Check the image version of the system and firmware version of the SAS HBA:

# imageinfo -ver
12.1.2.2.0.150917
# MegaCli64 -adpallinfo -a0 | grep -i package
FW Package Build: 24.3.0-0081

If the system is running SW Image 12.1.2.3.0 or later, then the problem does not apply. These images have both the firmware fix and the persistent log setting is already enabled by default, so no further action is required.

If the system is running SW Image 12.1.2.2.x, then these images have the firmware fix 24.3.0-0081 per the example above, but the HBA configuration settings need to be updated. Proceed to step 2.

# imageinfo -ver
12.1.2.1.1.150316.2
# MegaCli64 -adpallinfo -a0 | grep -i package
FW Package Build: 24.3.0-0073

If the system is running SW Image release earlier than 12.1.2.2.0 with firmware 24.3.0-0073 per the example above, then it needs to be updated to SW Image 12.1.2.2.0 or later, containing SAS HBA firmware package 24.3.0-0081 due to a firmware bug with terminal logging. For how to update image, refer to MOS Note 888828.1. The first preference for resolving this issue is to update image. If the server is not able to be updated with a later image at this time, then the SAS HBA firmware only may be temporarily updated to address this issue on systems that have had a failure as follows, using the firmware package "MR_6.3.8_24.3.0-0081.rom" attached to this Note.

a) Download the firmware package "MR_6.3.8_24.3.0-0081.rom" attached to this Note, and copy it to the /tmp directory on the Storage Cell.

b) Prepare the High Capacity Storage Cells for maintenance as follows:

NOTE: If updating firmware on multiple storage cells in a rolling manner, do not reboot and apply the firmware update to multiple storage cells at the same time - only do them one at a time and ensure all disks are re-synchronized with ASM before proceeding to the next storage cell.

i. ASM drops a disk shortly after it/they are taken offline. The default DISK_REPAIR_TIME
attribute value of 3.6hrs should be adequate for replacing components, but may have been
changed by the Customer. To check this parameter, have the Customer log into ASM and
perform the following query:

SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg
where a.name = 'disk_repair_time' and a.group_number = dg.group_number;

As long as the value is large enough to comfortably replace the hardware in a
storage cell, there is no need to change it.

ii. Check if ASM will be OK if the grid disks go OFFLINE.

# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
...snipit ...
DATA_CD_09_cel01 ONLINE Yes
DATA_CD_10_cel01 ONLINE Yes
DATA_CD_11_cel01 ONLINE Yes
RECO_CD_00_cel01 ONLINE Yes
etc....

If one or more disks return asmdeactivationoutcome='No', you should wait for some time
and repeat the query until all disks return asmdeactivationoutcome='Yes'.

NOTE: Taking the storage server offline while one or more disks return a status of asmdeactivationoutcome='No' will cause Oracle ASM to dismount the affected disk group, causing the databases to shut down abruptly.

iii. Run cellcli command to Inactivate all grid disks on the cell you wish to power down/reboot.
(this could take up to 10 minutes or longer)

# cellcli -e alter griddisk all inactive
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered
GridDisk DATA_CD_01_dmorlx8cel01 successfully altered
GridDisk DATA_CD_02_dmorlx8cel01 successfully altered
GridDisk RECO_CD_00_dmorlx8cel01 successfully altered
...etc...

iv. Execute the command below and the output should show asmmodestatus='UNUSED' or
'OFFLINE' and ‘asmdeactivationoutcome=Yes’ for all griddisks once the disks are offline and
inactive in ASM.

# cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 inactive OFFLINE Yes
DATA_CD_01_dmorlx8cel01 inactive OFFLINE Yes
DATA_CD_02_dmorlx8cel01 inactive OFFLINE Yes
RECO_CD_00_dmorlx8cel01 inactive OFFLINE Yes
...etc...

c) Disable Exadata Storage Server services with the following command as 'root' user:

# cellcli -e alter cell shutdown services all

d) Upgrade the HBA firmware with the following command as 'root' user:

# /opt/oracle.cellos/CheckHWnFWProfile -action updatefw -mode diagnostic -component DiskController -attribute DiskControllerFirmwareRevision -diagnostic_version 24.3.0-0081 -fwpath /tmp/MR_6.3.8_24.3.0-0081.rom

Upon completion of the firmware upgrade, the system will automatically reboot. This takes ~10 minutes to complete the entire process after rebooting the cell, excluding disk re-synchronization time.

e) Verify the SAS HBA firmware is updated:

# MegaCli64 -adpallinfo -a0 | grep -i package
FW Package Build: 24.3.0-0081

The firmware package with the logging bug fix is 24.3.0-0081.

f) Verify the disks and bring them online as follows:

i. Verify the 12 disks are visible. The following command should show 12 disks:

# lsscsi | grep -i LSI
[0:2:0:0]    disk    LSI      MR9361-8i        4.23 /dev/sda
[0:2:1:0]    disk    LSI      MR9361-8i        4.23 /dev/sdb
[0:2:2:0]    disk    LSI      MR9361-8i        4.23 /dev/sdc
[0:2:3:0]    disk    LSI      MR9361-8i        4.23 /dev/sdd
[0:2:4:0]    disk    LSI      MR9361-8i        4.23 /dev/sde
[0:2:5:0]    disk    LSI      MR9361-8i        4.23 /dev/sdf
[0:2:6:0]    disk    LSI      MR9361-8i        4.23 /dev/sdg
[0:2:7:0]    disk    LSI      MR9361-8i        4.23 /dev/sdh
[0:2:8:0]    disk    LSI      MR9361-8i        4.23 /dev/sdi
[0:2:9:0]    disk    LSI      MR9361-8i        4.23 /dev/sdj
[0:2:10:0]   disk    LSI      MR9361-8i        4.23 /dev/sdk
[0:2:11:0]   disk    LSI      MR9361-8i        4.23 /dev/sdl

ii. Activate the grid disks.

# cellcli
…
CellCLI> alter griddisk all active
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered
GridDisk DATA_CD_01_dmorlx8cel01 successfully altered
GridDisk RECO_CD_00_dmorlx8cel01 successfully altered
GridDisk RECO_CD_01_dmorlx8cel01 successfully altered
...etc...

iii. Verify all grid disks show 'active':

CellCLI> list griddisk
DATA_CD_00_dmorlx8cel01 active
DATA_CD_01_dmorlx8cel01 active
RECO_CD_00_dmorlx8cel01 active
RECO_CD_01_dmorlx8cel01 active
...etc...

iv. Verify all grid disks have been successfully put online using the following command. Wait until asmmodestatus is ONLINE for all grid disks. The following is an example of the output early in the activation process.

CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 active ONLINE Yes
DATA_CD_01_dmorlx8cel01 active ONLINE Yes
DATA_CD_02_dmorlx8cel01 active ONLINE Yes
RECO_CD_00_dmorlx8cel01 active SYNCING Yes
...etc...

Notice in the above example that RECO_CD_00_dmorlx8cel01 is still in the 'SYNCING' process. Oracle ASM synchronization is only complete when ALL grid disks show ‘asmmodestatus=ONLINE’. This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair.

g) Repeat the above steps to update the firmware on each storage cell, as needed.

2. To verify the current battery status for the fwtermlog setting, on each storage cell, as 'root' login, execute:

# MegaCli64 -fwtermlog -bbuget -a0
Battery is OFF for TTY history on Adapter 0
Exit Code: 0x00

We should see that the battery mode is off for the fwtermlog.

3. Turn on use of the battery for maintaining the fwtermlog across cell reboots and power cycles:

# MegaCli64 -fwtermlog -bbuon -a0
Battery is set to ON for TTY history on Adapter 0

Running the above command on the cells will not have any impact.

This change is persistent across cell reboots or power cycles and is only unset by command.

References

<NOTE:888828.1> - Exadata Database Machine and Exadata Storage Server Supported Versions
<BUG:22023718> - SET FWTERMLOG BBUON FOR STORAGE CELLS
<BUG:21534072> - ASPEN TTY LOG DUMP IS NOT PERSISTENT ACROSS MULTIPLE BOOTS

Attachments

This solution has no attachment