Exadata X5/X6 reports "Disk controller was hung. Cell was power cycled to stop the hang." on SW images prior to 12.1.2.3.3.161109

Asset ID:	1-72-2176276.1
Update Date:	2017-06-13
Keywords:

Solution Type Problem Resolution Sure

Solution 2176276.1 : Exadata X5/X6 reports "Disk controller was hung. Cell was power cycled to stop the hang." on SW images prior to 12.1.2.3.3.161109

Applies to:

Oracle SuperCluster T5-8 Hardware - Version All Versions and later
Exadata X6-2 Hardware - Version All Versions and later
Exadata X6-8 Hardware - Version All Versions and later
Exadata X5-2 Hardware - Version All Versions and later
Oracle SuperCluster M6-32 Hardware - Version All Versions and later
Information in this document applies to any platform.

Symptoms

Exadata X5/X6 reports "Disk controller was hung. Cell was power cycled to stop the hang." on SW images prior to 12.1.2.3.3.161109

Note: Exadata X5-2L/X6-2L Extreme Flash Storage Servers do not have a SAS HBA and are not affected.
Note: Exadata systems older than X5-2 use a different SAS HBA and are not affected.
Note: Exadata X4-8 systems with X5-2L Storage Servers only applies to the Storage Servers; the X4-8 DB nodes use a different SAS HBA and is not affected.

Issue: The server will alert with this (cellcli or dbmcli - depending on server type that saw the event):

CellCLI> list alerthistory

...

2 2016-05-16T10:07:39-04:00 critical "Disk controller was hung. Cell
was power cycled to stop the hang."

In some cases the server's logging a reset event may be due to HBA controller Correctable Error that is repeated thousands of times in the SAS HBA firmware terminal log (fwtermlog) such as:

# /opt/MegaRAID/MegaCli/MegaCli64 /c0 show termlog | more
Firmware Term Log Information on controller 0:
05/16/16 22:47:51: C0:SRAM errAddr c01af1e0 errAttrib 00000023
05/16/16 22:47:51: C0:Correctable err, continuing...
05/16/16 22:47:51: C0:SRAM errAddr c01af1e0 errAttrib 00000023
05/16/16 22:47:51: C0:Correctable err, continuing...
05/16/16 22:47:51: C0:SRAM errAddr c01af1e0 errAttrib 00000023
05/16/16 22:47:51: C0:Correctable err, continuing...
... <~8100 entries is typical>..........
05/16/16 22:47:51: C0:SRAM errAddr c01af1e0 errAttrib 00000023
05/16/16 22:47:51: C0:Correctable err, continuing...
05/16/16 22:47:51: C0:SRAM errAddr c01a
MonSetAllowChipReset: MonAllowResetChip 1

05/16/16 22:47:51: C0:In MonTask; Seconds from powerup = 0x00fd125b
05/16/16 22:47:51: C0:Max Temperature = 80 on Channel 4
Firmware crash dump feature enabled
Crash dump collection will start immediately
copied 75 MB in 71957 Microseconds
[0]: fp=c03ffe00, lr=c13243c8 - _MonTask+200
... <reset output>

In this case the error is correctable and really only occurred once, but a bug in the error handling routine causes it to be repeated thousands of times before it forcibly resets.

In some cases the reset event may not have anything useful logged as to the reason in the HBA controller logs, reporting that the firmware was hung:

06/21/16 2:36:24: C0:Driver detected possible FW hang, Driver triggers FW to start crashdump collection

IMPORTANT NOTE: It is possible due to another image issue in 12.1.2.2.2 and earlier that the SAS HBA firmware logs do not contain any information due to a incorrect setting that clears the logs on the power cycle. If this is the case, please ensure also the procedure in Note 2135119.1 is completed to ensure persistent logging is enabled in addition to this solution. The additional steps are included below.

Workaround: None, the power cycle recovers the HBA back to an operational state, as designed by Exadata server monitoring service (MS).

Cause

Unpublished HBA Firmware bug 21669752 causes the controller to hang while correcting an error.

Unpublished HBA Firmware bugs 23086151 and 23625327 cause the controller to hang due to a firmware hang issue.

Solution

The solution is to update the SAS HBA firmware to version 24.3.0-0084 (or later).
If a controller hang or reset event has occurred with other messages logged, or the event occurs on a system that already has SAS HBA firmware 24.3.0-0084, then a SR should be opened and a Sundiag and diagpack provided for the event uploaded for analysis and action plan.

1. Check the image version of the server and firmware version of the SAS HBA:

# imageinfo -ver
12.1.2.1.1.150316.2
# /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0 | grep -i package
FW Package Build: 24.3.0-0081

If the server is running SW Image 12.1.2.3.3.161109 or later, then the problem does not apply.

These images have the firmware fix in "FW Package Build: 24.3.0-0084".

If a controller hang or reset event has occurred with other messages logged, or the event occurs on a system that already has SAS HBA firmware 24.3.0-0084, then a SR should be opened and a Sundiag and diagpack provided for the event uploaded for analysis and action plan.

2. Update the SW image to 12.1.2.3.3.161109 or later, which contains the firmware fix 24.3.0-0084 per the example above. For how to update image, refer to MOS Note 888828.1.

The first preference for resolving this issue is to update image. If the server is not able to be updated with a later image at this time, then the SAS HBA firmware only may be temporarily updated to address this issue on systems that have had a failure as follows, using the firmware package "MR_6.3.8.4_24.3.0-0084.rom" attached to this Note as follows:

a) Download the firmware package "MR_6.3.8.4_24.3.0-0084.rom" attached to this Note, and copy it to the /tmp directory on the server to be updated.

The md5sum of the file is:

=> md5sum MR_6.3.8.4_24.3.0-0084.rom
5940e9999c9856ac0c3756702d28d0bf MR_6.3.8.4_24.3.0-0084.rom

b) Prepare the server for maintenance as follows:

Exadata Storage Servers (based on X5-2L and X6-2L)

NOTE: If updating firmware on multiple storage servers in a rolling manner, do not reboot and apply the firmware update to multiple storage servers at the same time - only do them one at a time and ensure all disks are re-synchronized with ASM before proceeding to the next storage server.

i. ASM drops a disk shortly after it/they are taken offline. The default DISK_REPAIR_TIME
attribute value of 3.6hrs should be adequate for replacing components, but may have been
changed by the Customer. To check this parameter, have the Customer log into ASM and
perform the following query:

SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg
where a.name = 'disk_repair_time' and a.group_number = dg.group_number;

As long as the value is large enough to comfortably perform the upgrade in a
storage cell, there is no need to change it.

ii. Check if ASM will be OK if the grid disks go OFFLINE.

# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
...snipit ...
DATA_CD_09_cel01 ONLINE Yes
DATA_CD_10_cel01 ONLINE Yes
DATA_CD_11_cel01 ONLINE Yes
RECO_CD_00_cel01 ONLINE Yes
etc....

If one or more disks return asmdeactivationoutcome='No', you should wait for some time
and repeat the query until all disks return asmdeactivationoutcome='Yes'.

NOTE: Taking the storage server offline while one or more disks return a status of asmdeactivationoutcome='No' will cause Oracle ASM to dismount the affected disk group, causing the databases to shut down abruptly.

iii. Run cellcli command to Inactivate all grid disks on the cell you wish to power down/reboot.
(this could take up to 10 minutes or longer)

# cellcli -e alter griddisk all inactive
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered
GridDisk DATA_CD_01_dmorlx8cel01 successfully altered
GridDisk DATA_CD_02_dmorlx8cel01 successfully altered
GridDisk RECO_CD_00_dmorlx8cel01 successfully altered
...etc...

iv. Execute the command below and the output should show asmmodestatus='UNUSED' or
'OFFLINE' and ‘asmdeactivationoutcome=Yes’ for all griddisks once the disks are offline and
inactive in ASM.

# cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 inactive OFFLINE Yes
DATA_CD_01_dmorlx8cel01 inactive OFFLINE Yes
DATA_CD_02_dmorlx8cel01 inactive OFFLINE Yes
RECO_CD_00_dmorlx8cel01 inactive OFFLINE Yes
...etc...

v) Disable Exadata Storage Server services with the following command as 'root' user:

# cellcli -e alter cell shutdown services all

Exadata DB Nodes (based on X5-2, X6-2 and X5-8)

Linux DB Nodes:

i) Shutdown and disable auto-start of CRS services:

# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle

# $ORACLE_HOME/bin/crsctl disable crs

# $ORACLE_HOME/bin/crsctl stop crs
or
# <GI_HOME>/bin/crsctl stop crs

where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment.

In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3.

ii) Validate CRS is down cleanly. There should be no processes running.

# ps -ef | grep css

iii) Disable Exadata DB Node management services with the following command as 'root' user:

# dbmcli -e alter dbserver shutdown services all

OVM DB Nodes:

i) See what user domains are running (record result)

Connect to the management domain (domain zero, or dom0). This is an example with just two domains and the management domain Domain-0

# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 8192 4 r----- 409812.7
dm01db01vm01 8 8192 2 -b---- 156610.6
dm01db01vm02 9 8192 2 -b---- 152169.8

ii) Connect to each domain using the command:

# xm console domainname

where domainname would be dm01db01vm01 or dm01db01vm02 if using the above examples.

iii) Shut down any instances of CRS on that domain:

# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle

# $ORACLE_HOME/bin/crsctl stop crs
or
# <GI_HOME>/bin/crsctl stop crs

where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment.

In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3.

iv) Validate CRS is down cleanly. There should be no processes running.

# ps -ef | grep css

v) Press CTRL+] to disconnect from the console.

vi) Repeat steps ii - v on each running domain.

vii) Shutdown all user domains from dom0

viii) Repeat step 1 to see what user domains are running. It should be only Domain-0.

ix) Disable user domains from auto starting during dom0 boot after firmware has been updated.

# chkconfig xendomains off

x) Disable Exadata DB Node management services with the following command as 'root' user:

# dbmcli -e alter dbserver shutdown services all

c) Upgrade the server's SAS HBA firmware with the following command as 'root' user:

# /opt/oracle.cellos/CheckHWnFWProfile -action updatefw -mode diagnostic -component DiskController -attribute DiskControllerFirmwareRevision -diagnostic_version 24.3.0-0084 -fwpath /tmp/MR_6.3.8.4_24.3.0-0084.rom

Upon completion of the firmware upgrade, the server will automatically reboot. There may be periods of time during the update where the output to the screen stops, which is expected - please be patient. This takes ~10 minutes to get to the reboot, and ~15 minutes to complete the entire process including rebooting the cell, excluding disk re-synchronization or CRS start time. The time may be longer on X5-8 DB nodes. There may be 2 reboots during the process.

d) Verify the server's SAS HBA firmware is updated:

# /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0 | grep -i package
FW Package Build: 24.3.0-0084

The firmware package with the bug fix is 24.3.0-0084.

Note: If there is a problem with the firmware download, it is not recommended to downgrade the SAS HBA firmware to an older revision. In this event, a SR should be opened for Oracle to investigate and action further.

3. Verify the server's disks after firmware update and bring back online its services as follows:

Exadata Storage Servers (based on X5-2L and X6-2L):

i. Verify the 12 disks are visible. The following command should show 12 disks:

# lsscsi | grep -i LSI
[0:2:0:0]    disk    LSI      MR9361-8i        4.23 /dev/sda
[0:2:1:0]    disk    LSI      MR9361-8i        4.23 /dev/sdb
[0:2:2:0]    disk    LSI      MR9361-8i        4.23 /dev/sdc
[0:2:3:0]    disk    LSI      MR9361-8i        4.23 /dev/sdd
[0:2:4:0]    disk    LSI      MR9361-8i        4.23 /dev/sde
[0:2:5:0]    disk    LSI      MR9361-8i        4.23 /dev/sdf
[0:2:6:0]    disk    LSI      MR9361-8i        4.23 /dev/sdg
[0:2:7:0]    disk    LSI      MR9361-8i        4.23 /dev/sdh
[0:2:8:0]    disk    LSI      MR9361-8i        4.23 /dev/sdi
[0:2:9:0]    disk    LSI      MR9361-8i        4.23 /dev/sdj
[0:2:10:0]   disk    LSI      MR9361-8i        4.23 /dev/sdk
[0:2:11:0]   disk    LSI      MR9361-8i        4.23 /dev/sdl

ii. Activate the grid disks.

# cellcli -e alter griddisk all active
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered
GridDisk DATA_CD_01_dmorlx8cel01 successfully altered
GridDisk RECO_CD_00_dmorlx8cel01 successfully altered
GridDisk RECO_CD_01_dmorlx8cel01 successfully altered
...etc...

iii. Verify all grid disks show 'active':

# cellcli -e list griddisk
DATA_CD_00_dmorlx8cel01 active
DATA_CD_01_dmorlx8cel01 active
RECO_CD_00_dmorlx8cel01 active
RECO_CD_01_dmorlx8cel01 active
...etc...

iv. Verify all grid disks have been successfully put online using the following command. Wait until asmmodestatus is ONLINE for all grid disks. The following is an example of the output early in the activation process.

# cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 active ONLINE Yes
DATA_CD_01_dmorlx8cel01 active ONLINE Yes
DATA_CD_02_dmorlx8cel01 active ONLINE Yes
RECO_CD_00_dmorlx8cel01 active SYNCING Yes
...etc...

Notice in the above example that RECO_CD_00_dmorlx8cel01 is still in the 'SYNCING' process. Oracle ASM synchronization is only complete when ALL grid disks show ‘asmmodestatus=ONLINE’. This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair.

Exadata DB Nodes (based on X5-2, X6-2 and X5-8)

Linux DB Nodes:

i) Verify all the disks are visible to the system and in 'normal' status.

# dbmcli -e "list physicaldisk"
252:0 F1HHYP normal
252:1 F1K76P normal
252:2 F1GZ1P normal
252:3 F1K7GP normal
252:4 F1LHUP normal
252:5 F1A2JP normal
252:6 F1LH6P normal
252:7 F1LDSP normal

There should be 4 or 8 disks depending on the DB node model.

ii) Startup CRS and re-enable autostart of crs. After the OS is up, the Customer DBA should validate that CRS is running. As root execute:

# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle

# $ORACLE_HOME/bin/crsctl start crs
# $ORACLE_HOME/bin/crsctl check crs

Now re-enable autostart

# $ORACLE_HOME/bin/crsctl enable crs
or
# <GI_HOME>/bin/crsctl check crs

# <GI_HOME>/bin/crsctl enable crs

where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment.

In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3.
Example output when all is online is:

# /u01/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

iii) Validate that instances are running:

# ps -ef |grep pmon

It should return a record for the ASM instance and a record for each database.

OVM DB Nodes:

i) Verify all the disks are visible to the system and in 'normal' status.

# dbmcli -e "list physicaldisk"
252:0 F1HHYP normal
252:1 F1K76P normal
252:2 F1GZ1P normal
252:3 F1K7GP normal
252:4 F1LHUP normal
252:5 F1A2JP normal
252:6 F1LH6P normal
252:7 F1LDSP normal

There should be 4 or 8 disks depending on the DB node model.

ii) Re-enable user domains to autostart during Domain-0 boot:

# chkconfig xendomains on

iii) Startup all user domains that are marked for auto start:

# service xendomains start

iv) See what user domains are running (compare against result from previously collected data):

# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 8192 4 r----- 409812.7
dm01db01vm01 8 8192 2 -b---- 156610.6
dm01db01vm02 9 8192 2 -b---- 152169.8

v) if any did not auto-start then Startup a single user domain:

# xm create -c /EXAVMIMAGES/GuestImages/DomainName/vm.cfg

vi) Check that CRS has started in user domains:

a) Connect to each domain using the command:

# xm console domainname

where domainname would be dm01db01vm01 or dm01db01vm02 if using the above examples.

b) Any instances of CRS on that domain should have automatically started:

# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle

# $ORACLE_HOME/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

c) Validate that instances are running:

# ps -ef |grep pmon

It should return a record for the ASM instance and a record for each database.

d) Press CTRL+] to disconnect from the console.

vii) Repeat step (vi) on each running domain.

4. Repeat the above steps to update the firmware on each storage server and DB node, as needed.

NOTE: If updating firmware on multiple storage server in a rolling manner, do not reboot and apply the firmware update to multiple storage servers at the same time - only do them one at a time and ensure all disks are re-synchronized with ASM completely before proceeding to the next storage server.

If the image is version 12.1.2.3.0 or later, then the procedure is complete. The following additional steps are required for images 12.1.2.2.2 or below (as taken from Note 2135119.1):

5. To verify the current battery status for the fwtermlog setting, on each server, as 'root' login, execute:

# /opt/MegaRAID/MegaCli/MegaCli64 -fwtermlog -bbuget -a0
Battery is OFF for TTY history on Adapter 0
Exit Code: 0x00

We should see that the battery mode is off for the fwtermlog.

6. Turn on use of the battery for maintaining the fwtermlog across server reboots and power cycles:

# /opt/MegaRAID/MegaCli/MegaCli64 -fwtermlog -bbuon -a0
Battery is set to ON for TTY history on Adapter 0

Running the above command on the server will not have any impact on running services.

This change is persistent across server reboots or power cycles and is only unset by command.

References

<NOTE:2135119.1> - SAS HBA does not maintain logs over a reboot on Exadata X5-2L High Capacity Storage Servers on SW Image versions below 12.1.2.3.0
<NOTE:888828.1> - Exadata Database Machine and Exadata Storage Server Supported Versions

Attachments

This solution has no attachment