![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||
Solution Type Problem Resolution Sure Solution 2176276.1 : Exadata X5/X6 reports "Disk controller was hung. Cell was power cycled to stop the hang." on SW images prior to 12.1.2.3.3.161109
Exadata X5/X6 reports "Disk controller was hung. Cell was power cycled to stop the hang." on SW images prior to 12.1.2.3.3.161109 In this Document
Applies to:Oracle SuperCluster T5-8 Hardware - Version All Versions and laterExadata X6-2 Hardware - Version All Versions and later Exadata X6-8 Hardware - Version All Versions and later Exadata X5-2 Hardware - Version All Versions and later Oracle SuperCluster M6-32 Hardware - Version All Versions and later Information in this document applies to any platform. SymptomsExadata X5/X6 reports "Disk controller was hung. Cell was power cycled to stop the hang." on SW images prior to 12.1.2.3.3.161109 Note: Exadata X5-2L/X6-2L Extreme Flash Storage Servers do not have a SAS HBA and are not affected.
CellCLI> list alerthistory ... 2 2016-05-16T10:07:39-04:00 critical "Disk controller was hung. Cell In some cases the server's logging a reset event may be due to HBA controller Correctable Error that is repeated thousands of times in the SAS HBA firmware terminal log (fwtermlog) such as: # /opt/MegaRAID/MegaCli/MegaCli64 /c0 show termlog | more 05/16/16 22:47:51: C0:In MonTask; Seconds from powerup = 0x00fd125b In this case the error is correctable and really only occurred once, but a bug in the error handling routine causes it to be repeated thousands of times before it forcibly resets. In some cases the reset event may not have anything useful logged as to the reason in the HBA controller logs, reporting that the firmware was hung: 06/21/16 2:36:24: C0:Driver detected possible FW hang, Driver triggers FW to start crashdump collection
IMPORTANT NOTE: It is possible due to another image issue in 12.1.2.2.2 and earlier that the SAS HBA firmware logs do not contain any information due to a incorrect setting that clears the logs on the power cycle. If this is the case, please ensure also the procedure in Note 2135119.1 is completed to ensure persistent logging is enabled in addition to this solution. The additional steps are included below.
Workaround: None, the power cycle recovers the HBA back to an operational state, as designed by Exadata server monitoring service (MS). CauseUnpublished HBA Firmware bug 21669752 causes the controller to hang while correcting an error. Unpublished HBA Firmware bugs 23086151 and 23625327 cause the controller to hang due to a firmware hang issue.
SolutionThe solution is to update the SAS HBA firmware to version 24.3.0-0084 (or later).
1. Check the image version of the server and firmware version of the SAS HBA: # imageinfo -ver
12.1.2.1.1.150316.2 # /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0 | grep -i package FW Package Build: 24.3.0-0081 If the server is running SW Image 12.1.2.3.3.161109 or later, then the problem does not apply. These images have the firmware fix in "FW Package Build: 24.3.0-0084". If a controller hang or reset event has occurred with other messages logged, or the event occurs on a system that already has SAS HBA firmware 24.3.0-0084, then a SR should be opened and a Sundiag and diagpack provided for the event uploaded for analysis and action plan. 2. Update the SW image to 12.1.2.3.3.161109 or later, which contains the firmware fix 24.3.0-0084 per the example above. For how to update image, refer to MOS Note 888828.1. The first preference for resolving this issue is to update image. If the server is not able to be updated with a later image at this time, then the SAS HBA firmware only may be temporarily updated to address this issue on systems that have had a failure as follows, using the firmware package "MR_6.3.8.4_24.3.0-0084.rom" attached to this Note as follows: a) Download the firmware package "MR_6.3.8.4_24.3.0-0084.rom" attached to this Note, and copy it to the /tmp directory on the server to be updated. => md5sum MR_6.3.8.4_24.3.0-0084.rom
5940e9999c9856ac0c3756702d28d0bf MR_6.3.8.4_24.3.0-0084.rom
Exadata Storage Servers (based on X5-2L and X6-2L) NOTE: If updating firmware on multiple storage servers in a rolling manner, do not reboot and apply the firmware update to multiple storage servers at the same time - only do them one at a time and ensure all disks are re-synchronized with ASM before proceeding to the next storage server.
i. ASM drops a disk shortly after it/they are taken offline. The default DISK_REPAIR_TIME SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg
where a.name = 'disk_repair_time' and a.group_number = dg.group_number; As long as the value is large enough to comfortably perform the upgrade in a ii. Check if ASM will be OK if the grid disks go OFFLINE. # cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
...snipit ... DATA_CD_09_cel01 ONLINE Yes DATA_CD_10_cel01 ONLINE Yes DATA_CD_11_cel01 ONLINE Yes RECO_CD_00_cel01 ONLINE Yes etc.... If one or more disks return asmdeactivationoutcome='No', you should wait for some time NOTE: Taking the storage server offline while one or more disks return a status of asmdeactivationoutcome='No' will cause Oracle ASM to dismount the affected disk group, causing the databases to shut down abruptly.
iii. Run cellcli command to Inactivate all grid disks on the cell you wish to power down/reboot. # cellcli -e alter griddisk all inactive
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered GridDisk DATA_CD_01_dmorlx8cel01 successfully altered GridDisk DATA_CD_02_dmorlx8cel01 successfully altered GridDisk RECO_CD_00_dmorlx8cel01 successfully altered ...etc... iv. Execute the command below and the output should show asmmodestatus='UNUSED' or # cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 inactive OFFLINE Yes DATA_CD_01_dmorlx8cel01 inactive OFFLINE Yes DATA_CD_02_dmorlx8cel01 inactive OFFLINE Yes RECO_CD_00_dmorlx8cel01 inactive OFFLINE Yes ...etc... v) Disable Exadata Storage Server services with the following command as 'root' user: # cellcli -e alter cell shutdown services all
Exadata DB Nodes (based on X5-2, X6-2 and X5-8) Linux DB Nodes: i) Shutdown and disable auto-start of CRS services: # . oraenv
ORACLE_SID = [root] ? +ASM1 The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle # $ORACLE_HOME/bin/crsctl disable crs # $ORACLE_HOME/bin/crsctl stop crs where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment. In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3. ii) Validate CRS is down cleanly. There should be no processes running. # ps -ef | grep css
iii) Disable Exadata DB Node management services with the following command as 'root' user: # dbmcli -e alter dbserver shutdown services all
OVM DB Nodes: i) See what user domains are running (record result) Connect to the management domain (domain zero, or dom0). This is an example with just two domains and the management domain Domain-0 # xm list
Name ID Mem VCPUs State Time(s) Domain-0 0 8192 4 r----- 409812.7 dm01db01vm01 8 8192 2 -b---- 156610.6 dm01db01vm02 9 8192 2 -b---- 152169.8 ii) Connect to each domain using the command: # xm console domainname
where domainname would be dm01db01vm01 or dm01db01vm02 if using the above examples. iii) Shut down any instances of CRS on that domain: # . oraenv
ORACLE_SID = [root] ? +ASM1 The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle # $ORACLE_HOME/bin/crsctl stop crs where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment. In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3. iv) Validate CRS is down cleanly. There should be no processes running. # ps -ef | grep css
v) Press CTRL+] to disconnect from the console. vi) Repeat steps ii - v on each running domain. vii) Shutdown all user domains from dom0 viii) Repeat step 1 to see what user domains are running. It should be only Domain-0. ix) Disable user domains from auto starting during dom0 boot after firmware has been updated. # chkconfig xendomains off
x) Disable Exadata DB Node management services with the following command as 'root' user: # dbmcli -e alter dbserver shutdown services all
c) Upgrade the server's SAS HBA firmware with the following command as 'root' user: # /opt/oracle.cellos/CheckHWnFWProfile -action updatefw -mode diagnostic -component DiskController -attribute DiskControllerFirmwareRevision -diagnostic_version 24.3.0-0084 -fwpath /tmp/MR_6.3.8.4_24.3.0-0084.rom
Upon completion of the firmware upgrade, the server will automatically reboot. There may be periods of time during the update where the output to the screen stops, which is expected - please be patient. This takes ~10 minutes to get to the reboot, and ~15 minutes to complete the entire process including rebooting the cell, excluding disk re-synchronization or CRS start time. The time may be longer on X5-8 DB nodes. There may be 2 reboots during the process. d) Verify the server's SAS HBA firmware is updated: # /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0 | grep -i package
FW Package Build: 24.3.0-0084 The firmware package with the bug fix is 24.3.0-0084. Note: If there is a problem with the firmware download, it is not recommended to downgrade the SAS HBA firmware to an older revision. In this event, a SR should be opened for Oracle to investigate and action further.
3. Verify the server's disks after firmware update and bring back online its services as follows: Exadata Storage Servers (based on X5-2L and X6-2L): i. Verify the 12 disks are visible. The following command should show 12 disks: # lsscsi | grep -i LSI
[0:2:0:0] disk LSI MR9361-8i 4.23 /dev/sda [0:2:1:0] disk LSI MR9361-8i 4.23 /dev/sdb [0:2:2:0] disk LSI MR9361-8i 4.23 /dev/sdc [0:2:3:0] disk LSI MR9361-8i 4.23 /dev/sdd [0:2:4:0] disk LSI MR9361-8i 4.23 /dev/sde [0:2:5:0] disk LSI MR9361-8i 4.23 /dev/sdf [0:2:6:0] disk LSI MR9361-8i 4.23 /dev/sdg [0:2:7:0] disk LSI MR9361-8i 4.23 /dev/sdh [0:2:8:0] disk LSI MR9361-8i 4.23 /dev/sdi [0:2:9:0] disk LSI MR9361-8i 4.23 /dev/sdj [0:2:10:0] disk LSI MR9361-8i 4.23 /dev/sdk [0:2:11:0] disk LSI MR9361-8i 4.23 /dev/sdl ii. Activate the grid disks. # cellcli -e alter griddisk all active
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered GridDisk DATA_CD_01_dmorlx8cel01 successfully altered GridDisk RECO_CD_00_dmorlx8cel01 successfully altered GridDisk RECO_CD_01_dmorlx8cel01 successfully altered ...etc... iii. Verify all grid disks show 'active': # cellcli -e list griddisk
DATA_CD_00_dmorlx8cel01 active DATA_CD_01_dmorlx8cel01 active RECO_CD_00_dmorlx8cel01 active RECO_CD_01_dmorlx8cel01 active ...etc... iv. Verify all grid disks have been successfully put online using the following command. Wait until asmmodestatus is ONLINE for all grid disks. The following is an example of the output early in the activation process. # cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 active ONLINE Yes DATA_CD_01_dmorlx8cel01 active ONLINE Yes DATA_CD_02_dmorlx8cel01 active ONLINE Yes RECO_CD_00_dmorlx8cel01 active SYNCING Yes ...etc... Notice in the above example that RECO_CD_00_dmorlx8cel01 is still in the 'SYNCING' process. Oracle ASM synchronization is only complete when ALL grid disks show ‘asmmodestatus=ONLINE’. This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair. Exadata DB Nodes (based on X5-2, X6-2 and X5-8) Linux DB Nodes: i) Verify all the disks are visible to the system and in 'normal' status. # dbmcli -e "list physicaldisk"
252:0 F1HHYP normal 252:1 F1K76P normal 252:2 F1GZ1P normal 252:3 F1K7GP normal 252:4 F1LHUP normal 252:5 F1A2JP normal 252:6 F1LH6P normal 252:7 F1LDSP normal There should be 4 or 8 disks depending on the DB node model. ii) Startup CRS and re-enable autostart of crs. After the OS is up, the Customer DBA should validate that CRS is running. As root execute: # . oraenv
ORACLE_SID = [root] ? +ASM1 The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle # $ORACLE_HOME/bin/crsctl start crs Now re-enable autostart # $ORACLE_HOME/bin/crsctl enable crs # <GI_HOME>/bin/crsctl enable crs where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment. In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3. # /u01/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online iii) Validate that instances are running: # ps -ef |grep pmon
It should return a record for the ASM instance and a record for each database.
OVM DB Nodes: i) Verify all the disks are visible to the system and in 'normal' status. # dbmcli -e "list physicaldisk"
252:0 F1HHYP normal 252:1 F1K76P normal 252:2 F1GZ1P normal 252:3 F1K7GP normal 252:4 F1LHUP normal 252:5 F1A2JP normal 252:6 F1LH6P normal 252:7 F1LDSP normal There should be 4 or 8 disks depending on the DB node model. ii) Re-enable user domains to autostart during Domain-0 boot: # chkconfig xendomains on
iii) Startup all user domains that are marked for auto start: # service xendomains start
iv) See what user domains are running (compare against result from previously collected data): # xm list
Name ID Mem VCPUs State Time(s) Domain-0 0 8192 4 r----- 409812.7 dm01db01vm01 8 8192 2 -b---- 156610.6 dm01db01vm02 9 8192 2 -b---- 152169.8 v) if any did not auto-start then Startup a single user domain: # xm create -c /EXAVMIMAGES/GuestImages/DomainName/vm.cfg
vi) Check that CRS has started in user domains: a) Connect to each domain using the command: # xm console domainname
where domainname would be dm01db01vm01 or dm01db01vm02 if using the above examples. b) Any instances of CRS on that domain should have automatically started: # . oraenv
ORACLE_SID = [root] ? +ASM1 The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle # $ORACLE_HOME/bin/crsctl check crs c) Validate that instances are running: # ps -ef |grep pmon
It should return a record for the ASM instance and a record for each database. d) Press CTRL+] to disconnect from the console. vii) Repeat step (vi) on each running domain.
4. Repeat the above steps to update the firmware on each storage server and DB node, as needed. NOTE: If updating firmware on multiple storage server in a rolling manner, do not reboot and apply the firmware update to multiple storage servers at the same time - only do them one at a time and ensure all disks are re-synchronized with ASM completely before proceeding to the next storage server.
If the image is version 12.1.2.3.0 or later, then the procedure is complete. The following additional steps are required for images 12.1.2.2.2 or below (as taken from Note 2135119.1): 5. To verify the current battery status for the fwtermlog setting, on each server, as 'root' login, execute: # /opt/MegaRAID/MegaCli/MegaCli64 -fwtermlog -bbuget -a0
Battery is OFF for TTY history on Adapter 0 Exit Code: 0x00 We should see that the battery mode is off for the fwtermlog. # /opt/MegaRAID/MegaCli/MegaCli64 -fwtermlog -bbuon -a0
Battery is set to ON for TTY history on Adapter 0 Running the above command on the server will not have any impact on running services. This change is persistent across server reboots or power cycles and is only unset by command.
References<NOTE:2135119.1> - SAS HBA does not maintain logs over a reboot on Exadata X5-2L High Capacity Storage Servers on SW Image versions below 12.1.2.3.0<NOTE:888828.1> - Exadata Database Machine and Exadata Storage Server Supported Versions Attachments This solution has no attachment |
||||||||||||||||||
|