Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1545766.1
Update Date:2014-06-04
Keywords:

Solution Type  Problem Resolution Sure

Solution  1545766.1 :   CELL-02767 When Using Cellcli List and Alter Flashcache Commands  


Related Items
  • Exadata Database Machine X2-2 Hardware
  •  
  • Oracle Exadata Storage Server Software
  •  
Related Categories
  • PLA-Support>Eng Systems>Exadata/ODA/SSC>Oracle Exadata>DB: Exadata_EST
  •  




Applies to:

Oracle Exadata Storage Server Software - Version 11.1.0.3.0 to 11.2.3.2.1 [Release 11.1 to 11.2]
Exadata Database Machine X2-2 Hardware
Information in this document applies to any platform.

Symptoms

Cellcli list and alter flashcache commands are failing with CELL-02767: Flash cache operation timed out.

 

From the ms-odl.trc file:

 

[2013-02-12T19:02:43.332+03:00] [ossmgmt] [ERROR] [] [ms.core.MSOSSComm][tid: 14] [ecid: 163.20.129.62:68041:1359568760594:4,0] oss_ioctl

OSS_IOCTL_GRIDDISK_DROP [59] caused the error: FlashCache management operation timed out [350]

[2013-02-12T19:02:43.333+03:00] [ossmgmt] [NOTIFICATION] []

[ms.core.MSCoreImpl] [tid: 14] [ecid: 163.20.129.62:68041:1359568760594:4,0]

Error while trying to sync diflist: oracle.ossmgmt.common.core.SageException:

CELL-02767: Flash cache operation timed out.[[

oracle.ossmgmt.common.core.SageException: CELL-02767: Flash cache operation timed out.

 

 And from the /var/log/messages file, we will see stuck IO requests to the flashdisk.

Feb 11 21:55:31 ex01cel04 kernel: mptbase: ioc1: LogInfo(0x31111000):  Originator={PL}, Code={Reset}, SubCode(0x1000)

Feb 11 21:55:31 ex01cel04 kernel: mptbase: ioc1: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000)

 Cellsrv has threads that were still hung in aio_write 

 

Cause

This is Unpublished Bug 16360464 - PDIT EIS UERCX8027 REBOOT BY CELLSRV DUE TO IO HUNG ON BAD FLASHDISK.

There is a known problem with Aura1 cards where if IO gets stuck in kernel, it will take hours to get completed and leads to the problem described above. 

The cellsrv process has to quiesce the IOs, if it needs to access the flashcache.  And this operation will time out if there are stuck IO at the kernel level.  This is the reason why the CELL-02767 error is generated.

Solution

Unpublished Bug 16360464 - PDIT EIS UERCX8027 REBOOT BY CELLSRV DUE TO IO HUNG ON BAD FLASHDISK is fixed in 11.2.3.2.2 and 11.2.3.3.0.

However you can reboot the Cell as a workaround with the following guidelines:

1.  Verify if all the griddisks on the cell are offline. 

The following ASM query can be issued to verify this:

SQL> select failgroup,group_number,mode_status,count(*)  from v$asm_disk group by failgroup,group_number,mode_status;

If the query returns mode_status ONLINE for group_number > 0 for the cell we are trying to shutdown, then those disks are still part of ASM, and cellsrv is still replying to IO requests. Please wait for all the grid disks to go offline.

2.  Power down the cell and replace the flash card.  (Refer to MOS note Steps to shut down or reboot an Exadata storage cell without affecting ASM [Doc ID 1188080.1] )                                                                                                                                                                                                                            
3.  Bring back the cell online.

References

<BUG:16360464> - PDIT EIS UERCX8027 REBOOT BY CELLSRV DUE TO IO HUNG ON BAD FLASHDISK
<BUG:16405983> - CELL-2767: FLASH CACHE OPERATION TIMED OUT
<NOTE:1188080.1> - Steps to shut down or reboot an Exadata storage cell without affecting ASM

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback