Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-77-1614703.1
Update Date:2014-06-04
Keywords:

Solution Type  Sun Alert Sure

Solution  1614703.1 :   EXADATA - AURA CARD 1.0 1.1 - BAD FLASHDISK PRODUCES A FAILURE ON CELLSRV DUE TO HUNG IO  


Related Items
  • Exadata Database Machine X2-2 Hardware
  •  
  • Oracle Exadata Storage Server Software
  •  
Related Categories
  • PLA-Support>Eng Systems>Exadata/ODA/SSC>Oracle Exadata>DB: Exadata_EST
  •  




In this Document
Description
Occurrence
Symptoms
Workaround
Patches
History


Applies to:

Oracle Exadata Storage Server Software - Version 11.2.3.1.0 to 11.2.3.2.1 [Release 11.2]
Exadata Database Machine X2-2 Hardware
Information in this document applies to any platform.

Description

 V2 & X2 Exadata Storage servers using Aura Cards 1.0 or 1.1, upon a failure of a FlashDisk, the cellsrv process can get into hung state. 

Occurrence

Pre-requisites:

  • Exadata Storage Server using Flashdisk Aura Cards 1.0 or 1.1
To identify the type of Aura Cards, execute command flash_dom -l |grep 'C0\|B3'|awk '{print $1, $6}'

 If the return value is B3 use the "For Aura1.0 cards only" steps below.
  
# flash_dom -l |grep 'C0\|B3'|awk '{print $1, $6}'
1. B3
2. B3
3. B3
4. B3
 

If the return value is C0 use the "For Aura1.1 cards only" steps below.

  
# flash_dom -l |grep 'C0\|B3'|awk '{print $1, $6}'
1. C0
2. C0
3. C0
4. C0
 
  • Aura Cards 1.0 or 1.1 (Exadata V2 and X2) running firmware lower than 01.27.92.00
To identify the current firmware on the Aura Cards, execute command # /usr/bin/flash_dom -l |grep 'Firmware image' 

  
# /usr/bin/flash_dom -l |grep 'Firmware image'
        Firmware image's version is MPTFW-01.27.92.00-IT
        Firmware image's version is MPTFW-01.27.92.00-IT
        Firmware image's version is MPTFW-01.27.92.00-IT
        Firmware image's version is MPTFW-01.27.92.00-IT
 
  • Exadata Storage Server software 11.2.3.2.1
 

Symptoms

Because of the problem introduced by <bug 16360464>, possible side effects are:

  • cellsrv process can get into hung state.  This will only be resolved by rebooting the Exadata Storage cell
  • cellsrv process can get into hung state but Exadata Storage cell will reboot automatically.

 

Below are details of situation when cellsrv is stuck but cell is not rebooted automatically.

 

The sequence of events are:

1. A flashdisk fails, for example Flashdisk FD_02_cel01:



Mon Jan 13 20:03:53 2014
CDHS: Received cd health state change  with newState HEALTH_FAIL guid 946b8c3c-32b5-48ce-ac05-2166f735f513
CDHS: Do cd health state change FD_02_osm2cel01 from HEALTH_GOOD to newState HEALTH_FAIL
FlashLog osm2cel01_FLASHLOG (2277105324, cdisk=FD_02_osm2cel01) is inactive due to inactive flash disk
Flashcache Warning: Pers list file open for write of /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/cellsrv/deploy/config/flashcache.lst failed: errno=2
QuarantineMgr: Fault does not have QM protection (threadID=3, beingMonitored=false)
Errors in file /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/osm2cel01/trace/svtrc_10921_3.trc  (incident=9):

The current CELLSRV process at that time was pid 10921.

Mon Jan 13 20:03:56 2014
State dump signal delivered to Cellsrv<10921>


2. Upon a FlashDisk failure, the device is removed at the operating system level.  From  file /var/log/messages:



Jan 13 20:04:08 cel01 kernel: mptscsih: ioc1: attempting task abort! (sc=ffff8802d5b293c0)
Jan 13 20:04:08 cel01 kernel: sd 9:0:2:0: [sdt] CDB: Write(10): 2a 00 02 b6 8b 50 00 00 10 00
Jan 13 20:04:12 cel01 kernel: mptbase: ioc1: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Jan 13 20:04:12 cel01 kernel: mptscsih: ioc1: task abort: SUCCESS (sc=ffff8802d5b293c0)

….

Jan 13 20:04:12 cel01 kernel: sd 9:0:2:0: [sdt] CDB: Write(10): 2a 00 02 b6 8b 50 00 00 10 00
Jan 13 20:04:12 cel01 kernel: mptscsih: ioc1: target reset: SUCCESS (sc=ffff8802d5b293c0)
Jan 13 20:04:12 cel01 kernel: mptscsih: ioc1: attempting host reset! (sc=ffff88041afe3580)
Jan 13 20:04:55 cel01 kernel: mptscsih: ioc1: host reset: SUCCESS (sc=ffff88041afe3580)
Jan 13 20:04:55 cel01 kernel: mptbase: ioc1: LogInfo(0x30030501): Originator={IOP}, Code={Invalid Page}, SubCode(0x0501)
Jan 13 20:04:55 cel01 kernel: mptbase: ioc1: LogInfo(0x30030501): Originator={IOP}, Code={Invalid Page}, SubCode(0x0501)



3. But cellsrv is stuck trying to fence out the pending IOs.   It is expected to receive an IO error once the device is removed, but instead, cellsrv process goes into D state (hung), reporting a series of call stack dumps.




Jan 13 20:05:59 cel01 kernel: cellsrv       ? ffff8805d7083100     0 10921      1 0x00000080
Jan 13 20:05:59 cel01 kernel:  ffff8806034efd68 0000000000000046 ffff8806034efd08 ffffffff810432be
Jan 13 20:05:59 cel01 kernel:  ffff8806092b8100 ffff88060345c300 ffff8806092b84d0 ffffffff81456a38
Jan 13 20:05:59 cel01 kernel:  ffff8806034efd58 ffffffff8107aab0 ffff8806034efd38 00000000ffffffff
Jan 13 20:05:59 cel01 kernel: Call Trace:
Jan 13 20:05:59 cel01 kernel:  [<ffffffff810432be>] ? need_resched+0x23/0x2d
Jan 13 20:05:59 cel01 kernel:  [<ffffffff81456a38>] ? _cond_resched+0xe/0x22
Jan 13 20:05:59 cel01 kernel:  [<ffffffff8107aab0>] ? switch_task_namespaces+0x1d/0x51
Jan 13 20:05:59 cel01 kernel:  [<ffffffff8105b678>] do_exit+0x680/0x699
Jan 13 20:05:59 cel01 kernel:  [<ffffffff8106898a>] ? freezing+0x13/0x15
Jan 13 20:05:59 cel01 kernel:  [<ffffffff8105b731>] sys_exit_group+0x0/0x1b
Jan 13 20:05:59 cel01 kernel:  [<ffffffff8106bd03>] get_signal_to_deliver+0x303/0x328
Jan 13 20:05:59 cel01 kernel:  [<ffffffff8101120a>] do_notify_resume+0x90/0x6d7
Jan 13 20:05:59 cel01 kernel:  [<ffffffff8107a2cf>] ? hrtimer_nanosleep+0x7e/0x102
Jan 13 20:05:59 cel01 kernel:  [<ffffffff81079964>] ? hrtimer_wakeup+0x0/0x26
Jan 13 20:05:59 cel01 kernel:  [<ffffffff8112412f>] ? path_put+0x22/0x27
Jan 13 20:05:59 cel01 kernel:  [<ffffffff8101207e>] int_signal+0x12/0x17
Jan 13 20:05:59 cel01 kernel: cellsrv       D 0000000000000001     0 11401      1 0x00000084
Jan 13 20:05:59 cel01 kernel:  ffff8805d7255b88 0000000000000086 0000000000000000 ffffffffadcb69db
Jan 13 20:05:59 cel01 kernel:  ffff8805fca8e140 ffff880664dca240 ffff8805fca8e510 ffff880028220000
Jan 13 20:05:59 cel01 kernel:  ffff8802cad04000 0000000000000000 ffff8805d7255b68 ffff8805fca8e140

Jan 13 20:06:00 cel01 kernel: Clocksource tsc unstable (delta = -51539592805 ns)
Jan 13 20:06:00 cel01 kernel: Switching to clocksource hpet
Jan 13 20:06:27 cel01 kernel: INFO: task cellsrv:11401 blocked for more than 120 seconds.
Jan 13 20:06:27 cel01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 13 20:06:27 cel01 kernel: cellsrv       D 0000000000000001     0 11401      1 0x00000084
Jan 13 20:06:27 cel01 kernel:  ffff8805d7255b88 0000000000000086 0000000000000000 ffffffffadcb69db
Jan 13 20:06:27 cel01 kernel:  ffff8805fca8e140 ffff880664dca240 ffff8805fca8e510 ffff880028220000
Jan 13 20:06:27 cel01 kernel:  ffff8802cad04000 0000000000000000 ffff8805d7255b68 ffff8805fca8e140
Jan 13 20:06:27 cel01 kernel: Call Trace:
Jan 13 20:06:27 cel01 kernel:  [<ffffffff814569cc>] io_schedule+0x42/0x5c
Jan 13 20:06:27 cel01 kernel:  [<ffffffff811474c0>] __blockdev_direct_IO+0x950/0xb04
Jan 13 20:06:27 cel01 kernel:  [<ffffffff81145f07>] blkdev_direct_IO+0x4e/0x50
Jan 13 20:06:27 cel01 kernel:  [<ffffffff81145f09>] ? blkdev_get_blocks+0x0/0xa6
Jan 13 20:06:27 cel01 kernel:  [<ffffffff810d87c5>] generic_file_direct_write+0xbe/0x130
Jan 13 20:06:27 cel01 kernel:  [<ffffffff810d89d9>] __generic_file_aio_write+0x1a2/0x2b7
Jan 13 20:06:27 cel01 kernel:  [<ffffffff81044aaf>] ? update_curr+0xc9/0xd2
Jan 13 20:06:27 cel01 kernel:  [<ffffffff81144ae5>] blkdev_aio_write+0x30/0x6f
Jan 13 20:06:27 cel01 kernel:  [<ffffffff8111a43f>] do_sync_write+0xe7/0x12b
Jan 13 20:06:27 cel01 kernel:  [<ffffffff81079f59>] ? hrtimer_try_to_cancel+0x40/0x4b
Jan 13 20:06:27 cel01 kernel:  [<ffffffff81077030>] ? autoremove_wake_function+0x0/0x3d
Jan 13 20:06:27 cel01 kernel:  [<ffffffff81079f7d>] ? hrtimer_cancel+0x19/0x25
Jan 13 20:06:27 cel01 kernel:  [<ffffffff811f686c>] ? selinux_file_permission+0x5d/0xb7
Jan 13 20:06:27 cel01 kernel:  [<ffffffff811eb4dc>] ? security_file_permission+0x16/0x18
Jan 13 20:06:27 cel01 kernel:  [<ffffffff8111aba4>] vfs_write+0xb0/0x10a
Jan 13 20:06:27 cel01 kernel:  [<ffffffff8111ac58>] sys_pwrite64+0x5a/0x78
Jan 13 20:06:27 cel01 kernel:  [<ffffffff81011db2>] system_call_fastpath+0x16/0x1b


4. Once cellsrv goes into hung status, different events could take place:



  • The cellsrv process could dump endless stacks of hung processes.  The process cellsrv will be unresponsive, and once disk_repair timer expires on ASM, griddisk will be dropped.
  • ORA-600 error

       ORA-00600: internal error code, arguments: [main_8], [1], [The cellinit.ora file contains incorrect parameters.],


Oss boot failed as listener could not receive at ipaddress = 192.168.10.5 and port =5042
CELLSRV failed to start due to the error 1 (The cellinit.ora file contains incorrect parameters.)
Errors in file /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/cel01/trace/svtrc_29095_0.trc  (incident=33):
ORA-00600: internal error code, arguments: [main_8], [1], [The cellinit.ora file contains incorrect parameters.], [], [], [], [], [], [], [], [], []
Incident details in: /opt/oracle/cell11.2.3.2.1_LINUX.X64_130109/log/diag/asm/cell/cel01/incident/incdir_33/svtrc_29095_0_i33.trc
Sweep [inc][33]: completed
CELLSRV error - ORA-600 internal error


This error ORA-600 is caused because the original cellsrv process was still alive but hung.  RS is not able to check cellsrv health, as we know, cellsrv process is already hung.

Then, RS tried to restart a new cellsrv process, but it fails to attach to the default port 5042, still in use by the hung cellsrv.

Workaround

Reboot the Exadata Storage Server

 

Patches

For Exadata Storage cells running software version 11.2.3.2.1, apply firmware 01.27.95.00 available via <patch 16360464>

Upgrade to Exadata Storage software version 11.2.3.3.0

History

[15-JAN-2014] - creation


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback