Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1514687.1
Update Date:2016-09-05
Keywords:

Solution Type  Problem Resolution Sure

Solution  1514687.1 :   Linux kernel faulty on unmap_single on Exadata compute nodes  


Related Items
  • Exadata Database Machine V2
  •  
  • Exadata Database Machine X2-2 Half Rack
  •  
  • Exadata Database Machine X2-2 Full Rack
  •  
  • Oracle Exadata Hardware
  •  
  • Exadata Database Machine X2-2 Qtr Rack
  •  
Related Categories
  • PLA-Support>Eng Systems>Exadata/ODA/SSC>Oracle Exadata>DB: Exadata_EST
  •  




Created from <SR 3-6517363331>

Applies to:

Exadata Database Machine X2-2 Full Rack - Version All Versions to All Versions [Release All Releases]
Exadata Database Machine X2-2 Half Rack - Version All Versions to All Versions [Release All Releases]
Exadata Database Machine X2-2 Qtr Rack - Version All Versions to All Versions [Release All Releases]
Exadata Database Machine V2 - Version All Versions to All Versions [Release All Releases]
Oracle Exadata Hardware - Version 11.2.0.2 to 11.2.0.3 [Release 11.2]
x86 64 bit

Symptoms

  • One of compute nodes was rebooted.
  • No error message in GI alert log, ocssd log, css agent log, css monitor agent log, diskmon log.
  • No error message in OS log.
  • OS watcher doens't show any resource problem before the node got rebooted.
  • As per ocssd logs from other nodes, both network heartbeat and disk heartbeat were lost at the time that OSW logging stopped. 
  • ILOM snapshot captured console output before last reboot as follows:
Buffer I/O error on device sdb, logical block 0
scsi 13:0:0:0: rejecting I/O to dead device
Buffer I/O error on device sdb, logical block 0
 unable to read partition table
RDS/IB: connected to 172.16.3.16 version 3.1
RPC: bad TCP reclen 0x00000000 (non-terminal)
RPC: bad TCP reclen 0x00000000 (non-terminal)
RPC: bad TCP reclen 0x00000000 (non-terminal)
RPC: bad TCP reclen 0x00000000 (non-terminal)
Unable to handle kernel paging request at ffff80fd41180640 RIP:
 [<ffffffff8007e582>] unmap_single+0x24/0xc6
PGD 0
Oops: 0000 [1] SMP
last sysfs file: /class/infiniband_mad/umad0/port
CPU 4
Modules linked in: sr_mod cdrom oracleacfs(PFU) oracleadvm(PFU) oracleoks(PU) nfs lockd fscache nfs_acl krg_8_5_0_3005(PFU) ipmi_poweroff ipmi_watchdog ipmi_devintf ipmi_si(U) ipmi_msghandler sunrpc bonding(U) iscsi_tcp libiscsi scsi_transport_iscsi rds(U) ib_ipoib(U) ipoib_helper(U) ipv6 xfrm_nalgo crypto_api rdma_ucm(U) rdma_cm(U) ib_ucm(U) ib_uverbs(U) ib_umad(U) ib_cm(U) iw_cm(U) ib_addr(U) ib_sa(U) dm_mirror dm_log dm_multipath scsi_dh dm_mod video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac fuse(U) parport_pc lp parport mlx4_ib(U) ib_mad(U) ib_core(U) sg joydev shpchp ahci mlx4_core(U) igb i2c_i801 i2c_core pcspkr usb_storage ata_piix libata cciss(U) megaraid_sas(U) sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 11092, comm: oracle Tainted: PF     2.6.18-128.1.16.0.1.el5 #1
RIP: 0010:[<ffffffff8007e582>]  [<ffffffff8007e582>] unmap_single+0x24/0xc6
RSP: 0018:ffff81048fb23cd8  EFLAGS: 00010202
RAX: ffff8100190c1000 RBX: 00000000ceba0300 RCX: 0000000000000000
RDX: ffffffffa5017ec8 RSI: 41e9f62821005000 RDI: ffff81127ff18870
RBP: 00083d4ea5017ec8 R08: ffff810009000000 R09: 00007f0000000000
R10: 0000000000000000 R11: 0000000000000046 R12: 0000000000000000
R13: 000000000000003d R14: ffff81127ff18870 R15: 0000000000000000
FS:  00002b9ac7839e60(0000) GS:ffff810140de5d40(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff80fd41180640 CR3: 0000000fd9ac0000 CR4: 00000000000006a0
Process oracle (pid: 11092, threadinfo ffff81048fb22000, task ffff810a66e4a820)
Stack:  0000000000000286 ffff8103561da000 0000000000000000 ffffffff80150a56
 ffff8112422b7840 ffff81126dcc3cc0 ffff8112422b7858 ffff81125642d1c0
 000000000000014a ffffffff8848da6f ffff8112422b7840 ffff81126dcc3cc0
Call Trace:
 [<ffffffff80150a56>] swiotlb_unmap_sg+0xba/0x126
 [<ffffffff8848da6f>] :rds:__rds_ib_teardown_mr+0x3d/0xa3
 [<ffffffff8848dca1>] :rds:rds_ib_flush_mr_pool+0x1cc/0x2c7
 [<ffffffff8848ddb6>] :rds:rds_ib_flush_mrs+0x1a/0x2e
 [<ffffffff8848475f>] :rds:rds_release+0x70/0xe5
 [<ffffffff80055562>] sock_release+0x19/0x9a
 [<ffffffff8005575d>] sock_close+0x2c/0x30
 [<ffffffff80012e22>] __fput+0xae/0x198
 [<ffffffff80023de6>] filp_close+0x5c/0x64
 [<ffffffff8001e333>] sys_close+0x88/0xbd
 [<ffffffff885e7b66>] :krg_8_5_0_3005:_close_origcode+0x78/0x1e2
 [<ffffffff885e3b6a>] :krg_8_5_0_3005:_close_postcode+0x0/0x229
 [<ffffffff885df7a1>] :krg_8_5_0_3005:syscall_wrappers_generic_flow+0x1f6/0x514
 [<ffffffff885e7aee>] :krg_8_5_0_3005:_close_origcode+0x0/0x1e2
 [<ffffffff885dfd8c>] :krg_8_5_0_3005:SYS_close_common_wrap+0x46/0xed
 [<ffffffff885e1588>] :krg_8_5_0_3005:SYS_close_wrap64+0x25/0x41
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0
  •  Compute node is running OEL 5.3
# imageinfo
Kernel version: 2.6.18-128.1.16.0.1.el5 #1 SMP Tue Jun 30 16:48:30 EDT 2009 x86_64
Image version: 11.2.2.4.2.111221
Image activated: 2012-04-15 13:12:42 -0500
Image status: success
System partition on device: /dev/sda1

# rpm -qa|grep ofa
ofa-2.6.18-128.1.16.0.1.el5-1.4.2-14

 

Cause

A couple of bugs were opened for the kernel crash. <Bug 13034913>, which was confirmed as duplicate of <bug 11847244>.

 

Solution

The fix was delivered in OEL 5.5. However, it cannot be backported to OEL 5.3.

Upgrade the kernel and ofa to this version or higher:

kernel 2.6.18-194.3.1.0.3.el5
ofa-2.6.18-194.3.1.0.3.el5-1.5.1-4.0.47.x86_64.rpm
 

In some cases, the storage software was upgraded from old version. The kernel stays old in compute nodes, while it was updated in storage cells. It's possible for compute nodes to get kernel/ofa update from storage cell. Please refer to <note 1284070.1>.

 

References

<NOTE:1284070.1> - Updating key software components on database hosts to match those on the cells
<BUG:13034913> - EXADATA LONDVOP0101 SYSTEM PANIC
<BUG:11847244> - NODE APPEARS HUNG PRIOR TO REBOOT

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback