![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||
Solution Type Problem Resolution Sure Solution 1632242.1 : Snapshot-based Backup via NFS Over Infiniband Network Will Freeze The Node
In this Document
Created from <SR 3-7822700871> Applies to:Exadata Database Machine V2 - Version All Versions and laterInformation in this document applies to any platform. SymptomsBackup server will go to a hung state when doing a Snapshot-based Backup from a Exadata Compute Node via NFS over Infiniband Network. The same backup procedure will work fine if its over normal 10Gb ethernet card with default setting. The message file from the node will be showing the below stack information when running backup. Aug 28 10:27:43 exahostdb01 lvm[79582]: Monitoring snapshot VGExaDb-u01_snap <<<
.. Aug 28 10:31:14 exahostdb01 kernel: nfs: server exaNFSbackup not responding, still trying Aug 28 10:39:45 exahostdb01 kernel: nfs: server exaNFSbackup OK Aug 28 10:55:00 exahostdb01 kernel: nfs: server exaNFSbackup not responding, still trying ... Aug 28 11:01:10 exahostdb01 kernel: nfs: server exaNFSbackup OK Aug 28 11:13:00 exahostdb01 kernel: nfs: server exaNFSbackup not responding, still trying Aug 28 11:13:53 exahostdb01 kernel: nfs: server exaNFSbackup OK Aug 28 12:41:20 exahostdb01 kernel: RDS/IB: re-connect to 169.XXX.XXX.XXX is stalling for more than 1 min...(drops=12 err=0) Aug 28 12:41:20 exahostdb01 kernel: RDS/IB: re-connect to 169.XXX.XXX.XXX is stalling for more than 1 min...(drops=12 err=0) Aug 28 12:41:58 exahostdb01 kernel: RDS/IB: re-connect to 10.XXX.XXX.XXX is stalling for more than 1 min...(drops=1 err=0) Aug 28 14:13:49 exahostdb01 kernel: RDS/IB: connected to 10.XXX.XXX.XXX version 3.1 Aug 28 14:16:47 exahostdb01 kernel: RDS/IB: connected to 169.XXX.XXX.XXX version 3.1 Aug 28 14:16:47 exahostdb01 kernel: RDS/IB: connected to 169.XXX.XXX.XXX version 3.1 .... Sep 5 12:16:12 exahostdb01 kernel: INFO: task lsof:65691 blocked for more than 120 seconds. Sep 5 12:16:12 exahostdb01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 5 12:16:12 exahostdb01 kernel: lsof D 0000000000000000 0 65691 65680 0x00000080 Sep 5 12:16:12 exahostdb01 kernel: ffff88116ec7bc08 0000000000000082 0000000000000000 ffffffffadf60c48 Sep 5 12:16:12 exahostdb01 kernel: ffff88355e6ea080 ffffffff81aae4c0 ffff88355e6ea450 0000000176b6fa52 Sep 5 12:16:12 exahostdb01 kernel: 000000006ec7bc98 0000000000000000 0000000000000000 ffff88355e6ea080 Sep 5 12:16:12 exahostdb01 kernel: Call Trace: Sep 5 12:16:12 exahostdb01 kernel: [<ffffffff814569cc>] io_schedule+0x42/0x5c Sep 5 12:16:12 exahostdb01 kernel: [<ffffffffa0614b02>] nfs_wait_bit_uninterruptible+0xe/0x12 [nfs] Sep 5 12:16:12 exahostdb01 kernel: [<ffffffff81456efb>] __wait_on_bit+0x4a/0x7c Sep 5 12:16:12 exahostdb01 kernel: [<ffffffffa0614af4>] ? nfs_wait_bit_uninterruptible+0x0/0x12 [nfs] Sep 5 12:16:12 exahostdb01 kernel: [<ffffffffa0614af4>] ? nfs_wait_bit_uninterruptible+0x0/0x12 [nfs] Sep 5 12:16:12 exahostdb01 kernel: [<ffffffff81456fa0>] out_of_line_wait_on_bit+0x73/0x80 Sep 5 12:16:12 exahostdb01 kernel: [<ffffffff8107706d>] ? wake_bit_function+0x0/0x2f Sep 5 12:16:12 exahostdb01 kernel: [<ffffffffa0614af2>] nfs_wait_on_request+0x2b/0x2d [nfs] Sep 5 12:16:12 exahostdb01 kernel: [<ffffffffa0618a6c>] nfs_sync_mapping_wait+0xec/0x1fa [nfs] Sep 5 12:16:12 exahostdb01 kernel: [<ffffffffa0619073>] nfs_write_mapping+0x77/0x9e [nfs] Sep 5 12:16:12 exahostdb01 kernel: [<ffffffff810432d6>] ? should_resched+0xe/0x2f Sep 5 12:16:12 exahostdb01 kernel: [<ffffffffa06190b4>] nfs_wb_nocommit+0x1a/0x1c [nfs] Sep 5 12:16:12 exahostdb01 kernel: [<ffffffffa060e184>] nfs_getattr+0x61/0xef [nfs] Sep 5 12:16:12 exahostdb01 kernel: [<ffffffff8111ea7b>] vfs_getattr+0x4c/0x69 Sep 5 12:16:12 exahostdb01 kernel: [<ffffffff8111eae8>] vfs_fstatat+0x50/0x67 Sep 5 12:16:12 exahostdb01 kernel: [<ffffffff8111ebe5>] vfs_stat+0x1b/0x1d Sep 5 12:16:12 exahostdb01 kernel: [<ffffffff8111ec06>] sys_newstat+0x1f/0x39 Sep 5 12:16:12 exahostdb01 kernel: [<ffffffff810a9d23>] ? audit_syscall_entry+0x103/0x12f Sep 5 12:16:12 exahostdb01 kernel: [<ffffffff81011db2>] system_call_fastpath+0x16/0x1b ... Sep 10 16:00:39 exahostdb01 kernel: ixgbe 0000:20:00.0: eth0: NIC Link is Up 1 Gbps, Flow Control: RX/TX Sep 10 16:00:39 exahostdb01 kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready Sep 10 16:00:41 exahostdb01 kernel: ib0: packet len 2398 (> 2048) too long to send, dropping Sep 10 16:00:41 exahostdb01 last message repeated 2 times Sep 10 16:11:34 exahostdb01 kernel: ixgbe 0000:30:00.1: eth5: NIC Link is Down <<<< bondib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 (ib0 + ib1)
inet addr:10.x.x.x Bcast:10.x.x.255 Mask:255.255.255.0 inet6 addr: fe80::221:2800:1fc:b3ed/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:65520 Metric:1
CauseThe stack was showing that the nfs process was trying to get back the status of the task via "nfs_wait_bit_uninterruptible" function, where the process was in uninterruptable state , because the communication to the source location was not successful. The above logs are clearly pointing that the base IB devices that are part of bondib0 are get to downstate intermittently and then joining back. So its a communication issue from the client side as network communication to the source is down (bondib0). This is because the MTU size set of the IB device was default and with 64K size. This will is a common issue with IB when the MTU size is with a larger value.
Solution1) Reduce the MTU size of IB device to 7000 and restart the network service References<NOTE:1546861.1> - [Linux OS] System Hung with Large Numbers of Page Allocation Failures with "order:5" on Exadata Environments<NOTE:1586212.1> - How to Change MTU Size in Exadata Environment Attachments This solution has no attachment |
||||||||||||||||||
|