Oracle ZFS Storage Appliance: Oracle Linux/ Redhat Enterprise Linux

Asset ID:	1-72-1638517.1
Update Date:	2018-05-10
Keywords:

Solution Type Problem Resolution Sure

Solution 1638517.1 : Oracle ZFS Storage Appliance: Oracle Linux/ Redhat Enterprise Linux - Client I/O error during FC LUN Failover

Applies to:

Sun ZFS Storage 7120 - Version All Versions to All Versions [Release All Releases]
Sun Storage 7410 Unified Storage System - Version All Versions to All Versions [Release All Releases]
Sun ZFS Storage 7320 - Version All Versions to All Versions [Release All Releases]
Sun Storage 7310 Unified Storage System - Version All Versions to All Versions [Release All Releases]
Sun ZFS Storage 7420 - Version All Versions to All Versions [Release All Releases]
7000 Appliance OS (Fishworks)

Symptoms

Appliance version: 7420
Appliance SW version: 2013.06.05.1.1,1-1.2
Client OS type/version: Redhat 6.2

When testing takeover and failback on a ZFS appliance 7420 cluster there is a I/O error on the Redhat client logs for a short period.

This also occurs on the reboot of one of the heads in the cluster.

The SAN switch connection has four FC paths to the attached ZFS appliance and the clients are using the Redhat 6.2 multipath daemon. In all scenario, two paths remain active when writing data for 30-40 seconds

The two configured FC LUN:

Views for 600144F09F6B2DCA0000528C677C000C:
Data File : /dev/zvol/rdsk/db_pool_02/local/Composition/Oracle
Host group : dc1plogdb01
Target Group : tgt-dc101
LUN : 0
Views for 600144F09F6B2DCA000052DD19DB0001:
Data File : /dev/zvol/rdsk/db_pool_02/local/Composition/Archive
Host group : dc1plogdb01
Target Group : tgt-dc101
LUN : 2

The logs captured during the Redhat client I/O offline messages:

### SCENARIO - 1 - Redhat Client messages after a reboot of the appliance:

Jan 29 17:29:46 dc1plogdb02 kernel: rport-6:0-0: blocked FC remote port time out: removing target and saving binding
Jan 29 17:29:46 dc1plogdb02 kernel: sd 6:0:0:0: alua: Detached
Jan 29 17:29:46 dc1plogdb02 kernel: scsi 6:0:0:0: rejecting I/O to offline device
Jan 29 17:29:46 dc1plogdb02 kernel: scsi 6:0:0:0: rejecting I/O to offline device
Jan 29 17:29:46 dc1plogdb02 kernel: scsi 6:0:0:0: rejecting I/O to offline device

Jan 29 17:29:46 dc1plogdb02 kernel: scsi 6:0:0:0: [sdc] Unhandled error code
Jan 29 17:29:46 dc1plogdb02 kernel: scsi 6:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jan 29 17:29:46 dc1plogdb02 kernel: scsi 6:0:0:0: [sdc] CDB: Write(10): 2a 00 04 77 cc 88 00 00 08 00
Jan 29 17:29:46 dc1plogdb02 kernel: end_request: I/O error, dev sdc, sector 74960008

Jan 29 17:29:46 dc1plogdb02 kernel: scsi 6:0:0:0: [sdc] CDB: Write(10): 2a 00 04 77 ce 90 00 02 00 00
Jan 29 17:29:46 dc1plogdb02 kernel: end_request: I/O error, dev sdc, sector 74960528
Jan 29 17:29:46 dc1plogdb02 kernel: device-mapper: multipath: Failing path 8:32.
Jan 29 17:29:46 dc1plogdb02 kernel: rport-3:0-4: blocked FC remote port time out: removing target and saving binding
Jan 29 17:29:46 dc1plogdb02 kernel: lpfc 0000:0d:00.0: 0:(0):0203 Devloss timeout on WWPN 21:00:00:24:ff:35:93:f6 NPort x010e00 Data: x0 x7 x0
Jan 29 17:29:46 dc1plogdb02 kernel: sd 3:0:1:0: alua: Detached
Jan 29 17:29:46 dc1plogdb02 kernel: scsi 3:0:1:0: rejecting I/O to offline device

Jan 29 17:29:46 dc1plogdb02 kernel: scsi 3:0:1:0: rejecting I/O to dead device
Jan 29 17:29:46 dc1plogdb02 kernel: scsi 3:0:1:0: rejecting I/O to dead device
Jan 29 17:29:46 dc1plogdb02 kernel: scsi 3:0:1:0: [sdb] Unhandled error code
Jan 29 17:29:46 dc1plogdb02 kernel: scsi 3:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jan 29 17:29:46 dc1plogdb02 kernel: scsi 3:0:1:0: [sdb] CDB: Write(10): 2a 00 04 76 df 40 00 00 08 00
Jan 29 17:29:46 dc1plogdb02 kernel: end_request: I/O error, dev sdb, sector 74899264

Jan 29 17:29:47 dc1plogdb02 multipathd: mpathc: load table [0 262144000 multipath 1 queue_if_no_path 1 alua 1 1 round-robin 0 2 1 8:48 1 8:64 1]
Jan 29 17:29:47 dc1plogdb02 kernel: sd 3:0:0:0: alua: port group 01 state A supports toluSnA
Jan 29 17:29:47 dc1plogdb02 kernel: sd 6:0:1:0: alua: port group 01 state A supports toluSnA
Jan 29 17:29:47 dc1plogdb02 multipathd: sdb [8:16]: path removed from map mpathc

### See http://www.sourceware.org/lvm2/wiki/MultipathUsageGuide for details.

### Multipath.txt

SCENARIO - 1 - Multipath -LL

mpathc (3600144f09164a66c000052d4fa290003) dm-2 SUN,ZFS Storage 7420
size=125G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=0 status=enabled
| |- 3:0:1:0 sdb 8:16 failed faulty running
| `- #:#:#:# - #:# failed faulty running <<<< failed
`-+- policy='round-robin 0' prio=130 status=active
|- 3:0:0:0 sdd 8:48 active ready running
`- 6:0:1:0 sde 8:64 active ready running
mpathc (3600144f09164a66c000052d4fa290003) dm-2 ,
size=125G features='0' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=0 status=enabled
| |- #:#:#:# - #:# failed faulty running
| `- #:#:#:# - #:# failed faulty running
`-+- policy='round-robin 0' prio=130 status=active
|- 3:0:0:0 sdd 8:48 active ready running
`- 6:0:1:0 sde 8:64 active ready running
mpathc (3600144f09164a66c000052d4fa290003) dm-2 ,
size=125G features='0' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=0 status=enabled
| |- #:#:#:# - #:# failed faulty running
| `- #:#:#:# - #:# failed faulty running
`-+- policy='round-robin 0' prio=130 status=active
|- 3:0:0:0 sdd 8:48 active ready running
`- 6:0:1:0 sde 8:64 active ready running

Changes

Cause

This is expected behavior - as there is an interruption of the I/O for 30 seconds for a single LUN during a LUN failover

There is always a delay when failing over LUNs.

But the important point is - when there is a transition from one path to the alternate path there is no I/O loss.

From the multipath -LL output we can see the first path has changed to a priority 0 - status: enabled faulty - but still running. While the alternate path is active ready priority 130

mpathc (3600144f09164a66c000052d4fa290003) dm-2 ,
size=125G features='0' hwhandler='1 alua' wp=rw

|-+- policy='round-robin 0' prio=0 status=enabled
| |- #:#:#:# - #:# failed faulty running
| `- #:#:#:# - #:# failed faulty running

`-+- policy='round-robin 0' prio=130 status=active
|- 3:0:0:0 sdd 8:48 active ready running
`- 6:0:1:0 sde 8:64 active ready running

## Then later we can see all paths are restored:

Solution

### Please refer to the Oracle Technical White Paper - January 2014 "Understanding the Use of Fibre Channel in the Oracle ZFS Storage Appliance"

http://www.oracle.com/technetwork/server-storage/sun-unified-storage/documentation/o12-019-fclun-7000-rs-1559284.pdf

This describes under the "Testing the Configuration for Failure and Recovery Scenarios" - that there is a small disruption in the transfer of data:

#### Failure of Links to Both Active Ports of Oracle ZFS Storage Node ####

This is a double failure scenario. Failing both links to the Oracle ZFS Storage Appliance with paths to the FC LUN results in halting the I/O to the LUN until the links are reestablished.

Failover of the data traffic to the node with the standby path will occur when a node failover is initiated.

### Oracle ZFS Storage Node Failure ###

Triggering a node takeover from a node that is actively serving FC LUNs results in the I/O of that LUN being taken over by the requesting node.

The failover for a single LUN takes about 30 seconds

===================================================================================

### For Oracle Linux please refer to these documents for guidance:

White Paper: Configuring Multipathing for Oracle Linux and the Oracle ZFS Storage Appliance:

http://www.oracle.com/technetwork/server-storage/sun-unified-storage/documentation/multipath-linux-zfssa-2035247.pdf

Sun Storage J4500 Array System Overview - Enabling and Disabling Multipathing in the Linux Operating System

http://docs.oracle.com/cd/E19122-01/j4500.array/820-3163/bcghjife.html
====================================================================================

### Redhat 6.2 client settings ###

The I/O timeout can be changed by tuning the file multipath.conf:

Redhat 6.2 - multipath.conf - uses the value of rr_min_io : - Specifies the number of I/O requests to route to a path before switching to the next path in the current path group.

## From the Redhat documentation: P8 - this parameter is no longer used in RedHat 6.2

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/pdf/DM_Multipath/Red_Hat_Enterprise_Linux-6-DM_Multipath-en-US.pdf

1.1.3. New and Changed Features for Red Hat Enterprise Linux 6.2

Red Hat Enterprise Linux 6.2 includes the following documentation and feature updates and changes.

The Red Hat Enterprise Linux 6.2 release provides a new multipath.conf parameter, "rr_m in_io_rq" in the defaults, devices, and multipaths sections of the multipath.conf file.

The rr_min_io parameter no longer has an effect in Red Hat Enterprise Linux 6.2.

For information on the rr_m in_io_rq parameter, see Chapter 4, The DM-Multipath Configuration File.

====================================================================================

Checked for relevancy - 10-May-2018

References

<NOTE:1628999.1> - Oracle ZFS Storage Appliance: How to set up Client Multipathing
http://www.sourceware.org/lvm2/wiki/MultipathUsageGuide
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/DM_Multipath/MPIO_Overview.html#s1-ov-newfeatures-6.2-dmmultipath
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/pdf/DM_Multipath/Red_Hat_Enterprise_Linux-6-DM_Multipath-en-US.pdf
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/DM_Multipath/config_file_defaults.html

Attachments

This solution has no attachment