Oracle ZFS Storage Appliance: Solaris client ZFS pool (constructed from FC LUNs exported from ZFS-SA) becomes suspended due to appliance takeover.

Asset ID:	1-72-2003660.1
Update Date:	2018-01-05
Keywords:

Solution Type Problem Resolution Sure

Solution 2003660.1 : Oracle ZFS Storage Appliance: Solaris client ZFS pool (constructed from FC LUNs exported from ZFS-SA) becomes suspended due to appliance takeover.

Applies to:

Sun Storage 7410 Unified Storage System - Version All Versions to All Versions [Release All Releases]
Sun Storage 7310 Unified Storage System - Version All Versions to All Versions [Release All Releases]
Sun Storage 7210 Unified Storage System - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS3-2 - Version All Versions to All Versions [Release All Releases]
Sun ZFS Storage 7320 - Version All Versions to All Versions [Release All Releases]
7000 Appliance OS (Fishworks)

Symptoms

Solaris 11.2 client with STMS/MPXIO configured reporting zpool 'suspended' when there is a takeover on the ZFS appliance.

Fibre Channel (FC) LUNs are mirrored by ZFS on the Solaris client.

# zpool status data01
   pool: data01
state: SUSPENDED
status: One or more devices are unavailable in response to IO failures.
         The pool is suspended.
action: Make sure the affected devices are connected, then run 'zpool clear'
or
         'fmadm repaired'.
         Run 'zpool status -v' to see device specific details.
    see: http://support.oracle.com/msg/ZFS-8000-HC
   scan: resilvered 5.57M in 0h0m with 0 errors on Tue Mar 24 17:28:47 2015
config:
.
         NAME                                       STATE     READ WRITE CKSUM
         data01                                     SUSPENDED     0   110    0
           mirror-0                                 ONLINE       0   130     0
             c0t600144F0B97C139B00005510F3350002d0 ONLINE       0   140     0
             c0t600144F0D232395600005510F2A90001d0 ONLINE       0   138     0

The ZFS appliance shows a short takeover/failback time.

The ZFS-SA exported FC LUNs are configured correctly with a target and host group configured.

Solaris client FMA shows both side of mirrored luns with probe failure before IO got suspended.

Mar 24 17:19:13 ZFS-8000-NX    fault.fs.zfs.vdev.probe_failure 600144f0b97c139b00005510f3350002 <<--
Mar 24 17:19:13 ZFS-8000-FD    fault.fs.zfs.vdev.io 600144f0b97c139b00005510f3350002
Mar 24 17:19:14 ZFS-8000-NX    fault.fs.zfs.vdev.probe_failure n600144f0d232395600005510f2a90001 <<--
Mar 24 17:19:15 ZFS-8000-FD    fault.fs.zfs.vdev.io   n600144f0d232395600005510f2a90001
Mar 24 17:31:27 ZFS-8000-8A    fault.fs.zfs.object.corrupt_data pool_name=data01
Mar 24 17:31:29 ZFS-8000-HC    fault.fs.zfs.io_failure_wait pool_name=data01 <<-- suspended I/O

The 'rm.ak' and 'debug.sys' logs show

Tue Mar 24 06:19:09 2015: takeover completed in 4.107s

Mar 24 06:19:10 BRSUA2-SAN-HEAD02 fct: [ID 469330 kern.notice] NOTICE: qlt0,0 LINK UP, portid ef, topology Private Loop, speed 8G.

Tue Mar 24 06:27:58 2015: ak_rm_fail_back phase 1 complete in 2.997s
Tue Mar 24 06:28:03 2015: ak_rm_fail_back phase 2 complete in 4.706s

Mar 24 06:28:04 brsua2-san-head01 fct: [ID 469330 kern.notice] NOTICE: qlt0,0 LINK UP, portid ef, topology Private Loop, speed 8G.

Changes

FC directly connected to ZFS appliance without an FC switch

Cause

Connectivity options: Point-to-Point (FC-P2P) and switch attach (FC-SW) connectivity is supported unless where noted specifically.

No support is provided for arbitrated loop (FC-AL) connectivity.

Solution

FC direct connection supportability is available in

https://stbeehive.oracle.com/teamcollab/wiki/ZFSSA+Interop:ZFSSA+Interoperability+Testing+Matrix+-+2013.1.3.0#Fiber+Channel

Connectivity options: Point-to-Point (FC-P2P) and switch attach (FC-SW) connectivity is supported unless where noted specifically. No support is provided for arbitrated loop (FC-AL) connectivity.

16Gb Qlogic FC HBA indicates no support for 16Gb FC-AL connection

Topologies supported: FC-SW switched fabric (N_Port), FC-AL arbitrated loop (not supported at 16 Gb) (NL_ Port), and Point-to-point (N_Port)

http://docs.oracle.com/cd/E24651_01/html/E24460/z40003111016271.html#scrolltoc

In this case, the Solaris initiator should be forced to use Fibre Channel Point-to-Point (FC-P2P).

Set connection-options=1 in /kernel/drv/qlc.conf

I/O error should be issued only after a appropriate timeout to cover port flaps.

Update Solaris client to minimum SRU 11.2.9.5.0

Workaround and best practise is to use FC switches.

References

<BUG:20802234> - LUNS PRESENTED TO SOLARIS CLIENT BECOME SUSPENDED DURING ZFS APPLIANCE TAKEOVER
<NOTE:1434184.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Fibre-Channel Problems
<NOTE:1672221.1> - Oracle Solaris 11.2 Support Repository Updates (SRU) Index
http://www.oracle.com/technetwork/server-storage/sun-unified-storage/documentation/o12-019-fclun-7000-rs-1559284.pdf
<NOTE:1402545.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Cluster Problems
<BUG:18969626> - I/O STOPS WHEN OTHER PATH PULLED OUT AND INSERTED AFTER A PATH IS DEGRADED.

Attachments

This solution has no attachment