Solaris 11.1 mpxio - Netapp Storage Controller Failure - Zfs Pool Has Failed Hours Later When Ports Reappeared In Fabric

Asset ID:	1-72-1586715.1
Update Date:	2018-03-07
Keywords:

Solution Type Problem Resolution Sure

Solution 1586715.1 : Solaris 11.1 mpxio - Netapp Storage Controller Failure - Zfs Pool Has Failed Hours Later When Ports Reappeared In Fabric

Applies to:

SPARC T4-1 - Version All Versions and later
Solaris Operating System - Version 11 and later
Information in this document applies to any platform.

Symptoms

This is a Solaris 11.1 SRU 4.5 T4-1 SPARC server t4server1 with Oracle FC HBA Dual port,
both ports connected to a SAN to access a Netapp storage array:

c9 = emlxs1 (fp3) -> /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0:devctl
c8 = emlxs0 (fp2) -> /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0:devctl

There are several LUNs mapped from the storage to this server under mpxio control , ie:

3. c0t60A9800037537478742443536131576Fd0 <NETAPP-LUN-811a-3.91TB>
/scsi_vhci/ssd@g60a9800037537478742443536131576f

Each LUN has 8 paths under mpxio, all ONLINE,
four primary to storage controller A ports and four secondary to storage controller B ports:

DEVICE PROPERTIES for disk: 500a09808d7e7bd2
Vendor: NETAPP
Product ID: LUN
Revision: 811a
Serial Num: 7Stxt$CSa1Wo
Unformatted capacity: 4096000.000 MBytes
Read Cache: Enabled
Minimum prefetch: 0x0
Maximum prefetch: 0x0
Device Type: Disk device
Path(s):

/dev/rdsk/c0t60A9800037537478742443536131576Fd0s2
/devices/scsi_vhci/ssd@g60a9800037537478742443536131576f:c,raw
Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0
Device Address 500a09839d7e7bd2,21
Host controller port WWN 10000090fa13cc4f
Class secondary
State ONLINE
Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0
Device Address 500a09819d7e7bd2,21
Host controller port WWN 10000090fa13cc4f
Class secondary
State ONLINE
Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0
Device Address 500a09818d7e7bd2,21
Host controller port WWN 10000090fa13cc4f
Class primary
State ONLINE
Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0
Device Address 500a09838d7e7bd2,21
Host controller port WWN 10000090fa13cc4f
Class primary
State ONLINE
Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0
Device Address 500a09829d7e7bd2,21
Host controller port WWN 10000090fa13cc4e
Class secondary
State ONLINE
Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0
Device Address 500a09848d7e7bd2,21
Host controller port WWN 10000090fa13cc4e
Class primary
State ONLINE
Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0
Device Address 500a09828d7e7bd2,21
Host controller port WWN 10000090fa13cc4e
Class primary
State ONLINE
Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0
Device Address 500a09849d7e7bd2,21
Host controller port WWN 10000090fa13cc4e
Class secondary
State ONLINE

--> There was failure on the Netapp storage controller A and all ports disappear from fabric as expected:

messages.1:Aug 23 22:47:48 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=d0300, PWWN=500a09848d7e7bd2 disappeared from fabric
messages.1:Aug 23 22:47:48 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=d0600, PWWN=500a09828d7e7bd2 disappeared from fabric
messages.1:Aug 23 22:47:48 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(3)::N_x Port with D_ID=e0200, PWWN=500a09818d7e7bd2 disappeared from fabric
messages.1:Aug 23 22:47:48 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(3)::N_x Port with D_ID=e0300, PWWN=500a09838d7e7bd2 disappeared from fabric

Access to the LUNs of the Netapp storage was provided through the other paths to controller B

The next day the storage controller A was fixed and its ports reappear:

messages.0:Aug 24 13:12:14 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(3)::N_x Port with D_ID=e0200, PWWN=500a09818d7e7bd2 reappeared in fabric
messages.0:Aug 24 13:12:14 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(3)::N_x Port with D_ID=e0300, PWWN=500a09838d7e7bd2 reappeared in fabric
messages.0:Aug 24 13:12:14 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=d0300, PWWN=500a09848d7e7bd2 reappeared in fabric
messages.0:Aug 24 13:12:14 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=d0600, PWWN=500a09828d7e7bd2 reappeared in fabric

...but just before that, with no other explanation there is a failure on zpool pool1 (disk 3. c0t60A9800037537478742443536131576Fd0 )

Aug 24 13:12:07 t4server1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-NX, TYPE: Fault, VER: 1, SEVERITY: Major
Aug 24 13:12:07 t4server1 EVENT-TIME: Sat Aug 24 13:12:06 CEST 2013
Aug 24 13:12:07 t4server1 PLATFORM: ORCL,SPARC-T4-1, CSN: 1307BDY5C0, HOSTNAME: t4server1
Aug 24 13:12:07 t4server1 SOURCE: zfs-diagnosis, REV: 1.0
Aug 24 13:12:07 t4server1 EVENT-ID: d37084dc-f97c-c321-c9bc-b6e5e95d4a5f
Aug 24 13:12:07 t4server1 DESC: Probe of ZFS device 'id1,ssd@n60a9800037537478742443536131576f/a' in pool 'pool1' has failed.
Aug 24 13:12:07 t4server1 AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available.
Aug 24 13:12:07 t4server1 IMPACT: Fault tolerance of the pool may be compromised.
Aug 24 13:12:07 t4server1 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -lx' for more information. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-NX for the latest service procedures and policies regarding this diagnosis.

Aug 24 13:12:07 t4server1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-8A, TYPE: Fault, VER: 1, SEVERITY: Critical
Aug 24 13:12:07 t4server1 EVENT-TIME: Sat Aug 24 13:12:07 CEST 2013
Aug 24 13:12:07 t4server1 PLATFORM: ORCL,SPARC-T4-1, CSN: 1307BDY5C0, HOSTNAME: t4server1
Aug 24 13:12:07 t4server1 SOURCE: zfs-diagnosis, REV: 1.0
Aug 24 13:12:07 t4server1 EVENT-ID: 921f491f-edef-e8ec-ab83-c85771e2d345
Aug 24 13:12:07 t4server1 DESC: A file or directory in pool 'pool1' could not be read due to corrupt data.
Aug 24 13:12:07 t4server1 AUTO-RESPONSE: No automated response will occur.
Aug 24 13:12:07 t4server1 IMPACT: The file or directory is unavailable.
Aug 24 13:12:07 t4server1 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -xv' and examine the list of damaged files to determine what has been affected. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-8A for the latest service procedures and policies regarding this diagnosis.

Aug 24 13:12:07 t4server1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-HC, TYPE: Fault, VER: 1, SEVERITY: Major
Aug 24 13:12:07 t4server1 EVENT-TIME: Sat Aug 24 13:12:07 CEST 2013
Aug 24 13:12:07 t4server1 PLATFORM: ORCL,SPARC-T4-1, CSN: 1307BDY5C0, HOSTNAME: t4server1
Aug 24 13:12:07 t4server1 SOURCE: zfs-diagnosis, REV: 1.0
Aug 24 13:12:07 t4server1 EVENT-ID: 4401ee7a-a603-ccf0-932d-e670dff05389
Aug 24 13:12:07 t4server1 DESC: ZFS pool 'pool1' has experienced currently unrecoverable I/O failures.
Aug 24 13:12:07 t4server1 AUTO-RESPONSE: No automated response will occur.
Aug 24 13:12:07 t4server1 IMPACT: Read and write I/Os cannot be serviced.
Aug 24 13:12:07 t4server1 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Make sure the affected devices are connected, then run 'zpool clear'. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-HC for the latest service procedures and policies regarding this diagnosis.

Then mpxio place the multipath status as optimal for each LUN/path from the Netapp storage, ie for LUN 0x21 :

Aug 24 13:12:14 t4server1 genunix: [ID 530209 kern.info] /scsi_vhci/ssd@g60a9800037537478742443536131576f (ssd10) multipath status: optimal: path 14 fp3/ssd@w500a09818d7e7bd2,21 is online: Load balancing: round-robin
Aug 24 13:12:14 t4server1 genunix: [ID 530209 kern.info] /scsi_vhci/ssd@g60a9800037537478742443536131576f (ssd10) multipath status: optimal: path 10 fp3/ssd@w500a09838d7e7bd2,21 is online: Load balancing: round-robin
Aug 24 13:12:14 t4server1 genunix: [ID 530209 kern.info] /scsi_vhci/ssd@g60a9800037537478742443536131576f (ssd10) multipath status: optimal: path 26 fp2/ssd@w500a09848d7e7bd2,21 is online: Load balancing: round-robin
Aug 24 13:12:14 t4server1 genunix: [ID 530209 kern.info] /scsi_vhci/ssd@g60a9800037537478742443536131576f (ssd10) multipath status: optimal: path 30 fp2/ssd@w500a09828d7e7bd2,21 is online: Load balancing: round-robin

The only errors we found are on fma, due to that the zfs pool was marked as faulty:

Aug 23 22:47:48.4700 ereport.io.scsi.cmd.disk.tran
Aug 23 22:47:48.4701 ereport.io.scsi.cmd.disk.tran
Aug 23 22:47:48.4824 ereport.io.scsi.cmd.disk.tran
Aug 23 22:47:48.4825 ereport.io.scsi.cmd.disk.tran
Aug 23 22:47:50.6209 ereport.io.scsi.cmd.disk.dev.serr
Aug 23 22:48:07.5394 ereport.io.scsi.cmd.disk.recovered
Aug 23 22:48:07.5667 ereport.io.scsi.cmd.disk.recovered
Aug 23 22:48:07.5670 ereport.io.scsi.cmd.disk.recovered
Aug 23 22:48:07.5911 ereport.io.scsi.cmd.disk.recovered
Aug 23 22:48:07.6014 ereport.io.scsi.cmd.disk.recovered

Aug 24 13:12:04.4190 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:04.4192 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:04.4193 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:06.9191 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:06.9193 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:06.9194 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:06.9194 ereport.fs.zfs.probe_failure
Aug 24 13:12:06.9195 ereport.fs.zfs.io
Aug 24 13:12:06.9195 ereport.fs.zfs.io
Aug 24 13:12:06.9195 ereport.fs.zfs.data
Aug 24 13:12:06.9196 ereport.fs.zfs.io
Aug 24 13:12:06.9196 ereport.fs.zfs.io_failure
Aug 24 13:12:06.9195 ereport.fs.zfs.io
Aug 24 13:12:06.9196 ereport.fs.zfs.io
Aug 24 13:12:09.4193 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:09.4194 ereport.fs.zfs.io
Aug 24 13:12:15.9526 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:16.0496 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:16.1496 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:16.2559 ereport.io.scsi.cmd.disk.recovered
Aug 24 13:13:08.6078 ereport.fs.zfs.io_failure
Aug 24 13:13:08.6080 ereport.fs.zfs.io_failure

Here is the detailed fmdump events that lead to the zfs failure,
same error (asc = 0x4, ascq = 0xa) on the four available paths under mpxio (to controller B)

Aug 24 2013 13:12:04.419012415 ereport.io.scsi.cmd.disk.dev.rqs.derr
nvlist version: 0
class = ereport.io.scsi.cmd.disk.dev.rqs.derr
ena = 0x5d8fa780c6c02801
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
cna_dev = 0x515c4d2a00000016
device-path = /pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0/ssd@w500a09819d7e7bd2,21
devid = id1,ssd@n60a9800037537478742443536131576f
(end detector)

devid = id1,ssd@n60a9800037537478742443536131576f
driver-assessment = fail
op-code = 0x8a
cdb = 0x8a 0x0 0x0 0x0 0x0 0x1 0xc8 0x8f 0x1b 0xbd 0x0 0x0 0x0 0x10 0x0 0x0
pkt-reason = 0x0
pkt-state = 0x3f
pkt-stats = 0x0
stat-code = 0x2
key = 0x2
asc = 0x4
ascq = 0xa
sense-data = 0x70 0x0 0x2 0x0 0x0 0x0 0x0 0xe 0x0 0x0 0x0 0x0 0x4 0xa 0x0 0x0 0x0 0x0 0x0 0x0
__ttl = 0x1
__tod = 0x52189504 0x18f99f3f

Aug 24 2013 13:12:04.419238767 ereport.io.scsi.cmd.disk.dev.rqs.derr
nvlist version: 0
class = ereport.io.scsi.cmd.disk.dev.rqs.derr
ena = 0x5d8fa7b757f0ac01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
cna_dev = 0x515c4d2a00000016
device-path = /pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0/ssd@w500a09829d7e7bd2,21
devid = id1,ssd@n60a9800037537478742443536131576f
(end detector)

devid = id1,ssd@n60a9800037537478742443536131576f
driver-assessment = retry
op-code = 0x28
cdb = 0x28 0x0 0xcc 0x6e 0x13 0xe4 0x0 0x0 0x1 0x0
pkt-reason = 0x0
pkt-state = 0x3f
pkt-stats = 0x0
stat-code = 0x2
key = 0x2
asc = 0x4
ascq = 0xa
sense-data = 0x70 0x0 0x2 0x0 0x0 0x0 0x0 0xe 0x0 0x0 0x0 0x0 0x4 0xa 0x0 0x0 0x0 0x0 0x0 0x0
__ttl = 0x1
__tod = 0x52189504 0x18fd136f

Aug 24 2013 13:12:04.419319640 ereport.io.scsi.cmd.disk.dev.rqs.derr
nvlist version: 0
class = ereport.io.scsi.cmd.disk.dev.rqs.derr
ena = 0x5d8fa7cb27408401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
cna_dev = 0x515c4d2a00000016
device-path = /pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0/ssd@w500a09839d7e7bd2,21
devid = id1,ssd@n60a9800037537478742443536131576f
(end detector)

devid = id1,ssd@n60a9800037537478742443536131576f
driver-assessment = retry
op-code = 0x28
cdb = 0x28 0x0 0x0 0x0 0x3 0x10 0x0 0x0 0x10 0x0
pkt-reason = 0x0
pkt-state = 0x3f
pkt-stats = 0x0
stat-code = 0x2
key = 0x2
asc = 0x4
ascq = 0xa
sense-data = 0x70 0x0 0x2 0x0 0x0 0x0 0x0 0xe 0x0 0x0 0x0 0x0 0x4 0xa 0x0 0x0 0x0 0x0 0x0 0x0
__ttl = 0x1
__tod = 0x52189504 0x18fe4f58

Aug 24 2013 13:12:06.919408534 ereport.io.scsi.cmd.disk.dev.rqs.derr
nvlist version: 0
class = ereport.io.scsi.cmd.disk.dev.rqs.derr
ena = 0x5d8fa7b757f0ac05
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
cna_dev = 0x515c4d2a00000016
device-path = /pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0/ssd@w500a09849d7e7bd2,21
devid = id1,ssd@n60a9800037537478742443536131576f
(end detector)

devid = id1,ssd@n60a9800037537478742443536131576f
driver-assessment = fail
op-code = 0x28
cdb = 0x28 0x0 0xcc 0x6e 0x13 0xe4 0x0 0x0 0x1 0x0
pkt-reason = 0x0
pkt-state = 0x3f
pkt-stats = 0x0
stat-code = 0x2
key = 0x2
asc = 0x4
ascq = 0xa
sense-data = 0x70 0x0 0x2 0x0 0x0 0x0 0x0 0xe 0x0 0x0 0x0 0x0 0x4 0xa 0x0 0x0 0x0 0x0 0x0 0x0
__ttl = 0x1
__tod = 0x52189506 0x36cd0f96

Aug 24 2013 13:12:06.919452855 ereport.fs.zfs.probe_failure
nvlist version: 0
class = ereport.fs.zfs.probe_failure
ena = 0x5d98f7f5abf07c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0xf0fd6429bb7c4b79
vdev = 0xc14bbe70ab2fd060
(end detector)

pool = pool1
pool_guid = 0xf0fd6429bb7c4b79
pool_context = 0
pool_failmode = wait
vdev_guid = 0xc14bbe70ab2fd060
vdev_type = disk
vdev_path = /dev/dsk/c0t60A9800037537478742443536131576Fd0s0
vdev_devid = id1,ssd@n60a9800037537478742443536131576f/a
parent_guid = 0xf0fd6429bb7c4b79
parent_type = root
prev_state = 0x0
__ttl = 0x1
__tod = 0x52189506 0x36cdbcb7

Aug 24 2013 13:12:06.919506008 ereport.fs.zfs.io
nvlist version: 0
class = ereport.fs.zfs.io
ena = 0x5d98f80282207c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0xf0fd6429bb7c4b79
vdev = 0xc14bbe70ab2fd060
(end detector)

pool = pool1
pool_guid = 0xf0fd6429bb7c4b79
pool_context = 0
pool_failmode = wait
vdev_guid = 0xc14bbe70ab2fd060
vdev_type = disk
vdev_path = /dev/dsk/c0t60A9800037537478742443536131576Fd0s0
vdev_devid = id1,ssd@n60a9800037537478742443536131576f/a
parent_guid = 0xf0fd6429bb7c4b79
parent_type = root
zio_err = 6
zio_txg = 0x2a1897
zio_offset = 0x198dc25c800
zio_size = 0x200
zio_objset = 0x21
zio_object = 0xa290
zio_level = 0
zio_blkid = 0x0
__ttl = 0x1
__tod = 0x52189506 0x36ce8c58

This LUN is ssd10 :
"/scsi_vhci/ssd@g60a9800037537478742443536131576f" 10 "ssd"

At this point customer initiated manually a Solaris crash dump to collect information.

Changes

Netapp storage controller failure and controller fixed reappear into the fabric.

Cause

All the data points to this mpxio bug as the RCA:
Bug 15822598 SUNBT7204589 scsi_vhci not handling the takeover on netapp storage correctly
and

Bug 17228789 scsi_vhci not handling the takeover on netapp storage correctly in non cluster

Based on the fma data and the dump collected and the Solaris 11.1 version , the server did not know how to handle that 04/0A sense data (probably due to bug 15822598 )
and was passed to the upper layers so fmd catch the error and caused ZFS to fail the pool.

Solution

Upgrade to Solaris 11.2 SRU 11.2.12.5.0 (or greater) that fixes Bug 17228789

On Solaris 10, fix for Bug 17228789 has been provided on:

<SunPatch:150400-28> Sep/10/2015 SunOS 5.10: Kernel Patch
<SunPatch:150401-28> Sep/10/2015 SunOS 5.10_x86: Kernel Patch

Notice that the fix provided on Bug 15822598 does not cover all scenarios:

On Solaris 11, fix for Bug 15822598 has been provided on:

Solaris 11.1 SRU 7.5

On Solaris 10, fix for Bug 15822598 has been provided on:

<SunPatch:150400-23> Apr/08/2015 SunOS 5.10: Kernel Patch
<SunPatch:150401-23> Apr/08/2015 SunOS 5.10_x86: Kernel Patch

Internal notes from the core dump analysis:

-----------------------
The analysis of the core dump is not conclusive in relation to the bug "Bug 15822598 SUNBT7204589 scsi_vhci not handling the takeover on netapp storage correctly"
as the dump was taken after the storage issue was seen and the multipathing devices has sorted themselves out.ssd9 and ssd11 have no outstanding commands in the driver or transport and the
Last pkt reason:
CMD_CMPLT - no transport errors- normal completion

ssd10 and ssd12 have no outstanding commands in the driver or transport but the
Last pkt reason:
CMD_TRAN_ERR - unspecified transport error

we can see from the kstat errors that ssd10 had many transport and hard errors logged against it

kid: 1772 @ 0x3000099e5b0, data[14] @ 0x3000099e6c0
mod: ssderr name: ssd10,err class: device_error type: KSTAT_TYPE_NAMED
inst: 10 flags: 8 size: 672
KSTAT_FLAG_PERSISTENT - kstat is to be persistent over time
creation time: 0x26d79efb14 (144 days 3 hours 51 minutes 26.806454690 seconds earlier)
last snapshot: 0x2bed0b55065713 (1 days 1 hours 26 minutes 36.707272099 seconds earlier)
update: genunix:nulldev
snapshot: unix:default_kstat_snapshot
Soft Errors: 0
Hard Errors: 2698
Transport Errors: 3294
Vendor: "NETAPP "
Product: "LUN "
Revision: "811a"
Serial No: "7Stxt$CSa1Wo"
Size: 4294967296000
Media Error: 0
Device Not Ready: 5
No Device: 5
Recoverable: 0
Illegal Request: 3481
Predictive Failure Analysis: 0

Looking at the ssd10 we can see historically that the sense data from the array
is asc/ascq ( 04/ 0A) ( LOGICAL UNIT NOT ACCESSIBLE, ASYMMETRIC ACCSSS STATE TRANSITION)

> 0x1001915c0700::print -t struct sd_lun un_xbuf_attr
void *un_xbuf_attr = 0x1001913f0a00
> 0x1001913f0a00::print -t struct __ddi_xbuf_attr xa_reserve_headp
void *xa_reserve_headp = 0x100190de6550
> 0x100190de6550::print -t ddi_xbuf_t
ddi_xbuf_t 0x100190dba660
> 0x100190dba660::print -t sd_xbuf xb_sense_data
uchar_t [20] xb_sense_data = [ 0x70, 0, 0x2, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0x4, 0xa, 0, 0, 0, 0, 0, 0 ]
>

So as per the bug you mentioned if the array is returning this sense data the version of Solaris the customer is currently using
will not understand what to do with it.

The sense data returned by the array

sense-data = 0x70 0x0 0x2 0x0 0x0 0x0 0x0 0xe 0x0 0x0 0x0 0x0 0x4 0xa 0x0 0x0 0x0 0x0 0x0 0x0

contains a sense key = 0x2 ( Not Ready)

asc/ascq = 0x4/0xA

If the customer does not have the version of Solaris with the fix as defined in the bug then Solaris does not know what to do with the the 0x4/0xa additional sense data.

It knows that the lun is "not ready" and the only code before the fix is

} else if ((skey == KEY_NOT_READY) &&
(asc == STD_LOGICAL_UNIT_NOT_ACCESSIBLE) &&
((ascq == STD_TGT_PORT_UNAVAILABLE) ||
(ascq == STD_TGT_PORT_STANDBY))) {
rval = SCSI_SENSE_INACTIVE;
VHCI_DEBUG(4, (CE_NOTE, NULL, "!std_analyze_sense:"
" sense_key:%x, add_code: %x, qual_code:%x"
" sense:%x\n", skey, asc, ascq, rval));

but will ignore this else if statement and rval will be set to SCSI_SENSE_UNKNOWN ( as per the bug comments ) and dealt with accordingly as it would with unknown sense data.

With the fix in place ( additional else if )

} else if ((skey == KEY_NOT_READY) &&
(asc == STD_SCSI_ASC_STATE_TRANS) &&
(ascq == STD_SCSI_ASCQ_STATE_TRANS)) {
rval = SCSI_SENSE_NOT_READY;
VHCI_DEBUG(4, (CE_NOTE, NULL, "!std_analyze_sense:"
" sense_key:%x, add_code: %x, qual_code:%x"
" sense:%x\n", skey, asc, ascq, rval));

where

#define STD_SCSI_ASC_STATE_TRANS 0x04
#define STD_SCSI_ASCQ_STATE_TRANS 0x0A

so now the return value is set to SCSI_SENSE_NOT_READY and dealt with differently.

dissembling std_analyze_sense() in the dump we only see ref to the first (if else )

std_analyze_sense+0xf0: cmp %i2, 0x2 ( KEY_NOT_READY )
std_analyze_sense+0xf4: bne,pn %icc, +0x3c
std_analyze_sense+0xf8: cmp %i3, 0x4 (STD_LOGICAL_UNIT_NOT_ACCESSIBLE)
std_analyze_sense+0xfc: be,pn %icc, +0x14
std_analyze_sense+0x100: mov 0x5, %l3
std_analyze_sense+0x104: sra %l3, 0x0, %i0
std_analyze_sense+0x108: ret
std_analyze_sense+0x10c: restore
std_analyze_sense+0x110: sub %l1, 0xb, %l5 ( STD_TGT_PORT_STANDBY )

So this customer does not have that bug fix on their system, and they should update their packages to get it installed.

I would say that the customer should update their version of Solaris
so that this bug will not cause them any further issue when the array returns 04/0A sense data.
-------------------------------

References

<NOTE:1519925.1> - fmdump -eV reports ereport.io.scsi.cmd.disk.dev.rqs.derr associated with SCSI Mode Select or SCSI Mode Sense commands
<NOTE:1501435.1> - Oracle Solaris 11.1 Support Repository Updates (SRU) Index
<BUG:15822598> - SUNBT7204589 SCSI_VHCI NOT HANDLING THE TAKEOVER ON NETAPP STORAGE CORRECTLY
<BUG:17228789> - SCSI_VHCI NOT HANDLING THE TAKEOVER ON NETAPP STORAGE CORRECTLY IN NON CLUSTER

Attachments

This solution has no attachment