Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1586715.1
Update Date:2018-03-07
Keywords:

Solution Type  Problem Resolution Sure

Solution  1586715.1 :   Solaris 11.1 mpxio - Netapp Storage Controller Failure - Zfs Pool Has Failed Hours Later When Ports Reappeared In Fabric  


Related Items
  • Solaris Operating System
  •  
  • SPARC T4-1
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>HBA>SN-DK: FC HBA
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Created from <SR 3-7774615141>

Applies to:

SPARC T4-1 - Version All Versions and later
Solaris Operating System - Version 11 and later
Information in this document applies to any platform.

Symptoms

This is a Solaris 11.1 SRU 4.5 T4-1 SPARC server t4server1 with Oracle FC HBA Dual port,
both ports connected to a SAN to access a Netapp storage array:


c9 = emlxs1 (fp3) -> /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0:devctl
c8 = emlxs0 (fp2) -> /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0:devctl


There are several LUNs mapped from the storage to this server under mpxio control , ie:

       3. c0t60A9800037537478742443536131576Fd0 <NETAPP-LUN-811a-3.91TB>
         /scsi_vhci/ssd@g60a9800037537478742443536131576f



Each LUN has 8 paths under mpxio, all ONLINE,
four primary to storage controller A ports and four secondary to storage controller B ports:

DEVICE PROPERTIES for disk: 500a09808d7e7bd2
 Vendor: NETAPP
 Product ID: LUN
 Revision: 811a
 Serial Num: 7Stxt$CSa1Wo
 Unformatted capacity: 4096000.000 MBytes
 Read Cache: Enabled
  Minimum prefetch: 0x0
  Maximum prefetch: 0x0
 Device Type: Disk device
 Path(s):

 /dev/rdsk/c0t60A9800037537478742443536131576Fd0s2
 /devices/scsi_vhci/ssd@g60a9800037537478742443536131576f:c,raw
  Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0
  Device Address 500a09839d7e7bd2,21
  Host controller port WWN 10000090fa13cc4f
  Class secondary
  State ONLINE
  Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0
  Device Address 500a09819d7e7bd2,21
  Host controller port WWN 10000090fa13cc4f
  Class secondary
  State ONLINE
  Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0
  Device Address 500a09818d7e7bd2,21
  Host controller port WWN 10000090fa13cc4f
  Class primary
  State ONLINE
  Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0
  Device Address 500a09838d7e7bd2,21
  Host controller port WWN 10000090fa13cc4f
  Class primary
  State ONLINE
  Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0
  Device Address 500a09829d7e7bd2,21
  Host controller port WWN 10000090fa13cc4e
  Class secondary
  State ONLINE
  Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0
  Device Address 500a09848d7e7bd2,21
  Host controller port WWN 10000090fa13cc4e
  Class primary
  State ONLINE
  Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0
  Device Address 500a09828d7e7bd2,21
  Host controller port WWN 10000090fa13cc4e
  Class primary
  State ONLINE
  Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0
  Device Address 500a09849d7e7bd2,21
  Host controller port WWN 10000090fa13cc4e
  Class secondary
  State ONLINE


--> There was failure on the Netapp storage controller A and all ports disappear from fabric as expected:

messages.1:Aug 23 22:47:48 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=d0300, PWWN=500a09848d7e7bd2 disappeared from fabric
messages.1:Aug 23 22:47:48 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=d0600, PWWN=500a09828d7e7bd2 disappeared from fabric
messages.1:Aug 23 22:47:48 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(3)::N_x Port with D_ID=e0200, PWWN=500a09818d7e7bd2 disappeared from fabric
messages.1:Aug 23 22:47:48 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(3)::N_x Port with D_ID=e0300, PWWN=500a09838d7e7bd2 disappeared from fabric


Access to the LUNs of the Netapp storage was provided through the other paths to controller B

The next day the storage controller A was fixed and its ports reappear:

messages.0:Aug 24 13:12:14 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(3)::N_x Port with D_ID=e0200, PWWN=500a09818d7e7bd2 reappeared in fabric
messages.0:Aug 24 13:12:14 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(3)::N_x Port with D_ID=e0300, PWWN=500a09838d7e7bd2 reappeared in fabric
messages.0:Aug 24 13:12:14 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=d0300, PWWN=500a09848d7e7bd2 reappeared in fabric
messages.0:Aug 24 13:12:14 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=d0600, PWWN=500a09828d7e7bd2 reappeared in fabric



...but just before that, with no other explanation there is a failure on zpool pool1 (disk 3. c0t60A9800037537478742443536131576Fd0 )

Aug 24 13:12:07 t4server1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-NX, TYPE: Fault, VER: 1, SEVERITY: Major
Aug 24 13:12:07 t4server1 EVENT-TIME: Sat Aug 24 13:12:06 CEST 2013
Aug 24 13:12:07 t4server1 PLATFORM: ORCL,SPARC-T4-1, CSN: 1307BDY5C0, HOSTNAME: t4server1
Aug 24 13:12:07 t4server1 SOURCE: zfs-diagnosis, REV: 1.0
Aug 24 13:12:07 t4server1 EVENT-ID: d37084dc-f97c-c321-c9bc-b6e5e95d4a5f
Aug 24 13:12:07 t4server1 DESC: Probe of ZFS device 'id1,ssd@n60a9800037537478742443536131576f/a' in pool 'pool1' has failed.
Aug 24 13:12:07 t4server1 AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available.
Aug 24 13:12:07 t4server1 IMPACT: Fault tolerance of the pool may be compromised.
Aug 24 13:12:07 t4server1 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -lx' for more information. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-NX for the latest service procedures and policies regarding this diagnosis.

Aug 24 13:12:07 t4server1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-8A, TYPE: Fault, VER: 1, SEVERITY: Critical
Aug 24 13:12:07 t4server1 EVENT-TIME: Sat Aug 24 13:12:07 CEST 2013
Aug 24 13:12:07 t4server1 PLATFORM: ORCL,SPARC-T4-1, CSN: 1307BDY5C0, HOSTNAME: t4server1
Aug 24 13:12:07 t4server1 SOURCE: zfs-diagnosis, REV: 1.0
Aug 24 13:12:07 t4server1 EVENT-ID: 921f491f-edef-e8ec-ab83-c85771e2d345
Aug 24 13:12:07 t4server1 DESC: A file or directory in pool 'pool1' could not be read due to corrupt data.
Aug 24 13:12:07 t4server1 AUTO-RESPONSE: No automated response will occur.
Aug 24 13:12:07 t4server1 IMPACT: The file or directory is unavailable.
Aug 24 13:12:07 t4server1 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -xv' and examine the list of damaged files to determine what has been affected. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-8A for the latest service procedures and policies regarding this diagnosis.

Aug 24 13:12:07 t4server1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-HC, TYPE: Fault, VER: 1, SEVERITY: Major
Aug 24 13:12:07 t4server1 EVENT-TIME: Sat Aug 24 13:12:07 CEST 2013
Aug 24 13:12:07 t4server1 PLATFORM: ORCL,SPARC-T4-1, CSN: 1307BDY5C0, HOSTNAME: t4server1
Aug 24 13:12:07 t4server1 SOURCE: zfs-diagnosis, REV: 1.0
Aug 24 13:12:07 t4server1 EVENT-ID: 4401ee7a-a603-ccf0-932d-e670dff05389
Aug 24 13:12:07 t4server1 DESC: ZFS pool 'pool1' has experienced currently unrecoverable I/O failures.
Aug 24 13:12:07 t4server1 AUTO-RESPONSE: No automated response will occur.
Aug 24 13:12:07 t4server1 IMPACT: Read and write I/Os cannot be serviced.
Aug 24 13:12:07 t4server1 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Make sure the affected devices are connected, then run 'zpool clear'. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-HC for the latest service procedures and policies regarding this diagnosis.

 

Then mpxio place the multipath status as optimal for each LUN/path from the Netapp storage, ie for LUN 0x21 :

Aug 24 13:12:14 t4server1 genunix: [ID 530209 kern.info] /scsi_vhci/ssd@g60a9800037537478742443536131576f (ssd10) multipath status: optimal: path 14 fp3/ssd@w500a09818d7e7bd2,21 is online: Load balancing: round-robin
Aug 24 13:12:14 t4server1 genunix: [ID 530209 kern.info] /scsi_vhci/ssd@g60a9800037537478742443536131576f (ssd10) multipath status: optimal: path 10 fp3/ssd@w500a09838d7e7bd2,21 is online: Load balancing: round-robin
Aug 24 13:12:14 t4server1 genunix: [ID 530209 kern.info] /scsi_vhci/ssd@g60a9800037537478742443536131576f (ssd10) multipath status: optimal: path 26 fp2/ssd@w500a09848d7e7bd2,21 is online: Load balancing: round-robin
Aug 24 13:12:14 t4server1 genunix: [ID 530209 kern.info] /scsi_vhci/ssd@g60a9800037537478742443536131576f (ssd10) multipath status: optimal: path 30 fp2/ssd@w500a09828d7e7bd2,21 is online: Load balancing: round-robin


The only errors we found are on fma, due to that the zfs pool was marked as faulty:


Aug 23 22:47:48.4700 ereport.io.scsi.cmd.disk.tran
Aug 23 22:47:48.4701 ereport.io.scsi.cmd.disk.tran
Aug 23 22:47:48.4824 ereport.io.scsi.cmd.disk.tran
Aug 23 22:47:48.4825 ereport.io.scsi.cmd.disk.tran
Aug 23 22:47:50.6209 ereport.io.scsi.cmd.disk.dev.serr
Aug 23 22:48:07.5394 ereport.io.scsi.cmd.disk.recovered
Aug 23 22:48:07.5667 ereport.io.scsi.cmd.disk.recovered
Aug 23 22:48:07.5670 ereport.io.scsi.cmd.disk.recovered
Aug 23 22:48:07.5911 ereport.io.scsi.cmd.disk.recovered
Aug 23 22:48:07.6014 ereport.io.scsi.cmd.disk.recovered

Aug 24 13:12:04.4190 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:04.4192 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:04.4193 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:06.9191 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:06.9193 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:06.9194 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:06.9194 ereport.fs.zfs.probe_failure
Aug 24 13:12:06.9195 ereport.fs.zfs.io
Aug 24 13:12:06.9195 ereport.fs.zfs.io
Aug 24 13:12:06.9195 ereport.fs.zfs.data
Aug 24 13:12:06.9196 ereport.fs.zfs.io
Aug 24 13:12:06.9196 ereport.fs.zfs.io_failure
Aug 24 13:12:06.9195 ereport.fs.zfs.io
Aug 24 13:12:06.9196 ereport.fs.zfs.io
Aug 24 13:12:09.4193 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:09.4194 ereport.fs.zfs.io
Aug 24 13:12:15.9526 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:16.0496 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:16.1496 ereport.io.scsi.cmd.disk.dev.rqs.derr
Aug 24 13:12:16.2559 ereport.io.scsi.cmd.disk.recovered
Aug 24 13:13:08.6078 ereport.fs.zfs.io_failure
Aug 24 13:13:08.6080 ereport.fs.zfs.io_failure





Here is the detailed fmdump events that lead to the zfs failure,
same error (asc = 0x4, ascq = 0xa) on the four available paths under mpxio (to controller B)

Aug 24 2013 13:12:04.419012415 ereport.io.scsi.cmd.disk.dev.rqs.derr
nvlist version: 0
  class = ereport.io.scsi.cmd.disk.dev.rqs.derr
  ena = 0x5d8fa780c6c02801
  detector = (embedded nvlist)
  nvlist version: 0
  version = 0x0
  scheme = dev
  cna_dev = 0x515c4d2a00000016
  device-path = /pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0/ssd@w500a09819d7e7bd2,21
  devid = id1,ssd@n60a9800037537478742443536131576f
  (end detector)

  devid = id1,ssd@n60a9800037537478742443536131576f
  driver-assessment = fail
  op-code = 0x8a
  cdb = 0x8a 0x0 0x0 0x0 0x0 0x1 0xc8 0x8f 0x1b 0xbd 0x0 0x0 0x0 0x10 0x0 0x0
  pkt-reason = 0x0
  pkt-state = 0x3f
  pkt-stats = 0x0
  stat-code = 0x2
  key = 0x2
  asc = 0x4
  ascq = 0xa
  sense-data = 0x70 0x0 0x2 0x0 0x0 0x0 0x0 0xe 0x0 0x0 0x0 0x0 0x4 0xa 0x0 0x0 0x0 0x0 0x0 0x0
  __ttl = 0x1
  __tod = 0x52189504 0x18f99f3f

Aug 24 2013 13:12:04.419238767 ereport.io.scsi.cmd.disk.dev.rqs.derr
nvlist version: 0
  class = ereport.io.scsi.cmd.disk.dev.rqs.derr
  ena = 0x5d8fa7b757f0ac01
  detector = (embedded nvlist)
  nvlist version: 0
  version = 0x0
  scheme = dev
  cna_dev = 0x515c4d2a00000016
  device-path = /pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0/ssd@w500a09829d7e7bd2,21
  devid = id1,ssd@n60a9800037537478742443536131576f
  (end detector)

  devid = id1,ssd@n60a9800037537478742443536131576f
  driver-assessment = retry
  op-code = 0x28
  cdb = 0x28 0x0 0xcc 0x6e 0x13 0xe4 0x0 0x0 0x1 0x0
  pkt-reason = 0x0
  pkt-state = 0x3f
  pkt-stats = 0x0
  stat-code = 0x2
  key = 0x2
  asc = 0x4
  ascq = 0xa
  sense-data = 0x70 0x0 0x2 0x0 0x0 0x0 0x0 0xe 0x0 0x0 0x0 0x0 0x4 0xa 0x0 0x0 0x0 0x0 0x0 0x0
  __ttl = 0x1
  __tod = 0x52189504 0x18fd136f

Aug 24 2013 13:12:04.419319640 ereport.io.scsi.cmd.disk.dev.rqs.derr
nvlist version: 0
  class = ereport.io.scsi.cmd.disk.dev.rqs.derr
  ena = 0x5d8fa7cb27408401
  detector = (embedded nvlist)
  nvlist version: 0
  version = 0x0
  scheme = dev
  cna_dev = 0x515c4d2a00000016
  device-path = /pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0/ssd@w500a09839d7e7bd2,21
  devid = id1,ssd@n60a9800037537478742443536131576f
  (end detector)

  devid = id1,ssd@n60a9800037537478742443536131576f
  driver-assessment = retry
  op-code = 0x28
  cdb = 0x28 0x0 0x0 0x0 0x3 0x10 0x0 0x0 0x10 0x0
  pkt-reason = 0x0
  pkt-state = 0x3f
  pkt-stats = 0x0
  stat-code = 0x2
  key = 0x2
  asc = 0x4
  ascq = 0xa
  sense-data = 0x70 0x0 0x2 0x0 0x0 0x0 0x0 0xe 0x0 0x0 0x0 0x0 0x4 0xa 0x0 0x0 0x0 0x0 0x0 0x0
  __ttl = 0x1
  __tod = 0x52189504 0x18fe4f58

Aug 24 2013 13:12:06.919408534 ereport.io.scsi.cmd.disk.dev.rqs.derr
nvlist version: 0
  class = ereport.io.scsi.cmd.disk.dev.rqs.derr
  ena = 0x5d8fa7b757f0ac05
  detector = (embedded nvlist)
  nvlist version: 0
  version = 0x0
  scheme = dev
  cna_dev = 0x515c4d2a00000016
  device-path = /pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0/ssd@w500a09849d7e7bd2,21
  devid = id1,ssd@n60a9800037537478742443536131576f
  (end detector)

  devid = id1,ssd@n60a9800037537478742443536131576f
  driver-assessment = fail
  op-code = 0x28
  cdb = 0x28 0x0 0xcc 0x6e 0x13 0xe4 0x0 0x0 0x1 0x0
  pkt-reason = 0x0
  pkt-state = 0x3f
  pkt-stats = 0x0
  stat-code = 0x2
  key = 0x2
  asc = 0x4
  ascq = 0xa
  sense-data = 0x70 0x0 0x2 0x0 0x0 0x0 0x0 0xe 0x0 0x0 0x0 0x0 0x4 0xa 0x0 0x0 0x0 0x0 0x0 0x0
  __ttl = 0x1
  __tod = 0x52189506 0x36cd0f96


Aug 24 2013 13:12:06.919452855 ereport.fs.zfs.probe_failure
nvlist version: 0
  class = ereport.fs.zfs.probe_failure
  ena = 0x5d98f7f5abf07c01
  detector = (embedded nvlist)
  nvlist version: 0
  version = 0x0
  scheme = zfs
  pool = 0xf0fd6429bb7c4b79
  vdev = 0xc14bbe70ab2fd060
  (end detector)

  pool = pool1
  pool_guid = 0xf0fd6429bb7c4b79
  pool_context = 0
  pool_failmode = wait
  vdev_guid = 0xc14bbe70ab2fd060
  vdev_type = disk
  vdev_path = /dev/dsk/c0t60A9800037537478742443536131576Fd0s0
  vdev_devid = id1,ssd@n60a9800037537478742443536131576f/a
  parent_guid = 0xf0fd6429bb7c4b79
  parent_type = root
  prev_state = 0x0
  __ttl = 0x1
  __tod = 0x52189506 0x36cdbcb7

Aug 24 2013 13:12:06.919506008 ereport.fs.zfs.io
nvlist version: 0
  class = ereport.fs.zfs.io
  ena = 0x5d98f80282207c01
  detector = (embedded nvlist)
  nvlist version: 0
  version = 0x0
  scheme = zfs
  pool = 0xf0fd6429bb7c4b79
  vdev = 0xc14bbe70ab2fd060
  (end detector)

  pool = pool1
  pool_guid = 0xf0fd6429bb7c4b79
  pool_context = 0
  pool_failmode = wait
  vdev_guid = 0xc14bbe70ab2fd060
  vdev_type = disk
  vdev_path = /dev/dsk/c0t60A9800037537478742443536131576Fd0s0
  vdev_devid = id1,ssd@n60a9800037537478742443536131576f/a
  parent_guid = 0xf0fd6429bb7c4b79
  parent_type = root
  zio_err = 6
  zio_txg = 0x2a1897
  zio_offset = 0x198dc25c800
  zio_size = 0x200
  zio_objset = 0x21
  zio_object = 0xa290
  zio_level = 0
  zio_blkid = 0x0
  __ttl = 0x1
  __tod = 0x52189506 0x36ce8c58



This LUN is ssd10 :
"/scsi_vhci/ssd@g60a9800037537478742443536131576f" 10 "ssd"

At this point customer initiated manually a Solaris crash dump to collect information.

Changes

Netapp storage controller failure and controller fixed reappear into the fabric.

Cause

All the data points to this mpxio bug as the RCA:
Bug 15822598 SUNBT7204589 scsi_vhci not handling the takeover on netapp storage correctly
and

Bug 17228789  scsi_vhci not handling the takeover on netapp storage correctly in non cluster



Based on the fma data and the dump collected and the Solaris 11.1 version , the server did not know how to handle that 04/0A sense data (probably due to bug 15822598 )
and was passed to the upper layers so fmd catch the error and caused ZFS to fail the pool.
 

Solution

Upgrade to Solaris 11.2 SRU 11.2.12.5.0 (or greater) that fixes Bug 17228789

On Solaris 10, fix for Bug 17228789 has been provided on:

<SunPatch:150400-28>         Sep/10/2015   SunOS 5.10: Kernel Patch
<SunPatch:150401-28>         Sep/10/2015   SunOS 5.10_x86: Kernel Patch

 

Notice that the fix provided on Bug 15822598 does not cover all scenarios:

On Solaris 11, fix for Bug 15822598 has been provided on:

Solaris 11.1 SRU 7.5 

On Solaris 10, fix for Bug 15822598 has been provided on:

<SunPatch:150400-23>         Apr/08/2015   SunOS 5.10: Kernel Patch
<SunPatch:150401-23>         Apr/08/2015   SunOS 5.10_x86: Kernel Patch

 

 

Internal notes from the core dump analysis:

 -----------------------
The analysis of the core dump is not conclusive in relation to the bug "Bug 15822598 SUNBT7204589 scsi_vhci not handling the takeover on netapp storage correctly"
as the dump was taken after the storage issue was seen and the multipathing devices has sorted themselves out.ssd9 and ssd11 have no outstanding commands in the driver or transport and the
Last pkt reason:
CMD_CMPLT - no transport errors- normal completion

ssd10 and ssd12 have no outstanding commands in the driver or transport but the
Last pkt reason:
  CMD_TRAN_ERR - unspecified transport error


we can see from the kstat errors that ssd10 had many transport and hard errors logged against it

kid: 1772 @ 0x3000099e5b0, data[14] @ 0x3000099e6c0
mod: ssderr name: ssd10,err class: device_error type: KSTAT_TYPE_NAMED
inst: 10 flags: 8 size: 672
  KSTAT_FLAG_PERSISTENT - kstat is to be persistent over time
creation time: 0x26d79efb14 (144 days 3 hours 51 minutes 26.806454690 seconds earlier)
last snapshot: 0x2bed0b55065713 (1 days 1 hours 26 minutes 36.707272099 seconds earlier)
update: genunix:nulldev
snapshot: unix:default_kstat_snapshot
Soft Errors: 0
Hard Errors: 2698
Transport Errors: 3294
Vendor: "NETAPP "
Product: "LUN "
Revision: "811a"
Serial No: "7Stxt$CSa1Wo"
Size: 4294967296000
Media Error: 0
Device Not Ready: 5
No Device: 5
Recoverable: 0
Illegal Request: 3481
Predictive Failure Analysis: 0

Looking at the ssd10 we can see historically that the sense data from the array
is asc/ascq ( 04/ 0A) ( LOGICAL UNIT NOT ACCESSIBLE, ASYMMETRIC ACCSSS STATE TRANSITION)


> 0x1001915c0700::print -t struct sd_lun un_xbuf_attr
void *un_xbuf_attr = 0x1001913f0a00
> 0x1001913f0a00::print -t struct __ddi_xbuf_attr xa_reserve_headp
void *xa_reserve_headp = 0x100190de6550
> 0x100190de6550::print -t ddi_xbuf_t
ddi_xbuf_t 0x100190dba660
> 0x100190dba660::print -t sd_xbuf xb_sense_data
uchar_t [20] xb_sense_data = [ 0x70, 0, 0x2, 0, 0, 0, 0, 0xe, 0, 0, 0, 0, 0x4, 0xa, 0, 0, 0, 0, 0, 0 ]
>

So as per the bug you mentioned if the array is returning this sense data the version of Solaris the customer is currently using
will not understand what to do with it.


The sense data returned by the array

sense-data = 0x70 0x0 0x2 0x0 0x0 0x0 0x0 0xe 0x0 0x0 0x0 0x0 0x4 0xa 0x0 0x0 0x0 0x0 0x0 0x0

contains a sense key = 0x2 ( Not Ready)

asc/ascq = 0x4/0xA

If the customer does not have the version of Solaris with the fix as defined in the bug then Solaris does not know what to do with the the 0x4/0xa additional sense data.

It knows that the lun is "not ready" and the only code before the fix is

} else if ((skey == KEY_NOT_READY) &&
(asc == STD_LOGICAL_UNIT_NOT_ACCESSIBLE) &&
((ascq == STD_TGT_PORT_UNAVAILABLE) ||
(ascq == STD_TGT_PORT_STANDBY))) {
rval = SCSI_SENSE_INACTIVE;
VHCI_DEBUG(4, (CE_NOTE, NULL, "!std_analyze_sense:"
" sense_key:%x, add_code: %x, qual_code:%x"
" sense:%x\n", skey, asc, ascq, rval));

but will ignore this else if statement and rval will be set to SCSI_SENSE_UNKNOWN ( as per the bug comments ) and dealt with accordingly as it would with unknown sense data.


With the fix in place ( additional else if )

} else if ((skey == KEY_NOT_READY) &&
(asc == STD_SCSI_ASC_STATE_TRANS) &&
(ascq == STD_SCSI_ASCQ_STATE_TRANS)) {
rval = SCSI_SENSE_NOT_READY;
VHCI_DEBUG(4, (CE_NOTE, NULL, "!std_analyze_sense:"
" sense_key:%x, add_code: %x, qual_code:%x"
" sense:%x\n", skey, asc, ascq, rval));

where

#define STD_SCSI_ASC_STATE_TRANS 0x04
#define STD_SCSI_ASCQ_STATE_TRANS 0x0A

so now the return value is set to SCSI_SENSE_NOT_READY and dealt with differently.

dissembling std_analyze_sense() in the dump we only see ref to the first (if else )

std_analyze_sense+0xf0: cmp %i2, 0x2 ( KEY_NOT_READY )
std_analyze_sense+0xf4: bne,pn %icc, +0x3c
std_analyze_sense+0xf8: cmp %i3, 0x4 (STD_LOGICAL_UNIT_NOT_ACCESSIBLE)
std_analyze_sense+0xfc: be,pn %icc, +0x14
std_analyze_sense+0x100: mov 0x5, %l3
std_analyze_sense+0x104: sra %l3, 0x0, %i0
std_analyze_sense+0x108: ret
std_analyze_sense+0x10c: restore
std_analyze_sense+0x110: sub %l1, 0xb, %l5 ( STD_TGT_PORT_STANDBY )

So this customer does not have that bug fix on their system, and they should update their packages to get it installed.

I would say that the customer should update their version of Solaris
so that this bug will not cause them any further issue when the array returns 04/0A sense data.
-------------------------------

 

References

<NOTE:1519925.1> - fmdump -eV reports ereport.io.scsi.cmd.disk.dev.rqs.derr associated with SCSI Mode Select or SCSI Mode Sense commands
<NOTE:1501435.1> - Oracle Solaris 11.1 Support Repository Updates (SRU) Index
<BUG:15822598> - SUNBT7204589 SCSI_VHCI NOT HANDLING THE TAKEOVER ON NETAPP STORAGE CORRECTLY
<BUG:17228789> - SCSI_VHCI NOT HANDLING THE TAKEOVER ON NETAPP STORAGE CORRECTLY IN NON CLUSTER

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback