![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||
Solution Type Problem Resolution Sure Solution 1586715.1 : Solaris 11.1 mpxio - Netapp Storage Controller Failure - Zfs Pool Has Failed Hours Later When Ports Reappeared In Fabric
In this Document
Created from <SR 3-7774615141> Applies to:SPARC T4-1 - Version All Versions and laterSolaris Operating System - Version 11 and later Information in this document applies to any platform. SymptomsThis is a Solaris 11.1 SRU 4.5 T4-1 SPARC server t4server1 with Oracle FC HBA Dual port, 3. c0t60A9800037537478742443536131576Fd0 <NETAPP-LUN-811a-3.91TB>
/scsi_vhci/ssd@g60a9800037537478742443536131576f
DEVICE PROPERTIES for disk: 500a09808d7e7bd2
Vendor: NETAPP Product ID: LUN Revision: 811a Serial Num: 7Stxt$CSa1Wo Unformatted capacity: 4096000.000 MBytes Read Cache: Enabled Minimum prefetch: 0x0 Maximum prefetch: 0x0 Device Type: Disk device Path(s): /dev/rdsk/c0t60A9800037537478742443536131576Fd0s2 /devices/scsi_vhci/ssd@g60a9800037537478742443536131576f:c,raw Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0 Device Address 500a09839d7e7bd2,21 Host controller port WWN 10000090fa13cc4f Class secondary State ONLINE Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0 Device Address 500a09819d7e7bd2,21 Host controller port WWN 10000090fa13cc4f Class secondary State ONLINE Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0 Device Address 500a09818d7e7bd2,21 Host controller port WWN 10000090fa13cc4f Class primary State ONLINE Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0 Device Address 500a09838d7e7bd2,21 Host controller port WWN 10000090fa13cc4f Class primary State ONLINE Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0 Device Address 500a09829d7e7bd2,21 Host controller port WWN 10000090fa13cc4e Class secondary State ONLINE Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0 Device Address 500a09848d7e7bd2,21 Host controller port WWN 10000090fa13cc4e Class primary State ONLINE Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0 Device Address 500a09828d7e7bd2,21 Host controller port WWN 10000090fa13cc4e Class primary State ONLINE Controller /devices/pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0 Device Address 500a09849d7e7bd2,21 Host controller port WWN 10000090fa13cc4e Class secondary State ONLINE
messages.1:Aug 23 22:47:48 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=d0300, PWWN=500a09848d7e7bd2 disappeared from fabric
messages.1:Aug 23 22:47:48 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=d0600, PWWN=500a09828d7e7bd2 disappeared from fabric messages.1:Aug 23 22:47:48 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(3)::N_x Port with D_ID=e0200, PWWN=500a09818d7e7bd2 disappeared from fabric messages.1:Aug 23 22:47:48 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(3)::N_x Port with D_ID=e0300, PWWN=500a09838d7e7bd2 disappeared from fabric
messages.0:Aug 24 13:12:14 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(3)::N_x Port with D_ID=e0200, PWWN=500a09818d7e7bd2 reappeared in fabric
messages.0:Aug 24 13:12:14 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(3)::N_x Port with D_ID=e0300, PWWN=500a09838d7e7bd2 reappeared in fabric messages.0:Aug 24 13:12:14 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=d0300, PWWN=500a09848d7e7bd2 reappeared in fabric messages.0:Aug 24 13:12:14 t4server1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=d0600, PWWN=500a09828d7e7bd2 reappeared in fabric
Aug 24 13:12:07 t4server1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-NX, TYPE: Fault, VER: 1, SEVERITY: Major
Aug 24 13:12:07 t4server1 EVENT-TIME: Sat Aug 24 13:12:06 CEST 2013 Aug 24 13:12:07 t4server1 PLATFORM: ORCL,SPARC-T4-1, CSN: 1307BDY5C0, HOSTNAME: t4server1 Aug 24 13:12:07 t4server1 SOURCE: zfs-diagnosis, REV: 1.0 Aug 24 13:12:07 t4server1 EVENT-ID: d37084dc-f97c-c321-c9bc-b6e5e95d4a5f Aug 24 13:12:07 t4server1 DESC: Probe of ZFS device 'id1,ssd@n60a9800037537478742443536131576f/a' in pool 'pool1' has failed. Aug 24 13:12:07 t4server1 AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Aug 24 13:12:07 t4server1 IMPACT: Fault tolerance of the pool may be compromised. Aug 24 13:12:07 t4server1 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -lx' for more information. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-NX for the latest service procedures and policies regarding this diagnosis. Aug 24 13:12:07 t4server1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-8A, TYPE: Fault, VER: 1, SEVERITY: Critical Aug 24 13:12:07 t4server1 EVENT-TIME: Sat Aug 24 13:12:07 CEST 2013 Aug 24 13:12:07 t4server1 PLATFORM: ORCL,SPARC-T4-1, CSN: 1307BDY5C0, HOSTNAME: t4server1 Aug 24 13:12:07 t4server1 SOURCE: zfs-diagnosis, REV: 1.0 Aug 24 13:12:07 t4server1 EVENT-ID: 921f491f-edef-e8ec-ab83-c85771e2d345 Aug 24 13:12:07 t4server1 DESC: A file or directory in pool 'pool1' could not be read due to corrupt data. Aug 24 13:12:07 t4server1 AUTO-RESPONSE: No automated response will occur. Aug 24 13:12:07 t4server1 IMPACT: The file or directory is unavailable. Aug 24 13:12:07 t4server1 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -xv' and examine the list of damaged files to determine what has been affected. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-8A for the latest service procedures and policies regarding this diagnosis. Aug 24 13:12:07 t4server1 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-HC, TYPE: Fault, VER: 1, SEVERITY: Major Aug 24 13:12:07 t4server1 EVENT-TIME: Sat Aug 24 13:12:07 CEST 2013 Aug 24 13:12:07 t4server1 PLATFORM: ORCL,SPARC-T4-1, CSN: 1307BDY5C0, HOSTNAME: t4server1 Aug 24 13:12:07 t4server1 SOURCE: zfs-diagnosis, REV: 1.0 Aug 24 13:12:07 t4server1 EVENT-ID: 4401ee7a-a603-ccf0-932d-e670dff05389 Aug 24 13:12:07 t4server1 DESC: ZFS pool 'pool1' has experienced currently unrecoverable I/O failures. Aug 24 13:12:07 t4server1 AUTO-RESPONSE: No automated response will occur. Aug 24 13:12:07 t4server1 IMPACT: Read and write I/Os cannot be serviced. Aug 24 13:12:07 t4server1 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Make sure the affected devices are connected, then run 'zpool clear'. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-HC for the latest service procedures and policies regarding this diagnosis.
Then mpxio place the multipath status as optimal for each LUN/path from the Netapp storage, ie for LUN 0x21 : Aug 24 13:12:14 t4server1 genunix: [ID 530209 kern.info] /scsi_vhci/ssd@g60a9800037537478742443536131576f (ssd10) multipath status: optimal: path 14 fp3/ssd@w500a09818d7e7bd2,21 is online: Load balancing: round-robin
Aug 24 13:12:14 t4server1 genunix: [ID 530209 kern.info] /scsi_vhci/ssd@g60a9800037537478742443536131576f (ssd10) multipath status: optimal: path 10 fp3/ssd@w500a09838d7e7bd2,21 is online: Load balancing: round-robin Aug 24 13:12:14 t4server1 genunix: [ID 530209 kern.info] /scsi_vhci/ssd@g60a9800037537478742443536131576f (ssd10) multipath status: optimal: path 26 fp2/ssd@w500a09848d7e7bd2,21 is online: Load balancing: round-robin Aug 24 13:12:14 t4server1 genunix: [ID 530209 kern.info] /scsi_vhci/ssd@g60a9800037537478742443536131576f (ssd10) multipath status: optimal: path 30 fp2/ssd@w500a09828d7e7bd2,21 is online: Load balancing: round-robin
Aug 23 22:47:48.4700 ereport.io.scsi.cmd.disk.tran Aug 23 22:47:48.4701 ereport.io.scsi.cmd.disk.tran Aug 23 22:47:48.4824 ereport.io.scsi.cmd.disk.tran Aug 23 22:47:48.4825 ereport.io.scsi.cmd.disk.tran Aug 23 22:47:50.6209 ereport.io.scsi.cmd.disk.dev.serr Aug 23 22:48:07.5394 ereport.io.scsi.cmd.disk.recovered Aug 23 22:48:07.5667 ereport.io.scsi.cmd.disk.recovered Aug 23 22:48:07.5670 ereport.io.scsi.cmd.disk.recovered Aug 23 22:48:07.5911 ereport.io.scsi.cmd.disk.recovered Aug 23 22:48:07.6014 ereport.io.scsi.cmd.disk.recovered Aug 24 13:12:04.4190 ereport.io.scsi.cmd.disk.dev.rqs.derr Aug 24 13:12:04.4192 ereport.io.scsi.cmd.disk.dev.rqs.derr Aug 24 13:12:04.4193 ereport.io.scsi.cmd.disk.dev.rqs.derr Aug 24 13:12:06.9191 ereport.io.scsi.cmd.disk.dev.rqs.derr Aug 24 13:12:06.9193 ereport.io.scsi.cmd.disk.dev.rqs.derr Aug 24 13:12:06.9194 ereport.io.scsi.cmd.disk.dev.rqs.derr Aug 24 13:12:06.9194 ereport.fs.zfs.probe_failure Aug 24 13:12:06.9195 ereport.fs.zfs.io Aug 24 13:12:06.9195 ereport.fs.zfs.io Aug 24 13:12:06.9195 ereport.fs.zfs.data Aug 24 13:12:06.9196 ereport.fs.zfs.io Aug 24 13:12:06.9196 ereport.fs.zfs.io_failure Aug 24 13:12:06.9195 ereport.fs.zfs.io Aug 24 13:12:06.9196 ereport.fs.zfs.io Aug 24 13:12:09.4193 ereport.io.scsi.cmd.disk.dev.rqs.derr Aug 24 13:12:09.4194 ereport.fs.zfs.io Aug 24 13:12:15.9526 ereport.io.scsi.cmd.disk.dev.rqs.derr Aug 24 13:12:16.0496 ereport.io.scsi.cmd.disk.dev.rqs.derr Aug 24 13:12:16.1496 ereport.io.scsi.cmd.disk.dev.rqs.derr Aug 24 13:12:16.2559 ereport.io.scsi.cmd.disk.recovered Aug 24 13:13:08.6078 ereport.fs.zfs.io_failure Aug 24 13:13:08.6080 ereport.fs.zfs.io_failure
Aug 24 2013 13:12:04.419012415 ereport.io.scsi.cmd.disk.dev.rqs.derr
nvlist version: 0 class = ereport.io.scsi.cmd.disk.dev.rqs.derr ena = 0x5d8fa780c6c02801 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev cna_dev = 0x515c4d2a00000016 device-path = /pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0/ssd@w500a09819d7e7bd2,21 devid = id1,ssd@n60a9800037537478742443536131576f (end detector) devid = id1,ssd@n60a9800037537478742443536131576f driver-assessment = fail op-code = 0x8a cdb = 0x8a 0x0 0x0 0x0 0x0 0x1 0xc8 0x8f 0x1b 0xbd 0x0 0x0 0x0 0x10 0x0 0x0 pkt-reason = 0x0 pkt-state = 0x3f pkt-stats = 0x0 stat-code = 0x2 key = 0x2 asc = 0x4 ascq = 0xa sense-data = 0x70 0x0 0x2 0x0 0x0 0x0 0x0 0xe 0x0 0x0 0x0 0x0 0x4 0xa 0x0 0x0 0x0 0x0 0x0 0x0 __ttl = 0x1 __tod = 0x52189504 0x18f99f3f Aug 24 2013 13:12:04.419238767 ereport.io.scsi.cmd.disk.dev.rqs.derr nvlist version: 0 class = ereport.io.scsi.cmd.disk.dev.rqs.derr ena = 0x5d8fa7b757f0ac01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev cna_dev = 0x515c4d2a00000016 device-path = /pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0/ssd@w500a09829d7e7bd2,21 devid = id1,ssd@n60a9800037537478742443536131576f (end detector) devid = id1,ssd@n60a9800037537478742443536131576f driver-assessment = retry op-code = 0x28 cdb = 0x28 0x0 0xcc 0x6e 0x13 0xe4 0x0 0x0 0x1 0x0 pkt-reason = 0x0 pkt-state = 0x3f pkt-stats = 0x0 stat-code = 0x2 key = 0x2 asc = 0x4 ascq = 0xa sense-data = 0x70 0x0 0x2 0x0 0x0 0x0 0x0 0xe 0x0 0x0 0x0 0x0 0x4 0xa 0x0 0x0 0x0 0x0 0x0 0x0 __ttl = 0x1 __tod = 0x52189504 0x18fd136f Aug 24 2013 13:12:04.419319640 ereport.io.scsi.cmd.disk.dev.rqs.derr nvlist version: 0 class = ereport.io.scsi.cmd.disk.dev.rqs.derr ena = 0x5d8fa7cb27408401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev cna_dev = 0x515c4d2a00000016 device-path = /pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0,1/fp@0,0/ssd@w500a09839d7e7bd2,21 devid = id1,ssd@n60a9800037537478742443536131576f (end detector) devid = id1,ssd@n60a9800037537478742443536131576f driver-assessment = retry op-code = 0x28 cdb = 0x28 0x0 0x0 0x0 0x3 0x10 0x0 0x0 0x10 0x0 pkt-reason = 0x0 pkt-state = 0x3f pkt-stats = 0x0 stat-code = 0x2 key = 0x2 asc = 0x4 ascq = 0xa sense-data = 0x70 0x0 0x2 0x0 0x0 0x0 0x0 0xe 0x0 0x0 0x0 0x0 0x4 0xa 0x0 0x0 0x0 0x0 0x0 0x0 __ttl = 0x1 __tod = 0x52189504 0x18fe4f58 Aug 24 2013 13:12:06.919408534 ereport.io.scsi.cmd.disk.dev.rqs.derr nvlist version: 0 class = ereport.io.scsi.cmd.disk.dev.rqs.derr ena = 0x5d8fa7b757f0ac05 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev cna_dev = 0x515c4d2a00000016 device-path = /pci@400/pci@2/pci@0/pci@8/SUNW,emlxs@0/fp@0,0/ssd@w500a09849d7e7bd2,21 devid = id1,ssd@n60a9800037537478742443536131576f (end detector) devid = id1,ssd@n60a9800037537478742443536131576f driver-assessment = fail op-code = 0x28 cdb = 0x28 0x0 0xcc 0x6e 0x13 0xe4 0x0 0x0 0x1 0x0 pkt-reason = 0x0 pkt-state = 0x3f pkt-stats = 0x0 stat-code = 0x2 key = 0x2 asc = 0x4 ascq = 0xa sense-data = 0x70 0x0 0x2 0x0 0x0 0x0 0x0 0xe 0x0 0x0 0x0 0x0 0x4 0xa 0x0 0x0 0x0 0x0 0x0 0x0 __ttl = 0x1 __tod = 0x52189506 0x36cd0f96 Aug 24 2013 13:12:06.919452855 ereport.fs.zfs.probe_failure nvlist version: 0 class = ereport.fs.zfs.probe_failure ena = 0x5d98f7f5abf07c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0xf0fd6429bb7c4b79 vdev = 0xc14bbe70ab2fd060 (end detector) pool = pool1 pool_guid = 0xf0fd6429bb7c4b79 pool_context = 0 pool_failmode = wait vdev_guid = 0xc14bbe70ab2fd060 vdev_type = disk vdev_path = /dev/dsk/c0t60A9800037537478742443536131576Fd0s0 vdev_devid = id1,ssd@n60a9800037537478742443536131576f/a parent_guid = 0xf0fd6429bb7c4b79 parent_type = root prev_state = 0x0 __ttl = 0x1 __tod = 0x52189506 0x36cdbcb7 Aug 24 2013 13:12:06.919506008 ereport.fs.zfs.io nvlist version: 0 class = ereport.fs.zfs.io ena = 0x5d98f80282207c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0xf0fd6429bb7c4b79 vdev = 0xc14bbe70ab2fd060 (end detector) pool = pool1 pool_guid = 0xf0fd6429bb7c4b79 pool_context = 0 pool_failmode = wait vdev_guid = 0xc14bbe70ab2fd060 vdev_type = disk vdev_path = /dev/dsk/c0t60A9800037537478742443536131576Fd0s0 vdev_devid = id1,ssd@n60a9800037537478742443536131576f/a parent_guid = 0xf0fd6429bb7c4b79 parent_type = root zio_err = 6 zio_txg = 0x2a1897 zio_offset = 0x198dc25c800 zio_size = 0x200 zio_objset = 0x21 zio_object = 0xa290 zio_level = 0 zio_blkid = 0x0 __ttl = 0x1 __tod = 0x52189506 0x36ce8c58
At this point customer initiated manually a Solaris crash dump to collect information. ChangesNetapp storage controller failure and controller fixed reappear into the fabric. CauseAll the data points to this mpxio bug as the RCA: Bug 17228789 scsi_vhci not handling the takeover on netapp storage correctly in non cluster
SolutionUpgrade to Solaris 11.2 SRU 11.2.12.5.0 (or greater) that fixes Bug 17228789 On Solaris 10, fix for Bug 17228789 has been provided on: <SunPatch:150400-28> Sep/10/2015 SunOS 5.10: Kernel Patch
Notice that the fix provided on Bug 15822598 does not cover all scenarios: On Solaris 11, fix for Bug 15822598 has been provided on: Solaris 11.1 SRU 7.5 On Solaris 10, fix for Bug 15822598 has been provided on: <SunPatch:150400-23> Apr/08/2015 SunOS 5.10: Kernel Patch
Internal notes from the core dump analysis: -----------------------
References<NOTE:1519925.1> - fmdump -eV reports ereport.io.scsi.cmd.disk.dev.rqs.derr associated with SCSI Mode Select or SCSI Mode Sense commands<NOTE:1501435.1> - Oracle Solaris 11.1 Support Repository Updates (SRU) Index <BUG:15822598> - SUNBT7204589 SCSI_VHCI NOT HANDLING THE TAKEOVER ON NETAPP STORAGE CORRECTLY <BUG:17228789> - SCSI_VHCI NOT HANDLING THE TAKEOVER ON NETAPP STORAGE CORRECTLY IN NON CLUSTER Attachments This solution has no attachment |
||||||||||||||||||||
|