Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2184100.1
Update Date:2017-10-11
Keywords:

Solution Type  Problem Resolution Sure

Solution  2184100.1 :   SPARC M7 Series Servers : ZFS fault on eUSB disk after CMIOU replacement  


Related Items
  • Oracle SuperCluster M7 Hardware
  •  
  • SPARC M7-16
  •  
  • SPARC M7-8
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: M7
  •  




In this Document
Symptoms
Cause
Solution
References


Applies to:

SPARC M7-8 - Version All Versions and later
Oracle SuperCluster M7 Hardware - Version All Versions and later
SPARC M7-16 - Version All Versions and later
Information in this document applies to any platform.

Symptoms

After replacing a CMIOU on SPARC M7 servers or SPARC SuperCluster M7 servers, if iSCSI over IPoIB is used as a boot option then the domain owning the eUSB disk from the CMIOU replaced might complain with a ZFS fault when restarting the domains.

For instance :

SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Thu Sep 15 22:03:09 PDT 2016
PLATFORM: unknown, CSN: unknown, HOSTNAME:
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 21ba6416-f79f-4b70-9059-e1c705462754
DESC: ZFS device 'id1,sd@SMICRON__eUSB_DISK_______17F0022700070705/a' in pool 'bpool' failed.
AUTO-RESPONSE: No automated response will occur.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -lx' for more information. Please e document at http://support.oracle.com/msg/ZFS-8000-D3 for the latest service procedures and policies regarding this diagnosis.
288 Thu Sep 15 22:03:10 2016 Fault Repair minor
Fault fault.fs.zfs.device on component - cleared
287 Thu Sep 15 21:58:52 2016 Fault Fault critical
Fault detected at time = Thu Sep 15 21:58:35 2016. The suspect component: - has fault.fs.zfs.device with probability=100. Refer to
http://support.oracle.com/msg/ZFS-8000-D3 for details

See the documents referenced below for further information about iSCSI over IPoIB.

Replacing a CMIOU does not require to transfer the eUSB from the suspect to the new CMIOU unless the eUSB disk is the only device in the boot pool.

 

Cause

This situation may be due to the eUSB disk having an incorrect/missing label. In such a case, the eUSB disk may fail to join the boot pool.

From the domain owning the eUSB disk and using it as part of the boot pool to boot using iSCSI over IP over IB.

The domain owning the eUSB disk can be the control/primary domain or any guest domain.

Make sure to identify from which ldom the fault is coming from. The fault is proxied from the guest ldom to the control/primary domain. The hostid for the domain where the fault was diagnosed is reported in the 'fmadm faulty' output.

See IO faults proxying in LDOM environment (Doc ID 1942045.1)

# fmdump

Sep 15 21:58:35.1780 e8a130c0-90b6-4461-aadf-df9502fb85a9 ZFS-8000-D3 Diagnosed
100% fault.fs.zfs.device

...

Problem in: zfs://pool=3794d1209385ba27/vdev=a8e0f7cc0efd0465/pool_name=bpool/vdev_name=id1,sd@SMICRON__eUSB_DISK_______17F0022700070705/a
Affects: zfs://pool=3794d1209385ba27/vdev=a8e0f7cc0efd0465/pool_name=bpool/vdev_name=id1,sd@SMICRON__eUSB_DISK_______17F0022700070705/a
FRU: zfs://pool=3794d1209385ba27/vdev=a8e0f7cc0efd0465/pool_name=bpool/vdev_name=id1,sd@SMICRON__eUSB_DISK_______17F0022700070705/a
FRU Location: -

 

# fmadm faulty

--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Sep 15 22:03:09 21ba6416-f79f-4b70-9059-e1c705462754 ZFS-8000-D3 Major

Problem Status : open
Diag Engine : zfs-diagnosis / 1.0
System
Manufacturer : unknown
Name : unknown
Part_Number : unknown
Serial_Number : unknown
Host_ID : 84f9adb4

----------------------------------------
Suspect 1 of 1 :
Problem class : fault.fs.zfs.device
Certainty : 100%
Affects : zfs://pool=3794d1209385ba27/vdev=a8e0f7cc0efd0465/pool_name=bpool/vdev_name=id1,sd@SMICRON__eUSB_DISK_______17F0022700070705/a
Status : faulted and taken out of service

FRU
Status : faulty
FMRI : "zfs://pool=3794d1209385ba27/vdev=a8e0f7cc0efd0465/pool_name=bpool/vdev_name=id1,sd@SMICRON__eUSB_DISK_______17F0022700070705/a"

Description : ZFS device 'id1,sd@SMICRON__eUSB_DISK_______17F0022700070705/a'
in pool 'bpool' failed.

Response : No automated response will occur.

Impact : Fault tolerance of the pool may be compromised.

Action : Use 'fmadm faulty' to provide a more detailed view of this event.
Run 'zpool status -lx' for more information. Please refer to the
associated reference document at
http://support.oracle.com/msg/ZFS-8000-D3 for the latest service
procedures and policies regarding this diagnosis.

From the ldom reporting the fault, check the error as reported and details for the fault ('fmadm faulty', 'fmdump -v', 'fmdump -eV'). The UUID must be the same as the one proxied to the control domain.

# fmdump -e

Sep 15 21:56:53.8380 ereport.fs.zfs.vdev.bad_label
Sep 15 21:56:53.8380 ereport.fs.zfs.vdev.bad_label
Sep 15 21:57:00.6992 ereport.fs.zfs.vdev.bad_label
Sep 15 21:58:35.3800 ereport.fs.zfs.vdev.dtl
Sep 15 21:58:35.4219 ereport.fs.zfs.vdev.bad_label

# fmdump -eV

TIME UUID SUNW-MSG-ID
Sep 15 2016 21:58:35.178074000 e8a130c0-90b6-4461-aadf-df9502fb85a9 ZFS-8000-D3

TIME CLASS ENA
Sep 15 21:56:53.8380 ereport.fs.zfs.vdev.bad_label 0x041e400c0af00001

nvlist version: 0
version = 0x0
class = list.suspect
uuid = e8a130c0-90b6-4461-aadf-df9502fb85a9
code = ZFS-8000-D3
diag-time = 1474001915 173495
de = (embedded nvlist)
nvlist version: 0
version = 0x1
scheme = fmd
authority = (embedded nvlist)
nvlist version: 0
version = 0x1
system-mfg = unknown
system-name = unknown
system-part = unknown
system-serial = unknown
sys-comp-mfg = unknown
sys-comp-name = unknown
sys-comp-part = unknown
sys-comp-serial = unknown
server-name =
host-id = 84f9adb4
(end authority)

mod-name = zfs-diagnosis
mod-version = 1.0
(end de)

fault-list-sz = 0x1
topo-uuid = 9d6fe8e4-bcd2-492c-802a-bcabdbc22e58
topo-time = 0x57db7beb
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = fault.fs.zfs.device
certainty = 0x64
asru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x3794d1209385ba27
vdev = 0xa8e0f7cc0efd0465
pool_name = bpool
vdev_name = id1,sd@SMICRON__eUSB_DISK_______17F0022700070705/a
(end asru)

fru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x3794d1209385ba27
vdev = 0xa8e0f7cc0efd0465
pool_name = bpool
vdev_name = id1,sd@SMICRON__eUSB_DISK_______17F0022700070705/a
(end fru)

resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x3794d1209385ba27
vdev = 0xa8e0f7cc0efd0465
pool_name = bpool
vdev_name = id1,sd@SMICRON__eUSB_DISK_______17F0022700070705/a
(end resource)

 

The boot pool of the ldom owning the eUSB disk is reported as degraded in the 'zpool status' output :

# zpool status -v

pool: bpool
state: DEGRADED
status: One or more devices are unavailable in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or 'fmadm repaired', or replace the device
with 'zpool replace'.
Run 'zpool status -v' to see device specific details.
scan: resilvered 91.9M in 2s with 0 errors on Thu Jun 30 12:20:34 2016

config:

NAME STATE READ WRITE CKSUM
  bpool DEGRADED 0 0 0
    mirror-0 DEGRADED 0 0 0
      12168998648951932005 UNAVAIL 0 0 0
      c3t0d0 ONLINE 0 0 0
      c0t600144F0A78B6D340000577544A9000Cd0 ONLINE 0 0 0

device details:

12168998648951932005 UNAVAIL was /dev/dsk/c2t0d0s0
status: ZFS detected errors on this device.
The device has invalid label.
see: http://support.oracle.com/msg/ZFS-8000-D3 for recovery

 

# bootadm boot-pool list
Boot pool name: bpool
Parameters: eviction_algorithm=lru
Current: /dev/dsk/c2t0d0, /dev/dsk/c3t0d0, /dev/dsk/c0t600144F0A78B6D340000577544A9000Cd0
Pending: /dev/dsk/c2t0d0, /dev/dsk/c3t0d0, /dev/dsk/c0t600144F0A78B6D340000577544A9000Cd0
Platform-specified devices excluded:
Platform-specified (auto-added, unless excluded): /dev/dsk/c2t0d0, /dev/dsk/c3t0d0

 

Solution

If due to an incorrect/missing label, the disk must be properly labelled.

There is no need to replace the eUSB disk.

The  fault on the control/primary is providing the information about where the fault is coming from. In this example : host-id = 84f9adb4.

Use the 'ldm ls-dom -l' command from the control domain to locate the ldom and eUSB disk. CMIOU2 was replaced previously in this example.

See SPARC M7 Series Servers: Device Paths (Doc ID 2063247.1) to identify the path and bus/rootcomplex.

# ldm ls-dom -l

NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME
ssccn1-dom3 active -n--v- 5003 112 129792M 0.0% 0.0% 10h 23m

...

HOSTID
0x84f9adb4

...

IO
DEVICE PSEUDONYM OPTIONS
pci@30e     pci_14

# ldm list-rsrc-group -l /SYS/CMIOU2

...

 

IO
DEVICE PSEUDONYM BOUND
pci@30a pci_10 primary
pci@30b pci_11 primary
pci@30d pci_13 primary
pci@30e pci_14 ssccn1-dom3

 

The eUSB can only be managed from the ldom owning the resource.

After logging into the identified ldom (ssccn1-dom3 in the previous example), the bpool is reported as degraded from this ldom.

# zpool status -v

pool: bpool
state: DEGRADED
status: One or more devices are unavailable in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or 'fmadm repaired', or replace the device
with 'zpool replace'.
Run 'zpool status -v' to see device specific details.
scan: resilvered 91.9M in 2s with 0 errors on Thu Jun 30 12:20:34 2016

config:

NAME STATE READ WRITE CKSUM
bpool DEGRADED 0 0 0
  mirror-0 DEGRADED 0 0 0
    12168998648951932005 UNAVAIL 0 0 0
    c3t0d0 ONLINE 0 0 0
    c0t600144F0A78B6D340000577544A9000Cd0 ONLINE 0 0 0

...

device details:

12168998648951932005 UNAVAIL was /dev/dsk/c2t0d0s0
status: ZFS detected errors on this device.
The device has invalid label.
see: http://support.oracle.com/msg/ZFS-8000-D3 for recovery

 

From the previously identified ldom, label the eUSB disk

#format...
6. c2t0d0 <MICRON-eUSB DISK-1112-1.89GB>
/pci@30e/pci@2/usb@0/storage@1/disk@0,0
7. c3t0d0 <MICRON-eUSB DISK-1112-1.89GB>
/pci@313/pci@2/usb@0/storage@1/disk@0,0
# format c2t0d0
format> label
yes
format> quit

Then the respective device can detached and re-attached to the pool.

# zpool detach bpool 12168998648951932005
# zpool attach bpool c3t0d0 c2t0d0

# zpool status -v

pool: bpool
state: ONLINE
scan: resilvered 272M in 32s with 0 errors on Fri Sep 16 08:33:47 2016

config:

NAME STATE READ WRITE CKSUM
  bpool ONLINE 0 0 0
    mirror-0 ONLINE 0 0 0
      c3t0d0 ONLINE 0 0 0
      c2t0d0 ONLINE 0 0 0

Make sure via FMA commands that no other fault exist related to this eUSB disk and bpool.

 

References

<NOTE:2094649.1> - SPARC T7 / M7 Servers : How to install Solaris on a Physical Domain using VersaBoot - iSCSI over IPoIB
<NOTE:2094741.1> - SPARC T7 / M7 / M8 Servers : Information about VersaBoot - iSCSI over IPoIB
<NOTE:2107700.1> - SPARC M7 Servers : iSCSI over IPoIB - CMIOU/eUSB replacement considerations

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback