IO faults proxying in LDOM environment

Asset ID:	1-75-1942045.1
Update Date:	2017-10-11
Keywords:

Solution Type Troubleshooting Sure

Solution 1942045.1 : IO faults proxying in LDOM environment

Applies to:

SPARC T5-8 - Version All Versions to All Versions [Release All Releases]
SPARC T5-2 - Version All Versions to All Versions [Release All Releases]
SPARC M6-32 - Version All Versions to All Versions [Release All Releases]
SPARC M5-32 - Version All Versions to All Versions [Release All Releases]
SPARC T5-4 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Purpose

On SPARC M5-32 and M6-32 or T5 Servers, when an error is reported on the SP/SPP running ILOM or on the Host (primary domain) running Solaris, the fault is diagnosed by the respective side that reported the error, i.e.. the SP/SPP running ILOM or the host (primary domain) running Solaris.

In order to keep the FMA information in sync between the SP/SPP and the Host, the diagnosed faults are proxyed to the counterpart.
FMA Fault Proxying occurs between the primary domain any of the Hosts (up to 4 Hosts/Pdoms on Mx-32 servers) and the SP/SPP over the interconnect channel.
See SPARC M5-32 and M6-32 Servers: Interconnect, FMA Fault Proxying and LDOM configuration (Doc ID 1683087.1)

In an LDOM configuration, it's possible to assign IO resources (rootcomplexes/bus, PCIE card) to special guest domains named IO or root domain.
See the Oracle VM Server for SPARC 3.1 Administration Guide.

When an error relative to an IO resource is detected, it is reported and diagnosed by the domain owning the IO resource.
Which means that the IO fault is diagnosed on the IO or root domain.

In order to make sure that the primary domain is aware of any IO fault diagnosed on an IO or root domain, the faults are proxied between the control domain and the IO/root domain. This is done using ETM over LDC channel.
The faults are then proxied to the SP/SPP via the interconnect.

Note : when a guest domain is using services (vnet, vdisk) from a control/IO/root domain, it does not own the IO resource. Any error reported for the IO resources is still diagnosed by the respective control/IO/root domain owning the resource.

As a result, the FMA faults are in sync between the SP/SPP, control and IO/root domains.

Basic example of a configuration where the following domains are configured on the Host :

control domain : primary
root domain : ldg1
IO domain : ldg2
guest domain : ldg3

It is possible to observe the SP-SPP/control-IO/root domain communication channels for faults proxying.

primary# fmstat -T
id state module              authority
...
7   RUN ip-transport        server-name=169.254.182.76:24
8   RUN etm                 system-mfg=unknown,system-name=unknown,system-part=unknown,system-serial=unknown,sys-comp-mfg=unknown,sys-comp-name=unknown,sys-comp-part=unknown,sys-comp-serial=unknown,server-name=solaris2,host-id=84fa6db4
9   RUN etm                 system-mfg=unknown,system-name=unknown,system-part=unknown,system-serial=unknown,sys-comp-mfg=unknown,sys-comp-name=unknown,sys-comp-part=unknown,sys-comp-serial=unknown,server-name=solaris1,host-id=84fade7c

In the above :

id 7 is used for FMA Fault Proxying over interconnect (control domain to SPP channel)
id 8 is used for FMA IO Faults Proxying between ldg2 and primary
id 9 is used for FMA IO Faults Proxying between ldg1 and primary

On the guest side :

root@solaris1:~# fmstat -T | grep etm
2 RUN etm system-mfg=Oracle-Corporation,system-name=SPARC-M5-32,system-part=31486290%2B1%2B1,system-serial=AK00087872,sys-comp-mfg=Oracle-Corporation,sys-comp-name=SPARC-M5-32,sys-comp-part=31486290%2B1%2B1,sys-comp-serial=AK00087872,server-name=primary,host-id=862246a9f

root@solaris2:/opt# fmstat -T | grep etm
2 RUN etm system-mfg=Oracle-Corporation,system-name=SPARC-M5-32,system-part=31486290%2B1%2B1,system-serial=AK00087872,sys-comp-mfg=Oracle-Corporation,sys-comp-name=SPARC-M5-32,sys-comp-part=31486290%2B1%2B1,sys-comp-serial=AK00087872,server-name=primary,host-id=86246a9f

root@solaris3:~# fmstat -T | grep etm
root@solaris3:~#

Note : no channel exists between ldg3 and primary as ldg3 is a guest domain, not owning any IO resource.

Troubleshooting Steps

When an IO fault is reported on the control domain and proxied to the SP/SPP, the source for the error must be identified so the fault can be addressed from the respective domain.
As a result, for a proper diagnosis, the following information is required :

snapshot of the platform,
explorer from the control domain of the respective Host,
explorer from the IO/root domain owning the resource.

Example of a ZFS error, reported on the SP as a fault

fma/@persist@faultdiags@faults.log

2014-10-21/12:53:39 2e17ff9c-8872-ef4e-ff19-e6a179cb60dc   ZFS-8000-D3
    list_sz = 1

    fault[0] = fault.fs.zfs.device
        certainty = 100.0 %
        FRU       = -
        ASRU      = -
        RESOURCE = -
        chassis_serial_number = AK00097030

Reported on the control domain, from the control domain explorer

fma/fmadm-faulty.out

--------------- ------------------------------------ -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------ -------------- ---------
Oct 21 12:53:39 2e17ff9c-8872-ef4e-ff19-e6a179cb60dc ZFS-8000-D3    Major

Problem Status    : solved
Diag Engine       : zfs-diagnosis / 1.0
System
    Manufacturer : unknown
    Name          : unknown
    Part_Number   : unknown
    Serial_Number : unknown
    Host_ID       : 84f8871e

fma/fmdump-V.out

Oct 21 2014 12:54:07.857733000 2e17ff9c-8872-ef4e-ff19-e6a179cb60dc ZFS-8000-D3

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = 2e17ff9c-8872-ef4e-ff19-e6a179cb60dc
        code = ZFS-8000-D3
        diag-time = 1413914019 443053
        de = (embedded nvlist)
        nvlist version: 0
                version = 0x1
                scheme = fmd
                authority = (embedded nvlist)
                nvlist version: 0
                        version = 0x1
                        system-mfg = unknown
                        system-name = unknown
                        system-part = unknown
                        system-serial = unknown
                        sys-comp-mfg = unknown
                        sys-comp-name = unknown
                        sys-comp-part = unknown
                        sys-comp-serial = unknown
                        server-name = m51s2
                        host-id = 84f8871e

The host-id and server-name can help to identify the respective domain

ldom/ldm_ls-dom_-l.out

NAME             STATE      FLAGS   CONS    VCPU MEMORY   UTIL NORM UPTIME
secondary        active     -n--v- 5000    32    16G      2.1% 2.1% 200d 2h

SOFTSTATE
Solaris running

UUID
    1cb48b92-64ad-e0b4-90cd-bbdaeef2320d

MAC
    00:14:4f:f8:87:1e

HOSTID
    0x84f8871e

fma/fmstat-T.out

3   RUN etm                 system-mfg=unknown,system-name=unknown,system-part=unknown,system-serial=unknown,sys-comp-mfg=unknown,sys-comp-name=unknown,sys-comp-part=unknown,sys-comp-serial=unknown,server-name=m51s2,host-id=84f8871e

ldom/log/vntsd/secondary/console-log

root@m51s2 #
root@m51s2 #
m51s2 console login: Oct 21 16:49:08 m51s2 last message repeated 1 time

At this point, an explorer from domain 84f8871e is required to check the FMA logs to move forward and address the original issue.

Attachments

This solution has no attachment