Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1942045.1
Update Date:2017-10-11
Keywords:

Solution Type  Troubleshooting Sure

Solution  1942045.1 :   IO faults proxying in LDOM environment  


Related Items
  • SPARC T5-4
  •  
  • SPARC T5-2
  •  
  • SPARC M6-32
  •  
  • SPARC M5-32
  •  
  • SPARC M7-8
  •  
  • SPARC T5-8
  •  
  • SPARC M7-16
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Mx-32
  •  




In this Document
Purpose
Troubleshooting Steps
References


Applies to:

SPARC T5-8 - Version All Versions to All Versions [Release All Releases]
SPARC T5-2 - Version All Versions to All Versions [Release All Releases]
SPARC M6-32 - Version All Versions to All Versions [Release All Releases]
SPARC M5-32 - Version All Versions to All Versions [Release All Releases]
SPARC T5-4 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Purpose

On SPARC M5-32 and M6-32 or T5 Servers, when an error is reported on the SP/SPP running ILOM or on the Host (primary domain) running Solaris, the fault is diagnosed by the respective side that reported the error, i.e.. the SP/SPP running ILOM or the host (primary domain) running Solaris.

In order to keep the FMA information in sync between the SP/SPP and the Host, the diagnosed faults are proxyed to the counterpart.
FMA Fault Proxying occurs between the primary domain any of the Hosts (up to 4 Hosts/Pdoms on Mx-32 servers) and the SP/SPP over the interconnect channel.
See SPARC M5-32 and M6-32 Servers: Interconnect, FMA Fault Proxying and LDOM configuration (Doc ID 1683087.1)

In an LDOM configuration, it's possible to assign IO resources (rootcomplexes/bus, PCIE card) to special guest domains named IO or root domain.
See the Oracle VM Server for SPARC 3.1 Administration Guide.

When an error relative to an IO resource is detected, it is reported and diagnosed by the domain owning the IO resource.
Which means that the IO fault is diagnosed on the IO or root domain.

In order to make sure that the primary domain is aware of any IO fault diagnosed on an IO or root domain, the faults are proxied between the control domain and the IO/root domain. This is done using ETM over LDC channel.
The faults are then proxied to the SP/SPP via the interconnect.

Note : when a guest domain is using services (vnet, vdisk) from a control/IO/root domain, it does not own the IO resource.  Any error reported for the IO resources is still diagnosed by the respective control/IO/root domain owning the resource.

As a result, the FMA faults are in sync between the SP/SPP, control and IO/root domains.

Basic example of a configuration where the following domains are configured on the Host :

  • control domain : primary
  • root domain : ldg1
  • IO domain : ldg2
  • guest domain : ldg3

 

primary# ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  NORM  UPTIME
primary          active     -n-cv-  UART    16    10G      0.1%  0.1%  10h 34m
ldg1             active     -n----  5000    10    10G      0.1%  0.1%  10h 29m
ldg2             active     -n----  5001    10    10G      0.0%  0.0%  10h 26m
ldg3             active     -n----  5002    10    10G      0.1%  0.1%  11h 41m

primary# ldm list -l -p | egrep -e '(DOMAIN|HOSTID)'
DOMAIN|name=primary|state=active|flags=normal,control,vio-service|cons=UART|ncpu=16|mem=10737418240|util=0.2|uptime=38296|norm_util=0.2|softstate=Solaris running
HOSTID|hostid=0x86246a9f
DOMAIN|name=ldg1|state=active|flags=normal|cons=5000|ncpu=10|mem=10737418240|util=0.1|uptime=37989|norm_util=0.1|softstate=Solaris running
HOSTID|hostid=0x84fade7c
DOMAIN|name=ldg2|state=active|flags=normal|cons=5001|ncpu=10|mem=10737418240|util=0.0|uptime=37820|norm_util=0.0|softstate=Solaris running
HOSTID|hostid=0x84fa6db4
DOMAIN|name=ldg3|state=active|flags=normal|cons=5002|ncpu=10|mem=10737418240|util=0.1|uptime=42342|norm_util=0.1|softstate=Solaris running
HOSTID|hostid=0x84f8d08f

root@solaris1:~# virtinfo
Domain role: LDoms guest I/O root

root@solaris2:/opt# virtinfo
Domain role: LDoms guest I/O

root@solaris3:~# virtinfo
Domain role: LDoms guest


It is possible to observe the SP-SPP/control-IO/root domain communication channels for faults proxying.

primary# fmstat -T
 id state module              authority
...
  7   RUN ip-transport        server-name=169.254.182.76:24
  8   RUN etm                 system-mfg=unknown,system-name=unknown,system-part=unknown,system-serial=unknown,sys-comp-mfg=unknown,sys-comp-name=unknown,sys-comp-part=unknown,sys-comp-serial=unknown,server-name=solaris2,host-id=84fa6db4
  9   RUN etm                 system-mfg=unknown,system-name=unknown,system-part=unknown,system-serial=unknown,sys-comp-mfg=unknown,sys-comp-name=unknown,sys-comp-part=unknown,sys-comp-serial=unknown,server-name=solaris1,host-id=84fade7c


In the above :

  • id 7 is used for FMA Fault Proxying over interconnect (control domain to SPP channel)
  • id 8 is used for FMA IO Faults Proxying between ldg2 and primary
  • id 9 is used for FMA IO Faults Proxying between ldg1 and primary


On the guest side :

root@solaris1:~# fmstat -T | grep etm
  2   RUN etm                 system-mfg=Oracle-Corporation,system-name=SPARC-M5-32,system-part=31486290%2B1%2B1,system-serial=AK00087872,sys-comp-mfg=Oracle-Corporation,sys-comp-name=SPARC-M5-32,sys-comp-part=31486290%2B1%2B1,sys-comp-serial=AK00087872,server-name=primary,host-id=862246a9f

root@solaris2:/opt# fmstat -T | grep etm
  2   RUN etm                 system-mfg=Oracle-Corporation,system-name=SPARC-M5-32,system-part=31486290%2B1%2B1,system-serial=AK00087872,sys-comp-mfg=Oracle-Corporation,sys-comp-name=SPARC-M5-32,sys-comp-part=31486290%2B1%2B1,sys-comp-serial=AK00087872,server-name=primary,host-id=86246a9f

root@solaris3:~# fmstat -T | grep etm
root@solaris3:~#


Note : no channel exists between ldg3 and primary as ldg3 is a guest domain, not owning any IO resource.

 

Troubleshooting Steps

When an IO fault is reported on the control domain and proxied to the SP/SPP, the source for the error must be identified so the fault can be addressed from the respective domain.
As a result, for a proper diagnosis, the following information is required :

  • snapshot of the platform,
  • explorer from the control domain of the respective Host,
  • explorer from the IO/root domain owning the resource.


Example of a ZFS error, reported on the SP as a fault

fma/@persist@faultdiags@faults.log

2014-10-21/12:53:39  2e17ff9c-8872-ef4e-ff19-e6a179cb60dc   ZFS-8000-D3
    list_sz = 1

    fault[0] = fault.fs.zfs.device
        certainty = 100.0 %
        FRU       = -
        ASRU      = -
        RESOURCE  = -
        chassis_serial_number = AK00097030



Reported on the control domain, from the control domain explorer

fma/fmadm-faulty.out

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Oct 21 12:53:39 2e17ff9c-8872-ef4e-ff19-e6a179cb60dc  ZFS-8000-D3    Major

Problem Status    : solved
Diag Engine       : zfs-diagnosis / 1.0
System
    Manufacturer  : unknown
    Name          : unknown
    Part_Number   : unknown
    Serial_Number : unknown
    Host_ID       : 84f8871e

 

fma/fmdump-V.out

Oct 21 2014 12:54:07.857733000 2e17ff9c-8872-ef4e-ff19-e6a179cb60dc ZFS-8000-D3

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = 2e17ff9c-8872-ef4e-ff19-e6a179cb60dc
        code = ZFS-8000-D3
        diag-time = 1413914019 443053
        de = (embedded nvlist)
        nvlist version: 0
                version = 0x1
                scheme = fmd
                authority = (embedded nvlist)
                nvlist version: 0
                        version = 0x1
                        system-mfg = unknown
                        system-name = unknown
                        system-part = unknown
                        system-serial = unknown
                        sys-comp-mfg = unknown
                        sys-comp-name = unknown
                        sys-comp-part = unknown
                        sys-comp-serial = unknown
                        server-name = m51s2
                        host-id = 84f8871e


The host-id and server-name can help to identify the respective domain

ldom/ldm_ls-dom_-l.out

NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  NORM  UPTIME
secondary        active     -n--v-  5000    32    16G      2.1%  2.1%  200d 2h

SOFTSTATE
Solaris running

UUID
    1cb48b92-64ad-e0b4-90cd-bbdaeef2320d

MAC
    00:14:4f:f8:87:1e

HOSTID
    0x84f8871e


fma/fmstat-T.out

  3   RUN etm                 system-mfg=unknown,system-name=unknown,system-part=unknown,system-serial=unknown,sys-comp-mfg=unknown,sys-comp-name=unknown,sys-comp-part=unknown,sys-comp-serial=unknown,server-name=m51s2,host-id=84f8871e


ldom/log/vntsd/secondary/console-log

root@m51s2 #
root@m51s2 #
m51s2 console login: Oct 21 16:49:08 m51s2 last message repeated 1 time


At this point, an explorer from domain 84f8871e is required to check the FMA logs to move forward and address the original issue.


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback