Asset ID: |
1-79-1683087.1 |
Update Date: | 2017-10-11 |
Keywords: | |
Solution Type
Predictive Self-Healing Sure
Solution
1683087.1
:
SPARC M5-32 and M6-32 Servers: Interconnect, FMA Fault Proxying and LDOM configuration
Related Categories |
- PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Mx-32
|
In this Document
Applies to:
SPARC M5-32 - Version All Versions and later
SPARC M6-32 - Version All Versions and later
Information in this document applies to any platform.
Purpose
This document provides details about the Interconnect, FMA Fault Proxying and LDOM on the SPARC M5-32 and M6-32 Servers.
Details
Role of interconnect
The Interconnect provides an internal communication interface between the Host/Pdom and the service processor.
On M5-32/M6-32, it's actually an interface between the Host (control/primary domain) and the Pdomain-SPP (aka Golden SPP) of the Host.
For M5-32/M6-32, this IP address can be used to connect the Pdomain-SPP. Not the SP like on the Tseries servers.
The Interconnect is using an Ethernet-Over-USB interface (usbecm).
It can be controlled/managed :
- from the SP :
/Servers/PDomains/PDomain_x/SP/network/interconnect
- from the Solaris Host
using the 'ilomconfig' command from Solaris or OHMP.
Note : with Solaris 11.1, ilomconfig is bundled with Solaris. The Oracle Hardware Management Pack is not required to configure the Interconnect.
See :
- Oracle ILOM Administrator’s Guide for Configuration and Maintenance Firmware Release 3.2.1 (here)
- SPARC M5-32 and SPARC M6-32 Servers Administration Guide (here)
The Fault Management Architecture is using the Interconnect to proxy the faults diagnosed on the SP/SPP to the Host and to proxy the faults diagnosed on the Solaris Host to the SP/SPP. This is known as FMA Fault Proxying; keeping the FMA faults in sync between the host and the Active SP.
When the interconnect is not available then the FMA Fault Proxying cannot work properly.
Note : when the FMA Fault Proxying mechanism does not work, FMA on SP and Solaris Host still works but the faults diagnosed are no longer proxyed to the other side.
The interconnect is enabled and configured by default and should always be enabled from both the SP and the Host.
Interconnect is using the usbecm interface. On top of it, an IP address is available.
The usbecm interface is using the rootcomplex/bus from CMUx/CMP0 where x is the lowest numbered CMU in the DCU (CMU0 in DCU#0, CMU4 in DCU#1 etc...). So the usbecm interface is using the following rootcomplexes/bus depending on the PDom configuration and Pdomain-SPP role selection :
- DCU#0 - SPP0 : pci_1,
- DCU#1 - SPP1 : pci_17,
- DCU#2 - SPP2 : pci_33,
- DCU#3 - SPP3 : pci_49
See SPARC M5-32 and M6-32 Servers: Device Paths (Doc ID 1540545.1).
When LDOMs are configured, if the usbecm connection is not available to the control domain, FMA Fault Proxying does not work. See below.
Identify the Interconnect path
-> show /Servers/PDomains/PDomain_1/SP/network/interconnect
/Servers/PDomains/PDomain_1/SP/network/interconnect
Targets:
Properties:
hostmanaged = true
type = USB Ethernet
ipaddress = 169.254.182.76
ipnetmask = 255.255.255.0
spmacaddress = 02:21:28:57:47:16
hostmacaddress = 02:21:28:57:47:17
It's a communication path to the Pdomain-SPP.
Ex :
-> show /Servers/PDomains/PDomain_1/HOST/ sp_name
/Servers/PDomains/PDomain_1/HOST
Properties:
sp_name = /SYS/SPP2
The available physical paths
root@m5-32-sca11-a-pdom01:~# prtdiag -v | grep -i usb
/SYS/SPP1/USB PCIE usb-pciexclass,0c0330 5.0GTx1
/pci@740/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0
/SYS/SPP2/USB PCIE usb-pciexclass,0c0330 5.0GTx1
/pci@b40/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0
root@m5-32-sca11-a-pdom01:/var/tmp# hotplug list -cv
Connection State Description
________________________________________________________________________________
...
pcie0 ENABLED PCIe-Native
Device Usage
___________________________________________________________________________
pci@0 -
pci@0 -
usb@0 -
communications@6 Network interface net24
net24: hosts IP addresses: 169.254.182.77
pci@5 -
display@0 framebuffer device
The IP configuration can be checked as following :
root@m5-32-sca11-a-pdom01:~# ilomconfig list interconnect
Interconnect
============
State: enabled
Type: USB Ethernet
SP Interconnect IP Address: 169.254.182.76
Host Interconnect IP Address: 169.254.182.77
Interconnect Netmask: 255.255.255.0
SP Interconnect MAC Address: 02:21:28:57:47:16
Host Interconnect MAC Address: 02:21:28:57:47:17
So the Interconnect is using by default the IP address 169.254.182.77 to communicate to the Pdomain-SPP via the IP address 169.254.182.76.
On a multi-DCUs Pdom, any of the SPP can be selected as Pdomain-SPP and this role may change after a stop/start operation.
On a single-DCU Pdom, there is obviously only one SPP available and selected as Pdomain-SPP.
It's possible to identify the path used by the usbecm interface.
On this dual DCU (DCU1 + DCU2) Pdom, SPP2 is the Pdomain-SPP.
-> show /Servers/PDomains/PDomain_1/HOST/ sp_name
/Servers/PDomains/PDomain_1/HOST
Properties:
sp_name = /SYS/SPP2
And the usbecm interface is using the path to SPP2
root@m5-32-sca11-a-pdom01:~# ipadm
NAME CLASS/TYPE STATE UNDER ADDR
...
net46 ip ok -- --
net46/v4 static ok -- 169.254.182.77/24
root@m5-32-sca11-a-pdom01:~# dladm show-phys -L | grep usb
net46 usbecm0 --
root@m5-32-sca11-a-pdom01:~# grep usb /etc/path_to_inst
"/pci@740/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0" 0 "xhci"
"/pci@740/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0/communications@6" 2 "usbecm"
"/pci@b40/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0" 1 "xhci"
"/pci@b40/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0/communications@6" 0 "usbecm"
As described in SPARC M5-32 and M6-32 Servers: Device Paths (Doc ID 1540545.1), /pci@b40/pci@1/pci@0/pci@1/pci@0/pci@0 is the path to SPP2.
If at some point the Golden SPP role switch to SPP1, the interconnect will use usbecm2 (/pci@740/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0).
Configuration rules
Empty DCU (no CMU installed)
- should not be assigned to a Host (/Servers/PDomains/PDomain_1/HOST dcus_assigned),
- the SPP for the empty DCU might become the Pdomain-SPP at some point,
- because there is no CMU in the empty DCU, the usbecm interface cannot use any root complex,
- this may result in a non-functioning interconnect and so no FMA Fault Proxying.
The root complexes used for the usbecm interface
- must be owned by the control/primary domain (pci_1, pci_17, pci_33, pci_49),
- when the root complexes used for the usbecm interface is assigned to a non-primary domain then FMA Fault Proxying does not work,
- this also applies to multi-DCUs PDom because the Pdomain-SPP may change at some point so the root complexes need to be available on the control domain even after a Pdomain-SPP role change,
- as long as the root complex is owned by the primary domain, it's possible to assign the PCIe endpoint to a non-primary domain; ie. PCIE2 (full configuration) or PCIE1/PCIE2 (half configuration). The card installed in the slot requires to support DIO.
- References :
- Oracle®VMServer for SPARC 3.1 Administration Guide / Creating an I/O Domain by Assigning PCIe EndpointDevices / Howto Create an I/O Domain by Assigning a PCIe EndpointDevice
- Oracle VM Server for SPARC PCIe Direct I/O and SR-IOV Features (Doc ID 1325454.1)
- unrelated to FMA Fault Proxying, but listed in the context of required root complex ownership -- when the root complexes used for the usbecm interface is assigned to a non-primary domain, storage redirection used for rcdrom boot does not work
- the rKVMS device path used for storage redirection is also used for the interconnect path described in this document
- no rcdrom device will be listed nor available in OBP if the required root complexes are not assigned to the control domain
Additional Information
- When the SPARC M5-32 and M6-32 Servers is running SysFW 9.1.1.a or 9.1.1.b, the FMA Fault Proxying may not be running properly because of 17768292 and so because the ip-transport module is not loaded (fmstat -T). The SysFW should be upgraded to the latest version. When properly loaded, you should see
# fmstat -T | grep ip-transport
3 RUN ip-transport server-name=169.254.182.76:24
- During the boot sequence, the following warning may be reported
WARNING: /pci@340/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0 (xhci0): Connecting device on port 6 failed
which means that "the driver failed to enumerate the device connected on port <number> of hub. If enumeration fails, disconnect and re-connect."; see man hubd(7D).
You can use the commands above to confirm that the path is used for the Eth-o-USB interface and that the interconnect and FMA Fault proxying are properly configured.
You can also check the /var/adm/messages file and/or host console logs to confirm that, despite the warning, the re-connect was successful
Jul 1 15:49:57 usba: [ID 691482 kern.warning] WARNING: /pci@340/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0 (xhci0): Connecting device on port 6 failed
Jul 1 15:50:02 last message repeated 1 timeJul 1 15:50:06 mac: [ID 469746 kern.info] NOTICE: usbecm2 registered
Jul 1 15:50:06 usba: [ID 912658 kern.info] USB 1.10 device (usb430,a4a2) operating at full speed (USB 1.x) on USB 2.0 root hub: communications@6, usbecm2 at bus address 1
Jul 1 15:50:06 usba: [ID 349649 kern.info] SunMicro Virtual Eth Device
Jul 1 15:50:06 genunix: [ID 936769 kern.info] usbecm2 is /pci@340/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0/communications@6
Jul 1 15:50:06 genunix: [ID 408114 kern.info] /pci@340/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0/communications@6 (usbecm2) online
Jul 1 15:50:06 mac: [ID 435574 kern.info] NOTICE: usbecm2 link up, 10 Mbps, full duplex
The impact is only on the FMA Fault Proxying.
The FMA ereport forwarding is not impacted; ie. any error detected on the Solaris side and requiring an SP diagnosis to produce a fault (GM/FERG) is still forwarded to the SP via the SPP as it's using a different channel than the Interconnect.
Special case : Golden SPP failover
When a Host is composed of multiple DCUs, one SPP is selected as Pdomain-SPP (Golden SPP). The Golden SPP then hosts the interconnect communication with the Host.
At some point, the Golden SPP role may switch from an SPP to another for several reason.
Golden SPP switch is then reported in the event logs from the Active SP as following :
45220 Thu Sep 18 00:05:34 2014 Reset Log minor
/Servers/PDomains/PDomain_3 is now managed by PDomain SPP /SYS/SPP3.
...
45140 Wed Sep 17 13:34:39 2014 Reset Log minor
/Servers/PDomains/PDomain_3 is now managed by PDomain SPP /SYS/SPP1.
If the host is up when the Golden SPP failover occurs, from the Solaris /var/adm/messages file on the control domain
Sep 18 00:05:29 pdom03 mac: [ID 486395 kern.info] NOTICE: usbecm2 link down
Sep 18 00:06:22 pdom03 mac: [ID 469746 kern.info] NOTICE: usbecm0 registered
Sep 18 00:06:22 pdom03 usba: [ID 912658 kern.info] USB 1.10 device (usb430,a4a2) operating at full speed (USB 1.x) on USB 2.0 root hub: communications@6, usbecm0 at bus address 1
Sep 18 00:06:22 pdom03 usba: [ID 349649 kern.info] SunMicro Virtual Eth Device
Sep 18 00:06:22 pdom03 genunix: [ID 936769 kern.info] usbecm0 is /pci@f40/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0/communications@6
Sep 18 00:06:22 pdom03 genunix: [ID 408114 kern.info] /pci@f40/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0/communications@6 (usbecm0) online
Sep 18 00:06:22 pdom03 genunix: [ID 408114 kern.info] /pci@740/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0/communications@6 (usbecm2) removed
where
# grep usb /etc/path_to_inst
"/pci@f40/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0" 0 "xhci"
"/pci@f40/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0/communications@6" 0 "usbecm"
"/pci@740/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0" 2 "xhci"
"/pci@740/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0/communications@6" 2 "usbecm"
# prtdiag -v | grep -i usb
/SYS/SPP1/USB PCIE usb-pciexclass,0c0330 5.0GT/x1 5.0GT/x1
/pci@740/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0
/SYS/SPP3/USB PCIE usb-pciexclass,0c0330 5.0GT/x1 5.0GT/x1
/pci@f40/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0
A Golden SPP role switch can occur :
- when restarting (stop/start) the host
- at any time while host is up&running
1- when restarting (stop/start) the host
If a Golden SPP role switch occurs while restarting the host (ie. Golden SPP changed between the stop and the start operations), Solaris and the respective svc:/network/ilomconfig-interconnect:default will properly initialized the usbecm instance and the interconnect communication path.
No extra step should be required in this case.
The commands as described above can be used to confirmed that it works properly.
2- at any time while host is up&running
If a Golden SPP switch occurs while the host is running Solaris, the interconnect communication may not be working anymore.
The former usbecm interface will be down and removed
Sep 18 01:42:48 pdom03 mac: [ID 486395 kern.info] NOTICE: usbecm2 link down
Sep 18 01:44:02 pdom03 genunix: [ID 408114 kern.info] /pci@740/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0/communications@6 (usbecm2) removed
The new usbecm interface will be brought back up
Sep 18 01:44:08 pdom03 mac: [ID 469746 kern.info] NOTICE: usbecm0 registered
Sep 18 01:44:08 pdom03 usba: [ID 912658 kern.info] USB 1.10 device (usb430,a4a2) operating at full speed (USB 1.x) on USB 2.0 root hub: communications@6, usbecm0 at bus address 1
Sep 18 01:44:08 pdom03 usba: [ID 349649 kern.info] SunMicro Virtual Eth Device
Sep 18 01:44:08 pdom03 genunix: [ID 936769 kern.info] usbecm0 is /pci@f40/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0/communications@6
Sep 18 01:44:08 pdom03 genunix: [ID 408114 kern.info] /pci@f40/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0/communications@6 (usbecm0) online
Sep 18 01:44:09 pdom03 mac: [ID 435574 kern.info] NOTICE: usbecm0 link up, 10 Mbps, full duplex
And the new interface is reported via dladm.
# dladm show-phys -P | grep usb
net79 usbecm2 Ethernet r----
net71 usbecm0 Ethernet -----
But the interface is not yet fully functional and must be configured manually
root@pdom03:~# ipadm
NAME CLASS/TYPE STATE UNDER ADDR
lo0 loopback ok -- --
lo0/v4 static ok -- 127.0.0.1/8
lo0/v6 static ok -- ::1/128
net0 ip ok -- --
net0/v4 static ok -- 10.133.111.158/21
net0/v6 addrconf ok -- fe80::210:e0ff:fe24:6acc/10
net71 ip down -- --
net79 ip failed -- --
net79/v4 static inaccessible -- 169.254.182.77/24
root@pdom03:~# ipadm delete-addr net79/v4
root@pdom03:~# ipadm create-addr -T static -a 169.254.182.77/24 net71/v4
root@pdom03:~# ipadm
NAME CLASS/TYPE STATE UNDER ADDR
lo0 loopback ok -- --
lo0/v4 static ok -- 127.0.0.1/8
lo0/v6 static ok -- ::1/128
net0 ip ok -- --
net0/v4 static ok -- 10.133.111.158/21
net0/v6 addrconf ok -- fe80::210:e0ff:fe24:6acc/10
net71 ip ok -- --
net71/v4 static ok -- 169.254.182.77/24
net79 ip failed -- --
The commands as described above can be used to confirmed that it works properly.
At this point, you can restart ('svcadm restart' or 'svcadm disable/enable') the svc:/network/ilomconfig-interconnect:default service.
You can also disable/enable the interconnect ('ilomconfig disable|enable interconnect').
This should not be required.
Corner case :
If, for some reason, the usbecm interface fails to be registered
Sep 18 05:52:08 pdom03 mac: [ID 486395 kern.info] NOTICE: usbecm0 link down
Sep 18 05:52:11 pdom03 genunix: [ID 408114 kern.info] /pci@f40/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0/communications@6 (usbecm0) removed
Sep 18 05:52:58 pdom03 usba: [ID 723738 kern.info] /pci@740/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0/communications@6 (usbecm2): usbecm_restore_device_state: Device has been reconnected but data may have been lost
Sep 18 05:52:58 pdom03 mac: [ID 435574 kern.info] NOTICE: usbecm2 link up, 10 Mbps, full duplex
Sep 18 05:52:58 pdom03 genunix: [ID 408114 kern.info] /pci@740/pci@1/pci@0/pci@1/pci@0/pci@0/usb@0/communications@6 (usbecm2) online
All the usbecm instances are reported as removed.
root@pdom03:~# dladm show-phys -P | grep usb
net79 usbecm2 Ethernet r----
net71 usbecm0 Ethernet r----
Then restarting the svc:/network/ilomconfig-interconnect:default service will more likely report
# tail -f /var/svc/log/network-ilomconfig-interconnect:default.log
[ Sep 18 06:01:43 Executing start method ("/lib/svc/method/svc-ilomconfig-interconnect start"). ]
ERROR: Internal error
ERROR: Internal error
ERROR: Internal error
ERROR: Internal error
ERROR: Internal error
[ Sep 18 06:04:22 Method "start" exited with status 0. ]
As will do, disable/enable interconnect
# ilomconfig disable interconnect
ERROR: Internal error
# ilomconfig enable interconnect
ERROR: Internal error
At this point, it should be possible to manually configure the network interface (ipadm delete-addr/create-addr) as described above.
And FMA Fault Proxying should be able to communicate properly with the SPP.
# fmstat -T | grep ip-transport
9 RUN ip-transport server-name=169.254.182.76:24
But it may be recommended to reboot the primary domain so the usbecm interface is properly registered again.
Note that when the interconnect communication is re-establish, you may see a replay for some faults from the SP; ie. SP re-proxying some faults to the host. Fault replay (vs new fault) can be confirmed by checking the UUID (fmdump) and diag-time (fmdump -Vu UUID).
Note : at some point, it will be possible to manually switch the Golden SPP role via the Host initiate_sp_failover property.
set /Hostx initiate_sp_failover=true
After performing a manual Golden SPP failover, the same procedure as described in "2- at any time while host is up&running" should be implemented.
Attachments
This solution has no attachment