SuperCluster: M8 IO domains may fail to boot with a "ERROR: Last Trap: Fast Data Access MMU Miss"

Asset ID:	1-72-2340619.1
Update Date:	2018-03-26
Keywords:

Solution Type Problem Resolution Sure

Solution 2340619.1 : SuperCluster: M8 IO domains may fail to boot with a "ERROR: Last Trap: Fast Data Access MMU Miss"

Applies to:

Oracle SuperCluster M8 Hardware - Version All Versions to All Versions [Release All Releases]
Oracle Solaris on SPARC (64-bit)

Symptoms

IO domains deployed on M8 SuperCluster where NIC cards (Part# 7319817 - Quad 10-Gigabit or Dual 40-Gigabit Ethernet QSFP+) are used may fail to boot with the following error on the IO domain console:

# telnet 0 5001
Trying 0.0.0.0...
Connected to 0.
Escape character is '^]'.
Connecting to console "ssccn2-io-dbm02" in group "ssccn2-io-dbm02" ....
Press ~? for control options ..
NOTICE: Entering OpenBoot.
NOTICE: Fetching Guest MD.
NOTICE: Starting slave cpus.
NOTICE: Initializing LDCs.
NOTICE: Probing PCI devices.
i40e_init_arq: Failed to write to Admin Rx Queue Regs
ERROR: Last Trap: Fast Data Access MMU Miss

NOTE: This issue is only applicable with IO domains on M8 SuperClusters where NIC cards (Part# 7319817 - Quad 10-Gigabit or Dual 40-Gigabit Ethernet QSFP+) are used for client (10G) network. FCode 3.9.0 and below are susceptible to this issue.

The above issue can be encountered in any of the below scenarios:

Scenario A: Starting IO domains in parallel using "ldm start -a" command after reboot of a root domain(s)

The issue is ONLY seen when the system has 9 or more Virtual Functions (VF) consumed from a single Physical Function (PF) from a given Root Domain.

Example:

(a) Identify the PF on the control or primary domain in a PDom (ex: ssccnX). In this example we am verify on ssccn3 (Primary LDom on PDom2)

# ldm ls-io | grep IOVNET| egrep 'primary|ssccn.-dom.'
/SYS/CMIOU0/PCIE2/IOVNET.PF0 PF pci_0 primary
/SYS/CMIOU0/PCIE2/IOVNET.PF1 PF pci_0 primary
/SYS/CMIOU0/PCIE2/IOVNET.PF2 PF pci_0 primary
/SYS/CMIOU0/PCIE2/IOVNET.PF3 PF pci_0 primary
/SYS/CMIOU0/PCIE1/IOVNET.PF0 PF pci_3 primary
/SYS/CMIOU0/PCIE1/IOVNET.PF1 PF pci_3 primary
/SYS/CMIOU0/PCIE1/IOVNET.PF2 PF pci_3 primary
/SYS/CMIOU0/PCIE1/IOVNET.PF3 PF pci_3 primary
/SYS/CMIOU1/PCIE2/IOVNET.PF0 PF pci_5 primary
/SYS/CMIOU1/PCIE2/IOVNET.PF1 PF pci_5 primary
/SYS/CMIOU1/PCIE2/IOVNET.PF2 PF pci_5 primary
/SYS/CMIOU1/PCIE2/IOVNET.PF3 PF pci_5 primary
/SYS/CMIOU2/PCIE2/IOVNET.PF0 PF pci_10 ssccn3-dom1
/SYS/CMIOU2/PCIE2/IOVNET.PF1 PF pci_10 ssccn3-dom1
/SYS/CMIOU2/PCIE2/IOVNET.PF2 PF pci_10 ssccn3-dom1
/SYS/CMIOU2/PCIE2/IOVNET.PF3 PF pci_10 ssccn3-dom1
/SYS/CMIOU3/PCIE2/IOVNET.PF0 PF pci_15 ssccn3-dom2
/SYS/CMIOU3/PCIE2/IOVNET.PF1 PF pci_15 ssccn3-dom2
/SYS/CMIOU3/PCIE2/IOVNET.PF2 PF pci_15 ssccn3-dom2
/SYS/CMIOU3/PCIE2/IOVNET.PF3 PF pci_15 ssccn3-dom2

(b) For each PF verify how many VF are created and consumed by IO domains

Run Step (b) for all the PF in all CMIOUs. This needs to be verified across all the primary domains in the SuperCluster rack (ex: ssccn1-4)

NOTE: If the count is 9 or more than the system is susceptible to hit this scenario A.

Scenario B: Power-on of a PDom that has root domain(s) with IO domains deployed

The issue is ONLY seen when the system has 4 or more Virtual Functions (VF) consumed from a single Physical Function (PF) from a given Root Domain. Refer to the above example to verify number of VFs consumed for each PF.

NOTE: If the count is 4 or more than the system is susceptible to hit this scenario B.

Cause

The cause of the issue is being investigated under BUG 27133932 - i40e_init_arq: Failed to write to Admin Rx Queue Regs observed during ldm start

Solution

1. (Scenario A) If the issue is encountered while starting all IO domains in parallel after a reboot of root domain(s), then follow the below steps:

a. Stop all IO domains in the root domain

# ldm stop <IO-Domain>

b. Once all the IO domains are stopped, then start IO domains sequentially (one at a time).

# ldm start <IO-Domain>

NOTE: Allow 5 secs delay before starting the next IO domain and proceed till all the IO domains are started.

2. (Scenario B) If the issue is encountered while "power-on" of a PDom where all the IO domains get started after power-on, then follow the below steps:

a. Stop all IO domains in the PDom

# ldm stop <IO-Domain>

b. Reboot all the root domain(s) in the PDom. The order of rebooting root domain doesn't matter.

# reboot

c. Once the root domain is rebooted, login and stop all IO domains in the root domain

# ldm stop <IO-Domain>

d. Once all the IO domains are stopped, then start IO domain sequentially (one at a time).

# ldm start <IO-Domain>

NOTE: Allow 5 secs delay before starting the next IO domain and proceed till all the IO domains are started.

References

<BUG:27133932> - I40E_INIT_ARQ: FAILED TO WRITE TO ADMIN RX QUEUE REGS OBSERVED DURING LDM START

Attachments

This solution has no attachment