Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2340619.1
Update Date:2018-03-26
Keywords:

Solution Type  Problem Resolution Sure

Solution  2340619.1 :   SuperCluster: M8 IO domains may fail to boot with a "ERROR: Last Trap: Fast Data Access MMU Miss"  


Related Items
  • Oracle SuperCluster M8 Hardware
  •  
Related Categories
  • PLA-Support>Eng Systems>Exadata/ODA/SSC>SPARC SuperCluster>DB: SuperCluster_EST
  •  


IO domains on SuperCluster M8 platform may fail to boot with a ERROR: Last Trap: Fast Data Access MMU Miss.
This applies to M8 SuperCluster where Fortville cards (Part# 7319817 - Quad 10-Gigabit or Dual 40-Gigabit Ethernet QSFP+) are used.

In this Document
Symptoms
Cause
Solution
References


Applies to:

Oracle SuperCluster M8 Hardware - Version All Versions to All Versions [Release All Releases]
Oracle Solaris on SPARC (64-bit)

Symptoms

 IO domains deployed on M8 SuperCluster where NIC cards (Part# 7319817 - Quad 10-Gigabit or Dual 40-Gigabit Ethernet QSFP+) are used may fail to boot with the following error on the IO domain console: 

# telnet 0 5001
Trying 0.0.0.0...
Connected to 0.
Escape character is '^]'.
Connecting to console "ssccn2-io-dbm02" in group "ssccn2-io-dbm02" ....
Press ~? for control options ..
NOTICE: Entering OpenBoot.
NOTICE: Fetching Guest MD.
NOTICE: Starting slave cpus.
NOTICE: Initializing LDCs.
NOTICE: Probing PCI devices.
i40e_init_arq: Failed to write to Admin Rx Queue Regs
ERROR: Last Trap: Fast Data Access MMU Miss

 

NOTE: This issue is only applicable with IO domains on M8 SuperClusters where NIC cards (Part# 7319817 - Quad 10-Gigabit or Dual 40-Gigabit Ethernet QSFP+) are used for client (10G) network. FCode 3.9.0 and below are susceptible to this issue.

The above issue can be encountered in any of the below scenarios:

Scenario A: Starting IO domains in parallel using "ldm start -a" command after reboot of a root domain(s)

The issue is ONLY seen when the system has 9 or more Virtual Functions (VF) consumed from a single Physical Function (PF) from a given Root Domain.

  

Example:

(a) Identify the PF on the control or primary domain in a PDom (ex: ssccnX). In this example we am verify on ssccn3 (Primary LDom on PDom2)

# ldm ls-io | grep IOVNET| egrep 'primary|ssccn.-dom.'
/SYS/CMIOU0/PCIE2/IOVNET.PF0 PF pci_0 primary
/SYS/CMIOU0/PCIE2/IOVNET.PF1 PF pci_0 primary
/SYS/CMIOU0/PCIE2/IOVNET.PF2 PF pci_0 primary
/SYS/CMIOU0/PCIE2/IOVNET.PF3 PF pci_0 primary
/SYS/CMIOU0/PCIE1/IOVNET.PF0 PF pci_3 primary
/SYS/CMIOU0/PCIE1/IOVNET.PF1 PF pci_3 primary
/SYS/CMIOU0/PCIE1/IOVNET.PF2 PF pci_3 primary
/SYS/CMIOU0/PCIE1/IOVNET.PF3 PF pci_3 primary
/SYS/CMIOU1/PCIE2/IOVNET.PF0 PF pci_5 primary
/SYS/CMIOU1/PCIE2/IOVNET.PF1 PF pci_5 primary
/SYS/CMIOU1/PCIE2/IOVNET.PF2 PF pci_5 primary
/SYS/CMIOU1/PCIE2/IOVNET.PF3 PF pci_5 primary
/SYS/CMIOU2/PCIE2/IOVNET.PF0 PF pci_10 ssccn3-dom1
/SYS/CMIOU2/PCIE2/IOVNET.PF1 PF pci_10 ssccn3-dom1
/SYS/CMIOU2/PCIE2/IOVNET.PF2 PF pci_10 ssccn3-dom1
/SYS/CMIOU2/PCIE2/IOVNET.PF3 PF pci_10 ssccn3-dom1
/SYS/CMIOU3/PCIE2/IOVNET.PF0 PF pci_15 ssccn3-dom2
/SYS/CMIOU3/PCIE2/IOVNET.PF1 PF pci_15 ssccn3-dom2
/SYS/CMIOU3/PCIE2/IOVNET.PF2 PF pci_15 ssccn3-dom2
/SYS/CMIOU3/PCIE2/IOVNET.PF3 PF pci_15 ssccn3-dom2

(b) For each PF verify how many VF are created and consumed by IO domains

# ldm ls-io | grep CMIOU3 | grep PF0 | grep IOVNET | grep VF | grep ssccn.- | wc -l
7
#

Run Step (b) for all the PF in all CMIOUs. This needs to be verified across all the primary domains in the SuperCluster rack (ex: ssccn1-4)

 

NOTE: If the count is 9 or more than the system is susceptible to hit this scenario A.

  

Scenario B: Power-on of a PDom that has root domain(s) with IO domains deployed

The issue is ONLY seen when the system has 4 or more Virtual Functions (VF) consumed from a single Physical Function (PF) from a given Root Domain. Refer to the above example to verify number of VFs consumed for each PF.

NOTE: If the count is 4 or more than the system is susceptible to hit this scenario B.

  

Cause

The cause of the issue is being investigated under BUG 27133932 - i40e_init_arq: Failed to write to Admin Rx Queue Regs observed during ldm start

Solution

1. (Scenario A) If the issue is encountered while starting all IO domains in parallel after a reboot of root domain(s), then follow the below steps:

a. Stop all IO domains in the root domain

# ldm stop <IO-Domain>

b. Once all the IO domains are stopped, then start IO domains sequentially (one at a time).

# ldm start <IO-Domain>

  

NOTE: Allow 5 secs delay before starting the next IO domain and proceed till all the IO domains are started.

  

2. (Scenario B) If the issue is encountered while "power-on" of a PDom where all the IO domains get started after power-on, then follow the below steps:

a. Stop all IO domains in the PDom 

# ldm stop <IO-Domain> 

b. Reboot all the root domain(s) in the PDom. The order of rebooting root domain doesn't matter.

# reboot

c. Once the root domain is rebooted, login and stop all IO domains in the root domain

# ldm stop <IO-Domain>

d. Once all the IO domains are stopped, then start IO domain sequentially (one at a time).

# ldm start <IO-Domain>

  

NOTE: Allow 5 secs delay before starting the next IO domain and proceed till all the IO domains are started.

  

References

<BUG:27133932> - I40E_INIT_ARQ: FAILED TO WRITE TO ADMIN RX QUEUE REGS OBSERVED DURING LDM START

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback