Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1674918.1
Update Date:2017-10-11
Keywords:

Solution Type  Troubleshooting Sure

Solution  1674918.1 :   LDOM fails to start after configuration changes or faults  


Related Items
  • SPARC M6-32
  •  
  • SPARC M5-32
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Mx-32
  •  




In this Document
Purpose
Troubleshooting Steps
References


Applies to:

SPARC M5-32 - Version All Versions to All Versions [Release All Releases]
SPARC M6-32 - Version All Versions to All Versions [Release All Releases]
Oracle Solaris on SPARC (64-bit)

Purpose

This Document will help Administrators of Oracle VM for SPARC(LDOM) to understand and troubleshoot ldom start up failures caused by configuration changes or component faults

It aims to explain what these changes or faults can do to a ldom configuration, so that mistakes can be avoided.

Troubleshooting Steps

 

No.

Details of Changes/Faults

Test(s)What HappensErrors/MessagesRecovery OptionsBest Practices
1
Expandable values in DCU changed.
=> If you change the expandable flag on a DCU from true to false or vice verse after LDOMs are configured

expandable property set from true to false.

*note recovery-mode not enabled



 

 

After a stop/start of HOSTx, the ldom configuration falls back to factory-default

> WARNING: HOST expandable property must be set to true in order to boot config
> WARNING: Unable to boot config_kon3 due to missing resources

> WARNING: Falling back to factory-default

 

> WARNING: Missing guest memory [0x100030000000:0x100830000000]
> WARNING: Missing required memory resources to boot config
> WARNING: Unable to boot 26oct15 due to missing resources 
> WARNING: Falling back to factory-default

 

 

 

see KM Doc 1640383.1 1. Good Planning needed in advance

2. Enable recovery mode for degraded
configuration to load
2
Adding CMU to original configuration
=>
If you add CMU(s) to to an existing DCU when LDOMs are already configured
Added CMU1 to an existing half-populated CMU0 and CMU3 PDOM/DCU

1.ioreconfigure is true or add_only

2.
ioreconfigure is false
1) If ioreconfigure is true or add_only
After start /HOSTx, The ldom configuration falls back to factory-default due to missing IO resources.


1a) With recovery mode enabled
=>
After start /HOSTx, The ldom configuration is booted in degraded mode
=> root complexes for new CMU are added
=> root complexes and paths are reprogrammed
=> cards now associated with previously non-existing root complexes are now "unk"
=> cards now associated with previously non-existing root complexes are no longer assigned to a specific ldom/guest (add-io required)

=> See Doc ID
1540545.1 for the list of root complexes and paths for PCI cards

 
2)
If ioreconfigure is false
After start /HOSTx, The ldom configuration is booted normally.
=> root complexes and paths for new CMU
are not added nor reprogrammed
=> all cards are accessible

1.
> WARNING: Missing IO resources to boot LDOM config
> WARNING: Unable to boot 13042014 due to missing resources
> WARNING: Falling back to factory-default
> NOTICE:  Booting config = factory-default

2.
> NOTICE:  Booting config = cmu-1
> DEBUG:   Updating mdset-boot-reason prop: "0""cmu-1""cmu-1"
> DEBUG:   ldm_set_bootedconfig_name: New bootedcfg cmu-1, last bootedcfg factory-default
1. Set ioreconfigure to true, then remove the extra CMU(s) and start /HOSTx and it should recover

* note that, if ioreconfigure is add_only, then removing the CMU(s) will not re-build the paths.
1. Good Planning needed in advance to either set ioreconfigure to false or prepare for a configuration change.


2. Enable recovery mode for degraded configuration to load


3. If customer wants to permanently add CMUs, then the only way is to plan ahead and redo the ldom configuration.
3
Removing CMU(s) from original configuration
==> If you remove CMU(s) from an existing DCU when LDOMS are already configured
Removed 2x CMU from a Full 4x CMU PDOM to make it a Half DCU

1.ioreconfigure is false or add_only

2.
ioreconfigure is true
1. If ioreconfigure is add_only or false, After start /HOSTx, the ldom configuration falls back to factory-defaultdue tomissing paths to the removed CMU(s)/CMP(s)

1a) with recovery mode enabled
=> configuration is booted in degraded mode
=> root complexes managed by the removed

CMU(s)/CMP(s)  disappear
=> no changes or reprogramming to the existing and remaining root complexes
=> the cards managed by the removed CMU(s)/CMP(s) are also not available
=> root complexes assigned to the guests are marked IOV

2. if ioreconfigure is true
After start /HOSTx, the ldom configuration falls back to factory-default
due to change in paths to the root-complexes.


2b) with recovery mode enabled
=> configuration is booted in degraded mode
=> root complexes and paths are reprogrammed
=>
the cards managed by the removed CMU(s)/CMP(s) are also not available
=> root complexes assigned to the guests are marked IOV

> WARNING: bootconfig not bootable: missing strand id 1280
> WARNING: Missing required strand resources to boot config
> WARNING: Unable to boot stef-full due to missing resources
> DEBUG:   Trying to fall back to degraded config
> DEBUG:   Can't open degraded cfg "stef-full" - rv = -6
> DEBUG:   Degraded config doesn't exist
> WARNING: Falling back to factory-default
> NOTICE:  Booting config = factory-default
> DEBUG:   Updating mdset-boot-reason prop: "1""stef-full""factory-default"
1. Set ioreconfigure to true, then add the removed CMU(s) back and start /HOSTx and it should re-build the paths and recover 1. Good  planning needed in advance, which includes steps to re-create ldoms manually if customer wants to remove CMU(s) permanently.

2. Enable recovery mode for degraded
configuration to load
4
CMP Failure (root complex) Blacklisted CMU15/CMP0 and CMP1 to avoid 7-nodes config and CID > 480 and driving PCIE15

CMP disabled : PCIE15 assigned to ldg1, no cid assignment, only vcpu - 150 out of 384 - 384 < CID < 424 (strandid 3840 below not in use)

*note recovery-mode not enabled

Falls back to factory-default

> DEBUG:   Strand not present: chip id = 30, smp_id = 3, local_chip_id = 6, local_strand_id= 0
> WARNING: bootconfig not bootable: missing strand id 3840
> WARNING: Missing required strand resources to boot config
> WARNING: Unable to boot stef-4 due to missing resources
> DEBUG:   Trying to fall back to degraded config
> DEBUG:   Can't open degraded cfg "stef-4" - rv = -6
> DEBUG:   Degraded config doesn't exist
> WARNING: Falling back to factory-default
> NOTICE:  Booting config = factory-default
> DEBUG:   Updating mdset-boot-reason prop: "1""stef-4""factory-default"
> DEBUG:   bootconfig differs from last boot
> DEBUG:   ldm_set_bootedconfig_name: New bootedcfg factory-default, last bootedcfg stef-4
1. Replace ASAP Enable recovery mode for degraded configuration to load
5
PCIE/IOU/CMU/DIMM Failures 1. Blacklisted IOU3/PCIE5

Original config : ldg1 is using IOU3/PCIE5 entire rootcomplex assigned to ldom

2.
Blacklisted IOU3/IOB1

Original config : ldg1 is using IOU3/PCIE15 entire rootcomplex assigned to ldom


3. Blacklisted CMU/CPU

4. Disabled
DIMMS

*note recovery-mode not enabled
1. The ldom configuration is booted normally. Only the PCIE resource is missing from the ldom when started.

    

2. Falls back to factory-default
=> Missing IO resources to boot LDOM configuration


3. Falls back to factory-default
=> Missing cpu strand

4. Falls back to factory-default
=>
Missing required memory
1.
> NOTICE:  Booting config = stef-4
> DEBUG:   Updating mdset-boot-reason prop: "0""stef-4""stef-4"
> DEBUG:   bootconfig differs from last boot
> DEBUG:   config stef-4 has 2 IO domains
> DEBUG:   Not in this IO domain
> DEBUG:   /SYS/IOU3/PCIE5 marked disabled in MD. Path=/@f80/@1/@0/@8
> DEBUG:   Updating stef-4 Control Domain's variables and keystore nodes


/SYS/IOU3/PCIE5 PCIE pci_50 ldg1 UNK

2.
> DEBUG:   Some IO unreachable from cpu nodeset: Degraded IO config.
> DEBUG:   config_root_io_is_avail: Not enough RCs in the current config
> WARNING: Missing IO resources to boot LDOM config
> WARNING: Unable to boot stef-4 due to missing resources
> DEBUG:   Trying to fall back to degraded config
> DEBUG:   Can't open degraded cfg "stef-4" - rv = -6
> DEBUG:   Degraded config doesn't exist
> WARNING: Falling back to factory-default
> NOTICE: Booting config = factory-default
> DEBUG:   Updating mdset-boot-reason prop: "1""stef-4""factory-default"
> DEBUG:   bootconfig differs from last boot
> DEBUG:   ldm_set_bootedconfig_name: New bootedcfg factory-default, last bootedcfg stef-4

3.
> WARNING: bootconfig not bootable: missing strand id 2176
> WARNING: Missing required strand resources to boot config
> WARNING: Unable to boot stef-alternate due to missing resources
> WARNING: Falling back to factory-default
> NOTICE:  Booting config = factory-default

4.
> WARNING: Missing guest memory [0x150000000000:0x158000000000]
> WARNING: Missing required memory resources to boot config
> WARNING: Unable to boot stef-mem due to missing resources
> DEBUG:   Trying to fall back to degraded config
> DEBUG:   Can't open degraded cfg "stef-mem" - rv = -6
> DEBUG:   Degraded config doesn't exist
>WARNING: Falling back to factory-default
> NOTICE:  Booting config = factory-default
> DEBUG:   Updating mdset-boot-reason prop: "1""stef-mem""factory-default"
1. Replace ASAP Enable recovery mode for degraded configuration to load
6
HDD/Network Card Failures in EMS Blacklisted EMS2

Original config : ldg2 is using vnet and vdisk services from EMS2 owned by primary
==> The ldom configuration is booted normally. The services (vdisk, vnet) are then obviously not available to the ldom
> NOTICE:  Booting config = stef-4
> DEBUG:   Updating mdset-boot-reason prop: "0""stef-4""stef-4"
> DEBUG:   bootconfig differs from last boot
> DEBUG:   config stef-4 has 2 IO domains
> DEBUG:   /SYS/IOU3/EMS2 marked disabled in MD. Path=/@1100/@1/@0/@0
> DEBUG:   /SYS/IOU3/EMS2 marked disabled in MD. Path=/@1100/@1/@0/@0/@0
> DEBUG:   Updating IO MD to offset 0x180006600000
> DEBUG:   Not in this IO domain
> DEBUG:   ldm_set_bootedconfig_name: New bootedcfg stef-4, last bootedcfg factory-default

/SYS/IOU3/EMS2/CARD/NET0  PCIE   pci_56   primary  UNK    
/SYS/IOU3/EMS2/CARD/SCSI  PCIE   pci_56   primary  UNK
1. Replace ASAP None

 

Best Practices

1. 9.1.0.x + OVM 3.1 + enable recovery mode
2. Understanding what happens when Expandable Modes are changed on DCUs when LDOMs are already configured
3. Understanding ioreconfigure options, false/add_only/true what they do when boards are added or removed

Notes

1. When recovery mode is enabled, it allows the LDOM to start in a degraded configuration if a resource is removed/added or faulted.
However, the degraded configuration should not be considered as the original production configuration. It is just a step forward to limit the outage.
A recovered domain is not guaranteed to be completely operable. The domain might not include a resource that is essential to run the OS instance or an application.
For example, a recovered domain might only have a network resource and no disk resource. For more details, see OracleVMServer for SPARC 3.1 Administration Guide page 291
http://docs.oracle.com/cd/E38405_01/pdf/E38406.pdf

When a system is in recovery mode, you can only perform ldm list-* commands. All other ldm commands are disabled until the recovery operation completes.

2. Even if recovery mode was disabled, it can still be enabled after falling back from factory mode (as long as you can boot solaris) and this will take effect immediately and start a degraded configuration  


References

1. ioreconfigure

Manage I/O Path Reconfiguration Settings - see SPARC M5-32 and SPARC M6-32 Servers Administration Guide, page 178
http://docs.oracle.com/cd/E24355_01/pdf/E41216.pdf

2. IOV

Enabling IOV on root complex - see OracleVMServer for SPARC 3.1 Administration Guide, page 100
http://docs.oracle.com/cd/E38405_01/pdf/E38406.pdf

3. Recovery Mode

Enabling Recovery Mode - see OracleVMServer for SPARC 3.1 Administration Guide page 294
http://docs.oracle.com/cd/E38405_01/pdf/E38406.pdf

 

SPARC M5-32 and M6-32 Servers: OVM Recovery Mode (Doc ID 1928902.1)

 

Extra Internal Resources from Stephane Dutilleul Troubleshooting Wiki


1. ioreconfigure
https://stbeehive.oracle.com/teamcollab/wiki/SPARC+M5-32+and+M6-32+Servers+-+Troubleshooting:Misc.#ioreconfigure

2. Recovery Mode
https://stbeehive.oracle.com/teamcollab/wiki/SPARC+M5-32+and+M6-32+Servers+-+Troubleshooting:Misc.#Recovery+mode

3. SPARC M5-32 and M6-32 Servers: Missing EMS and/or PCIe cards due to ioreconfigure variable setting (Doc ID 1931995.1)

 

 

References

<NOTE:1640383.1> - How to restore a LDOM config across M5/M6 expandable property change
<NOTE:1540545.1> - SPARC M5-32 and M6-32 Servers: Device Paths
http://docs.oracle.com/cd/E24355_01/pdf/E41216.pdf
http://docs.oracle.com/cd/E38405_01/pdf/E38406.pdf

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback