Asset ID: |
1-71-1928902.1 |
Update Date: | 2017-10-11 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
1928902.1
:
SPARC M5-32 and M6-32 Servers: OVM Recovery Mode
Related Categories |
- PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Mx-32
|
In this Document
Applies to:
SPARC M5-32 - Version All Versions and later
SPARC M6-32 - Version All Versions and later
SPARC
Goal
The ILOM HOST configuration software might not be able to boot a selected LDOMS configuration (/HOSTx/bootmode/ config) following a Pdom restart (stop/start). This can occur when a hardware resource is either added or becomes unavailable or if the IO topology changes.
Some possible causes for falling back to factory-default include :
- added/removed CMU(s) while the ioreconfigure property is set to true or add_only, as this will impact the IO topology,
- added/removed CMU(s) regardless of the ioreconfigure property, as this will change the cpu and memory configuration,
- faulted components that impact the available CPU, Memory, or IO configuration,
- user-disabled components such as a CMU, CPU, memory, or IO resources,
- modified expandable property, as this impacts the physical address and cpu map.
Note: The above list is not exhaustive and other conditions might impact resource availability.
Guest domains are unavailable when the system falls back to factory-default and fails to boot the desired LDOMS configuration. This can be a big impact when guest domain applications are not running (e.g., databases, middleware).
During the Pdom startup sequence host console WARNING messages may provide clues about why a HOST falls back to factory-default. The WARNINGS can lead to messages like :
Missing required strand resources to boot config
bootconfig not bootable: missing strand id
Missing IO resources to boot LDOM config
Missing required memory resources to boot config
System may not be running with the optimum IO topology
HOST expandable property must be set to true in order to boot config
Thorough data analysis must be conducted for the various types of events which can trigger the system to fall back to factory-default. Oracle Support can assist the diagnosis effort by opening a My Oracle Support hardware SR.
Example scenarios with plain English explanation:
Example A:
2015-06-28 07:52:51 8:0:0> WARNING: bootconfig not bootable: missing strand id 768 (1)
2015-06-28 07:52:51 8:0:0> WARNING: Missing required strand resources to boot config (2)
2015-06-28 07:52:52 8:0:0> WARNING: Unable to boot ldoms20150626 due to missing resources
2015-06-28 07:52:52 8:0:0> WARNING: Falling back to factory-default
(1) The HOST configuration is missing an expected CPU strand required to use an LDOMS configuration.
Use Doc ID
1540202.1 to assist in the identification of the missing (e.g., faulted, disabled, or degraded)
CMU resource. (2) Missing CMU resources prevent using the LDOMS configuration and the HOST falls back to
factory-default.
Example B:
2014-04-13 08:14:41 0:0:0> WARNING: System may not be running with the optimum IO topology (1)
2014-04-13 08:15:54 0:0:0> WARNING: Missing IO resources to boot LDOM config (2)
2014-04-13 08:15:54 0:0:0> WARNING: Unable to boot 13042014 due to missing resources
2014-04-13 08:15:55 0:0:0> WARNING: Falling back to factory-default
(1) All available root complexes are not fully utilized. This might occur following the loss of a CMU or
IO board (e.g., fault, disabled). (2) ILOM service processor Hostconfig software identfies that expected IO paths
are missing and consequently the HOST falls back to factory-default. See Doc ID 1540545.1 to understand the
M5-32/M6-32 server root complexes and available IO paths.
Note: The "Missing IO resources" WARNING occurs upon the loss of expected IO paths. This can occur due to faulted or disabled hardware. It can also be triggered upon the addition of CMUs and the reconfiguration of the IO topology when the ioreconfigure property is set to true or add_only.
Example C:
2017-07-06 01:07:36 0:00:0> WARNING: Missing guest memory [0xa0030000000:0xa2030000000] (1)
2017-07-06 01:07:36 0:00:0> WARNING: Missing required memory resources to boot config (2)
2017-07-06 01:07:36 0:00:0> WARNING: Unable to boot 05072017 due to missing resources
2017-07-06 01:07:36 0:00:0> WARNING: Falling back to factory-default
(1) The HOST configuration is missing an expected physical memory address range required to use an
LDOMS configuration. (2) Required memory is not availale in the HOST configuration. As result the
server falls back to factory-default. The output of the Solaris 'prtdiag' command can be used to
identify the CMU from which memory is missing.
Example D:
2014-05-07 07:30:26 16:0:0> WARNING: HOST expandable property must be set to true in order to boot config (1)
2014-05-07 07:30:26 16:0:0> WARNING: Unable to boot init due to missing resources (2)
2014-05-07 07:30:26 16:0:0> WARNING: Falling back to factory-default
(1) The HOST property expandable does not match the expected value for the desired LDOMS configuration.
The expandable property affects the physical address assignment of the devices in the PDomain. (2) Expected
resources are now missing because of address changes and the LDOMS configuration cannot be used. The HOST falls
back to factory-default. See Doc ID
1640383.1 for more details.
It is the responsibility of the SP Firmware (Hostconfig) to determine that some piece of HW that belongs in the bootconfig is no longer available. The ldom manager is then informed via the "mdset-boot-reason" property in "firmware" node of the PRI.
Example :
2013-10-23 14:34:19 16:0:0> DEBUG: Updating mdset-boot-reason prop: "1""stef-memory""factory-default"
References :
Solution
General Information
The behavior as described above is applicable to platforms running SysFW earlier than version 9.1.
When the platform is running SysFW 9.1 or later and OVM 3.1 or later , it is possible to enable the ldm recovery mode. By default it is disabled. When enabled, it will allow ldmd to try to start in a degraded confguration. When disabled, no attempt to run a degraded configuration is made.
It is highly recommended to run SysFW 9.1 and later, OVM 3.1 and later and to enable the ldm recovery mode.
References :
The Recovery Mode is a property of the svc:/ldoms/ldmd:default SMF service running on the primary/control domain.
Recovery Mode can only kick off upon request from the SP, when the primary/control domain is starting Solaris.
If for some reason it's not possible to boot Solaris on the control/primary domain, Recovery Mode cannot start.
At some point, if SysFW detects a criteria for which it's not possible to start the saved ldom configuration then when Solaris and ldmd are starting on the primary domain :
- if the Recovery Mode is enabled then ldm recovers all active and bound domains from the last selected power-on configuration.
- if the Recovery Mode is not enabled, it's still possible to enable after booting from a valid boot device in factory-default. This will take effect immediately (only if no changes were made to the system, that is, if it is still in the factory-default configuration) after an automatic reboot and ldm will then recover all active and bound domains from the last selected power-on configuration.
The resulting running configuration is called the degraded configuration.
See the examples below.
The degraded configuration is saved to the SP and remains the active configuration until either a new SP configuration is saved or the prom is power-cycled. Using the degraded configuration, it should be possible to start all of the ldoms (root, IO, guest domains).
Notes :
- The degraded configuration should not be considered as the original production configuration. It is just a step forward to limit the outage.
- A recovered domain is not guaranteed to be completely operable. The domain might not include a resource that is essential to run the OS instance or an application.
- When a system is in recovery mode, you can only perform ldm list-* commands. All other ldm commands are disabled until the recovery operation completes.
- When running ldm commands from degraded configuration, ldm will prompt the following
------------------------------------------------------------------------------
Notice: the LDoms Manager is running in Recovery Mode because not all
resources required for the selected configuration were available when
the system was powered on.
------------------------------------------------------------------------------
As soon as the problem leading to falling back to factory-default is identified and fixed, the Pdom can be restarted and the ldm will use the last selected configuration.
Note :
When the Recovery Mode is enabled (already enabled before a condition is met or enable "on the fly") and a request to use Recovery mode is made by the SP then an extra reboot will occur.
From the host console, you will see the following :
NOTICE: Recovery Mode requested by the system controller. LDoms Manager is starting recovery.
NOTICE: LDoms Manager is rebooting the primary domain to apply changes for Recovery Mode
NOTICE: LDoms Manager has completed recovery
Process flow can be summarized as following when HW resource is detected missing during host startup and SysFW decides to fall back to factory-default ("Falling back to factory-default") :

Managing OVM Recovery Mode
By default, for platforms running SysFW 9.1 or later and OVM 3.1 or later, OVM Recovery Mode is disabled.
The Recovery Mode is a property of the ldmd service on the control domain so it's managed from the control/primary domain where ldmd is running.
root@pdom03:~# svccfg -s ldmd listprop ldmd/recovery_mode
root@pdom03:~#
Enable Recovery Mode
root@pdom03:~# svccfg -s ldmd setprop ldmd/recovery_mode = astring: auto
root@pdom03:~# svcadm refresh ldmd
root@pdom03:~# svcadm restart ldmd
Check if Recovery Mode is properly enabled
root@pdom03:~# svccfg -s ldmd listprop ldmd/recovery_mode
ldmd/recovery_mode astring auto
Disable Recovery Mode
If for some reason, you want to disable again the Recovery Mode
root@pdom03:~# svccfg -s ldmd setprop ldmd/recovery_mode = astring: never
root@pdom03:~# svcadm refresh ldmd
root@pdom03:~# svcadm restart ldmd
root@pdom03:~# svccfg -s ldmd listprop ldmd/recovery_mode
ldmd/recovery_mode astring never
Logs
When Recovery Mode is requested, you can check the logs from the /var/svc/log/ldoms-ldmd:default.log file
- Recovery Mode request has been made but it's disabled
Oct 31 13:49:02 Recovery Mode has been requested by the system controller but it has not been administratively enabled through the ldmd/recovery_mode smf(5) property Falling back to 'factory-default'
- Recovery Mode request has been made and it's enabled
Oct 31 13:53:16 ------------------------------------------------------------------------------
Oct 31 13:53:16 Recovery Mode requested by the system controller. Starting recovery of configuration 'stef-mem'
Oct 31 13:53:16 ------------------------------------------------------------------------------
...
Oct 31 13:54:05 NOTICE: LDoms Manager is rebooting the primary domain to apply changes for Recovery Mode
[ Oct 31 13:54:08 Stopping because service disabled. ]
[ Oct 31 13:54:08 Executing stop method (:kill). ]
[ Oct 31 13:58:01 Executing start method ("/opt/SUNWldm/bin/ldmd_start"). ]
Oct 31 13:58:20 ------------------------------------------------------------------------------
Oct 31 13:58:20 Recovery Mode requested by the system controller. Continuing recovery of configuration 'stef-mem'
Oct 31 13:58:20 ------------------------------------------------------------------------------
…
Oct 31 13:58:26 Recovering root domain ldg1
Oct 31 13:58:27 Starting root domain ldg1
Oct 31 13:58:29 LDom ldg1 started
Oct 31 13:58:29 Domain ldg1 was resized from 3000G to 2814G of memory
Examples
Assuming that a Pdom has the following LDOM configuration (OVM 3.1) and SysFW 9.1 on the SP.
root@pdom02:~# svccfg -s ldmd listprop ldmd/recovery_mode
root@pdom02:~#
root@pdom02:~# ldm list
NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME
primary active -n-cv- UART 192 1094G 0.6% 0.7% 2d 37m
ldg1 active -n---- 5000 192 3000G 0.5% 0.6% 2d 53m
root@pdom02:~# ldm list-config
factory-default
stef-memory [current]
And a resource is missing, a DIMM here has been manually disabled (configuration rules apply), when starting the Pdom
-> show -t /SYS/CMU9 current_config_state==(Disabled,Degraded) current_config_state disable_reason
Target | Property | Value
---------------------------------+---------------------------------------+--------------------------------------------------------
/SYS/CMU9 | current_config_state | Degraded
/SYS/CMU9 | disable_reason | None
/SYS/CMU9/CMP1 | current_config_state | Degraded
/SYS/CMU9/CMP1 | disable_reason | None
/SYS/CMU9/CMP1/D0100 | current_config_state | Disabled
/SYS/CMU9/CMP1/D0100 | disable_reason | By user
/SYS/CMU9/CMP1/D0101 | current_config_state | Disabled
/SYS/CMU9/CMP1/D0101 | disable_reason | Configuration Rules
...
/SYS/CMU9/CMP1/D1113 | current_config_state | Disabled
/SYS/CMU9/CMP1/D1113 | disable_reason | Configuration Rules
If a known degraded config exists
2013-10-31 19:29:59 16:0:0> WARNING: Missing guest memory [0x150000000000:0x158000000000]
2013-10-31 19:29:59 16:0:0> WARNING: Missing required memory resources to boot config
2013-10-31 19:29:59 16:0:0> WARNING: Unable to boot stef-memory due to missing resources
2013-10-31 19:29:59 16:0:0> DEBUG: Trying to fall back to degraded config
2013-10-31 19:30:01 16:0:0> NOTICE: Booting config = stef-memory
root@pdom02:~# ldm list
------------------------------------------------------------------------------
Notice: the system is running a degraded configuration because not all
resources required for the selected configuration were available when
the system was powered on.
------------------------------------------------------------------------------
NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME
primary active -n-cv- UART 192 1T 0.8% 0.8% 4m
ldg1 active -n---- 5000 192 2814G 0.6% 0.6% 4m
root@pdom02:~# ldm list-config
------------------------------------------------------------------------------
Notice: the system is running a degraded configuration because not all
resources required for the selected configuration were available when
the system was powered on.
------------------------------------------------------------------------------
factory-default
stef-memory [current] [degraded]
If no known degraded config exists, since recovery mode is disabled, falling back to factory-default
2013-10-31 20:43:47 16:0:0> WARNING: Missing guest memory [0x150000000000:0x158000000000]
2013-10-31 20:43:47 16:0:0> WARNING: Missing required memory resources to boot config
2013-10-31 20:43:47 16:0:0> WARNING: Unable to boot stef-mem due to missing resources
2013-10-31 20:43:47 16:0:0> DEBUG: Trying to fall back to degraded config
2013-10-31 20:43:47 16:0:0> DEBUG: Degraded config doesn't exist
2013-10-31 20:43:47 16:0:0> WARNING: Falling back to factory-default
2013-10-31 20:43:47 16:0:0> NOTICE: Booting config = factory-default
root@pdom02:~# ldm list
------------------------------------------------------------------------------
Notice: the LDoms Manager is running in Recovery Mode because not all
resources required for the selected configuration were available when
the system was powered on.
------------------------------------------------------------------------------
NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME
primary active -n-c-- UART 384 3930880M 0.4% 0.4% 10m
From /var/svc/log/ldoms-ldmd:default.log
[ Oct 31 13:48:45 Executing start method ("/opt/SUNWldm/bin/ldmd_start"). ]
Oct 31 13:49:01 warning: Invalid --recovery-mode value 'astring:never', default to 'never
Oct 31 13:49:02 Recovery Mode has been requested by the system controller but it has not been administratively enabled through the ldmd/recovery_mode smf(5) property Falling back to 'factory-default'
At this point, it's possible to force a recovery anyway from the booted configuration from factory-default.
root@pdom02:~# svccfg -s ldmd setprop ldmd/recovery_mode = astring: auto
root@pdom02:~# svccfg -s ldmd listprop ldmd/recovery_mode
ldmd/recovery_mode astring auto
root@pdom02:~# svcadm refresh ldmd
root@pdom02:~# svcadm restart ldmd
leading to
NOTICE: Recovery Mode requested by the system controller. LDoms Manager is starting recovery.
NOTICE: LDoms Manager is rebooting the primary domain to apply changes for Recovery Mode
This will allow to use a degraded config
root@pdom02:~# ldm list
------------------------------------------------------------------------------
Notice: the LDoms Manager is running in Recovery Mode because not all
resources required for the selected configuration were available when
the system was powered on.
------------------------------------------------------------------------------
NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME
primary active -n-cv- UART 192 1T 1.0% 1.0% 3m
ldg1 active -t---- 5000 192 2814G 0.5% 0.5% 12s
root@pdom02:~# ldm list-config
------------------------------------------------------------------------------
Notice: the system is running a degraded configuration because not all
resources required for the selected configuration were available when
the system was powered on.
------------------------------------------------------------------------------
factory-default
stef-mem [current] [degraded]
From /var/svc/log/ldoms-ldmd:default.log
[ Oct 31 13:53:08 No 'refresh' method defined. Treating as :true. ]
[ Oct 31 13:53:16 Stopping because service restarting. ]
[ Oct 31 13:53:16 Executing stop method (:kill). ]
[ Oct 31 13:53:16 Executing start method ("/opt/SUNWldm/bin/ldmd_start"). ]
Oct 31 13:53:16 ------------------------------------------------------------------------------
Oct 31 13:53:16 Recovery Mode requested by the system controller. Starting recovery of configuration 'stef-mem'
Oct 31 13:53:16 ------------------------------------------------------------------------------
...
Oct 31 13:54:05 NOTICE: LDoms Manager is rebooting the primary domain to apply changes for Recovery Mode
[ Oct 31 13:54:08 Stopping because service disabled. ]
[ Oct 31 13:54:08 Executing stop method (:kill). ]
[ Oct 31 13:58:01 Executing start method ("/opt/SUNWldm/bin/ldmd_start"). ]
Oct 31 13:58:20 ------------------------------------------------------------------------------
Oct 31 13:58:20 Recovery Mode requested by the system controller. Continuing recovery of configuration 'stef-mem'
Oct 31 13:58:20 ------------------------------------------------------------------------------
…
Oct 31 13:58:26 Recovering root domain ldg1
Oct 31 13:58:27 Starting root domain ldg1
Oct 31 13:58:29 LDom ldg1 started
Oct 31 13:58:29 Domain ldg1 was resized from 3000G to 2814G of memory
If the recovery mode is enabled previously, ldmd would have used a degraded configuration automatically and directly after booting Solaris.
2013-09-09 15:05:45 16:0:0> WARNING: bootconfig not bootable: missing strand id 2176
2013-09-09 15:05:45 16:0:0> WARNING: Missing required strand resources to boot config
2013-09-09 15:05:46 16:0:0> WARNING: Unable to boot stef-alternate due to missing resources
2013-09-09 15:05:46 16:0:0> WARNING: Falling back to factory-default
2013-09-09 15:05:46 16:0:0> NOTICE: Booting config = factory-default
SPARC M5-32, No Keyboard
Copyright (c) 1998, 2013, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.35.3, 2.9987 TB memory available, Serial #103049886.
Ethernet address 0:10:e0:24:6a:b2, Host ID: 86246a9e.
WARNING: One or more resources have been retired, please run 'show faulty' on the SP.
Evaluating:
No viable default device found in boot-device variable.
{600} ok boot disk
Boot device: /pci@b00/pci@1/pci@0/pci@c/pci@0/pci@4/scsi@0/disk@w5000cca0162f46f1,0:a File and args:
Hostname: pdom02
NOTICE: Recovery Mode requested by the system controller. LDoms Manager is starting recovery.
root@pdom02:~# ldm list
------------------------------------------------------------------------------
Notice: the LDoms Manager is running in Recovery Mode because not all
resources required for the selected configuration were available when
the system was powered on.
------------------------------------------------------------------------------
NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME
primary active -n-c-- UART 288 3144448M 0.1% 0.1% 5m
root@pdom02:~#
NOTICE: LDoms Manager is rebooting the primary domain to apply changes for Recovery Mode
Broadcast Message from root (???) on pdom02
Mon Sep 9 08:11:18...
THE SYSTEM pdom02 IS BEING SHUT DOWN NOW ! ! !
Log off now or risk your files being damaged
NOTICE: LDoms Manager is rebooting the primary domain to apply changes for Recovery Mode
pdom02 console login: NOTICE: LDoms Manager has completed recovery
/HOST2/bootmode
Properties:
config = stef-alternate
jack@pdom02:~$ ldm list
------------------------------------------------------------------------------
Notice: the LDoms Manager is running in Recovery Mode because not all
resources required for the selected configuration were available when
the system was powered on.
------------------------------------------------------------------------------
NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME
primary active -n-cv- UART 48 34G 0.5% 0.5% 3m
ldg1 active -n---- 5000 192 24G 1.5% 1.5% 27s
jack@pdom02:~$ ldm list-config
------------------------------------------------------------------------------
Notice: the LDoms Manager is running in Recovery Mode because not all
resources required for the selected configuration were available when
the system was powered on.
------------------------------------------------------------------------------
factory-default
stef-alternate [next poweron]
References
<NOTE:1674918.1> - LDOM fails to start after configuration changes or faults
Attachments
This solution has no attachment