T5-4 - Main Module Motherboard Replacement Process When Veritas Foundation Suite Is Installed on Control Domains

Asset ID:	1-71-2283936.1
Update Date:	2017-11-21
Keywords:

Solution Type Technical Instruction Sure

Solution 2283936.1 : T5-4 - Main Module Motherboard Replacement Process When Veritas Foundation Suite Is Installed on Control Domains

Applies to:

SPARC T5-4 - Version All Versions and later
Solaris x64/x86 Operating System - Version 10 3/05 and later
Solaris Operating System - Version 10 3/05 and later
Information in this document applies to any platform.

Goal

How to avoid extended boot delay

The replacement of the MM on T5-4 platform is described in the following doc id.

How to Replace a SPARC T5-4 or T5-8 Server Main Module Motherboard:ATR:1528030.1:2 (Doc ID 1528030.1)

As part of the motherboard replacement operation, the platform needs to be set to 'factory default' mode, which can change the way physical access to attached storage is or is not restricted.

When the system is booted after the 'factory default' mode is set, vxvm/vxdmp can be presented with an enlarged kernel device tree, without matching /dev/[r]dsk paths for all disks now exposed in the kernel device tree. This delays the boot time discovery and configuration process that vxdmp needs to complete before the Solaris image fully completes the boot process.

In particular, vxvm/vxdmp has to deal with a configuration that has now changed from the logical domain configuration upon which it was initially installed and configured, and now has to resolve
an inconsistency:

An inquiry probe to all devices is successful ( through the Solaris kernel ) but not all of these devices are fully configured and available ( /dev/[r]dsk paths are missing for the additional devices now 'exposed' by 'factory default' mode ).

Those devices without a /dev/[r]dsk path are not fully accessible and return IO errors. As part of its error handling mechanism, vxvm/vxdmp issues an inquiry probe to determine if the device is
ready, and to determine the status of the path to the device. As this probe is successful, vxvm/vxdmp concludes that the path is available, and the device is ready to be accessed. Therefore, vxvm/vxdmp continues to (unsuccessfully ) access this device, including retries, until the timeout limit is reached.

The system does eventually complete a successful boot, but clearly the extended boot delay is undesirable.

Note. Following the replacement of the service processor or motherboard or main module in a SPARC CMT server, the saved LDoms config will get lost.
As a result, the issue described in this doc will occur if Veritas is installed.

For your reference, please have a look at:
How to Replace the Motherboard or Service Processor in a Logical Domain (LDom) Environment (Doc ID 1019720.1)

Depending on platform type the LDom configuration is stored within the Service Processor or Motherboard/Main Module -
when these components are replaced the LDom configuration will be lost;

[T1000, T2000, Netra T2000]
Stored within the /persist filesystem residing on the Service Processor

[T5x20, T5x40, Sun Blade T63x0, Netra T5xx0]
Stored within Host Data Flash embedded on the Service Processor

[T3-1, T3-2, T3-1B, Netra T3-1, T4-1, T4-2, T4-1B, Netra T4-x, T5-2, T5-1B]
Stored within Host data Flash embedded on the System Board

[T3-4, T4-4, T5-4, T5-8 S7-2, S7-2L, T7-1, T7-2, T7-4]
Stored within Host Data Flash embedded on the Main Module (Motherboard).

see also parameters suggested by Veritas

https://www.veritas.com/support/en_US/article.000085611

Solution

WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE:

Note 1. When running any OBP command when host is powered on with 'factory-default' mode,
make sure the OBP parameter auto-boot? is set to false.
Otherwise, Solaris will be booted up automatically, leaving no chance to enter OBP:

In ILOM:
-> set /HOST/bootmode script="setenv auto-boot? false"

Then power on host
-> start /SYS

Once host powers on, it will drop to OBP and show ok prompt.
ok> boot -m milestone=none

Solution:

The solution to avoid this inconsistency is to add a 'devfsadm' to the boot process:

ok> boot -m milestone=none
# devfsadm -v
# init s
# reboot

This simple addition of a 'devfsadm' creates the _additional_paths for new devices now exposed to the Solaris image, before the vxdmp driver loads, to eliminate the inconsistency between the Solaris kernel device tree, and the matching paths in /dev/[r]dsk.

This will eliminate the boot delay.

Although this problem is most likely to happen for logical domain configurations, if the number of devices, and/or the number of device paths to each device, is larger than the number of matching entries in /dev/[r]dsk, possibly because of any software, configuration, or hardware storage connectivity changes before the motherboard replacement is completed, the delayed boot
can also be seen on non logical domain configurations.

The simplest, cleanest, solution is to therefore add the extra step of issuing a 'devfsadm' at milestone 'none', before continuing the boot, to completely eliminate the inconsistency between the Solaris kernel device tree and the device paths in /dev/[r]dsk, in 'factory default' mode, to avoid the boot delay.

Oracle created an (almost identical) lab environment for testing, with assistance from Veritas engineers to provide guidance for installation and configuration with vxdmp installed.

- WHAT WE DID TO REPRODUCE THE ISSUE

, Boot primary logical domain from disk2 (boot disk), and secondary logical domain from disk4 (boot disk)
- Run devfsadm -Cv on both primary and secondary domains
- Return the Platform T5-4 to factory-default

- Boot from disk4, issue was reproduced

- From the console logs

Jun 7 16:15:11 secondary genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_150400-40 64-bit
Jun 7 16:15:11 secondary genunix: [ID 523776 kern.notice] Copyright (c) 1983, 2016, Oracle and/or its affiliates. All rights reserved.
Jun 7 16:15:11 secondary genunix: [ID 678236 kern.notice] Ethernet address = 0:10:e0:3a:cb:86

--- We can observe that the ssd instances for the 4 local disks are online , even if there are no entries in /dev/rdsk or in format at the end of the boot for the 3 local disks

Jun 7 16:15:11 secondary genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_150400-40 64-bit
Jun 7 16:15:20 secondary genunix: [ID 408114 kern.notice] /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/iport@1 (mpt_sas1) online
Jun 7 16:15:20 secondary genunix: [ID 408114 kern.notice] /scsi_vhci/disk@g5000cca022232ae8 (sd2) online
Jun 7 16:15:20 secondary genunix: [ID 483743 kern.notice] /scsi_vhci/disk@g5000cca022232ae8 (sd2) multipath status: degraded: path 1 mpt_sas1/disk@w5000cca022232ae9,0 is online
Jun 7 16:15:31 secondary genunix: [ID 408114 kern.notice] /pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/iport@1 (mpt_sas4) online
Jun 7 16:15:32 secondary genunix: [ID 408114 kern.notice] /scsi_vhci/disk@g5000cca0162b520c (sd0) online
Jun 7 16:15:32 secondary genunix: [ID 483743 kern.notice] /scsi_vhci/disk@g5000cca0162b520c (sd0) multipath status: degraded: path 2 mpt_sas4/disk@w5000cca0162b520d,0 is online
Jun 7 16:15:32 secondary genunix: [ID 408114 kern.notice] /pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/iport@4 (mpt_sas5) online
Jun 7 16:15:32 secondary genunix: [ID 408114 kern.notice] /scsi_vhci/disk@g5000cca016239c68 (sd1) online
Jun 7 16:15:32 secondary genunix: [ID 483743 kern.notice] /scsi_vhci/disk@g5000cca016239c68 (sd1) multipath status: degraded: path 3 mpt_sas5/disk@w5000cca016239c69,0 is online
Jun 7 16:15:32 secondary genunix: [ID 408114 kern.notice] /pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/iport@8 (mpt_sas6) online
Jun 7 16:15:32 secondary genunix: [ID 408114 kern.notice] /scsi_vhci/disk@g5000cca02224f160 (sd3) online
Jun 7 16:15:32 secondary genunix: [ID 483743 kern.notice] /scsi_vhci/disk@g5000cca02224f160 (sd3) multipath status: degraded: path 4 mpt_sas6/disk@w5000cca02224f161,0 is online

--- vxvm starting
Jun 7 16:18:31 secondary vxdmp: [ID 581508 kern.notice] NOTICE: VxVM vxdmp V-5-0-34 [Info] added disk array DISKS, datype = Disk
Jun 7 16:18:31 secondary vxdmp: [ID 581508 kern.notice] NOTICE: VxVM vxdmp V-5-0-34 [Info] added disk array 25039, datype = Hitachi_USP-VM
Jun 7 16:18:31 secondary vxdmp: [ID 241556 kern.notice] NOTICE: VxVM vxdmp V-5-0-0 [Info] removed disk array FAKE_ENCLR_SNO, datype = FAKE_ARRAY

--- DMP timeout always on both disk_2 and disk_3 (40minutes delay waiting vxdmp)

Jun 7 16:23:31 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x18
Jun 7 16:23:31 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x10
Jun 7 16:28:31 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x18
Jun 7 16:28:31 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x10
Jun 7 16:33:31 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x18
Jun 7 16:33:31 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x10
Jun 7 16:38:51 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x18
Jun 7 16:38:51 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x10
Jun 7 16:43:51 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x18
Jun 7 16:43:51 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x10
Jun 7 16:48:51 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x18
Jun 7 16:48:51 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x10
Jun 7 16:54:13 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x18
Jun 7 16:54:13 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x10
Jun 7 16:59:13 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x18
Jun 7 16:59:13 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x10
Jun 7 17:04:13 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x18
Jun 7 17:04:13 secondary vxdmp: [ID 760700 kern.notice] NOTICE: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (300) for disk 336/0x10

References

<NOTE:1006290.1> - Veritas Volume Manager (VxVM): Time Required to Fail a Disk When no Hard Error is Detected
<NOTE:1012201.1> - Veritas Volume Manager : How long does it take for Volume Manager to fail a disk?
<NOTE:1528030.1> - How to Replace a SPARC T5-4 or T5-8 Server Main Module Motherboard:ATR:1528030.1:2

Attachments

This solution has no attachment