Sun Fire[TM] 12K/15K/20K/25K: DR not possible when CSB is blacklisted and domain configured with both Centreplane halves active

Asset ID:	1-72-1009533.1
Update Date:	2018-01-29
Keywords:

Solution Type Problem Resolution Sure

Solution 1009533.1 : Sun Fire[TM] 12K/15K/20K/25K: DR not possible when CSB is blacklisted and domain configured with both Centreplane halves active

Applies to:

Sun Fire 12K Server - Version All Versions and later
Sun Fire 15K Server - Version All Versions and later
Sun Fire E20K Server - Version All Versions and later
Sun Fire E25K Server - Version All Versions and later
All Platforms
***Checked for relevance on 17-Jan-2014***

Symptoms

Symptoms
It has been observed that when a CSB (Centreplane Support Board) loses one of it's redundant power supplies, attempting to DR (Dynamic Reconfigure) in a system board will fail.

This is due to the design of the hpost process and the fact that to DR in a system board, we attempt to set the board to use the same Address, Response and Data bus configuration.

As we cannot achieve this configuration, hpost FAILs out the SB (System board).

This is in fact, not a bug. By design, we do not change current bus configuration as part of an hpost (Host Power On Self Test) for a DR.

hpost can indeed make bus configuration decisions when the domain is being started up from cold, just not during a DR. The method to change the bus configuration on a running domain is 'setbus'.

Cause

see solution

Solution

Resolution
This failure is characterized by the very early failure of POST.

We can see in the complete post log below that POST fails very early on, and the only real details we have to go on are the ESMD (Environmental Status
Monitoring Daemon) blacklist message, listing "cplane 1" (aka CSB1) as disabled and that there is No minimum system left after blacklist file.

So, there is not a great deal to go on in the, but there are other places we can look for the data.

First - The hpost log we have already discussed:

# SMI Sun Fire 12/15/20/25K POST log opened Tue Jun 13 00:39:13 2006
# hpost version 1.5 Generic 120648-04 Apr 24 2006 12:10:28
# libxcpost.so v. 1.5 Generic 120648-04 Apr 24 2006 11:48:42
# pid = 9538 level = 16 verbose_level = 20
# SC name: e25k1-sc1. ChHostID: XX00XX00XX00X
# Domain Id = A
# Parent PID = 6081: dxs
# Cmdline: /opt/SUNWSMS/SMS1.5/bin/hpost -dA -H16.0

Significant contents of .postrc (platform)
/etc/opt/SUNWSMS/SMS1.5/config/platform/.postrc:
# ident "@(#)postrc 1.1 01/04/02 SMI"
Reading domain blacklist file /etc/opt/SUNWSMS/config/A/blacklist ...
# ident "@(#)blacklist 1.1 01/04/02 SMI"
Reading platform blacklist file /etc/opt/SUNWSMS/config/platform/blacklist ...
# ident "@(#)blacklist 1.1 01/04/02 SMI"
Reading system ASR blacklist file /etc/opt/SUNWSMS/config/asr/blacklist ...
cplane 1 # ESMD Power Failure 0610.1718.56
SEEPROM probe took 0 seconds.
Reading Component Health Status (CHS) information ...
No minimum system left after blacklist file! Bailing out!
Exitcode = 48: No system after domain, .postrc, blacklist, etc.
POST (level=16, verbose=20, -H16.0) execution time 1:09
# SMI Sun Fire 12/15/20/25K POST log closed Tue Jun 13 00:40:22 2006

Then we have the platform log, located in /var/opt/SUNWSMS/adm/platform/messages, or in /<explorer-dir>/sf15k/adm/platform/messages if you are checking an explorer.

In this log, there may be messages that tell us more about previous failures.
In the case that this document was written about, there was a prior CSB power supply failure.

Jun 13 00:29:40 2006 e25k1-sc1 esmd[5620]: [2000 2674760785927349 ERR
SysControl.cc 1536] A failure has been detected on redundant PS at
ps1_power_good_l; located on CSB at CS1. SCHEDULE REPLACEMENT of CSB at CS1 as
soon as possible to restore redundancy.

Of course, this board has redundant power supplies, so the platform kept running after this failure, however, as the message notes, we should schedule a replacement as soon as possible.

The trick is that this also causes an entry to be made in the ASR blacklist, which hpost must obey.

From the Solaris[TM] side, within the domain, the details you would get as a result of this type of failure is a somewhat generic failure:

# cfgadm -c configure SB13
Jun 23 10:13:13 v4u-15ka-e-epar02 drmach: WARNING: SMS hpost reported
error, see POST log for details
cfgadm: Hardware specific failure: test SB13: SMS hpost reported error,
see POST log for details

Of course, this directs us to check the POST output.

So - We have set the scene, and now know that with the CSB partially failed, and listed as blacklisted in the ASR blacklist, we can't DR a system board into the domain.

What is the solution?

The only supported and sensible answer to this question is to replace the CSB with the failed power supply!

An example process follows: (Note: These processes are covered in great detail in the 15K and 25K service manuals. This document only supplies the minimum detail)

Let's assume that the failed CSB is CSB1, and the main SC is SC1.
We'll assume this config, as it's the hardest to workaround.

In essence, we need to get SC0 (The SC in the *good* CSB) to be main, stop using the failed CSB and then replace it.

Failover from SC1 to SC0
setfailover on (wait for sync to complete)
setfailover force

This fails the SC's over.

Stop using CSB1
setbus -c cs0

This disables the Address, Data and Response busses forall CSB1 supported paths. This means we are ready to replace the CSB

Halt SC1
From SC0, poweroff SC1, the SCPER1 and CSB1
poweroff sc1 scper1 csb1
Remove SC1, SCPER1 and CSB1
Install new CSB1
Install SCPER1 and SC1 *in that order*
SC1 automatically powers on and boots
setfailover on

Wait a few minutes for sync

Start using all busses again.
setbus -c cs0,cs1

Done!

Relief/Workaround

Using setbus, we can workaround this issue.
Note: This assumes that CS1 is the failed CSB, and all work is done on the main SC.

- Disable all CSB1 supported busses
- setbus -c cs0

- Perform the DR operation

- Replace the CSB at the first opportunity! Redundancy in your platform
depends on having both CSB's working at 100% capacity. See the 'solution' above.

See also - Technical Instruction <Document 1003308.1> Sun Fire[TM]12K/15K/E20K/E25K: esmd warning; A power failure has been detected on a redundant power supply at ...

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in an appropriate
My Oracle Support Community - Oracle Sun Technologies Community.

Attachments

This solution has no attachment