Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1009533.1
Update Date:2018-01-29
Keywords:

Solution Type  Problem Resolution Sure

Solution  1009533.1 :   Sun Fire[TM] 12K/15K/20K/25K: DR not possible when CSB is blacklisted and domain configured with both Centreplane halves active  


Related Items
  • Sun Fire 15K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire E25K Server
  •  
  • Sun Fire 12K Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: SF-Exxk
  •  
  • _Old GCS Categories>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
213152


Applies to:

Sun Fire 12K Server - Version All Versions and later
Sun Fire 15K Server - Version All Versions and later
Sun Fire E20K Server - Version All Versions and later
Sun Fire E25K Server - Version All Versions and later
All Platforms
***Checked for relevance on 17-Jan-2014***

Symptoms

Symptoms
It has been observed that when a CSB (Centreplane Support Board) loses one of it's redundant power supplies, attempting to DR (Dynamic Reconfigure) in a system board will fail.

This is due to the design of the hpost process and the fact that to DR in a system board, we attempt to set the board to use the same Address, Response and Data bus configuration.

As we cannot achieve this configuration, hpost FAILs out the SB (System board).

This is in fact, not a bug. By design, we do not change current bus configuration as part of an hpost (Host Power On Self Test) for a DR.

hpost can indeed make bus configuration decisions when the domain is being started up from cold, just not during a DR. The method to change the bus configuration on a running domain is 'setbus'.

Cause

see solution

Solution

Resolution
This failure is characterized by the very early failure of POST.

We can see in the complete post log below that POST fails very early on, and the only real details we have to go on are the ESMD (Environmental Status
Monitoring Daemon) blacklist message, listing "cplane 1" (aka CSB1) as disabled and that there is No minimum system left after blacklist file.

So, there is not a great deal to go on in the, but there are other places we can look for the data.

First - The hpost log we have already discussed:


# SMI Sun Fire 12/15/20/25K POST log opened Tue Jun 13 00:39:13 2006
# hpost version 1.5 Generic 120648-04 Apr 24 2006 12:10:28
# libxcpost.so v. 1.5 Generic 120648-04 Apr 24 2006 11:48:42
# pid = 9538 level = 16 verbose_level = 20
# SC name: e25k1-sc1. ChHostID: XX00XX00XX00X
# Domain Id = A
# Parent PID = 6081: dxs
# Cmdline: /opt/SUNWSMS/SMS1.5/bin/hpost -dA -H16.0

Significant contents of .postrc (platform)
/etc/opt/SUNWSMS/SMS1.5/config/platform/.postrc:
# ident "@(#)postrc 1.1 01/04/02 SMI"
Reading domain blacklist file /etc/opt/SUNWSMS/config/A/blacklist ...
# ident "@(#)blacklist 1.1 01/04/02 SMI"
Reading platform blacklist file /etc/opt/SUNWSMS/config/platform/blacklist ...
# ident "@(#)blacklist 1.1 01/04/02 SMI"
Reading system ASR blacklist file /etc/opt/SUNWSMS/config/asr/blacklist ...
cplane 1 # ESMD Power Failure 0610.1718.56
SEEPROM probe took 0 seconds.
Reading Component Health Status (CHS) information ...
No minimum system left after blacklist file! Bailing out!
Exitcode = 48: No system after domain, .postrc, blacklist, etc.
POST (level=16, verbose=20, -H16.0) execution time 1:09
# SMI Sun Fire 12/15/20/25K POST log closed Tue Jun 13 00:40:22 2006


Then we have the platform log, located in /var/opt/SUNWSMS/adm/platform/messages, or in /<explorer-dir>/sf15k/adm/platform/messages if you are checking an explorer.

In this log, there may be messages that tell us more about previous failures.
In the case that this document was written about, there was a prior CSB power supply failure.


Jun 13 00:29:40 2006 e25k1-sc1 esmd[5620]: [2000 2674760785927349 ERR
SysControl.cc 1536] A failure has been detected on redundant PS at
ps1_power_good_l; located on CSB at CS1. SCHEDULE REPLACEMENT of CSB at CS1 as
soon as possible to restore redundancy.


Of course, this board has redundant power supplies, so the platform kept running after this failure, however, as the message notes, we should schedule a replacement as soon as possible.

The trick is that this also causes an entry to be made in the ASR blacklist, which hpost must obey.

From the Solaris[TM] side, within the domain, the details you would get as a result of this type of failure is a somewhat generic failure:

# cfgadm -c configure SB13
Jun 23 10:13:13 v4u-15ka-e-epar02 drmach: WARNING: SMS hpost reported
error, see POST log for details
cfgadm: Hardware specific failure: test SB13: SMS hpost reported error,
see POST log for details


Of course, this directs us to check the POST output.

So - We have set the scene, and now know that with the CSB partially failed, and listed as blacklisted in the ASR blacklist, we can't DR a system board into the domain.


What is the solution?

The only supported and sensible answer to this question is to replace the CSB with the failed power supply!

An example process follows: (Note: These processes are covered in great detail in the 15K and 25K service manuals. This document only supplies the minimum detail)

Let's assume that the failed CSB is CSB1, and the main SC is SC1.
We'll assume this config, as it's the hardest to workaround.

In essence, we need to get SC0 (The SC in the *good* CSB) to be main, stop using the failed CSB and then replace it.

 

  • Failover from SC1 to SC0
  • setfailover on (wait for sync to complete)
  • setfailover force

This fails the SC's over.

  • Stop using CSB1
  • setbus -c cs0

This disables the Address, Data and Response busses forall CSB1 supported paths. This means we are ready to replace the CSB

  • Halt SC1
  • From SC0, poweroff SC1, the SCPER1 and CSB1
  • poweroff sc1 scper1 csb1
  • Remove SC1, SCPER1 and CSB1
  • Install new CSB1
  • Install SCPER1 and SC1 *in that order*
  • SC1 automatically powers on and boots
  • setfailover on

Wait a few minutes for sync

  • Start using all busses again.
  • setbus -c cs0,cs1

Done!


Relief/Workaround

Using setbus, we can workaround this issue.
Note: This assumes that CS1 is the failed CSB, and all work is done on the main SC.

- Disable all CSB1 supported busses
- setbus -c cs0

- Perform the DR operation

- Replace the CSB at the first opportunity! Redundancy in your platform
depends on having both CSB's working at 100% capacity. See the 'solution' above.



See also - Technical Instruction <Document 1003308.1> Sun Fire[TM]12K/15K/E20K/E25K: esmd warning; A power failure has been detected on a redundant power supply at ...

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in an appropriate
My Oracle Support Community - Oracle Sun Technologies Community.

 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback