Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1010757.1
Update Date:2016-10-04
Keywords:

Solution Type  Technical Instruction Sure

Solution  1010757.1 :   Sun Fire[TM] 12K/15K/E20K/E25K : Voltage Error on CPU Leads to Blacklisting the PROCPAIR  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
  • Sun Fire E20K Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: SF-Exxk
  •  
  • _Old GCS Categories>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
214857


Applies to:

Sun Fire 15K Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Fire E20K Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Fire 12K Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Fire E25K Server - Version Not Applicable to Not Applicable [Release N/A]
All Platforms

Goal

When a voltage problem is detected on a single CPU on Sun Fire[TM] 12K/15K/E20K/E25K platforms, ASR (Automatic System Recovery) blacklists it's PROCPAIR, and the domain is reset.
Blacklisting the PROCPAIR means that two CPUs and their memory are disabled and removed from the domain configuration.

This document explains why the esmd (Event Status Monitoring Daemon) disables and removes two CPUs as the result of a voltage problem on a single CPU.
This behavior might seem incorrect, but in fact the recovery action is exactly as it was designed to be.

Solution

The following is an example of a voltage fault which might be logged in the /var/opt/SUNWSMS/SMS/adm/platform/messages file on the System Controller (SC):

Jan 19 03:34:33 2004 s2oc-sc0 esmd[2511]: [1919 216320102467983 ERR DetectorV.cc609] A low voltage or power supply has been detected on Core3, located on CPU at SB8. The voltage detected is 0.02v; should be
1.31v to 1.47v. PROCPAIR at SB8/PP1 is being removed from the domain and powered off. Check all hardware for the cause.
Jan 19 03:34:33 2004 s2oc-sc0 esmd[2511]: [0 216320149848762 NOTICE SysControl.cc 5296] Component PROCPAIR at SB8/PP1 has been blacklisted
Jan 19 03:34:33 2004 s2oc-sc0 esmd[2511]: [1930 216320225206159 NOTICE SysControl.cc 6113] PROCPAIR at SB8/PP1 has been powered off: ecode=0

In the preceding error message, Core3 (CPU3) on SB8 has a low voltage.

NOTE: The voltage tolerances are defined on the SC in the /etc/opt/SUNWSMS/SMS/config/esmd_tuning.txt file; do not edit this file manually!


In the preceding error message, esmd has blacklisted "Component PROCPAIR at SB8/PP1." PROCPAIR are defined as follows:

  • PP0 = PROCPAIR0 = CPU0 and CPU1
  • PP1 = PROCPAIR1 = CPU2 and CPU3

The reported voltage fault on only one CPU (CPU3) results in two CPUs (CPU2 and CPU3) being removed from the domain configuration through blacklisting.
The decision to blacklist the PROCPAIR for the failure of the single CPU is a result of a compromise of differing forces made upon POST with regards to availability. A POST is the hardware tests executed against components prior to entering into OBP: these tests confirm the hardware sanity of the components.

Differing Forces made upon POST:

  1. FORCE 1: POST needs to be able to exclude faulty components from the domain configuration so that future failures don't occur.
  2. FORCE 2: POST should allow as many resources as possible to be configured into the domain to minimize domain impact as much as possible.

It is important to note that a voltage fault reported by a single CPU might not actually be a problem limited to that CPU itself. The same voltage issue could also be affecting its "related" components, such as the BBC asic, DCDS asic, and so on. Ultimately, the fault could be the result of a power distribution issue, representative of a larger issue on board.
The actual reason for the voltage fault might not be fully known, and the number of components that are affected by it might also be unknown.

Because of this "unknown" factor, there are two approaches for dealing with voltage issues on board, the Conservative Approach and the Aggressive Approach.

These two approaches relate directly to the two POST forces described previously:

  • Conservative: Disable the entire System Board. Now, any future outage is prevented if "related" components are affected by this voltage problem (FORCE 1)
  • Aggressive: Disable only the component reporting the voltage problem. This leaves as many resources as possible available to the domain, but there is some risk associated with this approach (FORCE 2)

Arguments for each approach can be made and no one argument is incorrect. One Customer might believe that disabling the whole System Board is the best decision; another Customer might believe that it is absolutely unacceptable to lose that many resources. Neither Customer is incorrect. A compromise is necessary to meet the needs of both of these forces.

Here is what esmd does as a compromise to meet these different POST forces:

  • Force 1 results in the exclusion of faulty components from the domain configuration, which is done by disabling the CPU, which reports the voltage problem, and its PROCPAIR partner. This isolation to a PROCPAIR is to prevent a problem on a "CPU-related" component, such as the BBC asic, or perhaps to prevent the DCDS asic from causing further incidents. Each PROCPAIR shares components, such as these asics. Thus, the PROCPAIR is a logical place to isolate.
  • Force 2 results in the configuration of as many resources as possible into the domain configuration. The remaining PROCPAIR is allowed into the domain configuration so that the domain can function. For a single board domain, the Conservative Approach leaves the domain down until service can take place. The Aggressive Approach leaves an exposure if a "related" component has a voltage issue of its own.

A compromise configures the domain with as minimal a resource impact as possible while also providing as much error isolation as possible.
Ultimately the following are fulfilled: the differing POST force needs, supply domain availability, and fault isolation. This compromise is not perfect, but it is the appropriate way to isolate faulty components from the configuration to prevent future outages, while also allowing as many resources as possible to remain available for domain production until a maintenence window is available to resolve the issue at hand.

Internal Comments

A false over-voltage issue existed in SMS 1.2 and 1.3 software: Sun Alert 53625 "CPU0/CPU1 May Be Disabled on Sun Fire 12K/15K System Boards Resulting in Domain Interruption"

Keywords: 12k, 15k, 12K, 15K, esmd, voltage, procpair, PROCPAIR, blacklisted, ASR

References: Sun Fire[TM] 12K/15K/E20K/E25K: How to bring CPUs back online after they have been blacklisted on domains (Doc ID 1004765.1)

Previously Published As 76240


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback