![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Technical Instruction Sure Solution 1010757.1 : Sun Fire[TM] 12K/15K/E20K/E25K : Voltage Error on CPU Leads to Blacklisting the PROCPAIR
PreviouslyPublishedAs 214857 Applies to:Sun Fire 15K Server - Version Not Applicable to Not Applicable [Release N/A]Sun Fire E20K Server - Version Not Applicable to Not Applicable [Release N/A] Sun Fire 12K Server - Version Not Applicable to Not Applicable [Release N/A] Sun Fire E25K Server - Version Not Applicable to Not Applicable [Release N/A] All Platforms GoalWhen a voltage problem is detected on a single CPU on Sun Fire[TM] 12K/15K/E20K/E25K platforms, ASR (Automatic System Recovery) blacklists it's PROCPAIR, and the domain is reset. This document explains why the esmd (Event Status Monitoring Daemon) disables and removes two CPUs as the result of a voltage problem on a single CPU. SolutionThe following is an example of a voltage fault which might be logged in the /var/opt/SUNWSMS/SMS/adm/platform/messages file on the System Controller (SC): Jan 19 03:34:33 2004 s2oc-sc0 esmd[2511]: [1919 216320102467983 ERR DetectorV.cc609] A low voltage or power supply has been detected on Core3, located on CPU at SB8. The voltage detected is 0.02v; should be
1.31v to 1.47v. PROCPAIR at SB8/PP1 is being removed from the domain and powered off. Check all hardware for the cause. Jan 19 03:34:33 2004 s2oc-sc0 esmd[2511]: [0 216320149848762 NOTICE SysControl.cc 5296] Component PROCPAIR at SB8/PP1 has been blacklisted Jan 19 03:34:33 2004 s2oc-sc0 esmd[2511]: [1930 216320225206159 NOTICE SysControl.cc 6113] PROCPAIR at SB8/PP1 has been powered off: ecode=0 In the preceding error message, Core3 (CPU3) on SB8 has a low voltage. NOTE: The voltage tolerances are defined on the SC in the /etc/opt/SUNWSMS/SMS/config/esmd_tuning.txt file; do not edit this file manually!
The reported voltage fault on only one CPU (CPU3) results in two CPUs (CPU2 and CPU3) being removed from the domain configuration through blacklisting. Differing Forces made upon POST:
It is important to note that a voltage fault reported by a single CPU might not actually be a problem limited to that CPU itself. The same voltage issue could also be affecting its "related" components, such as the BBC asic, DCDS asic, and so on. Ultimately, the fault could be the result of a power distribution issue, representative of a larger issue on board. Because of this "unknown" factor, there are two approaches for dealing with voltage issues on board, the Conservative Approach and the Aggressive Approach. These two approaches relate directly to the two POST forces described previously:
Arguments for each approach can be made and no one argument is incorrect. One Customer might believe that disabling the whole System Board is the best decision; another Customer might believe that it is absolutely unacceptable to lose that many resources. Neither Customer is incorrect. A compromise is necessary to meet the needs of both of these forces. Here is what esmd does as a compromise to meet these different POST forces:
A compromise configures the domain with as minimal a resource impact as possible while also providing as much error isolation as possible. Internal Comments A false over-voltage issue existed in SMS 1.2 and 1.3 software: Sun Alert 53625 "CPU0/CPU1 May Be Disabled on Sun Fire 12K/15K System Boards Resulting in Domain Interruption" Keywords: 12k, 15k, 12K, 15K, esmd, voltage, procpair, PROCPAIR, blacklisted, ASR References: Sun Fire[TM] 12K/15K/E20K/E25K: How to bring CPUs back online after they have been blacklisted on domains (Doc ID 1004765.1) Previously Published As 76240 Attachments This solution has no attachment |
||||||||||||
|