Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2001666.1
Update Date:2018-03-05
Keywords:

Solution Type  Problem Resolution Sure

Solution  2001666.1 :   NETRA 1290 FATAL: PROM_PANIC[0x0]: ECC Error: Uncorrectable Error  


Related Items
  • Sun Netra 1290 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: SF-x8x0/Ex900
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Created from <SR 3-10595719262>

Applies to:

Sun Netra 1290 Server - Version Not Applicable to Not Applicable [Release N/A]
Information in this document applies to any platform.

Symptoms

System may fail to boot reporting "PROM_PANIC" error signature, under the following scenario:

  1. Netra 1290 can be powered on successfully ( e.g. after a planned Maintenance Window )
  2. at point issue is observed, complete POST output is missing, or POST was run but system resources passed POST; domain needs to be up&running so a time consuming max level POST for the entire platform is not an option
  3. on LOM there may be no errors in showerrorbuffer and showboards output shows that all resources passed POST and are active
  4. host doesn't boot properly, even if "auto-boot?" is set to "true"
  5. on console output the following ( abbreviated ) logs or similar ones appear in a loop:

0} ok boot
|
ASI VADR         A S I   R E G I S T E R   N A M E             D A T A
--- ----  ----------------------------------------------  -----------------
06  00   Floating Point Status Register......[ASI_FPRS]  00000000.00000000
10  00   Performance Control..................[ASI_PCR]  00000000.00000000
11  00   Performance Instrumentation..........[ASI_PIC]  00000000.00000000

[  snip  ]

FATAL: PROM_PANIC[0x0]: ECC Error: Uncorrectable Error

Changes

 This behavior has been seen after SB0 replacement, but may occur in different circumstances also.

Cause

The PROM PANIC is pointing out there is an ECC Uncorrectable Error, which means it's basically a hardware fault.
SB0, and it's subcomponents, are the most suspect, because active components at that stage are boot proc and the associated memory banks.

Solution

Tasks for resolution

  1. start the system with best possible configuration as fast as possible, by isolating suspect resources ( in a multi system board configuration )
  2. determination of hardware cause later on
  3. resolution and system verification by replacement of the suspect component(s) and re-test of system

1. action to bring up system:

Disable the suspect SB0 using "setls" and power off host to bring up the host up and verify the problem resides on that board.
If the host comes up in that degraded mode without issues, the fault is isolated to SB0, which has been seen in all occurrances so far.

lom> setls -s disable -l sb0
lom> poweroff
(wait some minutes)
lom> poweron

2. find the root cause:

After system is up again, we can look for confirmation of broken hardware using "testboard" run on that system board, "testboard" maybe show up one or more  bad DIMMs in memory associated to lowest numbered CPU, which is SB0/P0, which also reported the PANIC [0x0]

lom> testboard sb0

Details on "testboard" usage can be found within the referenced diagnostic document 1009098.1

3. resolution and system verification

The final solution is to verify discriminated hardware and replace it. The findings maybe similar to N0/SB0/P0/Bx/Dx
In the observed cases the root cause of this issue (UE) has been found mostly in memory and could finally resolved by replacing the faulty DIMM hardware.

It might be possible to observe other errors, like CPU or memory controller issues as well, but was not seen so far.
Disabling of  SB0 or affected sub-components ( P0 or P0/Bx ) may be an optional interim solution.

If hardware has been replaced the system board needs to be retested.

The PANIC maybe best explained with SB0/P0 is the boot processor, because the lowest numbered CPU is always boot proc in serengeti systems, and therefor it has control during POST / OBP and boot. The faulty DIMM(s) within the associated memory are controlled also by this Processor SB0/P0, because memory controller for the associated memory resides on processor for this platform.


To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in an appropriate
My Oracle Support Community - Oracle Sun Technologies Community.


References

<NOTE:1009098.1> - Sun[TM] Fire 3800/4800/4810/6800/E2900/E4900/E6900/V1280 and Netra[TM] 1280/1290 Server: Using testboard to run extended POST diagnostics [Video]

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback