Troubleshooting an Unbootable Netra T5440 After Panic(s) resulting in "ERROR: 1 CPUs in MD did not start"

Asset ID:	1-72-1522934.1
Update Date:	2018-01-08
Keywords:

Solution Type Problem Resolution Sure

Solution 1522934.1 : Troubleshooting an Unbootable Netra T5440 After Panic(s) resulting in "ERROR: 1 CPUs in MD did not start"

Applies to:

Sun Netra T5440 Server - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

On : Netra T5440

Upon experiencing a panic, described as:

panic: failed to stop cpu100
0x64

panic[cpu28]/thread=2a104c3dca0: xt_sync: timeout

Initiating a new boot sequence, in order to reboot the Operating System, we experience further recurring panics:

panic[cpu0]/thread=180e000: cpu100 failed to start (2)

BUSINESS IMPACT
-----------------------
Due to this issue, user cannot successfully reboot the Operating System.

Cause

The Issue can be seen and verified in the @persist@hostconsole.log file, gathered by the Snapshot utility of the resident ILOM installed on this machine:

panic: failed to stop cpu100
0x64

panic[cpu28]/thread=2a104c3dca0: xt_sync: timeout

000002a104c3cfe0 unix:xt_sync+2e8 (2a104c3d148, c, ef0423242cdff, c, 187c600, 60)
%l0-3: 000002a104c3d0c8 000ef0438d9280ff 000ef0423241ac7f 000ef0438d9280af
%l4-7: 00000000010b0000 000002a104c3d128 0000000000000000 00000000010b0120
000002a104c3d1d0 unix:hat_unload_callback+824 (3000c3bc008, 2a104c3d420, 0, 0, 0, 30001a61bc0)
%l0-3: 000003000c446000 0000000000000001 0000000000000001 000003000c4407ff
%l4-7: 000002a104c3d530 ffffffffffffffff ffffffffffffffff 000007007e5cedc0
000002a104c3d590 swrand:physmem_ent_gen+210 (70085038, 700000ef480, 0, 0, 0, 1000)
%l0-3: 0000000000008de9 0000000000000000 000002a104c3d68c 0000000000001fff
%l4-7: 0000000000002000 0000000000001000 0000000011bd2000 000000000000000d
000002a104c3d6f0 swrand:rnd_handler+14 (0, 2a104c3dca0, 0, 0, 70085028, 70085000)
%l0-3: ffffffffffffffff ffffffffffffffff 0000000000000063 000006009f715d48
%l4-7: 0000000000630000 0000000000000001 0000000000000000 000003000c840000
000002a104c3d7a0 genunix:callout_list_expire+5c (60095466fc0, 600954f2c00, 80000000, 0, bfffffffffffffff, 4000000000000000)
%l0-3: 00000300cf92fc40 8000000000000000 000000000187c270 0000000000000008
%l4-7: 0000000000000002 0000000000000010 000006009f64b5c8 000000007bfc3e74
000002a104c3d850 genunix:callout_expire+1c (60095466fc0, 60095467040, 185ed90, 185ed90, 0, 0)
%l0-3: 00000600954f2c00 0000000000000004 0000000000000016 000002a101b51ca0
%l4-7: 00000000158166a3 00000000018f4000 000003000c840178 000000003afed9b5
000002a104c3d900 genunix:callout_execute+c (60095466fc0, 6009f8b7a18, dbab91c, 0, f24f0d00, 0)
%l0-3: 00000000018f5068 000003000c7702e0 0000000000000000 000006009f657338
%l4-7: 000002a1012f9ca0 0000000000000001 000000000182b1e0 000000000182b1d8
000002a104c3d9b0 genunix:taskq_thread+300 (6009f657370, 6009f657308, cd18b6d0235d6, cd18b9e7aa50c, 6009f65733c, 6009f65733a)
%l0-3: 000006009f8b7a18 000006009f657338 0000000000000002 0000000000080000
%l4-7: 000006009f657328 0000000000010000 00000000fffeffff 000006009f657330

syncing file systems... done
dumping to /dev/md/dsk/d10, offset 10309140480, content: kernel

1% done
2% done
3% done
4% done
~SNIP~
99% done
100% done
100% done: 888929 pages dumped, compression ratio 4.29, dump succeeded
rebooting...
Resetting...

ERROR: 1 CPUs in MD did not start

Initiating a new boot sequence, in order to reboot the Operating System, we experience further recurring panics:

{0} ok boot
Boot device: rootdisk File and args:

SunOS Release 5.10 Version Generic_142900-14 64-bit
Copyright 1983-2010 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.

~SNIP~

panic[cpu0]/thread=180e000: cpu100 failed to start (2)

000000000180b8b0 unix:start_cpu+140 (64, 10196a0, 1835400, 1000, 8, 1000000000)
%l0-3: 0000000001835630 0000000000000000 0000000fffffffff 00000000010ad400
%l4-7: 0000000fffffffff 0000000000000001 0000000000000024 0000000001835628
000000000180b960 unix:start_other_cpus+1dc (1906400, 1, 0, 18634c8, 185fcd0, 186f748)
%l0-3: 0000000000000064 0000000000000024 00000000010ad800 0000000000000000
%l4-7: 0000000001906400 00000000010ad800 0000000001019400 000000000182b000
000000000180ba10 genunix:main+1e4 (1901800, 2, 185ed90, 18f87e8, 0, 18f4000)
%l0-3: 000000000180c000 0000000001900c00 0000000050ec713e 0000000001900c00
%l4-7: 0000000001866400 0000000000000001 0000000001900c00 0000000001901a68

syncing file systems... done
skipping system dump - no dump device configured
rebooting...
Resetting...

ERROR: 1 CPUs in MD did not start

Netra T5440, No Keyboard
Copyright (c) 1998, 2012, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.33.6, 130848 MB memory available, Serial #1234WXYZ.
Ethernet address 0:00:00:x0:xa:1x, Host ID: 9X9X9X9X9X.

Aborting auto-boot sequence.
{0} ok

The cause for this issue has been determined as a fault encountered on the Memory Mezzanine Board.

Indication of this hardware component's failure is seen in the spos_logs/@var@log@plhwsvc.log file, from the Snapshot:

##### spos_logs/@var@log@plhwsvc.log #####
01/08/13 14:39:35: plat_hwsvc_rpc_svc.c:2586:hwsvc_get_enable_disable_state_1_svc():hwsvc_get_enable_disable_state_1_svc: /SYS/MB/CMP1/MCU2 failed
01/08/13 14:39:35: plat_hwsvc_rpc_svc.c:2586:hwsvc_get_enable_disable_state_1_svc():hwsvc_get_enable_disable_state_1_svc: /SYS/MB/CMP1/MCU3 failed
01/08/13 14:42:38: plat_hwsvc_rpc_svc.c:2586:hwsvc_get_enable_disable_state_1_svc():hwsvc_get_enable_disable_state_1_svc: /SYS/MB/CMP0/MCU2 failed
01/08/13 14:42:38: plat_hwsvc_rpc_svc.c:2586:hwsvc_get_enable_disable_state_1_svc():hwsvc_get_enable_disable_state_1_svc: /SYS/MB/CMP0/MCU3 failed
01/08/13 14:43:38: plat_hwsvc_rpc_svc.c:2586:hwsvc_get_enable_disable_state_1_svc():hwsvc_get_enable_disable_state_1_svc: /SYS/MB/CMP1/MCU2 failed
01/08/13 14:43:38: plat_hwsvc_rpc_svc.c:2586:hwsvc_get_enable_disable_state_1_svc():hwsvc_get_enable_disable_state_1_svc: /SYS/MB/CMP1/MCU3 failed
01/08/13 14:46:40: plat_hwsvc_rpc_svc.c:2586:hwsvc_get_enable_disable_state_1_svc():hwsvc_get_enable_disable_state_1_svc: /SYS/MB/CMP0/MCU2 failed
01/08/13 14:46:40: plat_hwsvc_rpc_svc.c:2586:hwsvc_get_enable_disable_state_1_svc():hwsvc_get_enable_disable_state_1_svc: /SYS/MB/CMP0/MCU3 failed
01/08/13 14:47:40: plat_hwsvc_rpc_svc.c:2586:hwsvc_get_enable_disable_state_1_svc():hwsvc_get_enable_disable_state_1_svc: /SYS/MB/CMP1/MCU2 failed
01/08/13 14:47:40: plat_hwsvc_rpc_svc.c:2586:hwsvc_get_enable_disable_state_1_svc():hwsvc_get_enable_disable_state_1_svc: /SYS/MB/CMP1/MCU3 failed

From above, Memory Control Units (MCU's) 2 and 3 are called out as "failed" from across both CMP0 and CMP1 processors.

Turning then to the properties output of those components, the following additional details are shown:/SYS/MB/CMP0/MCU2

Properties:
type = Memory Controller
component_state = (none) <--- should say either "Enabled" or "Disabled"

/SYS/MB/CMP0/MCU3
Properties:
type = Memory Controller
component_state = (none) <---

and

/SYS/MB/CMP1/MCU2
Properties:
type = Memory Controller
component_state = (none) <---

/SYS/MB/CMP1/MCU3
Properties:
type = Memory Controller
component_state = (none) <---

Verifying in the ...

-> show -d properties -level all

... output that all other components are in "Enabled" state, it is deduced that the Memory Control Unit is the point of failure, leading to inability to allow for proper Operating System reboot.

Solution

Upon encountering this situation, and probable match of the above statements compared to the issue experienced, please engage the Oracle SPARC Hardware team for proper handling via issuing a new Service Request either online via the My Oracle Support (MOS) portal, or calling us at 1-800-223-1711, option 2 to open a New Service Request.

The following steps were provided for further onsite diagnosis by the field engineer:

Action Plan

Field Engineer is to perform the following, in order:

1. Remove memory riser (mezzanine) and boot server.
a. If server does not boot still, replace the system board.
b. If server does boot, further investigate memory riser, by re-seating

2. Carefully re-seat the memory riser (mezzanine)
a. If re-seating of memory riser and further investigation does not correct the boot issue, replace the memory riser due to faulty Memory Control Unit.

To Perform this Action Plan, Parts needed on-hand:
Qty 1 of: Memory Mezzanine Assembly
Qty 1 of: System Board Assembly

Results from the field:

1- Removed the memory mezzanine board.
The system passed POST and subsequently auto-booted.

2- Placed the board back (reseated)
The server again would not boot.

3- Upon subsequently having the mezzanine board removed from the server, it once again came back up and reached the Solaris root Login without issue.

4- Replaced the Mezzanine Memory board with a new one resulting in a System that then was successfully powered on and booted to the OS.

Attachments

This solution has no attachment