Sun Storage 7000 Unified Storage System: AKD (Appliance Kit Daemon) fails to restart when a cache device is faulted

Asset ID:	1-72-1553271.1
Update Date:	2018-05-25
Keywords:

Solution Type Problem Resolution Sure

Solution 1553271.1 : Sun Storage 7000 Unified Storage System: AKD (Appliance Kit Daemon) fails to restart when a cache device is faulted

Applies to:

Sun Storage 7410 Unified Storage System - Version All Versions to All Versions [Release All Releases]
Sun ZFS Storage 7320 - Version All Versions to All Versions [Release All Releases]
Sun Storage 7310 Unified Storage System - Version All Versions to All Versions [Release All Releases]
Sun ZFS Storage 7420 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS3-2 - Version All Versions to All Versions [Release All Releases]
7000 Appliance OS (Fishworks)

Symptoms

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage ZFS Storage Appliance

Customer had a failed/faulted readzilla (cache) device:

head1:> maintenance hardware show

chassis-000 head1    faulted   Sun Microsystems, Inc. Sun Storage 7310                           1056QAB006

cpu-000      CPU 0    ok        AMD                     Six-Core AMD Opteron(tm) Processor 2427    unknown
cpu-001      CPU 1    ok        AMD                     Six-Core AMD Opteron(tm) Processor 2427    unknown
disk-000     HDD 0    faulted   STEC **************    MACH8                                      STM0000D15A4
disk-001     HDD 1    absent    -                       -                                          -
disk-002     HDD 2    ok        STEC                    MACH8                                      STM0000D0D30
disk-003     HDD 3    ok        STEC                    MACH8                                      STS000008FCC
disk-004     HDD 4    absent    -                       -                                          -
disk-005     HDD 5    ok        STEC                    MACH8                                      STM0001186CB
disk-006     HDD 6    ok        HITACHI                 HTE5450SASUN500G                           BK0F20VEGG3R0B
disk-007     HDD 7    ok        HITACHI                 HTE5450SASUN500G                           BK0F20VEGEGRRB
........

While the readzilla was in the 'faulted' state, a restart of the system management daemon (AKD) was attempted.

The restart failed and the BUI and CLI became unresponsive.

From the 'debug.sys' log:

Feb 2 00:13:23 head1 sata: [ID 801593 kern.warning] WARNING: /pci@1,0/pci10de,cb84@5:
Feb 2 00:13:23 head1 SATA device detached at port 0
Feb 2 00:13:23 head1 sata: [ID 801593 kern.warning] WARNING: /pci@1,0/pci10de,cb84@5:
Feb 2 00:13:23 head1 SATA device detached at port 0
Feb 2 00:13:23 head1 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci10de,cb84@5/disk@0,0 (sd105):
Feb 2 00:13:23 head1 Command failed to complete...Device is gone
Feb 2 00:13:31 head1 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci10de,cb84@5/disk@0,0 (sd105):
Feb 2 00:13:31 head1 SYNCHRONIZE CACHE command failed (5)

From the alert log:

entry-098 2013-2-2 00:13:39 151ff0f7-15d3-ebd6-e50a-d9c658a09e70 The disk in slot 'HDD 4' has been removed from chassis 'head1'. Major alert

The readzilla (cache) device has become 'faulted' because of a nv_sata port issue. See the following document for further details:

1457578.1 Sun Storage 7000 Unified Storage System: When replacing faulted Readzilla SSD and System disks in the head unit the replacement is not recognized

An NMI was generated to collect a system core dump because all remote connection was denied and customer was even unable to log in using the console.

All datas services was working fine.

core file: /cores/3-6725538660/ftp-2013-02-21/Prod2/ak.d065fba5-349d-4bbe-887c-c4c72049b706/core/vmcore.6
user: xxxxxxxx xxxxxxxx (xxxxxx:123456)
release: 5.11 (64-bit)
version: ak/generic@2011.04.24.4.2,1-1.28
machine: i86pc
node name: DRSHZ-MMVLLR02P_PROD2
system type: i86pc
hostid: 0
dump_conflags: 0x40000 (DUMP_CURPROC) on /dev/zvol/dsk/system/dump(32G)
moddebug: 0x10 (NOAUTOUNLOAD)
dump_uuid: a291d5f4-862e-eff5-d0a0-e2139bb7f29b
time of crash: Tue Feb 12 11:46:57 UTC 2013 (core is 9 days old)
age of system: 55 days 21 hours 35 minutes 39.535847662 seconds
panic CPU: 5 (12 CPUs, 63.9G memory)
panic string: NMI received

sanity checks: settings...
NOTE: /etc/system: module nxge not loaded for 4 "set nxge:..."
vmem...
WARNING: CPU4 has cpu_intr_actv for 14
WARNING: CPU5 has cpu_intr_actv for 9
WARNING: last_swtch[5]: 0x1cc951ad (55 days 21 hours 35 minutes 39.052891969 seconds earlier)
WARNING: CPU9 has cpu_intr_actv for 14
sysent...clock...
WARNING: dtrace:dtrace_state_deadman+0x0 CPU5 cyclic pend: 1409499 (16d7h31m39.000000000s)
WARNING: dtrace:dtrace_state_deadman+0x0 CPU5 cyclic pend: 1409499 (16d7h31m39.000000000s)
WARNING: dtrace:dtrace_state_clean+0x0 CPU5 cyclic pend: 142359355 (16d7h31m38.550261450s)
WARNING: cpu.generic:gcpu_ntv_mca_poll_cyclic+0x0 omni[CPU5] cyclic pend: 140949 (16d7h31m30.000000000s)
misc...
WARNING: 116 severe kstat errors (run "kstat xck")
WARNING: kernelbase 0xffffed8000000000, expected 0xfffffd8000000000, resetting
WARNING: 1 expired realtime (max -1h58m56.955847662s) and 2 expired normal (max -0.015847662s) callouts (36 on expired lists)
done

CAT(vmcore.6/11X)> thread summary
reference clock = panic_lbolt: 0x1cc953e2, panic_hrtime: 0x11289f5d6450ee
66 threads ran since 1 second before current tick (4 user, 62 kernel)
91 threads ran since 1 minute before current tick (12 user, 79 kernel)

2 TS_RUN threads (0 user, 2 kernel)
1 TS_STOPPED threads (0 user, 1 kernel)
107 TS_FREE threads (0 user, 107 kernel)
0 !TS_LOAD (swapped) threads

3933* threads trying to get a mutex (3915 user, 18 kernel)
longest sleeping 16 days 7 hours 30 minutes 37.07 seconds earlier
0 threads trying to get an rwlock
1736 threads waiting for a condition variable (341 user, 1395 kernel)
1 threads sleeping on a semaphore (1 user, 0 kernel)
longest sleeping 16 days 7 hours 30 minutes 38.29 seconds earlier
4 threads sleeping on a user-level sobj (4 user, 0 kernel)
236 threads sleeping on a shuttle (door) (236 user, 0 kernel)

0 threads in biowait()
6* threads in door_get_server() (6 user, 0 kernel)

2 threads in dispatch queues (0 user, 2 kernel)
2* threads in dispq of CPU running idle thread (0 user, 2 kernel)
1* interrupt threads running (0 user, 1 kernel)
1* threads with > 90% CPU (0 user, 1 kernel)

6049 total threads in allthreads list (4501 user, 1548 kernel)
3 thread_reapcnt
0 lwp_reapcnt
6052 nthread

CAT(vmcore.6/11X)> tlist sobj mutex
thread pri pctcpu idle PID wchan command
0xffffff007ec3bc40 100 0.000 14d22h57m45.29s 0 0xffffffffc01d9308 sched
0xffffff007ec7dc40 60 0.000 14d22h56m51.94s 0 0xfffffffffbcfefb8 sched
0xffffff007af32c40 0 0.000 16d7h30m37.07s 0 0xfffffffffbc759b8 sched
0xffffff007bbb1c40 60 0.000 14d22h55m44.15s 0 0xfffffffffbcfefb8 sched
0xfffff6001f937080 59 0.000 8d2h56m48.99s 2572 0xfffffffffbcfefb8 /usr/lib/ak/proftpd/proftpd -p 0 -c /var/run/ak/proftpd.conf
0xfffff6001ef0bc20 59 0.000 16d3h7m53.02s 6420 0xffffffffc01d9308 /usr/lib/ssh/sshd
0xfffff600809e20e0 59 0.000 16d3h8m0.13s 6423 0xffffffffc01d9308 /usr/lib/ssh/sshd
0xfffff600809ebc60 59 0.000 16d3h6m22.82s 6428 0xffffffffc01d9308 /usr/lib/ssh/sshd
0xfffff60074927060 59 0.000 16d3h4m45.33s 6432 0xffffffffc01d9308 /usr/lib/ssh/sshd
0xfffff600749283e0 59 0.000 15d21h7m58.09s 6436 0xffffffffc01d9308 /usr/lib/ssh/sshd
0xfffff60074833000 59 0.000 15d21h6m24.79s 6440 0xffffffffc01d9308 /usr/lib/ssh/sshd
0xfffff6007357a4a0 59 0.000 15d21h4m43.41s 6444 0xffffffffc01d9308 /usr/lib/ssh/sshd
0xfffff600809d5080 59 0.000 15d15h7m58.77s 6448 0xffffffffc01d9308 /usr/lib/ssh/sshd
0xfffff6006c3f7c20 59 0.000 15d9h7m59.20s 6452 0xffffffffc01d9308 /usr/lib/ssh/sshd
0xfffff6005b077080 59 0.000 15d9h6m21.87s 6456 0xffffffffc01d9308 /usr/lib/ssh/sshd
0xfffff600809bc140 59 0.000 14d22h57m44.70s 2791 0xfffffffffbcfefb8 /usr/lib/smbsrv/smbd start
0xfffff600809e3ba0 59 0.000 14d22h57m43.94s 2791 0xfffffffffbcfefb8 /usr/lib/smbsrv/smbd start
0xfffff600809f57c0 59 0.000 14d22h57m43.17s 2791 0xfffffffffbcfefb8 /usr/lib/smbsrv/smbd start
0xfffff600809e3460 59 0.000 14d22h57m42.49s 2791 0xfffffffffbcfefb8 /usr/lib/smbsrv/smbd start

3933 threads with that sobj found.

top mutex/rwlock owners:
count thread
3916 0xfffff600809eec00 state: slp wchan: 0xfffff6001bfa99e0 sobj: semaphore (from genunix:cyclic_remove_here+0x7c)

CAT(vmcore.6/11X)> thread 0xfffff600809eec00
==== user (LWP_SYS) thread: 0xfffff600809eec00 PID: 27240 ====
cmd: /usr/lib/ak/akd
fmri: svc:/appliance/kit/akd:default
t_wchan: 0xfffff6001bfa99e0 sobj: semaphore (from genunix:cyclic_remove_here+0x7c)
t_procp: 0xfffff6005c6f1000
p_as: 0xfffff600bec9e740 size: 1198686208 RSS: 926564352
hat: 0xfffff600bfc2bca8
cpuset:
zone: global
t_stk: 0xffffff007cb4ff10 sp: 0xffffff007cb4fa50 t_stkbase: 0xffffff007cb4b000
t_pri: 100(RT) t_tid: 2 pctcpu: 0.000000
t_lwp: 0xfffff6001b1aa600 lwp_regs: 0xffffff007cb4ff10
mstate: LMS_SLEEP ms_prev: LMS_SYSTEM
ms_state_start: 31 days 4 hours 36 minutes 27.543630756 seconds later
ms_start: 5 days 16 hours 34 minutes 23.201874483 seconds later
psrset: 0 last CPU: 10
idle: 140943829 ticks (16d7h30m38.29s)
start: Tue Jan 15 14:05:26 2013
age: 2410891 seconds (27 days 21 hours 41 minutes 31 seconds)
syscall: #6 close(, 0x0) (sysent: genunix:close+0x0)
tstate: TS_SLEEP - awaiting an event
tflg:tpflg: TP_TWAIT - wait to be freed by lwp_wait
TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
pflag: SMSACCT - process is keeping micro-state accounting
SMSFORK - child inherits micro-state accounting

pc: unix:_resume_from_idle+0xf4 resume_return: addq $0x8,%rsp

unix:_resume_from_idle+0xf4 resume_return()
unix:swtch+0x150()
genunix:sema_p+0x1d9(0xfffff6001bfa99e0)
genunix:cyclic_remove_here+0x7c(0xfffff6001bfa9940, 0x9, 0x0, 0x0)
genunix:cyclic_remove+0x34(0xfffff6002d8ce208)
dtrace:dtrace_state_destroy+0x19d(0xfffff6005b8812c0)
dtrace:dtrace_close+0x69(0x9c0000000d, 0x3, 0x2, 0xfffff6005984a250)
genunix:dev_close+0x52(0x9c0000000d, 0x3, 0x2, 0xfffff6005984a250)
specfs:device_close+0xa2(0xfffff60122d0ce00, 0x3, 0xfffff6005984a250)
specfs:spec_close+0x163(0xfffff60122d0ce00, 0x3, 0x1, 0x0, 0xfffff6005984a250, 0x0)
genunix:fop_close+0x71(0xfffff60122d0ce00, 0x3, 0x1, 0x0, 0xfffff6005984a250, 0x0)
genunix:closef+0x5f(0xfffff6005b214078)
genunix:closeandsetf+0x4f5(0x2f8, 0x0)
genunix:close+0x18(0x2f8)
unix:_syscall32_save+0xbd()
-- switch to user thread's user stack -

## CPU5 was the panic CPU. Although we see cyclics pending on multiple CPUs, most are due to CPU5. Almost all of the missed callouts were on CPU5.

CAT(vmcore.6/11X)> callout
CPU RHLEX XID expiration function(arg)
=== ===== ================== ====================== =============
...
5 R LE 0x26cd6d9ea -16d7h31m38.535847662s genunix:cv_wakeup(0xffffff007ac68c40)
5 R LE 0x26cd6da8a -16d7h31m37.781814632s genunix:cv_wakeup(0xfffff60020d16ba0)
5 R LE 0x26cd6e7ca -16d7h31m36.546392825s genunix:cv_wakeup(0xfffff6001cc7a140)
5 R LE 0x26cd7062a -16d7h30m30.884277996s genunix:cv_wakeup(0xfffff60020e4f760)
5 R LE 0x26cd9236a -16d7h29m34.465847662s genunix:sigalarm2proc(0xfffff6001f88c028)
5 R LE 0x26cf5598a -16d7h31m27.550471020s genunix:cv_wakeup(0xfffff6007483d0c0)
5 R LE 0x26cd6d9ca -16d7h30m49.345847662s genunix:cv_wakeup(0xffffff007b8d2c40)
5 R LE 0x26cd6d9aa -16d7h29m17.255847662s genunix:cv_wakeup(*sata(data):sata_event_thread)
5 R LE 0x26cd6d98a -16d7h28m2.385847662s genunix:cv_wakeup(0xffffff007a268c40)
5 R LE 0x26cd6d96a -16d7h18m37.795804362s genunix:cv_wakeup(0xfffff60020e654a0)
5 R LE 0x26cd6d94a -16d7h1m42.315847662s genunix:cv_wakeup(*genunix(bss):seg_pasync_thr)
5 R LE 0x26cd6d92a -16d6h37m56.545780842s genunix:cv_wakeup(0xfffff60020d16800)
5 R LE 0x26cee280a -15d11h46m26.284109738s genunix:cv_wakeup(0xfffff60020eb84e0)
5 R LE 0x26cd6d8ca -16d3h12m51.395847662s genunix:cv_wakeup(0xffffff007bdd3c40)
5 R LE 0x26cd6d86a -15d7h21m35.385693889s genunix:cv_wakeup(0xfffff600809aec20)
5 R LE 0x26cd6d84a -15d1h5m4.155847662s genunix:cv_wakeup(0xffffff007a161c40)
5 R LE 0x26cd6d82a -14d22h54m21.933744755s genunix:cv_wakeup(0xfffff60020dfab00)
5 R LE 0x26cd6d7ca -14d22h35m30.665847662s genunix:cv_wakeup(0xffffff007af89c40)
5 R LE 0x26cd6d68a -14d22h4m53.223305257s genunix:cv_wakeup(0xfffff60020e397c0)
5 R LE 0x26cd6d06a -14d17h47m26.115518871s genunix:cv_wakeup(0xfffff6001ea3e060)
5 R LE 0x26cd6c94a -14d14h22m26.225629459s genunix:cv_wakeup(0xfffff60020ebb120)
5 R LE 0x26cd69cca -13d15h15m18.521775050s genunix:cv_wakeup(0xfffff60020d5a3e0)
5 R LE 0x26cd69a4a -13d14h25m12.938116677s genunix:cv_wakeup(0xfffff6001e9ec7e0)
5 R LE 0x26cd674aa -12d13h43m25.987113322s genunix:cv_wakeup(0xfffff6001e5bd120)
5 R LE 0x26cd66c4a -12d8h54m44.755581996s genunix:cv_wakeup(0xfffff6001e9e90c0)
5 R LE 0x26cd66a6a -12d7h45m15.846885021s genunix:cv_wakeup(0xfffff60020e4f020)
5 R LE 0x26cd646ca -11d8h19m17.495847662s genunix:cv_wakeup(*unix(data):pc_thread_id)
5 R LE 0x26cd63f2a -11d3h57m46.757154017s genunix:cv_wakeup(0xfffff600809d4440)
5 R LE 0x26cd636ea -10d22h29m44.072230299s genunix:cv_wakeup(0xfffff60020d5a780)
5 R LE 0x26cd6356a -10d21h28m31.621470706s genunix:cv_wakeup(0xfffff600809cf120)
5 R LE 0x26cd5f42a -9d11h35m28.198968011s genunix:cv_wakeup(0xfffff60020d63180)
5 R LE 0x26cd5a3ca -7d16h7m27.818904987s genunix:cv_wakeup(0xfffff60074927400)
5 R LE 0x26cd5946a -7d7h59m45.669068604s genunix:cv_wakeup(0xfffff60020d5a040)
5 R LE 0x26cd5070a -3d21h22m39.976263873s genunix:cv_wakeup(0xfffff60020d70be0)
5 R LE 0x26cd4f56a -3d13h3m9.742467000s genunix:cv_wakeup(0xfffff600809ad500)
5 R LE 0x26cd4996a -1d10h4m46.835749190s genunix:cv_wakeup(0xfffff600809eb520)
5 R E 0x26cd4598a -18m59.414941164s genunix:cv_wakeup(0xfffff60020db7120)

CAT(vmcore.6/11X)> cpu 5
CPU thread pri PID cmd
5 @ 0xfffff6001bf1b080 P 0xffffff007b2e0c40 P 168 0 sched (PIL9 interrupt)

## Before being interrupted for the panic, it appears to have been idle.

CAT(vmcore.6/11X)> thread 0xffffff007b2e0c40
==== panic interrupt thread: 0xffffff007b2e0c40 PID: 0 on CPU: 5 affinity
CPU: 5 (last_swtch: 55 days 21 hours 35 minutes 39.052891969 seconds earlier)
PIL: 9 ====
cmd: sched
t_procp: 0xfffffffffbc2e270(proc_sched)
p_as: 0xfffffffffbc301a0(kas)
zone: global
t_stk: 0xffffff007b2e0c30 sp: 0xffffff007b2e0910 t_stkbase: 0xffffff007b2dc000
t_pri: 168(SYS) pctcpu: 0.000000
t_lwp: 0x0 psrset: 0 last CPU: 5
idle: 565 ticks (5.65s)
start: Tue Dec 18 14:16:31 2012
age: 4829426 seconds (55 days 21 hours 30 minutes 26 seconds)
stime: 2086 (55 days 21 hours 32 minutes 21.72 seconds earlier)
tstate: TS_ONPROC - thread is being run on a processor
tflg: T_INTR_THREAD - thread is an interrupt thread
T_TALLOCSTK - thread structure allocated from stk
T_PANIC - thread initiated a system panic
tpflg: none set
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
pflag: SSYS - system resident process

pc: unix:vpanic_common+0x13a: addq $0xf0,%rsp
startpc: unix:thread_create_intr+0x0: pushq %rbp

-- on interrupt thread's stack --
unix:vpanic_common+0x13a()
unix:panic+0x94(, , ...)
pcplusmp:apic_nmi_intr+0x7c(0x0, 0xffffff007b2e0a60)
unix:av_dispatch_nmivect+0x30(0xffffff007b2e0a60)
unix:nmiint+0x152()
unix:ddi_getl+0x0()
unix:av_dispatch_autovect+0x7c(0x14)
unix:dispatch_hardint+0x33(0x14, 0x0)
-- switch to CPU5 idle thread stack --
unix:switch_sp_and_call+0x13()
unix:do_interrupt+0x10d(0xfffff6001bf1b080, 0xfffff6001be2d740)
0xfffff6001bededf0()
-- end of interrupt thread's stack --

> 0xfffff6001bf1b080::cpuinfo -v
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
5 fffff6001bf1b080 1b 0 9 168 no no t-565 ffffff007b2e0c40 sched
| |
RUNNING <--+ +--> PIL THREAD
READY 9 ffffff007b2e0c40
EXISTS
ENABLE

1 thread: 0xffffff007ec3bc40
unix:_resume_from_idle+0xf4 resume_return
genunix:turnstile_block+0x760
unix:mutex_vector_enter+0x261
dtrace:dtrace_probe_create+0x57
fbt:fbt_provide_module+0x259
dtrace:dtrace_module_loaded+0x4a
genunix:mod_load+0x1e7
genunix:mod_hold_stub+0xe8
unix:stubs_common_code+0x1f
usba:hubd_hotplug_thread+0x4d
genunix:taskq_d_thread+0xb1
unix:thread_start+0x8

CAT(vmcore.6/11X)> thread 0xffffff007ec3bc40
==== kernel thread: 0xffffff007ec3bc40 PID: 0 ====
cmd: sched
t_wchan: 0xffffffffc01d9308 sobj: mutex
t_procp: 0xfffffffffbc2e270(proc_sched)
p_as: 0xfffffffffbc301a0(kas)
zone: global
t_stk: 0xffffff007ec3bc40 sp: 0xffffff007ec3b760 t_stkbase: 0xffffff007ec37000
t_pri: 60(SYS) t_epri: 100 pctcpu: 0.000000
t_lwp: 0x0 psrset: 0 last CPU: 9
idle: 129226529 ticks (14d22h57m45.29s)
start: Mon Jan 28 12:42:35 2013
age: 1292662 seconds (14 days 23 hours 4 minutes 22 seconds)
stime: 353687387 (14 days 23 hours 4 minutes 48.71 seconds earlier)
tstate: TS_SLEEP - awaiting an event
tflg: T_TALLOCSTK - thread structure allocated from stk
tpflg: none set
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
TS_SIGNALLED - thread was awakened by cv_signal()
pflag: SSYS - system resident process

pc: unix:_resume_from_idle+0xf4 resume_return: addq $0x8,%rsp
startpc: genunix:taskq_d_thread+0x0: pushq %rbp

-- on kernel thread's stack --
unix:_resume_from_idle+0xf4 resume_return()
unix:swtch+0x150()
genunix:turnstile_block+0x760(0xfffff60123a014c0, 0x0, 0xffffffffc01d9308, 0xfffffffffbc07e48, 0x0, 0x0)
unix:mutex_vector_enter+0x261(0xffffffffc01d9308)
dtrace:dtrace_probe_create+0x57(0xfffff6001e3a2be8, 0xfffff6001bad73a0, 0xfffff6001beacfb2, 0xfffffffff85d50a8, 0x3, 0xfffff60158fa4b40)
fbt:fbt_provide_module+0x259(0x0, 0xfffff6001b16c650)
dtrace:dtrace_module_loaded+0x4a(0xfffff6001b16c650)
genunix:mod_load+0x1e7(0xfffff6001b16c650, 0x1)
genunix:mod_hold_stub+0xe8(0xfffffffffbc0c560)
unix:stubs_common_code+0x1f()
usba:hubd_hotplug_thread+0x4d(0xfffff600ff5bdd30)
genunix:taskq_d_thread+0xb1(0xfffff6f2d0823920)
unix:thread_start+0x8()
-- end of kernel thread's stack --

CAT(vmcore.6/11X)> mutex 0xffffffffc01d9308 <-- dtrace_lock
adaptive mutex: owner: 0xfffff600809eec00 waiters: true

## ttymon was also waiting for this mutex, which explains why they were unable to log in via the console.

1 thread: 0xfffff60020e0fb40
unix:_resume_from_idle+0xf4 resume_return
genunix:turnstile_block+0x760
unix:mutex_vector_enter+0x261
dtrace:dtrace_helpers_destroy+0x31
genunix:exec_args+0x1e1
elfexec:elf32exec+0x718
genunix:gexec+0x6d7
genunix:exec_common+0x4e8
genunix:exece+0x1f
unix:_syscall32_save+0xbd

CAT(vmcore.6/11X)> thread 0xfffff60020e0fb40
==== user (LWP_SYS) thread: 0xfffff60020e0fb40 PID: 10212 ====
cmd: /usr/lib/saf/ttymon -g -d /dev/console -l console -T ansi -m ldterm,ttcompat -h
fmri: svc:/system/console-login:default
t_wchan: 0xffffffffc01d9308 sobj: mutex
t_procp: 0xfffff6005a215058
p_as: 0xfffff600c06a6ae8 size: 3084288 RSS: 2252800
hat: 0xfffff600bfc2b3b8
cpuset:
zone: global
t_stk: 0xffffff007c9bbf10 sp: 0xffffff007c9bb630 t_stkbase: 0xffffff007c9b7000
t_pri: 60(TS) t_tid: 1 pctcpu: 0.000000
t_lwp: 0xfffff600750f0f00 lwp_regs: 0xffffff007c9bbf10
mstate: LMS_SLEEP ms_prev: LMS_SYSTEM
ms_state_start: 33 days 23 hours 58 minutes 11.875644128 seconds later
ms_start: 3 days 10 hours 45 minutes 45.678765140 seconds earlier
psrset: 0 last CPU: 0
idle: 129920999 ticks (15d53m29.99s)
start: Fri Jan 11 10:23:43 2013
age: 2769794 seconds (32 days 1 hours 23 minutes 14 seconds)
stime: 353035259 (15 days 53 minutes 29.99 seconds earlier)
syscall: #59 execve(, 0x0) (sysent: genunix:exece+0x0)
tstate: TS_SLEEP - awaiting an event
tflg:tpflg: TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
pflag: SMSACCT - process is keeping micro-state accounting
SMSFORK - child inherits micro-state accounting

pc: unix:_resume_from_idle+0xf4 resume_return: addq $0x8,%rsp

-- on user (LWP_SYS) thread's stack --
unix:_resume_from_idle+0xf4 resume_return()
unix:swtch+0x150()
genunix:turnstile_block+0x760(0xfffff60123a014c0, 0x0, 0xffffffffc01d9308, 0xfffffffffbc07e48, 0x0, 0x0)
unix:mutex_vector_enter+0x261(0xffffffffc01d9308)
dtrace:dtrace_helpers_destroy+0x31()
genunix:exec_args+0x1e1(0xffffff007c9bbe20, 0xffffff007c9bbd30, 0x0, 0xffffff007c9bb928)
elfexec:elf32exec+0x718()
genunix:gexec+0x6d7(0xffffff007c9bbca8, 0xffffff007c9bbe20, 0xffffff007c9bbd30, 0x0, 0x0, 0xffffff007c9bbcb8, 0xffffff007c9bbd10, 0xfffff6005c3bdde8, 0x0)
genunix:exec_common+0x4e8(0x80604d8, 0x8047bd0, 0x8067fb4, 0x0)
genunix:exece+0x1f(0x80604d8, 0x8047bd0, 0x8067fb4)
unix:_syscall32_save+0xbd()
-- switch to user thread's user stack --

## It looks like some sort of scheduling issue. They already have the workaround for the Intel CPU erratum.

set idle_cpu_prefer_mwait=0x0
set idle_cpu_no_deep_c=0x1

> ::interrupts
IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# ISR(s)
4 0xb0 12 ISA Edg Fixed 11 1 0x0/0x4 asyintr
9 0x80 9 PCI Lvl Fixed 1 1 0x0/0x9 acpi_wrapper_isr
16 0xb1 12 PCI Lvl Fixed 1 1 0x0/0x10 clustron_v1_intr
17 0x64 6 PCI Lvl Fixed 10 1 0x0/0x11 e1000g_intr
20 0x84 9 PCI Lvl Fixed 5 2 0x0/0x14 ohci_intr, mcp5x_intr
21 0x60 6 PCI Lvl Fixed 6 2 0x0/0x15 nge_chip_intr, mcp5x_intr
22 0x62 6 PCI Lvl Fixed 7 2 0x0/0x16 nge_chip_intr, mcp5x_intr
23 0x81 9 PCI Lvl Fixed 0 1 0x0/0x17 ehci_intr
44 0x63 6 PCI Lvl Fixed 8 2 0x1/0x14 nge_chip_intr, mcp5x_intr
45 0x45 5 PCI Lvl Fixed 9 1 0x1/0x15 mcp5x_intr
46 0x46 5 PCI Lvl Fixed 10 1 0x1/0x16 mcp5x_intr
47 0x61 6 PCI Lvl Fixed 2 1 0x1/0x17 nge_chip_intr
48 0x82 7 PCI Edg MSI 2 1 - pcieb_intr_handler
49 0x83 7 PCI Edg MSI 2 1 - pcieb_intr_handler
50 0x30 4 PCI Edg MSI 3 1 - pcieb_intr_handler
51 0x31 4 PCI Edg MSI 3 1 - pcieb_intr_handler
52 0x40 5 PCI Edg MSI 4 1 - mpt_intr
160 0xa0 0 Edg IPI all 0 - poke_cpu
208 0xd0 14 Edg IPI all 1 - kcpc_hw_overflow_intr
209 0xd1 14 Edg IPI all 1 - cbe_fire
210 0xd3 14 Edg IPI all 1 - cbe_fire
240 0xe0 15 Edg IPI all 1 - xc_serv
241 0xe1 15 Edg IPI all 1 - apic_error_intr

## The clock thread doesn't show up in the dump. It's possible that the delay happened in the recent past but that
## things were progressing at the precise moment the dump was captured. I can't see what was holding the clock up.

## The 7310 is running software version ak-2011.04.24.4.2 from 09/11/2012.
## The latest is ak-2011.04.24.5.0 from 07/01/2013. I would suggest proactively updating to the latest firmware.

==================

NOTE: It looks like the customer is hitting Bug 15818307.
The IDR 2011.04.24.5.0,1-2.33.19.1 contains a fix for this bug and the official fix should be available in the next release.
Closing this bug as a Duplicate of 15818307.

Cause

The AKD hang appears to be a consequence of the nv_sata port issue:

Bug 15818307 (nv_sata port held in reset cannot replace cache disk on 7410 running 2011.1.4)

Solution

For immediate 'relief', reboot the affected head.

For a permanent resolution, upgrade to the latest Appliance Firmware Release version which will provide fixes for all known 'nv_sata' port issues.

***Checked for relevance on 25-MAY-2018***

References

<NOTE:1401282.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Unresponsive Administrative Interface (BUI/CLI hang)
<NOTE:1543359.1> - Sun Storage 7000 Unified Storage System: Restarting the Appliance Kit Management Daemon (AKD) may impact production data services
<NOTE:1457578.1> - Sun Storage 7000 Unified Storage System: When Replacing Faulted Readzilla SSD and/or System Disks in the head unit the replacement is not recognized
<NOTE:1504807.1> - Sun Storage 7000 Unified Storage System: Failing or moving readzilla SSD devices can lead to a panic
<NOTE:1506500.1> - Sun Storage 7000 Unified Storage System: Alerts from readzilla cannot be cleared - even after device replacement or reboot.

Attachments

This solution has no attachment