Information in this document applies to any platform.
Date of Resolved Release: 25-Nov-2014
_______________________________________
Description
T4 series servers installed with Sun System Firmware 8.4.0.a through 8.5.1.b may, in rare cases, exhibit the following symptoms:
- Excessive correctable DIMM events leading to excessive page retires and DIMMs being incorrectly faulted
- A 'send_mondo' or unrecoverable hardware error system panic (system outage)
- A system redstate (system outage)
- A system Hypervisor abort (system outage)
Occurrence
The issue has been observed in very rare instances when T4 system firmware is upgraded to version 8.4.0 through 8.5.1.b. Currently all confirmed occurrences have been limited to the SPARC T4-4 Server series and the SPARC Supercluster T4-4, but the other products listed in this alert are also potentially affected.
It has been observed that systems that hit this issue usually experience memory errors within two days after upgrading the firmware as stated above, but this is not always the case. If a system has been stable for a period of time on system firmware 8.4.0 or later, then the likelihood of hitting this problem is quite low. However, the resolution referenced below is still recommended as a risk avoidance measure.
The full list of patches affected is as follows:
SPARC T4-1 Server
- Patch 150676-01 through 150676-06: SPARC T4-1 Sun System Firmware 8.4.0.a, 8.4.0.b, 8.4.0.c, 8.4.1.a, 8.4.2.c, 8.4.2.d
- Patch 151295-01 through 151295-02: SPARC T4-1 Sun System Firmware 8.5.0.a, 8.5.1.b
SPARC T4-1B Server Module
- Patch 150679-01 through 150679-04: SPARC T4-1B Sun System Firmware 8.4.0.a, 8.4.0.c, 8.4.1.a, 8.4.2.c
- Patch 151298-01 through 151298-02: SPARC T4-1B Sun System Firmware 8.5.0.a, 8.5.1.b
SPARC T4-2 Server
- Patch 150677-01 through 150677-05: SPARC T4-2 Sun System Firmware 8.4.0.a, 8.4.0.b, 8.4.0.c, 8.4.1.a, 8.4.2.c
- Patch 151296-01 through 151296-02: SPARC T4-2 Sun System Firmware 8.5.0.a, 8.5.1.b
SPARC T4-4 Server
- Patch 150678-01 through 150678-05: SPARC T4-4 Sun System Firmware 8.4.0.a, 8.4.0.c, 8.4.1.a, 8.4.2.c, 8.4.2.d
- Patch 151297-01 through 151297-02: SPARC T4-4 Sun System Firmware 8.5.0.a, 8.5.1.b
SPARC Supercluster T4-4
- Patch 18163942: QUARTERLY FULL STACK DOWNLOAD PATCH FOR SUPERCLUSTER (JAN 2014 - 11.2 AND 12.1)
- Patch 18517092: QUARTERLY FULL STACK DOWNLOAD PATCH FOR SUPERCLUSTER (APR 2014 - 11.2 AND 12.1)
- Patch 18965131: QUARTERLY FULL STACK DOWNLOAD PATCH FOR SUPERCLUSTER (JUL 2014 - 11.2 AND 12.1)
- Patch 19621160: QUARTERLY FULL STACK DOWNLOAD PATCH FOR SUPERCLUSTER (OCT 2014 - 11.2 and 12.1)
StorageTek VSM 6
- Contact Oracle technical support
Netra SPARC T4-1 Server
- Patch 150680-01 through 150680-06: Netra SPARC T4-1 Sun System Firmware 8.4.0.a, 8.4.0.b, 8.4.0.c, 8.4.1.a, 8.4.2.c, 8.4.2.d
- Patch 151299-01 through 151299-02: Netra SPARC T4-1 Sun System Firmware 8.5.0.a, 8.5.1.b
Netra SPARC T4-1B Server Module
- Patch 150682-01 through 150682-04: Netra SPARC T4-1B Sun System Firmware 8.4.0.a, 8.4.0.c, 8.4.1.a, 8.4.2.c
- Patch 151301-01 through 151301-02: Netra SPARC T4-1B Sun System Firmware 8.5.0.a, 8.5.1.b
Netra SPARC T4-2 Server
- Patch 150681-01 through 150681-05: Netra SPARC T4-2 Sun System Firmware 8.4.0.a, 8.4.0.b, 8.4.0.c, 8.4.1.a, 8.4.2.c
- Patch 151300-01 through 151300-02: Netra SPARC T4-2 Sun System Firmware 8.5.0.a, 8.5.1.b
Symptoms
A system may experience symptoms similar to the following:
1. Unrecoverable Hardware Error system panic preceded by L2$ memory errors
FMA events:
2014-08-21/23:31:36 ereport.cpu.generic-sparc.l2data-uc@/HOST
2014-08-21/23:31:36 ereport.cpu.generic-sparc.l2data-uc@/HOST
2014-08-21/23:31:36 ereport.cpu.generic-sparc.l2data-uc@/HOST
Solaris Unrecoverable Hardware Error (UHE) panic reported on the host console:
panic[cpu147]/thread=1007f032c000: Unrecoverable hardware error
000002a11878ac00 unix:process_nonresumable_error+2ec (2a11878ae50, 0, 2, 2a11878ad10, 2a11878ad68, 100000000)
%l0-3: 0000000003000000 0000000000000040 0000000000000100 000003000012c790
%l4-7: 0000000000000000 00000000000000ff 0000000000000000 ffffffffffffff7f
000002a11878ada0 unix:ktl0+64 (30485cfc920, 0, 1000, 15f35e20, 10398000, af9af1)
%l0-3: 000003000012c000 0000000000000498 0000000800001604 000000000102a23c
%l4-7: 00001006b610ef30 00000000102a6400 0000000000000000 000002a11878ae50
000002a11878aef0 unix:hat_unload_callback+26c (1063e400, ffffffff63d00000, 0, 1, 1004d3506540, 1004d3506540)
%l0-3: 0000000000000000 0000000000000004 0000000000000004 000000000fffffff
%l4-7: 0000000000000019 0000000000000000 000003000ffb2928 0000030485cfc920
000002a11878b3f0 genunix:anon_private+190 (2a11878b5c8, 100777678000, ffffffff63d00000, 301728b6900, 301728bd000, 100776e97590)
%l0-3: 000003000012c000 000000000000000b 0000000000000000 000010076a5060c0
%l4-7: 0000000000000000 000003000012c000 00000200eedd2000 0000000000000002
000002a11878b4f0 genunix:segvn_faultpage+6e8 (1004d3506540, 100777678000, ffffffff63d00000, 10076a5060c0, 0, 0)
%l0-3: 0000000000000000 0000000000000001 0000000000000002 00001007fce87c48
%l4-7: 00001006fe360480 0000000000000000 000000000000000b 0000000000000001
000002a11878b600 genunix:segvn_fault+b24 (ffffffff63d00000, 100777678000, ffffffff63d00000, 0, 0, 1)
%l0-3: ffffffff63d02000 0000000000000000 0000000000000000 000002a11878b7a0
%l4-7: 0000000000000002 00001007fce87c48 00001006fe360480 0000000000000001
000002a11878b800 genunix:as_fault+3f0 (1004d3506540, 100777678000, 1, 100789061080, 2, 1006b531b058)
%l0-3: 0000000000000001 ffffffff63d00000 0000000000002000 ffffffff63d02000
%l4-7: ffffffff63d00000 0000000000000001 0000100777678000 0000000000002000
000002a11878b8f0 unix:pagefault+8c (fff8000100000000, 1006ab680008, 5, 0, 1, 0)
%l0-3: 0000000000000000 00001006b531b008 000003000012c000 0000000000000000
%l4-7: 0000000000000002 0000000000000000 ffffffff63d00000 0007ffff00000000
000002a11878b9b0 unix:trap+e20 (2a11878bb80, 0, 100789061080, 10000, ffffffff7ee19044, 0)
%l0-3: 000002a11878bad0 0000000000010033 00001006ab680008 0000000000000001
%l4-7: 0000000000000002 0000000000010000 0000000000001c00 0000000000010080
2. HV Abort preceded by L2$ memory errors
FMA events:
2014-05-06/16:34:38 ereport.cpu.generic-sparc.l2data-uc@/HOST
2014-05-06/16:34:38 ereport.cpu.generic-sparc.l2data-uc@/HOST
2014-05-06/16:34:38 ereport.cpu.generic-sparc.l2data-uc@/HOST
2014-05-06/16:34:38 ereport.cpu.generic-sparc.l2data-uc@/HOST
2014-05-06/16:34:43 ereport.cpu.generic-sparc.hv-abort@/HOST
Hypervisor Abort reported on the host console:
ABORT: ../../../greatlakes/src/mmu.s, line 0x41e: DMMU error in hypervisor PC = 8a159a0
3. Redstate triggered by L2$ event
FMA events:
2014-07-23/19:25:44 ereport.hc.unspecified.redstate@/SYS/PM1/CMP1/CORE5/P0
2014-07-23/19:25:44 ereport.hc.unspecified.redstate@/SYS/PM0/CMP1/CORE2/P5
Redstate reported on the host console:
2014-07-24 02:26:10 3:5:0> NOTICE: nesr : 700000
2014-07-24 02:26:10 1:2:5> NOTICE: nesr : 400000
2014-07-24 02:26:10 3:5:0> NOTICE: near : 0
2014-07-24 02:26:10 1:2:5> NOTICE: near : 0
2014-07-24 02:26:10 3:5:0> NOTICE: desr : 0
2014-07-24 02:26:10 1:2:5> NOTICE: desr : 0
2014-07-24 02:26:10 3:5:0> NOTICE: dfesr : 0
2014-07-24 02:26:10 1:2:5> NOTICE: dfesr : 0
2014-07-24 02:26:10 3:5:0> NOTICE: pesr : 200
2014-07-24 02:26:10 1:2:5> NOTICE: pesr : 100
2014-07-24 02:26:10 3:5:0> NOTICE: dsfsr : 80
2014-07-24 02:26:10 1:2:5> NOTICE: dsfsr : 0
2014-07-24 02:26:10 3:5:0> NOTICE: dsfar : f301346800
2014-07-24 02:26:10 1:2:5> NOTICE: dsfar : 158baf10
2014-07-24 02:26:10 3:5:0> NOTICE: tl tpc tnpc tstate tt htstate
2014-07-24 02:26:10 1:2:5> NOTICE: tl tpc tnpc tstate tt htstate
2014-07-24 02:26:10 3:5:0> NOTICE: 1 000000010782cf60 000000010782cf64 0000004482001203 00a 0000000000000400
2014-07-24 02:26:10 1:2:5> NOTICE: 1 000000000100f77c 000000000100f780 0000004480001406 180 0000000000000400
2014-07-24 02:26:10 3:5:0> NOTICE: 2 0000000008a4c840 0000000008a4c844 000001994f001003 00a 0000000000000004
2014-07-24 02:26:10 1:2:5> NOTICE: 2 0000000008a213e4 0000000008a213e8 0000014480001006 032 0000000000000004
2014-07-24 02:26:10 3:5:0> NOTICE: 3 0000000008a4c840 0000000008a4c844 000002444f001003 00a 0000000000000004
2014-07-24 02:26:10 1:2:5> NOTICE: 3 0000000008a4c070 0000000008a4c074 0000024480001006 00a 0000000000000004
2014-07-24 02:26:10 3:5:0> NOTICE: 4 0000000008a4c840 0000000008a4c844 000003444f001003 00a 0000000000000004
2014-07-24 02:26:10 1:2:5> NOTICE: 4 0000000008a4c774 0000000008a4c778 000003994f001006 032 0000000000000004
2014-07-24 02:26:10 3:5:0> NOTICE: 5 0000000008a4c840 0000000008a4c844 000003444f001003 00a 0000000000000004
2014-07-24 02:26:10 1:2:5> NOTICE: 5 0000000008a4c070 0000000008a4c074 000003994f001006 00a 0000000000000004
2014-07-24 02:26:10 3:5:0> NOTICE: 6 0000000008a1b2b8 0000000008a1b2bc 000003444f001003 00a 0000000000000004
2014-07-24 02:26:11 1:2:5> NOTICE: 6 0000000008a4c774 0000000008a4c778 000003444f001006 032 0000000000000004
2014-07-24 02:26:11 3:5:0> NOTICE:
2014-07-24 02:26:11 1:2:5> NOTICE:
2014-07-24 02:26:11 3:5:0> ERROR: Redstate trap occurred on node 3 strand 40
2014-07-24 02:26:11 1:2:5> ERROR: Redstate trap occurred on node 1 strand 21
2014-07-24 02:26:15 3:5:0> ERROR: Powering down due to Redstate
4. A 'send_mondo system panic' preceded by L2$ memory errors
FMA events:
2014-08-02/00:41:20 ereport.cpu.generic-sparc.l2data-uc@/HOST
2014-08-02/00:41:20 ereport.cpu.generic-sparc.l2data-uc@/HOST
2014-08-02/00:41:21 ereport.cpu.generic-sparc.l2data-uc@/HOST
Solaris Panic reported on the host console:
panic[cpu54]/thread=1005ea1b38a0: send_mondo_set: timeout
000002a118aecc70 unix:send_mondo_set+560 (1, bec53, 3e, c3097b50968a, c3097b509310, 3000006e790)
%l0-3: 00000000299e2800 0000000000000001 000000001057e1b8 0000c3097b50968a
%l4-7: 000000000ab21dae 00000000010c8000 00000000000001f8 00000000010c8000
000002a118aecd40 unix:xt_some+1a8 (2a118aed028, 102741c, 2a119a0c000, 40001015800, 2a118aecdf0, 0)
%l0-3: 00000000104512c8 0000000000000178 0000000000000000 0000000000000036
%l4-7: ffffffffffffffff 0000000000000000 0000000000000001 0000000000000000
000002a118aecf70 unix:sfmmu_flush_pages+444 (30, 2a119a0c000, 1, 2a118aed028, 1, 2a118aed5d8)
%l0-3: 0000040001015800 0000000000000000 0000000000000000 0000040001013c00
%l4-7: 0000000000000036 0000000001027400 0000000000000000 0000000000000030
000002a118aed1b0 unix:sfmmu_tlb_range_demap+ec (2a118aed5a0, 2a119a0c000, 0, 0, 2a119a0e000, 0)
%l0-3: 0000000000000000 0000040001015800 0000000000000001 000000000000000d
%l4-7: 0000000000000000 0000040092660018 0000000000002000 0000000000002000
000002a118aed260 unix:hat_unload_callback+82c (1, 2a119a0e000, 0, 0, 40001015800, 40001015800)
%l0-3: 0000000000000001 0000000000000001 0000000000000001 000000000fffffff
%l4-7: 0000000000000000 000002a119a0e000 00000400a523b058 000003011b4fd100
000002a118aed760 genunix:segkp_release_internal+90 (1002057d6d58, ffffffffffffffff, 2a119a0c000, d, 10396ff8, 106682a8)
%l0-3: 00000000010d2f68 00000000010d2f68 000010023e10a388 0000000000000001
%l4-7: 0000000000000002 0000000000000001 0000000000001fff 00000000010d2f70
000002a118aed810 genunix:schedctl_freepage+18 (1005a148a938, 2a119a0c000, f2e1c000, 4, f2e1c000, 100d7800)
%l0-3: 000010023e10a3b8 000010023e10a392 00000000f2e1e000 0000000000000000
%l4-7: 000010028db92008 0000000000000001 000000001064c9f0 000000001064d0f0
000002a118aed8c0 genunix:schedctl_proc_cleanup+3c (1006043a1bf0, 10d2c00, 106670f8, 10667000, 1005ce5d4008, 10020ba45b90)
%l0-3: 00001006043a1bf0 00000000db5fffff 00000000db5ffc00 0000000000000050
%l4-7: 0000000010658f98 0000000010658c00 00000000010d2f60 0000000000000000
000002a118aed970 genunix:proc_exit+20c (ffff0000, 0, 0, 5a006002, 0, 1)
%l0-3: 0000000000000000 0000000000000000 00001005ea10e108 0000000000000000
%l4-7: 00001005ea1b38a0 00001006043a1bf0 0000000000000000 00001005ce5d4008
000002a118aeda20 genunix:exit+8 (1, 0, 2c400, 60000, ffffffff, 60000)
%l0-3: 000000003e270000 0000000000003e27 0000000000000001 000003000006e000
%l4-7: 000003b77bb8fef8 0000000000000000 0000000000000000 0000000000000000
Workaround
This issue is addressed in the following releases:
SPARC T4 servers and Netra SPARC T4 servers:
SPARC T4-1 Server
- Patch 151682-01: SPARC T4-1 Sun System Firmware 8.6.0.b
SPARC T4-1B Server Module
- Patch 151685-01: SPARC T4-1B Sun System Firmware 8.6.0.b
SPARC T4-2 Server
- Patch 151683-01: SPARC T4-2 Sun System Firmware 8.6.0.b
SPARC T4-4 Server
- Patch 151684-01: SPARC T4-4 Sun System Firmware 8.6.0.b
Netra SPARC T4-1 Server
- Patch 151686-01: Netra SPARC T4-1 Sun System Firmware 8.6.0.b
Netra SPARC T4-1B Server Module
- Patch 151688-01: Netra SPARC T4-1B Sun System Firmware 8.6.0.b
Netra SPARC T4-2 Server
- Patch 151687-01: Netra SPARC T4-2 Sun System Firmware 8.6.0.b
SPARC Supercluster T4-4 and StorageTek VSM 6:
- Contact Oracle technical support
Patches
<SUNPATCH:151682-01>, <SUNPATCH:151683-01>
<SUNPATCH:151684-01>, <SUNPATCH:151685-01>
<SUNPATCH:151686-01>, <SUNPATCH:151687-01>
<SUNPATCH:151688-01>
History
25-Nov-2014: Document released, status is Resolved
This issue was initially reported under Bug 18895455, which has been
closed as a duplicate of Bug 19721476.
The resolution that applies to SPARC Supercluster T4-4 currently says
"contact technical support." This will be updated to reference
the software bundle containing the fix when the bundle is released.
So far the issue has only been seen on systems that contain DIMM
part number M393B2K70CM0-YF8.
To confirm the part number from snapshot data:
{snapshot}/ilom
/usr/gnu/bin/egrep -A 8 "type = DIMM" @usr@local@bin@collect_properties.out | egrep "fru_part_number|fru_serial_number" | sed 'N;s/\n/ /' | awk '{ print $3" : "$6 }'
Questions regarding this document should be addressed to
sunalertpublication_us_grp@oracle.com and copy the
Internal Contributors/Submitters listed below
Internal Contributor/Submitter: Alex Aftandilian, Matt Finch, Justin Hatch, Marcel Widjaja
Internal Eng Responsible Engineer: Alex Aftandilian
Oracle Knowledge Analyst: david.mariotto@oracle.com
Internal Eng Business Unit Group: Systems Group - SYS
Internal Escalation ID:
Internal Resolution Patches: 151682-01, 151685-01
151683-01, 151684-01 ,151688-01, 151687-01
References
<BUG:19721476> - COMPUTE_PCHG_POWER_DOWN ERRONEOUSLY CALLED ON YF AFTER REMOVING RF/T3 SUPPORT
Attachments
This solution has no attachment