![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Solution Type Problem Resolution Sure Solution 1580033.1 : SP Resets on T5xx0, T3, T4, T5 and M5 Systems May Overwrite a User-Land Solaris Memory Page Causing In-Memory Corruption
In this Document
Applies to:SPARC T5-2 - Version All Versions and laterSun Netra T6340 Server Module - Version Not Applicable and later SPARC T3-1B - Version Not Applicable and later SPARC T3-2 - Version Not Applicable and later Netra SPARC T4-1 Server - Version Not Applicable and later Oracle Solaris on SPARC (64-bit) SymptomsSee also Alert 1587769.1 Should the issue be encountered, some of the symptoms you may expect to find include, but are not limited to:
Running pkgchk(1M) will show an error caused by the checksum failure of a file, eg: $ pkgchk SUNWsmagt
ERROR: /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9 file cksum <55105> expected <43637> actual $ grep /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9 /var/sadm/install/contents /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9 f none 0755 root bin 1353872 55105 1308148769 SUNWsmagt $ cksum /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9 3059640111 1353872 /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9 $ sum /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9 43637 2645 /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9 Using a hex editor such as od(1M) or xxd(1M) to dump the raw contents of the file and then diff(1M) or sdiff(1M) to compare the contents, the following distinct pattern will be observed $ od -x libnetsnmp.so.5.0.9.BAD > od-x_libnetsnmp.so.5.0.9.BAD.out
$ od -x libnetsnmp.so.5.0.9.GOOD > od-x_libnetsnmp.so.5.0.9.GOOD.out $ diff od-x_libnetsnmp.so.5.0.9.GOOD.out od-x_libnetsnmp.so.5.0.9.BAD.out 20829,20836c20829,20836 < 1212700 9de3 bf30 a141 4000 0300 063c 8210 6204 <-- This is what we expect < 1212720 ae04 0001 e077 a7df f077 a7f7 f277 a7ef < 1212740 f427 a7eb 0300 0044 8210 62c8 e05d c001 < 1212760 9004 2080 d25f a7f7 d45f a7ef 4006 8599 < 1213000 d647 a7eb 81c7 e008 81e8 0000 81c7 e008 < 1213020 81e8 0000 0001 0000 0001 0000 0001 0000 < 1213040 0001 0000 0001 0000 0001 0000 0001 0000 < 1213060 9de3 be80 a141 4000 0300 063c 8210 6194 --- > 1212700 0101 0100 0000 0000 0001 0000 0000 0000 <-- Signature of the corruption > 1212720 0000 0000 0000 0000 0000 0000 0000 0000 > 1212740 0000 0000 0000 0000 0000 0000 0000 0000 > 1212760 0000 0000 0000 0000 0000 0000 0000 0000 > 1213000 0101 0100 0000 0000 0001 0000 0000 0000 > 1213020 0000 0000 0000 0000 0000 0000 0000 0000 > 1213040 0000 0000 0000 0000 0000 0000 0000 0000 > 1213060 0000 0000 0000 0000 0000 0000 0000 0000 If the 'corrupt' file exists on a ZFS filesystem, the ZFS Debugger zdb(1M) can be used to verify the on-disk copy if good: ### Get the object number using ls(1). In this case, the object number is 38167. This would be known as an inode number on UFS/VxFS.
$ ls -li /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9 38167 -rwxr-xr-x 1 root bin 1353872 Jun 15 2011 /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9 ### Get the ZFS dataset name for the root (/) filesystem $ zfs list / NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/s10u10_0413 24.1G 512G 7.91G / ### Dump the object details using zdb(1M) to obtain the L0 Indirect block DVAs (Data Virtual Address) $ zdb -dddddd <dataset> <object> $ zdb -dddddd rpool/ROOT/s10u10_0413 38167 > zdb-dddddd.38167.out $ cat zdb-dddddd.38167.out Dataset rpool/ROOT/zfsroot-2012-10-24 [ZPL], ID 200, cr_txg 750261, 6.14G, 199621 objects, rootbp DVA[0]=<0:21815bcc00:200:STD:1> DVA[1]=<0:2c2e63400:200:STD:1> [L0 DMU objset] fletcher4 lzjb BE contiguous unique 2-copy size=800L/200P birth=985291L/985291P fill=199621 cksum=1c23804e52:8bd38850b62:1819830f4f043:2fd13ac6e4e250 Object lvl iblk dblk dsize lsize %full type 38167 2 16K 128K 556K 1.38M 100.00 ZFS plain file (K=inherit) (Z=inherit) 168 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED dnode maxblkid: 10 path /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9 uid 0 gid 2 atime Fri Apr 5 10:00:01 2013 mtime Wed Jun 15 09:39:29 2011 ctime Fri Apr 27 19:43:33 2012 crtime Fri Apr 27 19:43:33 2012 gen 92 mode 100755 size 1353872 parent 6573 links 1 pflags 104 Indirect blocks: 0 L1 0:2c6463a00:600 0:8c00e9600:600 4000L/600P F=11 B=92/92 0 L0 0:2c62ee400:de00 20000L/de00P F=1 B=92/92 <-- We're only interested in L0 blocks 20000 L0 0:2c625d200:be00 20000L/be00P F=1 B=92/92 40000 L0 0:2c63aa200:fc00 20000L/fc00P F=1 B=92/92 60000 L0 0:2c63faa00:10800 20000L/10800P F=1 B=92/92 80000 L0 0:2c6435e00:10800 20000L/10800P F=1 B=92/92 a0000 L0 0:2c640b600:10600 20000L/10600P F=1 B=92/92 c0000 L0 0:2c6212600:9a00 20000L/9a00P F=1 B=92/92 e0000 L0 0:2c6192400:4400 20000L/4400P F=1 B=92/92 100000 L0 0:2c62c1e00:d200 20000L/d200P F=1 B=92/92 120000 L0 0:2c6346600:f200 20000L/f200P F=1 B=92/92 140000 L0 0:2c61a5600:5400 20000L/5400P F=1 B=92/92 segment [0000000000000000, 0000000000160000) size 1.38M ### We now need to use zdb(1M) again to dump each of the L0 blocks in their raw form bash-3.2# for dva in `awk '$2 ~/L0/ {print $3 ":dr " ;}' zdb-dddddd.38167.out` > do > zdb -R rpool ${dva} >> libnetsnmp.so.5.0.9.decomp.raw > done ### Now we can checksum the file extracted from the physical disks and compare it to a known good copy of the same library from another host running the same patches or from the patch itself bash-3.2# sum libnetsnmp.so.5.0.9.decomp.raw 55105 2816 libnetsnmp.so.5.0.9.decomp.raw bash-3.2# sum libnetsnmp.so.5.0.9.GOOD 55105 2645 libnetsnmp.so.5.0.9.GOOD ### The checksums match so we know the on-disk copy of the file is GOOD. The in-memory copy is BAD/Corrupt! Analysis of the physical memory using mdb(1M) or scat(1M) will show the same pattern seen above, eg: 0xfffb815a0: 18a8658 70041850000
0xfffb815b0: 7e8c28 18a8658 0xfffb815c0: 101010000000000 1000000000000 0xfffb815d0: 0 0 0xfffb815e0: 0 0 0xfffb815f0: 0 0 0xfffb81600: 101010000000000 1000000000000 0xfffb81610: 0 0 0xfffb81620: 0 0 0xfffb81630: 0 0 0xfffb81640: 101010000000000 1000000000000 0xfffb81650: 0 0 0xfffb81660: 0 0 0xfffb81670: 0 0 0xfffb81680: 70041862000 7e8c31 0xfffb81690: 18a8658 70041864000 0xfffb816a0: 7e8c32 18a8658 Note: If you have more than one corrupt file or the corruption pattern is not the one shown above, you have a different issue. The problem described in this KM is a very specific pattern.
ChangesThe Service Processor (SP) may have been manually reset by the System Administrator or reset/paniced due to a problem on the SP itself. The trigger for the described issue is an SP reset no matter what the cause. CauseThis issue can occur on the following CMT SPARC systems with S10u9 or later (Kernel Patch 142909-17) and without the prescribed minimum firmware revision installed (see the 'Solution' subheading below)
Note: This issue is not applicable to the T1000, T2000 or T6300 because they use ALOM not iLOM
SolutionWorkaround: Further corruption can be prevented by rebooting the host. The permanent fix is provided in the following system firmware revisions or later:
References<BUG:16863221> - LDC CHANNEL 1 OVERWRITES SOLARIS MEMORY PAGE WHEN SP RESETS.Attachments This solution has no attachment |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|