Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1580033.1
Update Date:2018-01-22
Keywords:

Solution Type  Problem Resolution Sure

Solution  1580033.1 :   SP Resets on T5xx0, T3, T4, T5 and M5 Systems May Overwrite a User-Land Solaris Memory Page Causing In-Memory Corruption  


Related Items
  • Netra SPARC T3-1B
  •  
  • Sun Netra T5440 Server
  •  
  • SPARC T4-2
  •  
  • SPARC M5-32
  •  
  • Netra SPARC T5-1B Server Module
  •  
  • Sun SPARC Enterprise T5440 Server
  •  
  • SPARC T3-2
  •  
  • SPARC T5-8
  •  
  • SPARC T3-4
  •  
  • Sun SPARC Enterprise T5120 Server
  •  
  • SPARC T4-1
  •  
  • Netra SPARC T4-1B
  •  
  • Sun SPARC Enterprise T5220 Server
  •  
  • Netra SPARC T4-2 Server
  •  
  • Sun SPARC Enterprise T5240 Server
  •  
  • SPARC T3-1
  •  
  • SPARC T5-2
  •  
  • Netra SPARC T4-1 Server
  •  
  • SPARC T4-1B
  •  
  • SPARC T4-4
  •  
  • SPARC T5-4
  •  
  • SPARC T5-1B
  •  
  • Sun Netra T5220 Server
  •  
  • Sun Blade T6340 Server Module
  •  
  • Sun Blade T6320 Server Module
  •  
  • Sun SPARC Enterprise T5140 Server
  •  
  • Sun Netra T6340 Server Module
  •  
  • SPARC T3-1B
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>CMT>SN-SPARC: T5
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Applies to:

SPARC T5-2 - Version All Versions and later
Sun Netra T6340 Server Module - Version Not Applicable and later
SPARC T3-1B - Version Not Applicable and later
SPARC T3-2 - Version Not Applicable and later
Netra SPARC T4-1 Server - Version Not Applicable and later
Oracle Solaris on SPARC (64-bit)

Symptoms

See also Alert 1587769.1

Should the issue be encountered, some of the symptoms you may expect to find include, but are not limited to:

  • Applications may fail to start and/or core dump due to 'missing symbols' in linked libraries when the corruption occurs within the in-memory copy of the library.  Applications currently running will not suffer any problems.
  • pkgchk(1M) may fail the checksum for a file.  If the file in question resides in-memory or within the filesystem cache this potentially corrupt version will be used rather than the on-disk copy which remains in tact and un-harmed.

Running pkgchk(1M) will show an error caused by the checksum failure of a file, eg:

$ pkgchk SUNWsmagt
ERROR: /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9
   file cksum <55105> expected <43637> actual
$ grep /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9 /var/sadm/install/contents
/usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9 f none 0755 root bin 1353872 55105 1308148769 SUNWsmagt
$ cksum /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9
3059640111      1353872 /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9
$ sum /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9
43637 2645 /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9
Using a hex editor such as od(1M) or xxd(1M) to dump the raw contents of the file and then diff(1M) or sdiff(1M) to compare the contents, the following distinct pattern will be observed
$ od -x libnetsnmp.so.5.0.9.BAD > od-x_libnetsnmp.so.5.0.9.BAD.out
$ od -x libnetsnmp.so.5.0.9.GOOD > od-x_libnetsnmp.so.5.0.9.GOOD.out

$ diff od-x_libnetsnmp.so.5.0.9.GOOD.out od-x_libnetsnmp.so.5.0.9.BAD.out

20829,20836c20829,20836
< 1212700 9de3 bf30 a141 4000 0300 063c 8210 6204  <-- This is what we expect
< 1212720 ae04 0001 e077 a7df f077 a7f7 f277 a7ef
< 1212740 f427 a7eb 0300 0044 8210 62c8 e05d c001
< 1212760 9004 2080 d25f a7f7 d45f a7ef 4006 8599
< 1213000 d647 a7eb 81c7 e008 81e8 0000 81c7 e008
< 1213020 81e8 0000 0001 0000 0001 0000 0001 0000
< 1213040 0001 0000 0001 0000 0001 0000 0001 0000
< 1213060 9de3 be80 a141 4000 0300 063c 8210 6194
---
> 1212700 0101 0100 0000 0000 0001 0000 0000 0000  <-- Signature of the corruption
> 1212720 0000 0000 0000 0000 0000 0000 0000 0000
> 1212740 0000 0000 0000 0000 0000 0000 0000 0000
> 1212760 0000 0000 0000 0000 0000 0000 0000 0000
> 1213000 0101 0100 0000 0000 0001 0000 0000 0000
> 1213020 0000 0000 0000 0000 0000 0000 0000 0000
> 1213040 0000 0000 0000 0000 0000 0000 0000 0000
> 1213060 0000 0000 0000 0000 0000 0000 0000 0000
If the 'corrupt' file exists on a ZFS filesystem, the ZFS Debugger zdb(1M) can be used to verify the on-disk copy if good:
### Get the object number using ls(1).  In this case, the object number is 38167.  This would be known as an inode number on UFS/VxFS.

$ ls -li /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9
     38167 -rwxr-xr-x   1 root     bin      1353872 Jun 15  2011 /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9

### Get the ZFS dataset name for the root (/) filesystem

$ zfs list /
NAME                     USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT/s10u10_0413  24.1G   512G  7.91G  /

### Dump the object details using zdb(1M) to obtain the L0 Indirect block DVAs (Data Virtual Address)

$ zdb -dddddd <dataset> <object>
$ zdb -dddddd rpool/ROOT/s10u10_0413 38167 > zdb-dddddd.38167.out

$ cat zdb-dddddd.38167.out
Dataset rpool/ROOT/zfsroot-2012-10-24 [ZPL], ID 200, cr_txg 750261, 6.14G, 199621 objects, rootbp DVA[0]=<0:21815bcc00:200:STD:1> DVA[1]=<0:2c2e63400:200:STD:1> [L0 DMU objset] fletcher4 lzjb BE contiguous unique 2-copy size=800L/200P birth=985291L/985291P fill=199621 cksum=1c23804e52:8bd38850b62:1819830f4f043:2fd13ac6e4e250

   Object  lvl   iblk   dblk  dsize  lsize   %full  type
    38167    2    16K   128K   556K  1.38M  100.00  ZFS plain file (K=inherit) (Z=inherit)
                                       168   bonus  System attributes
       dnode flags: USED_BYTES USERUSED_ACCOUNTED
       dnode maxblkid: 10
       path    /usr/sfw/lib/sparcv9/libnetsnmp.so.5.0.9
       uid     0
       gid     2
       atime   Fri Apr  5 10:00:01 2013
       mtime   Wed Jun 15 09:39:29 2011
       ctime   Fri Apr 27 19:43:33 2012
       crtime  Fri Apr 27 19:43:33 2012
       gen     92
       mode    100755
       size    1353872
       parent  6573
       links   1
       pflags  104
Indirect blocks:
              0 L1  0:2c6463a00:600 0:8c00e9600:600 4000L/600P F=11 B=92/92
              0  L0 0:2c62ee400:de00 20000L/de00P F=1 B=92/92      <-- We're only interested in L0 blocks
          20000  L0 0:2c625d200:be00 20000L/be00P F=1 B=92/92
          40000  L0 0:2c63aa200:fc00 20000L/fc00P F=1 B=92/92
          60000  L0 0:2c63faa00:10800 20000L/10800P F=1 B=92/92
          80000  L0 0:2c6435e00:10800 20000L/10800P F=1 B=92/92
          a0000  L0 0:2c640b600:10600 20000L/10600P F=1 B=92/92
          c0000  L0 0:2c6212600:9a00 20000L/9a00P F=1 B=92/92
          e0000  L0 0:2c6192400:4400 20000L/4400P F=1 B=92/92
         100000  L0 0:2c62c1e00:d200 20000L/d200P F=1 B=92/92
         120000  L0 0:2c6346600:f200 20000L/f200P F=1 B=92/92
         140000  L0 0:2c61a5600:5400 20000L/5400P F=1 B=92/92

               segment [0000000000000000, 0000000000160000) size 1.38M

### We now need to use zdb(1M) again to dump each of the L0 blocks in their raw form

bash-3.2# for dva in `awk '$2 ~/L0/ {print $3 ":dr " ;}' zdb-dddddd.38167.out`
> do
> zdb -R rpool ${dva} >> libnetsnmp.so.5.0.9.decomp.raw
> done

### Now we can checksum the file extracted from the physical disks and compare it to a known good copy of the same library from another host running the same patches or from the patch itself

bash-3.2# sum libnetsnmp.so.5.0.9.decomp.raw
55105 2816 libnetsnmp.so.5.0.9.decomp.raw
bash-3.2# sum libnetsnmp.so.5.0.9.GOOD
55105 2645 libnetsnmp.so.5.0.9.GOOD

### The checksums match so we know the on-disk copy of the file is GOOD.  The in-memory copy is BAD/Corrupt!
 

Analysis of the physical memory using mdb(1M) or scat(1M) will show the same pattern seen above, eg:

0xfffb815a0:    18a8658         70041850000    
0xfffb815b0:    7e8c28          18a8658        
0xfffb815c0:    101010000000000 1000000000000  
0xfffb815d0:    0               0              
0xfffb815e0:    0               0              
0xfffb815f0:    0               0              
0xfffb81600:    101010000000000 1000000000000  
0xfffb81610:    0               0              
0xfffb81620:    0               0              
0xfffb81630:    0               0              
0xfffb81640:    101010000000000 1000000000000  
0xfffb81650:    0               0              
0xfffb81660:    0               0              
0xfffb81670:    0               0              
0xfffb81680:    70041862000     7e8c31        
0xfffb81690:    18a8658         70041864000    
0xfffb816a0:    7e8c32          18a8658       

  

Note: If you have more than one corrupt file or the corruption pattern is not the one shown above, you have a different issue.  The problem described in this KM is a very specific pattern.

  

Changes

The Service Processor (SP) may have been manually reset by the System Administrator or reset/paniced due to a problem on the SP itself.  The trigger for the described issue is an SP reset no matter what the cause.

Cause

This issue can occur on the following CMT SPARC systems with S10u9 or later (Kernel Patch 142909-17) and without the prescribed minimum firmware revision installed (see the 'Solution' subheading below)

  • Sun Blade T6320 Server Module
  • Sun Blade T6340 Server Module

 

  • Sun SPARC Enterprise T5120 Server
  • Sun SPARC Enterprise T5140 Server
  • Sun SPARC Enterprise T5220 Server
  • Sun SPARC Enterprise T5240 Server
  • Sun SPARC Enterprise T5440 Server
  • Sun Netra T5220 Server
  • Sun Netra T5440 Server

 

  • SPARC T3-1
  • SPARC T3-1B
  • SPARC T3-2
  • SPARC T3-4
  • Netra SPARC T3-1
  • Netra SPARC T3-1B

 

  • SPARC T4-1
  • SPARC T4-1B
  • SPARC T4-2
  • SPARC T4-4
  • Netra SPARC T4-1 Server
  • Netra SPARC T4-1B
  • Netra SPARC T4-2 Server

 

  • SPARC T5-1B
  • SPARC T5-2
  • SPARC T5-4
  • SPARC T5-8
  • Netra SPARC T5-1B Server Module

 

  • SPARC M5-32
Note: This issue is not applicable to the T1000, T2000 or T6300 because they use ALOM not iLOM

  

Solution

Workaround: Further corruption can be prevented by rebooting the host.

The permanent fix is provided in the following system firmware revisions or later:

 
SystemFirmware Patch IDMinimum Firmware Version
M5
SPARC M5-32  17019082 9.0.2.E
T5
SPARC T5-8  17199962 9.0.0.K
SPARC T5-4  17199962 9.0.0.K
SPARC T5-2  17199954 9.0.0.K
SPARC T5-1B  17199947 9.0.0.K
Netra SPARC T5-1B  17199974 9.0.0.K
T4
 Netra SPARC T4-2 Server  150418-03  8.3.0.d
 Netra SPARC T4-1B  150419-03  8.3.0.d
 Netra SPARC T4-1 Server  150417-03  8.3.0.d
 SPARC T4-4  150415-03  8.3.0.d
 SPARC T4-2  150414-03  8.3.0.d
 SPARC T4-1B  150416-03  8.3.0.d
 SPARC T4-1  150413-03  8.3.0.d
T3
Netra SPARC T3-1 150411-03 8.3.0.d
Netra SPARC T3-1B 150412-03 8.3.0.d
 SPARC T3-4  150409-03  8.3.0.d
 SPARC T3-2  150408-03  8.3.0.d
 SPARC T3-1B  150410-03  8.3.0.d
 SPARC T3-1  150407-03  8.3.0.d
T5xx0
 Sun Netra T5440 Server  147313-09  7.4.6.c
 Sun Netra T5220 Server  147309-08  7.4.6.c
 Sun SPARC Enterprise T5440 Server  147311-07  7.4.6.c
 Sun SPARC Enterprise T5240 Server  147310-09  7.4.6.c
 Sun SPARC Enterprise T5220 Server  147307-09  7.4.6.c
 Sun SPARC Enterprise T5140 Server  147310-09  7.4.6.c
 Sun SPARC Enterprise T5120 Server  147307-09  7.4.6.c
Blades
 Sun Blade T6340 Server Module  147312-08  7.4.6.c
 Sun Blade T6320 Server Module  147308-07  7.4.6.c

 

References

<BUG:16863221> - LDC CHANNEL 1 OVERWRITES SOLARIS MEMORY PAGE WHEN SP RESETS.

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback