Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1668946.1
Update Date:2018-03-05
Keywords:

Solution Type  Problem Resolution Sure

Solution  1668946.1 :   Sun Fire[TM] 12K/15K/E20K/E25K Servers: Solaris instance on System Controllers (SCs) may hang during the execution of network command 'dladm show-link'  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
  • Sun Fire E20K Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: SF-Exxk
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Created from <SR 3-8623519861>

Applies to:

Sun Fire E20K Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Fire 12K Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Fire 15K Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Fire E25K Server - Version Not Applicable to Not Applicable [Release N/A]
Oracle Solaris on SPARC (32-bit)

Symptoms

Sun Fire 12K/15K/E20K/E25K System Controller may encounter unexpected outages due to hung of Solaris, after issueing the data link administration command "dladm" directly on console on System Controller OS once, or at least after some attempts.

These command sets are not really relevant for usage on Starcat System Controllers and for platform management:

sc_root# /usr/sbin/dladm show-link
sc_root# /usr/sbin/dladm show-linkprop

So it looks like the issue could be avoided by prevent usage of the commands on Sun Fire 12K/15K/E20K/E25K System Controllers, but dladm commands are executed by Oracle Explorer Data Collector in module "netinfo" since version 6.x.
Explorer run was proved to lead always to OS hang, during execution of that module with the given environment change.

Explorer run in command shell:

[ snip]
Oracle Explorer Data Collector 8.0
Apr 06 19:34:11 sc explorer:  Explorer ID: explorer.<snip>-2014.04.06.18.33
Apr 06 19:34:18 sc rda:  RUNNING
Apr 06 19:34:18 sc rda:  Initializing RDA
Apr 06 19:34:18 sc rda:  Running RDA
------------------------------------------------------------------------------
RDA Data Collection Started 06-Apr-2014 19:34:54
------------------------------------------------------------------------------
Processing RDA.BEGIN module ...
Processing RDA.CONFIG module ...
Processing XPLR module ...
Apr 06 19:34:56 sc begin:RUNNING
Apr 06 19:34:57 sc ilomsnapshot_start:RUNNING
Apr 06 19:34:57 sc patch:RUNNING
Apr 06 19:35:12 sc pkg:RUNNING
Apr 06 19:38:12 sc sysconfig:RUNNING
Apr 06 19:40:40 sc ndd:RUNNING
Apr 06 19:41:20 sc netinfo:RUNNING
<hang>

a ptree command in parallel in a second command shell:

Wed Apr  9 08:41:22 BST 2014
1     /sbin/init
 9     /lib/svc/bin/svc.startd
   410   -sh
     2169  ksh -o vi
       2827  /bin/sh ./explorer
         3601  /bin/sh /opt/SUNWexplo/tools/rda.sh
           3611  /usr/bin/perl -T /usr/lib/rda/rda.pl -nXExplorer run -d/opt/SUNWexplo/output/ex
             6995  /usr/sbin/dladm show-link

[ output abbreviated ]

Sun Apr  6 19:41:21 BST 2014
1     /sbin/init
 7     /lib/svc/bin/svc.startd
   401   -sh
     2040  /bin/sh ./explorer
       2740  /bin/sh /opt/SUNWexplo/tools/rda.sh
         2750  /usr/bin/perl -T /usr/lib/rda/rda.pl -nXExplorer run -d/opt/SUNWexplo/output/ex
           6144  /usr/sbin/dladm show-linkprop
Sun Apr  6 19:41:22 BST 2014
1     /sbin/init
 7     /lib/svc/bin/svc.startd
   401   -sh
     2040  /bin/sh ./explorer
       2740  /bin/sh /opt/SUNWexplo/tools/rda.sh
         2750  /usr/bin/perl -T /usr/lib/rda/rda.pl -nXExplorer run -d/opt/SUNWexplo/output/ex
           6144  /usr/sbin/dladm show-linkprop
Sun Apr  6 19:41:24 BST 2014
1     /sbin/init
 7     /lib/svc/bin/svc.startd
   401   -sh
     2040  /bin/sh ./explorer
       2740  /bin/sh /opt/SUNWexplo/tools/rda.sh
         2750  /usr/bin/perl -T /usr/lib/rda/rda.pl -nXExplorer run -d/opt/SUNWexplo/output/ex
           6144  /usr/sbin/dladm show-linkprop

<hang>

 

Changes

The descripted symptoms have been seen only on Sun Fire 12K/15K/E20K/E25K System Controllers with SMS 1.6 on Solaris 10 so far.
The behavior described in this document has been discovered on SCs which have had maintenance and diagnostic actions time before in terms of Solaris kernel issues, and advanced diagnostic methods have been applied (kmem_flags were set in /etc/system).

Cause

The /etc/systems entry "set kmem_flags=0x1" has been identified as root cause of the behavior; it was was entered in the file under a previous troubleshooting scenario for advanced Solaris Kernel debugging.

It's manifesting OS hang condition when command 'dladm show-link' is issued, be it stand alone, or via explorer data collection run. Within explorer's module "netinfo", OS hung consistently.

Normal behavior of "dladm" command on SF25K SC looks like the following:

# /usr/sbin/dladm show-link
eri0            type: legacy    mtu: 1500       device: eri0
eri1            type: legacy    mtu: 1500       device: eri1
eri2            type: legacy    mtu: 1500       device: eri2
eri3            type: legacy    mtu: 1500       device: eri3
eri4            type: legacy    mtu: 1500       device: eri4
eri5            type: legacy    mtu: 1500       device: eri5
eri6            type: legacy    mtu: 1500       device: eri6
eri7            type: legacy    mtu: 1500       device: eri7
eri8            type: legacy    mtu: 1500       device: eri8
eri9            type: legacy    mtu: 1500       device: eri9
eri10           type: legacy    mtu: 1500       device: eri10
eri11           type: legacy    mtu: 1500       device: eri11
eri12           type: legacy    mtu: 1500       device: eri12
eri13           type: legacy    mtu: 1500       device: eri13
eri14           type: legacy    mtu: 1500       device: eri14
eri15           type: legacy    mtu: 1500       device: eri15
eri16           type: legacy    mtu: 1500       device: eri16
eri17           type: legacy    mtu: 1500       device: eri17
eri18           type: legacy    mtu: 1500       device: eri18
eri19           type: legacy    mtu: 1500       device: eri19
eri20           type: legacy    mtu: 1500       device: eri20
eri21           type: legacy    mtu: 1500       device: eri21
scman0          type: legacy    mtu: 1500       device: scman0
scman1          type: legacy    mtu: 1500       device: scman1



Solution

1. recover from hung OS
2. resolve root cause
3. verification of resolution


1. The only way to clear the OS 'hang' on the system controller is via 'resetsc'.
    This command either runs from the main SC and resets the spare or runs from the spare and resets the main, because a SC cannot reset itself:

sms-user:> resetsc
About to reset other SC.
Are you sure you want to continue? (y or [n])


2. If the system controller was reboot avoid dladm usage untill deactivation of the /etc/system entry. Remove the entry "set kmem_flags=0x1" and reboot (both as user root).

3. start Oracle Explorer Data Collector for verification and have a special look to the "netinfo" module, and if it will finish ("netinfo:RUNNING")



References

<NOTE:1538483.1> - Collect a System Controller (SC) Explorer using STB7.3 (or newer) on Sun Fire 12K/15K/E20K/E25K (Starcat) Servers
<NOTE:1007746.1> - SunFire[TM] 12K/15K/E20K/E25K: Expected behavior of domains in different scenarios when the SCs are powered down or rebooted
<NOTE:1006092.1> - Sun Fire[TM] 12K/15K/E20K/E25K: Enterprise Installation Standards(EIS) EEPROM settings
<NOTE:1153444.1> - Oracle Services Tools Bundle (STB) - RDA/Explorer, SNEEP, ACT

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback