Solaris sun4v domains may panic after 1101 days of uptime

Asset ID:	1-72-2245358.1
Update Date:	2018-01-18
Keywords:

Solution Type Problem Resolution Sure

Solution 2245358.1 : Solaris sun4v domains may panic after 1101 days of uptime

Applies to:

SPARC M7-16 - Version All Versions and later
Oracle SuperCluster M7 Hardware - Version All Versions and later
SPARC T5-2 - Version All Versions and later
SPARC M5-32 - Version All Versions and later
SPARC S7-2 - Version All Versions and later
Information in this document applies to any platform.
This issue applies to SPARC servers of machine implementation "sun4v". The Solaris command 'uname -i' can be used to display the machine implementation.

Hypervisor 1.0 introduced the bug, but the panic can only be manifested on servers with Hypervisor 1.12.x or later. All "sun4v" servers use a hypervsior.

Symptoms

A bug exists in hypervisor (HV) 1.12.x or later which may cause a domain to panic after 1101 days of uptime. The HV version may be displayed with the Solaris command 'ldm -V | grep ")Hypervisor"'.
e.g.,

% ldm -V | grep ")Hypervisor"
Hypervisor v. 1.15.5.a @(#)Hypervisor 1.15.5.a 2016/08/09 15:21

Various types of panic might be evident in the HOST console log, including, "panic: send_mondo_set: timeout".

The ILOM event log (-> show /SP/logs/event/list) may have sufficient history to check the HOST uptime.
Look for Host "Powered on" or "HV started".

Example (event log):

225 Thu Jan 26 14:17:22 2017 System Log minor
Host: Solaris panicking <======================panic date/time
<snip>
199 Tue Jan 21 13:57:43 2014 System Log minor
Host: Host started
198 Tue Jan 21 13:57:39 2014 System Log minor
Host: HV started <======================HV start date/time
197 Tue Jan 21 13:49:00 2014 System Log minor
Host: Powered On

The Solaris GNU date command can be used to easily calculate a date 1101 days earlier from the panic date.

Example:
% /usr/gnu/bin/date -d "Jan 26 2017 - 1101 days"
Tuesday, January 21, 2014 12:00:00 AM PST

Many date calculators are available via Internet search that display the duration between two dates.

If the ILOM event log does not have sufficient history to check when "HV started" it may still be possible to
identify an uptime near 1101 days. The Solaris last command (e.g., last -5 reboot) will list when the
Solaris was last booted. This does NOT identify when "HV started", but if the uptime is near 1101 days it is
a reasonable deduction to match this bug as cause for the panic.

If either of the two techniques identifies a period of run time,

from when HV starts, or
the last occasion for Solaris boot

and the duration to panic is at or near 1101 days, then bug 23193383 has likely been manifested.
Oracle can analyze snapshot HOST status logs for confirmation.

The snapshot ilom/@persist@host_logs@host0_status.log (substitute proper host instance number)
can be checked for longer history of HOST start events, including the date and times when 'HV started'
and Solaris panics.

In a live service processor the "Restricted Shell" can be used to check the host status logs in
/persist/host_logs, but the "Restricted Shell" account should be used only by Oracle personnel and
authorized service partners.

Example:
-> set SESSION mode=restricted

WARNING: The "Restricted Shell" account is provided solely
to allow Services to perform diagnostic tasks.

## check for available host status logs
[(restricted_shell) server-sp:~]$ ls persist/host_logs/*status*
persist/host_logs/host0_status.log persist/host_logs/host1_status.log persist/host_logs/host2_status.log persist/host_logs/host3_status.log
persist/host_logs/host0_status.log.1 persist/host_logs/host1_status.log.1 persist/host_logs/host2_status.log.1 persist/host_logs/host3_status.log.1

## egrep for "HV start" and "panic" in the desired HOST status log
[(restricted_shell) server-sp:~]$ egrep "HV started|panic" persist/host_logs/host0_status.log
20140121 13:57:59: status='HV started'
20170126 14:17:22: status='Solaris panicking'

Cause

A bug was introduced into Hypervisor 1.0 which causes miscalculation of various cyclic operations, but panic will only occur on HV 1.12.x and later.

Solution

Upgrade service processor System Firmware (SysFW) to a release which includes the bug fix. Various servers SysFW versions that includes the bug fix are listed below:

M5-32: 9.6.7.a
M6-32: 9.6.7.a
M7-8: 9.7.4
M7-16: 9.7.4
T4: 8.9.8
T5: 9.6.7.a
T7: 9.7.4
S7: 9.7.5.b (9.7.5.c is required for S7 servers with 64GB Load Reduced DIMMs to resolve Bug 25953561)
Netra S7: 9.7.4
M10-1 - M10-4S: XCP2320

An exhaustive list of all impacted Oracle servers is not provided above. Any server which uses hypervisor 1.12.x or later can be impacted. If the server is of implementation "sun4v" it may be impacted. Check the server implementation with the Solaris command 'uname -i'.

To find patches which include the hypervisor fix, search server patch README manifests for SysFW patches which include the fix for bug 23193383 or a Backport bug. SysFW releases are available via the Oracle Technology Network.

Firmware Download links and Release History for Oracle Systems can be found on the Oracle Technology Network at,
http://www.oracle.com/technetwork/systems/patches/firmware/release-history-jsp-138416.html

Patch descriptions will list bug 23193383 or a Backport bug in the patch README manifest.
NOTE: It is required that the HOST be stopped and restarted following SysFW upgrade to deploy the HV fix.

If SysFW cannot be upgraded the HOST can be stopped and restarted prior to 1101 days of uptime. This will restart the exposure to another period of 1101 days uptime and thus prevent domain panic for 1101 days. Upgrading SysFW is the preferred resolution.

References

<BUG:23193383> - CYCLICS MISBEHAVE AFTER TWO YEARS OF UPTIME
http://www.oracle.com/technetwork/systems/patches/firmware/release-history-jsp-138416.html
<NOTE:1554086.1> - Fujitsu M10-1/M10-4/M10-4S XSCF Control Package (XCP) Firmware Image Software Version Matrix Information
<NOTE:1540816.1> - SPARC M5-32 and M6-32 Servers: Firmware Image Software Version Matrix Information
<NOTE:1967048.1> - SPARC M8 and SPARC M7 Series Servers : Firmware Image Software Version Matrix Information

Attachments

This solution has no attachment