Proactive setup/troubleshooting of a Sun Fire[TM] 280R

Asset ID:	1-75-1009309.1
Update Date:	2018-02-15
Keywords:

Solution Type Troubleshooting Sure

Solution 1009309.1 : Proactive setup/troubleshooting of a Sun Fire[TM] 280R

Applies to:

Sun Fire 280R Server - Version Not Applicable and later
All Platforms

Purpose

This document describes how to set up your system, Sun Fire [TM] 280R, so that in case trouble arises Sun support will be able to troubleshoot the system as good and as efficient as possible.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - SPARC Legacy Servers

Troubleshooting Steps

1) Patches

Be certain the system is up to date with patches. An up to date system has two advantages, availability will go up and in case of a problem it is better to diagnose.

How to find the Oracle Solaris Critical Patch Update (CPU) Patchsets, Recommended OS Patchsets for Oracle Solaris and Oracle Solaris Update Patch Bundles 1272947.1

2) Open Boot Prom (OBP)

Sun Fire[TM] V480/V490/V880/V890 servers: Firmware update information (Doc ID 1007526.1)

Recommended settings for this version:

diag-switch = true

diag-level = min

diag-script = normal

auto-boot = true

diag-device =

error-reset-recovery = sync

* With the diag-switch set to true booting can take a long time, especially if the system contains a lot of memory. When this is not acceptable set it to false.

* With the diag-script set to normal obdiag tests all devices expected to be present in the baseline configuration, so no pci cards.

* With error-reset-recovery set to sync OBP invokes a sync, which will create a crash dump, after a XIR or a Red state

* diag-level should be the factory default min. See - Guidance for POST Diagnostic Level Setting on Sun Fire[TM] 280R, V480, V490, V880, V890 and V880z servers. (Doc ID 1582330.1)

3) Configure the Remote System Controller (RSC)

Three packages need to be installed (available on the Solaris Supplemental CD or Download Here) :

SUNWrsc On the host system - RSC

SUNWrscd On the host system - RSC user guide

SUNWrscj On a client - RSC gui

- Configure the RSC:

# /usr/platform/`uname -i`/rsc/rsc-config

To get the console messages and to get to the ok-prompt via the RSC set the following two parameters in the OBP:

input-device = rsc-console

output-device = rsc-console

diag-out-console = true

4) Enable the watchdog reset mechanism

Add the following setting in /etc/system:

watchdog_enable=1

* a reboot is necessary to activate the setting

5) Configure Solaris to save a crash dump to disk after a panic

How much space is needed for the dump device

Crash dumps vary in size based on the memory configuration of the system and how much of that memory was in use. On systems with relatively small amounts of RAM (up to 5 GB), a guideline is to allow 35% of the amount of RAM per crash dump. For larger amounts of RAM, 2 GB is usually sufficient.

Configuring the dump device

Crash dumps are enabled by default, and unless the dumpadm command was used to change it, the dump device is the primary swap partition (the first one listed by the swap -l command). If the dump device is a regular partition (begins with /dev/dsk), and is of sufficient size, no further configuration is necessary.

Configuring a DiskSuite-encapsulated swap partition as the dump device

If the swap partitions are encapsulated by DiskSuite, you must use the name of the encapsulated partition, not one of the raw partitions it is made from. The output from dumpadm should look something like this one:

Dump device: /dev/md/dsk/d1 (swap)

If you are using the primary swap partition, use this dumpadm command to configure it:

# dumpadm -d swap

Configuring a Veritas Volume Manager encapsulated swap partition as the dump device

If there is a spare partition with sufficient space, use the "dumpadm -d" command to configure that as the dump device. If the only available space for the dump device is an encapsulated Veritas partition, you must provide the path of the original disk device name, rather than the Veritas encapsulated path name for a dedicated device. For example:

Dump device: /dev/dsk/c6t0d0s1 (dedicated)

Dump device: /dev/vx/dsk/swapvol (dedicated)

* For more info on setting up a dump device see:

Technical Instruction Document 1004803.1 - Collecting System Crash Dump Images on Solaris[TM] 7 and later

Technical Instruction Document 1017485.1 - Determining Approximate Crash Dump File Size

6) Configure an external loghost for the message files

It can be useful to log the /var/adm/messages also to an external loghost, but NOT JUST to an external host, log them also locally.

* For information on the syslog mechanism see the following documents on sunsolve:

Technical Instruction Document 1007237.1 - Setting up and debugging logging to remote hosts

Technical Instruction Document 1004455.1 - Working with the Solaris[TM] Operating Environment messaging and logging daemon

7) When we do not have a stable system

Enable full firmware diagnostics. Change these settings in OBP:

diag-switch = true

diag-script = all

test-args = verbose, subtests

* With the diag-script set to all obdiag tests all devices expected to be present in the baseline configuration, including pci cards.

When hangs can be expected enable the deadman kernel. Add these settings in /etc/system:

set snooping=1

set snoop_interval=9000000

* A reboot is necessary to activate the setting

* Enabling the deadman kernel will cost performance so do not leave this on as a default

* Technical Instruction Document 1004530.1 - KERNEL: How to enable deadman kernel code.

8) Configure an console loghost

Especially when problems are encountered but no relevant errors are seen in the message file we probably will see errors on the console.

For example a Red state exception will only be seen on the console

Connect a laptop (or other system) to the RSC network port, make a telnet connection to the RSC and capture the console messages.
When for some reason the RSC is not used capture the console messages from the ttyA serial port.

* For more info on setting up a console logging:

Technical Instruction 1008702.1 - Console Logging Options to capture Fatal Reset output for Sun systems.

9) What to do when the system hangs

- What is exactly hanging (system, RSC, network)

a) The main system with Solaris

is it pingable
can you telnet to it
run explorer
check for crashdumps
get status of system led's

b) The RSC card

is it pingable
can you telnet to it
get output of:

consolehistory

showenvironment

loghistory

version

- log in to the console (from the RSC)

- when solaris is up and running

get status of system led's
check for crashdumps
run explorer

- when console found on ok-prompt

get status of system led's
get output of printenv
issue a sync
run explorer
check for the crashdump

- when no output from the console

get status of system led's
turn keyswitch to diagnostics position
check if the console registers this change
perform a sync, if this fails sent a XIR from the rsc> prompt (rsc> xir)

c) The network

is it pingable
telnet/rlogin to system and RSC
tried to telnet with directly attached terminal

* When there is no response from the system at all

Press the system Power button for five seconds.
This causes an immediate hardware shutdown.
Wait at least 30 seconds, then power on the system

10) What to do when the system has panicked and automatically rebooted

check for a crash dump
run explorer

11) What to do when the system has panicked and sits at the ok-prompt

get status of system led's
get output of printenv
issue a sync
run explorer
check for the crashdump

12) What information will Sun usually ask for in these situations

Explorer
Crash dump
Console logging

* Run explorer as follows: /opt/SUNWexplo/bin/explorer -w fru,default .

The output file of the explorer will be located in /opt/SUNWexplo/output.

* The crash dump consists of two files: unix.(nr) and vmcore.(nr) located in /var/crash/`uname -n`.

SF280R, troubleshooting, setup, OBP, RSC, watchdog, panic, Littleneck, hang
Previously Published As
82171

Attachments

This solution has no attachment