Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-79-1624211.1
Update Date:2014-04-22
Keywords:

Solution Type  Predictive Self-Healing Sure

Solution  1624211.1 :   Sun Server X2-8 (formerly Sun Fire X4800 M2) Symptoms and Solutions  


Related Items
  • Sun Server X2-8
  •  
Related Categories
  • PLA-Support>Sun Systems>x86>Server>SN-x64: SERVER 64bit
  •  




In this Document
Purpose
Details
References


Applies to:

Sun Server X2-8 - Version Not Applicable and later
Information in this document applies to any platform.

Purpose

This document outlines the top non-hardware issues our customers encounter on the Sun Server X4-2, with the firmware version which fixes that issue or the workaround


These issues are outlined in more detail on the internal Sun Server X2-8 (formerly Sun Fire X4800 M2) & Sun Fire X4800 Current Product Issues <Document: 1356060.1> page

Details

Documentation X2-8 - Patches & Firmware X2-8 - My Oracle Support Community (MOSC) Sun x86 Systems - Oracle System Handbook X2-8
SymptomDetailsResolution/Workaround
Service Manual has CPU location reversed.

Service Manual 821-0282-10 has CPU location reversed on page 98.

Looking at the CMOD from the front CPU0 is on the right and CPU1 is one the left.  Note, the server top cover label is correct. The OSH (Oracle System Handbook) also has the positions correct.

NEM hangs midplane I2C bus with SW1.0.1 and SAS FW 5.3.3.3 causing system to be unbootable.

This will generate an fma messages ID #SPX86-8002-QQ which will stop the system from booting up until the fma fault is cleared in the SP.

This is an I2C issue so symptoms you might see are the SP not showing proper environmental data, temperature data.  You might see NEMs showing inserted or removed messages when it wasn't actually removed.  Worst case this can cause a system to be unbootable if the initial configuration check fails for supported configurations.  This will generate an fma messages ID #SPX86-8002-QQ which will stop the system from booting up until the fma fault is cleared in the SP. The fix for this issue is to upgrade the system to the next firmware release SW1.1 when available, which includes firmware to upgrade the NEM SAS Expander.  The SAS Expander must be upgraded to FW 5.3.3.4 and this is a manual process to update the NEM SAS expanders it does not get updated automatically when ILOM/BIOS is updated.  Please see the Product Notes for the directions on updating this.  This is fixed with the SW1.1 release which includes the new NEM FW5.3.3.4 for the SAS expanders.
Applications hang in S10u8/u9 and during periods of application hang clock goes backwards Nehalem deeper C-states causes erratic scheduling behaviour.  Whether using S10u8 or S10u9 the applications hang for periods of time, and while hung the OS clock goes backwards in time.  The system never hard hangs and the OS never completely hangs they can always access the OS and run commands but the applications get hung up.  The applications do recover on their own but once the hangs start happening they become progressively more frequent to the point that Veritas Cluster starts complaining.  

Workarounds:

1. First workaround is to set tunable below in /etc/system file and reboot.

set idle_cpu_no_deep_c=1


2. If the first workaround is found to not be working then disable C-States in the BIOS which is a more robust fix and has solved all cases to date:

You can disable the C-states from BIOS just Goto BIOS-- CPU configuration
Scroll down the screen you should see, "Intel(R) C-State tech", please disable that. (Excerpt included).

* Intel(R) SpeedStep(tm) tech     [Enabled]
* Intel(R) TurboMode tech     [Enabled]
* Intel(R) C-State tech     [Disabled]
* ACPI T State     [Enabled]

Fix is in Solaris 10 with patch 144489-17 or later.  Solaris 11 Express based upon build snv_165 or later.

Memory faults not lighting fault LEDs or creating FMA faults even though memory is disabled While installing X2-8/X4800 we noticed a DB couldn't boot, see Top Issue #08, due to memory being faulted and disabled.  We are at the beginning of the investigation so not sure if upgrading to SW1.2.1 would resolve this issue or not as X2-8 Database Servers currently ship with SW1.1 installed.  One symptom is the fault service LEDs all show "na" or OFF instead of ON. Only indication is the OS doesn't see the correct memory, and the host_debug_err.log shows the fault as well. Fixed in G5 SW1.2.1 and G5+ SW1.1.
X2-8 with memory faults becomes unbootable due to NUMA setting being enabled on SMP platforms While installing X2-8/X4800 we noticed a DB couldn't boot due to memory being faulted and disabled.  We discovered that because the X4800 is an symmetric multiprocessing (SMP) system that the Non-Uniform Memory Access or Non-Uniform Memory Architecture (NUMA) setting is set to enabled.  When memory fails and is disabled this setting causes the Database to not be bootable so in short takes a whole DB out until the memory is replaced. Workaround is to disable the NUMA setting so the DB can boot until the memory failure can be addressed.  Fixed with Bundle Pack 6 release.

Stability and fibre channel storage performance issues when running Windows with Qlogic HBA PEM's (371-4522 MetisQ) installed.


Errors similar to the following are observed in the ILOM event logs :

The suspect component:
/SYS/BL3 has fault.io.ioh.core.fatal with probability=100. Refer to
http://www.sun.com/msg/SPX86-8001-52 for details.

Critical Interrupt : BIOS : Bus Fatal IOH Core
Error: IOH 2 Error 6

Errors similar to the following are observed in the ILOM event logs :

The suspect component:
/SYS/BL3 has fault.io.ioh.core.fatal with probability=100. Refer to
http://www.sun.com/msg/SPX86-8001-52 for details.

Critical Interrupt : BIOS : Bus Fatal IOH Core
Error: IOH 2 Error 6

Usually the blade or CPU will be marked as faulty and in some cases multiple parts.Performance of copying files to/from the external storage via the Qlogic cards will usually be extremely poor.

Problem is due to APM (L0s) getting set during the windows acpi handling and/or qlogic driver installation. At that point bus errors are occurring. BIOS should disable ASPM (Active State Power Management) in the ACPI table.
Workaround :
Disabling ASPM support in Windows should fix the stability and performance issues.

In Windows:
Under Control Panel -> Power Options -> Advanced Settings - There is PCI Express -> Link State Power Management which can be turned off.

Follow these steps to disable PCI Express Active-State Power Management:

1. Open the Control Panel in the Start menu.
2. Open Power Options in the Control Panel.
Note: If Power Options is not available, change View by to Large icons at the top right of the Control Panel.
3. Select Change plan settings next to the power plan you want to set.
4. Select Change advanced power settings.
5. Select "PCI EXPRESS".
6. Under PCI EXPRESS select "Link State Power Management" select Off and then click OK to save the changes.

Final fix is in SW1.3 which includes G5 BIOS build 11016600.

*Note:* Does not effect MetisE (Emulex FC HBA PEMs)

Seeing ixgbe link down errors with no FEM installed in Solaris.

Below are example of messages being seen:

Jun 24 04:00:33 xrtp710 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe1: link down
Jun 24 04:00:33 xrtp710 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe0: link down
Jun 24 04:00:36 xrtp710 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe1: link down
Jun 24 04:00:36 xrtp710 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe0: link down


We are seeing these link down messages WITHOUT Niantic FEM X4871A-Z pn#375-3648 being installed.  Below are example of messages being seen:

Jun 24 04:00:33 xrtp710 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe1: link down
Jun 24 04:00:33 xrtp710 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe0: link down
Jun 24 04:00:36 xrtp710 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe1: link down
Jun 24 04:00:36 xrtp710 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe0: link down

When we mapped the device path of the messages it turned out that the customer had a Niantic PEM card installed.  When Explorer is run it automatically probes all devices including ixgbe causing the link down messages.  These messages are expected when running Explorer and were only seen when running Explorer.
Disk amber LEDs are "lit" - on all disks The issue is with SATA drives installed which are not dual ported the LSI REM marks the drives as faulty because they do not have a secondary path to the disk. The fix is in firmware release SW1.2 for X2-8
OHIA iso image won't boot on X2-8 On the X4800 M2, the OHIA image won't boot unless the x2APIC option is disabled in BIOS.

Solution:

Go to Main->CPU Configuration and disable x2apic.  After installation the x2APIC option can be enabled again depending on what OS is being installed.

X4800 PSx/S0/V_OUT_OK sensor state toggles between asserted and deasserted periodically in Solaris OS /var/adm/messages file.


These errors are seen when the system is up and running with no issues with power:

Aug 24 02:56:40 xrtp701 ipmievd: [ID 702911 daemon.notice] Power Supply sensor PS2/V_OUT_OK State Deasserted
Aug 24 02:56:40 xrtp701 ipmievd: [ID 702911 daemon.notice] Power Supply sensor PS3/V_OUT_OK State Deasserted


These errors are seen when the system is up and running with no issues with power:

Aug 24 02:56:40 xrtp701 ipmievd: [ID 702911 daemon.notice] Power Supply sensor PS2/V_OUT_OK State Deasserted
Aug 24 02:56:40 xrtp701 ipmievd: [ID 702911 daemon.notice] Power Supply sensor PS3/V_OUT_OK State Deasserted

 Fix is in SW1.2 for X2-8
The BIOS setting "Restore on AC power loss" is not working correctly The default setting for "Restore on AC power loss" is "Last State".  When this is set and you power off the system and then remove AC power, upon power being restored the correct behavior is the system should not power on.  But with the two versions of firmware tested "SW1.1.1, and Sw1.1" this is not the case.  The system powers up regardless of what the previous state actually was.

This has been root caused to a BIOS issue.

 The fix is to install SW1.2 or higher.

Receiving tons of ixgbe link down messages on unconfigured empty 10Gbe ports running Solaris 10.

Some symptoms on this issue are that you will see link down messages like below:

Dec 5 09:21:08 itghkdev98 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe7: link down
Dec 5 09:21:18 itghkdev98 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe7: link down
Dec 5 09:22:14 itghkdev98 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe6: link down


Some symptoms on this issue are that you will see link down messages like below:

Dec 5 09:21:08 itghkdev98 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe7: link down
Dec 5 09:21:18 itghkdev98 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe7: link down
Dec 5 09:22:14 itghkdev98 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe6: link down

 

We found that these systems were missing or down revved on these two patches:

148329-03 SunOS 5.10_x86: dladm patch
148323-08 s10_x86 ixgbe patch

Once these two patches were updated the issue was resolved across 11 systems.


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback