Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-2156884.1
Update Date:2018-04-09
Keywords:

Solution Type  Technical Instruction Sure

Solution  2156884.1 :   How to Troubleshoot Event SPX86-8003-RR for Exadata with Unpublished Bug 22727539 fault_state(0x0d04)!  


Related Items
  • Exadata X4-2 Hardware
  •  
  • Exadata X3-8b Hardware
  •  
  • Exadata X4-8 Hardware
  •  
  • Exadata X3-2 Hardware
  •  
Related Categories
  • PLA-Support>Sun Systems>x86>Engineered Systems HW>SN-x64: EXADATA
  •  


Exadata storage cell either X3-2L or X4-2L will reset and report in the ILOM the event SPX86-8003-RR IIO PCIE Fatal Error.The aim of this document is
to help identify if the cause of this event was due to unpublished bug 22727539 and to explain how to resolve the problem.

In this Document
Goal
Solution
References


Applies to:

Exadata X4-2 Hardware - Version All Versions to All Versions [Release All Releases]
Exadata X4-8 Hardware - Version All Versions to All Versions [Release All Releases]
Exadata X3-2 Hardware - Version All Versions to All Versions [Release All Releases]
Exadata X3-8b Hardware - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Goal

 Exadata storage cell either X3-2L or X4-2L will reset and report in the ILOM the event SPX86-8003-RR IIO PCIE Fatal Error.The aim of this document is
to help identify if the cause of this event was due to known bug unpublished 22727539 and to explain how to resolve the problem.

Solution




When the event SPX86-8003-RR occurs the Storage cell will reboot .Determine if the event is due to unpublished Bug 22727539 .

Collect ILOM snapshot (Doc ID 1062544.1) and also sundiag (Doc ID 761868.1) collection .

 

1. Begin first with the ILOM snapshot.

i). Look in the file fma/@persist@faultdiags@faults.log

Check for the following type of event is reported against a PCI slot containing a flash card.

2016-06-19/03:36:54 e01dbe58-56a3-e4d6-d00d-d4634492c113 SPX86-8003-RR

timestamp ereports
2016-06-19/03:36:54 ereport.io.intel.iio.pcie-completion-timeout-on-np-transactions@/sys/mb/p0/iio/dev02/fn2

fault = fault.io.intel.iio.pcie-fatal@/SYS/MB/PCIE4

 

This confirms we have event SPX86-8003-RR ,now check that this is due to unpublished bug 22727539

 

ii).Look in the file ilom/@persist@hostconsole.log and check for the following message being reported before the host resets.

mpt2sas2: fault_state(0x0d04)!

It may be necessary to look in ilom/@persist@hostconsole.log.1 for this message.

if both the SPX86-8003-RR event is seen in faults.log and message "mpt2sas2: fault_state(0x0d04)!" are reported in hostconsole.log then the storage cell may have encountered unpublished bug 22727539.

 

2.Confirm the problem by checking the sundiag collection.

i) Check and make a note of the image version by viewing imageinfo-all.out

Example: Active image version: 12.1.2.1.3.151021

 

ii) Check the firmware version on the F40 or F80 flash card.

cd to the cell directory within the sundiag and unpack the file lsidiag-xxxxx-xxxxx-xxxxx-min.tz2

For example :

tar -xf lsidiag-exastoracel02-20160627-045657-min.tz2

 

cd to the unpacked directory and view the file ddcli-listall.txt

The bug occurs when the previous check in the snapshot are matched and the F40 or F80 Flash accelerator Module contain the following firmware.

F40 will report firmware 09.05.42.00
F80 will report firmware 09.05.43.00

Example shows F40

ID WarpDrive Package Version PCI Address
-- --------- --------------- -----------
1 ELP-4x100-4d-n 09.05.42.00 00:20:00:00
2 ELP-4x100-4d-n 09.05.42.00 00:30:00:00
3 ELP-4x100-4d-n 09.05.42.00 00:90:00:00
4 ELP-4x100-4d-n 09.05.42.00 00:b0:00:00

 

Example shows F80

ID WarpDrive Package Version PCI Address
-- --------- --------------- -----------
1 ELP-4x100-4d-n 09.05.43.00 00:20:00:00
2 ELP-4x100-4d-n 09.05.43.00 00:30:00:00
3 ELP-4x100-4d-n 09.05.43.00 00:90:00:00
4 ELP-4x100-4d-n 09.05.43.00 00:b0:00:00

 

The firmware can also be checked on the cell with the following command:

/opt/oracle.SupportTools/CheckHWnFWProfile -action list -component Flash | grep -i 'cardfw' | uniq

<CardFw FIRMWARE_ID="1" VALUE="09.05.42.00"/>
<CardFw FIRMWARE_ID="2" VALUE="09.05.42.00"/>
<CardFw FIRMWARE_ID="4" VALUE="09.05.42.00"/>
<CardFw FIRMWARE_ID="5" VALUE="09.05.42.00"/>

This example shows F40

 

/opt/oracle.SupportTools/CheckHWnFWProfile -action list -component Flash | grep -i 'cardfw' | uniq

<CardFw FIRMWARE_ID="1" VALUE="09.05.43.00"/>
<CardFw FIRMWARE_ID="2" VALUE="09.05.43.00"/>
<CardFw FIRMWARE_ID="4" VALUE="09.05.43.00"/>
<CardFw FIRMWARE_ID="5" VALUE="09.05.43.00"/>

This example shows F80

 

The FMOD will show firmware UI03

# cellcli -e list physicaldisk attributes makeModel , physicalFirmware where diskType = FlashDisk

"Sun Flash Accelerator F40 PCIe Card" UIO3

or

"Sun Flash Accelerator F80 PCIe Card" UIO3

 

 

Resolution to the problem

The patches which fix this problem will not show unpublished bug 22727539 in the README ,this is due to the firmware fix being  part of the resolution to critical issue EX28

The problem is resolved by applying either a patch or image update as follows:

i) From any earlier image Update to full Image 12.1.2.3.1 or higher.

If update to image 12.1.2.3.1 is not possible then one of the following options must be applied to resolve the problem.

ii) If running image 12.1.2.3.0 apply the interim fix from 12.1.2.3.0 Patch unpublished # 21749993

The patch is in two parts.Part one contains a set of rpm files ,this is described as the interim fix.Part 2 contains a full iso image and is
described as the full release component.

iii) If running image 12.1.2.2.2 and the system needs to stay at this version then apply the interim fix for 12.1.2.2.2 unpublished Patch # 24306258

Or update the image to a higher release using the full release component of the patch.

iv) If running image 12.1.2.2.1 and the system needs to stay at this version then apply the interim fix for 12.1.2.2.1 unpublished Patch # 22106928

Or update the image to a higher release using the full release component of the patch.

v) If running image 12.1.2.2.0 and the system needs to stay at this version then apply the interim fix for 12.1.2.2.0 unpublished patch # 22086811
Or update the image to a higher release using the full release component of the patch.

vi) If running image 12.1.2.1.3 and the system needs to stay at this version then apply unpublished patch #23263418
Or update the image to a higher release using the full release component of the patch.

vii) If running image 12.1.2.1.2 and the system needs to stay at this release -  there is currently no available patch**No MLR patch available yet** see unpublished bug 23257267
Therefore  update the image to a higher release using the full release component of the patch.

viii) If running image 12.1.2.1.1 and the system needs to stay at this release then apply unpublished patch #23193769
Or update the image to a higher release using the full release component of the patch.

 

For any earlier image versions please contact the software support specialist (EEST) for guidance.

When applying any update please check the storage cells are not currently exposed to Exadata critical issue EX17 by reviewing MOS Doc 1968234.1

 Internal Patch details

12.1.2.3.2 - unpublished Patch 23200959

12.1.2.3.1 - unpublished Patch 24306177

12.1.2.3.0 - unpublished Patch 21749993

12.1.2.2.3 - unpublished Patch 23217781

12.1.2.2.2 - unpublished Patch 24306258

12.1.2.2.1 - unpublished Patch 22106928

12.1.2.2.0 - unpublished Patch 22086811

12.1.2.1.3 - unpublished Patch 23263418

 

Images 12.1.2.2.3 , 12.1.2.3.1 and 12.1.2.3.2 already contain the correct firmware to avoid this bug ,the patches above are for reference if updating from a lower image.

Images 12.1.2.1.3 , 12.1.2.2.0 , 12.1.2.2.1 , 12.1.2.2.2 and 12.1.2.3.0 require the above respective patches to resolve the bug.

 

 

Once the patches have been applied the firmware will show as below 13.05.xx.xx which confirms the correct firmware to resolve this problem.

 The FMOD fimware will change to UI06

F40 will report firmware 13.05.10.00 or 13.05.10.01 depending on the patch applied
F80 will report firmware 13.05.11.00 or 13.05.11.01 depending on the patch applied

 

 

References:

Diagnostic Information for ILOM, ILO , LO100 Issues (Doc ID 1062544.1)

Oracle Exadata Diagnostic Information required for Disk Failures and some other Hardware issues (Doc ID 761868.1)

Following software upgrade on X3 hardware with Exadata Smart Flash Cache Compression enabled, multiple flash drives may fail,
leading to reduced performance or data loss (Doc ID 1968234.1)


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback