![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
||
Solution Type Predictive Self-Healing Sure Solution 1525251.1 : SPSUN4V-8001-32 - Unexpected IO fault
In this Document
Applies to:SPARC T5-2SPARC T5-4 SPARC T5-8 SPARC M7-8 SPARC M7-16 Information in this document applies to any platform. PurposeProvide additional information form message ID: SPSUN4V-8001-32 Details
Type: Hardware Fault fault.cpu.SPARC-T5.remote-pio-to Severity: Critical Description An unexpected IO fault has occurred. Automated Response No automated response. Impact Loss of Service, host system will be powered down. Suggested Action for System Administrator An unexpected IO fault has occurred. Contact Oracle service for support.
Internal Only Section Revision 1.5 Data Gathering for Remote PIO timeout issue This issue is of a sensitive nature, and as such requires the appropriate confidentiality to take place such that it is discussed with those customers on an individual bases once the symptoms be exhibited have been confirmed. Applies to SPARC T5-2
SPARC T5-4
SPARC T5-8
Procedure for further investigation
The following procedure describes the essential steps required to gather key
information in order to further investigate a remote PIO timeout issue that may
occur on T5 systems.
The process is split into 4 sections
Initial confirmation of symptom being observed matches this issue.
Instrumentation to be configured on a system exhibiting the symptom
Information to be relayed to customer regarding detection once instrumented
Engagement of appropriate product groups once diagnostics from the
instrumentation has been gathered.
Initial confirmation of symptom being observed
==============================================
Task to be performed by any Oracle support engineer
1) Ensure that the symptoms/events reported match the following...
When a Remote PIO Timeout occurs on a system the following 2 events will take
place
i) A fatal error will be reported and the HOST system will be powered off
shortly afterwards, giving the impression the HOST system has hung.
ii) The HOST system once it has been powered on and Solaris
booted will have a recorded fault within FMA of the type
"fault.cpu.SPARC-T5.remote-pio-to"
The command 'fmadm faulty' can be used to confirm the appropriate fault has
been logged
i.e.
# fmadm faulty
------------------- ------------------------------------ -------------- --------
Time UUID msgid Severity
------------------- ------------------------------------ -------------- --------
2012-11-10/01:46:33 9116b25e-3c52-457b-93fd-bdb94e7a6f16 SPSUN4V-8001-32 Critical
Problem Status : solved
Diag Engine : fdd 1.0
System
Manufacturer : Oracle Corporation
Name : SPARC T5-8
Part_Number : unknown
Serial_Number : unknown
----------------------------------------
Suspect 1 of 1
Fault class : fault.cpu.SPARC-T5.remote-pio-to
Certainty : 100%
Affects : /SYS/PM3/CM1/CMP
Status : faulted but still in service
FRU
Status : faulty
Location : /SYS/PM3
Manufacturer : Oracle Corporation
Name : TLA,PM,B2.0/R,3.6.T5
Part_Number : 7054493
Revision : 02
Serial_Number : 465769T+123585002K
Chassis
Manufacturer : Oracle Corporation
Name : SPARC T5-8
Part_Number : 7041616
Serial_Number : 1204BDC013
There will be also a corresponding FMA ereport logged.
i.e.
# fmdump -eV
2012-11-09/23:24:19 ereport.cpu.SPARC-T5.remote-pio-to@/SYS/PM0/CM1/CMP
_tod-0 = 0x509d9009
__tod-1 = 0x2270ca20
diagnose = 0x1
tstate = 0x82000006
htstate = 0x0
tpc = 0x0
tl = 0x6
tt = 0x1
ps-g-ffesr = 0x800000000000
ps-g-fesr = 0x800000003d80
ps-ncu-pesr = 0x8200
ps-ncu-esr = 0x9000000000004000
ps-ncu-iesr1 = 0x200000000000
ps-ncu-iesr2 = 0x0
ps-ncu-iesr3 = 0x0
ps-ncu-iesr4 = 0x0
ps-nesr = 0x9000000000004000
Once confirmed, ensure this SR is Escalated and owned by TSC L2 Backline
engineer for further detailed investigation and system instrumentation.
Instrumentation to be configured on a system exhibiting the symptom
===================================================================
-- Task to be performed by TSC L2 Backline Engineers
1) Engage with the customer and obtain detailed overview of the system
configuration. This needs to cover both h/w and s/w.
i) For h/w a detailed description of the system needs to be gathered.
Focus on all PCIE cards especially 3rd party ones, establish how the
cards are used.
ii) For s/w we need a detailed description of what applications are run on
the system and as much detail on the applications purpose and function
as can be gathered especially around the failure time
iii) An Explorer and ILOM snapshot using the 'normal' dataset of the system
must be gathered. (details on obtaining a snapshot can be found via MOS)
article 1020204.1 (https://support.us.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1020204.1&h=Y)
In addition try to determine with the customer what events that were taking
place in around the time of the outage (use of analytical troubleshooting
techniques will help).
2) Attach the customer details to Oracle bug 16203743 - so to raise awareness
that the symptom has been observed in the field and instrumentation is being provided
3) Engage with the customer to further instrument the system, this will require
configuration of the system such that upon a fatal error the system will
automatically capture additional data from the host (a full system scandump).
In ILOM ensure the following properties are set as follows
cd /HOST
set state_capture_on_error=enabled
set state_capture_mode=fatal_scandump
NOTE
As before owing to TPM enabled by default on these systems, the HOST system upon
detecting a fatal error, will be powered down shortly after the event, giving
the impression the system has hung.
Please advise the customer NOT to close the SR as this is an ongoing
investigation.
Information to be relayed to customer regarding detection once instrumented
===========================================================================
-- Task to be provided to the customer by TSC L2 Backline Engineer
The system has been instrumented so to capture further diagnostics should the
event reoccur.
When a fatal error occurs, the system will automatically capture key data, and
the HOST system will appear to hang
Should the system have appeared to have hung with instrumentation applied,
please first confirm whether a scandump has been taken.
This can be performed by logging in to the Service Processor (SP) and viewing
the SP event log i.e.
-> cd /HOST/logs/events/list
-> show
7331 Mon Jan 21 15:43:43 2013 System Log minor
Host: Error Standby
7330 Mon Jan 21 15:43:21 2013 Fault Fault critical
Fault detected at time = Mon Jan 21 15:43:21 2013. The suspect component: /SYS has fault.cpu.generic-sparc.inconsistent with probability=100.
Refer to http://support.oracle.com/msg/SPSUN4V-8000-P3 for details.
7329 Mon Jan 21 15:43:06 2013 HOST Log critical
scandump data has been gathered
7328 Mon Jan 21 15:40:26 2013 HOST Log critical
Fatal polled has occurred. scandump data is being gathered.
7327 Mon Jan 21 15:30:25 2013 System Log minor
Host: OpenBoot Running
If the messages are similar to the above then it indicates scandump data has
been gathered, so perform the following...
i) Reapply power to the HOST using the ILOM command
"start /SYS"
ii) Obtain a ILOM snapshot using the 'normal' dataset (details on obtaining a
snapshot can be found via MOS article 1020204.1
(https://support.us.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1020204.1&h=Y)
iii)Contact Oracle Support quoting your initial Service Resolution Number and
supply to the engineer a copy of the ILOM snapshot captured for further
analysis.
Should the SP event log NOT indicate that a scandump has being gathered at the
time of the failure, please contact Oracle Support for further analysis.
Engagement of appropriate product groups once instrumentation gathered.
=======================================================================
Task to be performed by TSC L2 Backline Engineers
1) Upon receiving a copy of the snapshot from the customer.
i) Verify the scandump files reside in the ILOM snapshot
ls <snapshotdir>/ilom/conf/traces/*scandump*
ii) Email the beehive workgroup
"remote-pio-to-analysis_ww_grp@oracle.com" with the following template
Customer Name:
SR Number:
TSC L2 Backline Engineer Name:
Location of fmdump data:
Location of explorer data:
Location of snapshot containing scandump:
2) Await a response from a member on the Beehive group for further direction.
NOTE FOR TSC:
System instrumentation can be returned to 'default' behaviour setting the ILOM
variable state_capture_mode=default
i.e.
cd /HOST
set state_capture_mode=default
Attachments This solution has no attachment |
||||||||||||||
|
||||||||||||||