![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||
Solution Type Predictive Self-Healing Sure Solution 1525251.1 : SPSUN4V-8001-32 - Unexpected IO fault
In this Document
Applies to:SPARC T5-2SPARC T5-4 SPARC T5-8 SPARC M7-8 SPARC M7-16 Information in this document applies to any platform. PurposeProvide additional information form message ID: SPSUN4V-8001-32 Details
Type: Hardware Fault fault.cpu.SPARC-T5.remote-pio-to Severity: Critical Description An unexpected IO fault has occurred. Automated Response No automated response. Impact Loss of Service, host system will be powered down. Suggested Action for System Administrator An unexpected IO fault has occurred. Contact Oracle service for support.
Internal Only Section Revision 1.5 Data Gathering for Remote PIO timeout issue This issue is of a sensitive nature, and as such requires the appropriate confidentiality to take place such that it is discussed with those customers on an individual bases once the symptoms be exhibited have been confirmed. Applies to SPARC T5-2 SPARC T5-4 SPARC T5-8 Procedure for further investigation The following procedure describes the essential steps required to gather key information in order to further investigate a remote PIO timeout issue that may occur on T5 systems. The process is split into 4 sections Initial confirmation of symptom being observed matches this issue. Instrumentation to be configured on a system exhibiting the symptom Information to be relayed to customer regarding detection once instrumented Engagement of appropriate product groups once diagnostics from the instrumentation has been gathered. Initial confirmation of symptom being observed ============================================== Task to be performed by any Oracle support engineer 1) Ensure that the symptoms/events reported match the following... When a Remote PIO Timeout occurs on a system the following 2 events will take place i) A fatal error will be reported and the HOST system will be powered off shortly afterwards, giving the impression the HOST system has hung. ii) The HOST system once it has been powered on and Solaris booted will have a recorded fault within FMA of the type "fault.cpu.SPARC-T5.remote-pio-to" The command 'fmadm faulty' can be used to confirm the appropriate fault has been logged i.e. # fmadm faulty ------------------- ------------------------------------ -------------- -------- Time UUID msgid Severity ------------------- ------------------------------------ -------------- -------- 2012-11-10/01:46:33 9116b25e-3c52-457b-93fd-bdb94e7a6f16 SPSUN4V-8001-32 Critical Problem Status : solved Diag Engine : fdd 1.0 System Manufacturer : Oracle Corporation Name : SPARC T5-8 Part_Number : unknown Serial_Number : unknown ---------------------------------------- Suspect 1 of 1 Fault class : fault.cpu.SPARC-T5.remote-pio-to Certainty : 100% Affects : /SYS/PM3/CM1/CMP Status : faulted but still in service FRU Status : faulty Location : /SYS/PM3 Manufacturer : Oracle Corporation Name : TLA,PM,B2.0/R,3.6.T5 Part_Number : 7054493 Revision : 02 Serial_Number : 465769T+123585002K Chassis Manufacturer : Oracle Corporation Name : SPARC T5-8 Part_Number : 7041616 Serial_Number : 1204BDC013 There will be also a corresponding FMA ereport logged. i.e. # fmdump -eV 2012-11-09/23:24:19 ereport.cpu.SPARC-T5.remote-pio-to@/SYS/PM0/CM1/CMP _tod-0 = 0x509d9009 __tod-1 = 0x2270ca20 diagnose = 0x1 tstate = 0x82000006 htstate = 0x0 tpc = 0x0 tl = 0x6 tt = 0x1 ps-g-ffesr = 0x800000000000 ps-g-fesr = 0x800000003d80 ps-ncu-pesr = 0x8200 ps-ncu-esr = 0x9000000000004000 ps-ncu-iesr1 = 0x200000000000 ps-ncu-iesr2 = 0x0 ps-ncu-iesr3 = 0x0 ps-ncu-iesr4 = 0x0 ps-nesr = 0x9000000000004000 Once confirmed, ensure this SR is Escalated and owned by TSC L2 Backline engineer for further detailed investigation and system instrumentation. Instrumentation to be configured on a system exhibiting the symptom =================================================================== -- Task to be performed by TSC L2 Backline Engineers 1) Engage with the customer and obtain detailed overview of the system configuration. This needs to cover both h/w and s/w. i) For h/w a detailed description of the system needs to be gathered. Focus on all PCIE cards especially 3rd party ones, establish how the cards are used. ii) For s/w we need a detailed description of what applications are run on the system and as much detail on the applications purpose and function as can be gathered especially around the failure time iii) An Explorer and ILOM snapshot using the 'normal' dataset of the system must be gathered. (details on obtaining a snapshot can be found via MOS) article 1020204.1 (https://support.us.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1020204.1&h=Y) In addition try to determine with the customer what events that were taking place in around the time of the outage (use of analytical troubleshooting techniques will help). 2) Attach the customer details to Oracle bug 16203743 - so to raise awareness that the symptom has been observed in the field and instrumentation is being provided 3) Engage with the customer to further instrument the system, this will require configuration of the system such that upon a fatal error the system will automatically capture additional data from the host (a full system scandump). In ILOM ensure the following properties are set as follows cd /HOST set state_capture_on_error=enabled set state_capture_mode=fatal_scandump NOTE As before owing to TPM enabled by default on these systems, the HOST system upon detecting a fatal error, will be powered down shortly after the event, giving the impression the system has hung. Please advise the customer NOT to close the SR as this is an ongoing investigation. Information to be relayed to customer regarding detection once instrumented =========================================================================== -- Task to be provided to the customer by TSC L2 Backline Engineer The system has been instrumented so to capture further diagnostics should the event reoccur. When a fatal error occurs, the system will automatically capture key data, and the HOST system will appear to hang Should the system have appeared to have hung with instrumentation applied, please first confirm whether a scandump has been taken. This can be performed by logging in to the Service Processor (SP) and viewing the SP event log i.e. -> cd /HOST/logs/events/list -> show 7331 Mon Jan 21 15:43:43 2013 System Log minor Host: Error Standby 7330 Mon Jan 21 15:43:21 2013 Fault Fault critical Fault detected at time = Mon Jan 21 15:43:21 2013. The suspect component: /SYS has fault.cpu.generic-sparc.inconsistent with probability=100. Refer to http://support.oracle.com/msg/SPSUN4V-8000-P3 for details. 7329 Mon Jan 21 15:43:06 2013 HOST Log critical scandump data has been gathered 7328 Mon Jan 21 15:40:26 2013 HOST Log critical Fatal polled has occurred. scandump data is being gathered. 7327 Mon Jan 21 15:30:25 2013 System Log minor Host: OpenBoot Running If the messages are similar to the above then it indicates scandump data has been gathered, so perform the following... i) Reapply power to the HOST using the ILOM command "start /SYS" ii) Obtain a ILOM snapshot using the 'normal' dataset (details on obtaining a snapshot can be found via MOS article 1020204.1 (https://support.us.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1020204.1&h=Y) iii)Contact Oracle Support quoting your initial Service Resolution Number and supply to the engineer a copy of the ILOM snapshot captured for further analysis. Should the SP event log NOT indicate that a scandump has being gathered at the time of the failure, please contact Oracle Support for further analysis. Engagement of appropriate product groups once instrumentation gathered. ======================================================================= Task to be performed by TSC L2 Backline Engineers 1) Upon receiving a copy of the snapshot from the customer. i) Verify the scandump files reside in the ILOM snapshot ls <snapshotdir>/ilom/conf/traces/*scandump* ii) Email the beehive workgroup "remote-pio-to-analysis_ww_grp@oracle.com" with the following template Customer Name: SR Number: TSC L2 Backline Engineer Name: Location of fmdump data: Location of explorer data: Location of snapshot containing scandump: 2) Await a response from a member on the Beehive group for further direction. NOTE FOR TSC: System instrumentation can be returned to 'default' behaviour setting the ILOM variable state_capture_mode=default i.e. cd /HOST set state_capture_mode=default
Attachments This solution has no attachment |
||||||||||||||
|