SPSUN4V-8001-32 - Unexpected IO fault

Asset ID:	1-79-1525251.1
Update Date:	2017-03-09
Keywords:

Solution Type Predictive Self-Healing Sure

Solution 1525251.1 : SPSUN4V-8001-32 - Unexpected IO fault

Applies to:

SPARC T5-2
SPARC T5-4
SPARC T5-8
SPARC M7-8
SPARC M7-16
Information in this document applies to any platform.

Purpose

Provide additional information form message ID: SPSUN4V-8001-32

Details

Type: Hardware Fault

fault.cpu.SPARC-T5.remote-pio-to

Severity: Critical

Description

An unexpected IO fault has occurred.

Automated Response

No automated response.

Impact

Loss of Service, host system will be powered down.

Suggested Action for System Administrator

An unexpected IO fault has occurred. Contact Oracle service for support.

Internal Only Section

Revision 1.5

Data Gathering for Remote PIO timeout issue

This issue is of a sensitive nature, and as such requires the appropriate
confidentiality to take place such that it is discussed with those customers on
an individual bases once the symptoms be exhibited have been confirmed. 

Applies to

SPARC T5-2
SPARC T5-4
SPARC T5-8

Procedure for further investigation 

The following procedure describes the essential steps required to gather key
information in order to further investigate a remote PIO timeout issue that may
occur on T5 systems.

The process is split into 4 sections
    Initial confirmation of symptom being observed matches this issue.
    Instrumentation to be configured on a system exhibiting the symptom
    Information to be relayed to customer regarding detection once instrumented
    Engagement of appropriate product groups once diagnostics from the 
	instrumentation has been gathered.


Initial confirmation of symptom being observed
==============================================
Task to be performed by any Oracle support engineer

1) Ensure that the symptoms/events reported match the following...

When a Remote PIO Timeout occurs on a system the following 2 events will take
place
   i) A fatal error will be reported and the HOST system will be powered off
      shortly afterwards, giving the impression the HOST system has hung.

  ii) The HOST system once it has been powered on and Solaris
      booted will have a recorded fault within FMA of the type
      "fault.cpu.SPARC-T5.remote-pio-to"

The command 'fmadm faulty' can be used to confirm the appropriate fault has
been logged

i.e.
# fmadm faulty
------------------- ------------------------------------ -------------- --------
Time                UUID                                 msgid          Severity
------------------- ------------------------------------ -------------- --------
2012-11-10/01:46:33 9116b25e-3c52-457b-93fd-bdb94e7a6f16    SPSUN4V-8001-32     Critical

Problem Status    : solved
Diag Engine       : fdd 1.0
System           
   Manufacturer   : Oracle Corporation
   Name           : SPARC T5-8
   Part_Number    : unknown
   Serial_Number  : unknown

----------------------------------------
Suspect 1 of 1
   Fault class  : fault.cpu.SPARC-T5.remote-pio-to
   Certainty    : 100%
   Affects      : /SYS/PM3/CM1/CMP
   Status       : faulted but still in service

   FRU                 
      Status            : faulty
      Location          : /SYS/PM3
      Manufacturer      : Oracle Corporation
      Name              : TLA,PM,B2.0/R,3.6.T5
      Part_Number       : 7054493
      Revision          : 02
      Serial_Number     : 465769T+123585002K
      Chassis          
         Manufacturer   : Oracle Corporation
         Name           : SPARC T5-8
         Part_Number    : 7041616
         Serial_Number  : 1204BDC013


There will be also a corresponding FMA ereport logged.
i.e.
# fmdump -eV
2012-11-09/23:24:19  ereport.cpu.SPARC-T5.remote-pio-to@/SYS/PM0/CM1/CMP
	 _tod-0      = 0x509d9009
	__tod-1      = 0x2270ca20
	diagnose     = 0x1
	tstate       = 0x82000006
	htstate      = 0x0
	tpc          = 0x0
	tl           = 0x6
	tt           = 0x1
	ps-g-ffesr   = 0x800000000000
	ps-g-fesr    = 0x800000003d80
	ps-ncu-pesr  = 0x8200
	ps-ncu-esr   = 0x9000000000004000
	ps-ncu-iesr1 = 0x200000000000
	ps-ncu-iesr2 = 0x0
	ps-ncu-iesr3 = 0x0
	ps-ncu-iesr4 = 0x0
	ps-nesr      = 0x9000000000004000


Once confirmed, ensure this SR is Escalated and owned by TSC L2 Backline
engineer for further detailed investigation and system instrumentation.


Instrumentation to be configured on a system exhibiting the symptom
===================================================================
-- Task to be performed by TSC L2 Backline Engineers

1) Engage with the customer and obtain detailed overview of the system
configuration. This needs to cover both h/w and s/w.
    i) For h/w a detailed description of the system needs to be gathered. 
       Focus on all PCIE cards especially 3rd party ones, establish how the
       cards are used.

   ii) For s/w we need a detailed description of what applications are run on
       the system and as much detail on the applications purpose and function
       as can be gathered especially around the failure time

  iii) An Explorer and ILOM snapshot using the 'normal' dataset of the system
       must be gathered. (details on obtaining a snapshot can be found via MOS)
       article 1020204.1 (https://support.us.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1020204.1&h=Y)
	
In addition try to determine with the customer what events that were taking
place in around the time of the outage (use of analytical troubleshooting 
techniques will help).

2) Attach the customer details to Oracle bug 16203743 - so to raise awareness
that the symptom has been observed in the field and instrumentation is being provided

3) Engage with the customer to further instrument the system, this will require
configuration of the system such that upon a fatal error the system will
automatically capture additional data from the host (a full system scandump).

In ILOM ensure the following properties are set as follows
    cd /HOST
    set state_capture_on_error=enabled
    set state_capture_mode=fatal_scandump


NOTE
As before owing to TPM enabled by default on these systems, the HOST system upon
detecting a fatal error, will be powered down shortly after the event, giving
the impression the system has hung. 


Please advise the customer NOT to close the SR as this is an ongoing
investigation.


Information to be relayed to customer regarding detection once instrumented
===========================================================================
-- Task to be provided to the customer by TSC L2 Backline Engineer

The system has been instrumented so to capture further diagnostics should the
event reoccur.

When a fatal error occurs, the system will automatically capture key data, and
the HOST system will appear to hang

Should the system have appeared to have hung with instrumentation applied,
please first  confirm whether a scandump has been taken. 

This can be performed by logging in to the Service Processor (SP) and viewing
the SP event log i.e.

-> cd /HOST/logs/events/list
-> show

7331   Mon Jan 21 15:43:43 2013  System    Log       minor
       Host: Error Standby
7330   Mon Jan 21 15:43:21 2013  Fault     Fault     critical
       Fault detected at time = Mon Jan 21 15:43:21 2013. The suspect component: /SYS has fault.cpu.generic-sparc.inconsistent with probability=100.
       Refer to http://support.oracle.com/msg/SPSUN4V-8000-P3 for details.
7329   Mon Jan 21 15:43:06 2013  HOST      Log       critical
       scandump data has been gathered
7328   Mon Jan 21 15:40:26 2013  HOST      Log       critical
       Fatal polled has occurred. scandump data is being gathered.
7327   Mon Jan 21 15:30:25 2013  System    Log       minor
       Host: OpenBoot Running

If the messages are similar to the above then it indicates scandump data has
been gathered, so perform the following...
 
   i) Reapply power to the HOST using the ILOM command
	"start /SYS"

  ii) Obtain a ILOM snapshot using the 'normal' dataset (details on obtaining a
      snapshot can be found via MOS article 1020204.1
      (https://support.us.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1020204.1&h=Y)
     
  iii)Contact Oracle Support quoting your initial Service Resolution Number and
      supply to the engineer a copy of the ILOM snapshot captured for further
      analysis.


Should the SP event log NOT indicate that a scandump has being gathered at the
time of the failure, please contact Oracle Support  for further analysis.



Engagement of appropriate product groups once instrumentation gathered.
=======================================================================
Task to be performed by TSC L2 Backline Engineers

1) Upon receiving a copy of the snapshot from the customer.
    i)  Verify the scandump files reside in the  ILOM snapshot
	ls <snapshotdir>/ilom/conf/traces/*scandump*

    ii) Email the beehive workgroup 
        "remote-pio-to-analysis_ww_grp@oracle.com" with the following template

	Customer Name:
	SR Number:
	TSC L2 Backline Engineer Name:
 	Location of fmdump data:
	Location of explorer data:
	Location of snapshot containing scandump:

2) Await a response from a member on the Beehive group for further direction. 


NOTE FOR TSC:
System instrumentation can be returned to 'default' behaviour setting the ILOM
variable state_capture_mode=default
	i.e.
                cd /HOST
		set state_capture_mode=default

Attachments

This solution has no attachment