Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1530633.1
Update Date:2018-05-09
Keywords:

Solution Type  Problem Resolution Sure

Solution  1530633.1 :   Sun SPARC[TM] Enterprise M4000/M5000 multiple "IO Manager:Link error" in context of other errored components due to IOU#0  


Related Items
  • Sun SPARC Enterprise M4000 Server
  •  
  • Sun SPARC Enterprise M5000 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Mx000
  •  




Applies to:

Sun SPARC Enterprise M5000 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun SPARC Enterprise M4000 Server - Version Not Applicable to Not Applicable [Release N/A]
Information in this document applies to any platform.

Symptoms

Multiple "IO Manager:Link error" in context of other errored components due to IOU#0

This document is intended to show a context that covers multiple errors on multiple parts which require to consider IOU#0 as the bad component.

It shows two timeframes whereas the error pattern in the latter timeframe is already known to be caused by IOU#0 which is discussed in the following document:
Doc ID 1296435.1: Sun SPARC[TM] Enterprise M4000/M5000 - MBU_B and MEMB being faulted with SCF-8004-8X, SCF-8000-1D, and SCF-8005-MJ errors.

 

On a M5000 System the following errors are recorded in 'showlogs monitor':

  [...]
  Jan 31 02:14:12 <hostname> Warning: /IOU#0/PCI#1:IO Manager:Link error
  Jan 31 02:14:15 <hostname> Warning: /IOU#0/PCI#3:IO Manager:Link error
  Jan 31 02:15:16 <hostname> Alarm: /MBU_B/MEMB#4,/MBU_B:ANALYZE:MAC-SC interface fatal error
  [...]

 
Notice: The "IO Manager:Link error" errors are related to both Fibre DownLink Cards assembled in IOU#0. The one and only other assembled Card in IOU#0 is a PCIe network card SUNW,qlc.

XSCF's FMA and its 'fmdump -V' gives the following (just an excerpt):

  Jan 31 02:14:11.0116 95f0befa-8f62-4a6f-b45e-a4e3cccc0289 IOXSCF-8000-1A
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
                class = fault.chassis.iox.device.fail
                certainty = 0x64
                        scf-resource = hc:///chassis=0/iou=0/pcislot=1/link =0/xmtr=0
                        scf-resource = hc:///chassis=0/iou=0/pcislot=1/link =0/xmtr=0
                location = IOU#0-PCI#1
        (end fault-list[0])
        fault-status = 0x1

  Jan 31 02:14:14.8418 a6a40219-a951-4f47-afc0-5e3804ad0d75 IOXSCF-8000-1A
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
                class = fault.chassis.iox.device.fail
                certainty = 0x64
                        scf-resource = hc:///chassis=0/iou=0/pcislot=3/link =0/xmtr=0
                location = IOU#0-PCI#3
        (end fault-list[0])
        fault-status = 0x1

  Jan 31 02:15:11.2313 0bada0f7-4dc6-4069-8695-6468be76a040 SCF-8005-5X
        fault-list-sz = 0x2
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
                class = fault.chassis.SPARC-Enterprise.if.fe -mac-sc
                certainty = 0x21
                        scf-resource = hc:///chassis=0/cmu=1
                detected-by = ANALYZE
                location = /MBU_B
        (end fault-list[0])
        (start fault-list[1])
                class = fault.chassis.SPARC-Enterprise.if.fe -mac-sc
                certainty = 0x42
                        scf-resource = hc:///chassis=0/cmu=1/mac=0
                detected-by = ANALYZE
                location = /MBU_B/MEMB#4
        (end fault-list[1])
        fault-status = 0x1 0x1

 
Subsequent Degraded/Faulted components as per 'showstatus' are:

  *   MBU_B Status:Degraded;
  *       MEMB#4 Status:Faulted;
      IOU#0 Status:Normal;
  *       PCI#1 Status:Faulted;
  *       PCI#3 Status:Faulted;

 
The error pattern in the second timeframe clearly indicates that there is a problematical IOU#0. The 'showlogs monitor' output has the following errors:

  [...]
  Jan 31 20:11:43 <hostname> Warning: /UNSPECIFIED:SCF:spurious unit interrupt
  Jan 31 20:11:50 <hostname> Warning: /UNSPECIFIED:SCF:spurious unit interrupt
  Jan 31 20:11:56 <hostname> Warning: /IOU#0/PCI#1:IO Manager:Link error
  Jan 31 20:12:01 <hostname> Warning: /UNSPECIFIED:SCF:spurious unit interrupt
  Jan 31 20:12:08 <hostname> Warning: /IOU#0/PCI#3:IO Manager:Link error
  Jan 31 20:13:28 <hostname> Alarm: /MBU_B/MEMB#5:ANALYZE:MAC detected clock fatal failure
  Jan 31 20:13:32 <hostname> monitor_msg: SCF:DomainID 0 state change (initialize phase started, detail#10)
  Jan 31 20:13:33 <hostname> monitor_msg: SCF:DomainID 1 state change (initialize phase started, detail#10)
  Jan 31 20:13:35 <hostname> monitor_msg: SCF:DomainID 3 state change (initialize phase started, detail#10)
  Jan 31 20:13:53 <hostname> monitor_msg: SCF:DomainID 3 is deconfigured (no available XSB)
  Jan 31 20:14:01 <hostname> Warning: /MBU_B:SCF:SC test error
  Jan 31 20:14:06 <hostname> Warning: /MBU_B:SCF:SC test error
  Jan 31 20:14:07 <hostname> monitor_msg: SCF:System stopped (no available XSB)
  [...]

 
XSCF's FMA and its 'fmdump -V' give the following (just an excerpt):

  Jan 31 20:11:41.5983 86c6543b-3552-4c9c-8896-3e824e7b4f9f SCF-8004-8X
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
                class = fault.chassis.SPARC-Enterprise.asic. cpu.power.fail
                certainty = 0x64
                detected-by = SCF
                location = CHASSIS
        (end fault-list[0])
        fault-status = 0x0
  Jan 31 20:11:48.4577 8be58b80-9acc-4b51-af92-37366e6f4e5a SCF-8004-8X
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
                class = fault.chassis.SPARC-Enterprise.asic. cpu.power.fail
                certainty = 0x64
                detected-by = SCF
                location = CHASSIS
        (end fault-list[0])
        fault-status = 0x0
  Jan 31 20:11:51.7522 faa758e8-0759-47c4-a831-4e71815c61da IOXSCF-8000-1A
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
                class = fault.chassis.iox.device.fail
                certainty = 0x64
                        scf-resource = hc:///chassis=0/iou=0/pcislot=1/link =0/xmtr=0
                location = IOU#0-PCI#1
        (end fault-list[0])
        fault-status = 0x1
  Jan 31 20:11:57.9733 739f130c-82a7-47cd-bd5c-f817ecb12cff SCF-8004-8X
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
                class = fault.chassis.SPARC-Enterprise.asic. cpu.power.fail
                certainty = 0x64
                detected-by = SCF
                location = CHASSIS
        (end fault-list[0])
        fault-status = 0x0
  Jan 31 20:12:05.3927 72ba0b96-e1ce-4dbb-b504-3df4066d00b8 IOXSCF-8000-1A
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
                class = fault.chassis.iox.device.fail
                certainty = 0x64
                        scf-resource = hc:///chassis=0/iou=0/pcislot=3/link =0/xmtr=0
                location = IOU#0-PCI#3
        (end fault-list[0])
        fault-status = 0x1
  Jan 31 20:13:24.7270 9cef99db-92a3-4dde-9460-0ab5d3d8635c SCF-8000-1D
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
                class = fault.chassis.SPARC-Enterprise.if.fe -asic-clk
                certainty = 0x64
                        scf-resource = hc:///chassis=0/cmu=1/mac=1
                detected-by = ANALYZE
                location = /MBU_B/MEMB#5
        (end fault-list[0])
        fault-status = 0x1
  Jan 31 20:13:57.1795 3ecdc2a5-bea0-4d28-a6b4-ce2d764ef539 SCF-8005-MJ
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
                class = fault.chassis.SPARC-Enterprise.asic. sc.test
                certainty = 0x64
                        scf-resource = hc:///chassis=0/cmu=0/sc=0
                detected-by = SCF
                location = /MBU_B
        (end fault-list[0])
        fault-status = 0x1
  Jan 31 20:13:59.4440 fc42731d-c11f-4526-b10a-d4bbf86d32fa SCF-8005-MJ
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
                class = fault.chassis.SPARC-Enterprise.asic. sc.test
                certainty = 0x64
                        scf-resource = hc:///chassis=0/cmu=0/sc=1
                detected-by = SCF
                location = /MBU_B
        (end fault-list[0])
        fault-status = 0x1

 
Notice: In this example MEMB#5 is the highest assembled MEMB. The FMA messages outlined above and the info of 'showstatus' is with XCP version 1091. XCP 1111 now includes IOU#0 as a suspect component for the error pattern in the second
timeframe.

Changes

 

Cause

 Power to MEMB and system clock is supplied by IOU#0.

Solution

Contact your authorized service provider, this is a known condition that will require to replace IOU#0. If the system is below XCP 1111 then a 'flashupdate' should be scheduled.

Other Degraded/Faulted components are not suspect and should not be replaced.

 

Use service mode to clearfault all Degraded/Faulted components, which should
include the MBU_<x>, MEMB#<x> and IOU#0-PCI#<x>.

See also HowTo Doc ID 1007101.1
Sun SPARC(R)Enterprise M3000/M4000/M5000/M8000/M9000 (OPL) Servers: Fault clearing and LEDs behavior


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback