Asset ID: |
1-75-1019646.1 |
Update Date: | 2017-10-18 |
Keywords: | |
Solution Type
Troubleshooting Sure
Solution
1019646.1
:
Troubleshooting Interconnect errors on Sun Fire[TM] v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900, and Netra 1280, 1290 systems.
Related Items |
- Sun Fire 4810 Server
- Sun Fire 3800 Server
- Sun Netra 1290 Server
- Sun Fire 6800 Server
- Sun Fire E6900 Server
- Sun Fire E2900 Server
- Sun Fire 4800 Server
- Sun Fire V1280 Server
- Sun Fire E4900 Server
- Sun Netra 1280 Server
|
Related Categories |
- PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: SF-x8x0/Ex900
- _Old GCS Categories>Sun Microsystems>Servers>Entry-Level Servers
- _Old GCS Categories>Sun Microsystems>Servers>Midrange Servers
- _Old GCS Categories>Sun Microsystems>Servers>Midrange V and Netra Servers
|
PreviouslyPublishedAs
242866
Applies to:
Sun Fire E2900 Server - Version Not Applicable and later Sun Netra 1280 Server - Version Not Applicable and later Sun Fire 3800 Server - Version Not Applicable and later Sun Fire 4800 Server - Version Not Applicable and later Sun Fire 6800 Server - Version Not Applicable and later All Platforms
Purpose
Description
This document provides the basic troubleshooting steps to follow when needing to diagnose the cause of Interconnect Errors on Sun Fire[TM] Midrange Servers
Symptoms:
- A System Board, I/O Board, or Repeater may have been recently serviced, replaced, or reseated.
- A domain may not be able to boot.
- A domain could be described as down, can't be setkeyswitched on, can't be powered on, or as having failed POST.
- Error messages displayed in the System Controller (SC) log files (showlogs -v) or on the console could include messages like:
Failed AR interconnect test.
CPU Board V3 at /N0/SB1 has been removed from domain C due to a failure in interconnection test. Service action required.
AR Interconnect test: System board SB1/ar0 address repeater connections to system board RP3/ar0 failed
DX Interconnect test: System board /N0/SB1 data line connections to system board RP0 failed
NOTE: The example errors can be associated to any domain, RP, or any System Board (SB) or I/O Board (IB), and the examples above are not exclusive to these faults.
System Type:
- Sun Fire[TM] v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900
- Netra[TM] 1280, 1290
Troubleshooting Steps
Steps to Follow Collect the appropriate troubleshooting data and contact Oracle Support Services. The error you have encountered is a board interconnection issue. Essentially, this is a board connectivity issue. It is likely a hardware defect, a board or slot issue, or a board "seating" issue. The event requires that a Sun Support Engineer is engaged to diagnose and resolve this event.
Please contact Oracle Support Services in order to diagnose this issue. Being prepared with the following troubleshooting data will allow that engineer to immediately begin diagnosis of the issue, and decrease the time to resolution.
Please provide:
- Explorer with scextended or 1280extended option (depending on platform type); See Document 1019066.1 for details
- When Explorer data can not be captured, please obtain the list of System Controller (SC) commands from Document 1003529.1.
Please validate that each troubleshooting step below is true for your environment.
The steps will provide instructions or a link to a document, for validating the step and taking corrective action as necessary.
The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution.
Please do not skip a step.
1. Verify the components implicated in the interconnect errors were not recently replaced, reseated, or "handled".
- Recently "handled" hardware would include any board that has been removed or inserted to replace it or hardware components contained on it.
- Since the error is an interconnection problem, the physical act of servicing or handling the board could be the cause of the problem.
Document 1019218.1 Sun Fire[TM] Midrange Servers: How to identify pin or socket damage.
2. Verify that the errors persist after executing System Controller Failover (dual SC config) or an SC Reset (single SC config).
- Failover (scfailover) is only available on systems with Dual SCs.
Document 1003245.1
Sun Fire[TM] 3800-6900: System Controller failover functionality
- On Sun Fire[TM] v1280/E2900 and Netra[TM} 1280/1290 (single SC configurations) you will need to utilize the resetsc command to
reset the SC and confirm it's sanity.
Document 1012388.1
Sun Fire[TM] V1280/2900 LOM Quick Command Reference
- If errors persist on both SCs or after the resetsc is issued, proceed to Step 3.
- If errors go away after the resetsc you are done.
- If they go away after executing scfailover, fail back to the original Main SC and confirm the errors persist again.
- Replace the SC if they do.
3. Confirm that you are able to determine the suspect list for this issue and prioritize which suspect is most likely to be root cause.
Document 1019649.1
How to determine the suspect list for Sun Fire[TM] Midrange Server interconnect errors.
4. Verify that the primary FRU is NOT defective (primary FRU determined by the results of Step 3).
- If a System Board or I/O Board is implicated, it can be verifief as defective two different ways:
- By replacing the board.
- By having a Sun engineer move the suspect board into an empty slot or switch it with another board in the domain and observe the behavior.
- If the board works in the alternate slot, the RP or the board slot (CP) is implicated (proceed to Step 4).
- If the board fails to work in the alternate slot, the board is defective, so replace it.
- If a Repeater (RP) is implicated, it can be verified as defective two different ways:
- By having an engineer switch the suspect RP with an alternate RP in th system and observe the behavior.
- If the error follows the RP to it's new location, then the RP is defective, so replace it.
- If the failure remains at the old RP's slot, then the Centerplane is suspect.
- The Sun engineer performing any replacement or moving any hardware should be extremely careful to inspect the board and CP pins and sockets.
Document 1019218.1
Sun Fire[TM] Midrange Servers: How to identify pin or socket damage.
5. Verify that the secondary FRU is not defective (secondary FRU determined by the results of Step 3).
- If a System Board or I/O Board is implicated, it can be verified as defective two different ways:
- By replacing the board.
- By having a Sun engineer move the suspect board into an empty slot or switch it with another board in the domain and observe the behavior.
- If the board works in the alternate slot, the RP or the board slot (CP) is implicated (proceed to Step 4)
- If the board fails to work in the alternate slot, the board is defective, so replace it.
- If a Repeater (RP) is implicated, it can be verified as defective two different ways:
- By having a Sun engineer switch the suspect RP with an alternate RP in the same system and observe the behavior.
- If the error follows the RP to it's new location, then the RP is defective, so replace it.
- If the failure remains at the old RP's slot, then the Centerplane is suspect.
- The Sun engineer performing any replacement or moving any hardware should be extremely careful to inspect the board and CP pins and sockets.
Document 1019218.1
Sun Fire[TM] Midrange Servers: How to identify pin or socket damage.
6. Collaborate with TSC prior to proceeding to a Centerplane replacement.
- Make sure to have console data, explorer data, and a detailed explanation of what has been replaced, and when available when collaborating with TSC.
- Most likely the Centerplane will have to be replaced, but TSC will want make absolutely sure that nothing has been overlooked before proceeding to this invasive replacement action.
NOTE: The testinterconnect command can be utilized to test board interconnections if you obtain a service mode password (setkeyswitch on also accomplishes this testing). For details on testinterconnect command usage refer to
Document 1005014.1
interconnect, Interconnect, interconnect test, interconnection,
testinterconnect, Service action required, failure, POST, normalized, Mapped, Global_Oring
References<NOTE:1003245.1> - Sun Fire[TM] 3800-6900: System Controller failover functionality <NOTE:1003529.1> - Procedure to manually collect System Controller (SC) level failure data on Sun Fire[TM] v1280, E2900, 3800, 4800, E4900, 6800, E6900, and Netra 1280, 1290 servers. <NOTE:1012388.1> - Sun Fire[TM] v1280, E2900, and Netra 1280, 1290: Lights Out Management (LOM) Quick Command Reference <NOTE:1019066.1> - Sun Fire[TM] v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900 and Netra[TM] 1280, 2900 servers: How to collect scextended or 1280extended Explorer <NOTE:1019218.1> - Sun Fire[TM] Midrange Servers: How to identify pin or socket damage. <NOTE:1019649.1> - Sun Fire[TM] 3800, v1280, E2900, 4800, 4810, E4900, 6800 and E6900 and Netra 1280, 1290: How to determine the suspect list for interconnect errors.
Attachments
This solution has no attachment
|