Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1672184.1
Update Date:2017-02-24
Keywords:

Solution Type  Problem Resolution Sure

Solution  1672184.1 :   SunFire[TM] 12K/15K/E20K/E25K:System Controller failover because SCPER1 is being deconfigured  


Related Items
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: SF-Exxk
  •  
  • _Old GCS Categories>Announcements>All Product Lines>Support Systems
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Applies to:

Sun Fire E25K Server - Version All Versions to All Versions [Release All Releases]
Sun Fire 12K Server - Version All Versions to All Versions [Release All Releases]
Sun Fire E20K Server - Version All Versions to All Versions [Release All Releases]
Sun Fire 15K Server - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

The following messages will appear when the power is failing on the SCs peripheral board


failing SC:

Apr 12 13:06:02 2014 localhost-sc1-hme0 esmd[25768]: [1920 9749903521434542 ERR DetectorV.cc 613] A high voltage has been detected on 3.3VHK, located on SCPER1. The voltage detected is 4.53v; should be 3.00v to 3.50v. SCPER1 is being deconfigured and powered off. Check all hardware for the cause.

Apr 12 13:12:56 2014 localhost-sc1-hme0 ssd[2106]: [1319 119715124649 NOTICE SSDWorkArea.cc 38] ssd output: SMS 1.6 start-up initiated
Apr 12 13:12:56 2014 localhost-sc1-hme0 ssd[2106]: [1319 119759223543 NOTICE SSDWorkArea.cc 38] ssd output: SC POST results:  'CP1500 POST Passed; SSCPOST v1.25 Passed'
Apr 12 13:12:56 2014 localhost-sc1-hme0 ssd[2106]: [1304 119852261626 NOTICE StartupManager.cc 2744] software component start-up initiated: name=hwad
Apr 12 13:12:58 2014 localhost-sc1-hme0 ssd[2106]: [1304 121402054917 NOTICE StartupManager.cc 2744] software component start-up initiated: name=mand
Apr 12 13:12:59 2014 localhost-sc1-hme0 ssd[2106]: [1304 121921660404 NOTICE StartupManager.cc 2744] software component start-up initiated: name=frad
Apr 12 13:12:59 2014 localhost-sc1-hme0 ssd[2106]: [1304 122441777767 NOTICE StartupManager.cc 2744] software component start-up initiated: name=fomd

[There may some other messages show up which are just SC reporting messages
e.g.:
Apr 12 13:13:05 2014 localhost-sc1-hme0 mld[2079]: [9128 128060481973 NOTICE LogCleanup.cc 336] Removing old log: /var/opt/SUNWSMS/SMS1.6/adm/G/dump/dsmd.rstop.130325.0755.19.
and may be ignored the relevant are ones above and below in the boxes]

Apr 12 13:13:11 2014 localhost-sc1-hme0 fomd[2137]: [8600 134481139721 NOTICE FailoverMgr.cc 2842] Heartbeat interrupt detected
Apr 12 13:13:12 2014 localhost-sc1-hme0 ssd[2106]: [1320 135459079581 NOTICE StartupManager.cc 423] SMS software startup complete.
Apr 12 13:13:12 2014 localhost-sc1-hme0 fomd[2137]: [8563 135561498788 NOTICE FOConfig.cc 204] Failed to configure the logical interface - the interface may have already been removed, please check (ecode = -1)
Apr 12 13:13:12 2014 localhost-sc1-hme0 fomd[2137]: [8577 135562418705 NOTICE FailoverMgr.cc 3226] SC configured as Spare
Apr 12 13:13:12 2014 localhost-sc1-hme0 fomd[2137]: [8624 135737259010 NOTICE FMI2NetTest.cc 148] Remote SC is running SMS 1.6

 

redundant SC:
on the SC which has taken over now the role as main you can see that the failing SC did not respond and so a failover was triggered:

Apr 12 13:06:44 2014 localhost-sc0-hme0 fomd[2240]: [8609 9228638330146745 ERR RemoteSC.cc 964] Remote SC call failed: RPC: Timed out
Apr 12 13:06:44 2014 localhost-sc0-hme0 fomd[2240]: [8569 9228638332467521 NOTICE FailoverMgr.cc 1377] The I2 network test FAILED
Apr 12 13:07:44 2014 localhost-sc0-hme0 fomd[2240]: [8612 9228698351571928 ERR FOHASram.cc 1824] Timeout waiting for response from remote SC
Apr 12 13:07:44 2014 localhost-sc0-hme0 fomd[2240]: [8569 9228698352973495 NOTICE FailoverMgr.cc 1377] The HASRAM network test FAILED
Apr 12 13:07:44 2014 localhost-sc0-hme0 fomd[2240]: [8599 9228698353719138 NOTICE FMHeartbeat.cc 223] Checking for SC heartbeat interrupts (can take up to 50 seconds) ...
Apr 12 13:08:09 2014 localhost-sc0-hme0 fomd[2240]: [8582 9228722870838730 NOTICE FailoverMgr.cc 5256] Not detecting remote SC's heartbeat interrupts
Apr 12 13:08:09 2014 localhost-sc0-hme0 fomd[2240]: [8574 9228722872157503 NOTICE FailoverMgr.cc 2297] Taking over main role because remote SC is unresponsive or down
Apr 12 13:08:09 2014 localhost-sc0-hme0 fomd[2240]: [8519 9228722873273673 NOTICE FailoverMgr.cc 2631] Failover deactivated
Apr 12 13:08:14 2014 localhost-sc0-hme0 fomd[2240]: [8570 9228728150848495 NOTICE FailoverMgr.cc 2356] Reset the remote SC
Apr 12 13:08:46 2014 localhost-sc0-hme0 hwad[2171]: [50144 9228760130438198 NOTICE DevPresent.cc 1172] Changed clock sources.


Changes

No change have been make to the platform

Cause

The voltage event occurs on the respective system controller peripheral board

Solution

The solution is to replace the System Controller Peripheral Board on the failing System Controller.

Do not replace the SC itself

Understand the difference to DOC 1583980.1 where the power failure was on the SC

References

<NOTE:1583980.1> - SunFire[TM] 12K/15K/E20K/E25K:System Controller Is Down
<NOTE:1001320.1> - SMS DC Power Supply Voltage Monitoring Flaw May Expose a Domain to Outage

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback