![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||
Solution Type Problem Resolution Sure Solution 2232459.1 : T7-2 - FRUID access failure
Server reporting FRUID errors is unable to boot In this Document
Oracle Confidential PARTNER - Available to partners (SUN). Applies to:SPARC T7-2 - Version All Versions and laterOracle Solaris on SPARC (64-bit) SymptomsA T7-2 server crashed and was unable to boot, reporting FRUID access failure. The first visible signs of the issue were noted on 'fmdump -e' output, hours before the crash: 2017-02-03/01:56:01 ereport.fruid.writefail@/SYS/MB
2017-02-03/14:56:33 ereport.fruid.writefail@/SYS/DBP Along with FRUID errors found on sp trace file @coredump@sp_trace@logs@CRIT.log LIBFRU CRITICAL 2017-02-01 16:11:44.779639 1587 frucache.c:889 /SYS/DBP, capidirectfe_write_fru_direct() failed, status=-1, offset=000007ab, count=4
LIBFRU CRITICAL 2017-02-01 16:11:46.884166 1587 frucache.c:889 /SYS/DBP, capidirectfe_write_fru_direct() failed, status=-1, offset=000006a4, count=4 LIBFRU CRITICAL 2017-02-01 16:11:46.884785 1587 sun_file_access.c:2461 check /SYS/DBP fruid, error in hardware LIBFRU CRITICAL 2017-02-01 16:11:49.131451 1587 frucache.c:889 /SYS/FANBD, capidirectfe_write_fru_direct() failed, status=-1, offset=000007ab, count=4 LIBFRU CRITICAL 2017-02-01 16:11:52.259782 1587 frucache.c:889 /SYS/FANBD, capidirectfe_write_fru_direct() failed, status=-1, offset=000006a4, count=4 LIBFRU CRITICAL 2017-02-01 16:11:52.260371 1587 sun_file_access.c:2461 check /SYS/FANBD fruid, error in hardware LIBFRU CRITICAL 2017-02-01 16:11:53.300464 1587 frucache.c:889 /SYS/MB, capidirectfe_write_fru_direct() failed, status=-1, offset=000006ac, count=254 LIBFRU CRITICAL 2017-02-01 16:11:54.329394 1587 frucache.c:889 /SYS/MB, capidirectfe_write_fru_direct() failed, status=-1, offset=000007ab, count=4 LIBFRU CRITICAL 2017-02-01 16:11:58.520436 1587 sun_file_access.c:2461 check /SYS/MB fruid, error in hardware LIBFRU CRITICAL 2017-02-01 16:12:01.349682 1587 frucache.c:889 /SYS/MB/CM0/CMP/MR0, capidirectfe_write_fru_direct() failed, status=-1, offset=000006ac, count=254 LIBFRU CRITICAL 2017-02-01 16:12:02.719661 1587 frucache.c:889 /SYS/MB/CM0/CMP/MR0, capidirectfe_write_fru_direct() failed, status=-1, offset=000007ab, count=4 LIBFRU CRITICAL 2017-02-01 16:12:03.089826 1587 frucache.c:889 /SYS/MB/CM0/CMP/MR0, capidirectfe_write_fru_direct() failed, status=-1, offset=00000693, count=16 LIBFRU CRITICAL 2017-02-01 16:12:03.449674 1587 frucache.c:889 /SYS/MB/CM0/CMP/MR0, capidirectfe_write_fru_direct() failed, status=-1, offset=000006a4, count=4 LIBFRU CRITICAL 2017-02-01 16:12:03.450269 1587 sun_file_access.c:2461 check /SYS/MB/CM0/CMP/MR0 fruid, error in hardware ...snip... LIBFRU CRITICAL 2017-02-02 20:14:43.003704 1587 sun_file_access.c:2461 check /SYS/DBP fruid, error in hardware LIBFRU CRITICAL 2017-02-03 01:56:01.388245 8846 frucache.c:727 restore /SYS/MB: capidirectfe_write_fru_direct() failed ret: -1 bytes: 1280 LIBFRU CRITICAL 2017-02-03 01:56:01.388560 8846 frucache.c:734 /SYS/MB FRUID is corrupted. LIBFRU CRITICAL 2017-02-03 04:15:46.333176 1587 frucache.c:889 /SYS/MB, capidirectfe_write_fru_direct() failed, status=-1, offset=00000000, count=6 LIBFRU CRITICAL 2017-02-03 04:15:47.363345 1587 frucache.c:889 /SYS/MB, capidirectfe_write_fru_direct() failed, status=-1, offset=00000006, count=40 ...snip... LIBFRU CRITICAL 2017-02-03 14:56:33.310096 5577 frucache.c:727 restore /SYS/DBP: capidirectfe_write_fru_direct() failed ret: -1 bytes: 1792 LIBFRU CRITICAL 2017-02-03 14:56:33.310347 5577 frucache.c:734 /SYS/DBP FRUID is corrupted. LIBFRU CRITICAL 2017-02-03 16:16:38.698445 1587 frucache.c:889 /SYS/DBP, capidirectfe_write_fru_direct() failed, status=-1, offset=00000006, count=50 LIBFRU CRITICAL 2017-02-03 16:16:39.978307 1587 frucache.c:889 /SYS/DBP, capidirectfe_write_fru_direct() failed, status=-1, offset=00000562, count=144 Later on, host crashed due to what was wrongly interpreted as the sudden removal of the disk backplane, as seen on event logs: 475 Fri Feb 3 19:14:13 2017 Power Off major Power to /SYS has been turned off by: SP, Reason: Fault
474 Fri Feb 3 19:14:12 2017 Chassis Action major Hot removal of /SYS/DBP <<<=== After that, more errors on 'fmdump -e' output: 2017-02-03/19:13:28 ereport.component.absent@/SYS/DBP detector = /SYS/DBP/PRSNT hidden = true
2017-02-03/19:15:21 ereport.component.present@/SYS/DBP detector = /SYS/DBP/PRSNT hidden = true 2017-02-03/19:16:01 ereport.fruid.inaccessible@/SYS/DBP 2017-02-03/20:26:59 ereport.hc.dev_fault@/SYS/MB/CM1/CMP/MR0/FRUID reason = FRUID access failure 2017-02-03/20:26:59 ereport.hc.dev_fault@/SYS/MB/CM0/CMP/MR0/FRUID reason = FRUID access failure 2017-02-03/20:27:01 ereport.hc.dev_fault@/SYS/MB/CM0/CMP/MR1/FRUID reason = FRUID access failure ...snip... 2017-02-03/20:27:51 ereport.hc.dev_fault@/SYS/MB/CM1/CMP/MR2/FRUID reason = FRUID access failure 2017-02-03/20:27:53 ereport.hc.dev_fault@/SYS/MB/CM0/CMP/MR3/FRUID reason = FRUID access failure 2017-02-03/20:27:53 ereport.hc.dev_fault@/SYS/MB/CM1/CMP/MR3/FRUID reason = FRUID access failure 2017-02-03/20:28:42 ereport.hc.dev_info@/SYS/MB/CM0/CMP/MCU0 reason = Mem init retry: 1 2017-02-03/20:28:42 ereport.hc.dev_info@/SYS/MB/CM0/CMP/MCU1 reason = Mem init retry: 1 2017-02-03/20:28:42 ereport.hc.dev_info@/SYS/MB/CM0/CMP/MCU2 reason = Mem init retry: 1 ...snip... 2017-02-03/20:30:45 ereport.hc.component_disabled@/SYS/MB/CM1/CMP reason = Socket has no usable memory 2017-02-03/20:30:45 ereport.hc.component_disabled@/SYS/MB/CM0/CMP reason = Socket has no usable memory 2017-02-03/20:30:46 ereport.hc.abort@/SYS/MB/CM0/CMP reason = No active CMPs 2017-02-03/20:56:07 ereport.hc.dev_fault@/SYS/MB/CM1/CMP/MR0/FRUID reason = FRUID access failure 2017-02-03/20:56:08 ereport.hc.dev_fault@/SYS/MB/CM0/CMP/MR0/FRUID reason = FRUID access failure 2017-02-03/20:56:09 ereport.hc.dev_fault@/SYS/MB/CM1/CMP/MR1/FRUID reason = FRUID access failure ...snip... 2017-02-03/20:56:57 ereport.hc.dev_fault@/SYS/MB/CM1/CMP/MR2/FRUID reason = FRUID access failure 2017-02-03/20:56:58 ereport.hc.dev_fault@/SYS/MB/CM0/CMP/MR3/FRUID reason = FRUID access failure 2017-02-03/20:56:58 ereport.hc.dev_fault@/SYS/MB/CM1/CMP/MR3/FRUID reason = FRUID access failure 2017-02-03/20:57:42 ereport.hc.component_disabled@/SYS/MB/CM0/CMP/MR0/BOB0/CH0/DIMM reason = BOB not usable 2017-02-03/20:57:42 ereport.hc.component_disabled@/SYS/MB/CM0/CMP/MR0/BOB0/CH1/DIMM reason = BOB not usable 2017-02-03/20:57:42 ereport.hc.component_disabled@/SYS/MB/CM0/CMP/MR0/BOB1/CH0/DIMM reason = BOB not usable ...snip... 2017-02-03/20:57:47 ereport.hc.component_disabled@/SYS/MB/CM1/CMP/MR3/BOB0/CH1/DIMM reason = BOB not usable 2017-02-03/20:57:47 ereport.hc.component_disabled@/SYS/MB/CM1/CMP/MR3/BOB1/CH0/DIMM reason = BOB not usable 2017-02-03/20:57:47 ereport.hc.component_disabled@/SYS/MB/CM1/CMP/MR3/BOB1/CH1/DIMM reason = BOB not usable 2017-02-03/20:57:53 ereport.hc.component_disabled@/SYS/MB/CM1/CMP reason = Socket has no usable memory 2017-02-03/20:57:53 ereport.hc.component_disabled@/SYS/MB/CM0/CMP reason = Socket has no usable memory 2017-02-03/20:57:54 ereport.hc.abort@/SYS/MB/CM0/CMP reason = No active CMPs $ This was then seen on 'capiasrtest ddb' (available on ILOM snapshot on 'ilom' folder) $ more @usr@local@bin@capiasrtest_ddb.out
/SYS/MB/CM1/CMP/MR0 State:0x08 CHILD_AFFECTED /SYS/MB/CM1/CMP State:0x0a STATE_DISABLED CHILD_AFFECTED /SYS/MB/CM1 State:0x08 CHILD_AFFECTED /SYS/MB State:0x08 CHILD_AFFECTED /SYS/MB/CM0/CMP/MR0 State:0x08 CHILD_AFFECTED /SYS/MB/CM0/CMP State:0x0a STATE_DISABLED CHILD_AFFECTED /SYS/MB/CM0 State:0x08 CHILD_AFFECTED /SYS/MB/CM0/CMP/MR1 State:0x08 CHILD_AFFECTED /SYS/MB/CM1/CMP/MR1 State:0x08 CHILD_AFFECTED /SYS/MB/CM0/CMP/MR2 State:0x08 CHILD_AFFECTED /SYS/MB/CM1/CMP/MR2 State:0x08 CHILD_AFFECTED /SYS/MB/CM0/CMP/MR3 State:0x08 CHILD_AFFECTED /SYS/MB/CM1/CMP/MR3 State:0x08 CHILD_AFFECTED /SYS/MB/CM0/CMP/MR0/BOB0 State:0x4e STATE_DISABLED FAULTED CHILD_AFFECTED RETIRE /SYS/MB/CM0/CMP/MR1/BOB0 State:0x4e STATE_DISABLED FAULTED CHILD_AFFECTED RETIRE ...snip... /SYS/MB/CM1/CMP/MR1/BOB1 State:0x4e STATE_DISABLED FAULTED CHILD_AFFECTED RETIRE /SYS/MB/CM1/CMP/MR2/BOB1 State:0x4e STATE_DISABLED FAULTED CHILD_AFFECTED RETIRE /SYS/MB/CM1/CMP/MR3/BOB1 State:0x4e STATE_DISABLED FAULTED CHILD_AFFECTED RETIRE /SYS/MB/CM1/CMP/MR0/FRUID State:0x02 STATE_DISABLED /SYS/MB/CM0/CMP/MR0/FRUID State:0x02 STATE_DISABLED /SYS/MB/CM1/CMP/MR1/FRUID State:0x02 STATE_DISABLED /SYS/MB/CM0/CMP/MR1/FRUID State:0x02 STATE_DISABLED /SYS/MB/CM1/CMP/MR2/FRUID State:0x02 STATE_DISABLED /SYS/MB/CM0/CMP/MR2/FRUID State:0x02 STATE_DISABLED /SYS/MB/CM1/CMP/MR3/FRUID State:0x02 STATE_DISABLED /SYS/MB/CM0/CMP/MR3/FRUID State:0x02 STATE_DISABLED /SYS/MB/CM0/CMP/MR0/BOB0/CH0/DIMM State:0x02 STATE_DISABLED /SYS/MB/CM0/CMP/MR0/BOB0/CH0 State:0x08 CHILD_AFFECTED /SYS/MB/CM0/CMP/MR0/BOB0/CH1/DIMM State:0x02 STATE_DISABLED /SYS/MB/CM0/CMP/MR0/BOB0/CH1 State:0x08 CHILD_AFFECTED ...snip... /SYS/MB/CM1/CMP/MR3/BOB1/CH0/DIMM State:0x02 STATE_DISABLED /SYS/MB/CM1/CMP/MR3/BOB1/CH0 State:0x08 CHILD_AFFECTED /SYS/MB/CM1/CMP/MR3/BOB1/CH1/DIMM State:0x02 STATE_DISABLED /SYS/MB/CM1/CMP/MR3/BOB1/CH1 State:0x08 CHILD_AFFECTED $ HOST was rendered unbootable due to the inability to access Memory risers FRUID: @(#)Hostconfig 1.8.4 2016/12/08 06:16
2017-02-03 22:37:09 1:00:0> ERROR: /SYS/MB/CM1/CMP/MR0/FRUID: FRUID access failure 2017-02-03 22:37:09 0:00:0> ERROR: /SYS/MB/CM0/CMP/MR0/FRUID: FRUID access failure 2017-02-03 22:37:13 0:00:0> ERROR: /SYS/MB/CM0/CMP/MR1/FRUID: FRUID access failure 2017-02-03 22:37:13 1:00:0> ERROR: /SYS/MB/CM1/CMP/MR1/FRUID: FRUID access failure 2017-02-03 22:37:16 0:00:0> ERROR: /SYS/MB/CM0/CMP/MR2/FRUID: FRUID access failure 2017-02-03 22:37:17 1:00:0> ERROR: /SYS/MB/CM1/CMP/MR2/FRUID: FRUID access failure 2017-02-03 22:37:19 0:00:0> ERROR: /SYS/MB/CM0/CMP/MR3/FRUID: FRUID access failure 2017-02-03 22:37:20 1:00:0> ERROR: /SYS/MB/CM1/CMP/MR3/FRUID: FRUID access failure ...snip... 2017-02-03 20:41:14 0:00:0> WARNING: Running with a nonstandard DIMM configuration. Refer to service document for details. 2017-02-03 20:41:14 1:00:0> ERROR: /SYS/MB/CM1/CMP/MR2/BOB1/CH1/DIMM: BOB not usable. Not configured 2017-02-03 20:41:14 1:00:0> ERROR: /SYS/MB/CM1/CMP/MR3/BOB1/CH0/DIMM: BOB not usable. Not configured 2017-02-03 20:41:14 1:00:0> ERROR: /SYS/MB/CM1/CMP/MR3/BOB1/CH1/DIMM: BOB not usable. Not configured 2017-02-03 20:41:14 1:00:0> WARNING: Running with a nonstandard DIMM configuration. Refer to service document for details. 2017-02-03 20:41:22 1:00:0> ERROR: /SYS/MB/CM1/CMP: Socket has no usable memory. Not configured 2017-02-03 20:41:22 0:00:0> ERROR: /SYS/MB/CM0/CMP: Socket has no usable memory. Not configured 2017-02-03 20:41:22 1:00:0> NOTICE: Idling self 2017-02-03 20:41:22 0:00:0> NOTICE: Idling self 2017-02-03 20:41:22 0:00:0> FATAL: No active CMPs 2017-02-03 20:41:22 0:00:0> NOTICE: Waiting for poweroff or powercycle from the SP 2017-02-03 20:32:07 SP> NOTICE: HOST0 cannot be restarted. Reason: No configurable CPUs Probing i2c bus with 'i2ctest -v' (restricted shell) failed most components. -> set SESSION mode=restricted WARNING: The "Restricted Shell" account is provided solely to allow Services to perform diagnostic tasks. [(restricted_shell) t7cooxupe4-sp:~]# i2ctest -v Starting I2C Sub System Test (i2ctest)... TESTING /SYS/DBP/5V0_VCC_OBPS.PMBUS_CMD (T:0 0x88)... /SYS/DBP/5V0_VCC_OBPS.PMBUS_CMD (T:0 0x88) SKIPPED TESTING /SYS/DBP/DBP_JTAG_GPIO.EXPANDER (T:0 0x80)... /SYS/DBP/DBP_JTAG_GPIO.EXPANDER (T:0 0x80) Read FAILED ...snip... TESTING /SYS/DBP/FRUID.FRU_PROM (T:0 0x84)... data1: 8 0 1 1A, data2: 8 0 1 1A Read Data Compare PASSED FRU CRC verification... Checking /SYS/DBP: NOT OK FRU FAIL /SYS/DBP/FRUID.FRU_PROM (T:0 0x84) FRU CRC verification FAILED /SYS/DBP/FRUID.FRU_PROM (T:0 0x84) FAILED ...snip/condensed MR fault output... /SYS/MB/CM1/CMP/MR0/BOB0_TEMP.TEMP_SENSOR (T:2 0x5c) Read FAILED /SYS/MB/CM1/CMP/MR0/BOB1_TEMP.TEMP_SENSOR (T:2 0x5e) Read FAILED /SYS/MB/CM1/CMP/MR0/DIMM_LED_CTRL.EXPANDER (T:2 0x5a) Read FAILED /SYS/MB/CM1/CMP/MR0/DIMM_PRSNT_GPIO.EXPANDER (T:2 0x58) Read FAILED /SYS/MB/CM1/CMP/MR0/FRUID.FRU_PROM (T:2 0x54) Read FAILED /SYS/MB/CM1/CMP/MR0/FRU_WREN_GPIO.EXPANDER (T:2 0x56) Read FAILED /SYS/MB/CM1/CMP/MR0/INLET_TEMP.TEMP_SENSOR (T:2 0x4e) Read FAILED /SYS/MB/CM1/CMP/MR0/MR_CURRENT.MONITOR (T:2 0x52) Read FAILED /SYS/MB/CM1/CMP/MR0/OUTLET_TEMP.TEMP_SENSOR (T:2 0x50) Read FAILED /SYS/MB/CM1/CMP/MR0/PWR_MGR.PMBUS (T:1 0x36) Read FAILED /SYS/MB/CM1/CMP/MR1/BOB0_TEMP.TEMP_SENSOR (T:2 0x6e) Read FAILED /SYS/MB/CM1/CMP/MR1/BOB1_TEMP.TEMP_SENSOR (T:2 0x70) Read FAILED /SYS/MB/CM1/CMP/MR1/DIMM_LED_CTRL.EXPANDER (T:2 0x6c) Read FAILED /SYS/MB/CM1/CMP/MR1/DIMM_PRSNT_GPIO.EXPANDER (T:2 0x6a) Read FAILED /SYS/MB/CM1/CMP/MR1/FRUID.FRU_PROM (T:2 0x66) Read FAILED /SYS/MB/CM1/CMP/MR1/FRU_WREN_GPIO.EXPANDER (T:2 0x68) Read FAILED /SYS/MB/CM1/CMP/MR1/INLET_TEMP.TEMP_SENSOR (T:2 0x60) Read FAILED /SYS/MB/CM1/CMP/MR1/MR_CURRENT.MONITOR (T:2 0x64) Read FAILED /SYS/MB/CM1/CMP/MR1/OUTLET_TEMP.TEMP_SENSOR (T:2 0x62) Read FAILED /SYS/MB/CM1/CMP/MR1/PWR_MGR.PMBUS (T:1 0x42) Read FAILED /SYS/MB/CM1/CMP/MR2/BOB0_TEMP.TEMP_SENSOR (T:2 0x80) Read FAILED /SYS/MB/CM1/CMP/MR2/BOB1_TEMP.TEMP_SENSOR (T:2 0x82) Read FAILED /SYS/MB/CM1/CMP/MR2/DIMM_LED_CTRL.EXPANDER (T:2 0x7e) Read FAILED /SYS/MB/CM1/CMP/MR2/DIMM_PRSNT_GPIO.EXPANDER (T:2 0x7c) Read FAILED /SYS/MB/CM1/CMP/MR2/FRUID.FRU_PROM (T:2 0x78) Read FAILED /SYS/MB/CM1/CMP/MR2/FRU_WREN_GPIO.EXPANDER (T:2 0x7a) Read FAILED /SYS/MB/CM1/CMP/MR2/INLET_TEMP.TEMP_SENSOR (T:2 0x72) Read FAILED /SYS/MB/CM1/CMP/MR2/MR_CURRENT.MONITOR (T:2 0x76) Read FAILED /SYS/MB/CM1/CMP/MR2/OUTLET_TEMP.TEMP_SENSOR (T:2 0x74) Read FAILED /SYS/MB/CM1/CMP/MR2/PWR_MGR.PMBUS (T:1 0x4e) Read FAILED /SYS/MB/CM1/CMP/MR3/BOB0_TEMP.TEMP_SENSOR (T:2 0x92) Read FAILED /SYS/MB/CM1/CMP/MR3/BOB1_TEMP.TEMP_SENSOR (T:2 0x94) Read FAILED /SYS/MB/CM1/CMP/MR3/DIMM_LED_CTRL.EXPANDER (T:2 0x90) Read FAILED /SYS/MB/CM1/CMP/MR3/DIMM_PRSNT_GPIO.EXPANDER (T:2 0x8e) Read FAILED /SYS/MB/CM1/CMP/MR3/FRUID.FRU_PROM (T:2 0x8a) Read FAILED /SYS/MB/CM1/CMP/MR3/FRU_WREN_GPIO.EXPANDER (T:2 0x8c) Read FAILED /SYS/MB/CM1/CMP/MR3/INLET_TEMP.TEMP_SENSOR (T:2 0x84) Read FAILED /SYS/MB/CM1/CMP/MR3/MR_CURRENT.MONITOR (T:2 0x88) Read FAILED /SYS/MB/CM1/CMP/MR3/OUTLET_TEMP.TEMP_SENSOR (T:2 0x86) Read FAILED /SYS/MB/CM1/CMP/MR3/PWR_MGR.PMBUS (T:1 0x5a) Read FAILED ...snip... TESTING /SYS/MB/CM1_BUSBAR_TEMP.TEMP_SENSOR (T:0 0x22)... /SYS/MB/CM1_BUSBAR_TEMP.TEMP_SENSOR (T:0 0x22) Read FAILED ...snip... total I2C Devices Passed: 63, Failed: 83, Skipped: 162 <<<==== note number of failed devices on fault summary I2C Sub System Test (i2ctest) FAILED [(restricted_shell) t7cooxupe4-sp:~]# $ Fault persisted after replacing DBP then MB and then SP. CauseMultiple components faulted due to an i2c lockup caused by a faulty component in the i2c bus SolutionConsidering the parts already replaced and the many pointers to the memory risers, all Yuba Memory Risers (YMRs) were removed and then 'i2ctest -v' was used to isolate the faulty component. For a similar situation, if after removing all YMRs you get no failed components on 'i2ctest' output (Failed: 0) then install the first YMR and repeat the 'i2ctest -v'. Continue adding every next YMR and testing, until you encounter a failure, at which point you may replace the suspect YMR and proceed with the test until all YMRs are installed and no i2c faults are detected. At his point you may proceed with starting the HOST again. IMPORTANT NOTE: Never remove/install a YMR with AC applied to the server. Ensure the AC power cords are removed for removing/installing these components. References<BUG:25119460> - HIGH RATE OF FRUCACHE ERRORS SEEN ON T7Attachments This solution has no attachment |
||||||||||||||||||
|