Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2149494.1
Update Date:2017-09-29
Keywords:

Solution Type  Problem Resolution Sure

Solution  2149494.1 :   Sun SPARC Enterprise M3000/M4000/M5000/M8000/M9000 (OPL) Servers: "Fast Data Access MMU Miss" error along with BERR/BTO event during boot may be due to root file-system inconsistencies  


Related Items
  • Sun SPARC Enterprise M8000 Server
  •  
  • Sun SPARC Enterprise M4000 Server
  •  
  • Sun SPARC Enterprise M3000 Server
  •  
  • Sun SPARC Enterprise M9000-32 Server
  •  
  • Sun SPARC Enterprise M9000-64 Server
  •  
  • Sun SPARC Enterprise M5000 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Mx000
  •  




In this Document
Symptoms
Cause
Solution
References


Applies to:

Sun SPARC Enterprise M9000-64 Server - Version All Versions and later
Sun SPARC Enterprise M4000 Server - Version All Versions and later
Sun SPARC Enterprise M5000 Server - Version All Versions and later
Sun SPARC Enterprise M3000 Server - Version All Versions and later
Sun SPARC Enterprise M8000 Server - Version All Versions and later
Information in this document applies to any platform.

Symptoms

On rare occasions, a root file-system inconsistency or partial corruption on OPL domains can avoid domain from booting, and generate errors during boot attempts that can be misleading and lead to unnecessary hardware replacement.

BERR and/or BTO events can be reported during boot even if no platform hardware is actually faulty.

No degraded or faulted HW should be marked by XSCF for this Document's solution to be applicable

An example is reported below, collected on Domain 0 on a M4000 platform

Jun 15 07:05:48 CEST 2011 {0} ok boot boot-disk
Jun 15 07:05:48 CEST 2011 Boot device: /pci@0,600000/pci@0/pci@9/SUNW,qlc@0/fp@0,0/disk@w202400a0b867732e,0:a File and args: -m verbose
Jun 15 07:05:54 CEST 2011
Jun 15 07:05:54 CEST 2011 ERROR: Last Trap: Fast Data Access MMU Miss
Jun 15 07:05:54 CEST 2011 %TL:1 %TT:68 %TPC:f0038804 %TnPC:f0038808 %TSTATE:886a001600
Jun 15 07:05:54 CEST 2011 %PSTATE:16 ( IE:1 PRIV:1 PEF:1 )
Jun 15 07:05:54 CEST 2011 DSFSR:4280804b ( FV:1 OW:1 PR:1 E:1 TM:1 ASI:80 NC:1 BERR:1 )
Jun 15 07:05:54 CEST 2011 DSFAR:fd8f9000 DSFPAR:413182c07000 D-TAG:0

Logs show a 'Fast Data Access MMU Miss' error along with 'BERR:1', meaning that Bus error response occurred during boot attempt.

Same result while booting from an alternate path to the same boot device (in case of internal HDDs used as system drives, this would not be possible):

Jun 15 06:34:28 CEST 2011 Boot device: /pci@1,700000/SUNW,qlc@0/fp@0,0/disk@202500a0b867732e,0:a File and args: -m verbose
Jun 15 06:34:28 CEST 2011 QLogic QLE2462 Host Adapter FCode(SPARC): 2.01 03/27/08
Jun 15 06:34:32 CEST 2011 Wait for link up - Firmware version 4.03.01
Jun 15 06:34:32 CEST 2011 ERROR: /pci@1,700000/SUNW,qlc@0/fp@0,0: Last Trap: Fast Data Access MMU Miss
Jun 15 06:34:32 CEST 2011 %TL:1 %TT:68 %TPC:f000a428 %TnPC:f000a42c %TSTATE:446a001600
Jun 15 06:34:32 CEST 2011 %PSTATE:16 ( IE:1 PRIV:1 PEF:1 )
Jun 15 06:34:32 CEST 2011 DSFSR:4280804b ( FV:1 OW:1 PR:1 E:1 TM:1 ASI:80 NC:1 BERR:1 )
Jun 15 06:34:32 CEST 2011 DSFAR:fd8f9000 DSFPAR:413182c07000 D-TAG:fffffffc003c4000

BRTO event also possible:

{0} ok boot
Boot device: boot-disk File and args: ""
QLogic QLE2462 Host Adapter FCode(SPARC): 2.01 03/27/08
Firmware version 4.03.01
ERROR: Last Trap: Fast Data Access MMU Miss
%TL:1 %TT:68 %TPC:f0038804 %TnPC:f0038808 %TSTATE:886a001600
%PSTATE:16 ( IE:1 PRIV:1 PEF:1 )
DSFSR:2280804b ( FV:1 OW:1 PR:1 E:1 TM:1 ASI:80 NC:1 BRTO:1 )
DSFAR:fd8f8006 DSFPAR:413180200000 D-TAG:0

Decoding the dumped registers :

Decoding SFSR 4280804b
Bit setting in 4280804b:
100 0010 1000 0000 1000 0000 0100 1011
Bit [63-00] value of Status Register = 4280804b
[0] FV (Fault Valid)
[1] OW (Overwritten)
[3] PR (CPU privilege status)
[15] TM (Translation Miss)
[25] NC (Non cacheable reference; valid for UE, BERR or BRTO)
[30] BERR (Bus error response)

Decoding SFSR 2280804b
Bit setting in 2280804b:
10 0010 1000 0000 1000 0000 0100 1011
Bit [63-00] value of Status Register = 2280804b
[0] FV (Fault Valid)
[1] OW (Overwritten)
[3] PR (CPU privilege status)
[15] TM (Translation Miss)
[25] NC (Non cacheable reference; valid for UE, BERR or BRTO)
[29] BRTO (Bus timeout response)

Decoder tools for OPL's Status Register available at:
Synchronous Fault Status Register: https://mos-cores.us.oracle.com/cgi-bin/opltools/oplTools.cgi?SFSR=true

 

Cause

  • In case the system is showing the above symptoms and MBU_A or IOU is degraded, please refer to the following Document:
    • M3000/M4000/M5000 - OBP probe-all command execution and subsequent boot causes "Last Trap: Data Access Error" and MBU_A or IOU is degraded (Doc ID 1465634.1)
  • If some HW is marked as NOT OK by XSCF and the above (Doc ID 1465634.1) is not applicable, fix bad HW first.
  • In case no degraded/faulty HW is detected by the platform's XSCF, and network or CDROM boot is working normally, then this Doc may be used to troubleshoot the issue that may be due to partially corrupted boot environment.

 

Solution

If the above scenario occurs, check/verify the integrity and status of the boot FS and environment before moving to HW replacement; i.e. recovery steps may include:

  • verify that root FS can be correctly mounted and is clean
  • fsck on root partition
  • boot-block check
  • boot-archive check

The issue in the example above was fixed by performing multiple fsck on root FS and fixing boot archive issue that was introduced as below, after a system panic:

Jun 13 11:11:28 CEST 2011 FRAGMENT 17306698 DUP I=298444 LFN 250
Jun 13 11:11:28 CEST 2011 FRAGMENT 17306699 DUP I=298444 LFN 251
Jun 13 11:11:28 CEST 2011 FRAGMENT 17306700 DUP I=298444 LFN 252
Jun 13 11:11:28 CEST 2011 FRAGMENT 17306701 DUP I=298444 LFN 253
Jun 13 11:11:28 CEST 2011 FRAGMENT 17306702 DUP I=298444 LFN 254
Jun 13 11:11:28 CEST 2011 FRAGMENT 17306703 DUP I=298444 LFN 255
...
Jun 13 11:12:45 CEST 2011 FRAGMENT 87248639 DUP I=298444 LFN 16479
Jun 13 11:12:45 CEST 2011 FRAGMENT 2891792 DUP I=298452 LFN 0
...
Jun 13 11:17:00 CEST 2011 UNALLOCATED I=298427 OWNER=oracle MODE=0
Jun 13 11:17:01 CEST 2011 SIZE=0 MTIME=Jun 13 10:35 2011
Jun 13 11:17:01 CEST 2011 NAME=/export/home/oracle/DANIELE/osw/archive/oswnetstat/kgp-db2_netstat_11.06.01.0900.dat
Jun 13 11:17:01 CEST 2011
Jun 13 11:17:01 CEST 2011 REMOVE DIRECTORY ENTRY FROM I=297079? yes
...
Jun 13 11:17:14 CEST 2011 FRAG BITMAP WRONG (CORRECTED)
Jun 13 11:17:14 CEST 2011 FRAG BITMAP WRONG (CORRECTED)
Jun 13 11:17:26 CEST 2011 CORRECT GLOBAL SUMMARY
Jun 13 11:17:26 CEST 2011 SALVAGE? yes
Jun 13 11:17:26 CEST 2011
Jun 13 11:17:26 CEST 2011 Log was discarded, updating cyl groups
Jun 13 11:17:34 CEST 2011 217080 files, 26602524 used, 242400629 free (10037 frags, 30298824 blocks, 0.0% fragmentation)
Jun 13 11:17:34 CEST 2011
Jun 13 11:17:34 CEST 2011 ***** FILE SYSTEM WAS MODIFIED *****
Jun 13 11:17:46 CEST 2011 root@kgp-db2 #
Jun 13 11:17:46 CEST 2011 root@kgp-db2 #
Jun 13 11:17:50 CEST 2011 root@kgp-db2 # init 6

Jun 13 11:22:56 CEST 2011 WARNING: The following files in / differ from the boot archive:
Jun 13 11:22:56 CEST 2011
Jun 13 11:22:56 CEST 2011 changed /kernel/drv/did.conf
Jun 13 11:22:56 CEST 2011
Jun 13 11:22:56 CEST 2011 The recommended action is to reboot to the failsafe archive to correct
Jun 13 11:22:56 CEST 2011 the above inconsistency. To accomplish this, on a GRUB-based platform,
Jun 13 11:22:56 CEST 2011 reboot and select the "Solaris failsafe" option from the boot menu.
Jun 13 11:22:56 CEST 2011 On an OBP-based platform, reboot then type "boot -F failsafe". Then
Jun 13 11:22:56 CEST 2011 follow the prompts to update the boot archive. Alternately, to continue
Jun 13 11:22:56 CEST 2011 booting at your own risk, you may clear the service by running:
Jun 13 11:22:56 CEST 2011 "svcadm clear system/boot-archive"
...
Jun 13 11:22:56 CEST 2011 Jun 13 11:18:20 svc.startd[8]: svc:/system/boot-archive:default: Method "/lib/svc/method/boot-archive" failed with exit status 95.
...
Jun 13 11:23:55 CEST 2011 root@kgp-db2 # svcadm clear system/boot-archive
...
Jun 13 11:23:55 CEST 2011 root@kgp-db2 # [ system/boot-archive:default starting (check boot archive content) ]
Jun 13 11:23:55 CEST 2011 [ system/filesystem/usr:default starting (read/write root file systems mounts) ]
Jun 13 11:23:56 CEST 2011 [ system/keymap:default starting (keyboard defaults) ]
...
Jun 13 11:24:45 CEST 2011 Booting as part of a cluster
...
Jun 13 11:24:58 CEST 2011 [ milestone/multi-user:default starting (multi-user milestone) ]

 

References:

Solaris 10 SPARC system fails to boot with a "ERROR: Last Trap: Fast Data Access MMU Miss" error (Doc ID 1020309.1)

Internal References

  • Analyzing TO/BTO/DTO and BERR/DBERR Solaris Crash Dumps (Doc ID 1003640.1)
  • Service Requests: 3-3820184681 - 3-4617240041

 

References

<NOTE:1020309.1> - Solaris 10 SPARC system fails to boot with a "ERROR: Last Trap: Fast Data Access MMU Miss" error

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback