Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-2180489.1
Update Date:2016-11-09
Keywords:

Solution Type  Technical Instruction Sure

Solution  2180489.1 :   FS System: How to Determine if a Sector Repair Operation has Taken Place  


Related Items
  • Oracle FS1-2 Flash Storage System
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>Flash Storage>SN-EStor: FSx
  •  




In this Document
Goal
Solution
References


Oracle Confidential PARTNER - Available to partners (SUN).
Reason: Internal processing commands.
Created from <SR 3-13284905149>

Applies to:

Oracle FS1-2 Flash Storage System - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Goal

To help determine if a RAID_EVENT_SAVE_TRACE log bundle was generated due to a sector repair.  And if so, how to proceed.

Solution

Log bundles with the generic description of RAID_EVENT_SAVE_TRACE are an indication that there was a problem with a Drive Enclosure.  Since the intent is to capture the RAID logs as close to the event as possible or critical information may be overwritten, the log bundle often will not contain the event log.  As such the severity of the problem and what the response should be is not always clear.

 

  1. Check for other logs associated with the event. 
    1. In the first example, the log bundles immediately following the RAID_EVENT_SAVE_TRACE are not event driven collections:
      AK0027XXXX-160901230146-160904040613-RAID_EVENT_SAVE_TRACE-e-02-02.tar
      AK0027XXXX-160904040615-160904225936-PRIMARY_PERIODIC_COLLECTION-p-02-02.tar
      AK0027XXXX-160905103622-160906200758-MANUAL_COLLECTION-m-01-01.tar


    2. In this second example, the log bundle immediately after the RAID_EVENT_SAVE_TRACE is event driven and can logically be tied together:
      AK0027XXXX-160906200803-160906225039-RAID_EVENT_SAVE_TRACE-e-02-02.tar
      AK0027XXXX-160906225041-160906225227-ENCLOSURE_DRIVE_STATE_CHANGE-e-02-02.tar

      Note: For information on log collection types and other information about the log bundle filenames, see KM 1997980.1 FS System: How to Identify and Send Log Bundles for Specific Dates and Times to Oracle Support.


  2. RAID_EVENT_SAVE_TRACE log bundles that do not have any others associated with it are an indication that the problem took care of itself.  This can be confirmed by examining the RAID logs that are associated with the event.  Under the slammer directory, <slammer WWN>/*OUTLOG/var/log, there can be up to 3 RAID logs, raid0.log, raid1.log & raid2.log and represent the 3 SAS HBAs possible in each Controller.  The reason for the RAID_EVENT_SAVE_TRACE event will be towards the end of one of these files.  In this example, it is a sector repair.

    Check Condition; cdb 0x28 0x60 0x32 0xCA 0xE4 0x80 0x00 0x00 0x80 0x00 0x00 0x00 0x00 0x00 0x00 0x00
    SenseLength 0000-0018, Sense 0x72 0x03 0x11 0x00 0x00 0x00 0x00 0x34 0x00 0x0A 0x80 0x00 0x00 0x00 0x00 0x00 0x32 0xCA 0xE4 0xC8 0x01
    SER C 00 T 0057 Sts 0202 RC 0000, SR 0000-3000-8408-42D8
    Key/Asc/AscQ 03/11/00 Cmd 0000-0003, LBA 0000-0000-32CA-E480, XfrLength 0001-0400
    Sun Sep  4 04:04:01 2016  +       452..+    550490. msec
    ERROR--dm::cb_fcrwreq read -- p(        87.) tgtId(        87.) ec(0000-001E) scsiSts(02)
    72 03 03 00 00 00 00 34 00 0A 80 00 11 00 00 00 32 CA E4 C8 01 0A 00 00 00 00 00 00 00 00 00 00
    RG: ! Disk Error Detected in "DiskReadCallback()", psDR=0x3000840842d8, errCode=38("DM- Can't read a sector on the drive"), crcCheckFl
    RG:   bufIdx=1, logIdx=1, phyIdx=3, portNum=87, diskLBA=852157568(0x32CAE480), blocks=128(0x80)
    RG:   stripMap=FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
    RG:   psSI=0x300088765d40, stripeNr=6657481(0x006595C9), flags=[ sif_Parity | sif_difBlkSize_4K ]
    RG: a Abort the sequence due to err=38(DM- Can't read a sector on the drive)
    RG: . Queue the pending RR=0x300084f93570 to psRSI=0x300088765d40, stripeNr=6657481(0x6595C9) due to StripeRepair.
    RG:   rrf_Flags=[ rrf_Read ], sif_Flags=[ sif_Parity | sif_StripeRepair | sif_difBlkSize_4K ]
    RG: # 0x006595C9, StripeRepair for psRSI=0x300088765d40, stripeNr=6657481 started, offsetInStride=1.
    RG: & 0x006595C9, Lock Peer, DIL::Add_Entry(psRR=0x3000834056c8).
    {s0004-84D3} 09-04-2016  04:04:01.    453299. usec
    RG: > 0x006595C9, Dumping the Recovery Map, * Stripe Bitmap Dump for: 0x300083404008
    RG:          smt_SR_Read : 0x00000000000003FE, +{ 01 02 03 04 05 06 07 08 09 }
    RG:                        [ 01  ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
    RG:                        [ 02  ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
    RG:                        [ 03  ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
    RG:                        [ 04  ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
    RG:                        [ 05  ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
    RG:                        [ 06  ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
    RG:                        [ 07  ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
    RG:                        [ 08  ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
    RG:                        [ 09  ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
    ...
    RG: ! 0x006595C9, Logical Device or Buffer Device Errors found, * Stripe Bitmap Dump for: 0x300083404008
    RG:            smt_Error : N/A
    RG:      smt_LogDevError : 0x0000000000000002, +{ 01 }
    RG:                        [ 01  ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
    RG: @ 0x006595C9, Build Write Fix Sequence for stripeNr=6657481 logDevMap=0x0000000000000FFF, +{ 00 01 02 03 04 05 06 07 08 09 10 11 }
    RG: @ 0x006595C9, Dumping the Fix Sequence:
    RG:   Write: < buf=01, log=01, phy=03 >, map=FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
    RG: @ 0x006595C9, Staring the Fix Sequence ...
    {s0004-84D5} 09-04-2016  04:04:02.    186657. usec
    RG: @ 0x006595C9, Fix Sequence finished.
    RG: + 0x006595C9, Recovery Done, Dumping the SR map, * Stripe Bitmap Dump for: 0x300083404008

    1. The last two lines indicate that the repair was likely successful. 
    2. Look for any T10DIFF ERROR messages immediately afterwards, this would indicate a corrupted LUN and data would have to be restored.
      RG: RR-T10DIF ERROR lba=0x1FBECED80 RU=0x3000a6f537b8
      RG: GOT T10DIF ERROR during Stripe repair, Calculated CRC:0x4495,
      media-read-CRC:0xA3B9 !!!!!
       
    3. The Sense Key/ASC/ASCQ code (03/11/00) on the fourth line indicating an unrecoverable read error.
    4. Using one of the other log bundles, look in the event log at the timestamps on the fifth from the top (Sun Sep 4 04:04:01 2016) and third line from the end (09-04-2016  04:04:02):
      2016-09-04T04:04:02.241  RAID_EVENT_SAVE_TRACE                     INFORMATIONAL  ComponentName=/ENCLOSURE-01, Source=/CONTROLLER-01
      2016-09-04T04:04:02.243  COLLECT_LOGS_INITIATED                    INFORMATIONAL
      2016-09-04T04:04:02.305  CM_EVT_BS_TRIAGE_INFO                     INFORMATIONAL  Source=/CONTROLLER-02, TriageInfo=sd:0x0 requesting IC - reason: BS_SLAT_IC_REQ_PM_MEDIA_ERR
      2016-09-04T04:04:03.250  CM_EVT_SLAT_INTEGRITY_CHECK_STARTED       INFORMATIONAL  ComponentName=/Storage Domain 1, Source=/CONTROLLER-02
      2016-09-04T04:04:22.237  CALL_HOME_SUCCEEDED                       INFORMATIONAL  ComponentName=/INTERNAL_OP/13771558/SYSTEM, reason=EVENT, bundle=AK0027XXXX-160901230146-160904040402-RAID_EVENT_SAVE_TRACE-e-01-02
      2016-09-04T04:09:08.822  CALL_HOME_SUCCEEDED                       INFORMATIONAL  ComponentName=/INTERNAL_OP/13771649/SYSTEM, reason=EVENT, bundle=AK0027XXXX-160901230146-160904040613-RAID_EVENT_SAVE_TRACE-e-02-02
      2016-09-04T05:57:57.731  PM_EVT_REPL_RELOAD_CONFIGURATION          INFORMATIONAL

      This confirms the sector repair and no further action is required.
       
  3. For the RAID_EVENT_SAVE_TRACE that did have other log bundle collections associated with it:
    1. From the RAID_EVENT_SAVE_TRACE log bundle, examine the end of the appropriate raid*.log file:
      Check Condition; cdb 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x80 0x00 0x00 0x00 0x00 0x00 0x00 0x00
      SenseLength 0000-0018, Sense 0x70 0x00 0x06 0x00 0x00 0x00 0x00 0x38 0x00 0x00 0x00 0x00 0x29 0x04 0x00 0x00 0x00 0x00 0xC0
      SER C 00 T 0039 Sts 0202 RC 0000, SR 0000-3007-F229-6DE0
      Key/Asc/AscQ 06/29/04 Cmd 0000-0012, LBA 0000-0000-0000-0000, XfrLength 0000-0000
      Tue Sep  6 22:49:34 2016  +       573..+     85751. msec
       
      The Sense Key/ASC/ASCQ code (06/29/04) indicating a Device Reset and the timestamp (Sep  6 22:49:34 2016) in the last two lines.  This likely means that there were multiple failed reads or writes whereupon the system gave up and issued the reset.

    2. Searching this raid log for the keyword OFFLINE, identifies the actual drive port (57) itself:
      Removal call 0000-0039, 0000-0000
      n----> TRANS:  DrvPresent->Remove  P(        57.)
      RemoveStateHndlr -- Taking Port Offline. Port         57.
      DM::RemoveStateHndlr -- Need to Send AE when done. Port         57.
      !!!! Clearing Diag err bit !!!!
       -@-@   Changing NVR Drive State to 0000-0401 @-@-
      Clearing WSM for tgt 02 slot 09


      *******************************************************************
      **************  DRIVE HAS BEEN REMOVED ON PORT    57.   *****************
      *******************************************************************


      Calling Xpt         57.
      Commands On Hold 0000-0000-0000-012C FFFF-FFFF
      **************  PORT         57. IS BEING TAKEN OFFLINE *****************
      PORT         57.:

      Ident Data
          max LBA     = 0000-0000-BA4D-4AAF
          Model #     = LB1606R  SUN1.6T^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
          Serial #    = 41567872^@^@^@^@tag/T3BF
          F/W Rev     = S311^@^@^@^@
      ...
      ReturnPendingCommandToCaller
      DM::CB_FcRwReq -- Drive Offline. P(        57.) tgtId(        57.)
      ToSC
      RG: ! Disk Error Detected in "DiskReadCallback()", psDR=0x300af1df2898, errCode=28("DM- Port Offline"), crcCheckFla
      RG:   bufIdx=2, logIdx=2, phyIdx=3, portNum=57, diskLBA=2637725184(0x9D387E00), blocks=128(0x80)
      RG:   stripMap=FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
      RG:   psSI=0x300af7799000, stripeNr=20607228(0x013A70FC), flags=[ sif_WrLock | sif_WrIP | sif_Parity | sif_DIL_Allo
      RG: ! AHHH, err_DM_PortOffline Read psDR=0x300af1df2898
      RG: % Set sif_PortOffline for psRSI=0x300af7799000, stripeNum=0x013A70FC
      RG: a Abort the sequence due to err=28(DM- Port Offline)
      ...
      RG: . Ignore: sif_PortOffline already set for psRSI=0x300af63e20e0, stripeNum=0x00C37372# 0x01351517, StripeRepair
      RG: ! 0x01351517, LUN=13, psRU=0x300ae27992f0 offline/portOffline, reject all the pending requests
      RG: - Reject psRR=0x300af3260e50 with errCode:28
      RG: @ Return psRR=0x300af3260e50, errCode=28("DM- Port Offline"), stripeNr=0x1351517
      RG:   rrf_Flags=[ rrf_Read ]
      RG: * 0x01351517, StripeRepair for psRSI=0x300af7bce040, stripeNr=20256023 skipped since array/port offline.
      RG: ! 0x01351517, LUN=13, psRU=0x300ae27992f0 portOffline, reject all the incoming requests
      RG: . List Empty, nothing to reject
      RG: % Clear sif_PortOffline for psRSI=0x300af7bce040, stripeNum=0x01351517
      RG: # 0x01351517, StripeRepair for psRSI=0x300af7bce040, stripeNr=20256023 Ended.
      RG: % ON SR_END
      ...
      Target::AE_Handler Local, 0000-302C-0708-38F8, id +  1.
      Tgt - DrvOff AE p(        57.) t(         2.) desc(03) psFT(E27F-A170)
      RUSTY::ae_Drive: port  57., event   1., desc   3.
      RUSTY::AE_Drive offline LostPart, psRU E279-92F0 flags 0031-D000 portNum=+        57. partnum=+         3.
      LostPart LUN:    13., Part:     3.
      LUN_IDLE, tgt 0x02 lun 0x01 rstFlags 0x20 lunFlags 0x42
      LUN_Idle_2 Must Wait
      Companion responded that it doesn't have the token, Good.
      SES::PollEnclosureStatus: IMMED 20160906-22:49:07.781228963
      SES::PollEnclosureStatus: Found active enclosure 0x02D2 SES flags = 0x0090
      SES::PollEnclosureStatus: Requesting SES page two for enclosure 0x02D2
      {r0042-2125} 09-06-2016  22:49:08.     55379. usec
      {r0042-2127} 09-06-2016  22:49:08.     67814. usec
      {s0005-4DC9} 09-06-2016  22:49:08.     95034. usec
      {r0042-2129} 09-06-2016  22:49:08.     95058. usec
      {r0042-212B} 09-06-2016  22:49:08.    103035. usec
      Release Commands From Hold
      Tue Sep  6 22:49:10 2016  +       787..+     75608. msec
      SlotNumber 0000-0000
      Release Commands From Hold
      SlotNumber 0000-0001

       
      The seventh line down says that this drive is target 0x2, slot 9.  Targets, in this case Drive Enclosures, are allotted 24 slots, so adding these plus the slot (24+24+9) provides you the port number (57).

    3. From the ENCLOSURE_DRIVE_STATE_CHANGE log bundle, examine the *.chsh.xml file for the timestamp (2016-09-06T22:49:10) and the offending component (ENCLOSURE-04):
          <EventType>ENCLOSURE_DRIVE_STATE_CHANGE</EventType>
          <Severity>WARNING</Severity>
          <Category>SYSTEM</Category>
          <Time>2016-09-06T22:49:10.878</Time>
          <ComponentIdentity>
            <WuName>5080020001A9CE70</WuName>
          </ComponentIdentity>
          <ComponentName>/ENCLOSURE-04</ComponentName>
          <SourceNodeIdentity>
            <Id>508002000158C671</Id>
            <Fqn>/CONTROLLER-02</Fqn>
          </SourceNodeIdentity>
       
    4. Now compare these values with the event log from the ENCLOSURE_DRIVE_STATE_CHANGE log bundle:
      2016-09-06T22:49:10.843  RAID_EVENT_SAVE_TRACE                     INFORMATIONAL  ComponentName=/ENCLOSURE-11, Source=/CONTROLLER-01
      2016-09-06T22:49:10.844  COLLECT_LOGS_INITIATED                    INFORMATIONAL
      2016-09-06T22:49:10.878  ENCLOSURE_DRIVE_STATE_CHANGE              WARNING        ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-02, cruId=CRU_DRIVE9, action=CRU_OFFLINE
      2016-09-06T22:49:10.892  RAID_EVENT_PARTIAL_OFFLINE_STATE_CHANGE   INFORMATIONAL  ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-01, cruId=CRU_DRIVE9, partialOfflineState=PARTIAL_OFFLINE
      2016-09-06T22:49:10.902  PSG_PI_EVENT_PATH_FAILURE                 WARNING        ComponentName=/CONTROLLER-01, Source=/CONTROLLER-01, reasonCode=PDS_ILLEGAL
      2016-09-06T22:49:11.205  ENCLOSURE_DRIVE_REMOVED_FROM_DRIVE_GROUP  WARNING        ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-02, cruId=CRU_DRIVE9
      2016-09-06T22:49:11.206  RAID_EVENT_SAVE_TRACE                     INFORMATIONAL  ComponentName=/ENCLOSURE-11, Source=/CONTROLLER-02
      2016-09-06T22:49:12.162  RAID_EVENT_PARTIAL_OFFLINE_STATE_CHANGE   INFORMATIONAL  ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-01, cruId=CRU_DRIVE9, partialOfflineState=NORMAL
      2016-09-06T22:49:13.597  RAID_EVENT_ENCLOSURE_ARRAY_DEVICE_STATUS_CRITICAL  CRITICAL       ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-02, cruId=CRU_DRIVE9
      2016-09-06T22:49:13.632  PSG_PI_EVENT_PATH_RESTORED                INFORMATIONAL  ComponentName=/CONTROLLER-01, Source=/CONTROLLER-01
      2016-09-06T22:49:18.153  RAID_EVENT_MAINT_BLOCKED                  INFORMATIONAL  ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-02
      2016-09-06T22:49:25.419  CALL_HOME_SUCCEEDED                       INFORMATIONAL  ComponentName=/INTERNAL_OP/13973922/SYSTEM, reason=EVENT, bundle=AK0027XXXX-160906200803-160906224910-RAID_EVENT_SAVE_TRACE-e-01-02.tar
      2016-09-06T22:49:29.629  ENCLOSURE_MAINTENANCE_STARTED             INFORMATIONAL  ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-02

      This confirms an actual drive issue in ENCLOSURE-11, DRIVE-09.

For more information, see KM Document 2126081.1 FS System: How to Handle a "Save Trace Event" Auto-Service Request for additional details.  As such, this will likely mean having to examine other logs generated immediately before and afterwards to get a more complete picture of the event. 

 

 

 

 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback