FS System: How to Determine if a Sector Repair Operation has Taken Place

Asset ID:	1-71-2180489.1
Update Date:	2016-11-09
Keywords:

Solution Type Technical Instruction Sure

Solution 2180489.1 : FS System: How to Determine if a Sector Repair Operation has Taken Place

Applies to:

Oracle FS1-2 Flash Storage System - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Goal

To help determine if a RAID_EVENT_SAVE_TRACE log bundle was generated due to a sector repair. And if so, how to proceed.

Solution

Log bundles with the generic description of RAID_EVENT_SAVE_TRACE are an indication that there was a problem with a Drive Enclosure. Since the intent is to capture the RAID logs as close to the event as possible or critical information may be overwritten, the log bundle often will not contain the event log. As such the severity of the problem and what the response should be is not always clear.

Check for other logs associated with the event.
1. In the first example, the log bundles immediately following the RAID_EVENT_SAVE_TRACE are not event driven collections:
  
  AK0027XXXX-160901230146-160904040613-RAID_EVENT_SAVE_TRACE-e-02-02.tar
  AK0027XXXX-160904040615-160904225936-PRIMARY_PERIODIC_COLLECTION-p-02-02.tar
  AK0027XXXX-160905103622-160906200758-MANUAL_COLLECTION-m-01-01.tar
2. In this second example, the log bundle immediately after the RAID_EVENT_SAVE_TRACE is event driven and can logically be tied together:
  
  AK0027XXXX-160906200803-160906225039-RAID_EVENT_SAVE_TRACE-e-02-02.tar
  AK0027XXXX-160906225041-160906225227-ENCLOSURE_DRIVE_STATE_CHANGE-e-02-02.tar
  
  Note: For information on log collection types and other information about the log bundle filenames, see KM 1997980.1 FS System: How to Identify and Send Log Bundles for Specific Dates and Times to Oracle Support.
RAID_EVENT_SAVE_TRACE log bundles that do not have any others associated with it are an indication that the problem took care of itself. This can be confirmed by examining the RAID logs that are associated with the event. Under the slammer directory, <slammer WWN>/*OUTLOG/var/log, there can be up to 3 RAID logs, raid0.log, raid1.log & raid2.log and represent the 3 SAS HBAs possible in each Controller. The reason for the RAID_EVENT_SAVE_TRACE event will be towards the end of one of these files. In this example, it is a sector repair.

Check Condition; cdb 0x28 0x60 0x32 0xCA 0xE4 0x80 0x00 0x00 0x80 0x00 0x00 0x00 0x00 0x00 0x00 0x00
SenseLength 0000-0018, Sense 0x72 0x03 0x11 0x00 0x00 0x00 0x00 0x34 0x00 0x0A 0x80 0x00 0x00 0x00 0x00 0x00 0x32 0xCA 0xE4 0xC8 0x01
SER C 00 T 0057 Sts 0202 RC 0000, SR 0000-3000-8408-42D8
Key/Asc/AscQ 03/11/00 Cmd 0000-0003, LBA 0000-0000-32CA-E480, XfrLength 0001-0400
Sun Sep 4 04:04:01 2016 +       452..+    550490. msec
ERROR--dm::cb_fcrwreq read -- p(        87.) tgtId(        87.) ec(0000-001E) scsiSts(02)
72 03 03 00 00 00 00 34 00 0A 80 00 11 00 00 00 32 CA E4 C8 01 0A 00 00 00 00 00 00 00 00 00 00
RG: ! Disk Error Detected in "DiskReadCallback()", psDR=0x3000840842d8, errCode=38("DM- Can't read a sector on the drive"), crcCheckFl
RG:   bufIdx=1, logIdx=1, phyIdx=3, portNum=87, diskLBA=852157568(0x32CAE480), blocks=128(0x80)
RG:   stripMap=FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
RG:   psSI=0x300088765d40, stripeNr=6657481(0x006595C9), flags=[ sif_Parity | sif_difBlkSize_4K ]
RG: a Abort the sequence due to err=38(DM- Can't read a sector on the drive)
RG: . Queue the pending RR=0x300084f93570 to psRSI=0x300088765d40, stripeNr=6657481(0x6595C9) due to StripeRepair.
RG:   rrf_Flags=[ rrf_Read ], sif_Flags=[ sif_Parity | sif_StripeRepair | sif_difBlkSize_4K ]
RG: # 0x006595C9, StripeRepair for psRSI=0x300088765d40, stripeNr=6657481 started, offsetInStride=1.
RG: & 0x006595C9, Lock Peer, DIL::Add_Entry(psRR=0x3000834056c8).
{s0004-84D3} 09-04-2016 04:04:01.    453299. usec
RG: > 0x006595C9, Dumping the Recovery Map, * Stripe Bitmap Dump for: 0x300083404008
RG:          smt_SR_Read : 0x00000000000003FE, +{ 01 02 03 04 05 06 07 08 09 }
RG:                        [ 01 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
RG:                        [ 02 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
RG:                        [ 03 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
RG:                        [ 04 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
RG:                        [ 05 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
RG:                        [ 06 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
RG:                        [ 07 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
RG:                        [ 08 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
RG:                        [ 09 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
...
RG: ! 0x006595C9, Logical Device or Buffer Device Errors found, * Stripe Bitmap Dump for: 0x300083404008
RG:            smt_Error : N/A
RG:      smt_LogDevError : 0x0000000000000002, +{ 01 }
RG:                        [ 01 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
RG: @ 0x006595C9, Build Write Fix Sequence for stripeNr=6657481 logDevMap=0x0000000000000FFF, +{ 00 01 02 03 04 05 06 07 08 09 10 11 }
RG: @ 0x006595C9, Dumping the Fix Sequence:
RG:   Write: < buf=01, log=01, phy=03 >, map=FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
RG: @ 0x006595C9, Staring the Fix Sequence ...
{s0004-84D5} 09-04-2016 04:04:02.    186657. usec
RG: @ 0x006595C9, Fix Sequence finished.
RG: + 0x006595C9, Recovery Done, Dumping the SR map, * Stripe Bitmap Dump for: 0x300083404008
1. The last two lines indicate that the repair was likely successful.
2. Look for any T10DIFF ERROR messages immediately afterwards, this would indicate a corrupted LUN and data would have to be restored.
  
  RG: RR-T10DIF ERROR lba=0x1FBECED80 RU=0x3000a6f537b8
  RG: GOT T10DIF ERROR during Stripe repair, Calculated CRC:0x4495,
  media-read-CRC:0xA3B9 !!!!!
3. The Sense Key/ASC/ASCQ code (03/11/00) on the fourth line indicating an unrecoverable read error.
4. Using one of the other log bundles, look in the event log at the timestamps on the fifth from the top (Sun Sep 4 04:04:01 2016) and third line from the end (09-04-2016 04:04:02):
  
  2016-09-04T04:04:02.241 RAID_EVENT_SAVE_TRACE                     INFORMATIONAL ComponentName=/ENCLOSURE-01, Source=/CONTROLLER-01
  2016-09-04T04:04:02.243 COLLECT_LOGS_INITIATED                    INFORMATIONAL
  2016-09-04T04:04:02.305 CM_EVT_BS_TRIAGE_INFO                     INFORMATIONAL Source=/CONTROLLER-02, TriageInfo=sd:0x0 requesting IC - reason: BS_SLAT_IC_REQ_PM_MEDIA_ERR
  2016-09-04T04:04:03.250 CM_EVT_SLAT_INTEGRITY_CHECK_STARTED       INFORMATIONAL ComponentName=/Storage Domain 1, Source=/CONTROLLER-02
  2016-09-04T04:04:22.237 CALL_HOME_SUCCEEDED                       INFORMATIONAL ComponentName=/INTERNAL_OP/13771558/SYSTEM, reason=EVENT, bundle=AK0027XXXX-160901230146-160904040402-RAID_EVENT_SAVE_TRACE-e-01-02
  2016-09-04T04:09:08.822 CALL_HOME_SUCCEEDED                       INFORMATIONAL ComponentName=/INTERNAL_OP/13771649/SYSTEM, reason=EVENT, bundle=AK0027XXXX-160901230146-160904040613-RAID_EVENT_SAVE_TRACE-e-02-02
  2016-09-04T05:57:57.731 PM_EVT_REPL_RELOAD_CONFIGURATION          INFORMATIONAL
  
  This confirms the sector repair and no further action is required.
For the RAID_EVENT_SAVE_TRACE that did have other log bundle collections associated with it:
1. From the RAID_EVENT_SAVE_TRACE log bundle, examine the end of the appropriate raid*.log file:
  
  Check Condition; cdb 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x80 0x00 0x00 0x00 0x00 0x00 0x00 0x00
  SenseLength 0000-0018, Sense 0x70 0x00 0x06 0x00 0x00 0x00 0x00 0x38 0x00 0x00 0x00 0x00 0x29 0x04 0x00 0x00 0x00 0x00 0xC0
  SER C 00 T 0039 Sts 0202 RC 0000, SR 0000-3007-F229-6DE0
  Key/Asc/AscQ 06/29/04 Cmd 0000-0012, LBA 0000-0000-0000-0000, XfrLength 0000-0000
  Tue Sep 6 22:49:34 2016 + 573..+ 85751. msec
  
  The Sense Key/ASC/ASCQ code (06/29/04) indicating a Device Reset and the timestamp (Sep 6 22:49:34 2016) in the last two lines. This likely means that there were multiple failed reads or writes whereupon the system gave up and issued the reset.
2. Searching this raid log for the keyword OFFLINE, identifies the actual drive port (57) itself:
  
  Removal call 0000-0039, 0000-0000
  n----> TRANS: DrvPresent->Remove P(        57.)
  RemoveStateHndlr -- Taking Port Offline. Port         57.
  DM::RemoveStateHndlr -- Need to Send AE when done. Port         57.
  !!!! Clearing Diag err bit !!!!
  -@-@   Changing NVR Drive State to 0000-0401 @-@-
  Clearing WSM for tgt 02 slot 09
  
  *******************************************************************
  ************** DRIVE HAS BEEN REMOVED ON PORT    57.   *****************
  *******************************************************************
  
  Calling Xpt         57.
  Commands On Hold 0000-0000-0000-012C FFFF-FFFF
  ************** PORT         57. IS BEING TAKEN OFFLINE *****************
  PORT         57.:
  
  Ident Data
      max LBA     = 0000-0000-BA4D-4AAF
      Model #     = LB1606R SUN1.6T^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
      Serial #    = 41567872^@^@^@^@tag/T3BF
      F/W Rev     = S311^@^@^@^@
  ...
  ReturnPendingCommandToCaller
  DM::CB_FcRwReq -- Drive Offline. P(        57.) tgtId(        57.)
  ToSC
  RG: ! Disk Error Detected in "DiskReadCallback()", psDR=0x300af1df2898, errCode=28("DM- Port Offline"), crcCheckFla
  RG:   bufIdx=2, logIdx=2, phyIdx=3, portNum=57, diskLBA=2637725184(0x9D387E00), blocks=128(0x80)
  RG:   stripMap=FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 )
  RG:   psSI=0x300af7799000, stripeNr=20607228(0x013A70FC), flags=[ sif_WrLock | sif_WrIP | sif_Parity | sif_DIL_Allo
  RG: ! AHHH, err_DM_PortOffline Read psDR=0x300af1df2898
  RG: % Set sif_PortOffline for psRSI=0x300af7799000, stripeNum=0x013A70FC
  RG: a Abort the sequence due to err=28(DM- Port Offline)
  ...
  RG: . Ignore: sif_PortOffline already set for psRSI=0x300af63e20e0, stripeNum=0x00C37372# 0x01351517, StripeRepair
  RG: ! 0x01351517, LUN=13, psRU=0x300ae27992f0 offline/portOffline, reject all the pending requests
  RG: - Reject psRR=0x300af3260e50 with errCode:28
  RG: @ Return psRR=0x300af3260e50, errCode=28("DM- Port Offline"), stripeNr=0x1351517
  RG:   rrf_Flags=[ rrf_Read ]
  RG: * 0x01351517, StripeRepair for psRSI=0x300af7bce040, stripeNr=20256023 skipped since array/port offline.
  RG: ! 0x01351517, LUN=13, psRU=0x300ae27992f0 portOffline, reject all the incoming requests
  RG: . List Empty, nothing to reject
  RG: % Clear sif_PortOffline for psRSI=0x300af7bce040, stripeNum=0x01351517
  RG: # 0x01351517, StripeRepair for psRSI=0x300af7bce040, stripeNr=20256023 Ended.
  RG: % ON SR_END
  ...
  Target::AE_Handler Local, 0000-302C-0708-38F8, id + 1.
  Tgt - DrvOff AE p(        57.) t(         2.) desc(03) psFT(E27F-A170)
  RUSTY::ae_Drive: port 57., event   1., desc   3.
  RUSTY::AE_Drive offline LostPart, psRU E279-92F0 flags 0031-D000 portNum=+        57. partnum=+         3.
  LostPart LUN:    13., Part:     3.
  LUN_IDLE, tgt 0x02 lun 0x01 rstFlags 0x20 lunFlags 0x42
  LUN_Idle_2 Must Wait
  Companion responded that it doesn't have the token, Good.
  SES::PollEnclosureStatus: IMMED 20160906-22:49:07.781228963
  SES::PollEnclosureStatus: Found active enclosure 0x02D2 SES flags = 0x0090
  SES::PollEnclosureStatus: Requesting SES page two for enclosure 0x02D2
  {r0042-2125} 09-06-2016 22:49:08.     55379. usec
  {r0042-2127} 09-06-2016 22:49:08.     67814. usec
  {s0005-4DC9} 09-06-2016 22:49:08.     95034. usec
  {r0042-2129} 09-06-2016 22:49:08.     95058. usec
  {r0042-212B} 09-06-2016 22:49:08.    103035. usec
  Release Commands From Hold
  Tue Sep 6 22:49:10 2016 +       787..+     75608. msec
  SlotNumber 0000-0000
  Release Commands From Hold
  SlotNumber 0000-0001
  
  The seventh line down says that this drive is target 0x2, slot 9. Targets, in this case Drive Enclosures, are allotted 24 slots, so adding these plus the slot (24+24+9) provides you the port number (57).
3. From the ENCLOSURE_DRIVE_STATE_CHANGE log bundle, examine the *.chsh.xml file for the timestamp (2016-09-06T22:49:10) and the offending component (ENCLOSURE-04):
  
      <EventType>ENCLOSURE_DRIVE_STATE_CHANGE</EventType>
      <Severity>WARNING</Severity>
      <Category>SYSTEM</Category>
      <Time>2016-09-06T22:49:10.878</Time>
      <ComponentIdentity>
        <WuName>5080020001A9CE70</WuName>
      </ComponentIdentity>
      <ComponentName>/ENCLOSURE-04</ComponentName>
      <SourceNodeIdentity>
        <Id>508002000158C671</Id>
        <Fqn>/CONTROLLER-02</Fqn>
      </SourceNodeIdentity>
4. Now compare these values with the event log from the ENCLOSURE_DRIVE_STATE_CHANGE log bundle:
  
  2016-09-06T22:49:10.843 RAID_EVENT_SAVE_TRACE                     INFORMATIONAL ComponentName=/ENCLOSURE-11, Source=/CONTROLLER-01
  2016-09-06T22:49:10.844 COLLECT_LOGS_INITIATED                    INFORMATIONAL
  2016-09-06T22:49:10.878 ENCLOSURE_DRIVE_STATE_CHANGE              WARNING        ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-02, cruId=CRU_DRIVE9, action=CRU_OFFLINE
  2016-09-06T22:49:10.892 RAID_EVENT_PARTIAL_OFFLINE_STATE_CHANGE   INFORMATIONAL ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-01, cruId=CRU_DRIVE9, partialOfflineState=PARTIAL_OFFLINE
  2016-09-06T22:49:10.902 PSG_PI_EVENT_PATH_FAILURE                 WARNING        ComponentName=/CONTROLLER-01, Source=/CONTROLLER-01, reasonCode=PDS_ILLEGAL
  2016-09-06T22:49:11.205 ENCLOSURE_DRIVE_REMOVED_FROM_DRIVE_GROUP WARNING        ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-02, cruId=CRU_DRIVE9
  2016-09-06T22:49:11.206 RAID_EVENT_SAVE_TRACE                     INFORMATIONAL ComponentName=/ENCLOSURE-11, Source=/CONTROLLER-02
  2016-09-06T22:49:12.162 RAID_EVENT_PARTIAL_OFFLINE_STATE_CHANGE   INFORMATIONAL ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-01, cruId=CRU_DRIVE9, partialOfflineState=NORMAL
  2016-09-06T22:49:13.597 RAID_EVENT_ENCLOSURE_ARRAY_DEVICE_STATUS_CRITICAL CRITICAL       ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-02, cruId=CRU_DRIVE9
  2016-09-06T22:49:13.632 PSG_PI_EVENT_PATH_RESTORED                INFORMATIONAL ComponentName=/CONTROLLER-01, Source=/CONTROLLER-01
  2016-09-06T22:49:18.153 RAID_EVENT_MAINT_BLOCKED                  INFORMATIONAL ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-02
  2016-09-06T22:49:25.419 CALL_HOME_SUCCEEDED                       INFORMATIONAL ComponentName=/INTERNAL_OP/13973922/SYSTEM, reason=EVENT, bundle=AK0027XXXX-160906200803-160906224910-RAID_EVENT_SAVE_TRACE-e-01-02.tar
  2016-09-06T22:49:29.629 ENCLOSURE_MAINTENANCE_STARTED             INFORMATIONAL ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-02
  
  This confirms an actual drive issue in ENCLOSURE-11, DRIVE-09.

For more information, see KM Document 2126081.1 FS System: How to Handle a "Save Trace Event" Auto-Service Request for additional details. As such, this will likely mean having to examine other logs generated immediately before and afterwards to get a more complete picture of the event.

Attachments

This solution has no attachment