Asset ID: |
1-71-2180489.1 |
Update Date: | 2016-11-09 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
2180489.1
:
FS System: How to Determine if a Sector Repair Operation has Taken Place
Related Items |
- Oracle FS1-2 Flash Storage System
|
Related Categories |
- PLA-Support>Sun Systems>DISK>Flash Storage>SN-EStor: FSx
|
In this Document
Oracle Confidential PARTNER - Available to partners (SUN).
Reason: Internal processing commands.
Created from <SR 3-13284905149>
Applies to:
Oracle FS1-2 Flash Storage System - Version All Versions to All Versions [Release All Releases] Information in this document applies to any platform.
Goal
To help determine if a RAID_EVENT_SAVE_TRACE log bundle was generated due to a sector repair. And if so, how to proceed.
Solution
Log bundles with the generic description of RAID_EVENT_SAVE_TRACE are an indication that there was a problem with a Drive Enclosure. Since the intent is to capture the RAID logs as close to the event as possible or critical information may be overwritten, the log bundle often will not contain the event log. As such the severity of the problem and what the response should be is not always clear.
- Check for other logs associated with the event.
- In the first example, the log bundles immediately following the RAID_EVENT_SAVE_TRACE are not event driven collections:
AK0027XXXX-160901230146-160904040613-RAID_EVENT_SAVE_TRACE-e-02-02.tar AK0027XXXX-160904040615-160904225936-PRIMARY_PERIODIC_COLLECTION-p-02-02.tar AK0027XXXX-160905103622-160906200758-MANUAL_COLLECTION-m-01-01.tar
- In this second example, the log bundle immediately after the RAID_EVENT_SAVE_TRACE is event driven and can logically be tied together:
AK0027XXXX-160906200803-160906225039-RAID_EVENT_SAVE_TRACE-e-02-02.tar AK0027XXXX-160906225041-160906225227-ENCLOSURE_DRIVE_STATE_CHANGE-e-02-02.tar
Note: For information on log collection types and other information about the log bundle filenames, see KM 1997980.1 FS System: How to Identify and Send Log Bundles for Specific Dates and Times to Oracle Support.
- RAID_EVENT_SAVE_TRACE log bundles that do not have any others associated with it are an indication that the problem took care of itself. This can be confirmed by examining the RAID logs that are associated with the event. Under the slammer directory, <slammer WWN>/*OUTLOG/var/log, there can be up to 3 RAID logs, raid0.log, raid1.log & raid2.log and represent the 3 SAS HBAs possible in each Controller. The reason for the RAID_EVENT_SAVE_TRACE event will be towards the end of one of these files. In this example, it is a sector repair.
Check Condition; cdb 0x28 0x60 0x32 0xCA 0xE4 0x80 0x00 0x00 0x80 0x00 0x00 0x00 0x00 0x00 0x00 0x00 SenseLength 0000-0018, Sense 0x72 0x03 0x11 0x00 0x00 0x00 0x00 0x34 0x00 0x0A 0x80 0x00 0x00 0x00 0x00 0x00 0x32 0xCA 0xE4 0xC8 0x01 SER C 00 T 0057 Sts 0202 RC 0000, SR 0000-3000-8408-42D8 Key/Asc/AscQ 03/11/00 Cmd 0000-0003, LBA 0000-0000-32CA-E480, XfrLength 0001-0400 Sun Sep 4 04:04:01 2016 + 452..+ 550490. msec ERROR--dm::cb_fcrwreq read -- p( 87.) tgtId( 87.) ec(0000-001E) scsiSts(02) 72 03 03 00 00 00 00 34 00 0A 80 00 11 00 00 00 32 CA E4 C8 01 0A 00 00 00 00 00 00 00 00 00 00 RG: ! Disk Error Detected in "DiskReadCallback()", psDR=0x3000840842d8, errCode=38("DM- Can't read a sector on the drive"), crcCheckFl RG: bufIdx=1, logIdx=1, phyIdx=3, portNum=87, diskLBA=852157568(0x32CAE480), blocks=128(0x80) RG: stripMap=FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 ) RG: psSI=0x300088765d40, stripeNr=6657481(0x006595C9), flags=[ sif_Parity | sif_difBlkSize_4K ] RG: a Abort the sequence due to err=38(DM- Can't read a sector on the drive) RG: . Queue the pending RR=0x300084f93570 to psRSI=0x300088765d40, stripeNr=6657481(0x6595C9) due to StripeRepair. RG: rrf_Flags=[ rrf_Read ], sif_Flags=[ sif_Parity | sif_StripeRepair | sif_difBlkSize_4K ] RG: # 0x006595C9, StripeRepair for psRSI=0x300088765d40, stripeNr=6657481 started, offsetInStride=1. RG: & 0x006595C9, Lock Peer, DIL::Add_Entry(psRR=0x3000834056c8). {s0004-84D3} 09-04-2016 04:04:01. 453299. usec RG: > 0x006595C9, Dumping the Recovery Map, * Stripe Bitmap Dump for: 0x300083404008 RG: smt_SR_Read : 0x00000000000003FE, +{ 01 02 03 04 05 06 07 08 09 } RG: [ 01 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 ) RG: [ 02 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 ) RG: [ 03 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 ) RG: [ 04 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 ) RG: [ 05 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 ) RG: [ 06 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 ) RG: [ 07 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 ) RG: [ 08 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 ) RG: [ 09 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 ) ... RG: ! 0x006595C9, Logical Device or Buffer Device Errors found, * Stripe Bitmap Dump for: 0x300083404008 RG: smt_Error : N/A RG: smt_LogDevError : 0x0000000000000002, +{ 01 } RG: [ 01 ] FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 ) RG: @ 0x006595C9, Build Write Fix Sequence for stripeNr=6657481 logDevMap=0x0000000000000FFF, +{ 00 01 02 03 04 05 06 07 08 09 10 11 } RG: @ 0x006595C9, Dumping the Fix Sequence: RG: Write: < buf=01, log=01, phy=03 >, map=FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 ) RG: @ 0x006595C9, Staring the Fix Sequence ... {s0004-84D5} 09-04-2016 04:04:02. 186657. usec RG: @ 0x006595C9, Fix Sequence finished. RG: + 0x006595C9, Recovery Done, Dumping the SR map, * Stripe Bitmap Dump for: 0x300083404008
- The last two lines indicate that the repair was likely successful.
- Look for any T10DIFF ERROR messages immediately afterwards, this would indicate a corrupted LUN and data would have to be restored.
RG: RR-T10DIF ERROR lba=0x1FBECED80 RU=0x3000a6f537b8 RG: GOT T10DIF ERROR during Stripe repair, Calculated CRC:0x4495, media-read-CRC:0xA3B9 !!!!!
- The Sense Key/ASC/ASCQ code (03/11/00) on the fourth line indicating an unrecoverable read error.
- Using one of the other log bundles, look in the event log at the timestamps on the fifth from the top (Sun Sep 4 04:04:01 2016) and third line from the end (09-04-2016 04:04:02):
2016-09-04T04:04:02.241 RAID_EVENT_SAVE_TRACE INFORMATIONAL ComponentName=/ENCLOSURE-01, Source=/CONTROLLER-01 2016-09-04T04:04:02.243 COLLECT_LOGS_INITIATED INFORMATIONAL 2016-09-04T04:04:02.305 CM_EVT_BS_TRIAGE_INFO INFORMATIONAL Source=/CONTROLLER-02, TriageInfo=sd:0x0 requesting IC - reason: BS_SLAT_IC_REQ_PM_MEDIA_ERR 2016-09-04T04:04:03.250 CM_EVT_SLAT_INTEGRITY_CHECK_STARTED INFORMATIONAL ComponentName=/Storage Domain 1, Source=/CONTROLLER-02 2016-09-04T04:04:22.237 CALL_HOME_SUCCEEDED INFORMATIONAL ComponentName=/INTERNAL_OP/13771558/SYSTEM, reason=EVENT, bundle=AK0027XXXX-160901230146-160904040402-RAID_EVENT_SAVE_TRACE-e-01-02 2016-09-04T04:09:08.822 CALL_HOME_SUCCEEDED INFORMATIONAL ComponentName=/INTERNAL_OP/13771649/SYSTEM, reason=EVENT, bundle=AK0027XXXX-160901230146-160904040613-RAID_EVENT_SAVE_TRACE-e-02-02 2016-09-04T05:57:57.731 PM_EVT_REPL_RELOAD_CONFIGURATION INFORMATIONAL
This confirms the sector repair and no further action is required.
- For the RAID_EVENT_SAVE_TRACE that did have other log bundle collections associated with it:
- From the RAID_EVENT_SAVE_TRACE log bundle, examine the end of the appropriate raid*.log file:
Check Condition; cdb 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x80 0x00 0x00 0x00 0x00 0x00 0x00 0x00 SenseLength 0000-0018, Sense 0x70 0x00 0x06 0x00 0x00 0x00 0x00 0x38 0x00 0x00 0x00 0x00 0x29 0x04 0x00 0x00 0x00 0x00 0xC0 SER C 00 T 0039 Sts 0202 RC 0000, SR 0000-3007-F229-6DE0 Key/Asc/AscQ 06/29/04 Cmd 0000-0012, LBA 0000-0000-0000-0000, XfrLength 0000-0000 Tue Sep 6 22:49:34 2016 + 573..+ 85751. msec
The Sense Key/ASC/ASCQ code (06/29/04) indicating a Device Reset and the timestamp (Sep 6 22:49:34 2016) in the last two lines. This likely means that there were multiple failed reads or writes whereupon the system gave up and issued the reset.
- Searching this raid log for the keyword OFFLINE, identifies the actual drive port (57) itself:
Removal call 0000-0039, 0000-0000 n----> TRANS: DrvPresent->Remove P( 57.) RemoveStateHndlr -- Taking Port Offline. Port 57. DM::RemoveStateHndlr -- Need to Send AE when done. Port 57. !!!! Clearing Diag err bit !!!! -@-@ Changing NVR Drive State to 0000-0401 @-@- Clearing WSM for tgt 02 slot 09
******************************************************************* ************** DRIVE HAS BEEN REMOVED ON PORT 57. ***************** *******************************************************************
Calling Xpt 57. Commands On Hold 0000-0000-0000-012C FFFF-FFFF ************** PORT 57. IS BEING TAKEN OFFLINE ***************** PORT 57.:
Ident Data max LBA = 0000-0000-BA4D-4AAF Model # = LB1606R SUN1.6T^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ Serial # = 41567872^@^@^@^@tag/T3BF F/W Rev = S311^@^@^@^@ ... ReturnPendingCommandToCaller DM::CB_FcRwReq -- Drive Offline. P( 57.) tgtId( 57.) ToSC RG: ! Disk Error Detected in "DiskReadCallback()", psDR=0x300af1df2898, errCode=28("DM- Port Offline"), crcCheckFla RG: bufIdx=2, logIdx=2, phyIdx=3, portNum=57, diskLBA=2637725184(0x9D387E00), blocks=128(0x80) RG: stripMap=FFFFFFFFFFFFFFFF : FFFFFFFFFFFFFFFF, [ 0, 128 ) RG: psSI=0x300af7799000, stripeNr=20607228(0x013A70FC), flags=[ sif_WrLock | sif_WrIP | sif_Parity | sif_DIL_Allo RG: ! AHHH, err_DM_PortOffline Read psDR=0x300af1df2898 RG: % Set sif_PortOffline for psRSI=0x300af7799000, stripeNum=0x013A70FC RG: a Abort the sequence due to err=28(DM- Port Offline) ... RG: . Ignore: sif_PortOffline already set for psRSI=0x300af63e20e0, stripeNum=0x00C37372# 0x01351517, StripeRepair RG: ! 0x01351517, LUN=13, psRU=0x300ae27992f0 offline/portOffline, reject all the pending requests RG: - Reject psRR=0x300af3260e50 with errCode:28 RG: @ Return psRR=0x300af3260e50, errCode=28("DM- Port Offline"), stripeNr=0x1351517 RG: rrf_Flags=[ rrf_Read ] RG: * 0x01351517, StripeRepair for psRSI=0x300af7bce040, stripeNr=20256023 skipped since array/port offline. RG: ! 0x01351517, LUN=13, psRU=0x300ae27992f0 portOffline, reject all the incoming requests RG: . List Empty, nothing to reject RG: % Clear sif_PortOffline for psRSI=0x300af7bce040, stripeNum=0x01351517 RG: # 0x01351517, StripeRepair for psRSI=0x300af7bce040, stripeNr=20256023 Ended. RG: % ON SR_END ... Target::AE_Handler Local, 0000-302C-0708-38F8, id + 1. Tgt - DrvOff AE p( 57.) t( 2.) desc(03) psFT(E27F-A170) RUSTY::ae_Drive: port 57., event 1., desc 3. RUSTY::AE_Drive offline LostPart, psRU E279-92F0 flags 0031-D000 portNum=+ 57. partnum=+ 3. LostPart LUN: 13., Part: 3. LUN_IDLE, tgt 0x02 lun 0x01 rstFlags 0x20 lunFlags 0x42 LUN_Idle_2 Must Wait Companion responded that it doesn't have the token, Good. SES::PollEnclosureStatus: IMMED 20160906-22:49:07.781228963 SES::PollEnclosureStatus: Found active enclosure 0x02D2 SES flags = 0x0090 SES::PollEnclosureStatus: Requesting SES page two for enclosure 0x02D2 {r0042-2125} 09-06-2016 22:49:08. 55379. usec {r0042-2127} 09-06-2016 22:49:08. 67814. usec {s0005-4DC9} 09-06-2016 22:49:08. 95034. usec {r0042-2129} 09-06-2016 22:49:08. 95058. usec {r0042-212B} 09-06-2016 22:49:08. 103035. usec Release Commands From Hold Tue Sep 6 22:49:10 2016 + 787..+ 75608. msec SlotNumber 0000-0000 Release Commands From Hold SlotNumber 0000-0001
The seventh line down says that this drive is target 0x2, slot 9. Targets, in this case Drive Enclosures, are allotted 24 slots, so adding these plus the slot (24+24+9) provides you the port number (57).
- From the ENCLOSURE_DRIVE_STATE_CHANGE log bundle, examine the *.chsh.xml file for the timestamp (2016-09-06T22:49:10) and the offending component (ENCLOSURE-04):
<EventType>ENCLOSURE_DRIVE_STATE_CHANGE</EventType> <Severity>WARNING</Severity> <Category>SYSTEM</Category> <Time>2016-09-06T22:49:10.878</Time> <ComponentIdentity> <WuName>5080020001A9CE70</WuName> </ComponentIdentity> <ComponentName>/ENCLOSURE-04</ComponentName> <SourceNodeIdentity> <Id>508002000158C671</Id> <Fqn>/CONTROLLER-02</Fqn> </SourceNodeIdentity>
- Now compare these values with the event log from the ENCLOSURE_DRIVE_STATE_CHANGE log bundle:
2016-09-06T22:49:10.843 RAID_EVENT_SAVE_TRACE INFORMATIONAL ComponentName=/ENCLOSURE-11, Source=/CONTROLLER-01 2016-09-06T22:49:10.844 COLLECT_LOGS_INITIATED INFORMATIONAL 2016-09-06T22:49:10.878 ENCLOSURE_DRIVE_STATE_CHANGE WARNING ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-02, cruId=CRU_DRIVE9, action=CRU_OFFLINE 2016-09-06T22:49:10.892 RAID_EVENT_PARTIAL_OFFLINE_STATE_CHANGE INFORMATIONAL ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-01, cruId=CRU_DRIVE9, partialOfflineState=PARTIAL_OFFLINE 2016-09-06T22:49:10.902 PSG_PI_EVENT_PATH_FAILURE WARNING ComponentName=/CONTROLLER-01, Source=/CONTROLLER-01, reasonCode=PDS_ILLEGAL 2016-09-06T22:49:11.205 ENCLOSURE_DRIVE_REMOVED_FROM_DRIVE_GROUP WARNING ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-02, cruId=CRU_DRIVE9 2016-09-06T22:49:11.206 RAID_EVENT_SAVE_TRACE INFORMATIONAL ComponentName=/ENCLOSURE-11, Source=/CONTROLLER-02 2016-09-06T22:49:12.162 RAID_EVENT_PARTIAL_OFFLINE_STATE_CHANGE INFORMATIONAL ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-01, cruId=CRU_DRIVE9, partialOfflineState=NORMAL 2016-09-06T22:49:13.597 RAID_EVENT_ENCLOSURE_ARRAY_DEVICE_STATUS_CRITICAL CRITICAL ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-02, cruId=CRU_DRIVE9 2016-09-06T22:49:13.632 PSG_PI_EVENT_PATH_RESTORED INFORMATIONAL ComponentName=/CONTROLLER-01, Source=/CONTROLLER-01 2016-09-06T22:49:18.153 RAID_EVENT_MAINT_BLOCKED INFORMATIONAL ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-02 2016-09-06T22:49:25.419 CALL_HOME_SUCCEEDED INFORMATIONAL ComponentName=/INTERNAL_OP/13973922/SYSTEM, reason=EVENT, bundle=AK0027XXXX-160906200803-160906224910-RAID_EVENT_SAVE_TRACE-e-01-02.tar 2016-09-06T22:49:29.629 ENCLOSURE_MAINTENANCE_STARTED INFORMATIONAL ComponentName=/ENCLOSURE-04, Source=/CONTROLLER-02
This confirms an actual drive issue in ENCLOSURE-11, DRIVE-09.
For more information, see KM Document 2126081.1 FS System: How to Handle a "Save Trace Event" Auto-Service Request for additional details. As such, this will likely mean having to examine other logs generated immediately before and afterwards to get a more complete picture of the event.
Attachments
This solution has no attachment
|