Understanding MRdiagd Events in /var/log/messages file Produced by Internal RAID HBA

Asset ID:	1-72-2272177.1
Update Date:	2018-04-09
Keywords:

Solution Type Problem Resolution Sure

Solution 2272177.1 : Understanding MRdiagd Events in /var/log/messages file Produced by Internal RAID HBA

Applies to:

Exadata X6-2 Hardware - Version All Versions to All Versions [Release All Releases]
Exadata Database Machine V2 - Version All Versions to All Versions [Release All Releases]
Exadata Database Machine X2-2 Hardware - Version All Versions to All Versions [Release All Releases]
Exadata X3-2 Hardware - Version All Versions to All Versions [Release All Releases]
Exadata X4-2 Hardware - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.
The introduction of image 12.1.2.2.0 and above included mrdiagd service monitor

This document helps understand the mrdiagd codes displayed in /var/log/messages ,it is not intended to explain full diagnosis of the events.
This document will simply help understand what the "code=" message is displaying.Further diagnosis of the event type is still required,which is
beyond the scope of this document.

Symptoms

MRdiagd events in /var/log/messages

Changes

The introduction of image 12.1.2.2.0 and above included mrdiagd service monitor.

Cause

Exadata image 12.1.2.2.0 and above

Solution

With the introduction of the LSI Diagnostic Service monitor , mrdiagd events will now be reported to the /var/log/messages file.

Examples of some of the messages can be seen below:

Feb 12 12:32:55 hostname MRdiagd: MR Controller event (seq 103617) tracer=Controller_500605b004921420 ctrlId=500605b004921420 code=113 (PD:Info)

Feb 12 06:43:55 hostname MRdiagd: MR Controller event (seq 103615) tracer=Controller_500605b004921420 ctrlId=500605b004921420 code=110 (PD:Info)

Feb 10 21:33:37 hostname MRdiagd: MR Controller event (seq 103599) tracer=Controller_500605b004921420 ctrlId=500605b004921420 code=65 (LD:Progress)

Feb 16 20:37:15 hostname MRdiagd: MR Controller event (seq 103630) tracer=Controller_500605b004921420 ctrlId=500605b004921420 code=112 (PD:Warning)

Feb 16 20:05:52 hostname MRdiagd: MR Controller event (seq 103626) tracer=Controller_500605b004921420 ctrlId=500605b004921420 code=251 (LD:Critical)

The messages above show some examples of Informational , Progress , Warning and Critical messages.
There are many more messages ,above is just a brief example of the many types possible.

The message can be quickly understood by viewing the RAID controller FWTermLog file .

Entry in /var/log/messages

Example:

/var/log/messages:

Feb 12 12:32:55 hostname MRdiagd: MR Controller event (seq 103617) tracer=Controller_500605b004921420 ctrlId=500605b004921420 code=113 (PD:Info)

Search for 103617 the sequence number in the FWTermLog,below this is EVT#103617

Entry in the controller Firmware Termlog (FWTermLog

02/12/17 12:32:55: EVT#103617-02/12/17 12:32:55: 113=Unexpected sense: PD 0c(e0xfc/s3) Path 5000cca025432481, CDB: 28 00 07 9e c8 91 00 05 68 00, Sense: 3/11/00

this example shows the Unexpected sense was 3/11/00 = UNRECOVERED READ ERROR

If the controller event log is viewed , this shows the sequence number = 0x000194c1 ( hexadecimal ) = 103617 ,and again we see the Sense 3/11/00

seqNum: 0x000194c1
Time: Sun Feb 12 12:32:55 2017

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 0c(e0xfc/s3) Path 5000cca025432481, CDB: 28 00 07 9e c8 91 00 05 68 00, Sense: 3/11/00
Event Data:
===========
Device ID: 12
Enclosure Index: 252
Slot Number: 3
CDB Length: 10
CDB Data:
0028 0000 0007 009e 00c8 0091 0000 0005 0068 0000 0000 0000 0000 0000 0000 0000 Sense Length: 32
Sense Data:
00f0 0000 0003 0007 009e 00c8 0094 0018 0000 0000 0000 0000 0011 0000 002d 0080 0006 007f 0000 0000 00f7 002d 0000 0000 0000 0051 00ee 0002 0004 0054 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

To decode the message code in full will require either the LSI/Broadcom/Avago document "12Gb/s MegaRAID® SAS Software User Guide" or "MegaRAID SAS Software User Guide"

These links may change ,however for examples of the guide with the relevant information search for the MegaRAID SAS Software users guide when on the Broadcom site.

https://www.broadcom.com/products/storage/raid-controllers/megaraid-sas-9361-8i#documentation

https://docs.broadcom.com/docs/12353236

Using the guide go to the Appendix : Events, Messages, and Behaviors

The code displayed in the /var/log/messages file is a Decimal number ,to decode the code number convert it to Hexadecimal.

For example code=113 .Convert this into Hex gives 0x0071

Look up 0x0071 in the Appendix and this shows:

0x0071 | Warning | Unexpected sense: %s, CDB%s, Sense: %s |Logged when an I/O fails due to unexpected reasons and sense data needs to be logged.

It can be seen in the Appendix that there are many codes which can be displayed .As a quick example of the codes shown in this document

code=110 (PD:Info) - this would be event 0x006e - Logged when recovery completed successfully and fixed a medium error

code=65 (LD:Progress) - this would be 0x0041 - Logs Consistency Check progress

code=112 (PD:Warning) - this would be 0x0070 - Logged when a drive is removed from the controller

code=251 (LD:Critical) - 0x00fb - Logged when a logical drive state changes to degraded state

Attachments

This solution has no attachment