![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Problem Resolution Sure Solution 2272177.1 : Understanding MRdiagd Events in /var/log/messages file Produced by Internal RAID HBA
Applies to:Exadata X6-2 Hardware - Version All Versions to All Versions [Release All Releases]Exadata Database Machine V2 - Version All Versions to All Versions [Release All Releases] Exadata Database Machine X2-2 Hardware - Version All Versions to All Versions [Release All Releases] Exadata X3-2 Hardware - Version All Versions to All Versions [Release All Releases] Exadata X4-2 Hardware - Version All Versions to All Versions [Release All Releases] Information in this document applies to any platform. The introduction of image 12.1.2.2.0 and above included mrdiagd service monitor This document helps understand the mrdiagd codes displayed in /var/log/messages ,it is not intended to explain full diagnosis of the events. This document will simply help understand what the "code=" message is displaying.Further diagnosis of the event type is still required,which is beyond the scope of this document. SymptomsMRdiagd events in /var/log/messages ChangesThe introduction of image 12.1.2.2.0 and above included mrdiagd service monitor. CauseExadata image 12.1.2.2.0 and above SolutionWith the introduction of the LSI Diagnostic Service monitor , mrdiagd events will now be reported to the /var/log/messages file. Examples of some of the messages can be seen below: Feb 12 12:32:55 hostname MRdiagd: MR Controller event (seq 103617) tracer=Controller_500605b004921420 ctrlId=500605b004921420 code=113 (PD:Info) Feb 12 06:43:55 hostname MRdiagd: MR Controller event (seq 103615) tracer=Controller_500605b004921420 ctrlId=500605b004921420 code=110 (PD:Info) Feb 10 21:33:37 hostname MRdiagd: MR Controller event (seq 103599) tracer=Controller_500605b004921420 ctrlId=500605b004921420 code=65 (LD:Progress) Feb 16 20:37:15 hostname MRdiagd: MR Controller event (seq 103630) tracer=Controller_500605b004921420 ctrlId=500605b004921420 code=112 (PD:Warning) Feb 16 20:05:52 hostname MRdiagd: MR Controller event (seq 103626) tracer=Controller_500605b004921420 ctrlId=500605b004921420 code=251 (LD:Critical)
The messages above show some examples of Informational , Progress , Warning and Critical messages. The message can be quickly understood by viewing the RAID controller FWTermLog file . Entry in /var/log/messages Example: /var/log/messages: Feb 12 12:32:55 hostname MRdiagd: MR Controller event (seq 103617) tracer=Controller_500605b004921420 ctrlId=500605b004921420 code=113 (PD:Info)
Search for 103617 the sequence number in the FWTermLog,below this is EVT#103617 Entry in the controller Firmware Termlog (FWTermLog 02/12/17 12:32:55: EVT#103617-02/12/17 12:32:55: 113=Unexpected sense: PD 0c(e0xfc/s3) Path 5000cca025432481, CDB: 28 00 07 9e c8 91 00 05 68 00, Sense: 3/11/00
this example shows the Unexpected sense was 3/11/00 = UNRECOVERED READ ERROR If the controller event log is viewed , this shows the sequence number = 0x000194c1 ( hexadecimal ) = 103617 ,and again we see the Sense 3/11/00 seqNum: 0x000194c1 Code: 0x00000071
To decode the message code in full will require either the LSI/Broadcom/Avago document "12Gb/s MegaRAID® SAS Software User Guide" or "MegaRAID SAS Software User Guide" These links may change ,however for examples of the guide with the relevant information search for the MegaRAID SAS Software users guide when on the Broadcom site. https://www.broadcom.com/products/storage/raid-controllers/megaraid-sas-9361-8i#documentation https://docs.broadcom.com/docs/12353236 Using the guide go to the Appendix : Events, Messages, and Behaviors
The code displayed in the /var/log/messages file is a Decimal number ,to decode the code number convert it to Hexadecimal. For example code=113 .Convert this into Hex gives 0x0071
Look up 0x0071 in the Appendix and this shows: 0x0071 | Warning | Unexpected sense: %s, CDB%s, Sense: %s |Logged when an I/O fails due to unexpected reasons and sense data needs to be logged.
It can be seen in the Appendix that there are many codes which can be displayed .As a quick example of the codes shown in this document
code=110 (PD:Info) - this would be event 0x006e - Logged when recovery completed successfully and fixed a medium error code=65 (LD:Progress) - this would be 0x0041 - Logs Consistency Check progress code=112 (PD:Warning) - this would be 0x0070 - Logged when a drive is removed from the controller code=251 (LD:Critical) - 0x00fb - Logged when a logical drive state changes to degraded state
Attachments This solution has no attachment |
||||||||||||
|