Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-2365134.1
Update Date:2018-04-30
Keywords:

Solution Type  Troubleshooting Sure

Solution  2365134.1 :   FS System: Using the Node Matrix in the pcp.log File to Troubleshoot Component Boot Issues  


Related Items
  • Oracle FS1-2 Flash Storage System
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>Flash Storage>SN-EStor: FSx
  •  




In this Document
Purpose
Troubleshooting Steps
References


Oracle Confidential PARTNER - Available to partners (SUN).
Reason: restricted information

Applies to:

Oracle FS1-2 Flash Storage System - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Purpose

To document the node matrix in the pcp.log of the FS1-2 storage system and how you can use the information for initial troubleshooting of problems.  This document requires that the nodes be seen on the Private Management Interface (PMI). 

Troubleshooting Steps


The Pilot Control Process log or pcp.log is a file that exists in the /var/log directory and is synchronized across both Pilots.  Information in this file is a good starting point to understand system wide failures.  It shows system node (Pilot or Controller) events like restarts, failovers, and failbacks. 

Viewing this file live (using tailf or tail -f) can be used to monitor the current status of the system nodes, while looking at the file in a log bundle can provide historical data.  The pcp.log file is updated at 5 second intervals but the node matrix itself updates every second. Nodes 2 & 3 represent Pilots 1 & 2 while nodes 128 & 129 represent Controllers 1 & 2 respectively.  Node 255 represents the entire system. This node can be ignored as it doesn’t provide any valuable information for troubleshooting.  The last row is an indication of which nodes have active/master or passive/slave status.

A typical/normal node matrix is shown below.  Using the grep and cut options when attempting to troubleshoot an actual problem will likely obscure other important clues.  They are only used here to focus specifically on the node matrix itself:

[root@pilot2 log]# tailf /var/log/pcp.log | grep -A 7 "node matrix" | cut -c67-
 pcp:info fofb: node matrix
 node           3          2        129        128
   3:   1 (  4  6  0)(  0  6  0)( 20  6  0)( 20  6  0)
   2:   1 (  0  6  0)(  0  6  0)( 20  6  0)( 20  6  0)
 129:   1 ( 20  6  0)( 20  6  0)(  0  6  0)(  0  6  0)
 128:   1 ( a0  6  0)( a0  6  0)( 80  6  0)( c0  6  0)
 255:     (  0  6  0)(  0  6  0)(  0  6  0)(  0  6  0)
   1 -1 0 3 3


The entries in the second row and first column define the node views in the matrix.  The first column identifies the node associated with that row and a heartbeat count.  If this value is seen to be increasing, the node at the beginning of that row is not reporting heartbeats and can be an indication of a failing node. 

Rows that begin with a node number show how that node is seen by the other nodes (columns).  For the purposes of identification in this document, the entries of the second row will be used to indicate columns and have been delineated below by red vertical marks:

[root@pilot2 log]# tailf /var/log/pcp.log | grep -A 7 "node matrix" | cut -c67-
 pcp:info fofb: node matrix
 node     |       3     |       2     |     129     |     128
   3:   1 | (  4  6  0) | (  0  6  0) | ( 20  6  0) | ( 20  6  0)
   2:   1 | (  0  6  0) | (  0  6  0) | ( 20  6  0) | ( 20  6  0)
 129:   1 | ( 20  6  0) | ( 20  6  0) | (  0  6  0) | (  0  6  0)
 128:   1 | ( a0  6  0) | ( a0  6  0) | ( 80  6  0) | ( c0  6  0)
 255:     | (  0  6  0) | (  0  6  0) | (  0  6  0) | (  0  6  0)

The intersection of the node row and the node column represent a cell or tuple comprised of three elements in the Fail Over Fail Back (FOFB) process: flags, state and software component stage number.  FOFB flags and states are defined in the header file fofb.h.  Values for the flag field bits are:

/* These bitmask flags are over loaded.
* VOTE_ talk about any node, node x thinks of node y, or itself.
* SUSPECT_ talk only about other nodes, node x thinks of node y.
* AM_ talk only about this node, node x thinks about itself. */
#define VOTE_MASTER            0x0080 /* Term 'MASTER' used for Controllers only */
#define VOTE_ACTIVE            0x0008 /* Term 'ACTIVE' is used for Pilots only */
#define SUSPECT_P              0x0040 /* suspect on PMI Network */
#define SUSPECT_F              0x0020 /* suspect Controllers: Fibre Channel; Pilots: serial */
#define AM_MASTER              0x0040 /* again, Controller */
#define AM_CRITICAL            0x0020
#define AM_WARMSTARTING        0x0010
#define AM_ACTIVE              0x0004 /* again, Pilot */
#define AM_DEBUG_STATE         0x0001 /* Node in debug mode; health checks disabled */
#define SW_UPDATE_ACTIVE       0x0002 /* A pilot node is in a software update state */
#define COLD_START_IN_PROGRESS 0x0100 /* Only the master pilot will set this flag */

Values for the FOFB node state are:

A002881BRA
#define FOFB_UNKNOWN         0
#define FOFB_WBR             1
#define FOFB_INIT            2
#define FOFB_INITIALIZING    3
#define FOFB_READY           4
#define FOFB_NOT_READY       5
#define FOFB_NORMAL          6
#define FOFB_SLAVE           7
#define FOFB_FAILING         8
#define FOFB_FAILED          9
#define FOFB_FO             10    /* Node Failover state */
#define FOFB_FB             13    /* Node failback state */
#define FOFB_SHUTDOWN       11
#define FOFB_QUIESCE        14
#define FOFB_CANCEL_QUIESCE 15
#define FOFB_EC_WARMSTART   16

 

Component mapping to FOB stage mapping is defined in the pds_components.h header file:

int champ_comp_map[FOFB_MAX_COMPONENTS_TO_MONITOR] = {
  PDS_COMP_DMS,     // FOFB stage: 1
  PDS_COMP_PI,      // FOFB stage: 2
  PDS_COMP_AM,      // FOFB stage: 3
  PDS_COMP_BS,      // FOFB stage: 4
  PDS_COMP_SAN,     // FOFB stage: 5
  PDS_COMP_MFS,     // FOFB stage: 6
  PDS_COMP_VS,      // FOFB stage: 7
  PDS_COMP_XPAL,    // FOFB stage: 8
  PDS_COMP_NFS,     // FOFB stage: 9
  PDS_COMP_NFSAUXD, // FOFB stage: 10
  PDS_COMP_CIFS,    // FOFB stage: 11
  PDS_COMP_SIM,     // FOFB stage: 12
  PDS_COMP_BE,      // FOFB stage: 13
  PDS_COMP_NP,      // FOFB stage: 14
  PDS_COMP_DSC      // FOFB stage: 15
  PDS_COMP_RMR,     // FOFB stage: 16
  PDS_COMP_FPI,     // FOFB stage: 17
  PDS_COMP_CM       // FOFB stage: 18
  };

 

NOTE: the values and thus meaning of these entries may change over time.  When in doubt, consult the header files for the installed software version of the FS1-2 system.

Of the three elements in the tuple, the value of the node state in the middle is perhaps the best to monitor.  In this next example, shortly after a system restart was issued, the node states of both Controllers (nodes 128 and 129) are seen by all 4 nodes as "b" or "FOFB_SHUTDOWN":

pcp:info fofb: node matrix
node           2          3        128        129
  2:   1 (  0  6  0)(  0  6  0)( 20  6  0)( 20  6  0)
  3:   1 (  0  6  0)(  4  6  0)( 20  6  0)( 20  6  0)
128:   1 ( a0  b  0)( a0  b  0)( c0  b  0)( 80  b  0)
129:   1 ( 20  b  0)( 20  b  0)(  0  b  0)(  0  b  0)
255:     (  0  6  0)(  0  6  0)(  0  6  0)(  0  6  0)
  2 -1 0 3 3

When a system is restarted, the Pilots will come online first.  Typically by the time you can monitor a live system, the Pilots will already be FOFB_NORMAL or have a value of 6.  The Controllers can be observed to go through the sequence of 1, 2, 3, 4, 7, 6.  This typically takes 12-15 minutes once the Controllers are seen in the node matrix.  A Controller that hangs at 1 is likely to require its failure history to be cleared.  See KM Doc 2093580.1 FS System: How to Clear Controller Failure History for details on clearing the failure history.  A node state with a value of 5 indicates a likely hardware issue.

Details on which node is active (Pilots) or master (Controllers) can be obtained from the first element or FOFB flag.  In the example below it can be determined that Pilot 2 is active while Controller 1 is the master:

fofb: node matrix
node           3          2        128        129
  3:   1 (  4  6  0)(  0  6  0)( 20  6  0)( 20  6  0)
  2:   1 (  0  6  0)(  0  6  0)( 20  6  0)( 20  6  0)
128:   1 ( a0  6  0)( a0  6  0)( c0  6  0)( 80  6  0)
129:   1 ( 20  6  0)( 20  6  0)(  0  6  0)(  0  6  0)
255:     (  0  6  0)(  0  6  0)(  0  6  0)(  0  6  0)
  1 -1 0 3 3

The tuple formed by the intersection of node 3 with itself has the flag set to 4 and thus is the active Pilot.  The tuple formed by the intersection of node 128 has a flag set to c0 (80+40) indicating that it is the master Controller.

Finally there are a series of numbers just below the node 255 row.  Of these numbers, only the first number will provide reliable information.  It will be either 1 or 2 indicating whether the active Pilot (1) or standby Pilot (2) is the one doing the monitoring.

Additional details on triaging the pcp.log file can be found in KM document 1944574.1 FS System: How to triage system logs and understand Release 6.x system operations under the Triage PACMAN Pilot and FS1 Startup.


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback