![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Troubleshooting Sure Solution 2365134.1 : FS System: Using the Node Matrix in the pcp.log File to Troubleshoot Component Boot Issues
In this Document
Oracle Confidential PARTNER - Available to partners (SUN). Applies to:Oracle FS1-2 Flash Storage System - Version All Versions to All Versions [Release All Releases]Information in this document applies to any platform. PurposeTo document the node matrix in the pcp.log of the FS1-2 storage system and how you can use the information for initial troubleshooting of problems. This document requires that the nodes be seen on the Private Management Interface (PMI). Troubleshooting Steps
Viewing this file live (using tailf or tail -f) can be used to monitor the current status of the system nodes, while looking at the file in a log bundle can provide historical data. The pcp.log file is updated at 5 second intervals but the node matrix itself updates every second. Nodes 2 & 3 represent Pilots 1 & 2 while nodes 128 & 129 represent Controllers 1 & 2 respectively. Node 255 represents the entire system. This node can be ignored as it doesn’t provide any valuable information for troubleshooting. The last row is an indication of which nodes have active/master or passive/slave status. A typical/normal node matrix is shown below. Using the grep and cut options when attempting to troubleshoot an actual problem will likely obscure other important clues. They are only used here to focus specifically on the node matrix itself: [root@pilot2 log]# tailf /var/log/pcp.log | grep -A 7 "node matrix" | cut -c67-
pcp:info fofb: node matrix node 3 2 129 128 3: 1 ( 4 6 0)( 0 6 0)( 20 6 0)( 20 6 0) 2: 1 ( 0 6 0)( 0 6 0)( 20 6 0)( 20 6 0) 129: 1 ( 20 6 0)( 20 6 0)( 0 6 0)( 0 6 0) 128: 1 ( a0 6 0)( a0 6 0)( 80 6 0)( c0 6 0) 255: ( 0 6 0)( 0 6 0)( 0 6 0)( 0 6 0) 1 -1 0 3 3
Rows that begin with a node number show how that node is seen by the other nodes (columns). For the purposes of identification in this document, the entries of the second row will be used to indicate columns and have been delineated below by red vertical marks: [root@pilot2 log]# tailf /var/log/pcp.log | grep -A 7 "node matrix" | cut -c67-
pcp:info fofb: node matrix node | 3 | 2 | 129 | 128 3: 1 | ( 4 6 0) | ( 0 6 0) | ( 20 6 0) | ( 20 6 0) 2: 1 | ( 0 6 0) | ( 0 6 0) | ( 20 6 0) | ( 20 6 0) 129: 1 | ( 20 6 0) | ( 20 6 0) | ( 0 6 0) | ( 0 6 0) 128: 1 | ( a0 6 0) | ( a0 6 0) | ( 80 6 0) | ( c0 6 0) 255: | ( 0 6 0) | ( 0 6 0) | ( 0 6 0) | ( 0 6 0) The intersection of the node row and the node column represent a cell or tuple comprised of three elements in the Fail Over Fail Back (FOFB) process: flags, state and software component stage number. FOFB flags and states are defined in the header file fofb.h. Values for the flag field bits are: /* These bitmask flags are over loaded.
* VOTE_ talk about any node, node x thinks of node y, or itself. * SUSPECT_ talk only about other nodes, node x thinks of node y. * AM_ talk only about this node, node x thinks about itself. */ #define VOTE_MASTER 0x0080 /* Term 'MASTER' used for Controllers only */ #define VOTE_ACTIVE 0x0008 /* Term 'ACTIVE' is used for Pilots only */ #define SUSPECT_P 0x0040 /* suspect on PMI Network */ #define SUSPECT_F 0x0020 /* suspect Controllers: Fibre Channel; Pilots: serial */ #define AM_MASTER 0x0040 /* again, Controller */ #define AM_CRITICAL 0x0020 #define AM_WARMSTARTING 0x0010 #define AM_ACTIVE 0x0004 /* again, Pilot */ #define AM_DEBUG_STATE 0x0001 /* Node in debug mode; health checks disabled */ #define SW_UPDATE_ACTIVE 0x0002 /* A pilot node is in a software update state */ #define COLD_START_IN_PROGRESS 0x0100 /* Only the master pilot will set this flag */ Values for the FOFB node state are: A002881BRA
#define FOFB_UNKNOWN 0 #define FOFB_WBR 1 #define FOFB_INIT 2 #define FOFB_INITIALIZING 3 #define FOFB_READY 4 #define FOFB_NOT_READY 5 #define FOFB_NORMAL 6 #define FOFB_SLAVE 7 #define FOFB_FAILING 8 #define FOFB_FAILED 9 #define FOFB_FO 10 /* Node Failover state */ #define FOFB_FB 13 /* Node failback state */ #define FOFB_SHUTDOWN 11 #define FOFB_QUIESCE 14 #define FOFB_CANCEL_QUIESCE 15 #define FOFB_EC_WARMSTART 16
Component mapping to FOB stage mapping is defined in the pds_components.h header file: int champ_comp_map[FOFB_MAX_COMPONENTS_TO_MONITOR] = {
NOTE: the values and thus meaning of these entries may change over time. When in doubt, consult the header files for the installed software version of the FS1-2 system.
Of the three elements in the tuple, the value of the node state in the middle is perhaps the best to monitor. In this next example, shortly after a system restart was issued, the node states of both Controllers (nodes 128 and 129) are seen by all 4 nodes as "b" or "FOFB_SHUTDOWN": pcp:info fofb: node matrix
node 2 3 128 129 2: 1 ( 0 6 0)( 0 6 0)( 20 6 0)( 20 6 0) 3: 1 ( 0 6 0)( 4 6 0)( 20 6 0)( 20 6 0) 128: 1 ( a0 b 0)( a0 b 0)( c0 b 0)( 80 b 0) 129: 1 ( 20 b 0)( 20 b 0)( 0 b 0)( 0 b 0) 255: ( 0 6 0)( 0 6 0)( 0 6 0)( 0 6 0) 2 -1 0 3 3 When a system is restarted, the Pilots will come online first. Typically by the time you can monitor a live system, the Pilots will already be FOFB_NORMAL or have a value of 6. The Controllers can be observed to go through the sequence of 1, 2, 3, 4, 7, 6. This typically takes 12-15 minutes once the Controllers are seen in the node matrix. A Controller that hangs at 1 is likely to require its failure history to be cleared. See KM Doc 2093580.1 FS System: How to Clear Controller Failure History for details on clearing the failure history. A node state with a value of 5 indicates a likely hardware issue. Details on which node is active (Pilots) or master (Controllers) can be obtained from the first element or FOFB flag. In the example below it can be determined that Pilot 2 is active while Controller 1 is the master: fofb: node matrix
node 3 2 128 129 3: 1 ( 4 6 0)( 0 6 0)( 20 6 0)( 20 6 0) 2: 1 ( 0 6 0)( 0 6 0)( 20 6 0)( 20 6 0) 128: 1 ( a0 6 0)( a0 6 0)( c0 6 0)( 80 6 0) 129: 1 ( 20 6 0)( 20 6 0)( 0 6 0)( 0 6 0) 255: ( 0 6 0)( 0 6 0)( 0 6 0)( 0 6 0) 1 -1 0 3 3 The tuple formed by the intersection of node 3 with itself has the flag set to 4 and thus is the active Pilot. The tuple formed by the intersection of node 128 has a flag set to c0 (80+40) indicating that it is the master Controller. Finally there are a series of numbers just below the node 255 row. Of these numbers, only the first number will provide reliable information. It will be either 1 or 2 indicating whether the active Pilot (1) or standby Pilot (2) is the one doing the monitoring. Additional details on triaging the pcp.log file can be found in KM document 1944574.1 FS System: How to triage system logs and understand Release 6.x system operations under the Triage PACMAN Pilot and FS1 Startup. Attachments This solution has no attachment |
||||||||||||||||
|