Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1948040.1
Update Date:2016-04-28
Keywords:

Solution Type  Problem Resolution Sure

Solution  1948040.1 :   FS System: Pilots in FS1-2 May Reboot If System Date and Time Are Changed  


Related Items
  • Oracle FS1-2 Flash Storage System
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>Flash Storage>SN-EStor: FSx
  •  




In this Document
Symptoms
Cause
Solution
References


Created from <SR 3-9903511042>

Applies to:

Oracle FS1-2 Flash Storage System - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

Changing the sources of where Pilots get their time from can confuse the Pilots and cause them to reboot.  In this example, a customer disabled the NTP (Network Time Protocol) setting, manually updated the system time so that it was an hour ahead of where it had been and then re-enabled NTP.

NOTE: some of the logs depicted below require processing by Oracle Support to be seen in the manner indicated below.  Contact Oracle support if you are not able to see the details as shown below.

Event Log entries to look for:

2014-11-20T20:15:33.503  NTP_SETTINGS_MODIFIED                  INFORMATIONAL  ComponentName=/NtpSettings
2014-11-20T20:16:00.486  PCP_EVT_NTP_SERVER_NOT_RESPONDING      WARNING
2014-11-20T20:16:00.490  PCP_EVT_NTP_SYNCHRONIZATION_FAILED     WARNING
2014-11-20T20:16:10.503  NTP_SETTINGS_MODIFIED                  INFORMATIONAL  ComponentName=/NtpSettings
2014-11-20T20:19:43.996  PCP_EVT_ALL_NTP_SERVERS_RESPONDING     INFORMATIONAL
2014-11-20T20:19:44.007  PCP_EVT_NTP_SYNCHRONIZATION_RESTORED   INFORMATIONAL
2014-11-20T20:19:44.208  PCP_EVT_PILOT_FAILED_OVER              WARNING        ComponentName=/PILOT-2
2014-11-20T20:19:46.055  PCP_EVT_SYSTEM_STATE_CHANGED           INFORMATIONAL  currentSystemState=NORMAL, previousSystemState=PILOT_FAILBACK_IN_PROGRESS
2014-11-20T20:20:00.216  PCP_EVT_FOUND_PILOT_CORE_FILE          INFORMATIONAL

 

Pilots Logs will also contain important indications of the issue:

  1. The user has disabled the external NTP servers and set the internal clock which changed the time from 20:16:11 to 21:15:49:

    2014-11-20 20:16:11.032 pilot2 pilotcfgproc: 154264 17938 MemAlloc::getMemoryFromPool() 0x7fc7e82c7010 0x7fc7e83999e0 1
    2014-11-20 20:16:11.032 pilot2 pilotcfgproc: 154265 17938 getMsg() 0x7fc7e82e4c78 0x7fc7e82e4598 0x7fc7e83999e0 12
    2014-11-20 20:16:11.032 pilot2 pilotcfgproc: 154266 17938 pmi:info AllocMsg() 0x15000f 0x80000000546e2fd1 0x7fc7e82e4c78 0x7fc7e82e4598 0x7fc7e83999e0 12
    2014-11-20 20:16:11.032 pilot2 pilotcfgproc: 154267 17938 pcp:info Searching for message handler for 15000f type [4]. There are currently 55 registered.
    2014-11-20 20:16:11.032 pilot2 pilotcfgproc: 154268 17938 pcp:debug Searching for status handler for 15000f. There are currently 10 registered.
    2014-11-20 20:16:11.032 pilot2 pilotcfgproc: 154269 17938 pcp:info PCP_MSG_SET_SYSTEM_TIME Time set on the passive pilot
    2014-11-20 21:15:49.497 pilot2 pilotcfgproc: 154270 17939 pds_bulk_select_s:poll() sockfd 5 events 0x3 revents 0x0 zero 0x0
    2014-11-20 21:15:49.497 pilot2 pilotcfgproc: 154271 17939 pds_bulk_select_s:poll() sockfd 20 events 0x3 revents 0x1 zero 0x0
    2014-11-20 21:15:49.497 pilot2 pilotcfgproc: 154272 17939 pds_bulk_select_s:poll() sockfd 21 events 0x3 revents 0x0 zero 0x0
    2014-11-20 21:15:49.497 pilot2 pilotcfgproc: 154273 17939 pds_bulk_select_s:poll() sockfd 18 events 0x3 revents 0x0 zero 0x0
    2014-11-20 21:15:49.497 pilot2 pilotcfgproc: 154274 17939 pds_bulk_select_s:poll() sockfd 29 events 0x3 revents 0x0 zero 0x0
    2014-11-20 21:15:49.497 pilot2 pilotcfgproc: 154275 17939 bcd 0x1558910 nobuffer last data sent 3579s ago
    2014-11-20 21:15:49.497 pilot2 pilotcfgproc: 154276 17939 rd():bcd 0x1558910buf 0x7fc7e3ffeb10 len 24 flags 0x100
      

  2. Now the user enables the NTP server which put the clock back an hour:

    2014-11-20 21:17:44.500 pilot2 pilotcfgproc: 157608 17977 pcp:info PCP_MSG_SET_PILOT_NTP_SERVER complete
    ...
    2014-11-20 21:17:51.835 pilot2 pilotcfgproc: 157894 17995 pcp:info fofb hb 4440 node-129 State=6 flags=0
    2014-11-20 21:17:51.914 pilot2 pilotcfgproc: 157895 17997 pcp:debug Data read in from serial port:<<<PILOT_ONE****NODE_PASSIVE*CMD_NOOP*****>>>
    2014-11-20 21:17:52.291 pilot2 pilotcfgproc: 157896 17996 pcp:debug SystemState::threadLoop() 3 8
    2014-11-20 21:17:52.291 pilot2 pilotcfgproc: 157897 17996 pcp:debug SystemState is NORMAL.
    2014-11-20 20:18:15.728 pilot2 pilotcfgproc: 157898 17932 pcp:info Received heartbeat from PACMAN.
    2014-11-20 20:18:15.869 pilot2 pilotcfgproc: 157899 17995 pcp:info fofb hb 4446 node-128 State=6 flags=c0
    2014-11-20 20:18:15.869 pilot2 pilotcfgproc: 157900 17995 pcp:info Heartbeat sent 1503 3
    2014-11-20 20:18:15.869 pilot2 pilotcfgproc: 157901 17995 pcp:info fofb: node matrix
    ...
    2014-11-20 20:18:48.250 pilot2 pilotcfgproc: 158277 18003 pcp:warning Pacman has not heartbeat in 30 seconds!
    2014-11-20 20:18:48.250 pilot2 pilotcfgproc: 158278 18003 pcp:warning Failing over to passive pilot as Pacman is not running
    2014-11-20 20:18:48.250 pilot2 pilotcfgproc: 158279 18003 pilot_logging:forcing a log rollover
      

 

Cause

This is known issue where setting the clock back in time causes a pilot core dump.  Since the pilots are not part of the data path, this does not affect host access to the data stored on the FS1-2 system.

The issue is documented in the Oracle FS1-2 Flash Storage System Customer Release Notes:

[18006920] When the system time is changed and the new time is before the previously set time, the Pilot nodes might become unresponsive and then restart.

 

Solution

No fix is currently available for this issue.  The only way to avoid the problem is not to set the system time to an earlier value.

 

 

That bug 18006920 is fixed in 6.2.0, but that bug fix ONLY addresses one issue. The fix was a side effect of moving from Java 7 to Java 8, where 6.1 uses Java 7 and 6.2 uses Java 8. That bug fix is because the old Java determined when to wake up from a Java sleep() call by using the current time—what the Dev manager calls wall clock time.

If you set the clock back [manually or from turning on NTP], Java would never wake up since it would be trying to wake up before it went to sleep.

You can also trigger this by setting the clock forward on 6.1 since that also changes the time where Java wakes from a sleep.

The heartbeats between PacMan [Java] and PCP [C++] are driven by that Java sleep() in PacMan.

If PacMan does not send a heartbeat to PCP [within the same pilot] at least once every 30 seconds, PCP will kill that pilot with the pcp:warning Pacman has not heartbeat in 30 seconds!

 

 


 

References

<BUG:18006920> - PILOTS RESTARTED WHEN SYSTEM DATE/TIME WAS CHANGED - BUILD 683
<BUG:20079499> - FS1 PILOT CORES FOUND, SOFTWARE UPDATE 8 HOURS PRIOR

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback