Oracle ZFS Storage Appliance: When Starting Oracle Databases after doing an OS Upgrade on a Cluster, we receive many "lost lock" errors

Asset ID:	1-72-2385833.1
Update Date:	2018-05-01
Keywords:

Solution Type Problem Resolution Sure

Solution 2385833.1 : Oracle ZFS Storage Appliance: When Starting Oracle Databases after doing an OS Upgrade on a Cluster, we receive many "lost lock" errors

Applies to:

Oracle ZFS Storage ZS5-2 - Version All Versions and later
Oracle ZFS Storage ZS5-4 - Version All Versions and later
Oracle ZFS Storage ZS4-4 - Version All Versions and later
Oracle ZFS Storage ZS3-4 - Version All Versions and later
Oracle ZFS Storage ZS3-2 - Version All Versions and later
7000 Appliance OS (Fishworks)

Symptoms

When starting Oracle Databases after doing an OS upgrade on the cluster, we received many "lost lock" errors and then the databases crashed.

We've had this issue the last three times we upgraded our ZFSSA Clusters. It happens on ALL of them. It seems a failover and failback of cluster nodes results in this issue.

Changes

Upgrades include live upgrade of LDOMs and Appliance kit upgrade of the ZFS appliance that serves the LDOM; problem had been seen on multiple upgrades and did not appear to be tied to any particular OS or AK update version.

Cause

Examples of errors received on clients:

eimt_dB_errors_due_to_lost_loscks.txt events:

  Thu Oct 19 18:01:59 2017
  SMON: enabling cache recovery
  ...
  Thu Oct 19 18:02:02 2017
  Exception [type: unknown signal, unknown code] [ADDR:0x0] [PC:0xFFFFFFFF7A94F358, _syscall6()+32] [flags: 0x0, count: 1]
  Exception [type: unknown signal, unknown code] [ADDR:0x0] [PC:0xFFFFFFFF7A94DFD0, __close()+8] [flags: 0x2, count: 2]
  Thu Oct 19 18:02:04 2017
  PMON (ospid: 8595): terminating the instance due to error 471

Lost lock errors examples on client LDOMS:

messages.0:Oct 19 20:54:11 [dev client] klmops: [ID 424047 kern.notice] NOTICE: lockd: pid 43108 lost lock on server (server name)

We also saw statd events, i.e.:

Oct 19 18:43:41 [db client] statd[10117]: [ID 652648 daemon.notice] Received SM_NOTIFY (server name) from [server name] (IP)
Oct 19 18:43:41 [db client] klmops: [ID 424047 kern.notice] NOTICE: lockd: pid 10907 lost lock on server [server name]
Oct 19 18:43:41 [dbclient] last message repeated 3 times
...
messages:Oct 19 18:46:33 [db client] Oracle GoldenGate Capture for Oracle[15377]: [ID 702911 user.error] 2017-10-19 18:46:33 ERROR OGG-00664 Oracle GoldenGate Capture for Oracle, ODISC.prm: OCI Error beginning session (status = 1034-ORA-01034: ORACLE not available
messages:Oct 19 18:46:33 [db client] ORA-27101: shared memory realm does not exist

We also saw other database errors :

Oct 19 15:12:13 [db client]1 Oracle GoldenGate Delivery for Oracle[88734]: [ID 702911 user.error] 2017-10-19 15:12:13 ERROR OGG-00664 Oracle GoldenGate Delivery fo Oracle, DEL_CCB1.prm: OCI Error calling OCITransCommit (status = 3114-ORA-03114: not connected to ORACLE).
Oct 19 15:12:13 [db client] Oracle GoldenGate Delivery for Oracle[88734]: [ID 702911 user.error] 2017-10-19 15:12:13 ERROR OGG-00664 Oracle GoldenGate Delivery for Oracle, DEL_CCB1.prm: OCI Error calling OCITransRollback (status = 3114-ORA-03114: not connected to ORACLE).
Oct 19 15:12:13 [db_client] Oracle GoldenGate Delivery for Oracle[88734]: [ID 702911 user.error] 2017-10-19 15:12:13 ERROR OGG-01668 Oracle GoldenGate Delivery for Oracle, DEL_CCB1.prm: PROCESS ABENDING.
Oct 19 15:26:00 [db client] nfs: [ID 333984 kern.notice] NFS server [server name] not responding still trying
Oct 19 15:33:13 [db client] statd[30378]: [ID 652648 daemon.notice] Received SM_NOTIFY (server name) from [server name] (IP)
Oct 19 15:33:21 [db client] nfs: [ID 563706 kern.notice] NFS server [server name] ok

[LDOM] patch+pkg history for 10/19:

2017-10-19T15:52:04 refresh-publishers pkg Succeeded
2017-10-19T15:52:04 sync-linked pkg Canceled
2017-10-19T15:52:17 rebuild-image-catalogs pkg Succeeded
2017-10-19T15:53:26 sync-linked pkg Canceled
2017-10-19T15:57:16 sync-linked pkg Succeeded

The problem would appear to occur whenever the local zones booted the new ABE, and also whenever the appliance fails over (the latter might be expected since locks would need to be relinquished if a pool fails over to the peer node ... and fail-overs are done as part of the ZFS appliance upgrades process).

We don't see any errors/events from the appliance side, nor are we seeing any extensive times for import of resources/pools, the longest takeover was 6 seconds.

We identified that the LDOMS ABE upgrades were being created before the appliance was upgraded, clients were stopped and then the appliance was upgraded; the LDOMS would then be rebooted on their new ABE's and see the issue.

Solution

Recommendations were to either create the LDOMs ABE after the ZFSSA upgrade, or use the new ABE prior to the ZFSSA upgrade.

References

<NOTE:2298786.1> - Processes receive SIGLOST (lost lock) upon SPARC OVM guest migration for NFSv3 mounts with a NetAPP NFS server
<NOTE:1622379.1> - terminating the instance due to error 471 Out-Of-Memory(OOM) Killer Crashes Oracle Database
<NOTE:1541158.1> - Instance Terminated By PMON After DBWR Failed With SIGSEGV on [SSKGDS_SNM()+592]
<NOTE:1996984.1> - Bug 17484923 : Unrecoverable Error ORA-15188 Raised in ASM I/O Path
<NOTE:1005087.1> - Pros and Cons of Using UFS Direct I/O for Databases

Attachments

This solution has no attachment