Asset ID: |
1-75-1492780.1 |
Update Date: | 2017-07-21 |
Keywords: | |
Solution Type
Troubleshooting Sure
Solution
1492780.1
:
Troubleshooting Guide For NFSv4 File Lock & Hang Issues On Exalogic Linux Environments
Related Items |
- Exalogic Elastic Cloud X5-2 Eighth Rack
- Oracle Exalogic Elastic Cloud Software
|
Related Categories |
- PLA-Support>Eng Systems>Exalogic/OVCA>Oracle Exalogic>MW: Exalogic Core
|
In this Document
Applies to:
Oracle Exalogic Elastic Cloud Software - Version 1.0.0.0.0 and later
Exalogic Elastic Cloud X5-2 Eighth Rack
Linux x86-64
Purpose
This document provides detailed steps about troubleshooting and debugging NFSv4 File lock and hang issues on Exalogic Linux environments both physical and virtual. The NFSv4 hang issues like ls, ps, df commands on a NFS share hanging & NFSv4 File lock issues like WebLogic processes, java processes or other application processes like Siebel, SOA etc not starting with errors like “unable to obtain file lock”.
The majority of these NFSv4 hang and File lock issues happen due to following reasons:
- Recommended Configurations not in place such as IPMP settings & Delegation Settings on Storage Appliance.
- Improper NIS/LDAP configuration
- NTP configuration issues like NTP not synced across compute nodes/vServers and ZFS heads
- NFSv4 misconfigurations
- Network issues
- Running older versions of Exalogic PSUs which have known issues.
- One or more/all of the above
Troubleshooting Steps
Symptoms
Following are few symptoms that are observed during NFSV4 file lock and hang issue.
- Startup of WebLogic, Java and other application process failing with “unable to obtain file lock” errors. Following are two of the common file lock errors we observe when starting the processes. There may be other file lock related errors as well.
Error 1:
<Mar 17, 2013 1:42:05 PM EET> <Info> <Security> <BEA-090906> <Changing the default Random Number Generator in RSA CryptoJ from ECDRBG to FIPS186PRNG. To disable this change, specify -Dweblogic.security.allowCryptoJDefaultPRNG=true>
<Mar 17, 2013 1:42:05 PM EET> <Info> <WebLogicServer> <BEA-000377> <Starting WebLogic Server with Oracle JRockit(R) Version R28.2.0-79-146777-1.6.0_29-20111005-1807-linux-x86_64 from Oracle Corporation>
<Mar 17, 2013 1:42:16 PM EET> <Info> <Management> <BEA-141281> <unable to get file lock, will retry ...>
<Mar 17, 2013 1:42:26 PM EET> <Info> <Management> <BEA-141281> <unable to get file lock, will retry ...>
<Mar 17, 2013 1:42:36 PM EET> <Info> <Management> <BEA-141281> <unable to get file lock, will retry ...>
Error 2:
weblogic.management.ManagementException: Unable to obtain lock on /u01/soa/dev01/domains/dev01_domain/servers/dev01_soa51/tmp/dev01_soa51.lok. Server may already be running
at weblogic.management.internal.ServerLocks.getServerLock(ServerLocks.java:206)
at weblogic.management.internal.ServerLocks.getServerLock(ServerLocks.java:67)
at weblogic.management.internal.DomainDirectoryService.start(DomainDirectoryService.java:74)
at weblogic.t3.srvr.ServerServicesManager.startService(ServerServicesManager.java:461)
at weblogic.t3.srvr.ServerServicesManager.startInStandbyState(ServerServicesManager.java:166)
at weblogic.t3.srvr.T3Srvr.initializeStandby(T3Srvr.java:882)
at weblogic.t3.srvr.T3Srvr.startup(T3Srvr.java:572)
at weblogic.t3.srvr.T3Srvr.run(T3Srvr.java:469)
at weblogic.Server.main(Server.java:71)
- Not able to make changes from WebLogic Admin console.
- Unable to access applications running on NFSv4 file system with application and server logs throwing network IO error messages when opening files on NFSv4 file system.
- ls, df, ps commands which are run on NFSv4 file system hanging.
- One (or) more or all of the above symptoms.
Requirements For NFSv4 stability
Following are mandatory requirements to ensure NFSv4 stability in Exalogic Linux environments. Not having any/(or) all of the following may result in NFSv4 stability performance and File lock issues:
- Proper NTP Configuration: NTP is an important component of the NFSv4 scheme and both the client compute nodes/vServers and the server ZFS storage appliance must sync to the same NTP server(s). Typically this would be configured as the same across all Exalogic components during the ECU setup. However, we have seen numerous incidents where either NTP was not configured properly or the NTP servers were not reachable causing NFS outage in either case.
- Proper NIS & LDAP Configuration: In order to use NFSv4 file system NIS or LDAP setup has to be done. Not having NIS/LDAP setup or improperly configured NIS/LDAP also causes NFS and File lock issues. Please refer to following notes for configuring NIS for Exalogic Linux Physical and Virtual environments:
Note 1470844.1 for configuring NIS for Linux Physical environments
Note 1516025.1 for configuring NIS for Linux Virtual environments
<Note 1599868.1>: Oracle Exalogic Elastic Cloud - Setting Up LDAP Service for NFSv4
NOTE: If NIS was configured on compute nodes prior to EECS 2.0.0.0.0, it must be again reconfigured as part of the upgrade.
- In case of NIS configuration make sure you are running following version of YP packages. It is always recommended to run YP packages which are at least of below listed versions or later. There are few reported issues around YP NIS services crashing when using older version of YP packages.
ypserv: ypserv-2.19-9.el5_8.1.x86_64.rpm (or later)
yptools: yp-tools-2.9-1.el5 (or later)
ypbind: ypbind-1.19-12.el5_6.1.x86_64.rpm (or later)
Above NIS packages can be downloaded from site - http://public-yum.oracle.com/public-yum-el5.repo
- IPMP Settings on ZFS: It is important to have below recommended IPMP settings change on Exalogic ZFS Storage Appliance. The steps on changing the default IPMP settings on ZFS storage appliance are also documented in Exalogic Known Issues Note 1268557.1, refer to section "Default IPMP settings in ZFS Storage Appliance causing system hang issues" under "ZFS STORAGE APPLIANCE".
- Disable NFSv4 Delegation: There is known issue with the currently available ZFSSA firmware on Exalogic releases where NFSv4 delegation enabled on ZFSSA causes hangs on NFSv4 clients. Disabling NFSv4 delegation is the official direction from Oracle at the moment. It is valid and recommended configuration. This only applies when using NFSv4. Please follow steps listed in Note 1481713.1 for disabling NFSv4 delegation on both the ZFS storage heads active and passive.
Steps For Debugging NFSv4 File Lock Issues
User/Group & Permissions Validation
Check if the user/group of WebLogic (or) application process which is having file lock issues is matching the owner/group of the file/directory structure on NFSv4 share which it is accessing. File lock issues can also happen if the user/group ownership of the file on NFSv4 share is owned by user/group (for e.g. oracle/oracle) and if it is being accessed by process started as using user/group permissions (for e.g. testuser/testuser).
NIS Services Validation
Check with NIS services are UP and online on NIS Master, Slave, Client Compute Nodes/vServers and ZFS Storage Appliance. If they are down restart the NIS services.
- NIS services validation on NIS Master
On NIS Master Compute Node/vServer ypserv, ypbind, ypxfrd, yppasswdd services should be UP and RUNNING. You can verify the status of these services by running following commands.
service ypserv status
service ypbind status
service ypxfrd status
service yppasswdd status
You can also use below ps command to check above YP processes are running.
ps –ef | grep –i yp
If any of the above yp services are down restart them using following commands as needed based on the YP service which is down.
service ypserv start
service ypbind start
service ypxfrd start
service yppasswdd start
- NIS services validation on NIS Slave.
On NIS Slave Compute Node/vServer ypserv, ypbind, ypxfrd, yppasswdd services should be UP and RUNNING. You can verify the status of these services by running following commands.
service ypserv status
service ypbind status
service ypxfrd status
service yppasswdd status
You can also use below ps command to check above YP processes are running.
ps –ef | grep –i yp
If any of the above yp services are down restart them using following commands as needed based on the YP service which is down.
service ypserv start
service ypbind start
service ypxfrd start
service yppasswdd start
- NIS services validation on NIS Client Compute Nodes/vServers.
On NIS Client Compute Nodes/vServers ypbind service should be UP and RUNNING. You can verify the status of ypbind service by running following commands.
service ypbind status
You can also use below ps command to check above YP processes are running.
ps –ef | grep –i yp
If ypbind service is down restart it using following command
service ypbind start
- NIS services validation on ZFS Storage Appliance (NIS Client)
Login to Active Storage Appliance BUI (not passive one) and check if NIS Service is running on it using following command in CLI mode or from ZFSSA BUI screen by clicking on Configuration Services and looking at the status of NIS.
configuration services nis show
If the NIS service on active ZFSSA is down and not showing as online, restart it from ZFSSA BUI console. This can be done by going to Configuration -> Services -> NIS screen and clicking on “refresh” icon on left side.
NTP service & synchronization validation
- Validate if NTP Service is UP and RUNNING on Compute Nodes/vServers
Validate if NTP service is UP and RUNNING on compute nodes/vServers using following command.
service ntdp status
If ntpd service is not running restart it using following command.
service ntpd start
-
Validate if NTP Service is UP and RUNNING on ZFS Storage Appliance
Login to Active Storage Appliance BUI (not passive one) and check if NTP Service is running using following steps in CLI mode or from ZFSSA BUI screen by clicking on Configuration Services and looking at the status of NTP.
configuration services ntp show
If the NTP service on active ZFSSA is down and not showing as online, restart it from ZFSSA BUI console. This can be done by going to Configuration -> Services -> NTP screen and clicking on “refresh” icon on left side.
-
Validate if NTP Service on all Compute Nodes/vServers is synchronized to same NTP Server
Validate if NTP Service on all Compute Nodes/vServers is synchronized to same NTP Server. Run following command to verify the same.
ntpq –p
Below is sample output of above command. The servername in below output which has “*” in front of it is the active NTP server to which compute node is connected. Other servernames which have “+” infront of them are standby NTP servers, when primary NTP server goes down NTP synchronization fails back to these standby servers. These NTP servers are configured inside /etc/ntp.conf file.
ntpq –p
remote refid st t when poll reach delay offset jitter
==============================================================================
LOCAL(0) .LOCL. 10 l 3 64 377 0.000 0.000 0.001
+ntpserver2.oracle.com 10.232.254.246 5 u 598 1024 377 0.326 -0.611 2.323
*ntpserver.oracle.com 10.57.254.246 4 u 706 1024 377 3.048 -4.559 0.507
If NTP synchronization is not done to same server on all the compute nodes/vServers, verify if NTP servers listed on all compute nodes is same by checking /etc/ntp.conf file. Make the changes as needed to have same list of NTP servers on all compute nodes/vServers. Once changes are done restart the NTP service using following command.
service ntpd restart
-
Validate if NTP Service on ZFS Storage Appliance is synchronized to same NTP Server
Validate if NTP Service on all Compute Nodes is synchronized to same NTP Server. Login to Active Storage Appliance BUI (not passive one) and check NTP servers listed for NTP synchronization. You can do this by using following steps using CLI mode or from ZFSSA BUI screen by clicking on Configuration -> Services -> NTP, looking at the servers listed.
configuration services ntp show
If the NTP servers listed in ZFSSA appliance are different than the NTP servers Compute nodes/vServers are synchronizing to, Modify the NTP servers listed in ZFSSA appliance same as Compute nodes and restart NTP service on ZFSSA. NTP service can be restarted by clicking on refresh icon on the left.
nobody:nobody Known Issue Validation
Check the user/group ownership of the file/directory structure on NFSv4 share on which lock file errors are thrown. If the user/group ownership of the files/directories is showing as nobody:nobody, you might be hitting known issues in Storage Appliance where ZFS Storage Head failover will cause issue of file permissions on all the compute nodes changing to "nobody:nobody" intermittently. Please refer to Exalogic Known Issues Note 1268557.1, section “File Permissions On Compute Nodes Showing As "nobody:nobody" After ZFS Storage Head Failover” which has information. Restart NIS services on the ZFS Storage Appliance using steps in this known issues note section and check if user/group ownership of files/directories on NFSv4 shares changes back to original user/group ownership.
NIS Services Testing
Do following validation to ensure NIS propagation is happening from NIS Master to NIS Slave, NIS Clients and ZFS Storage Appliance which is NIS client.
- Run "domainname" command on all the compute nodes/vServers and ensure domain name returned is same on all the compute nodes/vServers.
- Run below "id" command on existing user in NIS on all compute nodes/vServers and check if it is showing same UID, GID mapping.
id <username in NIS>
For example: id testuser
- Check if NIS user propagation is happening to ZFSSA NIS client and if it is able to view the NIS user from NIS master using below steps.
From Active Storage Appliance BUI console, go to Configuration Users screen, then click on Add user, select Type as Directory, enter username which is already present in your NIS Server, and then hit Add. It should list that User from NIS Server
- If NIS propagation is not happening as expected in above steps 1,2,3 please ensure NIS services are UP and online on NIS Master, Slave, Client Compute Node/vServer and ZFS Storage Appliance as mentioned in above "NIS Services Validation" section.
- If all the NIS Services are UP and RUNNING and you still see NIS propagation not happening properly, this may be result NIS misconfiguration. For e.g. incorrect YP servers configuration on NIS clients, misconfigurations in /etc/yp.conf and /etc/hosts files etc. Check and verify the following:
a) Check if you have NIS Master hostname/address you are binding from NIS clients specified inside /etc/hosts file of all NIS Master and Slave Compute Nodes/vServers.
b) Check if NIS Master & Slave binding hostnames/addresses specified inside /etc/yp.conf file on all Compute Nodes/vServers are correct
c) Check if NIS Master & Slave binding hostnames/addresses specified for ZFSSA NIS client configuration is correct. This can be done by going to Configuration -> Services -> NIS screen, and looking at the list of servers specified.
If any of above configurations has discrepancies make the changes and restart all the NIS services. Please refer to following notes for validating NIS configuration for Exalogic Linux Physical and Virtual environments and making any other changes which are needed:
Note 1470844.1 for configuring NIS for Linux Physical environments
Note 1516025.1 for configuring NIS for Linux Virtual environments
Validate Unused LockStateID Entries In ZFS Storage Appliance
This check is only applicable when Running Physical PSU’s older than January 2013 PSU 2.0.3.0.1. Check and validate if you are running into known issue of increasing unused LockStateID Entries in ZFS Storage Appliance which will lead to file lock issues on NFSv4 clients. Please refer to MOS Note 1540532.1, which has information and steps to validate this known issue. This known issue is only applicable to Exalogic Linux Physical PSU versions older than 2.0.3.0.1. If you are running into this known issue, please upgrade to Exalogic January 2013 PSU version 2.0.3.0.1 & above.
NFSv4 File Lock Diagnostic Data collection
If you still face NFSv4 File lock issue after validating above steps please provide following information
- Run Exachk and collect Exachk diagnostic output.
- Collect sosreport from all compute nodes/vServers which has NFS lock issues.
- Collect Support bundle from active Storage Appliance
- Provide /etc/hosts, /etc/yp.conf & /etc/ntp.conf files from all the compute nodes/vServers.
- Provide below outputs from ZFSSA active head (CLI mode). This CLI commands will show list of servers used for NTP and NIS services and status of these services.
configuration services nis show
configuration services ntp show
Steps For Debugging NFSv4 Hang Issues
Validate IPMP Settings On ZFSSA
Validate if you have recommended IPMP settings change on Exalogic ZFS Storage Appliance. The steps on changing the default IPMP settings on ZFS storage appliance are also documented in Exalogic Known Issues Note 1268557.1, refer to section "Default IPMP settings in ZFS Storage Appliance causing system hang issues" under "ZFS STORAGE APPLIANCE". The IPMP Settings on ZFSSA should be set as follows:
Failure Detection Latency: Set to 5000 ms
Failback: Change to false
Validate If NFSv4 Delegation Is Disabled on ZFSSA
Check if NFSv4 delegation is turned off on the ZFS Storage Appliance. Please follow steps listed in Note 1481713.1 for disabling NFSv4 delegation on both the ZFS storage heads active and passive.
Validate Network Status On ZFSSA
Check if ZFS Storage Appliance is reachable from Compute Nodes NFSv4 clients. This can be verified by pinging ZFS Storage Appliance IPOIB address from Compute Nodes. Following are steps.
- Run DH –H command to get the IPOIB address of ZFSSA.
- Ping IPOIB address of ZFSSA from Linux Compute Nodes.
- If ZFSSA is not reachable via above ping command run below command on active ZFSSA in CLI mode to see if IB network interfaces are UP and RUNNING on ZFSSA
configuration net interfaces show
If you see any network issues in above listed steps troubleshoot the issue at IB network and ZFSSA side.
Validate NFS Service Status On ZFSSA
Check if NFS Service is UP and Online on active ZFS Storage. This can be verified by using following command on active ZFSSA in CLI mode.
configuration services nfs show
If the NFS service is down on active ZFSSA, restart it from BUI console by browsing to configuration -> services -> nfs screen, and clicking on “Refresh” icon button on the left hand side
Validate if NFSv4 mount options
Validate if NFSv4 mount options used inside /etc/fstab file are as recommended in below Exalogic owner’s guide, section “9.5 Creating NFSv4 Mount Points on Oracle Linux”.
http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm#BGBFFIBF
If mount options used are different, change them as recommended in above owners guide and remount the NFSv4 shares.
Unmount and Remount the NFSv4 shares
If there were no problems detected in above validations, unmount and remount all NFSv4 shares using following commands and see if fixes the issue.
Command for unmounting all NFS shares:
umount -all -t nfs
Command for remounting all NFS shares:
mount -a
NFSv4 Hang Diagnostic Data collection
If you still face NFS hang issue after implementing all the above, please collect below diagnostic data.
- Run Exachk and collect Exachk diagnostic output.
- Collect sosreport from all compute nodes where NFS hang issue observed.
- Capture strace output on commands that are hanging on NFS shares like ls, df or any other commands. Following is syntax for strace command.
strace –o /tmp/stracecommand.out ls
Above command will capture strace on “ls” command and write it to stracecommand.out inside /tmp directory.
- Collect Support bundles from both the ZFS Storage heads. Refer to Note 1019887.1 which has information on how to collect support bundles on ZFS appliance.
References
<NOTE:1481713.1> - NFSv4 mount directories hang on Exalogic Machine
<NOTE:1019887.1> - Sun Storage 7000 Unified Storage System: How to Collect a Support Bundle using the BUI or CLI
<NOTE:1469247.1> - Clock Sync issues between ZFS Storage Appliance Nodes and Compute Nodes in Exalogic Rack Machine
<NOTE:1268557.1> - Exalogic Elastic Cloud Software Known Issues
<NOTE:1470844.1> - How To Configure NIS Master, Slave And Client Configuration On Compute Nodes In Exalogic Elastic Cloud Software 2.x Physical Environment
<NOTE:1540532.1> - NFS File Lock Issues Caused By Increasing Unused LockStateID Entries In ZFS Storage Appliance In Exalogic
<NOTE:1599868.1> - Oracle Exalogic Elastic Cloud - Setting Up LDAP Service for NFSv4
<NOTE:1314535.1> - Exalogic Patch Set Updates (PSU) Master Note
Attachments
This solution has no attachment