![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Solution Type Predictive Self-Healing Sure Solution 1463157.1 : Exalogic Exachk Diagnostic Information and Suggested Actions
In this Document
Applies to:Oracle Exalogic Elastic Cloud Software - Version 1.0.0.2.0 and laterExalogic Elastic Cloud X5-2 Hardware Linux x86-64 Oracle Solaris on x86-64 (64-bit) PurposeExachk for Exalogic is a health-check tool that is designed to audit important configuration settings within an Oracle Exalogic machine. This reference document describes the benefit of the check, the risk, if a particular health-check fails, and the steps to resolve a failed health check, for each of the health checks that Exachk. ScopeThis document is intended for anyone planning to use and run Exachk on Oracle Exalogic Engineered Machine. DetailsThis document outlines the Exachk health check diagnostic information on Compute Node, Switches & Storage Nodes and also for Exalogic Virtualization components like OVM Manager, Database (DB), Exalogic Controller (EC) and Proxy Controller (PC) as follows: Compute Nodes LinuxCompute Node: Hardware and Firmware Profile
Benefit / Impact The Exalogic Elastic Cloud is an engineered system. Validating the hardware and firmware before the system is placed into, or returned to, production status can help avoid problems related to hardware or firmware modifications. Risk If the hardware and firmware are not validated, inconsistencies between components can lead to problems and outages. Action / Repair The output contains a few lines similar to the following: The BIOS is at a supported version
If any result other than "at a supported version" is returned, investigate and correct the condition. Compute Node: Software Profile
Benefit / Impact The Exalogic Elastic Cloud is an engineered system. Validating the software packages before it is placed into, or returned to, production status can help avoid problems related to configuration. Risk If the software is not validated, inconsistencies between components can lead to problems and outages. Action / Repair The output contains a few lines similar to the following: [SUCCESS]........Has supported operating system
If any result other than "SUCCESS" is returned, investigate and correct the condition. Compute Node: NTP Synchronization
Benefit / Impact NTP helps synchronize computer system clock with an accurate time source. To ensure correct synchronization, the delay and offset values should be non-zero and the jitter value should be under 100. Risk Unsynchronized system clock may lead to possible errors and outages. Action / Repair Any warnings generated by NTP Synchronization check could be caused by the following:
Compute Node: IB Startup Sequence
Benefit / Impact To avoid inconsistencies within the Exalogic Elastic Cloud, and for network services to work properly, openibd service must start before the network services. Risk Inconsistencies within the nodes can lead to problems and outages, if openibd does not start before network services start functioning. Action / Repair Relink openibd and mlx_vnic_confd so that openibd starts before mlx_vnic_confd. This can be done by running the following commands: To relink openibd with S05: rm -rf /etc/rc3.d/$(ls /etc/rc3.d/ | grep openibd); ln -s ../init.d/openibd /etc/rc3.d/S05openibd
And to relink mlx4_vnic_confd with S06. rm -rf /etc/rc3.d/$(ls /etc/rc3.d/ | grep mlx4_vnic_confd); ln -s ../init.d/mlx4_vnic_confd /etc/rc3.d/S06mlx4_vnic_confd
Compute Node: NFS Mount Point - Version
Benefit / Impact Verifying the correct configuration of mount points helps avoid performance problems related to NFS . Risk If the NFS mount points are not configured correctly, inconsistencies related to storage access may occur, and these can possibly lead to problems and outages. Action / Repair It is recommended to upgrade the NFS mount point to the latest version. Links http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm Compute Node: NFS Mount Point - Attribute Caching
Benefit / Impact By ensuring that attribute caching within NFS is not disabled, NFS mounts can perform more efficiently. Risk Disabling attribute caching can lead to extra network operation which leads to degrading network performance. Action / Repair Fix the configuration of the NFS Mount Point by removing "noac" and/or "actimeo=0" attributes from the mount points. Links http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm Compute Node: NFS Mount Point - Rsize Wsize
Benefit / Impact The NFS mount point option, "rsize" and "wsize", specify the size of the chunks of data that the client and server pass to each other. To maintain high performance for block transfer between the mount points, correct "rsize" and "wsize" needs to be verified. Risk Incorrect configuration of the rsize or wsize may lead to performance degradation. Action / Repair Correct the configuration of the NFS mount point by modifying rsize and/or wsize properties in the mount points, to the recommended value of 131072. Links http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm Compute Node: NIS domain with NFSv4 (ypbind)
Benefit / Impact "ypbind" is a client–server directory service protocol for distributing system configuration data. It allows Exalogic Elastic Cloud to find each server for NIS domains, and maintains the NIS binding information. Risk Without correct NIS configurations and binding information, inconsistency related to network services may occur, and these can possibly lead to problems and outages. Action / Repair Verify and investigate the NIS configuration based on NFSv4 Compute Node: DNS Setup
Benefit / Impact DNS service allows components within Exalogic Elastic Cloud to have access with each other in supporting its functions. Verifying the DNS setup is critical to avoid problems related to access issues between the components. Risk If DNS setup is not verified, inconsistent access protocol between components can lead to problems and outages. Action / Repair Verify the DNS setup configuration by examining /etc/resolv.conf and executing nslookup command on the localhost. Compute Node: IP Configuration - eth0 and bond0
Benefit / Impact Correct IP configuration of eth0 and bond0 allows each Compute Node to manage hostname mapping and DNS entries. Risk A misconfiguration on the /etc/hosts will cause problem when a compute node tries to reach the other nodes in the same rack. Action / Repair Investigate /etc/hosts and the content that is returned from interface configuration of eth0 and bond0: Links Network Preconfiguration Adding Exalogic Machine to Your Network Compute Node: Swap Space
Benefit / Impact The level of swappiness controls the amount of memory reclaim distress at a point where the kernel decides to start reclaiming mapped pages. If the swap space is unused, it means the kernel has adequate amount of free physical memory, and this ensures that the Exalogic Elastic Cloud performs at its optimal level. Risk The usage of swap space indicates that the kernel is running out of free physical memory. Lack of free physical memory can lead to degraded performance. Action / Repair Clear up the used memory. Compute Node: Free Physical Memory
Benefit / Impact Adequate amount of free physical memory ensures that the Exalogic Elastic Cloud performs at its optimal level. Risk If there is not enough free physical memory, problems and outages may occur. Action / Repair The recommended free space is calculated by adding the Free Memory (MemFree), Reclaimable Memory (SReclaimable), Buffers, Cache and subtracting Shared Memory (shmem) listed in the /proc/meminfo file. The free memory should be at least 20% of the Total Memory(MemTotal). sync; echo 3 > /proc/sys/vm/drop_caches
Compute Node: Virtual Memory Tuning for Dom0 & Physical
Benefit / Impact The tuning for virtual memory consists of two components: 1. vm.dirty_background_ratio = 3 The default value of this ratio is10%. With this value, the kernel will be forced to write dirty pages to disk when its size reaches 9.6GB (10% of 96GB). Oracle recommends that this parameter be tuned down to 3% to smooth out the I/O traffic. 2. vm.min_free_kbytes = - 1048576 KB (1GB) for physical rack - 524288 KB (512 MB) for Dom0 The default value of this parameter is 32M. Oracle recommends that this parameter be increased accordingly to account for the large MTU size within an IPoIB network, which is currently at 64K. Risk Without this tuning, the kernel may not perform at an optimum level. Action / Repair Edit the /etc/sysctl.conf file and modify the corresponding tuning parameters as specified in the Benefit / Impact section. Compute Node: TCP Tuning
Benefit / Impact The tuning for TCP consists of three components: 1. net.ipv4.tcp_timestamps should be set to 1 to avoid PAWS issue (Protect Against Wrapped Sequence - RFC 1323). Risk Without this tuning, the TCP may not perform at an optimum level. Action / Repair Add the recommended tuning parameters into the /etc/sysctl.conf file. Compute Node: MTU for Infiniband Link in Compute Node
Benefit / Impact Correct MTU size for the InfiniBand Link ensures that the communication protocol layer within InfiniBand performs optimally. Risk Incorrect MTU size may slow down InfiniBand Link and cause latency issues. Action / Repair Please refer to <Note 1624434.1>: Revised MTU Tuning Recommendations for the IPoIB Related Network Interfaces on Exalogic Physical and Virtual Environments Compute Node: MTU for Ethernet Link in Compute Node
Benefit / Impact Correct MTU size for the Ethernet Link ensures that the communication protocol layer within Ethernet performs optimally. Risk Incorrect MTU size may slow down Ethernet Link and cause latency issues. Action / Repair 1. Identify the Ethernet interface using ifconfig, you would see "bond1 Link encap:Ethernet" Compute Node: ib_ipoib Module
Benefit / Impact Having the ib_ipoib module loaded ensures that the Internet Protocol (IP) works properly over InfiniBand. Risk If the ib_ipoib module is not loaded, InfiniBand may not work properly. Action / Repair Load the module through /etc/infiniband/openib.conf. Compute Node: ib_sdp Module
Benefit / Impact Having ib_sdp module loaded ensures that Sockets Direct Protocol(SDP) works properly over InfiniBand. Risk If ib_sdp module is not loaded, InfiniBand might not work properly. Action / Repair Load the module through /etc/infiniband/openib.conf Compute Node: IPoIB in Connected Mode
Benefit / Impact Having IPoIB in Connected Mode ensures that Internet Protocol (IP) works properly over InfiniBand network. Risk If the Connected Mode is not set, IPoIB might not work properly. Action / Repair 1. If SET_IPOIB_CM and/or IPOIB_LOAD is not set to "yes", modify the /etc/infiniband/openib.conf file (or /etc/ofed/openib.conf at a later version of OFED) and change these properties to "yes". After modifying the files above, restart InfiniBand by running the command: /etc/init.d/openibd restart
Compute Node: Enabled Cache on Local SSD
Benefit / Impact Enabling caching ensures that Local SSD performs optimally. Risk If the cache is disabled within Local SSD, performance degradation may occur, and it may lead to problems and outages. Action / Repair The following commands are listed as an example to turn the cache on: /opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp Cached -L0 -a0
/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp WB -L0 -a0 Compute Node: EoIB Setup
Benefit / Impact Having EoIB set up correctly allows networking to work properly. Risk If EoIB is not set up, it may lead to problems and outages related to networking. Action / Repair Investigate bonding and ip configuration within theVNIC, listed in "mlx4_vnic_info -l" command. In some cases, network device could show up as ___tmp instead of ethX. For more information, please see following MOS note: <Note 1458683.1>: Network device showing __tmp instead of ethX
Compute Node: Correct Slot Installation for IB Card
Benefit / Impact For the best performance on the PCI Express interface, the adapter card should be installed in a PCIe x8 slot. Risk Installing the card in a slower slot limits bandwidth and performance significantly. Action / Repair Reinstall the card in a PCIe x8 slot, or replace the card if it is already in a PCIe x8. Compute Node: Subnet Manager
Benefit / Impact The subnet manager (SM) manages all operational characteristics of the InfiniBand network. The InfiniBand network typically has more than one SM, but only one SM is active at a time. The active SM is Master SM. Others are Standby SMs. Risk If the active SM shuts down or fails and there is no standby SM to replace it, the InfiniBand network will fail, which can cause loss of connectivity within Oracle Exalogic Elastic Cloud. Action / Repair Refer to the link for detailed instructions to repair the problem. Links Compute Node: Lockd Configuration
Benefit / Impact Lock recovery after a reboot is critical, to maintain data integrity and to prevent unnecessary application hangs. To help rpc.statd match SM_NOTIFY requests to NLM requests, this best practice should be observed. Risk NFSv3 locks may not be recovered after a reboot. Action / Repair NOTE: The compute node becomes unavailable during this period, causing applications to stop running within the compute nodes. To handle the possible impact of a temporary loss of service, ensure adequate preparation ahead of time. Follow the steps given below: 1. Edit /etc/sysconfig/nfs file
2. Change the following lines: From #STATDARG="" To STATDARG="-n `uname -n`" 3. Reboot the compute node. Compute Node: BIOS Settings
Benefit / Impact Before upgrading Oracle Exalogic Elastic Cloud (to 2.0.0.0.1 for example), it is important to ensure that the system has the recommended BIOS settings. Risk If the recommended BIOS settings are not followed, problems may occur during the Oracle Exalogic Elastic Cloud 2.0.0.0.1 upgrade. Action / Repair Refer to <Note 1608959.1>
Note 1608959.1 : Updating the BIOS Settings for X2-2, X3-2, and X4-2 Compute nodes before installing EECS on Exalogic Compute Node: Consistent Hardware Clock Timezone Reference
Benefit / Impact Having the same time zone reference to UTC for the hardware clock avoid any potential time skew between nodes before time sync with the NTP server. Risk Different time zone settings across different machines cause some applications to have job synchronization issues. Action / Repair
Compute Node: OVS Cluster Connectivity (Livenodes)
Benefit / Impact Exalogic Elastic Cloud must maintain the cluster connectivity of the compute nodes synchronized as a distributed system to support running applications. Risk If any of the compute nodes are not live, it may lead to problems and outages. Action / Repair Please investigate which compute nodes are not live and check if it is due to networking issues in the cluster. If the problem persists, please open an SR with Oracle Support. Compute Node: OVS Server Pool Virtual IP Ping Test
Benefit / Impact Ensuring that the virtual IP of the Oracle VM Server pool is up and running is crucial to ensure that the Server Pool Master is accessible. Risk If the virtual IP of the Oracle VM Server pool is down, Server Pool Master cannot be accessed. Action / Repair Please check your Oracle VM Server master compute node and ensure everything is functional. One possible workaround is as follows: 1) Reboot the Oracle VM Server master compute node. If problem persists, please contact Oracle Support. Compute Node: vServer Stale Lock
Benefit / Impact A lock indicates that a vServer is already running on one of the hypervisors in the server pool. A lock file that remains even when the vServer is not running due to some unexpected error is a stale lock file. The vServer with that lock will then not be able to start. Risk A vServer cannot be started even though it is not running. Action / Repair Please refer to the following MOS notes for details: If problem persists, please contact Oracle Support. Compute Node: Recent Reboot Info
Benefit / Impact Ensuring the stability of the system is important to support applications running on Exalogic. By discovering an unexpected and recent reboot of a compute node, action can be to taken to fully restore the service and resolve the potential cause of problem. Risk An unexpected and recent reboot of a compute node may lead to problems and outages. Action / Repair If the recent reboot was intentional or expected, please ignore this warning. Otherwise, please investigate why this compute node rebooted unexpectedly. If problem persists, please contact Oracle Support. Compute Node: Recent Critical Error
Benefit / Impact Ensuring the stability of the system is important to support applications running on Exalogic. By discovering unexpected critical errors in a compute node, action can be taken to fully restore the service as well as resolve the potential cause of problem. Risk An unexpected critical error within a compute node may lead to problems and outages. Action / Repair If the critical errors in the recent reboot were expected, please ignore this warning. Otherwise, please investigate further by looking at the log file /var/log/ovs-agent.log*. If problem persists, please contact Oracle Support. Links Note 1501348.1 - Identifying And Resolving Oracle VM Issues In Exalogic Virtual Environment Compute Node: Connectivity To OVMM
Benefit / Impact For the virtual data center management system to work, all the different Exalogic control components must be running. Risk The unavailability of any Exalogic control component will result in a loss of functionality in the management of the virtual datacenter. Action / Repair The failed Exalogic Elastic Cloud Software (EECS) component must be restarted. Links Note 1501228.1 - How To Start A Stopped Exalogic Control Stack In An Exalogic Virtual Environment Compute Node: Local Disk Usage Limit
Benefit / Impact Keeping enough local disk space free ensures the compute node can operate optimally. Risk Performance of the compute node will get affected. Action / Repair Free up disk space on the local disk. Oracle recommends most, if not all, user data be stored on the storage appliance. Compute Node: OVS Agent Status
Benefit / Impact Oracle VM Manager communicates with the Oracle VM Agent to create and manage guests on an Oracle VM Server. Risk If the Oracle VM Agent is not running, problems and outages related to the management of guest VMs will occur. Action / Repair Investigate the issue and notify Oracle Support for further assistance. Compute Node: Orphan Image File
Benefit / Impact Free up more disk space. Risk The image files of orphan virtual disks occupy disk space. This disk space is wasted and cannot be used for other data. Action / Repair Remove the orphan image files indicated by exachk after verifying that they are not being used by any vserver. Compute Node: OVS Pool File System
Benefit / Impact Without the ovspoolfs storage share properly mounted, vServers hosted by Oracle VM Server will not work correctly. Risk If any of the vServers do not have this system mount point, it may lead to problems and outages. Action / Repair Please investigate if there is storage connectivity issue. If the problem persists, please open an SR with Oracle Support. Compute Node: Bonding of InfiniBand Interface
Benefit / Impact The InfiniBand interfaces are a communication link between various components of the Exalogic machine. In order to maintain high availability (HA) with the IPoIB interface, Infiniband must be bonded correctly. Risk Without proper bonding of the InfiniBand interfaces, the Exalogic machine cannot maintain high availability (HA) if one of the communication links goes down. It can also affect performance. Action / Repair Investigate the bonding in the /etc/sysconfig/network-scripts/ifcfg-ib* files for each applicable pkey. Compute Node: ZCOPY Configuration
Benefit / Impact Proper zcopy configuration must be ensured for the Exalogic machine to perform optimally. Risk An incorrect zcopy configuration can affect performance. Action / Repair Add sdp_zcopy_thresh=0, recv_poll=0 to the /etc/modprobe.conf file. Compute Node: Ulimit
Benefit / Impact The ulimit parameter specifies the maximum number of open processes that a user can have running. This parameter must meet the specification set in the base image to ensure performance is optimal. Risk If the value of the ulimit parameter is too low, it can have an impact on performance due to the Exalogic machine not being able to open processes. Action / Repair Add the following line to the file ~/.bashrc. ulimit -s value Replace value with the value you want to change the ulimit to. Oracle recommends that it should be at least 65536, as set in the the base image specification. Compute Node: MegaCLI Status
Benefit / Impact The RAID disks serve as a redundant data storage. By monitoring RAID disks for failed or degraded disks, high availability (HA) in an Exalogic machine is maintained. Risk Failed or degraded RAID disks can affect the high availability of Exalogic and affect performance. Action / Repair Contact Oracle Support. Compute Node: Consistency with DNS on the Physical Compute Node
Benefit / Impact A correct hostname that matches the DNS prevents network configuration issues. Risk An incorrect hostname that does not match the DNS may cause configuration issues. It can also cause Exachk to report wrong results. Action / Repair You must determine if it is an error in the host or in the DNS entry. If it is the host, fix the hostname by changing the value of the parameter HOSTNAME in the /etc/sysconfig/network file. Compute Node: Disabled LRO on OVS
Benefit / Impact Large-Receive-Offload option offers the lowest CPU utilization for receivers and is enabled by default in the driver, but it is completely incompatible with routing/IP-forwarding and bridging. Hence, it must be disabled for Oracle Virtual Server network bridge to work correctly. Risk Any VM using dom0 bridging would have extremely poor network performance with LRO enabled. Action / Repair
Compute Node: Corruption in dom0 Partition Key Table
Benefit / Impact In EECS 2.0.6, deploying vServers with an EECS 2.0.1.1.0-based template can cause issues in the network of the vServers. This is usually indicated by the presence of all zeros in the hardware address. You must fix this issue to ensure proper network connectivity. Risk A corrupted vGUID table can cause loss of network connectivity. Action / Repair In each port, the table should not contain 0x00000 value as vGUID values in the first 64 entries(i.e. from 0-63) Compute Node: Host interconnect (usb0) Disabled
Benefit / Impact The host interconnect should be disabled to allow all assets within Enterprise Manager Ops Center (EMOC) to be discovered. In a physical environment, it prevents potential conflict with network interfaces. Risk When the host interconnect is not disabled, EMOC asset discovery can fail. In a physical environment, the IPoIB-default interface might be missing. Action / Repair To disable host interconnect, perform the following steps: Compute Node: PCI 64-Bit Resource Allocation Disabled
Benefit / Impact Disabling PCI 64-bit resource allocation ensures that all MMIO get allocated below 4GB within the Exalogic system. Risk If the recommended PCI 64-bit resource allocation settings are not used, you may face problems related to memory. Action / Repair NOTE: To fix this issue, you must restart the compute node. Applications running on the compute node will be stopped while the compute node restarts. Note 1608959.1 - Updating the BIOS Settings for X2-2, X3-2, and X4-2 Compute nodes before installing EECS on Exalogic
Compute Node: RPM Database Corruption for Control VM
Benefit / Impact If RPM query returns an error, any RPM operations would likely fail. Since upgrade or patching require the use of RPM, they would also fail. Risk The upgrade process cannot proceed without fixing the errors with RPM installation. Action / Repair Run rpm -qa, if command runs without any issue proceed with the upgrade installation. Compute Node: RPM Database Corruption
Benefit / Impact If RPM query returns an error, any RPM operations would likely fail. Since upgrade or patching require the use of RPM, they would also fail. Risk The upgrade process cannot proceed without fixing the errors with RPM installation. Action / Repair Run rpm -qa, if command runs without any issue proceed with the upgrade installation. Compute Node: Memory Recommendation for dom0
Benefit / Impact Physical memory of dom0 is critical for applications to be able to run on the Exalogic system. Risk When the physical memory attributes allocated to dom0 does not meet the recommended value, the compute node may freeze or experience unexpected kernel panic and restart. Action / Repair Please follow instructions in the MOS link. Compute Node: RAID Battery Level
Benefit / Impact Exalogic local storage is set up in RAID configuration. Ensuring that RAID has sufficient battery power is critical for the local storage to function properly, especially during a power outage. Risk When RAID battery runs out, the compute node may not have data protection against failure and may also experience performance degradation. Action / Repair Please refer to following Note and contact Oracle Support. <Note 1437353.1>: Exalogic Battery Check and Replacement Guidelines
Compute Node: Xen Vulnerability Patch Verification for Oracle Virtual Server of 2.0.6.x.x
Benefit / Impact Applying the CVE-2014-7188/XSA-108 patch addresses a critical xen security vulnerability that can allow malicious guest virtual machines to potentially read data from either other guest machines, or the hypervisor itself. Risk Not applying the patch exposes the Exalogic machine to a critical xen security vulnerability that can allow malicious guest virtual machines to potentially read data from either other guest machines, or the hypervisor itself. Action / Repair Please refer to MOS note 1932297.1 - CVE-2014-7188 / XSA-108 (Xen Vulnerability) Patch Availability for Oracle Exalogic in a Virtualized Configuration Note 1932297.1 - CVE-2014-7188 / XSA-108 (Xen Vulnerability) Patch Availability for Oracle Exalogic in a Virtualized Configuration Compute Node: Non-2.0.6.x.x VM with 12 vCPUs and 32G Memory
Benefit / Impact After upgrading to version 2.0.6.x.x, VMs with 12 or more vCPUs and 32 GB or more of RAM would start up normally. Risk Loading VMs with 12 or more vCPUs and 32G or more of memory takes an extended period of time. Action / Repair Upgrade pre-2.0.6.x.x templates and virtual machines in the system to 2.0.6.x.x. Note 1582091.1 - Exalogic Virtual dom0 Memory Recommendations Compute Node: Check for Version 2.0.1.x.x Template, Virtual Machine or Large VM in Version 2.0.4.x.x Environment
Benefit / Impact There is a possibility of multiple versions of Virtual Machines and Templates installed in Exalogic virtual setup. For virtual machines, templates and virtual setups, we have the following versions: 2.0.1.x.x, 2.0.4.x.x, 2.0.6.x.x. Virtual machines and templates of version 2.0.1.x.x are compatible with a 2.0.4.x.x. virtual setup, but are not compatible with the version 2.0.6.x.x infrastructure. Upgrade these 2.0.1.x.x VMs and templates before or immediately after the upgrade to 2.0.6.x.x. Risk 2.0.1.x.x template and virtual machines are not supported by version 2.0.6.x.x infrastructure. Action / Repair Before you upgrade your version 2.0.4.x.x virtual setup to version 2.0.6.x.x, create a plan to upgrade the version 2.0.1.x.x virtual machines and templates before or immediately after the upgrade to version 2.0.6.x.x. Compute Node: Check for version 2.0.4.x.x Template and Virtual Machine in 2.0.6 Environment
Benefit / Impact There is a possibility of multiple versions of Virtual Machines and Templates installed in Exalogic virtual setup. For virtual machines, templates and virtual setups, we have the following versions: 2.0.1.x.x, 2.0.4.x.x, 2.0.6.x.x. etc.. Virtual machines and templates of version 2.0.4.x.x can be used in 2.0.6.x.x. virtual setup. Risk Virtual machines and templates of version 2.0.4.x.x can be used in 2.0.6.x.x. virtual setup. There is no risk in this scenario. Action / Repair No action is needed. Compute Node: Check for Version 2.0.1.x.x Template, Virtual Machine or Large VM in Version 2.0.6.x.x Environment
Benefit / Impact Virtual machines and templates of version 2.0.1.x.x are not compatible with a version 2.0.6.x.x. virtual setup. Risk Virtual machines and templates of version 2.0.1.x.x do not function properly in a version 2.0.6.x.x. virtual setup. Action / Repair Upgrade version 2.0.1.x.x templates and virtual machines in the system to version 2.0.6.x.x. Compute Node: Check for Version 2.0.6.x.x Template and Virtual Machine in Version 2.0.4.x.x Environment
Benefit / Impact Multiple versions of Virtual Machines and Templates can be installed in Exalogic virtual setup. For virtual machines, templates and virtual setups, we have the following versions: 2.0.1.x.x, 2.0.4.x.x, 2.0.6.x.x. Virtual machines and templates of version 2.0.4.x.x can be used in a version 2.0.6.x.x. virtual setup, but will not have access to the newest features and fixes. Risk Virtual machines and templates of version 2.0.6.x.x can be used in a version 2.0.4.x.x. virtual setup. Action / Repair Upgrade 2.0.1.x.x templates and virtual machines in the system. Compute Node: Resource Control Utility Information
Benefit / Impact Each compute node in an Oracle Exalogic Elastic Cloud X3-2 and X4-2 machines has a total of 16 and 24 processor cores respectively. Not all Exalogic customers demand to use all cores. Resource Control utility controls and manages available CPU cores. The number of enabled cores is persisted in the BIOS. The affected compute node needs to be shutdown/powered-off and started again for the changes to take effect. Risk When processor cores are not enabled correctly, the system will have performance issue. Action / Repair Please refer to MOS note 1671659.1 - Exalogic Core Capping for bare metal (physical) Linux and Solaris Note 1671659.1 - Exalogic Core Capping for bare metal (physical) Linux and Solaris Compute Node: CPU CAP for Virtual Machine Configuration File in Oracle Virtual Server
Benefit / Impact When the CPU Cap is configured to be less than 100% (through EMOC), several issues related to CPU soft lockup and vServer hangs have been reported on Exalogic. When CPU Cap is configured to be 100% through EMOC, it is translated to cpu_cap=0 in vm.cfg, which is the value we want to see configured. Risk When the CPU Cap is configured to be less than 100% (through EMOC), several issues related to CPU soft lockup and vServer hangs have been reported on Exalogic. Action / Repair Please refer to MOS note 1912480.1 - Setting CPU CAP to be less than 100% is not supported for Guest vServers on Exalogic Note 1912480.1 - Setting CPU CAP to be less than 100% is not supported for Guest vServers on Exalogic Compute Node: Check for unknown files in OVS repositories
Benefit / Impact There is a possibility of multiple versions of Virtual Machines and Templates installed in Exalogic virtual setup. For virtual machines, templates and virtual setups, we have the following versions: 2.0.1.x.x, 2.0.4.x.x, 2.0.6.x.x. Virtual machines and templates of version 2.0.1.x.x are compatible with a 2.0.4.x.x. virtual setup, but are not compatible with the version 2.0.6.x.x infrastructure. Upgrade these 2.0.1.x.x VMs and templates before or immediately after the upgrade to 2.0.6.x.x. Risk 2.0.1.x.x template and virtual machines are not supported by version 2.0.6.x.x infrastructure. Action / Repair Before you upgrade your version 2.0.4.x.x virtual setup to version 2.0.6.x.x, create a plan to upgrade the version 2.0.1.x.x virtual machines and templates before or immediately after the upgrade to version 2.0.6.x.x. Compute Node: BIOS SR-IOV Status
Benefit / Impact Before upgrading Oracle Exalogic Elastic Cloud, it is important to ensure that the system has the recommended BIOS settings. Risk Using the incorrect BIOS settings will result in a failed install or upgrade process. Action / Repair Enable SR-IOV as detailed in the following MOS Note: <Note 1608959.1>: Updating the BIOS Settings for X2-2, X3-2, and X4-2 Compute nodes before installing EECS on Exalogic
Compute Node: RAID Configuration of Local Disks
Benefit / Impact Proper RAID configuration of local disks will ensure data redundancy. Risk If the RAID configuration of local disks is not set correctly, there is no more redundancy. i.e. if one of the SSD malfunction, all the data on the SSD will be lost. Action / Repair Please find the document "Engineering Approved Steps" from issue "Some RAID groups were not setup correctly in the factory when the Exalogic was built" referenced in MOS note 1360310.1 Note 1360310.1 - Oracle EXALOGIC Current Product Issues X2-2 , X3-2, X4-2 Compute Node: Check Dom0 Kernel Memory Slab Usage of Size-192
Benefit / Impact Certain workload may trigger a rare condition of excessive memory use. Risk Dom0 kernel memory may leak and cause it freeze up over time. Action / Repair Contact Oracle Support to file SR. Compute Node: IPoIB in Connected Mode for OEL6
Benefit / Impact Having IPoIB in Connected Mode ensures that Internet Protocol (IP) works properly over InfiniBand network. Risk If the Connected Mode is not set, IPoIB might not work properly. Action / Repair 1. If it is OFED R2 version, then check the following: -Go to /etc/sysconfig/network-scripts/ifcfg-ib* files, check whether "CONNECTED_MODE=yes". -If "CONNECTED_MODE" is not set to "yes", modify the /etc/sysconfig/network-scripts/ifcfg-ib* file and change the property to "yes". 2. If it is not OFED R2 version, then check the following: -If SET_IPOIB_CM and/or IPOIB_LOAD is not set to "yes", modify the /etc/rdma/rdma.conf file and change these properties to "yes". -If the content of /sys/class/net/ib0/mode and /sys/class/net/ib0/mode are not connected, modify the content of these files to "connected". After modifying the files above, restart InfiniBand by running the command: /etc/init.d/openibd restart Note 1982645.1 - Exachk Reporting "IPoIB is not in connected mode" WARNING Message On Exalogic 2.0.6.2.0 Linux Physical Racks Compute Node: Ibswitches Information Validation
Benefit / Impact Correct format for ibswitches information will ensure proper networking. Risk Incorrect switch description could cause patching issues. Action / Repair Please follow the MOS note 1476772.1: A script to reset (to factory defaults) the NM2-GW switch Description field to show Leaf Details. Note 1476772.1 - A script to reset (to factory defaults) the NM2-GW switch Description field to show Leaf Details. Compute Node: Eport_State_Enforce Status
Benefit / Impact On releases running with Oracle Enterprise Linux operating system, if the Ethernet link used by a vnic goes down, the bond configured with that particular vnic will not detect it.By default,the bond will only detect the physical link it is using, which is the Infiniband Link. It will not detect the link of the Ethernet port the vnic is connected to. eport_state_enforce=1 flag needs to be present in /etc/modprobe.conf to have this failure detected and failover. Risk Without eport_state_enforce=1 flag in /etc/modprobe.conf, network outage will occur if one of the link fails. Action / Repair Make sure eport_state_enforce=1 in /etc/modprobe.conf file. Note 1512139.1 - Oracle Exalogic Elastic Cloud Known Issues - Virtualization Release Note 1436514.1 - Exalogic: VNIC 10gb Bond Network Ethernet Link Failover Detection Compute Node: Detect EM Agent On Dom0
Benefit / Impact Additional software running on the hypervisor is not recommended. It may affect performance of the hypervisor and guests running on it. Risk Installation of EM agent on dom0 has the potential to de-stabilize the environment. Action / Repair Uninstall the EM Agent from dom0. Note 1668193.1 - FAQs: Modifications to Exalogic Control vServers and Dom0 In Exalogic Virtual Environments (Doc ID 1668193.1) Compute Node: Eport_State_Enforce Status for OEL6
Benefit / Impact On releases running with Oracle Enterprise Linux operating system, if the Ethernet link used by a vnic goes down, the bond configured with that particular vnic will not detect it.By default,the bond will only detect the physical link it is using, which is the Infiniband Link. It will not detect the link of the Ethernet port the vnic is connected to. eport_state_enforce=1 flag needs to be present in /etc/modprobe.d/mlx4_vnic.conf to have this failure detected and failover. Risk Without eport_state_enforce=1 flag in /etc/modprobe.d/mlx4_vnic.conf, network outage will occur if one of the link fails. Action / Repair Make sure eport_state_enforce=1 in /etc/modprobe.d/mlx4_vnic.conf file. Note 1512139.1 - Oracle Exalogic Elastic Cloud Known Issues - Virtualization Release Note 1436514.1 - Exalogic: VNIC 10gb Bond Network Ethernet Link Failover Detection Compute Node: Check CPUspeed Governor Setting
Benefit / Impact placeholder to have all exalogic related checks under this Risk Performance degradation. Action / Repair Please modify CPUspeed governor setting according to MOS note 1925546.1 Note 1925546.1 - Performance Issue with CPU Processing In Exalogic X4-2 Linux Virtual Racks (Doc ID 1925546.1) Compute Node: lro_num=0 in mlx4_vnic.conf
Benefit / Impact LRO (Large Receive Offload) is a technique used to increase throughput by reducing CPU overhead. It is not compatible with infiniband network settings, therefore needs to be disabled. Risk The system will panic if LRO is not disabled. Action / Repair Add "lro_num=0" to /etc/modprobe.d/mlx4_vnic.conf. Compute Node: ARI in BIOS Setting Enabled
Benefit / Impact If Alternate Routing ID (ARI) is supported by the hardware and set to enabled, devices are permitted to locate virtual functions (VFs) in function numbers 8 to 255 of the captured bus number, instead of normal function numbers 0 to 7. Risk If ARI is not enabled, only 7 virtual function will be available for VM to use. This means any additional VM will not be able to attach a virtual function, thus networking inside the VM will fail. Action / Repair NOTE: To fix this issue, you must restart the compute node. Applications running on the compute node will be stopped while the compute node restarts. Ensure that you have made adequate preparations to handle the temporary loss of service, before you start this procedure. To fix this issue, see MOS Note. Oracle Hardware Management Pack User's Guide (http://docs.oracle.com/cd/E20451_01/html/E25303/mpigt.glqbr.html) Note 1608959.1 - Updating the BIOS Settings for X2-2, X3-2, and X4-2 Compute nodes before installing EECS on Exalogic Compute Node: Grub Conf Settings for Dom0
Benefit / Impact To limit and ping Dom0 CPUs to run on the first 20 logical CPUs. Risk If the two parameters are missing from grub.conf, as soon as the compute nodes got rebooted, the customer will encounter stability issues and will unable to communicate over infiniband. Restoring the parameters returned the machine to functionality. Action / Repair Add the following to the xen.gz kernel boot line, then reboot the server: dom0_vcpus_pin dom0_max_vcpus=20. Compute Node: Disabled Automatic Path Migration(APM)
Benefit / Impact There is a compatibility issue between the OFA software version on Exadata (1.5.1) and on Exalogic (1.5.5). The SDP protocol fails due to the new feature, APM (Automatic Path Migration) that’s enabled in Exalogic by default but not yet supported in the OFED version in Exadata which causes to trigger the error "RDMA CMA: unexpected IB CM event: 13". Disabling APM will ensure that SDP protocol works properly in this particular case. Risk Enabling APM on Exalogic machine that is connected to Exadata can lead to problems and outages related to SDP protocol failure. Action / Repair Refer to Note 1588546.1 for Action/Repair. ___________________________________________________________________________________ Compute Nodes SolarisCompute Node: Software Profile
Benefit / Impact [SUCCESS]........Has supported operating system
If a result that is not SUCCESS is returned, investigate and correct the condition. #/opt/exalogic.tools/tools/CheckSWProfile
[SUCCESS]........Has supported operating system [SUCCESS]........Has supported processor [SUCCESS]........Kernel is at the supported version [SUCCESS]........Has supported kernel architecture [SUCCESS]........Software is at the supported profile
Compute Node: NTP Synchronization
Benefit / Impact
Note:
KISS keywords in the NTP parameters are ignored. A list of these keywords can be at the following link: http://www.iana.org/assignments/ntp-parameters/ntp-parameters.xml Sample Output # report=$(ntpq -pn 2>&1); echo "$report"
remote refid st t when poll reach delay offset jitter ============================================================================== +10.133.40.1 144.25.255.140 3 u 346 1024 377 0.321 0.088 0.011 *144.25.255.141 144.20.10.10 2 u 475 1024 377 1.870 -0.210 0.002 +144.25.255.142 144.25.255.140 3 u 432 1024 377 2.065 0.251 0.179 127.127.1.0 .LOCL. 10 l 15 64 377 0.000 0.000 0.001 Compute Node: DNS Setup
Benefit / Impact # /usr/sbin/nscfg import -f name-service/switch
# svcadm enable dns/client # svcadm refresh name-service/switch
Compute Node: Correct Slot Installation of IB Card for Solaris
Compute Node: Subnet Manager
Compute Node: Root Partition Usage Limit for Solaris
Compute Node: Lockd Configuration for Solaris Compute Node
svcadm enable svc:/network/nfs/nlockmgr:default
- To enable status, please run the following command: svcadm enable svc:/network/nfs/status:default
Links http://docs.oracle.com/cd/E23824_01/html/821-1462/lockd-1m.html#REFMAN1Mlockd-1m Compute Node: ib_ipoib Module for Solaris
Benefit / Impact # lsmod | grep ib_core
ib_core 61642 13 rdma_ucm,ib_sdp,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx4_vnic,ib_sa,mlx4_ib,ib_mthca,ib_mad
Compute Node: ib_sdp Module for Solaris
Benefit / Impact Compute Node: IP Configuration - net0 and bond0
Benefit / Impact
Compute Node: Recent Reboot Info for Solaris
Benefit / Impact Compute Node: Probe Based IPMP for Solaris
Benefit / Impact: # svccfg -s svc:/network/ipmp setprop config/transitive-probing=true
# svcadm refresh svc:/network/ipmp:default
Compute Node: Swap Space for Solaris
Benefit / Impact: Compute Node: Free Physical Memory for Solaris
Benefit / Impact Compute Node: MTU for Solaris
Benefit / Impact Compute Node: IPMP Configuration for Solaris
Benefit / Impact: Compute Node: Fault Management Log for Solaris
Benefit / Impact: Compute Node: BIOS Settings
Benefit / Impact Refer to <Note 1608959.1> Compute Node: NFS Mount Point - Version for Solaris
Benefit / Impact http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm Compute Node: Hostname Consistency with DNS on the Physical Compute Node
Benefit / Impact: # svccfg -s svc:/system/identity:node setprop config/nodename = astring: hostname
# svcadm refresh svc:/system/identity:node # svcadm restart identity:node
Compute Node: NFS Mount Point - Attribute Caching for Solaris
Benefit / Impact http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm Compute Node: NFS Mount Point - Rsize Wsize for Solaris
Benefit / Impact http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm Compute Node: TCP Protocol on NFS Mount Point for Solaris
Benefit / Impact To be compatible with ZFSSA, all NFS mount points in Solaris must have the protocol specified as TCP. Risk In Solaris, when no protocol is specified, the protocol used by default is RDMA, which is not compatible with ZFSSA. Action / Repair 1.From the output of the report command, identify the shares for which the specified protocol is not TCP. Use the umount command to unmount these shares. Compute Node: RAID Battery Level
Benefit / Impact Exalogic local storage is set up in RAID configuration. Ensuring that RAID has sufficient battery power is critical for the local storage to function properly, especially during a power outage. Risk When RAID battery runs out, the compute node may not have data protection against failure and may also experience performance degradation. Action / Repair Contact Oracle Support. Compute Node: IP Configuration in /etc/hosts for Solaris
Benefit / Impact Correct IP configuration for interfaces allows each compute node to manage hostname mapping and DNS entries. Risk A misconfiguration of the /etc/hosts file can cause problems when a compute node tries to reach other nodes in the same rack. Action / Repair Investigate the /etc/hosts file and the content that is returned from interface configuration of net0/igb0 and bond0 by doing the following: 1. Verify if the content of /etc/hosts has multiple entries of the same IP address. 2. The IP address obtained from the 'ipadm show-addr' command on net0/igb0 and bond0 should be listed in the /etc/hosts file. Adding Exalogic Machine to Your Network (http://docs.oracle.com/cd/E18476_01/doc.220/e18478/spreadsheet.htm) Network, Storage, and Database Preconfiguration (http://docs.oracle.com/cd/E18476_01/doc.220/e18479/net.htm#BHCJBICD) Compute Node: Check Solaris CACAO Publisher Setting
Benefit / Impact Verify pre-conditions for the Solaris PSU patching. Risk Solaris PSU may fail in installation process. Action / Repair Change the cacao publisher to non-sticky setting: pkg set-publisher --non-sticky cacao Compute Node: NIS domain (YPBind) for Solaris
Benefit / Impact __________________________________________________________________________________________________________ SwitchesSwitch: /conf/configvalid File
Benefit / Impact The content of the /conf/configvalid file verifies that the switch is not misconfigured. Having the correct configuration ensures that Oracle Exalogic Elastic Cloud runs properly. Risk Misconfiguration of /conf/configvalid file may lead to problems and outages. Action / Repair If the /conf/configvalid is invalid(0), investigate possible misconfiguration in the other components within the switch to correct the condition. Switch: EoIB Data SL
Benefit / Impact Misconfiguration of EoIB Data SL within gateway switch may lead to problems and outages. Action / Repair Please refer to the following MOS note: <Note 2120372.1>: Exalogic: How to Change Service Levels of EoIB Data and Control on NM2-GW Switches
Switch: EoIB Control SL
Benefit / Impact Correct configuration of the EoIB Control SL ensures that the gateway switch runs properly. Risk Misconfiguration of EoIB Control within gateway switch may lead to problems and outages. Action / Repair Please refer to the following MOS note: <Note 2120372.1>: Exalogic: How to Change Service Levels of EoIB Data and Control on NM2-GW Switches
Switch: Localhost Configuration
Benefit / Impact Valid localhost configuration of the switch needs to be ensured for the Oracle Middleware Exalogic Machine to perform its processes optimally. Risk If the localhost configuration is invalid, problems and outages may occur. Action / Repair If the localhost is not configured within the host, modify the /etc/hosts file to include localhost entry. Switch: Free Physical Memory
Benefit / Impact Availability of free memory needs to be ensured within the switches for the Oracle Middleware Exalogic Machine to perform its processes optimally. Note: The recommended free space is at least 70%.
Insufficient memory may lead to degrading performance, and may cause problems and outages. Action / Repair # sync ; echo 2 > /proc/sys/vm/drop_caches
NOTE: The switch becomes unavailable during this period, causing applications within this switch to stop running. Ensure that you have made adequate preparations to handle the temporary loss of service, before you start this procedure.
# reboot -n
Switch: Unused VNICS
Benefit / Impact 1) Investigate possible misconfiguration in other components within the switch.
2) Check whether the content of /conf/configvalid file is 1 (investigate "/conf/configvalid File" Check). 3) Check whether the localhost entry exists (investigate "Localhost Configuration" Check). 4) Check whether Subnet Manager is configured properly (investigate the output of sminfo command). 5) Check whether GUID is correct (Investigate the output of ibnetdiscover command). 6) Check whether partition is correctly configured. Switch: Opensm
Benefit / Impact Opensm provides an implementation of an InfiniBand Subnet Manager and Administration to support Oracle Middleware Exalogic Machine. Risk If the opensm is not running, possible problems and/or outages may occur. Action / Repair 1) Run the "getmaster" command on all NM2-GW switches. If any of the NM2-GW switches does not have a local instance of the subnet manager running, enable the subnet manager by using the "enablesm" command. Switch: List Link Up
Benefit / Impact This hardware command lists the presence of links and the up-down state of the associated ports on the switch chip. Risk If any of the links are "down", problems and outages may occur. Action / Repair
If Exadata is on the same InfiniBand fabric as Exalogic, verify that the subnet manager is disabled on Exadata as well. Switch: Environment Test
Benefit / Impact Verifying that the hardware passes environment test ensures Oracle Exalogic Elastic Cloud to run properly. Risk If the environment tests result in failure, problems and outages may occur. Action / Repair If the environment tests fail, perform a power cycle through ILOM. Investigate possible hardware problems within the switch. Switch: Ibstat
Benefit / Impact Ibstat displays the basic status of InfiniBand. Having InfiniBand that works well supports the fabric communication of the switches, to work optimally. Risk If these parameters for ibstat are not met, problems and outages may occur. Action / Repair This InfiniBand software command displays basic information retrieved from the local InfiniBand driver. For this software to work properly, ensure that the following criteria is met: If any of these criteria are not met, investigate the problems based on these components below: Switch: SNMP Daemon
Benefit / Impact OpsCenter utilizes Simple Network Management Protocol (SNMP) to retrieve various switch properties, which are critical to ensure proper network monitoring of the system. Risk If the SNMP daemon is not running, network management may experience performance degradation. Action / Repair To start the snmpd service, the complete ILOM stack must be started by running the following command:
Switch: Number of Partition Keys on Bridge-X Ports
Benefit / Impact To allow VNICs to work properly, ensure that the number of partitions associated with Bridge-X (BX) ports has not reached the upper limit. Risk If the number of partition keys associated with Bridge-X ports reaches or exceeds the recommended upper limit, any newly created VNICs will be in the WAIT-VHUB state even if all Bridge-X ports are full members of the appropriate partition. Action / Repair Remove BX port GUID from the unused partition and reduce the number of partitions. Each Bridge-X port has a maximum capacity of 128 partition keys. However, it is recommended that you keep the number of partitions below 100. Switch: Host Config VNIC
Benefit / Impact To be able to create VNICs in the Host Manual Mode, the "Allow host VNIC config" parameter must be set to "yes". Risk If the "Allow host VNIC config" parameter is not set to "yes", you will not be able to create VNICs in the Host Manual Mode. Action / Repair 1. If the output of showgwconfig shows "BXM not running", ensure that the BXM service is up and running. Restart the service by running the following commands on the switch:
Switch: Pre-upgrade check on switch memory and disk space
Benefit / Impact Upgrade needs 80Mb of space in / filesystem, 200Mb of space in /tmp and 240M of memory. This will allow upgrade to avoid failures due to space and memory. Risk Upgrade will fail if these criteria is not met. Action / Repair To free up memory, execute sync ; echo 2 > /proc/sys/vm/drop_caches or reboot Remove unwanted files in / and /tmp if its falls below 80Mb and 200Mb. Switch: VLAN PKEY PAIR Information for Switch
Benefit / Impact Unique VLAN Pkey Pairing guarantees correct networking. Risk Virtualized EECS environment does not support multiple vlans on a single IB partition. Action / Repair Please revert any manual vlan creation. If no vlan was created manually, contact Oracle Support. Switch: Validate No Stale Partition Key Temporary File Exists
Benefit / Impact Stale partition key temporary file can exist from a failed smpartition session. These stale files need to be clean up or the next smpartition session may fail or commit incorrectly. Risk Unable to execute any partition related operation or invalid partition information gets commited. Action / Repair 1) Validate that no valid smpartition operation is in progress. Execute a diff command between the 2 files to visualize the pending changes: diff /conf/partitions.conf.tmp /conf/partitions.current 2) Move the partitions.conf.tmp file to a location for Oracle Support analysis if necessary. 3) Execute smpartition abort on the master switch to terminate the stale smpartition session if invalid. Switch: Validate Partition Keys Are Using Latest Format
Benefit / Impact Partition key needs to be consistent using the new format since switch FW version 2.1.3-4. Risk EMOC will fail to execute any operations related to partition keys. Action / Repair Make sure the April 2014 PSU or July 2014 PSU was applied correctly. In particular, the documented step that modifies the pkeys (pkey_filter.pl) was executed successfully. Switch: /conf/configvalid File for Spine Switch
Benefit / Impact The content of the /conf/configvalid file verifies that the switch is not misconfigured. Having the correct configuration ensures that Oracle Exalogic Elastic Cloud runs properly. Risk Misconfiguration of /conf/configvalid file may lead to problems and outages. Action / Repair If the /conf/configvalid is invalid(0), investigate possible misconfiguration in the other components within the switch to correct the condition. Note 1520330.1 - "smpartition list active" Shows Inconsistent Partition Information Between Exalogic Switches Switch: Version Consistency on All Switches
Benefit / Impact Version consistency among switches avoids problems with hardware/software configuration. Risk Version inconsistency among switches can cause functional issues. Action / Repair Please confirm the EECS patch level of the system. Continue finishing the switch FW upgrade according to PSU instructions. Switch: Life Expectancy for SW
Benefit / Impact There is a high level of urgency to perform a control stack or switch backup right away Risk When remaining life is approximately less than 2%, a switch replacement should be scheduled as disk failure is possible and they cannot be repaired in the field. Action / Repair Please change a switch. Switch: Consistent Subnet Manager across Switches
Benefit / Impact Storage NodesStorage Node: Backend (chkBackend.aksh)
Benefit / Impact Reports any faults, single paths, and mismatches in firmware versions between data disks, write caches (logs), and JBOD SIMs. Risk Any faults, single paths, and mismatches in firmware versions can lead to problems and outages. Action / Repair If these checks fail, follow the instructions given below, depending on the error messages that are printed in the output:
- ERROR: {hostname} SHELF: {shelf} DISK: {disk} PATH ERROR ONLY FOUND 1 PATHS Repair: Replace the device if both SIMs are online. - ERROR: {hostname} SHELF: {shelf} DISK: {disk} DISK FIRMWARE MISMATCH ERROR DETECTED Repair: If the disk is not in the process of having its firmware upgraded (Maintenance -> System -> Firmware Upgrades), then replace the disk or have a field engineer manually upgrade the device. - ERROR: {hostname} SHELF: {shelf} DISK: {disk} FAULTED Repair: Replace the disk. - ERROR: {hostname} SHELF: {shelf} DISK: {disk} REPORTED AS MISSING - SHOULD IT BE? Repair: Reinsert or replace the disk. Exalogic should have all of the slots propagated with disks. - ERROR: {hostname} SHELF: {shelf} LOG: {log} PATH ERROR ONLY FOUND 1 PATHS Repair: Replace the device if both SIMs are online. - ERROR: {hostname} SHELF: {shelf} LOG: {log} FIRMWARE: {fw} FIRMWARE BELOW MINIMUM RELEASE Repair: If the disk is not in the process of having its firmware upgraded (Maintenance -> System -> Firmware Upgrades), then replace the disk or have a field engineer manually upgrade the device. - ERROR: {hostname} SHELF: {shelf} LOG: {log} FIRMWARE: {fw} FIRMWARE BELOW MINIMUM FOR AK VERSION Repair: If the disk is not in the process of having it's firmware upgraded (Maintenance -> System -> Firmware Upgrades), then replace the disk or have a field engineer manually upgrade the device. - ERROR: {hostname} SHELF: {shelf} SIM: {sim} REPORTS FAULTED Repair: Replace the SIM. - ERROR: {hostname} SHELF: {shelf} SIM: {sim} UNKNOWN STATE Repair: Reseat the SIM and check weather its firmware is up-to-date. - ERROR: {hostname} SHELF: {shelf} SIM: {sim} FIRMWARE: {fw} FIRMWARE MISMATCH: {fw} on another SIM Repair: If the SIM is not in the process of having it's firmware upgraded (Maintenance -> System -> Firmware Upgrades), then replace the SIM or have a field engineer manually upgrade the device. - ERROR: {hostname} SHELF: {shelf} SIM: {sim} UNKNOWN PART Repair: Reseat the SIM and check that it's firmware is up to date. - ERROR: {hostname} SHELF: {shelf} SIM: {sim} NOT PRESENT Repair: Reinsert or replace the missing SIM. Storage Node: Cluster (chkCluster.aksh)
Examines the cluster link health of the appliance. Risk Any faults within the cluster link may lead to problems and outages. Action / Repair If this check fails, follow the instructions given below, depending on the error messages that are printed in the output: ERROR: {hostname} CLUSTER: FAILOVER - NO ONE OWNS THE RESOURCES!
Repair: Reconfigure the cluster configuration on the owner node. Storage Node: Datasets (chkDatasets.aksh)
Examines the size of the datasets of the appliance. Risk An excessive number of large datasets can cause performance degradation. Action / Repair Delete the large datasets that are no longer needed and set up a dataset retention policy. Please refer to Action/Repair section of the storage check "ZFSSA Analytics Retention Policy" to properly set the analytics settings.To purge the datasets, please refer to the "Datasets" section on "Sun ZFS Storage 7000 Analytics Guide" to find all the datasets which are more than 2GB, select and prune them. Storage Node: Shadow Migrated Shares (chkShadow.aksh)
Iterates through all of the shares to discover those that are being shadow migrated. An error is generated when a shadow is not moving data, or when it is showing errors. Risk Any faults within the shadow migration setup may lead to problems and outages. Action / Repair If this check fails, follow the instructions given below depending on the error messages that are printed in the output: ERROR: {hostname} SHARE: {sharename} SHADOWSOURCE: {shadowsource} ERRORS: {errors} TRANSFERRED: {transferred}
Repair: Reconfig and restart the shadow migration setup, if applicable. Storage Node: Space Utilization (chkSpace.aksh)
Checks the space utilization of the storage appliance based on the pool, project, and share size. Risk Insufficient space may lead to performance degradation which will cause problems and outages. Action / Repair If the overall pool goes above 80%, reduce the amount of data stored in the disk tray by transferring some of the data to another storage device. Storage Node: Lockd Servers(chkLockd.aksh)
Some applications try to get an exclusive lock against the same file. When the lock reaches a limit, no more new sessions can be started. Risk Applications cannot scale out for more than a few concurrent sessions. Action / Repair Update to the latest firmware version. Storage Node: IPMP Failback Configuration (chkIPMPFailback.aksh)
IPMP failback policy needs to be "false" and the value of IPMP's interval needs to be "5000" to match the default setting that is on the Linux side. Risk Failback policy allows the original link to take over the active role after a failover. If this happens within a short period of time, then there is a possibility of erroneous conditions. Applications may be in recovery mode due to the first failure and if failback happens, it makes thing failed again. Action / Repair IPMP settings can be changed in configuration -> services -> ipmp, or from CLI as shown below: el02sn01:> configuration services ipmp
el02sn01:configuration services ipmp> show Properties: status = online interval = 10000 failback = true el02sn01:configuration services ipmp> set interval=5000 interval = 5000 (uncommitted) el02sn01:configuration services ipmp> set failback=false failback = false (uncommitted) el02sn01:configuration services ipmp> commit el02sn01:configuration services ipmp> show Properties: status = online interval = 5000 failback = false
Storage Node: IPMP Standby Configuration (chkIPMPStandby.aksh)
Verifying Storage Node's IPMP standby field to be non-empty is needed to ensure storage's high availability in case a failure occurs. Risk Having an active/active configuration may cause network issue. Action / Repair 1. Login to the storage appliance BUI, navigate to "Configuration" -> "Network" Storage Node: ZFS Snapshot Visibility
NOTE: Correction for snapshot visibility and ZFS block size can be done automatically using the script available in Note 1594039.1
Storage Node: L2ARC Header Size
Storage Node: ZFS Block Size
Perform the following:
NOTE: Correction for snapshot visibility and ZFS block size can be done automatically using the script available in Note 1594039.1
Storage Node: ZFS Maintenance Status
Investigate issues listed in the ZFS storage appliance maintenance status Storage Node: ZFS DNS Configuration
Benefit / Impact To allow proper network configuration within the ZFS storage appliance, ensure that you use a host name that matches the DNS . Risk An incorrect host name that does not match the DNS can cause network configuration issues. It can also cause Exachk to report wrong results. Action / Repair To configure the ZFSSA DNS domain and/or server settings, run the following commands. Please consult with the DNS network administrator for the specific domain or servers for the assigned machine.
Storage Node: NFSv4 Lock Object Leak
Benefit / Impact This is a known defect, number 14781917, in the Linux kernel versions shipped in Exalogic 2.0 Linux physical releases 2.0.0.0.0, 2.0.0.0.1, 2.0.0.0.2, and 2.0.0.0.3. Risk Due to this defect, unused LockStateID entries are not disposed of on the ZFS storage appliance and keep growing at a constant rate. When the number of unused LockStateID entries reaches 1 million (which is the maximum) on the ZFS storage appliance, any further NFSv4 calls from Linux clients will receive file lock errors. Action / Repair Upgrade to January 2013 PSU version 2.0.3.0.1 or later, which has the Linux kernel version containing the fix for this issue.
Storage Node: Nfsmapid Domain Matching with NIS server
Benefit / Impact On ZFSSA, the NIS domain setting must match the NFSv4 identity domian setting. A mismatch between the two will cause file ownership to show as nobody:nobody. Risk A mismatch between the NIS domain and NFSv4 identity domain will cause file ownerships on NFSv4 mounts to be nobody:nobody. Action / Repair Set the two domain settings to the correct value, using the following commands.
Storage Node: Softring Workflow
Benefit / Impact This is a performance tuning workflow for the ZFS storage appliance to allow optimal performance on Exalogic. Risk Under heavy load, you might experience performance degradation while accessing shares on the ZFS storage appliance. Action / Repair Upgrade to January 2013 PSU version 2.0.3.0.1 or later, which includes a tuning workflow for the ZFS storage appliance named "Provide work around for CR 7122961". Storage Node: ZFSSA Analytics Retention Policy
Benefit / Impact ZFSSA Analytics stores periodic snapshots of the system and retains them for a specified period of time. These snapshots are used for backup and recovery in case a failure occurs. You must consider the available disk space while setting up the retention policy, to ensure that system snapshots do not use up too much disk space. Risk If the retention policy is not set carefully within the recommended values, dataset growth may exceed the available disk space, which could cause significant performance degradation on the ZFS storage appliance. Action / Repair The maximum recommended settings for each property are as follows:
To configure the ZFSSA Analytics Retention Policy settings, use the following procedure:
Storage Node: ZFS Check Head Status
Benefit / Impact Clustering is recommended for the ZFS appliance. When it is configured correctly, only one storage head is actively functioning for each ZFS appliance. One head should be set as active for its own property description and ready for its peer property description, or vice versa. One head should be set as AKCS_OWNER for its own property state and AKCS_STRIPPED for its peer property state, or vice versa. This ensures that the storage heads are configured and run correctly. Risk When clustering is configured for ZFS appliance, if storage head description and state properties show more than one active head, it indicates a configuration or hardware error which can cause issues with clustering's availability feature. Action / Repair Contact Oracle Support. Refer to the Sun ZFS Storage 7000 System Administration Guide for more information. Sun ZFS Storage 7000 System Administration Guide - Cluster (http://docs.oracle.com/cd/E22471_01/html/820-4167/configuration__cluster.html) Storage Node: ZFS Mirror Profile Status
Benefit / Impact When data is mirrored, it reduces capacity by half, but yields a highly reliable and high-performing system. Data mirroring is recommended when space is considered ample, but performance is at a premium. An Exalogic system only has one pool, therefore the storage configuration profile status "mirror" shows up on the active head. Risk According to the ZFS Appliance Administration Guide, while arbitrary numbers of pools are supported, creating multiple pools with the same redundancy characteristics owned by the same cluster head is not advised. Doing so will result in poor performance, suboptimal allocation of resources, artificial partitioning of storage, and additional administrative complexity. Action / Repair Refer to the ZFS Appliance Administration Guide. Oracle ZFS Storage Appliance Administration Guide - Storage Configuration (http://docs.oracle.com/cd/E27998_01/html/E48433/configuration__storage.html#configuration__storage__configuration_rules_and_guidelines)
Storage Node: ZFS Share Quota
Benefit / Impact Projects and shares in the ZFS storage appliance should not use more than 85% of the space as a best practice. Risk Inadequate space in the ZFS storage appliance and its shares and projects can cause issues and affect performance. Action / Repair Clean up the ZFS storage appliance and reallocate resources in shares and projects to keep the space usage under 85%.
Storage Node: Check for ZFSSA Installed Ram
Benefit / Impact For optimal performance, it is recommended that ZFS storage appliance have a minimum of 90GB of RAM Risk Under heavy load, you might experience performance degradation while accessing shares on the ZFS storage appliance if the RAM size is less than the recommended size. Some systems with very low memory configurations might not run reliably. Action / Repair Upgrade ram to at least 90 GB. Storage Node: ZFS Dedup Status
Benefit / Impact Data deduplication controls duplicate copies of data are eliminated in ZFS appliance. It is synchronous, pool-wide,block-based, and can be enabled on a per project or share basis. If your data doesn't contain any duplicates, enabling Data Deduplication will add overhead (a more CPU-intensive checksum and on-disk deduplication table entries) without providing any benefit. If your data does contain duplicates, enabling Data Deduplication will both save space by storing only one copy of a given block regardless of how many times it occurs.The recommended practice for exalogic system is not to enable deduplication. Risk By its nature, deduplication requires modifying the deduplication table when a block is written to or freed. If the deduplication table cannot fit in DRAM, writes and frees may induce significant random read activity where there was previously none. As a result, the performance impact of enabling deduplication can be severe. Action / Repair According to the Administration Guide, you can disable data deduplication by deselecting the Data Deduplication checkbox on the general properties screen for projects or shares. Oracle ZFS Storage Appliance Administration Guide - Data Deduplication (http://docs.oracle.com/cd/E27998_01/html/E48433/shares__shares__general.html#shares__shares__general__data_deduplication)
Storage Node: ZFS Disk Timeout Warning
Benefit / Impact To find out disk timeout warnings within last 7 days in log file. Risk A disk timeout warning could potentially point to a disk failure that can be addressed way before the disk actually goes bad. Action / Repair Open an SR for further assistants if there is any performance issue accessing the ZFSSA.
Storage Node: ZFS Disk Health
Benefit / Impact To find out possible faulted disk among all disks. Risk A faulted disk may cause data corruption or even system failure. Action / Repair Replace the faulted disk.
Storage Node: NFSv4 Delegation
Benefit / Impact Disabling the NFSv4 delegation feature helps avoid hanging problems within ZFS storage. This typically happens when there is multiple concurrent write access to the same file on an NFS mounted directory. Risk NFS Mount Points for the Compute Nodes may hang which leads to problems and outages. Action / Repair Please refer to following Note: <Note 1481713.1>: NFSv4 mount directories hang on Exalogic Machine
REFERENCE: Storage Node: IPMP configuration on ZFS node
Benefit / Impact IP network multipathing (IPMP) is used primarily as a way of increasing redundancy so that network connectivity is unaffected by the failure of a single component be it a physical network port, a cable or a switch. This check determines whether or not IPMP is configured correctly. Risk Network connectivity may be affected by the failure of a single component. Action / Repair Link-based failure detection uses properties of the network device driver to check on whether the link to the network is active. To enable link-based failure detection you need to make sure that the test interfaces in an IPMP group do not have a traditional IP addresses configured. Instead they should be configured with the address and netmask of 0.0.0.0/8. Only the IPMP interface itself should be configured with a valid IP address and netmask for the appropriate subne Storage Node: ZFS Slot Health
Benefit / Impact Identifies faulted slots. Risk A faulted slot can cause performance degradation. Action / Repair Service or replace faulted slots. Storage Node: Verify ZFS node disk storage pools
Benefit / Impact This check determines the health of each pool from the state of all the pool's devices. Risk Unhealthy pools may go undetected. Action / Repair Follow instructions in ZFS TroubleShooting and Pool Recovery. __________________________________________________________________________________________________________ Oracle VM Manager (OVMM)OVMM: Oracle VM Manager (OVMM) Service Status
Benefit / Impact Ensuring that Oracle VM Manager is running is critical as it provides a central location to manage Oracle VM Server and virtual machines Risk If Oracle VM Manager goes down, problems and outages may occur. Links Note 1501348.1 - Identifying And Resolving Oracle VM Issues In Exalogic Virtual Environment. OVMM: Database Corruption
Benefit / Impact Ensures the database used by Oracle VM Manager is operating smoothly. Risk: Corrupted data in the database can cause unexpected errors including making the management console inaccessible. Action / Repair Contact Oracle Support. Links Note 1501348.1 - Identifying And Resolving Oracle VM Issues In Exalogic Virtual Environment. OVMM: Sufficient CPU resources for the Oracle VM Manager
Benefit / Impact Sufficient CPU resources is necessary for the vServer to run optimally. Risk A lack of sufficient CPU resources can affect performance. Action / Repair Contact Oracle Support. OVMM: OVMM Pool VM Start Policy
Benefit / Impact OVMM Pool VM Start Policy manages which servers a VM will be started on. For Exalogic, it is designed to start on the current server. Risk Misconfiguration of OVMM Pool VM Start Policy may cause VM creation job to be stopped by EMOC. Action / Repair Refer to the latest Exachk User Guide under the heading "Verifying and Enabling Passwordless SSH to the Oracle VM Manager CLI". The link to the latest Exachk User Guide can be found on <Note 1449226.1>. For further information on setting up passwordless SSH to the Oracle VM Manager CLI, please refer to the document "OVM CLI". For more information on setting up new pools and adding servers to the new pools with proper parameter values, please refer to the document "B.9.2.3 Create the Required Pools and Add Servers to the New Pools". OVMM: Check Connection Channels Before Upgrade
Benefit / Impact The attached python script is to check if the default configuration for WebLogic network channels has changed.The python scripts has a WLST embedded to it that connects to the AdminServer and inspects the network channel configuration. Connection channel could impact upgrade process. Risk When the connnection channel on WLS port 7002 is used, OVM upgrade script will fail. Action / Repair Please properly set the connection channel. OVMM: Sufficient RAM for the Oracle VM Manager
Benefit / Impact Sufficient memory is necessary for the vServer to run optimally.. Risk A lack of memory allocated to Oracle VM Manager can affect performance. Action / Repair Contact Oracle Support. Database (DB)DB: Oracle Database Service Status
Benefit / Impact Ensuring Oracle Database is running is critical for database management within applications to function properly. Risk If the database is not up, applications will not be able to run properly as they need to store and load data from the database. Action / Repair Investigate the issue and notify Oracle Support for further assistance. Links DB: Sufficient CPU resources for the Database Control vServer
Benefit / Impact Sufficient CPU resources is necessary for the vServer to run optimally. Risk A lack of CPU resource for the Database can affect performance. Action / Repair Contact Oracle Support. DB: Password Expiration Status for OVS User on DB Control vServer
Benefit / Impact If no password expiration is set, then the OVS user will not require a password reset and cannot lose access due to an expired ID. Risk If the password expiry date is set and the password expires, the control stack will stop working as the OVS user won't be available. Action / Repair 1. Login into DB control vServer 2. su - oracle 3. use the following code: ORACLE_SID=elctrldb ORAENV_ASK=NO . oraenv >/dev/null 2>&1 unset ORAENV_ASK sqlplus / as sysdba CREATE PROFILE OVS_PROFILE LIMIT PASSWORD_LIFE_TIME UNLIMITED; ALTER USER OVS PROFILE OVS_PROFILE; exit; DB: Sufficient RAM for the Database Control vServer
Benefit / Impact Sufficient memory is necessary for the vServer to run optimally. Risk A lack of memory for the Database can affect performance. Action / Repair Contact Oracle Support. Enterprise Controller (EC)EC: Enterprise Controller Service Status
Benefit / Impact EC: Excessive Jobs within EMOC
Benefit / Impact EC: Connectivity To EMOC
Benefit / Impact Note 1501228.1 - How To Start A Stopped Exalogic Control Stack In An Exalogic Virtual Environment. EC: Network Interface Connectivity for Control vServers
Benefit / Impact EC: Storage Network Interface Connectivity
Benefit / Impact For the whole system to work optimally, all the network interfaces for the storage appliance must be running and pingable. Risk Unavailability of any interface will disrupt communication to the storage appliance. Action / Repair Verify the configuration of the storage appliance and fix the issue. Links EC: Compute Node (OVS) Network Interface Connectivity
Benefit / Impact For the Exalogic machine to operate, all the network interfaces of the compute nodes must be running and pingable. Risk Unavailability of any interface of a compute node will cause problems with Exalogic Control and vServers. Action / Repair Verify the configuration of the compute nodes and fix the issue. EC: Sufficient CPU Resources for the Enterprise Controller
Benefit / Impact Sufficient CPU resources is necessary for the vServer to run optimally. Risk A lack of CPU resources for the Enterprise Controller can affect performance. Action / Repair Contact Oracle Support. EC: Uce_scheduler status check
Benefit / Impact Avoids class loading error that can cause continuous segmentation fault in uce_scheduler. Risk Continuous segmentation fault caused by the class loading error of uce_scheduler can cause the virtual machines to restart. Action / Repair Turn off the uce scheduler by running the following commands: EC: Valid Hostname within /etc/hosts in Enterprise Controller
Benefit / Impact /etc/hosts file provides mapping information between hostnames and IP addresses. This file is particularly useful as a cache by the host node to resolve its hostname information when DNS service is unavailable. Risk When DNS service is unavailable, incorrect hostname entry in /etc/hosts file can lead to severage outage and loss of data. Action / Repair Verify that there is entry for the hostname in /etc/hosts file that maps to EoIB external management interface. EC: OVS database schema BLOB corruption check
Benefit / Impact If no BLOB corruption found then, vDC functionality will be intact. Risk New guest VMs cant be created or existing guest VMs can not be managed. Action / Repair Follow Document 1509888.1 for recovery. To avoid happening again, contact Oracle Support. Note 1509888.1 - How To Recover Exalogic Virtual Environment After OVM Manager DB Problems In EECS v2.0.1.0.x and EECS v2.0.4.0.x Virtual EC: Sufficient RAM for the Enterprise Controller
Benefit / Impact Sufficient memory is necessary for the vServer to run optimally. Risk A lack of memory for the Enterprise Controller can affect performance. Action / Repair Contact Oracle Support. Proxy Controller (PC)PC: Proxy Controller Service Status
Benefit / Impact PC: Sufficient CPU resources for the Proxy Controller
Benefit / Impact PC: Valid Hostname within /etc/hosts in Proxy Controller
Benefit / Impact /etc/hosts file provides mapping information between hostnames and IP addresses. This file is particularly useful as a cache by the host node to resolve its hostname information when DNS service is unavailable. Risk When DNS service is unavailable, incorrect hostname entry in /etc/hosts file can lead to severage outage and loss of data. Action / Repair Verify that there is entry for the hostname in /etc/hosts file that maps to eth0 interface. PC: Sufficient RAM for the Proxy Controller
Benefit / Impact Multiple ComponentsMultiple Components: Kernel Out-of-Memory Errors
Benefit / Impact The kernel out-of-memory error indicates a potential resource issue. Multiple Components: Control Virtual Server's Uptime
Benefit / Impact: Multiple Components: NFSv3 Usage Verification for Control vServers Shares
Benefit / Impact: Ensure that NFSv3 is being used for Exalogic Control shares to prevent control stack instability. Risk: Exalogic Control shares not using NFSv3 can destabilize the communication between the storage and control vServers. Action / Repair: NFSv3 is used by default. If it was changed, revert it back to NFSv3 for the EL Control stack. Multiple Components: Gateway Configuration for non-Switch
Benefit / Impact Valid gateway configuration of the switch needs to be ensured for the Exalogic Machine to perform optimally. Risk Invalid gateway configuration will cause communication issues of the component with others outside its subnet. Action / Repair Correct the appropriate network configuration file with gateway information. A gateway should be specified in the format "GATEWAY=XX.XX.XX.XX" in the /etc/sysconfig/network file. Multiple Components: MTU for InfiniBand Link in Control vServers
Benefit / Impact A correct MTU size for the InfiniBand Link ensures that the communication protocol layer in InfiniBand performs optimally. Risk Incorrect MTU size can slow down InfiniBand Link and cause latency issues. Action / Repair Please refer to <Note 1624434.1>: Revised MTU Tuning Recommendations for the IPoIB Related Network Interfaces on Exalogic Physical and Virtual Environments. Multiple Components: TCP Tuning for Control vServers
Benefit / Impact The tuning for TCP consists of three components: 1. net.ipv4.tcp_timestamps should be set to 1 to avoid PAWS issue (Protect Against Wrapped Sequence - RFC 1323). 2. net.ipv4.tcp_window_scaling should be set to 1 to allow efficient transfer of data for high bandwidth-delay products. 3. net.ipv4.tcp_sack should be set to 1 to enable selective acknowledgement in mitigating duplicate acknowledgement and/or retransmission issues (RFC 2018). Risk Without this tuning, the TCP may not perform at an optimum level. Action / Repair Add the recommended tuning parameters into the /etc/sysctl.conf file. Multiple Components: Swap Space for Control vServers
Benefit / Impact The level of swappiness controls the amount of memory reclaim distress at a point where the kernel decides to start reclaiming mapped pages. If the swap space is unused, it means the kernel has adequate amount of free physical memory, and this ensures Exalogic Elastic Cloud to perform at its optimal level. Risk The usage of swap space indicates that the kernel is running out of free physical memory. Lack of free physical memory can lead to degraded performance. Action / Repair Clear up the used memory. Multiple Components: Lockd Configuration for Control vServers
Benefit / Impact Lock recovery after a reboot is critical to maintain data integrity and to prevent unnecessary application hangs. To help rpc.statd match SM_NOTIFY requests to NLM requests, this best practice should be observed. Risk NFSv3 locks may not be recovered after a reboot. Action / Repair NOTE: The control vServer becomes unavailable during this period, causing applications within the control vServer to stop running. To manage the impact of a temporary loss of service, prepare your environment. <Note 1594223.1>: How To Stop and Start the Entire Exalogic Control Stack In An Exalogic EECS v2.0.6.0.0 and later Virtual releases Note
<Note 1501228.1>: How To Start A Stopped Exalogic Control Stack In EECS v2.0.1.0.x and EECS v2.0.4.0.x Virtual Environments Multiple Components: Name Service Switch Config File Permission Status in Compute Nodes
Benefit / Impact In addition to /etc/hosts file having the alias for name resolution, /etc/nsswitch.conf also uses “files” for resolution. The /etc/nsswitch.conf file should have 644 rights so that /etc/hosts can be used by everyone. Risk If /etc/nsswitch.conf is not given the 644 permission, /etc/hosts will be ignored by anyone but root. In this case, WebLogic Server may fail to start and may complain about incorrect network configuration since it cannot resolve the hostname used as the listen address. Action / Repair Change the rights of /etc/nsswitch.conf to 644. Multiple Components: Local Partition Usage Limit
Benefit / Impact Keeping enough local disk free space ensures the system to operate optimally. Risk Performance of the system will get affected. Action / Repair Free up disk space on the local disk. Oracle recommends most, if not all, user data be stored on the storage appliance. Multiple Components: Check Root Space in DB and EC VM Before Upgrade
Benefit / Impact Sufficient space is required during the process of upgrade. This will allow upgrade to proceed successfully. Risk System might be in unstable condition if failed due to insufficient space. Action / Repair Compute Node: 500 MB (to cleanup space, run: "yum clean all" and if space is still needed, run: "> /var/log/devmon.log" to create an empty file on the compute node) Multiple Components: Cross check hostname with /etc/hosts in Guest VMs
Benefit / Impact Hostname is referred by application in /etc/hosts. This will allow the application functionality not getting interrupted Risk Functionality of application might get interrupted. Action / Repair Check hostname by running "hostname --s" and add the output with the ip in /etc/hosts Multiple Components: Check Root Space in OVM PC VM and Compute Node Before Upgrade
Benefit / Impact Sufficient space is required during the process of upgrade. This will allow upgrade to proceed successfully. Risk System might be in unstable condition if failed due to insufficient space. Action / Repair Compute Node: 1 GB (to cleanup space, run: "yum clean all" and if space is still needed, run: "> /var/log/devmon.log" to create an empty file on the compute node) Multiple Components: Bash Vulnerability Update Check
Benefit / Impact Applying the bash vulnerability patch addresses a critical security vulnerability with bash that allows a malicious user to execute arbitrary commands and gain unauthorized access to the system. Risk Not applying the patch exposes the Exalogic machine to a critical security vulnerability that can potentially allow a malicious user to execute arbitrary commands and gain unauthorized access the system. Action / Repair Please see following link for more information about CVE-2014-6271 and to fix bash code injection vulnerability. Vulnerability Summary for CVE-2014-6271 (http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-6271) Note 1930090.1 - CVE-2014-6271 and CVE-2014-7169 Patch Availability Document for Oracle Solaris Note 1930120.1 - CVE-2014-6271 and CVE-2014-7169 Patch Availability Document for Oracle Linux Oracle Security Alert for CVE-2014-7169 (http://www.oracle.com/technetwork/topics/security/alert-cve-2014-7169-2303276.html) Note 1929881.1 - CVE-2014-6271 and CVE-2014-7169 Patch Availability for Oracle Exalogic Linux Physical and Virtual Racks Multiple Components: Verify ILOM open issue
Benefit / Impact Any open issues found in ILOM usually indicate HW degradation or failure. Risk If open issues are not addressed promptly, it may result in lost of service. Action / Repair Address the issue by contacting Oracle Support Multiple Components: Validate Control VMs JDK Version
Benefit / Impact Ensuring control VM services are running under supported JDK version is critical to ensure functional compatibility. Risk Failure to ensure that control VM services are running under supported JDK version can lead to functional issues with EMOC and failure to execute patching. Action / Repair Investigate the issue and notify Oracle Support for further assistance. Multiple Components: Version Consistency on All Switches
Benefit / Impact Version consistency among switches avoids problems with hardware/software configuration. Risk Version inconsistency among switches can cause functional issues. Action / Repair Please confirm the EECS patch level of the system. Continue finishing the switch FW upgrade according to PSU instructions. Multiple Components: Ghost Vulnerability
Benefit / Impact For the whole system to work optimally, and to avoid problems related to Ghost Vulnerability, the installed image needs to be verified to be at its supported latest version. Risk If a system is not patched, it is exposed to security vulnerability. Action / Repair Please apply Patch 20448956 and refer to the referenced MOS Note for the steps on how to apply the patch, and rerun exachk post patching to validate. Note 1965975.1 - CVE-2015-0235 - Ghost Vulnerability - Patch Availability for Oracle Exalogic Linux Physical and Virtual Racks (Doc ID 19
Multiple Components: IPoIB in Connected Mode
Benefit / Impact Having IPoIB in Connected Mode ensures that Internet Protocol (IP) works properly over InfiniBand network. Risk If the Connected Mode is not set, IPoIB might not work properly. Action / Repair 1. If it is OFED R2 version, then check the following: -Go to /etc/sysconfig/network-scripts/ifcfg-ib* files, check whether "CONNECTED_MODE=yes". -If "CONNECTED_MODE" is not set to "yes", modify the /etc/sysconfig/network-scripts/ifcfg-ib* file and change the property to "yes". <Note 1982645.1>: Exachk Reporting "IPoIB is not in connected mode" WARNING Message On Exalogic 2.0.6.2.0 Linux Physical Racks
Multiple Components: NFS Mount Point - Attribute Caching
Benefit / Impact By ensuring that attribute caching within NFS is not disabled, NFS mounts can perform more efficiently. Risk Disabling attribute caching can lead to extra network operation which leads to degrading network performance. Action / Repair Fix the configuration of the NFS Mount Point by removing any of these attributes from the mount points: - noac - actimeo=0 - acregmin=0 - acregmax=0 - acdirmin=0 - acdirmax=0 Configuring NFS Version 4 (NFSv4) on Exalogic (http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm) Multiple Components: Free Physical Memory
Benefit / Impact Availability of free memory needs to be ensured within the switches for the Oracle Middleware Exalogic Machine to perform its processes optimally. Risk Insufficient memory may lead do degrading performance, and may cause problems and outages. Action / Repair The recommended free space is calculated by adding Free Memory(MemFree) and Reclaimable Memory(SReclaimable) listed in /proc/meminfo. The free memory should be at least 20% of the Total Memory(MemTotal). # sync ; echo 2 > /proc/sys/vm/drop_caches
If there is still not enough free memory, reboot the switch. NOTE: The switch becomes unavailable during this period, causing applications to stop running within this switch. To handle the possible impact of a temporary loss of service, ensure adequate preparation ahead of time. Reboot the switch by running the following command: # reboot -n
Multiple Components: MTU for Ethernet Link in Control vServers
Benefit / Impact Correcting the MTU size for the Ethernet Link ensures that the communication protocol layer in InfiniBand performs optimally. Risk Incorrect MTU size can slow down the Ethernet Link and cause latency issues. Action / Repair To correct the MTU size, perform the following:
__________________________________________________________________________________________________________ Cross-ComponentsCross-Component: Firmware Version Consistency for Storage Node
Benefit / Impact Having consistent base firmware version across all storage nodes ensures a stable environment for Exalogic to perform optimally. Risk Inconsistent firmware versions across the storage nodes can lead to problems and outages. Action / Repair Investigate which storage nodes have different firmware version and upgrade the storage nodes with lower firmware versions. Cross-Component: NTP Configuration for Control vServers
Benefit / Impact Ensuring correct NTP configuration for control vServers is crucial to running Exalogic vServers. Control vServers are configured to point to two compute nodes in the same rack by default. Risk Incorrect NTP configuration for control vServers can lead to job scheduling issues in managing Exalogic vServers. Action / Repair Correct the NTP server configuration in /etc/ntp.conf for control vServers to point to the first 2 compute nodes in the rack. Cross-Component: NTP Configuration Consistency with Oracle VM Server for ZFS
Benefit / Impact The ZFS storage appliance must use the same time source as the other components of the Exalogic machine. Risk An out of sync clock source can cause stability issues. Action / Repair Modify the NTP server using BUI to point to the same NTP servers configured on the compute node. Cross-Component: NTP Configuration Consistency with Physical Compute Nodes for ZFS
Benefit / Impact The ZFS storage appliance must use the same time source as the other components of the Exalogic machine. Risk An out of sync clock source can cause stability issues. Action / Repair Modify the NTP server using BUI to point to the same NTP servers configured on the compute node. Cross-Component: NTP Configuration for Compute Nodes
Benefit / Impact The compute nodes must use the same time source as the other components of the Exalogic machine. Risk An out of sync clock source can cause stability issues. Action / Repair Modify the NTP server configuration in /etc/ntp.conf for the compute nodes to point to the same set of external NTP servers. Cross-Component: NTP Configuration Consistency with Oracle VM Servers for Switch Nodes
Benefit / Impact The switches must use the same time source as the rest of the system. Risk An out of sync clock source can cause stability issues. Action / Repair Modify the configuration via the ILOM. Example:
Cross-Component: NTP Configuration Consistency with Physical Compute Nodes for Switch Nodes
Benefit / Impact The switches must use the same time source as the rest of the system. Risk An out of sync clock source can cause stability issues. Action / Repair Modify the configuration via the ILOM. Example:
Cross-Component: Hostname Consistency with DNS on Oracle VM Server
Benefit / Impact A correct hostname setting that matches the DNS avoids problems with network configuration. Risk An incorrect hostname that does not match DNS can cause configuration problem. Action / Repair Determine if it is an error in the host or in the DNS entry. If it is the host, fix the hostname by changing the value of the HOSTNAME parameter in /etc/sysconfig/network file. Example: HOSTNAME=el01cn01.example.com If it is an error on the DNS server, contact your network administrator to correct the issue. Cross-Component: Hostname Consistency with DNS on Switches
Benefit / Impact A hostname that matches the DNS will avoid problems with the networking configuration. Risk An incorrect hostname that does not match the DNS can cause configuration problem. Action / Repair Log in to the ILOM and set the hostname. Example: set /SP hostname=el01sw-ib01 Cross-Component: Stale VNICs in the Switch
Benefit / Impact Valid vNICs in the switch ensure the Exalogic machine performs optimally. For a virtual rack, VNICs are important in the creation of new vServers. Risk VNICs in states other than "UP" can cause network outages. In a physical rack, problems related to EoIB network can occur. In a virtual rack, excessive number of unused vNICs can cause performance issues. Action / Repair Delete the real stale VNICs listed under the report command.These stale VNICs can be removed from the respective switch via the deletevnic command. Cross-Component: OVS Repo Consistency
Benefit / Impact When the Oracle virtual server repositories on all DOM0s are pointing to the same one, the consistency eliminates performance problem. Risk Exalogic is engineered to use a single repository. Any misconfiguration would cause functional issue. Action / Repair Please ensure no manual change was done via OVM Manager. Revert those change if necessary. Cross-Component: Non-sequential Even-numbered Gateway Instance
Benefit / Impact Check to make sure all switches is using non-sequential even number for their GWInstance values (e.g. 20, 30, 40) to avoid issues for future upgrades. Risk If the GWInstance values are sequential and/or uneven, issues may arise during future upgrades. Action / Repair Ensure that all switches use non-sequential even number for their GWInstance values. If these criteria are not met, then these values need to be changed properly. For example: [root@nm2gw-ib03 ~]# setgwinstance 30
Stopping Bridge Manager.. [ OK ] Starting Bridge Manager. [ OK ] - Confirm that the change has been applied by running the command below: [root@nm2gw-ib03 ~]# showgwconfig
BXM (pid 19825) is running BXM versions: bxm_user 2.0.0816.3-0, BXM-API 1.6.0, bxm_libs 2.0.0816.3-0, bxm_main 1.31 mlx_bx_core 1.31 Parameter Configured Value Running Value ----------------------------------------------------------- GWInstance 30 30 SystemName None scae01sw-ib03 EoIB Data SL 1 1 EoIB Control SL 2 2 Allow host VNIC config None no LAG mode yes yes Default discover P_key None 0xffff System MAC Not applicable 00:21:28:54:7f:22 Guest vServersGuest VM: ib_sdp Module
Benefit / Impact Having ib_sdp module loaded ensures that Sockets Direct Protocol(SDP) works properly over InfiniBand. Risk If ib_sdp module is not loaded, InfiniBand might not work properly. Action / Repair Load the module through /etc/infiniband/openib.conf Guest VM: IB Startup Sequence
Benefit / Impact To avoid inconsistencies within the Exalogic Elastic Cloud, and for network services to work properly, openibd service must start before the network services. Risk If openibd doesn't start before network services, inconsistencies within the nodes can lead to problems and outages. Action / Repair Relink openibd with S05 and mlx4_vnic_confd with S06. Using Exalogic Configuration Utility (http://docs.oracle.com/cd/E18476_01/doc.220/e18478/app_a.htm) Guest VM: TCP Tuning
Benefit / Impact The tuning for TCP consists of three components: 1. net.ipv4.tcp_timestamps should be set to 1 to avoid PAWS issue (Protect Against Wrapped Sequence - RFC 1323). 2. net.ipv4.tcp_window_scaling should be set to 1 to allow efficient transfer of data for high bandwidth-delay products. 3. net.ipv4.tcp_sack should be set to 1 to enable selective acknowledgement in mitigating duplicate acknowledgement and/or retransmission issues (RFC 2018). Risk Without this tuning, the TCP may not perform at an optimum level. Action / Repair Add the recommended tuning parameters into the /etc/sysctl.conf file. Guest VM: NFS Mount Point - Attribute Caching
Benefit / Impact By ensuring that attribute caching within NFS is not disabled, NFS mounts can perform more efficiently. Risk Disabling attribute caching can lead to extra network operation which leads to degrading network performance. Action / Repair Fix the configuration of the NFS Mount Point by removing any of these attributes from the mount points: - noac - actimeo=0 - acregmin=0 - acregmax=0 - acdirmin=0 - acdirmax=0 Configuring NFS Version 4 (NFSv4) on Exalogic (http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm) Guest VM: Name Service Switch Config File Permission Status in Control vServers
Benefit / Impact In addition to /etc/hosts file having the alias for name resolution, /etc/nsswitch.conf also uses "files" for resolution. The /etc/nsswitch.conf file should have 644 rights so that /etc/hosts can be used by everyone. Risk If /etc/nsswitch.conf is not given the 644 permission, /etc/hosts will be ignored by anyone but root. In this case, WebLogic Server may fail to start and may complain about incorrect network configuration since it cannot resolve the hostname used as the listen address. Action / Repair Change the rights of /etc/nsswitch.conf to 644. Guest VM: NTP Synchronization
Benefit / Impact NTP helps synchronize the clock of Exalogic with an accurate time source. To ensure correct synchronization, the delay and offset values should be not zero and the jitter value should be under 100. Risk An unsynchronized system clock can lead to possible errors and outages. Action / Repair Any warnings generated by NTP Synchronization check can be caused by the following: 1. You are using an older version of the NTP package that does not work if you use the DNS name for the NTP servers. In this case, you must use the IP addresses. 2. A firewall blocking access to your Stratum 1 and 2 NTP servers. The firewall can be located on one of the networks between the NTP server and its time source or firewall software, such as iptables, that may be running on the NTP server. 3. The notrust nomodify notrap keywords present in the restrict statement of the NTP client. 4. Localhost is configured on the NTP server. If it is a Linux system, remove localhost from /etc/ntp.conf file to fix the issue. If it is a Solaris system, remove localhost from /etc/inet/ntp.conf Note: KISS keywords in the NTP parameters are ignored. Your Linux NTP clients cannot Synchronize Properly (http://www.linuxhomenetworking.com/wiki/index.php/Quick_HOWTO_:_Ch24_:_The_NTP_Server#Your_Linux_NTP_clients_cannot_Synchronize_Properly) Guest VM: Swap Space
Benefit / Impact The level of swappiness controls the amount of memory reclaim distress at a point where the kernel decides to start reclaiming mapped pages. If the swap space is unused, it means the kernel has adequate amount of free physical memory, and this ensures Exalogic Elastic Cloud to perform at its optimal level. Risk The usage of swap space indicates that the kernel is running out of free physical memory. Lack of free physical memory can lead to degraded performance. Action / Repair Clear up the used memory. Guest VM: Lockd Configuration
Benefit / Impact Lock recovery after a reboot is critical, to maintain data integrity and to prevent unnecessary application hangs. To help rpc.statd match SM_NOTIFY requests to NLM requests, this best practice should be observed. Risk NFSv3 locks may not be recovered after a reboot. Action / Repair NOTE: The node becomes unavailable during this period, causing applications to stop running. To handle the possible impact of a temporary loss of service, ensure adequate preparation ahead of time. 1. Edit /etc/sysconfig/nfs file 2. Change the following lines: From #STATDARG="" To STATDARG="-n `uname -n`" 3. Reboot the node. Guest VM: ib_ipoib Module
Benefit / Impact Having the ib_ipoib module loaded ensures that the Internet Protocol (IP) works properly over InfiniBand. Risk If the ib_ipoib module is not loaded, InfiniBand may not work properly. Action / Repair Load the module through /etc/infiniband/openib.conf. Guest VM: Recent Critical Error
Benefit / Impact Ensuring the stability of the system is important to support applications running on Exalogic. By discovering unexpected critical errors in a node, action can be taken to fully restore the service as well as resolve the potential cause of problem. Risk An unexpected critical error within a node may lead to problems and outages. Action / Repair If the critical errors in the recent reboot were expected, please ignore this warning. Otherwise, please investigate further by looking at the log file /var/log/ovs-agent.log*. If problem persists, please open an SR with Oracle Support. Note 1501348.1 - Resolving OVS issues in Exalogic Guest VM: Recent Reboot Info
Benefit / Impact Ensuring the stability of the system is important to support applications running on Exalogic. By discovering an unexpected and recent reboot of a node, action can be to taken to fully restore the service and resolve the potential cause of problem. Risk An unexpected and recent reboot of a node may lead to problems and outages. Action / Repair If the recent reboot was intentional or expected, please ignore this warning. Otherwise, please investigate why this compute node rebooted unexpectedly. If problem persists, please open an SR with Oracle Support. Guest VM: IPoIB in Connected Mode
Benefit / Impact Having IPoIB in Connected Mode ensures that Internet Protocol (IP) works properly over InfiniBand network. Risk If the Connected Mode is not set, IPoIB might not work properly. Action / Repair 1. If it is OFED R2 version, then check the following: -Go to /etc/sysconfig/network-scripts/ifcfg-ib* files, check whether "CONNECTED_MODE=yes". -If "CONNECTED_MODE" is not set to "yes", modify the /etc/sysconfig/network-scripts/ifcfg-ib* file and change the property to "yes". 2. If it is not OFED R2 version, then check the following: -If SET_IPOIB_CM and/or IPOIB_LOAD is not set to "yes", modify the /etc/infiniband/openib.conf file and change these properties to "yes". -If the content of /sys/class/net/ib0/mode and /sys/class/net/ib0/mode are not connected, modify the content of these files to "connected". After modifying the files above, restart InfiniBand by running the command: /etc/init.d/openibd restart Note 1982645.1 - Exachk Reporting "IPoIB is not in connected mode" WARNING Message On Exalogic 2.0.6.2.0 Linux Physical Racks Guest VM: Kernel Out-of-Memory Errors
Benefit / Impact The kernel out-of-memory error indicates a potential resource issue. Risk If the cause of the kernel out-of-memory error is not identified, service can be disrupted. Action / Repair Please check the kernel log, /var/log/message*, and identify the process and cause of the out-of-memory error. Guest VM: Local Partition Usage Limit
Benefit / Impact Keeping enough local disk free space ensures the system to operate optimally. Risk Performance of the system will get affected. Action / Repair Free up disk space on the local disk. Oracle recommends most, if not all, user data be stored on the storage appliance. Guest VM: MTU for Ethernet Link
Benefit / Impact Correct MTU size for the Ethernet Link ensures that the communication protocol layer within Ethernet performs optimally. Risk Incorrect MTU size may slow down Ethernet Link and cause latency issues. Action / Repair Investigate and fix the MTU for Ethernet Link to the correct size. Guest VM: ZCOPY Configuration
Benefit / Impact Proper zcopy configuration must be ensured for the Exalogic machine to perform optimally. Risk An incorrect zcopy configuration can affect performance. Action / Repair Add sdp_zcopy_thresh=0, recv_poll=0 to the /etc/modprobe.conf file. Guest VM: Consistent Hardware Clock Timezone Reference
Benefit / Impact Having the same time zone reference to UTC for the hardware clock avoid any potential time skew between nodes before time sync with the NTP server. Risk Different time zone settings across different machines may cause job scheduling issues. Action / Repair 1. Login to the server as root. 2. Run command "cat /etc/adjtime" 3. Make sure the 3rd line indicate UTC instead of LOCAL. If it shows UTC, it is configured correctly. If it shows LOCAL, run the following repair steps: - Make sure the system time is correctly synchronized with an NTP server. - Run the following command below to change the hardware clock to use UTC "hwclock --utc --systohc" Guest VM: Bonding of InfiniBand Interfaces
Benefit / Impact The InfiniBand interfaces are a communication link between various components of the Exalogic machine. In order to maintain high availability (HA) with the IPoIB interface, Infiniband must be bonded correctly. Risk Without proper bonding of the InfiniBand interfaces, the Exalogic machine cannot maintain high availability (HA) if one of the communication links goes down. It can also affect performance. Action / Repair Investigate the bonding in the /etc/sysconfig/network-scripts/ifcfg-ib* files for each applicable pkey. Guest VM: Disabled Automatic Path Migration(APM)
Benefit / Impact There is a compatibility issue between the OFA software version on Exadata (1.5.1) and on Exalogic (1.5.5). The SDP protocol fails due to the new feature, APM (Automatic Path Migration) which is enabled in Exalogic by default but not yet supported in the OFED version in Exadata which causes to trigger the error "RDMA CMA: unexpected IB CM event: 13". Disabling APM will ensure that SDP protocol works properly in this particular case. Risk Enabling APM on Exalogic machine that is connected to Exadata can lead to problems and outages related to SDP protocol failure. Action / Repair Please consult MOS note for Action / Repair Note 1588546.1 - SDP Connection in inter-connected Exalogic and Exadata stopped working Guest VM: MTU for InfiniBand Link
Benefit / Impact A correct MTU size for the InfiniBand Link ensures that the communication protocol layer in InfiniBand performs optimally. Risk Incorrect MTU size can slow down InfiniBand Link and cause latency issues. Action / Repair Please refer to MOS note: Revised MTU Tuning Recommendations for the IPoIB Related Network Interfaces on Exalogic Physical and Virtual Environments (Doc ID 1624434.1) Note 1624434.1 - Revised MTU Tuning Recommendations for the IPoIB Related Network Interfaces on Exalogic Physical and Virtual Environment Guest VM: Free Physical Memory
Benefit / Impact Adequate amount of free physical memory would ensure that the Exalogic Elastic Cloud performs at its optimal level. Risk If there is not enough free physical memory, problems and outages may occur. Action / Repair The recommended free space is calculated by using the following algorithm with values listed in /proc/meminfo. The free memory should be at least 20% of the Total Memory(MemTotal). Free Memory = MemFree + Buffers + SReclaimable + Cached - Shmem Clear up the memory cache by running this command: sync; echo 3 > /proc/sys/vm/drop_caches Guest VM: RPM Database Corruption
Benefit / Impact If RPM query returns an error, any RPM operations would likely fail. Since upgrade or patching require the use of RPM, they would also fail. Risk The upgrade process cannot proceed without fixing the errors with RPM installation. Action / Repair Run rpm -qa, if command runs without any issue proceed with the upgrade installation. If RPM query returns a lock issue, please refer to the MOS note below to fix the issue. Note 1599404.1 - Error received while executing rpm commands - "rpmdb: Lock table is out of available locker entries" Guest VM: Cross check hostname with /etc/hosts in Guest VMs
Benefit / Impact Hostname is referred by application in /etc/hosts. This will allow the application functionality not getting interrupted Risk Functionality of application might get interrupted. Action / Repair Check hostname by running "hostname --s" and add the output with the ip in /etc/hosts Guest VM: CPU CAP for Virtual Machine Configuration File in Oracle Virtual Server
Benefit / Impact When the CPU Cap is configured to be less than 100% (through EMOC), several issues related to CPU soft lockup and vServer hangs have been reported on Exalogic. When CPU Cap is configured to be 100% through EMOC, it is translated to cpu_cap=0 in vm.cfg, which is the value we want to see configured. Risk When the CPU Cap is configured to be less than 100% (through EMOC), several issues related to CPU soft lockup and vServer hangs have been reported on Exalogic. Action / Repair Please refer to MOS note 1912480.1 - Setting CPU CAP to be less than 100% is not supported for Guest vServers on Exalogic Note 1912480.1 - Setting CPU CAP to be less than 100% is not supported for Guest vServers on Exalogic Guest VM: Bash Vulnerability Update Check
Benefit / Impact Applying the bash vulnerability patch addresses a critical security vulnerability with bash that allows a malicious user to execute arbitrary commands and gain unauthorized access to the system. Risk Not applying the patch exposes the Exalogic machine to a critical security vulnerability that can potentially allow a malicious user to execute arbitrary commands and gain unauthorized access the system. Action / Repair Please see following link for more information about CVE-2014-6271 and to fix bash code injection vulnerability. Vulnerability Summary for CVE-2014-6271 (http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-6271) Note 1930090.1 - CVE-2014-6271 and CVE-2014-7169 Patch Availability Document for Oracle Solaris Note 1930120.1 - CVE-2014-6271 and CVE-2014-7169 Patch Availability Document for Oracle Linux Oracle Security Alert for CVE-2014-7169 (http://www.oracle.com/technetwork/topics/security/alert-cve-2014-7169-2303276.html) Note 1929881.1 - CVE-2014-6271 and CVE-2014-7169 Patch Availability for Oracle Exalogic Linux Physical and Virtual Racks Guest VM: CPU CAP for Virtual Machine Configuration File in Oracle Virtual Server
Benefit / Impact When the CPU Cap is configured to be less than 100% (through EMOC), several issues related to CPU soft lockup and vServer hangs have been reported on Exalogic. When CPU Cap is configured to be 100% through EMOC, it is translated to cpu_cap=0 in vm.cfg, which is the value we want to see configured. Risk When the CPU Cap is configured to be less than 100% (through EMOC), several issues related to CPU soft lockup and vServer hangs have been reported on Exalogic. Action / Repair Please refer to MOS note 1912480.1 - Setting CPU CAP to be less than 100% is not supported for Guest vServers on Exalogic Note 1912480.1 - Setting CPU CAP to be less than 100% is not supported for Guest vServers on Exalogic Guest VM: OL6 Guest vServer Performance Check
Benefit / Impact In cases with high load, the active processes need to get assigned a long enough time slice during critical execution time. Risk With the Oracle Linux 6 kernel, in cases with high load, the active processes do not get assigned a long enough time slice during critical execution time. Action / Repair Contact Oracle Support to follow steps in following Note. <Note 1980462.1>: Performance Regression in OL6 Guest vServers compared to OL5 Guest vServers on Exalogic
Guest VM: IPoIB in Connected Mode for OEL6
Benefit / Impact Having IPoIB in Connected Mode ensures that Internet Protocol (IP) works properly over InfiniBand network. Risk If the Connected Mode is not set, IPoIB might not work properly. Action / Repair 1. If it is OFED R2 version, then check the following: -Go to /etc/sysconfig/network-scripts/ifcfg-ib* files, check whether "CONNECTED_MODE=yes". -If "CONNECTED_MODE" is not set to "yes", modify the /etc/sysconfig/network-scripts/ifcfg-ib* file and change the property to "yes". 2. If it is not OFED R2 version, then check the following: -If SET_IPOIB_CM and/or IPOIB_LOAD is not set to "yes", modify the /etc/rdma/rdma.conf file and change these properties to "yes". -If the content of /sys/class/net/ib0/mode and /sys/class/net/ib0/mode are not connected, modify the content of these files to "connected". After modifying the files above, restart InfiniBand by running the command: /etc/init.d/openibd restart Note 1982645.1 - Exachk Reporting "IPoIB is not in connected mode" WARNING Message On Exalogic 2.0.6.2.0 Linux Physical Racks Guest VM: ZCOPY Configuration for OEL6
Benefit / Impact Proper zcopy configuration must be ensured for the Exalogic machine to perform optimally. Risk An incorrect zcopy configuration can affect performance. Action / Repair Add sdp_zcopy_thresh=0, recv_poll=0 to the /etc/modprobe.d/id_sdp.conf file. Guest VM: Eport_State_Enforce Status
Benefit / Impact On releases running with Oracle Enterprise Linux operating system, if the Ethernet link used by a vnic goes down, the bond configured with that particular vnic will not detect it.By default,the bond will only detect the physical link it is using, which is the Infiniband Link. It will not detect the link of the Ethernet port the vnic is connected to. eport_state_enforce=1 flag needs to be present in /etc/modprobe.conf to have this failure detected and failover. Risk Without eport_state_enforce=1 flag in /etc/modprobe.conf, network outage will occur if one of the link fails. Action / Repair Make sure eport_state_enforce=1 in /etc/modprobe.conf file. Note 1512139.1 - Oracle Exalogic Elastic Cloud Known Issues - Virtualization Release Note 1436514.1 - Exalogic: VNIC 10gb Bond Network Ethernet Link Failover Detection Guest VM: Eport_State_Enforce Status for OEL6
Benefit / Impact On releases running with Oracle Enterprise Linux operating system, if the Ethernet link used by a vnic goes down, the bond configured with that particular vnic will not detect it.By default,the bond will only detect the physical link it is using, which is the Infiniband Link. It will not detect the link of the Ethernet port the vnic is connected to. eport_state_enforce=1 flag needs to be present in /etc/modprobe.d/mlx4_vnic.conf to have this failure detected and failover. Risk Without eport_state_enforce=1 flag in /etc/modprobe.d/mlx4_vnic.conf, network outage will occur if one of the link fails. Action / Repair Make sure eport_state_enforce=1 in /etc/modprobe.d/mlx4_vnic.conf file. Note 1512139.1 - Oracle Exalogic Elastic Cloud Known Issues - Virtualization Release Note 1436514.1 - Exalogic: VNIC 10gb Bond Network Ethernet Link Failover Detection Guest VM: OVS Partition Usage Limit
Benefit / Impact Keeping enough free space ensures the system to operate optimally. Risk vDC functionality will get affected Action / Repair Free up disk space on these filesystems /nfsmnt/* /poolfsmnt/* /OVS/Repositories/* /var/lib/xenstored local disk. Guest VM: Virtual Memory Tuning for DomU
Benefit / Impact The tuning for virtual memory consists of two components: 1. vm.dirty_background_ratio < = 10 The default value of this ratio is 10%. With this value, the kernel will be forced to write dirty pages to disk when its size reaches 9.6GB (10% of 96GB). Oracle recommends that this parameter be tuned down to 3% to smooth out the I/O traffic. 2. vm.min_free_kbytes = 524288 KB (512 MB) for DomU. The default value of this parameter is 32M. Oracle recommends that this parameter be increased accordingly to account for the large MTU size within an IPoIB network, which is currently at 64K. Risk Without this tuning, the kernel may not perform at an optimum level. Action / Repair Edit the /etc/sysctl.conf file and modify the corresponding tuning parameters as specified in the Benefit section above. Guest VM: Ghost Vulnerability
Benefit / Impact For the whole system to work optimally, and to avoid problems related to Ghost Vulnerability, the installed image needs to be verified to be at its supported latest version. Risk If a system is not patched, it is exposed to security vulnerability. Action / Repair Please apply Patch 20448956 and refer to the referenced MOS Note for the steps on how to apply the patch, and rerun exachk post patching to validate. Note 1965975.1 - CVE-2015-0235 - Ghost Vulnerability - Patch Availability for Oracle Exalogic Linux Physical and Virtual Racks (Doc ID 19
References<NOTE:1967979.1> - Performance Degradation issues in Exalogic X2-2 Racks when ZFS 7320 Appliance configured with 24GB RAM<NOTE:1449226.1> - Exachk Health-Check Tool for Exalogic Attachments This solution has no attachment |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|