Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-79-1463157.1
Update Date:2017-05-29
Keywords:

Solution Type  Predictive Self-Healing Sure

Solution  1463157.1 :   Exalogic Exachk Diagnostic Information and Suggested Actions  


Related Items
  • Oracle Exalogic Elastic Cloud Software
  •  
  • Exalogic Elastic Cloud X5-2 Hardware
  •  
Related Categories
  • PLA-Support>Eng Systems>Exalogic/OVCA>Oracle Exalogic>MW: Exalogic Core
  •  




In this Document
Purpose
Scope
Details
 Compute Nodes Linux
 Compute Node: Hardware and Firmware Profile
 Compute Node: Software Profile
 Compute Node: NTP Synchronization
 Compute Node: IB Startup Sequence
 Compute Node: NFS Mount Point - Version
 Compute Node: NFS Mount Point - Attribute Caching
 Compute Node: NFS Mount Point - Rsize Wsize
 Compute Node: NIS domain with NFSv4 (ypbind)
 Compute Node: DNS Setup
 Compute Node: IP Configuration - eth0 and bond0
 Compute Node: Swap Space
 Compute Node: Free Physical Memory
 Compute Node: Virtual Memory Tuning for Dom0 & Physical
 Compute Node: TCP Tuning
 Compute Node: MTU for Infiniband Link in Compute Node
 Compute Node: MTU for Ethernet Link in Compute Node
 Compute Node: ib_ipoib Module
 Compute Node: ib_sdp Module
 Compute Node: IPoIB in Connected Mode
 Compute Node: Enabled Cache on Local SSD
 Compute Node: EoIB Setup
 Compute Node: Correct Slot Installation for IB Card
 Compute Node: Subnet Manager
 Compute Node: Lockd Configuration
 Compute Node: BIOS Settings
 Compute Node: Consistent Hardware Clock Timezone Reference
 Compute Node: OVS Cluster Connectivity (Livenodes)
 Compute Node: OVS Server Pool Virtual IP Ping Test
 Compute Node: vServer Stale Lock
 Compute Node: Recent Reboot Info
 Compute Node: Recent Critical Error
 Compute Node: Connectivity To OVMM
 Compute Node: Local Disk Usage Limit
 Compute Node: OVS Agent Status
 Compute Node: Orphan Image File
 Compute Node: OVS Pool File System
 Compute Node: Bonding of InfiniBand Interface
 Compute Node: ZCOPY Configuration
 Compute Node: Ulimit
 Compute Node: MegaCLI Status
 Compute Node: Consistency with DNS on the Physical Compute Node
 Compute Node: Disabled LRO on OVS
 Compute Node: Corruption in dom0 Partition Key Table
 Compute Node: Host interconnect (usb0) Disabled
 Compute Node: PCI 64-Bit Resource Allocation Disabled
 Compute Node: RPM Database Corruption for Control VM
 Compute Node: RPM Database Corruption
 Compute Node: Memory Recommendation for dom0
 Compute Node: RAID Battery Level
 Compute Node: Xen Vulnerability Patch Verification for Oracle Virtual Server of 2.0.6.x.x
 Compute Node: Non-2.0.6.x.x VM with 12 vCPUs and 32G Memory
 Compute Node: Check for Version 2.0.1.x.x Template, Virtual Machine or Large VM in Version 2.0.4.x.x Environment
 Compute Node: Check for version 2.0.4.x.x Template and Virtual Machine in 2.0.6 Environment
 Compute Node: Check for Version 2.0.1.x.x Template, Virtual Machine or Large VM in Version 2.0.6.x.x Environment
 Compute Node: Check for Version 2.0.6.x.x Template and Virtual Machine in Version 2.0.4.x.x Environment
 Compute Node: Resource Control Utility Information
 Compute Node: CPU CAP for Virtual Machine Configuration File in Oracle Virtual Server
 Compute Node: Check for unknown files in OVS repositories
 Compute Node: BIOS SR-IOV Status
 Compute Node: RAID Configuration of Local Disks
 Compute Node: Check Dom0 Kernel Memory Slab Usage of Size-192
 Compute Node: IPoIB in Connected Mode for OEL6
 Compute Node: Ibswitches Information Validation
 Compute Node: Eport_State_Enforce Status
 Compute Node: Detect EM Agent On Dom0
 Compute Node: Eport_State_Enforce Status for OEL6
 Compute Node: Check CPUspeed Governor Setting
 Compute Node: lro_num=0 in mlx4_vnic.conf
 Compute Node: ARI in BIOS Setting Enabled
 Compute Node: Grub Conf Settings for Dom0
 Compute Node: Disabled Automatic Path Migration(APM)
 Compute Nodes Solaris
 Compute Node: Software Profile
 Compute Node: NTP Synchronization
 Compute Node: DNS Setup
 Compute Node: Correct Slot Installation of IB Card for Solaris
 Compute Node: Subnet Manager
 Compute Node: Root Partition Usage Limit for Solaris
 Compute Node: Lockd Configuration for Solaris Compute Node
 Compute Node: ib_ipoib Module for Solaris
 Compute Node: ib_sdp Module for Solaris
 Compute Node: IP Configuration - net0 and bond0
 Compute Node: Recent Reboot Info for Solaris
 Compute Node: Probe Based IPMP for Solaris
 Compute Node: Swap Space for Solaris
 Compute Node: Free Physical Memory for Solaris
 Compute Node: MTU for Solaris
 Compute Node: IPMP Configuration for Solaris
 Compute Node: Fault Management Log for Solaris
 Compute Node: BIOS Settings
 Compute Node: NFS Mount Point - Version for Solaris
 Compute Node: Hostname Consistency with DNS on the Physical Compute Node
 Compute Node: NFS Mount Point - Attribute Caching for Solaris
 Compute Node: NFS Mount Point - Rsize Wsize for Solaris
 Compute Node: TCP Protocol on NFS Mount Point for Solaris
 Compute Node: RAID Battery Level
 Compute Node: IP Configuration in /etc/hosts for Solaris
 Compute Node: Check Solaris CACAO Publisher Setting
 Compute Node: NIS domain (YPBind) for Solaris
 Switches
 Switch: /conf/configvalid File
 Switch: EoIB Data SL
 Switch: EoIB Control SL
 Switch: Localhost Configuration
 Switch: Free Physical Memory
 Switch: Unused VNICS
 Switch: Opensm
 Switch: List Link Up
 Switch: Environment Test
 Switch: Ibstat
 Switch: SNMP Daemon
 Switch: Number of Partition Keys on Bridge-X Ports
 Switch: Host Config VNIC
 Switch: Pre-upgrade check on switch memory and disk space
 Switch: VLAN PKEY PAIR Information for Switch
 Switch: Validate No Stale Partition Key Temporary File Exists
 Switch: Validate Partition Keys Are Using Latest Format
 Switch: /conf/configvalid File for Spine Switch
 Switch: Version Consistency on All Switches
 Switch: Life Expectancy for SW
 Switch: Consistent Subnet Manager across Switches
 Storage Nodes
 Storage Node: Backend (chkBackend.aksh)
 Storage Node: Cluster (chkCluster.aksh)
 Storage Node: Datasets (chkDatasets.aksh)
 Storage Node: Shadow Migrated Shares (chkShadow.aksh)
 Storage Node: Space Utilization (chkSpace.aksh)
 Storage Node: Lockd Servers(chkLockd.aksh)
 Storage Node: IPMP Failback Configuration (chkIPMPFailback.aksh)
 Storage Node: IPMP Standby Configuration (chkIPMPStandby.aksh)
 Storage Node: ZFS Snapshot Visibility
 Storage Node: L2ARC Header Size
 Storage Node: ZFS Block Size
 Storage Node: ZFS Maintenance Status
 Storage Node: ZFS DNS Configuration
 Storage Node: NFSv4 Lock Object Leak
 Storage Node: Nfsmapid Domain Matching with NIS server
 Storage Node: Softring Workflow
 Storage Node: ZFSSA Analytics Retention Policy
 Storage Node: ZFS Check Head Status
 Storage Node: ZFS Mirror Profile Status
 Storage Node: ZFS Share Quota
 Storage Node: Check for ZFSSA Installed Ram
 Storage Node: ZFS Dedup Status
 Storage Node: ZFS Disk Timeout Warning
 Storage Node: ZFS Disk Health
 Storage Node: NFSv4 Delegation
 Storage Node: IPMP configuration on ZFS node
 Storage Node: ZFS Slot Health
 Storage Node: Verify ZFS node disk storage pools
 Oracle VM Manager (OVMM)
 OVMM: Oracle VM Manager (OVMM) Service Status
 OVMM: Database Corruption
 OVMM: Sufficient CPU resources for the Oracle VM Manager
 OVMM: OVMM Pool VM Start Policy
 OVMM: Check Connection Channels Before Upgrade
 OVMM: Sufficient RAM for the Oracle VM Manager
 Database (DB)
 DB: Oracle Database Service Status
 DB: Sufficient CPU resources for the Database Control vServer
 DB: Password Expiration Status for OVS User on DB Control vServer
 DB: Sufficient RAM for the Database Control vServer
 Enterprise Controller (EC)
 EC: Enterprise Controller Service Status
 EC: Excessive Jobs within EMOC
 EC: Connectivity To EMOC
 EC: Network Interface Connectivity for Control vServers
 EC: Storage Network Interface Connectivity
 EC: Compute Node (OVS) Network Interface Connectivity
 EC: Sufficient CPU Resources for the Enterprise Controller
 EC: Uce_scheduler status check
 EC: Valid Hostname within /etc/hosts in Enterprise Controller
 EC: OVS database schema BLOB corruption check
 EC: Sufficient RAM for the Enterprise Controller
 Proxy Controller (PC)
 PC: Proxy Controller Service Status
 PC: Sufficient CPU resources for the Proxy Controller
 PC: Valid Hostname within /etc/hosts in Proxy Controller
 PC: Sufficient RAM for the Proxy Controller
 Multiple Components
 Multiple Components: Kernel Out-of-Memory Errors
 Multiple Components: Control Virtual Server's Uptime
 Multiple Components: NFSv3 Usage Verification for Control vServers Shares
 Multiple Components: Gateway Configuration for non-Switch
 Multiple Components: MTU for InfiniBand Link in Control vServers
 Multiple Components: TCP Tuning for Control vServers
 Multiple Components: Swap Space for Control vServers
 Multiple Components: Lockd Configuration for Control vServers
 Multiple Components: Name Service Switch Config File Permission Status in Compute Nodes
 Multiple Components: Local Partition Usage Limit
 Multiple Components: Check Root Space in DB and EC VM Before Upgrade
 Multiple Components: Cross check hostname with /etc/hosts in Guest VMs
 Multiple Components: Check Root Space in OVM PC VM and Compute Node Before Upgrade
 Multiple Components: Bash Vulnerability Update Check
 Multiple Components: Verify ILOM open issue
 Multiple Components: Validate Control VMs JDK Version
 Multiple Components: Version Consistency on All Switches
 Multiple Components: Ghost Vulnerability
 Multiple Components: IPoIB in Connected Mode
 Multiple Components: NFS Mount Point - Attribute Caching
 Multiple Components: Free Physical Memory
 Multiple Components: MTU for Ethernet Link in Control vServers
 Cross-Components
 Cross-Component: Firmware Version Consistency for Storage Node
 Cross-Component: NTP Configuration for Control vServers
 Cross-Component: NTP Configuration Consistency with Oracle VM Server for ZFS
 Cross-Component: NTP Configuration Consistency with Physical Compute Nodes for ZFS
 Cross-Component: NTP Configuration for Compute Nodes
 Cross-Component: NTP Configuration Consistency with Oracle VM Servers for Switch Nodes
 Cross-Component: NTP Configuration Consistency with Physical Compute Nodes for Switch Nodes
 Cross-Component: Hostname Consistency with DNS on Oracle VM Server
 Cross-Component: Hostname Consistency with DNS on Switches
 Cross-Component: Stale VNICs in the Switch
 Cross-Component: OVS Repo Consistency
 Cross-Component: Non-sequential Even-numbered Gateway Instance
 Guest vServers
 Guest VM: IB Startup Sequence
 Guest VM: TCP Tuning
 Guest VM: NFS Mount Point - Attribute Caching
 Guest VM: Name Service Switch Config File Permission Status in Control vServers
 Guest VM: NTP Synchronization
 Guest VM: Swap Space
 Guest VM: Lockd Configuration
 Guest VM: ib_ipoib Module
 Guest VM: Recent Critical Error
 Guest VM: Recent Reboot Info
 Guest VM: IPoIB in Connected Mode
 Guest VM: Kernel Out-of-Memory Errors
 Guest VM: Local Partition Usage Limit
 Guest VM: MTU for Ethernet Link
 Guest VM: ZCOPY Configuration
 Guest VM: Consistent Hardware Clock Timezone Reference
 Guest VM: Bonding of InfiniBand Interfaces
 Guest VM: Disabled Automatic Path Migration(APM)
 Guest VM: MTU for InfiniBand Link
 Guest VM: Free Physical Memory
 Guest VM: RPM Database Corruption
 Guest VM: Cross check hostname with /etc/hosts in Guest VMs
 Guest VM: CPU CAP for Virtual Machine Configuration File in Oracle Virtual Server
 Guest VM: Bash Vulnerability Update Check
 Guest VM: CPU CAP for Virtual Machine Configuration File in Oracle Virtual Server
 Guest VM: OL6 Guest vServer Performance Check
 Guest VM: IPoIB in Connected Mode for OEL6
 Guest VM: ZCOPY Configuration for OEL6
 Guest VM: Eport_State_Enforce Status
 Guest VM: Eport_State_Enforce Status for OEL6
 Guest VM: OVS Partition Usage Limit
 Guest VM: Virtual Memory Tuning for DomU
 Guest VM: Ghost Vulnerability
References


Applies to:

Oracle Exalogic Elastic Cloud Software - Version 1.0.0.2.0 and later
Exalogic Elastic Cloud X5-2 Hardware
Linux x86-64
Oracle Solaris on x86-64 (64-bit)

Purpose

Exachk for Exalogic is a health-check tool that is designed to audit important configuration settings within an Oracle Exalogic machine. This reference document describes the benefit of the check, the risk, if a particular health-check fails, and the steps to resolve a failed health check, for each of the health checks that Exachk.

Scope

This document is intended for anyone planning to use and run Exachk on Oracle Exalogic Engineered Machine.

Details

This document outlines the Exachk health check diagnostic information on Compute Node, Switches & Storage Nodes and also for Exalogic Virtualization components like OVM Manager, Database (DB), Exalogic Controller (EC) and Proxy Controller (PC) as follows:

Compute Nodes Linux

Compute Node: Hardware and Firmware Profile

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

The Exalogic Elastic Cloud is an engineered system. Validating the hardware and firmware before the system is placed into, or returned to, production status can help avoid problems related to hardware or firmware modifications.

Risk

If the hardware and firmware are not validated, inconsistencies between components can lead to problems and outages.

Action / Repair

The output contains a few lines similar to the following:

The BIOS is at a supported version

If any result other than "at a supported version" is returned, investigate and correct the condition.

Compute Node: Software Profile

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

The Exalogic Elastic Cloud is an engineered system. Validating the software packages before it is placed into, or returned to, production status can help avoid problems related to configuration.

Risk

If the software is not validated, inconsistencies between components can lead to problems and outages.

Action / Repair

The output contains a few lines similar to the following:

[SUCCESS]........Has supported operating system

If any result other than "SUCCESS" is returned, investigate and correct the condition.

Compute Node: NTP Synchronization

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

NTP helps synchronize computer system clock with an accurate time source. To ensure correct synchronization, the delay and offset values should be non-zero and the jitter value should be under 100.

Risk

Unsynchronized system clock may lead to possible errors and outages.

Action / Repair

Any warnings generated by NTP Synchronization check could be caused by the following:

  • Older versions of the NTP package that do not work correctly if you use the DNS name for the NTP servers. In these cases, you may want to use the actual IP addresses instead.
  • A firewall blocking access to your Stratum 1 and 2 NTP servers. This could be located on one of the networks between the NTP server and its time source, or firewall software such as iptables, that could be running on the server itself.
  • The notrust nomodify notrap keywords present in the restrict statement for the NTP client. In some versions of the Fedora Core 2's implementation of NTP, clients will not be able to synchronize with a Fedora Core 2 time server unless the notrust nomodify notrap keywords are removed from the restrict statement of the NTP client.
  • Localhost configured with the NTP server. To fix this issue, remove localhost from /etc/ntp.conf file

Compute Node: IB Startup Sequence

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

To avoid inconsistencies within the Exalogic Elastic Cloud, and for network services to work properly, openibd service must start before the network services.

Risk

Inconsistencies within the nodes can lead to problems and outages, if openibd does not start before network services start functioning.

Action / Repair

Relink openibd and mlx_vnic_confd so that openibd starts before mlx_vnic_confd. This can be done by running the following commands:

To relink openibd with S05:

rm -rf /etc/rc3.d/$(ls /etc/rc3.d/ | grep openibd); ln -s ../init.d/openibd /etc/rc3.d/S05openibd

And to relink mlx4_vnic_confd with S06.

rm -rf /etc/rc3.d/$(ls /etc/rc3.d/ | grep mlx4_vnic_confd); ln -s ../init.d/mlx4_vnic_confd /etc/rc3.d/S06mlx4_vnic_confd

Compute Node: NFS Mount Point - Version

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

Verifying the correct configuration of mount points helps avoid performance problems related to NFS .

Risk

If the NFS mount points are not configured correctly, inconsistencies related to storage access may occur, and these can possibly lead to problems and outages.

Action / Repair

It is recommended to upgrade the NFS mount point to the latest version.

Links

http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm

Compute Node: NFS Mount Point - Attribute Caching

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

By ensuring that attribute caching within NFS is not disabled, NFS mounts can perform more efficiently.

Risk

Disabling attribute caching can lead to extra network operation which leads to degrading network performance.

Action / Repair

Fix the configuration of the NFS Mount Point by removing "noac" and/or "actimeo=0" attributes from the mount points.

Links

http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm

Compute Node: NFS Mount Point - Rsize Wsize

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

The NFS mount point option, "rsize" and "wsize", specify the size of the chunks of data that the client and server pass to each other. To maintain high performance for block transfer between the mount points, correct "rsize" and "wsize" needs to be verified.

Risk

Incorrect configuration of the rsize or wsize may lead to performance degradation.

Action / Repair

Correct the configuration of the NFS mount point by modifying rsize and/or wsize properties in the mount points, to the recommended value of 131072.

Links

http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm

Compute Node: NIS domain with NFSv4 (ypbind)

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

"ypbind" is a client–server directory service protocol for distributing system configuration data. It allows Exalogic Elastic Cloud to find each server for NIS domains, and maintains the NIS binding information.

Risk

Without correct NIS configurations and binding information, inconsistency related to network services may occur, and these can possibly lead to problems and outages.

Action / Repair

Verify and investigate the NIS configuration based on NFSv4

Compute Node: DNS Setup

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

DNS service allows components within Exalogic Elastic Cloud to have access with each other in supporting its functions. Verifying the DNS setup is critical to avoid problems related to access issues between the components.

Risk

If DNS setup is not verified, inconsistent access protocol between components can lead to problems and outages.

Action / Repair

Verify the DNS setup configuration by examining /etc/resolv.conf and executing nslookup command on the localhost.

Compute Node: IP Configuration - eth0 and bond0

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

Correct IP configuration of eth0 and bond0 allows each Compute Node to manage hostname mapping and DNS entries.

Risk

A misconfiguration on the /etc/hosts will cause problem when a compute node tries to reach the other nodes in the same rack.

Action / Repair

Investigate /etc/hosts and the content that is returned from interface configuration of eth0 and bond0:
1. Verify only a single entry of the same IP address is listed in the /etc/hosts file.
2. The IP obtained from ifconfig eth0 and bond0 should be listed in /etc/hosts

Links

Network Preconfiguration
http://docs.oracle.com/cd/E18476_01/doc.220/e18479/net.htm#BHCJBICD

Adding Exalogic Machine to Your Network
http://docs.oracle.com/cd/E18476_01/doc.220/e18478/spreadsheet.htm

Compute Node: Swap Space

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

The level of swappiness controls the amount of memory reclaim distress at a point where the kernel decides to start reclaiming mapped pages. If the swap space is unused, it means the kernel has adequate amount of free physical memory, and this ensures that the Exalogic Elastic Cloud performs at its optimal level.

Risk

The usage of swap space indicates that the kernel is running out of free physical memory. Lack of free physical memory can lead to degraded performance.

Action / Repair

Clear up the used memory.

Compute Node: Free Physical Memory

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

Adequate amount of free physical memory ensures that the Exalogic Elastic Cloud performs at its optimal level.

Risk

If there is not enough free physical memory, problems and outages may occur.

Action / Repair

The recommended free space is calculated by adding the Free Memory (MemFree), Reclaimable Memory (SReclaimable), Buffers, Cache and subtracting Shared Memory (shmem) listed in the /proc/meminfo file. The free memory should be at least 20% of the Total Memory(MemTotal).

Clear up the memory cache by running this command:

sync; echo 3 > /proc/sys/vm/drop_caches

Compute Node: Virtual Memory Tuning for Dom0 & Physical

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

The tuning for virtual memory consists of two components:

1. vm.dirty_background_ratio = 3

The default value of this ratio is10%. With this value, the kernel will be forced to write dirty pages to disk when its size reaches 9.6GB (10% of 96GB). Oracle recommends that this parameter be tuned down to 3% to smooth out the I/O traffic.

2. vm.min_free_kbytes =

- 1048576 KB (1GB) for physical rack

- 524288 KB (512 MB) for Dom0

The default value of this parameter is 32M. Oracle recommends that this parameter be increased accordingly to account for the large MTU size within an IPoIB network, which is currently at 64K.

Risk

Without this tuning, the kernel may not perform at an optimum level.

Action / Repair

Edit the /etc/sysctl.conf file and modify the corresponding tuning parameters as specified in the Benefit / Impact section.

Compute Node: TCP Tuning

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

The tuning for TCP consists of three components:

1. net.ipv4.tcp_timestamps should be set to 1 to avoid PAWS issue (Protect Against Wrapped Sequence - RFC 1323).
2. net.ipv4.tcp_window_scaling should be set to 1 to allow efficient transfer of data for high bandwidth-delay products.
3. net.ipv4.tcp_sack should be set to 1 to enable selective acknowledgement in mitigating duplicate acknowledgement and/or retransmission issues (RFC 2018).

Risk

Without this tuning, the TCP may not perform at an optimum level.

Action / Repair

Add the recommended tuning parameters into the /etc/sysctl.conf file.

Compute Node: MTU for Infiniband Link in Compute Node

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

Correct MTU size for the InfiniBand Link ensures that the communication protocol layer within InfiniBand performs optimally.

Risk

Incorrect MTU size may slow down InfiniBand Link and cause latency issues.

Action / Repair

Please refer to <Note 1624434.1>: Revised MTU Tuning Recommendations for the IPoIB Related Network Interfaces on Exalogic Physical and Virtual Environments

Compute Node: MTU for Ethernet Link in Compute Node

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

Correct MTU size for the Ethernet Link ensures that the communication protocol layer within Ethernet performs optimally.

Risk

Incorrect MTU size may slow down Ethernet Link and cause latency issues.

Action / Repair

1. Identify the Ethernet interface using ifconfig, you would see "bond1 Link encap:Ethernet"
2. Issue the command ifconfig ${Interface} mtu 1500 up, e.g. ifconfig bond1 mtu 1500 up
3. To persist after reboot, edit the corresponding file /etc/sysconfig/network-scripts/ifcfg-{$INTERFACE} and add/modify the line to MTU=1500.

Compute Node: ib_ipoib Module

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

Having the ib_ipoib module loaded ensures that the Internet Protocol (IP) works properly over InfiniBand.

Risk

If the ib_ipoib module is not loaded, InfiniBand may not work properly.

Action / Repair

Load the module through /etc/infiniband/openib.conf.

Compute Node: ib_sdp Module

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

Having ib_sdp module loaded ensures that Sockets Direct Protocol(SDP) works properly over InfiniBand.

Risk

If ib_sdp module is not loaded, InfiniBand might not work properly.

Action / Repair

Load the module through /etc/infiniband/openib.conf

Compute Node: IPoIB in Connected Mode

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

Having IPoIB in Connected Mode ensures that Internet Protocol (IP) works properly over InfiniBand network.

Risk

If the Connected Mode is not set, IPoIB might not work properly.

Action / Repair

1. If SET_IPOIB_CM and/or IPOIB_LOAD is not set to "yes", modify the /etc/infiniband/openib.conf file (or /etc/ofed/openib.conf at a later version of OFED) and change these properties to "yes".
2. If the content of /sys/class/net/ib0/mode and /sys/class/net/ib0/mode are not connected, modify the content of these files to "connected".

After modifying the files above, restart InfiniBand by running the command:

/etc/init.d/openibd restart

Compute Node: Enabled Cache on Local SSD

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

Enabling caching ensures that Local SSD performs optimally.

Risk

If the cache is disabled within Local SSD, performance degradation may occur, and it may lead to problems and outages.

Action / Repair

The following commands are listed as an example to turn the cache on:

/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp Cached -L0 -a0
/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp WB -L0 -a0

Compute Node: EoIB Setup

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

Having EoIB set up correctly allows networking to work properly.

Risk

If EoIB is not set up, it may lead to problems and outages related to networking.

Action / Repair

Investigate bonding and ip configuration within theVNIC, listed in "mlx4_vnic_info -l" command. In some cases, network device could show up as ___tmp instead of ethX. For more information, please see following MOS note:

<Note 1458683.1>: Network device showing __tmp instead of ethX 


REFERENCE:

Setting Up Ethernet Over InfiniBand (EoIB) on Oracle Linux (http://docs.oracle.com/cd/E18476_01/doc.220/e18478/commproc.htm#CHDCCEBJ)

Compute Node: Correct Slot Installation for IB Card

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

For the best performance on the PCI Express interface, the adapter card should be installed in a PCIe x8 slot.

Risk

Installing the card in a slower slot limits bandwidth and performance significantly.

Action / Repair

Reinstall the card in a PCIe x8 slot, or replace the card if it is already in a PCIe x8.

Compute Node: Subnet Manager

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

The subnet manager (SM) manages all operational characteristics of the InfiniBand network. The InfiniBand network typically has more than one SM, but only one SM is active at a time. The active SM is Master SM. Others are Standby SMs.

Risk

If the active SM shuts down or fails and there is no standby SM to replace it, the InfiniBand network will fail, which can cause loss of connectivity within Oracle Exalogic Elastic Cloud.

Action / Repair

Refer to the link for detailed instructions to repair the problem.

Links

http://docs.oracle.com/cd/E18476_01/doc.220/e18478/leafswitch.htm#CBHICGAA

Compute Node: Lockd Configuration

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

Lock recovery after a reboot is critical, to maintain data integrity and to prevent unnecessary application hangs. To help rpc.statd match SM_NOTIFY requests to NLM requests, this best practice should be observed.

Risk

NFSv3 locks may not be recovered after a reboot.

Action / Repair

NOTE: The compute node becomes unavailable during this period, causing applications to stop running within the compute nodes. To handle the possible impact of a temporary loss of service, ensure adequate preparation ahead of time. Follow the steps given below:

1. Edit /etc/sysconfig/nfs file
2. Change the following lines:
From
#STATDARG=""
To
STATDARG="-n `uname -n`"
3. Reboot the compute node.

Compute Node: BIOS Settings

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

Before upgrading Oracle Exalogic Elastic Cloud (to 2.0.0.0.1 for example), it is important to ensure that the system has the recommended BIOS settings.

Risk

If the recommended BIOS settings are not followed, problems may occur during the Oracle Exalogic Elastic Cloud 2.0.0.0.1 upgrade.

Action / Repair

Refer to <Note 1608959.1>


Links:

Note 1608959.1 : Updating the BIOS Settings for X2-2, X3-2, and X4-2 Compute nodes before installing EECS on Exalogic

Oracle Hardware Management User Guide

Compute Node: Consistent Hardware Clock Timezone Reference

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0, Control vServers

 

 



 

Benefit / Impact

Having the same time zone reference to UTC for the hardware clock avoid any potential time skew between nodes before time sync with the NTP server.

Risk

Different time zone settings across different machines cause some applications to have job synchronization issues.

Action / Repair

  1. Login to the server as root.

  2. Run command "cat /etc/adjtime"

  3. Make sure the 3rd line indicate UTC instead of LOCAL. If it shows UTC, it is configured correctly. If it shows LOCAL, run the following repair steps:

    • Make sure the system time is correctly synchronized with an NTP server.
    • Run the following command below to change the hardware clock to use UTC
      "hwclock --utc --systohc"    

Compute Node: OVS Cluster Connectivity (Livenodes)

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+ Dom0

 

 

 

 

Benefit / Impact

Exalogic Elastic Cloud must maintain the cluster connectivity of the compute nodes synchronized as a distributed system to support running applications.

Risk

If any of the compute nodes are not live, it may lead to problems and outages.

Action / Repair

Please investigate which compute nodes are not live and check if it is due to networking issues in the cluster. If the problem persists, please open an SR with Oracle Support.

Compute Node: OVS Server Pool Virtual IP Ping Test

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+ Dom0

 

 

 

 

Benefit / Impact

Ensuring that the virtual IP of the Oracle VM Server pool is up and running is crucial to ensure that the Server Pool Master is accessible.

Risk

If the virtual IP of the Oracle VM Server pool is down, Server Pool Master cannot be accessed.

Action / Repair

Please check your Oracle VM Server master compute node and ensure everything is functional. One possible workaround is as follows:

1) Reboot the Oracle VM Server master compute node.
2) Reboot the rest of the compute nodes in the pool.

If problem persists, please contact Oracle Support.

Compute Node: vServer Stale Lock

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+ Dom0

 

 

 

 

Benefit / Impact

A lock indicates that a vServer is already running on one of the hypervisors in the server pool. A lock file that remains even when the vServer is not running due to some unexpected error is a stale lock file. The vServer with that lock will then not be able to start.

Risk

A vServer cannot be started even though it is not running.

Action / Repair

Please refer to the following MOS notes for details:
1) Note 1474565.1 Acquire running lock failed.
2) Note 1380333.1 Oracle VM 3.0: Error: Acquire running

If problem persists, please contact Oracle Support.

Compute Node: Recent Reboot Info

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Physical and Dom0

 

 

 

 

 

Benefit / Impact

Ensuring the stability of the system is important to support applications running on Exalogic. By discovering an unexpected and recent reboot of a compute node, action can be to taken to fully restore the service and resolve the potential cause of problem.

Risk

An unexpected and recent reboot of a compute node may lead to problems and outages.

Action / Repair

If the recent reboot was intentional or expected, please ignore this warning. Otherwise, please investigate why this compute node rebooted unexpectedly. If problem persists, please contact Oracle Support.

Compute Node: Recent Critical Error

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 2.0.1.0.0+ Dom0

 

 

 

 

Benefit / Impact

Ensuring the stability of the system is important to support applications running on Exalogic. By discovering unexpected critical errors in a compute node, action can be taken to fully restore the service as well as resolve the potential cause of problem.

Risk

An unexpected critical error within a compute node may lead to problems and outages.

Action / Repair

If the critical errors in the recent reboot were expected, please ignore this warning. Otherwise, please investigate further by looking at the log file /var/log/ovs-agent.log*.

If problem persists, please contact Oracle Support.

Links

Note 1501348.1 - Identifying And Resolving Oracle VM Issues In Exalogic Virtual Environment

Compute Node: Connectivity To OVMM

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+ Dom0

 

 

 

 

Benefit / Impact

For the virtual data center management system to work, all the different Exalogic control components must be running.

Risk

The unavailability of any Exalogic control component will result in a loss of functionality in the management of the virtual datacenter.

Action / Repair

The failed Exalogic Elastic Cloud Software (EECS) component must be restarted.

Links

Note 1501228.1 - How To Start A Stopped Exalogic Control Stack In An Exalogic Virtual Environment

Compute Node: Local Disk Usage Limit

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical and Dom0

 

 

 

 

 

Benefit / Impact

Keeping enough local disk space free ensures the compute node can operate optimally.

Risk

Performance of the compute node will get affected.

Action / Repair

Free up disk space on the local disk. Oracle recommends most, if not all, user data be stored on the storage appliance.

Compute Node: OVS Agent Status

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+ Dom0

 

 

 

 

Benefit / Impact

Oracle VM Manager communicates with the Oracle VM Agent to create and manage guests on an Oracle VM Server.

Risk

If the Oracle VM Agent is not running, problems and outages related to the management of guest VMs will occur.

Action / Repair

Investigate the issue and notify Oracle Support for further assistance.

Compute Node: Orphan Image File

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 2.0.1.0.0+ Dom0

 

 

 

 

Benefit / Impact

Free up more disk space.

Risk

The image files of orphan virtual disks occupy disk space. This disk space is wasted and cannot be used for other data.

Action / Repair

Remove the orphan image files indicated by exachk after verifying that they are not being used by any vserver.

Compute Node: OVS Pool File System

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
FAIL All Types Linux 2.0.1.0.0+ Dom0

 

 

 

 

Benefit / Impact

Without the ovspoolfs storage share properly mounted, vServers hosted by Oracle VM Server will not work correctly.

Risk

If any of the vServers do not have this system mount point, it may lead to problems and outages.

Action / Repair

Please investigate if there is storage connectivity issue. If the problem persists, please open an SR with Oracle Support.

Compute Node: Bonding of InfiniBand Interface

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical and Dom0

 

 

 

 

 

Benefit / Impact

The InfiniBand interfaces are a communication link between various components of the Exalogic machine. In order to maintain high availability (HA) with the IPoIB interface, Infiniband must be bonded correctly.

Risk

Without proper bonding of the InfiniBand interfaces, the Exalogic machine cannot maintain high availability (HA) if one of the communication links goes down. It can also affect performance.

Action / Repair

Investigate the bonding in the /etc/sysconfig/network-scripts/ifcfg-ib* files for each applicable pkey.

Compute Node: ZCOPY Configuration

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

Proper zcopy configuration must be ensured for the Exalogic machine to perform optimally.

Risk

An incorrect zcopy configuration can affect performance.

Action / Repair

Add sdp_zcopy_thresh=0, recv_poll=0 to the /etc/modprobe.conf file.

Compute Node: Ulimit

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

The ulimit parameter specifies the maximum number of open processes that a user can have running. This parameter must meet the specification set in the base image to ensure performance is optimal.

Risk

If the value of the ulimit parameter is too low, it can have an impact on performance due to the Exalogic machine not being able to open processes.

Action / Repair

Add the following line to the file ~/.bashrc.

ulimit -s value

Replace value with the value you want to change the ulimit to. Oracle recommends that it should be at least 65536, as set in the the base image specification.

Compute Node: MegaCLI Status

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical and Dom0 (All Compute Nodes)

 

 



 

 

Benefit / Impact

The RAID disks serve as a redundant data storage. By monitoring RAID disks for failed or degraded disks, high availability (HA) in an Exalogic machine is maintained.

Risk

Failed or degraded RAID disks can affect the high availability of Exalogic and affect performance.

Action / Repair

Contact Oracle Support.

Compute Node: Consistency with DNS on the Physical Compute Node

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical and Dom0 (All Compute Nodes)

 

 



 

 

Benefit / Impact

A correct hostname that matches the DNS prevents network configuration issues.

Risk

An incorrect hostname that does not match the DNS may cause configuration issues. It can also cause Exachk to report wrong results.

Action / Repair

You must determine if it is an error in the host or in the DNS entry. If it is the host, fix the hostname by changing the value of the parameter HOSTNAME in the /etc/sysconfig/network file.

Compute Node: Disabled LRO on OVS

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Dom0

 

 

 

 

Benefit / Impact

Large-Receive-Offload option offers the lowest CPU utilization for receivers and is enabled by default in the driver, but it is completely incompatible with routing/IP-forwarding and bridging. Hence, it must be disabled for Oracle Virtual Server network bridge to work correctly.

Risk

Any VM using dom0 bridging would have extremely poor network performance with LRO enabled.

Action / Repair

  1. Add the following line to ifcfg-eth0 in dom0
    ETHTOOL_OFFLOAD_OPTS="lro off" 
  2. Restart network service by running following command:
    service network restart

Compute Node: Corruption in dom0 Partition Key Table

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Dom0

 

 

 

 

Benefit / Impact

In EECS 2.0.6, deploying vServers with an EECS 2.0.1.1.0-based template can cause issues in the network of the vServers. This is usually indicated by the presence of all zeros in the hardware address. You must fix this issue to ensure proper network connectivity.

Risk

A corrupted vGUID table can cause loss of network connectivity.

Action / Repair

In each port, the table should not contain 0x00000 value as vGUID values in the first 64 entries(i.e. from 0-63)

To repair the issue, reboot dom0 during maintenance and avoid deploying EECS 2.0.1.1.0 based templates.

Compute Node: Host interconnect (usb0) Disabled

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

 

Benefit / Impact

The host interconnect should be disabled to allow all assets within Enterprise Manager Ops Center (EMOC) to be discovered. In a physical environment, it prevents potential conflict with network interfaces.

Risk

When the host interconnect is not disabled, EMOC asset discovery can fail. In a physical environment, the IPoIB-default interface might be missing.

Action / Repair

To disable host interconnect, perform the following steps:
1. Install the ilomconfig tool, which is part of oracle-hmp-tools
2. Run the 'ilomconfig disable interconnect' command
3. Verify that the host interconnect is successfully disabled by running 'ifconfig -a | grep usb'. The output should not display usb0.
Note 1609142.1 - Disable USB Network Interfaces Prior To Installing EECS On Exalogic Virtual

Compute Node: PCI 64-Bit Resource Allocation Disabled

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical,DOM0

 

 

 

 

Benefit / Impact

Disabling PCI 64-bit resource allocation ensures that all MMIO get allocated below 4GB within the Exalogic system.

Risk

If the recommended PCI 64-bit resource allocation settings are not used, you may face problems related to memory.

Action / Repair

NOTE: To fix this issue, you must restart the compute node. Applications running on the compute node will be stopped while the compute node restarts.

Ensure that you have made adequate preparations to handle the temporary loss of service, before you start this procedure.

To fix this issue, restart the compute node and navigate through the BIOS screens.

Examine the following properties within the BIOS settings: IO -> "PCI SubSystem Settings" -> "PCI 64 bit Resources" IO -> "Allocation" -> Change value to "Disabled" After this property is set, exit the BIOS screen and complete the startup procedure. Oracle Hardware Management Pack User's Guide (http://docs.oracle.com/cd/E20451_01/html/E25303/mpigt.glqbr.html)

REFERENCE:

Note 1608959.1 - Updating the BIOS Settings for X2-2, X3-2, and X4-2 Compute nodes before installing EECS on Exalogic 

Compute Node: RPM Database Corruption for Control VM

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

 

Benefit / Impact

If RPM query returns an error, any RPM operations would likely fail. Since upgrade or patching require the use of RPM, they would also fail.

Risk

The upgrade process cannot proceed without fixing the errors with RPM installation.

Action / Repair

Run rpm -qa, if command runs without any issue proceed with the upgrade installation.
If RPM query returns a lock issue, please refer to the MOS note below to fix the issue.
Note 1599404.1 - Error received while executing rpm commands - "rpmdb: Lock table is out of available locker entries"

Compute Node: RPM Database Corruption

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

 

Benefit / Impact

If RPM query returns an error, any RPM operations would likely fail. Since upgrade or patching require the use of RPM, they would also fail.

Risk

The upgrade process cannot proceed without fixing the errors with RPM installation.

Action / Repair

Run rpm -qa, if command runs without any issue proceed with the upgrade installation.
If RPM query returns a lock issue, please refer to the MOS note below to fix the issue.
Note 1599404.1 - Error received while executing rpm commands - "rpmdb: Lock table is out of available locker entries"

Compute Node: Memory Recommendation for dom0

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Dom0

 

 

 

 

 

Benefit / Impact

Physical memory of dom0 is critical for applications to be able to run on the Exalogic system.

Risk

When the physical memory attributes allocated to dom0 does not meet the recommended value, the compute node may freeze or experience unexpected kernel panic and restart.

Action / Repair

Please follow instructions in the MOS link.
Note 1582091.1 - Exalogic Virtual dom0 Memory Recommendations

Compute Node: RAID Battery Level

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Physical,DOM0

 

 

 

 

Benefit / Impact

Exalogic local storage is set up in RAID configuration. Ensuring that RAID has sufficient battery power is critical for the local storage to function properly, especially during a power outage.

Risk

When RAID battery runs out, the compute node may not have data protection against failure and may also experience performance degradation.

Action / Repair

Please refer to following Note and contact Oracle Support.

<Note 1437353.1>: Exalogic Battery Check and Replacement Guidelines 

Compute Node: Xen Vulnerability Patch Verification for Oracle Virtual Server of 2.0.6.x.x

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Dom0

 

 

 

 

Benefit / Impact

Applying the CVE-2014-7188/XSA-108 patch addresses a critical xen security vulnerability that can allow malicious guest virtual machines to potentially read data from either other guest machines, or the hypervisor itself.

Risk

Not applying the patch exposes the Exalogic machine to a critical xen security vulnerability that can allow malicious guest virtual machines to potentially read data from either other guest machines, or the hypervisor itself.

Action / Repair

Please refer to MOS note 1932297.1 - CVE-2014-7188 / XSA-108 (Xen Vulnerability) Patch Availability for Oracle Exalogic in a Virtualized Configuration Note 1932297.1 - CVE-2014-7188 / XSA-108 (Xen Vulnerability) Patch Availability for Oracle Exalogic in a Virtualized Configuration

Compute Node: Non-2.0.6.x.x VM with 12 vCPUs and 32G Memory

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Dom0

 

 

 

 

Benefit / Impact

After upgrading to version 2.0.6.x.x, VMs with 12 or more vCPUs and 32 GB or more of RAM would start up normally.

Risk

Loading VMs with 12 or more vCPUs and 32G or more of memory takes an extended period of time.

Action / Repair

Upgrade pre-2.0.6.x.x templates and virtual machines in the system to 2.0.6.x.x. Note 1582091.1 - Exalogic Virtual dom0 Memory Recommendations

Compute Node: Check for Version 2.0.1.x.x Template, Virtual Machine or Large VM in Version 2.0.4.x.x Environment

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Dom0

 

 

 

 

Benefit / Impact

There is a possibility of multiple versions of Virtual Machines and Templates installed in Exalogic virtual setup. For virtual machines, templates and virtual setups, we have the following versions: 2.0.1.x.x, 2.0.4.x.x, 2.0.6.x.x. Virtual machines and templates of version 2.0.1.x.x are compatible with a 2.0.4.x.x. virtual setup, but are not compatible with the version 2.0.6.x.x infrastructure. Upgrade these 2.0.1.x.x VMs and templates before or immediately after the upgrade to 2.0.6.x.x.

Risk

2.0.1.x.x template and virtual machines are not supported by version 2.0.6.x.x infrastructure.

Action / Repair

Before you upgrade your version 2.0.4.x.x virtual setup to version 2.0.6.x.x, create a plan to upgrade the version 2.0.1.x.x virtual machines and templates before or immediately after the upgrade to version 2.0.6.x.x.

Compute Node: Check for version 2.0.4.x.x Template and Virtual Machine in 2.0.6 Environment

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Dom0

 

 

 

 

Benefit / Impact

There is a possibility of multiple versions of Virtual Machines and Templates installed in Exalogic virtual setup. For virtual machines, templates and virtual setups, we have the following versions: 2.0.1.x.x, 2.0.4.x.x, 2.0.6.x.x. etc.. Virtual machines and templates of version 2.0.4.x.x can be used in 2.0.6.x.x. virtual setup.

Risk

Virtual machines and templates of version 2.0.4.x.x can be used in 2.0.6.x.x. virtual setup. There is no risk in this scenario.

Action / Repair

No action is needed.

Compute Node: Check for Version 2.0.1.x.x Template, Virtual Machine or Large VM in Version 2.0.6.x.x Environment

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Dom0

 

 

 

 

Benefit / Impact

Virtual machines and templates of version 2.0.1.x.x are not compatible with a version 2.0.6.x.x. virtual setup.

Risk

Virtual machines and templates of version 2.0.1.x.x do not function properly in a version 2.0.6.x.x. virtual setup.

Action / Repair

Upgrade version 2.0.1.x.x templates and virtual machines in the system to version 2.0.6.x.x.

Compute Node: Check for Version 2.0.6.x.x Template and Virtual Machine in Version 2.0.4.x.x Environment

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Dom0

 

 

 

 

Benefit / Impact

Multiple versions of Virtual Machines and Templates can be installed in Exalogic virtual setup. For virtual machines, templates and virtual setups, we have the following versions: 2.0.1.x.x, 2.0.4.x.x, 2.0.6.x.x. Virtual machines and templates of version 2.0.4.x.x can be used in a version 2.0.6.x.x. virtual setup, but will not have access to the newest features and fixes.

Risk

Virtual machines and templates of version 2.0.6.x.x can be used in a version 2.0.4.x.x. virtual setup.

Action / Repair

Upgrade 2.0.1.x.x templates and virtual machines in the system.

Compute Node: Resource Control Utility Information

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

Each compute node in an Oracle Exalogic Elastic Cloud X3-2 and X4-2 machines has a total of 16 and 24 processor cores respectively. Not all Exalogic customers demand to use all cores. Resource Control utility controls and manages available CPU cores. The number of enabled cores is persisted in the BIOS. The affected compute node needs to be shutdown/powered-off and started again for the changes to take effect.

Risk

When processor cores are not enabled correctly, the system will have performance issue.

Action / Repair

Please refer to MOS note 1671659.1 - Exalogic Core Capping for bare metal (physical) Linux and Solaris Note 1671659.1 - Exalogic Core Capping for bare metal (physical) Linux and Solaris

Compute Node: CPU CAP for Virtual Machine Configuration File in Oracle Virtual Server

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Dom0

 

 

 

 

Benefit / Impact

When the CPU Cap is configured to be less than 100% (through EMOC), several issues related to CPU soft lockup and vServer hangs have been reported on Exalogic. When CPU Cap is configured to be 100% through EMOC, it is translated to cpu_cap=0 in vm.cfg, which is the value we want to see configured.

Risk

When the CPU Cap is configured to be less than 100% (through EMOC), several issues related to CPU soft lockup and vServer hangs have been reported on Exalogic.

Action / Repair

Please refer to MOS note 1912480.1 - Setting CPU CAP to be less than 100% is not supported for Guest vServers on Exalogic Note 1912480.1 - Setting CPU CAP to be less than 100% is not supported for Guest vServers on Exalogic

Compute Node: Check for unknown files in OVS repositories

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Dom0

 

 

 

 

Benefit / Impact

There is a possibility of multiple versions of Virtual Machines and Templates installed in Exalogic virtual setup. For virtual machines, templates and virtual setups, we have the following versions: 2.0.1.x.x, 2.0.4.x.x, 2.0.6.x.x. Virtual machines and templates of version 2.0.1.x.x are compatible with a 2.0.4.x.x. virtual setup, but are not compatible with the version 2.0.6.x.x infrastructure. Upgrade these 2.0.1.x.x VMs and templates before or immediately after the upgrade to 2.0.6.x.x.

Risk

2.0.1.x.x template and virtual machines are not supported by version 2.0.6.x.x infrastructure.

Action / Repair

Before you upgrade your version 2.0.4.x.x virtual setup to version 2.0.6.x.x, create a plan to upgrade the version 2.0.1.x.x virtual machines and templates before or immediately after the upgrade to version 2.0.6.x.x.

Compute Node: BIOS SR-IOV Status

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Physical,DOM0

 

 

 

 

Benefit / Impact

Before upgrading Oracle Exalogic Elastic Cloud, it is important to ensure that the system has the recommended BIOS settings.

Risk

Using the incorrect BIOS settings will result in a failed install or upgrade process.

Action / Repair

Enable SR-IOV as detailed in the following MOS Note:

<Note 1608959.1>: Updating the BIOS Settings for X2-2, X3-2, and X4-2 Compute nodes before installing EECS on Exalogic 

Compute Node: RAID Configuration of Local Disks

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical,DOM0

 

 

 

 

Benefit / Impact

Proper RAID configuration of local disks will ensure data redundancy.

Risk

If the RAID configuration of local disks is not set correctly, there is no more redundancy. i.e. if one of the SSD malfunction, all the data on the SSD will be lost.

Action / Repair

Please find the document "Engineering Approved Steps" from issue "Some RAID groups were not setup correctly in the factory when the Exalogic was built" referenced in MOS note 1360310.1 Note 1360310.1 - Oracle EXALOGIC Current Product Issues X2-2 , X3-2, X4-2

Compute Node: Check Dom0 Kernel Memory Slab Usage of Size-192

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ DOM0

 

 

 

 

Benefit / Impact

Certain workload may trigger a rare condition of excessive memory use.

Risk

Dom0 kernel memory may leak and cause it freeze up over time.

Action / Repair

Contact Oracle Support to file SR.

Compute Node: IPoIB in Connected Mode for OEL6

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical,DOM0

 

 

 

 

Benefit / Impact

Having IPoIB in Connected Mode ensures that Internet Protocol (IP) works properly over InfiniBand network.

Risk

If the Connected Mode is not set, IPoIB might not work properly.

Action / Repair

1. If it is OFED R2 version, then check the following: -Go to /etc/sysconfig/network-scripts/ifcfg-ib* files, check whether "CONNECTED_MODE=yes". -If "CONNECTED_MODE" is not set to "yes", modify the /etc/sysconfig/network-scripts/ifcfg-ib* file and change the property to "yes". 2. If it is not OFED R2 version, then check the following: -If SET_IPOIB_CM and/or IPOIB_LOAD is not set to "yes", modify the /etc/rdma/rdma.conf file and change these properties to "yes". -If the content of /sys/class/net/ib0/mode and /sys/class/net/ib0/mode are not connected, modify the content of these files to "connected". After modifying the files above, restart InfiniBand by running the command: /etc/init.d/openibd restart Note 1982645.1 - Exachk Reporting "IPoIB is not in connected mode" WARNING Message On Exalogic 2.0.6.2.0 Linux Physical Racks

Compute Node: Ibswitches Information Validation

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Physical,DOM0

 

 

 

 

Benefit / Impact

Correct format for ibswitches information will ensure proper networking.

Risk

Incorrect switch description could cause patching issues.

Action / Repair

Please follow the MOS note 1476772.1: A script to reset (to factory defaults) the NM2-GW switch Description field to show Leaf Details. Note 1476772.1 - A script to reset (to factory defaults) the NM2-GW switch Description field to show Leaf Details.

Compute Node: Eport_State_Enforce Status

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical,DOM0

 

 

 

 

Benefit / Impact

On releases running with Oracle Enterprise Linux operating system, if the Ethernet link used by a vnic goes down, the bond configured with that particular vnic will not detect it.By default,the bond will only detect the physical link it is using, which is the Infiniband Link. It will not detect the link of the Ethernet port the vnic is connected to. eport_state_enforce=1 flag needs to be present in /etc/modprobe.conf to have this failure detected and failover.

Risk

Without eport_state_enforce=1 flag in /etc/modprobe.conf, network outage will occur if one of the link fails.

Action / Repair

Make sure eport_state_enforce=1 in /etc/modprobe.conf file. Note 1512139.1 - Oracle Exalogic Elastic Cloud Known Issues - Virtualization Release Note 1436514.1 - Exalogic: VNIC 10gb Bond Network Ethernet Link Failover Detection

Compute Node: Detect EM Agent On Dom0

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ DOM0

 

 

 

 

Benefit / Impact

Additional software running on the hypervisor is not recommended. It may affect performance of the hypervisor and guests running on it.

Risk

Installation of EM agent on dom0 has the potential to de-stabilize the environment.

Action / Repair

Uninstall the EM Agent from dom0. Note 1668193.1 - FAQs: Modifications to Exalogic Control vServers and Dom0 In Exalogic Virtual Environments (Doc ID 1668193.1)

Compute Node: Eport_State_Enforce Status for OEL6

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

On releases running with Oracle Enterprise Linux operating system, if the Ethernet link used by a vnic goes down, the bond configured with that particular vnic will not detect it.By default,the bond will only detect the physical link it is using, which is the Infiniband Link. It will not detect the link of the Ethernet port the vnic is connected to. eport_state_enforce=1 flag needs to be present in /etc/modprobe.d/mlx4_vnic.conf to have this failure detected and failover.

Risk

Without eport_state_enforce=1 flag in /etc/modprobe.d/mlx4_vnic.conf, network outage will occur if one of the link fails.

Action / Repair

Make sure eport_state_enforce=1 in /etc/modprobe.d/mlx4_vnic.conf file. Note 1512139.1 - Oracle Exalogic Elastic Cloud Known Issues - Virtualization Release Note 1436514.1 - Exalogic: VNIC 10gb Bond Network Ethernet Link Failover Detection

Compute Node: Check CPUspeed Governor Setting

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Physical,DOM0

 

 

 

 

Benefit / Impact

placeholder to have all exalogic related checks under this

Risk

Performance degradation.

Action / Repair

Please modify CPUspeed governor setting according to MOS note 1925546.1 Note 1925546.1 - Performance Issue with CPU Processing In Exalogic X4-2 Linux Virtual Racks (Doc ID 1925546.1)

Compute Node: lro_num=0 in mlx4_vnic.conf

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ DomU

 

 

 

 

Benefit / Impact

LRO (Large Receive Offload) is a technique used to increase throughput by reducing CPU overhead. It is not compatible with infiniband network settings, therefore needs to be disabled.

Risk

The system will panic if LRO is not disabled.

Action / Repair

Add "lro_num=0" to /etc/modprobe.d/mlx4_vnic.conf.

Compute Node: ARI in BIOS Setting Enabled

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
FAIL All Types Linux 1.0.0.2.0+ Physical,DOM0

 

 

 

 

Benefit / Impact

If Alternate Routing ID (ARI) is supported by the hardware and set to enabled, devices are permitted to locate virtual functions (VFs) in function numbers 8 to 255 of the captured bus number, instead of normal function numbers 0 to 7.

Risk

If ARI is not enabled, only 7 virtual function will be available for VM to use. This means any additional VM will not be able to attach a virtual function, thus networking inside the VM will fail.

Action / Repair

NOTE: To fix this issue, you must restart the compute node. Applications running on the compute node will be stopped while the compute node restarts. Ensure that you have made adequate preparations to handle the temporary loss of service, before you start this procedure. To fix this issue, see MOS Note. Oracle Hardware Management Pack User's Guide (http://docs.oracle.com/cd/E20451_01/html/E25303/mpigt.glqbr.html) Note 1608959.1 - Updating the BIOS Settings for X2-2, X3-2, and X4-2 Compute nodes before installing EECS on Exalogic

Compute Node: Grub Conf Settings for Dom0

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical,DOM0

 

 

 

 

Benefit / Impact

To limit and ping Dom0 CPUs to run on the first 20 logical CPUs.

Risk

If the two parameters are missing from grub.conf, as soon as the compute nodes got rebooted, the customer will encounter stability issues and will unable to communicate over infiniband. Restoring the parameters returned the machine to functionality.

Action / Repair

Add the following to the xen.gz kernel boot line, then reboot the server: dom0_vcpus_pin dom0_max_vcpus=20.

Compute Node: Disabled Automatic Path Migration(APM)

 

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

There is a compatibility issue between the OFA software version on Exadata (1.5.1) and on Exalogic (1.5.5). The SDP protocol fails due to the new feature, APM (Automatic Path Migration) that’s enabled in Exalogic by default but not yet supported in the OFED version in Exadata which causes to trigger the error "RDMA CMA: unexpected IB CM event: 13". Disabling APM will ensure that SDP protocol works properly in this particular case.

Risk

Enabling APM on Exalogic machine that is connected to Exadata can lead to problems and outages related to SDP protocol failure.

Action / Repair

Refer to Note 1588546.1 for Action/Repair.

___________________________________________________________________________________

Compute Nodes Solaris

Compute Node: Software Profile

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

 

Benefit / Impact

Validating the software packages can avoid problems related to configuration.

Risk

If the software is not validated, inconsistencies between components can lead to problems and outages.

Action / Repair

The output contains a lines similar to the following:

[SUCCESS]........Has supported operating system

If a result that is not SUCCESS is returned, investigate and correct the condition.

Sample Output

#/opt/exalogic.tools/tools/CheckSWProfile
[SUCCESS]........Has supported operating system
[SUCCESS]........Has supported processor
[SUCCESS]........Kernel is at the supported version
[SUCCESS]........Has supported kernel architecture
[SUCCESS]........Software is at the supported profile

 

Compute Node: NTP Synchronization

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

 

Benefit / Impact

NTP helps synchronize the clock of Exalogic with an accurate time source. To ensure correct synchronization, the delay and offset values should be not zero and the jitter value should be under 100.

Risk

An unsynchronized system clock can lead to possible errors and outages.

Action / Repair

Any warnings generated by NTP Synchronization check can be caused by the following:

  • You are using an older version of the NTP package that does not work if you use the DNS name for the NTP servers. In this case, you must use the IP addresses.
  • A firewall blocking access to your Stratum 1 and 2 NTP servers. The firewall can be located on one of the networks between the NTP server and its time source or firewall software, such as iptables, that may be running on the NTP server.
  • The notrust nomodify notrap keywords present in the restrict statement of the NTP client.
  • Localhost is configured on the NTP server. If it is a Linux system, remove localhost from /etc/ntp.conf file to fix the issue. If it is a Solaris system, remove localhost from /etc/inet/ntp.conf
Note:

KISS keywords in the NTP parameters are ignored. A list of these keywords can be at the following link:
http://www.iana.org/assignments/ntp-parameters/ntp-parameters.xml

Sample Output

# report=$(ntpq -pn 2>&1); echo "$report"
remote refid st t when poll reach delay offset jitter
==============================================================================
+10.133.40.1 144.25.255.140 3 u 346 1024 377 0.321 0.088 0.011
*144.25.255.141 144.20.10.10 2 u 475 1024 377 1.870 -0.210 0.002
+144.25.255.142 144.25.255.140 3 u 432 1024 377 2.065 0.251 0.179
127.127.1.0 .LOCL. 10 l 15 64 377 0.000 0.000 0.001

Compute Node: DNS Setup

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

 

Benefit / Impact

The DNS service allows components of Exalogic to have access to each other. Verifying the DNS setup is critical to avoid problems of access between components.

Risk

If the DNS setup is not verified, inconsistent access protocol between components can lead to problems and outages.

Action / Repair

Verify DNS setup configuration by examining /etc/resolv.conf and import the configuration to the SMF with the following commands:

# /usr/sbin/nscfg import -f name-service/switch
# svcadm enable dns/client
# svcadm refresh name-service/switch

 

Compute Node: Correct Slot Installation of IB Card for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 


Benefit / Impact:

For the best performance on the PCI Express interface, the adapter card should contain the correct Vendor ID and Device ID.

Risk:

Installing the card in a slower slot or using unsupported adapter card vendors can significantly limit bandwidth and performance.

Action / Repair:

Reinstall the card in the correct slot as specified below:

For ConnectX-2: Vendor ID is 0x15b3 and device ID is 0x673c
For ConnectX-3: Vendor ID is 0x15b3 and device ID is 0x1003

Verify that the physical slot of the card at the back of the compute node is correct.

Compute Node: Subnet Manager

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 


Benefit / Impact

The subnet manager (SM) manages all operational characteristics of the InfiniBand network. The InfiniBand network has more than one SM, but only one SM active at a time. Standby SMs ensure that the InfiniBand stays up.

Risk

If the master SM shuts down or fails and there is no standby SM to replace it, the InfiniBand network will fail. This can cause a loss of connectivity for the Exalogic machine.

Action / Repair

To repair the problem, see the following link for detailed instructions.

Links

http://docs.oracle.com/cd/E18476_01/doc.220/e18478/leafswitch.htm#CBHICGAA

Compute Node: Root Partition Usage Limit for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 


Benefit / Impact:

Keeping enough free space in the root partition ensures the compute node can operate optimally.

Risk:

Performance of the compute node is affected.

Action / Repair:

Free up disk space on the root partition of the local disk. Oracle recommends user data be stored on the storage appliance.

Compute Node: Lockd Configuration for Solaris Compute Node

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Solaris 1.0.0.2.0+ Physical

 

 

 


Benefit / Impact

To maintain data integrity and to prevent application lag, lock recovery after a reboot is critical,.

Risk

NFSv3 locks may not be recovered after a reboot.

Action / Repair

- To enable lock manager, please run the following command:

svcadm enable svc:/network/nfs/nlockmgr:default

- To enable status, please run the following command:

svcadm enable svc:/network/nfs/status:default

Links

http://docs.oracle.com/cd/E23824_01/html/821-1462/lockd-1m.html#REFMAN1Mlockd-1m

Compute Node: ib_ipoib Module for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

 

Benefit / Impact

The ib_ipoib module must be loaded to ensure that the Internet Protocol (IP) works over InfiniBand.

Risk

If the ib_ipoib module is not loaded, InfiniBand may not work properly.

Action / Repair

To install the package which includes the module, contact Oracle Support.

Sample Output

# lsmod | grep ib_core
ib_core 61642 13 rdma_ucm,ib_sdp,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx4_vnic,ib_sa,mlx4_ib,ib_mthca,ib_mad

 

Compute Node: ib_sdp Module for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

 

Benefit / Impact

A loaded ib_sdp module ensures that Sockets Direct Protocol (SDP) works properly over InfiniBand.

Risk

If the ib_sdp module is not loaded, InfiniBand may not work correctly.

Action / Repair

To install the package which includes the module, contact Oracle Support.

Compute Node: IP Configuration - net0 and bond0

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

 

Benefit / Impact

Correct IP configuration for net0 and bond0 allows each compute node to manage hostname mapping and DNS entries.

Risk

A misconfiguration of the /etc/hosts file can cause problems when a compute node tries to reach other nodes in the same rack.

Action / Repair

Investigate the /etc/hosts file and the content that is returned from interface configuration of net0 and bond0 by doing the following:

  1. Verify if the content of /etc/hosts has multiple entries of the same IP address.
  2. The IP address obtained from the ipadm show-addr command on net0 and bond0 should be listed in the /etc/hosts file.

Compute Node: Recent Reboot Info for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

Ensuring the stability of the system is important to support applications running on Exalogic. By discovering an unexpected and recent reboot of a compute node, action can be taken to restore service and resolve a potential problem.

Risk

An unexpected and recent reboot of a compute node can lead to problems.

Action / Repair

If the recent reboot was intentional or expected, please ignore this warning. Otherwise, please investigate why this compute node rebooted unexpectedly. If problem persists, please open an SR with Oracle Support.

Compute Node: Probe Based IPMP for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact:

For probe based failure detection, the IPMP daemon (mpathd) sends ICMP packets back and forth across the network through ping test to ensure connectivity. It monitors the path to the gateway.

Risk:

When probe based IPMP is not set up properly, the system failure detection does not ensure connectivity to a test address dynamically.

Action / Repair:

Run the following commands to enable probe based IPMP failure detection:

# svccfg -s svc:/network/ipmp setprop config/transitive-probing=true
# svcadm refresh svc:/network/ipmp:default

Compute Node: Swap Space for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact:

If kernel runs out of memory, it uses the swap space to get more memory. If the swap space is unused, it means the kernel has adequate amount of free physical memory, and that the Exalogic machine is performing optimally.

Risk:

The usage of swap space indicates that the kernel is running out of free physical memory. A lack of free physical memory can lead to degraded performance.

Action / Repair:

Free up the used memory.

Compute Node: Free Physical Memory for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

Adequate amount of free physical memory ensures that the Exalogic machine performs optimally.

Risk

If there is not enough free physical memory, problems and outages may occur.

Action / Repair

Evaluate the memory usage. If possible, clear up the memory cache. Otherwise, contact Oracle Support for possible tuning or upgrade.

Compute Node: MTU for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

Correct MTU size ensures that the communication protocol layer performs optimally whether it's Ethernet and/or Infiniband.

Risk

Incorrect MTU size may slow down the performance and cause latency issues.

Action / Repair

Contact Oracle Support.

Compute Node: IPMP Configuration for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact:

The InfiniBand interfaces are a communication link between various components of the Exalogic machine. In order to maintain high availability (HA) with the IPoIB interface, Infiniband must be bonded correctly.

Risk:

Without proper bonding of the InfiniBand interfaces, the Exalogic machine cannot maintain high availability (HA) if one of the communication links goes down. It can also affect performance.

Action / Repair:

Please refer to My Oracle Support MOS Note 1547687.1.

Compute Node: Fault Management Log for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact:

The stability of the Exalogic machine is important to support running applications. By discovering faulty devices in a compute node, potential issues can be resolved.

Risk:

A faulty device in a compute node can lead to problems and outages.

Action / Repair:

Investigate the problems listed in the report. To fix the issue, follow the steps outlined in the report. If problem persists, open an SR with Oracle Support.

Compute Node: BIOS Settings

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

Before upgrading Oracle Exalogic Elastic Cloud to 2.0.0.0.1 or higher, ensure that the system is set to the recommended BIOS settings.

Risk

If the recommended BIOS settings are not used, problems can occur during the Oracle Exalogic Elastic Cloud 2.0.0.0.1 upgrade.

Action / Repair

Refer to <Note 1608959.1>

Compute Node: NFS Mount Point - Version for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

Verifying the correct configuration of mount points helps avoid NFS related performance issues.

Risk

If the NFS mount points are not configured correctly, inconsistencies related to storage access can occur. These inconsistencies can lead to problems and outages.

Action / Repair

It is recommended to upgrade the NFS mount point to the latest version.

Links

http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm

Compute Node: Hostname Consistency with DNS on the Physical Compute Node

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact:

A correct hostname that matches the DNS prevents network configuration issues.

Risk:

An incorrect hostname that does not match the DNS may cause configuration issues. It can also cause Exachk to report wrong results.

Action / Repair:

You must determine if it is an error on the host or in the DNS entry.

If it is on the host fix the hostname by using svccfg in the Service Management Facility (SMF). The following is an example in Solaris:

# svccfg -s svc:/system/identity:node setprop config/nodename = astring: hostname
# svcadm refresh svc:/system/identity:node
# svcadm restart identity:node 


If it is an error in the DNS entry, contact your network administrator to correct the issue.

Compute Node: NFS Mount Point - Attribute Caching for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

By ensuring that attribute caching in NFS is not disabled, NFS mounts can perform optimally.

Risk

Disabling attribute caching can lead to extra network operation which leads to reduced network performance.

Action / Repair

Fix the configuration of the NFS Mount Point by removing "noac" and "actimeo=0" attributes from the mount points.

Links

http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm

Compute Node: NFS Mount Point - Rsize Wsize for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

The NFS mount point option, rsize and wsize, specify the size of the blocks of data that the client and server send to each other. The correct values for rsize and wsize need to be verified to maintain high performance for block transfer between the mount points. Oracle suggests that the rsize and wsize to be 131072.

Risk

Incorrect configuration of the rsize or wsize may lead to performance degradation.

Action / Repair

Correct the configuration of the NFS mount point by modifying rsize and wsize properties in the mount points to the recommended value of 131072.

Links

http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm

Compute Node: TCP Protocol on NFS Mount Point for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

To be compatible with ZFSSA, all NFS mount points in Solaris must have the protocol specified as TCP.

Risk

In Solaris, when no protocol is specified, the protocol used by default is RDMA, which is not compatible with ZFSSA.

Action / Repair

1.From the output of the report command, identify the shares for which the specified protocol is not TCP. Use the umount command to unmount these shares.
Example:
# umount /mnt/share
2.Remount using "-o proto=tcp"
Example:
# mount -o proto=tcp 192.168.1.2:/path/to/share /mnt/share
Configuring NFS Version 4 (NFSv4) on Exalogic (http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm)

Compute Node: RAID Battery Level

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

Exalogic local storage is set up in RAID configuration. Ensuring that RAID has sufficient battery power is critical for the local storage to function properly, especially during a power outage.

Risk

When RAID battery runs out, the compute node may not have data protection against failure and may also experience performance degradation.

Action / Repair

Contact Oracle Support.



INTERNAL Note for Support:

Refer to INTERNAL <Note 1437353.1> which has steps to check and validate whether the Battery replacement is needed.

Compute Node: IP Configuration in /etc/hosts for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

Correct IP configuration for interfaces allows each compute node to manage hostname mapping and DNS entries.

Risk

A misconfiguration of the /etc/hosts file can cause problems when a compute node tries to reach other nodes in the same rack.

Action / Repair

Investigate the /etc/hosts file and the content that is returned from interface configuration of net0/igb0 and bond0 by doing the following: 1. Verify if the content of /etc/hosts has multiple entries of the same IP address. 2. The IP address obtained from the 'ipadm show-addr' command on net0/igb0 and bond0 should be listed in the /etc/hosts file. Adding Exalogic Machine to Your Network (http://docs.oracle.com/cd/E18476_01/doc.220/e18478/spreadsheet.htm) Network, Storage, and Database Preconfiguration (http://docs.oracle.com/cd/E18476_01/doc.220/e18479/net.htm#BHCJBICD)

Compute Node: Check Solaris CACAO Publisher Setting

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

Verify pre-conditions for the Solaris PSU patching.

Risk

Solaris PSU may fail in installation process.

Action / Repair

Change the cacao publisher to non-sticky setting: pkg set-publisher --non-sticky cacao

Compute Node: NIS domain (YPBind) for Solaris

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Solaris 1.0.0.2.0+ Physical

 

 

 

 

Benefit / Impact

Ypbind is a client–server directory service protocol for distributing system configuration data. It allows Exalogic Elastic Cloud to find each server for NIS domains, and maintains the NIS binding information.

Risk

Without correct NIS configurations and binding informations, inconsistencies related to network services may occur. These inconsistencies can lead to problems and outages.

Action / Repair

Verify the NIS configuration depending on the NFS version.

__________________________________________________________________________________________________________

Switches

Switch: /conf/configvalid File

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.2.0.0+ Leaf Switches

 

 

 

 

Benefit / Impact

The content of the /conf/configvalid file verifies that the switch is not misconfigured. Having the correct configuration ensures that Oracle Exalogic Elastic Cloud runs properly.

Risk

Misconfiguration of /conf/configvalid file may lead to problems and outages.

Action / Repair

If the /conf/configvalid is invalid(0), investigate possible misconfiguration in the other components within the switch to correct the condition.

Links

Note 1520330.1 - smpartition list active" Shows Inconsistent Partition Information Between Exalogic Switches

Switch: EoIB Data SL

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.2.0.0+ Leaf Switches

 

 

 

 

Benefit / Impact

Correct configuration of the EoIB Data SL ensures that the gateway switch runs properly.

Risk

Misconfiguration of EoIB Data SL within gateway switch may lead to problems and outages.

Action / Repair

Please refer to the following MOS note:

<Note 2120372.1>: Exalogic: How to Change Service Levels of EoIB Data and Control on NM2-GW Switches 

Switch: EoIB Control SL

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.2.0.0+ Leaf Switches

 

 

 

 

Benefit / Impact

Correct configuration of the EoIB Control SL ensures that the gateway switch runs properly.

Risk

Misconfiguration of EoIB Control within gateway switch may lead to problems and outages.

Action / Repair

Please refer to the following MOS note:

<Note 2120372.1>: Exalogic: How to Change Service Levels of EoIB Data and Control on NM2-GW Switches  

Switch: Localhost Configuration

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.2.0.0+ Leaf Switches

 

 

 

 

Benefit / Impact

Valid localhost configuration of the switch needs to be ensured for the Oracle Middleware Exalogic Machine to perform its processes optimally.

Risk

If the localhost configuration is invalid, problems and outages may occur.

Action / Repair

If the localhost is not configured within the host, modify the /etc/hosts file to include localhost entry.

Switch: Free Physical Memory

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.2.0.0+ Leaf Switches

 

 

 

 

Benefit / Impact

Availability of free memory needs to be ensured within the switches for the Oracle Middleware Exalogic Machine to perform its processes optimally.

Note: The recommended free space is at least 70%.


Risk

Insufficient memory may lead to degrading performance, and may cause problems and outages.

Action / Repair

The recommended free space is calculated by adding Free Memory (MemFree) and Reclaimable Memory (SReclaimable) listed in /proc/meminfo. The free memory should be at least 20% of the Total Memory(MemTotal).

Run the following command to clear in-kernel caches:

# sync ; echo 2 > /proc/sys/vm/drop_caches


If there is still not enough free memory, restart the switch.

NOTE: The switch becomes unavailable during this period, causing applications within this switch to stop running. Ensure that you have made adequate preparations to handle the temporary loss of service, before you start this procedure. 


Restart the switch by running the following command:

# reboot -n

Switch: Unused VNICS

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.2.0.0+ Leaf Switches

 

 

 

 

Benefit / Impact

Valid vNIC within the switch needs to be ensured for the Oracle Middleware Exalogic Machine to perform its processes optimally. For a virtual rack, VNICs are important in the creation of new vServers.

Risk

VNICs in states other than "UP" may cause network outages. In a physical rack, possible problems related to EoIB network may occur. In a virtual rack, excessive number of unused vNICs may cause performance issues.

Action / Repair

If there are unused VNICs in physical rack do the following:

1) Investigate possible misconfiguration in other components within the switch.

2) Check whether the content of /conf/configvalid file is 1 (investigate "/conf/configvalid File" Check).

3) Check whether the localhost entry exists (investigate "Localhost Configuration" Check).

4) Check whether Subnet Manager is configured properly (investigate the output of sminfo command).

5) Check whether GUID is correct (Investigate the output of ibnetdiscover command).

6) Check whether partition is correctly configured.

Switch: Opensm

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.2.0.0+ Leaf Switches

 

 

 

 

Benefit / Impact

Opensm provides an implementation of an InfiniBand Subnet Manager and Administration to support Oracle Middleware Exalogic Machine.

Risk

If the opensm is not running, possible problems and/or outages may occur.

Action / Repair

1) Run the "getmaster" command on all NM2-GW switches. If any of the NM2-GW switches does not have a local instance of the subnet manager running, enable the subnet manager by using the "enablesm" command.

2) Run "smnodes list" on all switches that have opensm enabled. Make sure that only those switches that have the subnet manager enabled are listed in this smnodes list on each node. If any of these switches is not listed, add the missing switches using "smnode add <IP>". Repeat on all switches that have opensm enabled.

If Exadata is on the same InfiniBand fabric as Exalogic, verify that the subnet manager is disabled on Exadata as well.

Switch: List Link Up

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.2.0.0+ Leaf Switches

 

 

 

 

Benefit / Impact

This hardware command lists the presence of links and the up-down state of the associated ports on the switch chip.

Risk

If any of the links are "down", problems and outages may occur.

Action / Repair

  1. Run the "getmaster" command on all NM2-GW switches. If any of the NM2-GW switches does not have a local instance of the subnet manager running, enable the subnet manager by using the "enablesm" command.
  2. Run "smnodes list" on all switches that have opensm enabled. Make sure that only those switches that have the subnet manager enabled are listed in this smnodes list on each node. If any of these switches is not listed, add the missing switches using "smnode add <IP>". Repeat on all switches that have opensm enabled.

If Exadata is on the same InfiniBand fabric as Exalogic, verify that the subnet manager is disabled on Exadata as well.

Switch: Environment Test

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.2.0.0+ Leaf Switches

 

 

 

 

Benefit / Impact

Verifying that the hardware passes environment test ensures Oracle Exalogic Elastic Cloud to run properly.

Risk

If the environment tests result in failure, problems and outages may occur.

Action / Repair

If the environment tests fail, perform a power cycle through ILOM. Investigate possible hardware problems within the switch.

Switch: Ibstat

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.2.0.0+ Leaf Switches

 

 

 

 

Benefit / Impact

Ibstat displays the basic status of InfiniBand. Having InfiniBand that works well supports the fabric communication of the switches, to work optimally.

Risk

If these parameters for ibstat are not met, problems and outages may occur.

Action / Repair

This InfiniBand software command displays basic information retrieved from the local InfiniBand driver. For this software to work properly, ensure that the following criteria is met:
1. Physical state LinkUp
2. Rate = 40 Gb/s
3. SM lid is present.

If any of these criteria are not met, investigate the problems based on these components below:
1. Check if "env_test" pass (investigate "Environment Test" check).
2. Check if "subnet manager" is configured properly (investigate the output of sminfo command).
3. If Rate is not 40Gbit/s or SM Lid is not present, investigate possible problems within the hardware.

Switch: SNMP Daemon

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Leaf Switches

 

 

 

 

 

Benefit / Impact

OpsCenter utilizes Simple Network Management Protocol (SNMP) to retrieve various switch properties, which are critical to ensure proper network monitoring of the system.

Risk

If the SNMP daemon is not running, network management may experience performance degradation.

Action / Repair

To start the snmpd service, the complete ILOM stack must be started by running the following command:
service ilom start

 

Switch: Number of Partition Keys on Bridge-X Ports

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Leaf Switches

 

 

 

 

 

Benefit / Impact

To allow VNICs to work properly, ensure that the number of partitions associated with Bridge-X (BX) ports has not reached the upper limit.

Risk

If the number of partition keys associated with Bridge-X ports reaches or exceeds the recommended upper limit, any newly created VNICs will be in the WAIT-VHUB state even if all Bridge-X ports are full members of the appropriate partition.

Action / Repair

Remove BX port GUID from the unused partition and reduce the number of partitions. Each Bridge-X port has a maximum capacity of 128 partition keys. However, it is recommended that you keep the number of partitions below 100.
To reduce the number of partitions, run the following commands:
1. smpartition start
2. smpartition delete -pkey <pkey_to_be_deleted>
3. smpartition commit

Switch: Host Config VNIC

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Leaf Switches

 

 

 

 

 

Benefit / Impact

To be able to create VNICs in the Host Manual Mode, the "Allow host VNIC config" parameter must be set to "yes".

Risk

If the "Allow host VNIC config" parameter is not set to "yes", you will not be able to create VNICs in the Host Manual Mode.

Action / Repair

1. If the output of showgwconfig shows "BXM not running", ensure that the BXM service is up and running. Restart the service by running the following commands on the switch:
- service bxm start
- allowhostconfig
2. If either the configured value or the running value of Allow Host Config is "no", run the following command on the switch:
- allowhostconfig

 

Switch: Pre-upgrade check on switch memory and disk space

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Leaf Switches

 

 

 

 

Benefit / Impact

Upgrade needs 80Mb of space in / filesystem, 200Mb of space in /tmp and 240M of memory. This will allow upgrade to avoid failures due to space and memory.

Risk

Upgrade will fail if these criteria is not met.

Action / Repair

To free up memory, execute sync ; echo 2 > /proc/sys/vm/drop_caches or reboot Remove unwanted files in / and /tmp if its falls below 80Mb and 200Mb.

Switch: VLAN PKEY PAIR Information for Switch

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Leaf Switches

 

 

 

 

Benefit / Impact

Unique VLAN Pkey Pairing guarantees correct networking.

Risk

Virtualized EECS environment does not support multiple vlans on a single IB partition.

Action / Repair

Please revert any manual vlan creation. If no vlan was created manually, contact Oracle Support.

Switch: Validate No Stale Partition Key Temporary File Exists

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Leaf Switches

 

 

 

 

Benefit / Impact

Stale partition key temporary file can exist from a failed smpartition session. These stale files need to be clean up or the next smpartition session may fail or commit incorrectly.

Risk

Unable to execute any partition related operation or invalid partition information gets commited.

Action / Repair

1) Validate that no valid smpartition operation is in progress. Execute a diff command between the 2 files to visualize the pending changes: diff /conf/partitions.conf.tmp /conf/partitions.current 2) Move the partitions.conf.tmp file to a location for Oracle Support analysis if necessary. 3) Execute smpartition abort on the master switch to terminate the stale smpartition session if invalid.

Switch: Validate Partition Keys Are Using Latest Format

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Leaf Switches

 

 

 

 

Benefit / Impact

Partition key needs to be consistent using the new format since switch FW version 2.1.3-4.

Risk

EMOC will fail to execute any operations related to partition keys.

Action / Repair

Make sure the April 2014 PSU or July 2014 PSU was applied correctly. In particular, the documented step that modifies the pkeys (pkey_filter.pl) was executed successfully.

Switch: /conf/configvalid File for Spine Switch

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Spine Switch

 

 

 

 

Benefit / Impact

The content of the /conf/configvalid file verifies that the switch is not misconfigured. Having the correct configuration ensures that Oracle Exalogic Elastic Cloud runs properly.

Risk

Misconfiguration of /conf/configvalid file may lead to problems and outages.

Action / Repair

If the /conf/configvalid is invalid(0), investigate possible misconfiguration in the other components within the switch to correct the condition. Note 1520330.1 - "smpartition list active" Shows Inconsistent Partition Information Between Exalogic Switches

Switch: Version Consistency on All Switches

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Leaf Switches

 

 

 

 

Benefit / Impact

Version consistency among switches avoids problems with hardware/software configuration.

Risk

Version inconsistency among switches can cause functional issues.

Action / Repair

Please confirm the EECS patch level of the system. Continue finishing the switch FW upgrade according to PSU instructions.

Switch: Life Expectancy for SW

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Leaf Switches

 

 

 

 

Benefit / Impact

There is a high level of urgency to perform a control stack or switch backup right away

Risk

When remaining life is approximately less than 2%, a switch replacement should be scheduled as disk failure is possible and they cannot be repaired in the field.

Action / Repair

Please change a switch.

Switch: Consistent Subnet Manager across Switches

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.2.0.0+ Leaf Switches

 

 

 

 

Benefit / Impact

All switches must use the same Subnet Manager for the Exalogic system to work.

Risk

Inconsistent Subnet Managers across switches will cause severe problem with EECS.

Action / Repair

Contact Oracle Support.

__________________________________________________________________________________________________________

Storage Nodes

Storage Node: Backend (chkBackend.aksh)

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.2.0.0+ Storage Node

 

 

 

Benefit / Impact

Reports any faults, single paths, and mismatches in firmware versions between data disks, write caches (logs), and JBOD SIMs.

Risk

Any faults, single paths, and mismatches in firmware versions can lead to problems and outages.

Action / Repair

If these checks fail, follow the instructions given below, depending on the error messages that are printed in the output:
- ERROR: {hostname} SHELF: {shelf} DISK: {disk} PATH ERROR ONLY FOUND 1 PATHS
Repair: Replace the device if both SIMs are online.
- ERROR: {hostname} SHELF: {shelf} DISK: {disk} DISK FIRMWARE MISMATCH ERROR DETECTED
Repair: If the disk is not in the process of having its firmware upgraded (Maintenance -> System -> Firmware Upgrades), then replace the disk or have a field engineer manually upgrade the device.
- ERROR: {hostname} SHELF: {shelf} DISK: {disk} FAULTED
Repair: Replace the disk.
- ERROR: {hostname} SHELF: {shelf} DISK: {disk} REPORTED AS MISSING - SHOULD IT BE?
Repair: Reinsert or replace the disk. Exalogic should have all of the slots propagated with disks.
- ERROR: {hostname} SHELF: {shelf} LOG: {log} PATH ERROR ONLY FOUND 1 PATHS
Repair: Replace the device if both SIMs are online.
- ERROR: {hostname} SHELF: {shelf} LOG: {log} FIRMWARE: {fw} FIRMWARE BELOW MINIMUM RELEASE
Repair: If the disk is not in the process of having its firmware upgraded (Maintenance -> System -> Firmware Upgrades), then replace the disk or have a field engineer manually upgrade the device.
- ERROR: {hostname} SHELF: {shelf} LOG: {log} FIRMWARE: {fw} FIRMWARE BELOW MINIMUM FOR AK VERSION
Repair: If the disk is not in the process of having it's firmware upgraded (Maintenance -> System -> Firmware Upgrades), then replace the disk or have a field engineer manually upgrade the device.
- ERROR: {hostname} SHELF: {shelf} SIM: {sim} REPORTS FAULTED
Repair: Replace the SIM.
- ERROR: {hostname} SHELF: {shelf} SIM: {sim} UNKNOWN STATE
Repair: Reseat the SIM and check weather its firmware is up-to-date.
- ERROR: {hostname} SHELF: {shelf} SIM: {sim} FIRMWARE: {fw} FIRMWARE MISMATCH: {fw} on another SIM
Repair: If the SIM is not in the process of having it's firmware upgraded (Maintenance -> System -> Firmware Upgrades), then replace the SIM or have a field engineer manually upgrade the device.
- ERROR: {hostname} SHELF: {shelf} SIM: {sim} UNKNOWN PART
Repair: Reseat the SIM and check that it's firmware is up to date.
- ERROR: {hostname} SHELF: {shelf} SIM: {sim} NOT PRESENT
Repair: Reinsert or replace the missing SIM.

Storage Node: Cluster (chkCluster.aksh)

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.2.0.0+ Storage Node

 

 


Benefit / Impact

Examines the cluster link health of the appliance.

Risk

Any faults within the cluster link may lead to problems and outages.

Action / Repair

If this check fails, follow the instructions given below, depending on the error messages that are printed in the output:

ERROR: {hostname} CLUSTER: FAILOVER - NO ONE OWNS THE RESOURCES!
Repair: Reconfigure the cluster configuration on the owner node.

Storage Node: Datasets (chkDatasets.aksh)

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.2.0.0+ Storage Node

 

 


Benefit / Impact

Examines the size of the datasets of the appliance.

Risk

An excessive number of large datasets can cause performance degradation.

Action / Repair

Delete the large datasets that are no longer needed and set up a dataset retention policy. Please refer to Action/Repair section of the storage check "ZFSSA Analytics Retention Policy" to properly set the analytics settings.To purge the datasets, please refer to the "Datasets" section on "Sun ZFS Storage 7000 Analytics Guide" to find all the datasets which are more than 2GB, select and prune them.

Storage Node: Shadow Migrated Shares (chkShadow.aksh)

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.2.0.0+ Storage Node

 

 


Benefit / Impact

Iterates through all of the shares to discover those that are being shadow migrated. An error is generated when a shadow is not moving data, or when it is showing errors.

Risk

Any faults within the shadow migration setup may lead to problems and outages.

Action / Repair

If this check fails, follow the instructions given below depending on the error messages that are printed in the output:

ERROR: {hostname} SHARE: {sharename} SHADOWSOURCE: {shadowsource} ERRORS: {errors} TRANSFERRED: {transferred}
Repair: Reconfig and restart the shadow migration setup, if applicable.

Storage Node: Space Utilization (chkSpace.aksh)

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.2.0.0+ Storage Node

 

 


Benefit / Impact

Checks the space utilization of the storage appliance based on the pool, project, and share size.

Risk

Insufficient space may lead to performance degradation which will cause problems and outages.

Action / Repair

If the overall pool goes above 80%, reduce the amount of data stored in the disk tray by transferring some of the data to another storage device.

Storage Node: Lockd Servers(chkLockd.aksh)

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.2.0.0+ Storage Node

 

 


Benefit / Impact

Some applications try to get an exclusive lock against the same file. When the lock reaches a limit, no more new sessions can be started.

Risk

Applications cannot scale out for more than a few concurrent sessions.

Action / Repair

Update to the latest firmware version.

Storage Node: IPMP Failback Configuration (chkIPMPFailback.aksh)

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.2.0.0+ Storage Node

 

 


Benefit / Impact

IPMP failback policy needs to be "false" and the value of IPMP's interval needs to be "5000" to match the default setting that is on the Linux side.

Risk

Failback policy allows the original link to take over the active role after a failover. If this happens within a short period of time, then there is a possibility of erroneous conditions. Applications may be in recovery mode due to the first failure and if failback happens, it makes thing failed again.

Action / Repair

IPMP settings can be changed in configuration -> services -> ipmp, or from CLI as shown below:

el02sn01:> configuration services ipmp
el02sn01:configuration services ipmp> show
Properties:
status = online
interval = 10000
failback = true

el02sn01:configuration services ipmp> set interval=5000
interval = 5000 (uncommitted)
el02sn01:configuration services ipmp> set failback=false
failback = false (uncommitted)
el02sn01:configuration services ipmp> commit
el02sn01:configuration services ipmp> show
Properties:
status = online
interval = 5000
failback = false

 

Storage Node: IPMP Standby Configuration (chkIPMPStandby.aksh)

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.2.0.0+ Storage Node

 

 


Benefit / Impact

Verifying Storage Node's IPMP standby field to be non-empty is needed to ensure storage's high availability in case a failure occurs.

Risk

Having an active/active configuration may cause network issue.

Action / Repair

1. Login to the storage appliance BUI, navigate to "Configuration" -> "Network"
2. Click on the "edit" icon next to the IPMP interface (the name of this interface may be different from a machine to a machine).
3. Make sure that one interface is in Active mode, e.g. ibp0 and the other, e.g. ibp1 is in Standby mode.

Storage Node: ZFS Snapshot Visibility

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.2.0.0+ Storage Node

 

 


Benefit / Impact

ZFS snapshot should be kept as hidden by default unless a recovery is being performed.

Risk

Exalogic can potentially use snapshot files instead of the current version of the files.

Action / Repair

Change the snapshot visibility of Exalogic Control project by doing the following:

  1. Open the following URL below, replace el01sn01 with your storage node's hostname/IP:
    https://<Storage Node Address>:215/#shares/projects=ExalogicControl+snapshots 
  2. Change .zfs/snapshot visibility property from "visible" to be "hidden"
  3. Click "Apply" to save the changes then log out of the storage node.
NOTE: Correction for snapshot visibility and ZFS block size can be done automatically using the script available in Note 1594039.1

Storage Node: L2ARC Header Size

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.2.0.0+ Storage Node

 

 


Benefit / Impact


Excessive size of the L2ARC header can lead to performance degradation on the ZFS storage appliance.

Risk

Significant degradation in performance.

Action / Repair

If the problem persists and if accessing the storage appliance is slow, please refer to the following MOS note :

<Note 1682174.1>:Exachk Reports: L2ARC Header Size Exceeds the Recommended 3 GB Limit

Storage Node: ZFS Block Size

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.2.0.0+ Storage Node

 

 


Benefit / Impact


Shares in the ZFS storage appliance should use the optimal block size requirement as a best practice.

Risk

An inadequate block size in the ZFS storage appliance can cause issues and affect performance.

Action / Repair

Perform the following:

  1. Log in to the ZFS storage appliance BUI.
  2. Click the Shares tab.
  3. Select the edit entry icon for the share with a low block size.
  4. Set the Database record size to 128k
NOTE: Correction for snapshot visibility and ZFS block size can be done automatically using the script available in Note 1594039.1

Storage Node: ZFS Maintenance Status

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.2.0.0+ Storage Node

 

 


Benefit / Impact


The maintenance status of the ZFS storage appliance lists recent and active problems. You must ensure that outstanding issues are resolved to maintain the availability of the ZFS shares and the stability of the Exalogic machine.

Risk

Outstanding issues displayed in the maintenance status of the ZFS storage appliance status can lead to issues.

Action / Repair

Investigate issues listed in the ZFS storage appliance maintenance status

Storage Node: ZFS DNS Configuration

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.0.2.0+ Storage Node

 

 

 

Benefit / Impact

To allow proper network configuration within the ZFS storage appliance, ensure that you use a host name that matches the DNS .

Risk

An incorrect host name that does not match the DNS can cause network configuration issues. It can also cause Exachk to report wrong results.

Action / Repair

To configure the ZFSSA DNS domain and/or server settings, run the following commands. Please consult with the DNS network administrator for the specific domain or servers for the assigned machine.
el01sn01:> configuration services dns set domain=<domain_value>
el01sn01:> configuration services dns set servers=<servers_value>
Example:
el01sn01:> configuration services dns set domain=my.example.com
el01sn01:> configuration services dns set servers=1.2.3.4

 

Storage Node: NFSv4 Lock Object Leak

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.0.2.0+ Storage Node

 

 

 

Benefit / Impact

This is a known defect, number 14781917, in the Linux kernel versions shipped in Exalogic 2.0 Linux physical releases 2.0.0.0.0, 2.0.0.0.1, 2.0.0.0.2, and 2.0.0.0.3.

Risk

Due to this defect, unused LockStateID entries are not disposed of on the ZFS storage appliance and keep growing at a constant rate. When the number of unused LockStateID entries reaches 1 million (which is the maximum) on the ZFS storage appliance, any further NFSv4 calls from Linux clients will receive file lock errors.

Action / Repair

Upgrade to January 2013 PSU version 2.0.3.0.1 or later, which has the Linux kernel version containing the fix for this issue.
Note 1540532.1 - NFS File Lock Issues Caused By Increasing Unused LockStateID Entries In ZFS Storage Appliance In Exalogic

 

Storage Node: Nfsmapid Domain Matching with NIS server

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.0.2.0+ Storage Node

 

 

 

Benefit / Impact

On ZFSSA, the NIS domain setting must match the NFSv4 identity domian setting. A mismatch between the two will cause file ownership to show as nobody:nobody.

Risk

A mismatch between the NIS domain and NFSv4 identity domain will cause file ownerships on NFSv4 mounts to be nobody:nobody.

Action / Repair

Set the two domain settings to the correct value, using the following commands.
sn01:> configuration services nfs show
Properties:
<status> = online
version_min = 2
version_max = 4
nfsd_servers = 500
grace_period = 90
mapid_domain = example.com << This is the NFSv4 domain
enable_delegation = true
krb5_realm =
krb5_kdc =
krb5_kdc2 =
krb5_admin =
sn01:> configuration services nis show
Properties:
<status> = online
domain = example.com << This is the NIS domain, must match
broadcast = false
ypservers = somewhere.example.com

 

Storage Node: Softring Workflow

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.0.2.0+ Storage Node

 

 

 

Benefit / Impact

This is a performance tuning workflow for the ZFS storage appliance to allow optimal performance on Exalogic.

Risk

Under heavy load, you might experience performance degradation while accessing shares on the ZFS storage appliance.

Action / Repair

Upgrade to January 2013 PSU version 2.0.3.0.1 or later, which includes a tuning workflow for the ZFS storage appliance named "Provide work around for CR 7122961".

Storage Node: ZFSSA Analytics Retention Policy

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.0.2.0+ Storage Node

 

 

 

Benefit / Impact

ZFSSA Analytics stores periodic snapshots of the system and retains them for a specified period of time. These snapshots are used for backup and recovery in case a failure occurs. You must consider the available disk space while setting up the retention policy, to ensure that system snapshots do not use up too much disk space.

Risk

If the retention policy is not set carefully within the recommended values, dataset growth may exceed the available disk space, which could cause significant performance degradation on the ZFS storage appliance.

Action / Repair

The maximum recommended settings for each property are as follows:

  • 1 month for retain_second_data
  • 2 months for retain_minute_data
  • 3 months for retain_hour_data

To configure the ZFSSA Analytics Retention Policy settings, use the following procedure:

  1. Make sure the PSU installed in the system is at least April14 PSU. If it is an earlier version of PSU, please install April14 PSU or the latest one.

  2. Change the value

    el01sn01:> analytics settings set <property_type>=<property_value_in_hours>

    Example:
    analytics settings set retain_second_data=672

  3. Commit the change

    el01sn01:> analytics settings commit

  4. Verify the change

    el01sn01:> analytics settings show

Storage Node: ZFS Check Head Status

Alert TypeMachine TypeExalogic VersionApplicable To
ERROR All Types 1.0.0.2.0+ Storage Node

 

 

 

 

Benefit / Impact

Clustering is recommended for the ZFS appliance. When it is configured correctly, only one storage head is actively functioning for each ZFS appliance. One head should be set as active for its own property description and ready for its peer property description, or vice versa. One head should be set as AKCS_OWNER for its own property state and AKCS_STRIPPED for its peer property state, or vice versa. This ensures that the storage heads are configured and run correctly.

Risk

When clustering is configured for ZFS appliance, if storage head description and state properties show more than one active head, it indicates a configuration or hardware error which can cause issues with clustering's availability feature.

Action / Repair

Contact Oracle Support. Refer to the Sun ZFS Storage 7000 System Administration Guide for more information. Sun ZFS Storage 7000 System Administration Guide - Cluster (http://docs.oracle.com/cd/E22471_01/html/820-4167/configuration__cluster.html) 

Storage Node: ZFS Mirror Profile Status

Alert TypeMachine TypeExalogic VersionApplicable To
ERROR All Types 1.0.0.2.0+ Storage Node

 

 

 

 

Benefit / Impact

When data is mirrored, it reduces capacity by half, but yields a highly reliable and high-performing system. Data mirroring is recommended when space is considered ample, but performance is at a premium. An Exalogic system only has one pool, therefore the storage configuration profile status "mirror" shows up on the active head.

Risk

According to the ZFS Appliance Administration Guide, while arbitrary numbers of pools are supported, creating multiple pools with the same redundancy characteristics owned by the same cluster head is not advised. Doing so will result in poor performance, suboptimal allocation of resources, artificial partitioning of storage, and additional administrative complexity.

Action / Repair

Refer to the ZFS Appliance Administration Guide. Oracle ZFS Storage Appliance Administration Guide - Storage Configuration (http://docs.oracle.com/cd/E27998_01/html/E48433/configuration__storage.html#configuration__storage__configuration_rules_and_guidelines)

 

Storage Node: ZFS Share Quota

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.0.2.0+ Storage Node

 

 

 

 

Benefit / Impact

Projects and shares in the ZFS storage appliance should not use more than 85% of the space as a best practice.

Risk

Inadequate space in the ZFS storage appliance and its shares and projects can cause issues and affect performance.

Action / Repair

Clean up the ZFS storage appliance and reallocate resources in shares and projects to keep the space usage under 85%.

 

Storage Node: Check for ZFSSA Installed Ram

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.0.2.0+ Storage Node

 

 

 

 

Benefit / Impact

For optimal performance, it is recommended that ZFS storage appliance have a minimum of 90GB of RAM

Risk

Under heavy load, you might experience performance degradation while accessing shares on the ZFS storage appliance if the RAM size is less than the recommended size. Some systems with very low memory configurations might not run reliably.

Action / Repair

Upgrade ram to at least 90 GB.

Storage Node: ZFS Dedup Status

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.0.2.0+ Storage Node

 

 

 

 

Benefit / Impact

Data deduplication controls duplicate copies of data are eliminated in ZFS appliance. It is synchronous, pool-wide,block-based, and can be enabled on a per project or share basis. If your data doesn't contain any duplicates, enabling Data Deduplication will add overhead (a more CPU-intensive checksum and on-disk deduplication table entries) without providing any benefit. If your data does contain duplicates, enabling Data Deduplication will both save space by storing only one copy of a given block regardless of how many times it occurs.The recommended practice for exalogic system is not to enable deduplication.

Risk

By its nature, deduplication requires modifying the deduplication table when a block is written to or freed. If the deduplication table cannot fit in DRAM, writes and frees may induce significant random read activity where there was previously none. As a result, the performance impact of enabling deduplication can be severe.

Action / Repair

According to the Administration Guide, you can disable data deduplication by deselecting the Data Deduplication checkbox on the general properties screen for projects or shares. Oracle ZFS Storage Appliance Administration Guide - Data Deduplication (http://docs.oracle.com/cd/E27998_01/html/E48433/shares__shares__general.html#shares__shares__general__data_deduplication)

 

Storage Node: ZFS Disk Timeout Warning

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.0.2.0+ Storage Node

 

 

 

 

Benefit / Impact

To find out disk timeout warnings within last 7 days in log file.

Risk

A disk timeout warning could potentially point to a disk failure that can be addressed way before the disk actually goes bad.

Action / Repair

Open an SR for further assistants if there is any performance issue accessing the ZFSSA.

 

Storage Node: ZFS Disk Health

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.0.2.0+ Storage Node

 

 

 

 

Benefit / Impact

To find out possible faulted disk among all disks.

Risk

A faulted disk may cause data corruption or even system failure.

Action / Repair

Replace the faulted disk.

 

Storage Node: NFSv4 Delegation

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.0.2.0+ Storage Node

 

 

 

 

Benefit / Impact

Disabling the NFSv4 delegation feature helps avoid hanging problems within ZFS storage. This typically happens when there is multiple concurrent write access to the same file on an NFS mounted directory.

Risk

NFS Mount Points for the Compute Nodes may hang which leads to problems and outages.

Action / Repair

Please refer to following Note:

<Note 1481713.1>: NFSv4 mount directories hang on Exalogic Machine

REFERENCE:

Sun ZFS Storage 7000 System Administration Guide (http://docs.oracle.com/cd/E22471_01/pdf/820-4167.pdf) 

Storage Node: IPMP configuration on ZFS node

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.0.2.0+ Storage Node

 

 

 

 

Benefit / Impact

IP network multipathing (IPMP) is used primarily as a way of increasing redundancy so that network connectivity is unaffected by the failure of a single component be it a physical network port, a cable or a switch. This check determines whether or not IPMP is configured correctly.

Risk

Network connectivity may be affected by the failure of a single component.

Action / Repair

Link-based failure detection uses properties of the network device driver to check on whether the link to the network is active. To enable link-based failure detection you need to make sure that the test interfaces in an IPMP group do not have a traditional IP addresses configured. Instead they should be configured with the address and netmask of 0.0.0.0/8. Only the IPMP interface itself should be configured with a valid IP address and netmask for the appropriate subne

Storage Node: ZFS Slot Health

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.0.2.0+ Storage Node

 

 

 

 

Benefit / Impact

Identifies faulted slots.

Risk

A faulted slot can cause performance degradation.

Action / Repair

Service or replace faulted slots.

Storage Node: Verify ZFS node disk storage pools

Alert TypeMachine TypeExalogic VersionApplicable To
WARNING All Types 1.0.0.2.0+ Storage Node

 

 

 

 

Benefit / Impact

This check determines the health of each pool from the state of all the pool's devices.

Risk

Unhealthy pools may go undetected.

Action / Repair

Follow instructions in ZFS TroubleShooting and Pool Recovery.

__________________________________________________________________________________________________________

Oracle VM Manager (OVMM)

OVMM: Oracle VM Manager (OVMM) Service Status

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
FAIL All Types Linux 2.0.1.0.0+ OVMM

 

 

 

 

Benefit / Impact

Ensuring that Oracle VM Manager is running is critical as it provides a central location to manage Oracle VM Server and virtual machines

Risk

If Oracle VM Manager goes down, problems and outages may occur.

Action / Repair

Investigate the issue and notify Oracle Support for further assistance.

Links

Note 1501348.1 - Identifying And Resolving Oracle VM Issues In Exalogic Virtual Environment.

OVMM: Database Corruption

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+ OVMM

 

 

 

 

 

Benefit / Impact

Ensures the database used by Oracle VM Manager is operating smoothly.

Risk:

Corrupted data in the database can cause unexpected errors including making the management console inaccessible.

Action / Repair

Contact Oracle Support.

Links

Note 1501348.1 - Identifying And Resolving Oracle VM Issues In Exalogic Virtual Environment.

OVMM: Sufficient CPU resources for the Oracle VM Manager

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
FAIL All Types Linux 2.0.1.0.0+ OVMM

 

 

 

 

 

Benefit / Impact

Sufficient CPU resources is necessary for the vServer to run optimally.

Risk

A lack of sufficient CPU resources can affect performance.

Action / Repair

Contact Oracle Support.

OVMM: OVMM Pool VM Start Policy

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

 

Benefit / Impact

OVMM Pool VM Start Policy manages which servers a VM will be started on. For Exalogic, it is designed to start on the current server.

Risk

Misconfiguration of OVMM Pool VM Start Policy may cause VM creation job to be stopped by EMOC.

Action / Repair

Refer to the latest Exachk User Guide under the heading "Verifying and Enabling Passwordless SSH to the Oracle VM Manager CLI". The link to the latest Exachk User Guide can be found on <Note 1449226.1>. For further information on setting up passwordless SSH to the Oracle VM Manager CLI, please refer to the document "OVM CLI". For more information on setting up new pools and adding servers to the new pools with proper parameter values, please refer to the document "B.9.2.3 Create the Required Pools and Add Servers to the New Pools".

OVMM: Check Connection Channels Before Upgrade

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ OVMM

 

 

 

 

Benefit / Impact

The attached python script is to check if the default configuration for WebLogic network channels has changed.The python scripts has a WLST embedded to it that connects to the AdminServer and inspects the network channel configuration. Connection channel could impact upgrade process.

Risk

When the connnection channel on WLS port 7002 is used, OVM upgrade script will fail.

Action / Repair

Please properly set the connection channel.

OVMM: Sufficient RAM for the Oracle VM Manager

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
FAIL All Types Linux 2.0.1.0.0+ OVMM

 

 

 

 

 

Benefit / Impact

Sufficient memory is necessary for the vServer to run optimally..

Risk

A lack of memory allocated to Oracle VM Manager can affect performance.

Action / Repair

Contact Oracle Support.
__________________________________________________________________________________________________________

Database (DB)

DB: Oracle Database Service Status

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
FAIL All Types Linux 2.0.1.0.0+ DB

 

 

 

 

Benefit / Impact

Ensuring Oracle Database is running is critical for database management within applications to function properly.

Risk

If the database is not up, applications will not be able to run properly as they need to store and load data from the database.

Action / Repair

Investigate the issue and notify Oracle Support for further assistance.

Links
Note 1501228.1 - How To Start A Stopped Exalogic Control Stack In An Exalogic Virtual Environment.

DB: Sufficient CPU resources for the Database Control vServer

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
FAIL All Types Linux 2.0.1.0.0+ DB

 

 

 

 

 

Benefit / Impact

Sufficient CPU resources is necessary for the vServer to run optimally.

Risk

A lack of CPU resource for the Database can affect performance.

Action / Repair

Contact Oracle Support.

DB: Password Expiration Status for OVS User on DB Control vServer

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ DB

 

 

 

 

Benefit / Impact

If no password expiration is set, then the OVS user will not require a password reset and cannot lose access due to an expired ID.

Risk

If the password expiry date is set and the password expires, the control stack will stop working as the OVS user won't be available.

Action / Repair

1. Login into DB control vServer 2. su - oracle 3. use the following code: ORACLE_SID=elctrldb ORAENV_ASK=NO . oraenv >/dev/null 2>&1 unset ORAENV_ASK sqlplus / as sysdba CREATE PROFILE OVS_PROFILE LIMIT PASSWORD_LIFE_TIME UNLIMITED; ALTER USER OVS PROFILE OVS_PROFILE; exit;

DB: Sufficient RAM for the Database Control vServer

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
FAIL All Types Linux 2.0.1.0.0+ DB

 

 

 

 

 

Benefit / Impact

Sufficient memory is necessary for the vServer to run optimally.

Risk

A lack of memory for the Database can affect performance.

Action / Repair

Contact Oracle Support.
__________________________________________________________________________________________________________

Enterprise Controller (EC)

EC: Enterprise Controller Service Status

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
FAIL All Types Linux 2.0.1.0.0+ EC

 

 

 

 

Benefit / Impact

Enterprise Controller is the primary browser user interface for managing the virtual data center. Ensuring that Enterprise Controller is running is critical since it functions as the core of Enterprise Manager Ops Center.

Risk

If Enterprise Controller is not online, critical communication within the applications and other external data sources may be disrupted. This may lead to problems and outages for managing the virtual data center.

Action / Repair

Investigate the issue and notify Oracle Support for further assistance.

Links

Note 1501228.1 - How To Start A Stopped Exalogic Control Stack In An Exalogic Virtual Environment.

EC: Excessive Jobs within EMOC

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+ EC

 

 

 

 

Benefit / Impact

Clearing unnecessary jobs ensures that important jobs can be completed quickly.

Risk

A large number of pending jobs can prohibit newer jobs from starting.

Action / Repair

Stop and delete running jobs that are unnecessary. If automatic jobs are accumulating, please contact Oracle Support.

EC: Connectivity To EMOC

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+ EC

 

 

 

 

 

Benefit / Impact

For the whole management system to work, all different Exalogic Control components must be up and running.

Risk

Unavailability of any Exalogic Control component will result in loss of management functionality.

Action / Repair

Ensure that the failed Exalogic Elastic Cloud Software (EECS) component is available. It might be necessary to restart it.

Links

Note 1501228.1 - How To Start A Stopped Exalogic Control Stack In An Exalogic Virtual Environment.

EC: Network Interface Connectivity for Control vServers

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+ EC

 

 

 

 

Benefit / Impact

For the whole management system to work, all the network interfaces for the Exalogic Control vServers must be running and pingable.

Risk

Unavailability of any interface on a Exalogic Control vServer will cause problems with managing the virtual datacenter.

Action / Repair

Verify the configuration of the failed vServer and fix the issue.

Links

Note 1501228.1 - How To Start A Stopped Exalogic Control Stack In An Exalogic Virtual Environment.

EC: Storage Network Interface Connectivity

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+ EC

 

 

 

 

Benefit / Impact

For the whole system to work optimally, all the network interfaces for the storage appliance must be running and pingable.

Risk

Unavailability of any interface will disrupt communication to the storage appliance.

Action / Repair

Verify the configuration of the storage appliance and fix the issue.

Links

Note 1501228.1 - How To Start A Stopped Exalogic Control Stack In An Exalogic Virtual Environment.

EC: Compute Node (OVS) Network Interface Connectivity

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ EC

 

 

 

 

Benefit / Impact

For the Exalogic machine to operate, all the network interfaces of the compute nodes must be running and pingable.

Risk

Unavailability of any interface of a compute node will cause problems with Exalogic Control and vServers.

Action / Repair

Verify the configuration of the compute nodes and fix the issue.

EC: Sufficient CPU Resources for the Enterprise Controller

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
FAIL All Types Linux 2.0.1.0.0+ EC

 

 

 

 

Benefit / Impact

Sufficient CPU resources is necessary for the vServer to run optimally.

Risk

A lack of CPU resources for the Enterprise Controller can affect performance.

Action / Repair

Contact Oracle Support.

EC: Uce_scheduler status check

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Dom0

 

 

 

 

 

Benefit / Impact

Avoids class loading error that can cause continuous segmentation fault in uce_scheduler.

Risk

Continuous segmentation fault caused by the class loading error of uce_scheduler can cause the virtual machines to restart.

Action / Repair

Turn off the uce scheduler by running the following commands:
# /opt/sun/xvmoc/bin/svcadm disable application/scn/uce-scheduler
# /etc/init.d/uce_scheduler stop
# chkconfig uce_scheduler off

EC: Valid Hostname within /etc/hosts in Enterprise Controller

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ EC

 

 

 

 

Benefit / Impact

/etc/hosts file provides mapping information between hostnames and IP addresses. This file is particularly useful as a cache by the host node to resolve its hostname information when DNS service is unavailable.

Risk

When DNS service is unavailable, incorrect hostname entry in /etc/hosts file can lead to severage outage and loss of data.

Action / Repair

Verify that there is entry for the hostname in /etc/hosts file that maps to EoIB external management interface.

EC: OVS database schema BLOB corruption check

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ EC

 

 

 

 

Benefit / Impact

If no BLOB corruption found then, vDC functionality will be intact.

Risk

New guest VMs cant be created or existing guest VMs can not be managed.

Action / Repair

Follow Document 1509888.1 for recovery. To avoid happening again, contact Oracle Support. Note 1509888.1 - How To Recover Exalogic Virtual Environment After OVM Manager DB Problems In EECS v2.0.1.0.x and EECS v2.0.4.0.x Virtual

EC: Sufficient RAM for the Enterprise Controller

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
FAIL All Types Linux 2.0.1.0.0+ EC

 

 

 

 

Benefit / Impact

Sufficient memory is necessary for the vServer to run optimally.

Risk

A lack of memory for the Enterprise Controller can affect performance.

Action / Repair

Contact Oracle Support.

__________________________________________________________________________________________________________

Proxy Controller (PC)

PC: Proxy Controller Service Status

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
FAIL All Types Linux 2.0.1.0.0+ PC

 

 

 

 

Benefit / Impact

Proxy controllers are one of the components of Oracle Enterprise Manager Ops Center. The proxy controllers must be up to perform critical functions such as carrying out application-related jobs, storing data such as OS images, and interacting directly with managed assets.

Risk

If the proxy controller goes down, problems and outages may occur.

Action / Repair

Investigate the issue and notify Oracle Support for further assistance.

Links

Note 1501228.1 - How To Start A Stopped Exalogic Control Stack In An Exalogic Virtual Environment.

PC: Sufficient CPU resources for the Proxy Controller

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
FAIL All Types Linux 2.0.1.0.0+ PC

 

 

 

 

Benefit / Impact

Sufficient CPU resources are necessary for the vServer to run optimally.

Risk

A lack of CPU resources for the Proxy Controller can affect performance.

Action / Repair

Contact Oracle Support.

PC: Valid Hostname within /etc/hosts in Proxy Controller

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ PC

 

 

 

 

Benefit / Impact

/etc/hosts file provides mapping information between hostnames and IP addresses. This file is particularly useful as a cache by the host node to resolve its hostname information when DNS service is unavailable.

Risk

When DNS service is unavailable, incorrect hostname entry in /etc/hosts file can lead to severage outage and loss of data.

Action / Repair

Verify that there is entry for the hostname in /etc/hosts file that maps to eth0 interface.

PC: Sufficient RAM for the Proxy Controller

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
FAIL All Types Linux 2.0.1.0.0+ PC

 

 

 

 

Benefit / Impact

Sufficient memory is necessary for the vServer to run optimally.

Risk

A lack of memory for the Proxy Controller can affect performance.

Action / Repair

Contact Oracle Support.

__________________________________________________________________________________________________________

Multiple Components

Multiple Components: Kernel Out-of-Memory Errors

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ All Compute Nodes(Physical, Dom0) and Control vServers

 

 

 

 

 

 

Benefit / Impact

The kernel out-of-memory error indicates a potential resource issue.

Risk

If the cause of the kernel out-of-memory error is not identified, service can be disrupted.

Action / Repair

Please check the kernel log, /var/log/message*, and identify the process and cause of the out-of-memory error.

Multiple Components: Control Virtual Server's Uptime

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ All Control vServers

 

 

 

 

 

Benefit / Impact:

Verify that the vServers have not experienced an unexpected restart.

Risk:

Unexpected reboots usually indicate more serious underlying problems.

Action / Repair:

Investigate the cause of the reboot. If reboot happens frequently, contact Oracle support.

Multiple Components: NFSv3 Usage Verification for Control vServers Shares

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+ All Control vServers

 

 

 

 

 

Benefit / Impact:

Ensure that NFSv3 is being used for Exalogic Control shares to prevent control stack instability.

Risk:

Exalogic Control shares not using NFSv3 can destabilize the communication between the storage and control vServers.

Action / Repair:

NFSv3 is used by default. If it was changed, revert it back to NFSv3 for the EL Control stack.

Multiple Components: Gateway Configuration for non-Switch

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.2.0.0+ Physical Compute Node, EC, OVMM

 

 

 

 

 

Benefit / Impact

Valid gateway configuration of the switch needs to be ensured for the Exalogic Machine to perform optimally.

Risk

Invalid gateway configuration will cause communication issues of the component with others outside its subnet.

Action / Repair

Correct the appropriate network configuration file with gateway information.

A gateway should be specified in the format "GATEWAY=XX.XX.XX.XX" in the /etc/sysconfig/network file.

Multiple Components: MTU for InfiniBand Link in Control vServers

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+ Control vServers

 

 

 

 

Benefit / Impact

A correct MTU size for the InfiniBand Link ensures that the communication protocol layer in InfiniBand performs optimally.

Risk

Incorrect MTU size can slow down InfiniBand Link and cause latency issues.

Action / Repair

Please refer to <Note 1624434.1>: Revised MTU Tuning Recommendations for the IPoIB Related Network Interfaces on Exalogic Physical and Virtual Environments.

Multiple Components: TCP Tuning for Control vServers

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ All Control vServers

 

 

 

 

Benefit / Impact

The tuning for TCP consists of three components: 1. net.ipv4.tcp_timestamps should be set to 1 to avoid PAWS issue (Protect Against Wrapped Sequence - RFC 1323). 2. net.ipv4.tcp_window_scaling should be set to 1 to allow efficient transfer of data for high bandwidth-delay products. 3. net.ipv4.tcp_sack should be set to 1 to enable selective acknowledgement in mitigating duplicate acknowledgement and/or retransmission issues (RFC 2018).

Risk

Without this tuning, the TCP may not perform at an optimum level.

Action / Repair

Add the recommended tuning parameters into the /etc/sysctl.conf file.

Multiple Components: Swap Space for Control vServers

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ All Control vServers

 

 

 

 

Benefit / Impact

The level of swappiness controls the amount of memory reclaim distress at a point where the kernel decides to start reclaiming mapped pages. If the swap space is unused, it means the kernel has adequate amount of free physical memory, and this ensures Exalogic Elastic Cloud to perform at its optimal level.

Risk

The usage of swap space indicates that the kernel is running out of free physical memory. Lack of free physical memory can lead to degraded performance.

Action / Repair

Clear up the used memory.

Multiple Components: Lockd Configuration for Control vServers

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Multiple Components

 

 

 

 

Benefit / Impact

Lock recovery after a reboot is critical to maintain data integrity and to prevent unnecessary application hangs. To help rpc.statd match SM_NOTIFY requests to NLM requests, this best practice should be observed.

Risk

NFSv3 locks may not be recovered after a reboot.

Action / Repair

NOTE: The control vServer becomes unavailable during this period, causing applications within the control vServer to stop running. To manage the impact of a temporary loss of service, prepare your environment.

1. Edit /etc/sysconfig/nfs file

2. Change the following lines: From #STATDARG="" To STATDARG="-n `uname -n`"

3. Reboot the control vServer. Please follow the correct order for safe shutdown and restart of control stack vServers. Refer to following MOS Notes:

<Note 1594223.1>: How To Stop and Start the Entire Exalogic Control Stack In An Exalogic EECS v2.0.6.0.0 and later Virtual releases Note

<Note 1501228.1>: How To Start A Stopped Exalogic Control Stack In EECS v2.0.1.0.x and EECS v2.0.4.0.x Virtual Environments 

Multiple Components: Name Service Switch Config File Permission Status in Compute Nodes

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Dom0,Physical

 

 

 

 

Benefit / Impact

In addition to /etc/hosts file having the alias for name resolution, /etc/nsswitch.conf also uses “files” for resolution. The /etc/nsswitch.conf file should have 644 rights so that /etc/hosts can be used by everyone.

Risk

If /etc/nsswitch.conf is not given the 644 permission, /etc/hosts will be ignored by anyone but root. In this case, WebLogic Server may fail to start and may complain about incorrect network configuration since it cannot resolve the hostname used as the listen address.

Action / Repair

Change the rights of /etc/nsswitch.conf to 644.

Multiple Components: Local Partition Usage Limit

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+

Physical,Dom0,All Control vServers,

Guest VM

 

 

 

 

 

Benefit / Impact

Keeping enough local disk free space ensures the system to operate optimally.

Risk

Performance of the system will get affected.

Action / Repair

Free up disk space on the local disk. Oracle recommends most, if not all, user data be stored on the storage appliance.

Multiple Components: Check Root Space in DB and EC VM Before Upgrade

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ DB, EC

 

 

 

 

Benefit / Impact

Sufficient space is required during the process of upgrade. This will allow upgrade to proceed successfully.

Risk

System might be in unstable condition if failed due to insufficient space.

Action / Repair

Compute Node: 500 MB (to cleanup space, run: "yum clean all" and if space is still needed, run: "> /var/log/devmon.log" to create an empty file on the compute node)

Multiple Components: Cross check hostname with /etc/hosts in Guest VMs

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+

Physical,

Guest VM

 

 

 

 

 

Benefit / Impact

Hostname is referred by application in /etc/hosts. This will allow the application functionality not getting interrupted

Risk

Functionality of application might get interrupted.

Action / Repair

Check hostname by running "hostname --s" and add the output with the ip in /etc/hosts

Multiple Components: Check Root Space in OVM PC VM and Compute Node Before Upgrade

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+

OVMM,PC,Dom0

 

 

 

 

Benefit / Impact

Sufficient space is required during the process of upgrade. This will allow upgrade to proceed successfully.

Risk

System might be in unstable condition if failed due to insufficient space.

Action / Repair

Compute Node: 1 GB (to cleanup space, run: "yum clean all" and if space is still needed, run: "> /var/log/devmon.log" to create an empty file on the compute node)

Multiple Components: Bash Vulnerability Update Check

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+

Physical,Dom0,All Control vServers,

Guest VM

 

 

 

 

 

Benefit / Impact

Applying the bash vulnerability patch addresses a critical security vulnerability with bash that allows a malicious user to execute arbitrary commands and gain unauthorized access to the system.

Risk

Not applying the patch exposes the Exalogic machine to a critical security vulnerability that can potentially allow a malicious user to execute arbitrary commands and gain unauthorized access the system.

Action / Repair

Please see following link for more information about CVE-2014-6271 and to fix bash code injection vulnerability. Vulnerability Summary for CVE-2014-6271 (http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-6271) Note 1930090.1 - CVE-2014-6271 and CVE-2014-7169 Patch Availability Document for Oracle Solaris Note 1930120.1 - CVE-2014-6271 and CVE-2014-7169 Patch Availability Document for Oracle Linux Oracle Security Alert for CVE-2014-7169 (http://www.oracle.com/technetwork/topics/security/alert-cve-2014-7169-2303276.html) Note 1929881.1 - CVE-2014-6271 and CVE-2014-7169 Patch Availability for Oracle Exalogic Linux Physical and Virtual Racks

Multiple Components: Verify ILOM open issue

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Multiple Components

 

 

 

 

Benefit / Impact

Any open issues found in ILOM usually indicate HW degradation or failure.

Risk

If open issues are not addressed promptly, it may result in lost of service.

Action / Repair

Address the issue by contacting Oracle Support

Multiple Components: Validate Control VMs JDK Version

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
FAIL All Types Linux 1.0.0.2.0+ Multiple Components

 

 

 

 

Benefit / Impact

Ensuring control VM services are running under supported JDK version is critical to ensure functional compatibility.

Risk

Failure to ensure that control VM services are running under supported JDK version can lead to functional issues with EMOC and failure to execute patching.

Action / Repair

Investigate the issue and notify Oracle Support for further assistance.

Multiple Components: Version Consistency on All Switches

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Multiple Components

 

 

 

 

Benefit / Impact

Version consistency among switches avoids problems with hardware/software configuration.

Risk

Version inconsistency among switches can cause functional issues.

Action / Repair

Please confirm the EECS patch level of the system. Continue finishing the switch FW upgrade according to PSU instructions.

Multiple Components: Ghost Vulnerability

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Multiple Components

 

 

 

 

Benefit / Impact

For the whole system to work optimally, and to avoid problems related to Ghost Vulnerability, the installed image needs to be verified to be at its supported latest version.

Risk

If a system is not patched, it is exposed to security vulnerability.

Action / Repair

Please apply Patch 20448956 and refer to the referenced MOS Note for the steps on how to apply the patch, and rerun exachk post patching to validate. Note 1965975.1 - CVE-2015-0235 - Ghost Vulnerability - Patch Availability for Oracle Exalogic Linux Physical and Virtual Racks (Doc ID 19

 

Multiple Components: IPoIB in Connected Mode

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Multiple Components

 

 

 

 

Benefit / Impact

Having IPoIB in Connected Mode ensures that Internet Protocol (IP) works properly over InfiniBand network.

Risk

If the Connected Mode is not set, IPoIB might not work properly.

Action / Repair

1. If it is OFED R2 version, then check the following: -Go to /etc/sysconfig/network-scripts/ifcfg-ib* files, check whether "CONNECTED_MODE=yes". -If "CONNECTED_MODE" is not set to "yes", modify the /etc/sysconfig/network-scripts/ifcfg-ib* file and change the property to "yes".

2. If it is not OFED R2 version, then check the following: -If SET_IPOIB_CM and/or IPOIB_LOAD is not set to "yes", modify the /etc/infiniband/openib.conf file and change these properties to "yes". -If the content of /sys/class/net/ib0/mode and /sys/class/net/ib0/mode are not connected, modify the content of these files to "connected". After modifying the files above, restart InfiniBand by running the command: /etc/init.d/openibd restart

REFERENCE:

<Note 1982645.1>: Exachk Reporting "IPoIB is not in connected mode" WARNING Message On Exalogic 2.0.6.2.0 Linux Physical Racks 

Multiple Components: NFS Mount Point - Attribute Caching

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Multiple Components

 

 

 

 

Benefit / Impact

By ensuring that attribute caching within NFS is not disabled, NFS mounts can perform more efficiently.

Risk

Disabling attribute caching can lead to extra network operation which leads to degrading network performance.

Action / Repair

Fix the configuration of the NFS Mount Point by removing any of these attributes from the mount points: - noac - actimeo=0 - acregmin=0 - acregmax=0 - acdirmin=0 - acdirmax=0 Configuring NFS Version 4 (NFSv4) on Exalogic (http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm)

Multiple Components: Free Physical Memory

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Multiple Components

 

 

 

 

Benefit / Impact

Availability of free memory needs to be ensured within the switches for the Oracle Middleware Exalogic Machine to perform its processes optimally.

Risk

Insufficient memory may lead do degrading performance, and may cause problems and outages.

Action / Repair

The recommended free space is calculated by adding Free Memory(MemFree) and Reclaimable Memory(SReclaimable) listed in /proc/meminfo. The free memory should be at least 20% of the Total Memory(MemTotal).

Run the following command to clear in-kernel caches:

# sync ; echo 2 > /proc/sys/vm/drop_caches

If there is still not enough free memory, reboot the switch.

NOTE: The switch becomes unavailable during this period, causing applications to stop running within this switch. To handle the possible impact of a temporary loss of service, ensure adequate preparation ahead of time. Reboot the switch by running the following command: # reboot -n 

Multiple Components: MTU for Ethernet Link in Control vServers

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+ Control vServers

 

 

 

 

 

Benefit / Impact

Correcting the MTU size for the Ethernet Link ensures that the communication protocol layer in InfiniBand performs optimally.

Risk

Incorrect MTU size can slow down the Ethernet Link and cause latency issues.

Action / Repair

To correct the MTU size, perform the following:

  1. Log in to the corresponding vServer.
  2. Edit the corresponding ifcfg-bondx file.
    • Example:
      # vi /etc/sysconfig/network-scripts/ifcfg-bond1
  3. Add the following line to the file:
    MTU=1500
  4. Save the changes.
  5. Bring the interface up using the ifup command.
    • Example:
      ifup bond1

 __________________________________________________________________________________________________________

Cross-Components

Cross-Component: Firmware Version Consistency for Storage Node

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Virtual(EC & Dom0)

 

 

 

 

 

Benefit / Impact

Having consistent base firmware version across all storage nodes ensures a stable environment for Exalogic to perform optimally.

Risk

Inconsistent firmware versions across the storage nodes can lead to problems and outages.

Action / Repair

Investigate which storage nodes have different firmware version and upgrade the storage nodes with lower firmware versions.

Cross-Component: NTP Configuration for Control vServers

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Virtual(EC & Dom0)

 

 

 

 

 

Benefit / Impact

Ensuring correct NTP configuration for control vServers is crucial to running Exalogic vServers. Control vServers are configured to point to two compute nodes in the same rack by default.

Risk

Incorrect NTP configuration for control vServers can lead to job scheduling issues in managing Exalogic vServers.

Action / Repair

Correct the NTP server configuration in /etc/ntp.conf for control vServers to point to the first 2 compute nodes in the rack.

Cross-Component: NTP Configuration Consistency with Oracle VM Server for ZFS

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical, Virtual(EC & Dom0)

 

 

 

 

 

Benefit / Impact

The ZFS storage appliance must use the same time source as the other components of the Exalogic machine.

Risk

An out of sync clock source can cause stability issues.

Action / Repair

Modify the NTP server using BUI to point to the same NTP servers configured on the compute node.

Cross-Component: NTP Configuration Consistency with Physical Compute Nodes for ZFS

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Physical

 

 

 

 

 

Benefit / Impact

The ZFS storage appliance must use the same time source as the other components of the Exalogic machine.

Risk

An out of sync clock source can cause stability issues.

Action / Repair

Modify the NTP server using BUI to point to the same NTP servers configured on the compute node.

Cross-Component: NTP Configuration for Compute Nodes

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+
Physical, Virtual (EC & Dom0)

 

 



 

Benefit / Impact

The compute nodes must use the same time source as the other components of the Exalogic machine.

Risk

An out of sync clock source can cause stability issues.

Action / Repair

Modify the NTP server configuration in /etc/ntp.conf for the compute nodes to point to the same set of external NTP servers.

Cross-Component: NTP Configuration Consistency with Oracle VM Servers for Switch Nodes

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+
Virtual (EC)

 

 



 

Benefit / Impact

The switches must use the same time source as the rest of the system.

Risk

An out of sync clock source can cause stability issues.

Action / Repair

Modify the configuration via the ILOM.

Example:

  1. To configure your clock to synchronize with an NTP server, run the following command:
    -> set /SP/clients/ntp/server/1 address=125.128.84.20
  2. Then enable the NTP service by running the following command:
    -> set /SP/clock/usentpserver=enabled.

Cross-Component: NTP Configuration Consistency with Physical Compute Nodes for Switch Nodes

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+
Physical

 

 

 

 

Benefit / Impact

The switches must use the same time source as the rest of the system.

Risk

An out of sync clock source can cause stability issues.

Action / Repair

Modify the configuration via the ILOM.

Example:

  1. To configure your clock to synchronize with an NTP server, run the following command:
    -> set /SP/clients/ntp/server/1 address=125.128.84.20
  2. Then enable the NTP service by running the following command:
    -> set /SP/clock/usentpserver=enabled

Cross-Component: Hostname Consistency with DNS on Oracle VM Server

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+
Virtual (Dom0)

 

 

 

 

Benefit / Impact

A correct hostname setting that matches the DNS avoids problems with network configuration.

Risk

An incorrect hostname that does not match DNS can cause configuration problem.

Action / Repair

Determine if it is an error in the host or in the DNS entry. If it is the host, fix the hostname by changing the value of the HOSTNAME parameter in /etc/sysconfig/network file.

Example:

HOSTNAME=el01cn01.example.com

If it is an error on the DNS server, contact your network administrator to correct the issue.

Cross-Component: Hostname Consistency with DNS on Switches

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+
Virtual (EC)

 

 

 

 

Benefit / Impact

A hostname that matches the DNS will avoid problems with the networking configuration.

Risk

An incorrect hostname that does not match the DNS can cause configuration problem.

Action / Repair

Log in to the ILOM and set the hostname.

Example:

set /SP hostname=el01sw-ib01

Cross-Component: Stale VNICs in the Switch

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 2.0.1.0.0+
Virtual(EC & Dom0)

 

 

 

 

 

Benefit / Impact

Valid vNICs in the switch ensure the Exalogic machine performs optimally. For a virtual rack, VNICs are important in the creation of new vServers.

Risk

VNICs in states other than "UP" can cause network outages. In a physical rack, problems related to EoIB network can occur. In a virtual rack, excessive number of unused vNICs can cause performance issues.

Action / Repair

Delete the real stale VNICs listed under the report command.These stale VNICs can be removed from the respective switch via the deletevnic command.

Cross-Component: OVS Repo Consistency

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Physical, Virtual(EC & Dom0)

 

 

 

 

Benefit / Impact

When the Oracle virtual server repositories on all DOM0s are pointing to the same one, the consistency eliminates performance problem.

Risk

Exalogic is engineered to use a single repository. Any misconfiguration would cause functional issue.

Action / Repair

Please ensure no manual change was done via OVM Manager. Revert those change if necessary.

Cross-Component: Non-sequential Even-numbered Gateway Instance

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux, Solaris 1.0.0.2.0+ Physical, Dom0

 

 

 

 

Benefit / Impact

Check to make sure all switches is using non-sequential even number for their GWInstance values (e.g. 20, 30, 40) to avoid issues for future upgrades.

Risk

If the GWInstance values are sequential and/or uneven, issues may arise during future upgrades.

Action / Repair

Ensure that all switches use non-sequential even number for their GWInstance values. If these criteria are not met, then these values need to be changed properly. For example:
nm2gw-ib02: 20
nm2gw-ib03: 30
nm2gw-ib04: 40
nm2gw-ib05: 50

To change the GWInstance value:
- Login to each switch as root.
- Change the GWInstance value to recommended non-sequential even-numbered value. This example below shows how to set GWInstance value to 30 for nm2gw-ib03.

[root@nm2gw-ib03 ~]# setgwinstance 30
Stopping Bridge Manager.. [ OK ]
Starting Bridge Manager. [ OK ]

- Confirm that the change has been applied by running the command below:

[root@nm2gw-ib03 ~]# showgwconfig
BXM (pid 19825) is running
BXM versions: bxm_user 2.0.0816.3-0, BXM-API 1.6.0, bxm_libs 2.0.0816.3-0, bxm_main 1.31 mlx_bx_core 1.31

Parameter Configured Value Running Value
-----------------------------------------------------------
GWInstance 30 30
SystemName None scae01sw-ib03
EoIB Data SL 1 1
EoIB Control SL 2 2
Allow host VNIC config None no
LAG mode yes yes
Default discover P_key None 0xffff
System MAC Not applicable 00:21:28:54:7f:22

Guest vServers

Guest VM: ib_sdp Module

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Having ib_sdp module loaded ensures that Sockets Direct Protocol(SDP) works properly over InfiniBand.

Risk

If ib_sdp module is not loaded, InfiniBand might not work properly.

Action / Repair

Load the module through /etc/infiniband/openib.conf

Guest VM: IB Startup Sequence

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

To avoid inconsistencies within the Exalogic Elastic Cloud, and for network services to work properly, openibd service must start before the network services.

Risk

If openibd doesn't start before network services, inconsistencies within the nodes can lead to problems and outages.

Action / Repair

Relink openibd with S05 and mlx4_vnic_confd with S06. Using Exalogic Configuration Utility (http://docs.oracle.com/cd/E18476_01/doc.220/e18478/app_a.htm)

Guest VM: TCP Tuning

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

The tuning for TCP consists of three components: 1. net.ipv4.tcp_timestamps should be set to 1 to avoid PAWS issue (Protect Against Wrapped Sequence - RFC 1323). 2. net.ipv4.tcp_window_scaling should be set to 1 to allow efficient transfer of data for high bandwidth-delay products. 3. net.ipv4.tcp_sack should be set to 1 to enable selective acknowledgement in mitigating duplicate acknowledgement and/or retransmission issues (RFC 2018).

Risk

Without this tuning, the TCP may not perform at an optimum level.

Action / Repair

Add the recommended tuning parameters into the /etc/sysctl.conf file.

Guest VM: NFS Mount Point - Attribute Caching

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

By ensuring that attribute caching within NFS is not disabled, NFS mounts can perform more efficiently.

Risk

Disabling attribute caching can lead to extra network operation which leads to degrading network performance.

Action / Repair

Fix the configuration of the NFS Mount Point by removing any of these attributes from the mount points: - noac - actimeo=0 - acregmin=0 - acregmax=0 - acdirmin=0 - acdirmax=0 Configuring NFS Version 4 (NFSv4) on Exalogic (http://docs.oracle.com/cd/E18476_01/doc.220/e18478/nfs.htm)

Guest VM: Name Service Switch Config File Permission Status in Control vServers

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

In addition to /etc/hosts file having the alias for name resolution, /etc/nsswitch.conf also uses "files" for resolution. The /etc/nsswitch.conf file should have 644 rights so that /etc/hosts can be used by everyone.

Risk

If /etc/nsswitch.conf is not given the 644 permission, /etc/hosts will be ignored by anyone but root. In this case, WebLogic Server may fail to start and may complain about incorrect network configuration since it cannot resolve the hostname used as the listen address.

Action / Repair

Change the rights of /etc/nsswitch.conf to 644.

Guest VM: NTP Synchronization

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

NTP helps synchronize the clock of Exalogic with an accurate time source. To ensure correct synchronization, the delay and offset values should be not zero and the jitter value should be under 100.

Risk

An unsynchronized system clock can lead to possible errors and outages.

Action / Repair

Any warnings generated by NTP Synchronization check can be caused by the following: 1. You are using an older version of the NTP package that does not work if you use the DNS name for the NTP servers. In this case, you must use the IP addresses. 2. A firewall blocking access to your Stratum 1 and 2 NTP servers. The firewall can be located on one of the networks between the NTP server and its time source or firewall software, such as iptables, that may be running on the NTP server. 3. The notrust nomodify notrap keywords present in the restrict statement of the NTP client. 4. Localhost is configured on the NTP server. If it is a Linux system, remove localhost from /etc/ntp.conf file to fix the issue. If it is a Solaris system, remove localhost from /etc/inet/ntp.conf Note: KISS keywords in the NTP parameters are ignored. Your Linux NTP clients cannot Synchronize Properly (http://www.linuxhomenetworking.com/wiki/index.php/Quick_HOWTO_:_Ch24_:_The_NTP_Server#Your_Linux_NTP_clients_cannot_Synchronize_Properly)

Guest VM: Swap Space

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

The level of swappiness controls the amount of memory reclaim distress at a point where the kernel decides to start reclaiming mapped pages. If the swap space is unused, it means the kernel has adequate amount of free physical memory, and this ensures Exalogic Elastic Cloud to perform at its optimal level.

Risk

The usage of swap space indicates that the kernel is running out of free physical memory. Lack of free physical memory can lead to degraded performance.

Action / Repair

Clear up the used memory.

Guest VM: Lockd Configuration

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Lock recovery after a reboot is critical, to maintain data integrity and to prevent unnecessary application hangs. To help rpc.statd match SM_NOTIFY requests to NLM requests, this best practice should be observed.

Risk

NFSv3 locks may not be recovered after a reboot.

Action / Repair

NOTE: The node becomes unavailable during this period, causing applications to stop running. To handle the possible impact of a temporary loss of service, ensure adequate preparation ahead of time. 1. Edit /etc/sysconfig/nfs file 2. Change the following lines: From #STATDARG="" To STATDARG="-n `uname -n`" 3. Reboot the node.

Guest VM: ib_ipoib Module

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Having the ib_ipoib module loaded ensures that the Internet Protocol (IP) works properly over InfiniBand.

Risk

If the ib_ipoib module is not loaded, InfiniBand may not work properly.

Action / Repair

Load the module through /etc/infiniband/openib.conf.

Guest VM: Recent Critical Error

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Ensuring the stability of the system is important to support applications running on Exalogic. By discovering unexpected critical errors in a node, action can be taken to fully restore the service as well as resolve the potential cause of problem.

Risk

An unexpected critical error within a node may lead to problems and outages.

Action / Repair

If the critical errors in the recent reboot were expected, please ignore this warning. Otherwise, please investigate further by looking at the log file /var/log/ovs-agent.log*. If problem persists, please open an SR with Oracle Support. Note 1501348.1 - Resolving OVS issues in Exalogic

Guest VM: Recent Reboot Info

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Ensuring the stability of the system is important to support applications running on Exalogic. By discovering an unexpected and recent reboot of a node, action can be to taken to fully restore the service and resolve the potential cause of problem.

Risk

An unexpected and recent reboot of a node may lead to problems and outages.

Action / Repair

If the recent reboot was intentional or expected, please ignore this warning. Otherwise, please investigate why this compute node rebooted unexpectedly. If problem persists, please open an SR with Oracle Support.

Guest VM: IPoIB in Connected Mode

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Having IPoIB in Connected Mode ensures that Internet Protocol (IP) works properly over InfiniBand network.

Risk

If the Connected Mode is not set, IPoIB might not work properly.

Action / Repair

1. If it is OFED R2 version, then check the following: -Go to /etc/sysconfig/network-scripts/ifcfg-ib* files, check whether "CONNECTED_MODE=yes". -If "CONNECTED_MODE" is not set to "yes", modify the /etc/sysconfig/network-scripts/ifcfg-ib* file and change the property to "yes". 2. If it is not OFED R2 version, then check the following: -If SET_IPOIB_CM and/or IPOIB_LOAD is not set to "yes", modify the /etc/infiniband/openib.conf file and change these properties to "yes". -If the content of /sys/class/net/ib0/mode and /sys/class/net/ib0/mode are not connected, modify the content of these files to "connected". After modifying the files above, restart InfiniBand by running the command: /etc/init.d/openibd restart Note 1982645.1 - Exachk Reporting "IPoIB is not in connected mode" WARNING Message On Exalogic 2.0.6.2.0 Linux Physical Racks

Guest VM: Kernel Out-of-Memory Errors

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

The kernel out-of-memory error indicates a potential resource issue.

Risk

If the cause of the kernel out-of-memory error is not identified, service can be disrupted.

Action / Repair

Please check the kernel log, /var/log/message*, and identify the process and cause of the out-of-memory error.

Guest VM: Local Partition Usage Limit

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Keeping enough local disk free space ensures the system to operate optimally.

Risk

Performance of the system will get affected.

Action / Repair

Free up disk space on the local disk. Oracle recommends most, if not all, user data be stored on the storage appliance.

Guest VM: MTU for Ethernet Link

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Correct MTU size for the Ethernet Link ensures that the communication protocol layer within Ethernet performs optimally.

Risk

Incorrect MTU size may slow down Ethernet Link and cause latency issues.

Action / Repair

Investigate and fix the MTU for Ethernet Link to the correct size.

Guest VM: ZCOPY Configuration

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Proper zcopy configuration must be ensured for the Exalogic machine to perform optimally.

Risk

An incorrect zcopy configuration can affect performance.

Action / Repair

Add sdp_zcopy_thresh=0, recv_poll=0 to the /etc/modprobe.conf file.

Guest VM: Consistent Hardware Clock Timezone Reference

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Having the same time zone reference to UTC for the hardware clock avoid any potential time skew between nodes before time sync with the NTP server.

Risk

Different time zone settings across different machines may cause job scheduling issues.

Action / Repair

1. Login to the server as root. 2. Run command "cat /etc/adjtime" 3. Make sure the 3rd line indicate UTC instead of LOCAL. If it shows UTC, it is configured correctly. If it shows LOCAL, run the following repair steps: - Make sure the system time is correctly synchronized with an NTP server. - Run the following command below to change the hardware clock to use UTC "hwclock --utc --systohc"

Guest VM: Bonding of InfiniBand Interfaces

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

The InfiniBand interfaces are a communication link between various components of the Exalogic machine. In order to maintain high availability (HA) with the IPoIB interface, Infiniband must be bonded correctly.

Risk

Without proper bonding of the InfiniBand interfaces, the Exalogic machine cannot maintain high availability (HA) if one of the communication links goes down. It can also affect performance.

Action / Repair

Investigate the bonding in the /etc/sysconfig/network-scripts/ifcfg-ib* files for each applicable pkey.

Guest VM: Disabled Automatic Path Migration(APM)

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

There is a compatibility issue between the OFA software version on Exadata (1.5.1) and on Exalogic (1.5.5). The SDP protocol fails due to the new feature, APM (Automatic Path Migration) which is enabled in Exalogic by default but not yet supported in the OFED version in Exadata which causes to trigger the error "RDMA CMA: unexpected IB CM event: 13". Disabling APM will ensure that SDP protocol works properly in this particular case.

Risk

Enabling APM on Exalogic machine that is connected to Exadata can lead to problems and outages related to SDP protocol failure.

Action / Repair

Please consult MOS note for Action / Repair Note 1588546.1 - SDP Connection in inter-connected Exalogic and Exadata stopped working

Guest VM: MTU for InfiniBand Link

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

A correct MTU size for the InfiniBand Link ensures that the communication protocol layer in InfiniBand performs optimally.

Risk

Incorrect MTU size can slow down InfiniBand Link and cause latency issues.

Action / Repair

Please refer to MOS note: Revised MTU Tuning Recommendations for the IPoIB Related Network Interfaces on Exalogic Physical and Virtual Environments (Doc ID 1624434.1) Note 1624434.1 - Revised MTU Tuning Recommendations for the IPoIB Related Network Interfaces on Exalogic Physical and Virtual Environment

Guest VM: Free Physical Memory

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Adequate amount of free physical memory would ensure that the Exalogic Elastic Cloud performs at its optimal level.

Risk

If there is not enough free physical memory, problems and outages may occur.

Action / Repair

The recommended free space is calculated by using the following algorithm with values listed in /proc/meminfo. The free memory should be at least 20% of the Total Memory(MemTotal). Free Memory = MemFree + Buffers + SReclaimable + Cached - Shmem Clear up the memory cache by running this command: sync; echo 3 > /proc/sys/vm/drop_caches

Guest VM: RPM Database Corruption

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

If RPM query returns an error, any RPM operations would likely fail. Since upgrade or patching require the use of RPM, they would also fail.

Risk

The upgrade process cannot proceed without fixing the errors with RPM installation.

Action / Repair

Run rpm -qa, if command runs without any issue proceed with the upgrade installation. If RPM query returns a lock issue, please refer to the MOS note below to fix the issue. Note 1599404.1 - Error received while executing rpm commands - "rpmdb: Lock table is out of available locker entries"

Guest VM: Cross check hostname with /etc/hosts in Guest VMs

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Hostname is referred by application in /etc/hosts. This will allow the application functionality not getting interrupted

Risk

Functionality of application might get interrupted.

Action / Repair

Check hostname by running "hostname --s" and add the output with the ip in /etc/hosts

Guest VM: CPU CAP for Virtual Machine Configuration File in Oracle Virtual Server

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

When the CPU Cap is configured to be less than 100% (through EMOC), several issues related to CPU soft lockup and vServer hangs have been reported on Exalogic. When CPU Cap is configured to be 100% through EMOC, it is translated to cpu_cap=0 in vm.cfg, which is the value we want to see configured.

Risk

When the CPU Cap is configured to be less than 100% (through EMOC), several issues related to CPU soft lockup and vServer hangs have been reported on Exalogic.

Action / Repair

Please refer to MOS note 1912480.1 - Setting CPU CAP to be less than 100% is not supported for Guest vServers on Exalogic Note 1912480.1 - Setting CPU CAP to be less than 100% is not supported for Guest vServers on Exalogic

Guest VM: Bash Vulnerability Update Check

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Applying the bash vulnerability patch addresses a critical security vulnerability with bash that allows a malicious user to execute arbitrary commands and gain unauthorized access to the system.

Risk

Not applying the patch exposes the Exalogic machine to a critical security vulnerability that can potentially allow a malicious user to execute arbitrary commands and gain unauthorized access the system.

Action / Repair

Please see following link for more information about CVE-2014-6271 and to fix bash code injection vulnerability. Vulnerability Summary for CVE-2014-6271 (http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-6271) Note 1930090.1 - CVE-2014-6271 and CVE-2014-7169 Patch Availability Document for Oracle Solaris Note 1930120.1 - CVE-2014-6271 and CVE-2014-7169 Patch Availability Document for Oracle Linux Oracle Security Alert for CVE-2014-7169 (http://www.oracle.com/technetwork/topics/security/alert-cve-2014-7169-2303276.html) Note 1929881.1 - CVE-2014-6271 and CVE-2014-7169 Patch Availability for Oracle Exalogic Linux Physical and Virtual Racks

Guest VM: CPU CAP for Virtual Machine Configuration File in Oracle Virtual Server

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

When the CPU Cap is configured to be less than 100% (through EMOC), several issues related to CPU soft lockup and vServer hangs have been reported on Exalogic. When CPU Cap is configured to be 100% through EMOC, it is translated to cpu_cap=0 in vm.cfg, which is the value we want to see configured.

Risk

When the CPU Cap is configured to be less than 100% (through EMOC), several issues related to CPU soft lockup and vServer hangs have been reported on Exalogic.

Action / Repair

Please refer to MOS note 1912480.1 - Setting CPU CAP to be less than 100% is not supported for Guest vServers on Exalogic Note 1912480.1 - Setting CPU CAP to be less than 100% is not supported for Guest vServers on Exalogic

Guest VM: OL6 Guest vServer Performance Check

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
INFO All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

In cases with high load, the active processes need to get assigned a long enough time slice during critical execution time.

Risk

With the Oracle Linux 6 kernel, in cases with high load, the active processes do not get assigned a long enough time slice during critical execution time.

Action / Repair

Contact Oracle Support to follow steps in following Note.

<Note 1980462.1>: Performance Regression in OL6 Guest vServers compared to OL5 Guest vServers on Exalogic 

Guest VM: IPoIB in Connected Mode for OEL6

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Having IPoIB in Connected Mode ensures that Internet Protocol (IP) works properly over InfiniBand network.

Risk

If the Connected Mode is not set, IPoIB might not work properly.

Action / Repair

1. If it is OFED R2 version, then check the following: -Go to /etc/sysconfig/network-scripts/ifcfg-ib* files, check whether "CONNECTED_MODE=yes". -If "CONNECTED_MODE" is not set to "yes", modify the /etc/sysconfig/network-scripts/ifcfg-ib* file and change the property to "yes". 2. If it is not OFED R2 version, then check the following: -If SET_IPOIB_CM and/or IPOIB_LOAD is not set to "yes", modify the /etc/rdma/rdma.conf file and change these properties to "yes". -If the content of /sys/class/net/ib0/mode and /sys/class/net/ib0/mode are not connected, modify the content of these files to "connected". After modifying the files above, restart InfiniBand by running the command: /etc/init.d/openibd restart Note 1982645.1 - Exachk Reporting "IPoIB is not in connected mode" WARNING Message On Exalogic 2.0.6.2.0 Linux Physical Racks

Guest VM: ZCOPY Configuration for OEL6

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Proper zcopy configuration must be ensured for the Exalogic machine to perform optimally.

Risk

An incorrect zcopy configuration can affect performance.

Action / Repair

Add sdp_zcopy_thresh=0, recv_poll=0 to the /etc/modprobe.d/id_sdp.conf file.

Guest VM: Eport_State_Enforce Status

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

On releases running with Oracle Enterprise Linux operating system, if the Ethernet link used by a vnic goes down, the bond configured with that particular vnic will not detect it.By default,the bond will only detect the physical link it is using, which is the Infiniband Link. It will not detect the link of the Ethernet port the vnic is connected to. eport_state_enforce=1 flag needs to be present in /etc/modprobe.conf to have this failure detected and failover.

Risk

Without eport_state_enforce=1 flag in /etc/modprobe.conf, network outage will occur if one of the link fails.

Action / Repair

Make sure eport_state_enforce=1 in /etc/modprobe.conf file. Note 1512139.1 - Oracle Exalogic Elastic Cloud Known Issues - Virtualization Release Note 1436514.1 - Exalogic: VNIC 10gb Bond Network Ethernet Link Failover Detection

Guest VM: Eport_State_Enforce Status for OEL6

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

On releases running with Oracle Enterprise Linux operating system, if the Ethernet link used by a vnic goes down, the bond configured with that particular vnic will not detect it.By default,the bond will only detect the physical link it is using, which is the Infiniband Link. It will not detect the link of the Ethernet port the vnic is connected to. eport_state_enforce=1 flag needs to be present in /etc/modprobe.d/mlx4_vnic.conf to have this failure detected and failover.

Risk

Without eport_state_enforce=1 flag in /etc/modprobe.d/mlx4_vnic.conf, network outage will occur if one of the link fails.

Action / Repair

Make sure eport_state_enforce=1 in /etc/modprobe.d/mlx4_vnic.conf file. Note 1512139.1 - Oracle Exalogic Elastic Cloud Known Issues - Virtualization Release Note 1436514.1 - Exalogic: VNIC 10gb Bond Network Ethernet Link Failover Detection

Guest VM: OVS Partition Usage Limit

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
ERROR All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

Keeping enough free space ensures the system to operate optimally.

Risk

vDC functionality will get affected

Action / Repair

Free up disk space on these filesystems /nfsmnt/* /poolfsmnt/* /OVS/Repositories/* /var/lib/xenstored local disk.

Guest VM: Virtual Memory Tuning for DomU

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ DomU

 

 

 

 

Benefit / Impact

The tuning for virtual memory consists of two components:

1. vm.dirty_background_ratio < = 10

The default value of this ratio is 10%. With this value, the kernel will be forced to write dirty pages to disk when its size reaches 9.6GB (10% of 96GB). Oracle recommends that this parameter be tuned down to 3% to smooth out the I/O traffic.

2. vm.min_free_kbytes = 524288 KB (512 MB) for DomU.

The default value of this parameter is 32M. Oracle recommends that this parameter be increased accordingly to account for the large MTU size within an IPoIB network, which is currently at 64K.

Risk

Without this tuning, the kernel may not perform at an optimum level.

Action / Repair

Edit the /etc/sysctl.conf file and modify the corresponding tuning parameters as specified in the Benefit section above.

Guest VM: Ghost Vulnerability

Alert TypeMachine TypeOS TypeExalogic VersionApplicable To
WARNING All Types Linux 1.0.0.2.0+ Guest VM

 

 

 

 

Benefit / Impact

For the whole system to work optimally, and to avoid problems related to Ghost Vulnerability, the installed image needs to be verified to be at its supported latest version.

Risk

If a system is not patched, it is exposed to security vulnerability.

Action / Repair

Please apply Patch 20448956 and refer to the referenced MOS Note for the steps on how to apply the patch, and rerun exachk post patching to validate. Note 1965975.1 - CVE-2015-0235 - Ghost Vulnerability - Patch Availability for Oracle Exalogic Linux Physical and Virtual Racks (Doc ID 19

 

References

<NOTE:1967979.1> - Performance Degradation issues in Exalogic X2-2 Racks when ZFS 7320 Appliance configured with 24GB RAM
<NOTE:1449226.1> - Exachk Health-Check Tool for Exalogic

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback