Common Problems Reported by Platinum Monitoring and Recommended Actions for Exalogic Systems

Asset ID:	1-79-1985576.1
Update Date:	2018-05-20
Keywords:

Solution Type Predictive Self-Healing Sure

Solution 1985576.1 : Common Problems Reported by Platinum Monitoring and Recommended Actions for Exalogic Systems

Applies to:

Exalogic Elastic Cloud X4-2 Quarter Rack - Version X4 to X4 [Release X4]
Exalogic Elastic Cloud X3-2 Eighth Rack - Version X5 to X5 [Release X5]
Exalogic Elastic Cloud X3-2 Hardware - Version X3 to X3 [Release X3]
Exalogic Elastic Cloud X4-2 Hardware - Version X4 to X4 [Release X4]
Exalogic Elastic Cloud X4-2 Hardware - Version X5 to X5 [Release X5]
Linux x86-64
Oracle Solaris on x86-64 (64-bit)
Oracle Virtual Server x86-64

Purpose

This Note provides list of commonly reported Platinum Fault and Alert messages by Platinum monitoring setup and provides recommended actions for addressing those Platinum Faults.

Scope

This note focuses on common Platinum alerts and solutions. For more information on Oracle Platinum Services including a full list of Fault Monitoring that is done, visit the following note:

Oracle Platinum Services – Quick Reference Guide (Doc ID 1993848.1)

Details

Following is the list of known Platinum Faults and Alerts reported by Platinum Monitoring setup and recommended actions for each of those faults.

Likely replaced: PLATINUM FAULT: Logfile Monitor found matches in (/var/log/messages) for pattern (kernel:.*(error|crit|fatal)) last count (33)

PLATINUM FAULT: LogFileMonitor:sys_occurrence_count Scanned /var/log/messages from line xxxxx to yyyyy. Found 1 occurence of the pattern [kernel:.* (error|crit|fatal)].. 1 crossed warning ( ) or critical (0) threshold.

PLATINUM FAULT: LogFileMonitor:sys_occurrence_count Scanned /var/adm/messages from line xxxxx to yyyyy. Found 2 occurences of the pattern [svc.startd.*failed].. 2 crossed warning (0) or critical (1) threshold.

These alerts occur from log monitors that have several patterns for which to search. The way we check for these messages is to connect to the given machine as orarom or root and run this command:

for Linux: egrep -i "kernel:.*(error|crit|fatal)" /var/log/messages*

for Solaris: egrep -i "svc.startd.*failed" /var/adm/messages

Notice that we add the asterisk after messages because there will be several files; messages, messages.1, messages.2, etc in the /var/log directory. We want to check all the files as it isn't always clear when the alerts occurred. In this example, we expect to find 33 matches (the count value in parenthesis).

Below is the current list of known error messages that be reported on the svc kernel by Platinum.

kernel: Error: Driver 'pcspkr' is already registered, aborting...
kernel: sdp_process_tx_wc:261 sdp_sock( 4551:14 58027:10280): Send completion with error. wr_id 0x400000002 Status 12
kernel: uce_agent.bin[23006]: segfault at f6eac05c ip 00000000082639a1 sp 00000000f6eac060 error 6 in uce_agent.bin[8048000+6e8000]
kernel: xs_tcp_setup_socket: connect returned unhandled error -107
kernel: bonding: bond1: Error: Unable to enslave eth326_2 because it is already up
yum-updatesd: error getting update info: Cannot retrieve repository metadata (repomd.xml) for repository: ol5_UEK_latest. Please verify its path and try again
kernel: cgrep[562]: segfault at 0 ip 000000004e7f7a3c sp 00000000ffc0ac4c error 4 in libc-2.5.so[4e79f000+154000]
kernel: ponu_ge_lstdat1[16101]: segfault at 28 ip 00007f5002b0a33d sp 00007fffe0fb6ee0 error 4 in libkitd.so[7f5002aea000+3f000]
kernel: FNDLOAD[1296]: segfault at c ip 000000000805689a sp 00000000ff8340ec error 4 in FNDLOAD[8048000+111000]
kernel: ERST: Error Record Serialization Table (ERST) support is initialized
svc.startd[13]: [ID 652011 daemon.warning] svc:/application/pkg/system-repository:default: Method "/lib/svc/method/svc-pkg-sysrepo refresh" failed with exit status 95.
svc.startd[13]: [ID 748625 daemon.error] application/pkg/system-repository:default failed fatally: transitioned to maintenance (see 'svcs -xv' for details)

Below are more details and recommended solutions on each of these above kernel error messages.

1. kernel: Error: Driver 'pcspkr' is already registered, aborting...

Details:

Following is the error message seen in /var/log/messages which triggers the platinum fault.

kernel: Error: Driver 'pcspkr' is already registered, aborting...

Solution: When you encounter the pcspkr error, we should follow the note below as the solution to the problem. The note should be followed for every compute node.

<Note 1948665.1> In Exalogic 2.0.6.x Virtual releases "kernel: Error: Driver 'pcspkr' is already registered, aborting" messages are seen in /var/log/messages

2. kernel: sdp_process_tx_wc:261 sdp_sock( 4551:14 58027:10280): Send completion with error. wr_id 0x400000002 Status 12

Details:

Following is the error message seen in /var/log/messages which triggers the platinum fault.

kernel: sdp_process_tx_wc:261 sdp_sock( 4551:14 58027:10280): Send completion with error. wr_id 0x400000002 Status 12

Solution: This occurs when the environment is configured for SDP with Oracle Traffic Director and there is high load on the system. Please review the following note for best practices, "Exalogic Best Practice: Configure Oracle Traffic Director (OTD) to use TCP instead of SDP (Doc ID 1932308.1)"

INTERNAL ONLY - Note for Support

The note below resolves this by disabling SDP.

<Note 1492408.1>: On very high load, performance issue with SDP between the Oracle Traffic Director (OTD) to Midtier traffic

3. kernel: uce_agent.bin[23006]: segfault at f6eac05c ip 00000000082639a1 sp 00000000f6eac060 error 6 in uce_agent.bin[8048000+6e8000]

Details:

Following is the error message seen in /var/log/messages which triggers the platinum fault.

kernel: uce_agent.bin[23006]: segfault at f6eac05c ip 00000000082639a1 sp 00000000f6eac060 error 6 in uce_agent.bin[8048000+6e8000]

Solution: This issue can be ignored as noted in the following note:

<Note 1900391.1>: uce_agent.bin segfault on Exalogic -- error 6 in uce_agent.bin[8048000+6e7000]

4. kernel: xs_tcp_setup_socket: connect returned unhandled error -107

Details:

Following is the error message seen in /var/log/messages which triggers the platinum fault.

kernel: xs_tcp_setup_socket: connect returned unhandled error -107

Solution: Please review following MOS Note on this platinum fault.

<Note 1985206.1>: Oracle Linux: kernel: xs_tcp_setup_socket: connect returned unhandled error -107

5. kernel: bonding: bond1: Error: Unable to enslave eth326_2 because it is already up

Details:

Following is the error message seen in /var/log/messages which triggers the platinum fault.

kernel: bonding: bond1: Error: Unable to enslave eth326_2 because it is already up

Solution: Please review following MOS Note on this platinum fault.

<Note 1593686.1>: Race Condition in Exalogic vServer Network Initialization Script Can Result in vServers Being Inaccessible Via Some IP Addresses

6. yum-updatesd: error getting update info: Cannot retrieve repository metadata (repomd.xml) for repository: ol5_UEK_latest. Please verify its path and try again

Details:

Following is the error message seen in /var/log/messages which triggers the platinum fault.

yum-updatesd: error getting update info: Cannot retrieve repository metadata (repomd.xml) for repository: ol5_UEK_latest. Please verify its path and try again

Solution: Please review following MOS Note on this platinum fault.

<Note 1912342.1>: Exalogic : error getting update info: Cannot retrieve repository metadata (repomd.xml) for repository

7. kernel: cgrep[562]: segfault at 0 ip 000000004e7f7a3c sp 00000000ffc0ac4c error 4 in libc-2.5.so[4e79f000+154000]

Details:

Following is the error message seen in /var/log/messages which triggers the platinum fault. This is a standard Linux application.

kernel: cgrep[562]: segfault at 0 ip 000000004e7f7a3c sp 00000000ffc0ac4c error 4 in libc-2.5.so[4e79f000+154000]

Solution: Please review following MOS Note on this platinum fault.

<Note 2039129.1>: Troubleshooting Segmentation Faults on Exalogic Environments

8. kernel: ponu_ge_lstdat1[16101]: segfault at 28 ip 00007f5002b0a33d sp 00007fffe0fb6ee0 error 4 in libkitd.so[7f5002aea000+3f000]

Details:

Following is the error message seen in /var/log/messages which triggers the platinum fault. This is a third-party application application.

ponu_ge_lstdat1[16101]: segfault at 28 ip 00007f5002b0a33d sp 00007fffe0fb6ee0 error 4 in libkitd.so[7f5002aea000+3f000]
ponu_ge_lstdat1[16100] general protection ip:7f185d09334b sp:7fff7cc91db0 error:0 in libkitd.so[7f185d073000+3f000]

Solution: Please review following MOS Note on this platinum fault.

<Note 2039129.1>: Troubleshooting Segmentation Faults on Exalogic Environments

9. kernel: FNDLOAD[1296]: segfault at c ip 000000000805689a sp 00000000ff8340ec error 4 in FNDLOAD[8048000+111000]

Details:

Upon review, FNDLOAD is related to E-business suite and was running on the server which reported the following error.

kernel: FNDLOAD[1296]: segfault at c ip 000000000805689a sp 00000000ff8340ec error 4 in FNDLOAD[8048000+111000]

Solution: This message does not have any adverse impact on the OS kernel and can be ignored.

10. kernel: ERST: Error Record Serialization Table (ERST) support is initialized

Details:

The following is the error message seen in /var/log/messages which triggers the platinum fault.

kernel: ERST: Error Record Serialization Table (ERST) support is initialized

Solution: When you encounter the ERST message, it can be ignored per the following note:

<Note 2012603.1> /var/log/messages reports ERST: Error Record Serialization Table (ERST) support is initialized (Doc ID 2012603.1)

11. svc.startd[13]: [ID 652011 daemon.warning] svc:/application/pkg/system-repository:default: Method "/lib/svc/method/svc-pkg-sysrepo refresh" failed with exit status 95.
svc.startd[13]: [ID 748625 daemon.error] application/pkg/system-repository:default failed fatally: transitioned to maintenance (see 'svcs -xv' for details)

Details:

These errors denote a service that Solaris expects to be started but it fails to start.

Solution:

If this is coinciding with PSU patching, this can be ignored as documented in (Doc ID 2049973.1) . if not, collect an explorer data and open an Service Request Engage Solaris as needed

INTERNAL NOTE FOR SUPPORT

From time to time, alerts in older files can be "rediscovered" by the Platinum monitoring system and be reported again.

If the messages are several days to weeks older than the SR, this may indicate that the agent was restarted and rescanned all the log files. Find a previous SR for the customer and then the errors can be ignored as false positives. If a previous SR was not triggered, you'll need to research the message and act accordingly.

To find previous SRs for an account, do the following:

Go to the SR
To the right of the "Account" field there is an icon that looks like a sound bubble with an "I" in it, click that icon
Click on the "Related SRs" tab and sort by "Date Opened" to find their latest SRs
Check for SRs opened around the time of the error timestamp

PLATINUM FAULT: adrAlertLogIncidentError:oraBlockCorruptErrStack An Oracle data block corruption detected in /u01/app/oracle/diag/rdbms/elctrldb/elctrldb/alert/log.xml at time/line number: <Date Time>/<Line Number>

Details:

As seen in the error, this issue is with Oracle Database data block corruption

PLATINUM FAULT: adrAlertLogIncidentError:oraBlockCorruptErrStack An Oracle data block corruption detected in /u01/app/oracle/diag/rdbms/elctrldb/elctrldb/alert/log.xml at time/line number: Mon Apr 4 22:05:59 2016/4991.

Cause:

The control stack database is throwing an error.

Solution:

Check the line number in the log file for the error. Please contact Support if you see this Platinum fault message. The error must be researched and addressed by support.

Collect and review the log.xml file reported in the above error. log.xml file will usually point to a .trc file for the db which needs to be collected in addition to DB alert log and engage DB team via collab SR.

PLATINUM FAULT: adrAlertLogIncidentError:accessViolationErrStack An access violation detected in /u01/app/oracle/diag/rdbms/elctrldb/elctrldb/alert/log.xml at time/line number: <Date Time>/<Line Number>

Details:

This Platinum Fault error message is reported with Date Time and Line number by Platinum monitoring team. For e.g. following is the Platinum Fault error message reported by Platinum monitoring which shows date time as "Mon Feb 2 22:00:39 2015" and line number as "116830".

PLATINUM FAULT: adrAlertLogIncidentError:accessViolationErrStack An access violation detected in /u01/app/oracle/diag/rdbms/elctrldb/elctrldb/alert/log.xml at time/line number: Mon Feb 2 22:00:39 2015/116830.

Cause:

The control stack database is throwing an error.

Solution:

Check the line number in the log file for the error. The error will likely be an ORA-XXXX which must be researched and addressed by support. Some common ones are:

Exalogic Control DB Alert Log Throws "ORA-07445: exception encountered: core dump" Error In Exalogic Virtual Releases (Doc ID 1995841.1)

PLATINUM FAULT: adrAlertLogIncidentError:genericIncidentErrStack Incident (ORA 445) detected in /u01/app/oracle/diag/rdbms/elctrldb/elctrldb/alert/log.xml at time/line number: <Date Time>/<Line Number>

Details:

This Platinum Fault error message is reported with Date Time and Line number by Platinum monitoring team. For e.g. following is the Platinum Fault error message reported by Platinmum monitoring which shows date time as "Thu Mar 5 23:35:00 2015" and line number as "59970".

PLATINUM FAULT: adrAlertLogIncidentError:genericIncidentErrStack Incident (ORA 445) detected in /u01/app/oracle/diag/rdbms/elctrldb/elctrldb/alert/log.xml at time/line number: Thu Mar 5 23:35:00 2015/59970.

Cause:

The control stack database is throwing an error.

Solution:

Check the line number in the log file for the error. Please contact Support if you see this Platinum fault message. The error must be researched and addressed by support.

PLATINUM FAULT: Ilom Sensor Alerts: SensorAlerts:PowerSupplyStatus Power supply sensor(s) at level - CRITICAL

Details:

Following is the Platinum Fault message reported by Platinum monitoring.

PLATINUM FAULT: Ilom Sensor Alerts: SensorAlerts:PowerSupplyStatus Power supply sensor(s) at level - CRITICAL

Cause:

This Platinum fault message can be due to alert condition that is found in the ILOM logs (or) it can be even be a false positive indicated by no alerts in the ILOM snapshot.

Solution:

An ILOM snapshot must be reviewed for the specific error and if there is a faulty part it needs to be replaced.

Also refer to following Note which provides more information on Power Supply sensor alerts.

<Note 1398378.1>: ILOM Targets Raise Critical Power Supply Sensor Alerts In EM That Never Clear

PLATINUM FAULT: ZFSProblem:ProblemSeverity AK-8003-Y6 : The device configuration for JBOD 1111FMD00X

Details:

The following is the Platinum Fault message reported by Platinum monitoring

PLATINUM FAULT: ZFSProblem:ProblemSeverity AK-8003-Y6 : The device configuration for JBOD 1111FMD00X is invalid. Severity: Major Message ID: 13da90c9-a088-481c-83c2-9680ec5fa4b1

Cause:

The command "sn01:> maintenance hardware select chassis-001 select disk list" shows one of the disk bays as absent:
disk-020 HDD 20 absent - - - --

Solution:

Hardware replacement using the appropriate action plan:

<Note 2211440.1>: Exalogic ZFS Storage Appliance: Support Strategy for Replacing a ZFS Logzilla with a Higher Capacity SSD

<Note 1410463.1>: How to Replace Sun ZFS Unified Storage Appliance Hard Disk Drive

PLATINUM FAULT: ZFSProblem:ProblemSeverity AK-8002-9M : The cable between the Ethernet ports of each controller is down

Details:

Following is the Platinum Fault message reported by Platinum monitoring

PLATINUM FAULT: ZFSProblem:ProblemSeverity AK-8002-9M : The cable between the Ethernet ports of each controller is down

Cause:

This is still under investigation but it is widely seen on the nodes running OS8.6.x (2013.1.6.x) firmwares.

Solution:

Please review following MOS Note on this platinum fault.

<Note 2195659.1> - Oracle ZFS Storage Appliance: Alert "The cable between the Ethernet ports of each controller is down"

PLATINUM FAULT: ZFSProblem:ProblemSeverity ZFS-8000-D3 : ZFS device id1 sd@SATA_____TOSHIBA_THN

Details:

Following is the Platinum Fault message reported by Platinum monitoring.

PLATINUM FAULT: ZFSProblem:ProblemSeverity ZFS-8000-D3 : ZFS device id1 sd@SATA_____TOSHIBA_THN

Cause:

This Platinum fault message can be due to alert condition that is found in the ILOM logs (or) it can be even be a false positive indicated by no alerts in the ILOM snapshot.

Solution: Please review following MOS Note on this platinum fault.

<Note 2023190.1> - Exalogic: "Major Fault ZFS device 'id1,sd@SATA_____TOSHIBA_THNSNC51________XXXXXXXXX/a' in pool 'exalogic' failed" Fault Messages Thrown on ZFS Heads

PLATINUM FAULT: ZFSProblem:ProblemSeverity USB-8000-GT : A hardware fault within the device or its interface was detected in the USB device. The driver has failed to initialize the device and the device is in an invalid state.

Details:

Following is the Platinum Fault message reported by Platinum monitoring.

PLATINUM FAULT: ZFSProblem:ProblemSeverity USB-8000-GT : A hardware fault within the device or its interface was detected in the USB device. The driver has failed to initialize the device and the device is in an invalid state

Cause:

This Platinum fault message is caused by <Bug 20957047> - Unrecoverable USB Hardware Error reported on internal USB

Solution: Please review following MOS Note on this platinum fault.

<Note 2019044.1> - Exalogic ZS3-ES Heads: Critical Fault Message "A hardware fault within the device or its interface was detected in the USB device. The driver has failed to initialize the device and the device is in an invalid state"

PLATINUM FAULT: ZFSProblem:ProblemSeverity DISK-8000-CY : There have been non-recovered ZFS checksum errors on this disk

Details:

The following is the Platinum Fault message reported by Platinum monitoring

PLATINUM FAULT: ZFSProblem:ProblemSeverity DISK-8000-CY : There have been non-recovered ZFS checksum errors on this disk

Cause:

This Platinum fault message occurs when there is a disk fault in the disk tray. It can be validated by running "maintenance hardware show" on a storage head and checking for the status "faulted"

Solution:

Please review following MOS Note and collect and upload support bundles to the SR and engage a ZFS appliance specialist to review and schedule a replacement

Sun Storage 7000 Unified Storage System: How to Collect a Support Bundle using the BUI or CLI (Doc ID 1019887.1)

PLATINUM FAULT: ZFSAlert:ProblemType All communication with the cluster peer has been lost

Details:

This Platinum fault message occurs during normal patching or failover testing. If either was happening during the alert, no action is necessary.

If no planned activity was occurring during the time of the alert, open an SR and collect a support bundle running on the affected head for review.

From the command line, issue the following (replace SR number as appropriate):

el01sn01-adm:> maintenance system
el01sn01-adm:maintenance system> sendbundle 3-00000000000
A support bundle is being created and sent to Oracle. You will receive an alert
when the bundle has finished uploading. Please save the following filename, as
Oracle support personnel will need it in order to access the bundle:

/upload/issue/3-00000000000/3-00000000000_ak.b616378c-c1b7-479e-a9b4-f62259469d2c.tar.gz

PLATINUM: An Integrated I/O (II0) fatal error in downstream PCIE device has occurred

Details:

This Platinum fault message will occur after patching the firmware of a storage node.

Cause:

Bug 22012490 "PCIE fatal enountered when upgrading infiniband CX-2 firmware"

Solution:

Contact ZFS support and have them implement the workaround found in this note: Oracle ZFS Storage Appliance: FMA 'pcie-fatal' event seen after upgrade of Infiniband CX-2 firmware on ZS3-ES (Doc ID 2082722.1)

PLATINUM FAULT: ZFSProblem:ProblemSeverity SUNOS-8000-KL : The system has rebooted after a kernel panic. Severity: Major Message ID: xxxxxxxx-xxxx-xxxx-xxxxxxxxxxxx

Details:

Following is the Platinum Fault message reported by Platinum monitoring.

PLATINUM FAULT: ZFSProblem:ProblemSeverity SUNOS-8000-KL : The system has rebooted after a kernel panic. Severity: Major Message ID: xxxxxxxx-xxxx-xxxx-xxxxxxxxxxxx

Cause:

There are a number of reasons, both system and user related, that can trigger this message. This will be updated with known causes as they occur. The current list is:

Bug 20188335 - zfs appliance crashed due to Deadlock: cycle in blocking chain
Bug 18199329 - Deadlock: cycle in blocking chain
Bug 15918412 - EXALOGIC: DISK FAILURE CAN CAUSE ZFSSA HEAD LOCKUP WITH SAS LOCK ERROR
Bug 15710557 - SUNBT7038163-SOLARIS_11 UPDATE THEBE FIRMWARE TO 1.11.0

Solution:

For the known bugs above, apply the latest PSU

There is also an internal note on this issue:

Sun ZFS Storage Appliance: Involuntary Reboots with SAS zone locking violation (Doc ID 1544989.1)

References

<NOTE:2023190.1> - Exalogic: "Major Fault ZFS device 'id1,sd@SATA_____TOSHIBA_THNSNC51________XXXXXXXXX/a' in pool 'exalogic' failed" Fault Messages Thrown on ZFS Heads
<NOTE:1948665.1> - In Exalogic "kernel: Error: Driver 'pcspkr' is already registered, aborting" messages are seen in /var/log/messages
<NOTE:1993848.1> - Oracle Platinum Services – Quick Reference Guide
<NOTE:2019044.1> - Exalogic ZS3-ES Heads: Critical Fault Message "A hardware fault within the device or its interface was detected in the USB device. The driver has failed to initialize the device and the device is in an invalid state"
<NOTE:1398378.1> - ILOM Targets Raise Critical Power Supply Sensor Alerts In EM That Never Clear
<NOTE:1593686.1> - Race Condition in Exalogic vServer Network Initialization Script Can Result in vServers Being Inaccessible Via Some IP Addresses
<NOTE:2039129.1> - Troubleshooting Segmentation Faults on Exalogic Environments
<NOTE:1900391.1> - uce_agent.bin segfault on Exalogic -- error 6 in uce_agent.bin[8048000+6e7000]
<NOTE:1912342.1> - Exalogic : error getting update info: Cannot retrieve repository metadata (repomd.xml) for repository
<NOTE:1985206.1> - Oracle Linux: kernel: xs_tcp_setup_socket: connect returned unhandled error -107
<NOTE:1492408.1> - On very high load, performance issue with SDP between the Oracle Traffic Director (OTD) to Mid-tier traffic
<NOTE:1995841.1> - Exalogic Control DB Alert Log Throws "ORA-07445: exception encountered: core dump" Error In Exalogic Virtual Releases
<NOTE:1932308.1> - Exalogic Best Practice: Configure Oracle Traffic Director (OTD) to use TCP instead of SDP

Attachments

This solution has no attachment