"Undo metadata area is bad" Error During Patchmgr Bug Exposure Check

Asset ID:	1-72-2273254.1
Update Date:	2017-07-10
Keywords:

Solution Type Problem Resolution Sure

Solution 2273254.1 : "Undo metadata area is bad" Error During Patchmgr Bug Exposure Check

Applies to:

Exadata X5-2 Hardware - Version All Versions and later
Oracle SuperCluster T5-8 Hardware - Version All Versions and later
Exadata X6-2 Hardware - Version All Versions and later
SPARC SuperCluster T4-4 - Version All Versions and later
Information in this document applies to any platform.

Symptoms

The following symptoms will be evident for this problem and applies to Exadata storage cell nodes regardless of the compute node type (Exadata or SuperCluster).

patchmgr prechecks fail with an error block similar to the following example:

     2017-06-01 20:52:16 +0200 :INFO: Patchmgr plugin start: Prereq check for exposure to bug 22468216 v1.0. Details in logfile /export/home/patch_12.1.2.2.1.160119/patchmgr.stdout.
     2017-06-01 20:52:16 +0200 :INFO: Arguments: Patch Check Prereq ssc1cel01,ssc1cel02,ssc1cel03,ssc1cel04,ssc1cel05,ssc1cel06 patch_prereq rolling 12.1.2.2.1.160119 /export/home/patch_12.1.2.2.1.160119/patchmgr.stdout patchmgr
     2017-06-01 20:52:41 +0200 :INFO: check_fix_cell_metadata.sh output for all cells ...
     2017-06-01 20:52:41 +0200 :INFO: ssc1cel01: ^[[40;36m[INFO ]^[[0m All metadata areas are good on all cell disks
     ssc1cel02: ^[[40;36m[INFO ]^[[0m All metadata areas are good on all cell disks
     ssc1cel03: ^[[40;36m[INFO ]^[[0m All metadata areas are good on all cell disks
     ssc1cel04: ^[[40;31m[ERROR ]^[[0m Primary and secondary metadata areas are bad on /dev/sdg
     ssc1cel04: ^[[40;31m[ERROR ]^[[0m Undo metadata area is bad on /dev/sdg
     ssc1cel04: ^[[40;31m[ERROR ]^[[0m Contact Oracle Support
     ssc1cel05: ^[[40;36m[INFO ]^[[0m All metadata areas are good on all cell disks
     ssc1cel06: ^[[40;36m[INFO ]^[[0m All metadata areas are good on all cell disks
     2017-06-01 20:52:41 +0200 :^[[40;1;31mERROR^[[0m: Patchmgr plugin complete: Prereq check failed for the bug 22468216
     2017-06-01 20:52:41 +0200 ++++++++++++++++++ Logs so far begin ++++++++++
     2017-06-01 20:52:42 +0200 ++++++++++++++++++ Logs so far end ++++++++++
     2017-06-01 20:52:42 +0200 :^[[40;31mFAILED^[[0m: Details in files <cell_name>.log /export/home/patch_12.1.2.2.1.160119/patchmgr.stdout, /export/home/patch_12.1.2.2.1.160119/patchmgr.stderr
     2017-06-01 20:52:42 +0200 :^[[40;1;31mFAILED^[[0m: DONE: Execute plugin check for Patch Check Prereq.
     [ERROR] Patch prerequisite checks failed. Please run cleanup before retrying.

Note that the error states metadata on /dev/sdg (your drive will probably differ) is bad but also states metadata areas on all cell disks are good. The actual focus will be on the specified drive as bad metadata does exist on that drive.

Running cellcli -e list griddisk detail and cellcli -e list celldisk detail will show drive is normal but with error counts. For example:

     name: CD_06_ssc1cel04
     comment:
     creationTime: 2012-11-14T11:22:15+01:00
     deviceName: /dev/sdg
     devicePartition: /dev/sdg
     diskType: HardDisk
      errorCount: 68
     freeSpace: 0
     id: d8500612-b6ad-4dc4-9cbe-6684310c5e34
     interleaving: none
     lun: 0_6
     physicalDisk: KB5ADL
     raidLevel: 0
     size: 557.859375G
     status: normal

The /var/log/messages file for the cell will show unhandled sense errors for the drive. For example:

     Apr 6 01:14:20 ssc1cel04 kernel: sd 0:2:6:0: [sdg] Unhandled sense code
     Apr 6 01:14:20 ssc1cel04 kernel: sd 0:2:6:0: [sdg] Result: hostbyte=invalid driverbyte=DRIVER_SENSE
     Apr 6 01:14:20 ssc1cel04 kernel: sd 0:2:6:0: [sdg] Sense Key : Medium Error [current]
     Apr 6 01:14:20 ssc1cel04 kernel: sd 0:2:6:0: [sdg] Add. Sense: Unrecovered read error
     Apr 6 01:14:20 ssc1cel04 kernel: sd 0:2:6:0: [sdg] CDB: Read(10): 28 00 00 53 80 00 00 08 00 00

One or both of the following commands run as root on the cell will produce a "Fatal error when getting file size" error. This indicates bad metadata exists.

cellutil -c primary -d /dev/<device>
cellutil -c secondary -d /dev/<device>

Output of cellcli -e list alerthistory will usually show no alerts for the drive in question.

The following actions have been taken but have not solved the problem. Bug 23277291

1. Ensure the tmp directory exists in /var/log/cellos

mkdir -p /var/log/cellos/tmp

2. Run the patchmgr cleanup. For example:

./patchmgr -cells cells_file.txt -cleanup

3. Run the patchmgr prechecks.

Cause

The hard drive is failing.

Solution

1. Collect a Sundiag bundle. Refer to Doc ID 1683842.1 for details.

2. Collect an ILOM Snapshot. Refer to Doc ID 1594992.1 for details.

3. Collect the patchmgr.stdout file.

4. Open a Service Request with Oracle Support.

5. Attach the data collected in steps 1 through 3 to the Service Request.

Oracle Engineers: If all of the above symptoms are confirmed, transfer the SR to Exadata hardware for drive replacement.

Product: The hardware type; e.g. - Exadata X5-2 Hardware, etc.

Component: Errors or Missing Components

Sub Component: Disk Issues

Category: HW MICC TO x64 EXADATA

Attachments

This solution has no attachment