On Exadata during disk replacement physical disk shows normal but cell disk shows proactive failure status

Asset ID:	1-71-1996075.1
Update Date:	2017-03-12
Keywords:

Solution Type Technical Instruction Sure

Solution 1996075.1 : On Exadata during disk replacement physical disk shows normal but cell disk shows proactive failure status

Applies to:

Exadata Database Machine X2-2 Hardware - Version All Versions and later
Information in this document applies to any platform.

Goal

The goal of this document is to describe a situation where after disk replacement the physical disk status is "normal" however the cell disk status shows as "proactive failure". The reason for this is that

1) The cell disk is still pointing to old disk serial number and not dropped automatically showing status "proactive failure"

2) Due to wrong pointer the auto-create of new cell disk and grid disk failed.

3) Even manually dropping the cell disk is failing with error CELL-04519: Cannot complete the drop of cell disk

Solution

Actual machine/disk/cell/grid disks have been renamed to generic for security purpose.

Steps to manually drop cell disk and create cell/grid disks on Exadata to fix above situation.

1. In this example we are showing the issue is with disk 11. It has been recently replaced. You confirm the status in alerthistory and notice the old cell/grid disks were NOT automatically dropped and shows status at "Proactive failure".

alerthistory.out:
==================

55_3 2015-03-25T23:39:18+01:00 critical "Data hard disk failed. Status : NOT PRESENT Manufacturer : HITACHI Model Number : H723*******3.0T Size : 3.0TB Serial Number : 121*****ZD Firmware : A690 Slot Number : 11 Cell Disk : CD_11_cellnode03a Grid Disk : DATA_CD_11_cellnode03a, DBFS_DG_CD_11_cellnode03a, RECO_CD_11_cellnode03a"

celldisk-detail.out:
===============

name: CD_11_cellnode03a
comment:
creationTime: 2013-06-24T10:58:13+02:00
deviceName: /dev/sdad
devicePartition: /dev/sdl
diskType: HardDisk
errorCount: 5
freeSpace: 0
id: d5c3ef97***********14956469694
interleaving: none
lun: 0_11
physicalDisk: R5S8ZD <<<<<==== cell disk pointing to old physical disk which is removed
raidLevel: 0
size: 2793.953125G
status: proactive failure <<<<<<=======

griddisk-detail.out:
===============

name: DATA_CD_11_cellnode03a
asmDiskGroupName: DATA
asmDiskName: DATA_CD_11_cellnode03A
asmFailGroupName: cellnode03A
availableTo:
cachingPolicy: default
cellDisk: CD_11_cellnode03a
comment:
creationTime: 2013-06-24T11:00:53+02:00
diskType: HardDisk
errorCount: 1
id: 1ccd50e**************15ecf
offset: 32M
size: 2208G
status: proactive failure

name: DBFS_DG_CD_11_cellnode03a
asmDiskGroupName: DBFS_DG
asmDiskName: DBFS_DG_CD_11_cellnode03A
asmFailGroupName: cellnode03A
availableTo:
cachingPolicy: default
cellDisk: CD_11_cellnode03a
comment:
creationTime: 2013-06-24T11:00:41+02:00
diskType: HardDisk
errorCount: 1
id: a39f54be*************674070d621
offset: 2760.15625G
size: 33.796875G
status: proactive failure

name: RECO_CD_11_cellnode03a
asmDiskGroupName: RECO
asmDiskName: RECO_CD_11_cellnode03A
asmFailGroupName: cellnode03A
availableTo:
cachingPolicy: none
cellDisk: CD_11_cellnode03a
comment:
creationTime: 2013-06-24T11:00:58+02:00
diskType: HardDisk
errorCount: 1
id: aa8f912a*************f6528a5c3f
offset: 2208.046875G
size: 552.109375G
status: proactive failure

2. Since the cell/grid disks pointing to old physical disk was not dropped automatically the new cell/grid disks auto-create also fails

alerthistory.out:
==================

55_4 2015-03-31T10:26:00+02:00 warning "Oracle Exadata Storage Server failed to auto-create cell disk and grid disks on the newly inserted physical disk. Physical Disk : 20:11 Status : NORMAL Manufacturer : HITACHI Model Number : H723********.0T Size : 3.0TB Serial Number : 14******GK Firmware : A690 Slot Number : 11 "

3. Next we confirm the physical disk and lun are fine.

physicaldisk-detail.out:
========================

name: 20:11
deviceId: 21
diskType: HardDisk
enclosureDeviceId: 20
errMediaCount: 0
errOtherCount: 0
luns: 0_11
makeModel: "HITACHI H7*********.0T"
physicalFirmware: A690
physicalInsertTime: 2015-03-31T13:42:17+02:00
physicalInterface: sas
physicalSerial: RJ52GK                     <<<<==== Correct physical disk serial for new disk
physicalSize: 2794.5199813842773G
slotNumber: 11
status: normal

lun-detail.out:
===============

name:                 0_11
cellDisk:             CD_11_cellnode03a
deviceName:           /dev/sdad
diskType:             HardDisk
id:                   0_11
isSystemLun:          FALSE
lunSize:              2793.966796875G
lunUID:               0_11
physicalDrives:       20:11
raidLevel:            0
lunWriteCacheMode:    "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
status:               normal

4. You need to manually drop the cell disk pointing to wrong disk. Drop the celldisk with force.

cellcli> drop celldisk CD_11_cellnode03a force;

The above command might mail with below error.

CELL-04519: Cannot complete the drop of cell disk: CD_11_cellnode03a. Received error: CELL-02583: The operation is not permitted on this cell disk.
Cell disks not dropped: CD_11_cellnode03a

5. If the drop cell disk step fails, which actually happens in most cases, you need to reboot the affected cell node and retry drop cell disk which should succeed now.

"shutdown -r now"

cellcli> drop celldisk CD_11_cellnode03a force;

Dropping cell disk will drop the griddisks also.

6. Validate is done by running command. It should be null.

cellcli> list celldisk CD_11_cellnode03a detail.

7. Create the celldisk manually and assign the same lun 0_11

cellcli> create celldisk CD_11_cellnode03a lun=0_11

8. Create the new grid disks in the order BASED ON THE OFFSET, using the same sizes as they had earlier. You can refer sundiag logs or command to find size reference from other grid disks as well. Please review this carefully.

Run the below query on cell disk 10 to get size/offset reference values

CellCLI> list griddisk where celldisk=CD_10_cellnode03a attributes name,size,offset

Create the new grid disks in the order Based on the Offset, using the sizes shown from the previuos command.

CellCLI> create griddisk DATA_CD_11_cellnode03a celldisk=CD_11_cellnode03a,size=2208G
CellCLI> create griddisk RECO_CD_11_cellnode03a celldisk=CD_11_cellnode03a,size=552.109375G
CellCLI> create griddisk DBFS_DG_CD_11_cellnode03a celldisk=CD_11_cellnode03a,size=33.796875G

Run below query on the new grid disks and make sure all the offsets are matching ( in the third column):

CellCLI> list griddisk where celldisk=CD_11_cellnode03a attributes name,size,offset

9. At the ASM level, the old diskgroups were dropped while dropping celldisk. Add the griddisks to the ASM diskgroups by login into +ASM1 instance and add the new disk. Set the rebalance power higher (11) to perform faster.

Add each griddisk to the diskgroup by running:

sql> alter diskgroup DATA add disk '<path for data>' rebalance power 11;
sql> alter diskgroup RECO add disk <path for reco>' rebalance power 11;
sql> alter diskgroup DBFS_DG add disk <path for reco>' rebalance power 11;

10. Run a sundiag again for this cell node to verify "Normal" status for cell/grid disks.

Attachments

This solution has no attachment