![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||||||||
Solution Type Technical Instruction Sure Solution 1386147.1 : How to Replace a Hard Drive in an Exadata Storage Server (Hard Failure)
Applies to:Exadata X5-2 Hardware - Version All Versions and laterExadata Database Machine X2-2 Qtr Rack - Version All Versions and later Oracle SuperCluster M7 Hardware - Version All Versions and later Exadata X6-2 Hardware - Version All Versions and later Exadata X6-8 Hardware - Version All Versions and later Information in this document applies to any platform. GoalHow to Replace a Hard Drive in an Exadata Storage Server (Cell) (Hard Failure) SolutionDISPATCH INSTRUCTIONS:
Special Instructions for Dispatch are required for this part. For Attention of Dispatcher: The parts required in this action plan may be available as spares owned by the customer, which they received with the Engineered System. (These are sometimes referred to as ride-along spares.) If parts are not available to meet the customer preferred delivery time/planned end date, then request TAM or field manager to contact the customer, and ask if the customer has parts available, and would be prepared to use them. If customer spare parts are used, inform the customer that Oracle will replenish the customer part stock as soon as we can. More details on this process can be found in GDMR procedure "Handling Where No Parts Available" step 2: https://ptp.oraclecorp.com/pls/apex/f?p=151:138:38504529393::::DN,BRNID,DP,P138_DLID:2,86687,4,9082,
Complete process may take longer depending on re-balance time that may be required. This document is specific to hard drives in "critical failure" state, also known as hard failure. There are situations where a drive will be flagged at first as a predictive failure which means the disk may still be in use. In such cases, please reference Doc ID 1390836.1 for replacement steps.
600GB Special Handling (Exadata Critical Alert EX32): Exadata Storage Servers in V2, X2-2 and X2-8 systems with 600GB high performance disk drives must run version 11.2.2.4.1 (released December 2011) or higher, prior to completing this procedure to replace the disk, or the replacement will not be accepted by the Exadata Storage Server software. See Doc ID 2199949.1 for details. 8TB Special Handling: Exadata Storage Servers in X2-2 and later with 8TB disk drives must run version 12.1.2.3.6 or 12.2.1.1.2 (released July 2017) or higher, prior to completing this procedure to replace the disk if the replacement part is 7337414, or the replacement may not be accepted by the Exadata Storage Server software. If the image is earlier, then only use part 7301588. For Exadata/ZDLRA refer to Doc ID 2352138.1 for details. For SuperCluster refer to Doc ID 2376948.1 for details. It is expected that the customer's DBA has completed these steps prior to arriving to replace the disk. The following commands are provided as guidance in case the customer needs assistance checking the system prior to replacement. If the customer or FSE requires more assistance prior to the physical replacement of the device, EEST/TSC should be contacted.
For "imageinfo" versions 11.2.3.2.x and later, use this syntax:
In the output above, both the "name:" value (following the ":") and the "slotNumber" provide the slot of the physical device requiring replacement where the "status" field is "critical" status. In the above example, the slot is determined to be slot 5. (slotNumber: 5 AND name: 28:5) 2. The Oracle ASM disks associated with the grid disks on the physical disk will be automatically dropped with FORCE option, and an ASM re-balance will start immediately to restore the data redundancy. Due to being "critical", there is no need to check that ASM is still re-balancing. a. Login to a database node with the username for the owner of Oracle Grid Infrastructure home. Typically this is the 'oracle' user.
b. Select the ASM instance for this DB node and connect to SQL Plus:
c. From the DB node, run the following query, using the name the celldisk associated with this physical disk, which is given in the Cell alert: SQL> select group_number,path,header_status,mount_status,mode_status,name from V$ASM_DISK where path like '%CD_05_edx2cel02'; SQL> This query should return no rows indicating the disk is no longer in the ASM diskgroup configuration. If this returns any other value, then contact the SR owner for further guidance.
a. Login to the cell server and enter the CellCLI interface edx2cel01 login: celladmin
Password: [celladmin@edx2cel01 ~]$ cellcli CellCLI: Release 11.2.2.4.2 - Production on Mon Jul 23 16:21:17 EDT 2012 Copyright (c) 2007, 2009, Oracle. All rights reserved. Cell Efficiency Ratio: 1,000 CellCLI> b. Verify the status of the msStatusis running before replacing the disk:
a. Login as 'root' on the Storage Cell, and use 'df' to determine the md device name for "/" volume: [root@dbm1cel1 /]# df
Filesystem 1K-blocks Used Available Use% Mounted on /dev/md5 10317752 2906660 6886980 30% / tmpfs 12265720 0 12265720 0% /dev/shm /dev/md7 2063440 569452 1389172 30% /opt/oracle /dev/md4 118451 37567 74865 34% /boot /dev/md11 2395452 74228 2199540 4% /var/log/oracle b. Use 'mdadm' to determine the volume status: [root@dbm1cel1 ~]# mdadm -Q --detail /dev/md5
/dev/md5: Version : 0.90 Creation Time : Wed Apr 11 12:08:33 2012 Raid Level : raid1 ... Update Time : Wed Apr 11 13:35:04 2012 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 1 Spare Devices : 0 UUID : c93a778e:64f89fb5:2c560736:d50b1c04 Events : 0.838 Number Major Minor RaidDevice State 0 8 5 0 active sync /dev/sda5 1 0 0 1 removed 2 8 21 - faulty spare /dev/sdb5 Verify the root volume is in 'State: clean,degraded' before hot replacing a system disk. If it is 'State: active' or 'State: clean' then the disk is not yet ready to be removed. Confirm the drive needing replacement based on the output provided ("name" or "slotNumber" value) and LED status of drive. For a hard failure, the LED for the failed drive should have the "Service Action Required" amber LED illuminated/flashing, and may have the "OK to Remove" blue LED illuminated/flashing depending on the nature of the failure mode and when the failure occured. The cell server within the rack should also have its LOCATE white LED illuminated/flashing. Perform the physical replacement of the disk following the directions from the service manual of the respective server (see REFERENCE INFORMATION below):
1. On the drive you plan to remove, push the storage drive release button to open the latch. 2. Grasp the latch and pull the drive out of the drive slot (Caution: The latch is not an ejector. Do not bend it too far to the right. Doing so can damage the latch. Also, whenever you remove a storage drive, you should replace it with another storage drive or a filler panel, otherwise the server might overheat due to improper airflow.) 3. Wait three minutes for the MS daemon to recognize the removal of the old drive. 4. Slide the new drive into the drive slot until it is fully seated. 5. Close the latch to lock the drive in place. 6. Verify the "OK/Activity" Green LED begins to flicker as the system recognizes the new drive. The other two LEDs for the drive should no longer be illuminated. 7. Wait three minutes for the MS daemon to start rebuilding the virtual drives before proceeding. Note: Do not run any controller commands in the service manual when replacing the disk. 8. The server's locate and disk's service LED locate blinking function should automatically turn off during the steps to return to an operational state.
OBTAIN CUSTOMER ACCEPTANCE
The "status" field should report "normal". Note also that the physicalInsertTime should be current date and time, and not an earlier time. If they are not, then the old disk entries may still be present and the disk replacement did not complete successfully. If this is the case, refer to the SR owner for further assistance.using/substituting the "name" value provided in the action plan 2. The firmware of the drive will be automatically upgraded to match the other disks in the system when the new drive is inserted, if it is below the supported version of the current image. If it is above the minimum supported version then no action will be taken, and the newer firmware will remain. This can be validated by the following command:
3. After the physical disk is replaced, a lun should be automatically created, and the grid disks and cell disks that existed on the previous disk in that slot are automatically re-created on the new physical disk. If those grid disks were part of an Oracle ASM group, then they will be added back to the disk group and the data will be re-balanced on them, based on the disk group redundancy and asm_power_limit parameter values.
Status should be normal for the cell disks and active for the grid disks. All of the creation times should also match the insertion time of the replacement disk. If they are not, then the old disk entries may still be present and the disk replacement did not complete successfully. If this is the case, refer to the SR owner for further assistance. Note: The lun name attribute will also be shown in the original alert generated by the storage cell. 4. To confirm that the status of the re-balance, connect to the ASM instance on a database node, and validate the disks were added back to the ASM diskgroups and a re-balance is running:
An active re-balance operation can be identified by STATE=RUN. The column group_number and inst_id provide the diskgroup number of the diskgroup been re-balanced and the instance number where the operation is running. The re-balance operation is complete when the above query returns "no rows selected".
If the new griddisks were not automatically added back into the ASM diskgroup configuration, then locate the disks with group_number=0, and add them back in manually using "alter diskgroup <name> add disk <path> re-balance power 10;" command: SQL> select path,header_status from v$asm_disk where group_number=0;
PATH HEADER_STATU -------------------------------------------------- ------------ o/192.168.9.10/DBFS_DG_CD_05_edx2cel02 FORMER o/192.168.9.10/DATA_Q1_CD_05_edx2cel02 FORMER o/192.168.9.10/RECO_Q1_CD_05_edx2cel02 FORMER SQL> alter diskgroup dbfs_dg add disk 'o/192.168.9.10/DBFS_DG_CD_05_edx2cel02' rebalance power 10; SQL> alter diskgroup data_q1 add disk 'o/192.168.9.10/DATA_Q1_CD_05_edx2cel02' rebalance power 10; SQL> alter diskgroup reco_q1 add disk 'o/192.168.9.10/RECO_Q1_CD_05_edx2cel02' rebalance power 10; Repeat the prior queries to validate the re-balance has started and there are no longer any disks with "group_number" values of 0. 5. If the disk replaced was a system disk in slot 0 or 1, then the status of the OS volume should also be checked. Login as 'root' on the Storage cell and check the status using the same 'df' and 'mdadm' commands listed above: [root@dbm1cel1 ~]# mdadm -Q --detail /dev/md5
/dev/md5: Version : 0.90 Creation Time : Thu Mar 17 23:19:42 2011 Raid Level : raid1 ... Update Time : Mon Jul 23 20:11:36 2012 State : active, degraded Active Devices : 1 Working Devices : 2 Failed Devices : 1 Spare Devices : 1 UUID : e75c1b6a:64cce9e4:924527db:b6e45d21 Events : 0.215 Number Major Minor RaidDevice State 3 65 213 0 spare rebuilding /dev/sdad5 1 8 21 1 active sync /dev/sdb5 2 8 5 - faulty spare [root@dbm1cel1 ~]# While the system disk is rebuilding, the state will show as "active, degraded" or "active,degraded,recovering" with one indicating it is rebuilding and the 3rd being the 'faulty' disk. After rebuild has started, re-running this command will give a "Rebuild Status: X% complete" line in the output. When the system disk sync status is complete, the state should return to "clean" only with 2 devices. If an extra entry (faulty spare) is seen, this can be ignored - refer to Doc ID 2031054.1 for details.
Refer to the Exadata Database Machine Owner's Guide Appendix C for part information. REFERENCE INFORMATION:
References<NOTE:1390836.1> - How to Replace a Hard Drive in an Exadata Storage Server (Predictive Failure)<NOTE:1416303.1> - How to Identify Which Exadata disk FRU Part Number to Order Based on Disk Model and Server Model <NOTE:1316829.1> - Mirror partitions not resynched after replacing failed system drive (lun 0 or 1) <NOTE:1087742.1> - Replacing an Oracle Exadata V2 Data disk drive <NOTE:1352938.1> - Replacing a physicaldisk on a storage cell , cellcli list physicaldisk reports two entries on same slot but LUN is not created <NOTE:1360343.1> - INTERNAL Exadata Database Machine Hardware Current Product Issues (V2, X2-2, X2-8) <NOTE:1360360.1> - INTERNAL Exadata Database Machine Hardware Troubleshooting <NOTE:1274324.1> - Oracle Exadata Database Machine Diagnosability and Troubleshooting Best Practices <NOTE:2199949.1> - (EX32) V2 and X2 Storage Servers with 600GB High Performance Disks Running Exadata Version 11.2.2.4.0 or Lower Require Software Update to Receive Replacement Drives <NOTE:1501450.1> - INTERNAL Exadata Database Machine Hardware Current Product Issues (X3-2, X4-2, X3-8, X4-8 w/X4-2L) <NOTE:1088475.1> - Replacing an Oracle Exadata System disk drive <NOTE:1281395.1> - Steps to manually create cell/grid disks on Exadata if auto-create fails during disk replacement <NOTE:1113013.1> - HALRT-02003: Data hard disk failure <NOTE:1071220.1> - Oracle Sun Database Machine V2 Diagnosability and Troubleshooting Best Practices <NOTE:2352138.1> - Storage servers with 8TB high capacity disks running Exadata older 12.2 or 12.1 versions require software update to receive newer model replacement drives <NOTE:2376948.1> - Support Strategy for Replacing a 8TB HDD in SPARC Based Systems <NOTE:1524329.1> - Root Volume of Predictive Failure Boot Hard Drive in an Exadata Storage Server Remains in State of 'active' Attachments This solution has no attachment |
||||||||||||||||||||||||||||||||||
|