![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Technical Instruction Sure Solution 2011874.1 : How to Replace an Exadata X5-2, X6-2 Storage Server Internal USB drive
Oracle Confidential PARTNER - Available to partners (SUN). Reason: FRU for engineered system Applies to:Exadata X4-8 Hardware - Version All Versions and laterOracle SuperCluster T5-8 Half Rack - Version All Versions and later Oracle SuperCluster M6-32 Hardware - Version All Versions and later Oracle SuperCluster T5-8 Hardware - Version All Versions and later Exadata X5-2 Eighth Rack - Version All Versions and later Information in this document applies to any platform. GoalHow to Replace an Exadata X5-2, X6-2 Storage Server Internal USB drive. Solution DISPATCH INSTRUCTIONS
TASK COMPLEXITY: 3 FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:
The instructions below assume the customer DBA is available and working with the field engineer onsite to manage the host OS and DB/ASM services. They are provided here to allow the FE to have all the available steps needed when onsite, and can be done by the FE if the customer DBA wants or allows or needs help with their steps. 1. Locate the server in the rack being serviced. Turn on the locate indicator light ‘on’ for easier identification of the server being repaired. If the server number has been identified then the Locate Button on the front panel may be pressed. To turn on remotely, use either of the following methods: From a login to the CellCli: CellCli> alter cell led on From a login to the server’s ILOM: -> set /SYS/LOCATE value=Fast_Blink
Set 'value' to 'Fast_Blink From a login to the server’s ‘root’ account: # ipmitool sunoem cli ‘set /SYS/LOCATE value=Fast_Blink’
Connected. Use ^D to exit. -> set /SYS/LOCATE value=Fast_Blink Set 'value' to 'Fast_Blink' -> Session closed Disconnected
2. Shutdown the node for which the USB stick requires replacement. a. For Extended information on this section check MOS Note:
This is also documented in the Exadata Database Maintenance Guide chapter 3 section titled "Maintaining Exadata Storage Servers" subsection "Shutting Down Exadata Storage Server" available on the customer's cell server image in the /opt/oracle/cell/doc directory. Exadata Maintenance Guide Documentation is available internally here: http://amomv0115.us.oracle.com/archive/cd_ns/E50790_01/doc/doc.121/e51951/storage.htm#DBMMN22021
In the following examples the SQL commands should be run by the Customers DBA prior to doing the hardware replacement. These should be done by the field engineer only if the customer directs them to, or is unable to do them. The cellcli commands will need to be run as root. Note the following when powering off Exadata Storage Servers:
b. ASM drops a disk shortly after it/they are taken offline. Powering off or restarting Exadata Storage Servers can impact database performance if the storage server is offline for longer than the ASM disk repair timer to be restored. The default DISK_REPAIR_TIME attribute value of 3.6hrs should be adequate for replacing components, but may have been changed by the Customer. To check this parameter, have the Customer log into ASM and perform the following query: SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg where a.name = 'disk_repair_time' and a.group_number = dg.group_number;
As long as the value is large enough to comfortably replace the components being replaced, then there is no need to change it. # cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
...sample ... DATA_CD_09_cel01 ONLINE Yes RECO_CD_01_cel01 ONLINE Yes ...repeated for all griddisks.... If one or more disks return asmdeactivationoutcome='No', then wait for some time and repeat step #2. Once all disks return asmdeactivationoutcome='Yes', proceed to the next step. d. Run cellcli command to Inactivate all grid disks on the cell that needs to be powered down for maintenance. (this could take up to 10 minutes or longer) # cellcli
...sample ... CellCLI> ALTER GRIDDISK ALL INACTIVE GridDisk DATA_CD_00_dmorlx8cel01 successfully altered GridDisk RECO_CD_02_dmorlx8cel01 successfully altered ...repeated for all griddisks... e. Execute the command below and the output should show asmmodestatus='UNUSED' or 'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and inactive in ASM. CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 inactive OFFLINE Yes RECO_CD_02_dmorlx8cel01 inactive OFFLINE Yes ...repeated for all griddisks... f.Before shutting down make a note of the active image version In this example it is 12.1.2.2.0 ,this will be required later if replacing the USB in an extreme flash storage cell. # imageinfo Kernel version: 2.6.39-400.264.1.el6uek.x86_64 #1 SMP Wed Aug 26 16:42:25 PDT 2015 x86_64 Active image version: 12.1.2.2.0.150917 g.For Extreme Flash systems ONLY which use NVME drives in place of hard-disk and the image is 12.1.2.2.0 or 12.1.2.2.1 it will be necessary to make the NVME devices bootable before shutting down.This step is not required for High capacity storage cells,for High capacity proceed to step "h" Due to bug 22620662 EXTREME FLASH CELLS FAIL TO BOOT FROM THE NVME SYSTEM DISKS This step is not necessary if the extreme flash image version is 12.1.2.1.x or 12.1.2.3.0 or higher. i) For 12.1.2.2.0 or 12.1.2.2.1 ONLY type the following: # cat << GRUB_INSTALL > /root/device.map
> (hd0) /dev/nvme0n1 > (hd1) /dev/nvme1n1 > GRUB_INSTALL This will create the file device.map which will be used and will be kept for reference if required. ii).Make the NVME devices bootable.Type the following: # /sbin/grub --device-map=/root/device.map << CELL_GRUB_INSTALL
> root (hd0,0) > setup (hd0) > root (hd1,0) > setup (hd1) > quit > CELL_GRUB_INSTALL This is an example of the expected output: # /sbin/grub --device-map=/root/device.map << CELL_GRUB_INSTALL GNU GRUB version 0.97 (640K lower / 3072K upper memory) [ Minimal BASH-like line editing is supported. For the first word, TAB If the above command fails with "failed Error 22t: No such partition" then continue to shutdown the cell as the device can still be replaced and made bootable using the diagnostics iso image.See later instructions in this action plan. h. Once all disks are offline and inactive, the customer may shutdown the Cell using the following command: # shutdown -hP now
When powering off Exadata Storage Servers, all storage services are automatically stopped.
WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?: 1. Slide out the server for maintenance. Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms. Take care to ensure the cables and Cable Management Arm is moving properly. Refer to Note 1444683.1 for CMA handling training. 2. Remove the AC power cords prior to removing the server’s top cover. For High Capacity Storage cell with a single internal USB drive follow "High Capacity configuration". High Capacity Configuration 1. Remove and replace the USB thumb drive from the internal USB port.Make a note of which slot the USB drive is inserted into ,there are two USB slots. On Exadata Storage Server based on a Oracle Server X5-2L, the internal USB ports are located near the handle on the Rear I/O daughter board located between PCIe slots 3 and 4. 2. Replace the server’s top cover and re-attach the AC power cords. ILOM will take up to 2 minutes to boot. 3. Slide the server back into the rack. 4. After ILOM has booted, power on the server by pressing the power button, and then connect to the server’s console. Do not connect using the ILOM web browser as the console output with image version 12.1.2.1.0 and above is displayed to the CLI console.
To connect to the console through ILOM: - From the ILOM CLI: → start /SP/console
5. From the console and monitor the system booting. The server should boot from the primary hard disk. This will be mentioned in the Exadata splash screen. 6. After the Storage Server is booted, login as ‘root’ user. 7. Run the following to copy the recovery image and configuration data to the new USB stick: # cd /opt/oracle.SupportTools
# ./make_cellboot_usb -verbose -force
Ignore any messages such as the following they do not prevent the action completing: WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util fdisk doesn't support GPT. Use GNU Parted. Note:This message above may not be present with image 12.2.1.1.0 or above. It may be required to stop the MS service to run this command. Remember to re-enable this once the make_cellboot_usb has completed.
8. Set the next boot to forcibly stop at the BIOS setup menu: # ipmitool chassis bootdev bios
Set Boot Device to bios # 9. Reboot the server with the following command: # shutdown -r now
10. Monitor the system booting again. The system should go automatically into the BIOS Setup screen. 11. Once the BIOS Setup screen is displayed on the console, use the arrow keys to navigate to the Boot screen .Check the "Legacy Boot Option Priority" list . Set the "USB:USBIN0:ORACLE SSM PMAP" to be the first boot device , followed by “PCI RAID Adapter” followed by the onboard network PXE devices. Press “Esc” to exit the ‘Boot Order Device Priority’ screen Refer to X5 Series Servers Administration Guide for details. 12. Navigate to the Exit screen and select “Save Changes and Exit” 13. The server will boot . This time it should load the Exadata splash screen (grub) from the USB stick and indicate as such. Extreme Flash (EF) Configuration With the introduction of image 12.1.2.2.0 and higher only one USB drive /dev/sda " USB:USBIN0:ORACLE SSM PMAP" (this is the lower USB slot ), is used for the grub boot-loader .USBIN1 is no longer mirrored or contains the grub boot loader.In addition it is now possible with 12.1.2.2.0 and higher to boot from the NVME devices. If the image is 12.1.2.1.x (where x= 0 ,1 ,2 or 3) proceed as follows.
If the image is 12.1.2.2.0 or higher go down to the section "EF Replacement when using image 12.1.2.2.0 or higher"
EF Replacement when image is 12.1.2.1.x 1. Remove and replace the USB thumb drive from the internal USB port.There are two USB slots , drive /dev/sda is "USB:USBIN0:ORACLE SSM PMAP" this is the lower USB slot .Drive /dev/sdb is "USB:USBIN1:ORACLE SSM PMAP" this is the upper slot. On Exadata Storage Server based on a Oracle Server X5-2L, the internal USB ports are located near the handle on the Rear I/O daughter board located between PCIe slots 3 and 4. 2. Replace the server’s top cover and re-attach the AC power cords. ILOM will take up to 2 minutes to boot. 3. Slide the server back into the rack. 4. After ILOM has booted,login to the ILOM and connect to the console. Do not connect using the ILOM web browser as the console output with image version 12.1.2.1.0 and above is displayed to the CLI console.
To connect to the console through ILOM: - From the ILOM CLI: → start /SP/console
5. From the console monitor the system booting.It will boot from the good USB drive. 6. After the Storage Server is booted, login as ‘root’ user,this will be the first session.Then with a second login session ssh to the storage cell as user root,this will be the second session. The following process is further documented in the attachment to this doc logfile_and_notes.pdf .The entire process take approximately one hour to complete and must not be interrupted.DO not attempt to manually recover the USB drive.Leave the storage cell to automatically recover the device. The attached logfile shows timings as an example of a device replacement.
The cell will now automatically rebuild the new USB drive ,this is performed by the actions of the "checkdeveachboot" validation script. To view the activity of the rebuild use the second login session and issue the following command. Note, you may need to wait at least 5 minutes after login before this log file is created. Once it exists then use the following to watch the rebuild activity. tail -f /var/log/cellos/checkdeveachboot.log
The activity will be started when events such as the following are displayed in the log file [1432042135][2015-05-19 14:28:55 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][] ..... The script will show errors such as the following ,this is because the device files /dev/sdX1 and /dev/sdX2 ( where X= a or b ) have not yet been created.These will be created later by the script. This example shows /dev/sdb1 missing when first checked. [1432042135][2015-05-19 14:29:00 +0100][WARNING][0-0][/opt/oracle.cellos/image_functions][cmd_retry][] Failed to run mdadm /dev/md4 --add /dev/sdb1. Retry
[1432042135][2015-05-19 14:29:01 +0100][WARNING][0-0][/opt/oracle.cellos/image_functions][cmd_retry][] Failed to run mdadm /dev/md4 --add /dev/sdb1. Retry [1432042135][2015-05-19 14:29:05 +0100][ERROR][0-0][/opt/oracle.cellos/image_functions][cmd_retry][][DISPLAY] Unable to run mdadm /dev/md4 --add /dev/sdb1 Approximately 10 minutes after the validation script has started it will appear to have stopped ,the following will be displayed in the log file.The script will then sleep for at least 10 minutes before recommencing and again rebuilding the new device. [1432042135][2015-05-19 14:37:29 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][] CELLBOOT USB is fixable
[1432042135][2015-05-19 14:37:29 +0100][WARNING][/opt/oracle.cellos/imageLogger][imageLogger_init][] Init string is pre-initalized while calling imageLogger_init from source in /opt/oracle.cellos/restore_cellboot_usb.sh at line 32 Further warnings are displayed [1432042135][2015-05-19 14:47:18 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][] Reset lock /var/log/exadatatmp/cellos/locks/12410.checkdeveachboot
[1432042135][2015-05-19 14:47:18 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][] #^#^# [WARNING] [MD] 4 Device /dev/md4 is either in degraded state or stopped [1432042135][2015-05-19 14:47:18 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][] #^#^# [ERROR] [SYSDISK] 27 One or more md devices in degraded state or stopped [1432042135][2015-05-19 14:47:18 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][] #^#^# [ERROR] [SYSDISK] 22 /dev/sdb1 boot partition does not have BOOT label [1432042135][2015-05-19 14:47:18 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][] #^#^# [ERROR] [USB] 25 CELLBOOT USB DEVICE USB1 not found [1432042135][2015-05-19 14:47:18 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][] #^#^# [INFO] [USB] 28 CELLBOOT USB fixed [1432042135][2015-05-19 14:47:18 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][] 0:138412033:301989893:16 The script will eventually complete with the following event, this event will be displayed twice before the script halts. [1432044051][2015-05-19 15:00:58 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][] USB partition /dev/sdb2 is mountable
[1432044051][2015-05-19 15:00:59 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][] [INFO] mount_dev: Mount device. Cmd: mount /dev/sdb2 /mnt/usb.check.dev.each.boot [1432044051][2015-05-19 15:01:00 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][] USB has version 12.1.2.1.0.141206.1 [1432044051][2015-05-19 15:01:00 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][] Reset lock /var/log/exadatatmp/cellos/locks/20512.checkdeveachboot [1432044051][2015-05-19 15:01:00 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][] #^#^# [ERROR] [SYSDISK] 22 /dev/sdb1 boot partition does not have BOOT label [1432044051][2015-05-19 15:01:00 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][] 0:4194305:0:0 At this point the mirror of the two USB devices, /dev/md4 will be recovering ,this will take approximately 35 minutes. Progress of the recovery can be viewed by examing the file /proc/mdstat by using the first console session # cat /proc/mdstat
Personalities : [raid1] md4 : active raid1 sdb1[2] sda1[0] 499904 blocks [2/1] [U_] [>....................] recovery = 3.4% (17408/499904) finish=30.8min speed=260K/sec The recovery can also be viewed with mdadm # mdadm -D /dev/md4
/dev/md4: Version : 0.90 Creation Time : Thu Apr 30 10:26:07 2015 Raid Level : raid1 Array Size : 499904 (488.27 MiB 511.90 MB) Used Dev Size : 499904 (488.27 MiB 511.90 MB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 4 Persistence : Superblock is persistent Update Time : Fri May 8 09:07:27 2015 State : clean, degraded, recovering Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1 Rebuild Status : 1% complete UUID : 1a7aafda:114252b2:04894333:532a878b Events : 0.252 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 2 8 17 1 spare rebuilding /dev/sdb1 Wait for the recovery to complete ,it will show the following # cat /proc/mdstat
Personalities : [raid1] md4 : active raid1 sdb1[1] sda1[0] 499904 blocks [2/2] [UU]
# mdadm -D /dev/md4
/dev/md4: Version : 0.90 Creation Time : Tue May 19 11:35:29 2015 Raid Level : raid1 Array Size : 499904 (488.27 MiB 511.90 MB) Used Dev Size : 499904 (488.27 MiB 511.90 MB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 4 Persistence : Superblock is persistent Update Time : Wed May 20 10:52:51 2015 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 UUID : df6e8567:c1b1ea55:04894333:532a878b Events : 0.2317 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1 7.Check the correct device files and label have been created - ensure both drives are seen. # lsscsi
[6:0:0:0] disk ORACLE SSM PMAP /dev/sda [7:0:0:0] disk ORACLE SSM PMAP /dev/sdb - ensure all device files exist # ls -l /dev/sd*
brw-rw---- 1 root disk 8, 0 May 19 15:39 /dev/sda brw-rw---- 1 root disk 8, 1 May 19 15:39 /dev/sda1 brw-rw---- 1 root disk 8, 2 May 19 15:40 /dev/sda2 brw-rw---- 1 root disk 8, 16 May 19 15:39 /dev/sdb brw-rw---- 1 root disk 8, 17 May 19 15:39 /dev/sdb1 brw-rw---- 1 root disk 8, 18 May 19 15:40 /dev/sdb2
Check drive labels. # e2label /dev/md4 # e2label /dev/sdb1 # e2label /dev/sdb2 Partition /dev/sdb1 may fail due to bug 20765279 ,it may show the following. # e2label /dev/sdb1
e2label: Bad magic number in super-block while trying to open /dev/sdb1 Couldn't find valid filesystem superblock. There are now two options for recovering the label on the drive . a). Ensure the recovery of the mirror /dev/md4 has completed ,then reboot the cell .This action will recover the missing label. # shutdown -r now
When the cell has completed the reboot,login and again check the labels using the e2label command,all should now be correct. or if the customer does not wish to reboot. b).Follow this procedure # umount /boot # mdadm --stop /dev/md4 # mdadm --assemble --scan # mount /dev/md4 /boot # e2label /dev/md4
EF Replacement when using image 12.1.2.2.0 or higher 1. Remove and replace the USB thumb drive from the internal USB port. There are two USB slots , drive /dev/sda is "USB:USBIN0:ORACLE SSM PMAP" this is the lower USB slot .Drive /dev/sdb is "USB:USBIN1:ORACLE SSM PMAP" this is the upper slot. USBIN1 is not used for this version. On Exadata Storage Server based on a Oracle Server X5-2L, the internal USB ports are located near the handle on the Rear I/O daughter board located between PCIe slots 3 and 4. 2. Replace the server’s top cover and re-attach the AC power cords. ILOM will take up to 2 minutes to boot. 3. Slide the server back into the rack. 4. After ILOM has booted,login to the ILOM and connect to the console. Do not connect using the ILOM web browser as the console output with image version 12.1.2.1.0 and above is displayed to the CLI console.
To connect to the console through ILOM: → start /SP/console
Now power on the storage cell.As soon as the first BIOS display is visible press cntrl-p to select the boot pop-up menu .Eventually the boot menu will be displayed. Select one of the NVME devices from the display to boot from ,see example below. ───────────────────────────────────
Please select boot device: ─────────────────────────────────── PCIE6:NVMe0:INTEL SSDPE2ME016T4S PCIE6:NVMe1:INTEL SSDPE2ME016T4S PXE:NET0:IBA XE Slot 3A00 v2320 Enter Setup ─────────────────────────────────── ↑ and ↓ to move selection ENTER to select boot device ESC to boot using defaults ─────────────────────────────────── The cell will now boot. If the cell fails to boot from the NVME device , or it was not possible to make the NVME devices bootable prior to shutdown then refer to document "Exadata Extreme Flash storage cell fails to boot from NVME when using image 12.1.2.2.0 or 12.1.2.2.1 (Doc ID 2108452.1)" 6. After the Storage Server is booted, login as ‘root’ user. 7. Run the following to copy the recovery image and configuration data to the new USB stick: # cd /opt/oracle.SupportTools
# ./make_cellboot_usb -verbose -force It may be required to stop the MS service to run this command. cellcli -e alter cell shutdown services MS
Remember to re-enable this once the make_cellboot_usb has completed. cellcli -e alter cell startup services MS
8. Set the next boot to forcibly stop at the BIOS setup menu: # ipmitool chassis bootdev bios
Set Boot Device to bios # 9. Reboot the server with the following command: # shutdown -r now
10. Monitor the system booting again. The system should go automatically into the BIOS Setup screen. 11. Once the BIOS Setup screen is displayed on the console, use the arrow keys to navigate to the Boot screen .Check the "Legacy Boot Option Priority" list . Set the "USB:USBIN0:ORACLE SSM PMAP" to be the first boot device , followed by “PCIE6:NVMe0:INTEL SSDPE2ME016T4S” then "PCIE6:NVMe1:INTEL SSDPE2ME016T4S" and finally "PXE:NET0:IBA XE Slot 3A00 v2320" . Press “Esc” to exit the ‘Boot Order Device Priority’ screen Refer to X5 Series Servers Administration Guide for details. 12. Navigate to the Exit screen and select “Save Changes and Exit” 13. The server will boot . This time it should load the Exadata splash screen (grub) from the USB stick and indicate as such.
OBTAIN CUSTOMER ACCEPTANCE The following steps should be done by the customer's administrator to return the disks to service: 1. Activate the grid disks: # cellcli …
CellCLI> alter griddisk all active GridDisk DATA_CD_00_dmorlx8cel01 successfully altered GridDisk RECO_CD_02_dmorlx8cel01 successfully altered ...etc... 2. Issue the command below and all disks should show 'active': CellCLI> list griddisk
DATA_CD_00_dmorlx8cel01 active RECO_CD_02_dmorlx8cel01 active ...etc... 3. Verify all grid disks have been successfully put online using the following command. Wait until 'asmmodestatus' is in status 'ONLINE' for all grid disks. The following is an example of the output early in the activation process. CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 active ONLINE Yes RECO_CD_00_dmorlx8cel01 active SYNCING Yes RECO_CD_01_dmorlx8cel01 active ONLINE Yes ...etc...
Notice in the above example that 'RECO_CD_00_dmorlx8cel01' is still in the 'SYNCING' process. Oracle ASM synchronization is only complete when ALL grid disks show ‘asmmodestatus=ONLINE’. This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair.
It has been found on very rare occasions that the above procedure has not worked with image 12.1.2.1.0 and 12.1.2.1.1 , if this happens please refer to the doc and contact Exadata software support for assistance. Exadata Storage Software 12.1.2.1.0 and 12.1.2.1.1 System Disk Replacement Issues (Doc ID 2003674.1)
7090170 - 8GB USB Stick
Oracle ILOM 3.2 documentation library - https://docs.oracle.com/cd/E37444_01/index.html References<NOTE:2108452.1> - Extreme Flash Storage Cell Fails to Boot from NVME When Using Image 12.1.2.2.0 or 12.1.2.2.1 on Exadata PlatformAttachments This solution has no attachment |
||||||||||||
|