Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-2011874.1
Update Date:2018-05-10
Keywords:

Solution Type  Technical Instruction Sure

Solution  2011874.1 :   How to Replace an Exadata X5-2, X6-2 Storage Server Internal USB drive  


Related Items
  • Oracle SuperCluster T5-8 Full Rack
  •  
  • Oracle SuperCluster M7 Hardware
  •  
  • Exadata SL6 Hardware
  •  
  • Zero Data Loss Recovery Appliance X6 Hardware
  •  
  • Oracle SuperCluster T5-8 Half Rack
  •  
  • Exadata X5-2 Eighth Rack
  •  
  • Exadata X5-2 Hardware
  •  
  • Exadata X5-2 Full Rack
  •  
  • Exadata X6-8 Hardware
  •  
  • Exadata X6-2 Hardware
  •  
  • Exadata X4-8 Hardware
  •  
  • Exadata X5-2 Quarter Rack
  •  
  • Zero Data Loss Recovery Appliance X5 Hardware
  •  
  • Exadata X5-2 Half Rack
  •  
  • Oracle SuperCluster M6-32 Hardware
  •  
  • Oracle SuperCluster T5-8 Hardware
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
  •  




Oracle Confidential PARTNER - Available to partners (SUN).
Reason: FRU for engineered system

Applies to:

Exadata X4-8 Hardware - Version All Versions and later
Oracle SuperCluster T5-8 Half Rack - Version All Versions and later
Oracle SuperCluster M6-32 Hardware - Version All Versions and later
Oracle SuperCluster T5-8 Hardware - Version All Versions and later
Exadata X5-2 Eighth Rack - Version All Versions and later
Information in this document applies to any platform.

Goal

 How to Replace an Exadata X5-2, X6-2 Storage Server Internal USB drive.

Solution

 DISPATCH INSTRUCTIONS
WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?: Exadata Trained


TIME ESTIMATE: 90 minutes

TASK COMPLEXITY: 3

FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:
PROBLEM OVERVIEW: A faulty internal USB drive has been detected and needs to be replaced in an Exadata X5-2/X6-2 Storage cell.


WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:

The instructions below assume the customer DBA is available and working with the field engineer onsite to manage the host OS and DB/ASM services. They are provided here to allow the FE to have all the available steps needed when onsite, and can be done by the FE if the customer DBA wants or allows or needs help with their steps.

1. Locate the server in the rack being serviced.

Turn on the locate indicator light ‘on’ for easier identification of the server being repaired. If the server number has been identified then the Locate Button on the front panel may be pressed. To turn on remotely, use either of the following methods:

From a login to the CellCli:

CellCli> alter cell led on

 From a login to the server’s ILOM:

-> set /SYS/LOCATE value=Fast_Blink
Set 'value' to 'Fast_Blink

From a login to the server’s ‘root’ account:

# ipmitool sunoem cli ‘set /SYS/LOCATE value=Fast_Blink’
Connected. Use ^D to exit.
-> set /SYS/LOCATE value=Fast_Blink
Set 'value' to 'Fast_Blink'

-> Session closed
Disconnected

 

2. Shutdown the node for which the USB stick requires replacement.

a. For Extended information on this section check MOS Note:
ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM

 

This is also documented in the Exadata Database Maintenance Guide chapter 3 section titled "Maintaining Exadata Storage Servers" subsection "Shutting Down Exadata Storage Server" available on the customer's cell server image in the /opt/oracle/cell/doc directory.

Exadata Maintenance Guide Documentation is available internally here:

http://amomv0115.us.oracle.com/archive/cd_ns/E50790_01/doc/doc.121/e51951/storage.htm#DBMMN22021

 

In the following examples the SQL commands should be run by the Customers DBA prior to doing the hardware replacement. These should be done by the field engineer only if the customer directs them to, or is unable to do them. The cellcli commands will need to be run as root.

Note the following when powering off Exadata Storage Servers:

  • Verify there are no other storage servers with disk faults. Shutting down a storage server while another disk is fails may result in the running database processes and Oracle ASM to crash if it loses both disks in the partner pair when this server’s disks go offline.

  • Powering off one Exadata Storage Server with no disk faults in the rest of the rack will not affect running database processes or Oracle ASM.

  • All database and Oracle Clusterware processes should be shut down prior to shutting down more than one Exadata Storage Server. Refer to the Exadata Owner’s Guide for details if this is necessary.

b. ASM drops a disk shortly after it/they are taken offline. Powering off or restarting Exadata Storage Servers can impact database performance if the storage server is offline for longer than the ASM disk repair timer to be restored. The default DISK_REPAIR_TIME attribute value of 3.6hrs should be adequate for replacing components, but may have been changed by the Customer. To check this parameter, have the Customer log into ASM and perform the following query:

SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg where a.name = 'disk_repair_time' and a.group_number = dg.group_number;

As long as the value is large enough to comfortably replace the components being replaced, then there is no need to change it.

c. Check if ASM will be OK if the grid disks go OFFLINE.

# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
...sample ...
DATA_CD_09_cel01 ONLINE Yes
RECO_CD_01_cel01 ONLINE Yes
...repeated for all griddisks....

If one or more disks return asmdeactivationoutcome='No', then wait for some time and repeat step #2. Once all disks return asmdeactivationoutcome='Yes', proceed to the next step.

d. Run cellcli command to Inactivate all grid disks on the cell that needs to be powered down for maintenance. (this could take up to 10 minutes or longer)

# cellcli
...sample ...
CellCLI> ALTER GRIDDISK ALL INACTIVE
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered 
GridDisk RECO_CD_02_dmorlx8cel01 successfully altered
...repeated for all griddisks...

 e. Execute the command below and the output should show asmmodestatus='UNUSED' or 'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and inactive in ASM.

CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 inactive OFFLINE Yes
RECO_CD_02_dmorlx8cel01 inactive OFFLINE Yes
...repeated for all griddisks...

f.Before shutting down make a note of the active image version

In this example it is 12.1.2.2.0 ,this will be required later if replacing the USB in an extreme flash storage cell.

# imageinfo

Kernel version: 2.6.39-400.264.1.el6uek.x86_64 #1 SMP Wed Aug 26 16:42:25 PDT 2015 x86_64
Cell version: OSS_12.1.2.2.0_LINUX.X64_150917
Cell rpm version: cell-12.1.2.2.0_LINUX.X64_150917-1.x86_64

Active image version: 12.1.2.2.0.150917
Active image activated: 2015-09-25 11:40:51 +0100
Active image status: success
Active system partition on device: /dev/md5
Active software partition on device: /dev/md7

g.For Extreme Flash systems ONLY which use NVME drives in place of hard-disk and the image is 12.1.2.2.0 or 12.1.2.2.1 it will be necessary to make the NVME devices bootable before shutting down.This step is not required for High capacity storage cells,for High capacity proceed to step "h"

Due to bug 22620662 EXTREME FLASH CELLS FAIL TO BOOT FROM THE NVME SYSTEM DISKS

This step is not necessary if the extreme flash image version is 12.1.2.1.x or 12.1.2.3.0 or higher.

i) For 12.1.2.2.0 or 12.1.2.2.1 ONLY type the following:

# cat << GRUB_INSTALL > /root/device.map
> (hd0) /dev/nvme0n1
> (hd1) /dev/nvme1n1
> GRUB_INSTALL

 This will create the file device.map which will be used and will be kept for reference if required.

ii).Make the NVME devices bootable.Type the following:

# /sbin/grub --device-map=/root/device.map << CELL_GRUB_INSTALL
> root (hd0,0)
> setup (hd0)
> root (hd1,0)
> setup (hd1)
> quit
> CELL_GRUB_INSTALL

This is an example of the expected output:

# /sbin/grub --device-map=/root/device.map << CELL_GRUB_INSTALL
> root (hd0,0)
> setup (hd0)
> root (hd1,0)
> setup (hd1)
> quit
> CELL_GRUB_INSTALL

GNU GRUB version 0.97 (640K lower / 3072K upper memory)

[ Minimal BASH-like line editing is supported. For the first word, TAB
lists possible command completions. Anywhere else TAB lists the possible
completions of a device/filename.]
grub> root (hd0,0)
Filesystem type is ext2fs, partition type 0x83
grub> setup (hd0)
Checking if "/boot/grub/stage1" exists... no
Checking if "/grub/stage1" exists... yes
Checking if "/grub/stage2" exists... yes
Checking if "/grub/e2fs_stage1_5" exists... yes
Running "embed /grub/e2fs_stage1_5 (hd0)"... failed (this is not fatal)
Running "embed /grub/e2fs_stage1_5 (hd0,0)"... failed (this is not fatal)
Running "install /grub/stage1 (hd0) /grub/stage2 p /grub/grub.conf "... succeeded
Done.
grub> root (hd1,0)
Filesystem type is ext2fs, partition type 0x83
grub> setup (hd1)
Checking if "/boot/grub/stage1" exists... no
Checking if "/grub/stage1" exists... yes
Checking if "/grub/stage2" exists... yes
Checking if "/grub/e2fs_stage1_5" exists... yes
Running "embed /grub/e2fs_stage1_5 (hd1)"... failed (this is not fatal)
Running "embed /grub/e2fs_stage1_5 (hd1,0)"... failed (this is not fatal)
Running "install /grub/stage1 (hd1) /grub/stage2 p /grub/grub.conf "... succeeded
Done.
grub> quit

If the above command fails with "failed Error 22t: No such partition" then continue to shutdown the cell as the device can still be replaced and made bootable using the diagnostics iso image.See later instructions in this action plan.

h. Once all disks are offline and inactive, the customer may shutdown the Cell using the following command:

# shutdown -hP now

When powering off Exadata Storage Servers, all storage services are automatically stopped.

 

WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?:

1. Slide out the server for maintenance. Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms. Take care to ensure the cables and Cable Management Arm is moving properly. Refer to Note 1444683.1 for CMA handling training.

2. Remove the AC power cords prior to removing the server’s top cover.

For High Capacity Storage cell with a single internal USB drive follow "High Capacity configuration".
For Extreme Flash storage cell with two internal USB drives follow the later section "Extreme Flash Configuration"

High Capacity Configuration

1. Remove and replace the USB thumb drive from the internal USB port.Make a note of which slot the USB drive is inserted into ,there are two USB slots.

   On Exadata Storage Server based on a Oracle Server X5-2L, the internal USB ports are located near the handle on the Rear I/O daughter board located between PCIe slots 3 and 4.

2. Replace the server’s top cover and re-attach the AC power cords. ILOM will take up to 2 minutes to boot.

3. Slide the server back into the rack.

4. After ILOM has booted, power on the server by pressing the power button, and then connect to the server’s console. 

Do not connect using the ILOM web browser as the console output with image version 12.1.2.1.0 and above is displayed to the CLI console.

 To connect to the console through ILOM:

- From the ILOM CLI:

→ start /SP/console

5. From the console and monitor the system booting. The server should boot from the primary hard disk. This will be mentioned in the Exadata splash screen.

6. After the Storage Server is booted, login as ‘root’ user.

7. Run the following to copy the recovery image and configuration data to the new USB stick:

# cd /opt/oracle.SupportTools
# ./make_cellboot_usb -verbose -force

 

Ignore any messages such as the following they do not prevent the action completing: WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util fdisk doesn't support GPT. Use GNU Parted.

Note:This message above may not be present with image 12.2.1.1.0 or above.

It may be required to stop the MS service to run this command.
cellcli -e alter cell shutdown services MS

Remember to re-enable this once the make_cellboot_usb has completed.
cellcli -e alter cell startup services MS

 

8. Set the next boot to forcibly stop at the BIOS setup menu:

# ipmitool chassis bootdev bios
Set Boot Device to bios
#

 9. Reboot the server with the following command:

# shutdown -r now

 10. Monitor the system booting again. The system should go automatically into the BIOS Setup screen.

11. Once the BIOS Setup screen is displayed on the console, use the arrow keys to navigate to the Boot screen .Check the "Legacy Boot Option Priority" list . Set the "USB:USBIN0:ORACLE SSM PMAP" to be the first boot device , followed by “PCI RAID Adapter” followed by the onboard network PXE devices. Press “Esc” to exit the ‘Boot Order Device Priority’ screen

Refer to X5 Series Servers Administration Guide for details.

12. Navigate to the Exit screen and select “Save Changes and Exit”

13. The server will boot . This time it should load the Exadata splash screen (grub) from the USB stick and indicate as such.

Extreme Flash (EF) Configuration

With the introduction of image 12.1.2.2.0 and higher only one USB drive /dev/sda " USB:USBIN0:ORACLE SSM PMAP"  (this is the lower USB slot ), is used for the grub boot-loader .USBIN1 is no longer mirrored or contains the grub boot loader.In addition it is now possible with 12.1.2.2.0 and higher to boot from the NVME devices.

If the image is 12.1.2.1.x (where x= 0 ,1 ,2 or 3) proceed as follows.
If the image is 12.1.2.2.0 or higher go down to the section "EF Replacement when using image 12.1.2.2.0 or higher"

 

EF Replacement when image is 12.1.2.1.x

1. Remove and replace the USB thumb drive from the internal USB port.There are two USB slots , drive /dev/sda is "USB:USBIN0:ORACLE SSM PMAP" this is the lower USB slot .Drive /dev/sdb is "USB:USBIN1:ORACLE SSM PMAP" this is the upper slot.

    On Exadata Storage Server based on a Oracle Server X5-2L, the internal USB ports are located near the handle on the Rear I/O daughter board located between PCIe slots 3 and 4.

2. Replace the server’s top cover and re-attach the AC power cords. ILOM will take up to 2 minutes to boot.

3. Slide the server back into the rack.

4. After ILOM has booted,login to the ILOM and connect to the console.

Do not connect using the ILOM web browser as the console output with image version 12.1.2.1.0 and above is displayed to the CLI console.

 To connect to the console through ILOM:

- From the ILOM CLI:

→ start /SP/console

 5. From the console monitor the system booting.It will boot from the good USB drive.

6. After the Storage Server is booted, login as ‘root’ user,this will be the first session.Then with a second login session ssh to the storage cell as user root,this will be the second session.

The following process is further documented in the attachment to this doc logfile_and_notes.pdf .The entire process take approximately one hour to complete and must not be interrupted.DO not attempt to manually recover the USB drive.Leave the storage cell to automatically recover the device.

The attached logfile shows timings as an example of a device replacement.

 

The cell will now automatically rebuild the new USB drive ,this is performed by the actions of the "checkdeveachboot" validation script. To view the activity of the rebuild use the second login session and issue the following command.

Note, you may need to wait at least 5 minutes after login before this log file is created. Once it exists then use the following to watch the rebuild activity.

tail -f /var/log/cellos/checkdeveachboot.log

The activity will be started when  events such as the following are displayed in the log file

[1432042135][2015-05-19 14:28:55 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]
[1432042135][2015-05-19 14:28:55 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  /opt/oracle.cellos/validations/init.d/checkdeveachboot started at 2015-05-19 14:28:55 +0100
[1432042135][2015-05-19 14:28:55 +0100][WARNING][/opt/oracle.cellos/imageLogger][imageLogger_init][]  Init string is pre-initalized while calling imageLogger_init from source in /usr/local/bin/imageinfo at line 24
[1432042135][2015-05-19 14:28:57 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  Set lock /var/log/exadatatmp/cellos/locks/12410.checkdeveachboot
[1432042135][2015-05-19 14:28:57 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  Fix mode for usb,md is in use
[1432042135][2015-05-19 14:28:57 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  Check and fix system disks for the cell node
[1432042135][2015-05-19 14:28:57 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  nvme controller has disks that are mapped to: /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1
[1432042135][2015-05-19 14:28:57 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  State of md devices is in /var/log/cellos/checkdeveachboot.proc_mdstat.log

.....

The script will show errors such as the following ,this is because the device files /dev/sdX1 and /dev/sdX2 ( where X= a or b ) have not yet been created.These will be created later by the script.

This example shows /dev/sdb1 missing when first checked.

[1432042135][2015-05-19 14:29:00 +0100][WARNING][0-0][/opt/oracle.cellos/image_functions][cmd_retry][]  Failed to run mdadm /dev/md4 --add /dev/sdb1. Retry
[1432042135][2015-05-19 14:29:01 +0100][WARNING][0-0][/opt/oracle.cellos/image_functions][cmd_retry][]  Failed to run mdadm /dev/md4 --add /dev/sdb1. Retry
[1432042135][2015-05-19 14:29:05 +0100][ERROR][0-0][/opt/oracle.cellos/image_functions][cmd_retry][][DISPLAY]  Unable to run mdadm /dev/md4 --add /dev/sdb1

Approximately 10 minutes after the validation script has started it will appear to have stopped ,the following will be displayed in the log file.The script will then sleep for at least 10 minutes before recommencing and again rebuilding the new device.

[1432042135][2015-05-19 14:37:29 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  CELLBOOT USB is fixable
[1432042135][2015-05-19 14:37:29 +0100][WARNING][/opt/oracle.cellos/imageLogger][imageLogger_init][]  Init string is pre-initalized while calling imageLogger_init from source in /opt/oracle.cellos/restore_cellboot_usb.sh at line 32

Further warnings are displayed

[1432042135][2015-05-19 14:47:18 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  Reset lock /var/log/exadatatmp/cellos/locks/12410.checkdeveachboot
[1432042135][2015-05-19 14:47:18 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  #^#^# [WARNING] [MD] 4 Device /dev/md4 is either in degraded state or stopped
[1432042135][2015-05-19 14:47:18 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  #^#^# [ERROR] [SYSDISK] 27 One or more md devices in degraded state or stopped
[1432042135][2015-05-19 14:47:18 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  #^#^# [ERROR] [SYSDISK] 22 /dev/sdb1 boot partition does not have BOOT label
[1432042135][2015-05-19 14:47:18 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  #^#^# [ERROR] [USB] 25 CELLBOOT USB DEVICE USB1 not found
[1432042135][2015-05-19 14:47:18 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  #^#^# [INFO] [USB] 28 CELLBOOT USB fixed
[1432042135][2015-05-19 14:47:18 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  0:138412033:301989893:16

The script will eventually complete with the following event, this event will be displayed twice before the script halts.

[1432044051][2015-05-19 15:00:58 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  USB partition /dev/sdb2 is mountable
[1432044051][2015-05-19 15:00:59 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  [INFO] mount_dev: Mount device. Cmd: mount  /dev/sdb2 /mnt/usb.check.dev.each.boot
[1432044051][2015-05-19 15:01:00 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  USB has version 12.1.2.1.0.141206.1
[1432044051][2015-05-19 15:01:00 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  Reset lock /var/log/exadatatmp/cellos/locks/20512.checkdeveachboot
[1432044051][2015-05-19 15:01:00 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  #^#^# [ERROR] [SYSDISK] 22 /dev/sdb1 boot partition does not have BOOT label
[1432044051][2015-05-19 15:01:00 +0100][INFO][/opt/oracle.cellos/image_functions][imlog_msg][]  0:4194305:0:0

At this point the mirror of the two USB devices, /dev/md4 will be recovering ,this will take approximately 35 minutes. Progress of the recovery can be viewed by examing the file /proc/mdstat by using the first console session

# cat /proc/mdstat
Personalities : [raid1]
md4 : active raid1 sdb1[2] sda1[0]
      499904 blocks [2/1] [U_]
      [>....................]  recovery =  3.4% (17408/499904) finish=30.8min speed=260K/sec

The recovery can also be viewed with mdadm

# mdadm -D /dev/md4
/dev/md4:
        Version : 0.90
  Creation Time : Thu Apr 30 10:26:07 2015
     Raid Level : raid1
     Array Size : 499904 (488.27 MiB 511.90 MB)
  Used Dev Size : 499904 (488.27 MiB 511.90 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 4
    Persistence : Superblock is persistent

    Update Time : Fri May  8 09:07:27 2015
          State : clean, degraded, recovering
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

 Rebuild Status : 1% complete

           UUID : 1a7aafda:114252b2:04894333:532a878b
         Events : 0.252

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       2       8       17        1      spare rebuilding   /dev/sdb1

Wait for the recovery to complete ,it will show the following

# cat /proc/mdstat
Personalities : [raid1]
md4 : active raid1 sdb1[1] sda1[0]
      499904 blocks [2/2] [UU]

 

# mdadm -D /dev/md4
/dev/md4:
        Version : 0.90
  Creation Time : Tue May 19 11:35:29 2015
     Raid Level : raid1
     Array Size : 499904 (488.27 MiB 511.90 MB)
  Used Dev Size : 499904 (488.27 MiB 511.90 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 4
    Persistence : Superblock is persistent

    Update Time : Wed May 20 10:52:51 2015
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : df6e8567:c1b1ea55:04894333:532a878b
         Events : 0.2317

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1

7.Check the correct device files and label have been created

- ensure both drives are seen.

# lsscsi
[6:0:0:0]    disk    ORACLE   SSM              PMAP  /dev/sda
[7:0:0:0]    disk    ORACLE   SSM              PMAP  /dev/sdb

- ensure all device files exist

# ls -l /dev/sd*
brw-rw---- 1 root disk 8,  0 May 19 15:39 /dev/sda
brw-rw---- 1 root disk 8,  1 May 19 15:39 /dev/sda1
brw-rw---- 1 root disk 8,  2 May 19 15:40 /dev/sda2
brw-rw---- 1 root disk 8, 16 May 19 15:39 /dev/sdb
brw-rw---- 1 root disk 8, 17 May 19 15:39 /dev/sdb1
brw-rw---- 1 root disk 8, 18 May 19 15:40 /dev/sdb2

 

Check drive labels.

# e2label /dev/md4
BOOT

# e2label /dev/sdb1
BOOT

# e2label /dev/sdb2
CELLBOOT

 Partition /dev/sdb1 may fail due to bug 20765279 ,it may show the following.

 # e2label /dev/sdb1
e2label: Bad magic number in super-block while trying to open /dev/sdb1
Couldn't find valid filesystem superblock.

There are now two options for recovering the label on the drive .

a). Ensure the recovery of the mirror /dev/md4 has completed ,then reboot the cell .This action will recover the missing label.

# shutdown -r now

When the cell has completed the reboot,login and again check the labels using the e2label command,all should now be correct.

or if the customer does not wish to reboot.

b).Follow this procedure

# umount /boot

# mdadm --stop /dev/md4

# mdadm --assemble --scan

 # mount /dev/md4 /boot

# e2label /dev/md4
BOOT
# e2label /dev/sdb1
BOOT
# e2label /dev/sdb2
CELLBOOT

 

 

EF Replacement when using image 12.1.2.2.0 or higher

1. Remove and replace the USB thumb drive from the internal USB port. There are two USB slots , drive /dev/sda is "USB:USBIN0:ORACLE SSM PMAP" this is the lower USB slot .Drive /dev/sdb is "USB:USBIN1:ORACLE SSM PMAP" this is the upper slot.

USBIN1 is not used for this version.

On Exadata Storage Server based on a Oracle Server X5-2L, the internal USB ports are located near the handle on the Rear I/O daughter board located between PCIe slots 3 and 4.

 2. Replace the server’s top cover and re-attach the AC power cords. ILOM will take up to 2 minutes to boot.

3. Slide the server back into the rack.

 4. After ILOM has booted,login to the ILOM and connect to the console.

Do not connect using the ILOM web browser as the console output with image version 12.1.2.1.0 and above is displayed to the CLI console.

 To connect to the console through ILOM:

→ start /SP/console

 Now power on the storage cell.As soon as the first BIOS display is visible press cntrl-p to select the boot pop-up menu .Eventually the boot menu will be displayed.

Select one of the NVME devices from the display to boot from ,see example below.

───────────────────────────────────
Please select boot device:
───────────────────────────────────
PCIE6:NVMe0:INTEL SSDPE2ME016T4S
PCIE6:NVMe1:INTEL SSDPE2ME016T4S
PXE:NET0:IBA XE Slot 3A00 v2320
Enter Setup
───────────────────────────────────
↑ and ↓ to move selection
ENTER to select boot device
ESC to boot using defaults
───────────────────────────────────

The cell will now boot.

If the cell fails to boot from the NVME device  , or it was not possible to make the NVME devices bootable prior to shutdown then refer to document "Exadata Extreme Flash storage cell fails to boot from NVME when using image 12.1.2.2.0 or 12.1.2.2.1 (Doc ID 2108452.1)"

6. After the Storage Server is booted, login as ‘root’ user.

7. Run the following to copy the recovery image and configuration data to the new USB stick:

# cd /opt/oracle.SupportTools
# ./make_cellboot_usb -verbose -force

It may be required to stop the MS service to run this command.

cellcli -e alter cell shutdown services MS

Remember to re-enable this once the make_cellboot_usb has completed.

cellcli -e alter cell startup services MS

8. Set the next boot to forcibly stop at the BIOS setup menu:

# ipmitool chassis bootdev bios
Set Boot Device to bios
#

9. Reboot the server with the following command:

# shutdown -r now

10. Monitor the system booting again. The system should go automatically into the BIOS Setup screen.

11. Once the BIOS Setup screen is displayed on the console, use the arrow keys to navigate to the Boot screen .Check the "Legacy Boot Option Priority" list . Set the "USB:USBIN0:ORACLE SSM PMAP" to be the first boot device , followed by “PCIE6:NVMe0:INTEL SSDPE2ME016T4S” then "PCIE6:NVMe1:INTEL SSDPE2ME016T4S" and finally "PXE:NET0:IBA XE Slot 3A00 v2320" . Press “Esc” to exit the ‘Boot Order Device Priority’ screen

Refer to X5 Series Servers Administration Guide for details.

12. Navigate to the Exit screen and select “Save Changes and Exit”

13. The server will boot . This time it should load the Exadata splash screen (grub) from the USB stick and indicate as such.

 

OBTAIN CUSTOMER ACCEPTANCE
WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE?:

The following steps should be done by the customer's administrator to return the disks to service:

1. Activate the grid disks:

# cellcli

    …    
CellCLI> alter griddisk all active
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered
GridDisk RECO_CD_02_dmorlx8cel01 successfully altered
...etc...

 2. Issue the command below and all disks should show 'active':

CellCLI> list griddisk
DATA_CD_00_dmorlx8cel01         active
RECO_CD_02_dmorlx8cel01         active
...etc...

 3. Verify all grid disks have been successfully put online using the following command. Wait until 'asmmodestatus' is in status 'ONLINE' for all grid disks. The following is an example of the output early in the activation process.

CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 active ONLINE Yes
RECO_CD_00_dmorlx8cel01 active SYNCING Yes
RECO_CD_01_dmorlx8cel01 active ONLINE Yes
...etc...

 

Notice in the above example that 'RECO_CD_00_dmorlx8cel01' is still in the 'SYNCING'  process. Oracle ASM synchronization is only complete when ALL grid disks show ‘asmmodestatus=ONLINE’.  This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair.

It has been found on very rare occasions that the above procedure has not worked with image 12.1.2.1.0 and 12.1.2.1.1 , if this happens please refer to the doc and contact Exadata software support for assistance.

Exadata Storage Software 12.1.2.1.0 and 12.1.2.1.1 System Disk Replacement Issues (Doc ID 2003674.1)

 


PARTS NOTE:

7090170 - 8GB USB Stick


REFERENCE INFORMATION:

Oracle ILOM 3.2 documentation library - https://docs.oracle.com/cd/E37444_01/index.html

References

<NOTE:2108452.1> - Extreme Flash Storage Cell Fails to Boot from NVME When Using Image 12.1.2.2.0 or 12.1.2.2.1 on Exadata Platform

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback