How to Shutdown and Startup Exadata Compute Nodes and Storage Cells (including SuperCluster) When Performing Hardware Maintenance (X5 and later)

Asset ID:	1-71-1982342.1
Update Date:	2018-04-09
Keywords:

Solution Type Technical Instruction Sure

Solution 1982342.1 : How to Shutdown and Startup Exadata Compute Nodes and Storage Cells (including SuperCluster) When Performing Hardware Maintenance (X5 and later)

Applies to:

Zero Data Loss Recovery Appliance X6 Hardware - Version All Versions and later
Exadata X6-2 Hardware - Version All Versions and later
Exadata X6-8 Hardware - Version All Versions and later
Exadata X7-2 Hardware - Version All Versions and later
Oracle SuperCluster M6-32 Hardware - Version All Versions and later
Information in this document applies to any platform.

Goal

This document is to provide a guide for shutting down and starting up when hardware maintenance is performed on Exadata compute nodes and storage cells (including SuperCluster) for models running 12.1.x.x.x or later images as shipped on X5-2 or later.

Solution

Shutdown of Storage Cells

For Extended information on this section check MOS Note:
ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM
Where noted, the SQL, CellCLI and commands under ‘root’ should be run by the
Customers DBA, unless the Customer provides login access to the Field Engineer

NOTE: If the Exadata X5 cell is to be shutdown for maintenance for a perceived bad flash device or when changing flashcachemode from WB (write-back) to non-WB (write-through) then it will be necessary to flush the flash cache.Please refer to doc 1500257.1 for details.

This is not required for normal shutdown of the cell.

Customer Activity:

1. ASM drops a disk shortly after it/they are taken offline. The default DISK_REPAIR_TIME
attribute value of 3.6hrs should be adequate for replacing components, but may have been
changed by the Customer. To check this parameter, have the Customer log into ASM and
perform the following query:

SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg
where a.name = 'disk_repair_time' and a.group_number = dg.group_number;

As long as the value is large enough to comfortably replace the hardware in a
storage cell, there is no need to change it.

2. Check if ASM will be OK if the grid disks go OFFLINE.

# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
...snipit ...
DATA_CD_09_cel01 ONLINE Yes
DATA_CD_10_cel01 ONLINE Yes
DATA_CD_11_cel01 ONLINE Yes
RECO_CD_00_cel01 ONLINE Yes
etc....

If one or more disks return asmdeactivationoutcome='No', you should wait for some time
and repeat the query until all disks return asmdeactivationoutcome='Yes'.

NOTE: Taking the storage server offline while one or more disks return a status of
asmdeactivationoutcome='No' will cause Oracle ASM to dismount the affected disk
group, causing the databases to shut down abruptly.

3. Run cellcli command to Inactivate all grid disks on the cell you wish to power down/reboot.
(this could take up to 10 minutes or longer)

# cellcli
...
CellCLI> ALTER GRIDDISK ALL INACTIVE
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered
GridDisk DATA_CD_01_dmorlx8cel01 successfully altered
GridDisk DATA_CD_02_dmorlx8cel01 successfully altered
GridDisk RECO_CD_00_dmorlx8cel01 successfully altered
...etc...

4. Execute the command below and the output should show asmmodestatus='UNUSED' or
'OFFLINE' and ‘asmdeactivationoutcome=Yes’ for all griddisks once the disks are offline and
inactive in ASM.

CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 inactive OFFLINE Yes
DATA_CD_01_dmorlx8cel01 inactive OFFLINE Yes
DATA_CD_02_dmorlx8cel01 inactive OFFLINE Yes
RECO_CD_00_dmorlx8cel01 inactive OFFLINE Yes
...etc...

Customer or Service Engineer Activity:

You can now shutdown the Cell using the following command:

# shutdown -hP now

Disconnect the power cords before opening the top of the server

Start-up of Storage Cells

Service Engineer Activity:

To start the cells, begin by booting the server. Once the power cords have been re-attached and
the ILOM has booted you will see a slow blink on the green LED for the server. Power on the
server by pressing the power button on the front of the unit.

Customer or Service Engineer Activity:

1.Connect to the console graphics screen, and monitor the boot ,it may also be necessary to connect to the ILOM cli console.

2. Before activating the cell disks verify all are visible to the server.

High Capacity ( HC ) Systems only

Verify the 12 disks are visible .

The following command should show 12 disks:

# lsscsi | grep -i LSI
[0:2:0:0]    disk    LSI      MR9361-8i        4.23 /dev/sda
[0:2:1:0]    disk    LSI      MR9361-8i        4.23 /dev/sdb
[0:2:2:0]    disk    LSI      MR9361-8i        4.23 /dev/sdc
[0:2:3:0]    disk    LSI      MR9361-8i        4.23 /dev/sdd
[0:2:4:0]    disk    LSI      MR9361-8i        4.23 /dev/sde
[0:2:5:0]    disk    LSI      MR9361-8i        4.23 /dev/sdf
[0:2:6:0]    disk    LSI      MR9361-8i        4.23 /dev/sdg
[0:2:7:0]    disk    LSI      MR9361-8i        4.23 /dev/sdh
[0:2:8:0]    disk    LSI      MR9361-8i        4.23 /dev/sdi
[0:2:9:0]    disk    LSI      MR9361-8i        4.23 /dev/sdj
[0:2:10:0]   disk    LSI      MR9361-8i        4.23 /dev/sdk
[0:2:11:0]   disk    LSI      MR9361-8i        4.23 /dev/sdl

Ensure the NVME devices are visible

# ls -l /dev | grep nvme | grep brw
brw-rw---- 1 root disk    259,   0 Feb 20 13:43 nvme0n1
brw-rw---- 1 root disk    259,   1 Feb 20 13:43 nvme1n1
brw-rw---- 1 root disk    259,   2 Feb 20 13:43 nvme2n1
brw-rw---- 1 root disk    259,   3 Feb 20 13:43 nvme3n1

We expect 4 NVME block device entries to be shown above.

If the device count is not correct above the server should be re-opened and the
device connections checked to be sure they are secure BEFORE the following CellCLI
commands are issued.

Verify that the 4 Sun F160 Flash AIC cards are visible to the cell’s software stack:

# cellcli -e "list lun where disktype=flashdisk"
         1_1     1_1     normal
         2_1     2_1     normal
         4_1     4_1     normal
         5_1     5_1     normal

If something is in status other than "normal" then it needs to be resolved.

Confirming the AIC card health status:

# nvmecli --identify --all | grep -i ind
Health Indicator                       : Healthy
Health Indicator                       : Healthy
Health Indicator                       : Healthy
Health Indicator                       : Healthy

Extreme Flash ( EF ) Storage cells only

Verify that the 2 boot USB “disks” are visible and online:

# lsscsi
[6:0:0:0] disk ORACLE SSM PMAP /dev/sda
[7:0:0:0] disk ORACLE SSM PMAP /dev/sdb

Display the NVME logical devices:

# ls -l /dev | grep nvme | grep brw | grep -v n1p
brw-rw---- 1 root disk    259,   0 Jan 30 17:14 nvme0n1
brw-rw---- 1 root disk    259, 12 Jan 30 17:14 nvme1n1
brw-rw---- 1 root disk    259, 24 Feb 20 13:53 nvme2n1
brw-rw---- 1 root disk    259, 25 Feb 20 13:52 nvme3n1
brw-rw---- 1 root disk    259, 26 Feb 20 13:52 nvme4n1
brw-rw---- 1 root disk    259, 27 Feb 20 13:53 nvme5n1
brw-rw---- 1 root disk    259, 28 Feb 20 13:52 nvme6n1
brw-rw---- 1 root disk    259, 29 Feb 20 13:53 nvme7n1

We expect 8 or 12 NVME block device entries to be shown above
(depending on the model).

If the device count is not correct in the above output, then the server should be re-opened and the
device connections checked to be sure they are secure BEFORE the following CellCLI
commands are issued.

Verify that the Flash is visible to the cell’s software stack:

# cellcli -e "list lun where disktype=flashdisk"
         0_0     0_0     normal
         0_1     0_1     normal
         0_3     0_3     normal
         0_4     0_4     normal
         0_6     0_6     normal
         0_7     0_7     normal
         0_9     0_9     normal
         0_10    0_10    normal

If something is in status other than "normal" then it needs to be resolved.

Confirm the health status

# nvmecli --identify --all | grep -i ind
Health Indicator                       : Healthy
Health Indicator                       : Healthy
Health Indicator                       : Healthy
Health Indicator                       : Healthy
Health Indicator                       : Healthy
Health Indicator                       : Healthy
Health Indicator                       : Healthy
Health Indicator                       : Healthy

Customer Activity:

1. Once the operating system is alive you will need to activate the grid disks.

# cellcli
…
CellCLI> alter griddisk all active
GridDisk DATA_CD_00_dmorlx8cel01 successfully altered
GridDisk DATA_CD_01_dmorlx8cel01 successfully altered
GridDisk RECO_CD_00_dmorlx8cel01 successfully altered
GridDisk RECO_CD_01_dmorlx8cel01 successfully altered
...etc...

2. Issue the command below and all disks should show 'active':

CellCLI> list griddisk
DATA_CD_00_dmorlx8cel01 active
DATA_CD_01_dmorlx8cel01 active
RECO_CD_00_dmorlx8cel01 active
RECO_CD_01_dmorlx8cel01 active
...etc...

Proceed to Verification of Storage Cells below

Verifying Storage Cells have returned to service.

Customer Activity:

Verify all grid disks have been successfully put online using the following command. Wait until
asmmodestatus is ONLINE for all grid disks. The following is an example of the output early in
the activation process.

CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 active ONLINE Yes
DATA_CD_01_dmorlx8cel01 active ONLINE Yes
DATA_CD_02_dmorlx8cel01 active ONLINE Yes
RECO_CD_00_dmorlx8cel01 active SYNCING Yes
...etc...

Notice in the above example that RECO_CD_00_dmorlx8cel01 is still in the 'SYNCING' process.
Oracle ASM synchronization is only complete when ALL grid disks show
‘asmmodestatus=ONLINE’. This process can take some time depending on how busy the
machine is, and has been while this individual server was down for repair.

Shutdown of DB nodes

Customer Activity:

if running OVM then go to section "For Compute Node running OVM - for non-OVM proceed as follows:

Shutdown crs

(a) As root user do the following to stop crs and disable autostart of crs on reboot:

# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle

# $ORACLE_HOME/bin/crsctl disable crs

# $ORACLE_HOME/bin/crsctl stop crs
or
# <GI_HOME>/bin/crsctl stop crs

where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment.

In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3.

(b) Validate CRS is down cleanly. There should be no processes running.

# ps -ef | grep css

Linux:

# shutdown -hP now

(d) The field engineer can now slide out the server for maintenance. Remember to disconnect the power cords before opening the top of the server .

For Compute Node running OVM proceed as follows:

If there are any concerns engage EEST engineer.

The customer should perform the following:

(a) See what user domains are running (record result )

Connect to the management domain (domain zero, or dom0).

This is an example with just two domains and the management domain Domain-0

# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 8192 4 r----- 409812.7
dm01db01vm01 8 8192 2 -b---- 156610.6
dm01db01vm02 9 8192 2 -b---- 152169.8

connect to each domain using the command

# xm console domainname

where domainname would be dm01db01vm01 or dm01db01vm02 if using the above examples.

Shut down any instances of crs, refer to the example above in previous section "shutdown crs" in all user domains

Note: Omit the following command for OVM as it is not not required.

# $ORACLE_HOME/bin/crsctl disable crs

Press CTRL+] to disconnect from the console.

(b)Shutdown all user domains from dom0

# xm shutdown -a -w

(d) The customer or the field engineer can now shutdown the server operating system:

# shutdown -hP now

(e)The field engineer can now slide out the server for maintenance. Remember to disconnect the power cords before opening the top of the server

Start-up of DB Nodes

To start the DB nodes, begin by booting the server. Once the power cords have been reattached
and the ILOM has booted you will see a slow blink on the green LED for the server.
Power on the server by pressing the power button on the front of the unit.

This is for compute nodes NOT running OVM proceed as follows, if running OVM see later section "For compute node running OVM"

You can now hand the system back to the customer DBA to check all ASM or DB CRS services can be brought up and are online before obtaining sign-off. This step
may take more than 10 minutes to complete based on the current load on the database. See detailed information below. If the customer DBA requires assistance
beyond this, then you should direct them to callback the parent SR owner in EEST.

DB Node Startup Verification:

1. Startup CRS and re-enable autostart of crs. After the OS is up, the Customer DBA should validate that CRS is running. As root execute:

# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle

# $ORACLE_HOME/bin/crsctl start crs
# $ORACLE_HOME/bin/crsctl check crs

Now re-enable autostart

# $ORACLE_HOME/bin/crsctl enable crs
or
# <GI_HOME>/bin/crsctl check crs

# <GI_HOME>/bin/crsctl enable crs

where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment.
In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3.
Example output when all is online is:

# /u01/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

2. Validate that instances are running:

# ps -ef |grep pmon

It should return a record for the ASM instance and a record for each database.

For Compute Node running OVM

If the customer requires assistance please ask them to contact EEST engineer or parent case owner.

See what user domains are running (compare against result from previously collected data)

# xm list

if any not auto-started then Startup a single user domain

# xm create -c /EXAVMIMAGES/GuestImages/DomainName/vm.cfg

Check that crs has started in user domains ,refer to previous section "DB Node Startup Verification"

Attachments

This solution has no attachment