![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Technical Instruction Sure Solution 1982342.1 : How to Shutdown and Startup Exadata Compute Nodes and Storage Cells (including SuperCluster) When Performing Hardware Maintenance (X5 and later)
In this Document
Applies to:Zero Data Loss Recovery Appliance X6 Hardware - Version All Versions and laterExadata X6-2 Hardware - Version All Versions and later Exadata X6-8 Hardware - Version All Versions and later Exadata X7-2 Hardware - Version All Versions and later Oracle SuperCluster M6-32 Hardware - Version All Versions and later Information in this document applies to any platform. GoalThis document is to provide a guide for shutting down and starting up when hardware maintenance is performed on Exadata compute nodes and storage cells (including SuperCluster) for models running 12.1.x.x.x or later images as shipped on X5-2 or later. SolutionShutdown of Storage Cells For Extended information on this section check MOS Note:
ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM Where noted, the SQL, CellCLI and commands under ‘root’ should be run by the Customers DBA, unless the Customer provides login access to the Field Engineer
NOTE: If the Exadata X5 cell is to be shutdown for maintenance for a perceived bad flash device or when changing flashcachemode from WB (write-back) to non-WB (write-through) then it will be necessary to flush the flash cache.Please refer to doc 1500257.1 for details. This is not required for normal shutdown of the cell.
1. ASM drops a disk shortly after it/they are taken offline. The default DISK_REPAIR_TIME SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg
where a.name = 'disk_repair_time' and a.group_number = dg.group_number; As long as the value is large enough to comfortably replace the hardware in a
# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
...snipit ... DATA_CD_09_cel01 ONLINE Yes DATA_CD_10_cel01 ONLINE Yes DATA_CD_11_cel01 ONLINE Yes RECO_CD_00_cel01 ONLINE Yes etc.... If one or more disks return asmdeactivationoutcome='No', you should wait for some time NOTE: Taking the storage server offline while one or more disks return a status of
asmdeactivationoutcome='No' will cause Oracle ASM to dismount the affected disk group, causing the databases to shut down abruptly. 3. Run cellcli command to Inactivate all grid disks on the cell you wish to power down/reboot. # cellcli
... CellCLI> ALTER GRIDDISK ALL INACTIVE GridDisk DATA_CD_00_dmorlx8cel01 successfully altered GridDisk DATA_CD_01_dmorlx8cel01 successfully altered GridDisk DATA_CD_02_dmorlx8cel01 successfully altered GridDisk RECO_CD_00_dmorlx8cel01 successfully altered ...etc... 4. Execute the command below and the output should show asmmodestatus='UNUSED' or CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 inactive OFFLINE Yes DATA_CD_01_dmorlx8cel01 inactive OFFLINE Yes DATA_CD_02_dmorlx8cel01 inactive OFFLINE Yes RECO_CD_00_dmorlx8cel01 inactive OFFLINE Yes ...etc...
Customer or Service Engineer Activity: You can now shutdown the Cell using the following command: # shutdown -hP now
Disconnect the power cords before opening the top of the server
Start-up of Storage Cells Service Engineer Activity:
2. Before activating the cell disks verify all are visible to the server. High Capacity ( HC ) Systems only Verify the 12 disks are visible . The following command should show 12 disks: # lsscsi | grep -i LSI
[0:2:0:0] disk LSI MR9361-8i 4.23 /dev/sda [0:2:1:0] disk LSI MR9361-8i 4.23 /dev/sdb [0:2:2:0] disk LSI MR9361-8i 4.23 /dev/sdc [0:2:3:0] disk LSI MR9361-8i 4.23 /dev/sdd [0:2:4:0] disk LSI MR9361-8i 4.23 /dev/sde [0:2:5:0] disk LSI MR9361-8i 4.23 /dev/sdf [0:2:6:0] disk LSI MR9361-8i 4.23 /dev/sdg [0:2:7:0] disk LSI MR9361-8i 4.23 /dev/sdh [0:2:8:0] disk LSI MR9361-8i 4.23 /dev/sdi [0:2:9:0] disk LSI MR9361-8i 4.23 /dev/sdj [0:2:10:0] disk LSI MR9361-8i 4.23 /dev/sdk [0:2:11:0] disk LSI MR9361-8i 4.23 /dev/sdl Ensure the NVME devices are visible # ls -l /dev | grep nvme | grep brw
brw-rw---- 1 root disk 259, 0 Feb 20 13:43 nvme0n1 brw-rw---- 1 root disk 259, 1 Feb 20 13:43 nvme1n1 brw-rw---- 1 root disk 259, 2 Feb 20 13:43 nvme2n1 brw-rw---- 1 root disk 259, 3 Feb 20 13:43 nvme3n1 We expect 4 NVME block device entries to be shown above. Verify that the 4 Sun F160 Flash AIC cards are visible to the cell’s software stack: # cellcli -e "list lun where disktype=flashdisk"
1_1 1_1 normal 2_1 2_1 normal 4_1 4_1 normal 5_1 5_1 normal If something is in status other than "normal" then it needs to be resolved. # nvmecli --identify --all | grep -i ind
Health Indicator : Healthy Health Indicator : Healthy Health Indicator : Healthy Health Indicator : Healthy Verify that the 2 boot USB “disks” are visible and online: # lsscsi
[6:0:0:0] disk ORACLE SSM PMAP /dev/sda [7:0:0:0] disk ORACLE SSM PMAP /dev/sdb Display the NVME logical devices: # ls -l /dev | grep nvme | grep brw | grep -v n1p
brw-rw---- 1 root disk 259, 0 Jan 30 17:14 nvme0n1 brw-rw---- 1 root disk 259, 12 Jan 30 17:14 nvme1n1 brw-rw---- 1 root disk 259, 24 Feb 20 13:53 nvme2n1 brw-rw---- 1 root disk 259, 25 Feb 20 13:52 nvme3n1 brw-rw---- 1 root disk 259, 26 Feb 20 13:52 nvme4n1 brw-rw---- 1 root disk 259, 27 Feb 20 13:53 nvme5n1 brw-rw---- 1 root disk 259, 28 Feb 20 13:52 nvme6n1 brw-rw---- 1 root disk 259, 29 Feb 20 13:53 nvme7n1 We expect 8 or 12 NVME block device entries to be shown above If the device count is not correct in the above output, then the server should be re-opened and the Verify that the Flash is visible to the cell’s software stack: # cellcli -e "list lun where disktype=flashdisk"
0_0 0_0 normal 0_1 0_1 normal 0_3 0_3 normal 0_4 0_4 normal 0_6 0_6 normal 0_7 0_7 normal 0_9 0_9 normal 0_10 0_10 normal If something is in status other than "normal" then it needs to be resolved. Confirm the health status # nvmecli --identify --all | grep -i ind
Health Indicator : Healthy Health Indicator : Healthy Health Indicator : Healthy Health Indicator : Healthy Health Indicator : Healthy Health Indicator : Healthy Health Indicator : Healthy Health Indicator : Healthy
# cellcli
… CellCLI> alter griddisk all active GridDisk DATA_CD_00_dmorlx8cel01 successfully altered GridDisk DATA_CD_01_dmorlx8cel01 successfully altered GridDisk RECO_CD_00_dmorlx8cel01 successfully altered GridDisk RECO_CD_01_dmorlx8cel01 successfully altered ...etc... 2. Issue the command below and all disks should show 'active': CellCLI> list griddisk
DATA_CD_00_dmorlx8cel01 active DATA_CD_01_dmorlx8cel01 active RECO_CD_00_dmorlx8cel01 active RECO_CD_01_dmorlx8cel01 active ...etc... Proceed to Verification of Storage Cells below Verifying Storage Cells have returned to service.
CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 active ONLINE Yes DATA_CD_01_dmorlx8cel01 active ONLINE Yes DATA_CD_02_dmorlx8cel01 active ONLINE Yes RECO_CD_00_dmorlx8cel01 active SYNCING Yes ...etc... Notice in the above example that RECO_CD_00_dmorlx8cel01 is still in the 'SYNCING' process.
Shutdown of DB nodes
if running OVM then go to section "For Compute Node running OVM - for non-OVM proceed as follows: Shutdown crs
# . oraenv # $ORACLE_HOME/bin/crsctl disable crs # $ORACLE_HOME/bin/crsctl stop crs where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment. In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3. (b) Validate CRS is down cleanly. There should be no processes running.
# ps -ef | grep css
(c) The customer or the field engineer can now shutdown the server operating system: Linux: # shutdown -hP now
(d) The field engineer can now slide out the server for maintenance. Remember to disconnect the power cords before opening the top of the server .
For Compute Node running OVM proceed as follows: If there are any concerns engage EEST engineer. The customer should perform the following: (a) See what user domains are running (record result ) Connect to the management domain (domain zero, or dom0). This is an example with just two domains and the management domain Domain-0 # xm list
Name ID Mem VCPUs State Time(s) Domain-0 0 8192 4 r----- 409812.7 dm01db01vm01 8 8192 2 -b---- 156610.6 dm01db01vm02 9 8192 2 -b---- 152169.8 connect to each domain using the command # xm console domainname
where domainname would be dm01db01vm01 or dm01db01vm02 if using the above examples. Shut down any instances of crs, refer to the example above in previous section "shutdown crs" in all user domains Note: Omit the following command for OVM as it is not not required. # $ORACLE_HOME/bin/crsctl disable crs Press CTRL+] to disconnect from the console.
(b)Shutdown all user domains from dom0 # xm shutdown -a -w
(c) See what user domains are running (should be only Domain-0) (d) The customer or the field engineer can now shutdown the server operating system: # shutdown -hP now
(e)The field engineer can now slide out the server for maintenance. Remember to disconnect the power cords before opening the top of the server
Start-up of DB Nodes
This is for compute nodes NOT running OVM proceed as follows, if running OVM see later section "For compute node running OVM" You can now hand the system back to the customer DBA to check all ASM or DB CRS services can be brought up and are online before obtaining sign-off. This step
1. Startup CRS and re-enable autostart of crs. After the OS is up, the Customer DBA should validate that CRS is running. As root execute: # . oraenv # $ORACLE_HOME/bin/crsctl start crs Now re-enable autostart # $ORACLE_HOME/bin/crsctl enable crs # <GI_HOME>/bin/crsctl enable crs where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment. # /u01/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online 2. Validate that instances are running: # ps -ef |grep pmon
It should return a record for the ASM instance and a record for each database.
For Compute Node running OVM If the customer requires assistance please ask them to contact EEST engineer or parent case owner.
See what user domains are running (compare against result from previously collected data) # xm list
if any not auto-started then Startup a single user domain # xm create -c /EXAVMIMAGES/GuestImages/DomainName/vm.cfg
Check that crs has started in user domains ,refer to previous section "DB Node Startup Verification" Attachments This solution has no attachment |
||||||||||||||||
|