SuperCluster M6-32 - How to cleanly shutdown and startup an Oracle SuperCluster M6-32

Asset ID:	1-71-1674297.1
Update Date:	2016-09-26
Keywords:

Solution Type Technical Instruction Sure

Solution 1674297.1 : SuperCluster M6-32 - How to cleanly shutdown and startup an Oracle SuperCluster M6-32

Applies to:

Oracle SuperCluster M6-32 Hardware - Version All Versions to All Versions [Release All Releases]
Oracle Solaris on SPARC (64-bit)
This document will address the various offered configurations of Oracle SuperCluster M6-32, if there are any variations approved as exceptions then the steps describes in this document may vary.

Goal

Describe the recommended procedure for cleanly powering down and powering up an Oracle SuperCluster M6-32

Solution

This document will address the various offered configurations of Oracle SuperCluster M6-32, if there are any variations approved as exceptions then the steps describes in this document may vary.

Shutdown Procedures

If running Oracle Solaris Cluster OSC3.3u1/S10 or OSC4.0/S11 Then you need to shutdown the clustering service. Run the following on all global zones involved in clustering to prevent failover when shutting down applications and zones.

# /usr/cluster/bin/cluster shutdown -g 0 -y

Follow applicable documentation to cleanly shutdown all user applications or databases running in zones or LDoms.

Obtain a list of all running zones

# zoneadm list
global
orlm6db01z2
orlm6db01z1
orlm6db01z3
orlm6db01_T

Shutdown all non Oracle Database applications running on zones

Login to each zone and cleanly shutdown crs and all DB resources with

# zlogin -z <zonename> /u01/app/< Grid Infrastructure version at install time >/grid/bin/crsctl stop crs

# zlogin -z orlm6db01z1 /u01/app/11.2.0.3/grid/bin/crsctl stop crs

Obtain a list of all running LDoms from the primary (control LDom) in each PDom

# ldm list
NAME             STATE      FLAGS   CONS    VCPU MEMORY   UTIL UPTIME
primary          active     -n-cv- UART    128   523776M 1.1% 4d 19h 50m
ssccn1-dom1      active     -n---- 5001    64     32G      0.0% 11d 2h 59m
ssccn1-dom2      active     -t--v- 5000    64    256G     1.6% 3d 23h 45m

The Oracle SuperCluster M6-32 LDom configurations can vary based off configuration chosen during installation. If running with 1 LDom you will shutdown the machine just as you would any other server just by cleanly shutting down the OS. If running 2 Ldoms you will shutdown the guest domain first and then the primary (control). If running with 3 or more domains you will have to identify the domain(s) that is/are running off virtualized hardware and shut it/them down first before moving on to shutting down the guest domain and finally the primary(control).

Stop the domains guest domains first

# ldm stop-domain ssccn1-dom2
# ldm stop-domain ssccn1-dom1

Shutdown the CRS stack on all domains running Oracle CRS.

# /u01/app/11.2.0.3/grid/bin/crsctl stop crs

Verify all oracle processes are stopped and if they are not remediate as necessary

ps -ef |grep oracle

Check the Ldoms and zones depending on Oracle deployment that they are shutdown.

# ldm list

# zoneadm list

Check CRS and resources at the Ldom layer or zone layer depending on Oracle deployment

# /u01/app/11.2.0.3/grid/bin/crsctl status res -t

# svcs -xv

Shutdown Exadata storage cell services and operating systems

# cd /opt/oracle.SupportTools/onecommand

# dcli -g cell_group -l root 'cellcli -e "alter cell shutdown services all"'

# dcli -g cell_group -l root shutdown -now

Shutdown the operating system of the control LDom

# shutdown -g0 -i0 -y

Connect to the compute node ILOM and stop SYS

stop /System

Show and then set, if need be, the power switch settings so the M6-32 domains DO NOT power on automatically when the rack power is restored. The following show the settings that you want to reach.

-> show /SP/policy

/SP/policy
    Targets:

    Properties:
        HOST_AUTO_POWER_ON = disabled
        HOST_LAST_POWER_STATE = disabled

    Commands:
        cd
        set
        show

If any of the power settings are set to enabled modify them as such
-> set /SP/policy HOST_AUTO_POWER_ON=disabled

Shutdown the ZFS Storage Appliance

Browse to the BUI of both storage heads and form the dashboard select the power off appliance button in the upper left section of the screen below the Oracle logo.

https://<hostname>:215/#status/dashboard

The switches do not have specific power off instructions they will be powered off when power is removed from the rack.

Flip the breakers on the PDUs to the off position.

Startup Procedures

Please note that if you are running switch firmware 1.1.3-x you will need to run steps to correct the switch infiniband partitioning. This is documented in <Document 1452277.1> SPARC SuperCluster Critical Issues. it is highly advisable to upgrade your rack to the latest Quarterly Maintenance Bundle to get the switch to version 2.0.6 or above to prevent this issue. The link to the download can be found here

<Document 1487081.1> SPARC SuperCluster and Exadata Storage Server 11g Release 2 (11.2) Supported Versions.

Flip the breakers on all PDUs to the on position

Only perform the following steps if the switches are at a firmware version below 2.0.6

Verify and if necessary fix the partitioning on the IB Switches

# smpartition list active

# getmaster

smpartition command should reflect 3 or more partitioned on 0x0501,0x0502,0x0503,etc... depending on configuration. getmaster should reflect the spine switch as the master. If does not follow go to the next command

# smpartition start; smpartition commit

If this does not remediate the issue please open an SR with your SuperCLuster CSI and serial number and request an engineer to assist you with the more indepth remediation steps. Reference this document ID in your SR.

Internal remediation steps before proceeding check /conf/configvalid and verify that it is 1. If it is not at any point during these steps echo 1 > /conf/configvalid

START:

Modify the host list in the following commands for the customer's IB switch host names

From Compute node, disablesm on all IB switches :

# for IBSW in ib-sw1 ib-sw2 ib-sw3; do ssh $IBSW disablesm ; done

view and fix if necessary network address/netmask on spine/IB1, reboot it (first time only!) :

# vi /etc/sysconfig/network-scripts/ifcfg-eth0
# reboot

From Compute node, enablesm on all IB switches :

Before running Verify the /conf/configvalid here and if not correct log into each switch and fix

# for IBSW in ib-sw1 ib-sw2 ib-sw3; do ssh $IBSW cat /conf/confgivalid ; done
# for IBSW in ib-sw1 ib-sw2 ib-sw3; do ssh $IBSW enablesm ; done

CHECK:
To see if a master subnet manager was arbitrated :

# for IBSW in ib-sw1 ib-sw2 ib-sw3; do ssh $IBSW getmaster ; done

If a subnet manager was arbitrated but it is not ib-sw1 (spine), go back to START: (except skip the network check/reboot steps)

else if no subnet manager was arbitrated :
Verify the /conf/configvalid here

# for IBSW in ib-sw1 ib-sw2 ib-sw3; do ssh $IBSW cat /conf/confgivalid ; done

From the spine switch

# smpartition start;smpartition stop
go back to CHECK:

Startup the ZFS Storage Appliance

Browse to the BUI of both storage heads and if you can connect proceed to the Exadata storage cell steps

https://<hostname>:215/#status/dashboard

If you can not connect with the BUI verify the 7320 has started by doing an ssh as root into the sp of heads and issuing the following:

-> start /SYS

Verify the startup of the Exadata storage cells

Run the following from cel01 of your SuperCluster as celladmin verify that the cell services are online and that all griddisks are active.

dcli -g cell_group -l celladmin 'cellcli -e "list cell"

dcli -g cell_group -l celladmin 'cellcli -e "list griddisk"

Bring up the M6-32 domains

Log into the ILOM of Active SP and start /SYS and then monitor the progress via the /HOSTx/console

-> start /SYS

->start /HOSTx/console

Verify the system

Unless configured otherwise by the site database administrators or system administrators all LDoms, Zones, Clusterware and Database related items should come up automatically as the system comes up. If it fails to do so manually start these components as per your site standard operating procedures. Please verify the system is all the way up via the console before checking dependent items. If for any reason you can not restart anything please gather appropriate diagnostic data and file an SR after consulting with your local administrators. The svcs -xv will let you know which system services if any did not start and assist you in debugging why.

# ldm list

# zoneadm list

# /u01/app/11.2.0.3/grid/bin/crsctl status res -t

# svcs -xv

Restart all applicable applications and test.

Community Discussions

Still have questions? Use the communities window below to search for similar discussions or start a new discussion on this subject. (Window is the live community not a screenshot)

Click here to open in main browser window

Attachments

This solution has no attachment