SuperCluster - How to cleanly shutdown and startup an Oracle SuperCluster T4-4 or T5-8

Asset ID:	1-71-1487791.1
Update Date:	2017-03-14
Keywords:

Solution Type Technical Instruction Sure

Solution 1487791.1 : SuperCluster - How to cleanly shutdown and startup an Oracle SuperCluster T4-4 or T5-8

Applies to:

SPARC SuperCluster T4-4 - Version All Versions to All Versions [Release All Releases]
Oracle SuperCluster T5-8 Full Rack - Version All Versions to All Versions [Release All Releases]
Oracle SuperCluster T5-8 Half Rack - Version All Versions to All Versions [Release All Releases]
SPARC SuperCluster T4-4 Half Rack - Version All Versions to All Versions [Release All Releases]
Oracle Solaris on SPARC (64-bit)

Goal

Describe the recommended procedure for cleanly powering down and powering up an Oracle SuperCluster T4-4 or T5-8.

Solution

This note will address the various offered configurations of Oracle SuperCluster if your machine has any variations approved as exceptions then your steps may vary.

Shutdown Procedures

If running Oracle Solaris Cluster OSC3.3u1/S10 or OSC4.0/S11 Then you need to shutdown the clustering service. Run the following on all global zones involved in clustering to prevent failover when shutting down applications and zones.

# /usr/cluster/bin/cluster shutdown -g 0 -y

If running OpsCenter 12C in SuperCluster mode you will have to also halt the enterprise controller so it does not attempt to fail over while bringing down CRS.

# /opt/SUNWxvmoc/bin/ecadm ha-stop-no-relocate

Follow applicable documentation to cleanly shutdown all user applications or databases running in zones or LDoms.

Obtain a list of all running zones

# zoneadm list
global
sol10_zone

Shutdown all running zones

# zoneadm -z sol10_zone shutdown

Obtain a list of all running LDoms

# ldm list
NAME             STATE      FLAGS   CONS    VCPU MEMORY   UTIL UPTIME
primary          active     -n-cv- UART    128   523776M 1.1% 4d 19h 50m
orlscclldm01     active     -n---- 5001    64     32G      0.0% 11d 2h 59m
ssccn2-app1      active     -t--v- 5000    64    256G     1.6% 3d 23h 45m

The T4-4 and T5-8 LDom configurations can vary based off configuration chosen during installation. If running with 1 LDom you will shutdown the machine just as you would any other server just by cleanly shutting down the OS. If running 2 Ldoms you will shutdown the guest domain first and then the primary (control). if running with 3 or more domains you will have to identify the domain(s) that is/are running off virtualized hardware and shut it/them down first before moving on to shutting down the guest domain and finally the primary(control).

Obtain the names of the LDoms with direct hardware access.

T4-4

# ldm list-io |egrep "pci@400|pci@700"
pci@400 pci_0 primary
pci@700 pci_3 ssccn2-app1
...

T5-8 is built a bit differently but you can identify the same searching on the SASHBA

# ldm list-io |grep SASHBA
/SYS/MB/SASHBA0 PCIE pci_0 primary OCC
/SYS/MB/SASHBA1 PCIE pci_15 ssccn1-dom3 OCC

Stop the domains from the ldm list command one that are not on this list

# ldm stop-domain orlscclldm01

Stop the guest domain with hardware access

# ldm stop-domain ssccn2-app1

Shutdown the CRS stack on all domains running Oracle CRS.

# /u01/app/11.2.0.3/grid/bin/crsctl stop crs

Verify all oracle processes are stopped and if they are not remediate as necessary

ps -ef |grep oracle

Shutdown Exadata storage cell services and operating systems

# cd /opt/oracle.SupportTools/onecommand

# dcli -g cell_group -l root 'cellcli -e "alter cell shutdown services all"'

# dcli -g cell_group -l root shutdown -now

Shutdown the operating system of the control LDom

# shutdown -g0 -i0 -y

Connect to the compute node ILOM and stop SYS

stop /SYS

Show and then set, if need be, the power switch settings so the T4-4 or T5-8 machines DO NOT power on automatically when the rack power is restored. the following show the settings that you want to reach.

-> show /SP/policy

/SP/policy
    Targets:

    Properties:
        HOST_AUTO_POWER_ON = disabled
        HOST_COOLDOWN = disabled
        HOST_LAST_POWER_STATE = disabled
        HOST_POWER_ON_DELAY = disabled
        PARALLEL_BOOT = enabled

If any of yours are set to enabled modify them as such
->set /SP/policy HOST_AUTO_POWER_ON=disabled

Shutdown the ZFS Storage Appliance

Browse to the BUI of both storage heads and form the dashboard select the power off appliance button in the upper left section of the screen below the Oracle logo.

https://<hostname>:215/#status/dashboard

The switches do no have specific power off instructions they will be powered off when power is removed from the rack.

Flip the breakers on the PDUs to the off position.

Startup Procedures

Please note that if you are running switch firmware 1.1.3-x you will need to run steps to correct the switch infiniband partitioning. This is documented in <Document 1452277.1> SPARC SuperCluster Critical Issues. it is highly advisable to upgrade your rack to the latest Quarterly Maintenance Bundle to get the switch to version 2.0.6 or above to prevent this issue. The link to the download can be found here

<Document 1567979.1> SPARC SuperCluster and Exadata Storage Server 11g Release 2 (11.2) Supported Versions.

Flip the breakers on all PDUs to the on position

Only perform the following steps if the switches are at a firmware version below 2.0.6

Verify and if necessary fix the partitioning on the IB Switches

# smpartition list active

# getmaster

smpartition command should reflect 3 or more partitioned on 0x0501,0x0502,0x0503,etc... depending on configuration. getmaster should reflect the spine switch as the master. If does not follow go to the next command

# smpartition start; smpartition commit

If this does not remediate the issue please open an SR with your SuperCLuster CSI and serial number and request an engineer to assist you with the more indepth remediation steps. Reference this document ID in your SR.

Internal remediation steps before proceeding check /conf/configvalid and verify that it is 1. If it is not at any point during these steps echo 1 > /conf/configvalid

START:

Modify the host list in the following commands for the customer's IB switch host names

From Compute node, disablesm on all IB switches :

# for IBSW in ib-sw1 ib-sw2 ib-sw3; do ssh $IBSW disablesm ; done

view and fix if necessary network address/netmask on spine/IB1, reboot it (first time only!) :

# vi /etc/sysconfig/network-scripts/ifcfg-eth0
# reboot

From Compute node, enablesm on all IB switches :

Before running Verify the /conf/configvalid here and if not correct log into each switch and fix

# for IBSW in ib-sw1 ib-sw2 ib-sw3; do ssh $IBSW cat /conf/confgivalid ; done
# for IBSW in ib-sw1 ib-sw2 ib-sw3; do ssh $IBSW enablesm ; done

CHECK:
To see if a master subnet manager was arbitrated :

# for IBSW in ib-sw1 ib-sw2 ib-sw3; do ssh $IBSW getmaster ; done

If a subnet manager was arbitrated but it is not ib-sw1 (spine), go back to START: (except skip the network check/reboot steps)

else if no subnet manager was arbitrated :
Verify the /conf/configvalid here

# for IBSW in ib-sw1 ib-sw2 ib-sw3; do ssh $IBSW cat /conf/confgivalid ; done

From the spine switch

# smpartition start;smpartition stop
go back to CHECK:

Startup the ZFS Storage Appliance

Browse to the BUI of both storage heads and if you can connect proceed to the Exadata storage cell steps

https://<hostname>:215/#status/dashboard

If you can not connect with the BUI verify the 7320 has started by doing an ssh as root into the sp of heads and issuing the following:

-> start /SYS

Verify the startup of the Exadata storage cells

Run the following from cel01 of your SuperCluster as celladmin verify that the cell services are online and that all griddisks are active.

dcli -g cell_group -l celladmin 'cellcli -e "list cell"

dcli -g cell_group -l celladmin 'cellcli -e "list griddisk"

Bring up the T4-4 or T5-8 systems

Log into the ILOM for each T4-4 or T5-8 and start /SYS and then monitor the progress via the /SP/console

-> start /SYS

->start /SP/console

Verify the system

Unless configured otherwise by the site database administrators or system administrators all LDoms, Zones, Clusterware and Database related items should come up automatically as the system comes up. If it fails to do so manually start these components as per your site standard operating procedures. Please verify the system is all the way up via the console before checking dependent items. If for any reason you can not restart anything please gather appropriate diagnostic data and file an SR after consulting with your local administrators. The svcs -xv will let you know which system services if any did not start and assist you in debugging why.

# ldm list

# zoneadm list

# /u01/app/11.2.0.3/grid/bin/crsctl status res -t

# svcs -xv

Restart all applicable applications and test.

Community Discussions

Still have questions? Use the communities window below to search for similar discussions or start a new discussion on this subject. (Window is the live community not a screenshot)

Click here to open in main browser window

References

<NOTE:1452277.1> - SuperCluster Critical Issues
<NOTE:1567979.1> - Oracle SuperCluster Supported Software Versions - All Hardware Types

Attachments

This solution has no attachment