![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Technical Instruction Sure Solution 1966059.1 : How to Replace an Exadata Database Machine X5-2/X6-2 Compute Node Motherboard
In this Document
Oracle Confidential PARTNER - Available to partners (SUN). Applies to:Exadata X6-2 Hardware - Version All Versions and laterZero Data Loss Recovery Appliance X6 Hardware - Version All Versions and later Exadata X5-2 Full Rack - Version All Versions and later Exadata X5-2 Half Rack - Version All Versions and later Exadata X5-2 Eighth Rack - Version All Versions and later Information in this document applies to any platform. GoalCanned Action Plan for replacing a Motherboard in compute node in an Exadata Database Machine [X5-2/X6-2] SolutionDISPATCH INSTRUCTIONS WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?: Exadata Trained
When connecting to ILOM via serial cable remember that the baud rate is 9600 for replacement boards. This will get changed during To make room for needed text MB = Motherboard
Pre-Install Steps:
-> cd /SP/config
-> set passphrase=welcome1 -> set dump_uri=scp://root:password@laptop_IP/var/tmp/SP.config 2. Obtain the correct Serial Numbers required.
3. If the system is not down already due to whatever problem is causing the MB to be replaced, then have the customer DBA shut the
(a) For Extended information on this section, check MOS Note 1093890.1 Steps To Shutdown/Startup The Exadata & RDBMS
http://amomv0115.us.oracle.com/archive/cd_ns/E13877_01/doc/doc.112/e13874/maintenance.htm#autoId18 (b) The Customer should shutdown CRS services prior to powering down the DB node, if running OVM then go to section "For Compute Node running OVM - for non-OVM proceed as follows:
Shutdown crs
# . oraenv # $ORACLE_HOME/bin/crsctl disable crs # $ORACLE_HOME/bin/crsctl stop crs where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment. In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3. ii. Validate CRS is down cleanly. There should be no processes running.
# ps -ef | grep css
(c)Check to see if CoD (Capacity on Demand) is configured. Issue the following command and make a note of the total number of active physical cores.If the count is 36 then CoD is not in use. If the compute node which requires the motherboard replacement is down then run this command on another compute node in the same cluster. Record the result. # dbmcli -e LIST DBSERVER attributes coreCount detail coreCount: 36
Check to see if IaaS is configured. If ON this will have to enabled after rebooting.If a blank line is retruned then IaaS is not configured. # dbmcli -e list dbserver ATTRIBUTES iaasMode detail
iaasMode: ON
(d) The customer or the field engineer can now shutdown the server operating system: Linux: # shutdown -hP now
(d) The field engineer can now slide out the server for maintenance. Remember to disconnect the power cords before opening the top of the server .
For Compute Node running OVM proceed as follows: If there are any concerns engage EEST engineer. The customer should perform the following: (a) See what user domains are running (record result ) Connect to the management domain (domain zero, or dom0). This is an example with just two domains and the management domain Domain-0 # xm list
Name ID Mem VCPUs State Time(s) Domain-0 0 8192 4 r----- 409812.7 dm01db01vm01 8 8192 2 -b---- 156610.6 dm01db01vm02 9 8192 2 -b---- 152169.8 connect to each domain using the command # xm console domainname
where domainname would be dm01db01vm01 or dm01db01vm02 if using the above examples. Shut down any instances of crs ,refer to the example above in previous section "shutdown crs" in all user domains Note: Omit the following command for OVM as it is not not required. # $ORACLE_HOME/bin/crsctl disable crs Press CTRL+] to disconnect from the console.
(b)Shutdown all user domains from dom0 # xm shutdown -a -w
(c) See what user domains are running (should be only Domain-0) # chkconfig xendomains off
(e) Check to see if CoD (Capacity on Demand) is configured. Issue the following command and make a note of the total number of active physical cores.If the count is 36 then CoD is not in use. If the compute node which requires the motherboard replacement is down then run this command on another compute node in the same cluster. Record the result. # dbmcli -e LIST DBSERVER attributes coreCount detail coreCount: 36 Check to see if IaaS is configured.If ON this will have to be enabled after rebooting.If a blank line is returned then IaaS is not configured. # dbmcli -e list dbserver ATTRIBUTES iaasMode detail
iaasMode: ON (f) The customer or the field engineer can now shutdown the server operating system: # shutdown -hP now
(g)The field engineer can now slide out the server for maintenance. Remember to disconnect the power cords before opening the top of the server .
WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?: Note: The removal/insertion CPU tool is new for the Ivy Bridge M3 product lines. If you have not used this new tool before please make yourself familiar before
NOTE:- Pull power cords before opening the top cover to avoid a SP degraded condition.
2. Carefully follow the port numbers on the cables when re-attaching so they are not reversed. It is easiest to plug cables in while the server is in the fully extended
Post-Installation Steps:
TLI MOS Note 1280913.1
-> set SESSION mode=restricted
WARNING: The "Restricted Shell" account is provided solely to allow Services to perform diagnostic tasks. (c) Review the current PSNC containers with “showpsnc” command:
[(restricted_shell) exdx5-tvp-a-db1-sp:~]# showpsnc
Primary: fruid:///SYS/DBP Backup 1: fruid:///SYS/MB Backup 2: fruid:///SYS/PS0 Element | Primary | Backup1 | Backup2 ------------------+-------------------+-------------------+------------------- PPN 7090664 7090664 7090664 PSN 1450NM104V 1450NM104V 1450NM104V Product Name ORACLE SERVER X5-2 ORACLE SERVER X5-2 ORACLE SERVER X5-2 [(restricted_shell) exdx5-tvp-a-db1-sp:~]#
[(restricted_shell) db02-ilom:~]# exit
exit -> (d) Where there is at least one container which still contains valid TLI information (usually the primary disk backplane DBP0), a service mode command
-> cd /SP/users
-> create sunny role=aucros (will ask for password) ii. Gather “version”, “show /SYS” and “show /SP/clock” outputs needed for generating the service mode password: -> version
.............. .............. ..............
iii. Generate a service mode password using “http://modepass.us.oracle.com/” Login is via Oracle Single-Sign-On. Example output of the tool is: BRAND : sun
MODE : service VERSION : 3.2.4.10 SERIAL : 00000000 UTC DATE : 05/20/2013 16:00 POP DOLL PHI TOW BRAN TAUT FEND PAW SKI SCAR BURG CEIL MINT DRAB KAHN FIR MAGI LEAF LIMB EM LAWS BRAE DEAL BURN GOAL HEFT HEAR KEY SEE A iv. Logout of root and log back in as 'sunny' user that you created, and enter Service mode: -> set SESSION mode=service Password:*** **** *** *** **** **** **** *** *** **** **** **** **** **** **** *** **** **** **** ** **** **** **** ****
v. Correct the invalid containers using the “copypsnc” command:
-> copypsnc Number of arguments is incorrect. Usage: copypsnc [-n] <src> <dest> where <src> is PRIMARY|BACKUP1|BACKUP2 <dest> is PRIMARY|BACKUP1|BACKUP2 -n: If src is a bilingual FRU, copy from new-style record. PRIMARY: fruid:///SYS/DBP0 BACKUP1: fruid:///SYS/MB BACKUP2: fruid:///SYS/PS0 -> copypsnc BACKUP1 PRIMARY The copypsnc command produces no output upon success.
-> reset /SP
-> delete /SP/users/sunny
Exadate X5-2 requires minimum image 12.1.2.1.0 this image automatically checks the version of ILOM/BIOS on the motherboard and will attempt to correct this if required .
Before power up it is necessary to configure the ILOM network settings , obtain the ILOM network values from the customer or the previously saved SP.config file . Login to the ILOM then "-> cd /SP/network" and apply the settings.If this step is not performed the ILOM will fail to re-flash. i. Power up system using front button or from ILOM "-> start /SYS" Note : If you see the following warning the the firmware update has failed.You will need to manually update the firmware.
[WARNING] Firmware updates were already tried once and likely failed. [WARNING] To retry the firmware update reboot or try to update the firmware manually. then please delete /opt/oracle.cellos/TRIED_FW_UPDATE_ONCE and re-run /opt/oracle.SupportTools/CheckHWnFWProfile -U /opt/oracle.cellos/iso/cellbits Login as root to the node after it is booted, and run the following: # /opt/oracle.SupportTools/CheckHWnFWProfile -U /opt/oracle.cellos/iso/cellbits
NOTE: The above command will do a similar update to the Cell automatic update method. The server will be powered off during this, ILOM will reset,
and after 10 minutes of being off to allow ILOM reset and BIOS flash update, the server host will be automatically powered back on. See Example output below:-
[root@gmpadb04 cellbits]# /opt/oracle.SupportTools/CheckHWnFWProfile -U /opt/oracle.cellos/iso/cellbits Now updating the ILOM and the BIOS ...
3. Restore the backed up SP configuration done during the pre-installation steps.
-> cd /SP/config
-> set passphrase=welcome1 -> set load_uri=scp://root:password@laptop_IP/var/tmp/SP.config If SP backup was not possible check with customer for network information & use another ILOM within the rack for general settings. The primary specific
-> set /SP/serial/external pendingspeed=115200
-> set /SP/serial/external commitpending=true -> set /SP/serial/host pendingspeed=115200 -> set /SP/serial/host commitpending=true
-> show /SP
Properties: check_physical_presence = false current_hostname = exdx5-tvp-a-db1-sp hostname = exdx5-tvp-a-db1-sp reset_to_defaults = none system_contact = (none) system_description = ORACLE SERVER X5-2, ILOM v3.2.4.10, r94551 system_identifier = Exadata Database Machine X5-2 AK00268428 system_location = (none)
-> set /SP/users/root password=welcome1 (or customers password)
Changing password for user /SP/users/root... Enter new password again: ******** New password was successfully set for user /SP/users/root (b) Reset the ILOM under the Maintenance Tab or from ILOM cli: -> reset /SP
(c) Check you can login to all interfaces and ILOM can be accessed using a browser and ssh from another system on the customer's management network.
hardware/firmware profile will be checked, and either a green “Passed” will be displayed, or a red “Warning” that something with the hardware or firmware does
# ifconfig eth0 (for each eth1/bondeth0 etc.) OR # ipmitool sunoem cli "show /SYS/MB/NET0" (for each NIC NET0/1/2/3) OR from ILOM -> show /SYS/MB/NET0
Verify that the management network is working: # ethtool eth0 | grep det
Link detected: yes Verify that the ILOM management network is working: # ipmitool sunoem cli 'show /SP/network' | grep ipadd
ipaddress = 192.168.1.108 pendingipaddress = 192.168.1.108 [root@db01 ~]# ping -c 3 192.168.1.108 PING 192.168.1.108 (192.168.1.108) 56(84) bytes of data. 64 bytes from 192.168.1.108: icmp_seq=1 ttl=64 time=0.625 ms 64 bytes from 192.168.1.108: icmp_seq=2 ttl=64 time=0.601 ms 64 bytes from 192.168.1.108: icmp_seq=3 ttl=64 time=0.606 ms --- 192.168.1.108 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 3199ms rtt min/avg/max/mdev = 0.601/0.608/0.625/0.026 ms Verify that all memory is present in Linux. Sun Server X5-2 DB nodes have 256GB. # grep MemTotal /proc/meminfo
MemTotal: 264152848 kB - this may vary depending on BIOS version. Verify the disks are visible and online: # /opt/MegaRAID/MegaCli/MegaCli64 -Pdlist -a0 | grep "Slot\|Firmware state"
Output from Exadata X5-2 DB node with dual-boot option: Slot Number: 0 Firmware state: Online, Spun Up Slot Number: 1 Firmware state: Online, Spun Up Slot Number: 2 Firmware state: Online, Spun Up Slot Number: 3 Firmware state: Online, Spun Up Verify the hardware logical volume is correctly set up: # /opt/MegaRAID/MegaCli/MegaCli64 -LdInfo -lAll -a0
Output for Exadata X5-2 DB nodes: Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name :DBSYS RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3 Size : 1.633 TB Physical Sector Size: 512 Logical Sector Size : 512 VD has Emulated PD : No Parity Size : 557.861 GB State : Optimal Strip Size : 1.0 MB Number Of Drives : 4 Span Depth : 1 Creation Date : 14-11-2014 Creation Time : 04:14:18 PM Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Default Access Policy: Read/Write Current Access Policy: Read/Write Disk Cache Policy : Disabled Encryption Type : None Bad Blocks Exist: No PI type: No PI Is VD Cached: No Exit Code: 0x00 Verify HW Profile is operating correctly. # /opt/oracle.SupportTools/CheckHWnFWProfile
[SUCCESS] The hardware and firmware matches supported profile for server=ORACLE_SERVER_X5-2 If there are any errors, they will need to be corrected. (b) Verify the InfiniBand connections are up and actively seen in the fabric:
# cd /opt/oracle.SupportTools/ibdiagtools
# ./verify-topology (options to verify-topology may be required depending on configuration) # ibstatus (Looking for both link ports up and active at 40Gb/s (4X QDR))
# ibdiagnet (Looking for any fabric errors that might suggest a link or cabling failure) # ibnetdiscover (Looking for ability to see all expected switches and other DB nodes and cells in the IB fabric) (c) Verify server functionality :
To confirm if this is an Eighth rack config,ask the customer to login to a storage cell which is part of the this cluster and issue the following command: # cellcli -e list cell attributes eighthRack detail
eighthRack: TRUE If a blank line is displayed then this is not an Eighth rack config.Above shows a Eighth rack has been detected (root)# /opt/oracle.SupportTools/resourcecontrol -show
[INFO] Validated hardware and OS. Proceed. [SHOW] Number of cores active per socket: 18 [SHOW] Total number of cores active: 36 [root)# For an eighth rack configuration, you should see 18 cores enabled. If that's what you see, then there are no configuration changes needed and the rest of this
(root)# /opt/oracle.SupportTools/resourcecontrol -core 18
Reboot the host: # reboot
(c) After the node reboots, verify the changes are now made: (root)# /opt/oracle.SupportTools/resourcecontrol -show
[INFO] Validated hardware and OS. Proceed. [SHOW] Number of cores active per socket: 9 [SHOW] Total number of cores active: 18 (root)#
If the compute node is configured for CoD restore the config if required.Check the reported value is the same as previously recorded. # dbmcli -e LIST DBSERVER attributes coreCount detail
coreCount: 36 If the value is incorrect then adjust the core count # dbmcli -e ALTER DBSERVER pendingCoreCount = new_number_of_active_physical_cores
Verify the pending number of active physical cores using the following command: # dbmcli -e LIST DBSERVER attributes pendingCoreCount
Now reboot the compute node. If IaaS is configured: Verify that all physical cores are active using the following command: # dbmcli -e LIST DBSERVER attributes coreCount detail
Enable IaaS # dbmcli DBMCLI> ALTER DBSERVER iaasMode = "on" DBMCLI> list dbserver ATTRIBUTES iaasMode detail
WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE: This is for compute nodes NOT running OVM proceed as follows,if running OVM see later section "For compute node running OVM" You can now hand the system back to the customer DBA to check all ASM or DB CRS services can be brought up and are online before obtaining sign-off. This step
1. Startup CRS and re-enable autostart of crs. After the OS is up, the Customer DBA should validate that CRS is running. As root execute: # . oraenv # $ORACLE_HOME/bin/crsctl start crs Now re-enable autostart # $ORACLE_HOME/bin/crsctl enable crs # <GI_HOME>/bin/crsctl enable crs where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment. # /u01/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online 2. Validate that instances are running: # ps -ef |grep pmon
It should return a record for the ASM instance and a record for each database.
For Compute Node running OVM If the customer requires assistance please ask them to contact EEST engineer or parent case owner. Once the compute node has booted ,re-enable user domains to autostart during Domain-0 boot. # chkconfig xendomains on
Startup all user domains that are marked for auto start # service xendomains start
See what user domains are running (compare against result from previously collected data) # xm list
if any not auto-started then Startup a single user domain # xm create -c /EXAVMIMAGES/GuestImages/DomainName/vm.cfg
Check that crs has started in user domains ,refer to previous section "DB Node Startup Verification"
PARTS NOTE: 7098505 System Board Assembly for Exadata X5-2 DB Nodes (Sun Server X5-2)
Service Manual's: MB Serial Number Reprogramming: Attachments This solution has no attachment |
||||||||||||||||
|