![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Technical Instruction Sure Solution 1599846.1 : How to replace a Motherboard in a server in Exadata Database Machine [X4-2]
Canned Action Plan for replacing a Motherboard in a server in an Exadata Database Machine [X4-2] Oracle Confidential PARTNER - Available to partners (SUN). Reason: CAP Applies to:Zero Data Loss Recovery Appliance X4 Hardware - Version All Versions to All Versions [Release All Releases]Oracle SuperCluster M6-32 Hardware - Version All Versions and later Exadata X3-8 Hardware - Version All Versions and later Exadata X4-2 Eighth Rack - Version All Versions and later Exadata X4-2 Half Rack - Version All Versions and later Information in this document applies to any platform. GoalCanned Action Plan for replacing a Motherboard in a server in an Exadata Database Machine [X4-2] SolutionDISPATCH INSTRUCTIONS WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?: Exadata trained TIME ESTIMATE: 120 Minutes TASK COMPLEXITY: 3
FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS: PROBLEM OVERVIEW: A server in an Exadata Database Machine requires the motherboard to be replaced. This procedure is specific to Exadata X4-2 systems based on Sun Server X4-2 and Sun Server X4-2L. Connectivity to the rack will depend on the customer's access requirements. The following procedure partially requires serial connection and network access to ILOM which assumes using a laptop attached to the Cisco management switch. If no port is available in a full rack, then temporarily disconnect a port used for another host's ILOM (e.g. port 2). If the customer does not allow login access to the host ILOM, then they will need to run the commands given below. When connecting to ILOM via serial cable remember that the baud rate is 9600 for replacement boards. This will get changed during the post-install procedure to the Exadata default which is 115200 for installed boards. To make room for needed text MB = Motherboard WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?: Pre-Install Steps: 1. Backup ILOM Settings. Assuming the ILOM is not the reason for the replacement of the system MB, then take a current backup of the ILOM SP configuration using a browser under “ILOM Administration → Configuration Management” tab on the left menu list. This can also be done from the ILOM CLI as follows:- -> cd /SP/config
-> set passphrase=welcome1 -> set dump_uri=scp://root:password@laptop_IP/var/tmp/SP.config 2. Obtain the correct Serial Numbers required. (a) Make a note of the System Serial Number from the front label of the server. 3. If the system is not down already due to whatever problem is causing the MB to be replaced, then have the customer DBA shut the node down. For DB Nodes: (a) For Extended information on this section, check MOS Note 1093890.1 Steps To Shutdown/Startup The Exadata & RDBMS Services and Cell/Compute Nodes On An Exadata Configuration. ( https://support.oracle.com/epmos/faces/ui/km/SearchDocDisplay.jspx?id=1093890.1&type=DOCUMENT ) http://amomv0115.us.oracle/archive/cd_ns/E13877_01/doc/doc.112/e13874/maintenance.htm#autoId18 (b) The Customer should shutdown CRS services prior to powering down the DB node: i. As root user do the following: # . oraenv
ORACLE_SID = [root] ? +ASM1 The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle # $ORACLE_HOME/bin/crsctl stop crs or # <GI_HOME>/bin/crsctl stop crs where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment. ii. Validate CRS is down cleanly. There should be no processes running. # ps -ef | grep css
(c) The customer or the field engineer can now shutdown the server operating system: Linux: # shutdown -hP now
Solaris: # shutdown -y -i 5 -g 0
(d) The field engineer can now slide out the server for maintenance. Remember to disconnect the power cords before opening the top of the server for compute nodes based on X4-2 servers. For Storage Cells: (a) For Extended information on this section check MOS Note ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM ( https://support.oracle.com/epmos/faces/ui/km/SearchDocDisplay.jspx?id=1188080.1&type=DOCUMENT ) This is also documented in the Exadata Owner's Guide in chapter 7 section titled “Maintaining Exadata Storage Servers” subsection “Shutting Down Exadata Storage Server” available on the customer's cell server image in the /opt/oracle/cell/doc. In the following examples the SQL commands should be run by the Customers DBA prior to doing the hardware replacement. These should be done by the field engineer only if the customer directs them to, or is unable to do them. The Cellcli commands will need to be run as root. (b) ASM drops a disk shortly after it/they are taken offline. The default DISK_REPAIR_TIME attribute value of 3.6hrs should be adequate for replacing components, but may have been changed by the Customer. To check this parameter, have the Customer log into ASM and perform the following query: SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg where a.name = 'disk_repair_time' and a.group_number = dg.group_number;
As long as the value is large enough to comfortably replace the components being replaced, then there is no need to change it. (c) Check if ASM will be OK if the grid disks go OFFLINE. # cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
...snipit ... DATA_CD_09_cel01 ONLINE Yes DATA_CD_10_cel01 ONLINE Yes etc.... If one or more disks return asmdeactivationoutcome='No', then wait for some time and repeat step (b). Once all disks return asmdeactivationoutcome='Yes', proceed to the next step. (d) Run cellcli command to Inactivate all grid disks on the cell that needs to be powered down for maintenance. (this could take up to 10 minutes or longer) # cellcli
... CellCLI> ALTER GRIDDISK ALL INACTIVE GridDisk DATA_CD_00_dmorlx8cel01 successfully altered GridDisk DATA_CD_01_dmorlx8cel01 successfully altered ...etc... (e) Execute the command below and the output should show asmmodestatus='UNUSED' or 'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and inactive in ASM. CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
... DATA_CD_00_dmorlx8cel01 inactive OFFLINE Yes DATA_CD_01_dmorlx8cel01 inactive OFFLINE Yes ...etc... (f) Once all disks are offline and inactive, the customer or field engineer may shutdown the Cell using the following command: # shutdown -hP now
(g) The field engineer can now slide out the server for maintenance. Remember to disconnect the power cords before opening the top of the server. Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms.
WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?: Note: The removal/insertion CPU tool is new for the Ivy Bridge M3 product lines. If you have not used this new tool before please make yourself familiar before attempting to use on-site. The tool is not intuitive so reference the service manual before attempting this service action. Reference links for Service Manual's: X4-2L Cell’s: (http://docs.oracle.com/cd/E36974_01/html/E38145/z40001d31037512.html#scrolltoc)
Physical Replacement Steps: 1. Replace the MB as per MOS Note 1592250.1(X4-2) and 1592281.1 (X4-2L) migrating existing CPUs, DIMMs, PCI Cards and risers. https://mosemp.us.oracle.com/epmos/faces/ui/km/DocumentDisplay.jspx?id=1592250.1 (X4-2/DB Node) https://mosemp.us.oracle.com/epmos/faces/ui/km/DocumentDisplay.jspx?id=1592281.1 (X4-2L/Storage Cell Node)
NOTE:- On Storage Cells, remember to move the internal USB stick onto the new board.
NOTE:- Pull power cords before opening the top cover to avoid a SP degraded condition.
2. Carefully follow the port numbers on the cables when re-attaching so they are not reversed. It is easiest to plug cables in while the server is in the fully extended maintenance position. 3. Do not power up the system yet, just ILOM.
Post-Installation Steps: 1. Update the Serial Number on the new MB, to that of the server chassis. This is REQUIRED for all in order for ASR to continue to work on the unit, and is required for all servers that are part of Exadata racks that may have a future Service Request, whether ASR is configured now or not. These platforms use the Top Level Indicator (TLI) feature in ILOM to perform the MB serial number update automatically. In certain circumstances this may not work correctly and will need manually corrected. For more information on TLI and restricted shell please refer to the following 2 MOS notes for these systems:- TLI MOS Note 1280913.1 Restricted Shell MOS Note 1302296.1 NOTE: The serial numbers of each server can be found at the front on the left hand side. (a) Connect to ILOM via serial port, and login as “root” with default password “changeme”. (b) Enter Restricted Mode. -> set SESSION mode=restricted
WARNING: The "Restricted Shell" account is provided solely to allow Services to perform diagnostic tasks. (c) Review the current PSNC containers with “showpsnc” command: [(restricted_shell) db02-ilom:~]# showpsnc
Primary: fruid:///SYS/DBP0 Backup 1: fruid:///SYS/MB Backup 2: fruid:///SYS/PS0 Element | Primary | Backup1 | Backup2 ------------------+-------------------+-------------------+------------------- PPN 7073164 7073164 7073164 PSN 1336FML0LJ 1336FML0LJ 1336FML0LJ Product Name SUN SERVER X4-2 SUN SERVER X4-2 SUN SERVER X4-2 [(restricted_shell) db02-ilom:~]# If the replacement has the correct product serial number, in all 3 containers including “/SYS/MB” then skip to step 2 of the post-replacement procedures. If the replacement does not have the product serial number populated correctly, then “exit” out of restricted shell mode and continue: [(restricted_shell) db02-ilom:~]# exit
exit -> (d) Where there is at least one container which still contains valid TLI information (usually the primary disk backplane DBP0), a service mode command copypsnc can be used to update the product serial number. i. Login as root and create escalation mode user with service role: -> cd /SP/users
-> create sunny role=aucros (will ask for password) ii. Gather “version”, “show /SYS” and “show /SP/clock” outputs needed for generating the service mode password: -> version
SP firmware 3.1.2.10 SP firmware build number: 74387 SP firmware date: Tue Jun 19 15:08:47 EDT 2012 SP filesystem version: 0.1.23 -> show /SYS -> show /SP/clock iii. Generate a service mode password using “http://modepass.us.oracle.com/” Login is via Oracle Single-Sign-On. Example output of the tool is: BRAND : sun
MODE : service VERSION : 3.1.2.10 SERIAL : 00000000 UTC DATE : 05/20/2013 16:00 POP DOLL PHI TOW BRAN TAUT FEND PAW SKI SCAR BURG CEIL MINT DRAB KAHN FIR MAGI LEAF LIMB EM LAWS BRAE DEAL BURN GOAL HEFT HEAR KEY SEE A iv. Logout of root and log back in as 'sunny' user that you created, and enter Service mode: -> set SESSION mode=service
Password:*** **** *** *** **** **** **** *** *** **** **** **** **** **** **** *** **** **** **** ** **** **** **** **** **** **** **** *** *** * Short form password is: ARMY ULAN HULL Currently in service mode. v. Correct the invalid containers using the “copypsnc” command: -> copypsnc
Number of arguments is incorrect. Usage: copypsnc [-n] <src> <dest> where <src> is PRIMARY|BACKUP1|BACKUP2 <dest> is PRIMARY|BACKUP1|BACKUP2 -n: If src is a bilingual FRU, copy from new-style record. PRIMARY: fruid:///SYS/DBP0 -> copypsnc BACKUP1 PRIMARY The copypsnc command produces no output upon success. vi. After running copypsnc, the service processor should be rebooted. -> reset /SP
vii. Log in again as 'root' user with default password 'changeme' and verify the SN is now populated correctly using 'show /SYS' and 'showpsnc' as shown above. viii. Remove the 'sunny' user: -> delete /SP/users/sunny
If there are any issues with programming the serial number with “copypsnc” then an escalation mode password and instructions will need to be provided by the TSC x86 engineer assigned to the SR. 2. Re-flash the ILOM/BIOS to the correct levels required for Exadata. Exadata Storage Cells using X4-2L servers: The image contains the firmware and will automatically re-flash it if it is not correct during boot. You do not need to do any flash updates manually. (a) Power up system using front button or ILOM "-> start /SYS" (b) During the boot validation phase, CheckHWnFWProfile will run and determine the ILOM is not correct, and automatically flash it. The server will be powered off during this, ILOM will reset, and after 10 minutes of being off to allow ILOM reset and BIOS flash update, the server host will be automatically powered back on. It is recommended to be connected to the serial console and monitor the host console through ILOM to verify this completes successfully.
Exadata DB Nodes using X4-2 servers: The DB nodes do not automatically update or require an Exadata specific version of ILOM/BIOS. The best practices for determining when and what to update firmware to on Database nodes after parts replacements is one of the following 3 options (in preference order): (a) Check with the customer first and if they have a specific version they are using, then ask they use the normal firmware update utility and bits they have to update to that version themselves and help them with how if they need it. (b) If they don't care, check and update to the latest available on the Oracle download site for that part, using normal manual ILOM update methods if there is one available. Use "/opt/oracle.SupportTools/CheckHWnFWProfile -s" to show the versions being checked for by the image. (c) If there is no version available newer than what the checks are, then use the image bits/utilities to update to the Exadata tested/approved image version manually. The image contains the firmware and but will NOT automatically re-flash it if it is not correct during boot. i. Power up system using front button or from ILOM "-> start /SYS" ii. During the boot validation phase, CheckHWnFWProfile will run and determine the ILOM is not correct, and give a WARNING but no action is taken. Login as root to the node after it is booted, and run the following: # /opt/oracle.SupportTools/CheckHWnFWProfile -U /opt/oracle.cellos/iso/cellbits
NOTE: The above command will do a similar update to the Cell automatic update method. The server will be powered off during this, ILOM will reset, and after 10 minutes of being off to allow ILOM reset and BIOS flash update, the server host will be automatically powered back on. See Example output below:- [root@gmpadb04 cellbits]# /opt/oracle.SupportTools/CheckHWnFWProfile -U /opt/oracle.cellos/iso/cellbits Broadcast message from root (Thu Dec 9 00:57:48 2010): The system is going down for system halt NOW!
3. Restore the backed up SP configuration done during the pre-installation steps. (a) Using a browser under Maintenance Tab or from ILOM cli:- -> cd /SP/config
-> set passphrase=welcome1 -> set load_uri=scp://root:password@laptop_IP/var/tmp/SP.config If SP backup was not possible check with customer for network information & use another ILOM within the rack for general settings. The primary specific setup for Exadata are: i. Serial Baud rate is 115200 for external and host ii. /SP system_identifer is set to the appropriate rack type string and master Rack Serial Number. This is critical for ASR deployments. The Master Rack Serial number can be obtained top left inside the cabinet or from show /SP on any other ILOM. The string should be of the following format: X4-2 - “Exadata Database Machine X4-2 <Rack SN>” For Example: -> show /SP
Properties: check_physical_presence = false hostname = edx42bur09cel01-ilom reset_to_defaults = none system_contact = (none) system_description = SUN SERVER X4-2, ILOM v3.1.2.32, r82440 system_identifier = Exadata Database Machine X4-2 AK00054114 system_location = (none) iii. /SP hostname is setup iv. /SP/network settings v. /SP/alertmgmt rules that may have been previously setup by ASR or cell configuration vi. /SP/clock timezone, datetime, and /SP/clients/ntp NTP settings vii. /SP/clients/dns Name service settings viii. root account password. If the root password has not been provided you can have the customer do this, or do this manually: -> set /SP/users/root password=welcome1 (or customers password)
Changing password for user /SP/users/root... Enter new password again: ******** New password was successfully set for user /SP/users/root (b) Reset the ILOM under the Maintenance Tab or from ILOM cli: -> reset /SP
(c) Check you can login to all interfaces and ILOM can be accessed using a browser and ssh from another system on the customer's management network. 4. Power-on the host server, and go into BIOS setup and check BIOS settings against EIS checklist, in particular make sure USB is first in boot order if this is a Storage Cell (the original USB stick should have been moved from the old board to the new board), and check date and time is correct. Use either ILOM/ipmi to set the boot device to BIOS "-> set /HOST boot_device=bios" or press F2 (Ctrl-E) during BIOS at the right time to get into BIOS setup menu. 5. Power-on and boot the system, monitoring the graphics java console through ILOM (or local video if there is a crash cart available). As the system boots the hardware/firmware profile will be checked, and either a green “Passed” will be displayed, or a red “Warning” that something with the hardware or firmware does not match what is expected. If the check passes, then everything is correct and up, and the boot will continue up to the OS login prompt. If the check fails, then the issue being flagged should be investigated and rectified before continuing. 6. Additional OS checks:- (a) Verify the network interfaces have correctly picked up the new MAC addresses of the new system board: # ifconfig eth0 (for each eth1/bondeth0 etc.) # ipmitool sunoem cli "show /SYS/MB/NET0" (for each NIC NET0/1/2/3) /SYS/MB/NET0 Targets: Properties: type = Network Interface ipmi_name = MB/NET0 fru_description = 10G Ethernet Controller fru_manufacturer = INTEL fru_part_number = X540 fru_macaddress = 00:10:e0:3e:51:ec fault_state = OK clear_fault_action = (none) Compare this to the following network configuration files under: /etc/sysconfig/network-scripts/ifcfg-ethX
where X is a 0 (ifcfg-eth0=NET0), 1 (ifcfg-eth1=NET1), 2 (NET2), 3 (NET3), or “ifcfg-bondeth0” if a db node with bonding. Example of file output :- #### DO NOT REMOVE THESE LINES ####
#### %GENERATED BY CELL% #### DEVICE=eth0 BOOTPROTO=static ONBOOT=yes IPADDR=10.167.166.90 NETMASK=255.255.252.0 NETWORK=10.167.164.0 BROADCAST=10.167.167.255 GATEWAY=10.167.164.1 HOTPLUG=no IPV6INIT=no HWADDR=00:21:28:46:ef:8a If there is any inconsistency with the new MAC addresses or IP, there should be backup files ending .bak in the same directory so use these and the ILOM information to update the files with the correct information. (b) Verify the InfiniBand connections are up and actively seen in the fabric: If possible to login to DB01, then check InfiniBand connections are ok by running the following from DB01: # cd /opt/oracle.SupportTools/ibdiagtools
# ./verify-topology (options to verify-topology may be required depending on configuration) If not possible for security reasons, then on this local node, verify the IB connection status with: # ibstatus (Looking for both link ports up and active at 40Gb/s (4X QDR))
# ibdiagnet (Looking for any fabric errors that might suggest a link or cabling failure) # ibnetdiscover (Looking for ability to see all expected switches and other DB nodes and cells in the IB fabric) (c) Verify server functionality per the EIS checklist server and common check sections (d) If dcli is setup for password-less SSH, then the SSH keys need to be updated for new mac address. The customer should be able to do this using their root password. 6. [Eighth Rack Only] For DB Nodes in Eighth Rack configurations, the configuration requires limiting CPU cores for licensing as follows. Refer to MOS Note 1538561.1 for more details if required. (a) Verify the current configuration with the following command: (root)# /opt/oracle.SupportTools/resourcecontrol -show
[INFO] Validated hardware and OS. Proceed. [SHOW] Number of cores active per socket: All (root)# For an eighth rack configuration, you should see 8 cores enabled. If that's what you see, then there are no configuration changes needed and the rest of this procedure should not be used. (b) If the output shows "All" cores enabled as shown above, we need to change the configuration with the following command: (root)# /opt/oracle.SupportTools/resourcecontrol -core 8
Reboot the host: Linux: # reboot
Solaris, use the '-p' option: # reboot -p
(c) After the node reboots, verify the changes are now made: (root)# /opt/oracle.SupportTools/resourcecontrol -show (this example is from a Solaris database server node - results are reported as 4 cores per socket, so 8 cores total)
OBTAIN CUSTOMER ACCEPTANCE WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE? You can now hand the system back to the customer DBA to check all ASM or DB CRS services can be brought up and are online before obtaining sign-off. This step may take more than 10 minutes to complete based on the current load on the database. See detailed information below. If the customer DBA requires assistance beyond this, then you should direct them to callback the parent SR owner in EEST. DB Node Startup Verification: 1. CRS services should start automatically during the OS boot. After the OS is up, the Customer DBA should validate that CRS is running. As root execute: # . oraenv # $ORACLE_HOME/bin/crsctl check crs where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment. In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3. Example output when all is online is: # /u01/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online CRS-4537: Cluster Ready Services is online CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online 2. Validate that instances are running: # ps -ef |grep pmon
It should return a record for the ASM instance and a record for each database.
Cell Node Startup Verification: 1. Activate the grid disks: # cellcli Issue the command below and all disks should show 'active': CellCLI> list griddisk
DATA_CD_00_dmorlx8cel01 active DATA_CD_01_dmorlx8cel01 active ...etc... 2. Verify all grid disks have been successfully put online using the following command. Wait until asmmodestatus is ONLINE for all grid disks. The following is an example of the output early in the activation process. CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 active ONLINE Yes DATA_CD_01_dmorlx8cel01 active ONLINE Yes DATA_CD_02_dmorlx8cel01 active ONLINE Yes RECO_CD_00_dmorlx8cel01 active SYNCING Yes ...etc... Notice in the above example that RECO_CD_00_dmorlx8cel01 is still in the 'SYNCING' process. Oracle ASM synchronization is only complete when ALL grid disks show asmmodestatus=ONLINE. This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair.
PARTS NOTE: 7058153 System Board Assembly for Exadata X4-2 DB Nodes (Sun Server X4-2) 7058152 System Board Assembly for Exadata X4-2 Storage Cells (Sun Server X4-2L) REFERENCE INFORMATION: Service Manual's: X4-2L Cell’s: (http://docs.oracle.com/cd/E36974_01/html/E38145/gentextid-14743.html#scrolltoc) X4-2 Motherboard Replacement Procedure: MOS Note 1592250.1 X4-2L Motherboard Replacement Procedure: MOS Note 1592281.1 MOS Note 1093890.1 Steps To Shutdown/Startup The Exadata & RDBMS Services and Cell/Compute Nodes On An Exadata Configuration. MOS Note ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM MB Serial Number Reprogramming: TLI MOS Note 1280913.1 Exadata Databas Machine Owner's Guide: http://amomv0115.us.oracle/archive/cd_ns/E13877_01/doc/doc.112/e13874/toc.htm EIS Checklist: http://eis.us.oracle.com/checklists/pdf/Exadata-X4.pdf
Attachments This solution has no attachment |
||||||||||||
|