![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Technical Instruction Sure Solution 2360554.1 : How to Replace an Exadata X7-2 Compute Node Server Motherboard Assembly
Oracle Confidential PARTNER - Available to partners (SUN). Reason: Exadata internal only for Oracle support engineers use and approved HW partners Applies to:Exadata X7-2 Hardware - Version All Versions and laterZero Data Loss Recovery Appliance X7 Hardware - Version All Versions and later Information in this document applies to any platform. GoalHow to Replace an Exadata X7-2 Compute Node Server Motherboard Assembly SolutionDISPATCH INSTRUCTIONS WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED:
PROBLEM OVERVIEW: An Exadata X7-2 Compute Node Server Motherboard Assembly needs replacement IMPORTANT NOTE TO TSC ENGINEER: CUT & PASTE the “CUSTOMER ACTIVITY” sections of the Pre-Replacement and Post-Replacement steps into a SR Note and ensure the customer is aware to do these steps prior to the scheduled field engineer activity, and during and after the replacement activity. CUSTOMER ACTIVITY: Offlining the disk cache and shutdown of the database node is required prior to the part replacement. 1. Shutdown the database services: If running Linux or Solaris native - follow Steps 1 to 7 of MOS Note: If running OVM - follow Steps 1 to 4 of MOS Note: 2. Revert all the RAID disk volumes to WriteThrough mode to ensure all data in the RAID cache memory is flushed to disk and not lost if disconnect of the SuperCap occurs. As 'root' user, set all logical volumes cache policy to WriteThrough cache mode: # /opt/MegaRAID/storcli/storcli64 /c0/vall set wrcache=WT
3. Verify the current cache policy for all logical volumes is now WriteThrough: # /opt/MegaRAID/storcli/storcli64 /c0/vall show
In the volume table, the "Cache" column should report as "NRWTD" where WT indicates WriteThrough. 4. Once all disks are offline and inactive, the customer may shutdown the Cell using the following command: # shutdown -hP now
WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?: Prepare the Server for Service The customer should have already prepared the server and powered it off. If not, provide them the instructions in the previous section. 1. Log into the ILOM check the fruid container values and sync them if needed. To avoid mismatched fruid values causing a failure after a motherboard replacement the fruid data should be confirmed to have matching data. The Motherboard is the Backup1 container so the Primary (DBP) and Backup2 (PS0) containers must have valid values that are the same, in order for the replacement motherboard's container to be updated to the correct values automatically. Go into restricted mode and use the showpsnc command to check this. -> set SESSION mode=restricted
WARNING: The "Restricted Shell" account is provided solely to allow Services to perform diagnostic tasks. [(restricted_shell) exa1dbadm01-ilom:~]# showpsnc Primary: fruid:///SYS/DBP Backup 1: fruid:///SYS/MB Backup 2: fruid:///SYS/PS0 Element | Primary | Backup1 | Backup2 ------------------+-------------------+-------------------+------------------- PPN 7338405 7338405 7338405 PSN 1735XC3004 1735XC3004 1735XC3004 Profile 0x00010000 0x00010000 0x00010000 Product Name ORACLE SERVER X7-2 ORACLE SERVER X7-2 ORACLE SERVER X7-2 RFID SN 341A583DE5800000000232F8 341A583DE5800000000232F8 341A583DE5800000000232F8 [(restricted_shell) exa1dbadm01-ilom:~]# exit The above example shows a system with all three containers properly in sync. If the output from the system does not show all of the containers with matching values then you should reset the SP and then re-check the values again. An ILOM reset will attempt to auto-populate the matching values if one container is out of sync. -> reset /SP
Are you sure you want to reset /SP (y/n)? y Performing reset on /SP 2. After an ILOM reset if the Primary and Backup2 containers match then proceed with the following steps to replace the motherboard. If these two containers do not match then DO NOT proceed with the replacement yet. Contact TSC for further assistance. If the containers do not match you will need to use the "copypsnc" command from service or escalation mode to copy the data from the good container so that the Primary and Backup2 containers match (Backup1 is the MB and we are about to replace this so it is not as important at this step). If you are unfamiliar with this process and require assistance please reference the steps for using copypsnc to fix the serial number detailed in the "How to update product serial number on systems which implement TLI functionality (Doc ID 1280913.1)" and "How to access service mode and escalation mode on ILOM 3.x and later platforms (Doc ID 1019946.1)". After the fruid data in the Primary and Backup2 containers have been confirmed to match proceed with the following steps. 3. Backup the current ILOM configuration settings including fault data history to a XML file on an external laptop/system, using one of the transfer protocols supported by ILOM 4.x: -> cd /SP/config
-> set include_faultdata=true -> set passphrase=motherboard-replacement -> set dump_uri=transfer_method://username:password@ipaddress_or_hostname/directorypath/ilom_config_backup.xml For additional information, refer to https://docs.oracle.com/cd/E81115_01/html/E86149/z40048b81489311.html#scrolltoc 4. Backup the current BIOS configuration parameters to a XML file on an external laptop/system using one of the transfer protocols supported by ILOM 4.x: -> cd /System/BIOS/Config
-> set dump_uri=transfer_method://username:password@ipaddress_or_hostname/directorypath/bios_config_backup.xml For additional information, refer to https://docs.oracle.com/cd/E81115_01/html/E86149/z40001541481533.html#scrolltoc 5. Extend the server to the maintenance position 6. Disconnect the power cords from the power supplies 7. Attach an anti-static wrist strap to your wrist and to a metal area on the chassis or the rack. 8. Remove the server top cover. Use a Torx T10 screwdriver to unlock the release button latch.
Removing the Motherboard Caution - These procedures require that you handle components that are sensitive to electrostatic discharge. This sensitivity can cause the components to fail. To avoid damage, ensure that you follow anti-static practices.
1. Remove the following components and set them aside on an anti-static mat: Caution - During the motherboard removal procedure, it is recommended to only pull the power supplies as far out as necessary to disengage them from the motherboard, without removing them completely from the chassis slot they are in. If they are removed completely from the chassis slot, it is critical to label the power supplies with the slot numbers from which they were removed (PS0, PS1). The power supplies must be reinstalled into the chassis slots from which they were removed because PS0 is a backup container for fruid data which will be used to verify and update the fruid data on the replacement motherboard. If they are accidentally swapped, then manual re-programming of the fruid data will be required. Contact TSC for further assistance with that.
2. Remove the following cables from the motherboard: a. Remove the SAS cables and the super capacitor cable that are connected to the internal HBA card, and then carefully lift them from the left-side cable trough and set them out of the way. 3. Remove the motherboard from the server chassis with all reusable components that populate the motherboard in place. 4. Remove the Coin Cell battery from the motherboard and re-install it on the replacement motherboard. 5. Remove the DDR4 DIMMs from the motherboard and re-install them onto the corresponding slots in the replacement motherboard. Note - Install the DIMMs only in the sockets (connectors) that correspond to the sockets from which they were removed. Performing a one-to-one replacement of the DIMMs significantly reduces the possibility that the DIMMs will be installed in the wrong slots. If you do not reinstall the DIMMs in the same sockets, server performance might suffer and some DIMMs might not be used by the server.
6. Remove the CPUs from the failed motherboard. 7. Remove the CPU socket covers from the replacement motherboard and install the CPUs into the replacement motherboard. a. Grasp the CPU socket cover finger grips (labeled REMOVE) and lift the socket cover up and off the processor socket. 8. Install the CPU socket covers onto the CPU sockets of the faulty motherboard. Caution - The CPU socket covers must be installed on the faulty motherboard; otherwise, damage might result to the CPU sockets during handling and shipping, preventing motherboards from being repairable.
a. Align the CPU socket cover over the CPU socket alignment posts. Install the CPU socket cover by firmly pressing down on all four corners (labeled INSTALL) on the socket cover.
Installing the Motherboard 1. Attach an anti-static wrist strap to your wrist, and then to a metal area on the chassis. 2. Insert the motherboard into the server chassis. 3. Reinstall cables on to the motherboard. 4. Reinstall the following components: Caution - The power supplies must be reinstalled into the chassis slots from which they were removed because PS0 is a backup container for fruid data which will be used to verify and update the fruid data on the replacement motherboard. If they are accidentally swapped, then manual re-programming of the fruid data will be required. Contact TSC for further assistance with that.
5. Reinstall all network cable connections to the ports they were removed from, as labelled.
Return the Server to Operation 1. Install the server top cover. Use a Torx T10 screwdriver to lock the release button latch. Note: When connecting to ILOM via serial cable, the baud rate is 9600 for replacement boards. This will get changed to the Exadata default which is 115200 when restoring ILOM settings and/or booting the Exadata OS image.
5. Login to the ILOM as root with default password 'changeme'. Power on the server to BIOS so that ILOM can access the BIOS but the server OS does not boot: -> set /HOST boot_device=bios
-> start /System 6. Install the Exadata ILOM profile required for UEFI secure boot. The update_entitlements.pkg package file is attached to this Note 2360554.1. If this is not installed into ILOM, the system will not be able to boot the Exadata OS image. Load the attached package from an external laptop/system using one of the transfer protocols supported by ILOM 4.x. After installation, reset the BIOS properties to default. -> set /SP system_contact='psnc profile|0x00010000'
-> load -script -source transfer_method://username:password@ipaddress_or_hostname/directorypath/update_entitlements.pkg -> set /System/BIOS reset_to_defaults=factory For additional information on the ILOM load command, refer to: https://docs.oracle.com/cd/E81115_01/html/E86149/z400371a1482689.html#scrolltoc 7. Check and set the system serial number/fruid data: a. Enter the ILOM restricted shell to check the psnc values. Follow the example below to enter restricted shell and use the showpsnc command: -> set SESSION mode=restricted Element | Primary | Backup1 | Backup2 The above example shows a system with the Backup1 container not in sync after MB replacement. If the output from the system does not show all of the containers with matching values then you should reset the SP and then re-check the values again. An ILOM reset will attempt to auto-populate the matching values if one container is out of sync. -> reset /SP
Are you sure you want to reset /SP (y/n)? y Performing reset on /SP If after the ILOM reset the containers still don't match then contact the TSC for further assistance. (if all three entries match this step is done). 8. Restore the ILOM configuration using the backup XML file made earlier, using one of the transfer protocols supported by ILOM 4.x: -> cd /SP/config
/SP/config -> set include_faultdata=true -> set passphrase=motherboard-replacement -> set load_uri=transfer_method://username:password@ipaddress_or_hostname/directorypath/ilom_config_backup.xml For additional information, refer to https://docs.oracle.com/cd/E81115_01/html/E86149/z40048b81489452.html#scrolltoc 9. Restore the BIOS configuration using the backup XML file made earlier. -> cd /System/BIOS/Config
-> set load_uri=transfer_method://username:password@ipaddress_or_hostname/directorypath/bios_config_backup.xml For additional information, refer to https://docs.oracle.com/cd/E81115_01/html/E86149/z40001541481574.html#scrolltoc Note - In the event the ILOM or BIOS configuration could not be backed up due to the faulty motherboard, manually set at least the following settings, using a working node ILOM or BIOS as the reference for values:
10. Reset the ILOM to apply the configuration changes: -> reset /SP
11. Reset the host power and connect to the server console via the ILOM and monitor the boot. -> reset /System
-> start /HOST/console By default the ILOM serial console displays the primary console output.
OBTAIN CUSTOMER ACCEPTANCE WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE?: FIELD SERVICE ENGINEER and CUSTOMER ACTIVITY: 1. Verify all expected hardware is visible to the server and the fault is cleared. Assistance from the customer for server login access will be required. -> show /SYS/MB
/SYS/MB Properties: Commands: -> 2. Verify there are no outstanding faults in ILOM: # ipmitool sunoem cli 'show faulty'
Connected. Use ^D to exit. -> show faulty Target | Property | Value -------------------+-----------------------+----------------------------------- -> Session closed Disconnected # If there are faults still outstanding that did not auto-clear in ILOM after replacement, refer to the post-repair procedures section of Doc ID 1155200.1 to clear the fault. 3. Verify there are no outstanding alerts in the Database Node: # dbmcli -e list alerthistory
4. Re-enable and restart the Database services: If running Linux or Solaris native - follow Steps 11 to 14 of MOS Note: If running OVM then follow MOS Note:
PARTS NOTE: 7317636 [F] System Board Assembly 7352217 [F] 12 in/lb Torque Driver (Required Tool)
REFERENCE INFORMATION: Oracle Exadata Database Machine Maintenance Guide: https://docs.oracle.com/cd/E80920_01/DBMMN/maintaining-exadata-database-servers.htm#DBMMN22020 Oracle Server X7-2 Documentation https://docs.oracle.com/cd/E72435_01/index.html How to shutdown the Exadata database nodes and storage cells in a rolling fashion so certain hardware tasks can be performed. (Doc ID 1539451.1) How to Shutdown and Startup Exadata compute nodes running OVM (Doc ID 2367609.1) Attachments This solution has no attachment |
||||||||||||
|