Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-2360554.1
Update Date:2018-05-30
Keywords:

Solution Type  Technical Instruction Sure

Solution  2360554.1 :   How to Replace an Exadata X7-2 Compute Node Server Motherboard Assembly  


Related Items
  • Exadata X7-2 Hardware
  •  
  • Zero Data Loss Recovery Appliance X7 Hardware
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
  •  




Oracle Confidential PARTNER - Available to partners (SUN).
Reason: Exadata internal only for Oracle support engineers use and approved HW partners

Applies to:

Exadata X7-2 Hardware - Version All Versions and later
Zero Data Loss Recovery Appliance X7 Hardware - Version All Versions and later
Information in this document applies to any platform.

Goal

How to Replace an Exadata X7-2 Compute Node Server Motherboard Assembly

Solution

DISPATCH INSTRUCTIONS

WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED:
Exadata X7-2 Training

TIME ESTIMATE: 120 minutes

TASK COMPLEXITY: 3



FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS

PROBLEM OVERVIEW: An Exadata X7-2 Compute Node Server Motherboard Assembly needs replacement

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY? :

IMPORTANT NOTE TO TSC ENGINEER: CUT & PASTE the “CUSTOMER ACTIVITY” sections of the Pre-Replacement and Post-Replacement steps into a SR Note and ensure the customer is aware to do these steps prior to the scheduled field engineer activity, and during and after the replacement activity.

CUSTOMER ACTIVITY:

Offlining the disk cache and shutdown of the database node is required prior to the part replacement.

1. Shutdown the database services:

   If running Linux or Solaris native - follow Steps 1 to 7 of MOS Note:
      How to shutdown the Exadata database nodes and storage cells in a rolling fashion so certain hardware tasks can be performed. (Doc ID 1539451.1)

   If running OVM - follow Steps 1 to 4 of MOS Note:
      How to Shutdown and Startup Exadata database nodes running OVM (Doc ID 2367609.1)

2. Revert all the RAID disk volumes to WriteThrough mode to ensure all data in the RAID cache memory is flushed to disk and not lost if disconnect of the SuperCap occurs. As 'root' user, set all logical volumes cache policy to WriteThrough cache mode:

# /opt/MegaRAID/storcli/storcli64 /c0/vall set wrcache=WT

3. Verify the current cache policy for all logical volumes is now WriteThrough:

# /opt/MegaRAID/storcli/storcli64 /c0/vall show

In the volume table, the "Cache" column should report as "NRWTD" where WT indicates WriteThrough.

4. Once all disks are offline and inactive, the customer may shutdown the Cell using the following command:

# shutdown -hP now

 

WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?: 

Prepare the Server for Service

The customer should have already prepared the server and powered it off. If not, provide them the instructions in the previous section.

1. Log into the ILOM check the fruid container values and sync them if needed. To avoid mismatched fruid values causing a failure after a motherboard replacement the fruid data should be confirmed to have matching data. The Motherboard is the Backup1 container so the Primary (DBP) and Backup2 (PS0) containers must have valid values that are the same, in order for the replacement motherboard's container to be updated to the correct values automatically.

Go into restricted mode and use the showpsnc command to check this.  

-> set SESSION mode=restricted

WARNING: The "Restricted Shell" account is provided solely
to allow Services to perform diagnostic tasks.

[(restricted_shell) exa1dbadm01-ilom:~]# showpsnc
Primary: fruid:///SYS/DBP
Backup 1: fruid:///SYS/MB
Backup 2: fruid:///SYS/PS0

Element           | Primary           | Backup1           | Backup2
------------------+-------------------+-------------------+-------------------
PPN                 7338405             7338405             7338405
PSN                 1735XC3004          1735XC3004          1735XC3004
Profile             0x00010000          0x00010000          0x00010000
Product Name        ORACLE SERVER X7-2  ORACLE SERVER X7-2  ORACLE SERVER X7-2
RFID SN             341A583DE5800000000232F8 341A583DE5800000000232F8 341A583DE5800000000232F8
[(restricted_shell) exa1dbadm01-ilom:~]# exit

The above example shows a system with all three containers properly in sync. If the output from the system does not show all of the containers with matching values then you should reset the SP and then re-check the values again. An ILOM reset will attempt to auto-populate the matching values if one container is out of sync.  

-> reset /SP
Are you sure you want to reset /SP (y/n)? y
Performing reset on /SP

2. After an ILOM reset if the Primary and Backup2 containers match then proceed with the following steps to replace the motherboard. If these two containers do not match then DO NOT proceed with the replacement yet.  Contact TSC for further assistance.

If the containers do not match you will need to use the "copypsnc" command from service or escalation mode to copy the data from the good container so that the Primary and Backup2 containers match (Backup1 is the MB and we are about to replace this so it is not as important at this step). If you are unfamiliar with this process and require assistance please reference the steps for using copypsnc to fix the serial number detailed in the "How to update product serial number on systems which implement TLI functionality (Doc ID 1280913.1)" and "How to access service mode and escalation mode on ILOM 3.x and later platforms (Doc ID 1019946.1)". After the fruid data in the Primary and Backup2 containers have been confirmed to match proceed with the following steps.

3. Backup the current ILOM configuration settings including fault data history to a XML file on an external laptop/system, using one of the transfer protocols supported by ILOM 4.x:

-> cd /SP/config
-> set include_faultdata=true
-> set passphrase=motherboard-replacement
-> set dump_uri=transfer_method://username:password@ipaddress_or_hostname/directorypath/ilom_config_backup.xml

For additional information, refer to  https://docs.oracle.com/cd/E81115_01/html/E86149/z40048b81489311.html#scrolltoc 

4. Backup the current BIOS configuration parameters to a XML file on an external laptop/system using one of the transfer protocols supported by ILOM 4.x:

-> cd /System/BIOS/Config
-> set dump_uri=transfer_method://username:password@ipaddress_or_hostname/directorypath/bios_config_backup.xml

For additional information, refer to  https://docs.oracle.com/cd/E81115_01/html/E86149/z40001541481533.html#scrolltoc 

5. Extend the server to the maintenance position

6. Disconnect the power cords from the power supplies

7. Attach an anti-static wrist strap to your wrist and to a metal area on the chassis or the rack.

8. Remove the server top cover. Use a Torx T10 screwdriver to unlock the release button latch.

 

Removing the Motherboard 

Caution - These procedures require that you handle components that are sensitive to electrostatic discharge. This sensitivity can cause the components to fail. To avoid damage, ensure that you follow anti-static practices.

1. Remove the following components and set them aside on an anti-static mat:

Caution - During the motherboard removal procedure, it is recommended to only pull the power supplies as far out as necessary to disengage them from the motherboard, without removing them completely from the chassis slot they are in. If they are removed completely from the chassis slot, it is critical to label the power supplies with the slot numbers from which they were removed (PS0, PS1). The power supplies must be reinstalled into the chassis slots from which they were removed because PS0 is a backup container for fruid data which will be used to verify and update the fruid data on the replacement motherboard. If they are accidentally swapped, then manual re-programming of the fruid data will be required. Contact TSC for further assistance with that.
  • Air baffles
  • Fan modules
  • Power supplies - disengage only, do not fully remove.
  • PCIe risers and attached PCIe cards and cables attached to them. Ensure all cables removed from network ports are properly labelled for reconnecting after replacement.
  • SFP28/SFP+ Transceivers in onboard network ports.

2. Remove the following cables from the motherboard:

    a. Remove the SAS cables and the super capacitor cable that are connected to the internal HBA card, and then carefully lift them from the left-side cable trough and set them out of the way.
    b. Disconnect the disk backplane power cable from the motherboard by pressing in on the connector latch and then pulling out the cable connector.
    c. Disconnect the disk backplane data cable from the motherboard by opening the ejectors and pulling out the cable connector.
    d. Disconnect the front indicator module (FIM) cable connector by opening the ejectors and pulling out the cable connector.

3. Remove the motherboard from the server chassis with all reusable components that populate the motherboard in place.

    a. Using a Torx T25 screwdriver, loosen the two green captive screws that secure the motherboard bracket/handle to the server chassis.
    b. Grasp the metal bracket located just to the rear of the DIMM sockets and the finger loop, and then slide the motherboard toward the front of the server and lift it slightly to disengage it from the eight mushroom-shaped standoffs located on the server chassis under the motherboard.
    c. Lift the motherboard out of the server chassis and place it on an anti-static mat next to the replacement motherboard.

4. Remove the Coin Cell battery from the motherboard and re-install it on the replacement motherboard.

5. Remove the DDR4 DIMMs from the motherboard and re-install them onto the corresponding slots in the replacement motherboard.

Note - Install the DIMMs only in the sockets (connectors) that correspond to the sockets from which they were removed. Performing a one-to-one replacement of the DIMMs significantly reduces the possibility that the DIMMs will be installed in the wrong slots. If you do not reinstall the DIMMs in the same sockets, server performance might suffer and some DIMMs might not be used by the server. 

6. Remove the CPUs from the failed motherboard.

7. Remove the CPU socket covers from the replacement motherboard and install the CPUs into the replacement motherboard.

    a. Grasp the CPU socket cover finger grips (labeled REMOVE) and lift the socket cover up and off the processor socket.
    b. Install a CPU into the socket from which you removed the CPU socket cover. See MOS Doc ID 2360561.1 (How to Replace an Exadata X7-2 Compute Node Server CPU) for steps. A 12.0 in-lbs (inch-pounds) torque driver (part number 7352217) with a Torx T30 bit is required for CPU installation.
    c. Repeat Step 7.a and Step 7.b to remove the second CPU socket cover from the replacement motherboard and install the second CPU processor.

8. Install the CPU socket covers onto the CPU sockets of the faulty motherboard.  

Caution - The CPU socket covers must be installed on the faulty motherboard; otherwise, damage might result to the CPU sockets during handling and shipping, preventing motherboards from being repairable.

    a. Align the CPU socket cover over the CPU socket alignment posts. Install the CPU socket cover by firmly pressing down on all four corners (labeled INSTALL) on the socket cover.
        You will hear an audible click when the CPU socket cover is securely attached to the CPU socket.
    b. Repeat Step 8.a to install the second CPU socket cover on the faulty motherboard.
 

 

Installing the Motherboard 

1. Attach an anti-static wrist strap to your wrist, and then to a metal area on the chassis.

2. Insert the motherboard into the server chassis.

    a. Grasp the metal bracket located to the rear of the DIMMs and the finger grasp, and then tilt the front of the motherboard up slightly and push it into the opening in the rear of the server chassis.

    b. Lower the motherboard into the server chassis and slide it to the rear until it engages the eight mushroom-shaped standoffs located on the server chassis under the motherboard.
    c. Ensure that the indicators, controls, and connectors on the rear of the motherboard fit correctly into the rear of the server chassis.
    d. Using a Torx T25 screwdriver, tighten the two green captive screws to secure the motherboard bracket/handle to the server chassis.

3. Reinstall cables on to the motherboard.

    a. 
Reconnect the front indicator module (FIM) cable to the motherboard connector.
    b. Reconnect the disk backplane data cable to the motherboard connector.
    c. Reconnect the disk backplane power cable to the motherboard connector.
    d. Carefully reinstall the SAS cables and super capacitor cable along the left-side cable trough.  Reconnect them to the HBA while re-installing the riser containing the HBA.

4. Reinstall the following components:

Caution - The power supplies must be reinstalled into the chassis slots from which they were removed because PS0 is a backup container for fruid data which will be used to verify and update the fruid data on the replacement motherboard. If they are accidentally swapped, then manual re-programming of the fruid data will be required. Contact TSC for further assistance with that.
  • SFP28/SFP+ Transceivers in onboard network ports.
  • PCIe risers and attached PCIe cards:
    • Slot 1 Riser - either empty (1/8th rack) or 10/25GbE SFP28 NIC card
    • Slot 2 Riser - IB HCA card
    • Slot 3/4 Riser - SAS HBA in internal Slot 4 and one of the following in Slot 3:
      • 10/25GbE SFP28 NIC card (1/8th rack Compute nodes) or
      • Quad 10Gb Base-TX NIC card (optional Exadata) or
      • Qlogic Fiber Channel card (optional ZDLRA) or
      • perforated filler panel
  • Power supplies - ensure they are re-installed in the same slot they were removed from.
  • Fan modules
  • Air baffles

5. Reinstall all network cable connections to the ports they were removed from, as labelled.

 

Return the Server to Operation

1. Install the server top cover. Use a Torx T10 screwdriver to lock the release button latch.
2. Reconnect the power cords to the server power supply and connect any other cables to their original locations.
3. Return the server to the normal rack position.
4. Once the power cords have been re-attached and the ILOM has booted you will see a slow blink on the green LED for the server.

Note: When connecting to ILOM via serial cable, the baud rate is 9600 for replacement boards. This will get changed to the Exadata default which is 115200 when restoring ILOM settings and/or booting the Exadata OS image.

5. Login to the ILOM as root with default password 'changeme'.  Power on the server to BIOS so that ILOM can access the BIOS but the server OS does not boot:

-> set /HOST boot_device=bios
-> start /System

6. Install the Exadata ILOM profile required for UEFI secure boot. The update_entitlements.pkg package file is attached to this Note 2360554.1.  If this is not installed into ILOM, the system will not be able to boot the Exadata OS image.  Load the attached package from an external laptop/system using one of the transfer protocols supported by ILOM 4.x. After installation, reset the BIOS properties to default.

-> set /SP system_contact='psnc profile|0x00010000'

-> load -script -source transfer_method://username:password@ipaddress_or_hostname/directorypath/update_entitlements.pkg

-> set /System/BIOS reset_to_defaults=factory

For additional information on the ILOM load command, refer to: https://docs.oracle.com/cd/E81115_01/html/E86149/z400371a1482689.html#scrolltoc 

7. Check and set the system serial number/fruid data:

a. Enter the ILOM restricted shell to check the psnc values. Follow the example below to enter restricted shell and use the showpsnc command:

-> set SESSION mode=restricted

WARNING: The "Restricted Shell" account is provided solely
to allow Services to perform diagnostic tasks.

[(restricted_shell) exa1dbadm01-ilom:~]# showpsnc
Primary: fruid:///SYS/DBP
Backup 1: fruid:///SYS/MB
Backup 2: fruid:///SYS/PS0

Element           | Primary           | Backup1           | Backup2
------------------+-------------------+-------------------+-------------------
PPN                 7338405             7338405             7338405
PSN                 1735XC3004          0000000000          1735XC3004
Profile             0x00010000          0x00010000          0x00010000
Product Name        ORACLE SERVER X7-2  ORACLE SERVER X7-2  ORACLE SERVER X7-2
RFID SN             341A583DE5800000000232F8 341A583DE5800000000232F8 341A583DE5800000000232F8
[(restricted_shell) exa1dbadm01-ilom:~]# exit

The above example shows a system with the Backup1 container not in sync after MB replacement. If the output from the system does not show all of the containers with matching values then you should reset the SP and then re-check the values again. An ILOM reset will attempt to auto-populate the matching values if one container is out of sync.  

-> reset /SP
Are you sure you want to reset /SP (y/n)? y
Performing reset on /SP

If after the ILOM reset the containers still don't match then contact the TSC for further assistance. (if all three entries match this step is done).

8. Restore the ILOM configuration using the backup XML file made earlier, using one of the transfer protocols supported by ILOM 4.x: 

-> cd /SP/config
/SP/config
-> set include_faultdata=true
-> set passphrase=motherboard-replacement
-> set load_uri=transfer_method://username:password@ipaddress_or_hostname/directorypath/ilom_config_backup.xml

For additional information, refer to https://docs.oracle.com/cd/E81115_01/html/E86149/z40048b81489452.html#scrolltoc 

9. Restore the BIOS configuration using the backup XML file made earlier.

-> cd /System/BIOS/Config
-> set load_uri=transfer_method://username:password@ipaddress_or_hostname/directorypath/bios_config_backup.xml

For additional information, refer to https://docs.oracle.com/cd/E81115_01/html/E86149/z40001541481574.html#scrolltoc 

Note - In the event the ILOM or BIOS configuration could not be backed up due to the faulty motherboard, manually set at least the following settings, using a working node ILOM or BIOS as the reference for values:
  • Serial Baud rate is 115200 for external and host
  • /SP system_identifier contains the Rack type and serial number
  • /SP hostname
  • /SP/network settings
  • /SP/clock and /SP/clients/ntp settings
  • /SP/clients/dns settings
  • /SP/alertmgmt rules
  • /SP/users/root account password
  • BIOS boot order

10. Reset the ILOM to apply the configuration changes:

-> reset /SP

11. Reset the host power and connect to the server console via the ILOM and monitor the boot.

-> reset /System

-> start /HOST/console

By default the ILOM serial console displays the primary console output.
In the event of unexpected boot behavior, it is advisable to connect to both ILOM serial and ILOM graphics consoles at the same time and monitor.

 

OBTAIN CUSTOMER ACCEPTANCE

WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE?:

FIELD SERVICE ENGINEER and CUSTOMER ACTIVITY: 

1. Verify all expected hardware is visible to the server and the fault is cleared. Assistance from the customer for server login access will be required.

-> show /SYS/MB

/SYS/MB
   Targets:
       BIOS
       CPLD
       FM0
       FM1
       FM2
       FM3
       NET0
       NET1
       NET2
       P0
       P1
       RISER1
       RISER2
       RISER3
       T_IN_SLOT1
       T_IN_SLOT2
       T_IN_SLOT3
       T_OUT_SLOT1
       T_OUT_SLOT2
       T_OUT_SLOT3

   Properties:
       type = Motherboard
       ipmi_name = MB
       fru_description = ASM, MB, X7-2
       fru_manufacturer = Oracle Corporation
       fru_part_number = 7317636
       fru_rev_level = 00
       fru_serial_number = 465136N+1732P5005A
       fru_macaddress = 00:10:e0:c3:c7:aa
       fault_state = OK
       clear_fault_action = (none)

   Commands:
       cd
       set
       show

->

2. Verify there are no outstanding faults in ILOM:

# ipmitool sunoem cli 'show faulty'
Connected. Use ^D to exit.
-> show faulty
Target | Property | Value
-------------------+-----------------------+-----------------------------------
-> Session closed
Disconnected
#

If there are faults still outstanding that did not auto-clear in ILOM after replacement, refer to the post-repair procedures section of Doc ID 1155200.1 to clear the fault.

3. Verify there are no outstanding alerts in the Database Node:

# dbmcli -e list alerthistory

4. Re-enable and restart the Database services:

If running Linux or Solaris native - follow Steps 11 to 14 of MOS Note:
How to shutdown the Exadata database nodes and storage cells in a rolling fashion so certain hardware tasks can be performed. (Doc ID 1539451.1)

If running OVM then follow MOS Note:
How to Shutdown and Startup Exadata compute nodes running OVM (Doc ID 2367609.1)

 

PARTS NOTE:

7317636 [F] System Board Assembly

7352217 [F] 12 in/lb Torque Driver (Required Tool)

 

REFERENCE INFORMATION:

Oracle Exadata Database Machine Maintenance Guide: https://docs.oracle.com/cd/E80920_01/DBMMN/maintaining-exadata-database-servers.htm#DBMMN22020

Oracle Server X7-2 Documentation https://docs.oracle.com/cd/E72435_01/index.html

How to shutdown the Exadata database nodes and storage cells in a rolling fashion so certain hardware tasks can be performed. (Doc ID 1539451.1)

How to Shutdown and Startup Exadata compute nodes running OVM (Doc ID 2367609.1)


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback