![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Technical Instruction Sure Solution 1968764.1 : How to Replace Motherboard in Storage Cell in Exadata Database Machine [X5-2/X6-2]
In this Document
Oracle Confidential PARTNER - Available to partners (SUN). Applies to:Oracle SuperCluster T5-8 Hardware - Version All Versions and laterOracle SuperCluster T5-8 Full Rack - Version All Versions and later Oracle SuperCluster T5-8 Half Rack - Version All Versions and later Exadata X5-2 Eighth Rack - Version All Versions and later Exadata X5-2 Half Rack - Version All Versions and later Information in this document applies to any platform. GoalCanned Action Plan for replacing a Motherboard in Storage Cell in an Exadata Database Machine [X5-2/X6-2] SolutionDISPATCH INSTRUCTIONS WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?: Exadata trained TIME ESTIMATE: 120 Minutes TASK COMPLEXITY: 3
FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS: PROBLEM OVERVIEW: A server in an Exadata Database Machine requires the motherboard to be replaced. This procedure is specific to Exadata X5-2/X6-2 systems based on Oracle Server X5-2/X6-2 and Oracle Server X5-2L/X6-2L. Connectivity to the rack will depend on the customer's access requirements. The following procedure partially requires serial connection and network access to ILOM which assumes using a laptop attached to the Cisco management switch. If no port is available in a full rack, then temporarily disconnect a port used for another host's ILOM (e.g. port 2). If the customer does not allow login access to the host ILOM, then they will need to run the commands given below. When connecting to ILOM via serial cable remember that the baud rate is 9600 for replacement boards. This will get changed during the post-install procedure to the Exadata default which is 115200 for installed boards. To make room for needed text MB = Motherboard WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?: Pre-Install Steps: 1. Backup ILOM Settings. Assuming the ILOM is not the reason for the replacement of the system MB, then take a current backup of the ILOM SP configuration using a browser under “ILOM Administration → Configuration Management” tab on the left menu list. This can also be done from the ILOM CLI as follows:- -> cd /SP/config
-> set passphrase=welcome1 -> set dump_uri=scp://root:password@laptop_IP/var/tmp/SP.config 2. Obtain the correct Serial Numbers required. (a) Make a note of the System Serial Number from the front label of the server. 3. If the system is not down already due to whatever problem is causing the MB to be replaced, then have the customer DBA shut the node down. (a) For Extended information on this section check MOS Note ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM This is also documented in the Exadata Owner's Guide in chapter 7 section titled “Maintaining Exadata Storage Servers” subsection “Shutting Down Exadata Storage Server” available on the customer's cell server image in the /opt/oracle/cell/doc.
http://amomv0115.us.oracle.com/archive/cd_ns/E13877_01/doc/doc.112/e13874/maintenance.htm#autoId33
In the following examples the SQL commands should be run by the Customers DBA prior to doing the hardware replacement. These should be done by the field engineer only if the customer directs them to, or is unable to do them. The Cellcli commands will need to be run as root. (b) ASM drops a disk shortly after it/they are taken offline. The default DISK_REPAIR_TIME attribute value of 3.6hrs should be adequate for replacing components, but may have been changed by the customer. To check this parameter, have the customer log into ASM and perform the following query: SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg where a.name = 'disk_repair_time' and a.group_number = dg.group_number;
As long as the value is large enough to comfortably replace the components being replaced, then there is no need to change it. (c) Check if ASM will be OK if the grid disks go OFFLINE. # cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
...snipit ... DATA_CD_09_cel01 ONLINE Yes DATA_CD_10_cel01 ONLINE Yes etc....
If one or more disks return asmdeactivationoutcome='No', then wait for some time and repeat step (b). Once all disks return asmdeactivationoutcome='Yes', proceed to the next step. (d) Run cellcli command to Inactivate all grid disks on the cell that needs to be powered down for maintenance. (this could take up to 10 minutes or longer) # cellcli
... CellCLI> ALTER GRIDDISK ALL INACTIVE GridDisk DATA_CD_00_dmorlx8cel01 successfully altered GridDisk DATA_CD_01_dmorlx8cel01 successfully altered ...etc...
(e) Execute the command below and the output should show asmmodestatus='UNUSED' or 'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and inactive in ASM. CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
... DATA_CD_00_dmorlx8cel01 inactive OFFLINE Yes DATA_CD_01_dmorlx8cel01 inactive OFFLINE Yes ...etc...
(f) Once all disks are offline and inactive, the customer or field engineer may shutdown the Cell using the following command: # shutdown -hP now
(g) The field engineer can now slide out the server for maintenance. Remember to disconnect the power cords before opening the top of the server. Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms.
WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?: Note: You must use the removal/insertion CPU tool for the X5-2L server. If you have not used this new tool before please make yourself familiar before attempting to use on-site. The tool is not intuitive so reference the service manual before attempting this service action.
Reference links for Service Manual's: X5-2L Cell’s: (http://docs.oracle.com/cd/E41033_01/html/E48325/cnpsm.z40001d31037512.html#scrolltoc)
Physical Replacement Steps: 1. Replace the MB as per MOS Note (How to Remove and Replace a Motherboard Assembly in an Oracle Server X5-2. (Doc ID 1992420.1)
NOTE:- On Storage Cells, remember to move the internal USB stick onto the new board.
NOTE:- Pull power cords before opening the top cover to avoid a SP degraded condition.
2. Carefully follow the port numbers on the cables when re-attaching so they are not reversed. It is easiest to plug cables in while the server is in the fully extended maintenance position. 3. Do not power up the system yet, just ILOM
Post-Installation Steps: 1. Update the Serial Number on the new MB, to that of the server chassis. This is REQUIRED in order for ASR to continue to work on the unit, and is required for all servers that are part of Exadata racks that may have a future Service Request, whether ASR is configured now or not. These platforms use the Top Level Indicator (TLI) feature in ILOM to perform the MB serial number update automatically. In certain circumstances this may not work correctly and will need to be manually corrected. For more information on TLI and restricted shell please refer to the following 2 MOS notes for these systems:- TLI MOS Note 1280913.1 Restricted Shell MOS Note 1302296.1 NOTE: The serial numbers of each server can be found at the front on the left hand side.
(a) Connect to ILOM via serial port, and login as “root” with default password “changeme”. (b) Enter Restricted Mode. -> set SESSION mode=restricted
WARNING: The "Restricted Shell" account is provided solely to allow Services to perform diagnostic tasks.
(c) Review the current PSNC containers with “showpsnc” command: [(restricted_shell) exdx5-tvp-a-db1-sp:~]# showpsnc
Primary: fruid:///SYS/DBP Backup 1: fruid:///SYS/MB Backup 2: fruid:///SYS/PS0 Element | Primary | Backup1 | Backup2 ------------------+-------------------+-------------------+------------------- PPN 7090664 7090664 7090664 PSN 1450NM104V 1450NM104V 1450NM104V Product Name ORACLE SERVER X5-2 ORACLE SERVER X5-2 ORACLE SERVER X5-2 [(restricted_shell) exdx5-tvp-a-db1-sp:~]#
If the replacement has the correct product serial number, in all 3 containers including “/SYS/MB” then skip to step 2 of the post-replacement procedures. If the replacement does not have the product serial number populated correctly, then “exit” out of restricted shell mode and continue: [(restricted_shell) exdx5-tvp-a-db1-sp:~]# exit
exit ->
(d) Where there is at least one container which still contains valid TLI information (usually the primary disk backplane DBP0), a service mode command copypsnc can be used to update the product serial number. i. Login as root and create escalation mode user with service role: -> cd /SP/users
-> create sunny role=aucros (will ask for password)
ii. Gather “version”, “show /SYS” and “show /SP/clock” outputs needed for generating the service mode password: -> version
SP firmware 3.2.4.12 SP firmware build number: 94599 SP firmware date: Mon Nov 17 13:07:41 EST 2014 SP filesystem version: 0.2.10 -> show /SYS .............. Properties: type = Host System ipmi_name = SYS product_name = ORACLE SERVER X5-2L product_part_number = 7090697 product_serial_number = XXXXXXXXXX product_manufacturer = Oracle Corporation fault_state = OK clear_fault_action = (none) power_state = On -> show /SP/clock .............. Properties: datetime = Wed Feb 4 05:35:50 2015 timezone = PST (America/Los_Angeles) uptime = 5 days, 07:03:01 usentpserver = disabled
iii. Generate a service mode password using “http://modepass.us.oracle.com/” Login is via Oracle Single-Sign-On. Example output of the tool is: BRAND : sun
MODE : service VERSION : 3.2.4.10 SERIAL : 00000000 UTC DATE : 05/20/2013 16:00 POP DOLL PHI TOW BRAN TAUT FEND PAW SKI SCAR BURG CEIL MINT DRAB KAHN FIR MAGI LEAF LIMB EM LAWS BRAE DEAL BURN GOAL HEFT HEAR KEY SEE A
iv. Logout of root and log back in as 'sunny' user that you created, and enter Service mode: -> set SESSION mode=service
Password:*** **** *** *** **** **** **** *** *** **** **** **** **** **** **** *** **** **** **** ** **** **** **** **** **** **** **** *** *** * Short form password is: ARMY ULAN HULL Currently in service mode.
v. Correct the invalid containers using the “copypsnc” command: -> copypsnc
Number of arguments is incorrect. Usage: copypsnc [-n] <src> <dest> where <src> is PRIMARY|BACKUP1|BACKUP2 <dest> is PRIMARY|BACKUP1|BACKUP2 -n: If src is a bilingual FRU, copy from new-style record. PRIMARY: fruid:///SYS/DBP0 BACKUP1: fruid:///SYS/MB BACKUP2: fruid:///SYS/PS0 -> copypsnc BACKUP1 PRIMARY
The copypsnc command produces no output upon success. vi. After running copypsnc, the service processor should be rebooted. -> reset /SP
vii. Log in again as 'root' user with default password 'changeme' and verify the SN is now populated correctly using 'show /SYS' and 'showpsnc' as shown above. viii. Remove the 'sunny' user: -> delete /SP/users/sunny
If there are any issues with programming the serial number with “copypsnc” then an escalation mode password and instructions will need to be provided by the TSC x86 engineer assigned to the SR. 2. Re-flash the ILOM/BIOS to the correct levels required for Exadata. The image on the Exadata Storage Cells contains the firmware and will automatically re-flash it if it is not correct during boot. You do not need to do any flash updates manually. (a) Power up system using front button or ILOM "-> start /SYS" (b) During the boot validation phase, CheckHWnFWProfile will run and determine the ILOM is not correct, and automatically flash it. The server will be powered off during this, ILOM will reset, and after 10 minutes of being off to allow ILOM reset and BIOS flash update, the server host will be automatically powered back on. It is recommended to be connected to the serial console and monitor the host console through ILOM to verify this completes successfully.
3. Restore the backed up SP configuration done during the pre-installation steps. (a) Using a browser under Maintenance Tab or from ILOM cli:- -> cd /SP/config
-> set passphrase=welcome1 -> set load_uri=scp://root:password@laptop_IP/var/tmp/SP.config If SP backup was not possible check with customer for network information & use another ILOM within the rack for general settings. The primary specific setup for Exadata are: i. Serial Baud rate is 115200 for external and host -> set /SP/serial/external pendingspeed=115200
-> set /SP/serial/external commitpending=true -> set /SP/serial/host pendingspeed=115200 -> set /SP/serial/host commitpending=true
ii. /SP system_identifer is set to the appropriate rack type string and master Rack Serial Number. This is critical for ASR deployments. The Master Rack Serial number can be obtained top left inside the cabinet or from show /SP on any other ILOM. The string should be of the following format: X5-2 - “Exadata Database Machine X5-2 <Rack SN>” For Example: check_physical_presence = false
current_hostname = ORACLESP-1449NM702E hostname = (none) reset_to_defaults = none system_contact = svcid pn|Exadata X5-2| sn|AK00000000| name|Exadata X5-2| system_description = ORACLE SERVER X5-2L, ILOM v3.2.4.12, r94599 system_identifier = Exadata Database Machine X5-2 AK00000000 system_location = (none)
iii. /SP hostname is setup iv. /SP/network settings v. /SP/alertmgmt rules that may have been previously setup by ASR or cell configuration vi. /SP/clock timezone, datetime, and /SP/clients/ntp NTP settings vii. /SP/clients/dns Name service settings viii. root account password. If the root password has not been provided you can have the customer do this, or do this manually: -> set /SP/users/root password=welcome1 (or customers password)
Changing password for user /SP/users/root... Enter new password again: ******** New password was successfully set for user /SP/users/root
(b) Reset the ILOM under the Maintenance Tab or from ILOM cli: -> reset /SP
(c) Check you can login to all interfaces and ILOM can be accessed using a browser or ssh from another system on the customer's management network. 4. Once the SP is re-configured, power on the server by pressing the power button at the front and enter the BIOS Setup Utility. To enter the BIOS Setup Utility, press the F2 key (Ctrl+E from a serial connection) when prompted and while the BIOS is running the power-on self-tests (POST). Check BIOS settings against EIS checklist, in particular make sure USB is first in boot order (the original USB stick should have been moved from the old board to the new board), and check date and time is correct.
Important - When NVMe cables are removed or replaced between the storage drive backplane and NVMe switch cards, you must perform the procedure in this section to confirm that all NVMe cable connections are correct. If the NVMe cable connections are not correct, the storage server operating system should not be allowed to boot, as it could cause a problem with disk drive mapping. -> set SESSION mode=restricted
WARNING: The "Restricted Shell" account is provided solely to allow Services to perform diagnostic tasks. [(restricted_shell) exdx5-tvp-a-cel3-sp:~]# [(restricted_shell) exdx5-tvp-a-cel3-sp:~]# [(restricted_shell) exdx5-tvp-a-cel3-sp:~]# hwdiag io nvme_test HWdiag (Restricted Mode) - Build Number 94599 (Nov 17 2014, 18:59:38) Current Date/Time: Apr 11 2015, 01:21:34 Checking NVME drive fru contents... checking fru on drive NVMe 0 OK checking fru on drive NVMe 1 OK checking fru on drive NVMe 3 OK checking fru on drive NVMe 4 OK checking fru on drive NVMe 6 OK checking fru on drive NVMe 7 OK checking fru on drive NVMe 9 OK checking fru on drive NVMe 10 OK NVME drives fru check: PASSED Checking NVME drive pcie links... checking pcie link on drive NVMe 0 OK checking pcie link on drive NVMe 1 OK checking pcie link on drive NVMe 3 OK checking pcie link on drive NVMe 4 OK checking pcie link on drive NVMe 6 OK checking pcie link on drive NVMe 7 OK checking pcie link on drive NVMe 9 OK checking pcie link on drive NVMe 10 OK NVME drives pcie link check: PASSED Checking NVME drive DSN... checking DSN on drive NVMe 0 OK checking DSN on drive NVMe 1 OK checking DSN on drive NVMe 3 OK checking DSN on drive NVMe 4 OK checking DSN on drive NVMe 6 OK checking DSN on drive NVMe 7 OK checking DSN on drive NVMe 9 OK checking DSN on drive NVMe 10 OK NVME drives DSN check: PASSED Checking NVME cabling... Cables associated with Switch Card 3 in PCIe Slot 6 verified Cables associated with Switch Card 2 in PCIe Slot 5 verified Cables associated with Switch Card 1 in PCIe Slot 2 verified Cables associated with Switch Card 0 in PCIe Slot 1 verified NVME cable check: PASSED NVME test PASSED [(restricted_shell) exdx5-tvp-a-cel3-sp:~]#
7. As the system boots the hardware/firmware profile will be checked, and either a green “Passed” will be displayed, or a red “Warning” that something with the hardware or firmware does not match what is expected. If the check passes, then everything is correct and up, and the boot will continue up to the OS login prompt. If the check fails, then the issue being flagged should be investigated and rectified before continuing. 8. Additional OS checks:- (a) Verify the network interfaces have correctly picked up the new MAC addresses of the new system board: # ifconfig eth0 (for each eth1/bondeth0 etc.) # ipmitool sunoem cli "show /SYS/MB/NET0" (for each NIC NET0/1/2/3) -> show /SYS/MB/NET0
Compare this to the following network configuration files under: /etc/sysconfig/network-scripts/ifcfg-ethX
where X is a 0 (ifcfg-eth0=NET0), 1 (ifcfg-eth1=NET1), 2 (NET2), 3 (NET3), or “ifcfg-bondeth0” if a db node with bonding. Example of file output :- #### DO NOT REMOVE THESE LINES ####
#### %GENERATED BY CELL% #### DEVICE=eth0 BOOTPROTO=static ONBOOT=yes IPADDR=10.167.166.90 NETMASK=255.255.252.0 NETWORK=10.167.164.0 BROADCAST=10.167.167.255 GATEWAY=10.167.164.1 HOTPLUG=no IPV6INIT=no HWADDR=00:21:28:46:ef:8a
If there is any inconsistency with the new MAC addresses or IP, there should be backup files ending .bak in the same directory so use these and the ILOM information to update the files with the correct information. (b) Verify that the management network is working: # ethtool eth0 | grep det
Link detected: yes # ipmitool sunoem cli 'show /SP/network' | grep ipadd
ipaddress = 192.168.1.108 pendingipaddress = 192.168.1.108 [root@db01 ~]# ping -c 3 192.168.1.108 PING 192.168.1.108 (192.168.1.108) 56(84) bytes of data. 64 bytes from 192.168.1.108: icmp_seq=1 ttl=64 time=0.625 ms 64 bytes from 192.168.1.108: icmp_seq=2 ttl=64 time=0.601 ms 64 bytes from 192.168.1.108: icmp_seq=3 ttl=64 time=0.606 ms --- 192.168.1.108 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 3199ms rtt min/avg/max/mdev = 0.601/0.608/0.625/0.026 ms
(d) Verify that all memory is present in Linux. High Capacity X5-2 Storage Cells have 96GB in total. Extreme Flash X5-2 Storage Cells have 64GB in total. # grep MemTotal /proc/meminfo
MemTotal: 65583428 kB - this may vary depending on BIOS version.
(e) Verify HW Profile is operating correctly. # /opt/oracle.SupportTools/CheckHWnFWProfile
[SUCCESS] The hardware and firmware matches supported profile for server=ORACLE_SERVER_X5-2
(f) Verify the InfiniBand connections are up and actively seen in the fabric: If possible to login to DB01, then check InfiniBand connections are ok by running the following from DB01: # cd /opt/oracle.SupportTools/ibdiagtools
# ./verify-topology
(options to verify-topology may be required depending on configuration) If not possible for security reasons, then on this local node, verify the IB connection status with: # ibstatus (Looking for both link ports up and active at 40Gb/s (4X QDR))
# ibdiagnet (Looking for any fabric errors that might suggest a link or cabling failure) # ibnetdiscover (Looking for ability to see all expected switches and other DB nodes and cells in the IB fabric)
(g) Verify server functionality per the EIS checklist server and common check sections
(h) If dcli is setup for password-less SSH, then the SSH keys need to be updated for new mac address. The customer should be able to do this using their root password.
(i) This step is only required on the Extreme Flash Storage Cell: Once the storage cell is booted up check that all the NVMe devices are present: [root@exdx5-tvp-a-cel3 ~]# nvmecli --identify --all | grep /dev
/***************** NVMe Device /dev/nvme0n1 ******************/ /***************** NVMe Device /dev/nvme1n1 ******************/ /***************** NVMe Device /dev/nvme2n1 ******************/ /***************** NVMe Device /dev/nvme3n1 ******************/ /***************** NVMe Device /dev/nvme4n1 ******************/ /***************** NVMe Device /dev/nvme5n1 ******************/ /***************** NVMe Device /dev/nvme6n1 ******************/ /***************** NVMe Device /dev/nvme7n1 ******************/
[root@exdx5-tvp-a-cel3 ~]# lspci | grep 0953
05:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01) 07:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01) 25:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01) 27:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01) 86:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01) 88:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01) 96:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01) 98:00.0 Non-Volatile memory controller: Intel Corporation Device 0953 (rev 01)
OBTAIN CUSTOMER ACCEPTANCE WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE? You can now hand the system back to the customer DBA to check all ASM or DB CRS services can be brought up and are online before obtaining sign-off. This step may take more than 10 minutes to complete based on the current load on the database. See detailed information below. If the customer DBA requires assistance beyond this, then you should direct them to callback the parent SR owner in EEST. Cell Node Startup Verification: 1. Activate the grid disks:
# cellcli Issue the command below and all disks should show 'active': CellCLI> list griddisk
2. Verify all grid disks have been successfully put online using the following command. Wait until asmmodestatus is ONLINE for all grid disks. The following is an example of the output early in the activation process. CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_dmorlx8cel01 active ONLINE Yes DATA_CD_01_dmorlx8cel01 active ONLINE Yes DATA_CD_02_dmorlx8cel01 active ONLINE Yes RECO_CD_00_dmorlx8cel01 active SYNCING Yes ...etc...
Notice in the above example that RECO_CD_00_dmorlx8cel01 is still in the 'SYNCING' process. Oracle ASM synchronization is only complete when ALL grid disks show asmmodestatus=ONLINE. This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair.
PARTS NOTE: 7098504 System Board Assembly for Exadata X5-2 Storage Cells (Sun Server X5-2L) REFERENCE INFORMATION: Service Manual's: X5-2L Motherboard Replacement Procedure:1993394.1 MOS Note 1093890.1 Steps To Shutdown/Startup The Exadata & RDBMS Services and Cell/Compute Nodes On An Exadata Configuration. MOS Note ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM
Attachments This solution has no attachment |
||||||||||||||||
|