Asset ID: |
1-71-1993842.1 |
Update Date: | 2018-04-04 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
1993842.1
:
How to Replace an Exadata X5-2/X6-2 Storage Server Flash F160/F320 Card
Related Items |
- Oracle SuperCluster T5-8 Full Rack
- Oracle SuperCluster M7 Hardware
- Exadata SL6 Hardware
- Zero Data Loss Recovery Appliance X6 Hardware
- Exadata X6-8 Hardware
- Oracle SuperCluster T5-8 Half Rack
- Exadata X5-2 Hardware
- Exadata X5-2 Eighth Rack
- Exadata X5-2 Full Rack
- Exadata X6-2 Hardware
- Exadata X4-8 Hardware
- Exadata X5-2 Quarter Rack
- Zero Data Loss Recovery Appliance X5 Hardware
- Exadata X5-2 Half Rack
- Exadata Cloud at Customer X6-2 Hardware
- Oracle SuperCluster T5-8 Hardware
- Oracle SuperCluster M6-32 Hardware
|
Related Categories |
- PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
|
In this Document
Oracle Confidential PARTNER - Available to partners (SUN).
Reason: Action plan applicable only to FRU component so only field engineers need access to this
Applies to:
Exadata X6-8 Hardware - Version All Versions and later
Exadata X4-8 Hardware - Version All Versions and later
Oracle SuperCluster T5-8 Half Rack - Version All Versions and later
Oracle SuperCluster T5-8 Full Rack - Version All Versions and later
Exadata SL6 Hardware - Version All Versions and later
Information in this document applies to any platform.
Goal
Procedure for How to Replace a Oracle Flash Accelerator F160/F320 NVMe Card in an Exadata Storage Cell without loss of data or Exadata service
Solution
DISPATCH INSTRUCTIONS:
The following information will be required prior to dispatch of a replacement:
Name/location of storage cell
PCI Slot number of failed card
Image Version (output of "imageinfo -active")
Special Instructions for Dispatch are required for this part.
For Attention of Dispatcher:
The parts required in this action plan may be available as spares owned by the customer, which they received with the Engineered System. (These are sometimes referred to as ride-along spares.)
If parts are not available to meet the customer preferred delivery time/planned end date, then request TAM or field manager to contact the customer, and ask if the customer has parts available, and would be prepared to use them.
If customer spare parts are used, inform the customer that Oracle will replenish the customer part stock as soon as we can. More details on this process can be found in GDMR procedure "Handling Where No Parts Available" step 2: https://ptp.oraclecorp.com/pls/apex/f?p=151:138:38504529393::::DN,BRNID,DP,P138_DLID:2,86687,4,9082,
WHAT SKILLS DOES THE ENGINEER NEED:
The engineer must be Exadata trained, have familiarity with the storage cells and replacing hard drives.
TIME ESTIMATE: 60 minutes
TASK COMPLEXITY: 3
FIELD ENGINEER INSTRUCTIONS:
PROBLEM OVERVIEW:
There is a failed Oracle Flash Accelerator F160/F320 Card in an Exadata Storage Server (Cell) that needs replacing.
WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:
The Storage Cell containing the failed F160/F320 card is required to be powered off prior to card replacement.
It is expected that the customer's DBA has completed these steps prior to arriving to replace the card. The following commands are provided as guidance in case the customer needs assistance checking the status of the system prior to replacement. If the customer or the FSE requires more assistance prior to the physical replacement of the device, EEST/TSC should be contacted.
1. Locate the server in the rack being serviced. The cell server within the rack can be determined from the hostname usually, and the known default Exadata server numbering scheme. Exadata Storage Servers are identified by a number 1 through 18, where 1 is the lowest most Storage Server in the rack installed in RU2, counting up to the top of the rack.
Turn on the locate indicator light ‘on’ for easier identification of the server being repaired. If the server number has been identified then the Locate Button on the front panel may be pressed. To turn on remotely, use either of the following methods:
From a login to the CellCli on Exadata Storage Servers:
CellCli> alter cell led on
From a login to the server’s ILOM:
-> set /SYS/LOCATE value=Fast_Blink
Set 'value' to 'Fast_Blink
From a login to the server’s ‘root’ account:
# ipmitool sunoem cli ‘set /SYS/LOCATE value=Fast_Blink’
Connected. Use ^D to exit.
-> set /SYS/LOCATE value=Fast_Blink
Set 'value' to 'Fast_Blink'
-> Session closed
Disconnected
2. Determine the active image version of the Exadata Storage Server:
# imageinfo -active
3. Shutdown the node for which the Flash F160/F320 card requires replacement.
a) For Extended information on this section check MOS Note:
ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM
This is also documented in the Exadata Owner's Guide in chapter 7 section titled "Maintaining Exadata Storage Servers" subsection "Shutting Down Exadata Storage Server" available on the customer's cell server image in the /opt/oracle/cell/doc directory.
Exadata Documentation is available internally here: https://docs.oracle.com/cd/E80920_01/DBMMN/maintaining-exadata-storage-servers.htm#DBMMN21129
In the following examples the SQL commands should be run by the Customers DBA prior to doing the hardware replacement. These should be done by the field engineer only if the customer directs them to, or is unable to do them. The cellcli commands will need to be run as root.
Note the following when powering off Exadata Storage Servers:
- Verify there are no other storage servers with disk faults. Shutting down a storage server while another disk is failed may result in the running database processes and Oracle ASM to crash if it loses both disks in the partner pair when this server’s disks go offline.
- Powering off one Exadata Storage Server with no disk faults in the rest of the rack will not affect running database processes or Oracle ASM.
b) ASM drops a disk shortly after they are taken offline. Powering off or restarting Exadata Storage Servers can impact database performance if the storage server is offline for longer than the ASM disk repair timer to be restored. The default DISK_REPAIR_TIME attribute value of 3.6hrs should be adequate for replacing components, but may have been changed by the Customer. To check this parameter, have the Customer log into ASM and perform the following query:
SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg where a.name = 'disk_repair_time' and a.group_number = dg.group_number;
As long as the value is large enough to comfortably replace the components being replaced, then there is no need to change it.
c) If the flash card disks are being used as griddisks, then please refer to Note 1545103.1 for additional specific instructions before continuing.
d) Check if ASM will be OK if the grid disks go OFFLINE.
# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
...sample ...
CATALOG_CD_09_zdlx5_tvp_a_cel1 ONLINE Yes
CATALOG_CD_10_zdlx5_tvp_a_cel1 ONLINE Yes
CATALOG_CD_11_zdlx5_tvp_a_cel1 ONLINE Yes
DELTA_CD_00_zdlx5_tvp_a_cel1 ONLINE Yes
DELTA_CD_01_zdlx5_tvp_a_cel1 ONLINE Yes
...repeated for all griddisks....
If one or more disks return asmdeactivationoutcome='No', then wait for some time and repeat this command. Once all disks return asmdeactivationoutcome='Yes', proceed to the next step.
e) Run cellcli command to Inactivate all grid disks on the cell that needs to be powered down for maintenance. (this could take up to 10 minutes or longer)
# cellcli
CellCLI> ALTER GRIDDISK ALL INACTIVE
...sample ...
GridDisk CATALOG_CD_09_zdlx5_tvp_a_cel3 successfully altered
GridDisk CATALOG_CD_10_zdlx5_tvp_a_cel3 successfully altered
GridDisk CATALOG_CD_11_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_00_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_01_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_02_zdlx5_tvp_a_cel3 successfully altered
...repeated for all griddisks...
f) Execute the command below and the output should show asmmodestatus='UNUSED' or 'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and inactive in ASM.
CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
CATALOG_CD_09_zdlx5_tvp_a_cel3 inactive OFFLINE Yes
CATALOG_CD_10_zdlx5_tvp_a_cel3 inactive OFFLINE Yes
CATALOG_CD_11_zdlx5_tvp_a_cel3 inactive OFFLINE Yes
DELTA_CD_00_zdlx5_tvp_a_cel3 inactive OFFLINE Yes
DELTA_CD_01_zdlx5_tvp_a_cel3 inactive OFFLINE Yes
DELTA_CD_02_zdlx5_tvp_a_cel3 inactive OFFLINE Yes
...repeated for all griddisks...
g) Once all disks are offline and inactive, the customer may shutdown the Cell using the following command:
# shutdown -hP now
When powering off Exadata Storage Servers, all storage services are automatically stopped.
WHAT ACTION DOES THE ENGINEER NEED TO TAKE:
Confirm the slot of the flash card needing replacement.
The Exadata Storage Server based on Oracle Server X5-2L/X6-2L has six PCIe slots. They are numbered 1 through 6 with 1 nearest the Power Supplies, and 6 nearest the outside wall of the chassis (the onboard ports/connectors are located between slots 3 and 4). Slot locations for Flash F160/F320 cards in Exadata Storage Servers are PCIe Slot 1, 2, 4 and 5.
The Oracle Flash Accelerator F160/F320 card does not have any field-serviceable parts, the FRU is the entire card unlike the previous Sun Flash Accelerator F20/F20M2 PCIe cards that previous Exadata versions used.
Physical card replacement
Reference links for Service Manual:
X5-2L : ( http://docs.oracle.com/cd/E41033_01/html/E48325/cnpsm.html#scrolltoc )
Remove the flash card
1. Slide out the server for maintenance. Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms.
2. Remove both power cables
3. Remove the server top cover
4. Swivel the air baffle into the upright position to allow access to PCIe cards
5. Rotate the PCIe card locking mechanism on the flash card that requires replacement, and then lift up on the PCIe card to disengage it from the motherboard connectors
Install the new flash card
1. Insert the new flash card into the required PCIe slot and rotate the PCIe locking mechanism to secure the PCIe card in place
2. Lower the air baffle to the installed position
3. Install the top cover
4. Re-connect both power cables
Post-Replacement additional steps
1. Once the power cords have been re-attached, slide the server back into the rack.
2. Once the ILOM has booted you will see a slow blink on the green LED for the server. Power on the server by pressing the power
button on the front of the unit.
Server Services Startup Validation
As the system boots the hardware/firmware profile will be checked, and either a green "Passed" will be displayed, or a red "Warning" that the check does not match if the firmware on the HBA is different from what the image expects. If the check passes, then the firmware is correct, and the boot will continue up to the OS login prompt. If the check fails, then the firmware will automatically be updated, and a subsequent reboot will occur. Monitor to ensure this occurs properly.
OBTAIN CUSTOMER ACCEPTANCE
- WHAT ACTION DOES THE CUSTOMER NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:
It is expected that the engineer stay on-site until the customer has given the approval to depart. The following commands are provided as guidance in case the customer needs assistance checking the status of the system following replacement. If the customer or the FSE requires more assistance following the physical replacement of the device, EEST/TSC should be contacted.
After replacing the Flash F160/F320 card and updating its firmware if necessary, the Exadata Storage Server should boot up automatically. Once the Exadata Storage Server comes back online the cell services will start up automatically, however you will need to reactivate the griddisks as follows:
1. Activate the griddisks:
# cellcli
CellCLI> alter griddisk all active
GridDisk CATALOG_CD_09_zdlx5_tvp_a_cel3 successfully altered
GridDisk CATALOG_CD_10_zdlx5_tvp_a_cel3 successfully altered
GridDisk CATALOG_CD_11_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_00_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_01_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_02_zdlx5_tvp_a_cel3 successfully altered
...repeated for all griddisks...
2. Verify all disks show 'active':
CellCLI> list griddisk
CATALOG_CD_09_zdlx5_tvp_a_cel3 active
CATALOG_CD_10_zdlx5_tvp_a_cel3 active
CATALOG_CD_11_zdlx5_tvp_a_cel3 active
DELTA_CD_00_zdlx5_tvp_a_cel3 active
DELTA_CD_01_zdlx5_tvp_a_cel3 active
DELTA_CD_02_zdlx5_tvp_a_cel3 active
...repeated for all griddisks...
3. Verify all grid disks have been successfully put online using the following command. Wait until 'asmmodestatus' is in status 'ONLINE' for all grid disks. The following is an example of the output early in the activation process.
CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
CATALOG_CD_09_zdlx5_tvp_a_cel3 active SYNCING Yes
CATALOG_CD_10_zdlx5_tvp_a_cel3 active SYNCING Yes
CATALOG_CD_11_zdlx5_tvp_a_cel3 active SYNCING Yes
DELTA_CD_00_zdlx5_tvp_a_cel3 active SYNCING Yes
DELTA_CD_01_zdlx5_tvp_a_cel3 active SYNCING Yes
DELTA_CD_02_zdlx5_tvp_a_cel3 active SYNCING Yes
...repeated for all griddisks...
Notice in the above example that the grid disks are still in the 'SYNCING' process. Oracle ASM synchronization is only complete when ALL grid disks show asmmodestatus=ONLINE. This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair.
4. If the flash disks were offlined due to 'poor performance' then the lun faults will need to be manually cleared before the new FDOM can be used in the flashcache again. Refer to Note 1306635.1 for additional specific instructions to clear this.
5. If the flashcache was dropped prior to the FMOD replacement (ie, if the following was run CellCLI> drop celldisk all flashdisk force), then the following steps will need to be run in order to recreate the flashcache:
CellCLI> create celldisk all flashdisk
CellCLI> create flashlog all
CellCLI> create flashcache all
6. If WriteBack flashcache mode is enabled, then validate the cell has resumed caching to the flashdisks.
The following will show whether the mode is WriteBack or WriteThrough:
CellCLI> list cell attributes name,flashcachemode
In WriteBack mode, the following should show each griddisk being cached by flash disks:
CellCLI> list griddisk attributes name,status,cachedby
A way to validate that data is dirty (i.e. only in flash disks and not yet on harddisks) run the following:
CellCLI> list metriccurrent fc_by_used,fc_by_dirty
On a busy system, metric "fc_by_dirty" should start increasing in value.
PARTS NOTE:
7090698 [F] 1.6TB Flash Accelerator F160 NVMe Card w/ LP Bracket - B0 Silicon
7307468 [F] 1.6TB Flash Accelerator F160 NVMe Card w/ LP Bracket (B1 Silicon)
7317693 [F] 3.2TB Flash Accelerator F320 NVMe Card
Oracle Exadata X5-2 Storage Cell (X5-2L) - Full Components List (https://mosemp.us.oracle.com/handbook_internal/Systems/Exadata_X5_2_Storagecell/components.html)
REFERENCE INFORMATION:
Exadata Database Machine Documentation:
Exadata Database Machine documentation is available on the Storage Server OS image in /opt/oracle/cell/doc/welcome.html or http://docs.oracle.com/cd/E50790_01/welcome.html
Oracle Server X5-2L Documentation Library (includes Sun Server X5-2L Service Manual) http://docs.oracle.com/cd/E41033_01/index.html
References
NOTE:1188080.1 - Steps to shut down or reboot an Exadata storage cell without affecting ASM
NOTE:1545103.1 - Replacing FlashCards or FDOM's when Griddisks are created on FlashDisk's
References
<NOTE:1545103.1> - Replacing FlashCards or FDOM's when Griddisks are created on FlashDisk's
<NOTE:1188080.1> - Steps to shut down or reboot an Exadata storage cell without affecting ASM
Attachments
This solution has no attachment