Asset ID: |
1-71-1984957.1 |
Update Date: | 2017-10-02 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
1984957.1
:
How to Replace an Exalytics X4-4 F80 Flash Accelerator PCIe Card
Related Items |
- Flash Accelerator F80 PCIe Card
- Exalytics In-Memory Machine X4-4
|
Related Categories |
- PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
|
Oracle Confidential PARTNER - Available to partners (SUN)
Reason: internal support doc
Applies to:
Exalytics In-Memory Machine X4-4 - Version All Versions to All Versions [Release All Releases]
Flash Accelerator F80 PCIe Card - Version All Versions to All Versions [Release All Releases]
x86_64
Goal
How to Replace an F80 Flash Accelerator PCIe Card in an Oracle Exalytics X4-4 system
Solution
CAP PROBLEM OVERVIEW: F80 Flash PCIe Card replacement
DISPATCH INSTRUCTIONS
WHAT SKILLS DOES THE ENGINEER NEED:
Oracle Exalytics Server Training
TIME ESTIMATE: 60 minutes
TASK COMPLEXITY: 3-FRU
FIELD ENGINEER INSTRUCTIONS
WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY? :
If the system is still up and functioning, the customer should be ready to perform an orderly and graceful shutdown of applications and OS. Access to the system's OS root login may be needed if the flash card failure needs to be confirmed.
A data backup is not a prerequisite but is a wise precaution.
WHAT ACTION DOES THE ENGINEER NEED TO TAKE:
1. Check the Flash card status and confirm/identify the failed card.
- Check the status of the flash cards using the exalytics_CheckFlash.sh script. The following example shows a failure of one of the devices on Flash Card 1 (extra output from the other cards has been cut from the output for brevity)
[root@exalytics0 ~]# /opt/exalytics/bin/exalytics_CheckFlash.sh
Checking Exalytics Flash Drive Status
Fetching some info on installed flash drives ....
Driver version : 01.250.41.04 (2012.06.04)
Supported number of flash drives detected (3)
Flash card 1 :
Overall health status : ERROR. Use --detail for more info
Size (in MB) : 572202
Capacity (in bytes) : 600000000000
Firmware Version : 109.05.26.00
Devices: /dev/sde /dev/sdc /dev/sdf
:
---cut---
:
Raid Array Info (/dev/md0):
/dev/md0: 1117.59GiB raid0 6 devices, 0 spares. Use mdadm --detail for more detail.
/dev/md0: No md super block found, not an md component.
Summary:
Healthy flash drives : 2
Broken flash drives : 1
Fail : Flash card health check failed. See above for more details.
The script will report an "ERROR" on the health status line for the card that has experienced a failure. In the above example we see that device /dev/sdd has failed and is no longer seen by Flash card 1. Make note of the devices assigned to this card so that they can be checked against the devices assigned to the replacement card later.
- A failed Flash card should have it's status LED lit amber or red. The status LED is the middle led on the rear of the card. (the top led is the Life LED and the bottom is the Activity LED). Check the rear of the system to confirm that the failed card can be identified by an amber/red status LED. For a normal/non-failed card this LED should be solid green. If the failed card to be replaced can be identified by it's status led make note of it's location and proceed to the next step to perform the physical replacment of the card.
- If the card has failed in such a way that the status LED is not showing a fault then the card to be replaced will need to be identified manually. In an Exalytics X4-4 system the Flash cards will populate PCIe slots 6,7,10. In step 1a above we should have identified the flash card that has failed. We can use this information to identify the card to be replaced by matching the ID number to the list of flashcards in the "ddcli -listall" command output to get the PCI Address.
[root@exalytics0 ~]# /opt/exalytics/flashUtil/ddcli -listall
****************************************************************************
LSI Corporation WarpDrive Management Utility
Version 107.00.00.04 (2012.06.05)
Copyright (c) 2011 LSI Corporation. All Rights Reserved.
****************************************************************************
ID WarpDrive Package Version PCI Address
-- --------- --------------- -----------
1 ELP-4x200-4d-n 09.05.33.00 00:41:00:00
2 ELP-4x200-4d-n 09.05.33.00 00:90:00:00
3 ELP-4x200-4d-n 09.05.33.00 00:c1:00:00
LSI WarpDrive Management Utility: Execution completed successfully.
Use this output list to confirm the physical PCIe slot to be replaced by matching the PCI Address for the Flash card identified above to the list below.
ID PCI Address Physical Slot
-- ----------- -------------
1 00:41:00:00 slot 6
2 00:90:00:00 slot 7
3 00:c1:00:00 slot 10
- You can also use the "locate" sub-command of ddcli to identify the card. Using this command will cause the status led to blink for a couple minutes so that the card may be identified physically. The following example will turn on the locate feature for Flash card 2 which is listed as "2 ELP-4x200-4d-n 09.05.33.00 00:90:00:00" and is physically located in PCIe slot 7
[root@exalytics0 ~]# /opt/exalytics/flashUtil/ddcli -c 2 -locate on
****************************************************************************
LSI Corporation WarpDrive Management Utility
Version 107.00.00.04 (2012.06.05)
Copyright (c) 2011 LSI Corporation. All Rights Reserved.
****************************************************************************
LSI WarpDrive Management Utility: Execution completed successfully.
- Once the physical location of the PCIe card to be replaced has been identified proceed to the replacement steps.
2. Prepare the server for service.
- Power off the server and disconnect the power cords from the power supplies.
- Extend the server to the maintenance position in the rack.
- Attach an anti-static wrist strap.
- Remove the top cover.
3. Locate and Remove the PCIe card.
- The X4-4 server has eleven PCIe slots.They are numbered 1 through 11 from left to right when you view the server from the rear (the onboard ports/connectors are located between slots 6 and 7)
- Identify the location of the PCIe slot that contains the failed Flash card using the previous steps.
- Disengage the PCIe slot crossbar from it's locked position and rotate it into it's upright position.
- Carefully remove the Flash PCIe card from the PCIe card slot by lifting it straight up from it's connector.
- Place the PCIe card on an antistatic mat.
4. Install the replacement Flash PCIe card.
- Remove the replacment Flash PCIe card from it's anti-static bag and place on an anti-static mat.
- Make sure to re-install the card into the same location from which the previous card was removed.
- Insert the PCIe card into the correct slot.
- Return the PCIe card slot crossbar to its closed and locked position to secure the PCIe cards in place.
5. Return the Server to operation
- Replace the top cover
- Remove any anti-static measures that were used.
- Return the server to it's normal operating position within the rack.
- Re-install the AC power cords and any data cables that were removed.
- Power on server. Verify that the Power/OK indicator led lights steady on.
- Allow the system to boot into the OS.
6. Confirm replacement card is healthy and identify the raid configuration type.
- After the system boots up into the OS with the replacement Flash card installed we should observe that the status LED for the new card is now lit green. Physically check to make sure the status LED of the new card is solid green.
- Execute the /opt/exalytics/bin/exalytics_CheckFlash.sh script to check on the status of the Flash cards and to see what devices are mapped to the newly replaced card. Confirm that the system now reports all 3 cards as GOOD/Healthy.
[root@exalytics0 ~]# /opt/exalytics/bin/exalytics_CheckFlash.sh
Checking Exalytics Flash Drive Status
Fetching some info on installed flash drives ....
Driver version : 01.250.41.04 (2012.06.04)
Supported number of flash drives detected (3)
Flash card 1 :
Overall health status : GOOD
Size (in MB) : 762936
Capacity (in bytes) : 800000000000
Firmware Version : 109.05.26.00
Devices: /dev/sdc /dev/sdd /dev/sde /dev/sdf
Flash card 2 :
Overall health status : GOOD
Size (in MB) : 762936
Capacity (in bytes) : 800000000000
Firmware Version : 109.05.26.00
Devices: /dev/sdi /dev/sdg /dev/sdh /dev/sdj
Flash card 3 :
Overall health status : GOOD
Size (in MB) : 762936
Capacity (in bytes) : 800000000000
Firmware Version : 109.05.26.00
Devices: /dev/sdk /dev/sdl /dev/sdm /dev/sdn
Raid Array Info (/dev/md0):
/dev/md0: 1490.12GiB raid5 3 devices, 0 spares. Use mdadm --detail for more detail.
/dev/md0: No md super block found, not an md component.
Summary:
Healthy flash drives : 3
Broken flash drives : 0
Pass : Flash card health check passed
- At this point the card has been replaced and confirmed to be working properly from the hardware level. If the system is using a "bare-metal" install and the flash is configured as a Raid10 or Raid05 created by using the config_flash.sh script (as done during a normal EIS install) then the following steps can be followed to bring the new card back into use by the SW raid array. If the system is Virtualized or not using a standard raid configuration then the following steps do not apply and should not be followed - the HW replacement is now complete, the system administrator will need to take care of putting the new card back into use for Virtualized and non-standard raid configurations.
- To identify the Raid configuration type check the "Raid Array Info" section near the bottom of the exalytics_CheckFlash.sh output. A Raid10 configuration will show /dev/md0 as a Raid0 with 6 devices and a Raid05 configuration will show /dev/md0 as a Raid 5 with 3 devices.
- For a Raid10 configuration follow the steps in section 7A
- For a Raid05 configuration follow the steps in section 7B.
7A. Raid10 restoration steps.
- Looking at the output of the exalytics_CheckFlash.sh script we can see that the system has assigned the same devices to the four flash drives on the replaced card as were mapped to the original card (/dev/sdc /dev/sdd /dev/sde /dev/sdf). This is normal and expected but be aware that the Operating System may map new/different devices to the flash card. If this happens you will need to recreate the RAID using the new devices as listed. Compare to the original output from step 1 to confirm if the devices are the same or are now different.
- Since the replaced Flash card contains four flash devices the SW Raid will now show four missing Raid1 devices (one for each flash module on the replaced card). Check the /proc/mdstat file to see the SW Raid status for the arrays made up by the flash devices. Each of the md devices will be listed with the devices that they include. We should see four md devices that only have a single sd device listed and the second line for these devices will end with something similar to [2/1] [_U] showing that only one device of the mirror is attached.
[root@exalytics0 ~]# cat /proc/mdstat
Personalities : [raid1] [raid0]
md6 : active raid1 sdj[0] sdm[1]
195312384 blocks [2/2] [UU]
md5 : active raid1 sdg[0] sdn[1]
195312384 blocks [2/2] [UU]
md4 : active raid1 sdl[1]
195312384 blocks [2/1] [_U] <<<<<<<<<
md3 : active raid1 sdk[1]
195312384 blocks [2/1] [_U] <<<<<<<<<
md2 : active raid1 sdi[1]
195312384 blocks [2/1] [_U] <<<<<<<<<
md1 : active raid1 sdh[1]
195312384 blocks [2/1] [_U] <<<<<<<<<
md0 : active raid0 md1[0] md6[5] md5[4] md4[3] md3[2] md2[1]
1171873920 blocks 64k chunks
unused devices: <none>
- For each of the md devices missing a drive we need to add the drive back to the mirror device. In this example devices md1, md2, md3, md4 need to be fixed. Check the /etc/mdadm.conf file to see what the correct configuration should be.
[root@exalytics0 ~]# cat /etc/mdadm.conf
ARRAY /dev/md1 level=raid1 num-devices=2 metadata=0.90 UUID=c37f9932:98627153:9000c110:fe4f17f1
devices=/dev/sdc,/dev/sdh
ARRAY /dev/md2 level=raid1 num-devices=2 metadata=0.90 UUID=4f2a82ff:a2a0e677:4cbf5fa6:4d371f8c
devices=/dev/sde,/dev/sdi
ARRAY /dev/md3 level=raid1 num-devices=2 metadata=0.90 UUID=8b843386:388c163f:f1f2eef1:0cff5de6
devices=/dev/sdd,/dev/sdk
ARRAY /dev/md4 level=raid1 num-devices=2 metadata=0.90 UUID=47319bbc:0c8757f6:480f8991:b1979dcc
devices=/dev/sdf,/dev/sdl
ARRAY /dev/md5 level=raid1 num-devices=2 metadata=0.90 UUID=0f927e1b:480c143c:7b32f73a:4d3ef7ea
devices=/dev/sdg,/dev/sdn
ARRAY /dev/md6 level=raid1 num-devices=2 metadata=0.90 UUID=04414596:2b5dbee6:57fdada8:0342dd21
devices=/dev/sdj,/dev/sdm
ARRAY /dev/md0 level=raid0 num-devices=6 metadata=0.90 UUID=a3b13b19:3fa10c4e:5efaae7f:5ab85d70
devices=/dev/md1,/dev/md2,/dev/md3,/dev/md4,/dev/md5,/dev/md6
-
In our example we match the four md devices to the sd devices they should be made up of so that we can add the correct sd device to the correct md. Here we can see that we need to add sdc to md1, sde to md2, sdd to md3 and sdf to md4 because these are the devices missing their second disk. This is done by comparing the mdstat output to the mdadm.conf file and seeing what the md device should have vs. what they do have. We need our four md devices to contain their proper sd devices:
/dev/md1 - /dev/sdc, /dev/sdh
/dev/md2 - /dev/sde, /dev/sdi
/dev/md3 - /dev/sdd, /dev/sdk
/dev/md4 - /dev/sdf, /dev/sdl
- Use the mdadm --add command to add the replaced device to each of the four md devices missing their drives:
[root@exalytics0 ~]# mdadm /dev/md1 --add /dev/sdc
mdadm: added /dev/sdc
[root@exalytics0 ~]# mdadm /dev/md2 --add /dev/sde
mdadm: added /dev/sde
[root@exalytics0 ~]# mdadm /dev/md3 --add /dev/sdd
mdadm: added /dev/sdd
[root@exalytics0 ~]# mdadm /dev/md4 --add /dev/sdf
mdadm: added /dev/sdf
- After adding the devices we can check the /proc/mdstat file to confirm that they were added and are now being rebuilt.
[root@exalytics0 ~]# cat /proc/mdstat
Personalities : [raid1] [raid0]
md6 : active raid1 sdj[0] sdm[1]
195312384 blocks [2/2] [UU]
md5 : active raid1 sdg[0] sdn[1]
195312384 blocks [2/2] [UU]
md4 : active raid1 sdf[2] sdl[1]
195312384 blocks [2/1] [_U]
[>....................] recovery = 1.7% (3487296/195312384) finish=15.5min speed=205135K/sec
md3 : active raid1 sdd[2] sdk[1]
195312384 blocks [2/1] [_U]
[>....................] recovery = 2.7% (5404416/195312384) finish=15.8min speed=200163K/sec
md2 : active raid1 sde[2] sdi[1]
195312384 blocks [2/1] [_U]
[>....................] recovery = 3.7% (7400064/195312384) finish=15.1min speed=206668K/sec
md1 : active raid1 sdc[2] sdh[1]
195312384 blocks [2/1] [_U]
[=>...................] recovery = 5.2% (10182528/195312384) finish=15.0min speed=205554K/sec
md0 : active raid0 md1[0] md6[5] md5[4] md4[3] md3[2] md2[1]
1171873920 blocks 64k chunks
unused devices: <none>
- The rebuild time will vary depending on the device sizes and system activity. If the system is actively using the flash then the rebuild time will be extended. After confirming that the recovery for each device finished successfully the Raid restoration is complete.
7B. Raid05 restoration steps.
- Looking at the output of the exalytics_CheckFlash.sh script we can see that the system has assigned the same devices to the four flash drives on the replaced card as were mapped to the original card (/dev/sdc /dev/sdd /dev/sde /dev/sdf). This is normal and expected but be aware that the Operating System may map new/different devices to the flash card. If this happens you will need to recreate the RAID using the new devices as listed. Compare to the original output from step 1 to confirm if the devices are the same or are now different.
- Check the /proc/mdstat file to see the SW Raid status.
[root@exalytics0 ~]# cat /proc/mdstat
Personalities : [raid0] [raid6] [raid5] [raid4]
md3 : active raid0 sdk[0] sdl[3] sdn[2] sdm[1]
781249536 blocks 64k chunks
md2 : active raid0 sdg[0] sdj[3] sdi[2] sdh[1]
781249536 blocks 64k chunks
md0 : active raid5 md2[1] md3[2]
1562498944 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
unused devices: <none>
- Since the replaced Flash card contained all four flash disks that made up one of the Raid0 devices we will see that our Raid5 device (/dev/md0) shows that one of it's three devices is now missing. /dev/md0 should show something similar to [3/2] [_UU] at the end of it's output showing that only 2 of the 3 devices are attached. md0 should be made up of md1, md2, and md3 but in our example we are missing md1. So md1 will need to be re-created. (if your system is missing a different device then adjust your commands to use the md device that is missing) Using the devices listed in the exalytics_CheckFlash.sh output as discussed in step a we will re-create the raid0 md device using mdadm --create (If the flash card being used for replacement was previously setup in a raid configuration then you may see a warning that the device was previously part of another raid device, if you see this you will need to reply 'y' when asked if you want to continue to create the array)
[root@exalytics0 ~]# mdadm /dev/md1 --create --raid-devices=4 --level=0 /dev/sdc /dev/sdd /dev/sde /dev/sdf
mdadm: array /dev/md1 started.
- After the Raid0 device has been created it then needs to be added to the Raid5 device. The Raid5 device should be /dev/md0 and from our example /dev/md1 is the new device to be added. (adjust the command as needed for your configuration)
[root@exalytics0 ~]# mdadm /dev/md0 --add /dev/md1
mdadm: added /dev/md1
- After adding the device we can check the /proc/mdstat file to confirm that it was added and /dev/md0 is now being rebuilt.
[root@exalytics0 ~]# cat /proc/mdstat
Personalities : [raid0] [raid6] [raid5] [raid4]
md3 : active raid0 sdk[0] sdl[3] sdn[2] sdm[1]
781249536 blocks 64k chunks
md1 : active raid0 sdf[3] sde[2] sdd[1] sdc[0]
781249536 blocks 64k chunks
md2 : active raid0 sdg[0] sdj[3] sdi[2] sdh[1]
781249536 blocks 64k chunks
md0 : active raid5 md1[3] md2[1] md3[2]
1562498944 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
[>....................] recovery = 0.1% (1486380/781249472) finish=61.2min speed=212340K/sec
unused devices: <none>
- Since the md1 device was newly created it will now have a different UUID from what was previously used by the system so the /etc/mdadm.conf file will need to be re-created. Use mdadm --detail --scan --verbose to recreate the file and then cat the file to check that it was properly created:
[root@exalytics0 ~]# mdadm --detail --scan --verbose > /etc/mdadm.conf
[root@exalytics0 ~]# cat /etc/mdadm.conf
ARRAY /dev/md2 level=raid0 num-devices=4 metadata=0.90 UUID=9f4bc73e:41cd9df8:9711f03c:ffeb2aa1
devices=/dev/sdg,/dev/sdh,/dev/sdi,/dev/sdj
ARRAY /dev/md1 level=raid0 num-devices=4 metadata=0.90 UUID=e4d1429f:4a93e2b0:a8dd07ca:1b71d281
devices=/dev/sdc,/dev/sdd,/dev/sde,/dev/sdf
ARRAY /dev/md3 level=raid0 num-devices=4 metadata=0.90 UUID=89a0bee7:c303d748:fe365ff0:439c4131
devices=/dev/sdk,/dev/sdm,/dev/sdn,/dev/sdl
ARRAY /dev/md0 level=raid5 num-devices=3 metadata=0.90 spares=1 UUID=11378895:dec1e684:05ffa13f:b6096c0f
devices=/dev/md1,/dev/md2,/dev/md3
- The rebuild time will vary depending on the device sizes and system activity. If the system is actively using the flash then the rebuild time will be extended. After confirming the recovery for each device finished successfully the Raid restoration is complete.
OBTAIN CUSTOMER ACCEPTANCE
WHAT ACTION DOES THE CUSTOMER NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:
Boot up system and verify full functionality
REFERENCE INFORMATION:
Oracle Exalytics In-Memory Machine Documentation Library
https://docs.oracle.com/cd/E56045_01/index.htm
Sun Server X4-4 Documentation
http://docs.oracle.com/cd/E38212_01/index.html
Attachments
This solution has no attachment