Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1681141.1
Update Date:2018-04-10
Keywords:

Solution Type  Technical Instruction Sure

Solution  1681141.1 :   How to Replace a Faulty Flash F20 or F20M2 card in an Exadata Storage Server (V2/X2-2/X2-8)  


Related Items
  • Exadata Database Machine X2-2 Qtr Rack
  •  
  • Exadata Database Machine X2-8
  •  
  • Exadata Database Machine X2-2 Full Rack
  •  
  • SPARC SuperCluster T4-4
  •  
  • Exadata Database Machine X2-2 Half Rack
  •  
  • Exadata Database Machine X2-2 Hardware
  •  
  • SPARC SuperCluster T4-4 Full Rack
  •  
  • Exadata Database Machine V2
  •  
  • SPARC SuperCluster T4-4 Half Rack
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
  •  


This CAP explains how to Service a faulty Flash F20/F20M2 card in an Exadata Storage Server (V2/X2-2/X2-8)

Oracle Confidential PARTNER - Available to partners (SUN).
Reason: Exadata FRU only; Internal only and HW support partners

Applies to:

Exadata Database Machine V2 - Version All Versions and later
SPARC SuperCluster T4-4 Half Rack - Version All Versions and later
Exadata Database Machine X2-2 Hardware - Version All Versions and later
Exadata Database Machine X2-2 Qtr Rack - Version All Versions and later
Exadata Database Machine X2-2 Full Rack - Version All Versions and later
Information in this document applies to any platform.

Goal

Service a faulty Flash F20/F20M2 PCIe card in an Exadata Storage Server (V2/X2-2/X2-8)

Solution


DISPATCH INSTRUCTIONS:
The following information will be required prior to dispatch of a replacement:

  • Type of Exadata (V2, X2-2 or X2-8) / Exadata Storage Expansion Rack / SPARC SuperCluster
  • Type of storage cell/Node (X4275 or X4270M2)
  • Name/location of storage cell
  • PCI Slot number of failed card
  • Type of PCIe card (F20 or F20M2) - refer to Note 1416397.1
  • Image Version (output of "imageinfo -active")

Special Instructions for Dispatch are required for this part.

For Attention of Dispatcher:

The parts required in this action plan may be available as spares owned by the customer, which they received with the Engineered System. (These are sometimes referred to as ride-along spares.)

If parts are not available to meet the customer preferred delivery time/planned end date, then request TAM or field manager to contact the customer, and ask if the customer has parts available, and would be prepared to use them.

If customer spare parts are used, inform the customer that Oracle will replenish the customer part stock as soon as we can. More details on this process can be found in GDMR procedure "Handling Where No Parts Available" step 2: https://ptp.oraclecorp.com/pls/apex/f?p=151:138:38504529393::::DN,BRNID,DP,P138_DLID:2,86687,4,9082,


- WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED:
- TIME ESTIMATE: 90 Minutes
- TASK COMPLEXITY: 3

FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:
PROBLEM OVERVIEW:
There is a Sun Flash Accelerator F20/F20M2 Card with a failed PCIe card in an Exadata Storage Server (Cell) that needs replacing.

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:

The Storage Cell that contains the faulty component(s) needs to be powered off.

It is expected that the customer's DBA has completed these steps prior to arriving to replace the card. The following commands are provided as guidance in case the customer needs assistance checking the status of the system prior to replacement.  If the customer or the FSE requires more assistance prior to the physical replacement of the device, EEST/TSC should be contacted.

  1. Locate the cell server in the rack being serviced.  The cell server within the rack can be determined from the hostname usually, and the known default Exadata server numbering scheme. Exadata Storage Servers are identified by a number 1 through 18, where 1 is the lowest most Storage Server in the rack installed in RU2, counting up to the top of the rack. 

    Turn on the locate indicator light ‘on’ for easier identification of the server being repaired. If the server number has been identified then the Locate Button on the front panel may be pressed. To turn on remotely, use either of the following methods:

    From a login to the CellCli on Exadata Storage Servers:

    CellCli> alter cell led on

    From a login to the server’s ILOM:

    -> set /SYS/LOCATE value=Fast_Blink
    Set 'value' to 'Fast_Blink

    From a login to the server’s ‘root’ account:

    # ipmitool sunoem cli ‘set /SYS/LOCATE value=Fast_Blink’
    Connected. Use ^D to exit.
    -> set /SYS/LOCATE value=Fast_Blink
    Set 'value' to 'Fast_Blink'

    -> Session closed
    Disconnected
  2. Determine the active image version of the Exadata Storage Server:

    # imageinfo -active


    This information will be needed to determine if the replacement needs its firmware updated, and should be provided to the Oracle service engineer performing the replacement.

  3. Shutdown the node for which the Flash F20/F20M2 PCIe card requires replacement.

    1. For Extended information on this section check MOS Note:
      ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM

      This is also documented in the Exadata Maintenance Guide section titled "Maintaining Exadata Storage Servers" subsection "Shutting Down Exadata Storage Server" available on the customer's cell server image in the /opt/oracle/cell/doc directory or online here:

      https://docs.oracle.com/cd/E80920_01/DBMMN/maintaining-exadata-storage-servers.htm#DBMMN21129

      In the following examples the SQL commands should be run by the Customer's DBA prior to doing the hardware replacement. The cellcli commands will need to be run as root.

      Note the following when powering off Exadata Storage Servers:
      • Verify there are no other storage servers with disk faults. Shutting down a storage server while another disk is failed may result in the running database processes and Oracle ASM to crash if it loses both disks in the partner pair when this server’s disks go offline.

      • Powering off one Exadata Storage Server with no disk faults in the rest of the rack will not affect running database processes or Oracle ASM.

      • All database and Oracle Clusterware processes should be shut down prior to shutting down more than one Exadata Storage Server. Refer to the Exadata Owner’s Guide for details if this is necessary.

    2. ASM drops a disk shortly after they are taken offline. Powering off or restarting Exadata Storage Servers can impact database performance if the storage server is offline for longer than the ASM disk repair timer to be restored. The default DISK_REPAIR_TIME attribute value of 3.6hrs should be adequate for replacing components, but may have been changed by the Customer. To check this parameter, have the Customer log into ASM and perform the following query:

      SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg where a.name = 'disk_repair_time' and a.group_number = dg.group_number;

      As long as the value is large enough to comfortably replace the components being replaced, then there is no need to change it.

    3. If the flash disks are being used for griddisks, then please refer to Note 1545103.1 for additional instructions before continuing.
    4. Check if ASM will be OK if the griddisks go OFFLINE.
      # cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
      ...sample ...
      DATA_CD_09_dbm1cel01 ONLINE Yes
      DATA_CD_10_dbm1cel01 ONLINE Yes
      DATA_CD_11_dbm1cel01 ONLINE Yes
      RECO_CD_00_dbm1cel01 ONLINE Yes
      RECO_CD_01_dbm1cel01 ONLINE Yes
      ...repeated for all griddisks....

      If one or more disks return asmdeactivationoutcome='No', then wait for some time and repeat this command. Once all disks return asmdeactivationoutcome='Yes', proceed to the next step.

    5. Run cellcli command to Inactivate all griddisks on the cell that needs to be powered down for maintenance. (this could take up to 10 minutes or longer)

      # cellcli
      ...sample ...
      CellCLI> ALTER GRIDDISK ALL INACTIVE
      GridDisk DATA_CD_00_dbm1cel01 successfully altered
      GridDisk DATA_CD_01_dbm1cel01 successfully altered
      GridDisk DATA_CD_02_dbm1cel01 successfully altered
      GridDisk RECO_CD_00_dbm1cel01 successfully altered
      GridDisk RECO_CD_01_dbm1cel01 successfully altered
      GridDisk RECO_CD_02_dbm1cel01 successfully altered
      ...repeated for all griddisks...
    6. Execute the command below and the output should show asmmodestatus='UNUSED' or 'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and inactive in ASM.

      CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
      DATA_CD_00_dbm1cel01 inactive OFFLINE Yes
      DATA_CD_01_dbm1cel01 inactive OFFLINE Yes
      DATA_CD_02_dbm1cel01 inactive OFFLINE Yes
      RECO_CD_00_dbm1cel01 inactive OFFLINE Yes
      RECO_CD_01_dbm1cel01 inactive OFFLINE Yes
      RECO_CD_02_dbm1cel01 inactive OFFLINE Yes
      ...repeated for all griddisks...
    7. Once all disks are offline and inactive, the customer may shutdown the Cell using the following command:

      # shutdown -hP now

       When powering off Exadata Storage Servers, all storage services are automatically stopped.


WHAT ACTION DOES THE ENGINEER NEED TO TAKE:

Identify the F20/F20M2 PCIe card slot that is at fault, from the fault messages.  It should also be possible to observe the LEDs on the rear of  the Flash F20/F20M2 Card to determine which card is showing a fault.

These are the steps to physically remove and replace the Flash F20/F20M2 PCIe Card:

A. Remove the PCI Riser - that contains the associated Flash F20/F20M2 PCIe Card.

  1. Prepare the server for service.
    1. Power off the server and disconnect the power cord (or cords) from the power supply (or supplies).
    2. Extend the server to the maintenance position.
    3. Attach an antistatic wrist strap.
    4. Remove the top cover. If the top cover is removed before the AC power cords are removed, the ILOM SP will fault for chassis intrusion. This will need to be cleared after restarting ILOM.

  2. Locate the card's position on the riser in the system.

    The Flash F20/F20M2 cards installed in the Storage Cells are located on PCIe Riser 1 (PCIe slots 1 and 4) in the middle of the X4275/X4270M2 server and PCIe Riser 2 (PCIe slots 2 and 5) on the outside wall of the X4275/X4270M2 server.

    1. Remove the back panel PCI crossbar.
      1. Loosen the two captive Phillips screws on the end of the PCI crossbar.
      2. Lift the PCI crossbar up and back to remove it from the chassis.

    2. Remove the PCIe riser from the system.
      1. Loosen the captive screw holding the riser to the motherboard.
      2. Lift up the riser and any PCIe cards that are attached to it as a unit.

B. Remove the faulty Flash F20/F20M2 PCIe Card from the PCIe riser and place it on an anti-static mat. The card nearest the riser PCI card edge connector is slot 1 or 2. The card further away is slot 4 or 5.

C. Remove the Flash Modules (FMOD's) from the faulty Flash F20/F20M2 PCIe card. Note which FMOD is located in which slot to ensure they are re-installed on the new card in the same slot, or possible loss of data and ASM data corruption may occur.

Note that the F20 and F20M2 cards are layed out differently:

The FMOD locations for F20 are plainly seen on the side of the ESM plastic housing (once the PCI card is removed from the system).

For F20M2, the FMOD labels are a bit trickier to see. Once the PCI card is removed from the system, look below the SAS controller near the PCI card edge connector. They are located on a small white label with arrows, and on the board near each FMOD (though some of those labels are partially obscured).

They are arranged as follows:

F20 Card Faceplate <-- FMOD0 Upper; FMOD1 Lower <-> ESM/SAS controller <-> FMOD2 Lower; FMOD3 Upper <-> 2 SAS ports on rear of card.
F20M2 Card Faceplate <-- FMOD0 Upper; FMOD1 Lower <-> SAS controller <-> FMOD2 Upper; FMOD3 Lower <-> ESM / 1 SAS port on rear of card
  1. Remove the FMODs as follows:
    1. Loosen the 3 clips on the appropriate side retaining the FMODs on the card. 
       - On F20 cards, loosen the 3 screws on the underside of the PCIe card - do NOT fully remove the screws to reduce the changes of the clips breaking or being lost.
       - On F20M2 cards, pull open the retaining clips to loosen them.
    2. Carefully lift and remove the upper FMODs followed by the lower FMOD's.  On F20 cards, the cable to the ESM should be disconnected from J803 before removing FMOD1 in the lower location.

  2. Install the FMODs on the replacement card.
    1. Carefully install the FMODs in the same slot number they were removed from, ensuring the side of the FMOD engages inside any metal tabs and clip guides. If the card is being changed from a F20 to a F20M2, or F20M2 to F20, then make sure the correct slot numbering location is used, following the chart above and card DOM location markings.
    2. On F20 cards, the ESM should be moved before re-installing FMOD0 in the upper slot - refer to step D below.
    3. Re-engage the clips that retain the FMODs, either tightening the screws (F20) or closing the locking clips (F20M2). If any clips are broken, additional clips should be available in the FRU box for the FMOD or the FRU for the F20/F20M2 PCI card itself. This may have to be ordered additionally.

D. Remove and relocate the ESM to the new card.  The ESM's on F20 and F20 M2 cards are layed out differently. Follow the appropriate physical procedure for the type of card being serviced.

Note if the card is being changed from a F20 to a F20M2, or from F20M2 to F20, then the ESM's are not interchangable. Do NOT attempt to use a F20 ESM on a F20 M2 card, or a F20 M2 ESM on a F20 card, or damage may occur to the cable or other parts of the Flash card and adjacent components. If the correct ESM is not available, then the card replacement should not be continued until the parts are available.

 

The F20 Card has the ESM located in the centre of the card, with FMOD’s on either side of it. The assembly part number label is located on the front of the card near the card edge connector between the disk controller and rear FMOD’s. For F20 card replacement, do the following:

  1. With upper FMOD0 removed, disconnect the cable from connector J803.
  2. Disengage the ESM shroud from the PCIe card:
    1. First, remove the center pin from each retaining pin.
    2. Next, push the outer section of each retaining pin through the card and remove them.
  3. Carefully slide the ESM assembly (the ESM shroud and the ESM) off the faulty card without disturbing any FMODs.
  4. Carefully slide the ESM assembly (the ESM shroud and the ESM) on to the replacement card, without disturbing any FMODs.
  5. Secure the ESM shroud to the replacement PCIe card:
    1. Push the outer section of each retaining pin through the card and ESM shroud to secure it in place.
    2. Push the center pin up from under the card into the outer section to lock it in place.
  6. Route the cable around the FMOD clip post and on top of FMOD1, and connect the cable to connector J803.

The F20M2 Card has the ESM located on the rear of the card next to the SAS cable connector. The assembly part number label is located next to the orange WWN label on the rear side of the card. For F20M2 card replacement, do the following:

  1. Locate the plastic retaining clip for the ESM plastic housing on the rear side of the card.
  2. With a small tool such as the tip of a screwdriver, carefully press the clip down while pushing the housing off the rear end of the PCI card.
  3. Disconnect the ESM cable from connector J803 on the faulty card.
  4. Connect the ESM cable to connector J803 on the replacement card.
  5. Slide the ESM assembly feet carefully onto the board, one into the slotted hole, the other slides onto the end of the PCI card. There should be an audible click when the retaining clip engages in its slot.

E. Reinstall the Flash F20/F20M2 PCIe card back into the PCIe riser.

F. Reinstall the PCIe riser back into the system.

G. Reinstall the PCIe crossbar on the rear of the system.

H. Reinstall the top cover.

I. Plug the AC power cords in and power on the Storage Cell.


OBTAIN CUSTOMER ACCEPTANCE
- WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO
TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:


It is expected that the engineer stay on-site until the customer has given the approval to depart.   The following commands are provided as guidance in case the customer needs assistance checking the status of the system following replacement.  If the customer or the FSE requires more assistance following the physical replacement of the device, EEST/TSC should be contacted.

After replacing the Flash F20/F20M2 PCIe card, the Exadata Storage Server should boot up automatically and if necessary it will update the replacement card's firmware to the version currently supported by the image.

Note - if the image version identified in the preparation steps is 11.2.3.2.1 and a MLR patch for firmware later than 1.27.92 has been installed, this may or may not update automatically. Use the "/usr/bin/flash_dom -l" command to check the version of each existing card matches the replacement, and if not, then alert the customer they may need to reapply any MLR patch they may have previously installed to update the new card to the correct firmware.

Once the Exadata Storage Server comes back online the cell services will start up automatically, however you will need to reactivate the griddisks as follows:

  1. Activate the griddisks:

    # cellcli
        …    
    CellCLI> alter griddisk all active
    GridDisk DATA_CD_00_dbm1cel01 successfully altered
    GridDisk DATA_CD_01_dbm1cel01 successfully altered
    GridDisk DATA_CD_02_dbm1cel01 successfully altered
    GridDisk RECO_CD_00_dbm1cel01 successfully altered
    GridDisk RECO_CD_01_dbm1cel01 successfully altered
    GridDisk RECO_CD_02_dbm1cel01 successfully altered
    ...etc...
  2. Verify all disks show 'active':

    CellCLI> list griddisk
    DATA_CD_00_dbm1cel01         active
    DATA_CD_01_dbm1cel01         active
    DATA_CD_02_dbm1cel01         active
    RECO_CD_00_dbm1cel01         active
    RECO_CD_01_dbm1cel01         active
    RECO_CD_02_dbm1cel01         active
    ...etc...
  3. Verify all griddisks have been successfully put online using the following command. Wait until 'asmmodestatus' is in status 'ONLINE' for all griddisks. The following is an example of the output early in the activation process.

    CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
    DATA_CD_00_dbm1cel01 active ONLINE Yes
    DATA_CD_01_dbm1cel01 active ONLINE Yes
    DATA_CD_02_dbm1cel01 active ONLINE Yes
    RECO_CD_00_dbm1cel01 active SYNCING Yes
    RECO_CD_01_dbm1cel01 active ONLINE Yes
    ...etc...


    Notice in the above example that 'RECO_CD_00_dbm1cel01' is still in the 'SYNCING'  process. Oracle ASM synchronization is only complete when ALL griddisks show ‘asmmodestatus=ONLINE’.  This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair. (Note: This operation uses Fast Mirror Resync operation - which does not trigger an ASM rebalance. The Resync operation restores only the extents that would have been written while the disk was offline.) 

  4. If the flashcache was dropped prior to the Flash PCIe replacement (ie, if the following was run CellCLI> drop celldisk all flashdisk force), then the following steps will need to be run in order to recreate the flashcache:

    CellCLI> create celldisk all flashdisk
    CellCLI> create flashlog all
    CellCLI> create flashcache all

    Note: If you are running an image version prior to 11.2.2.4, DO NOT run the 'create flashlog all' operation as this feature was introduced in the 11.2.2.4 release.

  5. If WriteBack flashcache mode is enabled, then validate the cell has resumed caching to the flashdisks.
    The following will show whether the mode is WriteBack or WriteThrough:
    CellCLI> list cell attributes name,flashcachemode

    In WriteBack mode, the following should show each griddisk being cached by 4 flash disks:

    CellCLI> list griddisk attributes name,status,cachedby

    A way to validate that data is dirty (i.e. only in flash disks and not yet on harddisks) run the following:

    CellCLI> list metriccurrent fc_by_used,fc_by_dirty

    On a busy system, metric "fc_by_dirty" should start increasing in value.


PARTS NOTE:

511-1275 Flash F20 PCI Express Board (superseded by 511-1500)

511-1500 Flash F20 PCI Express Board

371-4650 Flash F20 Energy Storage Module (ESM)

541-4416 Flash F20 M2 PCI Express Board

371-4953 Flash F20 M2 Energy Storage Module (ESM)


REFERENCE INFORMATION:

Sun Flash Accelerator F20 PCIe Card User’s Guide - https://docs.oracle.com/cd/E19682-01/index.html
How to Shutdown a Storage Cell for Service - Note 1188080.1

Aura (F20) Hardware and Software Troubleshooting Document - Note 1285796.1
How to identify which Flash card is installed in an Exadata Storage cell and order the correct FRU - Note 1416397.1

References

<NOTE:1545103.1> - Replacing FlashCards or FDOM's when Griddisks are created on FlashDisk's
<NOTE:1681165.1> - How to Replace a Faulty ESM on a Flash F20 or F20M2 card in an Exadata Storage Server (V2/X2-2/X2-8) [VCAP]
<NOTE:1381209.1> - How to Replace a Faulty FMOD on a Flash F20 or F20M2 card in an Exadata Storage Server (V2/X2-2/X2-8)
<NOTE:1416397.1> - How to Identify Which Flash card is Installed in an Exadata Storage Cell and Order the Correct FRU
<NOTE:1188080.1> - Steps to shut down or reboot an Exadata storage cell without affecting ASM

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback