Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1381209.1
Update Date:2018-04-10
Keywords:

Solution Type  Technical Instruction Sure

Solution  1381209.1 :   How to Replace a Faulty FMOD on a Flash F20 or F20M2 card in an Exadata Storage Server (V2/X2-2/X2-8)  


Related Items
  • Exadata Database Machine X2-2 Qtr Rack
  •  
  • Oracle Platinum Services
  •  
  • SPARC SuperCluster T4-4
  •  
  • Exadata Database Machine X2-8
  •  
  • Exadata Database Machine X2-2 Full Rack
  •  
  • Exadata Database Machine X2-2 Half Rack
  •  
  • Exadata Database Machine X2-2 Hardware
  •  
  • SPARC SuperCluster T4-4 Full Rack
  •  
  • Exadata Database Machine V2
  •  
  • SPARC SuperCluster T4-4 Half Rack
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
  •  


This CAP explains how to Service a faulty FMOD on a Flash F20 card in an Exadata Storage Server (V2/X2-2/X2-8)

In this Document
Goal
Solution
References


Oracle Confidential PARTNER - Available to partners (SUN).
Reason: Exadata FRU only; Internal only and HW support partners

Applies to:

Exadata Database Machine X2-2 Hardware - Version All Versions and later
Exadata Database Machine X2-2 Half Rack - Version All Versions and later
SPARC SuperCluster T4-4 Full Rack - Version All Versions and later
Exadata Database Machine V2 - Version All Versions and later
SPARC SuperCluster T4-4 - Version All Versions and later
Information in this document applies to any platform.

Goal

Service a faulty FMOD on a Flash F20 card in an Exadata Storage Server (V2/X2-2/X2-8)

Solution


DISPATCH INSTRUCTIONS:
The following information will be required prior to dispatch of a replacement:

  • Type of Exadata (V2, X2-2 or X3-8) / Exadata Storage Expansion Rack / SPARC SuperCluster
  • Type of storage cell/Node (X4275 or X4270M2)
  • Name/location of storage cell
  • PCI Slot and FMOD number of failed card
  • Image Version (output of "imageinfo -active")

Special Instructions for Dispatch are required for this part.

For Attention of Dispatcher:

The parts required in this action plan may be available as spares owned by the customer, which they received with the Engineered System. (These are sometimes referred to as ride-along spares.)

If parts are not available to meet the customer preferred delivery time/planned end date, then request TAM or field manager to contact the customer, and ask if the customer has parts available, and would be prepared to use them.

If customer spare parts are used, inform the customer that Oracle will replenish the customer part stock as soon as we can. More details on this process can be found in GDMR procedure "Handling Where No Parts Available" step 2: https://ptp.oraclecorp.com/pls/apex/f?p=151:138:38504529393::::DN,BRNID,DP,P138_DLID:2,86687,4,9082,


- WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED:
- TIME ESTIMATE: 90 Minutes
- TASK COMPLEXITY: 3

FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:
PROBLEM OVERVIEW:
There is a Sun Flash Accelerator F20/F20M2 PCIe Card with a failed FMOD in an Exadata Storage Server (Cell) that needs replacing.

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:

The Storage Cell that contains the faulty FMOD needs to be powered off. It may also require a firmware patch update prior to powering off for replacement.

Important Part Number NOTE: FMOD 7061269 with D21Y firmware will substitute and replace both 371-5014 D20Y and 371-4415 D20R FMOD FRU's. If 371-4415 or 371-5014 is no longer available locally, then the customer must install patch 14793859 which installs D21Y firmware into the image, prior to replacing the FMOD.

 It is expected that the customer's DBA has completed these steps prior to arriving to replace the card. The following commands are provided as guidance in case the customer needs assistance checking the status of the system prior to replacement.  If the customer or the FSE requires more assistance prior to the physical replacement of the device, EEST/TSC should be contacted.

  1. Locate the cell server in the rack being serviced.  The cell server within the rack can be determined from the hostname usually, and the known default Exadata server numbering scheme. Exadata Storage Servers are identified by a number 1 through 18, where 1 is the lowest most Storage Server in the rack installed in RU2, counting up to the top of the rack. 

    Turn on the locate indicator light ‘on’ for easier identification of the server being repaired. If the server number has been identified then the Locate Button on the front panel may be pressed. To turn on remotely, use either of the following methods:

    From a login to the CellCli on Exadata Storage Servers:

    CellCli> alter cell led on

    From a login to the server’s ILOM:

    -> set /SYS/LOCATE value=Fast_Blink
    Set 'value' to 'Fast_Blink

    From a login to the server’s ‘root’ account:

    # ipmitool sunoem cli ‘set /SYS/LOCATE value=Fast_Blink’
    Connected. Use ^D to exit.
    -> set /SYS/LOCATE value=Fast_Blink
    Set 'value' to 'Fast_Blink'

    -> Session closed
    Disconnected
  2. Determine the active image version of the Exadata Storage Server:

    # imageinfo -active


    This information will be needed to determine if the replacement needs its firmware updated, and should be provided to the Oracle service engineer performing the replacement.

  3. Shutdown the node for which the Flash F20 FMOD requires replacement.

        1. For Extended information on this section check MOS Note:
          ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM

          This is also documented in the Exadata Maintenance Guide section titled "Maintaining Exadata Storage Servers" subsection "Shutting Down Exadata Storage Server" available on the customer's cell server image in the /opt/oracle/cell/doc directory.
          https://docs.oracle.com/cd/E80920_01/DBMMN/maintaining-exadata-storage-servers.htm#GUID-AE16A1DA-53C6-4E80-94E5-963AA65373AB

          In the following examples the SQL commands should be run by the Customer's DBA prior to doing the hardware replacement. The cellcli commands will need to be run as root.

          Note the following when powering off Exadata Storage Servers:
          • Verify there are no other storage servers with disk faults. Shutting down a storage server while another disk is fails may result in the running database processes and Oracle ASM to crash if it loses both disks in the partner pair when this server’s disks go offline.

          • Powering off one Exadata Storage Server with no disk faults in the rest of the rack will not affect running database processes or Oracle ASM.

          • All database and Oracle Clusterware processes should be shut down prior to shutting down more than one Exadata Storage Server. Refer to the Exadata Owner’s Guide for details if this is necessary.

        2. ASM drops a disk shortly after they are taken offline. Powering off or restarting Exadata Storage Servers can impact database performance if the storage server is offline for longer than the ASM disk repair timer to be restored. The default DISK_REPAIR_TIME attribute value of 3.6hrs should be adequate for replacing components, but may have been changed by the Customer. To check this parameter, have the Customer log into ASM and perform the following query:

          SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg where a.name = 'disk_repair_time' and a.group_number = dg.group_number;

          As long as the value is large enough to comfortably replace the components being replaced, then there is no need to change it.

        3. If the flash disks are being used for griddisks, then please refer to Note 1545103.1 for additional instructions before continuing.
           
        4. Check if ASM will be OK if the grid disks go OFFLINE.
          # cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
          ...sample ...
          DATA_CD_09_dbm1cel01 ONLINE Yes
          DATA_CD_10_dbm1cel01 ONLINE Yes
          DATA_CD_11_dbm1cel01 ONLINE Yes
          RECO_CD_00_dbm1cel01 ONLINE Yes
          RECO_CD_01_dbm1cel01 ONLINE Yes
          ...repeated for all griddisks....

          If one or more disks return asmdeactivationoutcome='No', then wait for some time and repeat this command. Once all disks return asmdeactivationoutcome='Yes', proceed to the next step.

        5. Run cellcli command to Inactivate all grid disks on the cell that needs to be powered down for maintenance. (this could take up to 10 minutes or longer)

          # cellcli
          ...sample ...
          CellCLI> ALTER GRIDDISK ALL INACTIVE
          GridDisk DATA_CD_00_dbm1cel01 successfully altered
          GridDisk DATA_CD_01_dbm1cel01 successfully altered
          GridDisk DATA_CD_02_dbm1cel01 successfully altered
          GridDisk RECO_CD_00_dbm1cel01 successfully altered
          GridDisk RECO_CD_01_dbm1cel01 successfully altered
          GridDisk RECO_CD_02_dbm1cel01 successfully altered
          ...repeated for all griddisks...
        6. Execute the command below and the output should show asmmodestatus='UNUSED' or 'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and inactive in ASM.

          CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
          DATA_CD_00_dbm1cel01 inactive OFFLINE Yes
          DATA_CD_01_dbm1cel01 inactive OFFLINE Yes
          DATA_CD_02_dbm1cel01 inactive OFFLINE Yes
          RECO_CD_00_dbm1cel01 inactive OFFLINE Yes
          RECO_CD_01_dbm1cel01 inactive OFFLINE Yes
          RECO_CD_02_dbm1cel01 inactive OFFLINE Yes
          ...repeated for all griddisks...
        7. If the image version determined in step 2 above is 11.2.3.2.0 or earlier, and the one-off firmware patch 14793859 has not  yet been applied, then it must be applied now prior to replacement of the F20 FMOD with D21Y firmware.  Run the following command to determine the FMOD firmware level:

          # cellcli -e list physicaldisk attributes name, id, physicalFirmware where diskType = 'FlashDisk'

          If it is patched, it will report D21Y in column 3 of the output. If it is not patched, it will report as D20R or D20Y.with the following command in column 3 of the output:

          Follow the instructions in MOS Note 1504776.1 and the README contained within the patch.  The patch will also update ILOM/BIOS on Storage Servers in X2-2/X2-8 racks and will enforce a reboot or two. This one-off patch may be applied to the single storage cell requiring FMOD service if there is not ability to patch all storage cells in the rack at the time of service. It may take 30 minutes to complete.

          Failure to install the patch will result in the firmware on the FRU to be down-graded to D20R or D20Y depending on the current image installed. This may cause the replacement FRU to fail, or the original problem to return.

          After booting the system after firmware update, repeat steps (c) through (e) to verify the griddisks are offline and inactive.

        8. If the Flash Modules are not already marked failed, and the system is using WriteBack Flash Cache mode, then the flash module contents must be flushed to disk before replacement.  Normally this is automatic when the system fails the module and no additional actions are necessary. If the flash module is being replaced for any reason without being failed, perform step 3 of Doc ID 1306635.1 prior to shutdown.
        9. Once all disks are offline and inactive, the customer may shutdown the Cell using the following command:

          # shutdown -hP now

           When powering off Exadata Storage Servers, all storage services are automatically stopped.

     
    WHAT ACTION DOES THE ENGINEER NEED TO TAKE:

    Identify the F20 PCI card slot and FMOD number that is at fault, from the fault messages.  It should also be possible to observe the LEDs on the rear of Flash F20/F20M2 Card to determine which card and FMOD is showing a fault.

    These are the steps to remove and replace the FMOD:

    A. Remove the PCI Riser - that contains the associated Flash F20 Card.

        1. Prepare the server for service.
          1. Power off the server and disconnect the power cord (or cords) from the power supply (or supplies).
          2. Extend the server to the maintenance position.
          3. Attach an antistatic wrist strap.
          4. Remove the top cover.

            If the top cover is removed before the AC power cords are removed, the ILOM SP will fault for chassis intrusion. This will need to be cleared after restarting ILOM.

        2. Locate the card's position to the riser in the system.

          The Flash F20/F20M2 cards installed in the Storage Cells are located on PCIe Riser 1 (PCIe slots 1 and 4) in the middle of the X4275/X4270M2 server and PCIe Riser 2 (PCIe slots 2 and 5) on the outside wall of the X4275/X4270M2 server.

        3. Remove the back panel PCI crossbar.
          1. Loosen the two captive Phillips screws on the end of the PCI crossbar.
          2. Lift the PCI crossbar up and back to remove it from the chassis.

        4. Remove the PCIe riser from the system.
          1. Loosen the captive screw holding the riser to the motherboard.
          2. Lift up the riser and any PCIe cards that are attached to it as a unit.


    B.Remove the Flash F20/F20M2 Card - that contains the Faulty FMOD(s)

    The FMOD's on F20 and F20M2 cards are layed out differently.

    The PCI numbers and FMOD locations for F20 are plainly seen on the side of the ESM plastic housing (once the PCI card is removed from the system).

    For F20 M2, the FMOD labels are a bit trickier to see. Once the PCI card is removed from the system, look below the SAS controller near the PCI card edge connector. They are located on a small white label with arrows, and on the board near each FMOD (though some of those labels are partially obscured).

    They are arranged as follows:

    F20 Card Faceplate <-- FMOD0 Upper; FMOD1 Lower <-> ESM/SAS controller <-> FMOD2 Lower; FMOD3 Upper <-> 2 SAS ports on rear of card.
    F20M2 Card Faceplate <-- FMOD0 Upper; FMOD1 Lower <-> SAS controller <-> FMOD2 Upper; FMOD3 Lower <-> ESM / 1 SAS port on rear of card.
        1. Remove the PCIe card with the affected F20/F20M2 card.  The card nearest the riser PCI card edge connector is slot 1 or 2. The card further away is slot 4 or 5.  If necessary, make a note of where the PCIe cards are installed.

        2. Place the F20/F20m2 card on an anti-static mat

        3. Identify and remove the Faulty FMOD(s) from the card. 
          1. Loosen the 3 clips on the appropriate side retaining the FMOD(s) on the card. 
            On F20 cards, loosen the 3 screws on the underside of the PCIe card - do NOT fully remove the screws to reduce the changes of the clips breaking or being lost.
            On F20M2 cards, pull open the retaining clips to loosen them.
          2. Carefully lift and remove the faulty FMOD, and any upper FMOD needed to access the lower FMOD if required.

        4. Install the replacement FMOD(s) on the card.
          1. Carefully install the replacement in the same slot, ensuring the side of the FMOD engages inside any metal tabs and clip guides. 
          2. Re-install any additional FMOD(s) that were removed.
          3. Re-engage the clips that retain the FMOD(s), either tightening the screws (F20) or closing the locking clips (F20M2). If any clips are broken, additional clips should be available in the FRU box for the FMOD or the FRU for the F20/F20M2 PCI card itself. This may have to be ordered additionally.

        5. Reinstall the PCIe card into the riser.

     

    C. Reinstall the PCIe riser back into the system.

    D. Reinstall the PCIe crossbar on the rear of the system.

    E. Reinstall the top cover

    F. Plug the AC power cords in and power on the Storage Cell



    OBTAIN CUSTOMER ACCEPTANCE
    - WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO
    TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:


    It is expected that the engineer stay on-site until the customer has given the approval to depart.   The following commands are provided as guidance in case the customer needs assistance checking the status of the system following replacement.  If the customer or the FSE requires more assistance following the physical replacement of the device, EEST/TSC should be contacted.

    After replacing the Flash F20/F20M2 FMOD and updating its firmware if necessary, the Exadata Storage Server should boot up automatically.  Once the Exadata Storage Server comes back online the cell services will start up automatically, however you will need to reactivate the griddisks as follows:

        1. Activate the griddisks:

          # cellcli
              …    
          CellCLI> alter griddisk all active
          GridDisk DATA_CD_00_dbm1cel01 successfully altered
          GridDisk DATA_CD_01_dbm1cel01 successfully altered
          GridDisk DATA_CD_02_dbm1cel01 successfully altered
          GridDisk RECO_CD_00_dbm1cel01 successfully altered
          GridDisk RECO_CD_01_dbm1cel01 successfully altered
          GridDisk RECO_CD_02_dbm1cel01 successfully altered
          ...etc...
        2. Verify all disks show 'active':

          CellCLI> list griddisk
          DATA_CD_00_dbm1cel01         active
          DATA_CD_01_dbm1cel01         active
          DATA_CD_02_dbm1cel01         active
          RECO_CD_00_dbm1cel01         active
          RECO_CD_01_dbm1cel01         active
          RECO_CD_02_dbm1cel01         active
          ...etc...
        3. Verify all grid disks have been successfully put online using the following command. Wait until 'asmmodestatus' is in status 'ONLINE' for all grid disks. The following is an example of the output early in the activation process.

          CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
          DATA_CD_00_dbm1cel01 active ONLINE Yes
          DATA_CD_01_dbm1cel01 active ONLINE Yes
          DATA_CD_02_dbm1cel01 active ONLINE Yes
          RECO_CD_00_dbm1cel01 active SYNCING Yes
          RECO_CD_01_dbm1cel01 active ONLINE Yes
          ...etc...


          Notice in the above example that 'RECO_CD_00_dbm1cel01' is still in the 'SYNCING'  process. Oracle ASM synchronization is only complete when ALL grid disks show ‘asmmodestatus=ONLINE’.  This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair. (Note: This operation uses Fast Mirror Resync operation - which does not trigger an ASM rebalance. The Resync operation restores only the extents that would have been written while the disk was offline.) 

        4. If the replaced FDOM or any FDOM's on the same Flash Card were flagged as 'poor performance' at the time of the failure, then all FDOM's in state 'poor performance' will need to be manually cleared before those FDOM's can be used in the flashcache again. Refer to Note 1306635.1 for additional specific instructions on how to clear this state. 

        5. If the flashcache was dropped prior to the FMOD replacement (ie, if the following was run CellCLI> drop celldisk all flashdisk force), then the following steps will need to be run in order to recreate the flashcache:

          CellCLI> create celldisk all flashdisk
          CellCLI> create flashlog all
          CellCLI> create flashcache all

          Note: If you are running an image version prior to 11.2.2.4, DO NOT run the 'create flashlog all' operation as this feature was introduced in the 11.2.2.4 release.

        6. If WriteBack flashcache mode is enabled, then validate the cell has resumed caching to the flashdisks.
          The following will show whether the mode is WriteBack or WriteThrough:
          CellCLI> list cell attributes name,flashcachemode

          In WriteBack mode, the following should show each griddisk being cached by 4 flash disks:

          CellCLI> list griddisk attributes name,status,cachedby

          A way to validate that data is dirty (i.e. only in flash disks and not yet on harddisks) run the following:

          CellCLI> list metriccurrent fc_by_used,fc_by_dirty

          On a busy system, metric "fc_by_dirty" should start increasing in value.


    PARTS NOTE:

    7061269 24GB Solid State Flash Memory Module, FW D21Y

    371-5014 24GB Solid State Flash Memory Module, FW D20Y

    371-4415 24GB Solid State Flash Memory Module, FW D20R


    REFERENCE INFORMATION:

    Sun Flash Accelerator F20 PCIe Card User’s Guide - https://docs.oracle.com/cd/E19682-01/index.html 
    How to Shutdown a Storage Cell for Service - Note 1188080.1

    Aura (F20) Hardware and Software Troubleshooting Document - Note 1285796.1

References

<NOTE:1306635.1> - Flash Disks may report 'Not Present' or 'Poor Performance' after FDOM/Flash Disk Replacement
<NOTE:1188080.1> - Steps to shut down or reboot an Exadata storage cell without affecting ASM
<NOTE:1545103.1> - Replacing FlashCards or FDOM's when Griddisks are created on FlashDisk's

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback