Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1681165.1
Update Date:2018-04-10
Keywords:

Solution Type  Technical Instruction Sure

Solution  1681165.1 :   How to Replace a Faulty ESM on a Flash F20 or F20M2 card in an Exadata Storage Server (V2/X2-2/X2-8) [VCAP]  


Related Items
  • SPARC SuperCluster T4-4 Full Rack
  •  
  • Exadata Database Machine X2-2 Qtr Rack
  •  
  • Exadata Database Machine X2-8
  •  
  • Exadata Database Machine X2-2 Full Rack
  •  
  • Exadata Database Machine X2-2 Half Rack
  •  
  • SPARC SuperCluster T4-4 Half Rack
  •  
  • Exadata Database Machine X2-2 Hardware
  •  
  • SPARC SuperCluster T4-4
  •  
  • Exadata Database Machine V2
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
  •  
  • Microlearning>Video>ML-VID-VCAP
  •  


This CAP explains how to Service a faulty ESM on a Flash F20/F20M2 card in an Exadata Storage Server (V2/X2-2/X2-8)

Oracle Confidential PARTNER - Available to partners (SUN).
Reason: Exadata FRU only; Internal only and HW support partners

Applies to:

SPARC SuperCluster T4-4 - Version All Versions and later
Exadata Database Machine X2-8 - Version All Versions and later
SPARC SuperCluster T4-4 Half Rack - Version All Versions and later
Exadata Database Machine X2-2 Hardware - Version All Versions and later
Exadata Database Machine X2-2 Full Rack - Version All Versions and later
Information in this document applies to any platform.

Goal

Service a faulty ESM on a Flash F20/F20M2 PCIe card in an Exadata Storage Server (V2/X2-2/X2-8)

Solution


DISPATCH INSTRUCTIONS:
The following information will be required prior to dispatch of a replacement:

  • Type of Exadata (V2, X2-2 or X3-8) / Exadata Storage Expansion Rack / SPARC SuperCluster
  • Type of storage cell/Node (X4275 or X4270M2)
  • Name/location of storage cell
  • PCI Slot number of failed ESM
  • Type of Flash card requiring servicing (F20 or F20M2)

- WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED:
- TIME ESTIMATE: 90 Minutes
- TASK COMPLEXITY: 3

FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:
PROBLEM OVERVIEW:
There is a Sun Flash Accelerator F20/F20M2 Card with a failed ESM in an Exadata Storage Server (Cell) that needs replacing.

A video showing the physical replacement steps is attached to this Note 1681165.1.

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:

The Storage Cell that contains the faulty component(s) needs to be powered off. 

It is expected that the customer's DBA has completed these steps prior to arriving to replace the card. The following commands are provided as guidance in case the customer needs assistance checking the status of the system prior to replacement.  If the customer or the FSE requires more assistance prior to the physical replacement of the device, EEST/TSC should be contacted.

  1. Locate the cell server in the rack being serviced.  The cell server within the rack can be determined from the hostname usually, and the known default Exadata server numbering scheme. Exadata Storage Servers are identified by a number 1 through 18, where 1 is the lowest most Storage Server in the rack installed in RU2, counting up to the top of the rack. 

    Turn on the locate indicator light for easier identification of the server being repaired. If the server number has been identified then the Locate Button on the front panel may be pressed. To turn on remotely, use either of the following methods:

    From a login to the CellCli on Exadata Storage Servers:

    CellCli> alter cell led on

    From a login to the server’s ILOM:

    -> set /SYS/LOCATE value=Fast_Blink
    Set 'value' to 'Fast_Blink

    From a login to the server’s ‘root’ account:

    # ipmitool sunoem cli ‘set /SYS/LOCATE value=Fast_Blink’
    Connected. Use ^D to exit.
    -> set /SYS/LOCATE value=Fast_Blink
    Set 'value' to 'Fast_Blink'

    -> Session closed
    Disconnected
  2. Shutdown the node for which the Flash F20/F20M2 ESM requires replacement.
    For Extended information on this section check MOS Note ID 1188080.1 "Steps to shut down or reboot an Exadata storage cell without affecting ASM"

    This is also documented in the Exadata Maintenance Guide section titled "Maintaining Exadata Storage Servers" subsection "Shutting Down Exadata Storage Server" available on the customer's cell server image in the /opt/oracle/cell/doc directory. or online here:

    https://docs.oracle.com/cd/E80920_01/DBMMN/maintaining-exadata-storage-servers.htm#DBMMN21129

    1. In the following examples the SQL commands should be run by the Customer's DBA prior to doing the hardware replacement. The cellcli commands will need to be run as root.

      Note the following when powering off Exadata Storage Servers:
      • Verify there are no other storage servers with disk faults. Shutting down a storage server while another disk is failed may result in the running database processes and Oracle ASM to crash if it loses both disks in the partner pair when this server’s disks go offline.

      • Powering off one Exadata Storage Server with no disk faults in the rest of the rack will not affect running database processes or Oracle ASM.

      • All database and Oracle Clusterware processes should be shut down prior to shutting down more than one Exadata Storage Server. Refer to the Exadata Owner’s Guide for details if this is necessary.

    2. ASM drops a disk shortly after they are taken offline. Powering off or restarting Exadata Storage Servers can impact database performance if the storage server is offline for longer than the ASM disk repair timer to be restored. The default DISK_REPAIR_TIME attribute value of 3.6hrs should be adequate for replacing components, but may have been changed by the Customer. To check this parameter, have the Customer log into ASM and perform the following query:

      SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg where a.name = 'disk_repair_time' and a.group_number = dg.group_number;

      As long as the value is large enough to comfortably replace the components being replaced, then there is no need to change it.

    3. If the flash disks are being used for griddisks, then please refer to Note 1545103.1 for additional instructions before continuing.
       
    4. Check if ASM will be OK if the grid disks go OFFLINE.
      # cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
      ...sample ...
      DATA_CD_09_dbm1cel01 ONLINE Yes
      DATA_CD_10_dbm1cel01 ONLINE Yes
      DATA_CD_11_dbm1cel01 ONLINE Yes
      RECO_CD_00_dbm1cel01 ONLINE Yes
      RECO_CD_01_dbm1cel01 ONLINE Yes
      ...repeated for all griddisks....

      If one or more disks return asmdeactivationoutcome='No', then wait for some time and repeat this command. Once all disks return asmdeactivationoutcome='Yes', proceed to the next step.

    5. Run cellcli command to Inactivate all grid disks on the cell that needs to be powered down for maintenance. (this could take up to 10 minutes or longer)

      # cellcli
      ...sample ...
      CellCLI> ALTER GRIDDISK ALL INACTIVE
      GridDisk DATA_CD_00_dbm1cel01 successfully altered
      GridDisk DATA_CD_01_dbm1cel01 successfully altered
      GridDisk DATA_CD_02_dbm1cel01 successfully altered
      GridDisk RECO_CD_00_dbm1cel01 successfully altered
      GridDisk RECO_CD_01_dbm1cel01 successfully altered
      GridDisk RECO_CD_02_dbm1cel01 successfully altered
      ...repeated for all griddisks...
    6. Execute the command below and the output should show asmmodestatus='UNUSED' or 'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and inactive in ASM.

      CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
      DATA_CD_00_dbm1cel01 inactive OFFLINE Yes
      DATA_CD_01_dbm1cel01 inactive OFFLINE Yes
      DATA_CD_02_dbm1cel01 inactive OFFLINE Yes
      RECO_CD_00_dbm1cel01 inactive OFFLINE Yes
      RECO_CD_01_dbm1cel01 inactive OFFLINE Yes
      RECO_CD_02_dbm1cel01 inactive OFFLINE Yes
      ...repeated for all griddisks...
    7. Once all disks are offline and inactive, the customer may shutdown the Cell using the following command:

    8. # shutdown -hP now

       When powering off Exadata Storage Servers, all storage services are automatically stopped.

 
WHAT ACTION DOES THE ENGINEER NEED TO TAKE:

Identify the F20/F20M2 PCIe card slot that is at fault, from the fault messages.  It should also be possible to observe the LEDs on the rear of Flash F20/F20M2 Card to determine which card and ESM is showing a fault.

These are the steps to remove and replace the ESM:

A. Remove the PCI Riser - that contains the associated Flash F20/F20M2 Card.

      1. Prepare the server for service.
        1. Power off the server and disconnect the power cord (or cords) from the power supply (or supplies).
        2. Extend the server to the maintenance position.
        3. Attach an antistatic wrist strap.
        4. Remove the top cover.

          If the top cover is removed before the AC power cords are removed, the ILOM SP will fault for chassis intrusion. This will need to be cleared after restarting ILOM.

      2. Locate the card's position to the riser in the system.

        The Flash F20/F20M2 cards installed in the Storage Cells are located on PCIe Riser 1 (PCIe slots 1 and 4) in the middle of the X4275/X4270M2 server and PCIe Riser 2 (PCIe slots 2 and 5) on the outside wall of the X4275/X4270M2 server.

      3. Remove the back panel PCI crossbar.
        1. Loosen the two captive Phillips screws on the end of the PCI crossbar.
        2. Lift the PCI crossbar up and back to remove it from the chassis.

      4. Remove the PCIe riser from the system.
        1. Loosen the captive screw holding the riser to the motherboard.
        2. Lift up the riser and any PCIe cards that are attached to it as a unit.


B. Remove the PCIe card with the affected F20/F20M2 card.  The card nearest the riser PCI card edge connector is slot 1 or 2. The card further away is slot 4 or 5.  If necessary, make a note of where the PCIe cards are installed.  Place the F20/F20m2 card on an anti-static mat.

C. The ESM's on F20 and F20 M2 cards are layed out differently. Follow the appropriate physical procedure for the type of card being serviced.

The ESM's are not interchangable - do NOT attempt to use a F20 ESM on a F20 M2 card, or a F20 M2 ESM on a F20 card. If the correct ESM is not available, then both card and ESM should be replaced together with the correct pairing.

 Removing the ESM on F20 Card (541-3731)

The F20 Card has the ESM located in the centre of the card, with FMOD’s on either side of it. The assembly part number label is located on the front of the card near the card edge connector between the disk controller and rear FMOD’s.

    1. Remove the two ESM assembly retaining pins on the back of the card.

      1. First, remove the center pin from each retaining pin.

      2. Next, push the outer section of each retaining pin through the card and remove them.

    2. Carefully slide the ESM assembly (the ESM shroud and the ESM) off the card without disturbing FMOD0 or FMOD3.

    3. Using a pair of wire cutters, clip the ESM cable near the ESM end. This will allow removal of the cable without needing to unscrew clips and remove FMOD0.

    4. Disconnect the ESM cable from connector J803 on the card using the remaining tail.

    5. Remove the ESM from the plastic ESM shroud.

 

Installing the ESM on F20 Card (541-3731)

The F20 Card has the ESM located in the centre of the card, with FMODs on either side of it.
The assembly part number label is located on the front of the card near the card edge connector between the disk controller and rear FMODs.

    1. Place the ESM in the plastic ESM shroud.

    2. Place the ESM assembly next to the board, then slide it gently onto the card. Carefully route the cable and plug between FMOD0 and FMOD1 while sliding it on.

    3. Install the two retaining pins from the back of the card:

      1. First, install the outer section of each retaining pin.

      2. Next, install the center sections of the each retaining pin.

    4. Connect the ESM plug to J803 on the card, routing the ESM cable around the retainer clip holding FMOD0 and FMOD1, with the cable laying between the 2 FMODs.
       

Removing the ESM on F20 M2 Card (541-4417)

The F20 M2 Card has the ESM located on the rear of the card next to the SAS cable connector. The assembly part number label is located next to the orange WWN label on the rear side of the card.

    1. Locate the plastic retaining clip for the ESM plastic housing on the rear side of the card.

    2. With a small tool such as the tip of a screwdriver, carefully press the clip down while pushing the housing off the rear end of the PCI card.

    3. Disconnect the ESM cable from connector J803 on the card.

    4. Remove the ESM from the plastic ESM shroud.

 

Installing the ESM on F20 M2 Card (541-4417)

The F20 M2 Card has the ESM located on the rear of the card next to the SAS cable connector. The assembly part number label is located next to the orange WWN label on the rear side of the card.

    1. Place the ESM in the plastic ESM shroud.

    2. Connect the ESM plug to J803 on the card. 

    3. Slide the ESM assembly feet carefully onto the board, one into the slotted hole, the other slides onto the end of the PCI card. There should be an audible click when the retaining clip engages in its slot.
       

C. Reinstall the PCIe card into the riser.

D. Reinstall the PCIe riser back into the system.

E. Reinstall the PCIe crossbar on the rear of the system.

F. Reinstall the top cover

G. Plug the AC power cords in and power on the Storage Cell

H. The ESM power-on hours counter in ILOM requires manually cleared to reset it to zero. The physical replacement of the ESM does not initiate this automatically. The monitoring features was added in ILOM 3.0.9.19.a and contained in Exadata software image version 11.2.1.3.1 and later.

  1. After the system is plugged in and ILOM is booted, log in to ILOM on the Storage Cell as ‘root’ user. This can be done from the host, the network, or serial interface.

  2. Power on the server, so the ILOM can see the Flash cards, using either the power button or the following ILOM command, and wait 5 minutes for the system to boot.

    -> start /SYS

  3. For each ESM that was replaced, check if the fault_state is set to 'critical' or 'Faulted' (depending on ILOM version). 

    -> show /SYS/MB/RISER1/PCIE1/F20CARD fault_state

    Use the appropriate riser and pcie slot numbers for the ESM that was replaced. (RISER1/PCIE1 or RISER1/PCIE4 or RISER2/PCIE2 or RISER2/PCIE5)

  4. a) If the fault_state is showing as ‘critical’ or 'Faulted', then skip to step 5.

    b) If the fault_state is showing “OK” then it is not yet critical so it may not have reached power on hours threshold needed to flag it as critical. Its possibly because the unit has come close before the PM is being done, or it may be due to flash updating ILOM at some interim time which will reset the counter to 0. The fault_state can be manually set to critical as follows:

    1. Enter the fault management shell in ILOM CLI:

      -> start /SP/faultmgmt/shell
      Are you sure you want to start /SP/faultmgmt/shell (y/n)? y

      faultmgmtsp>


    2. a) On ILOM versions 3.0.14.x or later, use the following to mark the card failed (all on one line):

      faultmgmtsp> etcd -i ereport.chassis.device.esm.eol.warning@/SYS/MB/RISER1/PCIE1/F20CARD

      or
      b) On ILOM version 3.0.9.x, use the following to mark the card failed (all on one line):

      faultmgmtsp> etcd -i ereport.env.uptime-unr-ghi@/SYS/MB/RISER1/PCIE1/F20CARD

    3. Exit the fault management shell

      faultmgmtsp> exit

    4. Verify the fault_state is now showing as ‘critical’ or 'Faulted', this may take up to 2 minutes after setting the fault state using etcd:

      -> show /SYS/MB/RISER1/PCIE1/F20CARD fault_state

  5. With the fault_state showing as ‘critical’ or 'Faulted', it can be cleared as follows:

    -> set /SYS/MB/RISER1/PCIE1/F20CARD clear_fault_action=true

    This will reset the power-on-hours counter to 0.

  6. Verify the fault_state is now 'OK' and the power-on-hours counter is 0.

    -> show /SYS/MB/RISER1/PCIE1/F20CARD fault_state
    -> show /SYS/MB/RISER1/PCIE1/F20CARD/UPTIME

    If the fault_state is OK but the power-on-hours has not cleared to 0, then resetting ILOM may update it.  If resetting ILOM does not reset it to 0, then repeat step 4 and 5.

 

Additional Notes on the ESM power-on hours thresholds:

  • ILOM v3.0.9.19.a on Exadata V2 systems image 11.2.2.2.2 and earlier have a bug that prevents slot PCIE4 from reporting the presence of the flash F20 card. Skip that slot if the system has this problem.

  • ILOM v3.0.9.19.a on Exadata V2 systems and v3.0.9.27.a on Exadata X2-2/X2-8 systems has a bug that programmed the thresholds to 2 years (17200 hours) instead of 3 years or 4 years, so the fault status may have already been triggered and cleared.

  • On Exadata X2-2/X2-8 systems, the threshold may report 3 years (26220 hours) instead of 4 years (35052 hours) if the system_identifier property in ILOM /SP is not programmed to the standard Exadata identity string “Exadata Database Machine X2-2” (or X2-8) that identifies this card as being in an Exadata, rather than a regular X4270M2 system. This may be the case on V2 systems upgraded with X2-2 servers if the identity string was changed to the V2 rack string “Sun Oracle Database Machine”. Check the value and correct for the future if it is incorrect:

    -> show /SP system_identifier

    -> set /SP system_identifier=”Exadata Database Machine X2-2 <Rack SN>”

  • ILOM on Exadata X2-2/X2-8 systems, the threshold may report 3 years (26220 hours) instead of 4 years (35052 hours) on some cards and not others due to bug 18356571. This has been seen on ILOM v3.0.16.10 through 3.1.2.20.c. The resolution for this bug is in ILOM v3.1.2.20.d or later; ILOM v3.1.2.20.e is contained in image 12.1.2.1.0 and later.

  • The counter does not update every hour, it updates every 4 hours and may not read greater than 0 immediately. If the fault persists in ILOM display after clearing the fault, reset the ILOM.



OBTAIN CUSTOMER ACCEPTANCE
- WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO
TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:


It is expected that the engineer stay on-site until the customer has given the approval to depart.   The following commands are provided as guidance in case the customer needs assistance checking the status of the system following replacement.  If the customer or the FSE requires more assistance following the physical replacement of the device, EEST/TSC should be contacted.

The Exadata Storage Server should boot up automatically.  Once the Exadata Storage Server comes back online the cell services will start up automatically, however you will need to reactivate the griddisks as follows:

      1. Activate the griddisks:

        # cellcli
            …    
        CellCLI> alter griddisk all active
        GridDisk DATA_CD_00_dbm1cel01 successfully altered
        GridDisk DATA_CD_01_dbm1cel01 successfully altered
        GridDisk DATA_CD_02_dbm1cel01 successfully altered
        GridDisk RECO_CD_00_dbm1cel01 successfully altered
        GridDisk RECO_CD_01_dbm1cel01 successfully altered
        GridDisk RECO_CD_02_dbm1cel01 successfully altered
        ...etc...
      2. Verify all disks show 'active':

        CellCLI> list griddisk
        DATA_CD_00_dbm1cel01         active
        DATA_CD_01_dbm1cel01         active
        DATA_CD_02_dbm1cel01         active
        RECO_CD_00_dbm1cel01         active
        RECO_CD_01_dbm1cel01         active
        RECO_CD_02_dbm1cel01         active
        ...etc...
      3. Verify all grid disks have been successfully put online using the following command. Wait until 'asmmodestatus' is in status 'ONLINE' for all grid disks. The following is an example of the output early in the activation process.

        CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
        DATA_CD_00_dbm1cel01 active ONLINE Yes
        DATA_CD_01_dbm1cel01 active ONLINE Yes
        DATA_CD_02_dbm1cel01 active ONLINE Yes
        RECO_CD_00_dbm1cel01 active SYNCING Yes
        RECO_CD_01_dbm1cel01 active ONLINE Yes
        ...etc...


        Notice in the above example that 'RECO_CD_00_dbm1cel01' is still in the 'SYNCING'  process. Oracle ASM synchronization is only complete when ALL grid disks show ‘asmmodestatus=ONLINE’.  This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair. (Note: This operation uses Fast Mirror Resync operation - which does not trigger an ASM rebalance. The Resync operation restores only the extents that would have been written while the disk was offline.) 

      4. If WriteBack flashcache mode is enabled, then validate the cell has resumed caching to the flashdisks.
        The following will show whether the mode is WriteBack or WriteThrough:

      5. CellCLI> list cell attributes name,flashcachemode

        In WriteBack mode, the following should show each griddisk being cached by 4 flash disks:

        CellCLI> list griddisk attributes name,status,cachedby

        A way to validate that data is dirty (i.e. only in flash disks and not yet on harddisks) run the following:

        CellCLI> list metriccurrent fc_by_used,fc_by_dirty

        On a busy system, metric "fc_by_dirty" should start increasing in value.


PARTS NOTE:

371-4650 5.5V, 11F, Capacitive Backup Power Module, (ESM) (F20)

371-4953 5.5V, 11F, Capacitive Backup Power Module, (ESM) (F20 M2)

 

REFERENCE INFORMATION:

Sun Flash Accelerator F20 PCIe Card User’s Guide - https://docs.oracle.com/cd/E19682-01/index.html 
How to Shutdown a Storage Cell for Service - Note 1188080.1

Aura (F20) Hardware and Software Troubleshooting Document - Note 1285796.1

References

<NOTE:1285796.1> - Aura (F20) Hardware and Software Troubleshooting Document
<NOTE:1188080.1> - Steps to shut down or reboot an Exadata storage cell without affecting ASM
<NOTE:1505691.1> - Unable To Reset Flash F20 Accelerator ESM UPTIME

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback