![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Technical Instruction Sure Solution 1385308.1 : How to Replace a Failed InfiniBand (HCA) Card on a Exalogic Storage Node (Sun ZFS Storage 7320)
In this Document
Oracle Confidential PARTNER - Available to partners (SUN). Applies to:Oracle ZFS Storage ZS3-4 - Version All Versions to All Versions [Release All Releases]Sun ZFS Storage 7320 - Version Not Applicable to Not Applicable [Release N/A] Sun Infiniband HCA - Version Not Applicable to Not Applicable [Release N/A] Oracle ZFS Storage ZS3-2 - Version All Versions to All Versions [Release All Releases] Information in this document applies to any platform. GoalHow to Replaced a Failed InfiniBand Card on a Exalogic Storage Node. SolutionDISPATCH INSTRUCTIONS sn01:> configuration cluster show Properties: state = AKCS_OWNER description = Active (takeover completed) peer_asn = d6df4e45-3677-4ac0-9aaa-90746df9d6a5 peer_hostname = sn02 peer_state = AKCS_STRIPPED peer_description = Ready (waiting for failback) Children: resources => Configure resources Example of Passive Storage Node (state = AKCS_STRIPPED): sn02:> configuration cluster show Properties: state = AKCS_STRIPPED description = Ready (waiting for failback) peer_asn = eeac79d6-5822-6ca6-e4dd-c68b25265f21 peer_hostname = sn01 peer_state = AKCS_OWNER peer_description = Active (takeover completed) Children: resources => Configure resources - Determine if the target node is currently hosting the clustered resources (see previous step). If this is the case we have two options, shutdown the target node (will force resource failover to the alternate node) or to force a 'takeover' operation from the (AKCS_OWNER) node. sn01:> maintenance system poweroff <CR> This will turn off power to the appliance. Are you sure? (Y/N) Example of a forced Storage Node takeover: On the current active node: sn01:> maintenance system reboot This will reboot the appliance. Are you sure? (Y/N) Note: As per this document: "Failback" and "Takeover" Cluster Operation Supportability on ZFS Storage Appliance in Exalogic Rack (Doc ID 2091131.1) the active node reboot is the method within Exalogic to affect takeovers. Note: The reboot will force a 'takeover' operation that will result in failover of the clustered resources that comes from reboot of the original owner. You will either need to wait for the reboot to complete and shutdown the node or intercept the reboot operation and force a poweroff via ILOM (stop /SYS). - If resource failover was required, check the alternate node to ensure the cluster resources migrated successfully (state should transition from AKCS_STRIPPED to AKCS_OWNER). Note: The reboot will force a 'takeover' operation that will result in failover of the clustered resources that comes from reboot of the original owner. You will either need to wait for the reboot to complete and shutdown the node or intercept the reboot operation and force a poweroff via ILOM (stop /SYS).
Update the partition map on the Infiniband Master switch with the new IB-HCA’s GUID’s 1. Determine the port GUID’s of the new IB-HCA with the ibstat command run on the Storage node.
4. Add port guids of the new IB-HCA to the partitions that being used by the node. # smpartition start
NOTE: The -m switch sets the mode needed for the partition. If there is a default mode configured on the partition and that’s what these GUID’s will be using, you don’t need to use the -m however it will not cause any issues if you do. If there is NO default configured on the partition you will need to set this to one of the following: both, limited, or full
NOTE: Add the two new Port GUID’s to all the partitions that are needed You want to remove the port GUID’s of the card that is being removed form the partition maps. The port GUID’s of the faulty card are the “Node GUID of the card +1 and Node GUID of the card +2”. The Node GUID can be obtained with the ibstat command prior to removing the card or is printed on a lable on the card itself which can be read if it has already been removed.
You can remove these GUID’s from the active partitions using the following command: Example for a card with Node GUID 21280001ef1233 # smpartition remove -pkey 0x503 -port 21280001ef1234 21280001ef1235
NOTE: Remove the GUID’s for all partitions the faulty card was a member of. Verify the partition information is correct: # smpartition list modified If everything is correct: # smpartition commit 4. verify entries using # smpartition list active . . 0x0021280001ef5d24=both, 0x0021280001ef5d23=both, Ensure the GUID’s are in all the partitions it will be required to communicate in.
The customer is responsible to verify the new component is functioning correctly, some steps the customer may want to use for verification are as follows. Here are some sugestions for basic verification
sn01:configuration cluster> takeover <CR> Continuing will immediately fail back the resources assigned to the cluster peer. This may result in clients experiencing a slight delay in service. Are you sure? (Y/N)
sn01:> configuration cluster show Properties: state = AKCS_OWNER description = Active (takeover completed) peer_asn = d6df4e45-3677-4ac0-9aaa-90746df9d6a5 peer_hostname = sn02 peer_state = AKCS_STRIPPED peer_description = Ready (waiting for failback) Children: resources => Configure resources
sn01:> configuration net datalinks show Datalinks: DATALINK CLASS LINKS STATE LABEL igb0 device igb0 up igb0 igb1 device igb1 up igb1 ibp0 device ibp0 up ibp0 ibp1 device ibp1 up ibp1 sn01:> configuration net interfaces show Interfaces: INTERFACE STATE CLASS LINKS ADDRS LABEL igb0 up ip igb0 10.10.10.10/24 igb0 igb1 offline ip igb1 10.10.10.11/24 igb1 ipmp1 up ipmp ibp0 192.168.10.15/24 IB_Interface ibp1 ibp0 up ip ibp0 0.0.0.0/8 ibp0 ibp1 up ip ibp1 0.0.0.0/8 ibp1
sn01:> configuration net interfaces sn01:configuration net interfaces> show Interfaces: INTERFACE STATE CLASS LINKS ADDRS LABEL igb0 up ip igb0 10.10.10.10/24 igb0 igb1 offline ip igb1 10.10.10.11/24 igb1 ipmp1 up ipmp ibp0 192.168.10.15/24 IB_Interface ibp1 ibp0 up ip ibp0 0.0.0.0/8 ibp0 ibp1 up ip ibp1 0.0.0.0/8 ibp1 sn01:configuration net interfaces> select ibp0 sn01:configuration net interfaces ibp0> show Properties: state = up curaddrs = 0.0.0.0/8 class = ip label = ibp0 enable = true admin = true links = ibp0 v4addrs = 0.0.0.0/8 v4dhcp = false v6addrs = v6dhcp = false sn01:configuration net interfaces ibp0> set enable=false enable = false (uncommitted) sn01:configuration net interfaces ibp0> commit sn01:configuration net interfaces ibp0> cd .. sn01:configuration net interfaces> show Interfaces: INTERFACE STATE CLASS LINKS ADDRS LABEL igb0 up ip igb0 10.10.10.10/24 igb0 igb1 offline ip igb1 10.10.10.11/24 igb1 ipmp1 up ipmp ibp0 192.168.10.15/24 IB_Interface ibp1 ibp0 disabled ip ibp0 0.0.0.0/8 ibp0 ibp1 up ip ibp1 0.0.0.0/8 ibp1 sn01:configuration net interfaces> select ibp0 sn01:configuration net interfaces ibp0> set enable=true enable = true (uncommitted) sn01:configuration net interfaces ibp0> commit sn01:configuration net interfaces ibp0> cd .. sn01:configuration net interfaces> show Interfaces: INTERFACE STATE CLASS LINKS ADDRS LABEL igb0 up ip igb0 10.10.10.10/24 igb0 igb1 offline ip igb1 10.10.10.11/24 igb1 ipmp1 up ipmp ibp0 192.168.10.15/24 IB_Interface ibp1 ibp0 up ip ibp0 0.0.0.0/8 ibp0 ibp1 up ip ibp1 0.0.0.0/8 ibp1 sn01:configuration net interfaces> Note: Perform the same set of commands above for interface "ibp1" as well. This will test both links in the IPMP group "ipmp1". - Login to one or more compute nodes and verify access to the storage appliance over the InfiniBand Network. Depending on the level of testing desired you could perform one or more of the test from the following list: References<NOTE:2091131.1> - "Failback" and "Takeover" Cluster Operation Supportability on ZFS Storage Appliance in Exalogic RackAttachments This solution has no attachment |
||||||||||||||||
|