Asset ID: |
1-79-2366777.1 |
Update Date: | 2018-04-30 |
Keywords: | |
Solution Type
Predictive Self-Healing Sure
Solution
2366777.1
:
FS System Procedure to Restore an FS1-2 Back to Small Build After a Failed Upgrade
Related Items |
- Oracle FS1-2 Flash Storage System
|
Related Categories |
- PLA-Support>Sun Systems>DISK>Flash Storage>SN-EStor: FSx
|
In this Document
Oracle Confidential PARTNER - Available to partners (SUN).
Reason: sensitive procedure
Applies to:
Oracle FS1-2 Flash Storage System - Version 6.2 to 6.2 [Release 6.2]
Information in this document applies to any platform.
Purpose
To recover an FS1-2 system back to "Small Build" after a failed software upgrade attempts to enable Enhanced Allocation or "Big Build"
Scope
An FS1-2 "Small Build" system failed a Disruptive Update (DU) leaving the FS1-2 system in Read Only status because of a failed Cold Start. This procedure requires the engagement of the Configuration Manager (ConMan) group in FS Engineering. Do NOT attempt these steps without their active involvement. ConMan will verify if this procedure will work or if the data is lost.
Details
This issue can be avoided by following the steps in KM Document 2366718.1 FS System: Procedure to Inhibit an ILOM Upgrade and Enhanced Allocation Migration During an FS1-2 Disruptive Software Upgrade.
CAUTION: This procedure should only be attempted after verifying that the FS1-2 did not get a clean shutdown, and no conversion of ANY of the COD copies from Small to Big Build.
Confirmation
This issue will have the following symptoms:
- System status of Read Only
- Cold Start failed in Boot State CM
- PM_EVT_COLD_START_FAILED log bundle
Gather and scanlog the logs associated with the failed cold start. File a bug to engage the ConMan FS Engineering group.
The following is INTERNAL ONLY:
The ConMan FS Engineering group will make SURE that there are big build artifacts.
- Locate a log bundle with the cod file. The name will be an alphanumeric string ending in .cod, cod.tar, or cod.tar.gz and extract as needed:
% tar xvf A136B0559494795F.cod.tar
A136B0559494795F.cod
%
- Use dumpCod6211 utility to convert the cod file to text and grep for the 5 lines after "Master Block":
% /cores_data/local/tools/pillar/dumpCod6211 A136B0559494795F.cod | grep -A 5 "Master Block"
Master Block:
signature: PDSCOD
SSN: AK00126934
status: 1
maj/min: 60/0
gen: 0x59c9955a0000295f (reset Mon Sep 25 23:46:34 2017 UTC)
--
Master Block:
signature: HW_COD
SSN: AK00126934
status: 1
maj/min: 60/0
gen: 0x59c9955500000406 (reset Mon Sep 25 23:46:34 2017 UTC)
%
- If the maj/min line in the Master Block begins with 60 (like the above example) it is not necessary to inhibit Big Build COD conversion.
- If the maj/min line in the Master Block begins with 6 (like the example below) it will be necessary to inhibit installing the Big Build COD conversion.
Master Block:
signature: PDSCOD
SSN: AK00795540
status: 1
maj/min: 6/21
gen: 0x5575ce7500002d62 (reset Mon Jun 8 17:18:45 2015 UTC)
Do NOT rely on the BIG_BUILD file in the scanlog created directory for the log bundle as it is not reliable for a system in this state.
Ideally there should be a CM_SLAMMER_NODE_NOT_CLEAN_BOOT(0x1ce032) Event in the Controller log files. TDS traces of interest that should be seen are:
CsPreserveCodTask.cpp 194 CsPreserveCodTask::endExecutionStep CM CM_COD "CHECK_FOR_CONVERSION, CsPreserveCodTask(0x8198004f)"
CodManagerBase.cpp 838 CodManagerBase::checkForConversion CM CM_COD "cod conversion check: old cod major:6 minor:5"
HwcodManager.cpp 556 HwcodManager::checkForConversion CM CM_COD "HWCOD version difference: old cod major, minor: (6,5), current build: (60,0)"
PerformCodConversionTask.cpp 116 PerformCodConversionTask::endExecutionStep CM CM_COD "REJECTING COD conversion because some node has dirty BBM."
PerformCodConversionTask::finish CM CM_COD "PerformCodConversionTask(0x80390051) -- Failed, state(2), error code(0x1ce032)"
As long as NONE of Configuration On Disk (COD) has been touched, you can revert by simply de-installing the "50" series oraclefs-controller* package and install the "00" series oraclefs-controller package. You must keep the Controllers from booting. Shutting off PCP will do this. Do not let them boot until the removal of big build and install of small build has been verified. Enabling SSH such that it will survive a Pilot reboot will help with this procedure.
Revert Big Build to Small Build
Once that the ConMan group in FS Engineering has confirmed that COD conversion has NOT taken place, use the steps below to recover the system by manually backing out the Big Build Controller rpm package and install the Small Build Controller package:
- Use the fscli utility to enable ssh to the FS1-2. See KM Document 2029847.1 FS System: How to Enable SSH Access to the Pilot for details.
- ssh to the active Pilot (Pilot2 in the example below) and use the sshConfig utility to enable ssh access on both Pilots so it will survive a reboot:
[root@pilot2 ~]# /usr/local/sbin/sshConfig enable
iptables: Setting chains to policy ACCEPT: nat filter [ OK ]
iptables: Flushing firewall rules: [ OK ]
iptables: Unloading modules: [ OK ]
iptables: No config file. [WARNING]
[root@pilot2 ~]# ssh 172.30.80.2 /usr/local/sbin/sshConfig enable
iptables: Setting chains to policy ACCEPT: filter [ OK ]
iptables: Flushing firewall rules: [ OK ]
iptables: Unloading modules: [ OK ]
iptables: No config file.[WARNING]
[root@pilot2 ~]#
- Open ssh sessions to the unique IP address of each Pilot.
NOTE: while the next sequence of steps state to execute them on both Pilots, only Pilot 2 is shown for brevity. Be sure to run these commands on BOTH Pilots.
- To prevent the Controllers from booting during this procedure, shut down the Pilot Control Process (PCP):
[root@pilot2 ~]# service pilotcfg stop
Shutting down pcp_monitor:
Shutting down pilotcfg:
[root@pilot2 ~]# service pilotcfg status
pilotcfg is stopped
[root@pilot1 ~]#
- On both Pilots, verify the installed Controller rpm package:
[root@pilot2 ~]# rpm -qa | grep oraclefs-controller
oraclefs-controller-060216-054450.x86_64
[root@pilot2 ~]#
The last two digits (before x86_64) should be greater than 50. In the above example is 50 and confirms that the Big Build rpm package is installed on the Pilots.
- On both Pilots, remove that package and use the command from the previous step to confirm:
[root@pilot2 ~]# rpm -e --nodeps oraclefs-controller-060216-054450.x86_64
[root@pilot2 ~]# rpm -qa | grep oraclefs-controller
[root@pilot2 ~]#
- On both Pilots, confirm /var/images/pds/pxe is either empty or does not exist:
[root@pilot2 ~]# ls /var/images/pds/pxe
ls: cannot access /var/images/pds/pxe: No such file or directory
[root@pilot2 ~]#
- On both Pilots, confirm that the Small Build Controller rpm package is in the /rpms/AxiomONE-SW-installed directory:
[root@pilot2 ~]# ls /rpms/AxiomONE-SW-installed/*controller*
/rpms/AxiomONE-SW-installed/oraclefs-controller-060216-054450.x86_64.rpm
[root@pilot2 ~]#
If not, as in the example above (*50.x86_64.rpm is the Big Build version), copy the Small Build file from the /rpms/AxiomONE-SW-staged directory and set the file permissions to 644 and confirm that the owner and group are both root:
[root@pilot2 ~]# cp /rpms/AxiomONE-SW-staged/oraclefs-controller-060216-054400.x86_64.rpm /rpms/AxiomONE-SW-installed/
[root@pilot2 ~]# chmod 644 /rpms/AxiomONE-SW-installed/oraclefs-controller-060216-054400.x86_64.rpm
[root@pilot2 ~]# ls -l /rpms/AxiomONE-SW-installed/*controller*
-rw-r--r-- 1 root root 121159015 Mar 2 18:41 /rpms/AxiomONE-SW-installed/oraclefs-controller-060216-054400.x86_64.rpm
-rw-r--r-- 1 root root 121167817 Feb 22 00:32 /rpms/AxiomONE-SW-installed/oraclefs-controller-060216-054450.x86_64.rpm
[root@pilot2 ~]#
CAUTION: Do not create any other files in any of the /rpms subdirectories. Future upgrades will fail.
- On both Pilots, install the Small Build Controller rpm package and confirm:
[root@pilot2 ~]# rpm -i --force oraclefs-controller-060216-054400.x86_64.rpm
[root@pilot2 ~]# rpm -qa | grep oraclefs-controller
oraclefs-controller-060216-054400.x86_64
[root@pilot2 ~]#
- On both Pilots, confirm the contents of /var/images/pds/pxe. There should be several files including the initrd-axnp.gz which is the Controller runtime image:
[root@pilot2 AxiomONE-SW-installed]# ls /var/images/pds/pxe
cmd.c32 pxelinux.0 vmlinuz
initrd-axnp.gz pxelinux.cfg vmlinuz-3.0.16-200.29.3.el6uek-axnp.ndebug.060215.050300
[root@pilot2 AxiomONE-SW-installed]#
- On both Pilots, get rid of any system cold start failures and controller failure history:
[root@pilot2 ~]# rm /var/lib/pillar/pcp/node-info.xml
rm: remove regular file `/var/lib/pillar/pcp/node-info.xml'? y
[root@pilot2 ~]#
- Restart PCP and verify:
[root@pilot2 ~]# service pilotcfg start
Starting pilotcfg:
[root@pilot2 ~]# service pilotcfg status
pilotcfg is running
[root@pilot2 ~]#
- Controllers should begin requesting a DHCP address in /var/log/messages after about 10 minutes:
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 available DHCP subnet: 172.30.80.0/255.255.255.0
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 available DHCP range: 172.30.80.200 -- 172.30.80.240
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 vendor class: udhcp 1.15.1
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 DHCPREQUEST(pmi_if) 172.30.80.128 00:21:28:a1:c6:42
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 tags: fs1noclearfbm, known, pmi_if
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 DHCPACK(pmi_if) 172.30.80.128 00:21:28:a1:c6:42 WN508002000158BA50
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 requested options: 1:netmask, 3:router, 6:dns-server, 12:hostname,
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 requested options: 15:domain-name, 28:broadcast, 42:ntp-server
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 bootfile name: /pds/pxe/pxelinux.0
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 server name: 172.30.80.1
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 next server: 172.30.80.2
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 sent size: 1 option: 53 message-type 5
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 sent size: 4 option: 54 server-identifier 172.30.80.2
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 sent size: 4 option: 51 lease-time 1h
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 sent size: 4 option: 58 T1 27m54s
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 sent size: 4 option: 59 T2 50m24s
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 sent size: 4 option: 1 netmask 255.255.255.0
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 sent size: 4 option: 28 broadcast 172.30.80.255
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 sent size: 5 option: 15 domain-name axiom
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 sent size: 18 option: 12 hostname WN508002000158BA50
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 sent size: 20 option:209 70:78:65:6c:69:6e:75:78:2e:63:66:67:2f:64...
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 sent size: 4 option:208 f1:00:74:7e
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 sent size: 12 option: 42 ntp-server 172.30.80.1, 172.30.80.2, 172.30.80.3
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 sent size: 4 option: 3 router 172.30.80.1
2018-03-02 19:52:52.439+00:00 pilot1 dnsmasq-dhcp[23394]: 2307783428 sent size: 4 option: 6 dns-server 172.30.80.1
- About 3-4 minutes later the Controllers will download the initrd-axnp.gz file in /var/log/messages:
2018-03-02 19:56:34.828+00:00 pilot1 dnsmasq-tftp[23394]: sent /var/images/pds/pxe/initrd-axnp.gz to 172.30.80.128
2018-03-02 19:56:34.831+00:00 pilot1 dhcp-script-pmi: tftp 282179072 172.30.80.128 /var/images/pds/pxe/initrd-axnp.gz
2018-03-02 19:56:47.499+00:00 pilot1 dnsmasq-tftp[23394]: sent /var/images/pds/pxe/initrd-axnp.gz to 172.30.80.129
2018-03-02 19:56:47.501+00:00 pilot1 dhcp-script-pmi: tftp 282179072 172.30.80.129 /var/images/pds/pxe/initrd-axnp.gz
- The Controllers will begin heartbeating in the node matrix in /var/log/pcp.log:
2018-03-02 20:05:00.115 pilot1 pilotcfgproc: 43267 19217 pcp:info fofb: node matrix
2018-03-02 20:05:00.115 pilot1 pilotcfgproc: 43268 19217 pcp:info node 2 3 128 129
2018-03-02 20:05:00.115 pilot1 pilotcfgproc: 43269 19217 pcp:info 2: 1 (104 6 0)( 0 6 0)( 20 6 0)( 20 6 0)
2018-03-02 20:05:00.115 pilot1 pilotcfgproc: 43270 19217 pcp:info 3: 1 ( 0 6 0)( 0 6 0)( 20 6 0)( 20 6 0)
2018-03-02 20:05:00.115 pilot1 pilotcfgproc: 43271 19217 pcp:info 128: 1 ( a0 4 1)( a0 4 0)( c0 4 0)( a0 4 0)
2018-03-02 20:05:00.115 pilot1 pilotcfgproc: 43272 19217 pcp:info 129: 1 ( 20 4 1)( 20 4 0)( 20 4 0)( 0 4 0)
2018-03-02 20:05:00.115 pilot1 pilotcfgproc: 43273 19217 pcp:info 255: ( 0 6 0)( 0 6 0)( 0 6 0)( 0 6 0)
2018-03-02 20:05:00.115 pilot1 pilotcfgproc: 43274 19217 pcp:info 1 -1 0 3 3
2018-03-02 20:05:00.920 pilot1 pilotcfgproc: 43275 19218 pcp:debug SystemState::threadLoop() 3 5
2018-03-02 20:05:00.920 pilot1 pilotcfgproc: 43276 19218 pcp:debug SystemState is COLD_START_IN_PROGRESS.
- As soon as the Controllers are heartbeating in the node matrix, check their version with ver -v to make sure that the "50" tag is not there.
[root@pilot2 ~]# ver -v | grep 172.30.80
172.30.80.129 : OS version: 2060-00004-060216-054400
172.30.80.128 : OS version: 2060-00004-060216-054400
[root@pilot2 ~]#
- Monitor the boot. It should proceed past Boot State ConMan and Boot State BS. After that, any boot failures would be some other issue.
References
<BUG:27386591> - ASSIST-TSC: SYSTEM COLDSTART FAILED REPORTED BY PILOT SERVER
Attachments
This solution has no attachment